#*
Parses and transforms to html a *plain-text* format. The grammar of
this script is quite complex, and I prefer to use nomsf://eg/text.tohtml.pss
as the formatter now (2025) since its grammar it simpler and easier
to add to.
This script attempts to parse a "markdown"-like syntax (this script header is
an example of the document format. The format is also documented in
mark.format.txt ) and to transform it into LaTeX source
code. The script runs on on the "pep" parsing machine which is implemented at
http://bumble.sf.net/books/pars/ .
There are other ways, which may seem better or more straight-forward of
achieving this. The ANTLR parsing system can be used to write grammars that
can parse markdown like structures, or just regular expressions can be used.
But this is a good exercise for the pep/nom machine, and more complex
structures can be recognised than with plain regular expressions.
STATUS
starting to adapt from mark.latex.html
The script may generate html output with
>> pep -f eg/mark.html.pss doc.txt > test.html
TOKEN LIST
* tokens currently used by this script
>> --- >> 4dots codeblock codeline emline nl text uutext uuword word
bl, ulist, olist, dlist, dash,
[[* (for images) ??
We dont actually need heading* and subheading* tokens because they get
transpiled (into html) as soon as they are seen in the document.
Also, dont need link/file/quoted/star, although I could allow them
to exist for a brief moment.
For lists need: - dash, bl, list
eg: o/- -> olist,
olist, text, nl, dash -> list
star* has been eliminated by parsing immediately, and >> and ---
could also be eliminated. Probably need to add bl*=blankline
dash* and ulist/olist/dlist for unordered lists, ordered lists
and definition lists
Useful grammar analysis.
* get a unique list of tokens used during parsing
>> pep -f eg/mark.latex.simple.pss pars-book.txt | sed '/%% ---/q;' | sed 's/^[^:]*: *//;s/\* *$//' | tr '*' '\n' | sort | uniq
TODO
adapt this script to translate to markdown. translate to
nroff man format.
BUGS
Strange pep/nom segmentation fault with multiline comment.
NOTES
Date-lists are proving tricky. This is because a date must be on
a line by itself, but blank lines are allowable in date-lists.
Using a technique to make image-related tokens and then make them
disappear by changing tokens to "word*". This is an effective parse
technique.
Lists might have a title- caption.
Could use d/- for description lists where definition occurs on the
same line, and D/- for lists where definition starts on next line.
Use pdfpages to create a bindable booklet on A4, without sticking pages
together; this works but the font is currently too small, and the margins too
big.
A signature is how many pages (not sheets) go into a "folio". Each folio
gets sown through its centre onto bookspine. Signatures must be *4.
This script is probably a great deal more complex than some
equivalent regular expression type renderer (for a format such
as markdown). And when it goes wrong, it has to be carefully
debugged, thinking about how the rules interact with each other.
Also, normally you have to watch the token stack as it reduces
in order to find out what is going wrong.
But apart from these problems it has great advantages: Once the
grammar is robust and permissive, it can be easily modified to
output different formats such as html or markdown.
Also, it can be translated into scripting and compilable languages
using the pep/nom scripts in the tr/ folder: languages such
as "go","java","c","ruby","python" and maybe "tcl".
Add images, datelists, inline code?
Convert this grammar to generate html/markdown etc
Need to tidy up description lists.
Nested lists may just work- out of the box! But a different
list terminator (not blankline) would be handy.
What about inline code?
This script needs to parse *any* text successfully! Even text
that is not in any particular format.
May need to add "quoted" to handle quoted text, but not really
necessary at the moment.
Using o/- O/- u/- U/- d/- D/- to start ordered/unordered/description
lists!
IMAGE FORMAT
* examples of the image format
---
[[ f.png >> 80% "caption" ]]
[[ f.png >>
80%
]]
,,,
Currently (feb 2025) the elements in the image must be in
the order [[*imfile*float*width*quote*]]*
DOCUMENT FORMAT
See mark.format.txt and mark.latex.pss for the document format.
HISTORY
2 feb 2025
Starting to adapt to html to use for writing blog entries
and other documents. Basic unpleasant html is now produced.
Definition lists need to be checked. Image floating needs to
be fixed.
vvvvvvv
*#
begin {
# create a dummy newline so that doc structures work even
# on the first line of the file/stream.
add "nl*"; push;
}
read;
![:space:] {
# count words per line with the accumulator
a+;
whilenot [:space:]; put;
# image structure delimiters
"[[","]]" { put; add "*"; push; .reparse }
# an image position indicators, default is centre?
">>>","<<<","ccc" { put; clear; add "float*"; push; .reparse }
# quotes for image captions. I will use """ to delimit
# image captions, allow multiline. A run-away multiline
# quote will eat up the whole document. Or use word parsing here
B'"""' {
put; clop; clop; clop;
E'"""'.!'"""' { clear; add "quote*"; push; .reparse }
clear; get;
# for multiline quotes
# until '"""';
whilenot ["\n];
!(eof) { read; } !(eof) { read; } !(eof) { read; }
!E'"""' {
# unterminated """ quote, probably an error
put; clear; add "word*"; push; .reparse
}
put; clear; add "quote*"; push; .reparse
}
# widths for images in format eg 20%
E"%".!"%".[0123456789%] {
put; clear; add "width*"; push; .reparse
}
# image width in point format
E"pt".!"pt".[0123456789pt] {
put; clear; add "width*"; push; .reparse
}
E"cm".!"cm".[0123456789cm] {
put; clear; add "width*"; push; .reparse
}
E"mm".!"mm".[0123456789mm] {
put; clear; add "width*"; push; .reparse
}
E"em".!"em".[0123456789em] {
put; clear; add "width*"; push; .reparse
}
# create an image file token for images.
#E".png",E".jpg",E".jpeg",E".bmp",E".gif" {
# these are the formats that pdflatex can handle
E".png",E".jpg",E".jpeg",E".eps",E".pdf" {
clear; add "imfile*"; push; .reparse
}
# date and datelist tokens
# post elimiate this enddatelist token
"[/dates]","[/date]" {
put; clear; add "enddatelist*"; push; .reparse
}
[0-9] {
put; clear; add "number*"; push; .reparse
}
# case insensitive month names
lower;
"jan","january","feb","february","mar","march",
"apr","april","may","jun","june","jul","july","aug","august",
"sep","sept","september","oct","october",
"nov","november","dec","december" {
put; clear; add "month*"; push; .reparse
}
clear; add "word*"; push; .reparse
}
# keep leading space in newline token?
[\n] {
# set accumulator == 0 so that we can count words
# per line (and know which is the first word)
zero; nochars;
while [ ]; put; clear; add "nl*";
push; .reparse
}
[\r\t ] { clear; !(eof){.restart} }
parse>
# for debugging, add % as a latex comment.
# add "%%> line "; lines; add " char "; chars; add ": "; print; clear;
# unstack; print; stack; add "\n"; print; clear;
# ------------------
# Datelists: including numbers/months/dates
# I will try to put all datelist related stuff here for the
# sake of organisation.
pop; pop; pop; pop;
# A 4 token datelist test. This needs to come first after the
# parse> label because there is no special date list start token
# (a date can start a datelist or continue it).
# need to preserve the nl* token because other markup requires it.
"datelist*text*date*nl*" {
clear;
# if text is empty (meaning a date with no text) we need to
# add some text or else LaTeX may think that we are nesting
# a list.
++; get;
"",[:space:] { clear; add "..."; put; }
--; clear;
# already have \item start
get; add " "; ++; get; ++;
# also, put a \verbatim in [] because text is not escaped??
add "\n
"; get; add "
";
# add add an empty text token to avoid dealing with dates
# with no text.
--; --; put; clear; ++; put; --; clear;
add "\n"; ++; ++; put; --; --; clear;
add "datelist*text*nl*"; push; push; push; .reparse
}
push; push; push; push;
pop; pop;
# ---------------
# 2 token datelist reductions.
#----------------
# dates for datelists
# dates begin on a newline and each date begins a list item.
# start a new datelist (we have already checked above for
# an existing datelist with the rules datelist*text*date* and
# datelist*date*
"date*nl*" {
clear;
add "\n
"; get; add "
"; put;
# add an empty text* token so that we dont have to worry
# about dates with no text
clear; ++; put; --;
clear; add "\n"; ++; ++; put; --; --;
clear; add "datelist*text*nl*"; push; push; push; .reparse
}
# ------------------
# 2 token datelists reductions
# vanish numbers if not first on line or preceded by month*
E"number*".!"number*".!B"nl*".!B"bl*".!B"month*" {
replace "number*" "word*"; push; push; .reparse
}
B"number*".!"number*".!E"nl*".!E"bl*".!E"month*" {
replace "number*" "word*"; push; push; .reparse
}
# vanish months if not between day/year or first on line
# this should allow eg "aug 2022" and "30 aug 2022"
E"month*".!"month*".!B"nl*".!B"bl*".!B"number*" {
replace "month*" "word*"; push; push; .reparse
}
B"month*".!"month*".!E"number*" {
replace "month*" "word*"; push; push; .reparse
}
# tokenlist:
# --- >> 4dots codeblock codeline emline nl text uutext uuword word
# month number date datelist ulist olist dlist
#
# remove pesky newline tokens, 4dots handled elsewhere
#"nl*text*","nl*word*","nl*emline*","nl*codeline*",
#"nl*codeblock*"
# vanish nl/bl when not needed.
"nl*date*","bl*date*","nl*enddatelist*","bl*enddatelist*",
"nl*ulist*","bl*ulist*","nl*olist*","bl*olist*",
"nl*dlist*","bl*dlist*",
"nl*datelist*","bl*datelist*",
"nl*codeline*","bl*codeline*",
"nl*codeblock*","bl*codeblock*",
"nl*emline*","bl*emline*"
{
# delete nl token
clop; clop; clop; push; clear;
# ignore newline
get; --; put; ++; clear;
.reparse
}
# vanish enddatelist* if it wasnt already reduced.
B"enddatelist*".!"enddatelist*" {
replace "enddatelist*" "text*";
push; push; .reparse
}
pop;
# -----------------
# 3 token datelists
"text*text*enddatelist*" {
clear; get; ++; get; --; put; clear;
++; put; --;
add "text*enddatelist*"; push; push;
.reparse
}
# vanish strange date tokens.
"month*number*month*" {
clear; add "word*word*word*"; push; push; push;
.reparse
}
# finish off the date list
# we will make enddatelist chew up previous text tokens so:
# enddatelist ::= text enddatelist
# This allows to resolve
# text ::= datelist*text*text*enddatelist*
# This in turn allows us to include more markup in lists.
# Each token is responsible for turning itself into text* when
# it is no longer needed.
"datelist*text*enddatelist*" {
clear;
# if text is empty (meaning a date with no text) we may need to
# add some text or else html may think that we are nesting
# a list.
++; get;
"",[:space:] { clear; add "..."; put; }
--; clear;
add "\n
\n"; get;
add "\n "; ++; get; --;
add "\n
\n\n";
put; clear;
# insert the blankline attribute
add "\n\n"; ++; put; --; clear;
add "text*"; push; .reparse
}
# -------------------------
# 5 token datelist reductions
pop; pop;
# resolve dates, but need to leave the trailing newline
# because it is used for many other things
"nl*number*month*number*nl*","bl*number*month*number*nl*" {
clear; ++; get; --;
# make sure 1st number is a valid day number
"0","00","000","0000" {
clear; add "word*word*word*";
push; push; push; .reparse
}
clip; clip;
# >2 digits, not day number
!"" {
clear; add "word*word*word*";
push; push; push; .reparse
}
clear; ++; get; --;
# is valid day number (01-31 or 1-31)
# this is tricky
clear;
# now check the year number
++; ++; ++; get; --; --; --;
clip; clip;
# less than 3 digits not allowed for year
B"0","" {
clear; add "word*word*word*";
push; push; push; .reparse
}
clip; clip;
# >4 digits, not a year
!"" {
clear; add "word*word*word*";
push; push; push; .reparse
}
# now assemble date value
++; get; add " "; ++; get; add " "; ++; get; lower;
replace "jan " "January ";
replace "feb " "February ";
replace "mar " "March ";
--; --; --; put; clear;
# conserve trainling newline
add "\n"; ++; put; --;
clear; add "date*nl*"; push; push; .reparse
}
# end datelist parsing
push; push; push; push; push;
# -------------
# General token parsing.
# 1 token
pop;
"nl*" { nop; }
# here we classify words into other tokens
# we can use accumulator with a+ a- to determine if current
# word is the first word of the line, or even count number of
# words per line. This should simplify grammar items such as
# nl/--- and nl/star/ etc
# another advantage, is that we can dispense with tokens such as
# ---, >> etc and not have to get rid of them later.
"word*" {
clear; get;
# no numbers in headings!
[A-Z] { clear; add "uuword*"; push; .reparse }
# at least three --- on a newline marks a code block start
# use 'count;' here to simplify. The token --- probably doesnt
# need to exist.
B"---".[-] { clear; add "---*"; push; .reparse }
">>" { add "*"; push; .reparse }
# subheading marker
B"....".[.] { clear; add "4dots*"; push; .reparse }
# dash is used for lists
# only make a dash token if it is first word on the line
"-" {
clear; count;
"1" { clear; add "dash*"; push; .reparse }
clear; get;
}
# ordered list start token
# only make token if it is first word on the line
"o/-","O/-","0/-" {
clear; count;
"1" { clear; put; add "olist*"; push; .reparse }
clear; get;
}
# unordered list start token
"u/-","U/-" {
clear; count;
"1" { clear; put; add "ulist*"; push; .reparse }
clear; get;
}
# definition/description list start token
# need to parse a bit differently because of the desc
"d/-","D/-" {
clear; count;
"1" {
clear;
# read description here, but have to escape special
# verb cant go in here. Special chars will crash this.
add "\n
"; whilenot [\n:]; add "
"; put;
# remove ":" or \n
!(eof) { read; } clear;
add "dlist*"; push; .reparse
}
clear; get;
}
# star on newline marks emphasis, list or code description
# probably dont need star token.
"*" {
# check that * is 1st 'word' on line using accumulator
clear; count;
!"1" { clear; add "*"; }
"1" {
clear; while [ \t\f]; clear;
whilenot [\n]; cap; put; clear;
# this is a trick, because we want special LaTeX chars to
# be escaped. So, will add \\emph{} after next replace code.
add "::EMPH::"; get; put;
#add "emline*"; push; .reparse
}
}
# need to escape < > & ?
# This interfers with the page title
# replace "&" "&";
replace ">" ">";
replace "<" "<";
# make apostrophes nice
replace "I'm" "I’m";
# it's that's etc
replace "thats" "that’s";
replace "t's" "t’s";
replace "e're" "e’re";
# they're They're
replace "hey're" "hey’re";
# isn't can't don't
replace "n't" "n’t";
replace "dont" "don’t";
replace "isnt" "isn’t";
replace "cant" "can’t";
replace "wont" "won’t";
replace "youll" "you’ll";
replace "you'll" "you’ll";
replace "he'll" "he’ll";
replace "theyll" "they’ll";
replace "they'll" "they’ll";
# now make the emphasis line token, after special chars have
# been escaped.
B"::EMPH::" {
replace "::EMPH::" " "; add "";
put; clear;
add "emline*"; push; .reparse
}
# If a previous test has matched, then the workspace should
# be clear, and so none of the following will match.
# graphical key representations
B"[".E"]" {
replace "[esc]" "Esc";
replace "[enter]" "Enter";
replace "[return]" "Return";
replace "[insert]" "Ins";
replace "[shift]" "\\Shift";
replace "[delete]" "\\Del";
replace "[home]" "\\Home";
}
put;
# urls are important in html
B"file://",B"http://",B"https://",B"www.",B"ftp://",B"nntp://" {
!"file://".!"http://".!"https://".!"www.".!"ftp://".!"nntp://" {
# clear; add "url*"; push; .reparse
clear; add ""; get;
add "";
put; clear;
}
}
# format acronyms as a small capital font, case insensitive
lower;
"antlr","pdf","json","ebnf","bnf","dns","html" {
clear; add "\\textsc{\\textbf{"; get; add "}}"; put; clear;
}
# restore the mixed-case version of the input word
!"" { clear; get; }
# filenames, could be elided with quoted filenames
"parse>","print","pop","push","get","put",".reparse",".restart", "add",
"sed","awk","grep","pep","nom","less","stdin","stdout","bash",
"lex","yacc","flex","bison","lalr","gnu",
E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
E".png",E".jpg",E".jpeg",E".bmp",
E".mp3",E".wav",E"aux",
E".tar",E".gz",E"/" {
clear; add ""; get; add ""; put; clear;
}
# mark up language names
"python","java","ruby","perl","tcl","rust","swift","markdown",
"c","c++" {
clear; add ""; get; add ""; put; clear;
}
# paths and directories ?
B"../".!"../" {
clear; add ""; get; add ""; put; clear;
}
B'"'.E'"'.!'""'.!'"' {
# filenames in quotes
clip; clop; put;
# quoted uppercase words in headings
[A-Z] {
# add html curly quotes to the heading word
# laquo; is like '<<'
# ldquo; is curly left quote
clear; add "«"; get; add "»"; put; clear;
add "uuword*"; push; .reparse
}
# markup language names
"python","java","ruby","perl","tcl","rust","swift","markdown",
"c","c++","forth" {
clear; add ""; get; add ""; put; clear;
}
# markup filenames and some unix and pep/nom names as fixed-pitch
# font.
"pep",
"parse>","print","pop","push","get","put",".reparse",".restart", "add",
"sed","awk","grep","pep","nom","less","stdin","stdout","bash",
E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
E".png",E".jpg",E".jpeg",E".bmp",
E".mp3",E".wav",E"aux",
E".tar",E".gz",E";"
{ clear; add "«"; get; add "»"; put; clear; }
# everything else in quotes (but only words without spaces!)
!"" { clear; add "'"; get; add "'"; put; clear; }
}
# filenames
# crude pattern checking.
B"/".!"/" {
clip; E"." { clear; add "
"; get;
add "
"; put; clear; }
clip; E"." {
clear; add "
"; get; add "
"; put; clear;
}
clip; E"." { clear; add "
"; get; add "
"; put; clear; }
}
# emphasis is *likethis* (only words, not phrases)
# This is the same as markdown but words only
B"*".E"*".!"**" {
clip; clop; put; clear;
add ""; get; add ""; put; clear;
}
# && starting a line marks the document title
# the document 'title' after && or first heading, & has already
# been escaped
"\\&\\&" {
clear; count;
"1" {
clear; while [ \t\f]; clear;
whilenot [\n]; put; clear;
add "
"; get;
add "
\n"; put; clear;
}
}
# A quote, starting the line
"quote:" {
clear; count;
"1" {
clear; while [ \t\f]; clear;
whilenot [\n]; put; clear;
add "
"; get;
add "
\n"; put; clear;
}
}
clear; add "word*";
}
pop;
# -------------
# 2 tokens
#--------------------
# images
# standard format is [[*imfile*quote*width*float*]]*
# A width is "50%" or "200pt"; float is left/right/center
# imfile is a image file name. quote/width/float are optional
# tokens. The order of tokens is mandatory
# remove newline and blank line tokens when parsing
# images. But this is tricky, because we want to preserve
# them otherwise.
# remove nl/bl tokens in image formats
"[[*nl*","[[*bl*","imfile*nl*","imfile*bl*",
"quote*nl*","quote*bl*","width*nl*","width*bl*",
"float*nl*","float*bl*" {
push; clear; .reparse
}
# vanish [[ if not followed by imfile
B"[[*".!"[[*".!E"imfile*" {
replace "[[*" "word*"; push; push; .reparse
}
# vanish ]] where not significant
E"]]*".!"]]*".
!B"imfile*".!B"float*".!B"quote*".!B"width*" {
replace "]]*" "word*"; push; push; .reparse
}
# vanish imfiles
B"imfile*".!"imfile*".!E"float*".!E"quote*".!E"width*".!E"]]*" {
replace "imfile*" "word*"; push; push; .reparse
}
E"imfile*".!B"[[*" {
replace "imfile*" "word*"; push; push; .reparse
}
# vanish quotes
B"quote*".!"quote*".!E"float*".!E"width*".!E"]]*" {
replace "quote*" "word*"; push; push; .reparse
}
E"quote*".!"quote*".!B"imfile*" {
replace "quote*" "word*"; push; push; .reparse
}
# vanish widths
B"width*".!"width*".!E"float*".!E"]]*" {
replace "width*" "word*"; push; push; .reparse
}
E"width*".!"width*".!B"quote*".!B"imfile*" {
replace "width*" "word*"; push; push; .reparse
}
# vanish floats
B"float*".!"float*".!E"]]*" {
replace "float*" "word*"; push; push; .reparse
}
E"float*".!"float*".!B"width*".!B"quote*".!B"imfile*" {
replace "float*" "word*"; push; push; .reparse
}
# Add missing attributes here. This is a technique for
# providing "optionality" in pep/nom scripts
"width*]]*" {
clear; add "width*float*]]*";
push; push; push;
# also add an appropriate attribute for a center float
--; --; get; ++; put;
clear; --; put; ++; ++;
.reparse
}
"quote*]]*","quote*float*" {
replace "quote*" "quote*width*"; push; push; push;
# now transfer the attributes and add null quote
--; --; get; ++; put;
# or add an appropriate width
clear; --; put; ++; ++;
.reparse
}
"imfile*]]*","imfile*width*","imfile*float*" {
replace "imfile*" "imfile*quote*";
push; push; push; # ws should be clear
# now transfer the attributes and add null quote
--; --; get; ++; put;
# or put a null quote here.
clear; --; put; ++; ++;
.reparse
}
# End image token manipulation
# ellide text
"text*text*","word*text*",
"word*word*","text*word*",
"word*uuword*","text*uuword*","uutext*word*","uuword*word*" {
clear; get; add " "; ++; get; --; put; clear;
add "text*"; push; .reparse
}
# tokenlist:
# --- >> 4dots codeblock codeline emline nl text uutext uuword word
# codeblock,
# remove pesky newline tokens, 4dots handled elsewhere
# not really working
#*
"nl*text*","nl*word*","nl*emline*","nl*codeline*",
"nl*codeblock*" {
# delete nl token
clop; clop; clop; push; clear;
# ignore newline
get; --; put; ++; clear;
.reparse
}
*#
"nl*text*","nl*word*", "bl*text*","bl*word*" {
clear; get; ++; get; --; put; clear;
add "text*"; push; .reparse
}
"nl*dash*" {
clear; get; ++; get; --; put; clear;
add "dash*"; push; .reparse
}
"nl*emline*","bl*emline*" {
clear; ++; get; --; put; clear;
add "emline*"; push; .reparse
}
# We are using a dummy nl* token at the start of the doc, so the
# codeblock* codeline* etc tokens are not able to be the first token
# of the document. So we can remove the !"codeblock*". clause.
# multiline codeblocks with no caption
E"codeblock*".!"codeblock*".!B"emline*" {
replace "codeblock*" "text*"; push; push; clear;
# dont need to markup here because already done, just
# transfer the attribute.
# add "\n\n
\n ";
--; get;
# add "
\n";
put; ++; clear; .reparse
}
# single line code with no caption
E"codeline*".!"codeline*".!B"emline*" {
replace "codeline*" "text*"; push; push; clear;
# dont need to markup here because already done.
# add "\n\n
\n ";
--; get;
# add "
\n";
put; ++; clear; .reparse
}
# eliminate emline* tokens (not followed by codeblock/line)
# the logic is slightly different because emline* is significant before
# other tokens, not after.
# also, consider emline*text*nl*
B"emline*".!E"nl*".!E"codeline*".!E"codeblock*" {
replace "emline*" "text*"; push; push;
# make emline display on its own line, even when not
# followed by codeline/codeblock. LaTeX will treat a blank line
# as a paragraph break, but \newline or \\ could be used.
--; --; add "\n\n"; get; add "\n\n"; put; clear;
.reparse
}
# remove insignificant 4dots* tokens,
# 4 dots (....) marks a subheading and always comes at the end of
# all capitals line. Just replacing the 4dots token with a text
# token is safer and more logical.
E"4dots*".!B"uutext*".!B"uuword*" {
replace "4dots*" "text*"; push; push; .reparse
}
# remove insignificant ---* tokens
E"---*".!B"nl*".!B"bl*" {
clear; get; add " "; ++; get; --; put; clear;
add "text*"; push; .reparse
}
# remove insignificant >>* tokens
# lets assume that codelines cant start a document? Or lets
# generate a dummy nl* token at the start of the document to
# make parsing easier.
E">>*".!B"nl*".!B"bl*" {
#clear; get; add " "; ++; get; --; put; clear;
#add "text*"; push; .reparse
replace ">>*" "text*"; push; push; .reparse
}
# ellide upper case text
"uuword*uuword*","uutext*uuword*" {
clear; get; add " "; ++; get; --; put;
clear; add "uutext*"; push; .reparse
}
# a blank line token for terminating lists etc
# bl/bl should not happen really
"nl*nl*","bl*nl*","bl*bl*" {
clear; get; ++; get; --; put; clear;
add "bl*"; push; .reparse
}
# code line (starts with >>)
"bl*>>*","nl*>>*" {
# ignore leading space.
clear; while [ \t\f]; clear;
whilenot [\n]; put; clear;
add "
\n"; put; clear;
add "codeblock*"; push; .reparse
}
# a code block with its preceding description
"emline*codeblock*" {
clear;
add "\n\n \n ";
get; add " "; ++; get; --;
add " \n"; put; clear;
add "text*"; push; .reparse
}
# a code line with its preceding description
"emline*codeline*" {
clear;
add "\n\n \n ";
get; add " "; ++; get; --;
add " \n"; put; clear;
add "text*"; push; .reparse
}
# probably indicates an empty - at the end of a list
# add a dummy text token
"olist*bl*","ulist*bl*","dlist*bl*" {
push; clear; add "empty"; put;
clear; add "\n\n"; ++; put; --;
clear; add "text*bl*"; push; push; .reparse
}
# or use this to terminate the list, and so allow nested lists
"olist*dash*","ulist*dash*","dlist*dash*" {
push; clear; add "empty"; put;
clear; add "text*dash*"; push; push; .reparse
}
pop;
# -------------
# 3 tokens
"olist*word*dash*","ulist*word*dash*","dlist*word*dash*",
"olist*word*bl*","ulist*word*bl*","dlist*word*bl*" {
replace "word*" "text*";
# or dont reparse
# push; push; push; .reparse
}
# eliminate dashes that are not part of a list
# eg: ulist*dash* olist*text*dash* dlist*word*dash*
# the logic is tricky, how do we know there are really 3 tokens
# here, and not 2. This is the problem with negative tests.
# doesnt matter because not altering attributes here.
E"dash*" {
!B"ulist*text*".!B"olist*text*".!B"dlist*text*" {
replace "dash*" "text*"; push; push; push; .reparse
}
}
"olist*text*dash*" {
clear;
get; add "\n
"; ++; get; --; put; clear;
add "olist*"; push; .reparse
}
# could be ellided, but for readability, no
"ulist*text*dash*" {
clear;
get; add "\n
"; ++; get; --; put; clear;
add "ulist*"; push; .reparse
}
#
"dlist*text*dash*" {
clear;
# already have \item start
get; add " "; ++; get; --;
# also, put a \verbatim in [] because text is not escaped??
# The definition term is delimited by a newline or a ":" character
add "\n
"; whilenot [\n:]; add "
"; put;
# get rid of the trailing ":"
!(eof) { read; } clear;
add "dlist*"; push; .reparse
}
# finish off the ordered list, also could finish it off with
# ulist*dash* ??
"olist*text*bl*" {
clear;
add "\n \n"; get;
add "\n
"; ++; get; --;
add "\n
\n\n";
put; clear;
# insert the blankline attribute
add "\n\n"; ++; put; --; clear;
add "text*bl*"; push; push; .reparse
}
# finish off the unordered list
"ulist*text*bl*" {
clear;
add "\n
\n"; get;
add "\n
"; ++; get; --;
add "\n
\n\n";
put; clear;
# insert the blankline attribute
add "\n\n"; ++; put; --; clear;
add "text*bl*"; push; push; .reparse
}
# finish off the description list
"dlist*text*bl*" {
# or check here if it is D/- or d/- for nextline style
# or use \hfill \\ on each item which also works
clear;
add "\n
\n"; get;
add "\n "; ++; get; --;
add "\n
\n\n";
put; clear;
# insert the blankline attribute
add "\n\n"; ++; put; --; clear;
add "text*bl*"; push; push; .reparse
}
# top level headings, all upper case on the line in the source document.
# dont need a "heading" token because we dont parse the document as a
# heirarchy, we just render things as we find them in the stream.
"nl*uutext*nl*","nl*uuword*nl*",
"bl*uutext*nl*","bl*uuword*nl*" {
clear;
# Check that heading is at least 4 chars
++; get; --; clip; clip; clip;
"" {
add "nl*text*nl*"; push; push; push; .reparse
}
clear;
# make headings capital case
++; get;
# capitalise even 1st word in latex curly quotes
# add "<'; ++; get; --; add ""; put;
clear;
# transfer nl value
++; ++; get; --; put; clear; --;
add "text*nl*"; push; push; .reparse
}
# simple reductions
"nl*text*nl*","nl*word*nl*", "bl*text*nl*","bl*word*nl*",
"text*text*nl*","emline*text*nl*" {
clear; get; ++; get; --; put; clear;
++; ++; get; --; put; --; clear; # transfer newline value
add "text*nl*"; push; push; .reparse
}
# simple reductions
"nl*text*bl*","nl*word*bl*", "bl*text*bl*","bl*word*bl*",
"text*text*bl*","emline*text*bl*" {
clear; get; ++; get; --; put; clear;
++; ++; get; --; put; --; clear; # transfer blankline value
add "text*bl*"; push; push; .reparse
}
pop;
# -------------
# 4 tokens
# sub headings,
"nl*uutext*4dots*nl*","nl*uuword*4dots*nl*",
"bl*uutext*4dots*nl*","bl*uuword*4dots*nl*" {
clear;
# Check that sub heading text is at least 4 chars ?
# yes but need to transfer 4dots and nl
clear;
# make subheadings capital case
++; get;
# capitalise even 1st word in HTML curly quotes
B"``" { clop; clop; }
cap; put; replace "''" "";
# add open curly quotes if there before.
!(==) {
clear; add "``"; get;
}
put; --; clear;
get; # newline
add '
'; ++; get; --; add "
"; put; clear;
# transfer nl value, really? just add "\\n" no?
++; ++; ++; get; --; --; put; clear; --;
add "text*nl*"; push; push; .reparse
}
pop;
#------------------
# 5 tokens
pop;
# -------------
# 6 tokens
# all images have been standardised to this format (all
# optional tokens have been added but may be empty).
# for latex, image formats are jpeg,png,or pdf. Others need
# to be converted. Image names cant have dots in them???
"[[*imfile*quote*width*float*]]*" {
# need to translate widths floats etc to latex here because
# they may revert to word* if out of context.
clear;
++;
# get quote, if any, and remove """
++; get; clip; clip; clip; clop; clop; clop;
put; clear;
# get width attribute
++; get;
# turn percentage into decimal
#E"%","" {
# clip;
# !"".!"100" { put; clear; add "0."; get; }
# "" { add "0.60"; } # default width 60%
# "100" { clear; add "1.00"; }
# add "\\textwidth";
#}
# default width
"" { add "60%"; }
put; clear;
# translate floats into HTML css, default is centre
++; get; "" { add "none"; }
# unknown positioning spec
!"ccc".!"<<<".!">>>" { clear; add "none"; }
"ccc" { clear; add "none"; } # centre
"<<<" { clear; add "left"; } # left
">>>" { clear; add "right"; } # right
put; clear;
add "\n\n";
add "\n";
add "";
# get the quote attribute
++; ++; get; --; --; add "\n";
put; clear;
add "text*"; push; .reparse
}
# example: 75% page width
# \includegraphics[width=0.75\textwidth]{image/test.jpg}
push; push; push; push; push; push;
(eof) {
# or use 'unstack' but does it adjust the tape pointer?
pop; pop; pop; pop; pop; pop;
# "nl*word*","nl*text*" have already been dealt with.
# we would like "permissive" parsing, because this is just
# a document format, not code, so will just check for starting
# text token
B"text*",B"word*" {
# show the token parse stack at the top of the document
++; put; clear;
add "\n"; --;
clear;
# make a valid LaTeX document
add '
';
get;
add "\n \n";
add "\n \n";
add "\n\n \n";
# show parse-stack at end of doc as well
++; add " \n"; --;
print; quit;
}
stack;
add "Document parsed unusually!\n";
add "Stack at line "; lines; add " char "; chars; add ": "; print; clear;
unstack; print; stack; add "\n"; print; clear;
quit;
}