#*

  Parses and transforms to html a *plain-text* format. The grammar of 
  this script is quite complex, and I prefer to use nomsf://eg/text.tohtml.pss
  as the formatter now (2025) since its grammar it simpler and easier 
  to add to.

  This script attempts to parse a "markdown"-like syntax (this script header is
  an example of the document format. The format is also documented in 
  mark.format.txt ) and to transform it into LaTeX source
  code. The script runs on on the "pep" parsing machine which is implemented at
  http://bumble.sf.net/books/pars/ . 

  There are other ways, which may seem better or more straight-forward of
  achieving this. The ANTLR parsing system can be used to write grammars that
  can parse markdown like structures, or just regular expressions can be used.
  But this is a good exercise for the pep/nom machine, and more complex
  structures can be recognised than with plain regular expressions.

STATUS
  
  starting to adapt from mark.latex.html

  The script may generate html output with 
  >> pep -f eg/mark.html.pss doc.txt > test.html
 
TOKEN LIST

  * tokens currently used by this script
  >> --- >> 4dots codeblock codeline emline nl text uutext uuword word
   bl, ulist, olist, dlist, dash, 
  
    [[* (for images) ??

  We dont actually need heading* and subheading* tokens because they get
  transpiled (into html) as soon as they are seen in the document.
  Also, dont need link/file/quoted/star, although I could allow them
  to exist for a brief moment.

  For lists need: - dash, bl, list
  eg:  o/- -> olist, 
  olist, text, nl, dash -> list

  star* has been eliminated by parsing immediately, and >> and ---
  could also be eliminated. Probably need to add bl*=blankline
  dash* and ulist/olist/dlist for unordered lists, ordered lists
  and definition lists

  Useful grammar analysis.

  * get a unique list of tokens used during parsing
  >> pep -f eg/mark.latex.simple.pss pars-book.txt | sed '/%% ---/q;' | sed 's/^[^:]*: *//;s/\* *$//' | tr '*' '\n' | sort | uniq

TODO

  adapt this script to translate to markdown. translate to 
  nroff man format. 

BUGS

  Strange pep/nom segmentation fault with multiline comment.

NOTES

 Date-lists are proving tricky. This is because a date must be on
 a line by itself, but blank lines are allowable in date-lists.
  
 Using a technique to make image-related tokens and then make them
 disappear by changing tokens to "word*". This is an effective parse
 technique. 

 Lists might have a title- caption.

 Could use d/- for description lists where definition occurs on the 
 same line, and D/- for lists where definition starts on next line.

 Use pdfpages to create a bindable booklet on A4, without sticking pages
 together; this works but the font is currently too small, and the margins too
 big.

 A signature is how many pages (not sheets) go into a "folio". Each folio
 gets sown through its centre onto bookspine. Signatures must be *4.

  This script is probably a great deal more complex than some 
  equivalent regular expression type renderer (for a format such 
  as markdown). And when it goes wrong, it has to be carefully
  debugged, thinking about how the rules interact with each other.
  Also, normally you have to watch the token stack as it reduces
  in order to find out what is going wrong.

  But apart from these problems it has great advantages: Once the 
  grammar is robust and permissive, it can be easily modified to 
  output different formats such as html or markdown.
  Also, it can be translated into scripting and compilable languages
  using the pep/nom scripts in the tr/ folder: languages such
  as "go","java","c","ruby","python" and maybe "tcl".

  Add images, datelists, inline code?
  Convert this grammar to generate html/markdown etc

  Need to tidy up description lists.

  Nested lists may just work- out of the box! But a different
  list terminator (not blankline) would be handy.

  What about inline code?
  This script needs to parse *any* text successfully! Even text
  that is not in any particular format.

  May need to add "quoted" to handle quoted text, but not really
  necessary at the moment.
  
  Using o/- O/- u/- U/- d/- D/- to start ordered/unordered/description
  lists!

IMAGE FORMAT
  
  * examples of the image format
  ---
    [[ f.png >> 80% "caption" ]]
    [[ f.png >> 
      80% 
    ]]
  ,,,

  Currently (feb 2025) the elements in the image must be in 
  the order [[*imfile*float*width*quote*]]*

DOCUMENT FORMAT
  
  See mark.format.txt and mark.latex.pss for the document format.

HISTORY 
 
  2 feb 2025
    Starting to adapt to html to use for writing blog entries
    and other documents. Basic unpleasant html is now produced.
    Definition lists need to be checked. Image floating needs to
    be fixed.

vvvvvvv
*#
  
  begin {
    # create a dummy newline so that doc structures work even
    # on the first line of the file/stream.
    add "nl*"; push;
  }

  read;

  ![:space:] {
    # count words per line with the accumulator
    a+;
    whilenot [:space:]; put;
    
    # image structure delimiters
    "[[","]]" { put; add "*"; push; .reparse }
    # an image position indicators, default is centre?
    ">>>","<<<","ccc" { put; clear; add "float*"; push; .reparse }
    # quotes for image captions. I will use """ to delimit 
    # image captions, allow multiline. A run-away multiline
    # quote will eat up the whole document. Or use word parsing here

    B'"""' { 
      put; clop; clop; clop;
      E'"""'.!'"""' { clear; add "quote*"; push; .reparse } 
      clear; get;
      # for multiline quotes
      # until '"""';
      whilenot ["\n]; 
      !(eof) { read; } !(eof) { read; } !(eof) { read; }
      !E'"""' { 
        # unterminated """ quote, probably an error
        put; clear; add "word*"; push; .reparse 
      } 
      put; clear; add "quote*"; push; .reparse
    }

    # widths for images in format eg 20%
    E"%".!"%".[0123456789%] {
      put; clear; add "width*"; push; .reparse
    }
    # image width in point format
    E"pt".!"pt".[0123456789pt] {
      put; clear; add "width*"; push; .reparse
    }
    E"cm".!"cm".[0123456789cm] {
      put; clear; add "width*"; push; .reparse
    }
    E"mm".!"mm".[0123456789mm] {
      put; clear; add "width*"; push; .reparse
    }
    E"em".!"em".[0123456789em] {
      put; clear; add "width*"; push; .reparse
    }

    # create an image file token for images. 
    #E".png",E".jpg",E".jpeg",E".bmp",E".gif" {

    # these are the formats that pdflatex can handle
    E".png",E".jpg",E".jpeg",E".eps",E".pdf" {
      clear; add "imfile*"; push; .reparse 
    }

    # date and datelist tokens
    # post elimiate this enddatelist token
    "[/dates]","[/date]" { 
      put; clear; add "enddatelist*"; push; .reparse
    }

    [0-9] {
      put; clear; add "number*"; push; .reparse
    }
    # case insensitive month names
    lower;
    "jan","january","feb","february","mar","march",
    "apr","april","may","jun","june","jul","july","aug","august",
    "sep","sept","september","oct","october",
    "nov","november","dec","december" {
      put; clear; add "month*"; push; .reparse
    }
    clear; add "word*"; push; .reparse 
  }

  # keep leading space in newline token?
  [\n] { 
    # set accumulator == 0 so that we can count words 
    # per line (and know which is the first word)
    zero; nochars;
    while [ ]; put; clear; add "nl*"; 
    push; .reparse
  }
  [\r\t ] { clear; !(eof){.restart} }

parse>

  # for debugging, add % as a latex comment.
  # add "%%> line "; lines; add " char "; chars; add ": "; print; clear; 
  # unstack; print; stack; add "\n"; print; clear;

  # ------------------
  # Datelists: including numbers/months/dates
  #  I will try to put all datelist related stuff here for the 
  #  sake of organisation.

  
  pop; pop; pop; pop; 
  # A 4 token datelist test. This needs to come first after the 
  # parse> label because there is no special date list start token
  # (a date can start a datelist or continue it). 
  # need to preserve the nl* token because other markup requires it.
  "datelist*text*date*nl*" {
    clear;
    # if text is empty (meaning a date with no text) we need to 
    # add some text or else LaTeX may think that we are nesting
    # a list.
    ++; get; 
    "",[:space:] { clear; add "..."; put; }
    --; clear;

    # already have \item start
    get; add " "; ++; get; ++; 
    # also, put a \verbatim in [] because text is not escaped??
    add "\n <dd class='date.item'>"; get; add "</dd> ";
    # add add an empty text token to avoid dealing with dates
    # with no text.
    --; --; put; clear; ++; put; --; clear;
    add "\n"; ++; ++; put; --; --; clear;
    add "datelist*text*nl*"; push; push; push; .reparse
  }
  push; push; push; push;

  pop; pop;
  # ---------------
  # 2 token datelist reductions.

  #----------------
  # dates for datelists
  # dates begin on a newline and each date begins a list item.
  # start a new datelist (we have already checked above for 
  # an existing datelist with the rules datelist*text*date* and
  # datelist*date*
  "date*nl*" {
    clear; 
    add "\n <dt class='date'>"; get; add "</dt><dd>"; put;
    # add an empty text* token so that we dont have to worry 
    # about dates with no text
    clear; ++; put; --;
    clear; add "\n"; ++; ++; put; --; --;
    clear; add "datelist*text*nl*"; push; push; push; .reparse
  }

  # ------------------
  # 2 token datelists reductions

  # vanish numbers if not first on line or preceded by month*
  E"number*".!"number*".!B"nl*".!B"bl*".!B"month*" {
    replace "number*" "word*"; push; push; .reparse
  }
  B"number*".!"number*".!E"nl*".!E"bl*".!E"month*" {
    replace "number*" "word*"; push; push; .reparse
  }

  # vanish months if not between day/year or first on line
  # this should allow eg "aug 2022" and "30 aug 2022"
  E"month*".!"month*".!B"nl*".!B"bl*".!B"number*" {
    replace "month*" "word*"; push; push; .reparse
  }
  B"month*".!"month*".!E"number*" {
    replace "month*" "word*"; push; push; .reparse
  }

  # tokenlist:
  # --- >> 4dots codeblock codeline emline nl text uutext uuword word
  # month number date datelist ulist olist dlist 
  #
  # remove pesky newline tokens, 4dots handled elsewhere
  #"nl*text*","nl*word*","nl*emline*","nl*codeline*",
  #"nl*codeblock*" 

  # vanish nl/bl when not needed.
  "nl*date*","bl*date*","nl*enddatelist*","bl*enddatelist*",
  "nl*ulist*","bl*ulist*","nl*olist*","bl*olist*",
  "nl*dlist*","bl*dlist*",
  "nl*datelist*","bl*datelist*",
  "nl*codeline*","bl*codeline*",
  "nl*codeblock*","bl*codeblock*",
  "nl*emline*","bl*emline*"
  {
    # delete nl token
    clop; clop; clop; push; clear;
    # ignore newline
    get; --; put; ++; clear;
    .reparse
  }

  # vanish enddatelist* if it wasnt already reduced. 
  B"enddatelist*".!"enddatelist*" {
    replace "enddatelist*" "text*";
    push; push; .reparse
  }

  pop;
  # -----------------
  # 3 token datelists

  "text*text*enddatelist*" {
     clear; get; ++; get; --; put; clear;
     ++; put; --; 
     add "text*enddatelist*"; push; push;
     .reparse
  }

  # vanish strange date tokens.
  "month*number*month*" {
    clear; add "word*word*word*"; push; push; push; 
    .reparse
  }

  # finish off the date list
  # we will make enddatelist chew up previous text tokens so:
  #  enddatelist ::= text enddatelist
  # This allows to resolve
  #  text ::= datelist*text*text*enddatelist*
  # This in turn allows us to include more markup in lists.
  # Each token is responsible for turning itself into text* when
  # it is no longer needed.
  "datelist*text*enddatelist*" {
    clear; 
    # if text is empty (meaning a date with no text) we may need to 
    # add some text or else html may think that we are nesting
    # a list.
    ++; get; 
    "",[:space:] { clear; add "..."; put; }
    --; clear;

    add "\n <dl class='datelist'>\n"; get;
    add "\n "; ++; get; --; 
    add "\n </dd></dl>\n\n"; 
    put; clear; 
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*"; push; .reparse
  }

  # -------------------------
  # 5 token datelist reductions
  pop; pop;
  # resolve dates, but need to leave the trailing newline
  # because it is used for many other things
  "nl*number*month*number*nl*","bl*number*month*number*nl*" {
    clear; ++; get; --;
    # make sure 1st number is a valid day number
    "0","00","000","0000" { 
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clip; clip;
    # >2 digits, not day number
    !"" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clear; ++; get; --;
    # is valid day number (01-31 or 1-31)
    # this is tricky
    clear; 
    # now check the year number
    ++; ++; ++; get; --; --; --;
    clip; clip; 
    # less than 3 digits not allowed for year
    B"0","" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    clip; clip;
    # >4 digits, not a year
    !"" {
      clear; add "word*word*word*"; 
      push; push; push; .reparse
    }
    # now assemble date value
    ++; get; add " "; ++; get; add " "; ++; get; lower; 
    replace "jan " "January ";
    replace "feb " "February ";
    replace "mar " "March ";
    --; --; --; put; clear;
    # conserve trainling newline
    add "\n"; ++; put; --; 
    clear; add "date*nl*"; push; push; .reparse
  }

  # end datelist parsing
  push; push; push; push; push;

 
  # -------------
  # General token parsing.
  # 1 token
  pop;

  "nl*" { nop; }

  # here we classify words into other tokens
  # we can use accumulator with a+ a- to determine if current
  # word is the first word of the line, or even count number of 
  # words per line. This should simplify grammar items such as
  # nl/---  and nl/star/ etc
  # another advantage, is that we can dispense with tokens such as 
  # ---, >> etc and not have to get rid of them later.
  "word*" {
    clear; get; 
    # no numbers in headings!
    [A-Z] { clear; add "uuword*"; push; .reparse }

    # at least three --- on a newline marks a code block start
    # use 'count;' here to simplify. The token --- probably doesnt
    # need to exist.
    B"---".[-] { clear; add "---*"; push; .reparse }

    ">>" { add "*"; push; .reparse }

    # subheading marker
    B"....".[.] { clear; add "4dots*"; push; .reparse }

    # dash is used for lists 
    # only make a dash token if it is first word on the line
    "-" { 
      clear; count;
      "1" { clear; add "dash*"; push; .reparse }
      clear; get;
    }

    # ordered list start token 
    # only make token if it is first word on the line
    "o/-","O/-","0/-" { 
      clear; count;
      "1" { clear; put; add "olist*"; push; .reparse }
      clear; get;
    }

    # unordered list start token 
    "u/-","U/-" { 
      clear; count;
      "1" { clear; put; add "ulist*"; push; .reparse }
      clear; get;
    }

    # definition/description list start token 
    # need to parse a bit differently because of the desc
    "d/-","D/-" { 
      clear; count;
      "1" { 
        clear;
        # read description here, but have to escape special
        # verb cant go in here. Special chars will crash this. 
        add "\n <dt class=title>"; whilenot [\n:]; add "</dt>"; put;
        # remove ":" or \n
        !(eof) { read; } clear; 
        add "dlist*"; push; .reparse
      }
      clear; get;
    }

    # star on newline marks emphasis, list or code description 
    # probably dont need star token.
    "*" { 
      # check that * is 1st 'word' on line using accumulator
      clear; count; 
      !"1" { clear; add "*"; }
      "1" {
        clear; while [ \t\f]; clear;
        whilenot [\n]; cap; put; clear;
        # this is a trick, because we want special LaTeX chars to
        # be escaped. So, will add \\emph{} after next replace code. 
        add "::EMPH::"; get; put;
        #add "emline*"; push; .reparse
      }
    }

    # need to escape < > & ? 

    # This interfers with the page title
    # replace "&" "&amp;";
    replace ">" "&gt;";
    replace "<" "&lt;";
    # make apostrophes nice
    replace "I'm" "I&rsquo;m";
    # it's that's etc
    replace "thats" "that&rsquo;s";
    replace "t's" "t&rsquo;s";
    replace "e're" "e&rsquo;re";
    # they're They're
    replace "hey're" "hey&rsquo;re";
    # isn't can't don't
    replace "n't" "n&rsquo;t";
    replace "dont" "don&rsquo;t";
    replace "isnt" "isn&rsquo;t";
    replace "cant" "can&rsquo;t";
    replace "wont" "won&rsquo;t";
    replace "youll" "you&rsquo;ll";
    replace "you'll" "you&rsquo;ll";
    replace "he'll" "he&rsquo;ll";
    replace "theyll" "they&rsquo;ll";
    replace "they'll" "they&rsquo;ll";

    # now make the emphasis line token, after special chars have 
    # been escaped.
    B"::EMPH::" { 
      replace "::EMPH::" " <em>"; add "</em>";
      put; clear;
      add "emline*"; push; .reparse
    }

    # If a previous test has matched, then the workspace should
    # be clear, and so none of the following will match.

    # graphical key representations
    B"[".E"]" {
      replace "[esc]" "<kbd>Esc</kbd>";
      replace "[enter]" "<kbd>Enter</kbd>";
      replace "[return]" "<kbd>Return</kbd>";
      replace "[insert]" "<kbd>Ins</kbd>";
      replace "[shift]" "\\Shift";
      replace "[delete]" "\\Del";
      replace "[home]" "\\Home";
    }

    put;     
    
    # urls are important in html
    B"file://",B"http://",B"https://",B"www.",B"ftp://",B"nntp://" { 
      !"file://".!"http://".!"https://".!"www.".!"ftp://".!"nntp://" {
        # clear; add "url*"; push; .reparse
        clear; add "<a href='"; 
        get; 
        replace "href='www" "href='http://www";
        add "'>"; get;
        add "</a>";
        put; clear; 
      }
    }

    # format acronyms as a small capital font, case insensitive
    lower;
    "antlr","pdf","json","ebnf","bnf","dns","html" {
      clear; add "\\textsc{\\textbf{"; get; add "}}"; put; clear;
    }
    # restore the mixed-case version of the input word
    !"" { clear; get; }

    # filenames, could be elided with quoted filenames
    "parse>","print","pop","push","get","put",".reparse",".restart", "add",
    "sed","awk","grep","pep","nom","less","stdin","stdout","bash",
    "lex","yacc","flex","bison","lalr","gnu",
    E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
    E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
    E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
    E".png",E".jpg",E".jpeg",E".bmp",
    E".mp3",E".wav",E"aux",
    E".tar",E".gz",E"/" {
      clear; add "<code>"; get; add "</code>"; put; clear;
    }

    # mark up language names
    "python","java","ruby","perl","tcl","rust","swift","markdown",
    "c","c++" {
      clear; add "<em><code>"; get; add "</code></em>"; put; clear;
    }

    # paths and directories ? 
    B"../".!"../" {
      clear; add "<code>"; get; add "</code>"; put; clear;
    }

    B'"'.E'"'.!'""'.!'"' {
      # filenames in quotes
      clip; clop; put;
      # quoted uppercase words in headings
      [A-Z] {
        # add html curly quotes to the heading word
        # laquo; is like '<<'
        # ldquo; is curly left quote
        clear; add "&laquo;"; get; add "&raquo;"; put; clear;
        add "uuword*"; push; .reparse 
      }

      # markup language names
      "python","java","ruby","perl","tcl","rust","swift","markdown",
      "c","c++","forth" {
        clear; add "<em><code>"; get; add "</code></em>"; put; clear;
      }

      # markup filenames and some unix and pep/nom names as fixed-pitch
      # font. 
      "pep",
      "parse>","print","pop","push","get","put",".reparse",".restart", "add",
      "sed","awk","grep","pep","nom","less","stdin","stdout","bash",
      E".h",E".c",E".a",E".txt",E".doc",E".py",E".rb",E".rs",E".java",E".class",
      E".tcl",E".tk",E".sw",E".js",E".go",E".pp",E".pss",E".cpp",E".pl",
      E".html",E".pdf",E".tex",E".sh",E".css",E".out",E".log",
      E".png",E".jpg",E".jpeg",E".bmp",
      E".mp3",E".wav",E"aux",
      E".tar",E".gz",E";"
        { clear; add "&laquo;<code>"; get; add "</code>&raquo;"; put; clear; }
      # everything else in quotes (but only words without spaces!)
      !"" { clear; add "'<code>"; get; add "</code>'"; put; clear; }
    }

    # filenames 
    # crude pattern checking.
    B"/".!"/" {
      clip; E"." { clear; add "<pre class='file'>"; get; 
        add "</pre>"; put; clear; }
      clip; E"." { 
        clear; add "<pre>"; get; add "</pre>"; put; clear; 
      }
      clip; E"." { clear; add "<pre>"; get; add "<pre>"; put; clear; }
    }

    # emphasis is *likethis* (only words, not phrases) 
    # This is the same as markdown but words only
    B"*".E"*".!"**" {
      clip; clop; put; clear; 
      add "<em>"; get; add "</em>"; put; clear;
    }

    # && starting a line marks the document title 

    # the document 'title' after && or first heading, & has already 
    # been escaped
    "\\&\\&" { 
      clear; count; 
      "1" {
        clear; while [ \t\f]; clear;
        whilenot [\n]; put; clear;
        add "<h2> "; get;
        add "</h2> \n"; put; clear;
      }
    }

   # A quote, starting the line
   "quote:" { 
      clear; count; 
      "1" {
        clear; while [ \t\f]; clear;
        whilenot [\n]; put; clear;
        add "<blockquote cite=''>"; get;
        add "</blockquote> \n"; put; clear;
      }
    }

    clear; add "word*";
  }

  pop;
  # -------------
  # 2 tokens
  #--------------------
  # images 
  # standard format is [[*imfile*quote*width*float*]]*
  # A width is "50%" or "200pt"; float is left/right/center 
  # imfile is a image file name. quote/width/float are optional
  # tokens. The order of tokens is mandatory

  # remove newline and blank line tokens when parsing
  # images. But this is tricky, because we want to preserve
  # them otherwise.

  # remove nl/bl tokens in image formats  
  "[[*nl*","[[*bl*","imfile*nl*","imfile*bl*",
  "quote*nl*","quote*bl*","width*nl*","width*bl*",
  "float*nl*","float*bl*" { 
    push; clear; .reparse
  }
  
  # vanish [[ if not followed by imfile
  B"[[*".!"[[*".!E"imfile*" {
    replace "[[*" "word*"; push; push; .reparse
  }

  # vanish ]] where not significant
  E"]]*".!"]]*".
  !B"imfile*".!B"float*".!B"quote*".!B"width*" {
    replace "]]*" "word*"; push; push; .reparse
  }

  # vanish imfiles 
  B"imfile*".!"imfile*".!E"float*".!E"quote*".!E"width*".!E"]]*" {
    replace "imfile*" "word*"; push; push; .reparse
  }
  E"imfile*".!B"[[*" {
    replace "imfile*" "word*"; push; push; .reparse
  }

  # vanish quotes
  B"quote*".!"quote*".!E"float*".!E"width*".!E"]]*" {
    replace "quote*" "word*"; push; push; .reparse
  }
  E"quote*".!"quote*".!B"imfile*" {
    replace "quote*" "word*"; push; push; .reparse
  }

  # vanish widths
  B"width*".!"width*".!E"float*".!E"]]*" {
    replace "width*" "word*"; push; push; .reparse
  }
  E"width*".!"width*".!B"quote*".!B"imfile*" {
    replace "width*" "word*"; push; push; .reparse
  }

  # vanish floats
  B"float*".!"float*".!E"]]*" {
    replace "float*" "word*"; push; push; .reparse
  }
  E"float*".!"float*".!B"width*".!B"quote*".!B"imfile*" {
    replace "float*" "word*"; push; push; .reparse
  }

  # Add missing attributes here. This is a technique for 
  # providing "optionality" in pep/nom scripts
  "width*]]*" {
    clear; add "width*float*]]*"; 
    push; push; push; 
    # also add an appropriate attribute for a center float
    --; --; get; ++; put; 
    clear; --; put; ++; ++;
    .reparse
  }
  "quote*]]*","quote*float*" {
    replace "quote*" "quote*width*"; push; push; push;
    # now transfer the attributes and add null quote
    --; --; get; ++; put; 
    # or add an appropriate width
    clear; --; put; ++; ++;
    .reparse
  }
  "imfile*]]*","imfile*width*","imfile*float*" {
    replace "imfile*" "imfile*quote*";
    push; push; push; # ws should be clear
    # now transfer the attributes and add null quote
    --; --; get; ++; put; 
    # or put a null quote here.
    clear; --; put; ++; ++;
    .reparse
  }

  # End image token manipulation

  # ellide text
  "text*text*","word*text*",
  "word*word*","text*word*",
  "word*uuword*","text*uuword*","uutext*word*","uuword*word*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  # tokenlist:
  # --- >> 4dots codeblock codeline emline nl text uutext uuword word
  # codeblock,
  # remove pesky newline tokens, 4dots handled elsewhere
  # not really working
  #*
  "nl*text*","nl*word*","nl*emline*","nl*codeline*",
  "nl*codeblock*" {
    # delete nl token
    clop; clop; clop; push; clear;
    # ignore newline
    get; --; put; ++; clear;
    .reparse
  }
  *#

  "nl*text*","nl*word*", "bl*text*","bl*word*" {
    clear; get; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  "nl*dash*" {
    clear; get; ++; get; --; put; clear;
    add "dash*"; push; .reparse
  }

  "nl*emline*","bl*emline*" {
    clear; ++; get; --; put; clear;
    add "emline*"; push; .reparse
  }

  # We are using a dummy nl* token at the start of the doc, so the 
  # codeblock* codeline* etc tokens are not able to be the first token
  # of the document. So we can remove the !"codeblock*". clause.

  # multiline codeblocks with no caption 
  E"codeblock*".!"codeblock*".!B"emline*" {
    replace "codeblock*" "text*"; push; push; clear; 
    # dont need to markup here because already done, just 
    # transfer the attribute.
    # add "\n\n <pre class=codeblock><code>\n  ";
    --; get; 
    # add " </code></pre> \n";
    put; ++; clear; .reparse
  }

  # single line code with no caption 
  E"codeline*".!"codeline*".!B"emline*" {
    replace "codeline*" "text*"; push; push; clear; 

    # dont need to markup here because already done.
    # add "\n\n <pre><code>\n  ";
    --; get;  
    # add " </code></pre> \n";
    put; ++; clear; .reparse
  }

  # eliminate emline* tokens (not followed by codeblock/line)
  # the logic is slightly different because emline* is significant before
  # other tokens, not after.
  # also, consider emline*text*nl*
  B"emline*".!E"nl*".!E"codeline*".!E"codeblock*" {
    replace "emline*" "text*"; push; push; 
    # make emline display on its own line, even when not
    # followed by codeline/codeblock. LaTeX will treat a blank line 
    # as a paragraph break, but \newline or \\ could be used.
    --; --; add "\n\n"; get; add "\n\n"; put; clear;
    .reparse
  }

  # remove insignificant 4dots* tokens, 
  # 4 dots (....) marks a subheading and always comes at the end of 
  # all capitals line. Just replacing the 4dots token with a text
  # token is safer and more logical.
  E"4dots*".!B"uutext*".!B"uuword*" {
    replace "4dots*" "text*"; push; push; .reparse
  }

  # remove insignificant ---* tokens
  E"---*".!B"nl*".!B"bl*" {
    clear; get; add " "; ++; get; --; put; clear;
    add "text*"; push; .reparse
  }

  # remove insignificant >>* tokens
  # lets assume that codelines cant start a document? Or lets
  # generate a dummy nl* token at the start of the document to 
  # make parsing easier.
  E">>*".!B"nl*".!B"bl*" {
    #clear; get; add " "; ++; get; --; put; clear;
    #add "text*"; push; .reparse
    replace ">>*" "text*"; push; push; .reparse
  }

  # ellide upper case text 
  "uuword*uuword*","uutext*uuword*" {
    clear; get; add " "; ++; get; --; put; 
    clear; add "uutext*"; push; .reparse
  }

  # a blank line token for terminating lists etc 
  # bl/bl should not happen really
  "nl*nl*","bl*nl*","bl*bl*" {
    clear; get; ++; get; --; put; clear;
    add "bl*"; push; .reparse
  }

  # code line (starts with >>) 
  "bl*>>*","nl*>>*" { 
    # ignore leading space.
    clear; while [ \t\f]; clear;
    whilenot [\n]; put; clear;
    add " <pre class='oneline'><code> "; get;
    add " </code></pre>\n"; put; clear;
    add "codeline*"; push; .reparse
  }

  # code block marker 
  "bl*---*","nl*---*" { 
    clear; until ",,,"; clip; clip; clip;
    # remove excessive indentation.
    replace "\n   " "\n";
    put; while [,]; clear;
    add "\n <pre><code>"; get;
    add "\n </code></pre> \n"; put; clear;
    add "codeblock*"; push; .reparse
  }

  # a code block with its preceding description
  "emline*codeblock*" {
    clear; 
    add "\n\n <figure class='code.block'><figcaption>\n  ";
    get; add " </figcaption> "; ++; get; --; 
    add " </figure> \n"; put; clear;
    add "text*"; push; .reparse
  }

  # a code line with its preceding description
  "emline*codeline*" {
    clear; 
    add "\n\n <figure class='code.line'><figcaption>\n  ";
    get; add " </figcaption> "; ++; get; --; 
    add " </figure> \n"; put; clear;
    add "text*"; push; .reparse
  }

  # probably indicates an empty - at the end of a list
  # add a dummy text token
  "olist*bl*","ulist*bl*","dlist*bl*" {
    push; clear; add "empty"; put; 
    clear; add "\n\n"; ++; put; --;
    clear; add "text*bl*"; push; push; .reparse
  }

  # or use this to terminate the list, and so allow nested lists
  "olist*dash*","ulist*dash*","dlist*dash*" {
    push; clear; add "empty"; put; 
    clear; add "text*dash*"; push; push; .reparse
  }

  pop;
  # -------------
  # 3 tokens
  "olist*word*dash*","ulist*word*dash*","dlist*word*dash*",
  "olist*word*bl*","ulist*word*bl*","dlist*word*bl*" {
    replace "word*" "text*"; 
    # or dont reparse
    # push; push; push; .reparse
  }

  # eliminate dashes that are not part of a list
  # eg: ulist*dash* olist*text*dash* dlist*word*dash*
  # the logic is tricky, how do we know there are really 3 tokens 
  # here, and not 2. This is the problem with negative tests.
  # doesnt matter because not altering attributes here.
  E"dash*" {
    !B"ulist*text*".!B"olist*text*".!B"dlist*text*" {
      replace "dash*" "text*"; push; push; push; .reparse
    }
  }

  "olist*text*dash*" {
    clear;
    get; add "\n <li> "; ++; get; --; put; clear;
    add "olist*"; push; .reparse
  }

  # could be ellided, but for readability, no
  "ulist*text*dash*" {
    clear;
    get; add "\n <li> "; ++; get; --; put; clear;
    add "ulist*"; push; .reparse
  }

  # 
  "dlist*text*dash*" {
    clear;
    # already have \item start
    get; add " "; ++; get; --; 
    # also, put a \verbatim in [] because text is not escaped??
    # The definition term is delimited by a newline or a ":" character
    add "\n </dd><dt>"; whilenot [\n:]; add "</dt><dd> "; put; 
    # get rid of the trailing ":"
    !(eof) { read; } clear;
    add "dlist*"; push; .reparse
  }

  # finish off the ordered list, also could finish it off with 
  # ulist*dash* ??
  "olist*text*bl*" {
    clear; 
    add "\n <ol>\n"; get;
    add "\n <li> "; ++; get; --; 
    add "\n </ol>\n\n"; 
    put; clear;
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # finish off the unordered list
  "ulist*text*bl*" {
    clear; 
    add "\n <ul>\n"; get;
    add "\n <li> "; ++; get; --; 
    add "\n </ul>\n\n"; 
    put; clear; 
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # finish off the description list
  "dlist*text*bl*" {
    # or check here if it is D/- or d/- for nextline style
    # or use \hfill \\ on each item which also works
    clear; 
    add "\n <dl>\n"; get;
    add "\n "; ++; get; --; 
    add "\n </dl>\n\n"; 
    put; clear; 
    # insert the blankline attribute
    add "\n\n"; ++; put; --; clear;
    add "text*bl*"; push; push; .reparse
  }

  # top level headings, all upper case on the line in the source document.
  # dont need a "heading" token because we dont parse the document as a 
  # heirarchy, we just render things as we find them in the stream.
  "nl*uutext*nl*","nl*uuword*nl*",
  "bl*uutext*nl*","bl*uuword*nl*" {
    clear; 
    # Check that heading is at least 4 chars
    ++; get; --; clip; clip; clip; 
    "" { 
      add "nl*text*nl*"; push; push; push; .reparse
    }
    clear;
    # make headings capital case
    ++; get; 
    # capitalise even 1st word in latex curly quotes
    # add "<<heading\n"; print; replace "<<heading\n" "";
    B"``" { clop; clop; }
    cap; put; replace "''" "";
    # add open curly quotes if there before.
    !(==) {
      clear; add "``"; get;
    }
    put; --; clear; 
    get; # newline
    add '<h3 class="section.heading">'; ++; get; --; add "</h3>"; put; 
    clear;
    # transfer nl value
    ++; ++; get; --; put; clear; --;
    add "text*nl*"; push; push; .reparse
  }

  # simple reductions 
  "nl*text*nl*","nl*word*nl*", "bl*text*nl*","bl*word*nl*",
  "text*text*nl*","emline*text*nl*" {
    clear; get; ++; get; --; put; clear;
    ++; ++; get; --; put; --; clear; # transfer newline value
    add "text*nl*"; push; push; .reparse
  }
  # simple reductions 
  "nl*text*bl*","nl*word*bl*", "bl*text*bl*","bl*word*bl*",
  "text*text*bl*","emline*text*bl*" {
    clear; get; ++; get; --; put; clear;
    ++; ++; get; --; put; --; clear; # transfer blankline value
    add "text*bl*"; push; push; .reparse
  }

  pop;
  # -------------
  # 4 tokens

  # sub headings, 
  "nl*uutext*4dots*nl*","nl*uuword*4dots*nl*",
  "bl*uutext*4dots*nl*","bl*uuword*4dots*nl*" {
    clear; 

    # Check that sub heading text is at least 4 chars ?
    # yes but need to transfer 4dots and nl

    clear;
    # make subheadings capital case
    ++; get; 
    # capitalise even 1st word in HTML curly quotes
    B"``" { clop; clop; }
    cap; put; replace "''" "";
    # add open curly quotes if there before.
    !(==) {
      clear; add "``"; get;
    }
    put; --; clear; 
    get; # newline
    add '<h4 class="subsection.heading">'; ++; get; --; add "</h4>"; put; clear;
    # transfer nl value, really? just add "\\n" no? 
    ++; ++; ++; get; --; --; put; clear; --;
    add "text*nl*"; push; push; .reparse
  }

  pop;

  #------------------
  # 5 tokens
  pop;
  # -------------
  # 6 tokens

  # all images have been standardised to this format (all 
  # optional tokens have been added but may be empty).
  # for latex, image formats are jpeg,png,or pdf. Others need
  # to be converted. Image names cant have dots in them???
  "[[*imfile*quote*width*float*]]*" {
    # need to translate widths floats etc to latex here because
    # they may revert to word* if out of context.
    clear; 
    ++; 
    # get quote, if any, and remove """
    ++; get; clip; clip; clip; clop; clop; clop; 
    put; clear;
    # get width attribute
    ++; get; 
    # turn percentage into decimal

    #E"%","" { 
    #  clip;  
    #  !"".!"100" { put; clear; add "0."; get; }
    #  "" { add "0.60"; } # default width 60%
    #  "100" { clear; add "1.00"; }
    #  add "\\textwidth";
    #}

    # default width
    "" { add "60%"; }

    put; clear; 
    # translate floats into HTML css, default is centre
    ++; get; "" { add "none"; }
    # unknown positioning spec
    !"ccc".!"<<<".!">>>" { clear; add "none"; }
    "ccc" { clear; add "none"; } # centre
    "<<<" { clear; add "left"; } # left
    ">>>" { clear; add "right"; } # right
    put; clear;
    add "\n<figure style='float:"; 
    # position attribute
    get; add "; width:"; 
    # width attribute
    --; get; add "'>\n";
    add "<img style='width:"; 
    # width attribute again
    get;
    --; --; add "' src='"; 
    # image file name
    get; --; add "'/>\n"; 
    add "<figcaption class='image.caption'>";
    # get the quote attribute
    ++; ++; get; --; --; add "</figcaption>\n</figure>";
    put; clear;
    add "text*"; push; .reparse
  }
 
  # example: 75% page width
  #  \includegraphics[width=0.75\textwidth]{image/test.jpg}

  push; push; push; push; push; push;

  (eof) {
    # or use 'unstack' but does it adjust the tape pointer?
    pop; pop; pop; pop; pop; pop;

    # "nl*word*","nl*text*" have already been dealt with.

    # we would like "permissive" parsing, because this is just
    # a document format, not code, so will just check for starting
    # text token

    B"text*",B"word*" {
      # show the token parse stack at the top of the document
      ++; put; clear; 
      add "<!-- Document parse-stack is: "; get; add " -->\n"; --;
      clear; 
      # make a valid LaTeX document
      add '
   <!-- html generated by nom script: mark.html.pss -->
   <html>
   <head>
   <meta http-equiv="Content-type" content="text/html;charset=UTF-8" />
   <style>

    figure {
      float: right;
      width: 30%;
      text-align: center;
      font-style: italic;
      font-size: smaller;
      text-indent: 0;
      border: thin silver solid;
      margin: 0.5em;
      padding: 0.5em;
    }

   </style>
   <title></title>
   </head>
   <body>

  ';

      get; 
      add "\n </body> \n";
      add "\n </html> \n";
      add "\n\n <!-- Document parsed as text*! luckily -->\n"; 
      # show parse-stack at end of doc as well
      ++; add " <!-- Document parse-stack is: "; get; add "--> \n"; --;
      print; quit;
    }

    stack; 
    add "Document parsed unusually!\n";
    add "Stack at line "; lines; add " char "; chars; add ": "; print; clear; 
    unstack; print; stack; add "\n"; print; clear;
    quit;

  }