Pep and Nom

home | documentation | examples | translators | download | blog | all blog posts

general notes about grammars and nom

General notes about formal language grammars and nom.

In the Medieval period in Europe the four pillars of education were Rhetoric, Logic, Theology and Grammar. Today this seems a strange classification of knowledge and study, but Logic and Grammar are the basis of computer science.

I should state here that I am not a great formal-language theorist or high-level mathematician who can give insights into grammar theory. This page is just a set of heuristic notes about things I have discovered while parsing languages with the PEP NOM system.

look ahead

You can use the beginswith and endswith modifiers to achieve token lookahead in [nom]. The script /eg/maths.parse.pss and also /eg/maths.to.latex.pss have good examples of using lookahead.

The document /doc/howto/nom.lookahead.html has detailed information about how to write lookahead into a nom script .

There is also the peep register which is a single character and which is not used directly by the script writer.

associativity

This is how we parse “a + b + c” .... Do we parse this as “(a + b) + c” (left-associative) or as “a + (b + c)” (right-associative)

With nom we need to build this associativity into the grammar. See the arithmetic parser example for an example of associativity (with +/- signs which associate right) .

precedence

This is how we parse “a + b * c” .... Do we parse this as “(a + b) * c” ( '+' precedence ) or as “a + (b * c)” ( * precedence )

Again, this has to be factored into the grammar

parser generators

A parser generator is basically a way to avoid having to write a recursive descent parser (and compiler) or some other style of hand coded compiler.

kleene star

The kleene star is a grammar construct which basically means “zero or more of something” . This exists in regular expressions

example kleene star in a sed regular expression
 s/ab*c//

This means: 'a' followed by zero or more 'b's followed by 'c' The same construct is used in EBNF grammars (check!).

The rule below means that “A 'statement' is a 'keyword' followed” by zero or more 'parameters' followed by a semi colon.

kleene star in ebnf grammar rule
 statement = keyword parameter* ';' ;

How can we translate this into NOM ? We cannot create a single block which represents the rule above but by using several blocks we can create a nom fragment which represents it.

The code below represents a complete “lexer” and “recogniser” for statements as shown below. Notice that white-space is completely ignored except in parameters.

example of statements recognised below

     say 'good'; fizz; bang 'crash' ;
     whirr 'ok' 'okay'; grow'big' 'small' 'tiny' ;
     bird
       'tweet' 'fly '
       'nest' ;
  

The script that follows is a recogniser because it doesn’t actually compile/transpile/transform the input text, it just determines if the input text consists of valid “statements". Also notice, that the single [ebnf] rule” with a kleene star is broken up into 3 NOM blocks ( statements between braces ).

equivalent ebnf rules as used by nom

    statement = keyword ';' ;
    statement = keyword parameter ';' ;
    statement = keyword parameterset ';' ;
    parameterset = parameter parameter ;
    parameterset = parameterset parameter ;
  

Although their are 5 ebnf rules above, NOM only requires 3 blocks because it can use OR logic concatenation in the tests (using the ',' comma character)

Ebnf rule with kleene star

  # "lexing"
  
  read; 
  # literal token, the character is also the parse-token
  ';' { add "*"; push; }
  [:space:] { while [:space:]; clear; }  
  [:alpha:] { while [:alpha:]; put; clear; add "keyword*"; push; }
  "'" { until "'"; put; clear; add "parameter*"; push; }
  !"" { 
    put; clear; 
    add "?? bad char ["; get; add "]\n";
    print; quit;
  }
 
  # The nom version of the ebnf rule
  #   statement = keyword parameter* ';' ;
  parse>
    # show the stack reductions for debugging
    # add "line "; lines; add " char "; chars; add ": "; print; clear;
    # unstack; print; stack; add "\n"; print; clear;
    pop;
    pop;
    "keyword*;*" 
      { clear; add "statement*"; push; .reparse }
    "parameter*parameter*","paramset*parameter*" 
      { clear; add "paramset*"; push; .reparse }
    pop;
    "keyword*parameter*;*","keyword*paramset*;*" { 
      clear; add "statement!\n"; print;
      clip; clip; add "*";
      push; .reparse 
    }
    push;push;push;
  

parse token matching or recognising

Nom can only match or recognise a fixed number of parse-tokens at any point in the script. This is because it needs to pop the parse tokens off the stack before it can match parse token sequences.

This is why ebnf rules in the form


   a = b+ c ;
   a = c b* ;
  

need to be broken up into several NOM blocks, because b+ and b* represent a variable number of parse tokens.

recursive descent parsing

LL grammars (same?)

Although grammar theorists talk about all sorts of different types of parsing, in practice, the type of parsers that are written by jobbing programmers seem to be recursive descent . There is a very surprising statement on Robert Nystrom's blog that all compilers for modern languages in common use are recursive descent. Mr. Nystrom is involved in creating (designing?) the DART language and has written a great book about parsing and compiling, so I am pretty convinced he knows what he is talking about. Recursive descent is also the type of parser/compiler that seems to be taught in University Comp-Sci courses. Nicholas Wirth of Pascal fame was a strong proponent of this type of parser/compiler.

I don't wish to criticise recursive descent parsers, because they obviously work, but it does seem odd to me that this is still the main way to recognise and translate formal-languages.

NOM does not do recursive descent parsing: in fact I wrote nom because I wanted a way to understand the grammars of language that was more “natural” for my way of thinking. The PEP & NOM system does www.google.com/search?q=shift+reduce+parsing but it does it purely as a (unix-style) text-stream filter.

The pep/nom lexing/parsing/compiling process is as follows: In the lexical analysis phase of the nom script, pep/nom reads the input stream (more or less) one Unicode character at a time, and creates parse-tokens and their “attributes” in a text buffer . The parse-tokens are pushed onto a stack and the “attributes” (the partially compiled/translated text) are put into the tape . Then, during the parsing or “shift-reduce” phase of the script, the parse-tokens are popped off the stack back into the workspace buffer. If a particular parse-token sequence is matched or recognised, then the workspace is cleared and the token attributes are got from the tape and compiled and then put again onto the tape . Then the workspace (string) buffer is cleared and the new parse-token is created in the workspace with the add command. Finally, the new parse-token or tokens is pushed onto the stack .

This entire process is text-oriented meaning that everything is a “string” including the parse-tokens and the parse tokens and their attributes are manipulated in the same virtual machine with the same commands.

recursion in grammars

Left recursive grammars

   E := E + E
   E := T
 

grammar and script construction

My knowledge of formal language grammar theory is quite limited. I am more interested in practical techniques. But there is a reasonably close correlation between bnf-type grammar rules and pep/nom script construction.

The right-hand-side of a (E)BNF grammar rule is represented by the quoted text before a brace block, and the left-hand-side correlates to the new token pushed onto the stack.

the rule "<nounphrase> ::= <article> <noun> ;" in a parse script
 "article*noun*" { clear; add "nounphrase*"; push; }

terminology

Terminology, productions (same as rules?) terminals, non-terminals, tokens. Tokens are the same as terminals? BNF backus-naur form

automatons

There is a relationship between virtual machines, automatons and grammars and formal languages.

en.wikipedia.org/wiki/Deterministic_finite_automaton