An overview of the syntax of the ℕ𝕠𝕞 parsing language.
The file /doc/syntax/doc.dir.index.html contains a list of the documents available in this folder and links to documents that explain each element of the ℕ𝕠𝕞 syntax.
The nom language (which is interpreted via the pep tool or compiled via one of the language translation scripts - which are themselves nom scripts) is implemented in the file compile.pss and has a syntax somewhat similar to the “sed” text stream editor (but hopefully less cryptic).
Nom commands can be placed in blocks which can be nested as deeply as required. The statements in the block are only executed if the tests return true.
Unlike sed, it also allows long names for commands (eg clear
instead of
“d",” add" instead of “a"). Each command has a long and a short form.
” All commands must be terminated with a semicolon except for the following:
.reparse .restart parse>
White-space is not significant in the syntax of the parse-script language, except within ' and " quote characters and square brackets []
read; !
[a-e]{print;}
clear;
Braces “{” and “}” are used to define blocks of commands (as in sed, awk and c).
The script language (and it’s syntax) is implemented in the file compile.pss
.
Some commands, such as “.reparse” and “.restart” affect the flow of the
program, but not the virtual machine.
[a-z] { print; clear; }
Most scripts start with read
or r (which is the abbreviated form).
There is no implicit read statement in a nom script (unlike in
SED ) so a
script that does not read anything with read
or while
whilenot
or until
will do nothing and will have to be terminated
with [control-c] because it will never quit
.
The read command reads one character from the input stream. Whereas sed and awk are line oriented (they process the input stream one line at a time), nom is character-orientated (the input stream is processed one character at a time).
As with sed and awk, nom scripts have an implicit loop. When the interpreter
reaches the end of the script, it jumps back to the first command (usually
read ) and continues looping until the input stream is finished. This
behaviour is the same as the .restart
command but is implicit
in every script.
In the code below .restart is not required because ℕ𝕠𝕞 automatically jumps back to the 1st statement after the begin block (if there is one). The script is not an infinite loop because the read command exits when it tries to read the <eof> end of file marker.
read; print; .restart
It is possible (or probable) that some people will find the syntax of the ℕ𝕠𝕞 language cryptic because it reflects the ℙ𝕖𝕡 virtual machine which underlies it. In order to program in nom you need to have the virtual-machine “in-mind” (this sounds like some Martin Heidegger phrase)
But we can create a more natural language on top of ℕ𝕠𝕞 Lets do it....
The code below recognises ebnf rules is the form
& ??lt;token>+ = & ??lt;token>+ ;
Obviously it is a “toy” ebnf parser because this version doesn't compile the ebnf to ℕ𝕠𝕞 script but it is a good start. The script below took about 10 minutes to write. The debugging lines below the parse> label are very handy for seeing how the parse stack is shift-reducing as it reads the input stream.
See the following script below for a compiling version.
read;
# ignore white-space
[:space:] { while [:space:]; clear; }
# literal tokens ; and =
";","=" { add "*"; push; }
[:alpha:] {
while [:alpha:]; put; clear;
add "token*"; push;
}
!"" { add " ?? bad char \n"; print; quit; }
parse>
# An important grammar debugging technique for showing
# the parse-stack reductions.
# lines; add " char "; chars; add ": "; print; clear;
# unstack; print; stack; add "\n"; print; clear;
pop; pop;
"token*token*","sequence*token*" {
clear; add "sequence*"; push; .reparse
}
"token*=*","sequence*=*" {
clear; add "LHS*"; push; .reparse
}
"token*=*","sequence*;*" {
clear; add "RHS*"; push; .reparse
}
"LHS*RHS*" {
clear; add "rule!!\n"; print;
clear; add "rule*"; push; .reparse
}
push; push;
See bumble.sf.net/books/pars/eg/toybnf.pss for a development of the script below.
Creating a better language with ℕ𝕠𝕞 as the compile target.
It would be nice to have a more natural language that targets ℕ𝕠𝕞 Lets expand the script above to compile to [nom].
This is compiling very simple EBNF to ℕ𝕠𝕞 . This is the first example of using nom as the target of a nom script. Another strange corollary arises: that we can use this new language to implement a recogniser for itself (but not a compiler because so far out new language has no compiling syntax, just ebnf rule reductions. The script below parses the same syntax as above but instead of just recognising the syntax, it actually creates executable ℕ𝕠𝕞 code.
#*
tokens:
LHS left-hand-side of the bnf rule
RHS right-hand-side
sequence a sequence/list of tokens
token one grammar token
'=' ';' literal tokens
*#
read;
# line-relative char numbers
[\n] { nochars; }
# ignore white-space
[:space:] { while [:space:]; clear; }
# literal tokens ; and =
";","=" { add "*"; push; }
[:alpha:] {
# add the default nom parse token delimiter '*'
while [:alpha:]; add "*"; put; clear;
add "token*"; push;
}
!"" {
put; clear;
add "! [toyBNF]\n";
add " bad character '"; get; add "'";
add " at line:"; lines; add " char:"; chars; add "\n";
add " I just can't go on... sorry, goodbye";
print; quit;
}
parse>
# An important grammar debugging technique for showing
# the parse-stack reductions.
# lines; add " char "; chars; add ": "; print; clear;
# unstack; print; stack; add "\n"; print; clear;
pop; pop;
"token*token*","sequence*token*" {
# count tokens to calculate "push;" later
a+;
clear; get; ++; get; --; put;
clear; add "sequence*"; push; .reparse
}
"token*=*","sequence*=*" {
# later have to transform this count number into
# push; or push;push; etc
clear; get; a+; count; put; clear;
# reset the token counter for the RHS
zero;
add "LHS*"; push; .reparse
}
"token*;*","sequence*;*" {
clear; get; a+; count; put;
clear; add "RHS*"; push; .reparse
}
"LHS*RHS*" {
clear;
# first build the new token string
# eg 'add "tok*tok*2"; push; push; '
# that is we need as many pushes as there are tokens and need to
# get rid of the trailing number
get;
# not very elegant but....if you've got more than 6 tokens in a
# row maybe you should reconsider your grammar
# could avoid all this with a 'stack' command that updates the
# tape pointer properly
E"1" { clip; add '"; push;'; }
E"2" { clip; add '"; push; push;'; }
E"3" { clip; add '"; push; push; push;'; }
E"4" { clip; add '"; push; push; push; push;'; }
E"5" { clip; add '"; push; push; push; push; push;'; }
E"6" { clip; add '"; push; push; push; push; push; push;'; }
put; clear; add 'add "'; get; put;
clear;
#*
now need to build the rhs which becomes the nom test in format
this is bit more tricky than the LHS. If we had "stack" it
would be much easier
pop;pop; "c*d*" {
}
push;push;
*#
++;
get;
# build the "pushes" separately and store in tapecell+1
E"1" { clear; add "push;"; }
E"2" { clear; add "push;push;"; }
E"3" { clear; add "push;push;push;"; }
E"4" { clear; add "push;push;push;push;"; }
E"5" { clear; add "push;push;push;push;push;"; }
E"6" { clear; add "push;push;push;push;push;push;"; }
!E"push;" {
clear; add "! sorry 6 token sequence limit\n";
print; quit;
}
++; put; --;
# easier just replace push; with pop; and start building
# the start of the nom block
replace "push;" "pop;";
add '\n"'; get; clip; add '"'; put;
clear;
--;
# now assemble the nom block, but the lhs and rhs
# have already been built.
++; get; --; add ' {\n';
add ' clear; '; get; add ' .reparse \n';
add '}\n';
# now get the prebuilt "pushes" which were saved up on tape.
++; ++; get; --; --;
#print;
put;
clear; add "rule*"; push; .reparse
}
"rule*rule*","grammar*rule*" {
clear; get; add "\n"; ++; get; --; put;
clear; add "grammar*"; push; .reparse
}
push; push;
(eof) {
pop; "rule*","grammar*" {
clear; get; add "\n\n"; print; quit;
}
}
You can save the script above as toyBNF.pss
and test it:
pep -f toyBNF.pss -i 'com = word param; block = word newword;'
# sample input BNF rules (white-space doesnt matter):
# com = word param ;
# block = word newword ;
# output:
pop;pop;
"word*param*" {
clear; add "com*"; push; .reparse
}
push;push;
pop;pop;
"word*newword*" {
clear; add "block*"; push; push; push; .reparse
}
push;push
This is pretty cool, because we now have a toybnf-to-nom compiler that produces executable and translatable (to go/java/tcl/python/ruby etc) ℕ𝕠𝕞 code. But we still need a lexxing syntax for our toyBNF language
The “redundant push/pop” problem has a pretty simple solution, but we need to make sure there is no whitespace between.
replace "push;push;push;pop;pop;pop;" "";
replace "push;push;pop;pop;" "";
replace "push;pop;" "";
This toyBNF language may not be as efficient as hand coded ℕ𝕠𝕞 because it does redundant pushes and pops between code blocks, but it is easier to write and probably less prone to errors. But to make it more than a “recogniser” we have to add compiling syntax like this....
add = b chars {
#0 = "<a href=".$1.">".$2."</a>" ;
}
In the syntax above '.' is the string concatenator and $1 refers to the attribute of the first token on the RHS right-hand-side of the bnf grammar rule. The compiling block takes the place of the ';' in the syntax above.
We don’t have any sensible way to actually create the 'tokens' yet. (ie the lexing phase of the recogniser) but we can soon invent a syntax like this
word = [:alnum:]+ ;
newline = '\n' ;
Here is how this will be compiled by toyBNF.pss
in
ℕ𝕠𝕞
# toyBNF syntax: word = [:alnum:]+ ;
# the final reparse may not be necessary
read;
[:alnum:] {
while [:alnum:]; put; clear;
add "word*"; push; .reparse
}
# toyBNF syntax: newline = '\n' ;
'\n' { put; clear; add "newline*"; push; .reparse }
while
whilenot
and until
do not exit when they
encounter the <end-of-stream> marker unlike read.