tacc - The Alternative Compiler Compiler
tacc: Version: 4.56.2.81 32 bit, Compiled: Jun 25 2021 at 01:46:21 (c) Trevor Boyd 19 Mar 2014
tacc: Version: 4.56.2.81 64 bit, Compiled: Jun 25 2021 at 02:19:21 (c) Trevor Boyd 19 Mar 2014
Download tacc - Windows 32 bit and 64 bit
Introduction
tacc is a compiler compiler.
tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source code in C++ or Java with a built in lexical scanner.
You can compile the output .cpp with a C++ or Java compiler or you can use tacc -I to "interpret" the .cpp file.
The resultant parse from either method is identical. The C++ output.cpp file is built from the actual source code of the tacc when tacc is compiled.
The ability to test your grammar and syntax check any file in your target language without having to compile the .cpp or build your own lexical scanner can greatly speed up the development of your language parser.
You can add semantic actions directly to the bnf or you can add them later manually to the output .cpp file.
![]() |
Similar in function to yacc and lex put together.
Elegant simplicity
Lots of automatic and default components, tacc just works it out for you.
Powerful default lexer
Interpreter mode
Rich trace and debug
Parts of tacc were written before 1993, so it is a bit old school in some areas:
Let's write a simple parser to recognise a stream of typical variable names (identifiers).
Identifiers start with a letter or underscore and are optionally followed by letters, underscore and numbers.
The grammar for the identifier stream is:
> cat identifiers.bnf
GOAL ::= ids
ids ::= id | ids id
id ::= /[A-Za-z_][A-Za-z_0-9]*/
By default, the parser generated with tacc automatically handles whitespace, so we don't need to worry about it here.
The first token encountered in the BNF file is the goal token. The entire parse will eventually end up as the goal token.
Tokens can be plain text or /regular expressions/. Regular expressions match the leftmost longest string in the input. If more than one token matches the input, and they are the same length, then the one that is specified first in the bnf file has priority. This means you can put keywords before generic regexes that catch identifiers and have them correctly tokenised as keywords.
Compile the bnf with tacc:
> tacc identifiers.bnf -o identifiers.cpp
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc: using internal template: default (vogue)
tacc: output file is: identifiers.cpp
Test it out with tacc -I:
> tacc -I identifiers.cpp "this and that"
tacc: string accepted
> tacc -I identifiers.cpp "this then 9fail"
tacc: Error: Unrecognised input "9fail"
1: this then 9fail
^^^^^
Or you can compile identifiers.cpp with a C++ compiler:
> cl /EHsc /nologo /O2 identifiers.cpp /Feidentifiers.exe
identifiers.cpp
Test it out with identifiers.exe:
> identifiers.exe "this and that"
identifiers: string accepted
> identifiers.exe "this then 9fail"
identifiers: Error: Unrecognised input "9fail"
1: this then 9fail
^^^^^
Apart from the program name, the output is the same.
tacc -I and your exe also have command line options that can show you what is going on during the parse of your text. You can use -? To see them all
> identifiers.exe -?
usage: identifiers { -f input_file | input_string... }
[-xrtasnlpvewb]
[-GWTV-?]
-options:
-f infile
-x print transitions
-r print reductions
-t print tokens as they are read (-ttt... -K = max t's)
-a print terminal args with reductions
-s print top of parse tree stack with reductions (needs -p)
-n only print reductions that have terminal args
-l print lookahead with reductions
-p print parse tree, -pp print fancier parse tree
-v verbose, equivalent to -alrpptx
-e suppress expecting... output
-k don't use context aware lexing
-w don't ignore end of lines (\n, \f) (\r is always ignored)
-ww don't ignore any whitespace (\n \f \v \t ' ') (\r is always ignored)
-b print the BNF source for this program
-B print the input string or filename
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-V Version
-- Stop -option parsing
-? This help message
These options are all explained in detail later on, but for now -rat is fun:
D:\Trev\Projects\tacc\Doc>identifiers.exe -rat "this and that"
lex: next token is {9 /id/ = "this"}
lex: next token is {9 /id/ = "and"}
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "this"
2: ids ::= id
lex: next token is {9 /id/ = "that"}
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "and"
3: ids ::= ids id
lex: next token is {2 EOS = ""}
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "that"
3: ids ::= ids id
1: GOAL ::= ids
It is nice to be able to recognise input, but it is better to be able to do something with it. The action you perform after parsing is semantics. You can add semantic actions directly to the .bnf grammar or you can add it to the default_semantics function in the output.cpp or even create your own. Generally the more complex the grammar and the more changes you make to the semantic code, the more likely you are to use cpp semantics. bnf semantics are for quick and simple grammars.
It might be a little easier to start with entering code into the output.cpp file.
tacc generates a default semantic function for you to edit. The one for the identifiers.bnf grammar above looks like this:
int
default_semantics(int red, int argc, string argv[], token_info *argt[], string
&error)
{
// ---- Description
//
// default_semantics is a callback function
from the parser that gives you a reduction that is
// happening now and the strings that are
the terminals in that production.
You will normally
// perform some processing based on the
reduction number (red) and the arguments in argv.
//
// ---- Parameters
//
// red - the reduction (production)
number.
//
// argc - the count of terminal token
strings in argv
//
// argv[] - Reduction terminal token
strings (the values of the tokens that are "variables")
//
// argv[0] is the
leftmost terminal token string
// argv[1] is the
next leftmost terminal token string
// ...
// argv[argc - 1]
is the rightmost terminal token string
// argv[argc] is the
lookahead token string
//
// argt[] - Reduction terminal token
info
//
//
argt[0]->token - the token id number
//
// Usually you
would only use .token, but see class token_info for others.
//
// error - an optional error message that
will be printed if you return non 0 from this function.
//
So if you found a semantic error on token 0 and wanted to stop parsing,
you would:
//
//
error = argt[0]->token_error("ERROR: I'm sorry, Dave, I'm afraid I
can't do that.");
//
return 1;
//
// return value: 0 = all OK and keep
parsing. Non 0 = error and stop
parsing.
//
// ---- Example
//
// 17. EXPR ::= /[0-9]+/
OPERATOR /[A-Za-z_][A-Za-z_0-9]*/
//
37
goo
;
//
// red = 17
- the number of the reduction, ie the 17th production in your
.bnf
// argc = 2
- the number of terminal tokens, here they are "37" and "goo"
// argv[0] = "37" - the value of the
first terminal token
// argv[1] = "goo" - the value of the second
terminal token
// argv[2] = ";" - the lookahead
token
// argt[0]->token = 7 - the
token for number, ie the 7th token defined in your .bnf
// argt[1]->token = 9 - the
token for identifier, ie the 9th token defined in your .bnf
//
// EXPR is what you are
calculating here, most likely you will push this result on your
// own weird stack.
//
// OPERATOR is a non terminal,
most likely you will have OPERATOR ::= + | - | ...
// somewhere in your
.bnf.
//
switch (red) {
case 1 : // GOAL ::= ids
break;
case 2 : // ids ::= id
case 3 : // ids ::= ids id
break;
case 4 : // id ::= /[A-Za-z_][A-Za-z_0-9]*/
break;
case 0:
default :
// printf("Unhanded or unknown reduction =
%d\n", red);
//
exit(EXIT_FAILURE);
break;
}
return 0;
}
So if we wanted to print out a message when we read in an id, we could put it in the code like this:
case 4 : // id ::= /[A-Za-z_][A-Za-z_0-9]*/
printf("id = %s\n",
argv[0].c_str());
break;
Compiling it and running it gives:
> identifiers.exe this and that
id = this
id = and
id = that
identifiers: string accepted
Note that we could not have put the printf under case 2 or 3 because the right hand side (RHS) of the ::= are all non terminals (ie they appear as a left hand side (LHS) somewhere else in the grammar). If you want to retain the id for later processing by higher order rules, the usual way is to push the values onto a stack. We will see how to do that a bit further on.
You can enter code directly into the bnf by putting it inside braces after the rule it matched like this (and it has to be pretty much exactly like this):
GOAL ::= ids
ids ::= id | ids id
id ::= /[A-Za-z_][A-Za-z_0-9]*/
{
printf("id = %s\n", argv[0].c_str());
}
tacc, compile and execute gives us the same result as putting it directly into the .cpp:
> identifiers.exe this and that
id = this
id = and
id = that
identifiers: string accepted
Note that tacc -I cannot interpret the semantic actions (aww...). tacc can only interpret the tables it generated in the cpp file. Disappointing, I know, but I wasn't writing an entire C++ interpreter. Besides, you can use your own template file and it need not be C++.
If you want to put some code in the bnf file that is global, you can put it in a named section.
Let's say we wanted to read in the identifiers and output them all at once with commas between them. We need to change the grammar slightly so that the ids ::= id | ids id is split into separate lines and add some { semantic blocks }
So input:
this and that
Outputs:
List: this, and, that
The default tacc template has a tacc_include and a tacc_global that we can use like this:
tacc_include
{
#include <iostream>
using std::cout;
#include <string>
#include <stack>
}
tacc_global
{
typedef std::stack<std::string> string_stack;
string_stack aux;
}
GOAL ::= ids
{
string list = aux.top(); aux.pop();
cout << "List: " << list << "\n";
}
ids ::= id
{
string id = aux.top();
cout << "leaving id = " << id << " on the stack\n";
}
ids ::= ids id
{
string id = aux.top(); aux.pop();
cout << "pop id = " << id << "\n";
string list = aux.top(); aux.pop();
cout << "pop list = " << list << "\n";
list = list + ", " + id;
cout << "pushing list = " << list << "\n";
aux.push(list);
}
id ::= /[A-Za-z_][A-Za-z_0-9]*/
{
cout << "pushing id = " << argv[0] << "\n";
aux.push(argv[0]);
}
tacc, compile and execute gives us:
> identifiers.exe this and that
pushing id = this
leaving id = this on the stack
pushing id = and
pop id = and
pop list = this
pushing list = this, and
pushing id = that
pop id = that
pop list = this, and
pushing list = this, and, that
List: this, and, that
identifiers: string accepted
It may seem like a kinda crazy way to get that string to the next higher level rule by popping and pushing, but it is very common to do it like that.
You might also pop values off the stack, perform some processing and push the result back on as we do to put the comma in the list.
It is also normal to have a stack that is a union of many types.
A named section may have any name. The name given is matched to a comment in the template file that tacc uses to generate the cpp file.
You can list the templates built in to tacc and export them for perusal like this:
> tacc -d
Name Ext Default Description
-------- ---- ------- ----------------------------------------------------------------
classic cpp No Older C++ template with no real redeeming features.
sienna cpp No C++ template generated around tacc 4.03. Still good.
vogue cpp Yes C++ template generated from the tacc source code.
jive java No Simple Java template for tiny grammars.
juke java Yes Java template with parser tables in *.joos.dat files.
cup cup Yes Convert to CUP parser format. Not perfect.
default * * Matches a default template above by output file extension.
> tacc -M default -o template.cpp
Named sections look like this:
//
>>>> tacc directive
[arguments]
Most of them are only really useful if you are building your own template file.
Directive is one of these.
tacc_include
tacc_global
cpp_bnf_name
cpp_bnf_created
cpp_gen_author
cpp_tacc_info
cpp_command_line
cpp_bnf_source
cpp_tokens
cpp_transition_tokens
cpp_transition_actions
cpp_transition_state_index
cpp_reductions
cpp_string_tokens
cpp_regex_tokens
cpp_input_tokens
cpp_productions
cpp_case_semantics
cpp_regex_lex
cpp_context_lex_state
java_bnf_name
java_bnf_created
java_gen_author
java_tacc_info
java_command_line
java_bnf_source
java_tokens
java_transition_tokens
java_transition_actions
java_transition_state_index
java_reductions
java_string_tokens
java_regex_tokens
java_input_tokens
java_productions
java_case_semantics
java_regex_lex
java_context_lex_state
java_main_class
joos_app_info $dir/$name.app_info.joos.dat
joos_standard_tokens $dir/$name.standard_tokens.joos.dat
joos_transition_tokens $dir/$name.transition_tokens.joos.dat
joos_transition_actions $dir/$name.transition_actions.joos.dat
joos_transition_state_index $dir/$name.transition_state_index.joos.dat
joos_reductions $dir/$name.reductions.joos.dat
joos_string_tokens $dir/$name.string_tokens.joos.dat
joos_regex_tokens $dir/$name.regex_tokens.joos.dat
joos_input_tokens $dir/$name.input_tokens.joos.dat
joos_productions $dir/$name.productions.joos.dat
joos_regex_lex_state_action $dir/$name.regex_lex_state_action.joos.dat
joos_regex_lex_state_match $dir/$name.regex_lex_state_match.joos.dat
joos_context_lex_state $dir/$name.context_lex_state.joos.dat
joos_file_extension
cup_preamble
cup_lex
cup_cf
The named sections that start cpp, java and joos are used by tacc and are replaced with parser tables. To see a full description of every one of these, just browse through the comments in the template files.
$name is name of the input.bnf file without the extension, ie input.
$filename is the name of the input.bnf file, ie input.bnf
$dir is the directory of this automaton, from the dir/input.bnf or "." if not present
tacc doesn't use tacc_include or tacc_global, they are just custom named sections used by the cpp templates.
You can make your own template.cpp and include your own named sections. For example if we modify the exported template.cpp from above and add a groovy section:
// >>>> tacc groovy
And put a groovy named section in the bnf:
groovy
{
// yeah this is groovy
}
Then use tacc with the new template with -m:
> tacc -m template.cpp identifiers.bnf -o identifiers.cpp
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc: using template file: template.cpp
tacc: output file is: identifiers.cpp
The output file contains:
// yeah this is groovy
In place of the:
// >>>> tacc groovy
That was in the template file.
tacc can also do the same thing for java.
> tacc identifiers.bnf -o identifiers.java
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc: using internal template: default (juke)
tacc: output file is: identifiers.java
> javac identifiers.java
> java identifiers this and that
identifiers: string accepted
> java identifiers this and 9fail
identifiers: Error: Unrecognised input "9fail"
1: this and 9fail
^^^^^
> tacc -I identifiers.java this and that
tacc: string accepted
> tacc -I identifiers.java this and 9fail
tacc: Error: Unrecognised input "9fail"
1: this and 9fail
^^^^^
That is really all you need to know to get started.
tacc -k Context Aware Lexing
Because
tacc creates both the lexer and the parser, it knows everything about the parser
when it is building the lexer. This
means that it can create a lexer that is specific to every parser
state.
Well
in some cases it is the only way.
Parsers have long separated the lexer from the parser and because of that
they have 1 big generic lexer that can recognise all of the tokens. tacc was no different, even though it
was all part of the same program, the lexer and the parser really didn't
interact that much. But now they do
and this can make some bnfs possible where previously they were
impossible.
For
example, let's say you had a language that recognised simple assignments of the
form:
id =
number
Potentially
separated by ; The bnf would look
like this:
GOAL ::=
program
program ::= program ; statement |
statement
statement ::= expression |
NULL
expression ::= id \=
number
id ::=
/[A-Za-z_][A-Za-z_0-9]*/
number ::=
/[0-9]+/
And
now you want to add a help command as well. The help command would be of the
form:
help
any_thing_at_all
Where
any_thing_at_all
really could be anything, including letters, numbers and even a
;
Conventional
lexers would have trouble with this.
Let's see why. First put the
new help command in the grammar: context.bnf:
GOAL ::=
program
program ::= program ; statement |
statement
statement ::= expression |
help_expression | NULL
help_expression ::= help
subject
expression ::= id \=
number
id ::=
/[A-Za-z_][A-Za-z_0-9]*/
number ::= /[0-9]+/
subject ::=
/[^[:space:]]+/
Compile
it with tacc and try it out:
> tacc context.bnf -o
context.cpp
And
try it out:
> tacc -I context.cpp -rt
"goo = 37"
lex: next token is {17 /id/
= "goo"}
lex: next token is {15 = =
"="}
9: id ::=
/[A-Za-z_][A-Za-z_0-9]*/
lex: next token is {18
/number/ = "37"}
lex: next token is {2 EOS =
""}
10: number ::=
/[0-9]+/
8:
expression ::= id = number
4:
statement ::= expression
3: program
::= statement
1: GOAL
::= program
context: string
accepted
Good
so far, but what happens if you take some of that whitespace
out?
> tacc -I context.cpp -rt
"goo=37"
lex: next token is {19
/subject/ = "goo=37"}
context: Error: Unexpected
input token: /subject/
1: goo=37
^^^^^^
found: /subject/, regex:
/[^[:space:]]+/
expecting: EOS program ; statement expression
help_expression help id /id/
The
trouble is the /subject/ regex is so inclusive it consumes all of the
input. But why should it? The only time the /subject/ regex is
used is after the keyword help.
How
about asking for help on ;
> tacc -I context.cpp -rt
"help ;"
lex: next token is {12 help
= "help"}
lex: next token is {8 ; =
";"}
context: Error: Unexpected
input token: ;
1: help ;
^
found:
;
expecting: subject /subject/
No
good either. Even though the
ordinary regex lexer whould have had the ; matched with ; and /subject/ for the
same length of 1, the bare ; has higher priority than /subject/ because it
occurs before it in the grammar. So
there
is
no way to allow the ; to be a parameter to the help
keyword.
If
you could limit looking for /subject/ only to the context when the previous
token was 'help' then you would be able to have your cake and eat it
too.
This
is what the tacc -k context aware lexer does. It only enables those input tokens that
can occur in the context of a particular parser state. Let's try it out:
> tacc -k context.bnf -o
context.cpp
tacc: reading grammar file:
context.bnf
tacc: checking
productions
tacc: building
automation
tacc: building
lexer
tacc: building context aware
lexer
Lexical contexts:
6
Lexical states: 11 normal + 5 contextual = 16
total
States:
15
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts:
0
Warnings: 0
Errors:
0
Automaton is
LR(1)
tacc: using internal
template: default (vogue)
tacc: output file is:
context.cpp
Now
let's try it out:
> tacc -I context.cpp -rt
"goo=37"
lex: next token is {17 /id/
= "goo"}
lex: next token is {15 = =
"="}
9: id ::=
/[A-Za-z_][A-Za-z_0-9]*/
lex: next token is {18
/number/ = "37"}
lex: next token is {2 EOS =
""}
10: number ::=
/[0-9]+/
8:
expression ::= id = number
4:
statement ::= expression
3: program ::=
statement
1: GOAL
::= program
context: string
accepted
And
you can ask help of practically anything, including the ;
> tacc -I context.cpp -rt
"help ;"
lex: next token is {12 help
= "help"}
lex: next token is {19
/subject/ = ";"}
lex: next token is {2 EOS =
""}
11: subject ::=
/[^[:space:]]+/
7:
help_expression ::= help subject
5:
statement ::= help_expression
3: program
::= statement
1: GOAL
::= program
context: string
accepted
Note
that tacc
-k
builds a context aware lexer in addition to the normal lexer. In tacc
-I
interpreter
mode and in your compiled cpp or java parser, the lexer will automatically use
the context aware lexer if available.
But if you want to temporarily disable context aware lexing you can use
tacc
-I
-k.
>
tacc -I -k context.cpp -rt "help ;"
lex: next token is {12 help
= "help"}
lex: next token is {8 ; =
";"}
context: Error: Unexpected
input token: ;
1: help
;
^
found: ;
expecting: subject /subject/
Formal tacc BNF Syntax
tacc was written with lesser versions of tacc (multiple times), so I should just put the tacc BNF here, but it is a bit scary looking, so I will included it later after we see some simple forms:
Comments are started with // and continue to the end of the line:
// this is a comment
Productions or rules are simply a single identifier ::= (becomes) some other identifiers. LHS (left hand side) identifier and RHS (right hand side) identifiers are called tokens.
LHS ::= RHS ...
You can provide alternatives with the vertical bar | "OR" symbol.
LHS ::= RHS1 ... { | RHS2 ... }
Is equivalent to:
LHS ::= RHS1 ...
LHS ::= RHS2 ...
You can use the | alternate on the end of one line or the start of another:
LHS ::= RHS1 ... |
RHS2 ...
LHS ::= RHS1 ...
| RHS2 ...
You can use ||= as a shorthand way of writing | between every letter of the RHS
HEX ||= ABCDEFabcdef0123456789
Is equivalent to:
HEX ::= A | B | C | D | ... | 8 | 9
This is useful for character grammars, so check out the -c option too.
Tokens may be practically any string separated by whitespace. The "normal" identifier rules you might be familiar with in other languages do not apply here. For example 98* is a perfectly acceptable token even though it starts with a number and has a * in it.
If a token only appears on the right hand side (RHS) of the ::=, it is said to be a terminal token. If a token appears on the left hand side (LHS) and the RHS, then the token is said to be a non-terminal token. tacc automatically works this out.
It is conventional in grammar BNFs to make non terminals uppercase and terminals lowercase. This is not required or encouraged style in tacc, but feel free to go with the flow. You can use ::= or just = as a production operator. This document was written before = was allowed as an alternative and the author writes ::= mostly out of habit. Don't feel that you need to. There is no need to end statements with a ; as you see in some extended BNFs either, tacc uses line breaks as a delimiter. tacc wants the BNF syntax to be as natural and uncluttered with language syntax as possible.
A right hand side RHS may be optional, you can indicate this in tacc by using NULL
LHS ::= NULL
A token can be enclosed in quotes or you can start it with a backslash \ if it has special characters that might confuse tacc.
LHS ::= "NULL"
LHS ::= \NULL
OR ::= \|
A terminal token may be a simple string or a regular expression in /slashes/
keyword ::= while
number ::= /[0-9]+/
tacc regular expressions are stock standard, 7 bit ASCII only. The tacc compiled lexer will recognise any number of regular expressions in parallel just as quickly as one.
The following tables can also be seen with tacc -RR for regex help.
regex |
Description |
Example |
c |
Any non special character c matches itself |
k matches k |
. |
matches any single character except \r \n and \0 |
. matches a, B, c, x, ... |
^ |
matches the start of a line |
^a matches lines that start with a |
$ |
matches the end of a line |
z$ matches lines that end in z |
\n \r \t |
matches \n newline, \r carriage return and \t tab |
\n matches \n |
\\ |
matches \ |
\\ matches \ |
\s |
matches special character s |
\. matches . |
\xHH |
matches hex character HH |
\x20 matches space |
[xyz0-9] |
matches characters and ranges in between [...] |
gr[ae]y[0-9] matches gray7 and grey9 |
[^xyz0-9] |
matches not (characters and ranges) in between [^...] |
[^A-Z] matches not uppercase |
[:class:] |
matches a standard class of characters (see below), must be used inside [...] |
[[:digit:]] matches a digit 0 to 9 |
a* |
matches 0 or more a's |
a* matches "", a, aa, aaa... |
a+ |
matches 1 or more a's |
a+ matches a, aa, aaa... |
a? |
matches 0 or 1 a's |
ab?c matches ac, abc, but not abbc |
a|b |
matches a or b, almost always used with (a|b) |
gr(ay|ey) matches gray and grey |
( ) |
delimits a sub-expression that you can apply * + ? to |
(ab)+ matches ab, abab, ababab, ... |
. means does NOT match -- any other
character means match | |
|
|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+-- 00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111 00000000001111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000111111111122222222 01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567 |----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+-- |
[:class:] |
................................ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~. |
none blank cr eol space cntrl digit xdigit lower upper alpha alnum word punct graph dot any notnull regord regres |
................................................................................................................................ .........Y......................S............................................................................................... ..........Y..Y.................................................................................................................. Y.........Y..Y.................................................................................................................. .........YYYYY..................S............................................................................................... YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY...............................................................................................Y ................................................0123456789...................................................................... ................................................0123456789.......ABCDEF..........................abcdef......................... .................................................................................................abcdefghijklmnopqrstuvwxyz..... .................................................................ABCDEFGHIJKLMNOPQRSTUVWXYZ..................................... .................................................................ABCDEFGHIJKLMNOPQRSTUVWXYZ......abcdefghijklmnopqrstuvwxyz..... ................................................0123456789.......ABCDEFGHIJKLMNOPQRSTUVWXYZ......abcdefghijklmnopqrstuvwxyz..... ................................................0123456789.......ABCDEFGHIJKLMNOPQRSTUVWXYZ...._.abcdefghijklmnopqrstuvwxyz..... .................................!Y#$%&'()*+,-Y/..........:;<=>?@..........................[\]^_`..........................{|}~. .................................!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~. ................................S!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~. .YYYYYYYYY.YY.YYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y .YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y .YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#.%&'....,../0123456789.;<=>.@ABCDEFGHIJKLMNOPQRSTUVWXYZ...._`abcdefghijklmnopqrstuvwxyz{.}~. Y...................................$...()*+.-Y...........:....?...........................[\]^.............................|... |
It might seem a first glance that ||= expressions could be replaced by a regular expressions, and in some cases they can, but they are very different.
A /regex/ is a single token that matches some input during lexical scanning of your input.
||= makes multiple tokens, all of which are matched and returned separately by the lexer.
For example this regex returns a single token number for digit, for which the argv[] could be any character '0'-'9':
digit ::= /[0-9]/
And this one returns a separate token for each digit 0, 1, 2, ... , 9 and the argv[] would be '0', '1', '2', ... , '9'.
digit ||= 0123456789
This is the actual grammar used to build tacc's BNF parser:
//
// tacc bnf grammar
//
goal ::= grammar
grammar ::= statement | grammar statement
statement ::= production EOL |
production eol code |
include_tag eol code |
eol
production ::= LHEX prodop RHEX |
LHEX prodop eol RHEX
LHEX ::= LHS | NULL
prodop ::= \::= | \= | \| | \||=
RHEX ::= RHS |
RHEX \| RHS |
RHEX \| eol RHS
RHS ::= regex_token | RHS regex_token
LHS ::= plain_token
regex_token ::= token | regex
plain_token ::= token
token ::= word | quoted_text | escaped_word
include_tag ::= word
EOL ::= eol | EOS
code ::= /{([^[:eol:]]|[[:eol:]][^}])*[[:eol:]]}/
regex ::= /\/([^\\[:eol:]\/]|\\.)*\//
escaped_word ::= /\\[^[:space:]]+/
quoted_text ::= /"([^\\[:eol:]\"]|\\.)*"|'([^\\[:eol:]\']|\\.)*'/
word ::= /[^[:space:]]+/
eol ::= /[[:space:]]*(\/\/[^[:eol:]]*)?[[:eol:]]+[[:blank:]]*([[:space:]]*(\/\/[^[:eol:]]*)?[[:eol:]]+[[:blank:]]*)*/
And this is the grammar used to build the regular expression parser:
goal ::= regex
regex ::= alternation
alternation ::= concatenation | alternation \| concatenation
concatenation ::= quantified | concatenation quantified
quantified ::= meta | meta * | meta + | meta ?
meta ::= character | ^ | $ | . | list | ( regex )
list ::= matching_list | non_matching_list
matching_list ::= [ list_spec ]
non_matching_list ::= [ ^ list_spec ]
list_spec ::= list_character | list_spec list_character
list_character ::= range | class | character
range ::= character - character
class ::= [ : class_name : ]
class_name ::= literal_string
literal_string ::= identifier_character | literal_string identifier_character
character ::= escape_sequence | hex_character | literal_character
escape_sequence ::= \\ escaped_character
hex_character ::= \\ x hex_digit hex_digit
literal_character ::= identifier_character | ordinary_character
escaped_character ::= reserved_character | slash_character | ordinary_character
all_characters
{
||= ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz _ 0123456789 !@#$%^&+-*.,:;?=~ () [] {} <> \/ \\ \| \" ' ` " " \t \r \n
}
identifier_character ||= ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz _ 0123456789
ordinary_character ||= !@# % & , ; =~ {} <> \/ \" ' ` " "
reserved_character ||= $ ^ +-* . : ? () [] \\ \|
slash_character ||= trn
hex_digit ||= 0123456789 ABCDEF abcdef
tacc -options
Simply prints a short usage:
> tacc
tacc compiles an extended BNF grammar to (LA)LR(1) parser in C++ or Java.
usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]
[{-m template_file | -M internal_template_name}]
[-GWTV-?]
tacc -I can interpret the generated output C++ file.
usage: tacc -I interpret_source_file.cpp { -f input_file | input_string... }
[-xrtasnlpvewb]
[-GWTV-?]
Use tacc -? for more help on BNF compiler mode
Use tacc -I? for more help on C++ interpreter mode
Simply prints all the -options:
> tacc -?
usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]
[{-m template_file | -M internal_template_name}]
[-GWTV-?]
tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source
code in C++ or Java with a built in lexical scanner. You can compile the
resultant source code immediately with a C++ or Java compiler or you can use
tacc -I to interpret the source file.
See tacc -I? for more help on interpreter mode.
-options:
-g print goal token
-t print tokens
-p print productions
-x print transitions
-s print states
-i print input tokens (more useful with -s)
-v verbose, equivalent to -gtpxsi
-l construct an LALR(1) parser instead of an LR(1) parser
-c ensure token ids for single character token names = ascii of the character
-d list available internal templates
-k generate a context aware lexer
-y if a state's actions are all the same reduction, do it on any token
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-RR help on regexes and regex classes.
-V Version
-- Stop -option parsing
-? This help message
Examples:
tacc regex.bnf -o regex.cpp -> compile regex.bnf to regex.cpp
tacc -I regex.cpp -pp "a*b+c?(this|that)" -> run the interpreter on regex.ccp with input string
tacc regex.bnf -o regex.java -> compile regex.bnf to regex.java
tacc -l regex.bnf -o regex.cpp -> compile regex.bnf as LALR (better)
tacc -d -> list available templates
tacc -M default -o template.cpp -> output the default template to template.cpp
It is fairly common for grammars to generate a lot of internal states that are very similar. The -l option constructs the LR(1) parser as normal and then tries to merge similar states to what is known as an LALR(1) parser. The LALR(1) parser is generally smaller than the LR(1) parser with almost no drawbacks, so it is generally a good idea to use -l. The drawbacks are that it takes tacc more time to build your parser and, if you give the generated parser incorrect input, that it may internally loop a few extra times before working out that the input is incorrect. The LALR(1) generated parser is still correct in every way though.
(Unlike some other parser generators, tacc will not generate any sort of mysterious conflicts when generating LALR(1) automatons, so you can use -l with confidence.)
I like to think of the -l as the "optimised" compile of your grammar.
Here it is being used to compile the regex grammar and reduces the number of states in the automaton from 686 to 176, which is pretty significant:
> tacc -l regex.bnf -o regex.cpp
tacc: reading grammar file: regex.bnf
tacc: checking productions
tacc: building automation
tacc: LALR merging states
tacc: building lexer
Parser States: 686 - 510 merged = 176
Lexical States: 2
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1)
tacc: using internal template: default (vogue)
tacc: output file is: regex.cpp
After you generate a parser with tacc, you can add semantic actions to your code (see Introduction). The only tokens that are actually delivered to your semantic code are terminal tokens. That is, tokens that appear on the right hand side (RHS) of the ::=, but not on the left hand side (LHS). Usually it is pretty obvious which tokens are terminal or not, but should you be wanting a list of tokens and how they are categorised, then use -t.
-g simply prints the goal token, which is the first token encountered in the BNF file.
> cat identifiers.bnf
GOAL ::= ids
ids ::= id | ids id
id ::= /[A-Za-z_][A-Za-z_0-9]*/
> tacc -tg identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Goal
Goal is GOAL
Tokens
0. virtual not_null terminal - plain UNKNOWN
1. virtual not_null terminal - plain ERROR
2. virtual not_null terminal - plain EOS
3. virtual not_null terminal - plain ANY
4. virtual nullable terminal - plain NULL
5. virtual not_null terminal - plain START
6. goal not_null product pure plain GOAL
7. real not_null product pure plain ids
8. real not_null product prime plain id
9. real not_null terminal input regex /id/ /[A-Za-z_][A-Za-z_0-9]*/
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
As you can see, tacc reserves some virtual tokens that are used by the parser. It is not necessary to understand what they are or what they do, but it is instructive to know what they are for:
token |
Description |
UNKNOWN |
in a list of tokens, the UNKNOWN token (usually 0) is placed on the end for human readability, it is not used by tacc specifically |
ERROR |
is the token returned by the lexer when none of your terminal tokens matched the input string |
EOS |
end of string is the last token of any input, it is supplied and handled automatically in tacc |
ANY |
a special token used with -y that means if all the reductions in the state are the same, do it on any token. See option -y. |
NULL |
the token that means it could be nothing, as in no input at all. eg prefix ::= + | - | NULL |
START |
simply marks the place in the token table where real (as opposed to virtual) tokens start, it is not used by tacc specifically |
Printing productions really just lists your .bnf file with tacc's categorisation.
> tacc -p identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Productions
1. nulls: 0 reachable terminating GOAL ::= ids
2. nulls: 0 reachable terminating ids ::= id
3. nulls: 0 reachable terminating ids ::= ids id
4. nulls: 0 reachable terminating id ::= /[A-Za-z_][A-Za-z_0-9]*/
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
Printing transitions displays the full pushdown automaton and shows exactly what the parser would do when it parses your input string.
> tacc -x identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Transitions
state 1 on ids ---> 2
state 1 on id ---> 3
state 1 on /id/ ---> 4
state 2 on EOS _o-> GOAL ::= ids
state 2 on id ---> 5
state 2 on /id/ ---> 4
state 3 on EOS /id/ _o-> ids ::= id
state 4 on EOS /id/ _o-> id ::= /[A-Za-z_][A-Za-z_0-9]*/
state 5 on EOS /id/ _o-> ids ::= ids id
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
When tacc is compiling your grammar, it makes a states which indicate where the parser is currently and transitions that indicate where it might go, given particular a particular token. Understanding transitions requires some knowledge of pushdown automata, so take a look at the section on Parsing Basics. Also see tacc -I -x for a walkthrough of what transitions look like when the parser is executing.
If you don't like the default template that tacc uses, you can use your own, or modify the one tacc uses.
You can list the templates built in to tacc and export them for perusal like this:
> tacc -d
Name Ext Default Description
-------- ---- ------- ----------------------------------------------------------------
classic cpp No Older C++ template with no real redeeming features.
sienna cpp No C++ template generated around tacc 4.03. Still good.
vogue cpp Yes C++ template generated from the tacc source code.
jive java No Simple Java template for tiny grammars.
juke java Yes Java template with parser tables in *.joos.dat files.
cup cup Yes Convert to CUP parser format. Not perfect.
default * * Matches a default template above by output file extension.
> tacc -M default -o template.cpp
Now you can edit template.cpp and use it with tacc -m
> tacc -m template.cpp identifiers.bnf -o identifiers.cpp
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc: using template file: template.cpp
tacc: output file is: identifiers.cpp
Printing states shows the internals of the compile process. It is unlikely that you will want this, but it can be instructive if you want to learn how to build a pushdown state automaton from a BNF grammar. It can sometimes be useful when debugging grammars, but most of the time it outputs just too much stuff.
> tacc -s identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
States
---- START ---> state 1
| 1. GOAL ::= . ids ---> 2 EOS
| 3. ids ::= . ids id ---> 2 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 2. ids ::= . id ---> 3 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- ids ---> state 2
| 1. GOAL ::= ids . _o-> -1 EOS
| 3. ids ::= ids . id ---> 5 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- id ---> state 3
| 2. ids ::= id . _o-> -2 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- /id/ ---> state 4
| 4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ . _o-> -4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- id ---> state 5
| 3. ids ::= ids id . _o-> -3 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
Creating an automaton by hand is pretty tedious, but if you want to learn how, see the section on Parsing Basics.
There is no need whatsoever to understand how tacc builds an automaton to use tacc.
If you want to get hardcore with your parsing, you might want to consider your input stream one character at a time. So instead of:
keyword ::= while
You would write:
keyword ::= w h i l e
If you are doing it one character at a time, you can greatly simplify the lexical scanner from the standard regex matcher to one that simply reads one character. And to make things even simpler, why not make the token number equal to the ascii of the character as well? This is what -c does. It makes all of the token numbers for single characters equal to the ascii of the character.
Even though it makes the lexer lightning fast, it puts more work on the parser (and you writing the BNF too).
tacc didn't always have a fancy regex lexer. tacc's regex lexer was built with tacc -c.
Most of the time, if your state has only one reduction, it will do that reduction on every token. -y reduces all those to one reduction on the special single any token.
The upside of -y is that it slightly speeds up finding what action to do when you are in state S with token T and makes the parser slightly smaller.
The downside of -y is that it is a relaxation of your grammar, meaning it could accept input that it should have rejected. This would be pretty rare, most of the time it would do a reduce based on any that it should not have done and then whatever the real token is coming up next would likely not be able to be shifted and it would stop anyway. Even if it did continue at that stage, you can deal with it in your semantics. -y is not for beginners.
The -y process in tacc is done during the output of the .cpp file, so you won't see it with -s, but if you go through the C++ output tables (not recommended), effectively it changes this:
> tacc -s identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
States
---- START ---> state 1
| 1. GOAL ::= . ids ---> 2 EOS
| 3. ids ::= . ids id ---> 2 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 2. ids ::= . id ---> 3 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- ids ---> state 2
| 1. GOAL ::= ids . _o-> -1 EOS
| 3. ids ::= ids . id ---> 5 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- id ---> state 3
| 2. ids ::= id . _o-> -2 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- /id/ ---> state 4
| 4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ . _o-> -4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- id ---> state 5
| 3. ids ::= ids id . _o-> -3 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
To this (NOTE! FAKED OUTPUT!):
> tacc -sy identifiers.bnf
tacc: reading grammar file: identifiers.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
States
---- START ---> state 1
| 1. GOAL ::= . ids ---> 2 EOS
| 3. ids ::= . ids id ---> 2 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 2. ids ::= . id ---> 3 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- ids ---> state 2
| 1. GOAL ::= ids . _o-> -1 EOS
| 3. ids ::= ids . id ---> 5 EOS /[A-Za-z_][A-Za-z_0-9]*/
| 4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/ ---> 4 EOS /[A-Za-z_][A-Za-z_0-9]*/
--------------------
---- id ---> state 3
| 2. ids ::= id . _o-> -2 ANY Note! faked output!
--------------------
---- /[A-Za-z_][A-Za-z_0-9]*/ ---> state 4
| 4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ . _o-> -4 ANY Note! faked output!
--------------------
---- id ---> state 5
| 3. ids ::= ids id . _o-> -3 ANY Note! faked output!
--------------------
Parser States: 5
Lexical States: 3
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc -I interpreter and -options
tacc can interpret its output C++ or Java file with the same results as if you had compiled and executed it.
tacc is not a general C++ or Java interpreter, it just knows the structure of its own output. So this means it can't execute code put in { ... } in your bnf.
The tacc interpreter is very useful when developing grammars because you can immediately do a test parse without compiling the output.cpp file. In fact you can play with grammars and LR parsing without a C++ compiler at all.
The built in lexer means there is no need for any extra tools to build a lexical scanner. The default lexer works out of the box and is flexible enough not to require any modifications to work with your input.
The parser engine and options for the interpreter are identical to the compiled version since they are built from same source.
It is difficult to explain just how good this is compared to other parsing tools if you haven't done this sort of thing before.
Just use tacc -I with the file.cpp file you get from tacc file.bnf:
> tacc -I identifiers.cpp "this and that"
tacc: string accepted
> tacc -I identifiers.cpp "this then 9fail"
tacc: Error: Unrecognised input "9fail"
Line 1. this then 9fail
^^^^^
Nice, but not very informative on what is going on. But there are options.
-I? Simply prints all the -options:
> tacc -I?
usage: tacc -I interpret_source_file.cpp { -f input_file | input_string... }
[-xrtasnlpvewb]
[-GWTV-?]
tacc -I interprets a source code file created from a tacc grammar.bnf
tacc -I is NOT a general C++ or Java interpreter, it just knows the structure
of tacc generated source code and reads the tables from the cpp files it created.
The parser engine and options for the interpreter are identical to the compiled
version since they are built from same source.
-options:
-f infile
-x print transitions
-r print reductions
-t print tokens as they are read (-ttt... -K = max t's)
-a print terminal args with reductions
-s print top of parse tree stack with reductions (needs -p)
-n only print reductions that have terminal args
-l print lookahead with reductions
-p print parse tree, -pp print fancier parse tree
-v verbose, equivalent to -alrpptx
-e suppress expecting... output
-k don't use context aware lexing
-w don't ignore end of lines (\n, \f) (\r is always ignored)
-ww don't ignore any whitespace (\n \f \v \t ' ') (\r is always ignored)
-b print the BNF source for this program
-B print the input string or filename
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-RR help on regexes and regex classes.
-V Version
-- Stop -option parsing
-? This help message
Examples:
tacc regex.bnf -o regex.cpp -> compile regex.bnf to regex.cpp
tacc -I regex.cpp -pp "a*b+c?(this|that)" -> run the interpreter on regex.ccp with input string
-t tells the lexer to print tokens when they are read. It is good to see that your terminal tokens are being recognised:
> tacc -I identifiers.cpp -t "this and that"
lex: next token is {9 /id/ = "this"}
lex: next token is {9 /id/ = "and"}
lex: next token is {9 /id/ = "that"}
lex: next token is {2 EOS = ""}
tacc: string accepted
The output format is { token_number terminal_token_plain_or_regex = "input that matched" }
The more -t's you give, the more info it reports on each token, up until an insane 12 t's (or use -K which is short for the max number of -t's) when it prints a ridiculous amount of debug as it steps through the regular expression matcher internals (not recommended).
Here is a more sensible -tt which prints neat little markers showing where it found the token on the input line (read so far).
> tacc -I identifiers.cpp -tt "this and that"
lex: next token is {9 /id/ = "this"}
1: this
^^^^
lex: next token is {9 /id/ = "and"}
1: this and
^^^
lex: next token is {9 /id/ = "that"}
1: this and that
^^^^
lex: next token is {2 EOS = ""}
1: this and that
^
tacc: string accepted
-r tells the parser to print reductions as they are recognised:
> tacc -I identifiers.cpp -r "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
2: ids ::= id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
3: ids ::= ids id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
3: ids ::= ids id
1: GOAL ::= ids
tacc: string accepted
-ra tells the parser to print reductions and the arguments that would be given to your semantic function. So when you were writing some semantic code to go with this grammar, you can see which reduction is being executed and what arguments you would get when they are:
> tacc -I identifiers.cpp -ra "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "this"
3: ids ::= id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "and"
2: ids ::= ids id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "that"
2: ids ::= ids id
1: GOAL ::= ids
tacc: string accepted
The more -a's you give, the more info it reports on each argument:
> tacc -I identifiers.cpp -raa "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "this", argt[0]->token = 9}
3: ids ::= id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "and", argt[0]->token = 9}
2: ids ::= ids id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "that", argt[0]->token = 9}
2: ids ::= ids id
1: GOAL ::= ids
tacc: string accepted
> tacc -I identifiers.cpp -raaa "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "this", argt[0]->token = 9, regex_tokens[9] = /id/}
3: ids ::= id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "and", argt[0]->token = 9, regex_tokens[9] = /id/}
2: ids ::= ids id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: {argv[0] = "that", argt[0]->token = 9, regex_tokens[9] = /id/}
2: ids ::= ids id
1: GOAL ::= ids
tacc: string accepted
> tacc -I identifiers.cpp -raaaa "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals:
[0] = {
argv[0] = "this"
argt[0] = {
->token = 9
string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/
regex_tokens[9] = /id/
->line_start = 1, ->start = 0, ->line_end = 1, ->end = 3
}
}
3: ids ::= id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals:
[0] = {
argv[0] = "and"
argt[0] = {
->token = 9
string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/
regex_tokens[9] = /id/
->line_start = 1, ->start = 5, ->line_end = 1, ->end = 7
}
}
2: ids ::= ids id
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals:
[0] = {
argv[0] = "that"
argt[0] = {
->token = 9
string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/
regex_tokens[9] = /id/
->line_start = 1, ->start = 9, ->line_end = 1, ->end = 12
}
}
2: ids ::= ids id
1: GOAL ::= ids
tacc: string accepted
-rl tells the parser to print reductions and the current lookahead token. The lookahead token is half the reason "why" this reduction was done (the other half is because we have already read the RHS of the reduction) It is not all that useful:
> tacc -I identifiers.cpp -rl "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and
2: ids ::= id . and
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that
3: ids ::= ids id . that
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS
3: ids ::= ids id . EOS
1: GOAL ::= ids . EOS
tacc: string accepted
-ran tells the parser to only print reductions that have terminal arguments. Almost always used with -a so you can see the terminal args as well. This is useful if you have a grammar with a deep reduction parse tree but with few that actually handle terminal tokens. This is seen in expression calculators that have lots of layers of precedence rules.
> tacc -I identifiers.cpp -ran "this and that"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "this"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "and"
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
terminals: argv[0] = "that"
tacc: string accepted
-p tells the parser to print a parse "tree". -pp prints the same thing, but on multiple lines, so it looks a bit more like a tree. -p does some guesswork on how you want the parse tree to look and doesn't always get it right, so it is always a parse tree, but sometimes it is not pretty.
> tacc -I identifiers.cpp -pp "this and that"
Parse Tree:
[thisandthat]
[thisandthat]
tacc: string accepted
It looks a little better with a calculator type parser. Here is a calculator that has been compiled and has semantics to actually do the calculation. Interpreted or compiled, each way has the same -options:
> calc -pp "(8 + 2 * 3) / (1 + 9)"
Parse Tree:
[[[8+[2*3]]()]/[[1+9]()]]
[ / ]
[ ()] [ ()]
[8+ ] [1+9]
[2*3]
1.4
-b prints the .bnf source that was used to make the parser. But not in the original format, so it is not that exciting. It is useful to remind yourself exactly what (version of your) grammar was used to build this parser. And it would be really useful if you had lost your .bnf.
C:\Trev\Projects\tacc\Debug>tacc -I identifiers.cpp -b
GOAL ::= ids
ids ::= id
ids ::= ids id
id ::= /[A-Za-z_][A-Za-z_0-9]*/
-x prints transitions between states as they occur during parsing. Are you ready for this?
> tacc -I identifiers.cpp -txal "this and that"
lex: next token is {9 /id/ = "this"}
state 1 on 9 /id/ ---> state 4
lex: next token is {9 /id/ = "and"}
state 4 on 9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and
terminals: argv[0] = "this"
state 1 on 8 id ---> state 3
state 3 on 9 /id/ _o-> state 1 reduce 3: ids ::= id . and
state 1 on 7 ids ---> state 2
state 2 on 9 /id/ ---> state 4
lex: next token is {9 /id/ = "that"}
state 4 on 9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that
terminals: argv[0] = "and"
state 2 on 8 id ---> state 5
state 5 on 9 /id/ _o-> state 1 reduce 2: ids ::= ids id . that
state 1 on 7 ids ---> state 2
state 2 on 9 /id/ ---> state 4
lex: next token is {2 EOS = ""}
state 4 on 2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS
terminals: argv[0] = "that"
state 2 on 8 id ---> state 5
state 5 on 2 EOS _o-> state 1 reduce 2: ids ::= ids id . EOS
state 1 on 7 ids ---> state 2
state 2 on 2 EOS _o-> state 1 reduce 1: GOAL ::= ids . EOS
tacc: string accepted
States are built during the compile of the grammar.
Transitions come in two different actions and are printed like this:
in_state S on some_token shift_action---> to_state_N
in_state S on some_token reduce_action_o-> to_state_N reduce some_production
A shift action accepts a token and moves the parser to another state.
A reduce action recognises a production in the grammar and consumes shifted tokens on the right hand side (RHS) of the production, replacing them with the token on the left hand side (LHS) of the production. It also returns to a prior state, but now with the LHS as the next token.
So this grammar compiles to 5 states. To recognise the string input, we boot the parser by pushing state 1 on the state stack. We also call this in_state 1, but it really means the top of the state stack is 1.
The input string is "this and that", the parser will need some input, so it calls the lexer:
lex: next token is {9 /id/ = "this"}
The lexer matched the string "this" to the regex terminal token number 9, which is a regular expression /[A-Za-z_][A-Za-z_0-9]*/ or /id/ for short, since id ::= /[A-Za-z_][A-Za-z_0-9]*/. tacc uses a short version of regexes if the production is of the form something ::= /one_regex/ to improve readability.
As tokens are read, they are pushed on to the token stack. The top of the token stack is known as the on_token. (Some call this the lookahead token, and in this case, it is. tacc only calls a token the lookahead token when it is a terminal. Later on you will see the on_token can change, but he lookahead will stay the same.) We say the on_token is 9 /id/ = "this".
In general parsing goes, if we are in state S, on token K then do action. It is very deterministic and at its heart it is quite simple. All the grunt work for parsing is done up front when you compile your grammar. The execution of the parser is straightforward really.
To parse the string the find the transition that matches in_state, on_token pair. In this case we are in_state 1, on_token 9 /id/ = "this":
state 1 on 9 /id/ ---> state 4
This transition tells us the action will be a ---> shift action to state 4. It is called a shift action, because the token is shifted to the left of an imaginary parsing cursor (often represented as a dot . character). Or the cursor is shifted to the right through the string if you prefer. Or the whole string is shifted to the left. All are correct. The shift action pushes the 4 on to the state stack, so now the in_state is 4.
Because of this shift action, and because the shifted token was a terminal, we will read the next token:
lex: next token is {9 /id/ = "and"}
The lexer matched the string "and" to /id/ and pushes it on to the token stack, "and" becomes the on_token.
Now we are in_state 4, on_token /id/ = "and", so we find the matching transition:
state 4 on 9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and
This transition tells us the action will be a _o-> reduce action with the production, or really now, the reduction id ::= /[A-Za-z_][A-Za-z_0-9]*/ (That _o-> is supposed to look like a curly arrow with a loop in it. It is the way the author draws a reduction on a state diagram)
The reduction action wants us to take tokens that we have read matching the right hand side (RHS) of the production and change them into the token on the left hand side (LHS).
This corresponds to the tokens below the on_token on the token stack. In this case it is 1 token, "this".
The parser calls your semantic action for the reduction id ::= /[A-Za-z_][A-Za-z_0-9]*/ with the argc = 1 and argv = "this".
terminals: argv[0] = "this"
Then argc elements are popped off the token stack and the state stack. (Most of the time it is argc, but NULLs can make it more difficult to predict, so tacc precalculates the exact number to pop off the state stack for each reduction during compilation) And the LHS token = id is pushed onto the stack and becomes the on_token.
The top of the state stack, or in_state is now back to 1 and this time the on_token is id. Find the matching transition:
state 1 on 8 id ---> state 3
And so on...
Another way to look at this is to imagine the parse cursor going through the string, reducing input as it goes.
(Unfortunately tacc can't produce this output since 1. it doesn't read that far ahead, and 2. it doesn't bother to keep non-terminals, they are implied by the state you are in):
. this and that state 1 on 9 /id/ ---> state 4
this . and that state 4 on 9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
. id and that state 1 on 8 id ---> state 3
id . and that state 3 on 9 /id/ _o-> state 1 reduce 2: ids ::= id
. ids and that state 1 on 7 ids ---> state 2
ids . and that state 2 on 9 /id/ ---> state 4
ids and . that state 4 on 9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
ids . id that state 2 on 8 id ---> state 5
ids id . that state 5 on 9 /id/ _o-> state 1 reduce 3: ids ::= ids id
. ids that state 1 on 7 ids ---> state 2
ids . that state 2 on 9 /id/ ---> state 4
ids that . state 4 on 2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/
ids . id state 2 on 8 id ---> state 5
ids id . state 5 on 2 EOS _o-> state 1 reduce 3: ids ::= ids id
. ids state 1 on 7 ids ---> state 2
ids . state 2 on 2 EOS _o-> state 1 reduce 1: GOAL ::= ids
. GOAL
-s prints the top of the -p parse tree stack stx.top. I thought it would be really good, but it turned out not so great. Use -xrpls for best effect.
> tacc -I identifiers.cpp -xrpls "this and that"
state 1 on 9 /id/ ---> state 4
state 4 on 9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and
stx.top: this
state 1 on 8 id ---> state 3
state 3 on 9 /id/ _o-> state 1 reduce 3: ids ::= id . and
3: ids ::= id . and
stx.top: this
state 1 on 7 ids ---> state 2
state 2 on 9 /id/ ---> state 4
state 4 on 9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that
stx.top: and
state 2 on 8 id ---> state 5
state 5 on 9 /id/ _o-> state 1 reduce 2: ids ::= ids id . that
2: ids ::= ids id . that
stx.top: [thisand]
state 1 on 7 ids ---> state 2
state 2 on 9 /id/ ---> state 4
state 4 on 2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS
4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS
stx.top: that
state 2 on 8 id ---> state 5
state 5 on 2 EOS _o-> state 1 reduce 2: ids ::= ids id . EOS
2: ids ::= ids id . EOS
stx.top: [thisandthat]
state 1 on 7 ids ---> state 2
state 2 on 2 EOS _o-> state 1 reduce 1: GOAL ::= ids . EOS
1: GOAL ::= ids . EOS
stx.top: [thisandthat]
Parse Tree:
[thisandthat]
tacc: string accepted
-w controls how tacc controls whitespace. Dealing with whitespace is a fairly tedious thing in parsing. Fortunately tacc does this automatically for you. And most of the time you want this, but if you want to take control, build your grammar with your whitespace tokens and run tacc with:
-w don't ignore end of lines (\n, \f) (\r is always ignored)
-ww don't ignore any whitespace (\n \f \v \t ' ') (\r is always ignored)
If you want these permanently set, you can call parser p.whitespace_control(int wflag) somewhere.
There are many ways to deal with whitespace. You can deal with it like it was a separator or treat them like "blank" ids.
Let's modify the identifiers.bnf grammar to handle whitespace as if they could be "blank" ids:
> cat identifiers.whitespace.bnf
GOAL ::= words
words ::= word | words word
word ::= id | ws
id ::= /[A-Za-z_][A-Za-z_0-9]*/
ws ::= /[[:space:]]+/
Compile the grammar with tacc:
> tacc identifiers.whitespace.bnf -o identifiers.whitespace.cpp
tacc: reading grammar file: identifiers.whitespace.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 8
Lexical States: 4
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token
tacc: using internal template: default (vogue)
tacc: output file is: identifiers.whitespace.cpp
Run tacc -I with -ww to stop it from automatically handling whitespace:
> tacc -I identifiers.whitespace.cpp -ww -tr "this and that"
lex: next token is {11 /id/ = "this"}
lex: next token is {12 /ws/ = " "}
6: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: word ::= id
2: words ::= word
lex: next token is {11 /id/ = "and"}
7: ws ::= /[[:space:]]+/
5: word ::= ws
3: words ::= words word
lex: next token is {12 /ws/ = " "}
6: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: word ::= id
3: words ::= words word
lex: next token is {11 /id/ = "that"}
7: ws ::= /[[:space:]]+/
5: word ::= ws
3: words ::= words word
lex: next token is {2 EOS = ""}
6: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: word ::= id
3: words ::= words word
1: GOAL ::= words
identifiers.whitespace: string accepted
The trouble with that approach is that you have to deal with a word possibly being blank. And even though it is pretty easy in this case, it would be nice to make the parser do the work of delivering "pure" ids to you. This approach to the grammar treats whitespace as an optional separator:
> cat identifiers.whitesep.bnf
GOAL ::= expr
expr ::= ws_opt ids ws_opt
ids ::= id | ids ws_opt id
ws_opt ::= ws | NULL
id ::= /[A-Za-z_][A-Za-z_0-9]*/
ws ::= /[[:space:]]+/
Compile the grammar with tacc:
> tacc identifiers.whitesep.bnf -o identifiers.whitesep.cpp
tacc: reading grammar file: identifiers.whitesep.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Parser States: 12
Lexical States: 4
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(1)
tacc: using internal template: default (vogue)
tacc: output file is: identifiers.whitesep.cpp
Run tacc -I with -ww to stop it from automatically handling whitespace. This time, when an id is recognised and delivered to your semantic function, it won't be blank:
> tacc -I identifiers.whitesep.cpp -ww -tr "this and that"
lex: next token is {12 /id/ = "this"}
6: ws_opt ::= NULL
lex: next token is {13 /ws/ = " "}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
3: ids ::= id
lex: next token is {12 /id/ = "and"}
8: ws ::= /[[:space:]]+/
5: ws_opt ::= ws
lex: next token is {13 /ws/ = " "}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: ids ::= ids ws_opt id
lex: next token is {12 /id/ = "that"}
8: ws ::= /[[:space:]]+/
5: ws_opt ::= ws
lex: next token is {2 EOS = ""}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: ids ::= ids ws_opt id
6: ws_opt ::= NULL
2: expr ::= ws_opt ids ws_opt
1: GOAL ::= expr
identifiers.whitesep: string accepted
Both these grammars have been written in such a way that they also work with automatic whitespace left on. There is no need to try to do that, I just did it for fun:
> tacc -I identifiers.whitesep.cpp -tr "this and that"
lex: next token is {12 /id/ = "this"}
6: ws_opt ::= NULL
lex: next token is {12 /id/ = "and"}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
3: ids ::= id
6: ws_opt ::= NULL
lex: next token is {12 /id/ = "that"}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: ids ::= ids ws_opt id
6: ws_opt ::= NULL
lex: next token is {2 EOS = ""}
7: id ::= /[A-Za-z_][A-Za-z_0-9]*/
4: ids ::= ids ws_opt id
6: ws_opt ::= NULL
2: expr ::= ws_opt ids ws_opt
1: GOAL ::= expr
identifiers.whitesep: string accepted
A topic related to whitespace control is handling the end of string. The end of string token EOS can be used in your grammar as if it was a real character if necessary.
-e When tacc is parsing input, if it receives an unexpected token, it prints an error message. Then it lists all the tokens that it was expecting. Sometimes this is nice, but a lot of times it is a massive list of 99% of all possible tokens. -e suppresses the listing of all the expected possible tokens.
For example:
> cat expecting.bnf
GOAL ::= count
count ::= one two three
Compile with tacc and run it with some wrong input:
> tacc -I expecting.cpp "one three"
tacc: Error: Unexpected input token: three
1: one three
^^^^^
found: three
expecting: two
Now if you run it with -e, it won't show the expecting bit:
> tacc -I expecting.cpp "one three" -e
tacc: Error: Unexpected input token: three
1: one three
^^^^^
Note that this is a different error to the one you get if you supply input that isn't recognised as a token:
> tacc -I expecting.cpp "one four" -e
tacc: Error: Unrecognised input "four"
1: one four
^^^^
tacc experimental -options
tacc -?? Will display normal help plus experimental and undocumented options and usage. Needless to say the experimental stuff is not guaranteed to do anything useful. It is mostly for debugging and regression testing.
> tacc -??
usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]
[{-m template_file | -M internal_template_name}]
[-GWTV-?]
tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source
code in C++ or Java with a built in lexical scanner. You can compile the
resultant source code immediately with a C++ or Java compiler or you can use
tacc -I to interpret the source file.
See tacc -I? for more help on interpreter mode.
-options:
-g print goal token
-t print tokens
-p print productions
-x print transitions
-s print states
-i print input tokens (more useful with -s)
-v verbose, equivalent to -gtpxsi
-l construct an LALR(1) parser instead of an LR(1) parser
-c ensure token ids for single character token names = ascii of the character
-d list available internal templates
-k generate a context aware lexer
-y if a state's actions are all the same reduction, do it on any token
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-RR help on regexes and regex classes.
-V Version
-- Stop -option parsing
-? This help message
Examples:
tacc regex.bnf -o regex.cpp -> compile regex.bnf to regex.cpp
tacc -I regex.cpp -pp "a*b+c?(this|that)" -> run the interpreter on regex.ccp with input string
tacc regex.bnf -o regex.java -> compile regex.bnf to regex.java
tacc -l regex.bnf -o regex.cpp -> compile regex.bnf as LALR (better)
tacc -d -> list available templates
tacc -M default -o template.cpp -> output the default template to template.cpp
Experimental and Undocumented Usage
-options:
-Ux print transitions before they are LALR merged with -l
-Us print states before they are LALR merged with -l
-Ui print input tokens before they are LALR merged with -l
-Cx enables printing of transitions within the compiler
-Cp enables printing of the parse tree within the compiler
-Ct enables lex (token) debugging within the compiler, up to 13 t's
-CM enables all lex (token) debugging within the compiler
-O switches back to the old BNF parser. Used for testing the new
BNF parser is identical. No real reason to use -O unless you
have found some spectacular bug with the new BNF parser.
-X enables regex-like extensions to token names. Needs -O.
Tokens ending in a regex like character * + ? - adds another
two productions as follows:
LHS ::= RHS* ... adds RHS* ::= RHS* RHS | NULL
LHS ::= RHS+ ... adds RHS+ ::= RHS+ RHS | RHS
LHS ::= RHS- ... adds RHS- ::= RHS RHS- | RHS
LHS ::= RHS? ... adds RHS? ::= RHS | NULL
And while this sounds like a nifty idea, it actually can create
conflicts fairly easily because of the extra intermediate non-terminal
in the heirachy. :(
-R switches to regex test mode, for more info on -R, use -R?
-L switches to lex test mode, for more info on -L, use -L?
-O uses the old bnf bootstrap parser that was hand written before tacc was operational enough to generate its own BNF parser. It is not as flexible as the new tacc generated one and shouldn't be used.
-X is best avoided regardless of what it says in the help above.
-R actually has some use. You can test what matches what with regular expressions.
> tacc -R?
usage: tacc -R "regex" ...
[-m match_string]
[-GWTV-?]
tacc -R tests the regex deterministic finite state (DFA) constructor
that is used in to build the regex lexical scanner.
-options:
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-RR help on regexes and regex classes.
-V Version
-- Stop -option parsing
-? This help message
Example:
tacc -R "int" "while" "[0-9]+" "[A-Za-z_][A-Za-z_0-9]*" -m "integrity"
Just looking at the example given in the help. There are 4 regexs there and the input is "integrity". He is hoping that integrity will be matched by regex #4 "[A-Za-z_][A-Za-z_0-9]*" and not by regex #1 "int". Even though the first part of "integrity" is "int", "integrity" is longer and the regex matcher matches the leftmost longest string.
Running the example shows tacc compiling the regexes to multiple non-deterministic finite state automatons (NDFAs) and then merging them into a single deterministic finite state automaton (DFA). It is similar to what tacc does when building a parser, but the regex automaton doesn't use any stack. Just being in a particular state is enough to recognise regular expressions.
> tacc -R "int" "while" "[0-9]+" "[A-Za-z_][A-Za-z_0-9]*" -m "integrity"
regex_test_main: regex 1 is int
NDFA:
Start 5
state 5 on epsilon --> 1
state 1 on i --> 2
state 2 on n --> 3
state 3 on t --> 4 MATCH 1
regex_test_main: regex 2 is while
NDFA:
Start 12
state 12 on epsilon --> 6
state 6 on w --> 7
state 7 on h --> 8
state 8 on i --> 9
state 9 on l --> 10
state 10 on e --> 11 MATCH 2
regex_test_main: regex 3 is [0-9]+
NDFA:
Start 16
state 16 on epsilon --> 13
state 13 on 0123456789 --> 14
state 14 on epsilon --> 13
state 14 on epsilon --> 15 MATCH 3
regex_test_main: regex 4 is [A-Za-z_][A-Za-z_0-9]*
NDFA:
Start 26
state 26 on epsilon --> 17
state 17 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 24
state 24 on epsilon --> 20
state 24 on epsilon --> 25 MATCH 4
state 20 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 24
regex_test_main: DFA compile:
DFA:
Start 1
State 1 on 0123456789 --> 2 MATCH 3
State 1 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvxyz --> 3 MATCH 4
State 1 on i --> 4 MATCH 4
State 1 on w --> 5 MATCH 4
State 2 on 0123456789 --> 2 MATCH 3
State 3 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
State 4 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmopqrstuvwxyz --> 3 MATCH 4
State 4 on n --> 6 MATCH 4
State 6 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrsuvwxyz --> 3 MATCH 4
State 6 on t --> 7 MATCH 1
State 7 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
State 5 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefgijklmnopqrstuvwxyz --> 3 MATCH 4
State 5 on h --> 8 MATCH 4
State 8 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvwxyz --> 3 MATCH 4
State 8 on i --> 9 MATCH 4
State 9 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz --> 3 MATCH 4
State 9 on l --> 10 MATCH 4
State 10 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdfghijklmnopqrstuvwxyz --> 3 MATCH 4
State 10 on e --> 11 MATCH 2
State 11 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
regex_test_main: DFA compress:
DFA:
Start 1
State 1 on 0123456789 --> 2 MATCH 3
State 1 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvxyz --> 3 MATCH 4
State 1 on i --> 4 MATCH 4
State 1 on w --> 7 MATCH 4
State 2 on 0123456789 --> 2 MATCH 3
State 3 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
State 4 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmopqrstuvwxyz --> 3 MATCH 4
State 4 on n --> 5 MATCH 4
State 5 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrsuvwxyz --> 3 MATCH 4
State 5 on t --> 6 MATCH 1
State 6 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
State 7 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefgijklmnopqrstuvwxyz --> 3 MATCH 4
State 7 on h --> 8 MATCH 4
State 8 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvwxyz --> 3 MATCH 4
State 8 on i --> 9 MATCH 4
State 9 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz --> 3 MATCH 4
State 9 on l --> 10 MATCH 4
State 10 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdfghijklmnopqrstuvwxyz --> 3 MATCH 4
State 10 on e --> 11 MATCH 2
State 11 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4
match: s:001 m:000 c = '\x00'
match: s:004 m:004 c = 'i'
match: s:005 m:004 c = 'n'
match: s:006 m:001 c = 't'
match: s:003 m:004 c = 'e'
match: s:003 m:004 c = 'g'
match: s:003 m:004 c = 'r'
match: s:003 m:004 c = 'i'
match: s:003 m:004 c = 't'
match: s:003 m:004 c = 'y'
rex.match("integrity") = regex 4: [A-Za-z_][A-Za-z_0-9]*
As a state diagram it looks like this. You can follow the matching process above (lines that start match: ). You can see it started in state1, i, state 4, n, state 5 etc and eventually gets stuck in a loop on state 3 matching the 4th regex which is a variable id. Double outline circles are accept states.
![]() |
-L actually has some use. You can test what the lexical scanner does.
> tacc -L?
usage: tacc -L [-ctTwslnN] {"plain_string_token" | "/regular_expression_token/"} ...
{[-e] [-E escape_character] -i match_string | -f input_filename.ext}
[-x expected_output]
[-GWTV-?]
tacc -L tests the regex lexical scanner.
-options:
-c ensure token ids for single character token names = ascii of the character
-t print tokens as they are read (more t's for lexical scanner debug info)
-k don't use context aware lexing
-K lexical scanner debug everything, equivalent to -ttttt...
-w don't ignore end of lines (\n, \f) (\r is always ignored)
-ww don't ignore any whitespace (\n \f \v \t ' ') (\r is always ignored)
-s short output format (for regression tests, same format at -x expects)
-l dump regex lex tables
-n do not execute the parser (used to show the debug tables when parsing might go nuts)
-e -i input_string contains escape sequences (like \n) that need to be descaped
-E escape_character, like -e above, but escape sequences start with escape_character
-N suppress warning messages
-G debug
-W wait for enter key before exiting
-T print time taken
-D once enables descape of words in [ ]. Use -DD for more help.
[NAME], [xHX], [oOCT], [aDEC], [NNN]
-RR help on regexes and regex classes.
-V Version
-- Stop -option parsing
-? This help message
Examples:
tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" "/[[:space:]]+/" -iw "int interest = 40"
tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40"
tacc -Ls "/[a-z]+/" "=" "/[0-9]+/" -i "that = 86" -x "1:that, 2:=, 3:86, EOS"
Here is tacc doing the second example in the help:
> tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40"
lex_test_main: assign token 1 = "int"
lex_test_main: assign token 2 = "while"
lex_test_main: assign token 3 = "="
lex_test_main: assign token 4 = "/[0-9]+/"
lex_test_main: assign token 5 = "/[a-z]+/"
Tokens
1. real not_null terminal input plain int
2. real not_null terminal input plain while
3. real not_null terminal input plain =
4. real not_null terminal input regex /[0-9]+/
5. real not_null terminal input regex /[a-z]+/
6. virtual not_null terminal - plain UNKNOWN
7. virtual not_null terminal - plain ERROR
8. virtual not_null terminal - plain EOS
9. virtual not_null terminal - plain ANY
10. virtual nullable terminal - plain NULL
11. virtual not_null terminal - plain START
lex_test_main: lexing string: int interest = 40
{1 int = "int"}
{5 /[a-z]+/ = "interest"}
{3 = = "="}
{4 /[0-9]+/ = "40"}
{8 EOS = ""}
-Ls prints out a short one line lex of the input. This is mainly for regression testing:
> tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40" -s
1:int, 5:interest, 3:=, 4:40, EOS
Other options will have you drowning in debug, go for it.
How to LR(1) parse (by hand)
First you generate the list of predicate follow states based on the possibili... oh noes not that explanation again! Can't you make it simpler than that?!
Let's go slow, starting with a very simple grammar, with 2 productions: Productions have the form LHS ::= RHS. We say Left Hand Side (LHS) becomes (::=) Right Hand Side (RHS). The LHS is always one token, the RHS can be many tokens.
This grammar accepts the input x y z. An x followed by y, followed by z. Once. Just x y z.
GOAL ::= expr
expr ::= x y z
We are building a state pushdown automaton from a bnf grammar. Sounds fancy.
Start with state 1.
State 1
Find the token that is the GOAL token. In tacc, the GOAL token is the first token in the bnf file. Which is the LHS of the first production. Here the GOAL token is GOAL.
Write the Goal production out:
State 1
GOAL ::= expr
Put a dot . after the ::=, before the RHS of the production. This is the parse cursor:
State 1
GOAL ::= . expr
The goal production will always be followed by the end of string EOS. Put that after the production. This is called the follow set:
State 1
GOAL ::= . expr EOS
For every token to the right of the parse cursor dot . (in this case expr) find a production with a matching left hand side and add it to the state. Don't worry if there aren't any:
State 1
GOAL ::= . expr EOS
expr ::= x y z
It too will have a parse cursor dot . placed just after the ::=
State 1
GOAL ::= . expr EOS
expr ::= . x y z
And it will inherit the follow set of the production it was expanded from:
State 1
GOAL ::= . expr EOS
expr ::= . x y z EOS
Now x has a dot in front of it, so we look for a production with an x as the LHS. And there aren't any. So we say that x is a terminal token. Terminal tokens do not expand and really do occur in our input stream. expr and GOAL appear on the LHS of the productions. We call them non-terminal tokens because they expand to other tokens. Non-terminal tokens do not occur in our input stream.
Now for each production in state 1, we create another state. Taking the first production in state 1:
GOAL ::= . expr EOS
We see that the right of the parse cursor . has a token. We say, "In state 1, on an expr, go to state 2" Where 2 is the new state:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z EOS
And we make state 2:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z EOS
State 2
Then we add all of the productions in state 1 that have . expr in them to the newly created state 2, but shift the parse cursor . one token to the right. We also take the follow set, but not the go to state number. In this case only one production needs to be taken:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z EOS
State 2
GOAL ::= expr . EOS
Now we go back to state 1 and continue with the second production:
expr ::= . x y z EOS
We see that the right of the parse cursor . has a token. We say, "In state 1, on an x, go to state 3" Where 3 is a new state:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . EOS
And we make state 3:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . EOS
State 3
Then we add all of the productions that have . x in them to the newly created state 3, but shift the parse cursor . one token to the right. We also take the follow set, but not the go to state number. In this case only one production needs to be taken:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . EOS
State 3
expr ::= x . y z EOS
And state 1 is now closed.
Now we do the same for all the new states, starting with state 2.
State 2 only has one production:
GOAL ::= expr . EOS
But this time the parse cursor . is at the end of the production. We say, "In state 2, on an EOS, reduce GOAL ::= expr" The reduce is a curly arrow and tacc writes it like _o->
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z EOS
And state 2 is now closed.
Moving on to state 3.
State 3 only has one production:
expr ::= x . y z EOS
We see that the right of the parse cursor . has a token. We say, "In state 3, on a y, go to state 4" Where 4 is a new state:
And we make state 4:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z ---> 4 EOS
State 4
Then we add all of the productions that have . y in them to the newly created state 4, but shift the parse cursor . one token to the right. We also take the follow set, but not the go to state number. In this case only one production needs to be taken:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z ---> 4 EOS
State 4
expr ::= x y . z EOS
And state 3 is now closed.
Moving on to state 4.
State 4 only has one production:
expr ::= x y . z EOS
We see that the right of the parse cursor . has a token. We say, "In state 4, on a z, go to state 5" Where 5 is a new state:
And we make state 5:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z ---> 4 EOS
State 4
expr ::= x y . z ---> 5 EOS
State 5
Then we add all of the productions that have . z in them to the newly created state 5, but shift the parse cursor . one token to the right. We also take the follow set, but not the go to state number. In this case only one production needs to be taken:
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z ---> 4 EOS
State 4
expr ::= x y . z ---> 5 EOS
State 5
expr ::= x y z . EOS
And state 4 is closed.
Moving on to state 5.
State 5 only has one production:
expr ::= x y z . EOS
This time the parse cursor . is at the end of the production. We say, "In state 5, on an EOS, reduce expr ::= x y z" The reduce is a curly arrow and tacc writes it like _o->
State 1
GOAL ::= . expr ---> 2 EOS
expr ::= . x y z ---> 3 EOS
State 2
GOAL ::= expr . _o-> EOS
State 3
expr ::= x . y z ---> 4 EOS
State 4
expr ::= x y . z ---> 5 EOS
State 5
expr ::= x y z . _o-> EOS
And state 5 is closed.
And we are done. The state automaton has 5 states.
Of course this is a simple example and doesn't show all the nuances of state creation, but it is the core of how it is done. We will walk through a more complicated example later on.
Once the automaton is built, we can discard some of the information in the states. We only need to know 1. the state we are in, 2. the token we have and 3. the transition action - either the state to go to or the reduction to do.
When a transition action is a go to another state, it is called a shift transition.
When a transition action is a reduce, it is called a reduction transition.
We can extract the vital information from the state pushdown automaton to give us a pushdown automaton:
State 1 on expr ---> 2
State 1 on x ---> 3
State 2 on EOS _o-> GOAL ::= expr
State 3 on y ---> 4
State 4 on z ---> 5
State 5 on EOS _o-> expr ::= x y z
Graphically it looks like this:
![]() |
If you were following the diagram and trying to parse the input x y z.
Start in state 1, add the parse cursor . before the input, then the input x y z, then the end of string token EOS:
state 1: . x y z EOS
So in state 1, on an x, go to state 3
state 3: x . y z EOS
In state 3, on y, go to state 4
state 4: x y . z EOS
In state 4, on z, go to state 5
state 5: x y z . EOS
In state 5, on EOS, reduce expr ::= x y z. OK... But what state are we in now? This is where the pushdown in pushdown automaton comes in.
As we shift through the states during the parse, we push the state number on to the state stack as we go. When we reduce, we pop off the number of states off the state stack equal to the number of tokens on the RHS of the reduction (not the whole story yet, but stick with it for now. We need to adjust for NULLs as well.)
If we were keeping a state stack, it would be 1, 3, 4, 5. The numbers of all the states we have visited so far.
There are 3 tokens in this reduction, x y z, so we pop 3
states off the stack: 1, 3, 4, 5
and end up back in state 1.
The reduction also replaces everything on the RHS of the reduction (x y z), with what is on the LHS (expr).
The parse cursor . is placed before the LHS.
And it now looks like this:
state 1: . expr EOS
In state 1, on expr, go to state 2
state 2: expr . EOS
In state 2, on EOS, reduce GOAL ::= expr.
The state stack is 1, 2. There is 1 token on the RHS of the
reduction, so pop off 1 state from the state stack: 1, 2 and end up in state 1 again.
The reduction also replaces everything on the RHS of the reduction (expr), with what is on the LHS (GOAL).
The parse cursor . is placed before the LHS.
And it now looks like this:
state 1: . GOAL EOS
In state 1, on GOAL with no more input, is the condition to accept the string. (Some would say we need to shift the GOAL and maybe even the EOS, but tacc doesn't bother and saves a few cycles)
We are done.
We could have done the states with tacc -s:
> tacc xyz.bnf -s
tacc: reading grammar file: xyz.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
States
---- START ---> state 1
| 1. GOAL ::= . expr ---> 2 EOS
| 2. expr ::= . x y z ---> 3 EOS
--------------------
---- expr ---> state 2
| 1. GOAL ::= expr . _o-> -1 EOS
--------------------
---- x ---> state 3
| 2. expr ::= x . y z ---> 4 EOS
--------------------
---- y ---> state 4
| 2. expr ::= x y . z ---> 5 EOS
--------------------
---- z ---> state 5
| 2. expr ::= x y z . _o-> -2 EOS
--------------------
Parser States: 5
Lexical States: 2
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(0)
You can see that tacc has created a number for each production. Also the transition action is a positive number equal to the state number for a shift action and a negative number corresponding to the -production number for each reduction action.
And the transitions with tacc -x:
C:\Trev\Projects\tacc\Debug>tacc xyz.bnf -x
tacc: reading grammar file: xyz.bnf
tacc: checking productions
tacc: building automation
tacc: building lexer
Transitions
state 1 on expr ---> 2
state 1 on x ---> 3
state 2 on EOS _o-> GOAL ::= expr
state 3 on y ---> 4
state 4 on z ---> 5
state 5 on EOS _o-> expr ::= x y z
Parser States: 5
Lexical States: 2
Shift-Reduce conflicts: 0
Reduce-Reduce conflicts: 0
Warnings: 0
Errors: 0
Automaton is LR(0)
And after compiling with tacc xyz.bnf -o xyz.cpp, we could parse the input x y z with tacc -I:
> tacc -I xyz.cpp "x y z" -x
state 1 on 8 x ---> state 3
state 3 on 9 y ---> state 4
state 4 on 10 z ---> state 5
state 5 on 2 EOS _o-> state 1 reduce 2: expr ::= x y z
state 1 on 7 expr ---> state 2
state 2 on 2 EOS _o-> state 1 reduce 1: GOAL ::= expr
xyz: string accepted