tacc - The Alternative Compiler Compiler


tacc: Version: 4.56.2.81 32 bit, Compiled: Jun 25 2021 at 01:46:21 (c) Trevor Boyd 19 Mar 2014

tacc: Version: 4.56.2.81 64 bit, Compiled: Jun 25 2021 at 02:19:21 (c) Trevor Boyd 19 Mar 2014


Download tacc - Windows 32 bit and 64 bit

Introduction

tacc

tacc is a compiler compiler.

tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source code in C++ or Java with a built in lexical scanner.

You can compile the output .cpp with a C++ or Java compiler or you can use tacc -I to "interpret" the .cpp file.

The resultant parse from either method is identical.  The C++ output.cpp file is built from the actual source code of the tacc when tacc is compiled.

The ability to test your grammar and syntax check any file in your target language without having to compile the .cpp or build your own lexical scanner can greatly speed up the development of your language parser.

You can add semantic actions directly to the bnf or you can add them later manually to the output .cpp file.

tacc workflow

tacc Features

Similar in function to yacc and lex put together.

Elegant simplicity

Lots of automatic and default components, tacc just works it out for you.

Powerful default lexer

Interpreter mode

Rich trace and debug

tacc Drawbacks

Parts of tacc were written before 1993, so it is a bit old school in some areas:

Simple Example

Create a grammar.bnf

Let's write a simple parser to recognise a stream of typical variable names (identifiers).

Identifiers start with a letter or underscore and are optionally followed by letters, underscore and numbers.

The grammar for the identifier stream is:

> cat identifiers.bnf

 

GOAL ::= ids

 

ids ::= id | ids id

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

By default, the parser generated with tacc automatically handles whitespace, so we don't need to worry about it here.

The first token encountered in the BNF file is the goal token.  The entire parse will eventually end up as the goal token.

Tokens can be plain text or /regular expressions/.  Regular expressions match the leftmost longest string in the input.  If more than one token matches the input, and they are the same length, then the one that is specified first in the bnf file has priority.  This means you can put keywords before generic regexes that catch identifiers and have them correctly tokenised as keywords.

Compile the grammar.bnf to grammar.cpp with tacc

Compile the bnf with tacc:

> tacc identifiers.bnf -o identifiers.cpp

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

 

tacc: using internal template: default (vogue)

tacc: output file is: identifiers.cpp

Execute the grammar.cpp with the tacc -I interpreter

Test it out with tacc -I:

> tacc -I identifiers.cpp "this and that"

 

tacc: string accepted

 

> tacc -I identifiers.cpp "this then 9fail"

 

tacc: Error: Unrecognised input "9fail"

 

       1: this then 9fail

                    ^^^^^

Or compile the grammar.cpp with C++

Or you can compile identifiers.cpp with a C++ compiler:

> cl /EHsc /nologo /O2 identifiers.cpp /Feidentifiers.exe

identifiers.cpp

Test it out with identifiers.exe:

> identifiers.exe "this and that"

 

identifiers: string accepted

 

> identifiers.exe "this then 9fail"

 

identifiers: Error: Unrecognised input "9fail"

 

       1: this then 9fail

                    ^^^^^

Apart from the program name, the output is the same.

Command line options

tacc -I and your exe also have command line options that can show you what is going on during the parse of your text.  You can use -? To see them all

> identifiers.exe -?

 

  usage: identifiers { -f input_file | input_string... }

              [-xrtasnlpvewb]

              [-GWTV-?]

 

  -options:

 

     -f infile

     -x print transitions

     -r print reductions

     -t print tokens as they are read (-ttt... -K = max t's)

     -a print terminal args with reductions

     -s print top of parse tree stack with reductions (needs -p)

     -n only print reductions that have terminal args

     -l print lookahead with reductions

     -p print parse tree, -pp print fancier parse tree

     -v verbose, equivalent to -alrpptx

     -e suppress expecting... output

     -k don't use context aware lexing

     -w don't ignore end of lines (\n, \f)  (\r is always ignored)

     -ww don't ignore any whitespace (\n \f \v \t ' ')   (\r is always ignored)

     -b print the BNF source for this program

     -B print the input string or filename

 

     -G debug

     -W wait for enter key before exiting

     -T print time taken

     -D once enables descape of words in [ ].  Use -DD for more help.

        [NAME], [xHX], [oOCT], [aDEC], [NNN]

     -V Version

     -- Stop -option parsing

     -? This help message

These options are all explained in detail later on, but for now -rat is fun:

D:\Trev\Projects\tacc\Doc>identifiers.exe -rat "this and that"

lex: next token is {9 /id/ = "this"}

lex: next token is {9 /id/ = "and"}

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "this"

   2: ids ::= id

lex: next token is {9 /id/ = "that"}

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "and"

   3: ids ::= ids id

lex: next token is {2 EOS = ""}

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "that"

   3: ids ::= ids id

   1: GOAL ::= ids

Entering Semantics

It is nice to be able to recognise input, but it is better to be able to do something with it.  The action you perform after parsing is semantics.  You can add semantic actions directly to the .bnf grammar or you can add it to the default_semantics function in the output.cpp or even create your own.  Generally the more complex the grammar and the more changes you make to the semantic code, the more likely you are to use cpp semantics.  bnf semantics are for quick and simple grammars.

It might be a little easier to start with entering code into the output.cpp file.

cpp semantics

tacc generates a default semantic function for you to edit.  The one for the identifiers.bnf grammar above looks like this:

 

int default_semantics(int red, int argc, string argv[], token_info *argt[], string &error)

{

//  ---- Description

//

//  default_semantics is a callback function from the parser that gives you a reduction that is

//  happening now and the strings that are the terminals in that production.  You will normally

//  perform some processing based on the reduction number (red) and the arguments in argv.

//

//  ---- Parameters

//

//  red    - the reduction (production) number.

//

//  argc   - the count of terminal token strings in argv

//

//  argv[] - Reduction terminal token strings (the values of the tokens that are "variables")

//

//      argv[0] is the leftmost terminal token string

//      argv[1] is the next leftmost terminal token string

//      ...

//      argv[argc - 1] is the rightmost terminal token string

//      argv[argc] is the lookahead token string

//

//  argt[] - Reduction terminal token info

//

//      argt[0]->token - the token id number

//

//      Usually you would only use .token, but see class token_info for others.

//

//  error - an optional error message that will be printed if you return non 0 from this function.

//          So if you found a semantic error on token 0 and wanted to stop parsing, you would:

//

//          error = argt[0]->token_error("ERROR: I'm sorry, Dave, I'm afraid I can't do that.");

//          return 1;

//

//  return value: 0 = all OK and keep parsing.  Non 0 = error and stop parsing.

//

//  ---- Example

//

//    17. EXPR ::= /[0-9]+/ OPERATOR /[A-Za-z_][A-Za-z_0-9]*/

//                 37                goo                              ;

//

//    red  = 17          - the number of the reduction, ie the 17th production in your .bnf

//    argc = 2           - the number of terminal tokens, here they are "37" and "goo"

//    argv[0] = "37"     - the value of the first terminal token

//    argv[1] = "goo"    - the value of the second terminal token

//    argv[2] = ";"      - the lookahead token

//    argt[0]->token = 7 - the token for number, ie the 7th token defined in your .bnf

//    argt[1]->token = 9 - the token for identifier, ie the 9th token defined in your .bnf

//

//    EXPR is what you are calculating here, most likely you will push this result on your

//    own weird stack.

//

//    OPERATOR is a non terminal, most likely you will have OPERATOR ::= + | - | ...

//    somewhere in your .bnf.

//

 

    switch (red) {

 

        case    1 :  //  GOAL ::= ids 

            break;

 

        case    2 :  //  ids ::= id 

        case    3 :  //  ids ::= ids id 

            break;

 

        case    4 :  //  id ::= /[A-Za-z_][A-Za-z_0-9]*/ 

            break;

 

 

        case 0:

        default :

            // printf("Unhanded or unknown reduction = %d\n", red);

            // exit(EXIT_FAILURE);

            break;

    }

 

    return 0;

}

 

Adding cpp semantics

So if we wanted to print out a message when we read in an id, we could put it in the code like this:

        case    4 :  //  id ::= /[A-Za-z_][A-Za-z_0-9]*/ 

            printf("id = %s\n", argv[0].c_str());

            break;

 

Compiling it and running it gives:

> identifiers.exe this and that

id = this

id = and

id = that

 

identifiers: string accepted

Note that we could not have put the printf under case 2 or 3 because the right hand side (RHS) of the ::= are all non terminals (ie they appear as a left hand side (LHS) somewhere else in the grammar).  If you want to retain the id for later processing by higher order rules, the usual way is to push the values onto a stack.  We will see how to do that a bit further on.

bnf semantics

You can enter code directly into the bnf by putting it inside braces after the rule it matched like this (and it has to be pretty much exactly like this):

GOAL ::= ids

 

ids ::= id | ids id

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

{

    printf("id = %s\n", argv[0].c_str());

}

tacc, compile and execute gives us the same result as putting it directly into the .cpp:

> identifiers.exe this and that

id = this

id = and

id = that

 

identifiers: string accepted

Note that tacc -I cannot interpret the semantic actions (aww...).  tacc can only interpret the tables it generated in the cpp file.  Disappointing, I know, but I wasn't writing an entire C++ interpreter.  Besides, you can use your own template file and it need not be C++.

Named Sections

If you want to put some code in the bnf file that is global, you can put it in a named section.

Let's say we wanted to read in the identifiers and output them all at once with commas between them.  We need to change the grammar slightly so that the ids ::= id | ids id is split into separate lines and add some { semantic blocks }

So input:

this and that

Outputs:

List: this, and, that

The default tacc template has a tacc_include and a tacc_global that we can use like this:

tacc_include

{

    #include <iostream>

    using std::cout;

    #include <string>

    #include <stack>

}

 

tacc_global

{

    typedef std::stack<std::string> string_stack;

    string_stack aux;

}

 

GOAL ::= ids

{

    string list = aux.top(); aux.pop();

    cout << "List: " << list << "\n";

}

 

ids ::= id

{

    string id = aux.top();

    cout << "leaving id = " << id << " on the stack\n";

}

 

ids ::= ids id

{

    string id = aux.top(); aux.pop();

    cout << "pop id = " << id << "\n";

    string list = aux.top(); aux.pop();

    cout << "pop list = " << list << "\n";

    list = list + ", " + id;

    cout << "pushing list = " << list << "\n";

    aux.push(list);

}

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

{

    cout << "pushing id = " << argv[0] << "\n";

    aux.push(argv[0]);

}

tacc, compile and execute gives us:

> identifiers.exe this and that

pushing id = this

leaving id = this on the stack

pushing id = and

pop id = and

pop list = this

pushing list = this, and

pushing id = that

pop id = that

pop list = this, and

pushing list = this, and, that

List: this, and, that

 

identifiers: string accepted

It may seem like a kinda crazy way to get that string to the next higher level rule by popping and pushing, but it is very common to do it like that.

You might also pop values off the stack, perform some processing and push the result back on as we do to put the comma in the list.

It is also normal to have a stack that is a union of many types.

A named section may have any name.  The name given is matched to a comment in the template file that tacc uses to generate the cpp file.

You can list the templates built in to tacc and export them for perusal like this:

> tacc -d

 

Name      Ext   Default  Description

--------  ----  -------  ----------------------------------------------------------------

classic   cpp   No       Older C++ template with no real redeeming features.

sienna    cpp   No       C++ template generated around tacc 4.03.  Still good.

vogue     cpp   Yes      C++ template generated from the tacc source code.

jive      java  No       Simple Java template for tiny grammars.

juke      java  Yes      Java template with parser tables in *.joos.dat files.

cup       cup   Yes      Convert to CUP parser format.  Not perfect.

default   *     *        Matches a default template above by output file extension.

 

> tacc -M default -o template.cpp

Named sections look like this:

// >>>> tacc directive [arguments]

Most of them are only really useful if you are building your own template file.

Directive is one of these. 

tacc_include

tacc_global

 

cpp_bnf_name

cpp_bnf_created

cpp_gen_author

cpp_tacc_info

cpp_command_line

cpp_bnf_source

cpp_tokens

cpp_transition_tokens

cpp_transition_actions

cpp_transition_state_index

cpp_reductions

cpp_string_tokens

cpp_regex_tokens

cpp_input_tokens

cpp_productions

cpp_case_semantics

cpp_regex_lex

cpp_context_lex_state

 

java_bnf_name

java_bnf_created

java_gen_author

java_tacc_info

java_command_line

java_bnf_source

java_tokens

java_transition_tokens

java_transition_actions

java_transition_state_index

java_reductions

java_string_tokens

java_regex_tokens

java_input_tokens

java_productions

java_case_semantics

java_regex_lex

java_context_lex_state

java_main_class

 

joos_app_info                $dir/$name.app_info.joos.dat

joos_standard_tokens         $dir/$name.standard_tokens.joos.dat

joos_transition_tokens       $dir/$name.transition_tokens.joos.dat

joos_transition_actions      $dir/$name.transition_actions.joos.dat

joos_transition_state_index  $dir/$name.transition_state_index.joos.dat

joos_reductions              $dir/$name.reductions.joos.dat

joos_string_tokens           $dir/$name.string_tokens.joos.dat

joos_regex_tokens            $dir/$name.regex_tokens.joos.dat

joos_input_tokens            $dir/$name.input_tokens.joos.dat

joos_productions             $dir/$name.productions.joos.dat

joos_regex_lex_state_action  $dir/$name.regex_lex_state_action.joos.dat

joos_regex_lex_state_match   $dir/$name.regex_lex_state_match.joos.dat

joos_context_lex_state       $dir/$name.context_lex_state.joos.dat

 

joos_file_extension

 

cup_preamble

cup_lex

cup_cf

The named sections that start cpp, java and joos are used by tacc and are replaced with parser tables.  To see a full description of every one of these, just browse through the comments in the template files.

$name is name of the input.bnf file without the extension, ie input.

$filename is the name of the input.bnf file, ie input.bnf

$dir is the directory of this automaton, from the dir/input.bnf or "." if not present

tacc doesn't use tacc_include or tacc_global, they are just custom named sections used by the cpp templates.

Custom Named Sections

You can make your own template.cpp and include your own named sections.  For example if we modify the exported template.cpp from above and add a groovy section:

// >>>> tacc groovy

And put a groovy named section in the bnf:

groovy

{

    // yeah this is groovy

}

Then use tacc with the new template with -m:

> tacc -m template.cpp identifiers.bnf -o identifiers.cpp

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

 

tacc: using template file: template.cpp

tacc: output file is: identifiers.cpp

The output file contains:

    // yeah this is groovy

In place of the:

    // >>>> tacc groovy

That was in the template file.

Java

tacc can also do the same thing for java.

> tacc identifiers.bnf -o identifiers.java

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

 

tacc: using internal template: default (juke)

tacc: output file is: identifiers.java

 

> javac identifiers.java

 

> java identifiers this and that

 

identifiers: string accepted

 

> java identifiers this and 9fail

 

identifiers: Error: Unrecognised input "9fail"

 

       1: this and 9fail

                   ^^^^^

 

> tacc -I identifiers.java this and that

 

tacc: string accepted

 

> tacc -I identifiers.java this and 9fail

 

tacc: Error: Unrecognised input "9fail"

 

       1: this and 9fail

                   ^^^^^

Fly Little Bird

That is really all you need to know to get started.

tacc -k Context Aware Lexing

tacc -k

Because tacc creates both the lexer and the parser, it knows everything about the parser when it is building the lexer.  This means that it can create a lexer that is specific to every parser state.

Benefits:

Drawbacks:

Why would you want to do that then?

Well in some cases it is the only way.  Parsers have long separated the lexer from the parser and because of that they have 1 big generic lexer that can recognise all of the tokens.  tacc was no different, even though it was all part of the same program, the lexer and the parser really didn't interact that much.  But now they do and this can make some bnfs possible where previously they were impossible.

For example, let's say you had a language that recognised simple assignments of the form:

 

    id = number

 

Potentially separated by ;  The bnf would look like this:

 

    GOAL ::= program

 

    program    ::= program ; statement | statement

 

    statement  ::= expression | NULL

 

    expression ::= id \= number

 

    id     ::= /[A-Za-z_][A-Za-z_0-9]*/

    number ::= /[0-9]+/

 

And now you want to add a help command as well.  The help command would be of the form:

 

    help any_thing_at_all

 

Where any_thing_at_all really could be anything, including letters, numbers and even a ;

Conventional lexers would have trouble with this.  Let's see why.  First put the new help command in the grammar: context.bnf:

 

    GOAL ::= program

 

    program   ::= program ; statement | statement

 

    statement ::= expression | help_expression | NULL

 

    help_expression ::= help subject

    expression      ::= id \= number

 

    id      ::= /[A-Za-z_][A-Za-z_0-9]*/

    number  ::= /[0-9]+/

 

    subject ::= /[^[:space:]]+/

 

Compile it with tacc and try it out:

 

    > tacc context.bnf -o context.cpp

 

And try it out:

 

    > tacc -I context.cpp -rt "goo = 37"

    lex: next token is {17 /id/ = "goo"}

    lex: next token is {15 = = "="}

       9: id ::= /[A-Za-z_][A-Za-z_0-9]*/

    lex: next token is {18 /number/ = "37"}

    lex: next token is {2 EOS = ""}

      10: number ::= /[0-9]+/

       8: expression ::= id = number

       4: statement ::= expression

       3: program ::= statement

       1: GOAL ::= program

 

    context: string accepted

 

Good so far, but what happens if you take some of that whitespace out?

 

    > tacc -I context.cpp -rt "goo=37"

 

    lex: next token is {19 /subject/ = "goo=37"}

 

    context: Error: Unexpected input token: /subject/

 

           1: goo=37

              ^^^^^^

 

        found:  /subject/, regex: /[^[:space:]]+/

    expecting:  EOS program ; statement expression help_expression help id /id/

 

The trouble is the /subject/ regex is so inclusive it consumes all of the input.  But why should it?  The only time the /subject/ regex is used is after the keyword help.

How about asking for help on ;

 

    > tacc -I context.cpp -rt "help ;"

    lex: next token is {12 help = "help"}

    lex: next token is {8 ; = ";"}

 

    context: Error: Unexpected input token: ;

 

           1: help ;

                   ^

 

        found:  ;

    expecting:  subject /subject/

 

No good either.  Even though the ordinary regex lexer whould have had the ; matched with ; and /subject/ for the same length of 1, the bare ; has higher priority than /subject/ because it occurs before it in the grammar.  So there

is no way to allow the ; to be a parameter to the help keyword.

If you could limit looking for /subject/ only to the context when the previous token was 'help' then you would be able to have your cake and eat it too.

This is what the tacc -k context aware lexer does.  It only enables those input tokens that can occur in the context of a particular parser state.  Let's try it out:

 

    > tacc -k context.bnf -o context.cpp

 

    tacc: reading grammar file: context.bnf

    tacc: checking productions

    tacc: building automation

    tacc: building lexer

    tacc: building context aware lexer

    Lexical contexts: 6

    Lexical states:   11 normal + 5 contextual = 16 total

 

    States: 15

 

    Shift-Reduce conflicts:  0

    Reduce-Reduce conflicts: 0

    Warnings:   0

    Errors:     0

 

    Automaton is LR(1)

 

    tacc: using internal template: default (vogue)

    tacc: output file is: context.cpp

 

Now let's try it out:

 

    > tacc -I context.cpp -rt "goo=37"

    lex: next token is {17 /id/ = "goo"}

    lex: next token is {15 = = "="}

       9: id ::= /[A-Za-z_][A-Za-z_0-9]*/

    lex: next token is {18 /number/ = "37"}

    lex: next token is {2 EOS = ""}

      10: number ::= /[0-9]+/

       8: expression ::= id = number

       4: statement ::= expression

       3: program ::= statement

       1: GOAL ::= program

 

    context: string accepted

 

And you can ask help of practically anything, including the ;

 

    > tacc -I context.cpp -rt "help ;"

    lex: next token is {12 help = "help"}

    lex: next token is {19 /subject/ = ";"}

    lex: next token is {2 EOS = ""}

      11: subject ::= /[^[:space:]]+/

       7: help_expression ::= help subject

       5: statement ::= help_expression

       3: program ::= statement

       1: GOAL ::= program

 

    context: string accepted

 

Note that tacc -k builds a context aware lexer in addition to the normal lexer.  In tacc -I interpreter mode and in your compiled cpp or java parser, the lexer will automatically use the context aware lexer if available.  But if you want to temporarily disable context aware lexing you can use tacc -I -k.

 

    > tacc -I -k context.cpp -rt "help ;"

    lex: next token is {12 help = "help"}

    lex: next token is {8 ; = ";"}

 

    context: Error: Unexpected input token: ;

 

       1: help ;

               ^

 

    found:  ;

    expecting:  subject /subject/

Formal tacc BNF Syntax

BNF

tacc was written with lesser versions of tacc (multiple times), so I should just put the tacc BNF here, but it is a bit scary looking, so I will included it later after we see some simple forms:

Comments

Comments are started with // and continue to the end of the line:

// this is a comment

Productions

Productions or rules are simply a single identifier ::= (becomes) some other identifiers.  LHS (left hand side) identifier and RHS (right hand side) identifiers are called tokens.

LHS ::= RHS ...

You can provide alternatives with the vertical bar | "OR" symbol.

LHS ::= RHS1 ... { | RHS2 ... }

Is equivalent to:

LHS ::= RHS1 ...

LHS ::= RHS2 ...

You can use the | alternate on the end of one line or the start of another:

LHS ::= RHS1 ... |

        RHS2 ...

 

LHS ::= RHS1 ...

      | RHS2 ...

You can use ||= as a shorthand way of writing | between every letter of the RHS

HEX ||= ABCDEFabcdef0123456789

Is equivalent to:

HEX ::= A | B | C | D | ... | 8 | 9

This is useful for character grammars, so check out the -c option too.

Tokens

Tokens may be practically any string separated by whitespace.  The "normal" identifier rules you might be familiar with in other languages do not apply here.  For example 98* is a perfectly acceptable token even though it starts with a number and has a * in it.

If a token only appears on the right hand side (RHS) of the ::=, it is said to be a terminal token.  If a token appears on the left hand side (LHS) and the RHS, then the token is said to be a non-terminal token.  tacc automatically works this out.

It is conventional in grammar BNFs to make non terminals uppercase and terminals lowercase.  This is not required or encouraged style in tacc, but feel free to go with the flow.  You can use ::= or just = as a production operator.  This document was written before = was allowed as an alternative and the author writes ::= mostly out of habit.  Don't feel that you need to.  There is no need to end statements with a ; as you see in some extended BNFs either, tacc uses line breaks as a delimiter.  tacc wants the BNF syntax to be as natural and uncluttered with language syntax as possible.

A right hand side RHS may be optional, you can indicate this in tacc by using NULL

LHS ::= NULL

A token can be enclosed in quotes or you can start it with a backslash \ if it has special characters that might confuse tacc.

LHS ::= "NULL"

LHS ::= \NULL

OR  ::= \|

Terminal Tokens

A terminal token may be a simple string or a regular expression in /slashes/

keyword ::= while

number  ::= /[0-9]+/

Regular Expressions

tacc regular expressions are stock standard, 7 bit ASCII only.  The tacc compiled lexer will recognise any number of regular expressions in parallel just as quickly as one.

The following tables can also be seen with tacc -RR for regex help.

 

regex

Description

Example

c

Any non special character c matches itself

k matches k

.

matches any single character except \r \n and \0

. matches a, B, c, x, ...

^

matches the start of a line

^a matches lines that start with a

$

matches the end of a line

z$ matches lines that end in z

\n \r \t

matches \n newline, \r carriage return and \t tab

\n matches \n

\\

matches \

\\ matches \

\s

matches special character s

\. matches .

\xHH

matches hex character HH

\x20 matches space

[xyz0-9]

matches characters and ranges in between [...]

gr[ae]y[0-9] matches gray7 and grey9

[^xyz0-9]

matches not (characters and ranges) in between [^...]

[^A-Z] matches not uppercase

[:class:]

matches a standard class of characters (see below), must be used inside [...]

[[:digit:]] matches a digit 0 to 9

a*

matches 0 or more a's

a* matches "", a, aa, aaa...

a+

matches 1 or more a's

a+ matches a, aa, aaa...

a?

matches 0 or 1 a's

ab?c matches ac, abc, but not abbc

a|b

matches a or b, almost always used with (a|b)

gr(ay|ey) matches gray and grey

( )

delimits a sub-expression that you can apply * + ? to

(ab)+ matches ab, abab, ababab, ...

 

. means does NOT match -- any other character means match

 

|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+--

00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000001111111111111111111111111111

00000000001111111111222222222233333333334444444444555555555566666666667777777777888888888899999999990000000000111111111122222222

01234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567890123456789012345678901234567

|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+----|----+--

[:class:]

................................ !"#$%&'()*+,-./0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~.

none 

blank

cr

eol  

space

cntrl

digit

xdigit

lower

upper

alpha

alnum

word 

punct

graph

print

dot  

any

notnull  

regord

regres

................................................................................................................................

.........Y......................S...............................................................................................

..........Y..Y..................................................................................................................

Y.........Y..Y..................................................................................................................

.........YYYYY..................S...............................................................................................

YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYY...............................................................................................Y

................................................0123456789......................................................................

................................................0123456789.......ABCDEF..........................abcdef.........................

.................................................................................................abcdefghijklmnopqrstuvwxyz.....

.................................................................ABCDEFGHIJKLMNOPQRSTUVWXYZ.....................................

.................................................................ABCDEFGHIJKLMNOPQRSTUVWXYZ......abcdefghijklmnopqrstuvwxyz.....

................................................0123456789.......ABCDEFGHIJKLMNOPQRSTUVWXYZ......abcdefghijklmnopqrstuvwxyz.....

................................................0123456789.......ABCDEFGHIJKLMNOPQRSTUVWXYZ...._.abcdefghijklmnopqrstuvwxyz.....

.................................!Y#$%&'()*+,-Y/..........:;<=>?@..........................[\]^_`..........................{|}~.

.................................!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~.

................................S!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~.

.YYYYYYYYY.YY.YYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y

YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y

.YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#$%&'()*+,-Y/0123456789:;<=>?@ABCDEFGHIJKLMNOPQRSTUVWXYZ[\]^_`abcdefghijklmnopqrstuvwxyz{|}~Y

.YYYYYYYYYYYYYYYYYYYYYYYYYYYYYYYS!Y#.%&'....,../0123456789.;<=>.@ABCDEFGHIJKLMNOPQRSTUVWXYZ...._`abcdefghijklmnopqrstuvwxyz{.}~.

Y...................................$...()*+.-Y...........:....?...........................[\]^.............................|...

 

||= and /regexes/

It might seem a first glance that ||= expressions could be replaced by a regular expressions, and in some cases they can, but they are very different.

A /regex/ is a single token that matches some input during lexical scanning of your input.

||= makes multiple tokens, all of which are matched and returned separately by the lexer.

 

For example this regex returns a single token number for digit, for which the argv[] could be any character '0'-'9':

digit ::= /[0-9]/

And this one returns a separate token for each digit 0, 1, 2, ... , 9 and the argv[] would be '0', '1', '2', ... , '9'.

digit ||= 0123456789

tacc grammar

This is the actual grammar used to build tacc's BNF parser:

//

//    tacc bnf grammar

//

 

goal ::= grammar

 

grammar    ::= statement | grammar statement

 

statement  ::= production  EOL        |

               production  eol code   |

               include_tag eol code   |

                           eol

               

production ::= LHEX prodop     RHEX   |

               LHEX prodop eol RHEX

 

LHEX       ::= LHS | NULL

 

prodop     ::= \::= | \= | \| | \||=

 

RHEX       ::=             RHS   |

               RHEX \|     RHS   |

               RHEX \| eol RHS

 

RHS        ::= regex_token | RHS regex_token

LHS        ::= plain_token

 

regex_token   ::= token | regex

plain_token   ::= token

 

token         ::= word | quoted_text | escaped_word

include_tag   ::= word

 

EOL           ::= eol | EOS

 

code          ::= /{([^[:eol:]]|[[:eol:]][^}])*[[:eol:]]}/

 

regex         ::= /\/([^\\[:eol:]\/]|\\.)*\//

escaped_word  ::= /\\[^[:space:]]+/

quoted_text   ::= /"([^\\[:eol:]\"]|\\.)*"|'([^\\[:eol:]\']|\\.)*'/

word          ::= /[^[:space:]]+/

 

eol           ::= /[[:space:]]*(\/\/[^[:eol:]]*)?[[:eol:]]+[[:blank:]]*([[:space:]]*(\/\/[^[:eol:]]*)?[[:eol:]]+[[:blank:]]*)*/

tacc regex grammar

And this is the grammar used to build the regular expression parser:

goal ::= regex

 

regex ::= alternation

 

alternation ::= concatenation | alternation \| concatenation

 

concatenation ::= quantified | concatenation quantified

 

quantified ::= meta | meta * | meta + | meta ?

 

meta ::= character | ^ | $ | . | list | ( regex )

 

list ::= matching_list | non_matching_list

 

matching_list     ::= [   list_spec ]

non_matching_list ::= [ ^ list_spec ]

 

list_spec ::= list_character | list_spec list_character

 

list_character ::= range | class | character

 

range ::= character - character

 

class ::= [ : class_name : ]

 

class_name ::= literal_string

 

literal_string ::= identifier_character | literal_string identifier_character

 

character ::= escape_sequence | hex_character | literal_character

 

escape_sequence   ::= \\ escaped_character

hex_character     ::= \\ x hex_digit hex_digit

literal_character ::= identifier_character | ordinary_character

 

escaped_character ::= reserved_character | slash_character | ordinary_character

 

all_characters

{

                     ||= ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz _ 0123456789 !@#$%^&+-*.,:;?=~ () [] {} <> \/ \\ \| \" ' ` " " \t \r \n

}      

identifier_character ||= ABCDEFGHIJKLMNOPQRSTUVWXYZ abcdefghijklmnopqrstuvwxyz _ 0123456789

ordinary_character   ||=                                                                    !@# % &    , ; =~       {} <> \/       \" ' ` " "

reserved_character   ||=                                                                       $ ^ +-* . : ?   () []         \\ \|

slash_character      ||= trn

 

hex_digit ||= 0123456789 ABCDEF abcdef

 

tacc -options

tacc compile -options you will use

tacc with no options

Simply prints a short usage:

> tacc

 

  tacc compiles an extended BNF grammar to (LA)LR(1) parser in C++ or Java.

 

  usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]

              [{-m template_file | -M internal_template_name}]

              [-GWTV-?]

 

  tacc -I can interpret the generated output C++ file.

 

  usage: tacc -I interpret_source_file.cpp { -f input_file | input_string... }

              [-xrtasnlpvewb]

              [-GWTV-?]

 

    Use tacc -?  for more help on BNF compiler mode

    Use tacc -I? for more help on C++ interpreter mode

-? Help

Simply prints all the -options:

> tacc -?

 

    usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]

              [{-m template_file | -M internal_template_name}]

              [-GWTV-?]

 

  tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source

  code in C++ or Java with a built in lexical scanner.  You can compile the

  resultant source code immediately with a C++ or Java compiler or you can use

  tacc -I to interpret the source file.

 

  See tacc -I? for more help on interpreter mode.

 

  -options:

 

     -g print goal token

     -t print tokens

     -p print productions

     -x print transitions

     -s print states

     -i print input tokens (more useful with -s)

     -v verbose, equivalent to -gtpxsi

     -l construct an LALR(1) parser instead of an LR(1) parser

     -c ensure token ids for single character token names = ascii of the character

     -d list available internal templates

     -k generate a context aware lexer

     -y if a state's actions are all the same reduction, do it on any token

 

     -G debug

     -W wait for enter key before exiting

     -T print time taken

     -D once enables descape of words in [ ].  Use -DD for more help.

        [NAME], [xHX], [oOCT], [aDEC], [NNN]

     -RR help on regexes and regex classes.

     -V Version

     -- Stop -option parsing

     -? This help message

 

  Examples:

 

    tacc regex.bnf -o regex.cpp                -> compile regex.bnf to regex.cpp

    tacc -I regex.cpp -pp "a*b+c?(this|that)"  -> run the interpreter on regex.ccp with input string

    tacc regex.bnf -o regex.java               -> compile regex.bnf to regex.java

    tacc -l regex.bnf -o regex.cpp             -> compile regex.bnf as LALR (better)

    tacc -d                                    -> list available templates

    tacc -M default -o template.cpp            -> output the default template to template.cpp

 

-l construct an LALR(1) parser instead of an LR(1) parser

It is fairly common for grammars to generate a lot of internal states that are very similar.  The -l option constructs the LR(1) parser as normal and then tries to merge similar states to what is known as an LALR(1) parser.  The LALR(1) parser is generally smaller than the LR(1) parser with almost no drawbacks, so it is generally a good idea to use -l.  The drawbacks are that it takes tacc more time to build your parser and, if you give the generated parser incorrect input, that it may internally loop a few extra times before working out that the input is incorrect.  The LALR(1) generated parser is still correct in every way though.

(Unlike some other parser generators, tacc will not generate any sort of mysterious conflicts when generating LALR(1) automatons, so you can use  -l with confidence.)

I like to think of the -l as the "optimised" compile of your grammar.

Here it is being used to compile the regex grammar and reduces the number of states in the automaton from 686 to 176, which is pretty significant:

> tacc -l regex.bnf -o regex.cpp

 

tacc: reading grammar file: regex.bnf

tacc: checking productions

tacc: building automation

tacc: LALR merging states

tacc: building lexer

 

Parser States:    686 - 510 merged = 176

Lexical States:   2

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1)

 

tacc: using internal template: default (vogue)

tacc: output file is: regex.cpp

-t -g print tokens

After you generate a parser with tacc, you can add semantic actions to your code (see Introduction).  The only tokens that are actually delivered to your semantic code are terminal tokens.  That is, tokens that appear on the right hand side (RHS) of the ::=, but not on the left hand side (LHS).  Usually it is pretty obvious which tokens are terminal or not, but should you be wanting a list of tokens and how they are categorised, then use -t.

-g simply prints the goal token, which is the first token encountered in the BNF file.

> cat identifiers.bnf

 

GOAL ::= ids

 

ids ::= id | ids id

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

 

> tacc -tg identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Goal

 

  Goal is GOAL

 

Tokens

 

   0.  virtual  not_null  terminal  -           plain    UNKNOWN

   1.  virtual  not_null  terminal  -           plain    ERROR

   2.  virtual  not_null  terminal  -           plain    EOS

   3.  virtual  not_null  terminal  -           plain    ANY

   4.  virtual  nullable  terminal  -           plain    NULL

   5.  virtual  not_null  terminal  -           plain    START

   6.  goal     not_null  product   pure        plain    GOAL

   7.  real     not_null  product   pure        plain    ids

   8.  real     not_null  product   prime       plain    id

   9.  real     not_null  terminal  input       regex    /id/  /[A-Za-z_][A-Za-z_0-9]*/

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

As you can see, tacc reserves some virtual tokens that are used by the parser.  It is not necessary to understand what they are or what they do, but it is instructive to know what they are for:

token

Description

UNKNOWN

in a list of tokens, the UNKNOWN token (usually 0) is placed on the end for human readability, it is not used by tacc specifically

ERROR

is the token returned by the lexer when none of your terminal tokens matched the input string

EOS

end of string is the last token of any input, it is supplied and handled automatically in tacc

ANY

a special token used with -y that means if all the reductions in the state are the same, do it on any token.  See option -y.

NULL

the token that means it could be nothing, as in no input at all.  eg prefix ::= + | - | NULL

START

simply marks the place in the token table where real (as opposed to virtual) tokens start, it is not used by tacc specifically

tacc compile -options you might use

-p print productions

Printing productions really just lists your .bnf file with tacc's categorisation.

> tacc -p identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Productions

 

  1. nulls:  0  reachable    terminating      GOAL ::= ids

  2. nulls:  0  reachable    terminating      ids ::= id

  3. nulls:  0  reachable    terminating      ids ::= ids id

  4. nulls:  0  reachable    terminating      id ::= /[A-Za-z_][A-Za-z_0-9]*/

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

-x print transitions

Printing transitions displays the full pushdown automaton and shows exactly what the parser would do when it parses your input string.

> tacc -x identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Transitions

 

state 1 on ids  --->  2

state 1 on id  --->  3

state 1 on /id/  --->  4

 

state 2 on EOS  _o->  GOAL ::= ids

state 2 on id  --->  5

state 2 on /id/  --->  4

 

state 3 on EOS /id/  _o->  ids ::= id

 

state 4 on EOS /id/  _o->  id ::= /[A-Za-z_][A-Za-z_0-9]*/

 

state 5 on EOS /id/  _o->  ids ::= ids id

 

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

When tacc is compiling your grammar, it makes a states which indicate where the parser is currently and transitions that indicate where it might go, given particular a particular token.  Understanding transitions requires some knowledge of pushdown automata, so take a look at the section on Parsing Basics.  Also see tacc -I -x for a walkthrough of what transitions look like when the parser is executing.

-m -M -d Using Templates

If you don't like the default template that tacc uses, you can use your own, or modify the one tacc uses.

You can list the templates built in to tacc and export them for perusal like this:

> tacc -d

 

Name      Ext   Default  Description

--------  ----  -------  ----------------------------------------------------------------

classic   cpp   No       Older C++ template with no real redeeming features.

sienna    cpp   No       C++ template generated around tacc 4.03.  Still good.

vogue     cpp   Yes      C++ template generated from the tacc source code.

jive      java  No       Simple Java template for tiny grammars.

juke      java  Yes      Java template with parser tables in *.joos.dat files.

cup       cup   Yes      Convert to CUP parser format.  Not perfect.

default   *     *        Matches a default template above by output file extension.

 

> tacc -M default -o template.cpp

Now you can edit template.cpp and use it with tacc -m

> tacc -m template.cpp identifiers.bnf -o identifiers.cpp

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

 

tacc: using template file: template.cpp

tacc: output file is: identifiers.cpp

tacc compile -options you won't use

-s print states

Printing states shows the internals of the compile process.  It is unlikely that you will want this, but it can be instructive if you want to learn how to build a pushdown state automaton from a BNF grammar.  It can sometimes be useful when debugging grammars, but most of the time it outputs just too much stuff.

> tacc -s identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

States

 

 ---- START ---> state 1

|      1. GOAL ::= . ids  --->  2  EOS

|      3. ids ::= . ids id  --->  2  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      2. ids ::= . id  --->  3  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- ids ---> state 2

|      1. GOAL ::= ids .  _o->  -1  EOS

|      3. ids ::= ids . id  --->  5  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- id ---> state 3

|      2. ids ::= id .  _o->  -2  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- /id/ ---> state 4

|      4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ .  _o->  -4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- id ---> state 5

|      3. ids ::= ids id .  _o->  -3  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

Creating an automaton by hand is pretty tedious, but if you want to learn how, see the section on Parsing Basics.

There is no need whatsoever to understand how tacc builds an automaton to use tacc.

-c character tokens = character ascii

If you want to get hardcore with your parsing, you might want to consider your input stream one character at a time.  So instead of:

keyword ::= while

You would write:

keyword ::= w h i l e

If you are doing it one character at a time, you can greatly simplify the lexical scanner from the standard regex matcher to one that simply reads one character.  And to make things even simpler, why not make the token number equal to the ascii of the character as well?  This is what -c does.  It makes all of the token numbers for single characters equal to the ascii of the character.

Even though it makes the lexer lightning fast, it puts more work on the parser (and you writing the BNF too).

tacc didn't always have a fancy regex lexer.  tacc's regex lexer was built with tacc -c.

-y if a state's actions are all the same reduction, do it on any token

Most of the time, if your state has only one reduction, it will do that reduction on every token.  -y reduces all those to one reduction on the special single any token.

The upside of -y is that it slightly speeds up finding what action to do when you are in state S with token T and makes the parser slightly smaller.

The downside of -y is that it is a relaxation of your grammar, meaning it could accept input that it should have rejected.  This would be pretty rare, most of the time it would do a reduce based on any that it should not have done and then whatever the real token is coming up next would likely not be able to be shifted and it would stop anyway.  Even if it did continue at that stage, you can deal with it in your semantics.  -y is not for beginners.

The -y process in tacc is done during the output of the .cpp file, so you won't see it with -s, but if you go through the C++ output tables (not recommended), effectively it changes this:

> tacc -s identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

States

 

 ---- START ---> state 1

|      1. GOAL ::= . ids  --->  2  EOS

|      3. ids ::= . ids id  --->  2  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      2. ids ::= . id  --->  3  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- ids ---> state 2

|      1. GOAL ::= ids .  _o->  -1  EOS

|      3. ids ::= ids . id  --->  5  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- id ---> state 3

|      2. ids ::= id .  _o->  -2  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- /id/ ---> state 4

|      4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ .  _o->  -4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- id ---> state 5

|      3. ids ::= ids id .  _o->  -3  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

To this (NOTE! FAKED OUTPUT!):

> tacc -sy identifiers.bnf

 

tacc: reading grammar file: identifiers.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

States

 

 ---- START ---> state 1

|      1. GOAL ::= . ids  --->  2  EOS

|      3. ids ::= . ids id  --->  2  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      2. ids ::= . id  --->  3  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- ids ---> state 2

|      1. GOAL ::= ids .  _o->  -1  EOS

|      3. ids ::= ids . id  --->  5  EOS /[A-Za-z_][A-Za-z_0-9]*/

|      4. id ::= . /[A-Za-z_][A-Za-z_0-9]*/  --->  4  EOS /[A-Za-z_][A-Za-z_0-9]*/

 --------------------

 

 ---- id ---> state 3

|      2. ids ::= id .  _o->  -2  ANY    Note!  faked output!

 --------------------

 

 ---- /[A-Za-z_][A-Za-z_0-9]*/ ---> state 4

|      4. id ::= /[A-Za-z_][A-Za-z_0-9]*/ .  _o->  -4  ANY    Note!  faked output!

 --------------------

 

 ---- id ---> state 5

|      3. ids ::= ids id .  _o->  -3  ANY    Note!  faked output!

 --------------------

 

Parser States:    5

Lexical States:   3

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

tacc -I interpreter and -options

tacc -I interpreter

tacc can interpret its output C++ or Java file with the same results as if you had compiled and executed it.

tacc is not a general C++ or Java interpreter, it just knows the structure of its own output.  So this means it can't execute code put in { ... } in your bnf.

The tacc interpreter is very useful when developing grammars because you can immediately do a test parse without compiling the output.cpp file.  In fact you can play with grammars and LR parsing without a C++ compiler at all.

The built in lexer means there is no need for any extra tools to build a lexical scanner.  The default lexer works out of the box and is flexible enough not to require any modifications to work with your input.

The parser engine and options for the interpreter are identical to the compiled version since they are built from same source.

It is difficult to explain just how good this is compared to other parsing tools if you haven't done this sort of thing before.

Using the tacc -I interpreter for C++

Just use tacc -I with the file.cpp file you get from tacc file.bnf:

> tacc -I identifiers.cpp "this and that"

 

tacc: string accepted

 

> tacc -I identifiers.cpp "this then 9fail"

 

tacc: Error: Unrecognised input "9fail"

 

Line 1. this then 9fail

                  ^^^^^

Nice, but not very informative on what is going on.  But there are options.

tacc -I interpreter -options you will use

-I? Help

-I? Simply prints all the -options:

 

  > tacc -I?

 

  usage: tacc -I interpret_source_file.cpp { -f input_file | input_string... }

              [-xrtasnlpvewb]

              [-GWTV-?]

 

  tacc -I interprets a source code file created from a tacc grammar.bnf

 

  tacc -I is NOT a general C++ or Java interpreter, it just knows the structure

  of tacc generated source code and reads the tables from the cpp files it created.

  The parser engine and options for the interpreter are identical to the compiled

  version since they are built from same source.

 

  -options:

 

     -f infile

     -x print transitions

     -r print reductions

     -t print tokens as they are read (-ttt... -K = max t's)

     -a print terminal args with reductions

     -s print top of parse tree stack with reductions (needs -p)

     -n only print reductions that have terminal args

     -l print lookahead with reductions

     -p print parse tree, -pp print fancier parse tree

     -v verbose, equivalent to -alrpptx

     -e suppress expecting... output

     -k don't use context aware lexing

     -w don't ignore end of lines (\n, \f)  (\r is always ignored)

     -ww don't ignore any whitespace (\n \f \v \t ' ')   (\r is always ignored)

     -b print the BNF source for this program

     -B print the input string or filename

 

     -G debug

     -W wait for enter key before exiting

     -T print time taken

     -D once enables descape of words in [ ].  Use -DD for more help.

        [NAME], [xHX], [oOCT], [aDEC], [NNN]

     -RR help on regexes and regex classes.

     -V Version

     -- Stop -option parsing

     -? This help message

 

  Examples:

 

    tacc regex.bnf -o regex.cpp                -> compile regex.bnf to regex.cpp

    tacc -I regex.cpp -pp "a*b+c?(this|that)"  -> run the interpreter on regex.ccp with input string

-I -t print tokens as they are read

-t tells the lexer to print tokens when they are read.  It is good to see that your terminal tokens are being recognised:

> tacc -I identifiers.cpp -t "this and that"

 

lex: next token is {9 /id/ = "this"}

lex: next token is {9 /id/ = "and"}

lex: next token is {9 /id/ = "that"}

lex: next token is {2 EOS = ""}

 

tacc: string accepted

The output format is { token_number  terminal_token_plain_or_regex = "input that matched" }

The more -t's you give, the more info it reports on each token, up until an insane 12 t's (or use -K which is short for the max number of -t's) when it prints a ridiculous amount of debug as it steps through the regular expression matcher internals (not recommended).

Here is a more sensible -tt which prints neat little markers showing where it found the token on the input line (read so far).

> tacc -I identifiers.cpp -tt "this and that"

 

lex: next token is {9 /id/ = "this"}

       1: this

          ^^^^

 

lex: next token is {9 /id/ = "and"}

       1: this and

               ^^^

 

lex: next token is {9 /id/ = "that"}

       1: this and that

                   ^^^^

 

lex: next token is {2 EOS = ""}

       1: this and that

                       ^

 

 

tacc: string accepted

-I -r -a -l -n print reductions

-r tells the parser to print reductions as they are recognised:

> tacc -I identifiers.cpp -r "this and that"

 

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   2: ids ::= id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   3: ids ::= ids id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   3: ids ::= ids id

   1: GOAL ::= ids

 

tacc: string accepted

-ra tells the parser to print reductions and the arguments that would be given to your semantic function.  So when you were writing some semantic code to go with this grammar, you can see which reduction is being executed and what arguments you would get when they are:

> tacc -I identifiers.cpp -ra "this and that"

 

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "this"

   3: ids ::= id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "and"

   2: ids ::= ids id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "that"

   2: ids ::= ids id

   1: GOAL ::= ids

 

tacc: string accepted

The more -a's you give, the more info it reports on each argument:

> tacc -I identifiers.cpp -raa "this and that"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "this", argt[0]->token = 9}

   3: ids ::= id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "and", argt[0]->token = 9}

   2: ids ::= ids id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "that", argt[0]->token = 9}

   2: ids ::= ids id

   1: GOAL ::= ids

 

tacc: string accepted

 

> tacc -I identifiers.cpp -raaa "this and that"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "this", argt[0]->token = 9, regex_tokens[9] = /id/}

   3: ids ::= id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "and", argt[0]->token = 9, regex_tokens[9] = /id/}

   2: ids ::= ids id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: {argv[0] = "that", argt[0]->token = 9, regex_tokens[9] = /id/}

   2: ids ::= ids id

   1: GOAL ::= ids

 

tacc: string accepted

 

> tacc -I identifiers.cpp -raaaa "this and that"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals:

      [0] = {

          argv[0] = "this"

          argt[0] = {

              ->token = 9

                  string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/

                   regex_tokens[9] = /id/

              ->line_start = 1, ->start = 0, ->line_end = 1, ->end = 3

          }

      }

   3: ids ::= id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals:

      [0] = {

          argv[0] = "and"

          argt[0] = {

              ->token = 9

                  string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/

                   regex_tokens[9] = /id/

              ->line_start = 1, ->start = 5, ->line_end = 1, ->end = 7

          }

      }

   2: ids ::= ids id

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals:

      [0] = {

          argv[0] = "that"

          argt[0] = {

              ->token = 9

                  string_tokens[9] = /[A-Za-z_][A-Za-z_0-9]*/

                   regex_tokens[9] = /id/

              ->line_start = 1, ->start = 9, ->line_end = 1, ->end = 12

          }

      }

   2: ids ::= ids id

   1: GOAL ::= ids

 

tacc: string accepted

-rl tells the parser to print reductions and the current lookahead token.  The lookahead token is half the reason "why" this reduction was done (the other half is because we have already read the RHS of the reduction)  It is not all that useful:

> tacc -I identifiers.cpp -rl "this and that"

 

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and

   2: ids ::= id . and

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that

   3: ids ::= ids id . that

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS

   3: ids ::= ids id . EOS

   1: GOAL ::= ids . EOS

 

tacc: string accepted

-ran tells the parser to only print reductions that have terminal arguments.  Almost always used with -a so you can see the terminal args as well.  This is useful if you have a grammar with a deep reduction parse tree but with few that actually handle terminal tokens.  This is seen in expression calculators that have lots of layers of precedence rules.

> tacc -I identifiers.cpp -ran "this and that"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "this"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "and"

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

      terminals: argv[0] = "that"

 

tacc: string accepted

-I -p -pp print (pretty) parse tree

-p tells the parser to print a parse "tree".  -pp prints the same thing, but on multiple lines, so it looks a bit more like a tree.  -p does some guesswork on how you want the parse tree to look and doesn't always get it right, so it is always a parse tree, but sometimes it is not pretty.

> tacc -I identifiers.cpp -pp "this and that"

Parse Tree:

[thisandthat]

 

[thisandthat]

 

 

tacc: string accepted

It looks a little better with a calculator type parser.  Here is a calculator that has been compiled and has semantics to actually do the calculation.  Interpreted or compiled, each way has the same -options:

> calc -pp "(8 + 2 * 3) / (1 + 9)"

Parse Tree:

[[[8+[2*3]]()]/[[1+9]()]]

 

[             /         ]

 [         ()] [     ()]

  [8+     ]     [1+9]

     [2*3]

 

1.4

tacc -I interpreter -options you might use

-I -b print source bnf

-b prints the .bnf source that was used to make the parser.  But not in the original format, so it is not that exciting.  It is useful to remind yourself exactly what (version of your) grammar was used to build this parser.  And it would be really useful if you had lost your .bnf.

C:\Trev\Projects\tacc\Debug>tacc -I identifiers.cpp -b

GOAL ::= ids

ids ::= id

ids ::= ids id

id ::= /[A-Za-z_][A-Za-z_0-9]*/

-I -x print transitions

-x prints transitions between states as they occur during parsing.  Are you ready for this?

> tacc -I identifiers.cpp -txal "this and that"

 

lex: next token is {9 /id/ = "this"}

state     1 on    9 /id/ ---> state 4

lex: next token is {9 /id/ = "and"}

state     4 on    9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and

      terminals: argv[0] = "this"

state     1 on    8 id ---> state 3

state     3 on    9 /id/ _o-> state 1 reduce 3: ids ::= id . and

state     1 on    7 ids ---> state 2

state     2 on    9 /id/ ---> state 4

lex: next token is {9 /id/ = "that"}

state     4 on    9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that

      terminals: argv[0] = "and"

state     2 on    8 id ---> state 5

state     5 on    9 /id/ _o-> state 1 reduce 2: ids ::= ids id . that

state     1 on    7 ids ---> state 2

state     2 on    9 /id/ ---> state 4

lex: next token is {2 EOS = ""}

state     4 on    2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS

      terminals: argv[0] = "that"

state     2 on    8 id ---> state 5

state     5 on    2 EOS _o-> state 1 reduce 2: ids ::= ids id . EOS

state     1 on    7 ids ---> state 2

state     2 on    2 EOS _o-> state 1 reduce 1: GOAL ::= ids . EOS

 

tacc: string accepted

States are built during the compile of the grammar.

Transitions come in two different actions and are printed like this:

in_state S on some_token   shift_action--->  to_state_N

in_state S on some_token  reduce_action_o->  to_state_N reduce some_production

A shift action accepts a token and moves the parser to another state.

A reduce action recognises a production in the grammar and consumes shifted tokens on the right hand side (RHS) of the production, replacing them with the token on the left hand side (LHS) of the production.  It also returns to a prior state, but now with the LHS as the next token.

So this grammar compiles to 5 states. To recognise the string input, we boot the parser by pushing state 1 on the state stack.  We also call this in_state 1, but it really means the top of the state stack is 1.

The input string is "this and that", the parser will need some input, so it calls the lexer:

lex: next token is {9 /id/ = "this"}

The lexer matched the string "this" to the regex terminal token number 9, which is a regular expression /[A-Za-z_][A-Za-z_0-9]*/ or /id/ for short, since id ::= /[A-Za-z_][A-Za-z_0-9]*/.  tacc uses a short version of regexes if the production is of the form something ::= /one_regex/ to improve readability.

As tokens are read, they are pushed on to the token stack.  The top of the token stack is known as the on_token.  (Some call this the lookahead token, and in this case, it is.  tacc only calls a token the lookahead token when it is a terminal.  Later on you will see the on_token can change, but he lookahead will stay the same.)  We say the on_token is 9 /id/ = "this".

In general parsing goes, if we are in state S, on token K then do action.  It is very deterministic and at its heart it is quite simple.  All the grunt work for parsing is done up front when you compile your grammar.  The execution of the parser is straightforward really.

To parse the string the find the transition that matches in_state, on_token pair.  In this case we are in_state 1, on_token 9 /id/ = "this":

state     1 on    9 /id/ ---> state 4

This transition tells us the action will be a ---> shift action to state 4.  It is called a shift action, because the token is shifted to the left of an imaginary parsing cursor (often represented as a dot . character).  Or the cursor is shifted to the right through the string if you prefer.  Or the whole string is shifted to the left.  All are correct.  The shift action pushes the 4 on to the state stack, so now the in_state is 4.

Because of this shift action, and because the shifted token was a terminal, we will read the next token:

lex: next token is {9 /id/ = "and"}

The lexer matched the string "and" to /id/ and pushes it on to the token stack, "and" becomes the on_token.

Now we are in_state 4, on_token /id/ = "and", so we find the matching transition:

state     4 on    9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and

This transition tells us the action will be a _o-> reduce action with the production, or really now, the reduction id ::= /[A-Za-z_][A-Za-z_0-9]*/  (That _o-> is supposed to look like a curly arrow with a loop in it.  It is the way the author draws a reduction on a state diagram)

The reduction action wants us to take tokens that we have read matching the right hand side (RHS) of the production and change them into the token on the left hand side (LHS).

This corresponds to the tokens below the on_token on the token stack.  In this case it is 1 token, "this".

The parser calls your semantic action for the reduction id ::= /[A-Za-z_][A-Za-z_0-9]*/ with the argc = 1 and argv = "this".

      terminals: argv[0] = "this"

Then argc elements are popped off the token stack and the state stack.  (Most of the time it is argc, but NULLs can make it more difficult to predict, so tacc precalculates the exact number to pop off the state stack for each reduction during compilation)  And the LHS token = id is pushed onto the stack and becomes the on_token.

The top of the state stack, or in_state is now back to 1 and this time the on_token is id.  Find the matching transition:

state     1 on    8 id ---> state 3

And so on...

Another way to look at this is to imagine the parse cursor going through the string, reducing input as it goes.

(Unfortunately tacc can't produce this output since 1. it doesn't read that far ahead, and 2. it doesn't bother to keep non-terminals, they are implied by the state you are in):

. this and that      state     1 on    9 /id/ ---> state 4

this . and that      state     4 on    9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

. id and that        state     1 on    8 id ---> state 3

id . and that        state     3 on    9 /id/ _o-> state 1 reduce 2: ids ::= id

. ids and that       state     1 on    7 ids ---> state 2

ids . and that       state     2 on    9 /id/ ---> state 4

ids and . that       state     4 on    9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

ids . id that        state     2 on    8 id ---> state 5

ids id . that        state     5 on    9 /id/ _o-> state 1 reduce 3: ids ::= ids id

. ids that           state     1 on    7 ids ---> state 2

ids . that           state     2 on    9 /id/ ---> state 4

ids that .           state     4 on    2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/

ids . id             state     2 on    8 id ---> state 5

ids id .             state     5 on    2 EOS _o-> state 1 reduce 3: ids ::= ids id

. ids                state     1 on    7 ids ---> state 2

ids .                state     2 on    2 EOS _o-> state 1 reduce 1: GOAL ::= ids

. GOAL

tacc -I interpreter -options you won't use

-I -s -p print the top of the parse tree stack

-s prints the top of the -p parse tree stack stx.top.  I thought it would be really good, but it turned out not so great.  Use -xrpls for best effect.

> tacc -I identifiers.cpp -xrpls "this and that"

 

state     1 on    9 /id/ ---> state 4

state     4 on    9 /id/ _o-> state 1 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . and

      stx.top: this

state     1 on    8 id ---> state 3

state     3 on    9 /id/ _o-> state 1 reduce 3: ids ::= id . and

   3: ids ::= id . and

      stx.top: this

state     1 on    7 ids ---> state 2

state     2 on    9 /id/ ---> state 4

state     4 on    9 /id/ _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . that

      stx.top: and

state     2 on    8 id ---> state 5

state     5 on    9 /id/ _o-> state 1 reduce 2: ids ::= ids id . that

   2: ids ::= ids id . that

      stx.top: [thisand]

state     1 on    7 ids ---> state 2

state     2 on    9 /id/ ---> state 4

state     4 on    2 EOS _o-> state 2 reduce 4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS

   4: id ::= /[A-Za-z_][A-Za-z_0-9]*/ . EOS

      stx.top: that

state     2 on    8 id ---> state 5

state     5 on    2 EOS _o-> state 1 reduce 2: ids ::= ids id . EOS

   2: ids ::= ids id . EOS

      stx.top: [thisandthat]

state     1 on    7 ids ---> state 2

state     2 on    2 EOS _o-> state 1 reduce 1: GOAL ::= ids . EOS

   1: GOAL ::= ids . EOS

      stx.top: [thisandthat]

Parse Tree:

[thisandthat]

 

tacc: string accepted

-I -w -ww turn off automatic whitespace control

-w controls how tacc controls whitespace.  Dealing with whitespace is a fairly tedious thing in parsing.  Fortunately tacc does this automatically for you.  And most of the time you want this, but if you want to take control, build your grammar with your whitespace tokens and run tacc with:

     -w don't ignore end of lines (\n, \f)  (\r is always ignored)

     -ww don't ignore any whitespace (\n \f \v \t ' ')   (\r is always ignored)

If you want these permanently set, you can call parser p.whitespace_control(int wflag) somewhere.

There are many ways to deal with whitespace.  You can deal with it like it was a separator or treat them like "blank" ids.

Let's modify the identifiers.bnf grammar to handle whitespace as if they could be "blank" ids:

> cat identifiers.whitespace.bnf

 

GOAL ::= words

 

words ::= word | words word

 

word ::= id | ws

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

ws ::= /[[:space:]]+/

Compile the grammar with tacc:

> tacc identifiers.whitespace.bnf -o identifiers.whitespace.cpp

 

tacc: reading grammar file: identifiers.whitespace.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    8

Lexical States:   4

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1) but is LR(0) if you allow shifting of the EOS token

 

tacc: using internal template: default (vogue)

tacc: output file is: identifiers.whitespace.cpp

Run tacc -I with -ww to stop it from automatically handling whitespace:

> tacc -I identifiers.whitespace.cpp -ww -tr "this and that"

lex: next token is {11 /id/ = "this"}

lex: next token is {12 /ws/ = " "}

   6: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: word ::= id

   2: words ::= word

lex: next token is {11 /id/ = "and"}

   7: ws ::= /[[:space:]]+/

   5: word ::= ws

   3: words ::= words word

lex: next token is {12 /ws/ = " "}

   6: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: word ::= id

   3: words ::= words word

lex: next token is {11 /id/ = "that"}

   7: ws ::= /[[:space:]]+/

   5: word ::= ws

   3: words ::= words word

lex: next token is {2 EOS = ""}

   6: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: word ::= id

   3: words ::= words word

   1: GOAL ::= words

 

identifiers.whitespace: string accepted

The trouble with that approach is that you have to deal with a word possibly being blank.  And even though it is pretty easy in this case, it would be nice to make the parser do the work of delivering "pure" ids to you.  This approach to the grammar treats whitespace as an optional separator:

> cat identifiers.whitesep.bnf

 

GOAL ::= expr

 

expr ::= ws_opt ids ws_opt

 

ids ::= id | ids ws_opt id

 

ws_opt ::= ws | NULL

 

id ::= /[A-Za-z_][A-Za-z_0-9]*/

ws ::= /[[:space:]]+/

Compile the grammar with tacc:

> tacc identifiers.whitesep.bnf -o identifiers.whitesep.cpp

 

tacc: reading grammar file: identifiers.whitesep.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Parser States:    12

Lexical States:   4

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(1)

 

tacc: using internal template: default (vogue)

tacc: output file is: identifiers.whitesep.cpp

Run tacc -I with -ww to stop it from automatically handling whitespace.  This time, when an id is recognised and delivered to your semantic function, it won't be blank:

> tacc -I identifiers.whitesep.cpp -ww -tr "this and that"

lex: next token is {12 /id/ = "this"}

   6: ws_opt ::= NULL

lex: next token is {13 /ws/ = " "}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   3: ids ::= id

lex: next token is {12 /id/ = "and"}

   8: ws ::= /[[:space:]]+/

   5: ws_opt ::= ws

lex: next token is {13 /ws/ = " "}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: ids ::= ids ws_opt id

lex: next token is {12 /id/ = "that"}

   8: ws ::= /[[:space:]]+/

   5: ws_opt ::= ws

lex: next token is {2 EOS = ""}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: ids ::= ids ws_opt id

   6: ws_opt ::= NULL

   2: expr ::= ws_opt ids ws_opt

   1: GOAL ::= expr

 

identifiers.whitesep: string accepted

Both these grammars have been written in such a way that they also work with automatic whitespace left on.  There is no need to try to do that, I just did it for fun:

> tacc -I identifiers.whitesep.cpp -tr "this and that"

lex: next token is {12 /id/ = "this"}

   6: ws_opt ::= NULL

lex: next token is {12 /id/ = "and"}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   3: ids ::= id

   6: ws_opt ::= NULL

lex: next token is {12 /id/ = "that"}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: ids ::= ids ws_opt id

   6: ws_opt ::= NULL

lex: next token is {2 EOS = ""}

   7: id ::= /[A-Za-z_][A-Za-z_0-9]*/

   4: ids ::= ids ws_opt id

   6: ws_opt ::= NULL

   2: expr ::= ws_opt ids ws_opt

   1: GOAL ::= expr

 

identifiers.whitesep: string accepted

A topic related to whitespace control is handling the end of string.  The end of string token EOS can be used in your grammar as if it was a real character if necessary.

-I -e suppress expecting... output

-e When tacc is parsing input, if it receives an unexpected token, it prints an error message.  Then it lists all the tokens that it was expecting.  Sometimes this is nice, but a lot of times it is a massive list of 99% of all possible tokens.  -e suppresses the listing of all the expected possible tokens.

For example:

> cat expecting.bnf

 

GOAL ::= count

count ::= one two three

Compile with tacc and run it with some wrong input:

> tacc -I expecting.cpp "one three"

 

tacc: Error: Unexpected input token: three

 

       1: one three

              ^^^^^

 

    found:  three

expecting:  two

Now if you run it with -e, it won't show the expecting bit:

> tacc -I expecting.cpp "one three" -e

 

tacc: Error: Unexpected input token: three

 

       1: one three

              ^^^^^

Note that this is a different error to the one you get if you supply input that isn't recognised as a token:

> tacc -I expecting.cpp "one four" -e

 

tacc: Error: Unrecognised input "four"

 

       1: one four

              ^^^^

tacc experimental -options

tacc -?? Full disclosure of options

tacc -?? Will display normal help plus experimental and undocumented options and usage.  Needless to say the experimental stuff is not guaranteed to do anything useful.  It is mostly for debugging and regression testing.

> tacc -??

 

  usage: tacc [-gtpxsvlcdy] grammar.bnf [-o {parser.cpp|parser.java}]

            [{-m template_file | -M internal_template_name}]

            [-GWTV-?]

 

tacc compiles a simple BNF like grammar and outputs (LA)LR(1) parser source

code in C++ or Java with a built in lexical scanner.  You can compile the

resultant source code immediately with a C++ or Java compiler or you can use

tacc -I to interpret the source file.

 

See tacc -I? for more help on interpreter mode.

 

-options:

 

   -g print goal token

   -t print tokens

   -p print productions

   -x print transitions

   -s print states

   -i print input tokens (more useful with -s)

   -v verbose, equivalent to -gtpxsi

   -l construct an LALR(1) parser instead of an LR(1) parser

   -c ensure token ids for single character token names = ascii of the character

   -d list available internal templates

   -k generate a context aware lexer

   -y if a state's actions are all the same reduction, do it on any token

 

   -G debug

   -W wait for enter key before exiting

   -T print time taken

   -D once enables descape of words in [ ].  Use -DD for more help.

      [NAME], [xHX], [oOCT], [aDEC], [NNN]

   -RR help on regexes and regex classes.

   -V Version

   -- Stop -option parsing

   -? This help message

 

Examples:

 

  tacc regex.bnf -o regex.cpp                -> compile regex.bnf to regex.cpp

  tacc -I regex.cpp -pp "a*b+c?(this|that)"  -> run the interpreter on regex.ccp with input string

  tacc regex.bnf -o regex.java               -> compile regex.bnf to regex.java

  tacc -l regex.bnf -o regex.cpp             -> compile regex.bnf as LALR (better)

  tacc -d                                    -> list available templates

  tacc -M default -o template.cpp            -> output the default template to template.cpp

 

Experimental and Undocumented Usage

 

-options:

 

 

   -Ux print transitions before they are LALR merged with -l

   -Us print states before they are LALR merged with -l

   -Ui print input tokens before they are LALR merged with -l

 

   -Cx enables printing of transitions within the compiler

   -Cp enables printing of the parse tree within the compiler

   -Ct enables lex (token) debugging within the compiler, up to 13 t's

   -CM enables all lex (token) debugging within the compiler

 

   -O switches back to the old BNF parser.  Used for testing the new

      BNF parser is identical.  No real reason to use -O unless you

      have found some spectacular bug with the new BNF parser.

 

   -X enables regex-like extensions to token names.  Needs -O.

 

      Tokens ending in a regex like character * + ? - adds another

      two productions as follows:

 

        LHS ::= RHS* ...  adds  RHS* ::= RHS* RHS | NULL

        LHS ::= RHS+ ...  adds  RHS+ ::= RHS+ RHS | RHS

        LHS ::= RHS- ...  adds  RHS- ::= RHS RHS- | RHS

        LHS ::= RHS? ...  adds  RHS? ::= RHS | NULL

 

      And while this sounds like a nifty idea, it actually can create

      conflicts fairly easily because of the extra intermediate non-terminal

      in the heirachy.  :(

 

   -R switches to regex test mode, for more info on -R, use -R?

 

   -L switches to lex test mode, for more info on -L, use -L?

-O use old BNF parser

-O uses the old bnf bootstrap parser that was hand written before tacc was operational enough to generate its own BNF parser.  It is not as flexible as the new tacc generated one and shouldn't be used.

-OX enable experimental options

-X is best avoided regardless of what it says in the help above.

-R regex test mode

-R actually has some use.  You can test what matches what with regular expressions.

> tacc -R?

 

  usage: tacc -R "regex" ...

            [-m match_string]

            [-GWTV-?]

 

tacc -R tests the regex deterministic finite state (DFA) constructor

that is used in to build the regex lexical scanner.

 

-options:

 

   -G debug

   -W wait for enter key before exiting

   -T print time taken

   -D once enables descape of words in [ ].  Use -DD for more help.

      [NAME], [xHX], [oOCT], [aDEC], [NNN]

   -RR help on regexes and regex classes.

   -V Version

   -- Stop -option parsing

   -? This help message

 

Example:

 

  tacc -R "int" "while" "[0-9]+" "[A-Za-z_][A-Za-z_0-9]*" -m "integrity"

Just looking at the example given in the help.  There are 4 regexs there and the input is "integrity".  He is hoping that integrity will be matched by regex #4 "[A-Za-z_][A-Za-z_0-9]*" and not by regex #1 "int".  Even though the first part of "integrity" is "int", "integrity" is longer and the regex matcher matches the leftmost longest string.

Running the example shows tacc compiling the regexes to multiple non-deterministic finite state automatons (NDFAs) and then merging them into a single deterministic finite state automaton (DFA).  It is similar to what tacc does when building a parser, but the regex automaton doesn't use any stack.  Just being in a particular state is enough to recognise regular expressions.

> tacc -R "int" "while" "[0-9]+" "[A-Za-z_][A-Za-z_0-9]*" -m "integrity"

 

regex_test_main: regex 1 is int

 

NDFA:

Start     5

state     5 on epsilon --> 1

state     1 on i --> 2

state     2 on n --> 3

state     3 on t --> 4 MATCH 1

 

regex_test_main: regex 2 is while

 

NDFA:

Start    12

state    12 on epsilon --> 6

state     6 on w --> 7

state     7 on h --> 8

state     8 on i --> 9

state     9 on l --> 10

state    10 on e --> 11 MATCH 2

 

regex_test_main: regex 3 is [0-9]+

 

NDFA:

Start    16

state    16 on epsilon --> 13

state    13 on 0123456789 --> 14

state    14 on epsilon --> 13

state    14 on epsilon --> 15 MATCH 3

 

regex_test_main: regex 4 is [A-Za-z_][A-Za-z_0-9]*

 

NDFA:

Start    26

state    26 on epsilon --> 17

state    17 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 24

state    24 on epsilon --> 20

state    24 on epsilon --> 25 MATCH 4

state    20 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 24

 

regex_test_main: DFA compile:

 

DFA:

Start     1

State     1 on 0123456789 --> 2 MATCH 3

State     1 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvxyz --> 3 MATCH 4

State     1 on i --> 4 MATCH 4

State     1 on w --> 5 MATCH 4

State     2 on 0123456789 --> 2 MATCH 3

State     3 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

State     4 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmopqrstuvwxyz --> 3 MATCH 4

State     4 on n --> 6 MATCH 4

State     6 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrsuvwxyz --> 3 MATCH 4

State     6 on t --> 7 MATCH 1

State     7 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

State     5 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefgijklmnopqrstuvwxyz --> 3 MATCH 4

State     5 on h --> 8 MATCH 4

State     8 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvwxyz --> 3 MATCH 4

State     8 on i --> 9 MATCH 4

State     9 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz --> 3 MATCH 4

State     9 on l --> 10 MATCH 4

State    10 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdfghijklmnopqrstuvwxyz --> 3 MATCH 4

State    10 on e --> 11 MATCH 2

State    11 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

 

regex_test_main: DFA compress:

 

DFA:

Start     1

State     1 on 0123456789 --> 2 MATCH 3

State     1 on ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvxyz --> 3 MATCH 4

State     1 on i --> 4 MATCH 4

State     1 on w --> 7 MATCH 4

State     2 on 0123456789 --> 2 MATCH 3

State     3 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

State     4 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmopqrstuvwxyz --> 3 MATCH 4

State     4 on n --> 5 MATCH 4

State     5 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrsuvwxyz --> 3 MATCH 4

State     5 on t --> 6 MATCH 1

State     6 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

State     7 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefgijklmnopqrstuvwxyz --> 3 MATCH 4

State     7 on h --> 8 MATCH 4

State     8 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghjklmnopqrstuvwxyz --> 3 MATCH 4

State     8 on i --> 9 MATCH 4

State     9 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijkmnopqrstuvwxyz --> 3 MATCH 4

State     9 on l --> 10 MATCH 4

State    10 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdfghijklmnopqrstuvwxyz --> 3 MATCH 4

State    10 on e --> 11 MATCH 2

State    11 on 0123456789ABCDEFGHIJKLMNOPQRSTUVWXYZ_abcdefghijklmnopqrstuvwxyz --> 3 MATCH 4

 

match: s:001 m:000 c = '\x00'

match: s:004 m:004 c = 'i'

match: s:005 m:004 c = 'n'

match: s:006 m:001 c = 't'

match: s:003 m:004 c = 'e'

match: s:003 m:004 c = 'g'

match: s:003 m:004 c = 'r'

match: s:003 m:004 c = 'i'

match: s:003 m:004 c = 't'

match: s:003 m:004 c = 'y'

rex.match("integrity") = regex 4: [A-Za-z_][A-Za-z_0-9]*

As a state diagram it looks like this.  You can follow the matching process above (lines that start match: ).  You can see it started in state1, i, state 4, n, state 5 etc and eventually gets stuck in a loop on state 3 matching the 4th regex which is a variable id.  Double outline circles are accept states.

int-while-variable pushdown automata


-L lex test mode

-L actually has some use.  You can test what the lexical scanner does.

> tacc -L?

 

usage: tacc -L [-ctTwslnN] {"plain_string_token" | "/regular_expression_token/"} ...

            {[-e] [-E escape_character] -i match_string | -f input_filename.ext}

            [-x expected_output]

            [-GWTV-?]

 

tacc -L tests the regex lexical scanner.

 

-options:

 

   -c ensure token ids for single character token names = ascii of the character

   -t print tokens as they are read (more t's for lexical scanner debug info)

   -k don't use context aware lexing

   -K lexical scanner debug everything, equivalent to -ttttt...

   -w don't ignore end of lines (\n, \f)  (\r is always ignored)

   -ww don't ignore any whitespace (\n \f \v \t ' ')   (\r is always ignored)

   -s short output format (for regression tests, same format at -x expects)

   -l dump regex lex tables

   -n do not execute the parser (used to show the debug tables when parsing might go nuts)

   -e -i input_string contains escape sequences (like \n) that need to be descaped

   -E escape_character, like -e above, but escape sequences start with escape_character

   -N suppress warning messages

 

   -G debug

   -W wait for enter key before exiting

   -T print time taken

   -D once enables descape of words in [ ].  Use -DD for more help.

      [NAME], [xHX], [oOCT], [aDEC], [NNN]

   -RR help on regexes and regex classes.

   -V Version

   -- Stop -option parsing

   -? This help message

 

Examples:

 

  tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" "/[[:space:]]+/" -iw "int interest = 40"

  tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40"

  tacc -Ls "/[a-z]+/" "=" "/[0-9]+/" -i "that = 86" -x "1:that, 2:=, 3:86, EOS"

Here is tacc doing the second example in the help:

> tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40"

 

lex_test_main: assign token 1 = "int"

lex_test_main: assign token 2 = "while"

lex_test_main: assign token 3 = "="

lex_test_main: assign token 4 = "/[0-9]+/"

lex_test_main: assign token 5 = "/[a-z]+/"

 

Tokens

 

   1.  real     not_null  terminal  input       plain    int

   2.  real     not_null  terminal  input       plain    while

   3.  real     not_null  terminal  input       plain    =

   4.  real     not_null  terminal  input       regex    /[0-9]+/

   5.  real     not_null  terminal  input       regex    /[a-z]+/

   6.  virtual  not_null  terminal  -           plain    UNKNOWN

   7.  virtual  not_null  terminal  -           plain    ERROR

   8.  virtual  not_null  terminal  -           plain    EOS

   9.  virtual  not_null  terminal  -           plain    ANY

  10.  virtual  nullable  terminal  -           plain    NULL

  11.  virtual  not_null  terminal  -           plain    START

 

lex_test_main: lexing string: int interest = 40

 

{1 int = "int"}

{5 /[a-z]+/ = "interest"}

{3 = = "="}

{4 /[0-9]+/ = "40"}

{8 EOS = ""}

-L -s lex test mode, short output

-Ls prints out a short one line lex of the input.  This is mainly for regression testing:

> tacc -L "int" "while" "=" "/[0-9]+/" "/[a-z]+/" -i "int interest = 40" -s

1:int, 5:interest, 3:=, 4:40, EOS

-L -others debug

Other options will have you drowning in debug, go for it.

 

How to LR(1) parse (by hand)

LR 101

First you generate the list of predicate follow states based on the possibili...  oh noes not that explanation again!  Can't you make it simpler than that?!

grammar.bnf

Let's go slow, starting with a very simple grammar, with 2 productions:  Productions have the form LHS ::= RHS.  We say Left Hand Side (LHS) becomes (::=) Right Hand Side (RHS).  The LHS is always one token, the RHS can be many tokens.

This grammar accepts the input x y z.  An x followed by y, followed by z.  Once.  Just x y z.

GOAL ::= expr

expr ::= x y z

State Pushdown Automaton

We are building a state pushdown automaton from a bnf grammar.  Sounds fancy.

State 1

Start with state 1.

State 1

Find the GOAL

Find the token that is the GOAL token.  In tacc, the GOAL token is the first token in the bnf file.  Which is the LHS of the first production.  Here the GOAL token is GOAL.

Write the Goal production out:

State 1

GOAL ::= expr

Add the parse cursor .

Put a dot . after the ::=, before the RHS of the production.  This is the parse cursor:

State 1

GOAL ::= . expr

Add the follow set

The goal production will always be followed by the end of string EOS.  Put that after the production.  This is called the follow set:

State 1

GOAL ::= . expr             EOS

Expand non-terminals

For every token to the right of the parse cursor dot . (in this case expr) find a production with a matching left hand side and add it to the state.  Don't worry if there aren't any:

State 1

GOAL ::= . expr             EOS

expr ::= x y z

It too will have a parse cursor dot . placed  just after the ::=

State 1

GOAL ::= . expr             EOS

expr ::= . x y z

And it will inherit the follow set of the production it was expanded from:

State 1

GOAL ::= . expr             EOS

expr ::= . x y z            EOS

Make (shift) transitions (and close the state)

Now x has a dot in front of it, so we look for a production with an x as the LHS.  And there aren't any.  So we say that x is a terminal token.  Terminal tokens do not expand and really do occur in our input stream.  expr and GOAL appear on the LHS of the productions.  We call them non-terminal tokens because they expand to other tokens.  Non-terminal tokens do not occur in our input stream.

Now for each production in state 1, we create another state.  Taking the first production in state 1:

GOAL ::= . expr             EOS

We see that the right of the parse cursor . has a token.  We say, "In state 1, on an expr, go to state 2"  Where 2 is the new state:

State 1

GOAL ::= . expr  ---> 2     EOS

expr ::= . x y z            EOS

And we make state 2:

State 1

GOAL ::= . expr  ---> 2     EOS

expr ::= . x y z            EOS

 

State 2

Then we add all of the productions in state 1 that have . expr in them to the newly created state 2, but shift the parse cursor . one token to the right.  We also take the follow set, but not the go to state number.  In this case only one production needs to be taken:

State 1

GOAL ::= . expr  ---> 2     EOS

expr ::= . x y z            EOS

 

State 2

GOAL ::= expr .             EOS

Now we go back to state 1 and continue with the second production:

expr ::= . x y z            EOS

We see that the right of the parse cursor . has a token.  We say, "In state 1, on an x, go to state 3"  Where 3 is a new state:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .             EOS

And we make state 3:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .             EOS

 

State 3

Then we add all of the productions that have . x in them to the newly created state 3, but shift the parse cursor . one token to the right.  We also take the follow set, but not the go to state number.  In this case only one production needs to be taken:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .             EOS

 

State 3

expr ::= x . y z            EOS

And state 1 is now closed.

State 2 - make reductions

Now we do the same for all the new states, starting with state 2.

State 2 only has one production:

GOAL ::= expr .             EOS

But this time the parse cursor . is at the end of the production.  We say, "In state 2, on an EOS, reduce GOAL ::= expr"  The reduce is a curly arrow and tacc writes it like _o->

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z            EOS

And state 2 is now closed.

State 3

Moving on to state 3.

State 3 only has one production:

expr ::= x . y z            EOS

We see that the right of the parse cursor . has a token.  We say, "In state 3, on a y, go to state 4"  Where 4 is a new state:

And we make state 4:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z  ---> 4    EOS

 

State 4

Then we add all of the productions that have . y in them to the newly created state 4, but shift the parse cursor . one token to the right.  We also take the follow set, but not the go to state number.  In this case only one production needs to be taken:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z  ---> 4    EOS

 

State 4

expr ::= x y . z            EOS

And state 3 is now closed.

State 4

Moving on to state 4.

State 4 only has one production:

expr ::= x y . z            EOS

We see that the right of the parse cursor . has a token.  We say, "In state 4, on a z, go to state 5"  Where 5 is a new state:

And we make state 5:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z  ---> 4    EOS

 

State 4

expr ::= x y . z  ---> 5    EOS

 

State 5

Then we add all of the productions that have . z in them to the newly created state 5, but shift the parse cursor . one token to the right.  We also take the follow set, but not the go to state number.  In this case only one production needs to be taken:

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z  ---> 4    EOS

 

State 4

expr ::= x y . z  ---> 5    EOS

 

State 5

expr ::= x y z .            EOS

And state 4 is closed.

State 5

Moving on to state 5.

State 5 only has one production:

expr ::= x y z .            EOS

This time the parse cursor . is at the end of the production.  We say, "In state 5, on an EOS, reduce expr ::= x y z"  The reduce is a curly arrow and tacc writes it like _o->

State 1

GOAL ::= . expr   ---> 2    EOS

expr ::= . x y z  ---> 3    EOS

 

State 2

GOAL ::= expr .   _o->      EOS

 

State 3

expr ::= x . y z  ---> 4    EOS

 

State 4

expr ::= x y . z  ---> 5    EOS

 

State 5

expr ::= x y z .  _o->      EOS

And state 5 is closed.

And we are done.  The state automaton has 5 states.

Of course this is a simple example and doesn't show all the nuances of state creation, but it is the core of how it is done.  We will walk through a more complicated example later on.

Pushdown Automaton

Once the automaton is built, we can discard some of the information in the states.  We only need to know 1. the state we are in, 2. the token we have and 3. the transition action - either the state to go to or the reduction to do.

When a transition action is a go to another state, it is called a shift transition.

When a transition action is a reduce, it is called a reduction transition.

We can extract the vital information from the state pushdown automaton to give us a pushdown automaton:

State 1 on expr ---> 2

State 1 on x    ---> 3

State 2 on EOS  _o-> GOAL ::= expr

State 3 on y    ---> 4

State 4 on z    ---> 5

State 5 on EOS  _o-> expr ::= x y z

Graphically it looks like this:

expr ::= x y z pushdown automata

Parsing input x y z

If you were following the diagram and trying to parse the input x y z.

Start in state 1, add the parse cursor . before the input, then the input x y z, then the end of string token EOS:

state 1:  . x y z  EOS

So in state 1, on an x, go to state 3

state 3:  x . y z  EOS

In state 3, on y, go to state 4

state 4:  x y . z  EOS

In state 4, on z, go to state 5

state 5:  x y z .  EOS

In state 5, on EOS, reduce expr ::= x y z.  OK...  But what state are we in now?  This is where the pushdown in pushdown automaton comes in.

As we shift through the states during the parse, we push the state number on to the state stack as we go.  When we reduce, we pop off the number of states off the state stack equal to the number of tokens on the RHS of the reduction (not the whole story yet, but stick with it for now.  We need to adjust for NULLs as well.)

If we were keeping a state stack, it would be 1, 3, 4, 5.  The numbers of all the states we have visited so far.

There are 3 tokens in this reduction, x y z, so we pop 3 states off the stack: 1,  3, 4, 5  and end up back in state 1.

The reduction also replaces everything on the RHS of the reduction (x y z), with what is on the LHS (expr).

The parse cursor . is placed before the LHS.

And it now looks like this:

state 1:  . expr  EOS

In state 1, on expr, go to state 2

state 2:  expr .  EOS

In state 2, on EOS, reduce GOAL ::= expr.

The state stack is 1, 2.  There is 1 token on the RHS of the reduction, so pop off 1 state from the state stack: 1,  2  and end up in state 1 again.

The reduction also replaces everything on the RHS of the reduction (expr), with what is on the LHS (GOAL).

The parse cursor . is placed before the LHS.

And it now looks like this:

state 1:  . GOAL  EOS

In state 1, on GOAL with no more input, is the condition to accept the string.  (Some would say we need to shift the GOAL and maybe even the EOS, but tacc doesn't bother and saves a few cycles)

We are done.

Doing exactly the same thing with tacc

States

We could have done the states with tacc -s:

> tacc xyz.bnf -s

 

tacc: reading grammar file: xyz.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

States

 

 ---- START ---> state 1

|      1. GOAL ::= . expr  --->  2  EOS

|      2. expr ::= . x y z  --->  3  EOS

 --------------------

 

 ---- expr ---> state 2

|      1. GOAL ::= expr .  _o->  -1  EOS

 --------------------

 

 ---- x ---> state 3

|      2. expr ::= x . y z  --->  4  EOS

 --------------------

 

 ---- y ---> state 4

|      2. expr ::= x y . z  --->  5  EOS

 --------------------

 

 ---- z ---> state 5

|      2. expr ::= x y z .  _o->  -2  EOS

 --------------------

 

 

Parser States:    5

Lexical States:   2

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(0)

You can see that tacc has created a number for each production.  Also the transition action is a positive number equal to the state number for a shift action and a negative number corresponding to the -production number for each reduction action.

Transitions

And the transitions with tacc -x:

C:\Trev\Projects\tacc\Debug>tacc xyz.bnf -x

 

tacc: reading grammar file: xyz.bnf

tacc: checking productions

tacc: building automation

tacc: building lexer

 

Transitions

 

state 1 on expr  --->  2

state 1 on x  --->  3

 

state 2 on EOS  _o->  GOAL ::= expr

 

state 3 on y  --->  4

 

state 4 on z  --->  5

 

state 5 on EOS  _o->  expr ::= x y z

 

 

Parser States:    5

Lexical States:   2

 

Shift-Reduce conflicts:  0

Reduce-Reduce conflicts: 0

Warnings:   0

Errors:     0

 

Automaton is LR(0)

Parsing input

And after compiling with tacc xyz.bnf -o xyz.cpp, we could parse the input x y z with tacc -I:

> tacc -I xyz.cpp "x y z" -x

state     1 on    8 x ---> state 3

state     3 on    9 y ---> state 4

state     4 on   10 z ---> state 5

state     5 on    2 EOS _o-> state 1 reduce 2: expr ::= x y z

state     1 on    7 expr ---> state 2

state     2 on    2 EOS _o-> state 1 reduce 1: GOAL ::= expr

 

xyz: string accepted