These examples are simple, but bisonc++ grammars for real programming languages are written the same way. You can copy the examples from this document into source files to create your own versions of these programs. In addition, the bisonc++ source archive contains the various source files ready for use.
All sources for this calculator are found in the demos/rpn/ directory.
%baseclass-preinclude cmath %token NUM %stype double
The grammar file's first (directive) configures bisonc++ by specifying the
tokens (see section The bisonc++ Declarations Section). Each terminal symbol that
is not a single-character literal must be declared here (Single-character
literals are normally not declared, but are represented by literal character
constants). In the current example, all arithmetic operators are designated by
single-character literals, so the only terminal symbol that needs to be
declared is NUM, the token type for numeric constants. As bisonc++ by default
uses the type int as the semantic value type, but a calculator usually
uses floating point values the directive %stype double is used to indicate
that the calculator uses `double' as the type for its semantic values.
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << std::endl;
}
;
exp:
NUM
|
exp exp '+'
{
$$ = $1 + $2;
}
|
exp exp '-'
{
$$ = $1 - $2;
}
|
exp exp '*'
{
$$ = $1 * $2;
}
|
exp exp '/'
{
$$ = $1 / $2;
}
|
// Exponentiation:
exp exp '^'
{
$$ = pow($1, $2);
}
|
// Unary minus:
exp 'n'
{
$$ = -$1;
}
;
The rules of the rpn `language' defined here are the expression
(using the name exp), the line of input (line), and the complete input
transcript (input). Each of these nonterminal symbols has several
alternate rules, joined by the `|' separator, separating the various
production rules. The various rules are explained next.
The semantics of the language are determined by the actions taken once nonterminals have been recognized. The actions consist of C++ code that appears inside braces. See section 4.6.2.
Actions are specified using C++, but bisonc++ provides the means for passing
semantic values between rules. Refer to section 4.6.2 below for an
extensive coverage of the various possibilities. As a short introduction, the
pseudo-variable $$ can be used in rules to represent the semantic value
for the nonterminal that the production rule is defining. Assigning a value to
$$ is the main task of many actions. The semantic values of the elements
of production rules are referred to as $1 (the first element of a
production rule), $2 (the second element), and so on.
input:
input:
// empty
|
input line
;
This definition should be read as follows: A complete input is either
an empty string, or a complete input followed by an input line. Notice that
`complete input' is defined in terms of itself. This definition is said to be
left recursive since the rule's nonterminal (input) appears as the
leftmost symbol in the production rule. See section 4.4.
The first alternative is empty: there are no symbols between the colon and the
first `|'; this means that input can match an empty string of input (no
tokens). We write the rules this way because it is legitimate to type
Ctrl-d (end of input) immediately after starting the calculator. By
convention empty alternatives are provided with the comment `// empty'.
The second production rule (`input line') handles all nontrivial input. It
means after reading any number of lines, read one more line. The left
recursion turns this rule into a loop: it processes any number of lines.
The parser's parsing function continues to process input until a grammatical error is encountered or untilthe lexical analyzer returns the end-of-file token.
line nonterminal:
line:
'\n'
|
exp '\n'
{
cout << "\t" << $1 << endl;
}
;
The first alternative is a newline character token; this means that
rpn accepts a blank line (and ignores it, since it has no associated
action). The second alternative is an expression (expr), followed by a
newline. This alternative handles all expressions that are entered by the
user. The semantic value of the exp nonterminal is the value of $1
because the exp nonterminal is the first symbol of the production
rule. The action simply prints the expression's value.
This action is unusual because it does not assign a value to $$. As a
consequence, the semantic value associated with line is not defined. This
becomes a bug if that value is ever used, but we don't use it: once rpn
has printed the value of the user's input line, that value is no longer
needed.
exp nonterminal has several rules, one for each kind of
expression. The first rule handles the simplest kind of expression: just a
single number. The second handles additions, which are two expressions
followed by a plus-sign. The third handles subtractions, and so on.
exp:
NUM
|
exp exp '+'
{
$$ = $1 + $2;
}
|
exp exp '-'
{
$$ = $1 - $2;
}
...
Normally the production rule separator `|' is used to separate the
various production rules, but the rules could equally well have separately
been defined:
exp:
NUM
;
exp:
exp exp '+'
{
$$ = $1 + $2;
}
;
exp:
exp exp '-'
{
$$ = $1 - $2;
}
;
...
Most rules have actions assoicated with them computing the value of the
expression using the values of the elements of production rules. For example,
in the rule for addition, $1 refers to the first component exp and
$2 refers to the second one. The third component, '+', has no semantic
value associated with it, but if it had you could refer to it as $3. When
the parser's parsing function recognizes a sum expression using this rule, the
sum of the two subexpressions' values is produced as the value of the entire
expression. See section 4.6.2.
You don't have to define actions for every rule. When a rule has no action,
bisonc++ by default copies the value of $1 into $$. This is what happens
in the first rule (the one that uses NUM).
The formatting shown here shows the recommended layout, but bisonc++ does not require it. You can alter whitespace as much as you like.
lex), which is a predefined member
of the parser class. See section 5.3.1.
Our calculator only needs a simple lexical analyzer. It skips blanks and tabs,
then reads numbers, returning them as NUM tokens. Any other character that
isn't part of a number is a separate token. Note that the token-value for such
a single-character token is the character itself.
The lexical analyzer function's return value is a numeric value representing a
token. If the token is a literal character, then its numeric code is the
character's ASCII code; if the token is the name of a token defined by bisonc++
in a grammar specification file, then such named tokens are defined by bisonc++ in
a C++ enumeration. In the current example, therefore, NUM becomes an
enumeration value, returned by the lex member as the `value' NUM.
Semantic values of nonterminals are stored in the parser's data member
d_val_ (comparable to the variable yylval used by, e.g., Bison). This
data member has int as its default type, but by specifying %stype in
the grammar file's directive section this default type is modified (to, e.g.,
double).
A token value of zero is returned once end-of-file is encountered (the parsing function produced by bisonc++ interprets any nonpositive token value as end-of-input).
Here is the lexical scanner's implementation:
#include "Parser.ih"
// Lexical scanner returns a double floating point
// number on the stack and the token NUM, or the ASCII
// character read if not a number. Skips all blanks
// and tabs, returns 0 for EOF.
int Parser::lex()
{
char c;
// get the next non-ws character
while (std::cin.get(c) && (c == ' ' || c == '\t'))
;
if (!std::cin) // no characters were obtained
return 0; // indicate End Of Input
if (c == '.' || isdigit(c)) // if a digit char was found
{
std::cin.putback(c); // return the character
std::cin >> d_val_; // extract a number
return NUM; // return the NUM token
}
return c; // otherwise return the extracted char.
}
main function is a very small one. It
constructs a parser object and then calls its parsing function to
start the calculator:
#include "main.ih"
int main()
{
Parser parser;
parser.parse();
}
parse encounters a syntax error, it calls the error reporting
member function (error) to print an error message. A very basic in-line
implementation is provided by bisonc++ in the parser class internal header file
(see chapter 5), which can easily be redefined by the
programmer. The error member's default implementation is OK for
rpn.
Once error returns, bisonc++'s parser may recover from the error and continue
parsing if the grammar contains a suitable error rule (see chapter
8). Otherwise, the parsing function parse returns a non-zero
value. No error production rules were used for rpn, so the calculator
terminates at the first syntax error.. Not very nice in a real-life
calculator, but it is acceptable for this first example.
rpn this means that a source file (main.cc) is constructed
holding main, and a file parser/lex.cc containing the lexical
scanner's implementation. Note that I've put all the parser's files in a
separate directory as well (also see section 3.7).
In rpn's parser directory the file grammar
contains the grammar specification. Bisonc++ constructs a parser class and a
parsing member function from this file after issuing the command:
b() grammar
From this, bisonc++ produced the following files:
Parser.h, the parser class definition;
Parserbase.h, the parser's base class definition, defining, among
other, the grammatical tokens to be used by externally defined lexical
scanners;
Parser.ih, the internal header file, to be included by all
implementations of the parser class' members;
parse.cc, the parsing member function.
Parserbase.h and parse.cc will be re-created each
time bisonc++ is re-run. Parser.h and Parser.ih may safely be modified
by the programmer, e.g., to add new members to to the parser class. These two
files will not be overwritten by bisonc++, unless explicitly instructed to do
so.
# List files (recursively) in the (current) examples/rpn directory.
% ls -R
.:
build* main.cc parser/ rpn.ih
./parser:
grammar lex.cc Parser.ih
# Create `rpn' using the `build' script:
% ./build program
# List files again, ./rpn is the constructed program
% ls -R
.:
build* main.cc parser/ program* rpn.ih
./parser:
grammar grammar.output lex.cc parse.cc Parserbase.h Parser.h Parser.ih
Here is an example session using the calculator:
% program
4 9 +
13
3 7 + 3 4 5 *+-
-13
3 7 + 3 4 5 * + - n Note the unary minus, `n'
13
5 6 / 4 n +
-3.16667
3 4 ^ Exponentiation
81
^D End-of-file indicator
%
rpn to handle infix operators instead of postfix. Infix
notation requires us to define operator precedence and to extend the grammar
with nested expressions. Here is bisonc++'s grammar specification for calc, an
infix calculator:
%baseclass-preinclude cmath
%stype double
%token NUM
%left '-' '+'
%left '*' '/'
%left NEG // negation--unary minus
%right '^' // exponentiation
%%
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << '\n';
}
;
exp:
NUM
|
exp '+' exp
{
$$ = $1 + $3;
}
|
exp '-' exp
{
$$ = $1 - $3;
}
|
exp '*' exp
{
$$ = $1 * $3;
}
|
exp '/' exp
{
$$ = $1 / $3;
}
|
'-' exp %prec NEG
{
$$ = -$2;
}
|
// Exponentiation:
exp '^' exp
{
$$ = pow($1, $3);
}
|
'(' exp ')'
{
$$ = $2;
}
;
The functions lex, error and main are identical to the ones
used with rpn.
This example illustrates several important new features:
%left declares token types
stating that the mentioned operators are left-associative. The directives
%left and %right (right associativity) are used instead of %token,
which is merely used to declare a token identifier. The tokens mentioned with
%left and %right are single-character tokens, which ordinarily don't
need to be declared. We declare them here to specify their priority and
associativity.
%left and %right are specified, the lower the priority of
the operators that are following these keywords. So addition/subtraction have
the lowest precedence, multiplication and division are next, then unary minus
(NEG), and exponentiation has the highest precedence. See section
4.5.9.
%prec directive encountered
in the grammar section at the definition of the unary minus operator
production rule. The %prec directive indicates that the minus operator in
the production rule `'-' exp' has NEG's, rather than the minus
('-') operator's precedence. See section 7.3.
Here is a sample run of calc:
% calc
4 + 4.5 - (34/(8*3+-3))
6.88095
-56 + 2
-54
3 ^ 2
9
parse returns
after calling error. This means that an erroneous input line ends the
calculator. Now we show how to recover from erroneous input.
The bisonc++ grammar-specification language includes the reserved symbol error,
which can be used in production rules. In the example below it has been added
as a new alternative for recognizing the line nonterminal:
line:
'\n'
|
exp '\n'
{
std::cout << "\t" << $1 << '\n';
}
|
error '\n'
;
This addition to the grammar allows the calculator to recover from syntax
errors. If a syntax error is encountered, the third production rule is
activated, skipping all tokens until a newline token has been encounters. At
that point line has been recognized, and parsing continues afresh at the
next input line (the error member function is also called, printing its
message). Different from bison's error implementation, bisonc++ proceeds on the
assumption that whenever error is used in a production rule it is the
grammar constructor's intention to have the parser continue parsing
(therefore, a statement like `yyerrok', encountered in bison grammars is
superfluous in bisonc++ grammars). The reserved symbol error itself causes the
parsing function to skip all subsequent input until a token that can follow
error has been encountered. In the above implementation that token is the
newline character `\n' (see chapter 8).
This form of error recovery deals with syntax errors. There are other
kinds of errors; for example, divisions by zero, which raises an exception
signal that is normally fatal. A real calculator program must handle this
signal and use whatever it takes to discard the rest of the current line of
input and resume parsing thereafter. Handling such signals and other forms of
semantic errors is not discussed here, as it is not specific to bisonc++
programs. But once a semantic error has been encountered, handling functions
may call ERROR(), resulting in the same procedure as the one that's used
for syntax errors: all subsequenct tokens are skipped until a token has been
encountered that can follow the reserved symbol error.
+, - * / and ^. It would be nice to have a calculator that
allows us to use some other mathematical functions as well, such as sin,
cos, etc..
It is easy to add new operators to the infix calculator as long as they are
only single-character literals. The parser's member lex returns all
non-number characters as tokens, so only some new grammar production rules
need to be added to the grammar when such tokens must be recognized. But we
want something more flexible: built-in functions that can be called like this:
function_name (argument)
At the same time, we add memory to the calculator, allowing us to use
variables. Here is a sample session with the multi-function calculator:
pi = 3.141592653589
3.14159
sin(pi)
7.93266e-13
alpha = beta1 = 2.3
2.3
alpha
2.3
ln(alpha)
0.832909
exp(ln(beta1))
2.3
Note that multiple assignment and nested function calls are supported.
mfcalc calculator shows several new
features. Here is the bisonc++ directive section for the mfcalc multi-function
calculator (line numbers were added for referential purposes, they are not
part of the declaraction section as used in the actual grammar file):
1 %union
2 {
3 double u_val;
4 double *u_symbol;
5 double (*u_fun)(double);
6 }
7
8 %token <u_val> NUM // Simple double precision number
9 %token <u_symbol> VAR // Variable
10 %token <u_fun> FNCT // Function
11 %type <u_val> exp
12
13 %right '='
14 %left '-' '+'
15 %left '*' '/'
16 %left NEG // negation--unary minus
17 %right '^' // exponentiation
The above specifications show two new features of bisonc++'s grammar
specification language, allowing semantic values to have different data types.
%union directive (given in lines 1 through 6) allows semantic
values to have various data types (see section 4.5.33). It is used
instead of %stype, and defines Parser::STYPE_ as a the union type:
all semantic values now have this type. Now semantic values can have any of
the types that are defined as the union's fields.
%type directive is used to associate (non)terminal tokens with
the types of the fields of the union:
double (for exp and NUM);
double, being a pointer to entries in
mfcalc's symbol table, used with VAR tokens.
double argument and
returning a double value, used with FNCT tokens.
NUM,
VAR, FNCT, and exp. The %type directive first specifies (between
angle brackets) one of the union's field types, followed by a list of (blank
or comma separated) nonterminal or terminal symbols which are associated with
the specified union field type.
=', defined in line
13: by making the assignment operator right-associative we can allow
sequential assignments of the form a = b = c = expression.
calc. Three rules, mentioning VAR or
FNCT, are new:
input:
// empty
|
input line
;
line:
'\n'
|
exp '\n'
{
cout << "\t" << $1 << endl;
}
|
error '\n'
;
exp:
NUM
|
VAR
{
$$ = *$1;
}
|
VAR '=' exp
{
$$ = *$1 = $3;
}
|
FNCT '(' exp ')'
{
$$ = (*$1)($3);
}
|
exp '+' exp
{
$$ = $1 + $3;
}
|
exp '-' exp
{
$$ = $1 - $3;
}
|
exp '*' exp
{
$$ = $1 * $3;
}
|
exp '/' exp
{
$$ = $1 / $3;
}
|
'-' exp %prec NEG
{
$$ = -$2;
}
|
// Exponentiation:
exp '^' exp
{
$$ = pow($1, $3);
}
|
'(' exp ')'
{
$$ = $2;
}
;
The symbol table itself varies in size and contents once mfcalc is
used. It is defined as the data member
d_symbols in the Parser's header file. In contrast, the function table
has a fixed size and contents. Because of this the function table is
defined as a static data member. Both tables are defined as
std::unordered_map containers: their keys are
std::string objects, their values, respecively, doubles and double
(*)(double)s. Here is the declaration of d_symbols and s_functions as
used in mfcalc's parser:
std::unordered_map<std::string, double> d_symbols;
static std::unordered_map<std::string, double (*)(double)> s_functions;
As s_functions is a static member, it can be initialized compile
time from an array of pairs. To ease the definition of such an array a
private using declaration
using FunctionPair = std::pair<char const *, double (*)(double)>;
is added to the parser class, as well as a private array
static FunctionPair s_funTab[];
These definitions allow us to initialize s_functions in a separate
source file (data.cc):
#include "Parser.ih"
Parser::FunctionPair Parser::s_funTab[] =
{
FunctionPair("sin", sin),
FunctionPair("cos", cos),
FunctionPair("atan", atan),
FunctionPair("ln", log),
FunctionPair("exp", exp),
FunctionPair("sqrt", sqrt),
};
unordered_map<string, double (*)(double)> Parser::s_functions
(
Parser::s_funTab,
Parser::s_funTab + sizeof(Parser::s_funTab) / sizeof(Parser::FunctionPair)
);
By simply editing the definition of s_funTab, additional
functions can be added to the calculator.
mfcalc, the parser's member function lex() must now recognize
variables, function names, numeric values, and the single-character arithmetic
operators. Strings of alphanumeric characters not starting with a digit are
recognized as either variables or functions depending on the table in which
they are found. By arranging lex's logic such that the function table is
searched first it is simple to ensure that no variable can ever have the name
of a predefined function. The implementation used here, in which two
different tables are used for the arithmetic functions and the variable
symbols is appealing because it's simple to implement. However, it also has
the drawback of being difficult to scale to more generic calculators, using,
e.g., different data types and different types of functions. In such
situations a single symbol table is more preferable, where the keys are the
identifiers (variables, function names, predefined constants, etc.) while the
values are objects describing their characteristics. A re-implementation of
mfcalc using an integrated symbol table is suggested in one of the
exercises of the next section 6.5.
The parser's lex member has these characteristics:
mfcalc's parser (d_val) is itself also a union,
the numerical value can be extracted into d_val_.u_val, and a
NUM token can be returned.
s_functions map. If found, d_val_.u_fun is given the function's
address, found as the value of the s_functions map element
corresponding to the read identifier, and token FNCT is returned.
If the symbol is not found in s_functions the address of the
value ofn d_symbols associated with the received identifier is
assigned to d_val_.u_symbol and token VAR is returned. Note
that this automatically defines newly used variables, since
d_symbols[name] automatically inserts a new element in a map if
d_symbol[name] wasn't already there.
lex member function:
#include "Parser.ih"
/*
Lexical scanner returns a double floating point
number on the stack and the token NUM, or the ASCII
character read if not a number. Skips all blanks
and tabs, returns 0 for EOF.
*/
int Parser::lex()
{
char c;
// get the next non-ws character
while (cin.get(c) && (c == ' ' || c == '\t'))
;
if (!cin) // no characters were obtained
return 0; // indicate End Of Input
if (c == '.' || isdigit (c)) // if a digit char was found
{
cin.putback(c); // return the character
cin >> d_val_.u_val; // extract a number
return NUM; // return the NUM token
}
if (!isalpha(c)) // c doesn't start an identifier:
return c; // return a single character token.
// in all other cases, an ident is entered. Recognize a var or function
string word; // store the name in a string object
while (true) // process all alphanumerics:
{
word += c; // add 'm to `word'
if (!cin.get(c)) // no more chars? then done here
break;
if (!isalnum(c)) // not an alphanumeric: put it back and done.
{
cin.putback(c);
break;
}
}
// Now lookup the name as a function's name
unordered_map<string, double (*)(double)>::iterator function =
s_functions.find(word);
// Got it, so return FPTR
if (function != s_functions.end())
{
d_val_.u_fun = function->second;
return FNCT;
}
// no function, so return a VAR. Set
// u_symbol to the symbol's address in the
// d_symbol map. The map will add the
// symbol if not yet there.
d_val_.u_symbol = &d_symbols[word];
return VAR;
}
mfcalc, the following steps are suggested:
mfcalc.cc. Actually, it is already available,
since all implementations of main() used so far are identical to
each other.
parser:
grammar;
bisonc++ grammar to produce the files Parser.h,
Parserbase.h, Parser.ih and parse.cc;
Parser.h so as to include FunctionPair,
s_functions, s_funTab and d_symbols;
Parser.ih so as to include cmath and optionally
`using namespace std', which is commented out by default;
data.cc and lex.cc to initialize the static
data and to contain the lexical scanner, respectively.
mfcalc in mfcalc.cc's directory using the
following command:
g++ -o mfcalc *.cc parser/*.cc
mfcalc's
implementation and operating mode:
cmath' to the
Parser::s_functions;
Symbol in which the symbol type, and an
appropriate value for the symbol is stored. Define only one map d_symbols
in the Parser, and provide the Symbol class with means to obtain the
appropriate values for the various token types.
%union directive, and change it into %stype
Symbol. Hint: use the %preinclude-header directive to make
Symbol known to the parser's base class.
CONST for numerical constants (like PI, (E)),
and pre-define some numerical constants;
get() and set() member pair in Symbol, and use the appropriate
member in the appropriate expr rule; use ERROR() to initiate error
recovery.