webgui/lib/Parse/RecDescent.pod

=head1 NAME

Parse::RecDescent - Generate Recursive-Descent Parsers

=head1 VERSION

This document describes version 1.94 of Parse::RecDescent,
released April  9, 2003.

=head1 SYNOPSIS

 use Parse::RecDescent;

 # Generate a parser from the specification in $grammar:

	 $parser = new Parse::RecDescent ($grammar);

 # Generate a parser from the specification in $othergrammar

	 $anotherparser = new Parse::RecDescent ($othergrammar);


 # Parse $text using rule 'startrule' (which must be
 # defined in $grammar):

	$parser->startrule($text);


 # Parse $text using rule 'otherrule' (which must also
 # be defined in $grammar):

	 $parser->otherrule($text);


 # Change the universal token prefix pattern
 # (the default is: '\s*'):

	$Parse::RecDescent::skip = '[ \t]+';


 # Replace productions of existing rules (or create new ones)
 # with the productions defined in $newgrammar:

	$parser->Replace($newgrammar);


 # Extend existing rules (or create new ones)
 # by adding extra productions defined in $moregrammar:

	$parser->Extend($moregrammar);


 # Global flags (useful as command line arguments under -s):

	$::RD_ERRORS	   # unless undefined, report fatal errors
	$::RD_WARN	   # unless undefined, also report non-fatal problems
	$::RD_HINT	   # if defined, also suggestion remedies
	$::RD_TRACE	   # if defined, also trace parsers' behaviour
	$::RD_AUTOSTUB	   # if defined, generates "stubs" for undefined rules
	$::RD_AUTOACTION   # if defined, appends specified action to productions


=head1 DESCRIPTION

=head2 Overview

Parse::RecDescent incrementally generates top-down recursive-descent text
parsers from simple I<yacc>-like grammar specifications. It provides:

=over 4

=item *

Regular expressions or literal strings as terminals (tokens),

=item *

Multiple (non-contiguous) productions for any rule,

=item *

Repeated and optional subrules within productions,

=item *

Full access to Perl within actions specified as part of the grammar,

=item *

Simple automated error reporting during parser generation and parsing,

=item *

The ability to commit to, uncommit to, or reject particular
productions during a parse,

=item *

The ability to pass data up and down the parse tree ("down" via subrule
argument lists, "up" via subrule return values)

=item *

Incremental extension of the parsing grammar (even during a parse),

=item *

Precompilation of parser objects,

=item *

User-definable reduce-reduce conflict resolution via
"scoring" of matching productions.

=back

=head2 Using C<Parse::RecDescent>

Parser objects are created by calling C<Parse::RecDescent::new>, passing in a
grammar specification (see the following subsections). If the grammar is
correct, C<new> returns a blessed reference which can then be used to initiate
parsing through any rule specified in the original grammar. A typical sequence
looks like this:

	$grammar = q {
			# GRAMMAR SPECIFICATION HERE
		     };

	$parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n";

	# acquire $text

	defined $parser->startrule($text) or print "Bad text!\n";

The rule through which parsing is initiated must be explicitly defined
in the grammar (i.e. for the above example, the grammar must include a
rule of the form: "startrule: <subrules>".

If the starting rule succeeds, its value (see below)
is returned. Failure to generate the original parser or failure to match a text
is indicated by returning C<undef>. Note that it's easy to set up grammars
that can succeed, but which return a value of 0, "0", or "".  So don't be
tempted to write:

	$parser->startrule($text) or print "Bad text!\n";

Normally, the parser has no effect on the original text. So in the
previous example the value of $text would be unchanged after having
been parsed.

If, however, the text to be matched is passed by reference:

	$parser->startrule(\$text)

then any text which was consumed during the match will be removed from the
start of $text.


=head2 Rules

In the grammar from which the parser is built, rules are specified by
giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a
colon I<on the same line>, followed by one or more productions,
separated by single vertical bars. The layout of the productions
is entirely free-format:

	rule1:	production1
	     |  production2 |
		production3 | production4

At any point in the grammar previously defined rules may be extended with
additional productions. This is achieved by redeclaring the rule with the new
productions. Thus:

	rule1: a | b | c
	rule2: d | e | f
	rule1: g | h

is exactly equivalent to:

	rule1: a | b | c | g | h
	rule2: d | e | f

Each production in a rule consists of zero or more items, each of which
may be either: the name of another rule to be matched (a "subrule"),
a pattern or string literal to be matched directly (a "token"), a
block of Perl code to be executed (an "action"), a special instruction
to the parser (a "directive"), or a standard Perl comment (which is
ignored).

A rule matches a text if one of its productions matches. A production
matches if each of its items match consecutive substrings of the
text. The productions of a rule being matched are tried in the same
order that they appear in the original grammar, and the first matching
production terminates the match attempt (successfully). If all
productions are tried and none matches, the match attempt fails.

Note that this behaviour is quite different from the "prefer the longer match"
behaviour of I<yacc>. For example, if I<yacc> were parsing the rule:

	seq : 'A' 'B'
	    | 'A' 'B' 'C'

upon matching "AB" it would look ahead to see if a 'C' is next and, if
so, will match the second production in preference to the first. In
other words, I<yacc> effectively tries all the productions of a rule
breadth-first in parallel, and selects the "best" match, where "best"
means longest (note that this is a gross simplification of the true
behaviour of I<yacc> but it will do for our purposes).

In contrast, C<Parse::RecDescent> tries each production depth-first in
sequence, and selects the "best" match, where "best" means first. This is
the fundamental difference between "bottom-up" and "recursive descent"
parsing.

Each successfully matched item in a production is assigned a value,
which can be accessed in subsequent actions within the same
production (or, in some cases, as the return value of a successful
subrule call). Unsuccessful items don't have an associated value,
since the failure of an item causes the entire surrounding production
to immediately fail. The following sections describe the various types
of items and their success values.


=head2 Subrules

A subrule which appears in a production is an instruction to the parser to
attempt to match the named rule at that point in the text being
parsed. If the named subrule is not defined when requested the
production containing it immediately fails (unless it was "autostubbed" - see
L<Autostubbing>).

A rule may (recursively) call itself as a subrule, but I<not> as the
left-most item in any of its productions (since such recursions are usually
non-terminating).

The value associated with a subrule is the value associated with its
C<$return> variable (see L<"Actions"> below), or with the last successfully
matched item in the subrule match.

Subrules may also be specified with a trailing repetition specifier,
indicating that they are to be (greedily) matched the specified number
of times. The available specifiers are:

		subrule(?)	# Match one-or-zero times
		subrule(s)	# Match one-or-more times
		subrule(s?)	# Match zero-or-more times
		subrule(N)	# Match exactly N times for integer N > 0
		subrule(N..M)	# Match between N and M times
		subrule(..M)	# Match between 1 and M times
		subrule(N..)	# Match at least N times

Repeated subrules keep matching until either the subrule fails to
match, or it has matched the minimal number of times but fails to
consume any of the parsed text (this second condition prevents the
subrule matching forever in some cases).

Since a repeated subrule may match many instances of the subrule itself, the
value associated with it is not a simple scalar, but rather a reference to a
list of scalars, each of which is the value associated with one of the
individual subrule matches. In other words in the rule:

		program: statement(s)

the value associated with the repeated subrule "statement(s)" is a reference
to an array containing the values matched by each call to the individual
subrule "statement".

Repetition modifieres may include a separator pattern:

		program: statement(s /;/)

specifying some sequence of characters to be skipped between each repetition.
This is really just a shorthand for the E<lt>leftop:...E<gt> directive
(see below).

=head2 Tokens

If a quote-delimited string or a Perl regex appears in a production,
the parser attempts to match that string or pattern at that point in
the text. For example:

		typedef: "typedef" typename identifier ';'

		identifier: /[A-Za-z_][A-Za-z0-9_]*/

As in regular Perl, a single quoted string is uninterpolated, whilst
a double-quoted string or a pattern is interpolated (at the time
of matching, I<not> when the parser is constructed). Hence, it is
possible to define rules in which tokens can be set at run-time:

		typedef: "$::typedefkeyword" typename identifier ';'

		identifier: /$::identpat/

Note that, since each rule is implemented inside a special namespace
belonging to its parser, it is necessary to explicitly quantify
variables from the main package.

Regex tokens can be specified using just slashes as delimiters
or with the explicit C<mE<lt>delimiterE<gt>......E<lt>delimiterE<gt>> syntax:

		typedef: "typedef" typename identifier ';'

		typename: /[A-Za-z_][A-Za-z0-9_]*/

		identifier: m{[A-Za-z_][A-Za-z0-9_]*}

A regex of either type can also have any valid trailing parameter(s)
(that is, any of [cgimsox]):

		typedef: "typedef" typename identifier ';'

		identifier: / [a-z_] 		# LEADING ALPHA OR UNDERSCORE
			      [a-z0-9_]*	# THEN DIGITS ALSO ALLOWED
			    /ix			# CASE/SPACE/COMMENT INSENSITIVE

The value associated with any successfully matched token is a string
containing the actual text which was matched by the token.

It is important to remember that, since each grammar is specified in a
Perl string, all instances of the universal escape character '\' within
a grammar must be "doubled", so that they interpolate to single '\'s when
the string is compiled. For example, to use the grammar:

		word:	    /\S+/ | backslash
		line:	    prefix word(s) "\n"
		backslash:  '\\'

the following code is required:

		$parser = new Parse::RecDescent (q{

			word:	    /\\S+/ | backslash
			line:	    prefix word(s) "\\n"
			backslash:  '\\\\'

		});


=head2 Terminal Separators

For the purpose of matching, each terminal in a production is considered
to be preceded by a "prefix" - a pattern which must be
matched before a token match is attempted. By default, the
prefix is optional whitespace (which always matches, at
least trivially), but this default may be reset in any production.

The variable C<$Parse::RecDescent::skip> stores the universal
prefix, which is the default for all terminal matches in all parsers
built with C<Parse::RecDescent>.

The prefix for an individual production can be altered
by using the C<E<lt>skip:...E<gt>> directive (see below).


=head2 Actions

An action is a block of Perl code which is to be executed (as the
block of a C<do> statement) when the parser reaches that point in a
production. The action executes within a special namespace belonging to
the active parser, so care must be taken in correctly qualifying variable
names (see also L<Start-up Actions> below).

The action is considered to succeed if the final value of the block
is defined (that is, if the implied C<do> statement evaluates to a
defined value - I<even one which would be treated as "false">). Note
that the value associated with a successful action is also the final
value in the block.

An action will I<fail> if its last evaluated value is C<undef>. This is
surprisingly easy to accomplish by accident. For instance, here's an
infuriating case of an action that makes its production fail, but only
when debugging I<isn't> activated:

	description: name rank serial_number
			{ print "Got $item[2] $item[1] ($item[3])\n"
				if $::debugging
			}

If C<$debugging> is false, no statement in the block is executed, so
the final value is C<undef>, and the entire production fails. The solution is:

	description: name rank serial_number
			{ print "Got $item[2] $item[1] ($item[3])\n"
				if $::debugging;
			  1;
			}

Within an action, a number of useful parse-time variables are
available in the special parser namespace (there are other variables
also accessible, but meddling with them will probably just break your
parser. As a general rule, if you avoid referring to unqualified
variables - especially those starting with an underscore - inside an action,
things should be okay):

=over 4

=item C<@item> and C<%item>

The array slice C<@item[1..$#item]> stores the value associated with each item
(that is, each subrule, token, or action) in the current production. The
analogy is to C<$1>, C<$2>, etc. in a I<yacc> grammar.
Note that, for obvious reasons, C<@item> only contains the
values of items I<before> the current point in the production.

The first element (C<$item[0]>) stores the name of the current rule
being matched.

C<@item> is a standard Perl array, so it can also be indexed with negative
numbers, representing the number of items I<back> from the current position in
the parse:

	stuff: /various/ bits 'and' pieces "then" data 'end'
			{ print $item[-2] }  # PRINTS data
					     # (EASIER THAN: $item[6])

The C<%item> hash complements the <@item> array, providing named
access to the same item values:

	stuff: /various/ bits 'and' pieces "then" data 'end'
			{ print $item{data}  # PRINTS data
					     # (EVEN EASIER THAN USING @item)


The results of named subrules are stored in the hash under each
subrule's name (including the repetition specifier, if any),
whilst all other items are stored under a "named
positional" key that indictates their ordinal position within their item
type: __STRINGI<n>__, __PATTERNI<n>__, __DIRECTIVEI<n>__, __ACTIONI<n>__:

        stuff: /various/ bits 'and' pieces "then" data 'end' { save }
                        { print $item{__PATTERN1__}, # PRINTS 'various'
                                $item{__STRING2__},  # PRINTS 'then'
                                $item{__ACTION1__},  # PRINTS RETURN
						     # VALUE OF save
                        }


If you want proper I<named> access to patterns or literals, you need to turn
them into separate rules:

        stuff: various bits 'and' pieces "then" data 'end'
                        { print $item{various}  # PRINTS various
                        }

        various: /various/


The special entry C<$item{__RULE__}> stores the name of the current
rule (i.e. the same value as C<$item[0]>.

The advantage of using C<%item>, instead of C<@items> is that it
removes the need to track items positions that may change as a grammar
evolves. For example, adding an interim C<E<lt>skipE<gt>> directive
of action can silently ruin a trailing action, by moving an C<@item>
element "down" the array one place. In contrast, the named entry
of C<%item> is unaffected by such an insertion.

A limitation of the C<%item> hash is that it only records the I<last>
value of a particular subrule. For example:

        range: '(' number '..' number )'
                        { $return = $item{number} }

will return only the value corresponding to the I<second> match of the
C<number> subrule. In other words, successive calls to a subrule
overwrite the corresponding entry in C<%item>. Once again, the
solution is to rename each subrule in its own rule:

        range: '(' from_num '..' to_num )'
                        { $return = $item{from_num} }

        from_num: number
        to_num:   number


=item C<@arg> and C<%arg>

The array C<@arg> and the hash C<%arg> store any arguments passed to
the rule from some other rule (see L<"Subrule argument lists>). Changes
to the elements of either variable do not propagate back to the calling
rule (data can be passed back from a subrule via the C<$return>
variable - see next item).


=item C<$return>

If a value is assigned to C<$return> within an action, that value is
returned if the production containing the action eventually matches
successfully. Note that setting C<$return> I<doesn't> cause the current
production to succeed. It merely tells it what to return if it I<does> succeed.
Hence C<$return> is analogous to C<$$> in a I<yacc> grammar.

If C<$return> is not assigned within a production, the value of the
last component of the production (namely: C<$item[$#item]>) is
returned if the production succeeds.


=item C<$commit>

The current state of commitment to the current production (see L<"Directives">
below).

=item C<$skip>

The current terminal prefix (see L<"Directives"> below).

=item C<$text>

The remaining (unparsed) text. Changes to C<$text> I<do not
propagate> out of unsuccessful productions, but I<do> survive
successful productions. Hence it is possible to dynamically alter the
text being parsed - for example, to provide a C<#include>-like facility:

        hash_include: '#include' filename
                                { $text = ::loadfile($item[2]) . $text }

        filename: '<' /[a-z0-9._-]+/i '>'  { $return = $item[2] }
                | '"' /[a-z0-9._-]+/i '"'  { $return = $item[2] }


=item C<$thisline> and C<$prevline>

C<$thisline> stores the current line number within the current parse
(starting from 1). C<$prevline> stores the line number for the last
character which was already successfully parsed (this will be different from
C<$thisline> at the end of each line).

For efficiency, C<$thisline> and C<$prevline> are actually tied
hashes, and only recompute the required line number when the variable's
value is used.

Assignment to C<$thisline> adjusts the line number calculator, so that
it believes that the current line number is the value being assigned. Note
that this adjustment will be reflected in all subsequent line numbers
calculations.

Modifying the value of the variable C<$text> (as in the previous
C<hash_include> example, for instance) will confuse the line
counting mechanism. To prevent this, you should call
C<Parse::RecDescent::LineCounter::resync($thisline)> I<immediately>
after any assignment to the variable C<$text> (or, at least, before the
next attempt to use C<$thisline>).

Note that if a production fails after assigning to or
resync'ing C<$thisline>, the parser's line counter mechanism will
usually be corrupted.

Also see the entry for C<@itempos>.

The line number can be set to values other than 1, by calling the start
rule with a second argument. For example:

        $parser = new Parse::RecDescent ($grammar);

        $parser->input($text, 10);      # START LINE NUMBERS AT 10


=item C<$thiscolumn> and C<$prevcolumn>

C<$thiscolumn> stores the current column number within the current line
being parsed (starting from 1). C<$prevcolumn> stores the column number
of the last character which was actually successfully parsed. Usually
C<$prevcolumn == $thiscolumn-1>, but not at the end of lines.

For efficiency, C<$thiscolumn> and C<$prevcolumn> are
actually tied hashes, and only recompute the required column number
when the variable's value is used.

Assignment to C<$thiscolumn> or C<$prevcolumn> is a fatal error.

Modifying the value of the variable C<$text> (as in the previous
C<hash_include> example, for instance) may confuse the column
counting mechanism.

Note that C<$thiscolumn> reports the column number I<before> any
whitespace that might be skipped before reading a token. Hence
if you wish to know where a token started (and ended) use something like this:

        rule: token1 token2 startcol token3 endcol token4
                        { print "token3: columns $item[3] to $item[5]"; }

        startcol: '' { $thiscolumn }    # NEED THE '' TO STEP PAST TOKEN SEP
        endcol:      { $prevcolumn }

Also see the entry for C<@itempos>.

=item C<$thisoffset> and C<$prevoffset>

C<$thisoffset> stores the offset of the current parsing position
within the complete text
being parsed (starting from 0). C<$prevoffset> stores the offset
of the last character which was actually successfully parsed. In all
cases C<$prevoffset == $thisoffset-1>.

For efficiency, C<$thisoffset> and C<$prevoffset> are
actually tied hashes, and only recompute the required offset
when the variable's value is used.

Assignment to C<$thisoffset> or <$prevoffset> is a fatal error.

Modifying the value of the variable C<$text> will I<not> affect the
offset counting mechanism.

Also see the entry for C<@itempos>.

=item C<@itempos>

The array C<@itempos> stores a hash reference corresponding to
each element of C<@item>. The elements of the hash provide the
following:

        $itempos[$n]{offset}{from}      # VALUE OF $thisoffset BEFORE $item[$n]
        $itempos[$n]{offset}{to}        # VALUE OF $prevoffset AFTER $item[$n]
        $itempos[$n]{line}{from}        # VALUE OF $thisline BEFORE $item[$n]
        $itempos[$n]{line}{to}          # VALUE OF $prevline AFTER $item[$n]
        $itempos[$n]{column}{from}      # VALUE OF $thiscolumn BEFORE $item[$n]
        $itempos[$n]{column}{to}        # VALUE OF $prevcolumn AFTER $item[$n]

Note that the various C<$itempos[$n]...{from}> values record the
appropriate value I<after> any token prefix has been skipped.

Hence, instead of the somewhat tedious and error-prone:

        rule: startcol token1 endcol
              startcol token2 endcol
              startcol token3 endcol
                        { print "token1: columns $item[1]
                                              to $item[3]
                                 token2: columns $item[4]
                                              to $item[6]
                                 token3: columns $item[7]
                                              to $item[9]" }

        startcol: '' { $thiscolumn }    # NEED THE '' TO STEP PAST TOKEN SEP
        endcol:      { $prevcolumn }

it is possible to write:

        rule: token1 token2 token3
                        { print "token1: columns $itempos[1]{column}{from}
                                              to $itempos[1]{column}{to}
                                 token2: columns $itempos[2]{column}{from}
                                              to $itempos[2]{column}{to}
                                 token3: columns $itempos[3]{column}{from}
                                              to $itempos[3]{column}{to}" }

Note however that (in the current implementation) the use of C<@itempos>
anywhere in a grammar implies that item positioning information is
collected I<everywhere> during the parse. Depending on the grammar
and the size of the text to be parsed, this may be prohibitively
expensive and the explicit use of C<$thisline>, C<$thiscolumn>, etc. may
be a better choice.


=item C<$thisparser>

A reference to the S<C<Parse::RecDescent>> object through which
parsing was initiated.

The value of C<$thisparser> propagates down the subrules of a parse
but not back up. Hence, you can invoke subrules from another parser
for the scope of the current rule as follows:

        rule: subrule1 subrule2
            | { $thisparser = $::otherparser } <reject>
            | subrule3 subrule4
            | subrule5

The result is that the production calls "subrule1" and "subrule2" of
the current parser, and the remaining productions call the named subrules
from C<$::otherparser>. Note, however that "Bad Things" will happen if
C<::otherparser> isn't a blessed reference and/or doesn't have methods
with the same names as the required subrules!

=item C<$thisrule>

A reference to the S<C<Parse::RecDescent::Rule>> object corresponding to the
rule currently being matched.

=item C<$thisprod>

A reference to the S<C<Parse::RecDescent::Production>> object
corresponding to the production currently being matched.

=item C<$score> and C<$score_return>

$score stores the best production score to date, as specified by
an earlier C<E<lt>score:...E<gt>> directive. $score_return stores
the corresponding return value for the successful production.

See L<Scored productions>.

=back

B<Warning:> the parser relies on the information in the various C<this...>
objects in some non-obvious ways. Tinkering with the other members of
these objects will probably cause Bad Things to happen, unless you
I<really> know what you're doing. The only exception to this advice is
that the use of C<$this...-E<gt>{local}> is always safe.


=head2 Start-up Actions

Any actions which appear I<before> the first rule definition in a
grammar are treated as "start-up" actions. Each such action is
stripped of its outermost brackets and then evaluated (in the parser's
special namespace) just before the rules of the grammar are first
compiled.

The main use of start-up actions is to declare local variables within the
parser's special namespace:

        { my $lastitem = '???'; }

        list: item(s)   { $return = $lastitem }

        item: book      { $lastitem = 'book'; }
              bell      { $lastitem = 'bell'; }
              candle    { $lastitem = 'candle'; }

but start-up actions can be used to execute I<any> valid Perl code
within a parser's special namespace.

Start-up actions can appear within a grammar extension or replacement
(that is, a partial grammar installed via C<Parse::RecDescent::Extend()> or
C<Parse::RecDescent::Replace()> - see L<Incremental Parsing>), and will be
executed before the new grammar is installed. Note, however, that a
particular start-up action is only ever executed once.


=head2 Autoactions

It is sometimes desirable to be able to specify a default action to be
taken at the end of every production (for example, in order to easily
build a parse tree). If the variable C<$::RD_AUTOACTION> is defined
when C<Parse::RecDescent::new()> is called, the contents of that
variable are treated as a specification of an action which is to appended
to each production in the corresponding grammar. So, for example, to construct
a simple parse tree:

    $::RD_AUTOACTION = q { [@item] };

    parser = new Parse::RecDescent (q{
        expression: and_expr '||' expression | and_expr
        and_expr:   not_expr '&&' and_expr   | not_expr
        not_expr:   '!' brack_expr           | brack_expr
        brack_expr: '(' expression ')'       | identifier
        identifier: /[a-z]+/i
        });

which is equivalent to:

    parser = new Parse::RecDescent (q{
        expression: and_expr '||' expression
                        { [@item] }
                  | and_expr
                        { [@item] }

        and_expr:   not_expr '&&' and_expr
                        { [@item] }
                |   not_expr
                        { [@item] }

        not_expr:   '!' brack_expr
                        { [@item] }
                |   brack_expr
                        { [@item] }

        brack_expr: '(' expression ')'
                        { [@item] }
                  | identifier
                        { [@item] }

        identifier: /[a-z]+/i
                        { [@item] }
        });

Alternatively, we could take an object-oriented approach, use different
classes for each node (and also eliminating redundant intermediate nodes):

    $::RD_AUTOACTION = q
      { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };

    parser = new Parse::RecDescent (q{
        expression: and_expr '||' expression | and_expr
        and_expr:   not_expr '&&' and_expr   | not_expr
        not_expr:   '!' brack_expr           | brack_expr
        brack_expr: '(' expression ')'       | identifier
        identifier: /[a-z]+/i
        });

which is equivalent to:

    parser = new Parse::RecDescent (q{
        expression: and_expr '||' expression
                        { new expression_node (@item[1..3]) }
                  | and_expr

        and_expr:   not_expr '&&' and_expr
                        { new and_expr_node (@item[1..3]) }
                |   not_expr

        not_expr:   '!' brack_expr
                        { new not_expr_node (@item[1..2]) }
                |   brack_expr

        brack_expr: '(' expression ')'
                        { new brack_expr_node (@item[1..3]) }
                  | identifier

        identifier: /[a-z]+/i
                        { new identifer_node (@item[1]) }
        });

Note that, if a production already ends in an action, no autoaction is appended
to it. For example, in this version:

    $::RD_AUTOACTION = q
      { $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };

    parser = new Parse::RecDescent (q{
        expression: and_expr '&&' expression | and_expr
        and_expr:   not_expr '&&' and_expr   | not_expr
        not_expr:   '!' brack_expr           | brack_expr
        brack_expr: '(' expression ')'       | identifier
        identifier: /[a-z]+/i
                        { new terminal_node($item[1]) }
        });

each C<identifier> match produces a C<terminal_node> object, I<not> an
C<identifier_node> object.

A level 1 warning is issued each time an "autoaction" is added to
some production.


=head2 Autotrees

A commonly needed autoaction is one that builds a parse-tree. It is moderately
tricky to set up such an action (which must treat terminals differently from
non-terminals), so Parse::RecDescent simplifies the process by providing the
C<E<lt>autotreeE<gt>> directive.

If this directive appears at the start of grammar, it causes
Parse::RecDescent to insert autoactions at the end of any rule except
those which already end in an action. The action inserted depends on whether
the production is an intermediate rule (two or more items), or a terminal
of the grammar (i.e. a single pattern or string item).

So, for example, the following grammar:

        <autotree>

        file    : command(s)
        command : get | set | vet
        get     : 'get' ident ';'
        set     : 'set' ident 'to' value ';'
        vet     : 'check' ident 'is' value ';'
        ident   : /\w+/
        value   : /\d+/

is equivalent to:

        file    : command(s)                    { bless \%item, $item[0] }
        command : get                           { bless \%item, $item[0] }
                | set                           { bless \%item, $item[0] }
                | vet                           { bless \%item, $item[0] }
        get     : 'get' ident ';'               { bless \%item, $item[0] }
        set     : 'set' ident 'to' value ';'    { bless \%item, $item[0] }
        vet     : 'check' ident 'is' value ';'  { bless \%item, $item[0] }

        ident   : /\w+/          { bless {__VALUE__=>$item[1]}, $item[0] }
        value   : /\d+/          { bless {__VALUE__=>$item[1]}, $item[0] }

Note that each node in the tree is blessed into a class of the same name
as the rule itself. This makes it easy to build object-oriented
processors for the parse-trees that the grammar produces. Note too that
the last two rules produce special objects with the single attribute
'__VALUE__'. This is because they consist solely of a single terminal.

This autoaction-ed grammar would then produce a parse tree in a data
structure like this:

        {
          file => {
                    command => {
                                 [ get => {
                                            identifier => { __VALUE__ => 'a' },
                                          },
                                   set => {
                                            identifier => { __VALUE__ => 'b' },
                                            value      => { __VALUE__ => '7' },
                                          },
                                   vet => {
                                            identifier => { __VALUE__ => 'b' },
                                            value      => { __VALUE__ => '7' },
                                          },
                                  ],
                               },
                  }
        }

(except, of course, that each nested hash would also be blessed into
the appropriate class).


=head2 Autostubbing

Normally, if a subrule appears in some production, but no rule of that
name is ever defined in the grammar, the production which refers to the
non-existent subrule fails immediately. This typically occurs as a
result of misspellings, and is a sufficiently common occurance that a
warning is generated for such situations.

However, when prototyping a grammar it is sometimes useful to be
able to use subrules before a proper specification of them is
really possible.  For example, a grammar might include a section like:

        function_call: identifier '(' arg(s?) ')'

        identifier: /[a-z]\w*/i

where the possible format of an argument is sufficiently complex that
it is not worth specifying in full until the general function call
syntax has been debugged. In this situation it is convenient to leave
the real rule C<arg> undefined and just slip in a placeholder (or
"stub"):

        arg: 'arg'

so that the function call syntax can be tested with dummy input such as:

        f0()
        f1(arg)
        f2(arg arg)
        f3(arg arg arg)

et cetera.

Early in prototyping, many such "stubs" may be required, so
C<Parse::RecDescent> provides a means of automating their definition.
If the variable C<$::RD_AUTOSTUB> is defined when a parser is built,
a subrule reference to any non-existent rule (say, C<sr>),
causes a "stub" rule of the form:

        sr: 'sr'

to be automatically defined in the generated parser.
A level 1 warning is issued for each such "autostubbed" rule.

Hence, with C<$::AUTOSTUB> defined, it is possible to only partially
specify a grammar, and then "fake" matches of the unspecified
(sub)rules by just typing in their name.


=head2 Look-ahead

If a subrule, token, or action is prefixed by "...", then it is
treated as a "look-ahead" request. That means that the current production can
(as usual) only succeed if the specified item is matched, but that the matching
I<does not consume any of the text being parsed>. This is very similar to the
C</(?=...)/> look-ahead construct in Perl patterns. Thus, the rule:

        inner_word: word ...word

will match whatever the subrule "word" matches, provided that match is followed
by some more text which subrule "word" would also match (although this
second substring is not actually consumed by "inner_word")

Likewise, a "...!" prefix, causes the following item to succeed (without
consuming any text) if and only if it would normally fail. Hence, a
rule such as:

        identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/

matches a string of characters which satisfies the pattern
C</[A-Za-z_]\w*/>, but only if the same sequence of characters would
not match either subrule "keyword" or the literal token '_'.

Sequences of look-ahead prefixes accumulate, multiplying their positive and/or
negative senses. Hence:

        inner_word: word ...!......!word

is exactly equivalent the the original example above (a warning is issued in
cases like these, since they often indicate something left out, or
misunderstood).

Note that actions can also be treated as look-aheads. In such cases,
the state of the parser text (in the local variable C<$text>)
I<after> the look-ahead action is guaranteed to be identical to its
state I<before> the action, regardless of how it's changed I<within>
the action (unless you actually undefine C<$text>, in which case you
get the disaster you deserve :-).


=head2 Directives

Directives are special pre-defined actions which may be used to alter
the behaviour of the parser. There are currently eighteen directives:
C<E<lt>commitE<gt>>,
C<E<lt>uncommitE<gt>>,
C<E<lt>rejectE<gt>>,
C<E<lt>scoreE<gt>>,
C<E<lt>autoscoreE<gt>>,
C<E<lt>skipE<gt>>,
C<E<lt>resyncE<gt>>,
C<E<lt>errorE<gt>>,
C<E<lt>rulevarE<gt>>,
C<E<lt>matchruleE<gt>>,
C<E<lt>leftopE<gt>>,
C<E<lt>rightopE<gt>>,
C<E<lt>deferE<gt>>,
C<E<lt>nocheckE<gt>>,
C<E<lt>perl_quotelikeE<gt>>,
C<E<lt>perl_codeblockE<gt>>,
C<E<lt>perl_variableE<gt>>,
and C<E<lt>tokenE<gt>>.

=over 4

=item Committing and uncommitting

The C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives permit the recursive
descent of the parse tree to be pruned (or "cut") for efficiency.
Within a rule, a C<E<lt>commitE<gt>> directive instructs the rule to ignore subsequent
productions if the current production fails. For example:

        command: 'find' <commit> filename
               | 'open' <commit> filename
               | 'move' filename filename

Clearly, if the leading token 'find' is matched in the first production but that
production fails for some other reason, then the remaining
productions cannot possibly match. The presence of the
C<E<lt>commitE<gt>> causes the "command" rule to fail immediately if
an invalid "find" command is found, and likewise if an invalid "open"
command is encountered.

It is also possible to revoke a previous commitment. For example:

        if_statement: 'if' <commit> condition
                                'then' block <uncommit>
                                'else' block
                    | 'if' <commit> condition
                                'then' block

In this case, a failure to find an "else" block in the first
production shouldn't preclude trying the second production, but a
failure to find a "condition" certainly should.

As a special case, any production in which the I<first> item is an
C<E<lt>uncommitE<gt>> immediately revokes a preceding C<E<lt>commitE<gt>>
(even though the production would not otherwise have been tried). For
example, in the rule:

        request: 'explain' expression
               | 'explain' <commit> keyword
               | 'save'
               | 'quit'
               | <uncommit> term '?'

if the text being matched was "explain?", and the first two
productions failed, then the C<E<lt>commitE<gt>> in production two would cause
productions three and four to be skipped, but the leading
C<E<lt>uncommitE<gt>> in the production five would allow that production to
attempt a match.

Note in the preceding example, that the C<E<lt>commitE<gt>> was only placed
in production two. If production one had been:

        request: 'explain' <commit> expression

then production two would be (inappropriately) skipped if a leading
"explain..." was encountered.

Both C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives always succeed, and their value
is always 1.


=item Rejecting a production

The C<E<lt>rejectE<gt>> directive immediately causes the current production
to fail (it is exactly equivalent to, but more obvious than, the
action C<{undef}>). A C<E<lt>rejectE<gt>> is useful when it is desirable to get
the side effects of the actions in one production, without prejudicing a match
by some other production later in the rule. For example, to insert
tracing code into the parse:

        complex_rule: { print "In complex rule...\n"; } <reject>

        complex_rule: simple_rule '+' 'i' '*' simple_rule
                    | 'i' '*' simple_rule
                    | simple_rule


It is also possible to specify a conditional rejection, using the
form C<E<lt>reject:I<condition>E<gt>>, which only rejects if the
specified condition is true. This form of rejection is exactly
equivalent to the action C<{(I<condition>)?undef:1}E<gt>>.
For example:

        command: save_command
               | restore_command
               | <reject: defined $::tolerant> { exit }
               | <error: Unknown command. Ignored.>

A C<E<lt>rejectE<gt>> directive never succeeds (and hence has no
associated value). A conditional rejection may succeed (if its
condition is not satisfied), in which case its value is 1.

As an extra optimization, C<Parse::RecDescent> ignores any production
which I<begins> with an unconditional C<E<lt>rejectE<gt>> directive,
since any such production can never successfully match or have any
useful side-effects. A level 1 warning is issued in all such cases.

Note that productions beginning with conditional
C<E<lt>reject:...E<gt>> directives are I<never> "optimized away" in
this manner, even if they are always guaranteed to fail (for example:
C<E<lt>reject:1E<gt>>)

Due to the way grammars are parsed, there is a minor restriction on the
condition of a conditional C<E<lt>reject:...E<gt>>: it cannot
contain any raw '<' or '>' characters. For example:

        line: cmd <reject: $thiscolumn > max> data

results in an error when a parser is built from this grammar (since the
grammar parser has no way of knowing whether the first > is a "less than"
or the end of the C<E<lt>reject:...E<gt>>.

To overcome this problem, put the condition inside a do{} block:

        line: cmd <reject: do{$thiscolumn > max}> data

Note that the same problem may occur in other directives that take
arguments. The same solution will work in all cases.

=item Skipping between terminals

The C<E<lt>skipE<gt>> directive enables the terminal prefix used in
a production to be changed. For example:

        OneLiner: Command <skip:'[ \t]*'> Arg(s) /;/

causes only blanks and tabs to be skipped before terminals in the C<Arg>
subrule (and any of I<its> subrules>, and also before the final C</;/> terminal.
Once the production is complete, the previous terminal prefix is
reinstated. Note that this implies that distinct productions of a rule
must reset their terminal prefixes individually.

The C<E<lt>skipE<gt>> directive evaluates to the I<previous> terminal prefix,
so it's easy to reinstate a prefix later in a production:

        Command: <skip:","> CSV(s) <skip:$item[1]> Modifier

The value specified after the colon is interpolated into a pattern, so all of
the following are equivalent (though their efficiency increases down the list):

        <skip: "$colon|$comma">   # ASSUMING THE VARS HOLD THE OBVIOUS VALUES

        <skip: ':|,'>

        <skip: q{[:,]}>

        <skip: qr/[:,]/>

There is no way of directly setting the prefix for
an entire rule, except as follows:

        Rule: <skip: '[ \t]*'> Prod1
            | <skip: '[ \t]*'> Prod2a Prod2b
            | <skip: '[ \t]*'> Prod3

or, better:

        Rule: <skip: '[ \t]*'>
            (
                Prod1
              | Prod2a Prod2b
              | Prod3
            )


B<Note: Up to release 1.51 of Parse::RecDescent, an entirely different
mechanism was used for specifying terminal prefixes. The current method
is not backwards-compatible with that early approach. The current approach
is stable and will not to change again.>


=item Resynchronization

The C<E<lt>resyncE<gt>> directive provides a visually distinctive
means of consuming some of the text being parsed, usually to skip an
erroneous input. In its simplest form C<E<lt>resyncE<gt>> simply
consumes text up to and including the next newline (C<"\n">)
character, succeeding only if the newline is found, in which case it
causes its surrounding rule to return zero on success.

In other words, a C<E<lt>resyncE<gt>> is exactly equivalent to the token
C</[^\n]*\n/> followed by the action S<C<{ $return = 0 }>> (except that
productions beginning with a C<E<lt>resyncE<gt>> are ignored when generating
error messages). A typical use might be:

        script : command(s)

        command: save_command
               | restore_command
               | <resync> # TRY NEXT LINE, IF POSSIBLE

It is also possible to explicitly specify a resynchronization
pattern, using the C<E<lt>resync:I<pattern>E<gt>> variant. This version
succeeds only if the specified pattern matches (and consumes) the
parsed text. In other words, C<E<lt>resync:I<pattern>E<gt>> is exactly
equivalent to the token C</I<pattern>/> (followed by a S<C<{ $return = 0 }>>
action). For example, if commands were terminated by newlines or semi-colons:

        command: save_command
               | restore_command
               | <resync:[^;\n]*[;\n]>

The value of a successfully matched C<E<lt>resyncE<gt>> directive (of either
type) is the text that it consumed. Note, however, that since the
directive also sets C<$return>, a production consisting of a lone
C<E<lt>resyncE<gt>> succeeds but returns the value zero (which a calling rule
may find useful to distinguish between "true" matches and "tolerant" matches).
Remember that returning a zero value indicates that the rule I<succeeded> (since
only an C<undef> denotes failure within C<Parse::RecDescent> parsers.


=item Error handling

The C<E<lt>errorE<gt>> directive provides automatic or user-defined
generation of error messages during a parse. In its simplest form
C<E<lt>errorE<gt>> prepares an error message based on
the mismatch between the last item expected and the text which cause
it to fail. For example, given the rule:

        McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!'
             | pronoun 'dead,' name '!'
             | <error>

the following strings would produce the following messages:

=over 4

=item "Amen, Jim!"

       ERROR (line 1): Invalid McCoy: Expected curse or pronoun
                       not found

=item "Dammit, Jim, I'm a doctor!"

       ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a"
                       but found ", I'm a doctor!" instead

=item "He's dead,\n"

       ERROR (line 2): Invalid McCoy: Expected name not found

=item "He's alive!"

       ERROR (line 1): Invalid McCoy: Expected 'dead,' but found
                       "alive!" instead

=item "Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!"

       ERROR (line 1): Invalid McCoy: Expected a profession but found
                       "pointy-eared Vulcan!" instead


=back

Note that, when autogenerating error messages, all underscores in any
rule name used in a message are replaced by single spaces (for example
"a_production" becomes "a production"). Judicious choice of rule
names can therefore considerably improve the readability of automatic
error messages (as well as the maintainability of the original
grammar).

If the automatically generated error is not sufficient, it is possible to
provide an explicit message as part of the error directive. For example:

        Spock: "Fascinating ',' (name | 'Captain') '.'
             | "Highly illogical, doctor."
             | <error: He never said that!>

which would result in I<all> failures to parse a "Spock" subrule printing the
following message:

       ERROR (line <N>): Invalid Spock:  He never said that!

The error message is treated as a "qq{...}" string and interpolated
when the error is generated (I<not> when the directive is specified!).
Hence:

        <error: Mystical error near "$text">

would correctly insert the ambient text string which caused the error.

There are two other forms of error directive: C<E<lt>error?E<gt>> and
S<C<E<lt>error?: msgE<gt>>>. These behave just like C<E<lt>errorE<gt>>
and S<C<E<lt>error: msgE<gt>>> respectively, except that they are
only triggered if the rule is "committed" at the time they are
encountered. For example:

        Scotty: "Ya kenna change the Laws of Phusics," <commit> name
              | name <commit> ',' 'she's goanta blaw!'
              | <error?>

will only generate an error for a string beginning with "Ya kenna
change the Laws o' Phusics," or a valid name, but which still fails to match the
corresponding production. That is, C<$parser-E<gt>Scotty("Aye, Cap'ain")> will
fail silently (since neither production will "commit" the rule on that
input), whereas S<C<$parser-E<gt>Scotty("Mr Spock, ah jest kenna do'ut!")>>
will fail with the error message:

       ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!'
                       but found 'I jest kenna do'ut!' instead.

since in that case the second production would commit after matching
the leading name.

Note that to allow this behaviour, all C<E<lt>errorE<gt>> directives which are
the first item in a production automatically uncommit the rule just
long enough to allow their production to be attempted (that is, when
their production fails, the commitment is reinstated so that
subsequent productions are skipped).

In order to I<permanently> uncommit the rule before an error message,
it is necessary to put an explicit C<E<lt>uncommitE<gt>> before the
C<E<lt>errorE<gt>>. For example:

        line: 'Kirk:'  <commit> Kirk
            | 'Spock:' <commit> Spock
            | 'McCoy:' <commit> McCoy
            | <uncommit> <error?> <reject>
            | <resync>


Error messages generated by the various C<E<lt>error...E<gt>> directives
are not displayed immediately. Instead, they are "queued" in a buffer and
are only displayed once parsing ultimately fails. Moreover,
C<E<lt>error...E<gt>> directives that cause one production of a rule
to fail are automatically removed from the message queue
if another production subsequently causes the entire rule to succeed.
This means that you can put
C<E<lt>error...E<gt>> directives wherever useful diagnosis can be done,
and only those associated with actual parser failure will ever be
displayed. Also see L<"Gotchas">.

As a general rule, the most useful diagnostics are usually generated
either at the very lowest level within the grammar, or at the very
highest. A good rule of thumb is to identify those subrules which
consist mainly (or entirely) of terminals, and then put an
C<E<lt>error...E<gt>> directive at the end of any other rule which calls
one or more of those subrules.

There is one other situation in which the output of the various types of
error directive is suppressed; namely, when the rule containing them
is being parsed as part of a "look-ahead" (see L<"Look-ahead">). In this
case, the error directive will still cause the rule to fail, but will do
so silently.

An unconditional C<E<lt>errorE<gt>> directive always fails (and hence has no
associated value). This means that encountering such a directive
always causes the production containing it to fail. Hence an
C<E<lt>errorE<gt>> directive will inevitably be the last (useful) item of a
rule (a level 3 warning is issued if a production contains items after an unconditional
C<E<lt>errorE<gt>> directive).

An C<E<lt>error?E<gt>> directive will I<succeed> (that is: fail to fail :-), if
the current rule is uncommitted when the directive is encountered. In
that case the directive's associated value is zero. Hence, this type
of error directive I<can> be used before the end of a
production. For example:

        command: 'do' <commit> something
               | 'report' <commit> something
               | <error?: Syntax error> <error: Unknown command>


B<Warning:> The C<E<lt>error?E<gt>> directive does I<not> mean "always fail (but
do so silently unless committed)". It actually means "only fail (and report) if
committed, otherwise I<succeed>". To achieve the "fail silently if uncommitted"
semantics, it is necessary to use:

        rule: item <commit> item(s)
            | <error?> <reject>      # FAIL SILENTLY UNLESS COMMITTED

However, because people seem to expect a lone C<E<lt>error?E<gt>> directive
to work like this:

        rule: item <commit> item(s)
            | <error?: Error message if committed>
            | <error:  Error message if uncommitted>

Parse::RecDescent automatically appends a
C<E<lt>rejectE<gt>> directive if the C<E<lt>error?E<gt>> directive
is the only item in a production. A level 2 warning (see below)
is issued when this happens.

The level of error reporting during both parser construction and
parsing is controlled by the presence or absence of four global
variables: C<$::RD_ERRORS>, C<$::RD_WARN>, C<$::RD_HINT>, and
<$::RD_TRACE>. If C<$::RD_ERRORS> is defined (and, by default, it is)
then fatal errors are reported.

Whenever C<$::RD_WARN> is defined, certain non-fatal problems are also reported.
Warnings have an associated "level": 1, 2, or 3. The higher the level,
the more serious the warning. The value of the corresponding global
variable (C<$::RD_WARN>) determines the I<lowest> level of warning to
be displayed. Hence, to see I<all> warnings, set C<$::RD_WARN> to 1.
To see only the most serious warnings set C<$::RD_WARN> to 3.
By default C<$::RD_WARN> is initialized to 3, ensuring that serious but
non-fatal errors are automatically reported.

See F<"DIAGNOSTICS"> for a list of the varous error and warning messages
that Parse::RecDescent generates when these two variables are defined.

Defining any of the remaining variables (which are not defined by
default) further increases the amount of information reported.
Defining C<$::RD_HINT> causes the parser generator to offer
more detailed analyses and hints on both errors and warnings.
Note that setting C<$::RD_HINT> at any point automagically
sets C<$::RD_WARN> to 1.

Defining C<$::RD_TRACE> causes the parser generator and the parser to
report their progress to STDERR in excruciating detail (although, without hints
unless $::RD_HINT is separately defined). This detail
can be moderated in only one respect: if C<$::RD_TRACE> has an
integer value (I<N>) greater than 1, only the I<N> characters of
the "current parsing context" (that is, where in the input string we
are at any point in the parse) is reported at any time.
   >
C<$::RD_TRACE> is mainly useful for debugging a grammar that isn't
behaving as you expected it to. To this end, if C<$::RD_TRACE> is
defined when a parser is built, any actual parser code which is
generated is also written to a file named "RD_TRACE" in the local
directory.

Note that the four variables belong to the "main" package, which
makes them easier to refer to in the code controlling the parser, and
also makes it easy to turn them into command line flags ("-RD_ERRORS",
"-RD_WARN", "-RD_HINT", "-RD_TRACE") under B<perl -s>.

=item Specifying local variables

It is occasionally convenient to specify variables which are local
to a single rule. This may be achieved by including a
C<E<lt>rulevar:...E<gt>> directive anywhere in the rule. For example:

        markup: <rulevar: $tag>

        markup: tag {($tag=$item[1]) =~ s/^<|>$//g} body[$tag]

The example C<E<lt>rulevar: $tagE<gt>> directive causes a "my" variable named
C<$tag> to be declared at the start of the subroutine implementing the
C<markup> rule (that is, I<before> the first production, regardless of
where in the rule it is specified).

Specifically, any directive of the form:
C<E<lt>rulevar:I<text>E<gt>> causes a line of the form C<my I<text>;>
to be added at the beginning of the rule subroutine, immediately after
the definitions of the following local variables:

        $thisparser     $commit
        $thisrule       @item
        $thisline       @arg
        $text           %arg

This means that the following C<E<lt>rulevarE<gt>> directives work
as expected:

        <rulevar: $count = 0 >

        <rulevar: $firstarg = $arg[0] || '' >

        <rulevar: $myItems = \@item >

        <rulevar: @context = ( $thisline, $text, @arg ) >

        <rulevar: ($name,$age) = $arg{"name","age"} >

If a variable that is also visible to subrules is required, it needs
to be C<local>'d, not C<my>'d. C<rulevar> defaults to C<my>, but if C<local>
is explicitly specified:

        <rulevar: local $count = 0 >

then a C<local>-ized variable is declared instead, and will be available
within subrules.

Note however that, because all such variables are "my" variables, their
values I<do not persist> between match attempts on a given rule. To
preserve values between match attempts, values can be stored within the
"local" member of the C<$thisrule> object:

        countedrule: { $thisrule->{"local"}{"count"}++ }
                     <reject>
                   | subrule1
                   | subrule2
                   | <reject: $thisrule->{"local"}{"count"} == 1>
                     subrule3


When matching a rule, each C<E<lt>rulevarE<gt>> directive is matched as
if it were an unconditional C<E<lt>rejectE<gt>> directive (that is, it
causes any production in which it appears to immediately fail to match).
For this reason (and to improve readability) it is usual to specify any
C<E<lt>rulevarE<gt>> directive in a separate production at the start of
the rule (this has the added advantage that it enables
C<Parse::RecDescent> to optimize away such productions, just as it does
for the C<E<lt>rejectE<gt>> directive).


=item Dynamically matched rules

Because regexes and double-quoted strings are interpolated, it is relatively
easy to specify productions with "context sensitive" tokens. For example:

        command:  keyword  body  "end $item[1]"

which ensures that a command block is bounded by a
"I<E<lt>keywordE<gt>>...end I<E<lt>same keywordE<gt>>" pair.

Building productions in which subrules are context sensitive is also possible,
via the C<E<lt>matchrule:...E<gt>> directive. This directive behaves
identically to a subrule item, except that the rule which is invoked to match
it is determined by the string specified after the colon. For example, we could
rewrite the C<command> rule like this:

        command:  keyword  <matchrule:body>  "end $item[1]"

Whatever appears after the colon in the directive is treated as an interpolated
string (that is, as if it appeared in C<qq{...}> operator) and the value of
that interpolated string is the name of the subrule to be matched.

Of course, just putting a constant string like C<body> in a
C<E<lt>matchrule:...E<gt>> directive is of little interest or benefit.
The power of directive is seen when we use a string that interpolates
to something interesting. For example:

        command:        keyword <matchrule:$item[1]_body> "end $item[1]"

        keyword:        'while' | 'if' | 'function'

        while_body:     condition block

        if_body:        condition block ('else' block)(?)

        function_body:  arglist block

Now the C<command> rule selects how to proceed on the basis of the keyword
that is found. It is as if C<command> were declared:

        command:        'while'    while_body    "end while"
               |        'if'       if_body       "end if"
               |        'function' function_body "end function"


When a C<E<lt>matchrule:...E<gt>> directive is used as a repeated
subrule, the rule name expression is "late-bound". That is, the name of
the rule to be called is re-evaluated I<each time> a match attempt is
made. Hence, the following grammar:

        { $::species = 'dogs' }

        pair:   'two' <matchrule:$::species>(s)

        dogs:   /dogs/ { $::species = 'cats' }

        cats:   /cats/

will match the string "two dogs cats cats" completely, whereas it will
only match the string "two dogs dogs dogs" up to the eighth letter. If
the rule name were "early bound" (that is, evaluated only the first
time the directive is encountered in a production), the reverse
behaviour would be expected.

Note that the C<matchrule> directive takes a string that is to be treated
as a rule name, I<not> as a rule invocation. That is,
it's like a Perl symbolic reference, not an C<eval>. Just as you can say:

        $subname = 'foo';

	# and later...

        &{$foo}(@args);

but not:

        $subname = 'foo(@args)';

	# and later...

        &{$foo};

likewise you can say:

	$rulename = 'foo';

	# and in the grammar...

	<matchrule:$rulename>[@args]

but not:

	$rulename = 'foo[@args]';

	# and in the grammar...

	<matchrule:$rulename>


=item Deferred actions

The C<E<lt>defer:...E<gt>> directive is used to specify an action to be
performed when (and only if!) the current production ultimately succeeds.

Whenever a C<E<lt>defer:...E<gt>> directive appears, the code it specifies
is converted to a closure (an anonymous subroutine reference) which is
queued within the active parser object. Note that,
because the deferred code is converted to a closure, the values of any
"local" variable (such as C<$text>, <@item>, etc.) are preserved
until the deferred code is actually executed.

If the parse ultimately succeeds
I<and> the production in which the C<E<lt>defer:...E<gt>> directive was
evaluated formed part of the successful parse, then the deferred code is
executed immediately before the parse returns. If however the production
which queued a deferred action fails, or one of the higher-level
rules which called that production fails, then the deferred action is
removed from the queue, and hence is never executed.

For example, given the grammar:

        sentence: noun trans noun
                | noun intrans

        noun:     'the dog'
                        { print "$item[1]\t(noun)\n" }
            |     'the meat'
                        { print "$item[1]\t(noun)\n" }

        trans:    'ate'
                        { print "$item[1]\t(transitive)\n" }

        intrans:  'ate'
                        { print "$item[1]\t(intransitive)\n" }
               |  'barked'
                        { print "$item[1]\t(intransitive)\n" }

then parsing the sentence C<"the dog ate"> would produce the output:

        the dog  (noun)
        ate      (transitive)
        the dog  (noun)
        ate      (intransitive)

This is because, even though the first production of C<sentence>
ultimately fails, its initial subrules C<noun> and C<trans> do match,
and hence they execute their associated actions.
Then the second production of C<sentence> succeeds, causing the
actions of the subrules C<noun> and C<intrans> to be executed as well.

On the other hand, if the actions were replaced by C<E<lt>defer:...E<gt>>
directives:

        sentence: noun trans noun
                | noun intrans

        noun:     'the dog'
                        <defer: print "$item[1]\t(noun)\n" >
            |     'the meat'
                        <defer: print "$item[1]\t(noun)\n" >

        trans:    'ate'
                        <defer: print "$item[1]\t(transitive)\n" >

        intrans:  'ate'
                        <defer: print "$item[1]\t(intransitive)\n" >
               |  'barked'
                        <defer: print "$item[1]\t(intransitive)\n" >

the output would be:

        the dog  (noun)
        ate      (intransitive)

since deferred actions are only executed if they were evaluated in
a production which ultimately contributes to the successful parse.

In this case, even though the first production of C<sentence> caused
the subrules C<noun> and C<trans> to match, that production ultimately
failed and so the deferred actions queued by those subrules were subsequently
disgarded. The second production then succeeded, causing the entire
parse to succeed, and so the deferred actions queued by the (second) match of
the C<noun> subrule and the subsequent match of C<intrans> I<are> preserved and
eventually executed.

Deferred actions provide a means of improving the performance of a parser,
by only executing those actions which are part of the final parse-tree
for the input data.

Alternatively, deferred actions can be viewed as a mechanism for building
(and executing) a
customized subroutine corresponding to the given input data, much in the
same way that autoactions (see L<"Autoactions">) can be used to build a
customized data structure for specific input.

Whether or not the action it specifies is ever executed,
a C<E<lt>defer:...E<gt>> directive always succeeds, returning the
number of deferred actions currently queued at that point.


=item Parsing Perl

Parse::RecDescent provides limited support for parsing subsets of Perl,
namely: quote-like operators, Perl variables, and complete code blocks.

The C<E<lt>perl_quotelikeE<gt>> directive can be used to parse any Perl
quote-like operator: C<'a string'>, C<m/a pattern/>, C<tr{ans}{lation}>,
etc.  It does this by calling Text::Balanced::quotelike().

If a quote-like operator is found, a reference to an array of eight elements
is returned. Those elements are identical to the last eight elements returned
by Text::Balanced::extract_quotelike() in an array context, namely:

=over 4

=item [0]

the name of the quotelike operator -- 'q', 'qq', 'm', 's', 'tr' -- if the
operator was named; otherwise C<undef>,

=item [1]

the left delimiter of the first block of the operation,

=item [2]

the text of the first block of the operation
(that is, the contents of
a quote, the regex of a match, or substitution or the target list of a
translation),

=item [3]

the right delimiter of the first block of the operation,

=item [4]

the left delimiter of the second block of the operation if there is one
(that is, if it is a C<s>, C<tr>, or C<y>); otherwise C<undef>,

=item [5]

the text of the second block of the operation if there is one
(that is, the replacement of a substitution or the translation list
of a translation); otherwise C<undef>,

=item [6]

the right delimiter of the second block of the operation (if any);
otherwise C<undef>,

=item [7]

the trailing modifiers on the operation (if any); otherwise C<undef>.

=back

If a quote-like expression is not found, the directive fails with the usual
C<undef> value.

The C<E<lt>perl_variableE<gt>> directive can be used to parse any Perl
variable: $scalar, @array, %hash, $ref->{field}[$index], etc.
It does this by calling Text::Balanced::extract_variable().

If the directive matches text representing a valid Perl variable
specification, it returns that text. Otherwise it fails with the usual
C<undef> value.

The C<E<lt>perl_codeblockE<gt>> directive can be used to parse curly-brace-delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }.
It does this by calling Text::Balanced::extract_codeblock().

If the directive matches text representing a valid Perl code block,
it returns that text. Otherwise it fails with the usual C<undef> value.

You can also tell it what kind of brackets to use as the outermost
delimiters. For example:

	arglist: <perl_codeblock ()>

causes an arglist to match a perl code block whose outermost delimiters
are C<(...)> (rather than the default C<{...}>).


=item Constructing tokens

Eventually, Parse::RecDescent will be able to parse tokenized input, as
well as ordinary strings. In preparation for this joyous day, the
C<E<lt>token:...E<gt>> directive has been provided.
This directive creates a token which will be suitable for
input to a Parse::RecDescent parser (when it eventually supports
tokenized input).

The text of the token is the value of the
immediately preceding item in the production. A
C<E<lt>token:...E<gt>> directive always succeeds with a return
value which is the hash reference that is the new token. It also
sets the return value for the production to that hash ref.

The C<E<lt>token:...E<gt>> directive makes it easy to build
a Parse::RecDescent-compatible lexer in Parse::RecDescent:

        my $lexer = new Parse::RecDescent q
        {
                lex:    token(s)

                token:  /a\b/                      <token:INDEF>
                     |  /the\b/                    <token:DEF>
                     |  /fly\b/                    <token:NOUN,VERB>
                     |  /[a-z]+/i { lc $item[1] }  <token:ALPHA>
                     |  <error: Unknown token>

        };

which will eventually be able to be used with a regular Parse::RecDescent
grammar:

        my $parser = new Parse::RecDescent q
        {
                startrule: subrule1 subrule 2

                # ETC...
        };

either with a pre-lexing phase:

        $parser->startrule( $lexer->lex($data) );

or with a lex-on-demand approach:

        $parser->startrule( sub{$lexer->token(\$data)} );

But at present, only the C<E<lt>token:...E<gt>> directive is
actually implemented. The rest is vapourware.

=item Specifying operations

One of the commonest requirements when building a parser is to specify
binary operators. Unfortunately, in a normal grammar, the rules for
such things are awkward:

        disjunction:    conjunction ('or' conjunction)(s?)
                                { $return = [ $item[1], @{$item[2]} ] }

        conjunction:    atom ('and' atom)(s?)
                                { $return = [ $item[1], @{$item[2]} ] }

or inefficient:

        disjunction:    conjunction 'or' disjunction
                                { $return = [ $item[1], @{$item[2]} ] }
                   |    conjunction
                                { $return = [ $item[1] ] }

        conjunction:    atom 'and' conjunction
                                { $return = [ $item[1], @{$item[2]} ] }
                   |    atom
                                { $return = [ $item[1] ] }

and either way is ugly and hard to get right.

The C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives provide an
easier way of specifying such operations. Using C<E<lt>leftop:...E<gt>> the
above examples become:

        disjunction:    <leftop: conjunction 'or' conjunction>
        conjunction:    <leftop: atom 'and' atom>

The C<E<lt>leftop:...E<gt>> directive specifies a left-associative binary operator.
It is specified around three other grammar elements
(typically subrules or terminals), which match the left operand,
the operator itself, and the right operand respectively.

A C<E<lt>leftop:...E<gt>> directive such as:

        disjunction:    <leftop: conjunction 'or' conjunction>

is converted to the following:

        disjunction:    ( conjunction ('or' conjunction)(s?)
                                { $return = [ $item[1], @{$item[2]} ] } )

In other words, a C<E<lt>leftop:...E<gt>> directive matches the left operand followed by zero
or more repetitions of both the operator and the right operand. It then
flattens the matched items into an anonymous array which becomes the
(single) value of the entire C<E<lt>leftop:...E<gt>> directive.

For example, an C<E<lt>leftop:...E<gt>> directive such as:

        output:  <leftop: ident '<<' expr >

when given a string such as:

        cout << var << "str" << 3

would match, and C<$item[1]> would be set to:

        [ 'cout', 'var', '"str"', '3' ]

In other words:

        output:  <leftop: ident '<<' expr >

is equivalent to a left-associative operator:

        output:  ident                                  { $return = [$item[1]]       }
              |  ident '<<' expr                        { $return = [@item[1,3]]     }
              |  ident '<<' expr '<<' expr              { $return = [@item[1,3,5]]   }
              |  ident '<<' expr '<<' expr '<<' expr    { $return = [@item[1,3,5,7]] }
              #  ...etc...


Similarly, the C<E<lt>rightop:...E<gt>> directive takes a left operand, an operator, and a right operand:

        assign:  <rightop: var '=' expr >

and converts them to:

        assign:  ( (var '=' {$return=$item[1]})(s?) expr
                                { $return = [ @{$item[1]}, $item[2] ] } )

which is equivalent to a right-associative operator:

        assign:  var                            { $return = [$item[1]]       }
              |  var '=' expr                   { $return = [@item[1,3]]     }
              |  var '=' var '=' expr           { $return = [@item[1,3,5]]   }
              |  var '=' var '=' var '=' expr   { $return = [@item[1,3,5,7]] }
              #  ...etc...


Note that for both the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives, the directive does not normally
return the operator itself, just a list of the operands involved. This is
particularly handy for specifying lists:

        list: '(' <leftop: list_item ',' list_item> ')'
                        { $return = $item[2] }

There is, however, a problem: sometimes the operator is itself significant.
For example, in a Perl list a comma and a C<=E<gt>> are both
valid separators, but the C<=E<gt>> has additional stringification semantics.
Hence it's important to know which was used in each case.

To solve this problem the
C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
I<do> return the operator(s) as well, under two circumstances.
The first case is where the operator is specified as a subrule. In that instance,
whatever the operator matches is returned (on the assumption that if the operator
is important enough to have its own subrule, then it's important enough to return).

The second case is where the operator is specified as a regular
expression. In that case, if the first bracketed subpattern of the
regular expression matches, that matching value is returned (this is analogous to
the behaviour of the Perl C<split> function, except that only the first subpattern
is returned).

In other words, given the input:

        ( a=>1, b=>2 )

the specifications:

        list:      '('  <leftop: list_item separator list_item>  ')'

        separator: ',' | '=>'

or:

        list:      '('  <leftop: list_item /(,|=>)/ list_item>  ')'

cause the list separators to be interleaved with the operands in the
anonymous array in C<$item[2]>:

        [ 'a', '=>', '1', ',', 'b', '=>', '2' ]


But the following version:

        list:      '('  <leftop: list_item /,|=>/ list_item>  ')'

returns only the operators:

        [ 'a', '1', 'b', '2' ]

Of course, none of the above specifications handle the case of an empty
list, since the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
require at least a single right or left operand to match. To specify
that the operator can match "trivially",
it's necessary to add a C<(?)> qualifier to the directive:

        list:      '('  <leftop: list_item /(,|=>)/ list_item>(?)  ')'

Note that in almost all the above examples, the first and third arguments
of the C<<leftop:...E<gt>> directive were the same subrule. That is because
C<<leftop:...E<gt>>'s are frequently used to specify "separated" lists of the
same type of item. To make such lists easier to specify, the following
syntax:

        list:   element(s /,/)

is exactly equivalent to:

        list:   <leftop: element /,/ element>

Note that the separator must be specified as a raw pattern (i.e.
not a string or subrule).


=item Scored productions

By default, Parse::RecDescent grammar rules always accept the first
production that matches the input. But if two or more productions may
potentially match the same input, choosing the first that does so may
not be optimal.

For example, if you were parsing the sentence "time flies like an arrow",
you might use a rule like this:

        sentence: verb noun preposition article noun { [@item] }
                | adjective noun verb article noun   { [@item] }
                | noun verb preposition article noun { [@item] }

Each of these productions matches the sentence, but the third one
is the most likely interpretation. However, if the sentence had been
"fruit flies like a banana", then the second production is probably
the right match.

To cater for such situtations, the C<E<lt>score:...E<gt>> can be used.
The directive is equivalent to an unconditional C<E<lt>rejectE<gt>>,
except that it allows you to specify a "score" for the current
production. If that score is numerically greater than the best
score of any preceding production, the current production is cached for later
consideration. If no later production matches, then the cached
production is treated as having matched, and the value of the
item immediately before its C<E<lt>score:...E<gt>> directive is returned as the
result.

In other words, by putting a C<E<lt>score:...E<gt>> directive at the end of
each production, you can select which production matches using
criteria other than specification order. For example:

        sentence: verb noun preposition article noun { [@item] } <score: sensible(@item)>
                | adjective noun verb article noun   { [@item] } <score: sensible(@item)>
                | noun verb preposition article noun { [@item] } <score: sensible(@item)>

Now, when each production reaches its respective C<E<lt>score:...E<gt>>
directive, the subroutine C<sensible> will be called to evaluate the
matched items (somehow). Once all productions have been tried, the
one which C<sensible> scored most highly will be the one that is
accepted as a match for the rule.

The variable $score always holds the current best score of any production,
and the variable $score_return holds the corresponding return value.

As another example, the following grammar matches lines that may be
separated by commas, colons, or semi-colons. This can be tricky if
a colon-separated line also contains commas, or vice versa. The grammar
resolves the ambiguity by selecting the rule that results in the
fewest fields:

        line: seplist[sep=>',']  <score: -@{$item[1]}>
            | seplist[sep=>':']  <score: -@{$item[1]}>
            | seplist[sep=>" "]  <score: -@{$item[1]}>

        seplist: <skip:""> <leftop: /[^$arg{sep}]*/ "$arg{sep}" /[^$arg{sep}]*/>

Note the use of negation within the C<E<lt>score:...E<gt>> directive
to ensure that the seplist with the most items gets the lowest score.

As the above examples indicate, it is often the case that all productions
in a rule use exactly the same C<E<lt>score:...E<gt>> directive. It is
tedious to have to repeat this identical directive in every production, so
Parse::RecDescent also provides the C<E<lt>autoscore:...E<gt>> directive.

If an C<E<lt>autoscore:...E<gt>> directive appears in any
production of a rule, the code it specifies is used as the scoring
code for every production of that rule, except productions that already
end with an explicit C<E<lt>score:...E<gt>> directive. Thus the rules above could
be rewritten:

        line: <autoscore: -@{$item[1]}>
        line: seplist[sep=>',']
            | seplist[sep=>':']
            | seplist[sep=>" "]


        sentence: <autoscore: sensible(@item)>
                | verb noun preposition article noun { [@item] }
                | adjective noun verb article noun   { [@item] }
                | noun verb preposition article noun { [@item] }

Note that the C<E<lt>autoscore:...E<gt>> directive itself acts as an
unconditional C<E<lt>rejectE<gt>>, and (like the C<E<lt>rulevar:...E<gt>>
directive) is pruned at compile-time wherever possible.


=item Dispensing with grammar checks

During the compilation phase of parser construction, Parse::RecDescent performs
a small number of checks on the grammar it's given. Specifically it checks that
the grammar is not left-recursive, that there are no "insatiable" constructs of
the form:

        rule: subrule(s) subrule

and that there are no rules missing (i.e. referred to, but never defined).

These checks are important during development, but can slow down parser
construction in stable code. So Parse::RecDescent provides the
E<lt>nocheckE<gt> directive to turn them off. The directive can only appear
before the first rule definition, and switches off checking throughout the rest
of the current grammar.

Typically, this directive would be added when a parser has been thoroughly
tested and is ready for release.

=back


=head2 Subrule argument lists

It is occasionally useful to pass data to a subrule which is being invoked. For
example, consider the following grammar fragment:

        classdecl: keyword decl

        keyword:   'struct' | 'class';

        decl:      # WHATEVER

The C<decl> rule might wish to know which of the two keywords was used
(since it may affect some aspect of the way the subsequent declaration
is interpreted). C<Parse::RecDescent> allows the grammar designer to
pass data into a rule, by placing that data in an I<argument list>
(that is, in square brackets) immediately after any subrule item in a
production. Hence, we could pass the keyword to C<decl> as follows:

        classdecl: keyword decl[ $item[1] ]

        keyword:   'struct' | 'class';

        decl:      # WHATEVER

The argument list can consist of any number (including zero!) of comma-separated
Perl expressions. In other words, it looks exactly like a Perl anonymous
array reference. For example, we could pass the keyword, the name of the
surrounding rule, and the literal 'keyword' to C<decl> like so:

        classdecl: keyword decl[$item[1],$item[0],'keyword']

        keyword:   'struct' | 'class';

        decl:      # WHATEVER

Within the rule to which the data is passed (C<decl> in the above examples)
that data is available as the elements of a local variable C<@arg>. Hence
C<decl> might report its intentions as follows:

        classdecl: keyword decl[$item[1],$item[0],'keyword']

        keyword:   'struct' | 'class';

        decl:      { print "Declaring $arg[0] (a $arg[2])\n";
                     print "(this rule called by $arg[1])" }

Subrule argument lists can also be interpreted as hashes, simply by using
the local variable C<%arg> instead of C<@arg>. Hence we could rewrite the
previous example:

        classdecl: keyword decl[keyword => $item[1],
                                caller  => $item[0],
                                type    => 'keyword']

        keyword:   'struct' | 'class';

        decl:      { print "Declaring $arg{keyword} (a $arg{type})\n";
                     print "(this rule called by $arg{caller})" }

Both C<@arg> and C<%arg> are always available, so the grammar designer may
choose whichever convention (or combination of conventions) suits best.

Subrule argument lists are also useful for creating "rule templates"
(especially when used in conjunction with the C<E<lt>matchrule:...E<gt>>
directive). For example, the subrule:

        list:     <matchrule:$arg{rule}> /$arg{sep}/ list[%arg]
                        { $return = [ $item[1], @{$item[3]} ] }
            |     <matchrule:$arg{rule}>
                        { $return = [ $item[1]] }

is a handy template for the common problem of matching a separated list.
For example:

        function: 'func' name '(' list[rule=>'param',sep=>';'] ')'

        param:    list[rule=>'name',sep=>','] ':' typename

        name:     /\w+/

        typename: name


When a subrule argument list is used with a repeated subrule, the argument list
goes I<before> the repetition specifier:

        list:   /some|many/ thing[ $item[1] ](s)

The argument list is "late bound". That is, it is re-evaluated for every
repetition of the repeated subrule.
This means that each repeated attempt to match the subrule may be
passed a completely different set of arguments if the value of the
expression in the argument list changes between attempts. So, for
example, the grammar:

        { $::species = 'dogs' }

        pair:   'two' animal[$::species](s)

        animal: /$arg[0]/ { $::species = 'cats' }

will match the string "two dogs cats cats" completely, whereas
it will only match the string "two dogs dogs dogs" up to the
eighth letter. If the value of the argument list were "early bound"
(that is, evaluated only the first time a repeated subrule match is
attempted), one would expect the matching behaviours to be reversed.

Of course, it is possible to effectively "early bind" such argument lists
by passing them a value which does not change on each repetition. For example:

        { $::species = 'dogs' }

        pair:   'two' { $::species } animal[$item[2]](s)

        animal: /$arg[0]/ { $::species = 'cats' }


Arguments can also be passed to the start rule, simply by appending them
to the argument list with which the start rule is called (I<after> the
"line number" parameter). For example, given:

        $parser = new Parse::RecDescent ( $grammar );

        $parser->data($text, 1, "str", 2, \@arr);

        #             ^^^^^  ^  ^^^^^^^^^^^^^^^
        #               |    |         |
        # TEXT TO BE PARSED  |         |
        # STARTING LINE NUMBER         |
        # ELEMENTS OF @arg WHICH IS PASSED TO RULE data

then within the productions of the rule C<data>, the array C<@arg> will contain
C<("str", 2, \@arr)>.


=head2 Alternations

Alternations are implicit (unnamed) rules defined as part of a production. An
alternation is defined as a series of '|'-separated productions inside a
pair of round brackets. For example:

        character: 'the' ( good | bad | ugly ) /dude/

Every alternation implicitly defines a new subrule, whose
automatically-generated name indicates its origin:
"_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate
values of <I>, <P>, and <R>. A call to this implicit subrule is then
inserted in place of the brackets. Hence the above example is merely a
convenient short-hand for:

        character: 'the'
                   _alternation_1_of_production_1_of_rule_character
                   /dude/

        _alternation_1_of_production_1_of_rule_character:
                   good | bad | ugly

Since alternations are parsed by recursively calling the parser generator,
any type(s) of item can appear in an alternation. For example:

        character: 'the' ( 'high' "plains"      # Silent, with poncho
                         | /no[- ]name/         # Silent, no poncho
                         | vengeance_seeking    # Poncho-optional
                         | <error>
                         ) drifter

In this case, if an error occurred, the automatically generated
message would be:

        ERROR (line <N>): Invalid implicit subrule: Expected
                          'high' or /no[- ]name/ or generic,
                          but found "pacifist" instead

Since every alternation actually has a name, it's even possible
to extend or replace them:

        parser->Replace(
                "_alternation_1_of_production_1_of_rule_character:
                        'generic Eastwood'"
                        );

More importantly, since alternations are a form of subrule, they can be given
repetition specifiers:

        character: 'the' ( good | bad | ugly )(?) /dude/


=head2 Incremental Parsing

C<Parse::RecDescent> provides two methods - C<Extend> and C<Replace> - which
can be used to alter the grammar matched by a parser. Both methods
take the same argument as C<Parse::RecDescent::new>, namely a
grammar specification string

C<Parse::RecDescent::Extend> interprets the grammar specification and adds any
productions it finds to the end of the rules for which they are specified. For
example:

        $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
        parser->Extend($add);

adds two productions to the rule "name" (creating it if necessary) and one
production to the rule "desc".

C<Parse::RecDescent::Replace> is identical, except that it first resets are
rule specified in the additional grammar, removing any existing productions.
Hence after:

        $add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
        parser->Replace($add);

are are I<only> valid "name"s and the one possible description.

A more interesting use of the C<Extend> and C<Replace> methods is to call them
inside the action of an executing parser. For example:

        typedef: 'typedef' type_name identifier ';'
                       { $thisparser->Extend("type_name: '$item[3]'") }
               | <error>

        identifier: ...!type_name /[A-Za-z_]w*/

which automatically prevents type names from being typedef'd, or:

        command: 'map' key_name 'to' abort_key
                       { $thisparser->Replace("abort_key: '$item[2]'") }
               | 'map' key_name 'to' key_name
                       { map_key($item[2],$item[4]) }
               | abort_key
                       { exit if confirm("abort?") }

        abort_key: 'q'

        key_name: ...!abort_key /[A-Za-z]/

which allows the user to change the abort key binding, but not to unbind it.

The careful use of such constructs makes it possible to reconfigure a
a running parser, eliminating the need for semantic feedback by
providing syntactic feedback instead. However, as currently implemented,
C<Replace()> and C<Extend()> have to regenerate and re-C<eval> the
entire parser whenever they are called. This makes them quite slow for
large grammars.

In such cases, the judicious use of an interpolated regex is likely to
be far more efficient:

        typedef: 'typedef' type_name/ identifier ';'
                       { $thisparser->{local}{type_name} .= "|$item[3]" }
               | <error>

        identifier: ...!type_name /[A-Za-z_]w*/

        type_name: /$thisparser->{local}{type_name}/


=head2 Precompiling parsers

Normally Parse::RecDescent builds a parser from a grammar at run-time.
That approach simplifies the design and implementation of parsing code,
but has the disadvantage that it slows the parsing process down - you
have to wait for Parse::RecDescent to build the parser every time the
program runs. Long or complex grammars can be particularly slow to
build, leading to unacceptable delays at start-up.

To overcome this, the module provides a way of "pre-building" a parser
object and saving it in a separate module. That module can then be used
to create clones of the original parser.

A grammar may be precompiled using the C<Precompile> class method.
For example, to precompile a grammar stored in the scalar $grammar,
and produce a class named PreGrammar in a module file named PreGrammar.pm,
you could use:

        use Parse::RecDescent;

        Parse::RecDescent->Precompile($grammar, "PreGrammar");

The first argument is the grammar string, the second is the name of the class
to be built. The name of the module file is generated automatically by
appending ".pm" to the last element of the class name. Thus

        Parse::RecDescent->Precompile($grammar, "My::New::Parser");

would produce a module file named Parser.pm.

It is somewhat tedious to have to write a small Perl program just to
generate a precompiled grammar class, so Parse::RecDescent has some special
magic that allows you to do the job directly from the command-line.

If your grammar is specified in a file named F<grammar>, you can generate
a class named Yet::Another::Grammar like so:

        > perl -MParse::RecDescent - grammar Yet::Another::Grammar

This would produce a file named F<Grammar.pm> containing the full
definition of a class called Yet::Another::Grammar. Of course, to use
that class, you would need to put the F<Grammar.pm> file in a
directory named F<Yet/Another>, somewhere in your Perl include path.

Having created the new class, it's very easy to use it to build
a parser. You simply C<use> the new module, and then call its
C<new> method to create a parser object. For example:

        use Yet::Another::Grammar;
        my $parser = Yet::Another::Grammar->new();

The effect of these two lines is exactly the same as:

        use Parse::RecDescent;

        open GRAMMAR_FILE, "grammar" or die;
        local $/;
        my $grammar = <GRAMMAR_FILE>;

        my $parser = Parse::RecDescent->new($grammar);

only considerably faster.

Note however that the parsers produced by either approach are exactly
the same, so whilst precompilation has an effect on I<set-up> speed,
it has no effect on I<parsing> speed. RecDescent 2.0 will address that
problem.


=head2 A Metagrammar for C<Parse::RecDescent>

The following is a specification of grammar format accepted by
C<Parse::RecDescent::new> (specified in the C<Parse::RecDescent> grammar format!):

 grammar    : components(s)

 component  : rule | comment

 rule       : "\n" identifier ":" production(s?)

 production : items(s)

 item       : lookahead(?) simpleitem
            | directive
            | comment

 lookahead  : '...' | '...!'                   # +'ve or -'ve lookahead

 simpleitem : subrule args(?)                  # match another rule
            | repetition                       # match repeated subrules
            | terminal                         # match the next input
            | bracket args(?)                  # match alternative items
            | action                           # do something

 subrule    : identifier                       # the name of the rule

 args       : {extract_codeblock($text,'[]')}  # just like a [...] array ref

 repetition : subrule args(?) howoften

 howoften   : '(?)'                            # 0 or 1 times
            | '(s?)'                           # 0 or more times
            | '(s)'                            # 1 or more times
            | /(\d+)[.][.](/\d+)/              # $1 to $2 times
            | /[.][.](/\d*)/                   # at most $1 times
            | /(\d*)[.][.])/                   # at least $1 times

 terminal   : /[/]([\][/]|[^/])*[/]/           # interpolated pattern
            | /"([\]"|[^"])*"/                 # interpolated literal
            | /'([\]'|[^'])*'/                 # uninterpolated literal

 action     : { extract_codeblock($text) }     # embedded Perl code

 bracket    : '(' Item(s) production(s?) ')'   # alternative subrules

 directive  : '<commit>'                       # commit to production
            | '<uncommit>'                     # cancel commitment
            | '<resync>'                       # skip to newline
            | '<resync:' pattern '>'           # skip <pattern>
            | '<reject>'                       # fail this production
            | '<reject:' condition '>'         # fail if <condition>
            | '<error>'                        # report an error
            | '<error:' string '>'             # report error as "<string>"
            | '<error?>'                       # error only if committed
            | '<error?:' string '>'            #   "    "    "    "
            | '<rulevar:' /[^>]+/ '>'          # define rule-local variable
            | '<matchrule:' string '>'         # invoke rule named in string

 identifier : /[a-z]\w*/i                      # must start with alpha

 comment    : /#[^\n]*/                        # same as Perl

 pattern    : {extract_bracketed($text,'<')}   # allow embedded "<..>"

 condition  : {extract_codeblock($text,'{<')}  # full Perl expression

 string     : {extract_variable($text)}        # any Perl variable
            | {extract_quotelike($text)}       #   or quotelike string
            | {extract_bracketed($text,'<')}   #   or balanced brackets


=head1 GOTCHAS

This section describes common mistakes that grammar writers seem to
make on a regular basis.

=head2 1. Expecting an error to always invalidate a parse

A common mistake when using error messages is to write the grammar like this:

        file: line(s)

        line: line_type_1
            | line_type_2
            | line_type_3
            | <error>

The expectation seems to be that any line that is not of type 1, 2 or 3 will
invoke the C<E<lt>errorE<gt>> directive and thereby cause the parse to fail.

Unfortunately, that only happens if the error occurs in the very first line.
The first rule states that a C<file> is matched by one or more lines, so if
even a single line succeeds, the first rule is completely satisfied and the
parse as a whole succeeds. That means that any error messages generated by
subsequent failures in the C<line> rule are quietly ignored.

Typically what's really needed is this:

        file: line(s) eofile    { $return = $item[1] }

        line: line_type_1
            | line_type_2
            | line_type_3
            | <error>

        eofile: /^\Z/

The addition of the C<eofile> subrule  to the first production means that
a file only matches a series of successful C<line> matches I<that consume the
complete input text>. If any input text remains after the lines are matched,
there must have been an error in the last C<line>. In that case the C<eofile>
rule will fail, causing the entire C<file> rule to fail too.

Note too that C<eofile> must match C</^\Z/> (end-of-text), I<not>
C</^\cZ/> or C</^\cD/> (end-of-file).

And don't forget the action at the end of the production. If you just
write:

        file: line(s) eofile

then the value returned by the C<file> rule will be the value of its
last item: C<eofile>. Since C<eofile> always returns an empty string
on success, that will cause the C<file> rule to return that empty
string. Apart from returning the wrong value, returning an empty string
will trip up code such as:

        $parser->file($filetext) || die;

(since "" is false).

Remember that Parse::RecDescent returns undef on failure,
so the only safe test for failure is:

        defined($parser->file($filetext)) || die;


=head1 DIAGNOSTICS

Diagnostics are intended to be self-explanatory (particularly if you
use B<-RD_HINT> (under B<perl -s>) or define C<$::RD_HINT> inside the program).

C<Parse::RecDescent> currently diagnoses the following:

=over 4

=item *

Invalid regular expressions used as pattern terminals (fatal error).

=item *

Invalid Perl code in code blocks (fatal error).

=item *

Lookahead used in the wrong place or in a nonsensical way (fatal error).

=item *

"Obvious" cases of left-recursion (fatal error).

=item *

Missing or extra components in a C<E<lt>leftopE<gt>> or C<E<lt>rightopE<gt>>
directive.

=item *

Unrecognisable components in the grammar specification (fatal error).

=item *

"Orphaned" rule components specified before the first rule (fatal error)
or after an C<E<lt>errorE<gt>> directive (level 3 warning).

=item *

Missing rule definitions (this only generates a level 3 warning, since you
may be providing them later via C<Parse::RecDescent::Extend()>).

=item *

Instances where greedy repetition behaviour will almost certainly
cause the failure of a production (a level 3 warning - see
L<"ON-GOING ISSUES AND FUTURE DIRECTIONS"> below).

=item *

Attempts to define rules named 'Replace' or 'Extend', which cannot be
called directly through the parser object because of the predefined
meaning of C<Parse::RecDescent::Replace> and
C<Parse::RecDescent::Extend>. (Only a level 2 warning is generated, since
such rules I<can> still be used as subrules).

=item *

Productions which consist of a single C<E<lt>error?E<gt>>
directive, and which therefore may succeed unexpectedly
(a level 2 warning, since this might conceivably be the desired effect).

=item *

Multiple consecutive lookahead specifiers (a level 1 warning only, since their
effects simply accumulate).

=item *

Productions which start with a C<E<lt>rejectE<gt>> or C<E<lt>rulevar:...E<gt>>
directive. Such productions are optimized away (a level 1 warning).

=item *

Rules which are autogenerated under C<$::AUTOSTUB> (a level 1 warning).

=back

=head1 AUTHOR

Damian Conway (damian@conway.org)

=head1 BUGS AND IRRITATIONS

There are undoubtedly serious bugs lurking somewhere in this much code :-)
Bug reports and other feedback are most welcome.

Ongoing annoyances include:

=over 4

=item *

There's no support for parsing directly from an input stream.
If and when the Perl Gods give us regular expressions on streams,
this should be trivial (ahem!) to implement.

=item *

The parser generator can get confused if actions aren't properly
closed or if they contain particularly nasty Perl syntax errors
(especially unmatched curly brackets).

=item *

The generator only detects the most obvious form of left recursion
(potential recursion on the first subrule in a rule). More subtle
forms of left recursion (for example, through the second item in a
rule after a "zero" match of a preceding "zero-or-more" repetition,
or after a match of a subrule with an empty production) are not found.

=item *

Instead of complaining about left-recursion, the generator should
silently transform the grammar to remove it. Don't expect this
feature any time soon as it would require a more sophisticated
approach to parser generation than is currently used.

=item *

The generated parsers don't always run as fast as might be wished.

=item *

The meta-parser should be bootstrapped using C<Parse::RecDescent> :-)

=back

=head1 ON-GOING ISSUES AND FUTURE DIRECTIONS

=over 4

=item 1.

Repetitions are "incorrigibly greedy" in that they will eat everything they can
and won't backtrack if that behaviour causes a production to fail needlessly.
So, for example:

        rule: subrule(s) subrule

will I<never> succeed, because the repetition will eat all the
subrules it finds, leaving none to match the second item. Such
constructions are relatively rare (and C<Parse::RecDescent::new> generates a
warning whenever they occur) so this may not be a problem, especially
since the insatiable behaviour can be overcome "manually" by writing:

        rule: penultimate_subrule(s) subrule

        penultimate_subrule: subrule ...subrule

The issue is that this construction is exactly twice as expensive as the
original, whereas backtracking would add only 1/I<N> to the cost (for
matching I<N> repetitions of C<subrule>). I would welcome feedback on
the need for backtracking; particularly on cases where the lack of it
makes parsing performance problematical.

=item 2.

Having opened that can of worms, it's also necessary to consider whether there
is a need for non-greedy repetition specifiers. Again, it's possible (at some
cost) to manually provide the required functionality:

        rule: nongreedy_subrule(s) othersubrule

        nongreedy_subrule: subrule ...!othersubrule

Overall, the issue is whether the benefit of this extra functionality
outweighs the drawbacks of further complicating the (currently
minimalist) grammar specification syntax, and (worse) introducing more overhead
into the generated parsers.

=item 3.

An C<E<lt>autocommitE<gt>> directive would be nice. That is, it would be useful to be
able to say:

        command: <autocommit>
        command: 'find' name
               | 'find' address
               | 'do' command 'at' time 'if' condition
               | 'do' command 'at' time
               | 'do' command
               | unusual_command

and have the generator work out that this should be "pruned" thus:

        command: 'find' name
               | 'find' <commit> address
               | 'do' <commit> command <uncommit>
                        'at' time
                        'if' <commit> condition
               | 'do' <commit> command <uncommit>
                        'at' <commit> time
               | 'do' <commit> command
               | unusual_command

There are several issues here. Firstly, should the
C<E<lt>autocommitE<gt>> automatically install an C<E<lt>uncommitE<gt>>
at the start of the last production (on the grounds that the "command"
rule doesn't know whether an "unusual_command" might start with "find"
or "do") or should the "unusual_command" subgraph be analysed (to see
if it I<might> be viable after a "find" or "do")?

The second issue is how regular expressions should be treated. The simplest
approach would be simply to uncommit before them (on the grounds that they
I<might> match). Better efficiency would be obtained by analyzing all preceding
literal tokens to determine whether the pattern would match them.

Overall, the issues are: can such automated "pruning" approach a hand-tuned
version sufficiently closely to warrant the extra set-up expense, and (more
importantly) is the problem important enough to even warrant the non-trivial
effort of building an automated solution?

=back

=head1 COPYRIGHT

Copyright (c) 1997-2000, Damian Conway. All Rights Reserved.
This module is free software. It may be used, redistributed
and/or modified under the terms of the Perl Artistic License
  (see http://www.perl.com/perl/misc/Artistic.html)