2823 lines
101 KiB
Text
2823 lines
101 KiB
Text
=head1 NAME
|
|
|
|
Parse::RecDescent - Generate Recursive-Descent Parsers
|
|
|
|
=head1 VERSION
|
|
|
|
This document describes version 1.94 of Parse::RecDescent,
|
|
released April 9, 2003.
|
|
|
|
=head1 SYNOPSIS
|
|
|
|
use Parse::RecDescent;
|
|
|
|
# Generate a parser from the specification in $grammar:
|
|
|
|
$parser = new Parse::RecDescent ($grammar);
|
|
|
|
# Generate a parser from the specification in $othergrammar
|
|
|
|
$anotherparser = new Parse::RecDescent ($othergrammar);
|
|
|
|
|
|
# Parse $text using rule 'startrule' (which must be
|
|
# defined in $grammar):
|
|
|
|
$parser->startrule($text);
|
|
|
|
|
|
# Parse $text using rule 'otherrule' (which must also
|
|
# be defined in $grammar):
|
|
|
|
$parser->otherrule($text);
|
|
|
|
|
|
# Change the universal token prefix pattern
|
|
# (the default is: '\s*'):
|
|
|
|
$Parse::RecDescent::skip = '[ \t]+';
|
|
|
|
|
|
# Replace productions of existing rules (or create new ones)
|
|
# with the productions defined in $newgrammar:
|
|
|
|
$parser->Replace($newgrammar);
|
|
|
|
|
|
# Extend existing rules (or create new ones)
|
|
# by adding extra productions defined in $moregrammar:
|
|
|
|
$parser->Extend($moregrammar);
|
|
|
|
|
|
# Global flags (useful as command line arguments under -s):
|
|
|
|
$::RD_ERRORS # unless undefined, report fatal errors
|
|
$::RD_WARN # unless undefined, also report non-fatal problems
|
|
$::RD_HINT # if defined, also suggestion remedies
|
|
$::RD_TRACE # if defined, also trace parsers' behaviour
|
|
$::RD_AUTOSTUB # if defined, generates "stubs" for undefined rules
|
|
$::RD_AUTOACTION # if defined, appends specified action to productions
|
|
|
|
|
|
=head1 DESCRIPTION
|
|
|
|
=head2 Overview
|
|
|
|
Parse::RecDescent incrementally generates top-down recursive-descent text
|
|
parsers from simple I<yacc>-like grammar specifications. It provides:
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
Regular expressions or literal strings as terminals (tokens),
|
|
|
|
=item *
|
|
|
|
Multiple (non-contiguous) productions for any rule,
|
|
|
|
=item *
|
|
|
|
Repeated and optional subrules within productions,
|
|
|
|
=item *
|
|
|
|
Full access to Perl within actions specified as part of the grammar,
|
|
|
|
=item *
|
|
|
|
Simple automated error reporting during parser generation and parsing,
|
|
|
|
=item *
|
|
|
|
The ability to commit to, uncommit to, or reject particular
|
|
productions during a parse,
|
|
|
|
=item *
|
|
|
|
The ability to pass data up and down the parse tree ("down" via subrule
|
|
argument lists, "up" via subrule return values)
|
|
|
|
=item *
|
|
|
|
Incremental extension of the parsing grammar (even during a parse),
|
|
|
|
=item *
|
|
|
|
Precompilation of parser objects,
|
|
|
|
=item *
|
|
|
|
User-definable reduce-reduce conflict resolution via
|
|
"scoring" of matching productions.
|
|
|
|
=back
|
|
|
|
=head2 Using C<Parse::RecDescent>
|
|
|
|
Parser objects are created by calling C<Parse::RecDescent::new>, passing in a
|
|
grammar specification (see the following subsections). If the grammar is
|
|
correct, C<new> returns a blessed reference which can then be used to initiate
|
|
parsing through any rule specified in the original grammar. A typical sequence
|
|
looks like this:
|
|
|
|
$grammar = q {
|
|
# GRAMMAR SPECIFICATION HERE
|
|
};
|
|
|
|
$parser = new Parse::RecDescent ($grammar) or die "Bad grammar!\n";
|
|
|
|
# acquire $text
|
|
|
|
defined $parser->startrule($text) or print "Bad text!\n";
|
|
|
|
The rule through which parsing is initiated must be explicitly defined
|
|
in the grammar (i.e. for the above example, the grammar must include a
|
|
rule of the form: "startrule: <subrules>".
|
|
|
|
If the starting rule succeeds, its value (see below)
|
|
is returned. Failure to generate the original parser or failure to match a text
|
|
is indicated by returning C<undef>. Note that it's easy to set up grammars
|
|
that can succeed, but which return a value of 0, "0", or "". So don't be
|
|
tempted to write:
|
|
|
|
$parser->startrule($text) or print "Bad text!\n";
|
|
|
|
Normally, the parser has no effect on the original text. So in the
|
|
previous example the value of $text would be unchanged after having
|
|
been parsed.
|
|
|
|
If, however, the text to be matched is passed by reference:
|
|
|
|
$parser->startrule(\$text)
|
|
|
|
then any text which was consumed during the match will be removed from the
|
|
start of $text.
|
|
|
|
|
|
=head2 Rules
|
|
|
|
In the grammar from which the parser is built, rules are specified by
|
|
giving an identifier (which must satisfy /[A-Za-z]\w*/), followed by a
|
|
colon I<on the same line>, followed by one or more productions,
|
|
separated by single vertical bars. The layout of the productions
|
|
is entirely free-format:
|
|
|
|
rule1: production1
|
|
| production2 |
|
|
production3 | production4
|
|
|
|
At any point in the grammar previously defined rules may be extended with
|
|
additional productions. This is achieved by redeclaring the rule with the new
|
|
productions. Thus:
|
|
|
|
rule1: a | b | c
|
|
rule2: d | e | f
|
|
rule1: g | h
|
|
|
|
is exactly equivalent to:
|
|
|
|
rule1: a | b | c | g | h
|
|
rule2: d | e | f
|
|
|
|
Each production in a rule consists of zero or more items, each of which
|
|
may be either: the name of another rule to be matched (a "subrule"),
|
|
a pattern or string literal to be matched directly (a "token"), a
|
|
block of Perl code to be executed (an "action"), a special instruction
|
|
to the parser (a "directive"), or a standard Perl comment (which is
|
|
ignored).
|
|
|
|
A rule matches a text if one of its productions matches. A production
|
|
matches if each of its items match consecutive substrings of the
|
|
text. The productions of a rule being matched are tried in the same
|
|
order that they appear in the original grammar, and the first matching
|
|
production terminates the match attempt (successfully). If all
|
|
productions are tried and none matches, the match attempt fails.
|
|
|
|
Note that this behaviour is quite different from the "prefer the longer match"
|
|
behaviour of I<yacc>. For example, if I<yacc> were parsing the rule:
|
|
|
|
seq : 'A' 'B'
|
|
| 'A' 'B' 'C'
|
|
|
|
upon matching "AB" it would look ahead to see if a 'C' is next and, if
|
|
so, will match the second production in preference to the first. In
|
|
other words, I<yacc> effectively tries all the productions of a rule
|
|
breadth-first in parallel, and selects the "best" match, where "best"
|
|
means longest (note that this is a gross simplification of the true
|
|
behaviour of I<yacc> but it will do for our purposes).
|
|
|
|
In contrast, C<Parse::RecDescent> tries each production depth-first in
|
|
sequence, and selects the "best" match, where "best" means first. This is
|
|
the fundamental difference between "bottom-up" and "recursive descent"
|
|
parsing.
|
|
|
|
Each successfully matched item in a production is assigned a value,
|
|
which can be accessed in subsequent actions within the same
|
|
production (or, in some cases, as the return value of a successful
|
|
subrule call). Unsuccessful items don't have an associated value,
|
|
since the failure of an item causes the entire surrounding production
|
|
to immediately fail. The following sections describe the various types
|
|
of items and their success values.
|
|
|
|
|
|
=head2 Subrules
|
|
|
|
A subrule which appears in a production is an instruction to the parser to
|
|
attempt to match the named rule at that point in the text being
|
|
parsed. If the named subrule is not defined when requested the
|
|
production containing it immediately fails (unless it was "autostubbed" - see
|
|
L<Autostubbing>).
|
|
|
|
A rule may (recursively) call itself as a subrule, but I<not> as the
|
|
left-most item in any of its productions (since such recursions are usually
|
|
non-terminating).
|
|
|
|
The value associated with a subrule is the value associated with its
|
|
C<$return> variable (see L<"Actions"> below), or with the last successfully
|
|
matched item in the subrule match.
|
|
|
|
Subrules may also be specified with a trailing repetition specifier,
|
|
indicating that they are to be (greedily) matched the specified number
|
|
of times. The available specifiers are:
|
|
|
|
subrule(?) # Match one-or-zero times
|
|
subrule(s) # Match one-or-more times
|
|
subrule(s?) # Match zero-or-more times
|
|
subrule(N) # Match exactly N times for integer N > 0
|
|
subrule(N..M) # Match between N and M times
|
|
subrule(..M) # Match between 1 and M times
|
|
subrule(N..) # Match at least N times
|
|
|
|
Repeated subrules keep matching until either the subrule fails to
|
|
match, or it has matched the minimal number of times but fails to
|
|
consume any of the parsed text (this second condition prevents the
|
|
subrule matching forever in some cases).
|
|
|
|
Since a repeated subrule may match many instances of the subrule itself, the
|
|
value associated with it is not a simple scalar, but rather a reference to a
|
|
list of scalars, each of which is the value associated with one of the
|
|
individual subrule matches. In other words in the rule:
|
|
|
|
program: statement(s)
|
|
|
|
the value associated with the repeated subrule "statement(s)" is a reference
|
|
to an array containing the values matched by each call to the individual
|
|
subrule "statement".
|
|
|
|
Repetition modifieres may include a separator pattern:
|
|
|
|
program: statement(s /;/)
|
|
|
|
specifying some sequence of characters to be skipped between each repetition.
|
|
This is really just a shorthand for the E<lt>leftop:...E<gt> directive
|
|
(see below).
|
|
|
|
=head2 Tokens
|
|
|
|
If a quote-delimited string or a Perl regex appears in a production,
|
|
the parser attempts to match that string or pattern at that point in
|
|
the text. For example:
|
|
|
|
typedef: "typedef" typename identifier ';'
|
|
|
|
identifier: /[A-Za-z_][A-Za-z0-9_]*/
|
|
|
|
As in regular Perl, a single quoted string is uninterpolated, whilst
|
|
a double-quoted string or a pattern is interpolated (at the time
|
|
of matching, I<not> when the parser is constructed). Hence, it is
|
|
possible to define rules in which tokens can be set at run-time:
|
|
|
|
typedef: "$::typedefkeyword" typename identifier ';'
|
|
|
|
identifier: /$::identpat/
|
|
|
|
Note that, since each rule is implemented inside a special namespace
|
|
belonging to its parser, it is necessary to explicitly quantify
|
|
variables from the main package.
|
|
|
|
Regex tokens can be specified using just slashes as delimiters
|
|
or with the explicit C<mE<lt>delimiterE<gt>......E<lt>delimiterE<gt>> syntax:
|
|
|
|
typedef: "typedef" typename identifier ';'
|
|
|
|
typename: /[A-Za-z_][A-Za-z0-9_]*/
|
|
|
|
identifier: m{[A-Za-z_][A-Za-z0-9_]*}
|
|
|
|
A regex of either type can also have any valid trailing parameter(s)
|
|
(that is, any of [cgimsox]):
|
|
|
|
typedef: "typedef" typename identifier ';'
|
|
|
|
identifier: / [a-z_] # LEADING ALPHA OR UNDERSCORE
|
|
[a-z0-9_]* # THEN DIGITS ALSO ALLOWED
|
|
/ix # CASE/SPACE/COMMENT INSENSITIVE
|
|
|
|
The value associated with any successfully matched token is a string
|
|
containing the actual text which was matched by the token.
|
|
|
|
It is important to remember that, since each grammar is specified in a
|
|
Perl string, all instances of the universal escape character '\' within
|
|
a grammar must be "doubled", so that they interpolate to single '\'s when
|
|
the string is compiled. For example, to use the grammar:
|
|
|
|
word: /\S+/ | backslash
|
|
line: prefix word(s) "\n"
|
|
backslash: '\\'
|
|
|
|
the following code is required:
|
|
|
|
$parser = new Parse::RecDescent (q{
|
|
|
|
word: /\\S+/ | backslash
|
|
line: prefix word(s) "\\n"
|
|
backslash: '\\\\'
|
|
|
|
});
|
|
|
|
|
|
=head2 Terminal Separators
|
|
|
|
For the purpose of matching, each terminal in a production is considered
|
|
to be preceded by a "prefix" - a pattern which must be
|
|
matched before a token match is attempted. By default, the
|
|
prefix is optional whitespace (which always matches, at
|
|
least trivially), but this default may be reset in any production.
|
|
|
|
The variable C<$Parse::RecDescent::skip> stores the universal
|
|
prefix, which is the default for all terminal matches in all parsers
|
|
built with C<Parse::RecDescent>.
|
|
|
|
The prefix for an individual production can be altered
|
|
by using the C<E<lt>skip:...E<gt>> directive (see below).
|
|
|
|
|
|
=head2 Actions
|
|
|
|
An action is a block of Perl code which is to be executed (as the
|
|
block of a C<do> statement) when the parser reaches that point in a
|
|
production. The action executes within a special namespace belonging to
|
|
the active parser, so care must be taken in correctly qualifying variable
|
|
names (see also L<Start-up Actions> below).
|
|
|
|
The action is considered to succeed if the final value of the block
|
|
is defined (that is, if the implied C<do> statement evaluates to a
|
|
defined value - I<even one which would be treated as "false">). Note
|
|
that the value associated with a successful action is also the final
|
|
value in the block.
|
|
|
|
An action will I<fail> if its last evaluated value is C<undef>. This is
|
|
surprisingly easy to accomplish by accident. For instance, here's an
|
|
infuriating case of an action that makes its production fail, but only
|
|
when debugging I<isn't> activated:
|
|
|
|
description: name rank serial_number
|
|
{ print "Got $item[2] $item[1] ($item[3])\n"
|
|
if $::debugging
|
|
}
|
|
|
|
If C<$debugging> is false, no statement in the block is executed, so
|
|
the final value is C<undef>, and the entire production fails. The solution is:
|
|
|
|
description: name rank serial_number
|
|
{ print "Got $item[2] $item[1] ($item[3])\n"
|
|
if $::debugging;
|
|
1;
|
|
}
|
|
|
|
Within an action, a number of useful parse-time variables are
|
|
available in the special parser namespace (there are other variables
|
|
also accessible, but meddling with them will probably just break your
|
|
parser. As a general rule, if you avoid referring to unqualified
|
|
variables - especially those starting with an underscore - inside an action,
|
|
things should be okay):
|
|
|
|
=over 4
|
|
|
|
=item C<@item> and C<%item>
|
|
|
|
The array slice C<@item[1..$#item]> stores the value associated with each item
|
|
(that is, each subrule, token, or action) in the current production. The
|
|
analogy is to C<$1>, C<$2>, etc. in a I<yacc> grammar.
|
|
Note that, for obvious reasons, C<@item> only contains the
|
|
values of items I<before> the current point in the production.
|
|
|
|
The first element (C<$item[0]>) stores the name of the current rule
|
|
being matched.
|
|
|
|
C<@item> is a standard Perl array, so it can also be indexed with negative
|
|
numbers, representing the number of items I<back> from the current position in
|
|
the parse:
|
|
|
|
stuff: /various/ bits 'and' pieces "then" data 'end'
|
|
{ print $item[-2] } # PRINTS data
|
|
# (EASIER THAN: $item[6])
|
|
|
|
The C<%item> hash complements the <@item> array, providing named
|
|
access to the same item values:
|
|
|
|
stuff: /various/ bits 'and' pieces "then" data 'end'
|
|
{ print $item{data} # PRINTS data
|
|
# (EVEN EASIER THAN USING @item)
|
|
|
|
|
|
The results of named subrules are stored in the hash under each
|
|
subrule's name (including the repetition specifier, if any),
|
|
whilst all other items are stored under a "named
|
|
positional" key that indictates their ordinal position within their item
|
|
type: __STRINGI<n>__, __PATTERNI<n>__, __DIRECTIVEI<n>__, __ACTIONI<n>__:
|
|
|
|
stuff: /various/ bits 'and' pieces "then" data 'end' { save }
|
|
{ print $item{__PATTERN1__}, # PRINTS 'various'
|
|
$item{__STRING2__}, # PRINTS 'then'
|
|
$item{__ACTION1__}, # PRINTS RETURN
|
|
# VALUE OF save
|
|
}
|
|
|
|
|
|
If you want proper I<named> access to patterns or literals, you need to turn
|
|
them into separate rules:
|
|
|
|
stuff: various bits 'and' pieces "then" data 'end'
|
|
{ print $item{various} # PRINTS various
|
|
}
|
|
|
|
various: /various/
|
|
|
|
|
|
The special entry C<$item{__RULE__}> stores the name of the current
|
|
rule (i.e. the same value as C<$item[0]>.
|
|
|
|
The advantage of using C<%item>, instead of C<@items> is that it
|
|
removes the need to track items positions that may change as a grammar
|
|
evolves. For example, adding an interim C<E<lt>skipE<gt>> directive
|
|
of action can silently ruin a trailing action, by moving an C<@item>
|
|
element "down" the array one place. In contrast, the named entry
|
|
of C<%item> is unaffected by such an insertion.
|
|
|
|
A limitation of the C<%item> hash is that it only records the I<last>
|
|
value of a particular subrule. For example:
|
|
|
|
range: '(' number '..' number )'
|
|
{ $return = $item{number} }
|
|
|
|
will return only the value corresponding to the I<second> match of the
|
|
C<number> subrule. In other words, successive calls to a subrule
|
|
overwrite the corresponding entry in C<%item>. Once again, the
|
|
solution is to rename each subrule in its own rule:
|
|
|
|
range: '(' from_num '..' to_num )'
|
|
{ $return = $item{from_num} }
|
|
|
|
from_num: number
|
|
to_num: number
|
|
|
|
|
|
|
|
=item C<@arg> and C<%arg>
|
|
|
|
The array C<@arg> and the hash C<%arg> store any arguments passed to
|
|
the rule from some other rule (see L<"Subrule argument lists>). Changes
|
|
to the elements of either variable do not propagate back to the calling
|
|
rule (data can be passed back from a subrule via the C<$return>
|
|
variable - see next item).
|
|
|
|
|
|
=item C<$return>
|
|
|
|
If a value is assigned to C<$return> within an action, that value is
|
|
returned if the production containing the action eventually matches
|
|
successfully. Note that setting C<$return> I<doesn't> cause the current
|
|
production to succeed. It merely tells it what to return if it I<does> succeed.
|
|
Hence C<$return> is analogous to C<$$> in a I<yacc> grammar.
|
|
|
|
If C<$return> is not assigned within a production, the value of the
|
|
last component of the production (namely: C<$item[$#item]>) is
|
|
returned if the production succeeds.
|
|
|
|
|
|
=item C<$commit>
|
|
|
|
The current state of commitment to the current production (see L<"Directives">
|
|
below).
|
|
|
|
=item C<$skip>
|
|
|
|
The current terminal prefix (see L<"Directives"> below).
|
|
|
|
=item C<$text>
|
|
|
|
The remaining (unparsed) text. Changes to C<$text> I<do not
|
|
propagate> out of unsuccessful productions, but I<do> survive
|
|
successful productions. Hence it is possible to dynamically alter the
|
|
text being parsed - for example, to provide a C<#include>-like facility:
|
|
|
|
hash_include: '#include' filename
|
|
{ $text = ::loadfile($item[2]) . $text }
|
|
|
|
filename: '<' /[a-z0-9._-]+/i '>' { $return = $item[2] }
|
|
| '"' /[a-z0-9._-]+/i '"' { $return = $item[2] }
|
|
|
|
|
|
=item C<$thisline> and C<$prevline>
|
|
|
|
C<$thisline> stores the current line number within the current parse
|
|
(starting from 1). C<$prevline> stores the line number for the last
|
|
character which was already successfully parsed (this will be different from
|
|
C<$thisline> at the end of each line).
|
|
|
|
For efficiency, C<$thisline> and C<$prevline> are actually tied
|
|
hashes, and only recompute the required line number when the variable's
|
|
value is used.
|
|
|
|
Assignment to C<$thisline> adjusts the line number calculator, so that
|
|
it believes that the current line number is the value being assigned. Note
|
|
that this adjustment will be reflected in all subsequent line numbers
|
|
calculations.
|
|
|
|
Modifying the value of the variable C<$text> (as in the previous
|
|
C<hash_include> example, for instance) will confuse the line
|
|
counting mechanism. To prevent this, you should call
|
|
C<Parse::RecDescent::LineCounter::resync($thisline)> I<immediately>
|
|
after any assignment to the variable C<$text> (or, at least, before the
|
|
next attempt to use C<$thisline>).
|
|
|
|
Note that if a production fails after assigning to or
|
|
resync'ing C<$thisline>, the parser's line counter mechanism will
|
|
usually be corrupted.
|
|
|
|
Also see the entry for C<@itempos>.
|
|
|
|
The line number can be set to values other than 1, by calling the start
|
|
rule with a second argument. For example:
|
|
|
|
$parser = new Parse::RecDescent ($grammar);
|
|
|
|
$parser->input($text, 10); # START LINE NUMBERS AT 10
|
|
|
|
|
|
=item C<$thiscolumn> and C<$prevcolumn>
|
|
|
|
C<$thiscolumn> stores the current column number within the current line
|
|
being parsed (starting from 1). C<$prevcolumn> stores the column number
|
|
of the last character which was actually successfully parsed. Usually
|
|
C<$prevcolumn == $thiscolumn-1>, but not at the end of lines.
|
|
|
|
For efficiency, C<$thiscolumn> and C<$prevcolumn> are
|
|
actually tied hashes, and only recompute the required column number
|
|
when the variable's value is used.
|
|
|
|
Assignment to C<$thiscolumn> or C<$prevcolumn> is a fatal error.
|
|
|
|
Modifying the value of the variable C<$text> (as in the previous
|
|
C<hash_include> example, for instance) may confuse the column
|
|
counting mechanism.
|
|
|
|
Note that C<$thiscolumn> reports the column number I<before> any
|
|
whitespace that might be skipped before reading a token. Hence
|
|
if you wish to know where a token started (and ended) use something like this:
|
|
|
|
rule: token1 token2 startcol token3 endcol token4
|
|
{ print "token3: columns $item[3] to $item[5]"; }
|
|
|
|
startcol: '' { $thiscolumn } # NEED THE '' TO STEP PAST TOKEN SEP
|
|
endcol: { $prevcolumn }
|
|
|
|
Also see the entry for C<@itempos>.
|
|
|
|
=item C<$thisoffset> and C<$prevoffset>
|
|
|
|
C<$thisoffset> stores the offset of the current parsing position
|
|
within the complete text
|
|
being parsed (starting from 0). C<$prevoffset> stores the offset
|
|
of the last character which was actually successfully parsed. In all
|
|
cases C<$prevoffset == $thisoffset-1>.
|
|
|
|
For efficiency, C<$thisoffset> and C<$prevoffset> are
|
|
actually tied hashes, and only recompute the required offset
|
|
when the variable's value is used.
|
|
|
|
Assignment to C<$thisoffset> or <$prevoffset> is a fatal error.
|
|
|
|
Modifying the value of the variable C<$text> will I<not> affect the
|
|
offset counting mechanism.
|
|
|
|
Also see the entry for C<@itempos>.
|
|
|
|
=item C<@itempos>
|
|
|
|
The array C<@itempos> stores a hash reference corresponding to
|
|
each element of C<@item>. The elements of the hash provide the
|
|
following:
|
|
|
|
$itempos[$n]{offset}{from} # VALUE OF $thisoffset BEFORE $item[$n]
|
|
$itempos[$n]{offset}{to} # VALUE OF $prevoffset AFTER $item[$n]
|
|
$itempos[$n]{line}{from} # VALUE OF $thisline BEFORE $item[$n]
|
|
$itempos[$n]{line}{to} # VALUE OF $prevline AFTER $item[$n]
|
|
$itempos[$n]{column}{from} # VALUE OF $thiscolumn BEFORE $item[$n]
|
|
$itempos[$n]{column}{to} # VALUE OF $prevcolumn AFTER $item[$n]
|
|
|
|
Note that the various C<$itempos[$n]...{from}> values record the
|
|
appropriate value I<after> any token prefix has been skipped.
|
|
|
|
Hence, instead of the somewhat tedious and error-prone:
|
|
|
|
rule: startcol token1 endcol
|
|
startcol token2 endcol
|
|
startcol token3 endcol
|
|
{ print "token1: columns $item[1]
|
|
to $item[3]
|
|
token2: columns $item[4]
|
|
to $item[6]
|
|
token3: columns $item[7]
|
|
to $item[9]" }
|
|
|
|
startcol: '' { $thiscolumn } # NEED THE '' TO STEP PAST TOKEN SEP
|
|
endcol: { $prevcolumn }
|
|
|
|
it is possible to write:
|
|
|
|
rule: token1 token2 token3
|
|
{ print "token1: columns $itempos[1]{column}{from}
|
|
to $itempos[1]{column}{to}
|
|
token2: columns $itempos[2]{column}{from}
|
|
to $itempos[2]{column}{to}
|
|
token3: columns $itempos[3]{column}{from}
|
|
to $itempos[3]{column}{to}" }
|
|
|
|
Note however that (in the current implementation) the use of C<@itempos>
|
|
anywhere in a grammar implies that item positioning information is
|
|
collected I<everywhere> during the parse. Depending on the grammar
|
|
and the size of the text to be parsed, this may be prohibitively
|
|
expensive and the explicit use of C<$thisline>, C<$thiscolumn>, etc. may
|
|
be a better choice.
|
|
|
|
|
|
=item C<$thisparser>
|
|
|
|
A reference to the S<C<Parse::RecDescent>> object through which
|
|
parsing was initiated.
|
|
|
|
The value of C<$thisparser> propagates down the subrules of a parse
|
|
but not back up. Hence, you can invoke subrules from another parser
|
|
for the scope of the current rule as follows:
|
|
|
|
rule: subrule1 subrule2
|
|
| { $thisparser = $::otherparser } <reject>
|
|
| subrule3 subrule4
|
|
| subrule5
|
|
|
|
The result is that the production calls "subrule1" and "subrule2" of
|
|
the current parser, and the remaining productions call the named subrules
|
|
from C<$::otherparser>. Note, however that "Bad Things" will happen if
|
|
C<::otherparser> isn't a blessed reference and/or doesn't have methods
|
|
with the same names as the required subrules!
|
|
|
|
=item C<$thisrule>
|
|
|
|
A reference to the S<C<Parse::RecDescent::Rule>> object corresponding to the
|
|
rule currently being matched.
|
|
|
|
=item C<$thisprod>
|
|
|
|
A reference to the S<C<Parse::RecDescent::Production>> object
|
|
corresponding to the production currently being matched.
|
|
|
|
=item C<$score> and C<$score_return>
|
|
|
|
$score stores the best production score to date, as specified by
|
|
an earlier C<E<lt>score:...E<gt>> directive. $score_return stores
|
|
the corresponding return value for the successful production.
|
|
|
|
See L<Scored productions>.
|
|
|
|
=back
|
|
|
|
B<Warning:> the parser relies on the information in the various C<this...>
|
|
objects in some non-obvious ways. Tinkering with the other members of
|
|
these objects will probably cause Bad Things to happen, unless you
|
|
I<really> know what you're doing. The only exception to this advice is
|
|
that the use of C<$this...-E<gt>{local}> is always safe.
|
|
|
|
|
|
=head2 Start-up Actions
|
|
|
|
Any actions which appear I<before> the first rule definition in a
|
|
grammar are treated as "start-up" actions. Each such action is
|
|
stripped of its outermost brackets and then evaluated (in the parser's
|
|
special namespace) just before the rules of the grammar are first
|
|
compiled.
|
|
|
|
The main use of start-up actions is to declare local variables within the
|
|
parser's special namespace:
|
|
|
|
{ my $lastitem = '???'; }
|
|
|
|
list: item(s) { $return = $lastitem }
|
|
|
|
item: book { $lastitem = 'book'; }
|
|
bell { $lastitem = 'bell'; }
|
|
candle { $lastitem = 'candle'; }
|
|
|
|
but start-up actions can be used to execute I<any> valid Perl code
|
|
within a parser's special namespace.
|
|
|
|
Start-up actions can appear within a grammar extension or replacement
|
|
(that is, a partial grammar installed via C<Parse::RecDescent::Extend()> or
|
|
C<Parse::RecDescent::Replace()> - see L<Incremental Parsing>), and will be
|
|
executed before the new grammar is installed. Note, however, that a
|
|
particular start-up action is only ever executed once.
|
|
|
|
|
|
=head2 Autoactions
|
|
|
|
It is sometimes desirable to be able to specify a default action to be
|
|
taken at the end of every production (for example, in order to easily
|
|
build a parse tree). If the variable C<$::RD_AUTOACTION> is defined
|
|
when C<Parse::RecDescent::new()> is called, the contents of that
|
|
variable are treated as a specification of an action which is to appended
|
|
to each production in the corresponding grammar. So, for example, to construct
|
|
a simple parse tree:
|
|
|
|
$::RD_AUTOACTION = q { [@item] };
|
|
|
|
parser = new Parse::RecDescent (q{
|
|
expression: and_expr '||' expression | and_expr
|
|
and_expr: not_expr '&&' and_expr | not_expr
|
|
not_expr: '!' brack_expr | brack_expr
|
|
brack_expr: '(' expression ')' | identifier
|
|
identifier: /[a-z]+/i
|
|
});
|
|
|
|
which is equivalent to:
|
|
|
|
parser = new Parse::RecDescent (q{
|
|
expression: and_expr '||' expression
|
|
{ [@item] }
|
|
| and_expr
|
|
{ [@item] }
|
|
|
|
and_expr: not_expr '&&' and_expr
|
|
{ [@item] }
|
|
| not_expr
|
|
{ [@item] }
|
|
|
|
not_expr: '!' brack_expr
|
|
{ [@item] }
|
|
| brack_expr
|
|
{ [@item] }
|
|
|
|
brack_expr: '(' expression ')'
|
|
{ [@item] }
|
|
| identifier
|
|
{ [@item] }
|
|
|
|
identifier: /[a-z]+/i
|
|
{ [@item] }
|
|
});
|
|
|
|
Alternatively, we could take an object-oriented approach, use different
|
|
classes for each node (and also eliminating redundant intermediate nodes):
|
|
|
|
$::RD_AUTOACTION = q
|
|
{ $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };
|
|
|
|
parser = new Parse::RecDescent (q{
|
|
expression: and_expr '||' expression | and_expr
|
|
and_expr: not_expr '&&' and_expr | not_expr
|
|
not_expr: '!' brack_expr | brack_expr
|
|
brack_expr: '(' expression ')' | identifier
|
|
identifier: /[a-z]+/i
|
|
});
|
|
|
|
which is equivalent to:
|
|
|
|
parser = new Parse::RecDescent (q{
|
|
expression: and_expr '||' expression
|
|
{ new expression_node (@item[1..3]) }
|
|
| and_expr
|
|
|
|
and_expr: not_expr '&&' and_expr
|
|
{ new and_expr_node (@item[1..3]) }
|
|
| not_expr
|
|
|
|
not_expr: '!' brack_expr
|
|
{ new not_expr_node (@item[1..2]) }
|
|
| brack_expr
|
|
|
|
brack_expr: '(' expression ')'
|
|
{ new brack_expr_node (@item[1..3]) }
|
|
| identifier
|
|
|
|
identifier: /[a-z]+/i
|
|
{ new identifer_node (@item[1]) }
|
|
});
|
|
|
|
Note that, if a production already ends in an action, no autoaction is appended
|
|
to it. For example, in this version:
|
|
|
|
$::RD_AUTOACTION = q
|
|
{ $#item==1 ? $item[1] : new ${"$item[0]_node"} (@item[1..$#item]) };
|
|
|
|
parser = new Parse::RecDescent (q{
|
|
expression: and_expr '&&' expression | and_expr
|
|
and_expr: not_expr '&&' and_expr | not_expr
|
|
not_expr: '!' brack_expr | brack_expr
|
|
brack_expr: '(' expression ')' | identifier
|
|
identifier: /[a-z]+/i
|
|
{ new terminal_node($item[1]) }
|
|
});
|
|
|
|
each C<identifier> match produces a C<terminal_node> object, I<not> an
|
|
C<identifier_node> object.
|
|
|
|
A level 1 warning is issued each time an "autoaction" is added to
|
|
some production.
|
|
|
|
|
|
=head2 Autotrees
|
|
|
|
A commonly needed autoaction is one that builds a parse-tree. It is moderately
|
|
tricky to set up such an action (which must treat terminals differently from
|
|
non-terminals), so Parse::RecDescent simplifies the process by providing the
|
|
C<E<lt>autotreeE<gt>> directive.
|
|
|
|
If this directive appears at the start of grammar, it causes
|
|
Parse::RecDescent to insert autoactions at the end of any rule except
|
|
those which already end in an action. The action inserted depends on whether
|
|
the production is an intermediate rule (two or more items), or a terminal
|
|
of the grammar (i.e. a single pattern or string item).
|
|
|
|
So, for example, the following grammar:
|
|
|
|
<autotree>
|
|
|
|
file : command(s)
|
|
command : get | set | vet
|
|
get : 'get' ident ';'
|
|
set : 'set' ident 'to' value ';'
|
|
vet : 'check' ident 'is' value ';'
|
|
ident : /\w+/
|
|
value : /\d+/
|
|
|
|
is equivalent to:
|
|
|
|
file : command(s) { bless \%item, $item[0] }
|
|
command : get { bless \%item, $item[0] }
|
|
| set { bless \%item, $item[0] }
|
|
| vet { bless \%item, $item[0] }
|
|
get : 'get' ident ';' { bless \%item, $item[0] }
|
|
set : 'set' ident 'to' value ';' { bless \%item, $item[0] }
|
|
vet : 'check' ident 'is' value ';' { bless \%item, $item[0] }
|
|
|
|
ident : /\w+/ { bless {__VALUE__=>$item[1]}, $item[0] }
|
|
value : /\d+/ { bless {__VALUE__=>$item[1]}, $item[0] }
|
|
|
|
Note that each node in the tree is blessed into a class of the same name
|
|
as the rule itself. This makes it easy to build object-oriented
|
|
processors for the parse-trees that the grammar produces. Note too that
|
|
the last two rules produce special objects with the single attribute
|
|
'__VALUE__'. This is because they consist solely of a single terminal.
|
|
|
|
This autoaction-ed grammar would then produce a parse tree in a data
|
|
structure like this:
|
|
|
|
{
|
|
file => {
|
|
command => {
|
|
[ get => {
|
|
identifier => { __VALUE__ => 'a' },
|
|
},
|
|
set => {
|
|
identifier => { __VALUE__ => 'b' },
|
|
value => { __VALUE__ => '7' },
|
|
},
|
|
vet => {
|
|
identifier => { __VALUE__ => 'b' },
|
|
value => { __VALUE__ => '7' },
|
|
},
|
|
],
|
|
},
|
|
}
|
|
}
|
|
|
|
(except, of course, that each nested hash would also be blessed into
|
|
the appropriate class).
|
|
|
|
|
|
=head2 Autostubbing
|
|
|
|
Normally, if a subrule appears in some production, but no rule of that
|
|
name is ever defined in the grammar, the production which refers to the
|
|
non-existent subrule fails immediately. This typically occurs as a
|
|
result of misspellings, and is a sufficiently common occurance that a
|
|
warning is generated for such situations.
|
|
|
|
However, when prototyping a grammar it is sometimes useful to be
|
|
able to use subrules before a proper specification of them is
|
|
really possible. For example, a grammar might include a section like:
|
|
|
|
function_call: identifier '(' arg(s?) ')'
|
|
|
|
identifier: /[a-z]\w*/i
|
|
|
|
where the possible format of an argument is sufficiently complex that
|
|
it is not worth specifying in full until the general function call
|
|
syntax has been debugged. In this situation it is convenient to leave
|
|
the real rule C<arg> undefined and just slip in a placeholder (or
|
|
"stub"):
|
|
|
|
arg: 'arg'
|
|
|
|
so that the function call syntax can be tested with dummy input such as:
|
|
|
|
f0()
|
|
f1(arg)
|
|
f2(arg arg)
|
|
f3(arg arg arg)
|
|
|
|
et cetera.
|
|
|
|
Early in prototyping, many such "stubs" may be required, so
|
|
C<Parse::RecDescent> provides a means of automating their definition.
|
|
If the variable C<$::RD_AUTOSTUB> is defined when a parser is built,
|
|
a subrule reference to any non-existent rule (say, C<sr>),
|
|
causes a "stub" rule of the form:
|
|
|
|
sr: 'sr'
|
|
|
|
to be automatically defined in the generated parser.
|
|
A level 1 warning is issued for each such "autostubbed" rule.
|
|
|
|
Hence, with C<$::AUTOSTUB> defined, it is possible to only partially
|
|
specify a grammar, and then "fake" matches of the unspecified
|
|
(sub)rules by just typing in their name.
|
|
|
|
|
|
|
|
=head2 Look-ahead
|
|
|
|
If a subrule, token, or action is prefixed by "...", then it is
|
|
treated as a "look-ahead" request. That means that the current production can
|
|
(as usual) only succeed if the specified item is matched, but that the matching
|
|
I<does not consume any of the text being parsed>. This is very similar to the
|
|
C</(?=...)/> look-ahead construct in Perl patterns. Thus, the rule:
|
|
|
|
inner_word: word ...word
|
|
|
|
will match whatever the subrule "word" matches, provided that match is followed
|
|
by some more text which subrule "word" would also match (although this
|
|
second substring is not actually consumed by "inner_word")
|
|
|
|
Likewise, a "...!" prefix, causes the following item to succeed (without
|
|
consuming any text) if and only if it would normally fail. Hence, a
|
|
rule such as:
|
|
|
|
identifier: ...!keyword ...!'_' /[A-Za-z_]\w*/
|
|
|
|
matches a string of characters which satisfies the pattern
|
|
C</[A-Za-z_]\w*/>, but only if the same sequence of characters would
|
|
not match either subrule "keyword" or the literal token '_'.
|
|
|
|
Sequences of look-ahead prefixes accumulate, multiplying their positive and/or
|
|
negative senses. Hence:
|
|
|
|
inner_word: word ...!......!word
|
|
|
|
is exactly equivalent the the original example above (a warning is issued in
|
|
cases like these, since they often indicate something left out, or
|
|
misunderstood).
|
|
|
|
Note that actions can also be treated as look-aheads. In such cases,
|
|
the state of the parser text (in the local variable C<$text>)
|
|
I<after> the look-ahead action is guaranteed to be identical to its
|
|
state I<before> the action, regardless of how it's changed I<within>
|
|
the action (unless you actually undefine C<$text>, in which case you
|
|
get the disaster you deserve :-).
|
|
|
|
|
|
=head2 Directives
|
|
|
|
Directives are special pre-defined actions which may be used to alter
|
|
the behaviour of the parser. There are currently eighteen directives:
|
|
C<E<lt>commitE<gt>>,
|
|
C<E<lt>uncommitE<gt>>,
|
|
C<E<lt>rejectE<gt>>,
|
|
C<E<lt>scoreE<gt>>,
|
|
C<E<lt>autoscoreE<gt>>,
|
|
C<E<lt>skipE<gt>>,
|
|
C<E<lt>resyncE<gt>>,
|
|
C<E<lt>errorE<gt>>,
|
|
C<E<lt>rulevarE<gt>>,
|
|
C<E<lt>matchruleE<gt>>,
|
|
C<E<lt>leftopE<gt>>,
|
|
C<E<lt>rightopE<gt>>,
|
|
C<E<lt>deferE<gt>>,
|
|
C<E<lt>nocheckE<gt>>,
|
|
C<E<lt>perl_quotelikeE<gt>>,
|
|
C<E<lt>perl_codeblockE<gt>>,
|
|
C<E<lt>perl_variableE<gt>>,
|
|
and C<E<lt>tokenE<gt>>.
|
|
|
|
=over 4
|
|
|
|
=item Committing and uncommitting
|
|
|
|
The C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives permit the recursive
|
|
descent of the parse tree to be pruned (or "cut") for efficiency.
|
|
Within a rule, a C<E<lt>commitE<gt>> directive instructs the rule to ignore subsequent
|
|
productions if the current production fails. For example:
|
|
|
|
command: 'find' <commit> filename
|
|
| 'open' <commit> filename
|
|
| 'move' filename filename
|
|
|
|
Clearly, if the leading token 'find' is matched in the first production but that
|
|
production fails for some other reason, then the remaining
|
|
productions cannot possibly match. The presence of the
|
|
C<E<lt>commitE<gt>> causes the "command" rule to fail immediately if
|
|
an invalid "find" command is found, and likewise if an invalid "open"
|
|
command is encountered.
|
|
|
|
It is also possible to revoke a previous commitment. For example:
|
|
|
|
if_statement: 'if' <commit> condition
|
|
'then' block <uncommit>
|
|
'else' block
|
|
| 'if' <commit> condition
|
|
'then' block
|
|
|
|
In this case, a failure to find an "else" block in the first
|
|
production shouldn't preclude trying the second production, but a
|
|
failure to find a "condition" certainly should.
|
|
|
|
As a special case, any production in which the I<first> item is an
|
|
C<E<lt>uncommitE<gt>> immediately revokes a preceding C<E<lt>commitE<gt>>
|
|
(even though the production would not otherwise have been tried). For
|
|
example, in the rule:
|
|
|
|
request: 'explain' expression
|
|
| 'explain' <commit> keyword
|
|
| 'save'
|
|
| 'quit'
|
|
| <uncommit> term '?'
|
|
|
|
if the text being matched was "explain?", and the first two
|
|
productions failed, then the C<E<lt>commitE<gt>> in production two would cause
|
|
productions three and four to be skipped, but the leading
|
|
C<E<lt>uncommitE<gt>> in the production five would allow that production to
|
|
attempt a match.
|
|
|
|
Note in the preceding example, that the C<E<lt>commitE<gt>> was only placed
|
|
in production two. If production one had been:
|
|
|
|
request: 'explain' <commit> expression
|
|
|
|
then production two would be (inappropriately) skipped if a leading
|
|
"explain..." was encountered.
|
|
|
|
Both C<E<lt>commitE<gt>> and C<E<lt>uncommitE<gt>> directives always succeed, and their value
|
|
is always 1.
|
|
|
|
|
|
=item Rejecting a production
|
|
|
|
The C<E<lt>rejectE<gt>> directive immediately causes the current production
|
|
to fail (it is exactly equivalent to, but more obvious than, the
|
|
action C<{undef}>). A C<E<lt>rejectE<gt>> is useful when it is desirable to get
|
|
the side effects of the actions in one production, without prejudicing a match
|
|
by some other production later in the rule. For example, to insert
|
|
tracing code into the parse:
|
|
|
|
complex_rule: { print "In complex rule...\n"; } <reject>
|
|
|
|
complex_rule: simple_rule '+' 'i' '*' simple_rule
|
|
| 'i' '*' simple_rule
|
|
| simple_rule
|
|
|
|
|
|
It is also possible to specify a conditional rejection, using the
|
|
form C<E<lt>reject:I<condition>E<gt>>, which only rejects if the
|
|
specified condition is true. This form of rejection is exactly
|
|
equivalent to the action C<{(I<condition>)?undef:1}E<gt>>.
|
|
For example:
|
|
|
|
command: save_command
|
|
| restore_command
|
|
| <reject: defined $::tolerant> { exit }
|
|
| <error: Unknown command. Ignored.>
|
|
|
|
A C<E<lt>rejectE<gt>> directive never succeeds (and hence has no
|
|
associated value). A conditional rejection may succeed (if its
|
|
condition is not satisfied), in which case its value is 1.
|
|
|
|
As an extra optimization, C<Parse::RecDescent> ignores any production
|
|
which I<begins> with an unconditional C<E<lt>rejectE<gt>> directive,
|
|
since any such production can never successfully match or have any
|
|
useful side-effects. A level 1 warning is issued in all such cases.
|
|
|
|
Note that productions beginning with conditional
|
|
C<E<lt>reject:...E<gt>> directives are I<never> "optimized away" in
|
|
this manner, even if they are always guaranteed to fail (for example:
|
|
C<E<lt>reject:1E<gt>>)
|
|
|
|
Due to the way grammars are parsed, there is a minor restriction on the
|
|
condition of a conditional C<E<lt>reject:...E<gt>>: it cannot
|
|
contain any raw '<' or '>' characters. For example:
|
|
|
|
line: cmd <reject: $thiscolumn > max> data
|
|
|
|
results in an error when a parser is built from this grammar (since the
|
|
grammar parser has no way of knowing whether the first > is a "less than"
|
|
or the end of the C<E<lt>reject:...E<gt>>.
|
|
|
|
To overcome this problem, put the condition inside a do{} block:
|
|
|
|
line: cmd <reject: do{$thiscolumn > max}> data
|
|
|
|
Note that the same problem may occur in other directives that take
|
|
arguments. The same solution will work in all cases.
|
|
|
|
=item Skipping between terminals
|
|
|
|
The C<E<lt>skipE<gt>> directive enables the terminal prefix used in
|
|
a production to be changed. For example:
|
|
|
|
OneLiner: Command <skip:'[ \t]*'> Arg(s) /;/
|
|
|
|
causes only blanks and tabs to be skipped before terminals in the C<Arg>
|
|
subrule (and any of I<its> subrules>, and also before the final C</;/> terminal.
|
|
Once the production is complete, the previous terminal prefix is
|
|
reinstated. Note that this implies that distinct productions of a rule
|
|
must reset their terminal prefixes individually.
|
|
|
|
The C<E<lt>skipE<gt>> directive evaluates to the I<previous> terminal prefix,
|
|
so it's easy to reinstate a prefix later in a production:
|
|
|
|
Command: <skip:","> CSV(s) <skip:$item[1]> Modifier
|
|
|
|
The value specified after the colon is interpolated into a pattern, so all of
|
|
the following are equivalent (though their efficiency increases down the list):
|
|
|
|
<skip: "$colon|$comma"> # ASSUMING THE VARS HOLD THE OBVIOUS VALUES
|
|
|
|
<skip: ':|,'>
|
|
|
|
<skip: q{[:,]}>
|
|
|
|
<skip: qr/[:,]/>
|
|
|
|
There is no way of directly setting the prefix for
|
|
an entire rule, except as follows:
|
|
|
|
Rule: <skip: '[ \t]*'> Prod1
|
|
| <skip: '[ \t]*'> Prod2a Prod2b
|
|
| <skip: '[ \t]*'> Prod3
|
|
|
|
or, better:
|
|
|
|
Rule: <skip: '[ \t]*'>
|
|
(
|
|
Prod1
|
|
| Prod2a Prod2b
|
|
| Prod3
|
|
)
|
|
|
|
|
|
B<Note: Up to release 1.51 of Parse::RecDescent, an entirely different
|
|
mechanism was used for specifying terminal prefixes. The current method
|
|
is not backwards-compatible with that early approach. The current approach
|
|
is stable and will not to change again.>
|
|
|
|
|
|
=item Resynchronization
|
|
|
|
The C<E<lt>resyncE<gt>> directive provides a visually distinctive
|
|
means of consuming some of the text being parsed, usually to skip an
|
|
erroneous input. In its simplest form C<E<lt>resyncE<gt>> simply
|
|
consumes text up to and including the next newline (C<"\n">)
|
|
character, succeeding only if the newline is found, in which case it
|
|
causes its surrounding rule to return zero on success.
|
|
|
|
In other words, a C<E<lt>resyncE<gt>> is exactly equivalent to the token
|
|
C</[^\n]*\n/> followed by the action S<C<{ $return = 0 }>> (except that
|
|
productions beginning with a C<E<lt>resyncE<gt>> are ignored when generating
|
|
error messages). A typical use might be:
|
|
|
|
script : command(s)
|
|
|
|
command: save_command
|
|
| restore_command
|
|
| <resync> # TRY NEXT LINE, IF POSSIBLE
|
|
|
|
It is also possible to explicitly specify a resynchronization
|
|
pattern, using the C<E<lt>resync:I<pattern>E<gt>> variant. This version
|
|
succeeds only if the specified pattern matches (and consumes) the
|
|
parsed text. In other words, C<E<lt>resync:I<pattern>E<gt>> is exactly
|
|
equivalent to the token C</I<pattern>/> (followed by a S<C<{ $return = 0 }>>
|
|
action). For example, if commands were terminated by newlines or semi-colons:
|
|
|
|
command: save_command
|
|
| restore_command
|
|
| <resync:[^;\n]*[;\n]>
|
|
|
|
The value of a successfully matched C<E<lt>resyncE<gt>> directive (of either
|
|
type) is the text that it consumed. Note, however, that since the
|
|
directive also sets C<$return>, a production consisting of a lone
|
|
C<E<lt>resyncE<gt>> succeeds but returns the value zero (which a calling rule
|
|
may find useful to distinguish between "true" matches and "tolerant" matches).
|
|
Remember that returning a zero value indicates that the rule I<succeeded> (since
|
|
only an C<undef> denotes failure within C<Parse::RecDescent> parsers.
|
|
|
|
|
|
=item Error handling
|
|
|
|
The C<E<lt>errorE<gt>> directive provides automatic or user-defined
|
|
generation of error messages during a parse. In its simplest form
|
|
C<E<lt>errorE<gt>> prepares an error message based on
|
|
the mismatch between the last item expected and the text which cause
|
|
it to fail. For example, given the rule:
|
|
|
|
McCoy: curse ',' name ', I'm a doctor, not a' a_profession '!'
|
|
| pronoun 'dead,' name '!'
|
|
| <error>
|
|
|
|
the following strings would produce the following messages:
|
|
|
|
=over 4
|
|
|
|
=item "Amen, Jim!"
|
|
|
|
ERROR (line 1): Invalid McCoy: Expected curse or pronoun
|
|
not found
|
|
|
|
=item "Dammit, Jim, I'm a doctor!"
|
|
|
|
ERROR (line 1): Invalid McCoy: Expected ", I'm a doctor, not a"
|
|
but found ", I'm a doctor!" instead
|
|
|
|
=item "He's dead,\n"
|
|
|
|
ERROR (line 2): Invalid McCoy: Expected name not found
|
|
|
|
=item "He's alive!"
|
|
|
|
ERROR (line 1): Invalid McCoy: Expected 'dead,' but found
|
|
"alive!" instead
|
|
|
|
=item "Dammit, Jim, I'm a doctor, not a pointy-eared Vulcan!"
|
|
|
|
ERROR (line 1): Invalid McCoy: Expected a profession but found
|
|
"pointy-eared Vulcan!" instead
|
|
|
|
|
|
=back
|
|
|
|
Note that, when autogenerating error messages, all underscores in any
|
|
rule name used in a message are replaced by single spaces (for example
|
|
"a_production" becomes "a production"). Judicious choice of rule
|
|
names can therefore considerably improve the readability of automatic
|
|
error messages (as well as the maintainability of the original
|
|
grammar).
|
|
|
|
If the automatically generated error is not sufficient, it is possible to
|
|
provide an explicit message as part of the error directive. For example:
|
|
|
|
Spock: "Fascinating ',' (name | 'Captain') '.'
|
|
| "Highly illogical, doctor."
|
|
| <error: He never said that!>
|
|
|
|
which would result in I<all> failures to parse a "Spock" subrule printing the
|
|
following message:
|
|
|
|
ERROR (line <N>): Invalid Spock: He never said that!
|
|
|
|
The error message is treated as a "qq{...}" string and interpolated
|
|
when the error is generated (I<not> when the directive is specified!).
|
|
Hence:
|
|
|
|
<error: Mystical error near "$text">
|
|
|
|
would correctly insert the ambient text string which caused the error.
|
|
|
|
There are two other forms of error directive: C<E<lt>error?E<gt>> and
|
|
S<C<E<lt>error?: msgE<gt>>>. These behave just like C<E<lt>errorE<gt>>
|
|
and S<C<E<lt>error: msgE<gt>>> respectively, except that they are
|
|
only triggered if the rule is "committed" at the time they are
|
|
encountered. For example:
|
|
|
|
Scotty: "Ya kenna change the Laws of Phusics," <commit> name
|
|
| name <commit> ',' 'she's goanta blaw!'
|
|
| <error?>
|
|
|
|
will only generate an error for a string beginning with "Ya kenna
|
|
change the Laws o' Phusics," or a valid name, but which still fails to match the
|
|
corresponding production. That is, C<$parser-E<gt>Scotty("Aye, Cap'ain")> will
|
|
fail silently (since neither production will "commit" the rule on that
|
|
input), whereas S<C<$parser-E<gt>Scotty("Mr Spock, ah jest kenna do'ut!")>>
|
|
will fail with the error message:
|
|
|
|
ERROR (line 1): Invalid Scotty: expected 'she's goanta blaw!'
|
|
but found 'I jest kenna do'ut!' instead.
|
|
|
|
since in that case the second production would commit after matching
|
|
the leading name.
|
|
|
|
Note that to allow this behaviour, all C<E<lt>errorE<gt>> directives which are
|
|
the first item in a production automatically uncommit the rule just
|
|
long enough to allow their production to be attempted (that is, when
|
|
their production fails, the commitment is reinstated so that
|
|
subsequent productions are skipped).
|
|
|
|
In order to I<permanently> uncommit the rule before an error message,
|
|
it is necessary to put an explicit C<E<lt>uncommitE<gt>> before the
|
|
C<E<lt>errorE<gt>>. For example:
|
|
|
|
line: 'Kirk:' <commit> Kirk
|
|
| 'Spock:' <commit> Spock
|
|
| 'McCoy:' <commit> McCoy
|
|
| <uncommit> <error?> <reject>
|
|
| <resync>
|
|
|
|
|
|
Error messages generated by the various C<E<lt>error...E<gt>> directives
|
|
are not displayed immediately. Instead, they are "queued" in a buffer and
|
|
are only displayed once parsing ultimately fails. Moreover,
|
|
C<E<lt>error...E<gt>> directives that cause one production of a rule
|
|
to fail are automatically removed from the message queue
|
|
if another production subsequently causes the entire rule to succeed.
|
|
This means that you can put
|
|
C<E<lt>error...E<gt>> directives wherever useful diagnosis can be done,
|
|
and only those associated with actual parser failure will ever be
|
|
displayed. Also see L<"Gotchas">.
|
|
|
|
As a general rule, the most useful diagnostics are usually generated
|
|
either at the very lowest level within the grammar, or at the very
|
|
highest. A good rule of thumb is to identify those subrules which
|
|
consist mainly (or entirely) of terminals, and then put an
|
|
C<E<lt>error...E<gt>> directive at the end of any other rule which calls
|
|
one or more of those subrules.
|
|
|
|
There is one other situation in which the output of the various types of
|
|
error directive is suppressed; namely, when the rule containing them
|
|
is being parsed as part of a "look-ahead" (see L<"Look-ahead">). In this
|
|
case, the error directive will still cause the rule to fail, but will do
|
|
so silently.
|
|
|
|
An unconditional C<E<lt>errorE<gt>> directive always fails (and hence has no
|
|
associated value). This means that encountering such a directive
|
|
always causes the production containing it to fail. Hence an
|
|
C<E<lt>errorE<gt>> directive will inevitably be the last (useful) item of a
|
|
rule (a level 3 warning is issued if a production contains items after an unconditional
|
|
C<E<lt>errorE<gt>> directive).
|
|
|
|
An C<E<lt>error?E<gt>> directive will I<succeed> (that is: fail to fail :-), if
|
|
the current rule is uncommitted when the directive is encountered. In
|
|
that case the directive's associated value is zero. Hence, this type
|
|
of error directive I<can> be used before the end of a
|
|
production. For example:
|
|
|
|
command: 'do' <commit> something
|
|
| 'report' <commit> something
|
|
| <error?: Syntax error> <error: Unknown command>
|
|
|
|
|
|
B<Warning:> The C<E<lt>error?E<gt>> directive does I<not> mean "always fail (but
|
|
do so silently unless committed)". It actually means "only fail (and report) if
|
|
committed, otherwise I<succeed>". To achieve the "fail silently if uncommitted"
|
|
semantics, it is necessary to use:
|
|
|
|
rule: item <commit> item(s)
|
|
| <error?> <reject> # FAIL SILENTLY UNLESS COMMITTED
|
|
|
|
However, because people seem to expect a lone C<E<lt>error?E<gt>> directive
|
|
to work like this:
|
|
|
|
rule: item <commit> item(s)
|
|
| <error?: Error message if committed>
|
|
| <error: Error message if uncommitted>
|
|
|
|
Parse::RecDescent automatically appends a
|
|
C<E<lt>rejectE<gt>> directive if the C<E<lt>error?E<gt>> directive
|
|
is the only item in a production. A level 2 warning (see below)
|
|
is issued when this happens.
|
|
|
|
The level of error reporting during both parser construction and
|
|
parsing is controlled by the presence or absence of four global
|
|
variables: C<$::RD_ERRORS>, C<$::RD_WARN>, C<$::RD_HINT>, and
|
|
<$::RD_TRACE>. If C<$::RD_ERRORS> is defined (and, by default, it is)
|
|
then fatal errors are reported.
|
|
|
|
Whenever C<$::RD_WARN> is defined, certain non-fatal problems are also reported.
|
|
Warnings have an associated "level": 1, 2, or 3. The higher the level,
|
|
the more serious the warning. The value of the corresponding global
|
|
variable (C<$::RD_WARN>) determines the I<lowest> level of warning to
|
|
be displayed. Hence, to see I<all> warnings, set C<$::RD_WARN> to 1.
|
|
To see only the most serious warnings set C<$::RD_WARN> to 3.
|
|
By default C<$::RD_WARN> is initialized to 3, ensuring that serious but
|
|
non-fatal errors are automatically reported.
|
|
|
|
See F<"DIAGNOSTICS"> for a list of the varous error and warning messages
|
|
that Parse::RecDescent generates when these two variables are defined.
|
|
|
|
Defining any of the remaining variables (which are not defined by
|
|
default) further increases the amount of information reported.
|
|
Defining C<$::RD_HINT> causes the parser generator to offer
|
|
more detailed analyses and hints on both errors and warnings.
|
|
Note that setting C<$::RD_HINT> at any point automagically
|
|
sets C<$::RD_WARN> to 1.
|
|
|
|
Defining C<$::RD_TRACE> causes the parser generator and the parser to
|
|
report their progress to STDERR in excruciating detail (although, without hints
|
|
unless $::RD_HINT is separately defined). This detail
|
|
can be moderated in only one respect: if C<$::RD_TRACE> has an
|
|
integer value (I<N>) greater than 1, only the I<N> characters of
|
|
the "current parsing context" (that is, where in the input string we
|
|
are at any point in the parse) is reported at any time.
|
|
>
|
|
C<$::RD_TRACE> is mainly useful for debugging a grammar that isn't
|
|
behaving as you expected it to. To this end, if C<$::RD_TRACE> is
|
|
defined when a parser is built, any actual parser code which is
|
|
generated is also written to a file named "RD_TRACE" in the local
|
|
directory.
|
|
|
|
Note that the four variables belong to the "main" package, which
|
|
makes them easier to refer to in the code controlling the parser, and
|
|
also makes it easy to turn them into command line flags ("-RD_ERRORS",
|
|
"-RD_WARN", "-RD_HINT", "-RD_TRACE") under B<perl -s>.
|
|
|
|
=item Specifying local variables
|
|
|
|
It is occasionally convenient to specify variables which are local
|
|
to a single rule. This may be achieved by including a
|
|
C<E<lt>rulevar:...E<gt>> directive anywhere in the rule. For example:
|
|
|
|
markup: <rulevar: $tag>
|
|
|
|
markup: tag {($tag=$item[1]) =~ s/^<|>$//g} body[$tag]
|
|
|
|
The example C<E<lt>rulevar: $tagE<gt>> directive causes a "my" variable named
|
|
C<$tag> to be declared at the start of the subroutine implementing the
|
|
C<markup> rule (that is, I<before> the first production, regardless of
|
|
where in the rule it is specified).
|
|
|
|
Specifically, any directive of the form:
|
|
C<E<lt>rulevar:I<text>E<gt>> causes a line of the form C<my I<text>;>
|
|
to be added at the beginning of the rule subroutine, immediately after
|
|
the definitions of the following local variables:
|
|
|
|
$thisparser $commit
|
|
$thisrule @item
|
|
$thisline @arg
|
|
$text %arg
|
|
|
|
This means that the following C<E<lt>rulevarE<gt>> directives work
|
|
as expected:
|
|
|
|
<rulevar: $count = 0 >
|
|
|
|
<rulevar: $firstarg = $arg[0] || '' >
|
|
|
|
<rulevar: $myItems = \@item >
|
|
|
|
<rulevar: @context = ( $thisline, $text, @arg ) >
|
|
|
|
<rulevar: ($name,$age) = $arg{"name","age"} >
|
|
|
|
If a variable that is also visible to subrules is required, it needs
|
|
to be C<local>'d, not C<my>'d. C<rulevar> defaults to C<my>, but if C<local>
|
|
is explicitly specified:
|
|
|
|
<rulevar: local $count = 0 >
|
|
|
|
then a C<local>-ized variable is declared instead, and will be available
|
|
within subrules.
|
|
|
|
Note however that, because all such variables are "my" variables, their
|
|
values I<do not persist> between match attempts on a given rule. To
|
|
preserve values between match attempts, values can be stored within the
|
|
"local" member of the C<$thisrule> object:
|
|
|
|
countedrule: { $thisrule->{"local"}{"count"}++ }
|
|
<reject>
|
|
| subrule1
|
|
| subrule2
|
|
| <reject: $thisrule->{"local"}{"count"} == 1>
|
|
subrule3
|
|
|
|
|
|
When matching a rule, each C<E<lt>rulevarE<gt>> directive is matched as
|
|
if it were an unconditional C<E<lt>rejectE<gt>> directive (that is, it
|
|
causes any production in which it appears to immediately fail to match).
|
|
For this reason (and to improve readability) it is usual to specify any
|
|
C<E<lt>rulevarE<gt>> directive in a separate production at the start of
|
|
the rule (this has the added advantage that it enables
|
|
C<Parse::RecDescent> to optimize away such productions, just as it does
|
|
for the C<E<lt>rejectE<gt>> directive).
|
|
|
|
|
|
=item Dynamically matched rules
|
|
|
|
Because regexes and double-quoted strings are interpolated, it is relatively
|
|
easy to specify productions with "context sensitive" tokens. For example:
|
|
|
|
command: keyword body "end $item[1]"
|
|
|
|
which ensures that a command block is bounded by a
|
|
"I<E<lt>keywordE<gt>>...end I<E<lt>same keywordE<gt>>" pair.
|
|
|
|
Building productions in which subrules are context sensitive is also possible,
|
|
via the C<E<lt>matchrule:...E<gt>> directive. This directive behaves
|
|
identically to a subrule item, except that the rule which is invoked to match
|
|
it is determined by the string specified after the colon. For example, we could
|
|
rewrite the C<command> rule like this:
|
|
|
|
command: keyword <matchrule:body> "end $item[1]"
|
|
|
|
Whatever appears after the colon in the directive is treated as an interpolated
|
|
string (that is, as if it appeared in C<qq{...}> operator) and the value of
|
|
that interpolated string is the name of the subrule to be matched.
|
|
|
|
Of course, just putting a constant string like C<body> in a
|
|
C<E<lt>matchrule:...E<gt>> directive is of little interest or benefit.
|
|
The power of directive is seen when we use a string that interpolates
|
|
to something interesting. For example:
|
|
|
|
command: keyword <matchrule:$item[1]_body> "end $item[1]"
|
|
|
|
keyword: 'while' | 'if' | 'function'
|
|
|
|
while_body: condition block
|
|
|
|
if_body: condition block ('else' block)(?)
|
|
|
|
function_body: arglist block
|
|
|
|
Now the C<command> rule selects how to proceed on the basis of the keyword
|
|
that is found. It is as if C<command> were declared:
|
|
|
|
command: 'while' while_body "end while"
|
|
| 'if' if_body "end if"
|
|
| 'function' function_body "end function"
|
|
|
|
|
|
When a C<E<lt>matchrule:...E<gt>> directive is used as a repeated
|
|
subrule, the rule name expression is "late-bound". That is, the name of
|
|
the rule to be called is re-evaluated I<each time> a match attempt is
|
|
made. Hence, the following grammar:
|
|
|
|
{ $::species = 'dogs' }
|
|
|
|
pair: 'two' <matchrule:$::species>(s)
|
|
|
|
dogs: /dogs/ { $::species = 'cats' }
|
|
|
|
cats: /cats/
|
|
|
|
will match the string "two dogs cats cats" completely, whereas it will
|
|
only match the string "two dogs dogs dogs" up to the eighth letter. If
|
|
the rule name were "early bound" (that is, evaluated only the first
|
|
time the directive is encountered in a production), the reverse
|
|
behaviour would be expected.
|
|
|
|
Note that the C<matchrule> directive takes a string that is to be treated
|
|
as a rule name, I<not> as a rule invocation. That is,
|
|
it's like a Perl symbolic reference, not an C<eval>. Just as you can say:
|
|
|
|
$subname = 'foo';
|
|
|
|
# and later...
|
|
|
|
&{$foo}(@args);
|
|
|
|
but not:
|
|
|
|
$subname = 'foo(@args)';
|
|
|
|
# and later...
|
|
|
|
&{$foo};
|
|
|
|
likewise you can say:
|
|
|
|
$rulename = 'foo';
|
|
|
|
# and in the grammar...
|
|
|
|
<matchrule:$rulename>[@args]
|
|
|
|
but not:
|
|
|
|
$rulename = 'foo[@args]';
|
|
|
|
# and in the grammar...
|
|
|
|
<matchrule:$rulename>
|
|
|
|
|
|
=item Deferred actions
|
|
|
|
The C<E<lt>defer:...E<gt>> directive is used to specify an action to be
|
|
performed when (and only if!) the current production ultimately succeeds.
|
|
|
|
Whenever a C<E<lt>defer:...E<gt>> directive appears, the code it specifies
|
|
is converted to a closure (an anonymous subroutine reference) which is
|
|
queued within the active parser object. Note that,
|
|
because the deferred code is converted to a closure, the values of any
|
|
"local" variable (such as C<$text>, <@item>, etc.) are preserved
|
|
until the deferred code is actually executed.
|
|
|
|
If the parse ultimately succeeds
|
|
I<and> the production in which the C<E<lt>defer:...E<gt>> directive was
|
|
evaluated formed part of the successful parse, then the deferred code is
|
|
executed immediately before the parse returns. If however the production
|
|
which queued a deferred action fails, or one of the higher-level
|
|
rules which called that production fails, then the deferred action is
|
|
removed from the queue, and hence is never executed.
|
|
|
|
For example, given the grammar:
|
|
|
|
sentence: noun trans noun
|
|
| noun intrans
|
|
|
|
noun: 'the dog'
|
|
{ print "$item[1]\t(noun)\n" }
|
|
| 'the meat'
|
|
{ print "$item[1]\t(noun)\n" }
|
|
|
|
trans: 'ate'
|
|
{ print "$item[1]\t(transitive)\n" }
|
|
|
|
intrans: 'ate'
|
|
{ print "$item[1]\t(intransitive)\n" }
|
|
| 'barked'
|
|
{ print "$item[1]\t(intransitive)\n" }
|
|
|
|
then parsing the sentence C<"the dog ate"> would produce the output:
|
|
|
|
the dog (noun)
|
|
ate (transitive)
|
|
the dog (noun)
|
|
ate (intransitive)
|
|
|
|
This is because, even though the first production of C<sentence>
|
|
ultimately fails, its initial subrules C<noun> and C<trans> do match,
|
|
and hence they execute their associated actions.
|
|
Then the second production of C<sentence> succeeds, causing the
|
|
actions of the subrules C<noun> and C<intrans> to be executed as well.
|
|
|
|
On the other hand, if the actions were replaced by C<E<lt>defer:...E<gt>>
|
|
directives:
|
|
|
|
sentence: noun trans noun
|
|
| noun intrans
|
|
|
|
noun: 'the dog'
|
|
<defer: print "$item[1]\t(noun)\n" >
|
|
| 'the meat'
|
|
<defer: print "$item[1]\t(noun)\n" >
|
|
|
|
trans: 'ate'
|
|
<defer: print "$item[1]\t(transitive)\n" >
|
|
|
|
intrans: 'ate'
|
|
<defer: print "$item[1]\t(intransitive)\n" >
|
|
| 'barked'
|
|
<defer: print "$item[1]\t(intransitive)\n" >
|
|
|
|
the output would be:
|
|
|
|
the dog (noun)
|
|
ate (intransitive)
|
|
|
|
since deferred actions are only executed if they were evaluated in
|
|
a production which ultimately contributes to the successful parse.
|
|
|
|
In this case, even though the first production of C<sentence> caused
|
|
the subrules C<noun> and C<trans> to match, that production ultimately
|
|
failed and so the deferred actions queued by those subrules were subsequently
|
|
disgarded. The second production then succeeded, causing the entire
|
|
parse to succeed, and so the deferred actions queued by the (second) match of
|
|
the C<noun> subrule and the subsequent match of C<intrans> I<are> preserved and
|
|
eventually executed.
|
|
|
|
Deferred actions provide a means of improving the performance of a parser,
|
|
by only executing those actions which are part of the final parse-tree
|
|
for the input data.
|
|
|
|
Alternatively, deferred actions can be viewed as a mechanism for building
|
|
(and executing) a
|
|
customized subroutine corresponding to the given input data, much in the
|
|
same way that autoactions (see L<"Autoactions">) can be used to build a
|
|
customized data structure for specific input.
|
|
|
|
Whether or not the action it specifies is ever executed,
|
|
a C<E<lt>defer:...E<gt>> directive always succeeds, returning the
|
|
number of deferred actions currently queued at that point.
|
|
|
|
|
|
=item Parsing Perl
|
|
|
|
Parse::RecDescent provides limited support for parsing subsets of Perl,
|
|
namely: quote-like operators, Perl variables, and complete code blocks.
|
|
|
|
The C<E<lt>perl_quotelikeE<gt>> directive can be used to parse any Perl
|
|
quote-like operator: C<'a string'>, C<m/a pattern/>, C<tr{ans}{lation}>,
|
|
etc. It does this by calling Text::Balanced::quotelike().
|
|
|
|
If a quote-like operator is found, a reference to an array of eight elements
|
|
is returned. Those elements are identical to the last eight elements returned
|
|
by Text::Balanced::extract_quotelike() in an array context, namely:
|
|
|
|
=over 4
|
|
|
|
=item [0]
|
|
|
|
the name of the quotelike operator -- 'q', 'qq', 'm', 's', 'tr' -- if the
|
|
operator was named; otherwise C<undef>,
|
|
|
|
=item [1]
|
|
|
|
the left delimiter of the first block of the operation,
|
|
|
|
=item [2]
|
|
|
|
the text of the first block of the operation
|
|
(that is, the contents of
|
|
a quote, the regex of a match, or substitution or the target list of a
|
|
translation),
|
|
|
|
=item [3]
|
|
|
|
the right delimiter of the first block of the operation,
|
|
|
|
=item [4]
|
|
|
|
the left delimiter of the second block of the operation if there is one
|
|
(that is, if it is a C<s>, C<tr>, or C<y>); otherwise C<undef>,
|
|
|
|
=item [5]
|
|
|
|
the text of the second block of the operation if there is one
|
|
(that is, the replacement of a substitution or the translation list
|
|
of a translation); otherwise C<undef>,
|
|
|
|
=item [6]
|
|
|
|
the right delimiter of the second block of the operation (if any);
|
|
otherwise C<undef>,
|
|
|
|
=item [7]
|
|
|
|
the trailing modifiers on the operation (if any); otherwise C<undef>.
|
|
|
|
=back
|
|
|
|
If a quote-like expression is not found, the directive fails with the usual
|
|
C<undef> value.
|
|
|
|
The C<E<lt>perl_variableE<gt>> directive can be used to parse any Perl
|
|
variable: $scalar, @array, %hash, $ref->{field}[$index], etc.
|
|
It does this by calling Text::Balanced::extract_variable().
|
|
|
|
If the directive matches text representing a valid Perl variable
|
|
specification, it returns that text. Otherwise it fails with the usual
|
|
C<undef> value.
|
|
|
|
The C<E<lt>perl_codeblockE<gt>> directive can be used to parse curly-brace-delimited block of Perl code, such as: { $a = 1; f() =~ m/pat/; }.
|
|
It does this by calling Text::Balanced::extract_codeblock().
|
|
|
|
If the directive matches text representing a valid Perl code block,
|
|
it returns that text. Otherwise it fails with the usual C<undef> value.
|
|
|
|
You can also tell it what kind of brackets to use as the outermost
|
|
delimiters. For example:
|
|
|
|
arglist: <perl_codeblock ()>
|
|
|
|
causes an arglist to match a perl code block whose outermost delimiters
|
|
are C<(...)> (rather than the default C<{...}>).
|
|
|
|
|
|
=item Constructing tokens
|
|
|
|
Eventually, Parse::RecDescent will be able to parse tokenized input, as
|
|
well as ordinary strings. In preparation for this joyous day, the
|
|
C<E<lt>token:...E<gt>> directive has been provided.
|
|
This directive creates a token which will be suitable for
|
|
input to a Parse::RecDescent parser (when it eventually supports
|
|
tokenized input).
|
|
|
|
The text of the token is the value of the
|
|
immediately preceding item in the production. A
|
|
C<E<lt>token:...E<gt>> directive always succeeds with a return
|
|
value which is the hash reference that is the new token. It also
|
|
sets the return value for the production to that hash ref.
|
|
|
|
The C<E<lt>token:...E<gt>> directive makes it easy to build
|
|
a Parse::RecDescent-compatible lexer in Parse::RecDescent:
|
|
|
|
my $lexer = new Parse::RecDescent q
|
|
{
|
|
lex: token(s)
|
|
|
|
token: /a\b/ <token:INDEF>
|
|
| /the\b/ <token:DEF>
|
|
| /fly\b/ <token:NOUN,VERB>
|
|
| /[a-z]+/i { lc $item[1] } <token:ALPHA>
|
|
| <error: Unknown token>
|
|
|
|
};
|
|
|
|
which will eventually be able to be used with a regular Parse::RecDescent
|
|
grammar:
|
|
|
|
my $parser = new Parse::RecDescent q
|
|
{
|
|
startrule: subrule1 subrule 2
|
|
|
|
# ETC...
|
|
};
|
|
|
|
either with a pre-lexing phase:
|
|
|
|
$parser->startrule( $lexer->lex($data) );
|
|
|
|
or with a lex-on-demand approach:
|
|
|
|
$parser->startrule( sub{$lexer->token(\$data)} );
|
|
|
|
But at present, only the C<E<lt>token:...E<gt>> directive is
|
|
actually implemented. The rest is vapourware.
|
|
|
|
=item Specifying operations
|
|
|
|
One of the commonest requirements when building a parser is to specify
|
|
binary operators. Unfortunately, in a normal grammar, the rules for
|
|
such things are awkward:
|
|
|
|
disjunction: conjunction ('or' conjunction)(s?)
|
|
{ $return = [ $item[1], @{$item[2]} ] }
|
|
|
|
conjunction: atom ('and' atom)(s?)
|
|
{ $return = [ $item[1], @{$item[2]} ] }
|
|
|
|
or inefficient:
|
|
|
|
disjunction: conjunction 'or' disjunction
|
|
{ $return = [ $item[1], @{$item[2]} ] }
|
|
| conjunction
|
|
{ $return = [ $item[1] ] }
|
|
|
|
conjunction: atom 'and' conjunction
|
|
{ $return = [ $item[1], @{$item[2]} ] }
|
|
| atom
|
|
{ $return = [ $item[1] ] }
|
|
|
|
and either way is ugly and hard to get right.
|
|
|
|
The C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives provide an
|
|
easier way of specifying such operations. Using C<E<lt>leftop:...E<gt>> the
|
|
above examples become:
|
|
|
|
disjunction: <leftop: conjunction 'or' conjunction>
|
|
conjunction: <leftop: atom 'and' atom>
|
|
|
|
The C<E<lt>leftop:...E<gt>> directive specifies a left-associative binary operator.
|
|
It is specified around three other grammar elements
|
|
(typically subrules or terminals), which match the left operand,
|
|
the operator itself, and the right operand respectively.
|
|
|
|
A C<E<lt>leftop:...E<gt>> directive such as:
|
|
|
|
disjunction: <leftop: conjunction 'or' conjunction>
|
|
|
|
is converted to the following:
|
|
|
|
disjunction: ( conjunction ('or' conjunction)(s?)
|
|
{ $return = [ $item[1], @{$item[2]} ] } )
|
|
|
|
In other words, a C<E<lt>leftop:...E<gt>> directive matches the left operand followed by zero
|
|
or more repetitions of both the operator and the right operand. It then
|
|
flattens the matched items into an anonymous array which becomes the
|
|
(single) value of the entire C<E<lt>leftop:...E<gt>> directive.
|
|
|
|
For example, an C<E<lt>leftop:...E<gt>> directive such as:
|
|
|
|
output: <leftop: ident '<<' expr >
|
|
|
|
when given a string such as:
|
|
|
|
cout << var << "str" << 3
|
|
|
|
would match, and C<$item[1]> would be set to:
|
|
|
|
[ 'cout', 'var', '"str"', '3' ]
|
|
|
|
In other words:
|
|
|
|
output: <leftop: ident '<<' expr >
|
|
|
|
is equivalent to a left-associative operator:
|
|
|
|
output: ident { $return = [$item[1]] }
|
|
| ident '<<' expr { $return = [@item[1,3]] }
|
|
| ident '<<' expr '<<' expr { $return = [@item[1,3,5]] }
|
|
| ident '<<' expr '<<' expr '<<' expr { $return = [@item[1,3,5,7]] }
|
|
# ...etc...
|
|
|
|
|
|
Similarly, the C<E<lt>rightop:...E<gt>> directive takes a left operand, an operator, and a right operand:
|
|
|
|
assign: <rightop: var '=' expr >
|
|
|
|
and converts them to:
|
|
|
|
assign: ( (var '=' {$return=$item[1]})(s?) expr
|
|
{ $return = [ @{$item[1]}, $item[2] ] } )
|
|
|
|
which is equivalent to a right-associative operator:
|
|
|
|
assign: var { $return = [$item[1]] }
|
|
| var '=' expr { $return = [@item[1,3]] }
|
|
| var '=' var '=' expr { $return = [@item[1,3,5]] }
|
|
| var '=' var '=' var '=' expr { $return = [@item[1,3,5,7]] }
|
|
# ...etc...
|
|
|
|
|
|
Note that for both the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives, the directive does not normally
|
|
return the operator itself, just a list of the operands involved. This is
|
|
particularly handy for specifying lists:
|
|
|
|
list: '(' <leftop: list_item ',' list_item> ')'
|
|
{ $return = $item[2] }
|
|
|
|
There is, however, a problem: sometimes the operator is itself significant.
|
|
For example, in a Perl list a comma and a C<=E<gt>> are both
|
|
valid separators, but the C<=E<gt>> has additional stringification semantics.
|
|
Hence it's important to know which was used in each case.
|
|
|
|
To solve this problem the
|
|
C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
|
|
I<do> return the operator(s) as well, under two circumstances.
|
|
The first case is where the operator is specified as a subrule. In that instance,
|
|
whatever the operator matches is returned (on the assumption that if the operator
|
|
is important enough to have its own subrule, then it's important enough to return).
|
|
|
|
The second case is where the operator is specified as a regular
|
|
expression. In that case, if the first bracketed subpattern of the
|
|
regular expression matches, that matching value is returned (this is analogous to
|
|
the behaviour of the Perl C<split> function, except that only the first subpattern
|
|
is returned).
|
|
|
|
In other words, given the input:
|
|
|
|
( a=>1, b=>2 )
|
|
|
|
the specifications:
|
|
|
|
list: '(' <leftop: list_item separator list_item> ')'
|
|
|
|
separator: ',' | '=>'
|
|
|
|
or:
|
|
|
|
list: '(' <leftop: list_item /(,|=>)/ list_item> ')'
|
|
|
|
cause the list separators to be interleaved with the operands in the
|
|
anonymous array in C<$item[2]>:
|
|
|
|
[ 'a', '=>', '1', ',', 'b', '=>', '2' ]
|
|
|
|
|
|
But the following version:
|
|
|
|
list: '(' <leftop: list_item /,|=>/ list_item> ')'
|
|
|
|
returns only the operators:
|
|
|
|
[ 'a', '1', 'b', '2' ]
|
|
|
|
Of course, none of the above specifications handle the case of an empty
|
|
list, since the C<E<lt>leftop:...E<gt>> and C<E<lt>rightop:...E<gt>> directives
|
|
require at least a single right or left operand to match. To specify
|
|
that the operator can match "trivially",
|
|
it's necessary to add a C<(?)> qualifier to the directive:
|
|
|
|
list: '(' <leftop: list_item /(,|=>)/ list_item>(?) ')'
|
|
|
|
Note that in almost all the above examples, the first and third arguments
|
|
of the C<<leftop:...E<gt>> directive were the same subrule. That is because
|
|
C<<leftop:...E<gt>>'s are frequently used to specify "separated" lists of the
|
|
same type of item. To make such lists easier to specify, the following
|
|
syntax:
|
|
|
|
list: element(s /,/)
|
|
|
|
is exactly equivalent to:
|
|
|
|
list: <leftop: element /,/ element>
|
|
|
|
Note that the separator must be specified as a raw pattern (i.e.
|
|
not a string or subrule).
|
|
|
|
|
|
=item Scored productions
|
|
|
|
By default, Parse::RecDescent grammar rules always accept the first
|
|
production that matches the input. But if two or more productions may
|
|
potentially match the same input, choosing the first that does so may
|
|
not be optimal.
|
|
|
|
For example, if you were parsing the sentence "time flies like an arrow",
|
|
you might use a rule like this:
|
|
|
|
sentence: verb noun preposition article noun { [@item] }
|
|
| adjective noun verb article noun { [@item] }
|
|
| noun verb preposition article noun { [@item] }
|
|
|
|
Each of these productions matches the sentence, but the third one
|
|
is the most likely interpretation. However, if the sentence had been
|
|
"fruit flies like a banana", then the second production is probably
|
|
the right match.
|
|
|
|
To cater for such situtations, the C<E<lt>score:...E<gt>> can be used.
|
|
The directive is equivalent to an unconditional C<E<lt>rejectE<gt>>,
|
|
except that it allows you to specify a "score" for the current
|
|
production. If that score is numerically greater than the best
|
|
score of any preceding production, the current production is cached for later
|
|
consideration. If no later production matches, then the cached
|
|
production is treated as having matched, and the value of the
|
|
item immediately before its C<E<lt>score:...E<gt>> directive is returned as the
|
|
result.
|
|
|
|
In other words, by putting a C<E<lt>score:...E<gt>> directive at the end of
|
|
each production, you can select which production matches using
|
|
criteria other than specification order. For example:
|
|
|
|
sentence: verb noun preposition article noun { [@item] } <score: sensible(@item)>
|
|
| adjective noun verb article noun { [@item] } <score: sensible(@item)>
|
|
| noun verb preposition article noun { [@item] } <score: sensible(@item)>
|
|
|
|
Now, when each production reaches its respective C<E<lt>score:...E<gt>>
|
|
directive, the subroutine C<sensible> will be called to evaluate the
|
|
matched items (somehow). Once all productions have been tried, the
|
|
one which C<sensible> scored most highly will be the one that is
|
|
accepted as a match for the rule.
|
|
|
|
The variable $score always holds the current best score of any production,
|
|
and the variable $score_return holds the corresponding return value.
|
|
|
|
As another example, the following grammar matches lines that may be
|
|
separated by commas, colons, or semi-colons. This can be tricky if
|
|
a colon-separated line also contains commas, or vice versa. The grammar
|
|
resolves the ambiguity by selecting the rule that results in the
|
|
fewest fields:
|
|
|
|
line: seplist[sep=>','] <score: -@{$item[1]}>
|
|
| seplist[sep=>':'] <score: -@{$item[1]}>
|
|
| seplist[sep=>" "] <score: -@{$item[1]}>
|
|
|
|
seplist: <skip:""> <leftop: /[^$arg{sep}]*/ "$arg{sep}" /[^$arg{sep}]*/>
|
|
|
|
Note the use of negation within the C<E<lt>score:...E<gt>> directive
|
|
to ensure that the seplist with the most items gets the lowest score.
|
|
|
|
As the above examples indicate, it is often the case that all productions
|
|
in a rule use exactly the same C<E<lt>score:...E<gt>> directive. It is
|
|
tedious to have to repeat this identical directive in every production, so
|
|
Parse::RecDescent also provides the C<E<lt>autoscore:...E<gt>> directive.
|
|
|
|
If an C<E<lt>autoscore:...E<gt>> directive appears in any
|
|
production of a rule, the code it specifies is used as the scoring
|
|
code for every production of that rule, except productions that already
|
|
end with an explicit C<E<lt>score:...E<gt>> directive. Thus the rules above could
|
|
be rewritten:
|
|
|
|
line: <autoscore: -@{$item[1]}>
|
|
line: seplist[sep=>',']
|
|
| seplist[sep=>':']
|
|
| seplist[sep=>" "]
|
|
|
|
|
|
sentence: <autoscore: sensible(@item)>
|
|
| verb noun preposition article noun { [@item] }
|
|
| adjective noun verb article noun { [@item] }
|
|
| noun verb preposition article noun { [@item] }
|
|
|
|
Note that the C<E<lt>autoscore:...E<gt>> directive itself acts as an
|
|
unconditional C<E<lt>rejectE<gt>>, and (like the C<E<lt>rulevar:...E<gt>>
|
|
directive) is pruned at compile-time wherever possible.
|
|
|
|
|
|
=item Dispensing with grammar checks
|
|
|
|
During the compilation phase of parser construction, Parse::RecDescent performs
|
|
a small number of checks on the grammar it's given. Specifically it checks that
|
|
the grammar is not left-recursive, that there are no "insatiable" constructs of
|
|
the form:
|
|
|
|
rule: subrule(s) subrule
|
|
|
|
and that there are no rules missing (i.e. referred to, but never defined).
|
|
|
|
These checks are important during development, but can slow down parser
|
|
construction in stable code. So Parse::RecDescent provides the
|
|
E<lt>nocheckE<gt> directive to turn them off. The directive can only appear
|
|
before the first rule definition, and switches off checking throughout the rest
|
|
of the current grammar.
|
|
|
|
Typically, this directive would be added when a parser has been thoroughly
|
|
tested and is ready for release.
|
|
|
|
=back
|
|
|
|
|
|
=head2 Subrule argument lists
|
|
|
|
It is occasionally useful to pass data to a subrule which is being invoked. For
|
|
example, consider the following grammar fragment:
|
|
|
|
classdecl: keyword decl
|
|
|
|
keyword: 'struct' | 'class';
|
|
|
|
decl: # WHATEVER
|
|
|
|
The C<decl> rule might wish to know which of the two keywords was used
|
|
(since it may affect some aspect of the way the subsequent declaration
|
|
is interpreted). C<Parse::RecDescent> allows the grammar designer to
|
|
pass data into a rule, by placing that data in an I<argument list>
|
|
(that is, in square brackets) immediately after any subrule item in a
|
|
production. Hence, we could pass the keyword to C<decl> as follows:
|
|
|
|
classdecl: keyword decl[ $item[1] ]
|
|
|
|
keyword: 'struct' | 'class';
|
|
|
|
decl: # WHATEVER
|
|
|
|
The argument list can consist of any number (including zero!) of comma-separated
|
|
Perl expressions. In other words, it looks exactly like a Perl anonymous
|
|
array reference. For example, we could pass the keyword, the name of the
|
|
surrounding rule, and the literal 'keyword' to C<decl> like so:
|
|
|
|
classdecl: keyword decl[$item[1],$item[0],'keyword']
|
|
|
|
keyword: 'struct' | 'class';
|
|
|
|
decl: # WHATEVER
|
|
|
|
Within the rule to which the data is passed (C<decl> in the above examples)
|
|
that data is available as the elements of a local variable C<@arg>. Hence
|
|
C<decl> might report its intentions as follows:
|
|
|
|
classdecl: keyword decl[$item[1],$item[0],'keyword']
|
|
|
|
keyword: 'struct' | 'class';
|
|
|
|
decl: { print "Declaring $arg[0] (a $arg[2])\n";
|
|
print "(this rule called by $arg[1])" }
|
|
|
|
Subrule argument lists can also be interpreted as hashes, simply by using
|
|
the local variable C<%arg> instead of C<@arg>. Hence we could rewrite the
|
|
previous example:
|
|
|
|
classdecl: keyword decl[keyword => $item[1],
|
|
caller => $item[0],
|
|
type => 'keyword']
|
|
|
|
keyword: 'struct' | 'class';
|
|
|
|
decl: { print "Declaring $arg{keyword} (a $arg{type})\n";
|
|
print "(this rule called by $arg{caller})" }
|
|
|
|
Both C<@arg> and C<%arg> are always available, so the grammar designer may
|
|
choose whichever convention (or combination of conventions) suits best.
|
|
|
|
Subrule argument lists are also useful for creating "rule templates"
|
|
(especially when used in conjunction with the C<E<lt>matchrule:...E<gt>>
|
|
directive). For example, the subrule:
|
|
|
|
list: <matchrule:$arg{rule}> /$arg{sep}/ list[%arg]
|
|
{ $return = [ $item[1], @{$item[3]} ] }
|
|
| <matchrule:$arg{rule}>
|
|
{ $return = [ $item[1]] }
|
|
|
|
is a handy template for the common problem of matching a separated list.
|
|
For example:
|
|
|
|
function: 'func' name '(' list[rule=>'param',sep=>';'] ')'
|
|
|
|
param: list[rule=>'name',sep=>','] ':' typename
|
|
|
|
name: /\w+/
|
|
|
|
typename: name
|
|
|
|
|
|
When a subrule argument list is used with a repeated subrule, the argument list
|
|
goes I<before> the repetition specifier:
|
|
|
|
list: /some|many/ thing[ $item[1] ](s)
|
|
|
|
The argument list is "late bound". That is, it is re-evaluated for every
|
|
repetition of the repeated subrule.
|
|
This means that each repeated attempt to match the subrule may be
|
|
passed a completely different set of arguments if the value of the
|
|
expression in the argument list changes between attempts. So, for
|
|
example, the grammar:
|
|
|
|
{ $::species = 'dogs' }
|
|
|
|
pair: 'two' animal[$::species](s)
|
|
|
|
animal: /$arg[0]/ { $::species = 'cats' }
|
|
|
|
will match the string "two dogs cats cats" completely, whereas
|
|
it will only match the string "two dogs dogs dogs" up to the
|
|
eighth letter. If the value of the argument list were "early bound"
|
|
(that is, evaluated only the first time a repeated subrule match is
|
|
attempted), one would expect the matching behaviours to be reversed.
|
|
|
|
Of course, it is possible to effectively "early bind" such argument lists
|
|
by passing them a value which does not change on each repetition. For example:
|
|
|
|
{ $::species = 'dogs' }
|
|
|
|
pair: 'two' { $::species } animal[$item[2]](s)
|
|
|
|
animal: /$arg[0]/ { $::species = 'cats' }
|
|
|
|
|
|
Arguments can also be passed to the start rule, simply by appending them
|
|
to the argument list with which the start rule is called (I<after> the
|
|
"line number" parameter). For example, given:
|
|
|
|
$parser = new Parse::RecDescent ( $grammar );
|
|
|
|
$parser->data($text, 1, "str", 2, \@arr);
|
|
|
|
# ^^^^^ ^ ^^^^^^^^^^^^^^^
|
|
# | | |
|
|
# TEXT TO BE PARSED | |
|
|
# STARTING LINE NUMBER |
|
|
# ELEMENTS OF @arg WHICH IS PASSED TO RULE data
|
|
|
|
then within the productions of the rule C<data>, the array C<@arg> will contain
|
|
C<("str", 2, \@arr)>.
|
|
|
|
|
|
=head2 Alternations
|
|
|
|
Alternations are implicit (unnamed) rules defined as part of a production. An
|
|
alternation is defined as a series of '|'-separated productions inside a
|
|
pair of round brackets. For example:
|
|
|
|
character: 'the' ( good | bad | ugly ) /dude/
|
|
|
|
Every alternation implicitly defines a new subrule, whose
|
|
automatically-generated name indicates its origin:
|
|
"_alternation_<I>_of_production_<P>_of_rule<R>" for the appropriate
|
|
values of <I>, <P>, and <R>. A call to this implicit subrule is then
|
|
inserted in place of the brackets. Hence the above example is merely a
|
|
convenient short-hand for:
|
|
|
|
character: 'the'
|
|
_alternation_1_of_production_1_of_rule_character
|
|
/dude/
|
|
|
|
_alternation_1_of_production_1_of_rule_character:
|
|
good | bad | ugly
|
|
|
|
Since alternations are parsed by recursively calling the parser generator,
|
|
any type(s) of item can appear in an alternation. For example:
|
|
|
|
character: 'the' ( 'high' "plains" # Silent, with poncho
|
|
| /no[- ]name/ # Silent, no poncho
|
|
| vengeance_seeking # Poncho-optional
|
|
| <error>
|
|
) drifter
|
|
|
|
In this case, if an error occurred, the automatically generated
|
|
message would be:
|
|
|
|
ERROR (line <N>): Invalid implicit subrule: Expected
|
|
'high' or /no[- ]name/ or generic,
|
|
but found "pacifist" instead
|
|
|
|
Since every alternation actually has a name, it's even possible
|
|
to extend or replace them:
|
|
|
|
parser->Replace(
|
|
"_alternation_1_of_production_1_of_rule_character:
|
|
'generic Eastwood'"
|
|
);
|
|
|
|
More importantly, since alternations are a form of subrule, they can be given
|
|
repetition specifiers:
|
|
|
|
character: 'the' ( good | bad | ugly )(?) /dude/
|
|
|
|
|
|
=head2 Incremental Parsing
|
|
|
|
C<Parse::RecDescent> provides two methods - C<Extend> and C<Replace> - which
|
|
can be used to alter the grammar matched by a parser. Both methods
|
|
take the same argument as C<Parse::RecDescent::new>, namely a
|
|
grammar specification string
|
|
|
|
C<Parse::RecDescent::Extend> interprets the grammar specification and adds any
|
|
productions it finds to the end of the rules for which they are specified. For
|
|
example:
|
|
|
|
$add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
|
|
parser->Extend($add);
|
|
|
|
adds two productions to the rule "name" (creating it if necessary) and one
|
|
production to the rule "desc".
|
|
|
|
C<Parse::RecDescent::Replace> is identical, except that it first resets are
|
|
rule specified in the additional grammar, removing any existing productions.
|
|
Hence after:
|
|
|
|
$add = "name: 'Jimmy-Bob' | 'Bobby-Jim'\ndesc: colour /necks?/";
|
|
parser->Replace($add);
|
|
|
|
are are I<only> valid "name"s and the one possible description.
|
|
|
|
A more interesting use of the C<Extend> and C<Replace> methods is to call them
|
|
inside the action of an executing parser. For example:
|
|
|
|
typedef: 'typedef' type_name identifier ';'
|
|
{ $thisparser->Extend("type_name: '$item[3]'") }
|
|
| <error>
|
|
|
|
identifier: ...!type_name /[A-Za-z_]w*/
|
|
|
|
which automatically prevents type names from being typedef'd, or:
|
|
|
|
command: 'map' key_name 'to' abort_key
|
|
{ $thisparser->Replace("abort_key: '$item[2]'") }
|
|
| 'map' key_name 'to' key_name
|
|
{ map_key($item[2],$item[4]) }
|
|
| abort_key
|
|
{ exit if confirm("abort?") }
|
|
|
|
abort_key: 'q'
|
|
|
|
key_name: ...!abort_key /[A-Za-z]/
|
|
|
|
which allows the user to change the abort key binding, but not to unbind it.
|
|
|
|
The careful use of such constructs makes it possible to reconfigure a
|
|
a running parser, eliminating the need for semantic feedback by
|
|
providing syntactic feedback instead. However, as currently implemented,
|
|
C<Replace()> and C<Extend()> have to regenerate and re-C<eval> the
|
|
entire parser whenever they are called. This makes them quite slow for
|
|
large grammars.
|
|
|
|
In such cases, the judicious use of an interpolated regex is likely to
|
|
be far more efficient:
|
|
|
|
typedef: 'typedef' type_name/ identifier ';'
|
|
{ $thisparser->{local}{type_name} .= "|$item[3]" }
|
|
| <error>
|
|
|
|
identifier: ...!type_name /[A-Za-z_]w*/
|
|
|
|
type_name: /$thisparser->{local}{type_name}/
|
|
|
|
|
|
=head2 Precompiling parsers
|
|
|
|
Normally Parse::RecDescent builds a parser from a grammar at run-time.
|
|
That approach simplifies the design and implementation of parsing code,
|
|
but has the disadvantage that it slows the parsing process down - you
|
|
have to wait for Parse::RecDescent to build the parser every time the
|
|
program runs. Long or complex grammars can be particularly slow to
|
|
build, leading to unacceptable delays at start-up.
|
|
|
|
To overcome this, the module provides a way of "pre-building" a parser
|
|
object and saving it in a separate module. That module can then be used
|
|
to create clones of the original parser.
|
|
|
|
A grammar may be precompiled using the C<Precompile> class method.
|
|
For example, to precompile a grammar stored in the scalar $grammar,
|
|
and produce a class named PreGrammar in a module file named PreGrammar.pm,
|
|
you could use:
|
|
|
|
use Parse::RecDescent;
|
|
|
|
Parse::RecDescent->Precompile($grammar, "PreGrammar");
|
|
|
|
The first argument is the grammar string, the second is the name of the class
|
|
to be built. The name of the module file is generated automatically by
|
|
appending ".pm" to the last element of the class name. Thus
|
|
|
|
Parse::RecDescent->Precompile($grammar, "My::New::Parser");
|
|
|
|
would produce a module file named Parser.pm.
|
|
|
|
It is somewhat tedious to have to write a small Perl program just to
|
|
generate a precompiled grammar class, so Parse::RecDescent has some special
|
|
magic that allows you to do the job directly from the command-line.
|
|
|
|
If your grammar is specified in a file named F<grammar>, you can generate
|
|
a class named Yet::Another::Grammar like so:
|
|
|
|
> perl -MParse::RecDescent - grammar Yet::Another::Grammar
|
|
|
|
This would produce a file named F<Grammar.pm> containing the full
|
|
definition of a class called Yet::Another::Grammar. Of course, to use
|
|
that class, you would need to put the F<Grammar.pm> file in a
|
|
directory named F<Yet/Another>, somewhere in your Perl include path.
|
|
|
|
Having created the new class, it's very easy to use it to build
|
|
a parser. You simply C<use> the new module, and then call its
|
|
C<new> method to create a parser object. For example:
|
|
|
|
use Yet::Another::Grammar;
|
|
my $parser = Yet::Another::Grammar->new();
|
|
|
|
The effect of these two lines is exactly the same as:
|
|
|
|
use Parse::RecDescent;
|
|
|
|
open GRAMMAR_FILE, "grammar" or die;
|
|
local $/;
|
|
my $grammar = <GRAMMAR_FILE>;
|
|
|
|
my $parser = Parse::RecDescent->new($grammar);
|
|
|
|
only considerably faster.
|
|
|
|
Note however that the parsers produced by either approach are exactly
|
|
the same, so whilst precompilation has an effect on I<set-up> speed,
|
|
it has no effect on I<parsing> speed. RecDescent 2.0 will address that
|
|
problem.
|
|
|
|
|
|
=head2 A Metagrammar for C<Parse::RecDescent>
|
|
|
|
The following is a specification of grammar format accepted by
|
|
C<Parse::RecDescent::new> (specified in the C<Parse::RecDescent> grammar format!):
|
|
|
|
grammar : components(s)
|
|
|
|
component : rule | comment
|
|
|
|
rule : "\n" identifier ":" production(s?)
|
|
|
|
production : items(s)
|
|
|
|
item : lookahead(?) simpleitem
|
|
| directive
|
|
| comment
|
|
|
|
lookahead : '...' | '...!' # +'ve or -'ve lookahead
|
|
|
|
simpleitem : subrule args(?) # match another rule
|
|
| repetition # match repeated subrules
|
|
| terminal # match the next input
|
|
| bracket args(?) # match alternative items
|
|
| action # do something
|
|
|
|
subrule : identifier # the name of the rule
|
|
|
|
args : {extract_codeblock($text,'[]')} # just like a [...] array ref
|
|
|
|
repetition : subrule args(?) howoften
|
|
|
|
howoften : '(?)' # 0 or 1 times
|
|
| '(s?)' # 0 or more times
|
|
| '(s)' # 1 or more times
|
|
| /(\d+)[.][.](/\d+)/ # $1 to $2 times
|
|
| /[.][.](/\d*)/ # at most $1 times
|
|
| /(\d*)[.][.])/ # at least $1 times
|
|
|
|
terminal : /[/]([\][/]|[^/])*[/]/ # interpolated pattern
|
|
| /"([\]"|[^"])*"/ # interpolated literal
|
|
| /'([\]'|[^'])*'/ # uninterpolated literal
|
|
|
|
action : { extract_codeblock($text) } # embedded Perl code
|
|
|
|
bracket : '(' Item(s) production(s?) ')' # alternative subrules
|
|
|
|
directive : '<commit>' # commit to production
|
|
| '<uncommit>' # cancel commitment
|
|
| '<resync>' # skip to newline
|
|
| '<resync:' pattern '>' # skip <pattern>
|
|
| '<reject>' # fail this production
|
|
| '<reject:' condition '>' # fail if <condition>
|
|
| '<error>' # report an error
|
|
| '<error:' string '>' # report error as "<string>"
|
|
| '<error?>' # error only if committed
|
|
| '<error?:' string '>' # " " " "
|
|
| '<rulevar:' /[^>]+/ '>' # define rule-local variable
|
|
| '<matchrule:' string '>' # invoke rule named in string
|
|
|
|
identifier : /[a-z]\w*/i # must start with alpha
|
|
|
|
comment : /#[^\n]*/ # same as Perl
|
|
|
|
pattern : {extract_bracketed($text,'<')} # allow embedded "<..>"
|
|
|
|
condition : {extract_codeblock($text,'{<')} # full Perl expression
|
|
|
|
string : {extract_variable($text)} # any Perl variable
|
|
| {extract_quotelike($text)} # or quotelike string
|
|
| {extract_bracketed($text,'<')} # or balanced brackets
|
|
|
|
|
|
=head1 GOTCHAS
|
|
|
|
This section describes common mistakes that grammar writers seem to
|
|
make on a regular basis.
|
|
|
|
=head2 1. Expecting an error to always invalidate a parse
|
|
|
|
A common mistake when using error messages is to write the grammar like this:
|
|
|
|
file: line(s)
|
|
|
|
line: line_type_1
|
|
| line_type_2
|
|
| line_type_3
|
|
| <error>
|
|
|
|
The expectation seems to be that any line that is not of type 1, 2 or 3 will
|
|
invoke the C<E<lt>errorE<gt>> directive and thereby cause the parse to fail.
|
|
|
|
Unfortunately, that only happens if the error occurs in the very first line.
|
|
The first rule states that a C<file> is matched by one or more lines, so if
|
|
even a single line succeeds, the first rule is completely satisfied and the
|
|
parse as a whole succeeds. That means that any error messages generated by
|
|
subsequent failures in the C<line> rule are quietly ignored.
|
|
|
|
Typically what's really needed is this:
|
|
|
|
file: line(s) eofile { $return = $item[1] }
|
|
|
|
line: line_type_1
|
|
| line_type_2
|
|
| line_type_3
|
|
| <error>
|
|
|
|
eofile: /^\Z/
|
|
|
|
The addition of the C<eofile> subrule to the first production means that
|
|
a file only matches a series of successful C<line> matches I<that consume the
|
|
complete input text>. If any input text remains after the lines are matched,
|
|
there must have been an error in the last C<line>. In that case the C<eofile>
|
|
rule will fail, causing the entire C<file> rule to fail too.
|
|
|
|
Note too that C<eofile> must match C</^\Z/> (end-of-text), I<not>
|
|
C</^\cZ/> or C</^\cD/> (end-of-file).
|
|
|
|
And don't forget the action at the end of the production. If you just
|
|
write:
|
|
|
|
file: line(s) eofile
|
|
|
|
then the value returned by the C<file> rule will be the value of its
|
|
last item: C<eofile>. Since C<eofile> always returns an empty string
|
|
on success, that will cause the C<file> rule to return that empty
|
|
string. Apart from returning the wrong value, returning an empty string
|
|
will trip up code such as:
|
|
|
|
$parser->file($filetext) || die;
|
|
|
|
(since "" is false).
|
|
|
|
Remember that Parse::RecDescent returns undef on failure,
|
|
so the only safe test for failure is:
|
|
|
|
defined($parser->file($filetext)) || die;
|
|
|
|
|
|
=head1 DIAGNOSTICS
|
|
|
|
Diagnostics are intended to be self-explanatory (particularly if you
|
|
use B<-RD_HINT> (under B<perl -s>) or define C<$::RD_HINT> inside the program).
|
|
|
|
C<Parse::RecDescent> currently diagnoses the following:
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
Invalid regular expressions used as pattern terminals (fatal error).
|
|
|
|
=item *
|
|
|
|
Invalid Perl code in code blocks (fatal error).
|
|
|
|
=item *
|
|
|
|
Lookahead used in the wrong place or in a nonsensical way (fatal error).
|
|
|
|
=item *
|
|
|
|
"Obvious" cases of left-recursion (fatal error).
|
|
|
|
=item *
|
|
|
|
Missing or extra components in a C<E<lt>leftopE<gt>> or C<E<lt>rightopE<gt>>
|
|
directive.
|
|
|
|
=item *
|
|
|
|
Unrecognisable components in the grammar specification (fatal error).
|
|
|
|
=item *
|
|
|
|
"Orphaned" rule components specified before the first rule (fatal error)
|
|
or after an C<E<lt>errorE<gt>> directive (level 3 warning).
|
|
|
|
=item *
|
|
|
|
Missing rule definitions (this only generates a level 3 warning, since you
|
|
may be providing them later via C<Parse::RecDescent::Extend()>).
|
|
|
|
=item *
|
|
|
|
Instances where greedy repetition behaviour will almost certainly
|
|
cause the failure of a production (a level 3 warning - see
|
|
L<"ON-GOING ISSUES AND FUTURE DIRECTIONS"> below).
|
|
|
|
=item *
|
|
|
|
Attempts to define rules named 'Replace' or 'Extend', which cannot be
|
|
called directly through the parser object because of the predefined
|
|
meaning of C<Parse::RecDescent::Replace> and
|
|
C<Parse::RecDescent::Extend>. (Only a level 2 warning is generated, since
|
|
such rules I<can> still be used as subrules).
|
|
|
|
=item *
|
|
|
|
Productions which consist of a single C<E<lt>error?E<gt>>
|
|
directive, and which therefore may succeed unexpectedly
|
|
(a level 2 warning, since this might conceivably be the desired effect).
|
|
|
|
=item *
|
|
|
|
Multiple consecutive lookahead specifiers (a level 1 warning only, since their
|
|
effects simply accumulate).
|
|
|
|
=item *
|
|
|
|
Productions which start with a C<E<lt>rejectE<gt>> or C<E<lt>rulevar:...E<gt>>
|
|
directive. Such productions are optimized away (a level 1 warning).
|
|
|
|
=item *
|
|
|
|
Rules which are autogenerated under C<$::AUTOSTUB> (a level 1 warning).
|
|
|
|
=back
|
|
|
|
=head1 AUTHOR
|
|
|
|
Damian Conway (damian@conway.org)
|
|
|
|
=head1 BUGS AND IRRITATIONS
|
|
|
|
There are undoubtedly serious bugs lurking somewhere in this much code :-)
|
|
Bug reports and other feedback are most welcome.
|
|
|
|
Ongoing annoyances include:
|
|
|
|
=over 4
|
|
|
|
=item *
|
|
|
|
There's no support for parsing directly from an input stream.
|
|
If and when the Perl Gods give us regular expressions on streams,
|
|
this should be trivial (ahem!) to implement.
|
|
|
|
=item *
|
|
|
|
The parser generator can get confused if actions aren't properly
|
|
closed or if they contain particularly nasty Perl syntax errors
|
|
(especially unmatched curly brackets).
|
|
|
|
=item *
|
|
|
|
The generator only detects the most obvious form of left recursion
|
|
(potential recursion on the first subrule in a rule). More subtle
|
|
forms of left recursion (for example, through the second item in a
|
|
rule after a "zero" match of a preceding "zero-or-more" repetition,
|
|
or after a match of a subrule with an empty production) are not found.
|
|
|
|
=item *
|
|
|
|
Instead of complaining about left-recursion, the generator should
|
|
silently transform the grammar to remove it. Don't expect this
|
|
feature any time soon as it would require a more sophisticated
|
|
approach to parser generation than is currently used.
|
|
|
|
=item *
|
|
|
|
The generated parsers don't always run as fast as might be wished.
|
|
|
|
=item *
|
|
|
|
The meta-parser should be bootstrapped using C<Parse::RecDescent> :-)
|
|
|
|
=back
|
|
|
|
=head1 ON-GOING ISSUES AND FUTURE DIRECTIONS
|
|
|
|
=over 4
|
|
|
|
=item 1.
|
|
|
|
Repetitions are "incorrigibly greedy" in that they will eat everything they can
|
|
and won't backtrack if that behaviour causes a production to fail needlessly.
|
|
So, for example:
|
|
|
|
rule: subrule(s) subrule
|
|
|
|
will I<never> succeed, because the repetition will eat all the
|
|
subrules it finds, leaving none to match the second item. Such
|
|
constructions are relatively rare (and C<Parse::RecDescent::new> generates a
|
|
warning whenever they occur) so this may not be a problem, especially
|
|
since the insatiable behaviour can be overcome "manually" by writing:
|
|
|
|
rule: penultimate_subrule(s) subrule
|
|
|
|
penultimate_subrule: subrule ...subrule
|
|
|
|
The issue is that this construction is exactly twice as expensive as the
|
|
original, whereas backtracking would add only 1/I<N> to the cost (for
|
|
matching I<N> repetitions of C<subrule>). I would welcome feedback on
|
|
the need for backtracking; particularly on cases where the lack of it
|
|
makes parsing performance problematical.
|
|
|
|
=item 2.
|
|
|
|
Having opened that can of worms, it's also necessary to consider whether there
|
|
is a need for non-greedy repetition specifiers. Again, it's possible (at some
|
|
cost) to manually provide the required functionality:
|
|
|
|
rule: nongreedy_subrule(s) othersubrule
|
|
|
|
nongreedy_subrule: subrule ...!othersubrule
|
|
|
|
Overall, the issue is whether the benefit of this extra functionality
|
|
outweighs the drawbacks of further complicating the (currently
|
|
minimalist) grammar specification syntax, and (worse) introducing more overhead
|
|
into the generated parsers.
|
|
|
|
=item 3.
|
|
|
|
An C<E<lt>autocommitE<gt>> directive would be nice. That is, it would be useful to be
|
|
able to say:
|
|
|
|
command: <autocommit>
|
|
command: 'find' name
|
|
| 'find' address
|
|
| 'do' command 'at' time 'if' condition
|
|
| 'do' command 'at' time
|
|
| 'do' command
|
|
| unusual_command
|
|
|
|
and have the generator work out that this should be "pruned" thus:
|
|
|
|
command: 'find' name
|
|
| 'find' <commit> address
|
|
| 'do' <commit> command <uncommit>
|
|
'at' time
|
|
'if' <commit> condition
|
|
| 'do' <commit> command <uncommit>
|
|
'at' <commit> time
|
|
| 'do' <commit> command
|
|
| unusual_command
|
|
|
|
There are several issues here. Firstly, should the
|
|
C<E<lt>autocommitE<gt>> automatically install an C<E<lt>uncommitE<gt>>
|
|
at the start of the last production (on the grounds that the "command"
|
|
rule doesn't know whether an "unusual_command" might start with "find"
|
|
or "do") or should the "unusual_command" subgraph be analysed (to see
|
|
if it I<might> be viable after a "find" or "do")?
|
|
|
|
The second issue is how regular expressions should be treated. The simplest
|
|
approach would be simply to uncommit before them (on the grounds that they
|
|
I<might> match). Better efficiency would be obtained by analyzing all preceding
|
|
literal tokens to determine whether the pattern would match them.
|
|
|
|
Overall, the issues are: can such automated "pruning" approach a hand-tuned
|
|
version sufficiently closely to warrant the extra set-up expense, and (more
|
|
importantly) is the problem important enough to even warrant the non-trivial
|
|
effort of building an automated solution?
|
|
|
|
=back
|
|
|
|
=head1 COPYRIGHT
|
|
|
|
Copyright (c) 1997-2000, Damian Conway. All Rights Reserved.
|
|
This module is free software. It may be used, redistributed
|
|
and/or modified under the terms of the Perl Artistic License
|
|
(see http://www.perl.com/perl/misc/Artistic.html)
|