Version: 4.2.2
4 Simple Text Parser
This module provides a simple text parser that can read strings
and turn them into data without first building lexems (although it
can be used to either lex or parse).
More complex or faster parsers may require the use of the
parser-tools intergrated in Scheme.
A parser is given a list of matcher procedures and associated action procedures.
A matcher is generally a regexp, the associated action turns
the matched text into something else.
On the input string, the parser recursively looks
for the matcher that matches the earliest
character and applies its action.
no-match-proc is applied to the portion of the string (before the first matched
character) that has not been matched.
The parser has an internal state, the "phase", where it is possible
to define local parsers that only work when the parser is in that phase.
Actions can make the parser switch to a given phase.
Automata transitions can then easily be defined.
Instead of switching to another phase, it is also possible to set the
parser into a "sub-parser" mode, and to provide the sub-parser with a callback
that will be applied only once the sub-parser has returned.
The fastest and easiest way to understand how it works is probably to
look at the examples in the "examples" directory.
Somes simple examples are also given at the end of this page.
See also the "defs-parser.ss" source file for a more complex usage.
4.1 Priorities
When parsing a string, among all matchers of the current phase,
the matcher which action is triggered is the one that matches the earliest
character in the string.
If several matchers apply, then only the
last added matcher is chosen.
In
add-items, the priority is for the matcher that is defined the lowest in the
source file.
Returns #t if p is a parser, #f otherwise.
4.2 Main Functions
Creates a new parser with default behavior no-match-proc, starting in phase phase.
All the outputs generated byt the parser are then appended with appender.
Adds the matcher in and its associated action out
to parser.
The matcher will match only when the parser is in a phase that
returns #t when applied to phase?.
If phase? is a procedure, it will be used as is to match the parser’s phase.
If phase? equals #t it will be changed to (λ args #t)
such that it matches any phase.
Any other value of phase will be turned into a procedure that matches
this value with equal?.
If in is a string it will be turned into a procedure that matches
the corresponding pregexp.
If in is a symbol, it will be turned into a procedure that matches
the corresponding pregexp with word boundaries on both sides, (useful
for matching names or programming languages keywords).
If in is a list, then add-item is called recursively on each member
of in with the same parser, phase? and out.
If in equals #t, it will modify the no-match-proc procedure
to add the corresponding action when phase? applies to the parser.
In the end, in has returns the same kind of values as regexp-match-positions.
out must be a procedure that accepts the same number of arguments as
the number of values returned by the matcher in.
For example, if in is "aa(b+)c(d+)e", then out must
take 3 arguments (one for the whole string, and two for the b’s and the d’s).
If out is not a procedure, it will be turned into a procedure that accepts
any number of arguments and returns out.
(add-items parser [phase? [search-proc output-proc] ...] ...) |
The general form for adding several items at once.
See the examples at the end of this page.
Parses text with parser, starting in phase phase, which is the current phase
by default.
If split is #t, the text is split into lines, separated by "\n" strings.
This is the default behavior for speed considerations: several regeexp in parallel
can be greedy.
It is possible to call the parser inside the parsing phase, i.e.
once a portion of the text has been parsed, it can be given to the parser
itself in some phase to make further transformations.
This is not the same as sub-parsing because there is no callback.
4.3 Matchers
This section describes matching functions that can be used in the
in argument of
add-item and
add-items.
Turns
s into a pregexp and returns a procedure
that takes an input string and applies
regexp-match-positions on that string with the pregexp
s.
Same as
re but regexp-quotes
s beforehand, so that the string
s
is matched exactly.
Same as
txt but adds word-boundaries around
s.
4.4 Actions
This section describes action functions that can be used in the
out argument of
add-item and
add-items.
Sets the parser in the phase phase and returns "".
Sets the current parser in sub-parse mode and switches to
new-phase.
The result of the sub-parse is appended with
appender, which by default
is the same as the parser’s.
When the sub-parser has finished parsing
(it has returned with
sub-parse-return),
callback is called with the result of the sub-parse and the result of
callback is added to the current parser result.
Sub-parsers can be called recursively, once in a sub-parsing mode
or in the callback.
Returns "".
By default, the parser agglomerates the return values
of the action procedures.
The function
cons-out can be used to add a value to the parser
without being a return value of an action.
Should be rarely useful.
Adds out to the current parser result and returns
from the current sub-parsing mode.
4.5 Examples
Examples: |
|
"YaïCaïDaï -glitch- CaïDaï -gloutch- \nTaïPaïCHaï" |
|
(tree: (root (node1 (leaf1 leaf2) leaf3) (node2 leaf4 (node3 leaf5) leaf6) leaf7)) |
Note that the result of the last example is Scheme data, not a string.