4 Simple Text Parser

4 Simple Text Parser

4.1 Priorities

4.2 Main Functions

4.3 Matchers

4.4 Actions

4.5 Examples

Version: 4.2.2

(require (planet orseau/lazy-doc:1:7/simple-parser))

This module provides a simple text parser that can read strings and turn them into data without first building lexems (although it can be used to either lex or parse).

More complex or faster parsers may require the use of the parser-tools intergrated in Scheme.

A parser is given a list of matcher procedures and associated action procedures. A matcher is generally a regexp, the associated action turns the matched text into something else. On the input string, the parser recursively looks for the matcher that matches the earliest character and applies its action. no-match-proc is applied to the portion of the string (before the first matched character) that has not been matched.

The parser has an internal state, the "phase", where it is possible to define local parsers that only work when the parser is in that phase. Actions can make the parser switch to a given phase. Automata transitions can then easily be defined.

Instead of switching to another phase, it is also possible to set the parser into a "sub-parser" mode, and to provide the sub-parser with a callback that will be applied only once the sub-parser has returned.

The fastest and easiest way to understand how it works is probably to look at the examples in the "examples" directory. Somes simple examples are also given at the end of this page. See also the "defs-parser.ss" source file for a more complex usage.

When parsing a string, among all matchers of the current phase, the matcher which action is triggered is the one that matches the earliest character in the string. If several matchers apply, then only the last added matcher is chosen. In add-items, the priority is for the matcher that is defined the lowest in the source file.

(parser? p) → boolean?

p : any/c

Returns #t if p is a parser, #f otherwise.

(new-parser	[	no-match-proc
		#:phase phase
		#:appender appender])	→	parser?

no-match-proc : procedure? = identity

phase : any/c = 'start

appender : procedure? = string-append

Creates a new parser with default behavior no-match-proc, starting in phase phase. All the outputs generated byt the parser are then appended with appender.

(add-item parser phase? in out) → void?

parser : parser?

phase? : any/c

in : (or/c #t procedure? list? symbol? string?)

out : (or/c procedure? symbol? string?)

Adds the matcher in and its associated action out to parser. The matcher will match only when the parser is in a phase that returns #t when applied to phase?.

If phase? is a procedure, it will be used as is to match the parser’s phase. If phase? equals #t it will be changed to (λ args #t) such that it matches any phase. Any other value of phase will be turned into a procedure that matches this value with equal?.

If in is a string it will be turned into a procedure that matches the corresponding pregexp. If in is a symbol, it will be turned into a procedure that matches the corresponding pregexp with word boundaries on both sides, (useful for matching names or programming languages keywords). If in is a list, then add-item is called recursively on each member of in with the same parser, phase? and out. If in equals #t, it will modify the no-match-proc procedure to add the corresponding action when phase? applies to the parser. In the end, in has returns the same kind of values as regexp-match-positions.

out must be a procedure that accepts the same number of arguments as the number of values returned by the matcher in. For example, if in is "aa(b+)c(d+)e", then out must take 3 arguments (one for the whole string, and two for the b’s and the d’s). If out is not a procedure, it will be turned into a procedure that accepts any number of arguments and returns out.

(add-items parser [phase? [search-proc output-proc] ...] ...)

The general form for adding several items at once. See the examples at the end of this page.

parser : parser?

phase : any/c = ((parser-phase parser))

split : boolean? = #t

text : string?

Parses text with parser, starting in phase phase, which is the current phase by default. If split is #t, the text is split into lines, separated by "\n" strings. This is the default behavior for speed considerations: several regeexp in parallel can be greedy.

It is possible to call the parser inside the parsing phase, i.e. once a portion of the text has been parsed, it can be given to the parser itself in some phase to make further transformations. This is not the same as sub-parsing because there is no callback.

This section describes matching functions that can be used in the in argument of add-item and add-items.

(re s) → procedure?

s : string?

Turns s into a pregexp and returns a procedure that takes an input string and applies regexp-match-positions on that string with the pregexp s.

(txt s) → procedure?

s : string?

Same as re but regexp-quotes s beforehand, so that the string s is matched exactly.

(kw s) → procedure?

s : string?

Same as txt but adds word-boundaries around s.

This section describes action functions that can be used in the out argument of add-item and add-items.

(switch-phase phase) → string?

phase : any/c

Sets the parser in the phase phase and returns "".

(sub-parse		new-phase
	[	callback
		#:appender appender])	→	string?

new-phase : any/c

callback : procedure? = identity

appender : procedure? = (parser-appender (current-parser))

Sets the current parser in sub-parse mode and switches to new-phase. The result of the sub-parse is appended with appender, which by default is the same as the parser’s. When the sub-parser has finished parsing (it has returned with sub-parse-return), callback is called with the result of the sub-parse and the result of callback is added to the current parser result.

Sub-parsers can be called recursively, once in a sub-parsing mode or in the callback.

Returns "".

(cons-out out) → void?

out : any/c

By default, the parser agglomerates the return values of the action procedures. The function cons-out can be used to add a value to the parser without being a return value of an action. Should be rarely useful.

(sub-parse-return [out]) → any

out : any/c = #f

Adds out to the current parser result and returns from the current sub-parsing mode.

(parse-text

#:phase phase

#:split-lines split]

(listof any/c)

Examples:
  > (let ([p (new-parser)])
      (add-items
       p
       ('start
        ["pl(.[^p]?)p" (λ(s x)(string-append " -gl" x "tch- "))]
        ["ou" "aï"]
        [#t string-upcase]))
      (parse-text p "youcoudouplipcoudouploup" "toupouchou"))
  "YaïCaïDaï -glitch- CaïDaï -gloutch- \nTaïPaïCHaï"
  > (let ([tree-parser
           (new-parser #:appender
                       (λ vals (remove* '(||) vals)))])
      (add-items
       tree-parser
       ('start
        [#t string->symbol]
        ["\\s+" '||]
        ["\\(" (λ(s)(sub-parse 'start)'||)]
        ["\\)" (λ(s)(sub-parse-return))]))

      (parse-text
       tree-parser
       "tree:(root (node1 (leaf1 leaf2) \nleaf3) (node2\n leaf4 (node3 leaf5) leaf6) leaf7)"))
  (tree: (root (node1 (leaf1 leaf2) leaf3) (node2 leaf4 (node3 leaf5) leaf6) leaf7))

Note that the result of the last example is Scheme data, not a string.

← prev up next →

1	A Simple Example
2	Package Utilities
3	Scribble Definition Parser
4	Simple Text Parser
5	Common Scheme Utilities

4.1	Priorities
4.2	Main Functions
4.3	Matchers
4.4	Actions
4.5	Examples