Aho-Corasick
_Aho-Corasick_
The "ahocorasick" collection implements the classic Aho-Corasick
multiple keyword matching automaton. It includes functions for
constructing an AhoCorasickTree from a list of words, as well as a
macro for compiling them down into lambdas a-la Shriram
Krishnamurthi's Automata As Macros paper.
The documentation here is a little sparse since the interface might
change a bit; see the source code and its comments for the details.
I'll document the functions that will probably be used the most.
(Note: the documentation here has a string bias. I haven't tested my
assumption yet, but the core functions actually shouldn't care about
strings, so it should be possible to build automaton with different
label and output types.)
_Example of building and searching with an AhoCorasickTree_
The following quick-and-dirty program:
(require (prefix a: (planet "ahocorasick.ss" ("dyoo" "ahocorasick.plt" 1))))
(define tree (a:make-from-strings (list "he" "she" "his" "hers")))
(display (a:string-search
tree
"his doctor said she said PLT Scheme was his"))
(newline)
will show:
((his 0 3) (he 17 19) (she 16 19) (he 17 19) (he 31 33) (his 40 43))
_AhoCorasickTree API_
Use:
(require (planet "ahocorasick.ss" ("dyoo" "ahocorasick.plt" 1)))
to load this library.
> make: -> AhoCorasickTree
Constructs a new AhoCorasickTree with an initial zerostate state.
Example: (define tree (make-tree))
> add: AhoCorasickTree labels [output] -> void
Adds a sequence of labels to the tree. If the optional output is
given, associates a match of the labels with that output.
Otherwise, associates the keyword itself as the output.
Example:
(let ((tree (make-tree)))
(add tree (string->list "plt"))
(add tree (string->list "scheme") 'oh-yeah))
> prepare: AhoCorasickTree -> void
Finish up the construction of the tree. This needs to be called
before the tree is used for searching.
> make-from-strings: (listof string) -> AhoCorasickTree
Builds a prepared tree from a given set of strings.
Since a main application for building these automata are to process
strings, this function provides a one-step way of building a usable
automata.
> port-search: AhoCorasickTree port -> (listof (list output
start-index
end-index))
Returns a list of matches, using the port as the input source.
> string-search: AhoCorasickTree port -> (listof (list output
start-index
end-index))
Returns a list of matches, using the string as the input source.
_Compiling using macros_
Use:
(require (planet "ahocorasick-macro.ss" ("dyoo" "ahocorasick.plt" 1)))
to load this library.
This is mostly modeled from Shriram Krishnamurthi's Automata as Macros
paper. Here's a quick-and-dirty example of using it:
(require (prefix a: "ahocorasick-macro.ss"))
(require (lib "pretty.ss"))
(pretty-display (a:automaton-sexp-from-strings "he" "she" "his" "hers"))
(newline)
(newline)
(define my-automaton (a:automaton-from-strings "he" "she" "his" "hers"))
(display (a:string-search
my-automaton
"his doctor said she said PLT Scheme was his"))
(newline)
which should display:
(automaton
root
(root : (s -> state-1) (h -> state-2) (else -> root))
(state-1 : (h -> state-3) (fail -> root))
(state-2 : (e -> state-4) (i -> state-5) (fail -> root))
(state-3 : (e -> state-6) (fail -> state-2))
(state-4 : (outputs (he)) (r -> state-7) (fail -> root))
(state-5 : (s -> state-8) (fail -> root))
(state-6 : (outputs (he she)) (fail -> state-4))
(state-7 : (s -> state-9) (fail -> root))
(state-8 : (outputs (his)) (fail -> state-1))
(state-9 : (outputs (hers)) (fail -> state-1)))
((his 0 3) (he 17 19) (she 16 19) (he 31 33) (his 40 43))
The automaton s-expressions are compiled with the _automaton_ syntax;
it's similar to the one described in Shriram's paper, except:
In the list of transitions, the first "transition" can be an
'outputs' directive that lists the outputs associated with the
particular state in the Aho-Corasick automaton.
There are two special transition label types:
* else can consume any token, and follows to the next state.
Meant to be used for the root state, which in the Aho-Corasick
automaton, has edges to everything else.
* fail does not consume tokens, but still follows the edge. This
corresponds to the failure links of the Aho-Corasick automaton.
My macros are also quite a bit messier just because I'm still learning
the macro system.