4 Search (SXPath)
(sxpath path [ns-binding]) → (-> sxml? (listof sxml?)) path : abbr-sxpath? ns-binding : ns-binding? = '()
AbbrPath is a list. It is translated to the full SXPath according to the following rewriting rules
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
| |||
| ⇒ |
|
To extract all cells from an html table:
> (define table `(*TOP* (table (tr (td "a") (td "b")) (tr (td "c") (td "d")))))
> ((sxpath '(table tr td)) table) '((td "a") (td "b") (td "c") (td "d"))
To extract all cells anywhere in a document:
> (define table `(*TOP* (div (p (table (tr (td "a") (td "b")) (tr (td "c") (td "d")))) (table (tr (td "e"))))))
> ((sxpath '(// td)) table) '((td "a") (td "b") (td "c") (td "d") (td "e"))
One result may be nested in another one:
> (define doc `(*TOP* (div (p (div "3") (div (div "4"))))))
> ((sxpath '(// div)) doc) '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))
There’s also a string-based syntax, txpath. As shown in the grammar above, sxpath assumes that any strings in the path are expressed using the txpath syntax.
So, for instance, the prior example could be rewritten using a string:
> (define doc `(*TOP* (div (p (div "3") (div (div "4"))))))
> ((sxpath "//div") doc) '((div (p (div "3") (div (div "4")))) (div "3") (div (div "4")) (div "4"))
More generally, lists in the s-expression syntax correspond to string concatenation in the txpath syntax.
So, to find all italics that appear at top level within a paragraph:
> (define doc `(*TOP* (div (p (i "3") (froogy (i "4"))))))
> ((sxpath "//p/i") doc) '((i "3"))
Handling of namespaces in sxpath is a bit surprising. In particular, it appears to me that sxpath’s model is that namespaces must appear fully expanded in the matched source. For instance:
> ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (html (http://example.com:body (http://example.com:p "first para") (http://example.com:p "second para containing" (http://example.com:p "third para") "inside it")))))
'((http://example.com:p "first para")
(http://example.com:p
"second para containing"
(http://example.com:p "third para")
"inside it")
(http://example.com:p "third para"))
But the corresponding example where the source document contains a namespace shortcut does not match in the same way. That is:
> ((sxpath "//ns:p" `((ns . "http://example.com"))) '(*TOP* (@ (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it"))))) '()
It produces the empty list. Instead, you must pretend that the shortcut is actually the namespace. Thus:
> ((sxpath "//ns:p" `((ns . "ns"))) '(*TOP* (@ (*NAMESPACES* (ns "http://example.com"))) (html (ns:body (ns:p "first para") (ns:p "second para containing" (ns:p "third para") "inside it")))))
'((ns:p "first para")
(ns:p "second para containing" (ns:p "third para") "inside it")
(ns:p "third para"))
Ah well.