; Evaluation of this file yields an HTML document
; $Id: SXML.scm,v 3.4 2005/02/09 22:14:42 oleg Exp $
(define Content
'(html:begin
(Header
(title "SXML")
(description "Definition of SXML: an instance of XML Infoset as
S-expressions, an Abstract Syntax Tree of an XML document.")
(Date-Revision-yyyymmdd "20040312")
(Date-Creation-yyyymmdd "20010207")
(keywords "XML, XML parsing, XML Infoset, XML Namespaces, AST, SXML, Scheme")
(AuthorAddress "oleg-at-pobox.com")
(Revision "3.0")
(long-title "SXML")
(Links
(start "index.html" (title "Scheme Hash"))
(contents "../README.html")
(prev "xml.html")
(next "web.html")
(top "index.html")
(home "http://pobox.com/~oleg/ftp/")))
(body
(navbar)
(page-title)
(abstract
(Revision)
"SXML is an abstract syntax tree of an XML document. SXML is also
a concrete representation of the XML Infoset in the form of
S-expressions. The generic tree structure of SXML lends itself to a
compact library of combinators for querying and transforming SXML."
(prod-note (URL "http://pobox.com/~oleg/ftp/Scheme/xml.html"))
(keywords))
(TOC)
(Section 2 "Introduction")
(p
"An XML document is essentially a tree structure. The start and the end
tags of the root element enclose the document content,
which may include other elements or arbitrary character data. Text
with familiar angular brackets is an external representation of an XML
document. Applications ought to deal with an internalized form:
an XML Information Set, or its specializations. This form lets an
application locate specific data or transform an XML tree into another
tree, which can then be written out as an XML, HTML, PDF, etc.
document.")
(p "XML Information Set (Infoset) " (cite "XML Infoset") " is an
abstract datatype that describes information available in a
well-formed XML document. Infoset is made of \"information items\",
which denote elements, attributes, character data, processing
instructions, and other components of the document. Each information
item has a number of associated properties, e.g., name, namespace
URI. Some properties -- for example, 'children' and 'attributes' --
are collections of other information items. Infoset describes only the
information in an XML document that is relevant to applications. The
default value of attributes declared in the DTD, parameter entities,
the order of attributes within a start-tag, and other data used merely
for parsing or validation are not included. Although technically
Infoset is specified for XML it largely applies to other
semi-structured data formats, in particular, HTML.")
(p "The hierarchy of containers comprised of text strings and other
containers greatly lends itself to be described by
S-expressions. S-expressions " (cite "McCarthy") " are easy to parse
into an internal representation suitable for traversal. They have a
simple external notation (albeit with many a parentheses), which is
relatively easy to compose even by hand. S-expressions have another
advantage: " (em "provided") " an appropriate design, they can
represent Scheme code to be evaluated. This code-data dualism is a
distinguished feature of Lisp and Scheme. ")
(p
"SXML is a concrete instance of the XML Infoset. Infoset's goal is
to present in some form all relevant pieces of data and their " (em
"abstract") ", container-slot relationships with each other. SXML
gives the nest of containers a concrete realization as S-expressions,
and provides means of accessing items and their properties. SXML is a
\"relative\" of XPath " (cite "XPath") " and DOM " (cite "DOM") ",
whose data models are two other instances of the XML Infoset. SXML is
particularly suitable for Scheme-based XML/HTML authoring, SXPath
queries, and tree transformations. In John Hughes' terminology " (cite
"Hughes-PP") ", SXML is a term implementation of evaluation of the XML
document.")
;--------------------------------------------------
(Section 2 "Notation")
(p
"We will use an Extended BNF Notation (EBNF) employed in the XML
Recommendation " (cite "XML") ". The following table summarizes the notation.")
(dl
(dt (ebnf-opt "thing"))
(dd "An optional " (em "thing"))
(dt (ebnf-* "thing"))
(dd "Zero or more " (em "thing") "s")
(dt (ebnf-+ "thing"))
(dd "One or more " (em "thing") "s")
(dt (ebnf-choice "thing1" "thing2" "thing3"))
(dd "Choice of " (em "thing") "s")
(dt (nonterm "thing"))
(dd "A non-terminal of a grammar")
(dt (term-id "thing"))
(dd "A terminal of the grammar that is a Scheme identifier")
(dt (term-str "thing"))
(dd "A terminal of the grammar that is a Scheme string")
(dt (term-lit "thing"))
(dd "A literal Scheme symbol")
(dt (sexp (nonterm "A") (ebnf-* (nonterm "B")) (ebnf-opt (nonterm "C"))))
(dd "An S-expression made of " (nonterm "A")
" followed by zero or more " (nonterm "B")
" and, afterwards, optionally by " (nonterm "C"))
(dt (sset (term-lit "A") (ebnf-* (nonterm "B"))))
(dd "A tagged set: an S-expression made of " (term-lit "A")
" followed by zero or more instances of " (nonterm "B")
" in any order")
(dt (sexp-cons (nonterm "A") (nonterm "B")))
(dd "An S-expression that is made by prepending " (nonterm "A")
" to an S-expression denoted by " (nonterm "B"))
(dt (sexp-symb (nonterm "A") ":" (nonterm "B")))
(dd "A symbol whose string representation consists of all
characters that spell " (nonterm "A") " followed by the colon
character and by the characters that spell " (nonterm "B")
". The " (sexp-symb "" ) " notation can be regarded a
meta-function that creates symbols." )
)
;--------------------------------------------------
(Section 2 "Grammar")
(productions
(production 1
(nonterm "TOP")
((sexp
(term-lit "*TOP*")
(ebnf-opt (nonterm "annotations"))
(ebnf-* (nonterm "PI"))
(ebnf-* (nonterm "comment"))
(nonterm "Element")))))
(p
"This S-expression stands for the root of the SXML tree, a
document information item of the Infoset. Its only child element is
the root element of the XML document.")
(productions
(production 2
(nonterm "Element")
((sexp
(nonterm "name")
(ebnf-opt (nonterm "annot-attributes"))
(ebnf-* (nonterm "child-of-element")))))
(production 3
(nonterm "annot-attributes")
((sset
(term-lit "@")
(ebnf-* (nonterm "attribute"))
(ebnf-opt (nonterm "annotations")))))
(production 4
(nonterm "attribute")
((sexp (nonterm "name")
(ebnf-opt (term-str "value")) ; make several values? No!
(ebnf-opt (nonterm "annotations"))))) ; make sublist first? NO!
; see communication with Kirill,
; Feb 2004.
(production 5
(nonterm "child-of-element")
((ebnf-choice
(nonterm "Element")
(term-str "character data")
(nonterm "PI")
(nonterm "comment")
(nonterm "entity"))))
)
(p "These are the basic constructs of SXML.")
(productions
(production 6
(nonterm "PI")
((sexp (term-lit "*PI*")
(term-id "pi-target")
(ebnf-opt (nonterm "annotations"))
(term-str "processing instruction content string")))))
(p "The XML Recommendation specifies that processing instructions
(PI) are distinct from elements and character data; processing
instructions must be passed through to applications. In SXML, PIs are
therefore represented by nodes of a dedicated type " (code "*PI*")
". DOM Level 2 treats processing instructions in a similar way.")
(productions
(production 7
(nonterm "comment")
((sexp (term-lit "*COMMENT*")
(term-str "comment string"))))
(production 8
(nonterm "entity")
((sexp (term-lit "*ENTITY*")
(term-str "public-id")
(term-str "system-id")))))
(p "Comments are mentioned for completeness. A SSAX XML parser
" (cite "SSAX") ", among others, transparently skips the comments.
The XML Recommendation permits the parser to pass the comments to
an application or to completely disregard them. The present SXML grammar
admits comment nodes but does not mandate them by any means." (br)
"An " (nonterm "entity") " node represents a reference to an
unexpanded external entity. This node corresponds to an unexpanded
entity reference information item, defined in Section 2.5 of " (cite
"XML Infoset") ". Internal parsed entities are always expanded by the
XML processor at the point of their reference in the body of the
document.")
(productions
(production 9
(nonterm "name")
((ebnf-choice (nonterm "LocalName")
(nonterm "ExpName"))))
(production 10
(nonterm "LocalName")
((term-id "NCName")))
(production 11
(nonterm "ExpName")
((sexp-symb (nonterm "namespace-id") ":" (nonterm "LocalName"))))
(production 12
(nonterm "namespace-id")
((ebnf-choice
(sexp-symb (term-str "URI")) (term-id "user-ns-shortcut"))))
(production 13
(nonterm "namespaces")
((sset
(term-lit "*NAMESPACES*")
(ebnf-* (nonterm "namespace-assoc")))))
(production 14
(nonterm "namespace-assoc")
((sexp (nonterm "namespace-id") (term-str "URI")
(ebnf-opt (term-lit "original-prefix"))
)))
)
(p "An SXML " (nonterm "name") " is a single symbol. It is
generally an expanded name " (cite "XML-Namespaces") ", which
conceptually consists of a local name and a namespace URI. The latter
part may be empty, in which case " (nonterm "name") " is a " (term-id
"NCName") ": a Scheme symbol whose spelling conforms to production [4]
of the XML Namespaces Recommendation " (cite "XML-Namespaces") ".
"(nonterm "ExpName") " is also a Scheme symbol, whose string
representation contains an embedded colon that joins the local and the
namespace parts of the name. A " (sexp-symb (term-str "URI")) " is a
Namespace URI string converted to a Scheme symbol. Universal Resource
Identifiers (URI) may contain characters (e.g., parentheses) that are
prohibited in Scheme identifiers. Such characters must be %-quoted
during the conversion from a URI string to " (nonterm "namespace-id")
". The original XML Namespace prefix of a QName " (cite "XML-Namespaces") "
may be retained as an optional member " (term-lit
"original-prefix") " of a " (nonterm "namespace-assoc") "
association. A " (term-id "user-ns-shortcut") " is a Scheme
symbol chosen by an application programmer to represent a namespace
URI in the application program. The SSAX parser lets the programmer
define (short and mnemonic) unique shortcuts for often long and unwieldy
Namespace URIs.")
(Section 2 "Annotations")
(productions
(production 15
(nonterm "annotations")
((sset
(term-lit "@")
(ebnf-opt (nonterm "namespaces"))
(ebnf-* (nonterm "annotation")))))
(production 16
(nonterm "annotation")
((n_))
(em "To be defined in the future"))
)
(p "The XML Recommendation and related standards are not firmly
fixed, as the long list of errata and version 1.1 of XML
clearly show. Therefore, SXML has to be able to accommodate future
changes while guaranteeing backwards compatibility. SXML also ought to
permit applications to store various processing information (e.g.,
cached resolved IDREFs) in an SXML tree. A hash of ID-type attributes
would, for instance, let us implement efficient lookups in (SOAP-)
encoded arrays. To allow such extensibility, we introduce two new node
types: " (nonterm "annotations") " and " (nonterm "annotation") ". The
semantics of the latter is to be established in future versions of
SXML. Possible examples of an " (nonterm "annotation") " are the unique
id of an element or the reference to element's parent.")
(p
"The structure and the semantics of " (nonterm "annotations") "
is similar to those of an attribute list. In a manner of speaking,
annotations are ``attributes'' of an attribute list. The tag
" (code "@") " marks a collection of ancillary data associated
with an SXML node. For an element SXML node, the ancillary collection
is that of attributes. A nested " (code "@") " list is therefore a
collection of ``second-level'' attributes -- annotations -- such as
namespace nodes, parent pointers, etc. This design seems to be in
accord with the spirit of the XML Recommendation, which uses XML
attributes for two distinct purposes. Genuine, semantic attributes
provide ancillary description of the corresponding XML element,
e.g.,")
(verbatim
"16"
)
(p "On the other hand, attributes such as " (code "xmlns") ", " (code
"xml:prefix") ", " (code "xml:lang") " and " (code "xml:space") " are
auxiliary, or being used by XML itself. The XML Recommendation
distinguishes auxiliary attributes by their prefix
" (code "xml") ". SXML groups all such auxiliary attributes into a
" (code "@") "-tagged list inside the attribute list.")
(p
"XML attributes are treated as a dust bin. For example, the XSLT
Recommendation allows extra attributes in " (code
"xslt:template") ", provided these attributes are in a non-XSLT
namespace. A user may therefore annotate an XSLT template with his
own attributes, which will be silently disregarded by an XSLT
processor because the processor never looks for them. RELAX/NG
explicitly lets a schema author specify that an element may have more
attributes than given in the schema, provided those attributes come
from a particular namespace. The presence of these extra attributes
should not affect the XML processing applications that do not specifically
look for them. Annotations such as parent pointers and the source
location information are similarly targeted at specific
applications. The other applications should not be affected by the
presence or absence of annotations. Placing the collection of
annotations inside the attribute list accomplishes that goal.")
(p
"Annotations can be assigned to an element and to an attribute
of an element. The following example illustrates the difference
between the two annotations, which, in the example, contain only one
annotation: a pointer to the parent of a node.")
(verbatim
"(a (@ "
" (href \"http://somewhere/\" "
" (@ (*parent* a-node)) ; of the attribute 'href'"
" )"
" (@ (*parent* a-parent-node))) ; of the element 'a'"
" \"link\")"
)
(p
"The " (nonterm "TOP") " node may also contain annotations: for
example, " (nonterm "namespaces") " for the entire document or an
index of ID-type attributes.")
(verbatim
"(*TOP* (@ (id-collection id-hash)) (p (@ (id \"id1\")) \"par1\"))"
)
(p
"Annotations of the " (nonterm "TOP") " element look exactly
like `attributes' of the element. That should not cause any confusion
because " (nonterm "TOP") " cannot have genuine attributes. The SXML
element " (nonterm "TOP") " is an abstract representation of the whole
document and does not correspond to any single XML element. Assigning
annotations, which look and feel like an attribute list, to the
" (nonterm "TOP") " element does not contradict the Infoset
Recommendation, which specifically states that it is not intended to
be exhaustive. Attributes in general are not considered children of
their parents, therefore, even with our annotations the " (nonterm "TOP")
" element has only one child -- the root element.")
;--------------------------------------------------
(Section 2 "SXML Tree")
(p
"Infoset's information item is a sum of its properties. This makes
a list a particularly suitable data structure to represent an
item. The head of the list, a Scheme identifier, " (em "names") " the
item. For many items this is their (expanded) name. For an information
item that denotes an XML element, the corresponding list starts with
element's expanded name, optionally followed by a collection of
attributes and annotations. The rest of the element item list
is an ordered sequence of element's children -- character data,
processing instructions, and other elements. Every child is unique;
items never share their children even if the latter have the identical
content.")
(p
"A " (code "parent") " property of an Infoset information item
might seem troublesome. The Infoset Recommendation " (cite "XML Infoset") " specifies that element, attribute and other kinds of
information items have a property " (code "parent") ", whose value is
an information item that contains the given item in its " (code
"children") " property. The property parent thus is an upward link
from a child to its parent. At first sight, S-expressions seem lacking
in that respect: S-expressions represent directed trees and trees
cannot have upward links. An article " (cite "Parent-pointers") "
discusses and compares five different methods of determining the
parent of an SXML node. The existence of these methods is a crucial
step in a constructive proof that SXML is a complete model of the XML
Information set and the SXML query language (SXPath) can fully
implement the XPath Recommendation.")
(p
"Just as XPath does and the Infoset specification explicitly allows,
we group character information items into maximal text strings. The
value of an attribute is normally a string; it may be omitted (in
case of HTML) for a boolean attribute, e.g., " (code "