"Fossies" - the Fresh Open Source Software Archive

Member "SAOImageDS9/libxml2/doc/xmlreader.html" (13 Nov 2019, 20139 Bytes) of package /linux/misc/ds9.8.1.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format using (guessed) HTML source code syntax highlighting (style: standard) with prefixed line numbers. Alternatively you can here view or download the uninterpreted source code file.

    1 <!DOCTYPE html PUBLIC "-//W3C//DTD HTML 4.01 Transitional//EN"
    2     "http://www.w3.org/TR/html4/loose.dtd">
    3 <html>
    4 <head>
    5   <meta http-equiv="Content-Type" content="text/html">
    6   <style type="text/css"></style>
    7 <!--
    8 TD {font-family: Verdana,Arial,Helvetica}
    9 BODY {font-family: Verdana,Arial,Helvetica; margin-top: 2em; margin-left: 0em; margin-right: 0em}
   10 H1 {font-family: Verdana,Arial,Helvetica}
   11 H2 {font-family: Verdana,Arial,Helvetica}
   12 H3 {font-family: Verdana,Arial,Helvetica}
   13 A:link, A:visited, A:active { text-decoration: underline }
   14   </style>
   15 -->
   16   <title>Libxml2 XmlTextReader Interface tutorial</title>
   17 </head>
   18 
   19 <body bgcolor="#fffacd" text="#000000">
   20 <h1 align="center">Libxml2 XmlTextReader Interface tutorial</h1>
   21 
   22 <p></p>
   23 
   24 <p>This document describes the use of the XmlTextReader streaming API added
   25 to libxml2 in version 2.5.0 . This API is closely modeled after the <a
   26 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader</a>
   27 and <a
   28 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlReader.html">XmlReader</a>
   29 classes of the C# language.</p>
   30 
   31 <p>This tutorial will present the key points of this API, and working
   32 examples using both C and the Python bindings:</p>
   33 
   34 <p>Table of content:</p>
   35 <ul>
   36   <li><a href="#Introducti">Introduction: why a new API</a></li>
   37   <li><a href="#Walking">Walking a simple tree</a></li>
   38   <li><a href="#Extracting">Extracting informations for the current
   39   node</a></li>
   40   <li><a href="#Extracting1">Extracting informations for the
   41   attributes</a></li>
   42   <li><a href="#Validating">Validating a document</a></li>
   43   <li><a href="#Entities">Entities substitution</a></li>
   44   <li><a href="#L1142">Relax-NG Validation</a></li>
   45   <li><a href="#Mixing">Mixing the reader and tree or XPath
   46   operations</a></li>
   47 </ul>
   48 
   49 <p></p>
   50 
   51 <h2><a name="Introducti">Introduction: why a new API</a></h2>
   52 
   53 <p>Libxml2 <a href="http://xmlsoft.org/html/libxml-tree.html">main API is
   54 tree based</a>, where the parsing operation results in a document loaded
   55 completely in memory, and expose it as a tree of nodes all availble at the
   56 same time. This is very simple and quite powerful, but has the major
   57 limitation that the size of the document that can be hamdled is limited by
   58 the size of the memory available. Libxml2 also provide a <a
   59 href="http://www.saxproject.org/">SAX</a> based API, but that version was
   60 designed upon one of the early <a
   61 href="http://www.jclark.com/xml/expat.html">expat</a> version of SAX, SAX is
   62 also not formally defined for C. SAX basically work by registering callbacks
   63 which are called directly by the parser as it progresses through the document
   64 streams. The problem is that this programming model is relatively complex,
   65 not well standardized, cannot provide validation directly, makes entity,
   66 namespace and base processing relatively hard.</p>
   67 
   68 <p>The <a
   69 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">XmlTextReader
   70 API from C#</a> provides a far simpler programming model. The API acts as a
   71 cursor going forward on the document stream and stopping at each node in the
   72 way. The user's code keeps control of the progress and simply calls a
   73 Read() function repeatedly to progress to each node in sequence in document
   74 order. There is direct support for namespaces, xml:base, entity handling and
   75 adding DTD validation on top of it was relatively simple. This API is really
   76 close to the <a href="http://www.w3.org/TR/DOM-Level-2-Core/">DOM Core
   77 specification</a> This provides a far more standard, easy to use and powerful
   78 API than the existing SAX. Moreover integrating extension features based on
   79 the tree seems relatively easy.</p>
   80 
   81 <p>In a nutshell the XmlTextReader API provides a simpler, more standard and
   82 more extensible interface to handle large documents than the existing SAX
   83 version.</p>
   84 
   85 <h2><a name="Walking">Walking a simple tree</a></h2>
   86 
   87 <p>Basically the XmlTextReader API is a forward only tree walking interface.
   88 The basic steps are:</p>
   89 <ol>
   90   <li>prepare a reader context operating on some input</li>
   91   <li>run a loop iterating over all nodes in the document</li>
   92   <li>free up the reader context</li>
   93 </ol>
   94 
   95 <p>Here is a basic C sample doing this:</p>
   96 <pre>#include &lt;libxml/xmlreader.h&gt;
   97 
   98 void processNode(xmlTextReaderPtr reader) {
   99     /* handling of a node in the tree */
  100 }
  101 
  102 int streamFile(char *filename) {
  103     xmlTextReaderPtr reader;
  104     int ret;
  105 
  106     reader = xmlNewTextReaderFilename(filename);
  107     if (reader != NULL) {
  108         ret = xmlTextReaderRead(reader);
  109         while (ret == 1) {
  110             processNode(reader);
  111             ret = xmlTextReaderRead(reader);
  112         }
  113         xmlFreeTextReader(reader);
  114         if (ret != 0) {
  115             printf("%s : failed to parse\n", filename);
  116         }
  117     } else {
  118         printf("Unable to open %s\n", filename);
  119     }
  120 }</pre>
  121 
  122 <p>A few things to notice:</p>
  123 <ul>
  124   <li>the include file needed : <code>libxml/xmlreader.h</code></li>
  125   <li>the creation of the reader using a filename</li>
  126   <li>the repeated call to xmlTextReaderRead() and how any return value
  127     different from 1 should stop the loop</li>
  128   <li>that a negative return means a parsing error</li>
  129   <li>how xmlFreeTextReader() should be used to free up the resources used by
  130     the reader.</li>
  131 </ul>
  132 
  133 <p>Here is similar code in python for exactly the same processing:</p>
  134 <pre>import libxml2
  135 
  136 def processNode(reader):
  137     pass
  138 
  139 def streamFile(filename):
  140     try:
  141         reader = libxml2.newTextReaderFilename(filename)
  142     except:
  143         print "unable to open %s" % (filename)
  144         return
  145 
  146     ret = reader.Read()
  147     while ret == 1:
  148         processNode(reader)
  149         ret = reader.Read()
  150 
  151     if ret != 0:
  152         print "%s : failed to parse" % (filename)</pre>
  153 
  154 <p>The only things worth adding are that the <a
  155 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">xmlTextReader
  156 is abstracted as a class like in C#</a> with the same method names (but the
  157 properties are currently accessed with methods) and that one doesn't need to
  158 free the reader at the end of the processing. It will get garbage collected
  159 once all references have disapeared.</p>
  160 
  161 <h2><a name="Extracting">Extracting information for the current node</a></h2>
  162 
  163 <p>So far the example code did not indicate how information was extracted
  164 from the reader. It was abstrated as a call to the processNode() routine,
  165 with the reader as the argument. At each invocation, the parser is stopped on
  166 a given node and the reader can be used to query those node properties. Each
  167 <em>Property</em> is available at the C level as a function taking a single
  168 xmlTextReaderPtr argument whose name is
  169 <code>xmlTextReader</code><em>Property</em> , if the return type is an
  170 <code>xmlChar *</code> string then it must be deallocated with
  171 <code>xmlFree()</code> to avoid leaks. For the Python interface, there is a
  172 <em>Property</em> method to the reader class that can be called on the
  173 instance. The list of the properties is based on the <a
  174 href="http://dotgnu.org/pnetlib-doc/System/Xml/XmlTextReader.html">C#
  175 XmlTextReader class</a> set of properties and methods:</p>
  176 <ul>
  177   <li><em>NodeType</em>: The node type, 1 for start element, 15 for end of
  178     element, 2 for attributes, 3 for text nodes, 4 for CData sections, 5 for
  179     entity references, 6 for entity declarations, 7 for PIs, 8 for comments,
  180     9 for the document nodes, 10 for DTD/Doctype nodes, 11 for document
  181     fragment and 12 for notation nodes.</li>
  182   <li><em>Name</em>: the <a
  183     href="http://www.w3.org/TR/REC-xml-names/#ns-qualnames">qualified
  184     name</a> of the node, equal to (<em>Prefix</em>:)<em>LocalName</em>.</li>
  185   <li><em>LocalName</em>: the <a
  186     href="http://www.w3.org/TR/REC-xml-names/#NT-LocalPart">local name</a> of
  187     the node.</li>
  188   <li><em>Prefix</em>: a  shorthand reference to the <a
  189     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
  190     the node.</li>
  191   <li><em>NamespaceUri</em>: the URI defining the <a
  192     href="http://www.w3.org/TR/REC-xml-names/">namespace</a> associated with
  193     the node.</li>
  194   <li><em>BaseUri:</em> the base URI of the node. See the <a
  195     href="http://www.w3.org/TR/xmlbase/">XML Base W3C specification</a>.</li>
  196   <li><em>Depth:</em> the depth of the node in the tree, starts at 0 for the
  197     root node.</li>
  198   <li><em>HasAttributes</em>: whether the node has attributes.</li>
  199   <li><em>HasValue</em>: whether the node can have a text value.</li>
  200   <li><em>Value</em>: provides the text value of the node if present.</li>
  201   <li><em>IsDefault</em>: whether an Attribute  node was generated from the
  202     default value defined in the DTD or schema (<em>unsupported
  203   yet</em>).</li>
  204   <li><em>XmlLang</em>: the <a
  205     href="http://www.w3.org/TR/REC-xml#sec-lang-tag">xml:lang</a> scope
  206     within which the node resides.</li>
  207   <li><em>IsEmptyElement</em>: check if the current node is empty, this is a
  208     bit bizarre in the sense that <code>&lt;a/&gt;</code> will be considered
  209     empty while <code>&lt;a&gt;&lt;/a&gt;</code> will not.</li>
  210   <li><em>AttributeCount</em>: provides the number of attributes of the
  211     current node.</li>
  212 </ul>
  213 
  214 <p>Let's look first at a small example to get this in practice by redefining
  215 the processNode() function in the Python example:</p>
  216 <pre>def processNode(reader):
  217     print "%d %d %s %d" % (reader.Depth(), reader.NodeType(),
  218                            reader.Name(), reader.IsEmptyElement())</pre>
  219 
  220 <p>and look at the result of calling streamFile("tst.xml") for various
  221 content of the XML test file.</p>
  222 
  223 <p>For the minimal document "<code>&lt;doc/&gt;</code>" we get:</p>
  224 <pre>0 1 doc 1</pre>
  225 
  226 <p>Only one node is found, its depth is 0, type 1 indicate an element start,
  227 of name "doc" and it is empty. Trying now with
  228 "<code>&lt;doc&gt;&lt;/doc&gt;</code>" instead leads to:</p>
  229 <pre>0 1 doc 0
  230 0 15 doc 0</pre>
  231 
  232 <p>The document root node is not flagged as empty anymore and both a start
  233 and an end of element are detected. The following document shows how
  234 character data are reported:</p>
  235 <pre>&lt;doc&gt;&lt;a/&gt;&lt;b&gt;some text&lt;/b&gt;
  236 &lt;c/&gt;&lt;/doc&gt;</pre>
  237 
  238 <p>We modifying the processNode() function to also report the node Value:</p>
  239 <pre>def processNode(reader):
  240     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
  241                               reader.Name(), reader.IsEmptyElement(),
  242                               reader.Value())</pre>
  243 
  244 <p>The result of the test is:</p>
  245 <pre>0 1 doc 0 None
  246 1 1 a 1 None
  247 1 1 b 0 None
  248 2 3 #text 0 some text
  249 1 15 b 0 None
  250 1 3 #text 0
  251 
  252 1 1 c 1 None
  253 0 15 doc 0 None</pre>
  254 
  255 <p>There are a few things to note:</p>
  256 <ul>
  257   <li>the increase of the depth value (first row) as children nodes are
  258     explored</li>
  259   <li>the text node child of the b element, of type 3 and its content</li>
  260   <li>the text node containing the line return between elements b and c</li>
  261   <li>that elements have the Value None (or NULL in C)</li>
  262 </ul>
  263 
  264 <p>The equivalent routine for <code>processNode()</code> as used by
  265 <code>xmllint --stream --debug</code> is the following and can be found in
  266 the xmllint.c module in the source distribution:</p>
  267 <pre>static void processNode(xmlTextReaderPtr reader) {
  268     xmlChar *name, *value;
  269 
  270     name = xmlTextReaderName(reader);
  271     if (name == NULL)
  272         name = xmlStrdup(BAD_CAST "--");
  273     value = xmlTextReaderValue(reader);
  274 
  275     printf("%d %d %s %d",
  276             xmlTextReaderDepth(reader),
  277             xmlTextReaderNodeType(reader),
  278             name,
  279             xmlTextReaderIsEmptyElement(reader));
  280     xmlFree(name);
  281     if (value == NULL)
  282         printf("\n");
  283     else {
  284         printf(" %s\n", value);
  285         xmlFree(value);
  286     }
  287 }</pre>
  288 
  289 <h2><a name="Extracting1">Extracting information for the attributes</a></h2>
  290 
  291 <p>The previous examples don't indicate how attributes are processed. The
  292 simple test "<code>&lt;doc a="b"/&gt;</code>" provides the following
  293 result:</p>
  294 <pre>0 1 doc 1 None</pre>
  295 
  296 <p>This proves that attribute nodes are not traversed by default. The
  297 <em>HasAttributes</em> property allow to detect their presence. To check
  298 their content the API has special instructions. Basically two kinds of operations
  299 are possible:</p>
  300 <ol>
  301   <li>to move the reader to the attribute nodes of the current element, in
  302     that case the cursor is positionned on the attribute node</li>
  303   <li>to directly query the element node for the attribute value</li>
  304 </ol>
  305 
  306 <p>In both case the attribute can be designed either by its position in the
  307 list of attribute (<em>MoveToAttributeNo</em> or <em>GetAttributeNo</em>) or
  308 by their name (and namespace):</p>
  309 <ul>
  310   <li><em>GetAttributeNo</em>(no): provides the value of the attribute with
  311     the specified index no relative to the containing element.</li>
  312   <li><em>GetAttribute</em>(name): provides the value of the attribute with
  313     the specified qualified name.</li>
  314   <li>GetAttributeNs(localName, namespaceURI): provides the value of the
  315     attribute with the specified local name and namespace URI.</li>
  316   <li><em>MoveToAttributeNo</em>(no): moves the position of the current
  317     instance to the attribute with the specified index relative to the
  318     containing element.</li>
  319   <li><em>MoveToAttribute</em>(name): moves the position of the current
  320     instance to the attribute with the specified qualified name.</li>
  321   <li><em>MoveToAttributeNs</em>(localName, namespaceURI): moves the position
  322     of the current instance to the attribute with the specified local name
  323     and namespace URI.</li>
  324   <li><em>MoveToFirstAttribute</em>: moves the position of the current
  325     instance to the first attribute associated with the current node.</li>
  326   <li><em>MoveToNextAttribute</em>: moves the position of the current
  327     instance to the next attribute associated with the current node.</li>
  328   <li><em>MoveToElement</em>: moves the position of the current instance to
  329     the node that contains the current Attribute  node.</li>
  330 </ul>
  331 
  332 <p>After modifying the processNode() function to show attributes:</p>
  333 <pre>def processNode(reader):
  334     print "%d %d %s %d %s" % (reader.Depth(), reader.NodeType(),
  335                               reader.Name(), reader.IsEmptyElement(),
  336                               reader.Value())
  337     if reader.NodeType() == 1: # Element
  338         while reader.MoveToNextAttribute():
  339             print "-- %d %d (%s) [%s]" % (reader.Depth(), reader.NodeType(),
  340                                           reader.Name(),reader.Value())</pre>
  341 
  342 <p>The output for the same input document reflects the attribute:</p>
  343 <pre>0 1 doc 1 None
  344 -- 1 2 (a) [b]</pre>
  345 
  346 <p>There are a couple of things to note on the attribute processing:</p>
  347 <ul>
  348   <li>Their depth is the one of the carrying element plus one.</li>
  349   <li>Namespace declarations are seen as attributes, as in DOM.</li>
  350 </ul>
  351 
  352 <h2><a name="Validating">Validating a document</a></h2>
  353 
  354 <p>Libxml2 implementation adds some extra features on top of the XmlTextReader
  355 API. The main one is the ability to DTD validate the parsed document
  356 progressively. This is simply the activation of the associated feature of the
  357 parser used by the reader structure. There are a few options available
  358 defined as the enum xmlParserProperties in the libxml/xmlreader.h header
  359 file:</p>
  360 <ul>
  361   <li>XML_PARSER_LOADDTD: force loading the DTD (without validating)</li>
  362   <li>XML_PARSER_DEFAULTATTRS: force attribute defaulting (this also imply
  363     loading the DTD)</li>
  364   <li>XML_PARSER_VALIDATE: activate DTD validation (this also imply loading
  365     the DTD)</li>
  366   <li>XML_PARSER_SUBST_ENTITIES: substitute entities on the fly, entity
  367     reference nodes are not generated and are replaced by their expanded
  368     content.</li>
  369   <li>more settings might be added, those were the one available at the 2.5.0
  370     release...</li>
  371 </ul>
  372 
  373 <p>The GetParserProp() and SetParserProp() methods can then be used to get
  374 and set the values of those parser properties of the reader. For example</p>
  375 <pre>def parseAndValidate(file):
  376     reader = libxml2.newTextReaderFilename(file)
  377     reader.SetParserProp(libxml2.PARSER_VALIDATE, 1)
  378     ret = reader.Read()
  379     while ret == 1:
  380         ret = reader.Read()
  381     if ret != 0:
  382         print "Error parsing and validating %s" % (file)</pre>
  383 
  384 <p>This routine will parse and validate the file. Error messages can be
  385 captured by registering an error handler. See python/tests/reader2.py for
  386 more complete Python examples. At the C level the equivalent call to cativate
  387 the validation feature is just:</p>
  388 <pre>ret = xmlTextReaderSetParserProp(reader, XML_PARSER_VALIDATE, 1)</pre>
  389 
  390 <p>and a return value of 0 indicates success.</p>
  391 
  392 <h2><a name="Entities">Entities substitution</a></h2>
  393 
  394 <p>By default the xmlReader will report entities as such and not replace them
  395 with their content. This default behaviour can however be overriden using:</p>
  396 
  397 <p><code>reader.SetParserProp(libxml2.PARSER_SUBST_ENTITIES,1)</code></p>
  398 
  399 <h2><a name="L1142">Relax-NG Validation</a></h2>
  400 
  401 <p style="font-size: 10pt">Introduced in version 2.5.7</p>
  402 
  403 <p>Libxml2 can now validate the document being read using the xmlReader using
  404 Relax-NG schemas. While the Relax NG validator can't always work in a
  405 streamable mode, only subsets which cannot be reduced to regular expressions
  406 need to have their subtree expanded for validation. In practice it means
  407 that, unless the schemas for the top level element content is not expressable
  408 as a regexp, only chunk of the document needs to be parsed while
  409 validating.</p>
  410 
  411 <p>The steps to do so are:</p>
  412 <ul>
  413   <li>create a reader working on a document as usual</li>
  414   <li>before any call to read associate it to a Relax NG schemas, either the
  415     preparsed schemas or the URL to the schemas to use</li>
  416   <li>errors will be reported the usual way, and the validity status can be
  417     obtained using the IsValid() interface of the reader like for DTDs.</li>
  418 </ul>
  419 
  420 <p>Example, assuming the reader has already being created and that the schema
  421 string contains the Relax-NG schemas:</p>
  422 <pre><code>rngp = libxml2.relaxNGNewMemParserCtxt(schema, len(schema))<br>
  423 rngs = rngp.relaxNGParse()<br>
  424 reader.RelaxNGSetSchema(rngs)<br>
  425 ret = reader.Read()<br>
  426 while ret == 1:<br>
  427     ret = reader.Read()<br>
  428 if ret != 0:<br>
  429     print "Error parsing the document"<br>
  430 if reader.IsValid() != 1:<br>
  431     print "Document failed to validate"</code><br>
  432 </pre>
  433 
  434 <p>See <code>reader6.py</code> in the sources or documentation for a complete
  435 example.</p>
  436 
  437 <h2><a name="Mixing">Mixing the reader and tree or XPath operations</a></h2>
  438 
  439 <p style="font-size: 10pt">Introduced in version 2.5.7</p>
  440 
  441 <p>While the reader is a streaming interface, its underlying implementation
  442 is based on the DOM builder of libxml2. As a result it is relatively simple
  443 to mix operations based on both models under some constraints. To do so the
  444 reader has an Expand() operation allowing to grow the subtree under the
  445 current node. It returns a pointer to a standard node which can be
  446 manipulated in the usual ways. The node will get all its ancestors and the
  447 full subtree available. Usual operations like XPath queries can be used on
  448 that reduced view of the document. Here is an example extracted from
  449 reader5.py in the sources which extract and prints the bibliography for the
  450 "Dragon" compiler book from the XML 1.0 recommendation:</p>
  451 <pre>f = open('../../test/valid/REC-xml-19980210.xml')
  452 input = libxml2.inputBuffer(f)
  453 reader = input.newTextReader("REC")
  454 res=""
  455 while reader.Read():
  456     while reader.Name() == 'bibl':
  457         node = reader.Expand()            # expand the subtree
  458         if node.xpathEval("@id = 'Aho'"): # use XPath on it
  459             res = res + node.serialize()
  460         if reader.Next() != 1:            # skip the subtree
  461             break;</pre>
  462 
  463 <p>Note, however that the node instance returned by the Expand() call is only
  464 valid until the next Read() operation. The Expand() operation does not
  465 affects the Read() ones, however usually once processed the full subtree is
  466 not useful anymore, and the Next() operation allows to skip it completely and
  467 process to the successor or return 0 if the document end is reached.</p>
  468 
  469 <p><a href="mailto:xml@gnome.org">Daniel Veillard</a></p>
  470 
  471 <p>$Id$</p>
  472 
  473 <p></p>
  474 </body>
  475 </html>