Python notes - XML

From Helpful
Jump to: navigation, search
Various things have their own pages, see Category:Python. Some of the pages that collect various practical notes include:

I like dealing with XML in an object-based manner. It's bothersome to deal with in other forms, having to recursively iterate, test for node types to catch whitespace text nodes before the first element, and so on. Being able to access XML as python objects is nice - even though it's not always ideal.


There are various ways of doing this. ElementTree (and specifically the faster cElementTree implementation) is my current favourite. BeautifulSoup is specifically made to deals well with bad HTML, but is slow.

For other alternatives and their speeds, see mainly this (and possibly this and this) for overviews of (mostly xml-only) packages.

Just constructing XML can be done by a few more things.


XML generation / serialization

py.xml

Perhaps the simplest method (codewise) is py.xml, but it is flawed in that it's a littletoo easy to create invalid XML: each of the parts are immediately converted to strings, meaning that while HTML encoding (of
<>&
) can be done automatically for attributes, there is no way to do so for text nodes as code can not tell the difference between text nodes and nested elements.


stan

Nevow's stan seems to have considered the py.xml problem, because stan is a document object model. An example slightly adapted from the documentation:

from nevow import flat, tags as T
document = T.html[
    T.head[
        T.title["Hello, world!"]
    ],
    T.body[
        T.h1[ "This is a complete XHTML document modeled in Stan." ],
        T.p[ "This text is inside a paragraph tag." ],
        T.div(style="color: blue; width: 200px; background-color: yellow;")[
            "And this is a coloured div."
        ]
    ]
]
print flat.flatten(document)

Only that last call (flatten())causes conversion to a string, and does it correctly. (Note there are several styles of using stan; this one seems like the most convenient to me) Before the flatten call, what is actually created is a tree of objects, which are each flatten()able.


You can also:

  • make any object do stan flattening and therefor be usable directly in stan structures

Among other things, this makes templating more convenient. There is also some explicit template functionalty, e.g. in patterns.

  • add functions, to insert chunks of dynamic content. These functions will be called on flatten()ing

ElementTree's SimpleXMLWriter

The syntax that ElementTree it uses for XML generation is not quite as simple as the above, but still usable, and it may be handy to use one module for all XML things, just to keep your depdendencies down.

The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls. It consists of:

  • element(tag,attributes) (one-shot element, avoids separate start and end)
  • start(tag, attributes)
  • end() (current), or end(tag)
  • close() (all currently open) or close(id) (many levels, tages reference you took from start), or
  • data() (text nodes)


from elementtree.SimpleXMLWriter import XMLWriter
w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities
response = w.start("html")   #remember this so we can close it
 
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value=u"my \u2222 application 1.0")
w.end()             # current open tag is head
 
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
 
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(response)   # implies the end() for body and html

For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.

ElementTree by objects

This can be a little more tedious, but may be handy for XML-based protocols. For example:

def login(username,password):
    E,SE = ET.Element,ET.SubElement  # for brevity
 
    server_request = E('server_request')
    login_request    = SE(server_request,'login_request')    
    SE(login_request,'user_name').text     = username
    SE(login_request,'user_password').text = password
 
    return ET.tostring(server_request,'UTF-8')

SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.

It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.

XML parsing

DOM is a heavy model because it is heavily decorated, and requires all data to be loaded. Most XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.


When you don't need all of that, you can use SAX-style parsing. This can be particularly useful for keeping small memory and CPU footprint on large data files you want to fetch relatively little (or failrly flat) data from. It tends to be a little more bother, yes.


ElementTree

The thing I like about ET is that it's somewhere between the spartan SAX approach and the somewhat excessive DOM-style approach.

ElementTree is more element-centric than many other things, in that it does not keep elements, text, and attributes in separate objects - which makes some navigation code considerably shorter (though if instead of var-val-like XML such as config things you have HTML-like mixes of text with markup, things become more interesing).


Often enough, the easier it is to describe the XML in words, the easier it is to write ET code for. (If the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with)


It (and particularly the C-extension drop-in cElementTree) is decently fast ([1], [2]), though you wouldn't necessarily want the memory use you get from loading very large XML.



Example (parsing)

for:

xmldata='''
<subscriptions>
  <success/>
  <entry>
    <request>
      <issn>1234-5678</issn>
      <year>1984</year>
    </request>
    <presence>fulltext</presence>
  </entry>
  <entry>
    <request>
      <issn>0041-9999</issn>
      <year>1984</year>
    </request>
    <presence type="cached">fulltext</presence>
  </entry>
</subscriptions>'''

Code with comments:

subscriptions = ET.fromstring(xmldata) #gives the root element.   For bad XML, typically raises SyntaxError
 
error = subscriptions.find('error')  # check whether there is an <error>with message</error> instead of <success/>
if error != None:
    raise Exception( "Response reports error: %s"%(error.text) )  # .text gets the node's text content
 
for entry in subscriptions.findall('entry'):  #find all direct-child 'entry' elements
    issn = entry.find('request/issn').text
    year = entry.findtext('request/year','') 
 
    # Using xpath-style paths like above can be handy, though when we want multiple things 
    # it's probably handier to get such central element references first
    presence = entry.find('presence')
    prestype = presence.get('type')  # attribute fetch. Using get() means we get None if missing
    prestext = presence.text
 
    print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)

That code prints:

1234-5678 in 1984: fulltext (None)
0041-9999 in 1984: fulltext (cached)

Notes:

  • The functions on an element means you tend to use it as a deeper-only tree
  • If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
  • Your choice to use find(), findall(), getchildren(), getiterator() varies with case and taste
  • ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
    • Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
    • (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
  • there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.

Some interesting functions (navigation, attributes)

Finding nodes:

  • getchildren()
    all direct children (no filter)
  • findall(path)
    all matching children (by tag name) / all matching descendant paths
  • find(path)
    finds the first matching element (by tag name, or by path)
  • getiterator(tag=None)
    does a recursive treewalk, retuning all elements, or all elements with some tag name you hand one along


Note that you can in many cases pick out elements several levels deeper (hence 'by tag name / descendant path' above) using xpath-style paths, for example using root.findall('record/item/location'). This can sometimes shorten code that still looks for specific structures.


Getting information from nodes:

  • .tag for the node's name
  • Attributes:
    • get(name,default=None)
      gets an attribute value (or the default value if the attribute didn't exist)
    • set(name,val)
      - useful when you want to serialize it again
    • keys()
      returns the names of the present attributes
    • items()
      returns attributes, in an arbitrarily ordered list of (name,value) tuples.
    • attrib
      is a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())
  • Text:
    • findtext(path,default=None)
      returns the text contents of the first matching element (or the default value, if nothing matched)
    • .text see below (may be None)
    • .tail see below (may be None)


See also:

On text

element.text gives the initial text within an element.

  • Can be None (if there is no text)
  • If node content is only text, this is all the text (so for XML config files this is typically enough)
  • If there's a mix of text and nodes, it's the text before the first contained element. If you want all text in a flatten-all-text-away way, see notes below.

element.tail is the text between a node and its next sibling

  • Can be None.


An example is probably useful here:

>>> nodes=ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>')
>>> nodes.text     # <a>'s initial text   (a is the root node)
'this'
 
>>> [ (el.tag,el.tail)  for el in nodes ]
[('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)]
 
>>> [ (el.tag,el.text,el.tail)  for el in nodes ]
[('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)]
 
>>> all_text_fragments(nodes)    # see helper function below
['this', 'and', 'that', ' ', 'foo']


So yes, initial spaces and newlines and such make things complex, but only really in XML that is fairly free-form (and possibly pretty-printed).

If you want to, say, flatten a subtree of XHTML to just the text nodes it contains, you'll want a helper function like:

def all_text_fragments(et):
    ' Returns all fragments of text contained in a subtree, as a list of strings '
    r = []
    for e in et.getiterator(): # walks the subtree
        if e.text != None:
            r.append( e.text )
        if e.tail != None:
            r.append( e.tail )
    return r

Importing

There is an external library ElementTree, and a faster cElementTree, a C-extension drop-in (mostly - it implements the parsing, but not all of the generation part). ElementTree was also placed in Python ≥2.5's standard library as xml.etree.ElementTree and xml.etree.cElementTree.


Since there are a number of alternative names to import, it helps portability to imports whatever you have, and to alias it so that you don't hardcode for specific versions.

I have an ET.py that contains something like:

''' A centralized place we can try for the various possible basic ElementTrees,
    and cElementTree if we have it, from modules or Python 2.5 
    (Note there is also lxml.etree, not imported by this)
    This code sits (alone) in a module so that various modules can easily use these fallbacks.
 
    Uses import *, so most members you want are direct members of this module.
    Since not everything is bound (ElementTree uses __all__), anything that is
    not imported that way can be accessed via a reference to the module 
    (you would access it as ET.ET if you use this code as-is and put it in ET.py)
'''
try:
    from xml.etree.ElementTree import * #Python 2.5.
    import xml.etree.ElementTree as ET
except ImportError:
    try:
        from elementtree.ElementTree import *
        import elementtree.ElementTree as ET
    except ImportError:
        raise ImportError('Cannot find any version of ElementTree')
 
#We want to replace functions with their cElementTree implementation where possible 
# (cET doesn't reimplement everything)
try:
    from xml.etree.cElementTree import *     #2.5
    import xml.etree.cElementTree as cET
except ImportError:
    try:
        from cElementTree import *
        import cElementTree as cET
    except ImportError:
        pass #possibly complain about the absence of the C implementation


Pretty-printer

Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented. Meant only for debug, as this will implicitly change the stored data.

def indent_inplace(elem, level=0, whitespacestrip=True):
    ''' Alters the text nodes so that the tostring()ed version will look nicely indented.
 
        whitespacestrip can make contents that contain a lot of newlines look cleaner, 
        but changes the stored data even more.
    '''
    i = "\n" + level*"  "
 
    if whitespacestrip:
        if elem.text:
            elem.text=elem.text.strip()
        if elem.tail:
            elem.tail=elem.tail.strip()
 
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent_inplace(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i
 
 
def prettyprint(xml,whitespacestrip=True):
    ''' Convenience wrapper around indent_inplace():
        - Takes a string (parses it) or a ET structure (in which case it changes it),
        - Reindents,
        - Returns the result as a string.
 
        whitespacestrip: see note in indent_inplace()
 
        Not horribly efficient, and alters the structure you gave it,
        but you are only using this for debug, riiight?
    '''
    if type(xml) is str:
        xml = ET.fromstring(xml)
    inplace_indent(xml, whitespacestrip=whitespacestrip)
    return ET.tostring(xml).rstrip('\n')

Namespace stripper

Not thoroughly tested. The core of this was taken from elsewhere.

def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True):
    """ Takes a parsed ET structure and does an in-place removal,
        by default of all namespaces, optionally a specific namespace (by its URL).
 
        Can make node searches simpler in structures with unpredictable namespaces
        and in content given to not be ambiguously mixed.
 
        By default does so for node names as well as attribute names.       
        (doesn't remove the namespace definitions, but apparently
         ElementTree serialization omits any that are unused)
 
        Note that for attributes that are unique only because of namespace,
        this will cause attributes to be overwritten. 
        For example: <e p:at="bar" at="quu">   would become: <e at="bar">
        I don't think I've seen any XML where this matters, though.
    """
   if namespace==None: # all namespaces                               
        for elem in etree.getiterator():
            tagname = elem.tag
            if tagname[0]=='{':
                elem.tag = tagname[ tagname.index('}',1)+1:]
 
            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name[0]=='{':
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[attr_name.index('}',1)+1:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)
 
    else: # asked to remove specific namespace.
        ns = '{%s}' % namespace
        nsl = len(ns)
        for elem in etree.getiterator():
            if elem.tag.startswith(ns):
                elem.tag = elem.tag[nsl:]
 
            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name.startswith(ns):
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[nsl:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)

UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'

...or some other non-BMP Unicode character. Not necessarily a to-console problem as you might assume.

This probably comes from ET's tostring(), which uses 'ascii' encoding by default -- which usually works because it rewrites Unicode into numeric entities. However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).


If you want a unicode-wide serializer, you probably want to work around that problem.

One short way is to do:
tostring(nodes,encoding='utf-8')
, which tells it to encode unicode characters to UTF-8 bytes before dumping them into the document.

Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)

lxml

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

lxml is based on libxml2 and libxslt, but has a better API than those libraries' basic bindings.

It imitates ElementTree, and has a few other useful features.


minidom

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Input encoding safety

The following code was made to parse XML that should be utf8, but may well contain literally dumped Latin1. It originally used chardet to try to be even smarter than that.

ET makes things slightly more interesting: as it is made for real-world data, it doesn't unicode strings as input, it wants something like utf8.


rewriting