Python notes - XML

From Helpful
Jump to: navigation, search
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

I like dealing with XML in an object-based manner. It's bothersome to deal with in other forms, having to recursively iterate, test for node types to catch whitespace text nodes before the first element, and so on. Being able to access XML as python objects is nice - even though it's not always ideal.


There are various ways of doing this. ElementTree (and specifically the faster cElementTree implementation) is my current favourite. BeautifulSoup is specifically made to deals well with bad HTML, but is slow.

For other alternatives and their speeds, see mainly this (and possibly this and this) for overviews of (mostly xml-only) packages.

Just constructing XML can be done by a few more things.


XML generation / serialization

py.xml

Perhaps the simplest method (codewise) is what e.g. py.xml does - consider:

import py.xml
 
class ns(py.xml.Namespace):
    "Convenience class for creating requests"
 
def find_request(session,command,sourceArray):
    find=ns.x_server_request(
        ns.find_request(
            ns.wait_flag("N"),
            ns.find_request_command( command ),
            # list comprehension used create a variable amount of slightly nested nodes:
            list( ns.find_base(ns.find_base_001(sourceID))   
                  for sourceID in sourceArray ),
            ns.session_id(session)
        )
    )
    findXMLdoc='<?xml version="1.0" encoding="UTF-8"?>%s'%(
        find.unicode(indent=0).encode('utf8')
    )


However, this is flawed in that it's a little too easy to create invalid XML: each of the parts are immediately converted to strings, meaning that while HTML encoding (of
<>&
) can be done automatically for attributes, there is no way to do so for text nodes, as code can not tell the difference between text nodes and nested elements.

stan

Nevow's stan seems to have considered the py.xml problem, because stan is a document object model. An example slightly adapted from the documentation:

from nevow import flat, tags as T
document = T.html[
    T.head[
        T.title["Hello, world!"]
    ],
    T.body[
        T.h1[ "This is a complete XHTML document modeled in Stan." ],
        T.p[ "This text is inside a paragraph tag." ],
        T.div(style="color: blue; width: 200px; background-color: yellow;")[
            "And this is a coloured div."
        ]
    ]
]
print flat.flatten(document)

You actually create a tree of flatten()able objects, and only that last call (flatten()) causes the actual conversion to a string.

This also means you can make any of your obects stan-XML-serializable by giving it a flatten() function.

It can also make some ways of templating more convenient.


ElementTree's SimpleXMLWriter

The syntax that ElementTree it uses for XML generation is not as brief as the above, but still usable.

And it may be handy to use one module for all XML things, just to keep your depdendencies down.


The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls. It consists of:

  • element(tag,attributes) (one-shot element, avoids separate start and end)
  • start(tag, attributes)
  • end() (current), or end(tag)
  • close() (all currently open) or close(id) (many levels, tages reference you took from start), or
  • data() (text nodes)


from elementtree.SimpleXMLWriter import XMLWriter
w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities
response = w.start("html")   #remember this so we can close() it
 
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value=u"my \u2222 application 1.0")
w.end()             # current open tag is head
 
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
 
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(response)   # implies the end() for body and html

For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.

ElementTree by objects

This can be a little more tedious, but may be handy for XML-based protocols. For example:

def login(username,password):
    E,SE = ET.Element,ET.SubElement  # for brevity
 
    server_request = E('server_request')
    login_request    = SE(server_request,'login_request')    
    SE(login_request,'user_name').text     = username
    SE(login_request,'user_password').text = password
 
    return ET.tostring(server_request,'UTF-8')

SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.

It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.

XML parsing

DOM is a heavy model because it is heavily decorated, and requires all data to be loaded. Most XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.


When you don't need all of that, you can use SAX-style parsing. This can be particularly useful for keeping small memory and CPU footprint on large data files you want to fetch relatively little (or failrly flat) data from. It tends to involve more coding, yes.


Relatedly, you may care about BeautifulSoup, targeted at possibly-not-valid-HTML.


ElementTree

The thing I like about ET is that it's somewhere between the spartan SAX approach and the somewhat excessive DOM-style approach.

ElementTree is more element-centric than many other things, in that it does not keep elements, text, and attributes in separate objects - which makes some navigation code considerably shorter (though if instead of var-val-like XML such as config things you have HTML-like mixes of text with markup, things become more interesing).


Often enough, the easier it is to describe the XML in words, the easier it is to write ET code for. (If the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with)


It (and particularly the C-extension drop-in cElementTree) is decently fast ([1], [2]), though you wouldn't necessarily want the memory use you get from loading very large XML.



Example (parsing)

for:

xmldata='''
<subscriptions>
  <success/>
  <entry>
    <request>
      <issn>1234-5678</issn>
      <year>1984</year>
    </request>
    <presence>fulltext</presence>
  </entry>
  <entry>
    <request>
      <issn>0041-9999</issn>
      <year>1984</year>
    </request>
    <presence type="cached">fulltext</presence>
  </entry>
</subscriptions>'''

Code with comments:

subscriptions = ET.fromstring(xmldata)  # returns the root element.   For bad XML, typically raises SyntaxError
 
error = subscriptions.find('error')  # check whether there is an <error>with message</error> instead of <success/>
if error != None:
    raise Exception( "Response reports error: %s"%(error.text) )  # .text gets the node's text content
 
for entry in subscriptions.findall('entry'):  #find all direct-child 'entry' elements
    issn = entry.find('request/issn').text
    year = entry.findtext('request/year','') 
 
    # Using xpath-style paths like above can be handy, though when we want to fetch multiple details 
    # it's often handier to get element references first
    presence = entry.find('presence')
    prestype = presence.get('type')  # attribute fetch. Using get() means we get None if missing
    prestext = presence.text
 
    print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)

That code prints:

1234-5678 in 1984: fulltext (None)
0041-9999 in 1984: fulltext (cached)

Notes:

  • The functions on an element means you tend to use it as a deeper-only tree
  • If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
  • Your choice to use find(), findall(), getchildren(), getiterator() varies with case and taste
  • ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
    • Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
    • (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
  • there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.

Navigation and searching

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)


Finding tags under a tag

  • tag.find(path)
    finds the first matching element (by tag name, or by path)
  • tag.findall(path)
    all matching children (by tag name or path) / all matching descendant paths
  • tag.iterfind(path)
    - like findall(), but emits while walking, rather that collecting before returning


  • tag.xpath(path)
    (only in lxml.etree, not ElementTree)
like findall, but is a partial implementation of XPath (adds a few features, like navigating to parent)
note: often a little slower than find*() [3]


  • getchildren()
    all direct children (no filter)
  • getiterator(tag=None)
    does a recursive treewalk, retuning all elements (when tag=None), or all elements with some tag name you hand one along


In my experience,

find() and findall() tend to be simplest when picking data out of a fixed structure.
the other navigation tends to be more useful in helper functions.


A subset of XPath is supported[4], so you can pick out elements several levels deeper, e.g.

for all location in root.findall('record/item/location'):

Fetching out data

Much of the time you'll be happy with

.tag for the node's name


Attributes:

  • get(name,default=None)
    gets an attribute value (or the default value if the attribute didn't exist)
  • set(name,val)
    - useful when you want to serialize it again
  • keys()
    returns the names of the present attributes
  • items()
    returns attributes, in an arbitrarily ordered list of (name,value) tuples.
  • attrib
    is a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())


Text:

  • findtext(path,default=None)
    returns the text contents of the first matching element (or the default value, if nothing matched)
  • .text see below (may be None)
  • .tail see below (may be None)


See also:

Fetching text from this data model

Unlike the DOM-like model, where text is a first-class citizen but you have to do a lot more typing, ElementTree tries to focus on tags, which means it sticks text onto the nearest tag object.


A little more precisely:

  • element.text
Can be None (if there is no text)
If node content is only text, this will happen to be is all the text
If there's a mix of text and nodes, it's only the text before the first contained element.
If you want all text in a flatten-all-text-away way, see notes below.
  • element.tail is the text between a node and its next sibling
Can be None.


There is also findtext(path), which seems to be equivalent to find(path).text (returns the text content of the first matching element).



Practically, you deal either with

  • specifically structured data - which you can often access key-value style
  • free-form documents, which is messier, but "fetch all text fragments" usually goes a long way


Most structured data (e.g. XML config files, metadata) will often be designed with very selective nesting. Say, in

<meta><title>I am title</title><etc>and so on</etc></meta>

if you know the node you are interested in, and that there is no nesting at all, then you know that .text is all the contents you care about, that .tail is empty (or just whitespace, if it's pretty-printed to show intentation), and that code like the following does all you want:

for node in meta:
   retdict[node.tag] = node.text 


Free-form things, like HTML markup, like to intersperse things and are messier. Consider:

>>> nodes = ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>') # which is mostly text and nodes mixed under root level, except for e
>>> nodes.text     # <a>'s initial text   (a is the root node)
'this'
 
>>> [ (el.tag,el.tail)  for el in nodes ]           # tail is the text after the node, if any
[('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)]
 
>>> [ (el.tag,el.text,el.tail)  for el in nodes ]   # .text is the (first) text in the node
[('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)]
 
>>> all_text_fragments(nodes)    # see helper function below
['this', 'and', 'that', ' ', 'foo']


For documents, the following may go a long way:

def all_text_fragments(under):
    ''' Returns all fragments of text contained in a subtree, as a list of strings. 
        Keep in mind that in pretty-printed XML, many fragments are only spaced and newlines 
        You might extend this, e.g. with specific tag names to ignore the contents of.
    '''
    r = []
    for e in under.getiterator(): # walks the subtree
        if e.text != None:
            r.append( e.text )
        if e.tail != None:
            r.append( e.tail )
    return r

Namespaces

In searches
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

The element tag has the namespace URI, in XPath-style syntax.

As such, you can

root.findall('{http://www.w3.org/2002/07/owl#}Class')
root.xpath(".//{http://purl.org/dc/elements/1.1/}title")


There does not seem to be a "find in any namespace" on the existing find functions, though you could always do it yourself (explicit matching Element's tag string).


If you want to use a prefix like

root.findall('owl:Class')

...then read the next section

Prefixes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)


Streaming

If you have a huge XML file, and it is only used as a stream of records where you care about one at a time, then you may care to not load it in memory.


You can use a parser that does little more than emit start and end tags, (see e.g. SAX parsers) but consuming this is more manual work.


You can get a decent tradeoff between the two.

A good explanation at https://stackoverflow.com/questions/9809469/python-sax-to-lxml-for-80gb-xml/9814580#9814580

slightly shorter version

  • iterparse will build the same tree that generic ET will, but does so incrementally.
so yes, if loads the same thing into memory, but more slowly -- but you can remove what you've used at the same time
  • this helps when the XML has a lot of records that you can
    clear()
    elements immediately after you're done with
it does not help much when the XML effectively references other parts, because you typically can't know what to throw away when
  • clear() still leaves empty elements which still take a handful of bytes. Usually enough, but processing truly humongous XML you may want to del the previous nodes to save a little more


See also:


Semi-sorted

On unicode and bytes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)



Lineage, variants, and importing
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)


Roughly speaking:

lxml

is a python binding for the libxml C library.
not standard library, but installed on most systems
the libxml wrapping itself is more minimal, but it added an etree-like interface, which is actually more capable than (c)ElementTree(verify)


ElementTree

imitates lxml
and adds python code to support
standard library since Python ≥2.5, also present in 3


cElementTree

(a more efficient) C implementation of ElemebtTree
Mostly the same API as ElementTree (a few functions are missing?(verify))
so is actually very similar to lxml's etree interface



As such

  • there is a subset of functions shared by all three
  • lxml.etree is much like ElementTree, though they differ on a few details, like unicode (is this still true in py3?)



When you deal with XHTML or well-formed HTML, there is further comparison in flexibility/speed.

And robustness - see e.g. BeautifulSoup (lxml has become the default parser in bs4).



Pretty-printer

Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented.

Meant only for debug, as this will implicitly change the stored data (and is not the most efficient).

def indent_inplace(elem, level=0, whitespacestrip=False):
    ''' Alters the text nodes so that the tostring()ed version will look nicely indented.
 
        whitespacestrip can make contents that contain a lot of newlines look cleaner, 
        but changes the stored data even more.
    '''
    i = "\n" + level*"  "
 
    if whitespacestrip:
        if elem.text:
            elem.text=elem.text.strip()
        if elem.tail:
            elem.tail=elem.tail.strip()
 
    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent_inplace(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i
 
 
def prettyprint(xml,whitespacestrip=True):
    ''' Convenience wrapper around indent_inplace():
        - Takes a string (parses it) or a ET structure,
        - alters it to reindent according to depth,
        - returns the result as a string.
 
        whitespacestrip: see note in indent_inplace()
 
        Not horribly efficient, and alters the structure you gave it,
        but you are only using this for debug, riiight?
    '''
    if type(xml) is str:
        xml = ET.fromstring(xml)
    inplace_indent(xml, whitespacestrip=whitespacestrip)
    return ET.tostring(xml).rstrip('\n')


Namespace stripper

I use this to simplify searching and fetching whenever I know my use case has no actual conflicts.


Not thoroughly tested. The core of this was taken from elsewhere.

def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True):
    """ Takes a parsed ET structure and does an in-place removal of namespaces.
        By default removes all namespaces, optionally just a specific namespace (by its URL).
 
        Can make node searches simpler in structures with unpredictable namespaces
        and in content given to not be ambiguously mixed.
 
        By default does so for node names as well as attribute names.       
        (doesn't remove the namespace definitions, but apparently
         ElementTree serialization omits any that are unused)
 
        Note that for attributes that are unique only because of namespace,
        this will cause attributes to be overwritten. 
        For example: <e p:at="bar" at="quu">   would become: <e at="bar">
        I don't think I've seen any XML where this matters, though.
    """
   if namespace==None: # all namespaces                               
        for elem in etree.getiterator():
            tagname = elem.tag
            if tagname[0]=='{':
                elem.tag = tagname[ tagname.index('}',1)+1:]
 
            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name[0]=='{':
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[attr_name.index('}',1)+1:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)
 
    else: # asked to remove specific namespace.
        ns = '{%s}' % namespace
        nsl = len(ns)
        for elem in etree.getiterator():
            if elem.tag.startswith(ns):
                elem.tag = elem.tag[nsl:]
 
            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name.startswith(ns):
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[nsl:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)
UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'

...or some other non-BMP Unicode character (so \U rather than \u).


Not necessarily a to-console problem as you might assume.


This probably comes from ET's tostring(), which uses 'ascii' encoding by default, and which usually works fine because it rewrites Unicode into numeric entities.

However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).


On technically-more-standard way to work around that is to do:
tostring(nodes,encoding='utf-8')
, which tells it to encode unicode characters to UTF-8 bytes before dumping them into the document.

Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is also dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)

lxml

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)

lxml is based on libxml2 and libxslt, but has a better API than those libraries' basic bindings.

It imitates ElementTree, and has a few other useful features.

minidom

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me)


pulldom

Input encoding safety

The following code was made to parse XML that should be utf8, but may well contain literally dumped Latin1. It originally used chardet to try to be even smarter than that.

ET makes things slightly more interesting: as it is made for real-world data, it doesn't unicode strings as input, it wants something like utf8.


rewriting