Python notes - XML

Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency

IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly

Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time

Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

I like dealing with XML in an object-based manner. It's bothersome to deal with in other forms, having to recursively iterate, test for node types to catch whitespace text nodes before the first element, and so on. Being able to access XML as python objects is nice - even though it's not always ideal.

There are various ways of doing this. ElementTree (and specifically the faster cElementTree implementation) is my current favourite. BeautifulSoup is specifically made to deals well with bad HTML, but is somewhat slower.

For other alternatives and their speeds, see mainly this (and possibly this and this) for overviews of (mostly xml-only) packages.

Just constructing XML can be done by a few more things.

XML parsing

DOM is a heavy model because it is heavily decorated, and requires all data to be loaded. Most XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.

When you don't need all of that, you can use SAX-style parsing. This can be particularly useful for keeping small memory and CPU footprint on large data files you want to fetch relatively little (or failrly flat) data from. It tends to involve more coding, yes.

Relatedly, you may care about BeautifulSoup, targeted at possibly-not-valid-HTML.

lxml

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

lxml is a python interface based on libxml2 and libxslt, but has a better API than those libraries' basic bindings.

It also has a good portion that imitates ElementTree.

lxml also adds a few things that ET does not have.

ET and lxml - Lineage, variants, and importing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Roughly speaking:

lxml

is a python binding for the libxml C library.

not standard library, but installed on most systems

the libxml wrapping itself is more minimal, but it added an etree-like interface, lxml.etree, which is actually more capable than (c)ElementTree(verify)

ElementTree

imitates lxml

and adds python code for some extra functionality

standard library since Python ≥2.5, also present in 3

cElementTree

(a more efficient) C implementation of ElementTree

Mostly the same API as ElementTree (a few functions are missing?(verify)), but misses some of that extra python code

so is actually very similar to lxml's etree interface

As such

there is a subset of functions shared by all three

there are details that differ between the three, sometimes subtle but sometimes fundamental

read e.g. https://lxml.de/1.3/compatibility.html

you may wish to choose just one (I switched to lxml at some point)

and if you want code that can deals with objects from others - mostly duck typing solves this, but there are a few footnotes (e.g. lxml.etree has comments and processing instructions, ET just ignores them while parsing)

when you deal with messy, possibly-incorrect HTML, take a look at BeautifulSoup (which now (bs4) defaults to using lxml under the covers).

When you deal with XHTML or well-formed HTML, there is further comparison in flexibility/speed.

In python2 it was

interesting where to import ET from - xml.etree was introduced in 2.5, before then you'ld import elementtree(verify)

useful to mix/fallback imports, to e.g. get ET's added python code mixed with cET's faster c code.

This is less pressing in py3 (3.3) in that xml.etree.ElementTree uses a C implementation if available.

ElementTree

ElementTree works out as a pragmatic midway between the spartan SAX approach and the somewhat excessive DOM-style approach.

Where DOM keeps elements, text, and attributes in their own objects, ET smushes those onto one object:

the element (name, namespace)
the attributes on this element
the text directly within, and directly after (this one's a little funny, though)

This makes some navigation code considerably shorter.

Roughly speaking, the easier it is to describe the XML in words, the easier it is to write ET code for.

That said, and if if the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with.

And, in particular, you have to sit down once to think about how it handles text.

For data storage like variable-and-value lists it's actively more practical

for free-form documents it gets a lot more interesting

Semi-sorted

On unicode and bytes

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Pretty-printer

Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented.

Meant only for debug, as this will implicitly change the stored data (and is not the most efficient).

def indent_inplace(elem, level=0, whitespacestrip=False):
    ''' Alters the text nodes so that the tostring()ed version will look nicely indented.
     
        whitespacestrip can make contents that contain a lot of newlines look cleaner, 
        but changes the stored data even more.
    '''
    i = "\n" + level*"  "

    if whitespacestrip:
        if elem.text:
            elem.text=elem.text.strip()
        if elem.tail:
            elem.tail=elem.tail.strip()

    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent_inplace(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i


def prettyprint(xml,whitespacestrip=True):
    ''' Convenience wrapper around indent_inplace():
        - Takes a string (parses it) or a ET structure,
        - alters it to reindent according to depth,
        - returns the result as a string.

        whitespacestrip: see note in indent_inplace()

        Not horribly efficient, and alters the structure you gave it,
        but you are only using this for debug, riiight?
    '''
    if type(xml) is str:
        xml = ET.fromstring(xml)
    inplace_indent(xml, whitespacestrip=whitespacestrip)
    return ET.tostring(xml).rstrip('\n')

Namespace stripper

I use this to simplify searching and fetching whenever I know my use case has no actual conflicts.

Not thoroughly tested. The core of this was taken from elsewhere.

def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True):
    """ Takes a parsed ET structure and does an in-place removal of namespaces.
        By default removes all namespaces, optionally just a specific namespace (by its URL).
        
        Can make node searches simpler in structures with unpredictable namespaces
        and in content given to not be ambiguously mixed.

        By default does so for node names as well as attribute names.       
        (doesn't remove the namespace definitions, but apparently
         ElementTree serialization omits any that are unused)

        Note that for attributes that are unique only because of namespace,
        this will cause attributes to be overwritten. 
        For example: <e p:at="bar" at="quu">   would become: <e at="bar">
        I don't think I've seen any XML where this matters, though.
    """
   if namespace==None: # all namespaces                               
        for elem in etree.getiterator():
            tagname = elem.tag
            if tagname[0]=='{':
                elem.tag = tagname[ tagname.index('}',1)+1:]

            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name[0]=='{':
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[attr_name.index('}',1)+1:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)

    else: # asked to remove specific namespace.
        ns = '{%s}' % namespace
        nsl = len(ns)
        for elem in etree.getiterator():
            if elem.tag.startswith(ns):
                elem.tag = elem.tag[nsl:]

            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name.startswith(ns):
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[nsl:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)

etree errors

internal error: Huge input lookup

Unicode strings with encoding declaration are not supported

Usually comes down to that you handed fromstring() an XML file as a unicode string instead of a bytes.

It wants to see bytes and either read the encoding from the data {{comment}(or have you override it via a XMLParser())}}.

Often: decoding is the parser's job, don't do it yourself, and don't let file reading do it for you. If you had no control over how you, I guess you could .encode('utf8') it.

UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'

...or some other non-BMP Unicode character (so \U rather than \u).

Not necessarily a to-console problem as you might assume.

This probably comes from ET's tostring(), which uses 'ascii' encoding by default, which usually works fine because it rewrites Unicode into numeric entities.

However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).

On technically-more-standard way to work around that is to do: tostring(nodes,encoding='utf-8'), which tells it to encode unicode characters to UTF-8 bytes before dumping them into the document. Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is also dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)

minidom

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

pulldom

Input encoding safety

The following code was made to parse XML that should be utf8, but may well contain literally dumped Latin1. It originally used chardet to try to be even smarter than that.

ET makes things slightly more interesting: as it is made for real-world data, it doesn't unicode strings as input, it wants something like utf8.

rewriting

XML generation / serialization

py.xml

One simple-looking method (codewise) is what e.g. py.xml does - consider:

import py.xml

class ns(py.xml.Namespace):
    "Convenience class for creating requests"
 
def find_request(session,command,sourceArray):
    find=ns.x_server_request(
        ns.find_request(
            ns.wait_flag("N"),
            ns.find_request_command( command ),
            # list comprehension used create a variable amount of slightly nested nodes:
            list( ns.find_base(ns.find_base_001(sourceID))   
                  for sourceID in sourceArray ),
            ns.session_id(session)
        )
    )
    findXMLdoc='<?xml version="1.0" encoding="UTF-8"?>%s'%(
        find.unicode(indent=0).encode('utf8')
    )

However, this is flawed in that it's a little too easy to create invalid XML: each of the parts are immediately converted to strings, meaning that while HTML encoding (of <>&) can be done automatically for attributes, there is no way to do so for text nodes, as code can not tell the difference between text nodes and nested elements.

stan

Nevow's stan seems to have considered just-mentioned problem, because stan is a document object model. An example slightly adapted from the documentation:

from nevow import flat, tags as T
document = T.html[
    T.head[
        T.title["Hello, world!"]
    ],
    T.body[
        T.h1[ "This is a complete XHTML document modeled in Stan." ],
        T.p[ "This text is inside a paragraph tag." ],
        T.div(style="color: blue; width: 200px; background-color: yellow;")[
            "And this is a coloured div."
        ]
    ]
]
print flat.flatten(document)

You actually create a tree of flatten()able objects, and only that last call (flatten()) actually makes a string view of that document model. This also means you can make any of your obects stan-XML-serializable by giving it a flatten() function.

It can also make some ways of templating more convenient.

ElementTree's SimpleXMLWriter

The syntax that ElementTree it uses for XML generation is not as brief as the above, but still usable.

And it may be handy to use one module for all XML things, just to keep your depdendencies down.

The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls. It consists of:

element(tag,attributes) (one-shot element, avoids separate start and end)

start(tag, attributes)
end() (current), or end(tag)
close() (all currently open) or close(id) (many levels, tages reference you took from start), or

data() (text nodes)

from elementtree.SimpleXMLWriter import XMLWriter
w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities
response = w.start("html")   #remember this so we can close() it

w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value=u"my \u2222 application 1.0")
w.end()             # current open tag is head

w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")

w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(response)   # implies the end() for body and html

For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.

ElementTree by objects

This can be a little more tedious, but may be handy for XML-based protocols. For example:

def login(username,password):
    E,SE = ET.Element,ET.SubElement  # for brevity

    server_request = E('server_request')
    login_request    = SE(server_request,'login_request')    
    SE(login_request,'user_name').text     = username
    SE(login_request,'user_password').text = password

    return ET.tostring(server_request,'UTF-8')

SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.

It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.

Python notes - XML

Contents

XML parsing

lxml

ET and lxml - Lineage, variants, and importing

ElementTree

Semi-sorted

On unicode and bytes

Pretty-printer

Namespace stripper

etree errors

internal error: Huge input lookup

Unicode strings with encoding declaration are not supported

UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'

minidom

pulldom

Input encoding safety

XML generation / serialization

py.xml

stan

ElementTree's SimpleXMLWriter

ElementTree by objects

Navigation menu