Python notes - XML
Syntaxish: syntax and language · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
I like dealing with XML in an object-based manner. It's bothersome to deal with in other forms, having to recursively iterate, test for node types to catch whitespace text nodes before the first element, and so on. Being able to access XML as python objects is nice - even though it's not always ideal.
There are various ways of doing this. ElementTree (and specifically the faster cElementTree implementation) is my current favourite. BeautifulSoup is specifically made to deals well with bad HTML, but is slow.
For other alternatives and their speeds, see mainly this (and possibly this and this) for overviews of (mostly xml-only) packages.
Just constructing XML can be done by a few more things.
Contents
- 1 XML generation / serialization
- 2 XML parsing
XML generation / serialization
py.xml
Perhaps the simplest method (codewise) is what e.g. py.xml does - consider:
import py.xml class ns(py.xml.Namespace): "Convenience class for creating requests" def find_request(session,command,sourceArray): find=ns.x_server_request( ns.find_request( ns.wait_flag("N"), ns.find_request_command( command ), # list comprehension used create a variable amount of slightly nested nodes: list( ns.find_base(ns.find_base_001(sourceID)) for sourceID in sourceArray ), ns.session_id(session) ) ) findXMLdoc='<?xml version="1.0" encoding="UTF-8"?>%s'%( find.unicode(indent=0).encode('utf8') )
stan
Nevow's stan seems to have considered the py.xml problem, because stan is a document object model. An example slightly adapted from the documentation:
from nevow import flat, tags as T document = T.html[ T.head[ T.title["Hello, world!"] ], T.body[ T.h1[ "This is a complete XHTML document modeled in Stan." ], T.p[ "This text is inside a paragraph tag." ], T.div(style="color: blue; width: 200px; background-color: yellow;")[ "And this is a coloured div." ] ] ] print flat.flatten(document)
You actually create a tree of flatten()able objects, and only that last call (flatten()) causes the actual conversion to a string.
This also means you can make any of your obects stan-XML-serializable by giving it a flatten() function.
It can also make some ways of templating more convenient.
ElementTree's SimpleXMLWriter
The syntax that ElementTree it uses for XML generation is not as brief as the above, but still usable.
And it may be handy to use one module for all XML things, just to keep your depdendencies down.
The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls.
It consists of:
- element(tag,attributes) (one-shot element, avoids separate start and end)
- start(tag, attributes)
- end() (current), or end(tag)
- close() (all currently open) or close(id) (many levels, tages reference you took from start), or
- data() (text nodes)
from elementtree.SimpleXMLWriter import XMLWriter w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities response = w.start("html") #remember this so we can close() it w.start("head") w.element("title", "my document") w.element("meta", name="generator", value=u"my \u2222 application 1.0") w.end() # current open tag is head w.start("body") w.element("h1", "this is a heading") w.element("p", "this is a paragraph") w.start("p") w.data("this is ") w.element("b", "bold") w.data(" and ") w.element("i", "italic") w.data(".") w.end("p") w.close(response) # implies the end() for body and html
For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.
ElementTree by objects
This can be a little more tedious, but may be handy for XML-based protocols. For example:
def login(username,password): E,SE = ET.Element,ET.SubElement # for brevity server_request = E('server_request') login_request = SE(server_request,'login_request') SE(login_request,'user_name').text = username SE(login_request,'user_password').text = password return ET.tostring(server_request,'UTF-8')
SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.
It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.
XML parsing
DOM is a heavy model because it is heavily decorated, and requires all data to be loaded. Most XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.
When you don't need all of that, you can use SAX-style parsing. This can be particularly useful for keeping small memory and CPU footprint on large data files you want to fetch relatively little (or failrly flat) data from.
It tends to involve more coding, yes.
Relatedly, you may care about BeautifulSoup, targeted at possibly-not-valid-HTML.
lxml
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
lxml is a python interface based on libxml2 and libxslt, but has a better API than those libraries' basic bindings.
It also has a good portion that imitates ElementTree
ET and lxml - Lineage, variants, and importing
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Roughly speaking:
lxml
- is a python binding for the libxml C library.
- not standard library, but installed on most systems
- the libxml wrapping itself is more minimal, but it added an etree-like interface, lxml.etree, which is actually more capable than (c)ElementTree(verify)
ElementTree
- imitates lxml
- and adds python code for some extra functionality
- standard library since Python ≥2.5, also present in 3
cElementTree
- (a more efficient) C implementation of ElementTree
- Mostly the same API as ElementTree (a few functions are missing?(verify)), but misses some of that extra python code
- so is actually very similar to lxml's etree interface
As such
- there is a subset of functions shared by all three
- there are details that differ between the three, sometimes subtle but sometimes fundamental
- read e.g. https://lxml.de/1.3/compatibility.html
- you may wish to choose just one (I switched to lxml at some point)
- and if you want code that can deals with objects from others - mostly duck typing solves this, but there are a few footnotes (e.g. lxml.etree has comments and processing instructions, ET just ignores them while parsing)
- when you deal with messy, possibly-incorrect HTML, take a look at BeautifulSoup (which now (bs4) defaults to using lxml under the covers).
- When you deal with XHTML or well-formed HTML, there is further comparison in flexibility/speed.
In python2 it was
- interesting where to import ET from - xml.etree was introduced in 2.5, before then you'ld import elementtree(verify)
- useful to mix/fallback imports, to e.g. get ET's added python code mixed with cET's faster c code.
- This is less pressing in py3 (3.3) in that xml.etree.ElementTree uses a C implementation if available.
ElementTree
ElementTree works out as a pragmatic midway between the spartan SAX approach and the somewhat excessive DOM-style approach.
Where DOM keeps elements, text, and attributes in their own objects, ET smushes those onto one object:
- the element (name, namespace)
- the attributes on this element
- the text directly within, and directly after (this one's a little funny, though)
This makes some navigation code considerably shorter.
Roughly speaking, the easier it is to describe the XML in words, the easier it is to write ET code for.
That said, and if if the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with.
And, in particular, you have to sit down once to think about how it handles text.
- For data storage like variable-and-value lists it's actively more practical
- for free-form documents it gets a lot more interesting
Such abstract talk gets clearer with an example, though:
Example (parsing)
for:
xmldata=''' <subscriptions> <success/> <entry> <request> <issn>1234-5678</issn> <year>1984</year> </request> <presence>fulltext</presence> </entry> <entry> <request> <issn>0041-9999</issn> <year>1984</year> </request> <presence type="cached">fulltext</presence> </entry> </subscriptions>'''
Code with comments:
subscriptions = ET.fromstring(xmldata) # parses the whole, returns the root element error = subscriptions.find('error') # check whether there is an <error>with message</error> instead of <success/> if error is not None: raise Exception( "Response reports error: %s"%(error.text) ) # .text gets the node's direct text content for entry in subscriptions.findall('entry'): #find all direct-child 'entry' elements issn = entry.find('request/issn').text # note: if this node doesn't exist, this would error out (because None.text doesn't make sense) year = entry.findtext('request/year','') # another way of fetching text - one that deals better with absence # Using xpath-style paths like above can be handy, though when we want to fetch multiple details # it's often handier to get element references first presence = entry.find('presence') prestype = presence.get('type') # attribute fetch. Using get() means we get None if missing prestext = presence.text print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)
That code prints:
1234-5678 in 1984: fulltext (None) 0041-9999 in 1984: fulltext (cached)
Notes:
- The functions on an element means you tend to use it as a deeper-only tree
- If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
- ...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
- Your choice to use find(), findall(), getchildren(), getiterator() varies with case and taste
- ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
- Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
- (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
- there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Direct children only
- getchildren()all direct children (no filter)
Relative path to current tag
- tag.find(path)finds the first matching element (by tag name, or by path)
- e.g. records = root.find('record_list')
- tag.findall(path)all matching descendants by path (which you can also use for all matching children by tag)
- e.g. records.findall('meta/location')
- tag.iterfind(path)- like findall(), but emits while walking, rather that collecting before returning
- tag.xpath(path)(only in lxml, not ElementTree)
- like findall, but is a partial implementation of XPath (adds a few features, like navigating to parent)
- note: often a little slower than find*() [1]
Anywhere under, disregarding structure
- getiterator(tag=None)does a recursive treewalk
- returning all elements (when tag=None)
- ...or filtered by tag name(s) you hand along
In my experience,
- find() and findall()/finditer() tend to be simplest when picking data out of a known, fixed structure.
- and avoids some possible confusion when similarly named nodes are nested under others
- getiterator() is useful to pick out items when you don't care about context
Fetching out data
Much of the time you'll be happy with
.tag for the node's name
Attributes:
- get(name,default=None)gets an attribute value (or the default value if the attribute didn't exist)
- set(name,val)- useful when you want to serialize it again
- keys()returns the names of the present attributes
- items()returns attributes, in an arbitrarily ordered list of (name,value) tuples.
- attribis a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())
Text:
- findtext(path,default=None)returns the text contents of the first matching element (or the default value, if nothing matched)
- .text see below (may be None)
- .tail see below (may be None)
See also:
Fetching text from this data model
Unlike the DOM-like model, where text is a first-class citizen but you have to do a lot more typing, ElementTree tries to focus on tags, which means it sticks text onto the nearest tag object.
A little more precisely:
- element.text
- Can be None (if there is no text)
- If node content is only text, this will happen to be is all the text
- If there's a mix of text and nodes, it's only the text before the first contained element.
- If you want all text in a flatten-all-text-away way, see notes below.
- element.tail is the text between a node and its next sibling
- Can be None.
There is also findtext(path),
which seems to be equivalent to find(path).text (returns the text content of the first matching element).
Practically, you deal either with
- specifically structured data - which you can often access key-value style
- free-form documents, which is messier, but "fetch all text fragments" usually goes a long way
Most structured data (e.g. XML config files, metadata) will often be designed with very selective nesting. Say, in
<meta><title>I am title</title><etc>and so on</etc></meta>
if you know the node you are interested in, and that there is no nesting at all, then you know that .text is all the contents you care about, that .tail is empty (or just whitespace, if it's pretty-printed to show intentation), and that code like the following does all you want:
for node in meta: retdict[node.tag] = node.text
Free-form things, like HTML markup, like to intersperse things and are messier. Consider:
>>> nodes = ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>') # which is mostly text and nodes mixed under root level, except for e >>> nodes.text # <a>'s initial text (a is the root node) 'this' >>> [ (el.tag,el.tail) for el in nodes ] # tail is the text after the node, if any [('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)] >>> [ (el.tag,el.text,el.tail) for el in nodes ] # .text is the (first) text in the node [('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)] >>> all_text_fragments(nodes) # see helper function below ['this', 'and', 'that', ' ', 'foo']
For documents, the following may go a long way:
def all_text_fragments(under): ''' Returns all fragments of text contained in a subtree, as a list of strings. Keep in mind that in pretty-printed XML, many fragments are only spaced and newlines You might extend this, e.g. with specific tag names to ignore the contents of. ''' r = [] for e in under.getiterator(): # walks the subtree if e.text != None: r.append( e.text ) if e.tail != None: r.append( e.tail ) return r
Namespaces
In searches
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
The element tag has the namespace URI, in XPath-style syntax.
As such, you can
root.findall('{http://www.w3.org/2002/07/owl#}Class') root.xpath(".//{http://purl.org/dc/elements/1.1/}title")
There does not seem to be a "find in any namespace" on the existing find functions,
though you could always do it yourself (explicit matching Element's tag string).
If you want to use a prefix like
root.findall('owl:Class')
...then read the next section
Prefixes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Streaming
If you have a huge XML file, you can load it into memory, but you may run into RAM problems.
The most minimal way to deal with one is a SAX-style parser, one that does little more than start and end tags,
but it would be nothing more than that, so you would still need to remember enough of a record for your purpose (or basically build up that record),
which is manual work (and single purpose).
As very large XML files are often data dumps, which are often lots of small and independent records in a row, where you only care about one at a time - and now RAM shouldn't be much more than one record's worth at a time.
So wouldn't it be a nice tradeoff to get to hold one record in memory, and remove it when you're done?
In etree, you can do basically that with iterparse.
There is a good explanation at https://stackoverflow.com/questions/9809469/python-sax-to-lxml-for-80gb-xml/9814580#9814580 but the slightly shorter version
- iterparse will build the same tree that generic ET will, but does so incrementally
- so it returns control to you regularly (the parse will in fact be slower)
- so at the end of what you know is a full record, you can deal with that record, and then clear()elements from it immediately after
- clear() still leaves empty elements which still take a handful of bytes, but that barely adds up until it's many millions of records. Apparently you can del the previous nodes to save a little more
See also:
Semi-sorted
internal error: Huge input lookup
On unicode and bytes
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
Pretty-printer
Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented.
Meant only for debug, as this will implicitly change the stored data (and is not the most efficient).
def indent_inplace(elem, level=0, whitespacestrip=False): ''' Alters the text nodes so that the tostring()ed version will look nicely indented. whitespacestrip can make contents that contain a lot of newlines look cleaner, but changes the stored data even more. ''' i = "\n" + level*" " if whitespacestrip: if elem.text: elem.text=elem.text.strip() if elem.tail: elem.tail=elem.tail.strip() if len(elem): if not elem.text or not elem.text.strip(): elem.text = i + " " if not elem.tail or not elem.tail.strip(): elem.tail = i for elem in elem: indent_inplace(elem, level+1) if not elem.tail or not elem.tail.strip(): elem.tail = i else: if level and (not elem.tail or not elem.tail.strip()): elem.tail = i def prettyprint(xml,whitespacestrip=True): ''' Convenience wrapper around indent_inplace(): - Takes a string (parses it) or a ET structure, - alters it to reindent according to depth, - returns the result as a string. whitespacestrip: see note in indent_inplace() Not horribly efficient, and alters the structure you gave it, but you are only using this for debug, riiight? ''' if type(xml) is str: xml = ET.fromstring(xml) inplace_indent(xml, whitespacestrip=whitespacestrip) return ET.tostring(xml).rstrip('\n')
Namespace stripper
I use this to simplify searching and fetching whenever I know my use case has no actual conflicts.
Not thoroughly tested. The core of this was taken from elsewhere.
def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True): """ Takes a parsed ET structure and does an in-place removal of namespaces. By default removes all namespaces, optionally just a specific namespace (by its URL). Can make node searches simpler in structures with unpredictable namespaces and in content given to not be ambiguously mixed. By default does so for node names as well as attribute names. (doesn't remove the namespace definitions, but apparently ElementTree serialization omits any that are unused) Note that for attributes that are unique only because of namespace, this will cause attributes to be overwritten. For example: <e p:at="bar" at="quu"> would become: <e at="bar"> I don't think I've seen any XML where this matters, though. """ if namespace==None: # all namespaces for elem in etree.getiterator(): tagname = elem.tag if tagname[0]=='{': elem.tag = tagname[ tagname.index('}',1)+1:] if remove_from_attr: to_delete=[] to_set={} for attr_name in elem.attrib: if attr_name[0]=='{': old_val = elem.attrib[attr_name] to_delete.append(attr_name) attr_name = attr_name[attr_name.index('}',1)+1:] to_set[attr_name] = old_val for key in to_delete: elem.attrib.pop(key) elem.attrib.update(to_set) else: # asked to remove specific namespace. ns = '{%s}' % namespace nsl = len(ns) for elem in etree.getiterator(): if elem.tag.startswith(ns): elem.tag = elem.tag[nsl:] if remove_from_attr: to_delete=[] to_set={} for attr_name in elem.attrib: if attr_name.startswith(ns): old_val = elem.attrib[attr_name] to_delete.append(attr_name) attr_name = attr_name[nsl:] to_set[attr_name] = old_val for key in to_delete: elem.attrib.pop(key) elem.attrib.update(to_set)
UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'
...or some other non-BMP Unicode character (so \U rather than \u).
Not necessarily a to-console problem as you might assume.
This probably comes from ET's tostring(), which uses 'ascii' encoding by default, which usually works fine because it rewrites Unicode into numeric entities.
However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).
Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is also dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)
minidom
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, or tell me) |
pulldom
Input encoding safety
The following code was made to parse XML that should be utf8, but may well contain literally dumped Latin1. It originally used chardet to try to be even smarter than that.
ET makes things slightly more interesting: as it is made for real-world data, it doesn't unicode strings as input, it wants something like utf8.
rewriting