ElementTree / lxml scraping

From Helpful
Jump to navigation Jump to search

Screen scraping (mostly HTML and XML parsing)

Python: BeautifulSoup · ElementTree / lxml scraping
Wrapping or controlling a browser



This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

lxml is a python library that wraps the libxml2 library (also libxslt) and gives a good balance between the speed of those C libraries and a nice interface on the python side.


It also implements the ElementTree interface.

For ElementTree in a broader sense, see Python_notes_-_XML#ElementTree
For extracting/scraping data the ElementTree way, you may specifically want lxml, because
it has a few additions that can be really useful, such as
the ability to navigate to the parent.
the node.xpath() method[1]
which sometimes saves a lot of time

lxml should e.g. prove faster than BeautifulSoup -- and not using both bs4 and etree can avoid some confusion ("was it findall() or find_all() or maybe findAll() after all?" That said, each has their strengths - e.g. BS seems a little more convenient (less typing) getting out text and dealing with less-structured things.


Navigation and searching

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Direct children only

  • getchildren() all direct children (no filter)


Relative path to current tag

  • tag.find(path) finds the first matching element (by tag name, or by path)
e.g. records = root.find('record_list')
  • tag.l(path) all matching descendants by path (which you can also use for all matching children by tag)
e.g. records.findall('meta/location')
  • tag.iterfind(path) - like findall(), but emits while walking, rather that collecting before returning


  • tag.xpath(path) (only in lxml, not ElementTree)
like findall, but is a partial implementation of XPath (adds a few features, like navigating to parent)
note: often a little slower than find*() [2]


Anywhere under, disregarding structure

  • iter(tag=None) does a recursive treewalk
returning all elements (when tag=None)
...or filtered by tag name(s) you hand along



In my experience,

find() and findall()/finditer() tend to be simplest when picking data out of a known, fixed structure.
and avoids some possible confusion when similarly named nodes are nested under others
iter() is useful to pick out items when you don't care about context

Fetching out data

Much of the time you'll be happy with

.tag for the node's name


Attributes:

  • get(name,default=None) gets an attribute value (or the default value if the attribute didn't exist)
  • set(name,val) - useful when you want to serialize it again
  • keys() returns the names of the present attributes
  • items() returns attributes, in an arbitrarily ordered list of (name,value) tuples.
  • attrib is a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())


Text:

  • findtext(path,default=None) returns the text contents of the first matching element (or the default value, if nothing matched)
  • .text see below (may be None)
  • .tail see below (may be None)


See also:

Fetching text from this data model

Unlike the DOM-like model, where text is a first-class citizen but you have to do a lot more typing, ElementTree tries to focus on tags, which means it sticks text onto the nearest tag object.


A little more precisely:

  • element.text
Can be None (if there is no text)
If node content is only text, this will happen to be is all the text
If there's a mix of text and nodes, it's only the text before the first contained element.
If you want all text in a flatten-all-text-away way, see notes below.
  • element.tail is the text between a node and its next sibling
Can be None.


There is also findtext(path), which seems to be equivalent to find(path).text (returns the text content of the first matching element).



Practically, you deal either with

  • specifically structured data - which you can often access key-value style
  • free-form documents, which is messier, but "fetch all text fragments" usually goes a long way


Most structured data (e.g. XML config files, metadata) will often be designed with very selective nesting. Say, in

<meta><title>I am title</title><etc>and so on</etc></meta>

if you know the node you are interested in, and that there is no nesting at all, then you know that .text is all the contents you care about, that .tail is empty (or just whitespace, if it's pretty-printed to show intentation), and that code like the following does all you want:

for node in meta:
   retdict[node.tag] = node.text 


Free-form things, like HTML markup, like to intersperse things and are messier. Consider:

>>> nodes = ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>') # which is mostly text and nodes mixed under root level, except for e
>>> nodes.text     # <a>'s initial text   (a is the root node)
'this'

>>> [ (el.tag,el.tail)  for el in nodes ]           # tail is the text after the node, if any
[('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)]

>>> [ (el.tag,el.text,el.tail)  for el in nodes ]   # .text is the (first) text in the node
[('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)]

>>> all_text_fragments(nodes)    # see helper function below
['this', 'and', 'that', ' ', 'foo']


For documents, the following may go a long way:

def all_text_fragments(under):
    ''' Returns all fragments of text contained in a subtree, as a list of strings. 
        Keep in mind that in pretty-printed XML, many fragments are only spaced and newlines 
        You might extend this, e.g. with specific tag names to ignore the contents of.
    '''
    r = []
    for e in under.iter(): # walks the subtree
        if e.text != None:
            r.append( e.text )
        if e.tail != None:
            r.append( e.tail )
    return r

Namespaces

In searches
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The element tag has the namespace URI, in XPath-style syntax.

As such, you can

root.findall('{http://www.w3.org/2002/07/owl#}Class')
root.xpath(".//{http://purl.org/dc/elements/1.1/}title")


There does not seem to be a "find in any namespace" on the existing find functions, though you could always do it yourself (explicit matching Element's tag string).


If you want to use a prefix like

root.findall('owl:Class')

...then read the next section

Prefixes
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Streaming

If you have a huge XML file, you can load it into memory, but you may run into RAM problems.


The most minimal way to deal with one is a SAX-style parser, one that does little more than start and end tags, but it would be nothing more than that, so you would still need to remember enough of a record for your purpose (or basically build up that record), which is manual work (and single purpose).


As very large XML files are often data dumps, which are often lots of small and independent records in a row, where you only care about one at a time - and now RAM shouldn't be much more than one record's worth at a time.

So wouldn't it be a nice tradeoff to get to hold one record in memory, and remove it when you're done?


In etree, you can do basically that with iterparse.

There is a good explanation at https://stackoverflow.com/questions/9809469/python-sax-to-lxml-for-80gb-xml/9814580#9814580 but the slightly shorter version

  • iterparse will build the same tree that generic ET will, but does so incrementally
so it returns control to you regularly (the parse will in fact be slower)
  • so at the end of what you know is a full record, you can deal with that record, and then clear() elements from it immediately after
clear() still leaves empty elements which still take a handful of bytes, but that barely adds up until it's many millions of records. Apparently you can del the previous nodes to save a little more


See also:


Examples

Parsing data

For data enough to always have the same thing at the same place...


xmldata='''
<subscriptions>
  <success/>
  <entry>
    <request>
      <issn>1234-5678</issn>
      <year>1984</year>
    </request>
    <presence>fulltext</presence>
  </entry>
  <entry>
    <request>
      <issn>0041-9999</issn>
      <year>1984</year>
    </request>
    <presence type="cached">fulltext</presence>
  </entry>
</subscriptions>'''

Code with comments:

subscriptions = ET.fromstring(xmldata)  # parses the whole, returns the root element

error = subscriptions.find('error')  # check whether there is an <error>with message</error> instead of <success/>
if error is not None:
    raise Exception( "Response reports error: %s"%(error.text) )  # .text gets the node's direct text content

for entry in subscriptions.findall('entry'):  #find all direct-child 'entry' elements
    issn = entry.find('request/issn').text     # note: if this node doesn't exist, this would error out (because None.text doesn't make sense)
    year = entry.findtext('request/year','')   # another way of fetching text - one that deals better with absence
    
    # Using xpath-style paths like above can be handy, though when we want to fetch multiple details
    # it's often handier to get element references first
    presence = entry.find('presence')
    prestype = presence.get('type')  # attribute fetch. Using get() means we get None if missing
    prestext = presence.text
    
    print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)

That code prints:

1234-5678 in 1984: fulltext (None)
0041-9999 in 1984: fulltext (cached)

Notes:

  • The functions on an element means you tend to use it as a deeper-only tree
  • If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
  • Your choice to use find(), findall(), getchildren(), iter() varies with case and taste
  • ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
    • Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
    • (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
  • there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.