Python notes - XML and HTML
Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings
IO: networking and web · filesystem Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly
Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML speed, memory, debugging, profiling · Python extensions · semi-sorted |
Intro
DOM, SAX, or other
DOM is a heavy model because it is heavily decorated, and requires all data to be loaded.
Many XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.
When you don't need all of that, you can use SAX-style parsing,
which remembers almost nothing,
it just mentions the details of tags as they pass the parser.
Whenever the XML is fairly flat, and/or you want to fetch fairly little from a large file, this can take significantly less RAM (and CPU).
It may involve a little more coding, but may be great to know when given many-gigabyte XML files.
Note that there are tricks in lxml that allow something similar,
which is nice to know when you already know that API.
parse and/or generate
ElementTree intro
ElementTree, as an API, can be seen as a pragmatic midway between the spartan SAX approach and the somewhat excessive DOM-style approach.
And potentially less verbose than both of those.
Where DOM keeps elements, text, and attributes in their own objects, ET smushes those onto one object:
- the element (name, namespace)
- the attributes on this element
- the text directly within, and directly after (this one's a little funny, though)
This makes some navigation code considerably shorter.
Roughly speaking, the easier it is to describe the XML in words, the easier it is to write ET code for.
That said,
- if the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with anyway, and you gain nothing on that particular data.}}
- you have to sit down once to think about how it handles text.
- For data storage like variable-and-value lists it's actively more practical
- for free-form documents it gets a lot more interesting
lxml intro
lxml is a python interface based on libxml2 (and libxslt).
lxml is a python library that wraps the libxml2 library (also libxslt) and gives a good balance between the speed of those C libraries and a capable and nice interface on the python side.
It has a nicer API than those libraries' basic bindings, so if you use those lower-level things from a higher-level language like Python, lxml may make you happier.
The python lxml library also adds an ElementTree-like API.
ET-like in that lxml also adds a few things that ET does not have.
When using any fancier features:
While lxml is an library you have to install, a lot has been around for a long time;
python's own etree is still changing, so fancier features might affect the python versions you can support,
and some support still seems different.
lxml has a few additions that can be really useful, such as
- the ability to navigate to the parent.
- lxml's xpath()[1] has been more capable for many years
than python's recent xpath support is today(verify).
I have found lxml a little more productive when doing scraping.
Two things you may want to read:
lxml should e.g. prove faster than BeautifulSoup. Note that bs4 and etree are similar enough to be confusing ("was it findall() or find_all() or maybe findAll() after all?"), so while each has their strengths, you may want to use just one when that's suitable.
ET and lxml - Lineage, variants, and importing
Roughly speaking:
lxml
- is a python binding for the libxml C library.
- not standard library, but installed on many systems
- the libxml wrapping itself is more minimal, but it also adds an etree-like interface, lxml.etree, which is actually more capable than (c)ElementTree(verify)
ElementTree
- imitates lxml
- and adds python code for some extra functionality
- standard library since Python ≥2.5, also present in 3
cElementTree
- (a more efficient) C implementation of ElementTree
- Mostly the same API as ElementTree (a few functions are missing?(verify)), but omits some of that added python code
- so is actually more similar to lxml's etree interface again
- cElementTree is not very relevant since xml.etree.ElementTree in py3.3, as it defaults to something very similar[2]
As such
- there is a subset of functions shared by all three
- you may wish to choose just one (I switched to lxml-only at some point to lessen my confusion)
- and if you want code that can deals with objects from others - mostly duck typing solves this, but there are a few footnotes (e.g. lxml.etree has comments and processing instructions, ET just ignores them while parsing)
- There use to be more difference than there is now
- cET was for speed, but that mostly got added to python
- lxml was for speed+features, but some of them are now getting added to python
- there are details that differ between the three, sometimes subtle, sometimes fundamental
- read e.g. https://lxml.de/1.3/compatibility.html
- when you deal with messy, possibly-incorrect HTML, take a look at BeautifulSoup (which since bs4 actually defaults to using lxml under the covers).
- When you deal with XHTML or well-formed HTML, there is further comparison in flexibility/speed.
In python2 it was
- interesting where to import ET from - xml.etree was introduced in 2.5, before then you'ld import elementtree(verify)
- useful to fallback to other imports, to e.g. get ET's added python code mixed with cET's faster c code.
- This is less pressing in py3 (3.3) in that xml.etree.ElementTree uses a C implementation if available.
ElementTree ≈ etree
Given the above ElementTree and etree are
- sometimes used to refer approximately to the API,
- sometimes used to to specific implementations
On this page, both are mainly used to refer to the API.
mainly parsing
BeautifulSoup intro
BeautifulSoup is a Python module that reads in and parses HTML data, and has helpers to navigate and search the result.
It can deal with some common markup mistakes.
It is also fairly convenient about expressing how we want to search the parsed result.
Not all use is very fast - see #Performance
On this page I omit all the shorthand forms that I don't like, and mostly ignore pre-bs4 versions. Both because these variations are mostly confusing to all put side by side.
mainly generating
py.xml
One simple-looking method (codewise) is what e.g. py.xml does - consider:
import py.xml
class ns(py.xml.Namespace):
"Convenience class for creating requests"
def find_request(session,command,sourceArray):
find=ns.x_server_request(
ns.find_request(
ns.wait_flag("N"),
ns.find_request_command( command ),
# list comprehension used create a variable amount of slightly nested nodes:
list( ns.find_base(ns.find_base_001(sourceID))
for sourceID in sourceArray ),
ns.session_id(session)
)
)
findXMLdoc='<?xml version="1.0" encoding="UTF-8"?>%s'%(
find.unicode(indent=0).encode('utf8')
)
However, this is flawed in that it's easy to create invalid XML:
each of the parts are immediately converted to strings, meaning that while HTML encoding (of <>&) can be done automatically for attributes,
there is no way to do so for text nodes, as code can not tell the difference between text nodes and nested elements.
stan
Nevow's stan seems to have considered just-mentioned problem, because stan is a document object model.
An example slightly adapted from its documentation:
from nevow import flat, tags as T
document = T.html[
T.head[
T.title["Hello, world!"]
],
T.body[
T.h1[ "This is a complete XHTML document modeled in Stan." ],
T.p[ "This text is inside a paragraph tag." ],
T.div(style="color: blue; width: 200px; background-color: yellow;")[
"And this is a coloured div."
]
]
]
print( flat.flatten(document) )
You actually create a tree of flatten()able objects, and only that last call (flatten()) actually makes a string view of that document model. This also means you can make any of your obects stan-XML-serializable by giving it a flatten() function.
It can also make some ways of templating more convenient.
ElementTree's SimpleXMLWriter
The syntax that ElementTree it uses for XML generation is not as brief as the above, but still usable.
And it may be handy to use one module for all XML things, just to keep your depdendencies down.
The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls.
It consists of:
- element(tag,attributes) (one-shot element, avoids separate start and end)
- start(tag, attributes)
- end() (current), or end(tag)
- close() (all currently open) or close(id) (many levels, tages reference you took from start), or
- data() (text nodes)
from elementtree.SimpleXMLWriter import XMLWriter
w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities
response = w.start("html") #remember this so we can close() it
w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value=u"my \u2222 application 1.0")
w.end() # current open tag is head
w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")
w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(response) # implies the end() for body and html
For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.
ElementTree by objects
This can be a little more tedious, but may be handy for XML-based protocols. For example:
def login(username,password):
E,SE = ET.Element,ET.SubElement # for brevity
server_request = E('server_request')
login_request = SE(server_request,'login_request')
SE(login_request,'user_name').text = username
SE(login_request,'user_password').text = password
return ET.tostring(server_request,'UTF-8')
SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.
It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.
minidom, pulldom
https://docs.python.org/3/library/xml.dom.minidom.html
https://docs.python.org/3/library/xml.dom.pulldom.html
General notes
On unicode and bytes
On the XML side
Technically,
- The XML format should technically be considered a binary format, that self-describes its data coding
- (either explicitly, or falls back to UTF-8 by implication of the specs).
- XML1.1 requires you to specify an encoding=
- leaving it out is always invalid, or rather:
- and if missing the specs say to treat it like XML 1.0[3]).
- XML 1.0 without the encoding declaration means the parser will have to guess
- how it does that is not standardized (there are suggestions though)
The first point means that decoding is the parser's job, not yours.
And yes, that's a slight chicken-and-egg problem.
Each library has a slightly different take on it,
but generally, the parser expects bytes.
Parsers may or may not choose accept already-decoded unicode (...in languages where unencoded unicode is an actual data type. Like Python.).
So you generally don't want to decode and then hand that in,
unless you have a explicitly good explanation that you need it,
and why you want the job of also handling other cases just as correctly as it does.
(One good reason is that it gueses wrong, but that should only happen for some very unusual XML)
You also generally don't want to create XML without encoding declarations - it's considered bad practice, because you cannot guarantee every parser will get it right.
On the python side
There are a few different distinctions and questions here
- python2 and python3
- due to the different meanings of the str and bytes types
- lxml and ElementTree
- what you could read in
What you can feed in
- etree
- in py2 it would only consume bytes, though you could optionally tell it what encoding it's in.
- in python3 it can take either bytes or unicode.
The type of text that you pick out of the tree
In py3, both ElementTree and lxml.etree returns str (unicode) for names and element text.
In py2, both ElementTree and lxml.etree returns byte strings for plain ASCII text values (tag names, text in element, etc.)
- the argument seems to be less memory use, less decoding, and if combined with unicode they would be coerced as necessary anyway.
https://stackoverflow.com/questions/3418262/python-unicode-and-elementtree-parse
https://lxml.de/compatibility.html
...but people will do it wrong anyway
Libraries generating XML will do things correctly.
People are more fickle.
From memory, I remember:
- XML that says it's UTF8, but contains Latin1
- XML that says it's ASCII, but contains UTF8
The best fixes depend, in particular on how sure you are what that actual problem is,
because some duct tape makes things worse down the line.
etree and lxml notes
This API lets you do thing multiple ways,
- some functions take paths, others tag names,
- some functions return nodes, others contents, others are iterators
and it is easy to not know the easiest way to do a thing, or even how to choose a good subset and always work with that.
- getchildren()
- all direct children, no filter possible
- treating a element as a python iterable is equivalent to getchildren()(verify)
- returns list of element objects
- tag.find(path)
- first matching descendant by path (which easily doubles as first direct-child tagname)
- e.g. records = root.find('record_list')
- returns an element object
- tag.findall(path)
- all matching descendants by path
- e.g. records.findall('meta/location')
- return list of element objects
- iter(tag=None)
- does a treewalk, finds all descendants, optionally filtered by tag name (not path, so effectively disregards structure')
- can filter by tag name(s); the default tag=None returns all elements ()
- returns an iterator yielding elements
- tag.iterfind(path)
- matches like findall(), but emits while walking, rather that collecting before returning
- returns an iterator yielding elements
- find() and findall()/iterfind() tend to be simplest when picking data out of a known, fixed structure.
- and avoids some possible confusion when similarly named nodes are also found nested under others
- iter() is useful to pick out items when you don't care about context
- xpath() (see below) is useful when expressing complex things with syntax-fu, rather than a handful of lines of code
lxml only
- tag.getparent() (only in lxml, not ElementTree)
- tag.xpath(path) (only in lxml, not ElementTree)
- itersiblings -
- iterancestors -
- iterchildren -
- iterdescendants -
Fetching out data
name
.tag for the element's name
See also the namespaces section, because that's shoved into .tag when present.
attributes
- elem.get(name,default=None) gets an attribute value (or the default value if the attribute didn't exist)
- elem.set(name,val) - useful when you want to serialize it again
- elem.keys() returns the names of the present attributes
- elem.items() returns attributes, in an arbitrarily ordered list of (name,value) tuples.
- elem.attrib is a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())
Text
ElementTree's focus on elements, meaning it sticks text onto the nearest element object and in two different places on it, makes extracting text a little more adventurous.
More technically:
- elem.text is initial text (from the DOM view: the first node if it is text)
- If node content is only text, this will happen to be is all the text
- Can be None (if there is no text inside)
- If there's a mix of text and nodes, it's only the text before the first contained element.
- If you want all text in a flatten-all-text-away way, see notes below.
- elem.tail is the text between a node and its next sibling
- Can be None (if there is no text between them)
- elem.findtext(path,default=None) returns the text contents of the first matching element (or the default value, if nothing matched)
- because it's roughly equivalent to find(path).text (returns the text content of the first matching element)
- elem.itertext() - basically a list of each .text and .tail, in document order (you often want to "".join() that)
See also:
- https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.text
- https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.findtext
- https://docs.python.org/3/library/xml.etree.elementtree.html#xml.etree.ElementTree.Element.itertext
More practically
You often deal with either a well-structured data serialization (and you can pick out what you want), or a free-form document (and blanket 'all fragments under here' is what you want).
See also some of the scraping examples below.
When you have serialized data and know there are no sub-elements and you just want the direct text content
Consider:
<kv>
<key1>value1</key1>
<key2>value2</key2>
</kv>
- if in this case, you know .text is enough (and .tail will be empty, or if actually indented like that, just ignorable whitespace), and you might read that entire kv list with:
for node in meta: retdict[node.tag] = node.text
When you have serialized data and know exactly where you want a single text fragment from, but it's a little deeper
- see findtext()
For Less-structured documents, that can just intersperse things anywhere (much like HTML), and knowing you want just all subtree text
That is, if you want just the text nodes, removing all node structure
For example, running
list( element.itertext() )
on the example tree two sections back would give:
['\n ', 'value1', '\n ', 'value2', '\n']
Sometimes you want to smush that into a single string,
because you don't care about how each node contributed.
This comes with some footnotes. While the following
"".join(element.itertext())
...is least creative, it also means strings may get a little too smushed. As it is, it will give '\n value1\n value2\n' but if that example were not indented it would give 'value1value2', which
" ".join(element.itertext())
would solve -- but if it were actually HTML, this might insert some spaces where you don't want it. Say,
Pointing out the difference between work<i>ed</i> and work<i>ing</i>
I would not, but in
<div>First quote</div><div>Second quote</div>
I would and yeaaah, that's up to HTML-specific parser listening to HTML-specific standard.
Manually?
For reference on how .text and .tail works, consider:
>>> nodes = ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>') # which is mostly text and nodes mixed under root level, except for e
>>> nodes.text # <a>'s initial text (a is the root node)
'this'
>>> [ (el.tag,el.tail) for el in nodes ] # tail is the text after the node, if any
[('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)]
>>> [ (el.tag,el.text,el.tail) for el in nodes ] # .text is the (first) text in the node
[('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)]
>>> all_text_fragments(nodes) # see helper function below
['this', 'and', 'that', ' ', 'foo']
Namespaces
In searches
The element tag has the namespace URI, in XPath-style syntax.
As such, you can
root.findall('{http://www.w3.org/2002/07/owl#}Class') root.xpath(".//{http://purl.org/dc/elements/1.1/}title")
There does not seem to be a "find in any namespace" on the existing find functions,
though you could always do it yourself (explicit matching Element's tag string).
If you want to use a prefix like
root.findall('owl:Class')
...then read the next section
Prefixes
Streaming
If you have a huge XML file, you can load it into memory, but you may run into RAM problems.
The most minimal way to deal with one is a SAX-style parser, one that does little more than start and end tags,
but it would be nothing more than that, so you would still need to remember enough of a record for your purpose (or basically build up that record),
which is manual work (and single purpose).
As very large XML files are often data dumps, which are often lots of small and independent records in a row, where you only care about one at a time - and now RAM shouldn't be much more than one record's worth at a time.
So wouldn't it be a nice tradeoff to get to hold one record in memory, and remove it when you're done?
In etree, you can do basically that with iterparse.
There is a good explanation at https://stackoverflow.com/questions/9809469/python-sax-to-lxml-for-80gb-xml/9814580#9814580 but the slightly shorter version
- iterparse will build the same tree that generic ET will, but does so incrementally
- so it returns control to you regularly (the parse will in fact be slower)
- so at the end of what you know is a full record, you can deal with that record, and then clear() elements from it immediately after
- clear() still leaves empty elements which still take a handful of bytes, but that barely adds up until it's many millions of records. Apparently you can del the previous nodes to save a little more
See also:
Examples
Parsing data
For data enough to always have the same thing at the same place...
xmldata='''
<subscriptions>
<success/>
<entry>
<request>
<issn>1234-5678</issn>
<year>1984</year>
</request>
<presence>fulltext</presence>
</entry>
<entry>
<request>
<issn>0041-9999</issn>
<year>1984</year>
</request>
<presence type="cached">fulltext</presence>
</entry>
</subscriptions>'''
Code with comments:
subscriptions = ET.fromstring(xmldata) # parses the whole, returns the root element
error = subscriptions.find('error') # check whether there is an <error>with message</error> instead of <success/>
if error is not None:
raise Exception( "Response reports error: %s"%(error.text) ) # .text gets the node's direct text content
for entry in subscriptions.findall('entry'): #find all direct-child 'entry' elements
issn = entry.find('request/issn').text # note: if this node doesn't exist, this would error out (because None.text doesn't make sense)
year = entry.findtext('request/year','') # another way of fetching text - one that deals better with absence
# Using xpath-style paths like above can be handy, though when we want to fetch multiple details
# it's often handier to get element references first
presence = entry.find('presence')
prestype = presence.get('type') # attribute fetch. Using get() means we get None if missing
prestext = presence.text
print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)
That code prints:
1234-5678 in 1984: fulltext (None) 0041-9999 in 1984: fulltext (cached)
Notes:
- The functions on an element means you tend to use it as a deeper-only tree
- If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
- ...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
- Your choice to use find(), findall(), getchildren(), iter() varies with case and taste
- ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
- Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
- (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
- there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.
etree error notes
internal error: Huge input lookup
There are a few limits configured in libxml, which leads the parser to refuse rather than to handle some cases, such as:
- size of a single text node (10MByte)
- recursion limit with entities
- maximum depth of a document
This seems mostly to protect against handing in data that may be (or amount to) an algorithmic complexity attack / resource DoS.
You can lift these limits using XML_PARSE_HUGE / LIBXML_PARSEHUGE though should probably only do that if all the XML you feed it won't do that.
in python lxml the way to do that seems to be:
from lxml.etree import XMLParser, parse p = XMLParser(huge_tree=True) tree = parse('file.xml', parser=p)
Unicode strings with encoding declaration are not supported
Usually comes down to that you handed fromstring() an XML file as a unicode string instead of a bytes -- but it also still self-declares the encoding it is in.
It wants to
- see bytes and
- either
- read the encoding from the data
- {{comment}(or have you override it via a XMLParser())}}
If you decode()d it, or had an open() decode it for you -- don't.
If you had no control over how you, I guess you could .encode('utf8') it.
UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'
...or some other non-BMP Unicode character (so \U rather than \u).
Not necessarily a to-console problem as you might assume.
This probably comes from ET's tostring(), which uses 'ascii' encoding by default, which usually works fine because it rewrites Unicode into numeric entities.
However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).
On technically-more-standard way to work around that is to do: tostring(nodes,encoding='utf-8'), which tells it to encode unicode characters to UTF-8 bytes before dumping them into the document.
Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is also dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)
BeautifulSoup notes
Basics
Firstly, There are multiple ways of filtering/fetching elements from a parsed tree.
You're probably best off deciding what single syntax you like, and ignoring all others. I dislike the short forms because they can clash, and raise more exceptions that makes code more convoluted if you handle them properly.
A parse tree is made mainly of Tag and NavigableString objects, representing elements and text contents, respectively.
Example data used below:
<a> <b q="foo bar"> 1 <c q="foo"/> <d>2</d> <c r="bar"/> <c/> 3 </b> </a>
To play with that example:
import bs4
soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', parser='lxml')
Notes:
- parser has four options: [5]
- html.parser - works out of the box, not the fastest
- lxml - fast, external C dependency
- lxml-xml - focused on XML (plain lxml is aimed at HTML)
- html5lib - lenient, slow, external python dependency
- When you print Tag objects, it prints the entire subtree
- which is e.g. quite confusing when you're trying to step through the tree and printing its elements.
searching
You can
- walk through the Tag elements entirely manually
- but that is rarely useful, unless it was already data,
- or perhaps it works out as a good way to express some heavily contextual parsing.
- jump to specific parts you are interested in, with find()/find_all() and , or possibly select(), then often still walk those interesting bits manually
select
this covers similar ground to find()/find_all()/friends, but as a note...
select() can frequently be more succinct, because it allows CSS selectors, letting you do things like:
soup.select("p > a:nth-of-type(2)")
soup.select("#link1,#link2")
soup.select("a[class~='externalLink']"
soup.select("div > ul[class*='browseList'] > li[class*='browseItem'] > a" )
SoupSieve
Basically a more up-to-date version of select()
https://pypi.org/project/soupsieve/
find and friends
find() and its friends plus your code can be very direct in what you tell it to do, but is almost always more typing
- find() - finds the first match in the subtree find() docs
- find_all() - finds all matches in the subtree find_all() docs
- There are variations on the same idea, but with a specific direction or restriction:
- find_parent(), find_parents()
- find_next_sibling(), find_next_siblings()
- find_previous_sibling(), find_previous_siblings()
- find_next(), find_all_next()
- find_previous(), find_all_previous()
- You may never need more than a few, depending on how you like to think about searching in trees.
Finding things with specific properties - find functions take keyword arguments like
- name: match Tags by name. When you hand in a...
- string: exact name match
- list or tuple: exact match of any in list
- (compiled) regexp: regexp match
- function: use as arbitrary filter (should return True/False. Can of course be a lambda function)
- True: fetch all (often pointless; using only attrs implies this, and you can iterate over all children more directly)
- attrs: match Tags by attributes. When you hand in a...
- string: should match class, but different in older BeautifulSoup version(verify), so I avoid it
- dicts mapping string to...
- ...to a string: exact match
- ...to True: tags with this attribute present, e.g. soup.find_all(True,{'id':True})
- ...to a regexp: match attribute value by regexp, e.g. soup.find_all(True,{'class':re.compile(r'\bwikitable\b')}) (useful to properly match classes, since class attribute values are space-separated lists)
- text: match NavigableStrings, by text content. Using this implies 'ignore name and attrs'. When you hand in a...
- string: exact match
- True: all strings
- regexp: regexp match
- recursive: (where relevant) these search recursively by default (recursive=True)
- You can change that, e.g. when you want to express e.g. "find all spans directly under this div", a non-recursive find_all might make sense
For example:
soup.find_all(['b','c']) # all b and c tags
soup.find_all( re.compile('[bc]') ) # all b and c tags
# anything with a q attribute at all:
soup.find_all(attrs={'q':True})
# anything with attribute q="foo"
soup.find_all(attrs={'q':'foo'})
#all divs with class set to tablewrapper (string equality)
soup.find_all('div', attrs={'class':'tablewrapper'})
#Anything with a class attribute that contains 'bar' (uses word-edge to be close enough to [https://dom.spec.whatwg.org/#interface-domtokenlist token-list] matching):
soup.find_all(attrs={'class':re.compile(r'\bbar\b')})
When nothing matches,
- find() would return None
- Which also means you can't really chain these, since that'll easily result in complaining you're trying to do something on None (specifically an AttributeError).
- find_all would return an empty list
There is quite a bit of extra decoration on Tag (and also NavigableString) objects.
Things you could keep in mind include:
- string: returns text child (NavigableString type)
- ...but only if a Tag contains exactly one of these.
- If there is more than that, this will yield None, even if the first element is a string.
- often, you instead want to use find(string=True) (next piece of text) or find_all(string=True) (all pieces of text), depending on what you know of the structure
- parent: selects single parent, a Tag object.
- contents: selects a Tag's sub-things, a list containing a mix of Tag and NavigableString objects. (DOM would call this 'children')
- using a Tag as an iterable (e.g. using for, list()) iterates its direct contents one element at a time.
- Sometimes this convenient and clean, in other cases searching is faster, more flexible, or more readable
- previousSibling and nextSibling: selects the next Tag or NavigableString at the current level. Think of this as walking the contents list. Returns None when sensible. Useful for some specific node constructions.
- previous and next are essentially a single step out of a treewalk (that emits before walking(verify)).
- If that made you go 'huh?', you probably want previousSibling and nextSibling instead.
For example, with the example data mentioned earlier:
import bs4 soup = bs4.BeautifulSoup('<a>1<c q="foo"/><d>2</d>3<c r="bar"/><c/></a>', 'lxml') p = soup.contents[0] while p is not None: print( p ) p=p.next
Will start at the a element and print:
- the a element
- the b element
- '1'
- the first c element (the one that contains d)
- the d element
- '2'
- '3'
- the first empty c element
- the second empty c element
While the following prints that list in exact reverse:
p = soup.find_all('c')[2] #selects the last c
while p!=None:
print( p )
p=p.previous
There are also find functions that behave this way.
Assorted notes
On Unicode
It used to be that it required unicode string input, so you needed to do decoding yourself, and correctly.
Recent versions consider UTF-8 as an input encoding, which means that, for a lot of modern web content, it'll work without you thinking too hard about it.
Once parsed, the strings are python unicode strings.
You can now also ask for the bytestring as it was in the source document.
(does this vary with parser you ask for?)
getting attributes, alternatives
Generally, use a.get('name'). Largely because it returns None if not present (and you can have a fallback like get('name', '')
Alternative styles are more bother. Say, a['name'], raises ValueError when not present.
Feeding in data, parser alternatives
There are a few ways of feeding in a page:
open_file_object=open('filename','r')
soup=BeautifulSoup(open_file_object)
#...or...
soup=BeautifulSoup(string)
#...or...
soup=BeautifulSoup()
soup.feed(page_contents) #...which can apparently be done incrementally if you wish.
You can get behaviour with lots of attempted correction, or nearly none.
Theoretically, you can feed in things much closer to even SGML, but you may find you want to customize the parser somewhat for any specific SGML, so that's not necessarily worth it.
In earlier versions you had parser alternatives like:
- BeautifulSoup.BeautifulSoup is tuned for HTML, and knows about self-closing tags.
- BeautifulSoup.BeautifulStoneSoup is for much more basic XML (and not XHTML).
And also:
- BeautifulSoup.BeautifulSOAP, a subclass of BeautifulStoneSoup
- BeautifulSoup.MinimalSoup - like BeautifulSoup.BeautifulSoup, but is ignorant of nesting rules. It is probably most useful as a base class for your own fine-tuned parsers.
- BeautifulSoup.ICantBelieveItsBeautifulSoup is quite like like BeautifulSoup.BeautifulSoup, but in a few cases follows the HTML standard rather than common HTML abuse, so is sometimes a little more appropriate on very nearly correct HTML, but it seems you will rarely really need it.
It seems the preferred way now is to tell the constructor.
As of bs4 there are three builders included, based on htmlparser, lxml, and html5lib
- html.parser
- python's own. decent speed but slower than lxml, less lenient than html5lib
- lxml
- fast, lenient, can also handle XML (unlike the other two(verify))
- html5lib
- slow, very lenient
- separate package
The way you request these is the markup argument to BeautifulSoup,
and it's more of a lookup than direct specification, also . (TODO: figure out how that works)
- 'html' and 'html.parser' seems to land on html.parser
- 'xml' and 'lxml-xml' seems to land on lxml's XML parser
- 'lxml' seems to land on lxml's HTML parser
- 'html5' seems to land on html5lib
See also https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
Performance
As the documentation points out, "if there’s any ... reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml", because "Beautiful Soup will never be as fast as the parsers it sits on top of"
There are differences between the parsers you can use, e.g. in how they 'fix' incorrect HTML, and how fast they are.
You might want to sit down once and choose your preference.
For some cases (e.g. large documents) it can make sense to e.g. apply tidying (e.g. µTidylib) then feed it to a stricter parser.
When you can count on syntax-correctness of your data, you may want a stricter parser to start with. (if it's XML you may want to try BeautifulStoneSoup)
You may want to prefer the lxml parser (which is a C library),
because html.parser is pure python and slower.
lxml is also faster than html5lib.
lxml has become the default parser in bs4 -- if it is installed.
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers
https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use
- install cchardet, because without it it'll use the pure-python chardet
- for huge documents, consider SoupStrainer, which parses only tags you're interested in.
- this won't make parsing faster, but it will make searching faster, and lower memory use
https://beautiful-soup-4.readthedocs.io/en/latest/#improving-performance
https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/
Scraping text
Warnings
DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead
From the docs:
With string you can search for strings instead of tags. The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text.
4.4.0 is now years old. It seems that I had an old version because, while bs4 is a dummy package that installs BeautifulSoup4, updating bs4 won't seem to update BeautifulSoup4.
Examples
You generally want to look at things per page, specifically asking yourself "What distinguishes that which I want to extract?" This is often an attribute or class, or sometimes an element context.
Table extraction
I was making a an X-SAMPA / IPA conversion and wanted to save myself a lot of typing. I downloaded the wikipedia X-SAMPA page.
At a glance, it looks like mediawiki tables that are generated from markup have exactly one class, wikitable, which is rather convenient because it means we can select the data tables in one go.
The tables on that page have either four or five columns, which are different cases we have to deal with in our interpretation, and half the lines of code below deal with that.
(the page has changed since I wrote this - I should review it)
# Note the code is overly safe for a one-shot script, and a little overly commented
from bs4 import BeautifulSoup
soup = BeautifulSoup(open('X-SAMPA.html'))
for table in soup.find_all('table', {'class':'wikitable'} ):
# Gets the amount of columns, from the header row.
# This because cells can be omitted, but generally aren't on the header row. (it's hard dealing with col/rowspans anyway)
tablecols = len( table.find('th').parent.find_all('th') )
# Actually means: "find first th, go to the parent tr, select all th children, count them"
for tr in table.find_all('tr'): # for all rows in the table
TDs= tr.find_all('td')
# deal with both tables in the same code -- check which we're dealing with by amount of columns
if tablecols==5: # XS, IPA, IPA, Description, Example
xs,ipa,_,descr = (TDs+[None]*5)[:4] # hack faking extra list entries when there aren't enough TDs in the table
#pad with a bunch of nothings in case of missing cells, then use the first 4
elif tablecols==4: # XS, IPA, IPA, Description
xs,ipa,_,descr = (TDs+[None]*5)[:4]
else:
raise ValueError("Don't know this table type!")
if None in [xs,ipa]: #empty rows?
pass
else:
#We fish out all the text chunks. In this case we can join them together
xs = ' '.join( xs.find_all(string=True) )
ipa = ' '.join( ipa.find_all(string=True) )
descr = ' '.join( descr.find_all(string=True) )
Similar idea, for the Kunrei-shiki Rōmaji page:
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('Kunrei-shiki_Romaji.html'))
tables = soup('table',{'width':'100%'}) #the main table
tables.extend( soup('table',{'class':'wikitable'})) #the exceptions table
for table in tables:
for td in table('td'): #cells are completely independent as far as we're concerned
tdtext=' '.join(td(string=True)).replace('&#160;',' ').strip() #is there text under this TD?
if len(tdtext)>0: #Yup
#There a few styles of cell filling, which we unify both with the text select and with logic below
a=tdtext.split()
kana=''
if len(a)==2:
kana,roman = a
hiragana,katakana=kana[:len(kana)/2],kana[len(kana)/2:] #close enough
elif len(a)==3:
hiragana,katakana,roman=a
else:
raise('BOOGA')
print `hiragana,katakana,roman`
More table extraction
for http://www.isbn-international.org/en/identifiers/allidentifiers.html. Will need some fixing.
import re,pprint
from BeautifulSoup import BeautifulSoup
bs=BeautifulSoup(file('allidentifiers.html').read())
table = bs.find('table', {'style':'width: 565px;'})
t={}
identifier=None
val=None
for tr in table.find_all('tr'):
tds = tr.find_all('td')
if len(tds)>=2:
if tds[0].span!=None:
try:
identifier=int(tds[0].span.string)
except ValueError: #not an integer - stick with the one we have
pass
val = ''.join( tds[1].find_all(string=True) ).replace('\r\n', ' ')
if identifier not in t:
t[identifier] = [val]
else:
t[identifier].append(val)
result={}
for k in t:
if k!=None:
result[k] = ' '.join(t[k])
resultdata=pprint.pformat(result)
print resultdata
f=file('allidentifiers.py','w')
f.write(resultdata)
f.close()
Dealing with non-nestedness
For one site I needed the logic "Look for the first node that has text node 'title1' and return a list of all nodes (text nodes, elements) up to the next '<b>' tag"
I needed to fetch a list of things under a heading that wasn't really structurally stored at all. The code I used was roughly:
from BeautifulSoup import BeautifulSoup,NavigableString
example="""
<b>title1</b>
contents1<br>
contents2<br>
<b>nexttitle</b>
contents3<br>
"""
def section(soup,startAtTagWithText, stopAtTagName):
ret=[]
e=soup.firstText(startAtTagWithText)
try:
e=e.next # skip the tag that has the string in it
while e!=None:
if type(e)!=NavigableString:
if e.name==stopAtTagName:
break
else:
ret.append(e)
else: #an element
ret.append(e)
e=e.nextSibling
except:
pass
return ret
section(BeautifulSoup(example),'title1','b')
See also
- http://www.crummy.com/software/BeautifulSoup/documentation.html (documentation, examples)
Helpers, and scraping notes
etree pretty-printer
Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented.
def indent_inplace(elem, level=0, whitespacestrip=False):
''' Alters the text nodes so that the tostring()ed version will look nicely indented.
whitespacestrip can make contents that contain a lot of newlines look cleaner,
but changes the stored data even more.
'''
i = "\n" + level*" "
if whitespacestrip:
if elem.text:
elem.text=elem.text.strip()
if elem.tail:
elem.tail=elem.tail.strip()
if len(elem):
if not elem.text or not elem.text.strip():
elem.text = i + " "
if not elem.tail or not elem.tail.strip():
elem.tail = i
for elem in elem:
indent_inplace(elem, level+1)
if not elem.tail or not elem.tail.strip():
elem.tail = i
else:
if level and (not elem.tail or not elem.tail.strip()):
elem.tail = i
def prettyprint(xml,whitespacestrip=True):
''' Convenience wrapper around indent_inplace():
- Takes a string (parses it) or a ET structure,
- alters it to reindent according to depth,
- returns the result as a string.
whitespacestrip: see note in indent_inplace()
Not horribly efficient, and alters the structure you gave it,
but you are only using this for debug, riiight?
'''
if type(xml) is str:
xml = ET.fromstring(xml)
inplace_indent(xml, whitespacestrip=whitespacestrip)
return ET.tostring(xml).rstrip('\n')
etree namespace stripper
I use this to simplify searching and fetching whenever I know my use case will have no conflicts introduced by namespace stripping.
def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True):
""" Takes a parsed ET structure and does an in-place removal of namespaces.
By default removes all namespaces, optionally just a specific namespace (by its URL).
Can make node searches simpler in structures with unpredictable namespaces
and in content given to not be ambiguously mixed.
By default does so for node names as well as attribute names.
(doesn't remove the namespace definitions, but apparently
ElementTree serialization omits any that are unused)
Note that for attributes that are unique only because of namespace,
this will cause attributes to be overwritten.
For example: <e p:at="bar" at="quu"> would become: <e at="bar">
I don't think I've seen any XML where this matters, though.
"""
if namespace==None: # all namespaces
for elem in etree.getiterator():
tagname = elem.tag
if tagname[0]=='{':
elem.tag = tagname[ tagname.index('}',1)+1:]
if remove_from_attr:
to_delete=[]
to_set={}
for attr_name in elem.attrib:
if attr_name[0]=='{':
old_val = elem.attrib[attr_name]
to_delete.append(attr_name)
attr_name = attr_name[attr_name.index('}',1)+1:]
to_set[attr_name] = old_val
for key in to_delete:
elem.attrib.pop(key)
elem.attrib.update(to_set)
else: # asked to remove specific namespace.
ns = '{%s}' % namespace
nsl = len(ns)
for elem in etree.getiterator():
if elem.tag.startswith(ns):
elem.tag = elem.tag[nsl:]
if remove_from_attr:
to_delete=[]
to_set={}
for attr_name in elem.attrib:
if attr_name.startswith(ns):
old_val = elem.attrib[attr_name]
to_delete.append(attr_name)
attr_name = attr_name[nsl:]
to_set[attr_name] = old_val
for key in to_delete:
elem.attrib.pop(key)
elem.attrib.update(to_set)