Python notes - XML and HTML

From Helpful
(Redirected from BeautifulSoup)
Jump to navigation Jump to search

Syntaxish: syntax and language · type stuff · changes and py2/3 · decorators · importing, modules, packages · iterable stuff · concurrency · exceptions, warnings


IO: networking and web · filesystem

Data: Numpy, scipy · pandas, dask · struct, buffer, array, bytes, memoryview · Python database notes

Image, Visualization: PIL · Matplotlib, pylab · seaborn · bokeh · plotly


Tasky: Concurrency (threads, processes, more) · joblib · pty and pexpect

Stringy: strings, unicode, encodings · regexp · command line argument parsing · XML

date and time


Notebooks

speed, memory, debugging, profiling · Python extensions · semi-sorted

Intro

DOM, SAX, or other

DOM is a relatively heavy model because it is highly decorated, and requires all data to be loaded.

Many XPath modules use a DOM tree to query on, which makes sense as its query capabilities match well with the decoration.


SAX, on the othe rhand, is just mentioning the nodes it sees pass. by and remembers almost nothing.

...except what you remember on top of it. This will involve a little more thinking and a little more coding, whenever the XML is fairly flat (and/or many distinct fragments in one large file), this is essentially streaming its contents and can take significantly less RAM (possibly also CPU).


Note that there are tricks in some libraries (like lxml) that work out as something SAX-like in resource terms, which is good to know when you already know that API)

parse and/or generate

ElementTree intro

ElementTree, as an API, can be seen as a pragmatic midway between the spartan SAX approach and the somewhat excessive DOM-style approach.


And potentially less verbose than both of those. Where DOM keeps elements, text, and attributes in their own objects, ET smushes those onto one object:

  • the element (name, namespace)
  • the attributes on this element
  • the text directly within, and directly after (this one's a little funny, though)


This makes some navigation code considerably shorter.

Roughly speaking, the easier it is to describe the XML in words, the easier it is to write ET code for.


That said,

if the XML structure has a lot of superfluous stuff, unnecessarily deep structures, arbitrary embedding and such, it may take more than a few lines to deal with anyway, and you gain nothing on that particular data.}}
you have to sit down once to think about how it handles text.
For data storage like variable-and-value lists it's actively more practical
for free-form documents it gets a lot more interesting

lxml intro

lxml is a python interface based on libxml2 (and libxslt).

lxml is a python library that wraps the libxml2 library (also libxslt) and gives a good balance between the speed of those C libraries and a capable and nice interface on the python side.

It has a nicer API than those libraries' basic bindings, so if you use those lower-level things from a higher-level language like Python, lxml may make you happier.


The python lxml library also adds an ElementTree-like API.

ET-like in that lxml also adds a few things that ET does not have.


When using any fancier features: While lxml is an library you have to install, a lot has been around for a long time; python's own etree is still changing, so fancier features might affect the python versions you can support, and some support still seems different.


lxml has a few additions that can be really useful, such as

the ability to navigate to the parent.
lxml's xpath()[1] has been more capable for many years

than python's recent xpath support is today(verify).

I have found lxml a little more productive when doing scraping.

Two things you may want to read:


lxml should e.g. prove faster than BeautifulSoup. Note that bs4 and etree are similar enough to be confusing ("was it findall() or find_all() or maybe findAll() after all?"), so while each has their strengths, you may want to use just one when that's suitable.


ET and lxml - Lineage, variants, and importing
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Roughly speaking:

lxml

is a python binding for the libxml C library.
not standard library, but installed on many systems
the libxml wrapping itself is more minimal, but it also adds an etree-like interface, lxml.etree, which is actually more capable than (c)ElementTree(verify)


ElementTree

imitates lxml
and adds python code for some extra functionality
standard library since Python ≥2.5, also present in 3


cElementTree

(a more efficient) C implementation of ElementTree
Mostly the same API as ElementTree (a few functions are missing?(verify)), but omits some of that added python code
so is actually more similar to lxml's etree interface again
cElementTree is not very relevant since xml.etree.ElementTree in py3.3, as it defaults to something very similar[2]


As such

  • there is a subset of functions shared by all three
  • you may wish to choose just one (I switched to lxml-only at some point to lessen my confusion)
and if you want code that can deals with objects from others - mostly duck typing solves this, but there are a few footnotes (e.g. lxml.etree has comments and processing instructions, ET just ignores them while parsing)
  • There use to be more difference than there is now
cET was for speed, but that mostly got added to python
lxml was for speed+features, but some of them are now getting added to python
  • there are details that differ between the three, sometimes subtle, sometimes fundamental
read e.g. https://lxml.de/1.3/compatibility.html
  • when you deal with messy, possibly-incorrect HTML, take a look at BeautifulSoup (which since bs4 actually defaults to using lxml under the covers).
  • When you deal with XHTML or well-formed HTML, there is further comparison in flexibility/speed.


In python2 it was

  • interesting where to import ET from - xml.etree was introduced in 2.5, before then you'ld import elementtree(verify)
  • useful to fallback to other imports, to e.g. get ET's added python code mixed with cET's faster c code.
This is less pressing in py3 (3.3) in that xml.etree.ElementTree uses a C implementation if available.


ElementTree ≈ etree

Given the above ElementTree and etree are

sometimes used to refer approximately to the API,
sometimes used to to specific implementations

On this page, both are mainly used to refer to the API.

mainly parsing

BeautifulSoup intro

BeautifulSoup is a Python module that reads in and parses HTML data, and has helpers to navigate and search the result.


It can deal with some common markup mistakes.

It is also fairly convenient about expressing how we want to search the parsed result.


Not all use is very fast - see #Performance


⚠ This page is opinionated
On this page I omit all the shorthand forms that I don't like, and mostly ignore pre-bs4 versions. Both because these variations are mostly confusing to all put side by side.


mainly generating

py.xml

One simple-looking method (codewise) is what e.g. py.xml does - consider:

import py.xml

class ns(py.xml.Namespace):
    "Convenience class for creating requests"
 
def find_request(session,command,sourceArray):
    find=ns.x_server_request(
        ns.find_request(
            ns.wait_flag("N"),
            ns.find_request_command( command ),
            # list comprehension used create a variable amount of slightly nested nodes:
            list( ns.find_base(ns.find_base_001(sourceID))   
                  for sourceID in sourceArray ),
            ns.session_id(session)
        )
    )
    findXMLdoc='<?xml version="1.0" encoding="UTF-8"?>%s'%(
        find.unicode(indent=0).encode('utf8')
    )


However, this is flawed in that it's easy to create invalid XML: each of the parts are immediately converted to strings, meaning that while HTML encoding (of <>&) can be done automatically for attributes, there is no way to do so for text nodes, as code can not tell the difference between text nodes and nested elements.


stan

Nevow's stan seems to have considered just-mentioned problem, because stan is a document object model.

An example slightly adapted from its documentation:

from nevow import flat, tags as T
document = T.html[
    T.head[
        T.title["Hello, world!"]
    ],
    T.body[
        T.h1[ "This is a complete XHTML document modeled in Stan." ],
        T.p[ "This text is inside a paragraph tag." ],
        T.div(style="color: blue; width: 200px; background-color: yellow;")[
            "And this is a coloured div."
        ]
    ]
]
print( flat.flatten(document) )

You actually create a tree of flatten()able objects, and only that last call (flatten()) actually makes a string view of that document model. This also means you can make any of your obects stan-XML-serializable by giving it a flatten() function.

It can also make some ways of templating more convenient.


ElementTree's SimpleXMLWriter

The syntax that ElementTree it uses for XML generation is not as brief as the above, but still usable.

And it may be handy to use one module for all XML things, just to keep your depdendencies down.


The SimpleXMLWriter class is basically a SAX-style writer with a few convenience functions. It remembers what tags are open, so can avoid some wrap-up calls. It consists of:

  • element(tag,attributes) (one-shot element, avoids separate start and end)
  • start(tag, attributes)
  • end() (current), or end(tag)
  • close() (all currently open) or close(id) (many levels, tages reference you took from start), or
  • data() (text nodes)


from elementtree.SimpleXMLWriter import XMLWriter
w = XMLWriter(req,encoding='utf-8') #without the encoding unicode become entities
response = w.start("html")   #remember this so we can close() it

w.start("head")
w.element("title", "my document")
w.element("meta", name="generator", value=u"my \u2222 application 1.0")
w.end()             # current open tag is head

w.start("body")
w.element("h1", "this is a heading")
w.element("p", "this is a paragraph")

w.start("p")
w.data("this is ")
w.element("b", "bold")
w.data(" and ")
w.element("i", "italic")
w.data(".")
w.end("p")
w.close(response)   # implies the end() for body and html

For my latest project there were other details involved in the decision: I only have to generate a few simple XML documents, I'm using ET anyway so can make a leaner program if I do everything with it, and the last two are less likely to be easily installed /installable as libraries.


ElementTree by objects

This can be a little more tedious, but may be handy for XML-based protocols. For example:

def login(username,password):
    E,SE = ET.Element,ET.SubElement  # for brevity

    server_request = E('server_request')
    login_request    = SE(server_request,'login_request')    
    SE(login_request,'user_name').text     = username
    SE(login_request,'user_password').text = password

    return ET.tostring(server_request,'UTF-8')

SubElement creates a node by name, adds it under the given node, and also returns a reference to it. The above code stores that references the first use, and only uses it the other two uses.

It is unicode-capable. If you don't use the encoding keyword, you get an XML fragment that uses numeric entities for such characters. If you do use it, it will both encode it and prepend a <?xml?> header -- in both cases under the condition that it recognizes the encoding name. Since it will use the encoding string in the header, you probably want to say 'UTF-8', not 'utf8' (see the XML standard on encodings) as it will use that string in the XML header.

minidom, pulldom

https://docs.python.org/3/library/xml.dom.minidom.html

https://docs.python.org/3/library/xml.dom.pulldom.html

General notes

On unicode and bytes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

On the XML side

Technically,

  • The XML format should technically be considered a binary format, that self-describes its data coding
(either explicitly, or falls back to UTF-8 by implication of the specs).
  • XML1.1 requires you to specify an encoding=
leaving it out is always invalid, or rather:
and if missing the specs say to treat it like XML 1.0[3]).
  • XML 1.0 without the encoding declaration means the parser will have to guess
how it does that is not standardized (there are suggestions though)


The first point means that decoding is the parser's job, not yours.

And yes, that's a slight chicken-and-egg problem.


Each library has a slightly different take on it, but generally, the parser expects bytes.

Parsers may or may not choose accept already-decoded unicode (...in languages where unencoded unicode is an actual data type. Like Python.).


So you generally don't want to decode and then hand that in, unless you have a explicitly good explanation that you need it, and why you want the job of also handling other cases just as correctly as it does. (One good reason is that it gueses wrong, but that should only happen for some very unusual XML)

You also generally don't want to create XML without encoding declarations - it's considered bad practice, because you cannot guarantee every parser will get it right.



On the python side

There are a few different distinctions and questions here

  • python2 and python3
due to the different meanings of the str and bytes types
  • lxml and ElementTree
  • what you could read in


What you can feed in

etree
in py2 it would only consume bytes, though you could optionally tell it what encoding it's in.
in python3 it can take either bytes or unicode.


The type of text that you pick out of the tree

In py3, both ElementTree and lxml.etree returns str (unicode) for names and element text.

In py2, both ElementTree and lxml.etree returns byte strings for plain ASCII text values (tag names, text in element, etc.)

the argument seems to be less memory use, less decoding, and if combined with unicode they would be coerced as necessary anyway.


https://stackoverflow.com/questions/3418262/python-unicode-and-elementtree-parse

https://lxml.de/compatibility.html


...but people will do it wrong anyway

Libraries generating XML will do things correctly.


People are more fickle. From memory, I remember:

  • XML that says it's UTF8, but contains Latin1
  • XML that says it's ASCII, but contains UTF8


The best fixes depend, in particular on how sure you are what that actual problem is, because some duct tape makes things worse down the line.



lxml.html

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


lxml.html

  • is mentioned to be able to handle broken HTML - but also that for broken-enough HTML it may not correct usefully
you may want BeautifulSoup instead
or, if you want to stick to etree's API, use BS' output within lxml via ElementSoup
  • adds some extra functions on elements



https://lxml.de/lxmlhtml.html

etree and lxml notes

etree-style navigation and searching

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

This API lets you do thing multiple ways,

some functions take paths, others tag names,
some functions return nodes, others contents, others are iterators

and it is easy to not know the easiest way to do a thing, or even how to choose a good subset and always work with that.


  • getchildren()
all direct children, no filter possible
treating a element as a python iterable is equivalent to getchildren()(verify)
returns list of element objects
  • tag.find(path)
first matching descendant by path (which easily doubles as first direct-child tagname)
e.g. records = root.find('record_list')
returns an element object
  • tag.findall(path)
all matching descendants by path
e.g. records.findall('meta/location')
return list of element objects


  • iter(tag=None)
does a treewalk, finds all descendants, optionally filtered by tag name (not path, so effectively disregards structure')
can filter by tag name(s); the default tag=None returns all elements ()
returns an iterator yielding elements
  • tag.iterfind(path)
matches like findall(), but emits while walking, rather that collecting before returning
returns an iterator yielding elements


💤 In my experience,
find() and findall()/iterfind() tend to be simplest when picking data out of a known, fixed structure.
and avoids some possible confusion when similarly named nodes are also found nested under others
iter() is useful to pick out items when you don't care about context
xpath() (see below) is useful when expressing complex things with syntax-fu, rather than a handful of lines of code


lxml only

  • tag.getparent() (only in lxml, not ElementTree)
  • tag.xpath(path) (only in lxml, not ElementTree)
like findall, but is a partial implementation of XPath
note: often a little slower than find*() [4]
yes, etree now also has an xpath, but it's still being worked on, whereas lxml's has been there since forever
  • itersiblings -
  • iterancestors -
  • iterchildren -
  • iterdescendants -



Fetching out data

name

.tag for the element's name

See also the namespaces section, because that's shoved into .tag when present.


attributes
  • elem.get(name,default=None) gets an attribute value (or the default value if the attribute didn't exist)
  • elem.set(name,val) - useful when you want to serialize it again
  • elem.keys() returns the names of the present attributes
  • elem.items() returns attributes, in an arbitrarily ordered list of (name,value) tuples.
  • elem.attrib is a dict with attributes. You can alter this, though note that you should not assign a new dict object - if you want new contents, do a clear() and update())
Text from etrees

ElementTree's focus on elements, meaning it sticks text onto the nearest element object and in two different places on it, makes extracting text a little more adventurous.


More technically:

  • elem.text is initial text (from the DOM view: the first node if it is text)
If node content is only text, this will happen to be is all the text
Can be None (if there is no text inside)
If there's a mix of text and nodes, it's only the text before the first contained element.
If you want all text in a flatten-all-text-away way, see notes below.
  • elem.tail is the text between a node and its next sibling
Can be None (if there is no text between them)
  • elem.findtext(path,default=None) returns the text contents of the first matching element (or the default value, if nothing matched)
because it's roughly equivalent to find(path).text (returns the text content of the first matching element)
  • elem.itertext() - basically a list of each .text and .tail, in document order
to make that a single string: "".join(elem.itertext())

See also:


More practically

You often deal with either a well-structured data serialization (and you can pick out what you want), or a free-form document (and blanket 'all fragments under here' is what you want).

See also some of the scraping examples below.


When you have serialized data and know there are no sub-elements and you just want the direct text content

Consider:

<kv>
  <key1>value1</key1>
  <key2>value2</key2>
</kv>
if in this case, you know .text is enough (and .tail will be empty, or if actually indented like that, just ignorable whitespace), and you might read that entire kv list with:
for node in meta:
   retdict[node.tag] = node.text 


When you have serialized data and know exactly where you want a single text fragment from, but it's a little deeper
  • see findtext()


For Less-structured documents, that can just intersperse things anywhere (much like HTML), and knowing you want just all subtree text

That is, if you want just the text nodes, removing all node structure

For example, running

list( element.itertext() )

on the example tree two sections back would give:

['\n  ', 'value1', '\n  ', 'value2', '\n']



Sometimes you want to smush that into a single string, because you don't care about how each node contributed.

This comes with some footnotes. While the following

"".join(element.itertext())

...is least creative, it also means strings may get a little too smushed. As it is, it will give '\n value1\n value2\n' but if that example were not indented it would give 'value1value2', which

" ".join(element.itertext())

would solve -- but if it were actually HTML, this might insert some spaces where you don't want it. Say,

Pointing out the difference between work<i>ed</i> and work<i>ing</i>

I would not, but in

<div>First quote</div><div>Second quote</div>

I would and yeaaah, that's up to HTML-specific parser listening to HTML-specific standard.


For HTML where you may need to insert spaces depending on the node

This is, almost inherently, a creative exercise, so there is no singular best example.

You will probably also find that HTML also ignores a lot of whitespace (because you can indent things) so will have to imitate that too.

And that it's never perfect because enough webpages change layout semantics via display:

That said,

Manually?

For reference on how .text and .tail works, consider:

>>> nodes = ET.fromstring('<a>this<b/>and<c/>that<d/> <e>foo</e><f/><g/></a>') # which is mostly text and nodes mixed under root level, except for e
>>> nodes.text     # <a>'s initial text   (a is the root node)
'this'

>>> [ (el.tag,el.tail)  for el in nodes ]           # tail is the text after the node, if any
[('b', 'and'), ('c', 'that'), ('d', ' '), ('e', None), ('f', None), ('g', None)]

>>> [ (el.tag,el.text,el.tail)  for el in nodes ]   # .text is the (first) text in the node
[('b', None, 'and'), ('c', None, 'that'), ('d', None, ' '), ('e', 'foo', None), ('f', None, None), ('g', None, None)]

>>> all_text_fragments(nodes)    # see helper function below
['this', 'and', 'that', ' ', 'foo']

Namespaces

In searches
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The element tag has the namespace URI, in XPath-style syntax.

As such, you can

root.findall('{http://www.w3.org/2002/07/owl#}Class')
root.xpath(".//{http://purl.org/dc/elements/1.1/}title")


There does not seem to be a "find in any namespace" on the existing find functions, though you could always do it yourself (explicit matching Element's tag string).


If you want to use a prefix like

root.findall('owl:Class')

...then read the next section

Prefixes
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Streaming

If you have a huge XML file, you can load it into memory, but you may run into RAM problems.


As very large XML files are often data dumps, which are often lots of small and independent records in a row, where you only care about one at a time - and now RAM shouldn't be much more than one record's worth at a time.

So wouldn't it be a nice tradeoff to get to hold one record in memory, and remove it when you're done?


The most minimal way to deal with one is a SAX-style parser, which has essentially zero memory, just mentions start and end tags.

That still means you need to remember enough of a record for your purpose (or basically build up that record yourself), which is manual work (and single purpose, so not much reusable code next time you need that).


In lxml, you can get something inbetween with iterparse.

It's sort of like a SAX parser that also actually ends up the same ET tree that parsing it regularly would. (so yes, it would eventually use just as much RAM)

...but but does so incrementally, yielding during that process.

So if it is a record-after-record structure (which most huge XMLs are), you can process one record at a time - in that you can chose to get rid of that record's ET representation once you're done.


The least-code way is to clear() the contents of each record. That might look like:

import lxml.etree, io
test = b'<t><r>1</r><r>2</r><r>3</r></t>'

# Note: tag (and events) asks the generator to only yield only at certain positions, rather than at every point.
for event, element in lxml.etree.iterparse( io.BytesIO(test), events=('end',), tag='r' ):
    print( event, element.text, lxml.etree.tostring(element) )
    element.clear()

This means it would build up the tree and yield at the end of every node called 'r' (without specifying tag, it yields at basically every point, which is generally not as useful). After doing something with that node (here we print it), that then clear()s that particular record's tree contents.

Note that this still leaves a root with each past record's empty node (each element you get has increasingly many siblings), and if you're parsing something with hundreds of millions of short records, that still adds up. If that matters, this suggests adding "delete all previous siblings" like (note that it means element.clear() isn't strictly necessary anymore):

while element.getprevious() is not None:
    del elem.getparent()[0]


See also:

etree examples

Parsing data

For data enough to always have the same thing at the same place...


xmldata='''
<subscriptions>
  <success/>
  <entry>
    <request>
      <issn>1234-5678</issn>
      <year>1984</year>
    </request>
    <presence>fulltext</presence>
  </entry>
  <entry>
    <request>
      <issn>0041-9999</issn>
      <year>1984</year>
    </request>
    <presence type="cached">fulltext</presence>
  </entry>
</subscriptions>'''

Code with comments:

subscriptions = ET.fromstring(xmldata)  # parses the whole, returns the root element

error = subscriptions.find('error')  # check whether there is an <error>with message</error> instead of <success/>
if error is not None:
    raise Exception( "Response reports error: %s"%(error.text) )  # .text gets the node's direct text content

for entry in subscriptions.findall('entry'):  #find all direct-child 'entry' elements
    issn = entry.find('request/issn').text     # note: if this node doesn't exist, this would error out (because None.text doesn't make sense)
    year = entry.findtext('request/year','')   # another way of fetching text - one that deals better with absence
    
    # Using xpath-style paths like above can be handy, though when we want to fetch multiple details
    # it's often handier to get element references first
    presence = entry.find('presence')
    prestype = presence.get('type')  # attribute fetch. Using get() means we get None if missing
    prestext = presence.text
    
    print '%s in %s: %s (%s)'%(issn,year,prestext,prestype)

That code prints:

1234-5678 in 1984: fulltext (None)
0041-9999 in 1984: fulltext (cached)

Notes:

  • The functions on an element means you tend to use it as a deeper-only tree
  • If, in the above example, you don't always have issn under request, the python would sometimes try to fetch None.text which would be a (python-side) mistake.
...which is why the year fetch is an example of a more robust variant - for text. If there is not text for the node the default (itself defaulting to None) is returned, here altered to
  • Your choice to use find(), findall(), getchildren(), iter() varies with case and taste
  • ET does character decoding to unicode, and seems to assume UTF8. If you want to be robust to other encodings, handle the string before handing it to ET. (verify)
    • Be careful, though. Doing things like decode('utf8','ignore') may eat tags in some cases (when there is an invalid sequence right before a <)
    • (<py3k:) text nodes may be unicode or str strings, so your code should be robust to both
  • there are other ways of getting at items. An element object acts like a sequence, meaning you can use len() to get the amount of children (can be handy for some logic), list() to get an anonymous list of the children (though using generators is a good habit), and more.

etree error notes

internal error: Huge input lookup
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


There are a few limits configured in libxml, which leads the parser to refuse rather than to handle some cases, such as:

  • size of a single text node (10MByte)
  • recursion limit with entities
  • maximum depth of a document

This seems mostly to protect against handing in data that may be (or amount to) an algorithmic complexity attack / resource DoS.

You can lift these limits using XML_PARSE_HUGE / LIBXML_PARSEHUGE though should probably only do that if all the XML you feed it won't do that.


in python lxml the way to do that seems to be:

from lxml.etree import XMLParser, parse
p = XMLParser(huge_tree=True)
tree = parse('file.xml', parser=p)


Unicode strings with encoding declaration are not supported

Usually comes down to that you handed fromstring() an XML file as a unicode string instead of a bytes -- but it also still self-declares the encoding it is in.


It wants to

  • see bytes and
  • either
    • read the encoding from the data
    • {{comment}(or have you override it via a XMLParser())}}


If you decode()d it, or had an open() decode it for you -- don't.

If you had no control over how you, I guess you could .encode('utf8') it.


UnicodeEncodeError: 'ascii' codec can't encode character u'\U0001d516'

...or some other non-BMP Unicode character (so \U rather than \u).


Not necessarily a to-console problem as you might assume.


This probably comes from ET's tostring(), which uses 'ascii' encoding by default, which usually works fine because it rewrites Unicode into numeric entities.

However, that conversion plays safe and only supports writing U+0080 to U+FFFF in numeric entities (Unicode numeric entities were not explicitly required to work for codepoints above U+FFFD until later versions of XML).


On technically-more-standard way to work around that is to do: tostring(nodes,encoding='utf-8'), which tells it to encode unicode characters to UTF-8 bytes before dumping them into the document. Note the dash in 'utf-8'. This is a value for ET, not the python codec name ('utf8'/'u8'). The dash in there is important as this string is also dumped into the XML header (or, apparently, omitted if it is 'utf-8' (probably because that is the XML default)). 'utf8', 'UTF8', and such are not valid encoding references in XML.)

BeautifulSoup notes

Basics

Firstly, There are multiple ways of filtering/fetching elements from a parsed tree.

You're probably best off deciding what single syntax you like, and ignoring all others. I dislike the short forms because they can clash, and raise more exceptions that makes code more convoluted if you handle them properly.


A parse tree is made mainly of Tag and NavigableString objects, representing elements and text contents, respectively.

Example data used below:

 <a>
   <b q="foo bar">
      1
      <c q="foo"/>
      <d>2</d>
      <c r="bar"/>
      <c/>
      3
   </b>
 </a>

To play with that example:

import bs4
soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', parser='lxml')

Notes:

  • parser has four options: [5]
    • html.parser - works out of the box, not the fastest
    • lxml - fast, external C dependency
    • lxml-xml - focused on XML (plain lxml is aimed at HTML)
    • html5lib - lenient, slow, external python dependency


  • When you print Tag objects, it prints the entire subtree
which is e.g. quite confusing when you're trying to step through the tree and printing its elements.


searching

You can

  • walk through the Tag elements entirely manually
but that is rarely useful, unless it was already data,
or perhaps it works out as a good way to express some heavily contextual parsing.
  • jump to specific parts you are interested in, with find()/find_all() and , or possibly select(), then often still walk those interesting bits manually
select

this covers similar ground to find()/find_all()/friends, but as a note...

select() can frequently be more succinct (than e.g. find_all), because it allows CSS selectors, letting you do things like:

soup.select("p > a:nth-of-type(2)")
soup.select("#link1,#link2")
soup.select("a[class~='externalLink']"
soup.select("div > ul[class*='browseList'] > li[class*='browseItem'] > a" )
SoupSieve

Basically a more up-to-date version of select()

https://pypi.org/project/soupsieve/

find and friends

find() and its friends plus your code can be very direct in what you tell it to do, but is almost always more typing

  • find() - finds the first match in the subtree find() docs
  • There are variations on the same idea, but with a specific direction or restriction:
find_parent(), find_parents()
find_next_sibling(), find_next_siblings()
find_previous_sibling(), find_previous_siblings()
find_next(), find_all_next()
find_previous(), find_all_previous()
You may never need more than a few, depending on how you like to think about searching in trees.


Finding things with specific properties - find functions take keyword arguments like

  • name: match Tags by name. When you hand in a...
    • string: exact name match
    • list or tuple: exact match of any in list
    • (compiled) regexp: regexp match
    • function: use as arbitrary filter (should return True/False. Can of course be a lambda function)
    • True: fetch all (often pointless; using only attrs implies this, and you can iterate over all children more directly)
  • attrs: match Tags by attributes. When you hand in a...
    • string: should match class, but different in older BeautifulSoup version(verify), so I avoid it
    • dicts mapping string to...
      • ...to a string: exact match
      • ...to True: tags with this attribute present, e.g. soup.find_all(True,{'id':True})
      • ...to a regexp: match attribute value by regexp, e.g. soup.find_all(True,{'class':re.compile(r'\bwikitable\b')}) (useful to properly match classes, since class attribute values are space-separated lists)
  • text: match NavigableStrings, by text content. Using this implies 'ignore name and attrs'. When you hand in a...
    • string: exact match
    • True: all strings
    • regexp: regexp match
  • recursive: (where relevant) these search recursively by default (recursive=True)
You can change that, e.g. when you want to express e.g. "find all spans directly under this div", a non-recursive find_all might make sense

For example:

soup.find_all(['b','c'])          # all b and c tags
soup.find_all( re.compile('[bc]') ) # all b and c tags


# anything with a q attribute at all:
soup.find_all(attrs={'q':True})

# anything with attribute q="foo"
soup.find_all(attrs={'q':'foo'})

#all divs with class set to tablewrapper (string equality)
soup.find_all('div', attrs={'class':'tablewrapper'})

#Anything with a class attribute that contains 'bar' (uses word-edge to be close enough to [https://dom.spec.whatwg.org/#interface-domtokenlist token-list] matching):
soup.find_all(attrs={'class':re.compile(r'\bbar\b')})



When nothing matches,

  • find() would return None
Which also means you can't really chain these, since that'll easily result in complaining you're trying to do something on None (specifically an AttributeError).
  • find_all would return an empty list

navigation

There is quite a bit of extra decoration on Tag (and also NavigableString) objects.

Things you could keep in mind include:

  • string: returns text child (NavigableString type)
...but only if a Tag contains exactly one of these.
If there is more than that, this will yield None, even if the first element is a string.
often, you instead want to use find(string=True) (next piece of text) or find_all(string=True) (all pieces of text), depending on what you know of the structure
  • parent: selects single parent, a Tag object.
  • contents: selects a Tag's sub-things, a list containing a mix of Tag and NavigableString objects. (DOM would call this 'children')
  • using a Tag as an iterable (e.g. using for, list()) iterates its direct contents one element at a time.
Sometimes this convenient and clean, in other cases searching is faster, more flexible, or more readable


  • previousSibling and nextSibling: selects the next Tag or NavigableString at the current level. Think of this as walking the contents list. Returns None when sensible. Useful for some specific node constructions.
  • previous and next are essentially a single step out of a treewalk (that emits before walking(verify)).
If that made you go 'huh?', you probably want previousSibling and nextSibling instead.


For example, with the example data mentioned earlier:

import bs4
soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', 'lxml')
 
p = soup.contents[0]
while p is not None:
   print( p )
   p=p.next

Will start at the a element and print:

  • the a element
  • the b element
  • '1'
  • the first c element (the one that contains d)
  • the d element
  • '2'
  • '3'
  • the first empty c element
  • the second empty c element

While the following prints that list in exact reverse:

p = soup.find_all('c')[2] # selects the last c (knowing that structure - I don't think there is a direct way to select the last element)
while p!=None:
    print( p )
    p=p.previous

There are also find functions that behave this way.

Assorted notes

On Unicode

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

It used to be that it required unicode string input, so you needed to do decoding yourself, and correctly.

Recent versions consider UTF-8 as an input encoding, which means that, for a lot of modern web content, it'll work without you thinking too hard about it.


Once parsed, the strings are python unicode strings.

You can now also ask for the bytestring as it was in the source document.


(does this vary with parser you ask for?)

getting attributes, alternatives

Generally, use a.get('name'). Largely because it returns None if not present (and you can have a fallback like get('name', '')


Alternative styles are more bother. Say, a['name'], raises ValueError when not present.



Feeding in data, parser alternatives

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

There are a few ways of feeding in a page:

open_file_object=open('filename','r')
soup=BeautifulSoup(open_file_object)  

#...or...

soup=BeautifulSoup(string)      

#...or...

soup=BeautifulSoup()
soup.feed(page_contents)  #...which can apparently be done incrementally if you wish.


You can get behaviour with lots of attempted correction, or nearly none.

Theoretically, you can feed in things much closer to even SGML, but you may find you want to customize the parser somewhat for any specific SGML, so that's not necessarily worth it.


In earlier versions you had parser alternatives like:

  • BeautifulSoup.BeautifulSoup is tuned for HTML, and knows about self-closing tags.
  • BeautifulSoup.BeautifulStoneSoup is for much more basic XML (and not XHTML).

And also:

  • BeautifulSoup.BeautifulSOAP, a subclass of BeautifulStoneSoup
  • BeautifulSoup.MinimalSoup - like BeautifulSoup.BeautifulSoup, but is ignorant of nesting rules. It is probably most useful as a base class for your own fine-tuned parsers.
  • BeautifulSoup.ICantBelieveItsBeautifulSoup is quite like like BeautifulSoup.BeautifulSoup, but in a few cases follows the HTML standard rather than common HTML abuse, so is sometimes a little more appropriate on very nearly correct HTML, but it seems you will rarely really need it.



It seems the preferred way now is to tell the constructor.

As of bs4 there are three builders included, based on htmlparser, lxml, and html5lib

html.parser
python's own. decent speed but slower than lxml, less lenient than html5lib
lxml
fast, lenient, can also handle XML (unlike the other two(verify))
html5lib
slow, very lenient
separate package


The way you request these is the markup argument to BeautifulSoup, and it's more of a lookup than direct specification, also . (TODO: figure out how that works)

'html' and 'html.parser' seems to land on html.parser
'xml' and 'lxml-xml' seems to land on lxml's XML parser
'lxml' seems to land on lxml's HTML parser
'html5' seems to land on html5lib


See also https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers


Performance
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

As the documentation points out, '"if there’s any [...] reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml", because "Beautiful Soup will never be as fast as the parsers it sits on top of"


There are differences between the parsers you can use, e.g. in how they 'fix' incorrect HTML, and how fast they are. Roughly:

  • "html.parser" - medium speed, not so lenient, no external dependency
  • "lxml" - fast, (not so lenient?), an external dependency - generally recommended
  • "html5lib" - slow, though very lenient
  • ("lxml-xml" / "xml" for XML)

You might want to sit down once and choose your preference, but in generall: install and use lxml.

Apparently html.parser used to be the default, but more recently lxml is preferred if it is installed.

(side note: you might also consider using lxml's html module (or lxml's iterparse with html=True and recover=True) instead of bs4)


https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use


  • install cchardet, because without it it'll use the pure-python chardet


  • for huge documents, consider SoupStrainer, which parses only tags you're interested in.
this won't make parsing faster, but it will make searching faster, and lower memory use


https://beautiful-soup-4.readthedocs.io/en/latest/#improving-performance

https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/

Scraping text

Warnings

DeprecationWarning: The 'text' argument to find()-type methods is deprecated. Use 'string' instead

From the docs:

With string you can search for strings instead of tags.
The string argument is new in Beautiful Soup 4.4.0. In earlier versions it was called text.

4.4.0 is now years old. It seems that I had an old version because, while bs4 is a dummy package that installs BeautifulSoup4, updating bs4 won't seem to update BeautifulSoup4.

BeautifulSoup examples

You generally want to look at things per page, specifically asking yourself "What distinguishes that which I want to extract?" This is often an attribute or class, or sometimes an element context.

Table extraction

I was making a an X-SAMPA / IPA conversion and wanted to save myself a lot of typing. I downloaded the wikipedia X-SAMPA page.

At a glance, it looks like mediawiki tables that are generated from markup have exactly one class, wikitable, which is rather convenient because it means we can select the data tables in one go.

The tables on that page have either four or five columns, which are different cases we have to deal with in our interpretation, and half the lines of code below deal with that.

(the page has changed since I wrote this - I should review it)

# Note the code is overly safe for a one-shot script, and a little overly commented
from bs4 import BeautifulSoup

soup = BeautifulSoup(open('X-SAMPA.html'))

for table in soup.find_all('table', {'class':'wikitable'} ):

    # Gets the amount of columns, from the header row.    
    # This because cells can be omitted, but generally aren't on the header row. (it's hard dealing with col/rowspans anyway)
    tablecols = len( table.find('th').parent.find_all('th') )
    # Actually means:   "find first th, go to the parent tr, select all th children, count them"

    for tr in table.find_all('tr'):   # for all rows in the table
        TDs= tr.find_all('td')
        # deal with both tables in the same code -- check which we're dealing with by amount of columns
        if tablecols==5:                             # XS, IPA, IPA, Description, Example
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]    # hack faking extra list entries when there aren't enough TDs in the table
           #pad with a bunch of nothings in case of missing cells, then use the first 4
        elif tablecols==4:                           # XS, IPA, IPA, Description
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]
        else:
            raise ValueError("Don't know this table type!")

        if None in [xs,ipa]: #empty rows?
            pass
        else:
            #We fish out all the text chunks. In this case we can join them together
            xs    = ' '.join( xs.find_all(string=True) )
            ipa   = ' '.join( ipa.find_all(string=True) )
            descr = ' '.join( descr.find_all(string=True) )


Similar idea, for the Kunrei-shiki Rōmaji page:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(open('Kunrei-shiki_Romaji.html'))

tables = soup('table',{'width':'100%'}) #the main table
tables.extend( soup('table',{'class':'wikitable'})) #the exceptions table

for table in tables:
    for td in table('td'): #cells are completely independent as far as we're concerned
        tdtext=' '.join(td(string=True)).replace('&amp;#160;',' ').strip()   #is there text under this TD?
        if len(tdtext)>0: #Yup
            #There a few styles of cell filling, which we unify both with the text select and with logic below
            a=tdtext.split()
            kana=''
            if len(a)==2:
                kana,roman = a
                hiragana,katakana=kana[:len(kana)/2],kana[len(kana)/2:] #close enough
            elif len(a)==3:
                hiragana,katakana,roman=a
            else:
                raise('BOOGA')
            print `hiragana,katakana,roman`

More table extraction

for http://www.isbn-international.org/en/identifiers/allidentifiers.html. Will need some fixing.

import re,pprint

from BeautifulSoup import BeautifulSoup
bs=BeautifulSoup(file('allidentifiers.html').read())
table = bs.find('table', {'style':'width: 565px;'})

t={}
identifier=None
val=None

for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if len(tds)>=2:
        if tds[0].span!=None:
           try:
               identifier=int(tds[0].span.string)
           except ValueError: #not an integer - stick with the one we have
               pass   
        val = ''.join( tds[1].find_all(string=True) ).replace('\r\n', ' ')
    
        if identifier not in t:
            t[identifier] = [val]
        else:
            t[identifier].append(val)

result={}  
for k in t:
    if k!=None:
        result[k] = ' '.join(t[k])

resultdata=pprint.pformat(result)
print resultdata

f=file('allidentifiers.py','w')
f.write(resultdata)
f.close()

Dealing with non-nestedness

For one site I needed the logic "Look for the first node that has text node 'title1' and return a list of all nodes (text nodes, elements) up to the next '<b>' tag"

I needed to fetch a list of things under a heading that wasn't really structurally stored at all. The code I used was roughly:

from BeautifulSoup import BeautifulSoup,NavigableString

example="""
<b>title1</b>
  contents1<br>
  contents2<br>
<b>nexttitle</b>
  contents3<br>
"""


def section(soup,startAtTagWithText, stopAtTagName):
   ret=[]
   e=soup.firstText(startAtTagWithText)
   try:
      e=e.next # skip the tag that has the string in it
      while e!=None:
         if type(e)!=NavigableString:
            if e.name==stopAtTagName:
               break
            else:
               ret.append(e)
         else: #an element
            ret.append(e)
         e=e.nextSibling
   except:
       pass
   return ret

section(BeautifulSoup(example),'title1','b')

See also

Helpers, and scraping notes

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


etree pretty-printer

⚠ This will implicitly change the stored data (and is not the most efficient), so this is only meant to present to people as debug

Rewrites the structure (adding and removing whitespace) so that its tostring()ed version looks nicely indented.

def indent_inplace(elem, level=0, whitespacestrip=False):
    ''' Alters the text nodes so that the tostring()ed version will look nicely indented.
     
        whitespacestrip can make contents that contain a lot of newlines look cleaner, 
        but changes the stored data even more.
    '''
    i = "\n" + level*"  "

    if whitespacestrip:
        if elem.text:
            elem.text=elem.text.strip()
        if elem.tail:
            elem.tail=elem.tail.strip()

    if len(elem):
        if not elem.text or not elem.text.strip():
            elem.text = i + "  "
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
        for elem in elem:
            indent_inplace(elem, level+1)
        if not elem.tail or not elem.tail.strip():
            elem.tail = i
    else:
        if level and (not elem.tail or not elem.tail.strip()):
            elem.tail = i


def prettyprint(xml,whitespacestrip=True):
    ''' Convenience wrapper around indent_inplace():
        - Takes a string (parses it) or a ET structure,
        - alters it to reindent according to depth,
        - returns the result as a string.

        whitespacestrip: see note in indent_inplace()

        Not horribly efficient, and alters the structure you gave it,
        but you are only using this for debug, riiight?
    '''
    if type(xml) is str:
        xml = ET.fromstring(xml)
    inplace_indent(xml, whitespacestrip=whitespacestrip)
    return ET.tostring(xml).rstrip('\n')


etree namespace stripper

⚠ This will implicitly change the stored data (and is not the most efficient), so this is only meant to present to people as debug. Not thoroughly tested (the core of this was taken from elsewhere).

I use this to simplify searching and fetching whenever I know my use case will have no conflicts introduced by namespace stripping.


def strip_namespace_inplace(etree, namespace=None,remove_from_attr=True):
    """ Takes a parsed ET structure and does an in-place removal of namespaces.
        By default removes all namespaces, optionally just a specific namespace (by its URL).
        
        Can make node searches simpler in structures with unpredictable namespaces
        and in content given to not be ambiguously mixed.

        By default does so for node names as well as attribute names.       
        (doesn't remove the namespace definitions, but apparently
         ElementTree serialization omits any that are unused)

        Note that for attributes that are unique only because of namespace,
        this will cause attributes to be overwritten. 
        For example: <e p:at="bar" at="quu">   would become: <e at="bar">
        I don't think I've seen any XML where this matters, though.
    """
   if namespace==None: # all namespaces                               
        for elem in etree.getiterator():
            tagname = elem.tag
            if tagname[0]=='{':
                elem.tag = tagname[ tagname.index('}',1)+1:]

            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name[0]=='{':
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[attr_name.index('}',1)+1:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)

    else: # asked to remove specific namespace.
        ns = '{%s}' % namespace
        nsl = len(ns)
        for elem in etree.getiterator():
            if elem.tag.startswith(ns):
                elem.tag = elem.tag[nsl:]

            if remove_from_attr:
                to_delete=[]
                to_set={}
                for attr_name in elem.attrib:
                    if attr_name.startswith(ns):
                        old_val = elem.attrib[attr_name]
                        to_delete.append(attr_name)
                        attr_name = attr_name[nsl:]
                        to_set[attr_name] = old_val
                for key in to_delete:
                    elem.attrib.pop(key)
                elem.attrib.update(to_set)