From Helpful
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.


BeautifulSoup is a Python module that parses HTML (and can deal with common mistakes), and has helpers to navigate and search the result.

It's quite convenient to scrape information from we pages.

Not all use is very fast - see #Performance

Note that there have been one or two large redesigns, so if things don't seem to work:

  • can't import BeautifulSoup module - do you have bs4? that's the module name now.
  • if findAll does nothing - you probably have a very old version (which means you're actually using the shorthand to look for a tag named findAll)(verify)


Firstly, There are multiple ways of filtering/fetching. You're probably best off deciding what syntax you like and ignoring all others. I dislike the short forms because they can clash, and raise more exceptions. On this page I omit all the shorthand forms that I don't like

A parse tree is made mainly of Tag and NavigableString objects, representing elements and text contents, respectively.

Example data used below:

   <b q="foo bar">
      <c q="foo"/>
      <c r="bar"/>

To play with that example:

soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', 'lxml')
# older versions:
# soup = BeautifulSoup.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>')

Note that When you print Tag objects, it prints the entire subtree - which is e.g. quite confusing when you're stepping through the tree and printing its elements.


Tag objects have some find functions that start there.

The most inetersting are probably the subtree-searchers:

  • find() - finds the first, next match
  • findAll() - finds all matches

There are more, which search in specific parts/directions (see the types of navigation below).

These search recursively by default. You can change that.

They take following arguments, all optional:

  • name
    : match Tags by name. When you hand in a...
    • string: exact name match
    • list or tuple: exact match of any in list
    • (compiled) regexp: regexp match
    • function: use as arbitrary filter (should return True/False. Can of course be a lambda function)
    • True: fetch all (often pointless; using only attrs implies this, and you can iterate over all children more directly)
  • attrs
    : match Tags by attributes. When you hand in a...
    • string: should match class, but different in older BeautifulSoup version(verify), so I avoid it
    • dicts mapping string to...
      • a string: exact match
      • True: tags with this attribute present, e.g.
      • a regexp: match attribute value by regexp, e.g.
        (useful to properly match classes, since class attribute values are space-separated lists)
  • text
    : match NavigableStrings, by text content. Using this implies 'ignore name and attrs'. When you hand in a...
    • string: exact match
    • True: all strings
    • regexp: regexp match
  • recursive
    : True is default
  • keyword arguments, to be matched. I don't use this, I think it's messy since it may interact with other arguments and python keywords.

find next, find all, and what to find

Element perspective:

soup.find('c')     # find the next c element

Returns None when nothing matches. This also means you can't really chain these, since that'll easily result in an AttributeError complaining you're trying to do something on None.

List-like perspective:

soup.findAll('c')  # returns a list, of all matching c elements
#Returns [] when nothing matches
# or its shortcut:

A few examples

Searching for different properties, in different ways:

soup.findAll(['b','c'])          # all b and c tags
soup.findAll(re.compile('[bc]')) # all b and c tags
#Anything with a q attribute:
#Anything with attribute q="foo"
#all divs with class set to tablewrapper (string equality)
soup.findAll('div', attrs={'class':'tablewrapper'})
#Anything with a class attribute that contains 'bar' as a token (...since class is a token-list thing in (X)HTML):


There is quite a bit of extra decoration on Tag (and also NavigableString) objects.

Things you could keep in mind include:

  • parent
    : selects single parent, a Tag object.
  • contents
    : selects a Tag's sub-things, a list containing a mix of Tag and NavigableString objects. (DOM would call this 'children')
  • using a Tag as an iterable (e.g. using
    ) iterates its direct contents one element at a time. Sometimes this convenient and clean, in other cases searching is faster and more flexible than direct child selection
  • string
    : returns text child (NavigableString type) -- but only if a Tag contains exactly one of these. If there is more than that, this will yield None, even if the first element is a string.
    • Often you want to use find(text=True) (next piece of text) or findAll(text=True) (all pieces of text), depending on what you know of the structure
  • previousSibling
    : selects the next Tag or NavigableString at the current level. Think of this as walking the contents list. Returns None when sensible. Useful for some specific node constructions.
  • previous
    are essentially a single step out of a treewalk (that emits before walking(verify)). If that made you go 'huh?', you probably want previousSibling and nextSibling instead.

With the example data mentioned earlier:

p = soup.contents[0]
while p!=None:
   print p

Will print:

  • the a element
  • the b element
  • '1'
  • the first c element (the one that contains d)
  • the d element
  • '2'
  • '3'
  • the first empty c element
  • the second empty c element

While the following prints that list in exact reverse:

p = soup.findAll('c')[2] #selects the last c
while p!=None:
    print p

There are also find functions that behave this way.

Assorted notes

On Unicode

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

It used to be that it required unicode string input, so you needed to do decoding yourself, and correctly.

Recent versions consider UTF-8 as an input encoding, which means you can get away with not thinking about it for a lot of modern web content.

Once parsed, the strings are python unicode strings.

You can now also ask for the bytestring as it was in the source document.

(does this vary with parser?)

getting attributes, alternatives

Generally, use a.get('name'). Largely because it returns None if not present (and you can have a fallback like get('name', '')

Alternative styles are more bother. Say, a['name'], raises ValueError when not present.

Feeding in data, parser alternatives

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are a few ways of feeding in a page:

soup.feed(page_contents)  #...which can apparently be done incrementally if you wish.

You can get behaviour with lots of attempted correction, or nearly none.

Theoretically, you can feed in things much closer to even SGML, but you may find you want to customize the parser somewhat for any specific SGML, so that's not necessarily worth it.

In earlier versions you had parser alternatives like:

  • BeautifulSoup.BeautifulSoup is tuned for HTML, and knows about self-closing tags.
  • BeautifulSoup.BeautifulStoneSoup is for much more basic XML (and not XHTML).

And also:

  • BeautifulSoup.BeautifulSOAP, a subclass of BeautifulStoneSoup
  • BeautifulSoup.MinimalSoup - like BeautifulSoup.BeautifulSoup, but is ignorant of nesting rules. It is probably most useful as a base class for your own fine-tuned parsers.
  • BeautifulSoup.ICantBelieveItsBeautifulSoup is quite like like BeautifulSoup.BeautifulSoup, but in a few cases follows the HTML standard rather than common HTML abuse, so is sometimes a little more appropriate on very nearly correct HTML, but it seems you will rarely really need it.

It seems the preferred way now is to tell the constructor.

As of bs4 there are three builders included, based on htmlparser, lxml, and html5lib

python's own. decent speed but slower than lxml, less lenient than html5lib
fast, lenient, can also handle XML (unlike the other two(verify))
slow, very lenient
separate package

The way you request these is the
argument to BeautifulSoup,

and it's more of a lookup than direct specification, also . (TODO: figure out how that works)

'html' and 'html.parser' seems to land on html.parser
'xml' and 'lxml-xml' seems to land on lxml's XML parser
'lxml' seems to land on lxml's HTML parser
'html5' seems to land on html5lib

See also


This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

As the documentation points out, "if there’s any ... reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml", because "Beautiful Soup will never be as fast as the parsers it sits on top of"

There are differences between the parsers you can use, e.g. in how they 'fix' incorrect HTML, and how fast they are.

You might want to sit down once and choose your preference.

For some cases (e.g. large documents) it can make sense to e.g. apply tidying (e.g. µTidylib) then feed it to a stricter parser.

When you can count on syntax-correctness of your data, you may want a stricter parser to start with. (if it's XML you may want to try BeautifulStoneSoup)

You may want to prefer the lxml parser (which is a C library), because html.parser is pure python and slower. lxml is also faster than html5lib.

lxml has become the default parser in bs4 -- if it is installed.

  • install cchardet, because without it it'll use the pure-python chardet

  • for huge documents, consider SoupStrainer, which parses only tags you're interested in.
this won't make parsing faster, but it will make searching faster, and lower memory use

Scraping text


You generally want to look at things per page, specifically asking yourself "What distinguishes that which I want to extract?" This is often an attribute or class, or sometimes an element context.

Table extraction

I was making a an X-SAMPA / IPA conversion and wanted to save myself a lot of typing. I downloaded the wikipedia X-SAMPA page.

At a glance, it looks like mediawiki tables that are generated from markup have exactly one class, wikitable, which is rather convenient because it means we can select the data tables in one go.

The tables on that page have either four or five columns, which changes interpretation a little, and half the code below dealing with that.

#Note the code is overly safe for a one-shot script, and a little overly commented
from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('X-SAMPA.html'))
for table in soup.findAll('table', {'class':'wikitable'} ):
    # Gets the amount of columns, from the header row.    
    # This because cells can be omitted, but generally aren't on the header row. (it's hard dealing with col/rowspans anyway)
    tablecols = len( table.find('th').parent.findAll('th') )
    # Actually means:   "find first th, go to the parent tr, select all th children, count them"
    for tr in table.findAll('tr'):   # for all rows in the table
        TDs= tr.findAll('td')
        # deal with both tables in the same code -- check which we're dealing with by amount of columns
        if tablecols==5:                             # XS, IPA, IPA, Description, Example
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]    # hack faking extra list entries when there aren't enough TDs in the table
           #pad with a bunch of nothings in case of missing cells, then use the first 4
        elif tablecols==4:                           # XS, IPA, IPA, Description
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]
            raise ValueError("Don't know this table type!")
        if None in [xs,ipa]: #empty rows?
            #We fish out all the text chunks. In this case we can join them together
            xs    = ' '.join( xs.findAll(text=True) )
            ipa   = ' '.join( ipa.findAll(text=True) )
            descr = ' '.join( descr.findAll(text=True) )

Similar idea, for the Kunrei-shiki Rōmaji page:

from BeautifulSoup import BeautifulSoup
soup = BeautifulSoup(open('Kunrei-shiki_Romaji.html'))
tables = soup('table',{'width':'100%'}) #the main table
tables.extend( soup('table',{'class':'wikitable'})) #the exceptions table
for table in tables:
    for td in table('td'): #cells are completely independent as far as we're concerned
        tdtext=' '.join(td(text=True)).replace('&amp;#160;',' ').strip()   #is there text under this TD?
        if len(tdtext)>0: #Yup
            #There a few styles of cell filling, which we unify both with the text select and with logic below
            if len(a)==2:
                kana,roman = a
                hiragana,katakana=kana[:len(kana)/2],kana[len(kana)/2:] #close enough
            elif len(a)==3:
            print `hiragana,katakana,roman`

More table extraction

for Will need some fixing.

import re,pprint
from BeautifulSoup import BeautifulSoup
table = bs.find('table', {'style':'width: 565px;'})
for tr in table.findAll('tr'):
    tds = tr.findAll('td')
    if len(tds)>=2:
        if tds[0].span!=None:
           except ValueError: #not an integer - stick with the one we have
        val = ''.join( tds[1].findAll(text=True) ).replace('\r\n', ' ')
        if identifier not in t:
            t[identifier] = [val]
for k in t:
    if k!=None:
        result[k] = ' '.join(t[k])
print resultdata

Dealing with non-nestedness

For one site I needed the logic "Look for the first node that has text node 'title1' and return a list of all nodes (text nodes, elements) up to the next '<b>' tag"

I needed to fetch a list of things under a heading that wasn't really structurally stored at all. The code I used was roughly:

from BeautifulSoup import BeautifulSoup,NavigableString
def section(soup,startAtTagWithText, stopAtTagName):
   try: # skip the tag that has the string in it
      while e!=None:
         if type(e)!=NavigableString:
         else: #an element
   return ret

See also