BeautifulSoup

From Helpful
Jump to navigation Jump to search

Screen scraping (mostly HTML and XML parsing)

Python: BeautifulSoup · ElementTree / lxml scraping
Wrapping or controlling a browser



These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Intro

BeautifulSoup is a Python module that reads in and parses HTML data, and has helpers to navigate and search the result.


It can deal with some common markup mistakes, and it is fairly convenient about expressing how we want to search the parsed result.


Not all use is very fast - see #Performance

⚠ This page is opinionated
On this page I omit all the shorthand forms that I don't like, and mostly ignore pre-bs4 versions. Both because these variations are mostly confusing to all put side by side.


Basics

Firstly, There are multiple ways of filtering/fetching elements from a parsed tree.

You're probably best off deciding what single syntax you like, and ignoring all others. I dislike the short forms because they can clash, and raise more exceptions that makes code more convoluted if you handle them properly.


A parse tree is made mainly of Tag and NavigableString objects, representing elements and text contents, respectively.

Example data used below:

 <a>
   <b q="foo bar">
      1
      <c q="foo"/>
      <d>2</d>
      <c r="bar"/>
      <c/>
      3
   </b>
 </a>

To play with that example:

import bs4
soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', 'lxml')

Note that When you print Tag objects, it prints the entire subtree - which is e.g. quite confusing when you're stepping through the tree and printing its elements.


searching

You can walk through the Tag elements manually, but that is rarely useful, unless it was already data, or perhaps it works out as a good way to express some heavily contextual parsing.

Usually the structure you are looking for is local, and you would use select() or find()-and-friends


select() can frequently be more succinct, because it allows CSS selectors, letting you do things like:

soup.select("p > a:nth-of-type(2)")
soup.select("#link1,#link2")
soup.select("a[class~='externalLink']"
soup.select("div > ul[class*='browseList'] > li[class*='browseItem'] > a" )


find()/find_*() plus your code can be more flexible and specific, but is almost always more typing when select

  • find() - finds the first match in the subtree
  • find_all() - finds all matches in the subtree
These search recursively by default(verify).

You can change that, which e.g. makes sense when you want to express e.g. "find all spans directly under this div"

  • There are more, which are basically the same idea, but in with a specific direction or restriction:
find_parent(), find_parents()
find_next_sibling(), find_next_siblings()
find_previous_sibling(), find_previous_siblings()
find_next(), find_all_next()
find_previous(), find_all_previous()

You may never need more than a few, depending on how you approach searching in trees.


find functions take a bunch of optional arguments, including::

  • name: match Tags by name. When you hand in a...
    • string: exact name match
    • list or tuple: exact match of any in list
    • (compiled) regexp: regexp match
    • function: use as arbitrary filter (should return True/False. Can of course be a lambda function)
    • True: fetch all (often pointless; using only attrs implies this, and you can iterate over all children more directly)
  • attrs: match Tags by attributes. When you hand in a...
    • string: should match class, but different in older BeautifulSoup version(verify), so I avoid it
    • dicts mapping string to...
      • ...to a string: exact match
      • ...to True: tags with this attribute present, e.g. soup.find_all(True,{'id':True})
      • ...to a regexp: match attribute value by regexp, e.g. soup.find_all(True,{'class':re.compile(r'\bwikitable\b')}) (useful to properly match classes, since class attribute values are space-separated lists)
  • text: match NavigableStrings, by text content. Using this implies 'ignore name and attrs'. When you hand in a...
    • string: exact match
    • True: all strings
    • regexp: regexp match
  • recursive: True is default
  • keyword arguments, to be matched. I don't use this, I think it's messy since it may interact with other arguments and python keywords.

find next, find all, and what to find

Docs:


Element perspective:

soup.find('c')     # find the next c element

Returns None when nothing matches. This also means you can't really chain these, since that'll easily result in an AttributeError complaining you're trying to do something on None.


List-like perspective:

soup.find_all('c')  # returns a list, of all matching c elements
#Returns [] when nothing matches

# or its shortcut:
soup('c')


A few examples

Searching for different properties, in different ways:

soup.find_all(['b','c'])          # all b and c tags
soup.find_all(re.compile('[bc]')) # all b and c tags


#Anything with a q attribute:
soup.find_all(attrs={'q':True})

#Anything with attribute q="foo"
soup.find_all(attrs={'q':'foo'})

#all divs with class set to tablewrapper (string equality)
soup.find_all('div', attrs={'class':'tablewrapper'})

#Anything with a class attribute that contains 'bar' (uses word-edge to be close enough to [https://dom.spec.whatwg.org/#interface-domtokenlist token-list] matching):
soup.find_all(attrs={'class':re.compile(r'\bbar\b')})

navigation

There is quite a bit of extra decoration on Tag (and also NavigableString) objects.

Things you could keep in mind include:

  • parent: selects single parent, a Tag object.
  • contents: selects a Tag's sub-things, a list containing a mix of Tag and NavigableString objects. (DOM would call this 'children')
  • using a Tag as an iterable (e.g. using for, list()) iterates its direct contents one element at a time. Sometimes this convenient and clean, in other cases searching is faster and more flexible than direct child selection
  • string: returns text child (NavigableString type) -- but only if a Tag contains exactly one of these. If there is more than that, this will yield None, even if the first element is a string.
    • Often you want to use find(text=True) (next piece of text) or find_all(text=True) (all pieces of text), depending on what you know of the structure
  • previousSibling and nextSibling: selects the next Tag or NavigableString at the current level. Think of this as walking the contents list. Returns None when sensible. Useful for some specific node constructions.
  • previous and next are essentially a single step out of a treewalk (that emits before walking(verify)). If that made you go 'huh?', you probably want previousSibling and nextSibling instead.

With the example data mentioned earlier:

p = soup.contents[0]
while p!=None:
   print p
   p=p.next

Will print:

  • the a element
  • the b element
  • '1'
  • the first c element (the one that contains d)
  • the d element
  • '2'
  • '3'
  • the first empty c element
  • the second empty c element

While the following prints that list in exact reverse:

p = soup.find_all('c')[2] #selects the last c
while p!=None:
    print p
    p=p.previous

There are also find functions that behave this way.

Assorted notes

On Unicode

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

It used to be that it required unicode string input, so you needed to do decoding yourself, and correctly.

Recent versions consider UTF-8 as an input encoding, which means you can get away with not thinking about it for a lot of modern web content.


Once parsed, the strings are python unicode strings.

You can now also ask for the bytestring as it was in the source document.


(does this vary with parser?)


getting attributes, alternatives

Generally, use a.get('name'). Largely because it returns None if not present (and you can have a fallback like get('name', '')


Alternative styles are more bother. Say, a['name'], raises ValueError when not present.



Feeding in data, parser alternatives

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

There are a few ways of feeding in a page:

open_file_object=open('filename','r')
soup=BeautifulSoup(open_file_object)  

#...or...

soup=BeautifulSoup(string)      

#...or...

soup=BeautifulSoup()
soup.feed(page_contents)  #...which can apparently be done incrementally if you wish.


You can get behaviour with lots of attempted correction, or nearly none.

Theoretically, you can feed in things much closer to even SGML, but you may find you want to customize the parser somewhat for any specific SGML, so that's not necessarily worth it.


In earlier versions you had parser alternatives like:

  • BeautifulSoup.BeautifulSoup is tuned for HTML, and knows about self-closing tags.
  • BeautifulSoup.BeautifulStoneSoup is for much more basic XML (and not XHTML).

And also:

  • BeautifulSoup.BeautifulSOAP, a subclass of BeautifulStoneSoup
  • BeautifulSoup.MinimalSoup - like BeautifulSoup.BeautifulSoup, but is ignorant of nesting rules. It is probably most useful as a base class for your own fine-tuned parsers.
  • BeautifulSoup.ICantBelieveItsBeautifulSoup is quite like like BeautifulSoup.BeautifulSoup, but in a few cases follows the HTML standard rather than common HTML abuse, so is sometimes a little more appropriate on very nearly correct HTML, but it seems you will rarely really need it.



It seems the preferred way now is to tell the constructor.

As of bs4 there are three builders included, based on htmlparser, lxml, and html5lib

html.parser
python's own. decent speed but slower than lxml, less lenient than html5lib
lxml
fast, lenient, can also handle XML (unlike the other two(verify))
html5lib
slow, very lenient
separate package


The way you request these is the markup argument to BeautifulSoup, and it's more of a lookup than direct specification, also . (TODO: figure out how that works)

'html' and 'html.parser' seems to land on html.parser
'xml' and 'lxml-xml' seems to land on lxml's XML parser
'lxml' seems to land on lxml's HTML parser
'html5' seems to land on html5lib


See also https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers


Performance
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


As the documentation points out, "if there’s any ... reason why computer time is more valuable than programmer time, you should forget about Beautiful Soup and work directly atop lxml", because "Beautiful Soup will never be as fast as the parsers it sits on top of"



There are differences between the parsers you can use, e.g. in how they 'fix' incorrect HTML, and how fast they are.

You might want to sit down once and choose your preference.


For some cases (e.g. large documents) it can make sense to e.g. apply tidying (e.g. µTidylib) then feed it to a stricter parser.

When you can count on syntax-correctness of your data, you may want a stricter parser to start with. (if it's XML you may want to try BeautifulStoneSoup)


You may want to prefer the lxml parser (which is a C library), because html.parser is pure python and slower. lxml is also faster than html5lib.

lxml has become the default parser in bs4 -- if it is installed.


https://www.crummy.com/software/BeautifulSoup/bs4/doc/#differences-between-parsers

https://www.crummy.com/software/BeautifulSoup/bs4/doc/#specifying-the-parser-to-use


  • install cchardet, because without it it'll use the pure-python chardet


  • for huge documents, consider SoupStrainer, which parses only tags you're interested in.
this won't make parsing faster, but it will make searching faster, and lower memory use


https://beautiful-soup-4.readthedocs.io/en/latest/#improving-performance

https://thehftguy.com/2020/07/28/making-beautifulsoup-parsing-10-times-faster/


Scraping text

Examples

You generally want to look at things per page, specifically asking yourself "What distinguishes that which I want to extract?" This is often an attribute or class, or sometimes an element context.

Table extraction

I was making a an X-SAMPA / IPA conversion and wanted to save myself a lot of typing. I downloaded the wikipedia X-SAMPA page.

At a glance, it looks like mediawiki tables that are generated from markup have exactly one class, wikitable, which is rather convenient because it means we can select the data tables in one go.

The tables on that page have either four or five columns, which changes interpretation a little, and half the code below dealing with that.

#Note the code is overly safe for a one-shot script, and a little overly commented
from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(open('X-SAMPA.html'))

for table in soup.find_all('table', {'class':'wikitable'} ):

    # Gets the amount of columns, from the header row.    
    # This because cells can be omitted, but generally aren't on the header row. (it's hard dealing with col/rowspans anyway)
    tablecols = len( table.find('th').parent.find_all('th') )
    # Actually means:   "find first th, go to the parent tr, select all th children, count them"

    for tr in table.find_all('tr'):   # for all rows in the table
        TDs= tr.find_all('td')
        # deal with both tables in the same code -- check which we're dealing with by amount of columns
        if tablecols==5:                             # XS, IPA, IPA, Description, Example
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]    # hack faking extra list entries when there aren't enough TDs in the table
           #pad with a bunch of nothings in case of missing cells, then use the first 4
        elif tablecols==4:                           # XS, IPA, IPA, Description
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]
        else:
            raise ValueError("Don't know this table type!")

        if None in [xs,ipa]: #empty rows?
            pass
        else:
            #We fish out all the text chunks. In this case we can join them together
            xs    = ' '.join( xs.find_all(text=True) )
            ipa   = ' '.join( ipa.find_all(text=True) )
            descr = ' '.join( descr.find_all(text=True) )


Similar idea, for the Kunrei-shiki Rōmaji page:

from BeautifulSoup import BeautifulSoup

soup = BeautifulSoup(open('Kunrei-shiki_Romaji.html'))

tables = soup('table',{'width':'100%'}) #the main table
tables.extend( soup('table',{'class':'wikitable'})) #the exceptions table

for table in tables:
    for td in table('td'): #cells are completely independent as far as we're concerned
        tdtext=' '.join(td(text=True)).replace('&amp;#160;',' ').strip()   #is there text under this TD?
        if len(tdtext)>0: #Yup
            #There a few styles of cell filling, which we unify both with the text select and with logic below
            a=tdtext.split()
            kana=''
            if len(a)==2:
                kana,roman = a
                hiragana,katakana=kana[:len(kana)/2],kana[len(kana)/2:] #close enough
            elif len(a)==3:
                hiragana,katakana,roman=a
            else:
                raise('BOOGA')
            print `hiragana,katakana,roman`

More table extraction

for http://www.isbn-international.org/en/identifiers/allidentifiers.html. Will need some fixing.

import re,pprint

from BeautifulSoup import BeautifulSoup
bs=BeautifulSoup(file('allidentifiers.html').read())
table = bs.find('table', {'style':'width: 565px;'})

t={}
identifier=None
val=None

for tr in table.find_all('tr'):
    tds = tr.find_all('td')
    if len(tds)>=2:
        if tds[0].span!=None:
           try:
               identifier=int(tds[0].span.string)
           except ValueError: #not an integer - stick with the one we have
               pass   
        val = ''.join( tds[1].find_all(text=True) ).replace('\r\n', ' ')
    
        if identifier not in t:
            t[identifier] = [val]
        else:
            t[identifier].append(val)

result={}  
for k in t:
    if k!=None:
        result[k] = ' '.join(t[k])

resultdata=pprint.pformat(result)
print resultdata

f=file('allidentifiers.py','w')
f.write(resultdata)
f.close()

Dealing with non-nestedness

For one site I needed the logic "Look for the first node that has text node 'title1' and return a list of all nodes (text nodes, elements) up to the next '<b>' tag"

I needed to fetch a list of things under a heading that wasn't really structurally stored at all. The code I used was roughly:

from BeautifulSoup import BeautifulSoup,NavigableString

example="""
<b>title1</b>
  contents1<br>
  contents2<br>
<b>nexttitle</b>
  contents3<br>
"""


def section(soup,startAtTagWithText, stopAtTagName):
   ret=[]
   e=soup.firstText(startAtTagWithText)
   try:
      e=e.next # skip the tag that has the string in it
      while e!=None:
         if type(e)!=NavigableString:
            if e.name==stopAtTagName:
               break
            else:
               ret.append(e)
         else: #an element
            ret.append(e)
         e=e.nextSibling
   except:
       pass
   return ret

section(BeautifulSoup(example),'title1','b')


See also