Difference between revisions of "BeautifulSoup"

From Helpful
Jump to: navigation, search
m (Table extraction)
m (Basics)
 
Line 38: Line 38:
 
To play with that example:
 
To play with that example:
 
<code lang="python">
 
<code lang="python">
soup = BeautifulSoup.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>')
+
soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', 'lxml')
 +
# older versions:
 +
# soup = BeautifulSoup.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>')
 
</code>
 
</code>
  

Latest revision as of 15:39, 15 August 2019

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Intro

BeautifulSoup is a Python module that parses HTML (and can deal with common mistakes), and has helpers to navigate and search the result. It's convenient to scrape information.

It's not very fast, so when the document can be large, you may want to go another way, e.g. apply tidying (e.g. µTidylib) then feed it to a stricter parser.

When you can count on syntax-correctness of your data, you may want a stricter parser to start with. (if it's XML you may want to try BeautifulStoneSoup)


Note that there have been one or two large redesigns, so if things don't seem to work:

  • can't import BeautifulSoup module - do you have bs4? that's the module name now.
  • findAll does nothing - you probably have a very old version (which means you're actually using the shorthand to look for a tag named findAll)(verify)


Basics

Firstly, There are multiple ways of filtering/fetching. You're probably best off deciding what syntax you like and ignoring all others. I dislike the short forms because they can clash, and raise more exceptions. On this page I omit all the shorthand forms that I don't like


A parse tree is made mainly of Tag and NavigableString objects, representing elements and text contents, respectively.

Example data used below:

 <a>
   <b q="foo bar">
      1
      <c q="foo"/>
      <d>2</d>
      <c r="bar"/>
      <c/>
      3
   </b>
 </a>

To play with that example:

soup = bs4.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>', 'lxml')
# older versions:
# soup = BeautifulSoup.BeautifulSoup('<a><b q="foo bar">1<c q="foo"/><d>2</d>3<c r="bar"/><c/></b></a>')

Note that When you print Tag objects, it prints the entire subtree - which is e.g. quite confusing when you're stepping through the tree and printing its elements.


searching

Tag objects have some find functions that start there.

The most inetersting are probably the subtree-searchers:

  • find() - finds the first, next match
  • findAll() - finds all matches

There are more, which search in specific parts/directions (see the types of navigation below).

These search recursively by default. You can change that.


They take following arguments, all optional:

  • name
    : match Tags by name. When you hand in a...
    • string: exact name match
    • list or tuple: exact match of any in list
    • (compiled) regexp: regexp match
    • function: use as arbitrary filter (should return True/False. Can of course be a lambda function)
    • True: fetch all (often pointless; using only attrs implies this, and you can iterate over all children more directly)
  • attrs
    : match Tags by attributes. When you hand in a...
    • string: should match class, but different in older BeautifulSoup version(verify), so I avoid it
    • dicts mapping string to...
      • ...to a string: exact match
      • ...to True: tags with this attribute present, e.g.
        soup.findAll(True,{'id':True})
      • ...to a regexp: match attribute value by regexp, e.g.
        soup.findAll(True,{'class':re.compile(r'\bwikitable\b')})
        (useful to properly match classes, since class attribute values are space-separated lists)
  • text
    : match NavigableStrings, by text content. Using this implies 'ignore name and attrs'. When you hand in a...
    • string: exact match
    • True: all strings
    • regexp: regexp match
  • recursive
    : True is default
  • keyword arguments, to be matched. I don't use this, I think it's messy since it may interact with other arguments and python keywords.

find next, find all, and what to find

Element perspective:

soup.find('c')     # find the next c element

Returns None when nothing matches. This also means you can't really chain these, since that'll easily result in an AttributeError complaining you're trying to do something on None.


List-like perspective:

soup.findAll('c')  # returns a list, of all matching c elements
#Returns [] when nothing matches
 
# or its shortcut:
soup('c')


A few examples

Searching for different properties, in different ways:

soup.findAll(['b','c'])          # all b and c tags
soup.findAll(re.compile('[bc]')) # all b and c tags
 
 
#Anything with a q attribute:
soup.findAll(attrs={'q':True})
 
#Anything with attribute q="foo"
soup.findAll(attrs={'q':'foo'})
 
#all divs with class set to tablewrapper (string equality)
soup.findAll('div', attrs={'class':'tablewrapper'})
 
#Anything with a class attribute that contains 'bar' as a token (...since class is a token-list thing in (X)HTML):
soup.findAll(attrs={'class':re.compile(r'\bbar\b')})

navigation

There is quite a bit of extra decoration on Tag (and also NavigableString) objects.

Things you could keep in mind include:

  • parent
    : selects single parent, a Tag object.
  • contents
    : selects a Tag's sub-things, a list containing a mix of Tag and NavigableString objects. (DOM would call this 'children')
  • using a Tag as an iterable (e.g. using
    for
    ,
    list()
    ) iterates its direct contents one element at a time. Sometimes this convenient and clean, in other cases searching is faster and more flexible than direct child selection
  • string
    : returns text child (NavigableString type) -- but only if a Tag contains exactly one of these. If there is more than that, this will yield None, even if the first element is a string.
    • Often you want to use find(text=True) (next piece of text) or findAll(text=True) (all pieces of text), depending on what you know of the structure
  • previousSibling
    and
    nextSibling
    : selects the next Tag or NavigableString at the current level. Think of this as walking the contents list. Returns None when sensible. Useful for some specific node constructions.
  • previous
    and
    next
    are essentially a single step out of a treewalk (that emits before walking(verify)). If that made you go 'huh?', you probably want previousSibling and nextSibling instead.

With the example data mentioned earlier:

p = soup.contents[0]
while p!=None:
   print p
   p=p.next

Will print:

  • the a element
  • the b element
  • '1'
  • the first c element (the one that contains d)
  • the d element
  • '2'
  • '3'
  • the first empty c element
  • the second empty c element

While the following prints that list in exact reverse:

p = soup.findAll('c')[2] #selects the last c
while p!=None:
    print p
    p=p.previous

There are also find functions that behave this way.

Assorted notes

getting attributes, alternatives

Use a.get('name'). It will returns None if not present.

One alternative style, a['name'], raises ValueError when not present.


On Unicode

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Once parsed, the strings are python unicode strings.

It used to be that it required unicode string input, so you needed to do decoding yourself, and correctly. Recent versions consider UTF-8 as an input encoding.

You can now also ask for the bytestring as it was in the source document.


Feeding in data, parser alternatives

There are a few ways of feeding in a page:

open_file_object=open('filename','r')
soup=BeautifulSoup(open_file_object)  
 
#...or...
 
soup=BeautifulSoup(string)      
 
#...or...
 
soup=BeautifulSoup()
soup.feed(page_contents)  #...which can apparently be done incrementally if you wish.


Parsers include:

  • BeautifulSoup.BeautifulSoup is tuned for HTML, and knows about self-closing tags.
  • BeautifulSoup.BeautifulStoneSoup is for much more basic XML (and not XHTML).

And also:

  • BeautifulSoup.BeautifulSOAP, a subclass of BeautifulStoneSoup
  • BeautifulSoup.MinimalSoup - like BeautifulSoup.BeautifulSoup, but is ignorant of nesting rules. It is probably most useful as a base class for your own fine-tuned parsers.
  • BeautifulSoup.ICantBelieveItsBeautifulSoup is quite like like BeautifulSoup.BeautifulSoup, but in a few cases follows the HTML standard rather than common HTML abuse, so is sometimes a little more appropriate on very nearly correct HTML, but it seems you will rarely really need it.


Theoretically, you can feed in things much closer to SGML, but you may regularly want to customize the parser for any specific SGML.


It seems that functions may react differently when called on different Parsers. I ran into trouble with findParents - this may have been a bug, or not. I haven't checked.


Examples

You generally want to look at things per page, specifically asking yourself "What distinguishes that which I want to extract?" This is often an attribute or class, or sometimes an element context.

Table extraction

I was making a an X-SAMPA / IPA conversion and wanted to save myself a lot of typing. I downloaded the wikipedia X-SAMPA page.

At a glance, it looks like mediawiki tables that are generated from markup have exactly one class, wikitable, which is rather convenient because it means we can select the data tables in one go.

The tables on that page have either four or five columns, which changes interpretation a little, and half the code below dealing with that.

#Note the code is overly safe for a one-shot script, and a little overly commented
from BeautifulSoup import BeautifulSoup
 
soup=BeautifulSoup(open('X-SAMPA'))
 
for table in soup.findAll('table', {'class':'wikitable'} ):
 
    # Gets the amount of columns, from the header row.    
    # This because cells can be omitted, but generally aren't on the header row. (it's hard dealing with col/rowspans anyway)
    tablecols = len( table.find('th').parent.findAll('th') )
    # Actually means:   "find first th, go to the parent tr, select all th children, count them"
 
    for tr in table.findAll('tr'):   # for all rows in the table
        TDs= tr.findAll('td')
        # deal with both tables in the same code -- check which we're dealing with by amount of columns
        if tablecols==5:                             # XS, IPA, IPA, Description, Example
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]    # hack faking extra list entries when there aren't enough TDs in the table
           #pad with a bunch of nothings in case of missing cells, then use the first 4
        elif tablecols==4:                           # XS, IPA, IPA, Description
            xs,ipa,_,descr   = (TDs+[None]*5)[:4]
        else:
            raise ValueError("Don't know this table type!")
 
        if None in [xs,ipa]: #empty rows?
            pass
        else:
            #We fish out all the text chunks. In this case we can join them together
            xs    = ' '.join( xs.findAll(text=True) )
            ipa   = ' '.join( ipa.findAll(text=True) )
            descr = ' '.join( descr.findAll(text=True) )


Similar idea, for the Kunrei-shiki Rōmaji page:

from BeautifulSoup import BeautifulSoup
soup=BeautifulSoup(open('Kunrei-shiki_Romaji'))
 
tables= soup('table',{'width':'100%'}) #the main table
tables.extend( soup('table',{'class':'wikitable'})) #the exceptions table
 
for table in tables:
    for td in table('td'): #cells are completely independent as far as we're concerned
        tdtext=' '.join(td(text=True)).replace('&amp;#160;',' ').strip()   #is there text under this TD?
        if len(tdtext)>0: #Yup
            #There a few styles of cell filling, which we unify both with the text select and with logic below
            a=tdtext.split()
            kana=''
            if len(a)==2:
                kana,roman = a
                hiragana,katakana=kana[:len(kana)/2],kana[len(kana)/2:] #close enough
            elif len(a)==3:
                hiragana,katakana,roman=a
            else:
                raise('BOOGA')
            print `hiragana,katakana,roman`

More table extraction

for http://www.isbn-international.org/en/identifiers/allidentifiers.html. Will need some fixing.

import re,pprint
 
from BeautifulSoup import BeautifulSoup
bs=BeautifulSoup(file('allidentifiers.html').read())
table = bs.find('table', {'style':'width: 565px;'})
 
t={}
identifier=None
val=None
 
for tr in table.findAll('tr'):
    tds = tr.findAll('td')
    if len(tds)>=2:
        if tds[0].span!=None:
           try:
               identifier=int(tds[0].span.string)
           except ValueError: #not an integer - stick with the one we have
               pass   
        val = ''.join( tds[1].findAll(text=True) ).replace('\r\n', ' ')
 
        if identifier not in t:
            t[identifier] = [val]
        else:
            t[identifier].append(val)
 
result={}  
for k in t:
    if k!=None:
        result[k] = ' '.join(t[k])
 
resultdata=pprint.pformat(result)
print resultdata
 
f=file('allidentifiers.py','w')
f.write(resultdata)
f.close()

Dealing with non-nestedness

For one site I needed the logic "Look for the first node that has text node 'title1' and return a list of all nodes (text nodes, elements) up to the next '<b>' tag"

I needed to fetch a list of things under a heading that wasn't really structurally stored at all. The code I used was roughly:

from BeautifulSoup import BeautifulSoup,NavigableString
 
example="""
<b>title1</b>
  contents1<br>
  contents2<br>
<b>nexttitle</b>
  contents3<br>
"""
 
 
def section(soup,startAtTagWithText, stopAtTagName):
   ret=[]
   e=soup.firstText(startAtTagWithText)
   try:
      e=e.next # skip the tag that has the string in it
      while e!=None:
         if type(e)!=NavigableString:
            if e.name==stopAtTagName:
               break
            else:
               ret.append(e)
         else: #an element
            ret.append(e)
         e=e.nextSibling
   except:
       pass
   return ret
 
section(BeautifulSoup(example),'title1','b')


See also