Harvesting wikipedia

From Helpful
Revision as of 21:59, 31 January 2015 by Helpful (Talk | contribs)

(diff) ← Older revision | Latest revision (diff) | Newer revision → (diff)
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

Note that this page is old and some information on it not correct anymore.


Check whether projects like dbpedia haven't already done what you want.

About markup

To do some information retrieval-like things on a large scale, you shouldn't spider wikipedia.

Options you have instead are:

This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).
  • Download static, rendered HTML
  • Download database dumps,
  • export single pages,
    • While the article export feature uses XML, that XML is only used for metadata; the article text itself is inside a single node and in mediawiki's own markup language, for which there is no formal grammar and the best working parser is simply an installation of mediawiki, specifically one of its PHP files. The basics of this language are obvious, but there are a lot of smaller exceptions; depending on what you want you can get away with one of the parsers out there.

Also, about article data: You need to decide to either:

  • parse the mediawiki markup (perhaps take the mediawiki parser php and make it give you a syntax tree), or
  • screen-scrape the resulting pages

I'm currently trying the latter, because it's less work [for me, that is - it's not exactly easy on the CPU:)].

Static HTML download, content notes

English wikipedia

You can download an every-page-rendered-as-HTML version of various wikipediae on dumps.wikipedia.org (previously static.wikipedia.org).

In ~2007 the english dumps were ~5.5GB when compressed as 7zip archives, and more than 81GB uncompressed, and includes many pages that are not directly content, that you may or may not care about. For example, discussion (Talk pages) will contain more informal sentences, which could be useful to corpus linguistics, but not necessarily to much else. User talk pages consist largely of identical bot notices, so are generally even less useful.

Some summary (Amounts are for the Nov 2006 dump, and may include a good bunch of filesystem overhead from the sheer amount of files involved):

  • User_talk: (1M files, ~20% of space)
  • Talk: (883K files, ~17% of space (or more, a filesystem filled up while I was doing this))
  • Image: (871K files, ~17% of space) and Image_talk: (19K files, 1% of space or less)
  • User: (286K files, ~5% of space)
  • Wikipedia:, WP: and Wp: (together 180K files, ~6% of space)
  • Template: and Template_talk: (~94K files, 1% of space or so)
  • Portal: pages (17K files, 1% of space or less)
  • Category talk: pages (17K files, 1% of space or less)
  • MediaWiki: and MediaWiki_talk: (together 250 files, 1% of space or less)
  • ...and various other wikipedia namespaces
  • Category: pages (Arguably. They are incomplete for some larger categories, for which they are just the first page of 200; for an example, see e.g. "1970 births")

Without these, about 33GB seemed to be actual content pages, whether regular or not (consider e.g. List of... pages)

Redirect pages

There a lot of redirect pages.

While in the online wikipedia this may be interesting information (signaling alias names, commonly mis-typed things and such), the static version does not contain the original title, so these are pretty much non-information. I assume they're left in to avoid broken links in browsing, but if you're here for text extraction, they must make processing slower (not because they take up much size, but because there are so many of them, and they have to be handled both by the filesystem and by you).

These files are sized roughly between 400 and 700 bytes (depending on title length), which is fairly easy to find:

find . -type f -size -700c

...and removed e.g. like:

find . -type f -size -700c -print0 | xargs -0 rm

Counts per namespace, optional removal

If you want to see the counts per namespace, a quick copy-paste to do so could be:

find . -type f -iname '*~*.html' | awk ' BEGIN { FS="[/~]"} {print $2": "$6}' | \
   sort | uniq -c | sort -r -n > namespace.list

This was made for a directory structure that uses the language code, so that something like less ./en/a/a/r/Aardvarks.html would work.

The command will take a while.

Note that will give you the meta-wiki pages mentioned as well as psuedo-namespaced pages, which can be pages you may care about. For example, in the english wikipedia there are a lot of UN LOCODE: pages. If you are going to automate removing everything with colons, be wary of these cases.

The following python script mostly does the same thing, but also has the option to remove (same 'execute one level above country code' requirement). And yes, it's a nasty script.

import glob,os
def removeYN(s):
    "NOTE TO SELF: this is buggy. Update with new version." 
    if lns=='wp':
        return True
    for s in ['wiki','templat','porta','imag','help','hulp','talk']:
        if lns.find(s)!=-1:
            return True
    return False
for d in glob.glob('*'):
    if os.path.isdir(d):
        print "For country code '%s':"%d
        for root,dirs,files in os.walk(d):
            for name in files:
                if name.find('~')!=-1:
                    fullname = os.path.join(root,name)
                    if namespace not in data:
        for ns in data:
            if count>1: #if there's one, it's almost certainly a page with a : in its title
                print ("  [%s]: %d"%(ns,count)).ljust(40),
                if count>100 or removeYN(ns):
                    print "(up for deletion)"
                    if actuallyRemove:
                        print "   removing... "
                        for filename in data[ns]:
                        print "   Done."
                    print "(keeping)"

Usable as python scriptname.py | less. Once you are happy with what it wants to remove, set actuallyRemove to True and run again (use at your own risk!).

An arbitrary threshold for deletion of at least 100 and some strings is in fact overzealous and inaccurate. For most analyses, there's enough left to work with, of course.

Local wikipedia mirror

The short answer is that you want the static dumps instead unless you really must have an editable copy.

Running a a local copy of wikipedia to abuse at will is hard to do correctly. The basic mediawiki installation is easy, but wikipedia itself uses not only the cutting-edge version of mediawiki, it also uses a large number of extensions.


  • You should use the same version (CVS HEAD) of mediawiki as wikipedia uses for its content; it's fairly cutting-edge in what it uses. See MediaWiki_from_CVS. You can usually get away with older versions too, though.
  • You will need a lot of the extensions listed at [1]. Many pages will not display properly without particularly the extra parsers.
  • For speed, a PHP opcode cache system is quite useful, since it avoids continuously recompiling the PHP. PHP memcaches have these.


Dumps have related tools (see [2]) including mwdumper, which can convert the XML dump to SQL that you can hand to mysql to be imported into a database, e.g.:

java -jar mwdumper.jar --format=sql:1.5 enwiki-20060219-pages-articles.xml.bz2 | \
    mysql --user=user --max_allowed_packet=16M --password databasename


  • The max_allowed_packet option is used to avoid errors; some sql statements may be larger than a MB when sent to the mysql server. You will have to set this in the server configuration too, see [3].
  • the page, revision, and text tables have to be empty for this (the import'll complain about duplicate keys if they're not), which they aren't on a bare mediawiki installation so do the following first: (TRUNCATE is actually DROP and CREATE so is faster than DELETE, though be aware of [4])
mysql -u user -p wikidatabasename
mysql> TRUNCATE page;
mysql> TRUNCATE revision;
mysql> TRUNCATE text;

Then you wait half a day, because mwdumper is rather slow. You probably want to tweak mysql speed unless you want it to be even longer.

You can also download and insert the .sql files. I'm not sure exactly what most of them do.

MySQL details

Your binlogs will grow like crazy if you have them enabled; you may want to disable them for this database, it'll also be faster. If you don't disable them you probably want to flush them afterwards (
), though know that this is server-wide and makes restoring backups by replaying binlogs from earlier than that flush impossible, which is not ideal especially if you have other databases in the same MySQL instantiation.

About categories

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are many meta-categories. I filtered many out of my final data (names lowercased here, as they are in my data)

Exact names:

  • 'pending deletions', 'deletion log'
  • 'parserfunctions',
  • 'redirects', 'soft redirects', 'category redirects',
  • 'protected redirects', 'protected deleted pages','protected deleted categories'
  • 'disambiguation',
  • 'npov dispute',
  • 'semi protected',
  • 'protected against vandalism', 'protected deleted categories', 'protected deleted pages',
  • 'permanently protected'
  • 'protected templates',
  • 'wikistress',
  • 'aticle lists', 'incomplete lists',
  • 'stub categories',
  • 'deprecated templates', 'attribution templates'
  • 'templates categorising temporary userpages'
  • 'template tracking categories',
  • 'citation templates', 'deprecated citation templates', 'law citation templates' and a few more like that

Name starts with:

  • 'list of ', 'lists of ',
  • 'wikipedian '
  • 'wikipedia ' (many, including wikipedia ... requests)
  • 'wikiproject ',
  • 'articles needing ', 'articles lacking ', 'articles to ', 'articles for ', 'articles with ', 'categories requiring', 'pages needing ',

Name ends with:

  • ' navigational boxes', ' userboxes'

There are also a lot of categories about things like music and sports.

If going for data for academic data, classifying things, you can win a lot of space by filtering out:

  • albums (~12K connections) and songs (~1.2K connections) (also good for correctness; there are a lot of album and song names with trivial and therefore ambiguous names that are not so interesting outside music systems). Arguably the same thing goes for films (~7K connections)
  • In sports, there are many detailed categories, for example marking people as players of a particular sport, players of a sport of a particular nationality, players of a sport in a particular named teams, players in particular leagues, and players of sports in different positions. You may want to cull pitchers (a few thousand connections), for example.
  • the Surnames category. People themselves are in there too