✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).

Corpora and treebanks

Corpora tend to refer to decently curated but fairly unstrucured pieces of text, sometimes with syntactic tagging, useful at least for decently representative samples of text of a language.

Treebanks usually refer to corpora with syntactic, structural, and semantic analysis.

Dutch corpora and treebanks

Eindhoven corpus

A tagged Dutch corpus, from the seventies, and mostly from relatively informal and scientific printed use(verify). Originally created to create a word frequency list for Dutch.

See also:

http://www.inl.nl/index.php?option=com_content&task=view&id=350&Itemid=579&lang=en

CGN

Corpus Gesproken Nederlands (CGN) is a tagged corpus of text that comes from transcriptions of spoken Dutch (in the Netherlands and Flanders).

https://lands.let.ru.nl/cgn/

LASSY

Large Scale Syntactic Annotation of written Dutch

See also:

http://www.let.rug.nl/~vannoord/Lassy/

English corpora and treebanks

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

20newsgroups

A small dataset of messages from a number of newsgroups, often used when teaching automatic text classifiers.

License: Unclear

See also:

http://people.csail.mit.edu/jrennie/20Newsgroups/

Reuters corpora

Reuters-21578

A corpus also often used for categorization tasks

Copyrighted, free for research use.

See also:

Reuters Corpus, Volume 1 (RCV1)

http://trec.nist.gov/data/reuters/reuters.html

Reuters Corpus, Volume 2 (RCV2)

http://trec.nist.gov/data/reuters/reuters.html

ICAME collection

The International Computer Archive of Modern and Medieval English is a collection of various previously-existing corpora

Written:

Brown Corpus[1]

or fully, the Brown University Standard Corpus of Present-Day American English

Under copyright, usable for research. (verify)

LOB Corpus[2]
Tagged LOB Corpus
Freiburg-LOB Corpus of British English (FLOB)
Freiburg-Brown Corpus of American English (FROWN)
Kolhapur Corpus of Indian English
Australian Corpus of English (ACE)
Wellington Corpus of Written New Zealand English
International Corpus of English - East African component

Spoken:

London-Lund Corpus of Spoken English
Lancaster/IBM SEC Corpus, The Machine-Readable Corpus of Spoken English
Wellington Corpus of Spoken New Zealand English (WSC)
Bergen Corpus of London Teenage Language (COLT) in Acrobat format, in Word format
International Corpus of English - East African component

http://clu.uni.no/icame/manuals/

BNC

British National Corpus (BNC) (tagged)

http://www.natcorp.ox.ac.uk/

ANC

American National Corpus (ANC)

See also:

http://en.wikipedia.org/wiki/American_National_Corpus

COCA

Corpus of Contemporary American English (COCA)

http://www.americancorpus.org/

Oxford English Corpus

See also:

http://en.wikipedia.org/wiki/Oxford_English_Corpus

LUCY

A written UK English treebank.

See also:

http://www.grsampson.net/RLucy.html

Penn Treebank

Resource from the University of Pennsylvania.

See also:

http://www.cis.upenn.edu/~treebank/

Penn Middle English Treebank

See also:

http://www.ling.upenn.edu/hist-corpora/

Wordlists, dictionaries, and thesauri

12dicts

Alan Beale's `12dicts`

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A collection of wordlists focusing on common and non-obscure words (as e.g. scrabble lists can do), made by Alan Beale.

Named for the fact that its creator took/checked the words from/with various dictionaries (verify). AGID was also involved.

Most of the data is placed in the Public Domain. Some relatively recent revisions include information that was made based on SCOWL, WordNet and which therefore carry respective license restrictions.

Basic set (public domain):

lists of words that appear in more than n dictionaries
- 2of12.txt - appear in at least two source dictionaries
- 6of12.txt - appear in at least six source dictionaries. Marked with some extra characters, see the documentation for them.
3esl.txt - words from ESL 'core vocabulary'

5desk.txt - meant as an everyday dictionary, something between a core vocabulary list and a dictionary wordlist (verify)

2of4brif.txt - internationalized wordlist (most of the rest is American-geared)

There are annotations you may want to use, or filter out (see the README):

a = at the end of a line marks a second-class word (mostly inflected/derived forms)
a + at the end of a line marks a signature word
a : at the end of a line marks an abbreviation (of some sort)
a & at the end of a line marks a word that is not part of common usage in America
a ^ at the end of a line marks a word that was arbitrarily chosen from a set that has no clear primary form
a < at the end of a line marks an form that is not clearly primary
a # at the end of a line marks a form considered a variant (non-lemma)
a % at the end of a line marks uncountable nouns (in 2of12 only?)

Recent additions (rev 5? or earlier?):

neol2007.txt - some neologisms (public domain)

2of12inf.txt - The more recently added inflections mentioned above.
2+2gfreq.txt and 2+2lemma.txt are based on 2of12inf

Kevin Atkinson's `alt12dicts`

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The 'Unofficial Alternate 12 Dicts Package' is a transformation of 12dicts that stores mostly the same information.

Its 2of12full.txt conains:

how many dictionaries contained the entry (making a 6of12.txt redundant)
how many list it as a non-variant American word
how many list it as a variant form
how many list it as a non-American word

Most interesting files are exactly as they are in the original 12dicts. A few files were added.

COMLEX

See also:

http://nlp.cs.nyu.edu/comlex/

DICT

DICT can refer to

DICT Development Group (http://dict.org/)
their dictionary network protocol - see also RFC 2229

See also:

http://en.wikipedia.org/wiki/DICT

Various wordlists

Games

Largely for scrabble and the likes, in which case they are plain lists of words.

Note that there are multiple revisions of most of these lists.

English:

OSPD (Official Scrabble Players Dictionary)

OSW (Official Scrabble Words) is a British list based on the Chambers Dictionary

SOWPODS (derived from OSPD and OSW), based on OSPD and some others

Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.

Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)
TWL98
TWL3

seem used in Northern America and Canada

Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K

See also:

http://www.math.toronto.edu/jjchew/scrabble/lists/

Other / unsorted

From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):

Spell Checker Oriented Word Lists (SCOWL)
Automatically Generated Inflection Database (AGID)
VarCon (Variant Conversion Info)

See also:

[4]

Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [5]. Based on ENABLE.

Crossword/cryptic wordlists may e.g. include synonyms, relation to common questions, and such.

UK Advanced Cryptics Dictionary (UKACD) [6]

--- DEDUPE ---

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Mostly for scrabble. Note that there are multiple revisions of most of these lists.

English:

The Official Scrabble Players Dictionary (OSPD)

Official Scrabble Words (OSW) is a British list based on the Chambers Dictionary

SOWPODS (derived from OSPD and OSW), based on OSPD and some others. Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.

TWL98 is used in Northern America and Canada ()

Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)

Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K; see also [7]. Based on the OSPD (Official Scrabble Players Dictionary) and Merriam-Webster. Made by Alan Beale. Public Domain

French:

L'Officiel du Scrabble (ODS)

Romanian:

Lista Oficiala de Cuvinte 2000 (LOC2000)

Dutch:

Scrabble Woorden Lijst (SWL)

Italian

PARO (unofficial; there is no official italian scrabble list (still not?))
ZINGA

Spanish:

(verify) DRAE22, Diccionario de la Real Academia Española, 22nd edition.

See also:

http://www.math.toronto.edu/jjchew/scrabble/lists/

Other / unsorted

From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):

Spell Checker Oriented Word Lists (SCOWL)
Automatically Generated Inflection Database (AGID)
VarCon (Variant Conversion Info)

See also:

[8]

Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [9]. Based on ENABLE.

UK Advanced Cryptics Dictionary (UKACD) [10]

Webster's 1913 Dictionary

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

One of the early (late eighteen hundreds), large English dictionaries available in the US.

Interesting because the 1913 version's copyright has lapsed, is Public Domain, and is fairly easily available (although needs a bunch of work to parse well).

See for example the version in Project Gutenberg (search link below), which has been augmented by Micra, Inc.

Legalities seem a little unclear to me. There is certainly a later version under Micra's copyright, whereas the copy in Project Gutenberg's copies mention that Micra's addition in that (the tags) are copyrighted and that the basic text version is Public Domain. It is unclear to me what Micra's base resource was.

See also:

OPTED

The Online Plain Text English Dictionary (OPTED) is a resource based on the public domain 1913 version of Webster's Dictionary in Project Gutenberg.

I have not checked the license details. It seems public domain - except with the requirement that it stays public domain and free of cost.

See also:

http://msowww.anu.edu.au/~ralph/OPTED/

CIDE

The Collaborative International Dictionary of English (CIDE) is derived from the 1913 (public domain) Webster's Dictionary.

There are various derivative versions.

See also:

http://en.wikipedia.org/wiki/Collaborative_International_Dictionary_of_English

GCIDE

GNU Collaborative International Dictionary of English

GPL license

See also:

http://en.wikipedia.org/wiki/GCIDE

Roget's thesaurus

A thesaurus from 1911.

The roget15a.txt file as stored in Project Gutenberg is not the original transcribed text. Some proof-reading, restructuring, extra marking, and a little content addition was was done was done by Micra, Inc. [11]. They claims no restrictions on this version of their additions; the resource is under Public Domain, with Gutenberg's basic, strippable restrictions.

From the notes included in that version:

Likely still contains errors, though it has had reasonable proof-reading.
supplemented, but still far from up to date with modern English
About 1500 verbs (out of 6500) which can be found in a modern 80,000-word spell-checker are absent
Nouns are probably worse, as there are many words coined since, particularly in technical areas.
A good deal of words in here are not recognized by a modern spell checker, in most cases because they are Latin or obsolete.
Such non-modern cases are usually marked with comments in square brackets.

On formatting:
- Comments that are not thesaurus content are delimited <-- like so -->
- Section headings are included between percent (%) markers. (also seem to be all uppercase)
- @ and things within {} are internal references (by number), which you could ignore (as they seem incomplete, an experiment - though they do also seem useful)
- each of the ~1000 main entries has a pound sign (#) in front of an identifier (enumeration).
- Greek words and phrases are transliterated, and formatted <gr/like so/gr>
- obsolete words (in the 1911 version) are marked with a vertical bar are marked with a pipe (|)
- words current in 1911 but not in a current college-sized dictionary are marked with a pipe and exclamation (|!) (not complete) or [obs3] (probably also not complete, but better)

See also:

http://www.gutenberg.org/etext/22

Wiktionary

Wiktionary is a user-contributed dictionary/thesaurus for various languages.

It could be a potentially useful resource.

See also:

http://en.wiktionary.org/

ICE

International Corpus of English (ICE)

A text corpus that consist for a good part of transcribed spoken English.

See also:

Internet Dictionary Project

A semi-personal, semi-public project that ran from 1995 to 2007, meant to create free translation dictionaries.

Copyrighted, but royalty-free, and seemingly legally unrestricted (for uses other than selling unaltered copies for money).

See also:

Translation

Freedict

A number of bilingual translation dictionaries, under the GPL license.

See also:

Apertium

https://github.com/apertium

Phonetic

TIMIT

TIMIT is a corpus of phonemic pronunciations for lexical items, based on various dialects of American English. It was commissioned by DARPA.

Uses its own phonetic script; see Phonetic scripts#TIMIT

Its name comes from some of the parties in its creation, Texas Instruments (recording) and MIT (transcription).(verify)

Paid-for corpus.(verify)

There is also a Network TIMIT (telephone network quality sound(verify)).

CMUdict

The CMU Pronouncing Dictionary is a public domain resource from Carnegie Mellon.

It encodes pronunciation using the Phonetic_scripts#ARPAbet system.(verify)

See also:

Longman Pronunciation Dictionary

See also:

CAAPR

Combined Anglo-American Pronunciation Reference, which has British English and American English pronunciations

By Alan Beale, and using his FLOSS phonemic system.

Based on FEWL (for American) and the EPD (for British).

License: Unclear(verify)

Note: FEWL is an earlier verson of CAAPR

See also:

http://www.wyrdplay.org/AlanBeale/FEWL-ref.html

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

EPD

English Pronouncing Dictionary (EPD), published by the Cambridge University Press

See also:

Morphology

CELEX

Lexical database. For Dutch, English, German has word forms, phonological and morphological info, frequencies, and more, .

Has a revision - you probably now want CELEX2.

See also:

http://www.ldc.upenn.edu/Catalog/CatalogEntry.jsp?catalogId=LDC96L14

Mixed

Moby

Moby is a resource created by and placed in public domain by Grady Ward.

Consists of a number of sub-resourced, including

part of speech information
pronunciation information
various wordlists
hyphenation information (a large list of examples)
more

It is a large data set, of decent quality (though with some known peculiarities and flaws).

Where non-ASCII is used, these files seem to be coded in one of:

Macintosh Roman [12]
Code page 437 [13]

See also:

And perhaps:

http://wixml.com/moby.html

Focus on semantics

Things like synonym sets, word senses, ontological structures, and such.

To a lesser extent, any corpus could apply. Note e.g. that some of the above have some some limited semantic tagging.

VerbOcean

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A semantic network of verbs

University of Southern California

See also:

WordNet and variants

WordNet is a semantic lexicon developed by Princeton University. It has a decent number of word-sense pairs, and some short dictionary-like explanations.

It does not contain etymological information, nor much information about inflection or other morphology.

It provides information on nouns, verbs, adjectives and adverbs. It allows separate querying so allows you to select sense when there are cross-lexical-category homographs.

For nouns, it synonyms, hypernyms, hyponyms, holoynyms, meronyms, and coordinate terms (words that share a hypernym)

For verbs, it has hypernyms and troponyms (generalized senses of actions), entailment (implied actions), and coordinate terms.

For adjectives, it has related nouns, and a relation to the verb if the adjective is a participle.

For adverbs, it has the root adjective.

Under copyright, provided under a BSD-like license.

Exactly which variation it can be used as is not entirely clear (can be important, as 4-clause-style BSD is less compatible with other licenses).

Looks to effectively be most like a 2-clause BSD license, which would make it compatible with GPL, LGPL and such, as long as the wordnet license is preserved (as the 'appropriate copyright notice').

Princeton seems to have indicated this, and the FSF has suggested something in similar lines.

See also:

http://en.wikipedia.org/wiki/Wordnet

Extended WordNet

A resource that parsed the glosses (definitions) in WordNet, to be used as additional information.

BSD style license

See also:

GeoWordNet

A variant of wordnet with geospatial information; a merger of Wordnet, GeoNames, and the Italian part of MultiWordNet.

License: CC-by 3.0

Other-language wordnets

Some:

EuroWordNet
- English, Dutch, Estonian, French, German, Italian, Spanish
- http://en.wikipedia.org/wiki/EuroWordNet
- License:
  - The whole distributed via ELDA/ELRA (restrictive(verify))
  - Each wordnet can be released by its creator too

MultiWordNet - Italian
- http://multiwordnet.itc.it/english/home.php

Spanish Wordnet (TALP Group at the Universitat Politecnica de Catalunya (Spain))

Portuguese Wordnet (NLX-Group at the University of Lisbon (Portugal))

Hebrew Wordnet (Computational Linguistic Group at the University of Haifa (Israel))

Romanian Wordnet ("Alexandru Ioan Cuza" University of Iasi (Romania))

Latin WordNet (University of Verona (Italy))

BalkaNet

Asian WordNet
- Thai, Korean, Japanese, Indonesian, Myanmar, Vietnamese, Mongolian, Bengali

...many others

Global WordNet Grid
- has a list of overview of wordnets for various languages.
- http://www.globalwordnet.org/gwa/gwa_grid.htm

MontyLingua

A toolkit that deals with POS tagging, chunking, lemmatizing, certain specific extractions, and uses information from Open Mind Common Sense to avoid some errors.

GPL license for non-commercial use (e.g. research), and licensable for commercial use.

MontyLingua3 seems to be a fork that is GPL for any purpose [14].

Python, Java

See also:

DAESO

Detecting And Exploiting Semantic Overlap (DAESO) is project that studies expressions of similar information.

CORNETTO

Aims to build a lexical resource with semantic-style correlations.

See also:

http://www2.let.vu.nl/oz/cornetto/

SenseEval

http://www.senseval.org/

http://www.d.umn.edu/~tpederse/data.html

SenseClusters

See also:

http://senseclusters.sourceforge.net/

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A largely user-contributed knowledge base, hosted by a commercial company, but apparently available under CC Attribution.

See also:

FrameNet

From Berkley University

See also:

FrameNet: Theory and practice (book)
http://en.wikipedia.org/wiki/FrameNet

DBpedia

A semantic resource created by harvesting data from wikipedia. (e.g. using page categories, geographic templates).

GFDL content.

See also:

http://dbpedia.org/

DOLCE

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE)

Apparently LGPL content.

See also:

http://www.loa-cnr.it/DOLCE.html

BFO

Basic Formal Ontology

Similar in idea to DOLCE, SUMO.

See also:

http://www.ifomis.org/bfo

SUMO

Suggested Upper Merged Ontology (SUMO) refers to a project that creates broad-level ontologies, which don't run into that many ambiguities. Has mappings to WordNet.

Available under the GPL.

Now also has MId-Level Ontologies (MILO)

See also:

YAGO

A knowledge base based on data from Wikipedia and WordNet.

Apparently available under the GFDL.

See also:

COSMO

COmmon Semantic MOdel (COSMO)

By Micra Inc.

Takes data from OpenCyc, SUMO, BFO, DOLCE

TEASE

Can refer to the algorithm, as well as its results from a large-scale execution.

Bar-Ilan University

http://www.aclweb.org/aclwiki/index.php?title=TEASE

Cyc

Cyc is an AI reasoning project, and consists mainly of an ontological knowledge base, supporting assertions, and a number of tools.

The full Cyc knowledge base is available only within Cycorp.

A subset of Cyc is released as OpenCyc (Available under CC Share Alike(verify) or Apache(verify). It contains mostly the ontological facts, but little of the extra knowledge present in the assertions.

Recently there is also ResearchCyc, which lies somewhere between Cyc and OpenCyc in detail, and has a specific license permissive for research (but not really FOSS compatible).

See also:

Open Mind Common Sense

MIT's Open Mind Common Sense project (a.k.a. just 'Common Sense') is an AI reasoning project, or rather, it aims to store a lot of knowledge commonly used in real-world reasoning and disambiguation.

Similar to Cyc in some ways (though knowledge is stored in a sentence-based form rather than in strictly formal relations).

Derivatives/relations based on (parsed) Common Sense data:

ConceptNet
LifeNet
StoryNet
ShapeNet

License is generally GPL for non-commercial use such as research. (verify)

Conceptnet3 can be used under GPL, or CC Attribution.

See also:

http://rct.media.mit.edu/rct/lifenet.html

http://csc.media.mit.edu/StoryNetAcquisitionHome.htm

MontyLingua uses Common Sense data

Unsorted

DoReCo

Aiksaurus

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

AikSaurus is a C++ thesaurus library OSX interface, and data.

Available under GPL.

See also:

http://aiksaurus.sourceforge.net/

PAROLE

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Preparatory Action for linguistic Resources Organisation for Language Engineering (PAROLE)

Copyrighted, non-distribution license.

See also:

LEFFF

LExique des Formes Fléchies du Français (LEFFF), a French morpholgical lexicon. The first version has only verbs, the second all lexical categories.

Under the permissive LPGLR license.

See also:

Proton

See also:

http://bach.arts.kuleuven.be/PA/proton.html

SALSA

Saarbrücken Lexical Semantics Acquisition Project (SALSA)

See also:

http://www.coli.uni-saarland.de/projects/salsa/page.php?id=index

SemCor and variants

SemCor is a tagged text corpus based on the intersection between the Brown corpus and WordNet.

See also:

MultiSemCor is an English/Italian parallel based on SemCor and professional translators.

See also:

http://multisemcor.itc.it/index.php

SemiSUSANNE

A corpus generated from the overlap between SUSANNE and SemCor (and thereby indirectly WordNet).

See also:

http://www.grsampson.net/SemiSueDoc.html

SUSANNE

SUSANNE can refer to the SUSANNE scheme as well as the SUSANNE corpus.

The SUSANNE corpus was created from a subset of the Brown corpus according to the SUSANNE scheme. Its focus on quality rather than quantity makes it an interesting resource.

There seems no mention of a license other than a mention that it is freely usable for research.

See also:

http://www.grsampson.net/RSue.html

The CHRISTINE corpus extends SUSANNE for spoken English.

See also:

http://www.grsampson.net/RChristine.html

TIGER

The TIGER treebank is a German treebank that comes from newspapers

See also:

http://www.ims.uni-stuttgart.de/projekte/TIGER/TIGERCorpus/

TIME corpus

a.k.a. Time Magazine Corpus, based on Time magazine, and with a web interface.

See also:

http://corpus.byu.edu/time/

Wall Street Journal corpus

Wall Street Journal (WSJ) corpus

treebank?

Wellington corpus

A corpus of New Zealand English, or rather, two:

Wellington Corpus of Written New Zealand English (WWC)
Wellington Corpus of Spoken New Zealand English (WSC)

See also:

http://khnt.hit.uib.no/icame/manuals/wsc/

TwNC

The Twente News Corpus (TwNC) is a Dutch text corpus of content from newspapers, teletext subtitling and such.

See also:

http://wwwhome.cs.utwente.nl/~druid/TwNC/TwNC-main.html

VerbNet

Lexical resource from the University of Colorado (in Boulder).

Details verbs, with semantic and syntactic information (extending (Levin 1993) classes).

Related to PropBank, FrameNet.

Has its own license, looks BSDlike with a story similar to(verify) that of (the unrelated) WordNet.

(See http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html)

PropBank

See also:

NomBank

New York University

Related to PropBank (verify)

See also:

http://nlp.cs.nyu.edu/meyers/NomBank.html

Nomlex

Nominalizations

New York University

See also:

http://nlp.cs.nyu.edu/nomlex/index.html

Potential

These lists are here to be able to easily add a corpus/resource without having to immediately add a wiki page that actually says something useful about it.

There's always work in writing a short descriptive article for these.

Potential corpora

Project Gutenberg

Open Content Allicance sources (mostly hosted by Internet Archive) [15] [16]

Wikipedia (e.g. the static HTML downloads)

Government projects (often public domain)

The web, see e.g. W3Corpora

Academic projects

Corpora are commonly lexically annotated text, or treebanks, or just collections of documents that have been manually touched up and verified.

Single-language corpora

English corpora

Text

Lancaster-Oslo/Bergen (LOB) [17] and a subcorpus, the Lancaster Parsed Corpus [18]

Spoken/Dialogue

London-Lund Corpus (LLC) [19]
Spoken Corpus of British English (SCRIBE)
Lancaster/IBM Spoken English Corpus (SEC) and MARSEC, a machine geared variation
Dialogue Diversity Corpus (DDC) [20]
Göteborg Spoken Language Corpus (GSLC) [21]

ICSI Meeting Corpus [22]

Children

Polytechnic of Wales (POW)
Child Language Data Exchange System (CHILDES)

Specific

International Corpus of English (ICE) (non-natives), with subcorpora such as ICE-GB [23]
MICASE (academic) [24]
Corpus of Professional, Spoken American-English (CPSA) (nonfree) http://www.athel.com/cpsa.html
Penn-Helsinki (middle english, nonfree) [25]

Russian

http://www.orc.ru/~patrikey/liblib/enauth.htm

Eastern European

Czech

The Prague Dependency Treebank [26]

Slovene

Slovene Dependency Treebank [27]

Bulgarian

BulTreeBank (treebank) [28]

Bosnian

The Oslo Corpus of Bosnian Texts [29]

Western European

German

Negra (news) [30]
Cosmas [31]

French

American and French Research on the Treasury of the French Language (ARTFL) [32]

Spanish, Portugese

TychoBrahe (historical) [33]
CETEMPúblico [34]
Acesso a corpora/Disponibilização de corpora (AC/DC) [35]
[36]
CUMBRE

Dutch

Text

Twente Nieuwscorpus (news) [37]
Alpino (treebank) [38]

Spoken

Corpus Gesproken Nederlands (CGN) [39]

Scandinavian

Scarrie (news) [40]

Norwegian

Oslo-Korpuset [41]

Swedish

SUC [42]
Spraakbanken [43]

Danish

Korpus2000 [44]
Danish Dependency Treebank (treebank) [45]

Asian

Chinese

Lancaster Corpus of Mandarin Chinese (LCMC) [46]
Penn Chinese Treebank (treebank) [47]

Korean

http://www.cis.upenn.edu/~xtag/koreantag/index.html

Multi-language corpora

Usually parallel, possibly aligned.

TODO: sort out http://tcc.itc.it/people/forner/multilingualcorpora.html

EuroParl (Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese

Swedish) [48] - tokenized, sentence-aligned

MULTEXT-EAST: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Resian, Romanian, Russian, Slovene, and Serbian [49]

JRC-Acquis (EU law in Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish) [50]

ENPC (English-Norwegian Parallel Corpus Project) [51]

CRATER (English, French, Spanish) [52]

TRIPTIC (TRIlingual Parallel Text Information Corpus) (English, French and Dutch)

OPUS [53] [54]

SVEZ-IJS ACQUIS (Slovene, English) [55] and

IJS - ELAN (Slovene, English) [56]

EMILLE [57]

Lilabar (English, Russian) [58]

Parallel Corpus of English and Czech Texts [59]

COMPARA (Protugese, English) [60]

MULTEXT [61]

http://www.athel.com/parallel_corpora.html

Semantic resources

http://en.wikipedia.org/wiki/Upper_ontology_%28computer_science%29#Available_ontologies

Unsorted (potential)

http://www.sil.org/linguistics/ETEXT.HTML#texts

http://www.net-comber.com/wordurls.html

http://www.lib.utexas.edu/books/etext.html

http://www.ldc.upenn.edu/Catalog/byType.jsp

http://www.clres.com/dict.html

http://www.aclweb.org/aclwiki/index.php?title=Knowledge_collections_and_datasets_(English)

http://www.comp.lancs.ac.uk/computing/research/stemming/Links/resources.htm

http://crl.ucsd.edu/corpora/

http://www.ldc.upenn.edu/Catalog/

http://kevinchai.net/datasets/

Public domain stuff

http://www.instructionaldesign.org/public_domain.html

http://www.eusd4kids.org/edtech/public_domain.htm

http://banis-associates.com/pdlist/

http://copyrightfree.blogspot.com/

http://www.thepublicdomain.net/

Linguistic data and resources

Corpora and treebanks

Dutch corpora and treebanks

Eindhoven corpus

CGN

LASSY

English corpora and treebanks

20newsgroups

Reuters corpora

ICAME collection

BNC

ANC

COCA

Oxford English Corpus

LUCY

Penn Treebank

Penn Middle English Treebank

Wordlists, dictionaries, and thesauri

12dicts

Alan Beale's 12dicts

Kevin Atkinson's alt12dicts

COMLEX

DICT

Various wordlists

Games

Webster's 1913 Dictionary

OPTED

CIDE

GCIDE

Roget's thesaurus

Wiktionary

ICE

Internet Dictionary Project

Translation

Freedict

Apertium

Phonetic

TIMIT

CMUdict

Longman Pronunciation Dictionary

CAAPR

EPD

Morphology

CELEX

Mixed

Moby

Focus on semantics

VerbOcean

WordNet and variants

Extended WordNet

GeoWordNet

Other-language wordnets

MontyLingua

DAESO

CORNETTO

SenseEval

SenseClusters

FrameNet

DBpedia

DOLCE

BFO

SUMO

YAGO

COSMO

TEASE

Cyc

Open Mind Common Sense

Unsorted

DoReCo

Aiksaurus

PAROLE

LEFFF

Proton

SALSA

SemCor and variants

SemiSUSANNE

SUSANNE

CHRISTINE

TIGER

TIME corpus

Alan Beale's `12dicts`

Kevin Atkinson's `alt12dicts`