Linguistic data and resources
⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research). |
See also Lingusitics software
Corpora and treebanks
Corpora tend to refer to decently curated but fairly unstrucured pieces of text, sometimes with syntactic tagging, useful at least for decently representative samples of text of a language.
Treebanks usually refer to corpora with syntactic, structural, and semantic analysis.
Dutch corpora and treebanks
Eindhoven corpus
A tagged Dutch corpus, from the seventies, and mostly from relatively informal and scientific printed use(verify). Originally created to create a word frequency list for Dutch.
See also:
CGN
Corpus Gesproken Nederlands (CGN) is a tagged corpus of text that comes from transcriptions of spoken Dutch (in the Netherlands and Flanders).
LASSY
Large Scale Syntactic Annotation of written Dutch
See also:
English corpora and treebanks
20newsgroups
A small dataset of messages from a number of newsgroups, often used when teaching automatic text classifiers.
License: Unclear
See also:
Reuters corpora
Reuters-21578
A corpus also often used for categorization tasks
Copyrighted, free for research use.
See also:
- http://kdd.ics.uci.edu/databases/reuters21578/reuters21578.html or
- http://www.daviddlewis.com/resources/testcollections/reuters21578/
Reuters Corpus, Volume 1 (RCV1)
http://trec.nist.gov/data/reuters/reuters.html
Reuters Corpus, Volume 2 (RCV2)
http://trec.nist.gov/data/reuters/reuters.html
ICAME collection
The International Computer Archive of Modern and Medieval English is a collection of various previously-existing corpora
Written:
- Brown Corpus[1]
- or fully, the Brown University Standard Corpus of Present-Day American English
- Under copyright, usable for research. (verify)
- LOB Corpus[2]
- Tagged LOB Corpus
- Freiburg-LOB Corpus of British English (FLOB)
- Freiburg-Brown Corpus of American English (FROWN)
- Kolhapur Corpus of Indian English
- Australian Corpus of English (ACE)
- Wellington Corpus of Written New Zealand English
- International Corpus of English - East African component
Spoken:
- London-Lund Corpus of Spoken English
- Lancaster/IBM SEC Corpus, The Machine-Readable Corpus of Spoken English
- Wellington Corpus of Spoken New Zealand English (WSC)
- Bergen Corpus of London Teenage Language (COLT) in Acrobat format, in Word format
- International Corpus of English - East African component
http://clu.uni.no/icame/manuals/
BNC
British National Corpus (BNC) (tagged)
ANC
American National Corpus (ANC)
See also:
COCA
Corpus of Contemporary American English (COCA)
http://www.americancorpus.org/
Oxford English Corpus
See also:
LUCY
A written UK English treebank.
See also:
Penn Treebank
Resource from the University of Pennsylvania.
See also:
Penn Middle English Treebank
See also:
Wordlists, dictionaries, and thesauri
12dicts
Alan Beale's 12dicts
A collection of wordlists focusing on common and non-obscure words (as e.g. scrabble lists can do), made by Alan Beale.
Named for the fact that its creator took/checked the words from/with various dictionaries (verify). AGID was also involved.
Most of the data is placed in the Public Domain. Some relatively recent revisions include information that was made based on SCOWL, WordNet and which therefore carry respective license restrictions.
Basic set (public domain):
- lists of words that appear in more than n dictionaries
- 2of12.txt - appear in at least two source dictionaries
- 6of12.txt - appear in at least six source dictionaries. Marked with some extra characters, see the documentation for them.
- 3esl.txt - words from ESL 'core vocabulary'
- 5desk.txt - meant as an everyday dictionary, something between a core vocabulary list and a dictionary wordlist (verify)
- 2of4brif.txt - internationalized wordlist (most of the rest is American-geared)
There are annotations you may want to use, or filter out (see the README):
- a = at the end of a line marks a second-class word (mostly inflected/derived forms)
- a + at the end of a line marks a signature word
- a : at the end of a line marks an abbreviation (of some sort)
- a & at the end of a line marks a word that is not part of common usage in America
- a ^ at the end of a line marks a word that was arbitrarily chosen from a set that has no clear primary form
- a < at the end of a line marks an form that is not clearly primary
- a # at the end of a line marks a form considered a variant (non-lemma)
- a % at the end of a line marks uncountable nouns (in 2of12 only?)
Recent additions (rev 5? or earlier?):
- neol2007.txt - some neologisms (public domain)
- 2of12inf.txt - The more recently added inflections mentioned above.
- 2+2gfreq.txt and 2+2lemma.txt are based on 2of12inf
Kevin Atkinson's alt12dicts
The 'Unofficial Alternate 12 Dicts Package' is a transformation of 12dicts that stores mostly the same information.
Its 2of12full.txt conains:
- how many dictionaries contained the entry (making a 6of12.txt redundant)
- how many list it as a non-variant American word
- how many list it as a variant form
- how many list it as a non-American word
Most interesting files are exactly as they are in the original 12dicts. A few files were added.
COMLEX
See also:
DICT
DICT can refer to
- DICT Development Group (http://dict.org/)
- their dictionary network protocol - see also RFC 2229
See also:
Various wordlists
Games
Largely for scrabble and the likes, in which case they are plain lists of words.
Note that there are multiple revisions of most of these lists.
English:
- OSPD (Official Scrabble Players Dictionary)
- OSW (Official Scrabble Words) is a British list based on the Chambers Dictionary
- SOWPODS (derived from OSPD and OSW), based on OSPD and some others
- Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.
- Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)
- TWL98
- TWL3
- seem used in Northern America and Canada
- Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K
- see also [3].
- Based on the OSPD (Official Scrabble Players Dictionary) and Merriam-Webster. Made by Alan Beale. Public Domain
- YAWL - Yet Another Word List (YAWL),
- by Mendel Cooper (a.k.a. thegrendel)
- License: Public domain (it's based on public domain work by Alan Beale).
- Apparently, the list is a superset of OSPD, SOWPODS, and ENABLE
Has previously been available from, apparently, a number of personal webspaces that didn't live too long. Available as a package in a few linux distributions, and floating around a few other places.
French:
- L'Officiel du Scrabble (ODS)
Romanian:
- Lista Oficiala de Cuvinte 2000 (LOC2000)
Dutch:
- Scrabble Woorden Lijst (SWL)
Italian
- PARO (unofficial; there is no official italian scrabble list (still not?))
- ZINGA
Spanish:
- (verify) DRAE22, Diccionario de la Real Academia Española, 22nd edition.
See also:
- http://en.wikipedia.org/wiki/Official_Scrabble_Players_Dictionary
- http://en.wikipedia.org/wiki/SOWPODS
- http://en.wikipedia.org/wiki/Official_Tournament_and_Club_Word_List
Other / unsorted
From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):
- Spell Checker Oriented Word Lists (SCOWL)
- Automatically Generated Inflection Database (AGID)
- VarCon (Variant Conversion Info)
See also:
Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [5]. Based on ENABLE.
Crossword/cryptic wordlists may e.g. include synonyms, relation to common questions, and such.
UK Advanced Cryptics Dictionary (UKACD) [6]
--- DEDUPE ---
Mostly for scrabble. Note that there are multiple revisions of most of these lists.
English:
- The Official Scrabble Players Dictionary (OSPD)
- Official Scrabble Words (OSW) is a British list based on the Chambers Dictionary
- SOWPODS (derived from OSPD and OSW), based on OSPD and some others. Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.
- TWL98 is used in Northern America and Canada ()
- Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)
- Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K; see also [7]. Based on the OSPD (Official Scrabble Players Dictionary) and Merriam-Webster. Made by Alan Beale. Public Domain
French:
- L'Officiel du Scrabble (ODS)
Romanian:
- Lista Oficiala de Cuvinte 2000 (LOC2000)
Dutch:
- Scrabble Woorden Lijst (SWL)
Italian
- PARO (unofficial; there is no official italian scrabble list (still not?))
- ZINGA
Spanish:
- (verify) DRAE22, Diccionario de la Real Academia Española, 22nd edition.
See also:
- http://en.wikipedia.org/wiki/Official_Scrabble_Players_Dictionary
- http://en.wikipedia.org/wiki/SOWPODS
- http://en.wikipedia.org/wiki/Official_Tournament_and_Club_Word_List
Other / unsorted
From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):
- Spell Checker Oriented Word Lists (SCOWL)
- Automatically Generated Inflection Database (AGID)
- VarCon (Variant Conversion Info)
See also:
Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [9]. Based on ENABLE.
UK Advanced Cryptics Dictionary (UKACD) [10]
Webster's 1913 Dictionary
One of the early (late eighteen hundreds), large English dictionaries available in the US.
Interesting because the 1913 version's copyright has lapsed, is Public Domain, and is fairly easily available (although needs a bunch of work to parse well).
See for example the version in Project Gutenberg (search link below), which has been augmented by Micra, Inc.
Legalities seem a little unclear to me. There is certainly a later version under Micra's copyright, whereas the copy in Project Gutenberg's copies mention that Micra's addition in that (the tags) are copyrighted and that the basic text version is Public Domain. It is unclear to me what Micra's base resource was.
See also:
- http://en.wikipedia.org/wiki/Webster%27s_Dictionary
- http://www.gutenberg.org/catalog/world/results?title=webster%20dictionary
OPTED
The Online Plain Text English Dictionary (OPTED) is a resource based on the public domain 1913 version of Webster's Dictionary in Project Gutenberg.
I have not checked the license details. It seems public domain - except with the requirement that it stays public domain and free of cost.
See also:
CIDE
The Collaborative International Dictionary of English (CIDE) is derived from the 1913 (public domain) Webster's Dictionary.
There are various derivative versions.
See also:
GCIDE
GNU Collaborative International Dictionary of English
GPL license
See also:
Roget's thesaurus
A thesaurus from 1911.
The roget15a.txt file as stored in Project Gutenberg is not the original transcribed text. Some proof-reading, restructuring, extra marking, and a little content addition was was done was done by Micra, Inc. [11]. They claims no restrictions on this version of their additions; the resource is under Public Domain, with Gutenberg's basic, strippable restrictions.
From the notes included in that version:
- Likely still contains errors, though it has had reasonable proof-reading.
- supplemented, but still far from up to date with modern English
- About 1500 verbs (out of 6500) which can be found in a modern 80,000-word spell-checker are absent
- Nouns are probably worse, as there are many words coined since, particularly in technical areas.
- A good deal of words in here are not recognized by a modern spell checker, in most cases because they are Latin or obsolete.
- Such non-modern cases are usually marked with comments in square brackets.
- On formatting:
- Comments that are not thesaurus content are delimited <-- like so -->
- Section headings are included between percent (%) markers. (also seem to be all uppercase)
- @ and things within {} are internal references (by number), which you could ignore (as they seem incomplete, an experiment - though they do also seem useful)
- each of the ~1000 main entries has a pound sign (#) in front of an identifier (enumeration).
- Greek words and phrases are transliterated, and formatted <gr/like so/gr>
- obsolete words (in the 1911 version) are marked with a vertical bar are marked with a pipe (|)
- words current in 1911 but not in a current college-sized dictionary are marked with a pipe and exclamation (|!) (not complete) or [obs3] (probably also not complete, but better)
See also:
Wiktionary
Wiktionary is a user-contributed dictionary/thesaurus for various languages.
It could be a potentially useful resource.
See also:
ICE
International Corpus of English (ICE)
A text corpus that consist for a good part of transcribed spoken English.
See also:
- http://www.ucl.ac.uk/english-usage/ice/
- http://en.wikipedia.org/wiki/International_Corpus_of_English
Internet Dictionary Project
A semi-personal, semi-public project that ran from 1995 to 2007, meant to create free translation dictionaries.
Copyrighted, but royalty-free, and seemingly legally unrestricted (for uses other than selling unaltered copies for money).
See also:
Translation
Freedict
A number of bilingual translation dictionaries, under the GPL license.
See also:
Apertium
Phonetic
TIMIT
TIMIT is a corpus of phonemic pronunciations for lexical items, based on various dialects of American English. It was commissioned by DARPA.
Uses its own phonetic script; see Phonetic scripts#TIMIT
Its name comes from some of the parties in its creation, Texas Instruments (recording) and MIT (transcription).(verify)
Paid-for corpus.(verify)
There is also a Network TIMIT (telephone network quality sound(verify)).
CMUdict
The CMU Pronouncing Dictionary is a public domain resource from Carnegie Mellon.
It encodes pronunciation using the Phonetic_scripts#ARPAbet system.(verify)
See also:
- http://en.wikipedia.org/wiki/CMU_Pronouncing_Dictionary
- http://www.speech.cs.cmu.edu/cgi-bin/cmudict
Longman Pronunciation Dictionary
See also:
- http://www.pearsonlongman.com/dictionaries/LPD/
- http://www.antimoon.com/how/pronunciation-dictionaries-review.htm
CAAPR
Combined Anglo-American Pronunciation Reference, which has British English and American English pronunciations
By Alan Beale, and using his FLOSS phonemic system.
Based on FEWL (for American) and the EPD (for British).
License: Unclear(verify)
Note: FEWL is an earlier verson of CAAPR
See also:
- http://www.wyrdplay.org/AlanBeale/CAAPR-ref.html
- http://www.wyrdplay.org/AlanBeale/CAAPR-ref-12.html
EPD
English Pronouncing Dictionary (EPD), published by the Cambridge University Press
See also:
- http://www.antimoon.com/how/cepd-review.htm
- http://seas3.elte.hu/epd.html (web-searchable)
- http://www.antimoon.com/how/pronunciation-dictionaries-review.htm
Morphology
CELEX
Lexical database. For Dutch, English, German has word forms, phonological and morphological info, frequencies, and more, .
Has a revision - you probably now want CELEX2.
See also:
Mixed
Moby
Moby is a resource created by and placed in public domain by Grady Ward.
Consists of a number of sub-resourced, including
- part of speech information
- pronunciation information
- various wordlists
- hyphenation information (a large list of examples)
- more
It is a large data set, of decent quality (though with some known peculiarities and flaws).
Where non-ASCII is used, these files seem to be coded in one of:
See also:
And perhaps:
Focus on semantics
Things like synonym sets, word senses, ontological structures, and such.
To a lesser extent, any corpus could apply. Note e.g. that some of the above have some some limited semantic tagging.
VerbOcean
A semantic network of verbs
University of Southern California
See also:
- http://demo.patrickpantel.com/Content/verbocean/
- http://www.aclweb.org/aclwiki/index.php?title=VerbOcean
WordNet and variants
WordNet is a semantic lexicon developed by Princeton University. It has a decent number of word-sense pairs, and some short dictionary-like explanations.
It does not contain etymological information, nor much information about inflection or other morphology.
It provides information on nouns, verbs, adjectives and adverbs. It allows separate querying so allows you to select sense when there are cross-lexical-category homographs.
For nouns, it synonyms, hypernyms, hyponyms, holoynyms, meronyms, and coordinate terms (words that share a hypernym)
For verbs, it has hypernyms and troponyms (generalized senses of actions), entailment (implied actions), and coordinate terms.
For adjectives, it has related nouns, and a relation to the verb if the adjective is a participle.
For adverbs, it has the root adjective.
Under copyright, provided under a BSD-like license.
Exactly which variation it can be used as is not entirely clear (can be important, as 4-clause-style BSD is less compatible with other licenses).
Looks to effectively be most like a 2-clause BSD license, which would make it compatible with GPL, LGPL and such, as long as the wordnet license is preserved (as the 'appropriate copyright notice').
Princeton seems to have indicated this, and the FSF has suggested something in similar lines.
See also:
Extended WordNet
A resource that parsed the glosses (definitions) in WordNet, to be used as additional information.
BSD style license
See also:
GeoWordNet
A variant of wordnet with geospatial information; a merger of Wordnet, GeoNames, and the Italian part of MultiWordNet.
License: CC-by 3.0
See also
- http://eprints.biblio.unitn.it/archive/00001777/
- http://semanticmatching.org/background-knowledge-datasets.html
Other-language wordnets
Some:
- EuroWordNet
- English, Dutch, Estonian, French, German, Italian, Spanish
- http://en.wikipedia.org/wiki/EuroWordNet
- License:
- The whole distributed via ELDA/ELRA (restrictive(verify))
- Each wordnet can be released by its creator too
- MultiWordNet - Italian
- Spanish Wordnet (TALP Group at the Universitat Politecnica de Catalunya (Spain))
- Portuguese Wordnet (NLX-Group at the University of Lisbon (Portugal))
- Hebrew Wordnet (Computational Linguistic Group at the University of Haifa (Israel))
- Romanian Wordnet ("Alexandru Ioan Cuza" University of Iasi (Romania))
- Latin WordNet (University of Verona (Italy))
- Asian WordNet
- Thai, Korean, Japanese, Indonesian, Myanmar, Vietnamese, Mongolian, Bengali
- ...many others
- Global WordNet Grid
MontyLingua
A toolkit that deals with POS tagging, chunking, lemmatizing, certain specific extractions, and uses information from Open Mind Common Sense to avoid some errors.
GPL license for non-commercial use (e.g. research), and licensable for commercial use.
MontyLingua3 seems to be a fork that is GPL for any purpose [14].
Python, Java
See also:
DAESO
Detecting And Exploiting Semantic Overlap (DAESO) is project that studies expressions of similar information.
See also
CORNETTO
Aims to build a lexical resource with semantic-style correlations.
See also:
SenseEval
http://www.d.umn.edu/~tpederse/data.html
SenseClusters
See also:
A largely user-contributed knowledge base, hosted by a commercial company, but apparently available under CC Attribution.
See also:
FrameNet
From Berkley University
See also:
- FrameNet: Theory and practice (book)
- http://en.wikipedia.org/wiki/FrameNet
DBpedia
A semantic resource created by harvesting data from wikipedia. (e.g. using page categories, geographic templates).
GFDL content.
See also:
DOLCE
Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE)
Apparently LGPL content.
See also:
BFO
Basic Formal Ontology
Similar in idea to DOLCE, SUMO.
See also:
SUMO
Suggested Upper Merged Ontology (SUMO) refers to a project that creates broad-level ontologies, which don't run into that many ambiguities. Has mappings to WordNet.
Available under the GPL.
Now also has MId-Level Ontologies (MILO)
See also:
YAGO
A knowledge base based on data from Wikipedia and WordNet.
Apparently available under the GFDL.
See also:
COSMO
COmmon Semantic MOdel (COSMO)
By Micra Inc.
Takes data from OpenCyc, SUMO, BFO, DOLCE
TEASE
Can refer to the algorithm, as well as its results from a large-scale execution.
Bar-Ilan University
http://www.aclweb.org/aclwiki/index.php?title=TEASE
Cyc
Cyc is an AI reasoning project, and consists mainly of an ontological knowledge base, supporting assertions, and a number of tools.
The full Cyc knowledge base is available only within Cycorp.
A subset of Cyc is released as OpenCyc (Available under CC Share Alike(verify) or Apache(verify). It contains mostly the ontological facts, but little of the extra knowledge present in the assertions.
Recently there is also ResearchCyc, which lies somewhere between Cyc and OpenCyc in detail, and has a specific license permissive for research (but not really FOSS compatible).
See also:
Open Mind Common Sense
MIT's Open Mind Common Sense project (a.k.a. just 'Common Sense') is an AI reasoning project, or rather, it aims to store a lot of knowledge commonly used in real-world reasoning and disambiguation.
Similar to Cyc in some ways (though knowledge is stored in a sentence-based form rather than in strictly formal relations).
Derivatives/relations based on (parsed) Common Sense data:
- ConceptNet
- LifeNet
- StoryNet
- ShapeNet
License is generally GPL for non-commercial use such as research. (verify)
Conceptnet3 can be used under GPL, or CC Attribution.
See also:
- http://www.openmind.org/
- http://en.wikipedia.org/wiki/Open_Mind_Common_Sense
- http://commons.media.mit.edu/en/
- MontyLingua uses Common Sense data
Unsorted
DoReCo
Aiksaurus
AikSaurus is a C++ thesaurus library OSX interface, and data.
Available under GPL.
See also:
PAROLE
Preparatory Action for linguistic Resources Organisation for Language Engineering (PAROLE)
Copyrighted, non-distribution license.
See also:
LEFFF
LExique des Formes Fléchies du Français (LEFFF), a French morpholgical lexicon. The first version has only verbs, the second all lexical categories.
Under the permissive LPGLR license.
See also:
Proton
See also:
SALSA
Saarbrücken Lexical Semantics Acquisition Project (SALSA)
See also:
SemCor and variants
SemCor is a tagged text corpus based on the intersection between the Brown corpus and WordNet.
See also:
- http://multisemcor.itc.it/semcor.php
- http://citeseerx.ist.psu.edu/viewdoc/summary?doi=10.1.1.38.9945
MultiSemCor is an English/Italian parallel based on SemCor and professional translators.
See also:
SemiSUSANNE
A corpus generated from the overlap between SUSANNE and SemCor (and thereby indirectly WordNet).
See also:
SUSANNE
SUSANNE can refer to the SUSANNE scheme as well as the SUSANNE corpus.
The SUSANNE corpus was created from a subset of the Brown corpus according to the SUSANNE scheme. Its focus on quality rather than quantity makes it an interesting resource.
There seems no mention of a license other than a mention that it is freely usable for research.
See also:
Related:
CHRISTINE
The CHRISTINE corpus extends SUSANNE for spoken English.
See also:
TIGER
The TIGER treebank is a German treebank that comes from newspapers
See also:
TIME corpus
a.k.a. Time Magazine Corpus, based on Time magazine, and with a web interface.
See also:
Wall Street Journal corpus
Wall Street Journal (WSJ) corpus
treebank?
Wellington corpus
A corpus of New Zealand English, or rather, two:
- Wellington Corpus of Written New Zealand English (WWC)
- Wellington Corpus of Spoken New Zealand English (WSC)
See also:
TwNC
The Twente News Corpus (TwNC) is a Dutch text corpus of content from newspapers, teletext subtitling and such.
See also:
VerbNet
Lexical resource from the University of Colorado (in Boulder).
Details verbs, with semantic and syntactic information (extending (Levin 1993) classes).
Related to PropBank, FrameNet.
Has its own license, looks BSDlike with a story similar to(verify) that of (the unrelated) WordNet.
(See http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html)
See also
PropBank
See also:
NomBank
New York University
See also:
Nomlex
Nominalizations
New York University
See also:
Potential
These lists are here to be able to easily add a corpus/resource without having to immediately add a wiki page that actually says something useful about it.
See also Category:Linguistic data or resources
There's always work in writing a short descriptive article for these.
Potential corpora
- Project Gutenberg
- Wikipedia (e.g. the static HTML downloads)
- Government projects (often public domain)
- The web, see e.g. W3Corpora
- Academic projects
Corpora are commonly lexically annotated text, or treebanks, or just collections of documents that have been manually touched up and verified.
Single-language corpora
English corpora
Text
Spoken/Dialogue
- London-Lund Corpus (LLC) [19]
- Spoken Corpus of British English (SCRIBE)
- Lancaster/IBM Spoken English Corpus (SEC) and MARSEC, a machine geared variation
- Dialogue Diversity Corpus (DDC) [20]
- Göteborg Spoken Language Corpus (GSLC) [21]
- ICSI Meeting Corpus [22]
Children
- Polytechnic of Wales (POW)
- Child Language Data Exchange System (CHILDES)
Specific
- International Corpus of English (ICE) (non-natives), with subcorpora such as ICE-GB [23]
- MICASE (academic) [24]
- Corpus of Professional, Spoken American-English (CPSA) (nonfree) http://www.athel.com/cpsa.html
- Penn-Helsinki (middle english, nonfree) [25]
Russian
Eastern European
Czech
- The Prague Dependency Treebank [26]
Slovene
- Slovene Dependency Treebank [27]
Bulgarian
- BulTreeBank (treebank) [28]
Bosnian
- The Oslo Corpus of Bosnian Texts [29]
Western European
German
French
- American and French Research on the Treasury of the French Language (ARTFL) [32]
Spanish, Portugese
- TychoBrahe (historical) [33]
- CETEMPúblico [34]
- Acesso a corpora/Disponibilização de corpora (AC/DC) [35]
- [36]
- CUMBRE
Dutch
Text
Spoken
- Corpus Gesproken Nederlands (CGN) [39]
- Scarrie (news) [40]
Norwegian
- Oslo-Korpuset [41]
Swedish
Danish
Asian
Chinese
Korean
Multi-language corpora
Usually parallel, possibly aligned.
TODO: sort out http://tcc.itc.it/people/forner/multilingualcorpora.html
- EuroParl (Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese
Swedish) [48] - tokenized, sentence-aligned
- MULTEXT-EAST: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Resian, Romanian, Russian, Slovene, and Serbian [49]
- JRC-Acquis (EU law in Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish) [50]
- ENPC (English-Norwegian Parallel Corpus Project) [51]
- CRATER (English, French, Spanish) [52]
- TRIPTIC (TRIlingual Parallel Text Information Corpus) (English, French and Dutch)
- SVEZ-IJS ACQUIS (Slovene, English) [55] and
- IJS - ELAN (Slovene, English) [56]
- EMILLE [57]
- Lilabar (English, Russian) [58]
- Parallel Corpus of English and Czech Texts [59]
- COMPARA (Protugese, English) [60]
- MULTEXT [61]
http://www.athel.com/parallel_corpora.html
Semantic resources
http://en.wikipedia.org/wiki/Upper_ontology_%28computer_science%29#Available_ontologies
See also
- http://www.comp.lancs.ac.uk/computing/users/paul/ucrel/corpora.html
- http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/areas/nlp/corpora/0.html
- http://www.natcorp.ox.ac.uk/corpora.html
- http://www.rc.kyushu-u.ac.jp/~higuchi/text7/corpus.html
- http://ota.ahds.ac.uk/
- http://www.ldc.upenn.edu/
- http://www.elra.info/
- http://crl.nmsu.edu/Tools/CLR/
- http://www.elsnet.org/resources/eciCorpus.html
- http://trec.nist.gov/data/reuters/reuters.html
- http://www.ucl.ac.uk/english-usage/
Unsorted (potential)
- http://ucnk.ff.cuni.cz/
- http://www.fida.net/slo/index.html
- http://www.compapp.dcu.ie/~away/Treebank/treebank.html
- http://tangra.si.umich.edu/clair/CSTBank/
Public domain stuff