Linguistic data and resources

From Helpful
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.
⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).

See also Lingusitics software

Corpora and treebanks

Corpora tend to refer to decently curated but fairly unstrucured pieces of text, sometimes with syntactic tagging, useful at least for decently representative samples of text of a language.

Treebanks usually refer to corpora with syntactic, structural, and semantic analysis.








Dutch corpora and treebanks

Eindhoven corpus

A tagged Dutch corpus, from the seventies, and mostly from relatively informal and scientific printed use(verify). Originally created to create a word frequency list for Dutch.

See also:

CGN

Corpus Gesproken Nederlands (CGN) is a tagged corpus of text that comes from transcriptions of spoken Dutch (in the Netherlands and Flanders).

https://lands.let.ru.nl/cgn/

LASSY

Large Scale Syntactic Annotation of written Dutch

See also:

English corpora and treebanks

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

20newsgroups

A small dataset of messages from a number of newsgroups, often used when teaching automatic text classifiers.

License: Unclear

See also:

Reuters corpora

Reuters-21578

A corpus also often used for categorization tasks

Copyrighted, free for research use.

See also:


Reuters Corpus, Volume 1 (RCV1)

http://trec.nist.gov/data/reuters/reuters.html


Reuters Corpus, Volume 2 (RCV2)

http://trec.nist.gov/data/reuters/reuters.html


ICAME collection

The International Computer Archive of Modern and Medieval English is a collection of various previously-existing corpora

Written:

  • Brown Corpus[1]
or fully, the Brown University Standard Corpus of Present-Day American English
Under copyright, usable for research. (verify)
  • LOB Corpus[2]
  • Tagged LOB Corpus
  • Freiburg-LOB Corpus of British English (FLOB)
  • Freiburg-Brown Corpus of American English (FROWN)
  • Kolhapur Corpus of Indian English
  • Australian Corpus of English (ACE)
  • Wellington Corpus of Written New Zealand English
  • International Corpus of English - East African component

Spoken:

  • London-Lund Corpus of Spoken English
  • Lancaster/IBM SEC Corpus, The Machine-Readable Corpus of Spoken English
  • Wellington Corpus of Spoken New Zealand English (WSC)
  • Bergen Corpus of London Teenage Language (COLT) in Acrobat format, in Word format
  • International Corpus of English - East African component

http://clu.uni.no/icame/manuals/


BNC

British National Corpus (BNC) (tagged)

http://www.natcorp.ox.ac.uk/


ANC

American National Corpus (ANC)

See also:

COCA

Corpus of Contemporary American English (COCA)

http://www.americancorpus.org/

Oxford English Corpus

See also:


LUCY

A written UK English treebank.

See also:

Penn Treebank

Resource from the University of Pennsylvania.

See also:

Penn Middle English Treebank

See also:

Wordlists, dictionaries, and thesauri

12dicts

Alan Beale's 12dicts

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A collection of wordlists focusing on common and non-obscure words (as e.g. scrabble lists can do), made by Alan Beale.

Named for the fact that its creator took/checked the words from/with various dictionaries (verify). AGID was also involved.


Most of the data is placed in the Public Domain. Some relatively recent revisions include information that was made based on SCOWL, WordNet and which therefore carry respective license restrictions.


Basic set (public domain):

  • lists of words that appear in more than n dictionaries
    • 2of12.txt - appear in at least two source dictionaries
    • 6of12.txt - appear in at least six source dictionaries. Marked with some extra characters, see the documentation for them.
  • 3esl.txt - words from ESL 'core vocabulary'
  • 5desk.txt - meant as an everyday dictionary, something between a core vocabulary list and a dictionary wordlist (verify)
  • 2of4brif.txt - internationalized wordlist (most of the rest is American-geared)


There are annotations you may want to use, or filter out (see the README):

  • a = at the end of a line marks a second-class word (mostly inflected/derived forms)
  • a + at the end of a line marks a signature word
  • a : at the end of a line marks an abbreviation (of some sort)
  • a & at the end of a line marks a word that is not part of common usage in America
  • a ^ at the end of a line marks a word that was arbitrarily chosen from a set that has no clear primary form
  • a < at the end of a line marks an form that is not clearly primary
  • a # at the end of a line marks a form considered a variant (non-lemma)
  • a % at the end of a line marks uncountable nouns (in 2of12 only?)


Recent additions (rev 5? or earlier?):

  • neol2007.txt - some neologisms (public domain)
  • 2of12inf.txt - The more recently added inflections mentioned above.
  • 2+2gfreq.txt and 2+2lemma.txt are based on 2of12inf


Kevin Atkinson's alt12dicts

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The 'Unofficial Alternate 12 Dicts Package' is a transformation of 12dicts that stores mostly the same information.

Its 2of12full.txt conains:

  • how many dictionaries contained the entry (making a 6of12.txt redundant)
  • how many list it as a non-variant American word
  • how many list it as a variant form
  • how many list it as a non-American word

Most interesting files are exactly as they are in the original 12dicts. A few files were added.


COMLEX

See also:


DICT

DICT can refer to

  • DICT Development Group (http://dict.org/)
  • their dictionary network protocol - see also RFC 2229

See also:

Various wordlists

Games

Largely for scrabble and the likes, in which case they are plain lists of words.

Note that there are multiple revisions of most of these lists.


English:

  • OSPD (Official Scrabble Players Dictionary)
  • OSW (Official Scrabble Words) is a British list based on the Chambers Dictionary
  • SOWPODS (derived from OSPD and OSW), based on OSPD and some others
Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.
  • Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)
  • TWL98
  • TWL3
seem used in Northern America and Canada
  • Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K
see also [3].
Based on the OSPD (Official Scrabble Players Dictionary) and Merriam-Webster. Made by Alan Beale. Public Domain


  • YAWL - Yet Another Word List (YAWL),
by Mendel Cooper (a.k.a. thegrendel)
License: Public domain (it's based on public domain work by Alan Beale).
Apparently, the list is a superset of OSPD, SOWPODS, and ENABLE

Has previously been available from, apparently, a number of personal webspaces that didn't live too long. Available as a package in a few linux distributions, and floating around a few other places.


French:

  • L'Officiel du Scrabble (ODS)

Romanian:

  • Lista Oficiala de Cuvinte 2000 (LOC2000)

Dutch:

  • Scrabble Woorden Lijst (SWL)

Italian

  • PARO (unofficial; there is no official italian scrabble list (still not?))
  • ZINGA

Spanish:

  • (verify) DRAE22, Diccionario de la Real Academia Española, 22nd edition.



See also:



Other / unsorted

From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):

  • Spell Checker Oriented Word Lists (SCOWL)
  • Automatically Generated Inflection Database (AGID)
  • VarCon (Variant Conversion Info)


See also:


Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [5]. Based on ENABLE.



Crossword/cryptic wordlists may e.g. include synonyms, relation to common questions, and such.

UK Advanced Cryptics Dictionary (UKACD) [6]



--- DEDUPE ---

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Mostly for scrabble. Note that there are multiple revisions of most of these lists.


English:

  • The Official Scrabble Players Dictionary (OSPD)
  • Official Scrabble Words (OSW) is a British list based on the Chambers Dictionary
  • SOWPODS (derived from OSPD and OSW), based on OSPD and some others. Used in Scrabble world championships, and in Britain, Australia, New Zealand, and some others.
  • TWL98 is used in Northern America and Canada ()
  • Official Tournament and Club Word List or Tournament Word List (TWL, OWL, or OTaCWL)
  • Enhanced North American Benchmark LExicon (ENABLE), and ENABLE2K; see also [7]. Based on the OSPD (Official Scrabble Players Dictionary) and Merriam-Webster. Made by Alan Beale. Public Domain


French:

  • L'Officiel du Scrabble (ODS)

Romanian:

  • Lista Oficiala de Cuvinte 2000 (LOC2000)

Dutch:

  • Scrabble Woorden Lijst (SWL)

Italian

  • PARO (unofficial; there is no official italian scrabble list (still not?))
  • ZINGA

Spanish:

  • (verify) DRAE22, Diccionario de la Real Academia Española, 22nd edition.


See also:


Other / unsorted

From Kevin Atkinson - generally under various copyrights (permissive for educational purposes):

  • Spell Checker Oriented Word Lists (SCOWL)
  • Automatically Generated Inflection Database (AGID)
  • VarCon (Variant Conversion Info)


See also:


Greg's Babble Dictionary - originally at www.photosphere.us/gj_babble_db.txt, now moved and replaced with a slightly slimmed version, at [9]. Based on ENABLE.


UK Advanced Cryptics Dictionary (UKACD) [10]

Webster's 1913 Dictionary

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

One of the early (late eighteen hundreds), large English dictionaries available in the US.

Interesting because the 1913 version's copyright has lapsed, is Public Domain, and is fairly easily available (although needs a bunch of work to parse well).

See for example the version in Project Gutenberg (search link below), which has been augmented by Micra, Inc.

Legalities seem a little unclear to me. There is certainly a later version under Micra's copyright, whereas the copy in Project Gutenberg's copies mention that Micra's addition in that (the tags) are copyrighted and that the basic text version is Public Domain. It is unclear to me what Micra's base resource was.


See also:


OPTED

The Online Plain Text English Dictionary (OPTED) is a resource based on the public domain 1913 version of Webster's Dictionary in Project Gutenberg.

I have not checked the license details. It seems public domain - except with the requirement that it stays public domain and free of cost.


See also:


CIDE

The Collaborative International Dictionary of English (CIDE) is derived from the 1913 (public domain) Webster's Dictionary.


There are various derivative versions.

See also:


GCIDE

GNU Collaborative International Dictionary of English

GPL license

See also:

Roget's thesaurus

A thesaurus from 1911.

The roget15a.txt file as stored in Project Gutenberg is not the original transcribed text. Some proof-reading, restructuring, extra marking, and a little content addition was was done was done by Micra, Inc. [11]. They claims no restrictions on this version of their additions; the resource is under Public Domain, with Gutenberg's basic, strippable restrictions.


From the notes included in that version:

  • Likely still contains errors, though it has had reasonable proof-reading.
  • supplemented, but still far from up to date with modern English
  • About 1500 verbs (out of 6500) which can be found in a modern 80,000-word spell-checker are absent
  • Nouns are probably worse, as there are many words coined since, particularly in technical areas.
  • A good deal of words in here are not recognized by a modern spell checker, in most cases because they are Latin or obsolete.
  • Such non-modern cases are usually marked with comments in square brackets.
  • On formatting:
    • Comments that are not thesaurus content are delimited <-- like so -->
    • Section headings are included between percent (%) markers. (also seem to be all uppercase)
    • @ and things within {} are internal references (by number), which you could ignore (as they seem incomplete, an experiment - though they do also seem useful)
    • each of the ~1000 main entries has a pound sign (#) in front of an identifier (enumeration).
    • Greek words and phrases are transliterated, and formatted <gr/like so/gr>
    • obsolete words (in the 1911 version) are marked with a vertical bar are marked with a pipe (|)
    • words current in 1911 but not in a current college-sized dictionary are marked with a pipe and exclamation (|!) (not complete) or [obs3] (probably also not complete, but better)


See also:

Wiktionary

Wiktionary is a user-contributed dictionary/thesaurus for various languages.

It could be a potentially useful resource.

See also:


ICE

International Corpus of English (ICE)

A text corpus that consist for a good part of transcribed spoken English.

See also:

Internet Dictionary Project

A semi-personal, semi-public project that ran from 1995 to 2007, meant to create free translation dictionaries.

Copyrighted, but royalty-free, and seemingly legally unrestricted (for uses other than selling unaltered copies for money).


See also:

Translation

Freedict

A number of bilingual translation dictionaries, under the GPL license.

See also:


Apertium

https://github.com/apertium

Phonetic

TIMIT

TIMIT is a corpus of phonemic pronunciations for lexical items, based on various dialects of American English. It was commissioned by DARPA.

Uses its own phonetic script; see Phonetic scripts#TIMIT

Its name comes from some of the parties in its creation, Texas Instruments (recording) and MIT (transcription).(verify)

Paid-for corpus.(verify)


There is also a Network TIMIT (telephone network quality sound(verify)).


CMUdict

The CMU Pronouncing Dictionary is a public domain resource from Carnegie Mellon.

It encodes pronunciation using the Phonetic_scripts#ARPAbet system.(verify)


See also:


Longman Pronunciation Dictionary

See also:



CAAPR

Combined Anglo-American Pronunciation Reference, which has British English and American English pronunciations

By Alan Beale, and using his FLOSS phonemic system.

Based on FEWL (for American) and the EPD (for British).


License: Unclear(verify)

Note: FEWL is an earlier verson of CAAPR


See also:

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

EPD

English Pronouncing Dictionary (EPD), published by the Cambridge University Press


See also:

Morphology

CELEX

Lexical database. For Dutch, English, German has word forms, phonological and morphological info, frequencies, and more, .

Has a revision - you probably now want CELEX2.

See also:

Mixed

Moby

Moby is a resource created by and placed in public domain by Grady Ward.

Consists of a number of sub-resourced, including

  • part of speech information
  • pronunciation information
  • various wordlists
  • hyphenation information (a large list of examples)
  • more

It is a large data set, of decent quality (though with some known peculiarities and flaws).


Where non-ASCII is used, these files seem to be coded in one of:


See also:

And perhaps:



Focus on semantics

Things like synonym sets, word senses, ontological structures, and such.

To a lesser extent, any corpus could apply. Note e.g. that some of the above have some some limited semantic tagging.


VerbOcean

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A semantic network of verbs

University of Southern California

See also:

WordNet and variants

WordNet is a semantic lexicon developed by Princeton University. It has a decent number of word-sense pairs, and some short dictionary-like explanations.


It does not contain etymological information, nor much information about inflection or other morphology.

It provides information on nouns, verbs, adjectives and adverbs. It allows separate querying so allows you to select sense when there are cross-lexical-category homographs.

For nouns, it synonyms, hypernyms, hyponyms, holoynyms, meronyms, and coordinate terms (words that share a hypernym)

For verbs, it has hypernyms and troponyms (generalized senses of actions), entailment (implied actions), and coordinate terms.

For adjectives, it has related nouns, and a relation to the verb if the adjective is a participle.

For adverbs, it has the root adjective.


Under copyright, provided under a BSD-like license.

Exactly which variation it can be used as is not entirely clear (can be important, as 4-clause-style BSD is less compatible with other licenses).

Looks to effectively be most like a 2-clause BSD license, which would make it compatible with GPL, LGPL and such, as long as the wordnet license is preserved (as the 'appropriate copyright notice').

Princeton seems to have indicated this, and the FSF has suggested something in similar lines.


See also:


Extended WordNet

A resource that parsed the glosses (definitions) in WordNet, to be used as additional information.

BSD style license

See also:

GeoWordNet

A variant of wordnet with geospatial information; a merger of Wordnet, GeoNames, and the Italian part of MultiWordNet.

License: CC-by 3.0

See also

Other-language wordnets

Some:

  • Spanish Wordnet (TALP Group at the Universitat Politecnica de Catalunya (Spain))
  • Hebrew Wordnet (Computational Linguistic Group at the University of Haifa (Israel))
  • Asian WordNet
    • Thai, Korean, Japanese, Indonesian, Myanmar, Vietnamese, Mongolian, Bengali
  • ...many others



MontyLingua

A toolkit that deals with POS tagging, chunking, lemmatizing, certain specific extractions, and uses information from Open Mind Common Sense to avoid some errors.


GPL license for non-commercial use (e.g. research), and licensable for commercial use.

MontyLingua3 seems to be a fork that is GPL for any purpose [14].


Python, Java


See also:


DAESO

Detecting And Exploiting Semantic Overlap (DAESO) is project that studies expressions of similar information.

See also


CORNETTO

Aims to build a lexical resource with semantic-style correlations.

See also:

SenseEval

http://www.senseval.org/

http://www.d.umn.edu/~tpederse/data.html


SenseClusters

See also:

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A largely user-contributed knowledge base, hosted by a commercial company, but apparently available under CC Attribution.

See also:

FrameNet

From Berkley University

See also:


DBpedia

A semantic resource created by harvesting data from wikipedia. (e.g. using page categories, geographic templates).

GFDL content.

See also:


DOLCE

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Descriptive Ontology for Linguistic and Cognitive Engineering (DOLCE)

Apparently LGPL content.


See also:

BFO

Basic Formal Ontology

Similar in idea to DOLCE, SUMO.

See also:

SUMO

Suggested Upper Merged Ontology (SUMO) refers to a project that creates broad-level ontologies, which don't run into that many ambiguities. Has mappings to WordNet.

Available under the GPL.


Now also has MId-Level Ontologies (MILO)


See also:


YAGO

A knowledge base based on data from Wikipedia and WordNet.

Apparently available under the GFDL.


See also:

COSMO

COmmon Semantic MOdel (COSMO)

By Micra Inc.

Takes data from OpenCyc, SUMO, BFO, DOLCE


TEASE

Can refer to the algorithm, as well as its results from a large-scale execution.

Bar-Ilan University

http://www.aclweb.org/aclwiki/index.php?title=TEASE


Cyc

Cyc is an AI reasoning project, and consists mainly of an ontological knowledge base, supporting assertions, and a number of tools.


The full Cyc knowledge base is available only within Cycorp.

A subset of Cyc is released as OpenCyc (Available under CC Share Alike(verify) or Apache(verify). It contains mostly the ontological facts, but little of the extra knowledge present in the assertions.

Recently there is also ResearchCyc, which lies somewhere between Cyc and OpenCyc in detail, and has a specific license permissive for research (but not really FOSS compatible).


See also:

Open Mind Common Sense

MIT's Open Mind Common Sense project (a.k.a. just 'Common Sense') is an AI reasoning project, or rather, it aims to store a lot of knowledge commonly used in real-world reasoning and disambiguation.

Similar to Cyc in some ways (though knowledge is stored in a sentence-based form rather than in strictly formal relations).


Derivatives/relations based on (parsed) Common Sense data:

  • ConceptNet
  • LifeNet
  • StoryNet
  • ShapeNet


License is generally GPL for non-commercial use such as research. (verify)

Conceptnet3 can be used under GPL, or CC Attribution.


See also:

Unsorted

DoReCo

Aiksaurus

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

AikSaurus is a C++ thesaurus library OSX interface, and data.

Available under GPL.

See also:


PAROLE

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Preparatory Action for linguistic Resources Organisation for Language Engineering (PAROLE)

Copyrighted, non-distribution license.


See also:


LEFFF

LExique des Formes Fléchies du Français (LEFFF), a French morpholgical lexicon. The first version has only verbs, the second all lexical categories.


Under the permissive LPGLR license.


See also:


Proton

See also:


SALSA

Saarbrücken Lexical Semantics Acquisition Project (SALSA)

See also:


SemCor and variants

SemCor is a tagged text corpus based on the intersection between the Brown corpus and WordNet.

See also:


MultiSemCor is an English/Italian parallel based on SemCor and professional translators.

See also:

SemiSUSANNE

A corpus generated from the overlap between SUSANNE and SemCor (and thereby indirectly WordNet).

See also:

SUSANNE

SUSANNE can refer to the SUSANNE scheme as well as the SUSANNE corpus.


The SUSANNE corpus was created from a subset of the Brown corpus according to the SUSANNE scheme. Its focus on quality rather than quantity makes it an interesting resource.

There seems no mention of a license other than a mention that it is freely usable for research.


See also:

Related:

CHRISTINE

The CHRISTINE corpus extends SUSANNE for spoken English.

See also:

TIGER

The TIGER treebank is a German treebank that comes from newspapers

See also:


TIME corpus

a.k.a. Time Magazine Corpus, based on Time magazine, and with a web interface.

See also:

Wall Street Journal corpus

Wall Street Journal (WSJ) corpus

treebank?

Wellington corpus

A corpus of New Zealand English, or rather, two:

  • Wellington Corpus of Written New Zealand English (WWC)
  • Wellington Corpus of Spoken New Zealand English (WSC)

See also:


TwNC

The Twente News Corpus (TwNC) is a Dutch text corpus of content from newspapers, teletext subtitling and such.

See also:

VerbNet

Lexical resource from the University of Colorado (in Boulder).

Details verbs, with semantic and syntactic information (extending (Levin 1993) classes).

Related to PropBank, FrameNet.


Has its own license, looks BSDlike with a story similar to(verify) that of (the unrelated) WordNet.

(See http://verbs.colorado.edu/~mpalmer/projects/verbnet/downloads.html)


See also

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

PropBank

See also:

NomBank

New York University

Related to PropBank (verify)

See also:

Nomlex

Nominalizations

New York University

See also:

Potential

These lists are here to be able to easily add a corpus/resource without having to immediately add a wiki page that actually says something useful about it.

See also Category:Linguistic data or resources

There's always work in writing a short descriptive article for these.


Potential corpora

  • Project Gutenberg
  • Open Content Allicance sources (mostly hosted by Internet Archive) [15] [16]
  • Wikipedia (e.g. the static HTML downloads)
  • Government projects (often public domain)
  • The web, see e.g. W3Corpora
  • Academic projects


Corpora are commonly lexically annotated text, or treebanks, or just collections of documents that have been manually touched up and verified.


Single-language corpora

English corpora

Text

  • Lancaster-Oslo/Bergen (LOB) [17] and a subcorpus, the Lancaster Parsed Corpus [18]


Spoken/Dialogue

  • London-Lund Corpus (LLC) [19]
  • Spoken Corpus of British English (SCRIBE)
  • Lancaster/IBM Spoken English Corpus (SEC) and MARSEC, a machine geared variation
  • Dialogue Diversity Corpus (DDC) [20]
  • Göteborg Spoken Language Corpus (GSLC) [21]
  • ICSI Meeting Corpus [22]


Children

  • Polytechnic of Wales (POW)
  • Child Language Data Exchange System (CHILDES)


Specific

  • International Corpus of English (ICE) (non-natives), with subcorpora such as ICE-GB [23]
  • MICASE (academic) [24]
  • Corpus of Professional, Spoken American-English (CPSA) (nonfree) http://www.athel.com/cpsa.html
  • Penn-Helsinki (middle english, nonfree) [25]

Russian

Eastern European

Czech

  • The Prague Dependency Treebank [26]

Slovene

  • Slovene Dependency Treebank [27]

Bulgarian

  • BulTreeBank (treebank) [28]

Bosnian

  • The Oslo Corpus of Bosnian Texts [29]

Western European

German

French

  • American and French Research on the Treasury of the French Language (ARTFL) [32]

Spanish, Portugese

  • TychoBrahe (historical) [33]
  • CETEMPúblico [34]
  • Acesso a corpora/Disponibilização de corpora (AC/DC) [35]
  • [36]
  • CUMBRE

Dutch

Text

  • Twente Nieuwscorpus (news) [37]
  • Alpino (treebank) [38]

Spoken

  • Corpus Gesproken Nederlands (CGN) [39]

Scandinavian

Norwegian
Swedish
Danish
  • Korpus2000 [44]
  • Danish Dependency Treebank (treebank) [45]

Asian

Chinese

  • Lancaster Corpus of Mandarin Chinese (LCMC) [46]
  • Penn Chinese Treebank (treebank) [47]

Korean

Multi-language corpora

Usually parallel, possibly aligned.

TODO: sort out http://tcc.itc.it/people/forner/multilingualcorpora.html


  • EuroParl (Danish, German, Greek, English, Spanish, Finnish, French, Italian, Dutch, Portuguese

Swedish) [48] - tokenized, sentence-aligned

  • MULTEXT-EAST: Bulgarian, Croatian, Czech, English, Estonian, Hungarian, Lithuanian, Resian, Romanian, Russian, Slovene, and Serbian [49]
  • JRC-Acquis (EU law in Bulgarian, Czech, Danish, German, Greek, English, Spanish, Estonian, Finnish, French, Hungarian, Italian, Lithuanian, Latvian, Maltese, Dutch, Polish, Portuguese, Romanian, Slovak, Slovene and Swedish) [50]
  • ENPC (English-Norwegian Parallel Corpus Project) [51]
  • CRATER (English, French, Spanish) [52]
  • TRIPTIC (TRIlingual Parallel Text Information Corpus) (English, French and Dutch)
  • SVEZ-IJS ACQUIS (Slovene, English) [55] and
  • IJS - ELAN (Slovene, English) [56]
  • Lilabar (English, Russian) [58]
  • Parallel Corpus of English and Czech Texts [59]
  • COMPARA (Protugese, English) [60]



http://www.athel.com/parallel_corpora.html

Semantic resources


http://en.wikipedia.org/wiki/Upper_ontology_%28computer_science%29#Available_ontologies

See also

Unsorted (potential)




Public domain stuff