Labeling in linguistics

From Helpful
(Redirected from Tagsets)
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Lexical categories / POS

Refers to the sets of lexical categories that POS taggers annotate data with. For example, the Penn tagset has been used to create the Penn treebank.

Tagsets are usually primarily based on parts of speech, sometimes with common linguistic features added.

The basic few POS tags are agreed on, but there is always discussion possible about the detail, special cases, the degree of normalisation (have specific tags vs. have tags essentially with features). This also affects the ways it can or cannot tag languages other than English.



UPOS

[1]

Penn

[2] [3]

TEI

C5, a.k.a BNC basic

61 tags See e.g. [4]

(C6)

The same as C7 except for handling of punctuation.

C7, a.k.a BNC enriched

146 tags See e.g. [5]


CLAWS

Various versions, current is C8.

The latest program seems to be referred to as CLAWS4. See [6]


CLAWS1 tagset - 132 tags - [7]

CLAWS2 tagset - 166 tags - [8]

C5 - [9]

C6 - [10]

C7 - [11]

C8 - [12]


https://ucrel.lancs.ac.uk/claws/

Corpus-specific

...usually meaning 'not generally used'. See also corpora.

Parole

[13]

Brown

[14]

London-Lund

[15]

LOB, SEC

[16] [17]

POW

[18]

German

STSS

54 tags

[19]

Dependencies

Stanford dependency representation

https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

Universal Dependencies

Broad overview: https://universaldependencies.org/u/dep/



https://nlp.stanford.edu/pubs/USD_LREC14_paper_camera_ready.pdf

https://universaldependencies.org/introduction.html

https://en.wikipedia.org/wiki/Universal_Dependencies


Universal Parts of Speech

udep's idea http://universaldependencies.org/docs/u/pos/index.html

   ADJ: adjective
   ADP: adposition
   ADV: adverb
   AUX: auxiliary verb
   CONJ: coordinating conjunction
   DET: determiner
   INTJ: interjection
   NOUN: noun
   NUM: numeral
   PART: particle
   PRON: pronoun
   PROPN: proper noun
   PUNCT: punctuation
   SCONJ: subordinating conjunction
   SYM: symbol
   VERB: verb
   X: other