Labeling in linguistics

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Lexical categories / POS

Refers to the sets of lexical categories that POS taggers annotate data with. For example, the Penn tagset has been used to create the Penn treebank.

Tagsets are usually primarily based on parts of speech, sometimes with common linguistic features added.

The basic few POS tags are agreed on, but there is always discussion possible about the detail, special cases, the degree of normalisation (have specific tags vs. have tags essentially with features). This also affects the ways it can or cannot tag languages other than English.



UPOS

[1]

Penn

[2] [3]

TEI

C5, a.k.a BNC basic

61 tags See e.g. [4]

(C6)

The same as C7 except for handling of punctuation.

C7, a.k.a BNC enriched

146 tags See e.g. [5]


CLAWS

Various versions, current is C8.

The latest program seems to be referred to as CLAWS4. See [6]


CLAWS1 tagset - 132 tags - [7]

CLAWS2 tagset - 166 tags - [8]

C5 - [9]

C6 - [10]

C7 - [11]

C8 - [12]


https://ucrel.lancs.ac.uk/claws/

Corpus-specific

...usually meaning 'not generally used'. See also corpora.

Parole

[13]

Brown

[14]

London-Lund

[15]

LOB, SEC

[16] [17]

POW

[18]

German

STSS

54 tags

[19]

Dependencies

Stanford dependency representation

https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

Universal Dependencies

Broad overview: https://universaldependencies.org/u/dep/



https://nlp.stanford.edu/pubs/USD_LREC14_paper_camera_ready.pdf

https://universaldependencies.org/introduction.html

https://en.wikipedia.org/wiki/Universal_Dependencies