Labeling in linguistics

From Helpful
Revision as of 18:47, 10 July 2022 by Helpful (talk | contribs) (→‎ICE)
Jump to navigation Jump to search
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Lexical categories / POS

Refers to the sets of lexical categories that POS taggers annotate data with. For example, the Penn tagset has been used to create the Penn treebank.

Tagsets are usually primarily based on parts of speech, sometimes with common linguistic features added.

The basic few POS tags are agreed on, but there is always discussion possible about the detail, special cases, the degree of normalisation (have specific tags vs. have tags essentially with features). This also affects the ways it can or cannot tag languages other than English.



UPOS

[1]

Penn

[2] [3]

TEI

C5, a.k.a BNC basic

61 tags See e.g. [4]

(C6)

The same as C7 except for handling of punctuation.

C7, a.k.a BNC enriched

146 tags See e.g. [5]


CLAWS

Various versions, current is C8.

The latest program seems to be referred to as CLAWS4. See [6]


CLAWS1 tagset - 132 tags - [7]

CLAWS2 tagset - 166 tags - [8]

C5 - [9]

C6 - [10]

C7 - [11]

C8 - [12]


https://ucrel.lancs.ac.uk/claws/

Corpus-specific

...usually meaning 'not generally used'. See also corpora.

Parole

[13]

Brown

[14]

London-Lund

[15]

LOB, SEC

[16] [17]

POW

[18]

German

STSS

54 tags

[19]

Dependencies

Stanford dependency representation

https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

Universal Dependencies

Broad overview: https://universaldependencies.org/u/dep/


https://nlp.stanford.edu/pubs/USD_LREC14_paper_camera_ready.pdf

https://universaldependencies.org/introduction.html

https://en.wikipedia.org/wiki/Universal_Dependencies