Tagsets

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Refers to the sets of lexical categories that POS taggers annotate data with. For example, the Penn tagset has been used to create the Penn treebank.

Tagsets are usually primarily based on parts of speech, sometimes with common linguistic features added.

The basic few POS tags are agreed on, but there is always discussion possible about the detail, special cases, the degree of normalisation (have specific tags vs. have tags essentially with features). This also affects the ways it can or cannot tag languages other than English.



Penn

[1] [2]

TEI

C5, a.k.a BNC basic

61 tags See e.g. [3]

(C6)

The same as C7 except for handling of punctuation.

C7, a.k.a BNC enriched

146 tags See e.g. [4]


CLAWS

Various versions, current is CLAWS7. The latest program seems to be referred to as CLAWS4. See [5]

CLAWS1

132 tags [6]

CLAWS2

166 tags [7]

CLAWS5

[8]

CLAWS6

[9]

CLAWS7

[10]

Corpus-specific

...usually meaning 'not generally used'. See also corpora.

Parole

[11]

Brown

[12]

ICE

London-Lund

[13]

LOB, SEC

[14] [15]

POW

[16]

German

STSS

54 tags

[17]