Labeling in linguistics: Difference between revisions

From Helpful
Jump to navigation Jump to search
(2 intermediate revisions by the same user not shown)
Line 92: Line 92:


Broad overview: https://universaldependencies.org/u/dep/
Broad overview: https://universaldependencies.org/u/dep/


<!--
<!--
'''Versions'''
1 [https://universaldependencies.org/docsv1/]
Changes from 1 to 2 [https://universaldependencies.org/v2/summary.html]
2 [https://aclanthology.org/2020.lrec-1.497.pdf]
UD 2.0 versus UD 2.2 ?
2.0: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-1983
2.2: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2837
2.3: https://lindat.mff.cuni.cz/repository/xmlui/handle/11234/1-2895
https://universaldependencies.org/changes.html
https://github.com/UniversalDependencies




Line 154: Line 173:


https://en.wikipedia.org/wiki/Universal_Dependencies
https://en.wikipedia.org/wiki/Universal_Dependencies
==Universal Parts of Speech==
udep's idea http://universaldependencies.org/docs/u/pos/index.html
    ADJ: adjective
    ADP: adposition
    ADV: adverb
    AUX: auxiliary verb
    CONJ: coordinating conjunction
    DET: determiner
    INTJ: interjection
    NOUN: noun
    NUM: numeral
    PART: particle
    PRON: pronoun
    PROPN: proper noun
    PUNCT: punctuation
    SCONJ: subordinating conjunction
    SYM: symbol
    VERB: verb
    X: other

Revision as of 15:52, 3 April 2024

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Lexical categories / POS

Refers to the sets of lexical categories that POS taggers annotate data with. For example, the Penn tagset has been used to create the Penn treebank.

Tagsets are usually primarily based on parts of speech, sometimes with common linguistic features added.

The basic few POS tags are agreed on, but there is always discussion possible about the detail, special cases, the degree of normalisation (have specific tags vs. have tags essentially with features). This also affects the ways it can or cannot tag languages other than English.



UPOS

[1]

Penn

[2] [3]

TEI

C5, a.k.a BNC basic

61 tags See e.g. [4]

(C6)

The same as C7 except for handling of punctuation.

C7, a.k.a BNC enriched

146 tags See e.g. [5]


CLAWS

Various versions, current is C8.

The latest program seems to be referred to as CLAWS4. See [6]


CLAWS1 tagset - 132 tags - [7]

CLAWS2 tagset - 166 tags - [8]

C5 - [9]

C6 - [10]

C7 - [11]

C8 - [12]


https://ucrel.lancs.ac.uk/claws/

Corpus-specific

...usually meaning 'not generally used'. See also corpora.

Parole

[13]

Brown

[14]

London-Lund

[15]

LOB, SEC

[16] [17]

POW

[18]

German

STSS

54 tags

[19]

Dependencies

Stanford dependency representation

https://downloads.cs.stanford.edu/nlp/software/dependencies_manual.pdf

Universal Dependencies

Broad overview: https://universaldependencies.org/u/dep/



https://nlp.stanford.edu/pubs/USD_LREC14_paper_camera_ready.pdf

https://universaldependencies.org/introduction.html

https://en.wikipedia.org/wiki/Universal_Dependencies


Universal Parts of Speech

udep's idea http://universaldependencies.org/docs/u/pos/index.html

   ADJ: adjective
   ADP: adposition
   ADV: adverb
   AUX: auxiliary verb
   CONJ: coordinating conjunction
   DET: determiner
   INTJ: interjection
   NOUN: noun
   NUM: numeral
   PART: particle
   PRON: pronoun
   PROPN: proper noun
   PUNCT: punctuation
   SCONJ: subordinating conjunction
   SYM: symbol
   VERB: verb
   X: other