Revision as of 11:23, 19 June 2024

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Tools

Online, open source

label studio

image, text
browser based (your own hosted copy)
https://labelstud.io/

doccano

text
browser based (your own hosted copy)
https://doccano.github.io/doccano/

ML-Annotate

text
browser based (your own hosted copy)
https://github.com/falcony-io/ml-annotate

brat

text?
browser based (your own hosted copy)
http://brat.nlplab.org/

annotator.js

text
browser extension, meant to work on webpages
http://annotatorjs.org/

Annotation Lab (a.k.a. NLP Lab)

text(, also images?)
https://nlp.johnsnowlabs.com/docs/en/alab/quickstart

Paid and/or closed source

(mostly online or self-hosted)

datagym

image and video
web based
assisted labeling
free/paid model
open source
https://www.datagym.ai/

LightTag

text
free start, mostly paid
https://www.lighttag.io/

Label Your Data

https://labelyourdata.com/
closed source
paid

prodigy

text (including specific spacy things like pos, ner, dep); images, audio
paid only?(verify) [1]
https://prodi.gy/

LabelBox

free for small data, mostly paid
https://labelbox.com/

CVAT

image and video
paid; free is limited
https://cvat.ai/

GUI, open source

LabelImg

image
Python, Qt (local install)
open source
https://github.com/heartexlabs/labelImg

MAE (Multi-document Annotation Environment)

text
GUI (Java)
open source
https://keighrim.github.io/mae-annotation/
https://github.com/keighrim/mae-annotation

YEDDA

text
GUI app
open source
https://github.com/jiesutd/YEDDA

ELAN

audio and video
open source
https://archive.mpi.nl/tla/elan

Praat

audio
open source
specialized for phonetics/linguistics
https://www.fon.hum.uva.nl/praat/

Phon

audio
specialized for phonetics/linguistics
open source
https://www.phon.ca/phon-manual/getting_started.html
https://github.com/phon-ca/phon

Unsorted

ipyannotations

text (images overall)
python notebook

poplar

text?
https://github.com/synyi/poplar

VGG Oxford University

varied

Annotation data formats

Text

CoNLL-X and CoNLL-U

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

CoNLL-U

https://universaldependencies.org/format.html

Roughly:

three types of lines:
- blank line, marking a sentence boundary
- # at the start, marking sentence comments
- word lines, which are tab-separated fields
  - ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
  - FORM: Word form or punctuation symbol.
  - LEMMA: Lemma or stem of word form.
  - UPOS: Universal part-of-speech tag.
  - XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
  - FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
  - HEAD: Head of the current word, which is either a value of ID or zero (0).
  - DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
  - DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
  - MISC: Any other annotation.

CoNLL-X

can be seen as an earlier version of CoNNL-U, which was similar but had different column definitions. There are uses where the two are compatible(verify)

https://aclanthology.org/W06-2920.pdf

File-extension wise, .conll often means CoNNL-X, .conllu oftne means CoNNL-U

Only half related is CoNLL-UL,

unsorted

IOB

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

IOB (Inside, Outside, Beginning), a.k.a. BIO, is a format certain annotation outputs to signify sequences of adjacent tokens, such as in entity recongition.

This seems to originate of using the chunker to also tag these, and having it output what is a separate thing that is probably just a single list of things, same length as input tokens (easier to parse, separate, than trying to mush it into a single annotation along the way).

It sends

B the beginning of a sequence
I inside sequence
O outside a sequence (terminating an ongoing sequence, or later not inside one)

(examples often show something like 'I-LOC' (LOC for location) - because you can push that into a single list of strings the same length of the token list you are annotating, and you can recover both variable-length chunks and and the annotated types from that single list. The below separates that a little to focus on the B/I/O/other markings)

There are actually a number of flavours of this idea.

The simplest might be IO - which basically isn't used at all, because it can't code that two different things are right next to each other.

There are actually surprisingly few cases where this really matters for NER, just because most aren't directly adjacent to each other, but why build in that limitation?

So an approach with three markings (IOB) seems the default.

In IOB, The difference between I and B comes in part to let you annotate that there are separate adjacent things.

For example, doing NER tagging might output:

Los          I   LOC
Angeles      I   LOC
in           O
California   I   LOC

...which implies that Los Angeles belongs together, and the seprate in/O implies that California is a different thing. If someone removed that 'in' (and were bad at adding punctiation) then you might want to annotate like:

Los          I   LOC
Angeles      I   LOC
California   B   LOC

to signify Los Angeles and California are separate things.

IOB2 seems to refer to the variant that would equivalently do that like:

Los          B   LOC
Angeles      I   LOC
in           O
California   B   LOC

and

Los          B   LOC
Angeles      I   LOC
California   B   LOC

There is also the addition of

L or E to signify Last/Ending in a compound
S or U to signify Single-token entity (or 'Unit')

So now there is

IOE, IOE2: delimits adjacent things by putting E on the last token of the previous

BIOES and BILOU where we might annotate like

Alex    S   PER
is      O
going   O
with    O
Marty   B   PER
A.      I   PER
Rick    E   PER
to      O
Los     B   LOC
Angeles E   LOC

"START/END" might refer to BIOES?

BWEMO is similar, with different naming:

B beginning-of-entity

W single-token entity

E end-of-entity

M mid-entry

O outide

BWEMO+ is similar to BWEMO but the rules of interpretation are expanded/relaxed, because it was made for a model where the output doesn't have strict memory of adjacency(verify)

https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

https://spacy.io/usage/linguistic-features#accessing-ner

https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format

https://web.archive.org/web/20170805150451/https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/

Text in time

Text Encoding Initiative (TEI)

https://tei-c.org/

FoLiA

FoLiA (Format for Linguistic Annotation) is a format to annotate text resources, in theory for rich and interoperable linguistic annotation, for things like transciption, corpora (glossaries, dictionaries, thesauri and wordnets, etc), and processing.

While the serialized XML form is complex to read, libraries should make it reasonable to read and alter.

It was presented as a better alternative to ad hoc storage, which you tend to spend time figuring out for each dataset,

It is unopinionated in the sense that

it does not restrict to a particular label set or theory

it allows marking up of different things

all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).

It deals separately with things like

inline annotations of individual elements
(inline) annotations of spans of elements
subtoken, for morphology and phonology
document structure
higher-level things like arbitrary selections, arbitrary relations,

@@ Line 325: / Line 325: @@
 -->
+===Text Encoding Initiative (TEI)===
+https://tei-c.org/
+===FoLiA===
+FoLiA (Format for Linguistic Annotation) is a format to annotate text resources,
+in theory for rich and interoperable linguistic annotation, for things like transciption, corpora (glossaries, dictionaries, thesauri and wordnets, etc), and processing.
+While the serialized XML form is complex to read,
+libraries should make it reasonable to read and alter.
+It was presented as a better alternative to ad hoc storage,
+which you tend to spend time figuring out for each dataset,
+It is unopinionated in the sense that
+: it does not restrict to a particular label set or theory
+: it allows marking up of different things
+: all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).
+It deals separately with things like
+* inline annotations of individual elements
+* (inline) annotations of spans of elements
+* subtoken, for morphology and phonology
+* document structure
+* higher-level things like arbitrary selections, arbitrary relations,
+See also:
+* https://proycon.github.io/folia/
+* https://folia.readthedocs.io/en/latest/introduction.html
+* https://www.researchgate.net/publication/261215684_FoLiA_A_practical_XML_format_for_linguistic_annotation_-_A_descriptive_and_comparative_study
+There is also a FoLiA Query Language that lets you select and also edit documents.
+[https://folia.readthedocs.io/en/latest/fql.html]
+There are web annotation tools like FLAT,
+that build on a document server
+'''What does it annotate?'''
+things like:
+: relatively mechanical structure
+:: on the macro level (e.g. paragraphs, head, divisions, lists, figures), the ability to define terms and create glossaries and such
+:: smaller level like (e.g. whitespace, tokens, morphemes),
+: more semantic things like quotes, events, the difference between utterances and sentences
+: additional annotation types, e.g. phonetic, sentiment, language; POS, lemma, sense, language, reference,
+: larger annotation, like spans and span relations
+: corrections
+...although it may not be advisable to use it for everything it can do at once.
+https://folia.readthedocs.io/en/latest/introduction.html#annotation-types
+'''What does it look like?'''
+https://github.com/proycon/folia/tree/master/examples
+'''Who or what uses it?'''
+Universities, mainly.
+[[Category:Computational linguistics]]

Data annotation notes: Difference between revisions

Revision as of 11:23, 19 June 2024

Contents

Tools

Online, open source

Paid and/or closed source

GUI, open source

Unsorted

Annotation data formats

Text

CoNLL-X and CoNLL-U

unsorted

IOB

Text in time

Text Encoding Initiative (TEI)

FoLiA

Navigation menu