Data annotation notes and tools
Data reference, annotation: Data annotation notes and tools · Knowledge representation / Semantic annotation / structured data / linked data on the web Reference: Open science, research, access, data, etc. · Citations Library related: Library glossary · Identifiers, classifiers, and other codes · Repository notes · Metadata models and standards Library systems · Online (library) search related · Library-related service notes · OpenURL notes · OCLC Pica notes · Library - unsorted |
Tools
Online, open source
label studio
- image, text
- browser based (your own hosted copy)
- https://labelstud.io/
doccano
- text
- browser based (your own hosted copy)
- https://doccano.github.io/doccano/
ML-Annotate
- text
- browser based (your own hosted copy)
- https://github.com/falcony-io/ml-annotate
brat
- text?
- browser based (your own hosted copy)
- http://brat.nlplab.org/
annotator.js
- text
- browser extension, meant to work on webpages
- http://annotatorjs.org/
Annotation Lab (a.k.a. NLP Lab)
- text(, also images?)
- https://nlp.johnsnowlabs.com/docs/en/alab/quickstart
Paid and/or closed source
(mostly online or self-hosted)
datagym
- image and video
- web based
- assisted labeling
- free/paid model
- open source
- https://www.datagym.ai/
LightTag
- text
- free start, mostly paid
- https://www.lighttag.io/
Label Your Data
- https://labelyourdata.com/
- closed source
- paid
prodigy
- text (including specific spacy things like pos, ner, dep); images, audio
- paid only?(verify) [1]
- https://prodi.gy/
LabelBox
- free for small data, mostly paid
- https://labelbox.com/
CVAT
- image and video
- paid; free is limited
- https://cvat.ai/
GUI, open source
LabelImg
- image
- Python, Qt (local install)
- open source
- https://github.com/heartexlabs/labelImg
MAE (Multi-document Annotation Environment)
- text
- GUI (Java)
- open source
- https://keighrim.github.io/mae-annotation/
- https://github.com/keighrim/mae-annotation
YEDDA
- text
- GUI app
- open source
- https://github.com/jiesutd/YEDDA
ELAN
- audio and video
- open source
- https://archive.mpi.nl/tla/elan
Praat
- audio
- open source
- specialized for phonetics/linguistics
- https://www.fon.hum.uva.nl/praat/
Phon
- audio
- specialized for phonetics/linguistics
- open source
- https://www.phon.ca/phon-manual/getting_started.html
- https://github.com/phon-ca/phon
Unsorted
ipyannotations
- text (images overall)
- python notebook
poplar
VGG Oxford University
- varied
Annotation data formats
Text
CoNLL-X and CoNLL-U
CoNLL-U
Roughly:
- three types of lines:
- blank line, marking a sentence boundary
- # at the start, marking sentence comments
- word lines, which are tab-separated fields
- ID: Word index, integer starting at 1 for each new sentence; may be a range for multiword tokens; may be a decimal number for empty nodes (decimal numbers can be lower than 1 but must be greater than 0).
- FORM: Word form or punctuation symbol.
- LEMMA: Lemma or stem of word form.
- UPOS: Universal part-of-speech tag.
- XPOS: Optional language-specific (or treebank-specific) part-of-speech / morphological tag; underscore if not available.
- FEATS: List of morphological features from the universal feature inventory or from a defined language-specific extension; underscore if not available.
- HEAD: Head of the current word, which is either a value of ID or zero (0).
- DEPREL: Universal dependency relation to the HEAD (root iff HEAD = 0) or a defined language-specific subtype of one.
- DEPS: Enhanced dependency graph in the form of a list of head-deprel pairs.
- MISC: Any other annotation.
CoNLL-X
- can be seen as an earlier version of CoNNL-U, which was similar but had different column definitions.
- There are uses where the two are compatible(verify)
- https://aclanthology.org/W06-2920.pdf
File-extension wise,
.conll often means CoNNL-X, .conllu oftne means CoNNL-U
Only half related is CoNLL-UL,
unsorted
IOB
IOB (Inside, Outside, Beginning),
a.k.a. BIO for the order they more typically come in,
is a format certain annotation outputs to signify sequences of adjacent tokens,
often named entity recognition.
This seems to originate from using a chunker to also tag these, and having it output separately, often a list the same length as the output tokens; easier to parse, separate, than trying to mush it into a single annotation along the way).
It sends
- B the beginning of a sequence
- I still inside a sequence
- O outside a sequence (terminating an ongoing sequence, or still not inside one)
Examples often show IOB output like I-LOC signifying both
- I for inside
- LOC for LOCation
- This is using a single string to mark both these things - and allowing some disambiguation.
The below separates that a little to focus on the B/I/O(/other) markings.
There are actually a number of flavours of this idea.
- The simplest might be IO
- which basically isn't used at all, because it cannot signify that two sequences are right next to each other when nothing is inbetween.
- There are actually surprisingly few cases where this really matters for NER
- ...but still, why build in that limitation? So an approach with three markings (IOB) seems the default instead
IOB lets you annotate that there are separate adjacent strings, by having a distinction between I and B
There are also multiple ways of doing that.
Consider that NER tagging might output:
Los I LOC Angeles I LOC in O California I LOC
- ...which implies that Los Angeles belongs together, and the separate I/O implies that California is a different thing.
If someone removed that 'in' (and were bad at adding punctiation) then you might want to annotate like:
Los I LOC Angeles I LOC California B LOC
to signify Los Angeles and California are separate things.
IOB2 seems to refer to the variant that would equivalently do that like:
Los B LOC Angeles I LOC in O California B LOC
and
Los B LOC Angeles I LOC California B LOC
(There seems to be no functional difference. And the latter seems valid in classic IOB, just not the output convention?(verify))
Further variants lie in the addition of
- L or E to signify Last/Ending in a compound
- S or U to signify Single-token entity (or 'Unit')
So now there is
- IOE, IOE2: delimits adjacent things by putting E on the last token of the previous
- BIOES and BILOU where we might annotate like
Alex S PER is O going O with O Marty B PER A. I PER Rick E PER to O Los B LOC Angeles E LOC
- "START/END" might refer to BIOES?
- BWEMO is similar, with different naming:
- B beginning-of-entity
- W single-token entity
- E end-of-entity
- M mid-entry
- O outide
- BWEMO+ is similar to BWEMO but the rules of interpretation are expanded/relaxed,
- (because it was made for a model where the output doesn't have strict memory of adjacency?(verify))
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
https://spacy.io/usage/linguistic-features#accessing-ner
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format
Text in time
Audacity annotation https://manual.audacityteam.org/man/creating_and_selecting_labels.html
ELAN (EAF, Elan annotation format)
TextGrid (for Praat)
SubRip (SRT), SubViewer (SBV/SUB), TTML, WebVTT (VTT);
Text Encoding Initiative (TEI)
FoLiA
FoLiA (Format for Linguistic Annotation) is a format to annotate text resources, in theory for rich and interoperable linguistic annotation, for things like transciption, corpora (glossaries, dictionaries, thesauri and wordnets, etc), and processing.
While the serialized XML form is complex to read,
libraries should make it reasonable to read and alter.
It was presented as a better alternative to ad hoc storage,
which you tend to spend time figuring out for each dataset,
It is unopinionated in the sense that
- it does not restrict to a particular label set or theory
- it allows marking up of different things
- all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).
It deals separately with things like
- inline annotations of individual elements
- (inline) annotations of spans of elements
- subtoken, for morphology and phonology
- document structure
- higher-level things like arbitrary selections, arbitrary relations,
See also:
There is also a FoLiA Query Language that lets you select and also edit documents. [2]
There are web annotation tools like FLAT, that build on a document server
What does it annotate?
things like:
- relatively mechanical structure
- on the macro level (e.g. paragraphs, head, divisions, lists, figures), the ability to define terms and create glossaries and such
- smaller level like (e.g. whitespace, tokens, morphemes),
- more semantic things like quotes, events, the difference between utterances and sentences
- additional annotation types, e.g. phonetic, sentiment, language; POS, lemma, sense, language, reference,
- larger annotation, like spans and span relations
- corrections
...although it may not be advisable to use it for everything it can do at once.
https://folia.readthedocs.io/en/latest/introduction.html#annotation-types
What does it look like?
https://github.com/proycon/folia/tree/master/examples
Who or what uses it?
Universities, mainly.