Data annotation notes
Tools
Online, open source
label studio
- image, text
- browser based (your own hosted copy)
- https://labelstud.io/
doccano
- text
- browser based (your own hosted copy)
- https://doccano.github.io/doccano/
ML-Annotate
- text
- browser based (your own hosted copy)
- https://github.com/falcony-io/ml-annotate
brat
- text?
- browser based (your own hosted copy)
- http://brat.nlplab.org/
annotator.js
- text
- browser extension, meant to work on webpages
- http://annotatorjs.org/
Annotation Lab (a.k.a. NLP Lab)
- text(, also images?)
- https://nlp.johnsnowlabs.com/docs/en/alab/quickstart
Paid and/or closed source
(mostly online or self-hosted)
datagym
- image and video
- web based
- assisted labeling
- free/paid model
- open source
- https://www.datagym.ai/
LightTag
- text
- free start, mostly paid
- https://www.lighttag.io/
Label Your Data
- https://labelyourdata.com/
- closed source
- paid
prodigy
- text (including specific spacy things like pos, ner, dep); images, audio
- paid only?(verify) [1]
- https://prodi.gy/
LabelBox
- free for small data, mostly paid
- https://labelbox.com/
CVAT
- image and video
- paid; free is limited
- https://cvat.ai/
GUI, open source
LabelImg
- image
- Python, Qt (local install)
- open source
- https://github.com/heartexlabs/labelImg
MAE (Multi-document Annotation Environment)
- text
- GUI (Java)
- open source
- https://keighrim.github.io/mae-annotation/
- https://github.com/keighrim/mae-annotation
YEDDA
- text
- GUI app
- open source
- https://github.com/jiesutd/YEDDA
ELAN
- audio and video
- open source
- https://archive.mpi.nl/tla/elan
Praat
- audio
- open source
- specialized for phonetics/linguistics
- https://www.fon.hum.uva.nl/praat/
Phon
- audio
- specialized for phonetics/linguistics
- open source
- https://www.phon.ca/phon-manual/getting_started.html
- https://github.com/phon-ca/phon
Unsorted
ipyannotations
- text (images overall)
- python notebook
poplar
VGG Oxford University
- varied
Annotation data formats
IOB
IOB (Inside, Outside, Beginning), a.k.a. BIO, is a format certain annotation outputs to signify sequences of adjacent tokens,
such as in entity recongition.
This seems to originate of using the chunker to also tag these, and having it output what is a separate thing that is probably just a single list of things, same length as input tokens (easier to parse, separate, than trying to mush it into a single annotation along the way).
It sends
- B the beginning of a sequence
- I inside sequence
- O outside a sequence (terminating an ongoing sequence, or later not inside one)
(examples often show 'I-LOC' - because you can push that into a single list of strings the same length of the token list you are annotating, and you can recover both variable-length chunks and and the annotated types from that)
There are actually a number of flavours of this idea.
- The simplest might be IO - which basically isn't used at all, because it can't code that two different things are right next to each other.
- there are actually surprisingly few cases where this really matters for NER, but why build in that limitation?
- So IOB seems the default.
The difference between I and B comes in part to let you annotate that there are separate adjacent things. For example, doing NER tagging might output:
Los I LOC Angeles I LOC in O California I LOC
...which implies that Los Angeles belongs together, and the seprate in/O implies that California is a different thing. If someone removed that 'in' (and were bad at adding punctiation) you might want to annotate like:
Los I LOC Angeles I LOC California B LOC
to signify Los Angeles and California are separate things.
- IOB2 seems to refer to the variant that would equivalently do that like:
Los B LOC Angeles I LOC in O California B LOC
and
Los B LOC Angeles I LOC California B LOC
There is also the addition of
- L or E to signify Last/Ending in a compound
- S or U to signify Single-token entity ('Unit'
So now there is
- IOE, IOE2: delimits adjacent things by putting E on the last token of the previous
- BIOES and BILOU where we might annotate like
Alex S PER is O going O with O Marty B PER A. I PER Rick E PER to O Los B LOC Angeles E LOC
- 'START/END' might refer to BIOES?
- BWEMO is similar, with different naming:
- B beginning-of-entity
- W single-token entity
- E end-of-entity
- M mid-entry
- O outide
- BWEMO+ is similar to BWEMO but the rules are expanded/relaxed because it was made for a model that doesn't have strict memory of adjacency(verify)
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
https://spacy.io/usage/linguistic-features#accessing-ner
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)
https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format