Data annotation notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
Line 166: Line 166:


===IOB===
===IOB===
<!--
{{stub}}
IOB (Inside, Outside, Beginning), a.k.a. BIO, is a format
used around to mark sequences larger than a single token,
such as named entity recongition.


This seems to come from the habit tagging coming out of a [[chunker]],
annotating in a separate data stream.


IOB (Inside, Outside, Beginning), a.k.a. BIO, is a format certain annotation outputs to signify sequences of adjacent tokens,
such as in entity recongition.


IOB can be the concept of marking a token as
This seems to originate of using the [[chunker]] to also tag these,
- the beginning of a sequence
and having it output what is a separate thing that is probably just a single list of things, same length as input tokens
- inside a sequence
(easier to parse, separate, than trying to mush it into a single annotation along the way).
- outside a sequence (terminating an ongoing sequence, or later not inside one)


It sends
* {{inlinecode|B}} the beginning of a sequence
* {{inlinecode|I}} inside sequence
* {{inlinecode|O}} outside a sequence (terminating an ongoing sequence, or later not inside one)


{{comment|(examples often show 'I-LOC' - because you can push that into a ''single'' list of strings the same length of the token list you are annotating, and you can recover both variable-length chunks and and the annotated types from that)}}




, also commonly referred to as the BIO format
There are actually a number of flavours of this idea.




IOB/IOB2/BILUO
The simplest might be '''IO''' - which basically isn't used at all, because it can't code that two different things are right next to each other.
: there are actually surprisingly few cases where this really matters for NER, but why build in that limitation?
 
 
So '''IOB''' seems the default.
 
The difference between I and B comes in part to let you annotate that there are separate adjacent things.
For example, doing NER tagging might output:
Los          I  LOC
Angeles      I  LOC
in          O
California  I  LOC
...which implies that Los Angeles belongs together, and the seprate in/O ''implies'' that California is a different thing. If someone removed that 'in' (and were bad at adding punctiation) you might want to annotate like:
Los          I  LOC
Angeles      I  LOC
California  B  LOC
to signify Los Angeles and California are separate things.
 
 
 
IOB2 seems to refer to the variant that would equivalently do that like:
Los          B  LOC
Angeles      I  LOC
in          O
California  B  LOC
and
Los          B  LOC
Angeles      I  LOC
in          O
California  B  LOC
 
 
There is also the addition of
* L or E to signify Last/Ending in a compound
* S or U to signify Single-token entity ('Unit'
 
So now there is BIOES and BILOU where we might annotate like
Alex    S  PER
is      O
going  O
with    O
Marty  B  PER
A.      I  PER
Rick    E  PER
to      O
Los    B  LOC
Angeles E  LOC
 
 
BWEMO is similar, with different naming:
B
W single-token entity
E end-of-entity
M mid-entry
O outide
 
BWEMO+ is similar but the rules are expanded because it was made for a model that doesn't have direct memory of adjacency{{verify}}
 
 
 
https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)


https://spacy.io/usage/linguistic-features#accessing-ner
https://spacy.io/usage/linguistic-features#accessing-ner
Line 193: Line 255:


https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format
https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format
-->
 
https://web.archive.org/web/20170805150451/https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/

Revision as of 16:22, 13 March 2024

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Tools

Online, open source

label studio


doccano


ML-Annotate


brat


annotator.js


Annotation Lab (a.k.a. NLP Lab)


(mostly online or self-hosted)


datagym


LightTag


Label Your Data


prodigy


LabelBox


CVAT

GUI, open source

LabelImg

MAE (Multi-document Annotation Environment)


YEDDA


ELAN


Praat


Phon

Unsorted

ipyannotations

  • text (images overall)
  • python notebook


poplar


VGG Oxford University

  • varied


Annotation data formats

IOB

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


IOB (Inside, Outside, Beginning), a.k.a. BIO, is a format certain annotation outputs to signify sequences of adjacent tokens, such as in entity recongition.

This seems to originate of using the chunker to also tag these, and having it output what is a separate thing that is probably just a single list of things, same length as input tokens (easier to parse, separate, than trying to mush it into a single annotation along the way).


It sends

  • B the beginning of a sequence
  • I inside sequence
  • O outside a sequence (terminating an ongoing sequence, or later not inside one)

(examples often show 'I-LOC' - because you can push that into a single list of strings the same length of the token list you are annotating, and you can recover both variable-length chunks and and the annotated types from that)


There are actually a number of flavours of this idea.


The simplest might be IO - which basically isn't used at all, because it can't code that two different things are right next to each other.

there are actually surprisingly few cases where this really matters for NER, but why build in that limitation?


So IOB seems the default.

The difference between I and B comes in part to let you annotate that there are separate adjacent things. For example, doing NER tagging might output:

Los          I   LOC
Angeles      I   LOC
in           O
California   I   LOC

...which implies that Los Angeles belongs together, and the seprate in/O implies that California is a different thing. If someone removed that 'in' (and were bad at adding punctiation) you might want to annotate like:

Los          I   LOC
Angeles      I   LOC
California   B   LOC

to signify Los Angeles and California are separate things.


IOB2 seems to refer to the variant that would equivalently do that like:

Los          B   LOC
Angeles      I   LOC
in           O
California   B   LOC

and

Los          B   LOC
Angeles      I   LOC
in           O
California   B   LOC


There is also the addition of

  • L or E to signify Last/Ending in a compound
  • S or U to signify Single-token entity ('Unit'

So now there is BIOES and BILOU where we might annotate like

Alex    S   PER
is      O
going   O
with    O
Marty   B   PER
A.      I   PER
Rick    E   PER
to      O
Los     B   LOC
Angeles E   LOC


BWEMO is similar, with different naming: B W single-token entity E end-of-entity M mid-entry O outide

BWEMO+ is similar but the rules are expanded because it was made for a model that doesn't have direct memory of adjacency(verify)


https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

https://spacy.io/usage/linguistic-features#accessing-ner

https://en.wikipedia.org/wiki/Inside%E2%80%93outside%E2%80%93beginning_(tagging)

https://datascience.stackexchange.com/questions/37824/difference-between-iob-and-iob2-format

https://web.archive.org/web/20170805150451/https://lingpipe-blog.com/2009/10/14/coding-chunkers-as-taggers-io-bio-bmewo-and-bmewo/