FoLiA notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
mNo edit summary
 
Line 1: Line 1:
{{stub}}
{{stub}}


<!--


FoLiA (Format for Linguistic Annotation) is intended for rich and interoperable linguistic annotation, for things like transciption, corpora, and processing.
FoLiA (Format for Linguistic Annotation) is a format to annotate text resources,
in theory for rich and interoperable linguistic annotation, for things like transciption, corpora (glossaries, dictionaries, thesauri and wordnets, etc), and processing.
 
While serialized into somewhat complex XML,
libraries should make it reasonable to read and alter.
 


It was presented as a better alternative to ad hoc storage,
It was presented as a better alternative to ad hoc storage,
which you tend to have to figure out per dataset,
which you tend to spend time figuring out for each dataset,


It is unopinionated in the sense that
: it does not restrict to a particular label set or theory
: it allows marking up of different things
: all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).


It is unopinionated in that all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).
 
 
It deals separately with things like
* inline annotations of individual elements
* (inline) annotations of spans of elements
* subtoken, for morphology and phonology
* document structure
* higher-level things like arbitrary selections, arbitrary relations,
 
 
See also:
* https://proycon.github.io/folia/
 
* https://folia.readthedocs.io/en/latest/introduction.html
 
* https://www.researchgate.net/publication/261215684_FoLiA_A_practical_XML_format_for_linguistic_annotation_-_A_descriptive_and_comparative_study




Line 16: Line 39:
[https://folia.readthedocs.io/en/latest/fql.html]
[https://folia.readthedocs.io/en/latest/fql.html]
   
   
There are web annotation tools like FLAT, that build on a document server
There are web annotation tools like FLAT,  
 
that build on a document server
 
 
While serialized into somewhat complex XML,
libraries should make it more reasonable to read and alter.


https://foliapy.readthedocs.io/en/latest/folia.html




Line 47: Line 65:




'''What does it look like?
'''What does it look like?'''


https://github.com/proycon/folia/tree/master/examples
https://github.com/proycon/folia/tree/master/examples
Line 56: Line 74:


Universities, mainly.  
Universities, mainly.  
There's
https://proycon.github.io/folia/
https://folia.readthedocs.io/en/latest/
https://www.researchgate.net/publication/261215684_FoLiA_A_practical_XML_format_for_linguistic_annotation_-_A_descriptive_and_comparative_study
FoLiA (Format for Linguistic Annotation) is an XML-based format to annotate text resources.
It tries to be a singular answer in that
* it does not restrict to a particular label set or theory
* it allows marking up of different things
glossaries, dictionaries, thesauri and wordnets
It deals separately with things like
* inline annotations of individual elements
* (inline) annotations of spans of elements
* subtoken, for morphology and phonology
* document structure
* higher-level things like arbitrary selections, arbitrary relations,
See also:
* https://proycon.github.io/folia/
* https://folia.readthedocs.io/en/latest/introduction.html


-->


[[Category:Computational linguistics]]
[[Category:Computational linguistics]]

Latest revision as of 00:46, 10 August 2023

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


FoLiA (Format for Linguistic Annotation) is a format to annotate text resources, in theory for rich and interoperable linguistic annotation, for things like transciption, corpora (glossaries, dictionaries, thesauri and wordnets, etc), and processing.

While serialized into somewhat complex XML, libraries should make it reasonable to read and alter.


It was presented as a better alternative to ad hoc storage, which you tend to spend time figuring out for each dataset,

It is unopinionated in the sense that

it does not restrict to a particular label set or theory
it allows marking up of different things
all vocabulary sets need to be explicitly referenced (a SKOS / RDF thing, but don't let that scare you off).


It deals separately with things like

  • inline annotations of individual elements
  • (inline) annotations of spans of elements
  • subtoken, for morphology and phonology
  • document structure
  • higher-level things like arbitrary selections, arbitrary relations,


See also:


There is also a FoLiA Query Language that lets you select and also edit documents. [1]

There are web annotation tools like FLAT, that build on a document server



What does it annotate?

things like:

relatively mechanical structure
on the macro level (e.g. paragraphs, head, divisions, lists, figures), the ability to define terms and create glossaries and such
smaller level like (e.g. whitespace, tokens, morphemes),
more semantic things like quotes, events, the difference between utterances and sentences
additional annotation types, e.g. phonetic, sentiment, language; POS, lemma, sense, language, reference,
larger annotation, like spans and span relations
corrections

...although it may not be advisable to use it for everything it can do at once.


https://folia.readthedocs.io/en/latest/introduction.html#annotation-types


What does it look like?

https://github.com/proycon/folia/tree/master/examples


Who or what uses it?

Universities, mainly.