Knowledge representation / Semantic annotation / structured data / linked data on the web

From Helpful
Jump to navigation Jump to search

Data reference, annotation: Data annotation notes and tools · Knowledge representation / Semantic annotation / structured data / linked data on the web

Reference: Open science, research, access, data, etc. · Citations

Library related: Library glossary · Identifiers, classifiers, and other codes · Repository notes · Metadata models and standards

Library systems · Online (library) search related · Library-related service notes · OpenURL notes · OCLC Pica notes · Library - unsorted


Generic, not related to library metadata

URN

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Many know know URLs, which locate a resource.

URLs answer where.


URN answers what, and only what

For example, you might say isbn:0898156122 is an URN - a namespace to name a kind, and a value to be interpreted in that namspace.

These namespaces are registered with IANA
though there's nothing really keeping you from creating ad-hoc standards

(URI basically encompasses both, so roughly means 'URL or URN'.)


URNs are, by themselves, a little too generic, in that there isn't a clear way to say "this is an URN you should do something with" in a generic information embedding way, without e.g. putting it on a contextually specific lookup URL.

URNs are, potentially, a way to drop some basic type-value items into HTML, but since there is no standard way to resolve these, this primarily has uses in relatively niche uses, where multiple sides/apps have added some agreements on how to use them.

CURIE

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A Compact URI (CURIE) amounts to a way way to abbreviate a URL in a context - usually to abbreviate many.

It's seen more around XML and XHTML, because they're conceptualized from XML namespaces, though could be used elsewhere fairly easily.

However, they are only parsed by CURIE-aware things, which isn't many(verify).


Consider:

<html xmlns:wikipedia="http://en.wikipedia.org/wiki/">
  <head></head>
  <body>
      <p>Find out more about <a href="[wikipedia:Biome]">biomes</a>.</p>
  </body>
</html>
</code>


It is reminischent of a <base> as defined by HTML, but HTML bases are explicitly defined, and have some more path-related rules.


It ends up looking like a URN

but you can't really assume the prefix is fixed, as it is in URNs
also, a CURIE expands into URL/URI, while URN is a and stays a string that just happens to be an identifier


See also:

Microformats

Microformats add semantic markers to HTML.

Emerged at the time of HTML4, and (probably mainly for validation reasons) chose to reuse existing attributes - seemingly mostly class and rel.


For example, showing personal information in HTML and also telling a potential microformat parser what's what according to hCard:

<ul class="vcard">
  <li class="fn">Joe Doe</li>
  <li class="org">The Example Company</li>
  <li class="tel">604-555-1234</li>
  <li><a class="url" href="http://example.com/">http://example.com/</a></li>
</ul>


See also:

Microdata

Microdata marks up HTML with the data it is also displaying, e.g.

<div itemscope>
  <p>My name is <span itemprop="name">Neil</span>.</p>
  <p>My band is called <span itemprop="band">Four Parts Water</span>.</p>
  <p>I am <span itemprop="nationality">British</span>.</p>
</div>


...yes, it serves similar goals to microformats.


Microdata seems explained as a slightly-more-expressive successor, in that...

Microformats reuse existing parts of HTML4 (mostly class and rel), whereas microdata extends HTML5 with specific custom attributes (e.g. itemscope, itemtype, itemprop).
Also, microformats have sort of a fixed existing set, while microdata points at 'use any schema.org thing', allowing community extension.


(side note: that places that sanitize user-sourced HTML are more likely to remove microdata than microformats, due to unknown attributes)

See also:

JSON-LD

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Linked data idea, with keys according to schema.org, which happens to be embedded in a page as JSON. (Not to be confused with LDJSON / JSONL / NDJSON, which are serialization formats)


Consider the recipe example from here :

 <script type="application/ld+json">
    {
      "@context": "https://schema.org/",
      "@type": "Recipe",
      "name": "Party Coffee Cake",
      "author": {
        "@type": "Person",
        "name": "Mary Stone"
      },
      "datePublished": "2018-03-10",
      "description": "This coffee cake is awesome and perfect for parties.",
      "prepTime": "PT20M"
    }
 </script>

It's a machine-parseable form of the page's main content, assuming that main content is relatively bite-sized.

Seems to be developed for and targeted primarily at web crawlers.


Yes, this omits the actually useful sections like "ingredient", "yield", "instructions", which seems to indicate this was not aimed at structured data, but at crawler just caring about the type of page.

So this feels more SEO-adjacent metadata than about structured or linked data (also consider statements like "All annotated information must be on the page; adding information that is not on the page will likely not show in search results and is against Google guidelines").

Yet it's an open-ish mechanism, and due to the varied existing schemas you can code a lot of nontrivial things, that you could later extract again.


See also:

Arguably primarily a way to get away from that doing the same in XML.


Ontologies and knowledge representation

RDF
RDFS

RDF Schema

https://en.wikipedia.org/wiki/RDF_Schema

Trix

RDF triples stored in XML

See also:


Embedded RDF (eRDF)

Embedded RDF (eRDF) places RDF inside HTML.

It was apparently inspired by microformats.

It seems effectively obsolete, as people seem to prefer things like RDFa, Microdata, JSON-LD.


See also:


RDFa
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

RDFa describes when you express RDF-like content within HTML attributes (hence the a):


This is still a flexible concept.

If you start with HTML and want to mark what its parts are better, you might start with

 <h2>My page title</h2>

and make that:

 <h2 property="http://purl.org/dc/terms/title">My page title</h2>

...but you probably want to look at full examples.


One real-world example I found was EUR-Lex encoding a bunch of metadata in their HTML pages, e.g. 31965L0001 contains 200+ lines like:

<meta about="http://data.europa.eu/eli/dir/1965/1/oj" typeof="eli:LegalResource"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:uri_schema" resource="http://data.europa.eu/eli/%7Btypedoc%7D/%7Byear%7D/%7Bnatural_number%7D/oj"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" content="31965L0001" lang="" property="eli:id_local"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:type_document" resource="http://publications.europa.eu/resource/authority/resource-type/DIR"/>
<meta about="http://data.europa.eu/eli/dir/1965/1/oj" property="eli:passed_by" resource="http://publications.europa.eu/resource/authority/corporate-body/CONSIL"/>

..which seems to come more from a "We had RDF triples and put them in here for you to parse" angle.


Attributes you can use include:

  • property - specifying a property for the content of an element or the partner resource
  • about - when you e.g. have a div describing a resource, this points to the actual resource (URI or CURIE)
  • rel, rev - relationships and reverse relationships with another source
  • typeof - RDF type(s) of the subject or the partner resource
  • src, href, resource - partner resources
  • content - attribute that overrides the content of the element when using the property attribute
  • datatype - attribute that specifies the datatype of text specified for use with the property attribute


http://en.wikipedia.org/wiki/RDFa

https://www.w3.org/TR/rdfa-core/

RDFa Lite

RDFa Lite intends to be simpler subset of RDFa that allows most things people want to do, and be easier to deal with

  • property
  • vocab
  • resource
  • typeof
  • prefix

https://www.w3.org/TR/rdfa-lite/#introduction

https://www.w3.org/TR/rdfa-lite/

Notation3 (Notation 3, N3), Turtle, N-Triples
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Assuming you consider RDF XML the full form, then N3 (a.k.a. Notation3), Turtle, and N-triples can be considered the non-XML shorthands.

They are also closely related:

  • N-triples is equivalent to RDF
  • N3 can code RDF, and has some additional options
file extensions used: n3
  • Turtle is a subset of N3 (also in the sense that Turtle examples are valid N3)
file extensions used: ttl

N-triples is meant to be simpler than N3 and turtle (to read, write, parse), though it lacks some features that it could have had (like CURIE).


See also:

N-Quads

N-Triples plus optional context value


TriG

Yet another RDF serialization(verify).

This one an extension of turtle


https://www.w3.org/TR/2014/REC-trig-20140225/#sec-trig-intro

SKOS
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

SKOS is a data model that helps with controlled vocabulary, e.g. taxonomies, and also vocabularies, thesauri, subject headings.


Note that it is only the model for a knowledge representation language to use, and is not a knowledge representations language itself.

That is, it describes classes and properties, but does not assert that objects have them.


Its intent seems to be interoperability, to have a shared basis for modelling done in a semantic-web context, ...instead of each rolling their own (as you could with OWL properties and classes, and RDFS),


You could call it a linked open vocabulary - though note those vary wildly in how widely applicable they are https://lov.linkeddata.es/dataset/lov/


SKOS's model is itself defined as as an ontology that is expressed in OWL ontology, but that's a bit of a technicality.


https://en.wikipedia.org/wiki/Simple_Knowledge_Organization_System

https://www.w3.org/TR/2009/REC-skos-reference-20090818/


OWL

Web Ontology Language (OWL) is a family of languages for ontologies


https://en.wikipedia.org/wiki/Web_Ontology_Language

Isn't there real overlap between SKOS, OWL, and RDFS?
More potentially relevant standards

BS 8723

ISO 5964

ISO 2788 - Guidelines for the establishment and development of monolingual thesauri

ISO 25964 - international standard for thesauri and interoperability with other vocabularies[1]

apparently came from (and based on) BS 8723, ISO 2788?

More specific-purpose

FOAF

See also:

OGP

Facebook's Open Graph protocol lets you describe your page, and controls e.g. how sites like how it appears when linked from Facebook, Twitter, and the like

So is arguably structured data. But arguably used largely as an SEO thing and/or just for a nicer preview of links.


https://ogp.me/

More library geared

FRBR

FRBR ('Functional Requirements for Bibliographic Records'): refers to a model of semantic interrelations between works, realizations, items, responsible entities, and others.


It is focused on bibliographic information, and meant to automate meaningful operations on it.

...though not all types of information found in libraries fits well into FRBR, which probably leads to some creative of the model, that may be incompatible with other such creative use.


There are also limits to applying it to existing information, as various relations are hard to extract exactly and authoritatively. Practice shows that automatically dealing with even just sets of works can be quite a bit of work to do accurately.



People may use the term FRBR loosely, as

a reference to services built on this (e.g. 'we have a a variant of that book you may be interested in')
or the idea or specific datasets (e.g. LibraryThing's) of mappings between a particular book as a work, and various realizations (e.g. revisions)
for some other types of semantic cross-relations.


Group 1 means to refer to the products of intellectual or artistic endeavour:

  • A Work, a distinct intellectual or artistic creation.
  • An Expression, refers to unique intellectual/artistic form present in a realization of a work.
  • A Manifestation is the physical embodiment of an expression.
  • An Item is a single concrete manifestation. (an exemplar of a manifestation)


An example using books:

  • a book someone writes is a work
it's clearly a distinct creation from all other books
  • ...which starts off with just one expression, the original original text
but revisions of the text may be considered distinct expressions
  • are different manifestations - physically clearly different, but the same expression
e.g. a paperback and hardcover variations,
  • each physical book is an item (many items of the same manifestations exists)



Group 2 deals with the custodianship of Group 1 entities, mainly:

  • Persons
  • Corporate bodies


Group 3 deals with subject/event/place relations for Group 1 and 2

  • Concepts
  • Objects
  • Events
  • Places


Further notes:

  • Various bibliographical identifiers implicitly do some some FRBRization. For example, XISBN can be seen as a manifestation identifier.

COinS

COinS (ContextObjects in Spans) describes how to embed OpenURL-style citation data in HTML.

It's functionally similar to microformats.


It allows browser plugins to do things including:

  • link you to full-text via your own institution's OpenURL resolver (e.g. using OpenURL Referrer)


See also:

Persistent and open identifiers

https://en.wikipedia.org/wiki/Persistent_identifier


DOI

DOI (Digital Object Identifier) is a specific standard (syntax is defined in ANSI/NISO Z39.84 [2]) meant to identify pieces of intellectual property, mostly journal articles, research reports, data sets, and official publications.


In a DOI code, for example 10.1000/abc...

  • the 10 designates a DOI
DOI is technically just one implementation of the CNRI Handle system
  • The rest of the code before the slash is the DOI registrant prefix, 1000 in the example above.
You can buy such a prefix from a Registration Agency (or possibly an experimental one from the International DOI Foundation(verify).
  • ...followed by by item identifiers assigned by that registrant. Can be anything.
The combination will always be unique, as the the registrant prefix acts as a namespace)


The combination is like an URN in that it identifies, but does not locate.

You need a DOI resolver to look it up to what it refers to, which gets you to the actual content it points to (usually a landing page first).

This may amount to a search engine, but since these are unique identifiers can give more precise answers.



See also:

ORCID

ORCID is a persistent identifier for individual researchers, that considers that identities are more unique than names, and that e.g. names themselves can change.


See also:

Older / unsorted

Some theory

Digital - Ontologies, semantic web stuff

Note that some of this is glorified metadata, so see also (Library) metadata notes


Query

SPARQL

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


An RDF-querying language that wants to look vaguely like SQL.


Has roughly four query types:

  • SELECT - fetches values as they are stored, in table form
  • CONSTRUCT - extract and transform into valid RDF
  • ASK query - ask a yes/no question
  • DESCRIBE - doesn't fetch resources, but describes them in a way that the database maintainer rather than you decides (verify) (which might actually just be a fetch sometimes?(verify))


The wikipedia example for SELECT

PREFIX foaf: <http://xmlns.com/foaf/0.1/>
SELECT ?name 
       ?email
WHERE
  {
    ?person  a          foaf:Person .
    ?person  foaf:name  ?name .
    ?person  foaf:mbox  ?email .
  }

...though depending on how complex, the more you need to know details know the underlying data model (at least look up its constants).


For example, here is a query towards EUR-Lex (you can try it here)

PREFIX cdm: <http://publications.europa.eu/ontology/cdm#>
PREFIX annot: <http://publications.europa.eu/ontology/annotation#>
PREFIX skos:<http://www.w3.org/2004/02/skos/core#>
PREFIX dc:<http://purl.org/dc/elements/1.1/>
PREFIX xsd:<http://www.w3.org/2001/XMLSchema#>
PREFIX rdf:<http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX owl:<http://www.w3.org/2002/07/owl#>
select distinct ?work ?type ?celex ?date ?force 
WHERE {
    ?work cdm:work_has_resource-type ?type. 
    FILTER(?type=<http://publications.europa.eu/resource/authority/resource-type/JUDG>)
    FILTER not exists{?work cdm:work_has_resource-type <http://publications.europa.eu/resource/authority/resource-type/CORRIGENDUM>
}
OPTIONAL { ?work cdm:resource_legal_id_celex ?celex. } 
OPTIONAL { ?work cdm:work_date_document      ?date.  } 
OPTIONAL { ?work cdm:resource_legal_in-force ?force. } 
FILTER not exists{?work cdm:do_not_index "true"^^<http://www.w3.org/2001/XMLSchema#boolean>}. }


Note:

  • all namespaces except the first are unused here
  • Amounts to
    • get all things of type JUDG,
    • except if marked do_not_index,
    • and add fields 'type', 'work', 'celex' 'date' and 'force' if they're there
    • OPTIONAL amounts to "add field if it's there, but don't require it"
stripping it of OPTIONAL { } means only solutions with values will be returned


https://en.wikipedia.org/wiki/SPARQL

Unsorted

  • GRDDL (Gleaning Resource Descriptions from Dialects of Languages) is a markup language that to extract RDF from XML, such as XHTML, microformat data in XHTML, and such. [3] [4]



  • SKOS [Simple Knowledge Organization System] [6] [7]
  • SIOC [Semantically-Interlinked Online Communities] [8]
  • 'Common Vocabularies' (and vocabulary mapping) refer to settling things between systems so that they can make inferences more easily [9]
  • Semantic Annotation
  • Rules, specifically in the sens used e.g. in...
  • RuleML [10]
  • Rule Interchange Format (RIF) [11]
  • Semantic Web Rule Language (SWRL) (is OWL+RuleML)

And also various document/item metadata formats, in this context most often

  • Dublin Core
  • DOAP (Description Of A Project) [12]
  • Internet Content Description Language


Software

  • Ontology editors
    • Protégé
    • GATE
    • KAON
    • Hozo



http://en.wikipedia.org/wiki/Template:Semantic_Web

See also