Library metadata notes

From Helpful
(Redirected from (Library) metadata notes)
Jump to: navigation, search
For more articles related to library systems, see the Library related category. Some of the main articles:

Contents

Record syntaxes, Profiles

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A 'record syntax' defines the structure a record should follow. Record syntaxes include ISO2709/MARC, GRS-1, XML, and in some ways SGML, HTML, SUTRS,


An 'application profile' (a term most often seen with Z39.50) refers to a collections of conventions and/or(/but usually) standards that makes it easier to get different systems to cooperate for specific purposes. They may focus on any or all of:

  • search attribute sets (for clearly defined searchability, particularly of non-regular fields)
  • communication details
  • conventions in filling data records (fields to use, formatting to use, required fields, etc.)

Many application profiles are extensions of others.

Some standards are made mostly for ease of interchange and are for the most part profiles - consider for example Dublin Core, GILS, CIMI (their storage is not part of the profile, and/or already settled in the standard that encapsulates the profile and more - e.g. ISO2709 for CIMI and XML for GILS).

http://www.loc.gov/z3950/agency/profiles/about.html



See also:

Library-related metadata/record formats/standards

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)



To see an example of various of these, some sites export to a bunch of them. For example, see the list on http://cogprints.org/266/


Alphabetically:


CCF

CCF (Common Communication Format) (Based on ISO 2709)

See also:

  • CCF/B: CCF for bibliographic information [1]
  • CCF/F: CCF for factual information [2]


CIMI

CIMI (Computer Interchange of Museum Information), based on ISO 2709

See also:


CZP

Largely a profile.

  • CZP (Content Zoekprofiel), a profile based on IEEE LOM

See also:


DAIA

Document Availability Information API

http://www.gbv.de/wikis/cls/DAIA_-_Document_Availability_Information_API

Dublin Core (DC)

Dublin Core (DC) [4] [5] [6] was designed as a simplified, fairly general-purpose metadata profile (15 elements).

It is found stored in:

  • DC-XML [7] (perhaps most common?)
  • DC-HTML [8]
  • DC-TEXT [9]
  • DC-RDF [10]
  • DC-DS-XML (Description sets) [11]

EAD

Encoded Archival Description, an XML-based serialization (fairly rare in general, apparently used in some historical archives?)

See also:

EAP

EPrints Application Profile, also (and probably more commonly) known as the Scholarly Works Application Profile; see #SWAP

GILS

Largely a profile.

Government Information Locator Service (sometimes Global Information Locator Service?), mostly an index of US federal resources, and does not seem particularly suited to bibliographic uses.

An XML-based metadata format with a fairly human-readable list of named values (perhaps a few dozen options) (thereby resembling things like DC-XML and MODS to some degree).


See also:


GRS-1

GRS-1/GRS1 (Generic Record Syntax 1, supersedes GRS-0) stores a record as a tree hierarchy.

Has a bit of a learning curve and, being based on ASN.1, does not have a serialization outside of that (tutorials use varying plain-text representations). I'm guessing that makes it rare outside Z39.50.

GRS-1 offers things like easy encoding of variant forms, and of typed data (even binary data) more easily than in various other forms.


See also:

ISO 2709

ISO 2709 can be said to be a family of metadata types, specifying mostly the serialization and things like CCF, MARC (and its many variants) and anything else based on it actually defining the storage/use details, field meanings, and such.

Many of the real-world ISO 2709 derivatives are more specifically MARC variants, of which there are many.

See also #MARC, and everything that mentions ISO 2709 here.


ISO/DIS 25577

See #MarcXchange


ISO 20775

Record format for holdings information (relatively rare/specific?)

http://www.loc.gov/standards/iso20775/

LOM / LOMS

IEEE 1484.12, Learning Objects Metadata (Standard)

(XML-based)

(rare so far?)

See also:

MAB

MAB (Maschinelles Austauschformat fuer Bibliotheken),

Largely replaced by MARC21 now, see [15]

See also:

MADS

MADS (Metadata Authority Description Schema) (XML)

See also:

MARC and MARCXML

(Usually-ISO2709-stored) MARC

MARC (MAchine-Readable Cataloging) (a.k.a. Z39.2 and ISO 2709) and its many variants. An (incomplete) list of these variants:

USMARC

USMARC [16] (once called LCMARC(verify)); see MARC21

CAN/MARC

Canadian, but see MARC21.

UKMARC

[17]

MARC21

Evolved, joined version of USMARC and CAN/MARC.

[18]

[19]

[20]

DANMARC, DANMARC2

(Some danish libraries use mainly/only this)

OCLC-MARC

[21],

AUSMARC

Mostly replaced by MARC21?

NORMARC

(based on MARC21)

SWEMARC
UNIMARC

[22]

INTERMARC
SIGLEMARC

SIGLE being 'System for Information on Grey Literature in Europe'

IBERMARC
JPMARC / JAPAN MARC
TRC-MARC
CMARC
KORMARC
INDOMARC
MALMARC

Malaysian

RUSMARC

[23]

ANNAMARC
PICAMARC
LIBRISMARC
BIBSYS-MARC
FINMARC, etc.

FINMARC, FINMARC2000, MARC21-fin

CATMARC
HUNMARC
COMARC

Used by COBISS (Co-operative Online Bibliographic System and Services), Slovenia

XML-stored MARC

(Note that these may still be based on any MARC variant, dumped in XML instead of ISO2709)

MARCXML

A fairly straightforward XML re-realization of MARC21 (but, note, one of a handful of XML variants of MARC).

See also:

MarcXchange

A variation on MARCXML that is less picky in use(verify), and largely compatible with MARCXML.

See also:

UnimarcXML

Variation on MARCXML. Uses abbreviated forms of XML nodes.

http://www.bncf.firenze.sbn.it/progetti/unimarc/slim/documentation/unimarcslim.html

PICA XML

Not unlike MARCXML, but its tags are different

See also:


...and more

See e.g.

METS

METS (Metadata Encoding and Transmission Standard) (XML) (Somewhat specific to libraries or specific fields)

See also:


MODS

MODS (Metadata Object Description Schema) (XML) (seen in libraries and some specific fields), intended as a compromise between MARC's complexity and DC's simplicity [24]


MPEG21

MPEG21, an XML-based 'Rights Expression Language' to share metadata related to rights, permissions, and restrictions [25]

MPEG-21 DIDL (Digital Item Declaration Language) aims to be a 'well-defined data model for complex digital objects' [26] [27]


OAI-ORE

OAI-ORE [28] Resource Map


Onix

Onix (EDItEUR's Onix for Books XML) (seen used by publishers, also sometimes seen in libraries)


See also:

'OPAC'

See e.g. Z39.50's "Record Syntaxes" Appendix

(relatively rare/specific?)

See also:


SBN

SBN (Based on ISO 2709(verify)) mostly used at Servizio Bibliotecario Nazionale[32](verify))


SUTRS

SUTRS (Simple Unstructured Text Record Syntax) is human-readable text which a client client can present with little to no manipulation.

There is no defined structure. It is a record syntax only in that it is something that a search client can fall back on, but it guarantees no machine parseability; there are no field names or such, and every database could add its own conventions.



SWAP

SWAP (Scholarly Works Application Profile), also known as the EPrints Application Profile (EAP)

Eprints DC XML (EPDCX)


TEF

TEF (Thèses Électroniques Françaises) (used for french dissertations)

Zthes

http://zthes.z3950.org/schema/

Unsorted

  • EPrint3's export format [33]

Other metadata/record formats/standards

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Again alphabetically, and none too detailed.


CSDGM

Content Standard for Digital Geospatial Metadata

[34]

EDIFACT

Developed for various (cross-)industry communication

[35]


hCard

A microformat that can store/describe people and addresses


IMS-CP

IMS Content Packaging Specification)(rare?)

[36]


RDF

RDF (Resource Description Framework) is used as a serialization option in some recent standards

TEI

Text Encoding Initiative

[37]

vCard (.vcf)

Text-based format describing people and addresses RFC 2425, RFC 2426, [38]

Unsorted

http://www.loc.gov/standards/sru/resources/schemas.html

More detailed notes

MARC and MARCXML notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

MARC refers to a group of related metadata speciications probably most commonly seen in remote catalogue search. It is one of the more general-(library-)purpose formats in that it has loads of fairly well specified fields to be used, though for the same reason is not particularly useful for human consumption either.

MARC was designed to fit bibliographic data, authority data, classification data, holdings data (copy-specific information on a library resource such as call number, shelf location, and/or volumes held) and community information.


MARC records are often encoded in records conforming to ISO 2709 (from 1981), also known as ANSI/NISO Z39.2 (from 1985). ISO2709 centers around what a record consists of, such as coding of/in the leader, dictionary, control+data fields, and such.

MARC, on the other hand, refers to what can be stored, and how it should be placed in a record, where 'record' is an abstract thing that can be coded in ISO2709 (and, since it usually is, is designed with ISO2709 organization in mind).

MARC is not synonymous with Z39.2 / ISO 2709. Things can be based on and conform to ISO2709/Z39.2 but not be particularly related to MARC - one example being CCF (developed by Unesco/UNISIST).

Of course, such cases must use fields, and are likely to imitate MARC's use of fields - no use completely reinventing that. This is one of a few reasons that leads some people to use terms IS2709/Z39.2 and MARC interchangably.


Variations

MARC formats mostly specify field meanings, general text coding, field value coding, and such.

There are, however, a lot of MARC standards (see the general metadata section for a list), all variations of each other, and many of which are local to a country and sometimes even catalogue. Most of these standards exist mostly because it was tuned to some specific wishes.

Fairly generally adopted formats include USMARC (common in Z39.50 servers), MARC21 (a harmonization of USMARC and CAN/MARC), and now MARCXML (based on MARC21).(verify) A number of specific libraries/catalogues may only support some less than globally supported. For example, a few danish libraries support DANMARC but no other MARC.(verify)

Even adhering to MARCs, specific systems still find the space to have system-local conventions to the storage/interpretation of data - helped by the fact that MARC allows user-defined fields.


Each standard has its own added serialization details, restrictions, and such. For example, MARC21 renames the identifier as a 'subfield code', adds details like that tags must be numeric and in the 002-999 range, that text must be coded either as UTF-8 or as MARC-8 (MARC8 being one among various encodings based on the variable-character-size ISO 2022).


Data fields, data subfields, and idiosyncracies

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

MARC records have fields, identified by a three-character alphanumeric string. In practice they are rarely anything but numeric, and some MARC variants require purely numeric identifiers.There is a subdivision into

  • control fields (those with information that may be necessary for correct/complete record parsing or interpretation), and
  • data fields (which do not, just store record data).

Control fields come before data fields (usually/always?), and start with 00 (e.g. 001, 008).


The (data) field identifiers are organized into categorizing ranges of sorts. An overview of these ranges, and some mention of some of the more interesting fields:

  • 0xx: local and global codes (call numbers, other holding, record, some subject)
  • 2xx: mostly item/issue/holding-specific information (edition, publisher, )
    • 245: Title
    • 246: Alternative title / subtitle
  • 1xx: basic description (titles/names of entities)
    • 100: Primary author(s)
  • 7xx: more bibliographic details (more authors, names, item relations, etc.)
    • 700: Additional author(s)
    • 701-709: Anyone else involved (editors, illustrators, etc.)
    • 710: Corporate name, often used as author when author is not a person (don't treat as personal name; e.g. don't reorder)
    • 773 citation information (but see 76x-78x, and notes below)
  • 3xx: work/item-specific details (phsysical characteristics, publication frequency, etc)
    • 300: Physical description, often follows [39]
  • 5xx: notes (content notes, summary, other general notes)
    • 520 summary/abstract
    • 505 'Formatted contents note' - TOCs and such
  • 6xx: mostly subjects, keywords
    • Controlled keywords (controlled per source at best, sometimes formatted oddly) are regularly present in one or a few fields in the set: 650, 600, 651, 653, 654, 690, 693, some others. Note that some use is not correct. For example, fields such as 610 and 600 (and perhaps 610, 698) are not strictly a keyword field, but for general search you may just want to assume all of the 6xx range is keywords
  • 4xx and 8xx: Series data
  • 9xx: not defined. Can be used as custom fields for most any purpose
    • 900 free-form text (sometimes a description, abstract, or summary)

There exist various conventions (some could be called semi-standards) to the use of the 9xx fields. There are also various other fields (undefined fields in non-9xx ranges) that have database-specific meaning, and are sometimes used in wider conventions.



Fields are major categories of content (e.g. title, author, ISSN, physical properties, abstract, etc.) and for each field there are usually a few subfields defined (alphanumeric: a-z and 0-9, and usually a letter) that are used to separate specific details of that function, sometimes as some sort of signalling, or even just a delimiters of sorts.

For example, field 260 stores publication, and 260a (field 260, subfield a) stores the place of publication, b the name of the publisher, c the date of publication. 260d, 260e, 260f, and 260g exist but are rarely used, other subfields aren't defined for 260 (though are sometimes seen; some use non-defined fields for local purposes).

Most fields, such as title (245), see common use of only one or two subfields, some fields only have one or two subfields defined to start with.


The most interesting information commonly sits in the 'a' subfield. Few fields put interesting information elsewhere, and subfield use varies per MARC flavour, and changes in restandardizations. Few fields put interesting information elsewhere, and subfield use varies per MARC flavour, and changes in restandardizations. Fields that do place interesting information include 773 citation data), 260, 856 (commonly used for URLs, with titles in y and the URL in u, though also in a).


A record is probably most accurately seen as a list of (field,subfield,value) values, that strictly speaking you can't see as a joined set, map, or such. However, for most fields such a map view can be very convenient, but is iffy in the (fairly rare case) of repeating (sub)field( range)s. A (hashmap) view can be convenient for code that only consumes records, and probably not so much for code that produces MARC.


Fields may repeat (appear more than once in a record), as may subfields (appear more than once per field). Note there may be standardized limitations to repetition specific to a field.

Repeating subfields often signals one of:

  • A continuation. For example, a list of a subfields in a 520 field is probably a single multi-kilobyte summary that has been split into a number of subfields.
  • Separate pieces of information, one per subfield
  • Pseudo-records in the form of regular chunks of consistent subfields, say, a,d,e, a,d,e, a,d,e. Logic like "if the subfield is the same or lower in the order, this is a new pseudo-record" usually works, but not always, and you may want a non-ASCII collation for it.


There are hundreds of fields with fairly settled meaning, and given subfield specifics there are probably thousands of differentiable usable fields that see some use somewhere. Still, many databases expose fairly limited records (only if a database uses MARC internally is it likely to get detailed).

A lot of fields are filled using conventions and by technical librarians, meaning that a lot of data can stand some transformation, canonicalization, lookups, and/or other polishing, for machine processing and/or user-friendly presentation. Few fields are likely to be present and directly printable - even the title can sometimes use some work. Author names are generally ill defined, but also an example where per-database processing can help a lot.


Note that few databases use (or expose/translate, in case they don't use MARC internally) all fields and subfields correctly.

For example, You may find the ISSN in 773 (hopefully in addition to 020, the place that it should be if the record stores it), you may find the title in 773a instead of 773t, you may find values in non-standard fields for which you can't imagine what the values could be, and the formatting errors.

These deviations tend to be consistent for a specific database, so search aggregation systems effectively require per-database remapping and normalization transformations if you want to actually use the records in any real detail.


Note that stored values sometimes have further delimiting and/or may reference other parts of the record (or other recordS) by implication, some of which so structural that its originator may consider it almost part of the record structure itself.

Semi-sorted MARC notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


MARC-specific value coding
  • MARC codes for geographical areas [40] [41], countries [42]
  • MARC language codes: Mostly ISO 639-2 (but not always?) [43] [44]
  • MARC time periods [45]



On identifiers

MARC records do not necessarily contain identifiers for the databases they came from -- so it is not always possible to refer or (deep)link back to the database unambiguously.

Aside from the fact that MARC may store various types of information (bibliographic, classification, authority, and/or other types) and so may logically not include one, real-world systems often don't have strictly enforced rules either. Even if, say, an OPAC system seems to identify records using, say, control field 001, it may not necessarily have that field on all records.



On 773 ('Host Item')

The subfields on 773 are actually those that are defined for a range of fields (76x-78x), defined with relatively loose semantics, such as:

  • t: title
  • b: edition
  • d: publisher/issuer data
  • g: identifier for specific piece of information in host
  • x: ISSN
  • a: entry header(/name/title type thing)

These subfields gain more specific meanings on specific fields. For example, for academic articles you will primarily see 773, 'Host Item', referring to the serial that contains the item, where the subfields will likely contain:

  • t: journal title (and p is sometimes used for abbreviated title)
  • d: city, publisher, and/or related details
  • g: volume, issue, page (usually)
  • x: ISSN
  • a: may contain a formatted citation, or title, or other.

Of course, 773 is also used for things other than articles.



On 300 ('Physical description')

Many databases follow ISBD physical description to a decent degree.

For books, this fields commonly mentions the amount of pages, the amount of pre-content pages (e.g TOC) in roman numerals, and the presence of (and sometimes the amount of pages spent on) illustrations.

Varying examples, ignoring/discarding subfields for a moment:

578 p.
xi, 116 p.
iii, 65, 93 p.
iii, 65, 93 p.
xvi, 271 p. ill.
74 p. of ill., 15 p.
7, xxii, ca. 11, 26 p.
 
[115] p.
26 [i.e. 52] p.
96 p., 8 p. of plates

4 v. (loose-leaf)
5 v.
8 v. in 5

11 folded leaves
297 leaves

The most common subfields:

  • 300/a: Extent, often describes amount of content, usually in pages or, for archive-style content often in items, containers, volumes, linear feet, etc. May sometimes be free-form text (e.g. "12 stereograph reels (7 double fr.)", "Jigsaw puzzle"), and describes any non-regular content (additional or main) such as maps, floppies/CDs, film reels, etc., and/or (other) things that should be in the b/c subfields.
    • 160 p.
    • xii, 462 p. : 313 fig. ; 25 cm + CD-ROM.
    • 1 sound disc
    • '379 s.'
  • 300/b: 'Other physical details', often used for description of visual content
    • ill.
    • col.
    • sd., col.
    • helt tab.
    • digital
  • 300/c: Dimensions (of book, plate, reel, etc.), for example:
    • 24 cm.
    • 30 x 55 cm.
    • 35 mm.
    • 4 3/4 in.

And occasionally:

  • 300/f may be present if unusual units are used, e.g. a:17 f:boxes
  • 300/3 gives material details, either detailing another subfield, or sometimes (essentially) being the main material description)(verify)

See also (for reference and examples):


Notes on coded records, ISO 2709
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that the following mixes terminology and coding details from both ISO 2709, and some values specifically for the MARC21 variation.


A record can be said to consist of:

  • a leader
  • a directory
  • the record's (coded) data:
    • zero or more control fields
    • zero or more data fields

The record data is the actual coded data, storing all record data except the tags (which come from the directory).


A 24-character record label, a.k.a. Leader, which contains the

  • Record length (5 bytes at postion 0, left-zero-padded non-null-terminated ASCII string)
  • Status (1 byte)
  • Record type (1 byte)
  • Implementation-specific (2 bytes)
  • Character coding (1 byte)
  • Indicator count (1 byte) (value fixed to 2 in MARC21)
  • Subfield code length (1 byte) (value fixed to 2 in MARC21) (The ISO / ANSI standards call this 'identifier length' instead, 'subfield code' is the MARC term)
  • The data's base address (5 bytes, left-zero-padded non-null-terminated ASCII string)
  • Implementation-specific (3 bytes)
  • Entry map (4 bytes), specifies the directory entry structure, and has four individual bytes storing:
    • length of length-of-field (value fixed to 4 in MARC21)
    • starting-position (value fixed to 5 in MARC21)
    • length of implementation-defined (value fixed to 0 in MARC21)
    • undefined

Since the record length and base address mentioned above are ASCII strings, the maximum data length is 99999 bytes (the maximum space a record will take is more, as the dictionary will also take space).


Directory

The directory contains entries that specify the positions of individual fields, and their tags. Entries contain:

  • the field's 3-character tag indicating the the content type.
  • the length of the field (in the data segment)
  • the position of the field (in the data segment)
  • Implementation-defined part (optional)

In MARC21, these are 12-byte things as those four elements respectively take 3, 4, 5, and 0 bytes.

Note that because of the length coding, the field's total length is 9999, and the data you can store is a little less. This is one reason long data is sometimes fragmented amoung multiple fields/subfields.


Control fields exist for (machine-readable) codes, and seem meant for coding that may be necessary tho parse the data fields that come after them. Control fields fields have subfield codes (unlike datafields) and often no indices(verify), so a control field has only a tag and a value. In comparison, a regular data field has:

  • a tag
  • indices (optional)
  • subfield character
  • value (specific to subfield)


Example in MARCXML coding:

<controlfield tag="001">99448814</controlfield>

<datafield tag="035" ind1=" " ind2=" ">
  <subfield code="a">(DLC)12848717</subfield>
</datafield>

It is not yet clear to me how the distinction is made between control and data field (or whether this varies between MARC variants). It seems that the two are disjunct: fields with tag 001-009 are control fields and the rest are data fields -- but if so, then I've seen violations of this.



Bibliographic fields can use indicators (0-9 chars) and subfield identifiers (0-9 chars) as detailed by the leader.

  • Indicators provide further information about field contents, field relationships and manipulations. They are fairly rare, and rarely more than one or two characters when they are used (MARCXML fixes/truncates to two)
  • Identifiers are used to both delimit and type possible subfields in the data for a field. When used, field data consists of a repetition of:
    • the subfield identifier 0x1F followed by a code with the length specified in the header (0-9),
    • the subfield data




Records are terminated/separated with the 0x1D byte.

On semi-processed graphical MARC representation

The subfield separator 0x1F (IS1, Unit Separator (US) from ISO 646) may be graphically represented as $.

The record terminator 0x1D (IS3, Group separator) from ISO 646) may be represented as \ or RT.

A space can be used in indicators and identifiers(verify) to mean undefined or not applicable. A # in indicator fields seems to serve the same purpose.

For example, a semi-processed MARC field might be shown as:

260##$aNew York, N.Y. : $bElsevier, $c1984.

(no indicators, three subfields present)

See also


Field/subfield references:


And perhaps:

Semi-soted notes

MODS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

MODS is a format that has about a structure of about two dozen main named fields and a good number of specific named subfields that allow decent semantically accurate storage and basic relations


MODS is loosely based on MARC, that is, on its ideas about what to separate into specific fields.

As such, there are mappings between MARC and MODS, though the conversion is still lossy.


(Note: MODS is also seen embedded in METS)

See also:

METS

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

METS (Metadata Encoding and Transmission Standard) is an XML based format that documents describe objects, but is actually mostly an encapsulation, because it is agnostic about the metadata scheme you actually choose.

METS probably has value when one of the following applies:

  • you wish to tie metadata more closely to the data it describes, and want metadata to point into parts of it rather than at it in general
  • you have a repositories primarily of readable or watchable things (e.g. text documets), rather than one containing bibliographic records resources
  • you want to try to separate:
    • reference to external file(s)
    • object structure/hierarchy metadata
    • rights, terms and conditions metadata
    • administrative metadata (non-resource stuff -- source, )


See also:


Unsorted

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

I'm guessing it is generally most important to support MARC21, USMARC, DANMARC, MARCXML, arguably SUTRS, GRS-1, and MODS/METS (verify)



Z39.50:

  • It seems that InternationalString should generally be interpreted as a GeneralString in version 3, and a VisibleString in version 2 (see also ASN.1#Notes_on_strings) (apparently meaning that a record conforming to version 3 may not conform to version 2, if it uses InternationalString)


SUTRS:

  • Defined in Z39.50 since 1995
  • uses 0x0A (LF) as a line separator
  • Recommendation to (by default) not exceed 72 characters per line.
  • Z3950 mentions SUTRS to be a single InternationalString


It may contain lines like:

AUTHOR: I. Zim
TITLE: Galactic Invasion for Dummies

...but these fields are not standardized, so could well be AU, AUT, uncapitalised. Fields, if any, may be counted on to be constant for a specific database.

Various library-related resources use some variant of MARC, although there are exceptions. For example, Z39.50 interface supports only MODS, GRS-1, and SUTRS


See also:



Notes on metadata models

Dublin Core

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Made to allow a simple, fairly general-purpose standards to be broadly used.

(Simple) Dublin Core has fifteen elements:

  • Title: Name by which the resource is (formally) known
  • Creator: Entity primarily responsible for the resource
  • Contributor: Any entity responsible for making contributions
  • Date: Point/period associated with the resource's lifecycle (suggestion: W3CDTF profile of ISO 8601; see also Common date formats).
  • Type: basic type of the resource (suggestion: DCMI)
  • Format: (suggestion: MIME types)
  • Subject: (suggestion: use a controlled vocabulary)
  • Description: free form, an abstract, a TOC, etc.
  • Coverage: Mostly the spatial/temporal context (its definition is wider)
  • Language: Language code. (suggestion: RFC 4646)
  • Publisher: Entity responsible for resource's availability
  • Identifier: resource's identifier in a system
  • Relation: Related resource (suggestion: use identifiers)
  • Source: Related resource from which this resource is derived.
  • Rights: details on (intellectual) property rights

'Qualified Dublin Core' adds:

  • Audience
  • Provenance
  • RightsHolder

and


Notes:

  • The standads suggest, but do not require much specific formatting, reference codes or controlled vocabularies.
  • No fields for bibliographic information.


See also:

Basic bibliographic/citation models

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

From the perspective of other fields, the most important difference between bibliographic records and more general metadata is probably specifically the concept of an article citation. It fits badly in general models (such as DC) as it adds some important descriptors (e.g. journal, volume, issue, page) that do not fit into more general models.

FRBR

These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.

FRBR is a semantic model indended for bibliographic information (FRBR stands for Functional Requirements for Bibliographic Records), also for libraries.

Its design separates things into three groups: works and such, people, and relationships for those two (subject/event/place).


It is practical in that it would make bibliographic information are more meaningful to work with in automated processes.

However, there is some question whether it is fit as a reference model for bibliographic information, since some types of information found in libraries does does not immediately fit the model, and may lead to varied models and arguable use that breaks the automated/meaningful nature somewhat.

There are also limits to applying it to existing information, as various relations are hard to authoritavely extract. Practice shows that automatically dealing with even just sets of works can be quite a bit of work to do accurately.

One has to wonder whether it is just something hip to play with, but it certainly has various merits, and has usefully driven people to think about normalized forms of data, and to think about collocation-based browsing of materials.



Group 1 means to refer to the products of intellectual or artistic endeavour:

  • A Work, a distinct intellectual or artistic creation.
  • An Expression, refers to unique intellectual/artistic form present in a realization of a work.
  • A Manifestation is the physical embodiment of an expression.
  • An Item is a single concrete manifestation. (an exemplar of a manifestation)

An example using books:

  • a book someone writes is a work
  • ...with at least one and usually just the one expression (the original text)
  • Different releases/publications are different manifestations. Consider paperback and hardcover variations, and those from different countries. In this case, a manifestation correlates strongly with 'something that gets an unique ISBN'
  • each physical book is an item

Another example:

  • 'any edition of Alice in Wonderland' is a work
  • 'The Annotated Alice' is an expression


Possible problem cases include translations, annotated editions and such. These can be considered a different expression of the same work, or a different work, which you can argue depends on the specific case - translations that involve/require involved creativity may well be considered a new work.

These sort of cases can break FRBR's potential simplicity (which is not altogether surprising, as it is a fairly simple model of a fuzzy domain).


Group 2 deals with the custodianship of Group 1 entities, mainly:

  • Persons
  • Corporate bodies


Group 3 deals with subject/event/place relations for Group 1 and 2

  • Concepts
  • Objects
  • Events
  • Places


Further notes:

  • Various bibliographical identifiers implicitly do some some FRBRization. For example, XISBN can be seen as a manifestation identifier.

FRAD

Functional Requirements for Authority Data

http://en.wikipedia.org/wiki/Functional_Requirements_for_Authority_Data

On the web

Microformats

Microformats refer to embedded semantic information (often in XML/XHTML), often of an simple, pragmatic, ad-hoc nature.

See also:

FOAF

Unsorted

Ruby's postulate: "The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata."