Metadata models and standards

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Ruby's postulate: "The accuracy of metadata is inversely proportional to the square of the distance between the data and the metadata."


Metadata records

Record syntaxes, Profiles

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

A 'record syntax' defines the structure a record should follow.

Record syntaxes include ISO2709/MARC, GRS-1, XML, and in some ways SGML, HTML, SUTRS,


An 'application profile' (a term most often seen with Z39.50) refers to collections of conventions and/or(/but usually) standards that makes it easier to get different systems to cooperate for specific purposes.

They may focus on any or all of:

  • search attribute sets (for clearly defined searchability, particularly of non-regular fields)
  • communication details
  • conventions in filling data records (fields to use, formatting to use, required fields, etc.)

Many application profiles are extensions of others.

Some standards are made mostly for ease of interchange and are for the most part profiles - consider for example Dublin Core, GILS, CIMI (their storage is not part of the profile, and/or already settled in the standard that encapsulates the profile and more - e.g. ISO2709 for CIMI and XML for GILS).

http://www.loc.gov/z3950/agency/profiles/about.html


Various profiles (some of these are old, some rather specific and non-librarian)

  • Union Catalogue Profile [3]

Specific:

  • GILS profile [4]
  • Zthes (for thesauri)

Old:


See also:


Notes on metadata models

Metadata models are about what parts of information are most useful to combine in a context, and in way standard enough to be predictable and useful

(...rather than e.g. how to transfer that metadata)


Dublin Core

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Dublin Core (DC) [5] [6] [7] was designed as a simplified, fairly broadly applicable metadata profile, with 15 core elements.

(Simple) Dublin Core has fifteen elements:

  • Title: Name by which the resource is (formally) known
  • Creator: Entity primarily responsible for the resource
  • Contributor: Any entity responsible for making contributions
  • Date: Point/period associated with the resource's lifecycle (suggestion: W3CDTF profile of ISO 8601; see also Common date formats).
  • Type: basic type of the resource (suggestion: DCMI)
  • Format: (suggestion: MIME types)
  • Subject: (suggestion: use a controlled vocabulary)
  • Description: free form, an abstract, a TOC, etc.
  • Coverage: Mostly the spatial/temporal context (its definition is wider)
  • Language: Language code. (suggestion: RFC 4646)
  • Publisher: Entity responsible for resource's availability
  • Identifier: resource's identifier in a system
  • Relation: Related resource (suggestion: use identifiers)
  • Source: Related resource from which this resource is derived.
  • Rights: details on (intellectual) property rights

'Qualified Dublin Core' adds:

  • Audience
  • Provenance
  • RightsHolder

and


Notes:

  • The standads suggest, but do not require much specific formatting, reference codes or controlled vocabularies.
  • No fields for bibliographic information.


  • It is found stored in e.g.
DC-XML [8] (perhaps most common?)
DC-HTML [9]
DC-TEXT [10]
DC-RDF [11]
DC-DS-XML (Description sets) [12]


See also:

Basic bibliographic/citation models

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

From the perspective of other fields, the most important difference between bibliographic records and more general metadata is probably specifically the concept of an article citation. It fits badly in general models (such as DC) as it adds some important descriptors (e.g. journal, volume, issue, page) that do not fit into more general models.


FRAD

Functional Requirements for Authority Data

http://en.wikipedia.org/wiki/Functional_Requirements_for_Authority_Data

Library-related metadata/record formats/standards

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)



To see an example of various of these, some sites export to a bunch of them. For example, see the list on http://cogprints.org/266/


Alphabetically:


CCF

CCF (Common Communication Format) (Based on ISO 2709)

See also:

  • CCF/B: CCF for bibliographic information [13]
  • CCF/F: CCF for factual information [14]


CIMI

CIMI (Computer Interchange of Museum Information), based on ISO 2709

See also:


CZP

Largely a profile.

  • CZP (Content Zoekprofiel), a profile based on IEEE LOM

See also:


DAIA

Document Availability Information API

http://www.gbv.de/wikis/cls/DAIA_-_Document_Availability_Information_API


Dublin core in XML

https://www.dublincore.org/specifications/dublin-core/dc-xml-guidelines/

EAD

Encoded Archival Description, an XML-based serialization (fairly rare in general, apparently used in some historical archives?)


Some examples [16]

See also:

EAP

EPrints Application Profile, also (and probably more commonly) known as the Scholarly Works Application Profile; see #SWAP

GILS

Government Information Locator Service (sometimes Global Information Locator Service?), mostly an index of US federal resources, and does not seem particularly suited to bibliographic uses.


Largely a profile, but works out as an XML-based metadata format with a fairly human-readable list of named values (perhaps a few dozen options) (thereby resembling things like DC-XML and MODS to some degree).


See also:

GRS-1

GRS-1/GRS1 (Generic Record Syntax 1, supersedes GRS-0) stores a record as a tree hierarchy.

Has a bit of a learning curve and, being based on ASN.1, does not have a serialization outside of that (tutorials use varying plain-text representations). I'm guessing that makes it rare outside Z39.50.

GRS-1 offers things like easy encoding of variant forms, and of typed data (even binary data) more easily than in various other forms.


See also:

ISO 2709

ISO 2709 can be said to be a family of metadata types, specifying mostly the serialization and things like CCF, MARC (and its many variants) and anything else based on it actually defining the storage/use details, field meanings, and such.

Many of the real-world ISO 2709 derivatives are more specifically MARC variants, of which there are many.

See also #MARC, and everything that mentions ISO 2709 here.


ISO/DIS 25577

See #MarcXchange


ISO 20775

Record format for holdings information (relatively rare/specific?)

http://www.loc.gov/standards/iso20775/

LOM / LOMS

IEEE 1484.12, Learning Objects Metadata (Standard)

(XML-based)

(rare so far?)

See also:

MAB

MAB (Maschinelles Austauschformat fuer Bibliotheken),

Largely replaced by MARC21 now, see [20]

See also:

MADS

MADS (Metadata Authority Description Schema) (XML)

See also:

MARC and MARCXML

(Usually-ISO2709-stored) MARC

MARC (MAchine-Readable Cataloging) (a.k.a. Z39.2 and ISO 2709) and its many variants. An (incomplete) list of these variants:


  • USMARC [21] (once called LCMARC(verify)); see MARC21
  • CAN/MARC
Canadian, but see MARC21.
  • UKMARC
[22]


  • MARC21
apparently originates as an evolved, joined version of USMARC and CAN/MARC.
[23]
[24]
[25]


  • DANMARC, DANMARC2
(Some danish libraries use mainly/only this)
  • OCLC-MARC
[26],
  • AUSMARC
Mostly replaced by MARC21?
  • NORMARC
(based on MARC21)
  • SWEMARC
  • UNIMARC
[27]
  • INTERMARC
  • SIGLEMARC
SIGLE being 'System for Information on Grey Literature in Europe'
  • IBERMARC
  • JPMARC / JAPAN MARC
  • TRC-MARC
  • CMARC
  • KORMARC
  • INDOMARC
  • MALMARC
Malaysian
  • RUSMARC
[28]
  • ANNAMARC
  • PICAMARC
  • LIBRISMARC
  • BIBSYS-MARC
  • FINMARC, etc.
FINMARC, FINMARC2000, MARC21-fin
  • CATMARC
  • HUNMARC
  • COMARC
Used by COBISS (Co-operative Online Bibliographic System and Services), Slovenia

XML-stored MARC

(Note that these may still be based on any MARC variant, dumped in XML instead of ISO2709)


  • MARCXML
A fairly straightforward XML re-realization of MARC21 (but, note, one of a handful of XML variants of MARC).
http://www.loc.gov/standards/marcxml/


  • MarcXchange
A variation on MARCXML that is less picky in use(verify), and largely compatible with MARCXML.
http://www.loc.gov/standards/iso25577/


  • UnimarcXML
Variation on MARCXML. Uses abbreviated forms of XML nodes.
http://www.bncf.firenze.sbn.it/progetti/unimarc/slim/documentation/unimarcslim.html


  • PICA XML
Not unlike MARCXML, but its tags are different
http://www.gbv.de/wikis/cls/PICA_XML_Version_1.0


...and more, see e.g.

METS

METS (Metadata Encoding and Transmission Standard) (XML) (Somewhat specific to libraries or specific fields)

See also:

METS notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

METS (Metadata Encoding and Transmission Standard) is an XML based format that documents describe objects, but is actually mostly an encapsulation, because it is agnostic about the metadata scheme you actually choose.

METS probably has value when one of the following applies:

  • you wish to tie metadata more closely to the data it describes, and want metadata to point into parts of it rather than at it in general
  • you have a repositories primarily of readable or watchable things (e.g. text documets), rather than one containing bibliographic records resources
  • you want to try to separate:
    • reference to external file(s)
    • object structure/hierarchy metadata
    • rights, terms and conditions metadata
    • administrative metadata (non-resource stuff -- source, )


See also:



MODS

MODS (Metadata Object Description Schema) is an XML format, seen in libraries and some specific fields, intended as a compromise between MARC's more generic complexity and DC's more semantic simplicity.


https://www.loc.gov/standards/mods/userguide/examples.html


See also:

MODS notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

MODS is a format that has about a structure of about two dozen main named fields and a good number of specific named subfields that allow decent semantically accurate storage and basic relations


MODS is loosely based on MARC, that is, on its ideas about what to separate into specific fields.

As such, there are mappings between MARC and MODS, though the conversion is still lossy.


(Note: MODS is also seen embedded in METS)

See also:

MPEG21

MPEG21, an XML-based 'Rights Expression Language' to share metadata related to rights, permissions, and restrictions [29]

DIDL

MPEG-21 DIDL (Digital Item Declaration Language) aims to be a 'well-defined data model for complex digital objects' [30] [31]

It has been used in archives

OAI-ORE

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Serialization

And you may want a graph-style description of the potentially complex way of the relation of its parts, so there are described ways to experss it...

...and the HTTP implementation considers what you should do when the Resource Map is exposed on multiple of these formats


See also:

Onix

Onix (EDItEUR's Onix for Books XML) (seen used by publishers, also sometimes seen in libraries)


See also:

'OPAC'

See e.g. Z39.50's "Record Syntaxes" Appendix

(relatively rare/specific?)

See also:


SBN

SBN (Based on ISO 2709(verify)) mostly used at Servizio Bibliotecario Nazionale[36](verify))


SUTRS

SUTRS (Simple Unstructured Text Record Syntax) is human-readable text which a client client can present with little to no manipulation.

There is no defined structure. It is a record syntax only in that it is something that a search client can fall back on, but it guarantees no machine parseability; there are no field names or such, and every database could add its own conventions.



SWAP

SWAP (Scholarly Works Application Profile), also known as the EPrints Application Profile (EAP)

Eprints DC XML (EPDCX)


TEF

TEF (Thèses Électroniques Françaises) (used for french dissertations)


See also:

Zthes

http://zthes.z3950.org/schema/

Other metadata/record formats/standards

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Again alphabetically, and none too detailed.


CSDGM

Content Standard for Digital Geospatial Metadata

[39]

EDIFACT

Developed for various (cross-)industry communication

[40]


hCard

A microformat that can store/describe people and addresses


IMS-CP

IMS Content Packaging Specification)(rare?)

[41]


TEI

Text Encoding Initiative

[42]

vCard (.vcf)

Text-based format describing people and addresses RFC 2425, RFC 2426, [43]

Unsorted

http://www.loc.gov/standards/sru/resources/schemas.html

More detailed notes

MARC and MARCXML notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

MARC refers to a group of related metadata speciications probably most commonly seen in federated catalogue search.

It is one of the more general-(library-)purpose formats in that it has loads of fairly well specified fields to be used, though for the same reason is not particularly useful for human consumption either.

MARC was designed to fit bibliographic data, authority data, classification data, holdings data (copy-specific information on a library resource such as call number, shelf location, and/or volumes held) and community information.


MARC records are often encoded in records conforming to ISO 2709 (from 1981), basically equivalent to ANSI/NISO Z39.2 (from 1985).

ISO2709 centers around what a record consists of, such as coding of/in the leader, dictionary, control+data fields, and such.

MARC, on the other hand, refers more to what can be stored, and how it should be placed in a record, where 'record' is an abstract thing that can be coded in ISO2709. ...and, since it usually is, is designed with ISO2709 organization in mind.


Point being that once you get into the details, MARC is not synonymous with Z39.2 / ISO 2709. Things can be based on and conform to ISO2709/Z39.2 but not be particularly related to MARC, one example being CCF (developed by Unesco/UNISIST).

Such cases probably still use fields, and are still likely to imitate MARC's use of fields - no use completely reinventing that. This is one of a few reasons that leads some people to use terms IS2709/Z39.2 and MARC interchangably.


Variations

MARC formats mostly specify field meanings, general text coding, field value coding, and such.

There are, however, a lot of MARC standards (see the general metadata section for a list), all variations of each other, and many of which are local to a country and sometimes even catalogue. Most of these standards exist mostly because it was tuned to some specific wishes.

Fairly generally adopted formats include

USMARC (common in Z39.50 servers)
MARC21 (a harmonization of USMARC and CAN/MARC)
MARCXML (based on MARC21).(verify)

Various number of specific libraries/catalogues may offer some translation, but others may only support some less than globally supported. For example, a few Danish libraries support DANMARC but no other MARC.(verify) That said, even those adhering to wider standards find the space to have system-local conventions to the storage/interpretation of data - helped by the fact that MARC allows user-defined fields.

So in practice, everything is both sort of close enough to each other, and everything can stand some looking at the details.


Each standard has its own added serialization details, restrictions, and such. For example, MARC21 renames the identifier as a 'subfield code', adds details like that tags must be numeric and in the 002-999 range, that text must be coded either as UTF-8 or as MARC-8 (MARC8 being one among various encodings based on the variable-character-size ISO 2022).

Data fields, data subfields, and idiosyncrasies

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

MARC records have fields, identified by a three-character alphanumeric string. In practice they are rarely anything but numeric, and some MARC variants require purely numeric identifiers. There is a subdivision into

  • control fields (those with information that may be necessary for correct/complete record parsing or interpretation), and
  • data fields (which do not, just store record data).

Control fields come before data fields (usually/always?), and start with 00 (e.g. 001, 008).




The (data) field identifiers are organized into categorizing ranges of sorts. An overview of these ranges, and some mention of some of the more interesting fields:

  • 0xx: local and global codes (call numbers, other holding, record, some subject)
  • 2xx: mostly item/issue/holding-specific information (edition, publisher, )
    • 245: Title
    • 246: Alternative title / subtitle
  • 1xx: basic description (titles/names of entities)
    • 100: Primary author(s)
  • 7xx: more bibliographic details (more authors, names, item relations, etc.)
    • 700: Additional author(s)
    • 701-709: Anyone else involved (editors, illustrators, etc.)
    • 710: Corporate name, often used as author when author is not a person (don't treat as personal name; e.g. don't reorder)
    • 773 citation information (but see 76x-78x, and notes below)
  • 3xx: work/item-specific details (phsysical characteristics, publication frequency, etc)
    • 300: Physical description, often follows [44]
  • 5xx: notes (content notes, summary, other general notes)
    • 520 summary/abstract
    • 505 'Formatted contents note' - TOCs and such
  • 6xx: mostly subjects, keywords
    • Controlled keywords (controlled per source at best, sometimes formatted oddly) are regularly present in one or a few fields in the set: 650, 600, 651, 653, 654, 690, 693, some others. Note that some use is not correct. For example, fields such as 610 and 600 (and perhaps 610, 698) are not strictly a keyword field, but for general search you may just want to assume all of the 6xx range is keywords
  • 4xx and 8xx: Series data
  • 9xx: not defined. Can be used as custom fields for most any purpose
    • 900 free-form text (sometimes a description, abstract, or summary)

There exist various conventions (some could be called semi-standards) to the use of the 9xx fields. There are also various other fields (undefined fields in non-9xx ranges) that have database-specific meaning, and are sometimes used in wider conventions.



Fields are major categories of content (e.g. title, author, ISSN, physical properties, abstract, etc.) and for each field there are usually a few subfields defined (alphanumeric: a-z and 0-9, and usually a letter) that are used to separate specific details of that function, sometimes as some sort of signalling, or even just a delimiters of sorts.

For example, field 260 stores publication, and 260a (field 260, subfield a) stores the place of publication, b the name of the publisher, c the date of publication. 260d, 260e, 260f, and 260g exist but are rarely used, other subfields aren't defined for 260 (though are sometimes seen; some use non-defined fields for local purposes).

Most fields, such as title (245), see common use of only one or two subfields, some fields only have one or two subfields defined to start with.


The most interesting information commonly sits in the 'a' subfield. Few fields put interesting information elsewhere, and subfield use varies per MARC flavour, and changes in restandardizations. Few fields put interesting information elsewhere, and subfield use varies per MARC flavour, and changes in restandardizations. Fields that do place interesting information include 773 citation data), 260, 856 (commonly used for URLs, with titles in y and the URL in u, though also in a).


A record is probably most accurately seen as a list of (field,subfield,value) values, that strictly speaking you can't see as a joined set, map, or such. However, for most fields such a map view can be very convenient, but is iffy in the (fairly rare case) of repeating (sub)field( range)s. A (hashmap) view can be convenient for code that only consumes records, and probably not so much for code that produces MARC.


Fields may repeat (appear more than once in a record), as may subfields (appear more than once per field). Note there may be standardized limitations to repetition specific to a field.

Repeating subfields often signals one of:

  • A continuation. For example, a list of a subfields in a 520 field is probably a single multi-kilobyte summary that has been split into a number of subfields.
  • Separate pieces of information, one per subfield
  • Pseudo-records in the form of regular chunks of consistent subfields, say, a,d,e, a,d,e, a,d,e. Logic like "if the subfield is the same or lower in the order, this is a new pseudo-record" usually works, but not always, and you may want a non-ASCII collation for it.


There are hundreds of fields with fairly settled meaning, and given subfield specifics there are probably thousands of differentiable usable fields that see some use somewhere. Still, many databases expose fairly limited records (only if a database uses MARC internally is it likely to get detailed).

A lot of fields are filled using conventions and by technical librarians, meaning that a lot of data can stand some transformation, canonicalization, lookups, and/or other polishing, for machine processing and/or user-friendly presentation. Few fields are likely to be present and directly printable - even the title can sometimes use some work. Author names are generally ill defined, but also an example where per-database processing can help a lot.


Note that few databases use (or expose/translate, in case they don't use MARC internally) all fields and subfields correctly.

For example, You may find the ISSN in 773 (hopefully in addition to 020, the place that it should be if the record stores it), you may find the title in 773a instead of 773t, you may find values in non-standard fields for which you can't imagine what the values could be, and the formatting errors.

These deviations tend to be consistent for a specific database, so search aggregation systems effectively require per-database remapping and normalization transformations if you want to actually use the records in any real detail.


Note that stored values sometimes have further delimiting and/or may reference other parts of the record (or other recordS) by implication, some of which so structural that its originator may consider it almost part of the record structure itself.

Semi-sorted MARC notes

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


MARC-specific value coding
  • MARC language codes: Mostly ISO 639-2 (but not always?) [48] [49]
  • MARC time periods [50]



On identifiers

MARC records do not necessarily contain identifiers for the databases they came from -- so it is not always possible to refer or (deep)link back to the database unambiguously.

Aside from the fact that MARC may store various types of information (bibliographic, classification, authority, and/or other types) and so may logically not include one, real-world systems often don't have strictly enforced rules either. Even if, say, an OPAC system seems to identify records using, say, control field 001, it may not necessarily have that field on all records.



On 773 ('Host Item')

The subfields on 773 are actually defined for a range of fields (76x-78x, including things like subseries, supplements, ordered entries, additional forms, etc.).


Across those fields, they serve general functions like:

  • t: title
  • b: edition
  • d: publisher/issuer data
  • g: identifier for specific piece of information in host
  • x: ISSN
  • a: entry header(/name/title type thing)


On a specific field, these subfields often grow a more specific meanings.

For example, for academic articles you will primarily see 773, 'Host Item', referring to the serial that contains the item, so the subfields will likely contain:

  • t: journal title (and p is sometimes used for abbreviated title)
  • d: city, publisher, and/or related details
  • g: volume, issue, page (usually)
  • x: ISSN
  • a: may contain a formatted citation, or title, or other.

...though because 773 is also frequently used for things other than articles, many of those subfields may be missing.

On 300 ('Physical description')

Many databases follow ISBD physical description to a decent degree.

For books, this fields commonly mentions the amount of pages, the amount of pre-content pages (e.g TOC) in roman numerals, and the presence of (and sometimes the amount of pages spent on) illustrations.

Varying examples, ignoring/discarding subfields for a moment:

578 p.
xi, 116 p.
iii, 65, 93 p.
iii, 65, 93 p.
xvi, 271 p. ill.
74 p. of ill., 15 p.
7, xxii, ca. 11, 26 p.
 
[115] p.
26 [i.e. 52] p.
96 p., 8 p. of plates

4 v. (loose-leaf)
5 v.
8 v. in 5

11 folded leaves
297 leaves

The most common subfields:

  • 300/a: Extent, often describes amount of content, usually in pages or, for archive-style content often in items, containers, volumes, linear feet, etc. May sometimes be free-form text (e.g. "12 stereograph reels (7 double fr.)", "Jigsaw puzzle"), and describes any non-regular content (additional or main) such as maps, floppies/CDs, film reels, etc., and/or (other) things that should be in the b/c subfields.
    • 160 p.
    • xii, 462 p. : 313 fig. ; 25 cm + CD-ROM.
    • 1 sound disc
    • '379 s.'
  • 300/b: 'Other physical details', often used for description of visual content
    • ill.
    • col.
    • sd., col.
    • helt tab.
    • digital
  • 300/c: Dimensions (of book, plate, reel, etc.), for example:
    • 24 cm.
    • 30 x 55 cm.
    • 35 mm.
    • 4 3/4 in.

And occasionally:

  • 300/f may be present if unusual units are used, e.g. a:17 f:boxes
  • 300/3 gives material details, either detailing another subfield, or sometimes (essentially) being the main material description)(verify)

See also (for reference and examples):



Notes on coded records, ISO 2709
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Note that the following is meant as introduction, and mixes terminology and coding details from both ISO 2709, and some values specifically for the MARC21 variation.


A record can be said to consist of:

  • a leader
  • a directory
  • the record's (coded) data:
    • zero or more control fields
    • zero or more data fields

The record data is the actual coded data, storing all record data except the tags (which come from the directory).


A 24-character record label, a.k.a. Leader, which contains the

  • Record length (5 bytes at postion 0, left-zero-padded non-null-terminated ASCII string)
  • Status (1 byte)
  • Record type (1 byte)
  • Implementation-specific (2 bytes)
  • Character coding (1 byte)
  • Indicator count (1 byte) (value fixed to 2 in MARC21)
  • Subfield code length (1 byte) (value fixed to 2 in MARC21) (The ISO / ANSI standards call this 'identifier length' instead, 'subfield code' is the MARC term)
  • The data's base address (5 bytes, left-zero-padded non-null-terminated ASCII string)
  • Implementation-specific (3 bytes)
  • Entry map (4 bytes), specifies the directory entry structure, and has four individual bytes storing:
    • length of length-of-field (value fixed to 4 in MARC21)
    • starting-position (value fixed to 5 in MARC21)
    • length of implementation-defined (value fixed to 0 in MARC21)
    • undefined

Since the record length and base address mentioned above are ASCII strings, the maximum data length is 99999 bytes (the maximum space a record will take is more, as the dictionary will also take space).


Directory

The directory contains entries that specify the positions of individual fields, and their tags. Entries contain:

  • the field's 3-character tag indicating the the content type.
  • the length of the field (in the data segment)
  • the position of the field (in the data segment)
  • Implementation-defined part (optional)

In MARC21, these are 12-byte things as those four elements respectively take 3, 4, 5, and 0 bytes.

Note that because of the length coding, the field's total length is 9999, and the data you can store is a little less. This is one reason long data is sometimes fragmented amoung multiple fields/subfields.


Control fields exist for (machine-readable) codes, and seem meant for coding that may be necessary tho parse the data fields that come after them. Control fields fields have subfield codes (unlike datafields) and often no indices(verify), so a control field has only a tag and a value. In comparison, a regular data field has:

  • a tag
  • indices (optional)
  • subfield character
  • value (specific to subfield)


Example in MARCXML coding:

<controlfield tag="001">99448814</controlfield>

<datafield tag="035" ind1=" " ind2=" ">
  <subfield code="a">(DLC)12848717</subfield>
</datafield>

It is not yet clear to me how the distinction is made between control and data field (or whether this varies between MARC variants). It seems that the two are disjunct: fields with tag 001-009 are control fields and the rest are data fields -- but if so, then I've seen violations of this.



Bibliographic fields can use indicators (0-9 chars) and subfield identifiers (0-9 chars) as detailed by the leader.

  • Indicators provide further information about field contents, field relationships and manipulations. They are fairly rare, and rarely more than one or two characters when they are used (MARCXML fixes/truncates to two)
  • Identifiers are used to both delimit and type possible subfields in the data for a field. When used, field data consists of a repetition of:
    • the subfield identifier 0x1F followed by a code with the length specified in the header (0-9),
    • the subfield data




Records are terminated/separated with the 0x1D byte.

On semi-processed graphical MARC representation

The subfield separator 0x1F (IS1, Unit Separator (US) from ISO 646) may be graphically represented as $.

The record terminator 0x1D (IS3, Group separator) from ISO 646) may be represented as \ or RT.

A space can be used in indicators and identifiers(verify) to mean undefined or not applicable. A # in indicator fields seems to serve the same purpose.

For example, a semi-processed MARC field might be shown as:

260##$aNew York, N.Y. : $bElsevier, $c1984.

(no indicators, three subfields present)

See also


Field/subfield references:


And perhaps: