(Library) search system notes

From Helpful
(Redirected from Z39.50)
Jump to: navigation, search
For more articles related to library systems, see the Library related category. Some of the main articles:
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Search protocols

ANSI/NISO Z39.50 (also ISO 23950)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Z39.50 is a search-and-present protocol, one that is still fairly commonly used by libraries and scholarly databases.


Specifically:

  • ANSI/NISO Z39.50-1988: version 1 (considered obsolete)
  • ANSI/NISO Z39.50-1992: version 2 (incompatbible with version 1)
  • ANSI/NISO Z39.50-1995: version 2 and 3 (a compatible superset of Z39.50-1992)
  • Z39.50-2003 - a clarified version of -1995?

Apparently various servers use 2 with some extensions from version 3.


  • runs on port 2100 by default, although many databases choose other ports.


A Z39.50 server's capabilities (services, record types, query operators) are not reported by the host itself; you generally have to discover them though experiments, then configure them. (Only a few percent offers explain-style support)


Services

  • Basics include:
    • search
    • present
    • scan
    • sort (not always supported(!) )
  • Optionals:
    • extendedServices (see )
    • namedResultSets
    • triggerResourceCtrl
    • delSet
    • negotiationModel
    • duplicationDetection

See also [2]


Z39.50 Attribute sets - Bib-1

These numbers are used in various Z39.50 query types (Type-1, Type-101) queries, in PQF, and in relevant query translations.

See also:


Of the six types of attributes, use attributes are probably most generally interesting. The others are only interesting if you're tweaking a client or query translato.


Bib-1 Use attributes (1)

'Use Attributes' are references to indices you can refer to in your searches. The most interesting are approximately:

  • 1016 is 'Any'
    • usually as an 'any common field', but may be creatively interpreted
    • other all-ish fields may appear in addition to (or sometimes instead of) 1016, for example 1035, 1036, and others
  • 4, Title
  • 1003, Author, though may be 1004 (Author-name personal), 1 (Personal name), and/or others at a specific database
  • 31, Date of publication (or 30?)
  • 7, ISBN
  • 8, ISSN
  • 21, Subject heading
  • 1035, 'Anywhere', which has assumed status of 'anywhere, including abstract and/or full text'
  • 1036, Author-Title-Subject, and other variations on this with two of the three and/or others.

In practice, you may require per-target use attribute remapping to convert queries, to get consistent behaviour out of a larger set of targets in which some are somewhat unusual.


Z39.50 mentions the MARC fields that this would likely map to.



Relation attributes (2)

  • Values:
    • 1, < less than
    • 2, <= less than or equal
    • 3, = equal
    • 4, >= greater or equal
    • 5, > greater than
    • 6, <> not equal
    • 100, phonetic
    • 101, stem
    • 102, relevance
    • 103, always matches
    • 104, custom, per-target

See also:


Position attribute (3) Default is 3, rarely changed, rarely useful to change

  • 1, first in field
  • 2, first in subfield
  • 3, any position in field


Structure attribute (4) The most interesting are 105 and 1 (and perhaps 108); the rest are usually too target-specific.

  • 1, phrase (requires same order, and adjacency)
  • 2, word
  • 3, key
  • 4, year (4-digit)
  • 5, date (normalized, see ISO 8824)
  • 6, word list (orderless, target-specific interpretation(verify))
  • 100, date (un-normalized)
  • 101, name (normalized)
  • 102, name (un-normalized)
  • 103, structure
  • 104, urx
  • 105, free-form-text
  • 106, document-text
  • 107, local number
  • 108, string
  • 109, numeric string



Truncation attribute (5) Truncation is done with #

  • 1 right
  • 2 left
  • 3 left & right
  • 100 none
  • 101 process
  • 102 regular-1
  • 103 regular-2
    • 104 CCL



Completeness attribute (6)

  • Completeness attribute (6)
    • 1 incomplete subfield
    • 2 complete subfield
    • 3 complete field

Z39.50 Attribute sets - Others

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Z39.50 defines:

  • Bib-1 {Z39-50-attributeSet 1} (See ATR.1)
  • Exp-1 {Z39-50-attributeSet 2} (See ATR.2)
  • Ext-1 {Z39-50-attributeSet 3} (See ATR.3)
  • CCL-1 {Z39-50-attributeSet 4}
  • GILS {Z39-50-attributeSet 5}
  • STAS {Z39-50-attributeSet 6}

Bib-1 is (probably by far) the most common


There is a Danish Dan-1 set, which is often usually used in addition to Bib-1.


See also:

Unsorted

extended services -- [3] or [4] ?

Init, Search, Retrieval Facilities: http://www.loc.gov/z3950/agency/markup/04.html Delete, Access control, Sort, Scan: http://www.loc.gov/z3950/agency/markup/05.html

Explain: http://www.loc.gov/z3950/agency/markup/07.html



http://www.loc.gov/z3950/agency/markup/13.html

http://www.loc.gov/z3950/agency/clarify/


-->

SRU and related

  • SRU (Search/Retrieval via URL) is a simple protocol written as a successor to Z39.50. It mostly takes CQL(verify) queries in URLs and returns XML. ([5])
  • SRW (Search/Retrieval as a Web service) is a deprecated name for what should now be called 'SRU over SOAP'. ([6], [7])
  • SRW/U groupingly refers to both SRU and SRW(verify)
  • MXG (Metasearch XML Gateway) is an XML gateway meant to make metasearch and the required interfacing with content providers easier. It is based on and (strongly) prefers interfacing to NISO SRU, though other HTTP-based resources that take queries and respond with XML can also be supported. ([8], [9])


SRU is not fixed to a particular schema (or even to XML(verify)) and should report the data format (dtd identifier). A content provider may choose an XML schema that suits its contents best, and mix them; a client is expected to transform the contents it receives. XML-coded Dublin Core [10] and MARCXML are not uncommon.

Sources may choose to transform whatever format they use internally to more specific-purpose schemas, such as Metadata Object Description Schema (MODS, [11]) for bibliographic ends, or specific schemas such as Learning Object Metadata (LOM, [12]) for e-learning ends.


OpenSearch

A9's OpenSearch is comparable in function to SRU and MXG

Other

Custom (regularly XML/HTTP based)

Proxied

A third party implements the search and provides it, usually, over Z39.50, SRU, or SRW.

For example OpenTranslators

http://www.librarytechnology.org/ltg-displayarticle.pl?RC=12980


OAI and related

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

OAI refers to the Open Archive Initiative . It is effectively a way to share repository metadata content, to be combined, indexed and such elsewhere.


In part, it is a more elegant and less quirky solution than federated search, although the fact that it is based on Dublin Core effectively limits the scope of application.

OAI-PMH

OAI often implies a setup where OAI-PMH is used to provide / fetch the metadata. (PMH: Protocol for Metadata Harvesting)

XML-based, transferred over HTTP

See also:

OAI-ORE

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See also:

Unsorted

Query formats

Z39.50 query formats (Type 1, 101, and others)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Z39.50 defines a few query types, of which type 1 and 101 are the most commonly supported.

Both are postfix grammars, but note they are a structure (transported via ASN.1?(verify)) rather than a direct string representation(verify).


From a few minutes of googling it seems that:

  • Type 1: called RPN (Reverse Polish Notation)
    • must be supported by conforming Z39.50 target
  • Type 101: Called ERPN (Extended RPN), a fairly basic extension of Type 1, with additions like the prox operator
    • Target may claim support of Type-101, of its Prox operator, and of its Restriction operand (independently?(verify)) (can be claimed using Explain, or just by informal description)


And apparently:

  • Type 0: a-priori agreement (verify)
  • Type 2: ISO8777 (verify)
  • Type 100: Common Command Language (CCL)?(verify)) (verify)
  • Type 102: Ranked List query - which consists of a list of queries and a weight for each
  • Type 104: (recent) (verify)


See also:

CCL (Common Command Language)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • and, or, not


CQL - Common Query Language (up to 1.1) / Contextual Query Language (since 1.2)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • Developed and maintained by LoC (Z39.50 maintenance agency), apparently focusing on Z39.50 use
  • basic queries are fairly simple (google-style, though boolean treatment seems unusual), complex queries are useful for query syntax conversion (optional index/field names and such) and/or beyond 99% people of people.
  • The only query syntax in SRU, SRW. Useable in Z39.50, and other protocols.
  • See also


'Contextual' seems to refer mostly to exact query semantics being left up more to interpretation/implementation than to being strictly coded into the syntax. In this ways it allows for complex queries being valid on most systems (Level 2) but leaves implementing them and/or any cleverness up to the specific server.

Part of this lies in the CQL concept of context sets - a term CQL uses to refer to sets of attributes referring, usually, to record properties to search in, and in a few cases allow reference to an earlier search set to narrow down. Context sets include a (quite simple) cql context set, a srw set, a Bath set (to be deprecated in favour of bib), and more specific ones (some serving niches), like zthes, one for LOM, ccg (collectible card games), music, and any you wish to create yourself. The registered ones are listed at [13].

A system may choose to support only very simple queries, fairly simple queries, or parse all queries. No sustem is required to support everything you can parse (but to do diagnose that they don't support something -- see SRU's list). The three levels:

  • Level 0 must support simple term-only queries (words, doublequoted strings, backslash escaped doublequotes in value), and report all that is not actually supported
  • Level 1 adds the ability to parse both of the following, and support at least one of them at a time in a query (not necessarily both):
    • search clauses with 'index relation searchTerm' structure
    • boolean combinations
  • Level 2 quires that all of CQL be parsed, and for it to report what is not actually supported

Arguably you would always want level 2 parsers because it won't ever bomb parsing complex queries - level 0 and 1 are certain to. However, level 2 requires a proper parser while in 0 and 1 you can get away with simpler hackish string handling.


As to CQL syntax, see [14] for a definition and e.g. [15] for a decent introduction.

Some notes and examples of my own:

  • The syntax of CQL is case insensitive, though the values it carries (term values, modifier values, prefix map values) should be carried through unchanged.
  • A query is one clause, or multiple clauses booleaned together
    • booleans being and, or, not (meaning and not, not a unary operator), and prox (which is like an and with the extra requirement that it is close)
    • booleans are ealuated left to right (no precedence), parentheses can be used to override left-to-right evaluation
  • A single clause consists of
    • an index (optional, defaults to cql.serverChoice. For more details on index specification, see [16]))
    • a relation (optional, defaults to = in 1.2 and to scr in 1.1)
    • a term (required) -- must be double-quoted if it contains a whitespace (e.g. space) or one of
      <>=/()
      , may always be (only not necessary when you use single words as terms). Backslash-escaped doublequotes may appear in values.
  • the specification of index, relations, booleans, and sorting can be modified by adding a / (optional whitespace around it) and modifier details.
  • sort specification, as last element (e.g. fish sortBy dc.title, and more specific through modification: fish sortBy dc.date/sort.ascending)
  • Note that a single clause can imply boolean-like interpretation, e.g. though cql.any and cql.all
  • You sometimes see examples like title=((dinosaur and bird) or dinobird) or dc.title=(kern* or ritchie). This seems to be invalid CQL, and probably comes from confusion since it is valid CCL.


As to the optionals in the clause: you can either use just the term or the full thing, for example
fish
or
cql.serverChoice = fish
, but not, say, = fish or cql.serverChoice fish

Index references must have a base name (e.g. title), and may have a prefix referring to a context set. If the prefix is omitted, it is determined/guessed by the server, so things like title any fish and dc.title any fish are correct - and in various applications will behave identically.

You can also map your own prefixes, but this is rarely seen and probably about as rarely supported. One example (there are more styles):

dc = "info:srw/context-sets/1/dc-v1.1" dc.title any fish


The any (short for cql.any) in the examples above is the relation, which specifies the way you wish for the search to use the term. This includes equality, range tests, non-equality and such. You probably commonly use one of:

  • =
    - server choice, which may be clever or not so much. Will probably frequently choose
    ==
    or
    adj
    depending on field and/or value
  • exact
    - exact match
  • ==
    - exact match
  • adj
    - phrase searches, e.g.
    dc.title adj "lord of the rings"
  • any
    - any of the parts of the term value (sort of a shorthand for ORing together all the words in the term)
  • all
    - all of the parts of the term value (sort of a shorthand for ANDing together all the words in the term)
  • <>
    - not equal (similar to NOT)

For example:

dc.title any fish
dc.title = fish
dc.title any fish
cql.serverChoice adj fish

You can use relation modifiers to ask for more specific tests, or just hints, in how to evaluate the relation test. For example:

dc.title any/relevant fish
dc.title any/ relevant /cql.string fish
dc.title any/rel.algorithm=cori fish

Generally defined modifiers (with no specific implementation) include:

  • stem - use stemming before matching
  • relevant - conceptually fuzzy. For example subject any/relevant "fish frog" might match various specific fishes, frogs, amphibians, and whatnot.
  • fuzzy - unspecified fuzzy matching, e.g. to match misspelled words, in whatever way it chooses.
  • phonetic - match homophones


Other query examples:

title =/stem "completed dinosaurs"
(((a and b) or (c not d) not (e or f and g)) and h not i) or j
(caudal or dorsal) prox vertebra
ribs prox/>/0/paragraph chevrons
dc.title any fish prox/unit=word/distance>3 dc.title any squirrel

? and * are wildcards for one and zero or more characters, respectively. The syntax allows them at any point (so for arbitry, regexpish matching) although many real-world indexing systems CQL is used on may not necessarily support that.


XCQL

(Parsed form of) CQL serialized into XML.

Apparently mostly used in messaging to show how a given query was parsed -- is verbose and more explicit (not requiring parentheses or precedence rules), in fact to the point where only debuggers would really want to read it.


XCQL uses a representation centralizing around elements it calls searchClause and triple.

A triple is a structure that builds up a tree, and consists of:

  • boolean
  • leftOperand: a (terminal) searchclause, or another triple
  • rightOperand: a (terminal) searchclause, or another triple

A searchClause may contain an index, relation and term

Both searchClause and triple may contain an array of prefixes, which contains/overrides the mapping at that level and deeper.


See also:

PQF (Prefix Query Format), also sometimes Prefix Query Notation (PQN)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Example:

@attrset 1.2.840.10003.3.1 @or  @attr 1=1016 "foo"  @attr 1=1016 "snake"

In which:

  • 1.2.840.10003.3.1 refers to the bib-1 attribute set
  • 1=1016 refers to (bib-1) use attribute (1) and the 'any' field (1016).

Cheshire II (C2)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

http://cheshire.berkeley.edu/cheshire2.html#zfind


Z+SQL

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)
  • also known as ZSQL, Z-SQL
  • [17],
  • apparently meant to give the Z39.50-1995 Version3 protocol an SQL-like syntax, and allows creation of more complex queries.


Not commonly seen?(verify) (yet?)


Others

You could implement the basic syntax that is approximately the lowest common denominator between systems like google, lucene, and such (AND, OR, doublequotes, brackets, minus as a NOT)


There are numerous minor expansions you could choose to support You can add minor expansions of this, such as varying ways of specifying NOT, field searching, and more.


See for example

...and many others.


See also

Related standards

Z39.88: OpenURL

COinS

Describes how to embed OpenURL-style citation data in HTML.

It alloes browsers plugins to do things including:

  • link you to full-text via your own institution's OpenURL resolver (e.g. using OpenURL Referrer)
  • collect citations (e.g. using Zotero, although its COinS support is currently ~2008 somewhat fragile)



Interesting searching, browsing or visualization

Relation browsing:

Faceted:

Metasearchers:


Semi-sorted notes

PyZ3950

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Installation

You need the python PLY package or you will get ImportError: No module named lex. Also as of this writing, ubuntu's python-ply package seems to be broken, but since the modules are pure python, manually unpacking lex.py and yacc.py from the download (into the z3950 directory in your site-packages, for example) also works.

The 2.04 tarball has a bug in the install scripts that causes it to raise:

AttributeError: 'float' object has no attribute 'replace'

This seems to mean the version in vers.py should be a string, not a float.


Docs & notes


How to deal with slow/timeouting connects?


Documentation example:

from PyZ3950 import zoom
conn = zoom.Connection('z3950.loc.gov', 7090)
conn.databaseName          = 'VOYAGER'
conn.preferredRecordSyntax = 'USMARC'
 
# for exxample with CCL (there are other options)
query = zoom.Query('CCL', 'ti="this and that"')
results = conn.search(query)
 
for result in results:
    print result
 
conn.close()