Online (library) search related

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Search protocols

ANSI/NISO Z39.50 (also ISO 23950)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Z39.50 is a search-and-present protocol, one that is still fairly commonly used by libraries and scholarly databases.

Specifically:

ANSI/NISO Z39.50-1988: version 1 (considered obsolete)
ANSI/NISO Z39.50-1992: version 2 (incompatbible with version 1)
ANSI/NISO Z39.50-1995: version 2 and 3 (a compatible superset of Z39.50-1992)
Z39.50-2003 - a clarified version of -1995?

Apparently various servers use 2 with some extensions from version 3.

Related links:
- http://www.loc.gov/z3950/agency/
- ZING (Z39.50 International Next Generation initiative) (LOC's information onthis has been merged into its SRU/SRW pages, though)
  - includes ZOOM (see e.g. [1]), an effort for more object-oriented Z39.50
- Mike Taylor's z3950.org, including resources related to ZOOM (and various bindings), ZING
- Some statistics
  - http://irspy.indexdata.com/stats.html
  - Some test hosts: http://www.loc.gov/z3950/agency/resources/testport.html

uses ASN.1 (Abstract Syntax Notation One) to describe record formats and such.

runs on port 2100 by default, although many databases choose other ports.

A Z39.50 server's capabilities (services, record types, query operators) are not reported by the host itself; you generally have to discover them though experiments, then configure them. (Only a few percent offers explain-style support)

Services

Basics include:
- search
- present
- scan
- sort (not always supported(!) )
Optionals:
- extendedServices (see )
- namedResultSets
- triggerResourceCtrl
- delSet
- negotiationModel
- duplicationDetection

Z39.50 Attribute sets - Bib-1

These numbers are used in various Z39.50 query types (Type-1, Type-101) queries, in PQF, and in relevant query translations.

See also:

http://www.loc.gov/z3950/agency/bib1.html

Of the six types of attributes, use attributes are probably most generally interesting. The others are only interesting if you're tweaking a client or query translato.

Bib-1 Use attributes (1)

'Use Attributes' are references to indices you can refer to in your searches. The most interesting are approximately:

1016 is 'Any'
- usually as an 'any common field', but may be creatively interpreted
- other all-ish fields may appear in addition to (or sometimes instead of) 1016, for example 1035, 1036, and others
4, Title
1003, Author, though may be 1004 (Author-name personal), 1 (Personal name), and/or others at a specific database
31, Date of publication (or 30?)
7, ISBN
8, ISSN
21, Subject heading

1035, 'Anywhere', which has assumed status of 'anywhere, including abstract and/or full text'

1036, Author-Title-Subject, and other variations on this with two of the three and/or others.

In practice, you may require per-target use attribute remapping to convert queries, to get consistent behaviour out of a larger set of targets in which some are somewhat unusual.

Z39.50 mentions the MARC fields that this would likely map to.

Relation attributes (2)

Values:
- 1, < less than
- 2, <= less than or equal
- 3, = equal
- 4, >= greater or equal
- 5, > greater than
- 6, <> not equal
- 100, phonetic
- 101, stem
- 102, relevance
- 103, always matches
- 104, custom, per-target

See also:

http://www.loc.gov/z3950/agency/defns/bib1.html#relation

Position attribute (3) Default is 3, rarely changed, rarely useful to change

1, first in field
2, first in subfield
3, any position in field

Structure attribute (4) The most interesting are 105 and 1 (and perhaps 108); the rest are usually too target-specific.

1, phrase (requires same order, and adjacency)
2, word
3, key
4, year (4-digit)
5, date (normalized, see ISO 8824)
6, word list (orderless, target-specific interpretation(verify))
100, date (un-normalized)
101, name (normalized)
102, name (un-normalized)
103, structure
104, urx
105, free-form-text
106, document-text
107, local number
108, string
109, numeric string

Truncation attribute (5) Truncation is done with #

1 right
2 left
3 left & right
100 none
101 process
102 regular-1
103 regular-2
- 104 CCL

Completeness attribute (6)

Completeness attribute (6)
- 1 incomplete subfield
- 2 complete subfield
- 3 complete field

Z39.50 Attribute sets - Others

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Z39.50 defines:

Bib-1 {Z39-50-attributeSet 1} (See ATR.1)
Exp-1 {Z39-50-attributeSet 2} (See ATR.2)
Ext-1 {Z39-50-attributeSet 3} (See ATR.3)
CCL-1 {Z39-50-attributeSet 4}
GILS {Z39-50-attributeSet 5}
STAS {Z39-50-attributeSet 6}

Bib-1 is (probably by far) the most common

There is a Danish Dan-1 set, which is often usually used in addition to Bib-1.

See also:

SRU and related

SRU (Search/Retrieval via URL) is a simple protocol intended as a successor to Z39.50.

returns XML

1.x uses CQL queries in URLs, 2.0 allows other query languages (but CQL still seems quite common)

Three version: 1.1, 1.2, 2.0

SRW (Search/Retrieval as a Web service) is a deprecated name for what should now be called 'SRU over SOAP'.

[3], [4]

SRW/U groups both SRU and SRW(verify)

MXG (Metasearch XML Gateway) is an XML gateway meant to make metasearch and the required interfacing with content providers easier.

It is based on and (strongly) prefers interfacing to SRU, though other HTTP-based resources that take queries and respond with XML can also be supported.

[5], [6]

SRU is not tied to a particular XML schema (or even to XML(verify)) and should report the data format (DTD identifier). A content provider may choose an XML schema that suits its contents best, and mix them; a client is expected to transform the contents it receives.

XML-coded Dublin Core [7] and MARCXML are not uncommon.

Sources may choose to transform whatever format they use internally to more specific-purpose schemas, such as Metadata Object Description Schema (MODS, [8]) for bibliographic ends, or specific schemas such as Learning Object Metadata (LOM, [9]) for e-learning ends.

OpenSearch

A9's OpenSearch is comparable in function to SRU and MXG

I've noticed some dead links, but it doesn't seem to be dead.

Proxied

A third party implements the search and provides it, usually, over Z39.50, SRU, or SRW.

For example OpenTranslators

http://www.librarytechnology.org/ltg-displayarticle.pl?RC=12980

Query formats

Z39.50 query formats (Type 1, 101, and others)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Z39.50 defines a few query types, of which type 1 and 101 are the most commonly supported.

Both are postfix grammars, but note they are a structure (transported via ASN.1?(verify)) rather than a direct string representation(verify).

From a few minutes of googling it seems that:

Type 1: called RPN (Reverse Polish Notation)
- must be supported by conforming Z39.50 target

Type 101: Called ERPN (Extended RPN), a fairly basic extension of Type 1, with additions like the prox operator
- Target may claim support of Type-101, of its Prox operator, and of its Restriction operand (independently?(verify)) (can be claimed using Explain, or just by informal description)

And apparently:

Type 0: a-priori agreement (verify)
Type 2: ISO8777 (verify)
Type 100: Common Command Language (CCL)?(verify)) (verify)
Type 102: Ranked List query - which consists of a list of queries and a weight for each
Type 104: (recent) (verify)

See also:

http://www.loc.gov/z3950/agency/markup/09.html

http://www.loc.gov/z3950/agency/attrarch/arch.html

CCL (Common Command Language)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

ISO 8777 (but not all are entirely up to that)
http://www.indexdata.dk/yaz/doc/tools.tkl#CCL
Meant to be a somewhat user-end query language (but probably still too complex for that)

and, or, not

CQL - Common Query Language (up to 1.1) / Contextual Query Language (since 1.2)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Developed and maintained by LoC, apparently with focus on Z39.50 use. As a query syntax it is only choice in SRU and SRW, an option in Z39.50, and it is sometimes seen elsewhere.

Basic 'just give me some words' queries are fairly simple, more structured queries are possible (not required to be supported, see the levels, but most do) though the syntax and index/field names make this beyond 99% people of people to use directly.

'Contextual' seems to refer to leaving exact query interpretation somewhat up to the server-side implementation, rather than being defined strictly into the query syntax. Which allows for flexible searches, though also means each server is as clever or as dumb as its implementation happens to be, with few ways to work around that.

Context sets

CQL has the concept of context sets, mostly referring to sets of attributes referring (usually) to record properties to search in, but also to allow reference to an earlier search set to narrow down.

Already-defined context sets include a (quite simple) cql context set, a srw set, a Bath set (to be deprecated in favour of bib), and more specific ones (some serving niches), like zthes, one for LOM, ccg (collectible card games), music, and any you wish to create yourself. The registered ones are listed at [10].

Systems are allowed to deal only with simpler variations of CQL queries, without supporting some of the more complex stuff. No system is required to support everything, but should still diagnose that they don't support something -- see SRU's list).

The three levels:

Level 0 must support simple term-only queries

words, doublequoted strings, backslash escaped doublequotes in value), and report all that is not actually supported

Level 1 adds the ability to parse both of the following, and support at least one of them at a time in a query (not necessarily both):
- search clauses with 'index relation searchTerm' structure (e.g. 'dc.title any food')
- boolean combinations

Level 2 quires that all of CQL be parsed, and for it to report what is not actually supported

(level 2 requires a proper parser while in 0 and 1 you can get away with simpler hackish string handling)

CQL syntax

As to CQL syntax', see [11] for a definition and e.g. [12] for a decent introduction.

Some notes and examples of my own:

The syntax of CQL is case insensitive, though the values it carries (term values, modifier values, prefix map values) should be carried through unchanged.

A query is one clause, or multiple clauses booleaned together
- booleans being and, or, not (meaning and not, not a unary operator), and prox (which is like an and with the extra requirement that it is close)
- booleans are ealuated left to right (no precedence), parentheses can be used to override left-to-right evaluation

A single clause consists of
- an index (optional, defaults to cql.serverChoice. For more details on index specification, see [13]))
- a relation (optional)
- a term (required). If it contains a whitespace (e.g. space) or one of <>=/() it must be double-quoted, may always be (only not necessary when you use single words as terms). Backslash-escaped doublequotes may appear in values.

the index and relation are optional in that if you write only a term, it is interpreted as cql.serverChoice = term {{comment|(...since 1.2 anyway, the default relation in 1.1 was scr)

the specification of index, relations, booleans, and sorting can be modified by adding a / (optional whitespace around it) and modifier details.

sort specification, as last element (e.g. fish sortBy dc.title, and more specific through modification: fish sortBy dc.date/sort.ascending)

Note that a single clause can imply boolean-like interpretation, e.g. though cql.any and cql.all

You sometimes see examples like title=((dinosaur and bird) or dinobird) or dc.title=(kern* or ritchie). This seems to be invalid CQL, and probably comes from confusion since it is valid CCL.

As to the optionals in the clause: you can either use just the term or the full thing, for example fish or cql.serverChoice = fish, but not, say, = fish or cql.serverChoice fish

Index references must have a base name (e.g. title), and may have a prefix referring to a context set. If the prefix is omitted, it is determined/guessed by the server, so things like title any fish and dc.title any fish are correct - and in various applications will behave identically.

You can also map your own prefixes, but this is rarely seen and probably about as rarely supported. One example (there are more styles):

dc = "info:srw/context-sets/1/dc-v1.1" dc.title any fish

The any (short for cql.any) in the examples above is the relation, which specifies the way you wish for the search to use the term. This includes equality, range tests, non-equality and such. You probably commonly use one of:

= - server choice, which may be clever or not so much. Will probably frequently choose == or adj depending on field and/or value
exact - exact match
== - exact match
adj - phrase searches, e.g. dc.title adj "lord of the rings"
any - any of the parts of the term value (sort of a shorthand for ORing together all the words in the term)
all - all of the parts of the term value (sort of a shorthand for ANDing together all the words in the term)
<> - not equal (similar to NOT)

For example:

dc.title any fish
dc.title = fish
dc.title any fish
cql.serverChoice adj fish

You can use relation modifiers to ask for more specific tests, or just hints, in how to evaluate the relation test. For example:

dc.title any/relevant fish
dc.title any/ relevant /cql.string fish
dc.title any/rel.algorithm=cori fish

Generally defined modifiers (with no specific implementation) include:

stem - use stemming before matching
relevant - conceptually fuzzy. For example subject any/relevant "fish frog" might match various specific fishes, frogs, amphibians, and whatnot.
fuzzy - unspecified fuzzy matching, e.g. to match misspelled words, in whatever way it chooses.
phonetic - match homophones

Other query examples:

title =/stem "completed dinosaurs"
(((a and b) or (c not d) not (e or f and g)) and h not i) or j
(caudal or dorsal) prox vertebra
ribs prox/>/0/paragraph chevrons
dc.title any fish prox/unit=word/distance>3 dc.title any squirrel

? and * are wildcards for one and zero or more characters, respectively. The syntax allows them at any point (so for arbitry, regexpish matching) although many real-world indexing systems CQL is used on may not necessarily support that.

XCQL

The parsed form of CQL, serialized into XML.

Which is verbose and more explicit (not requiring parentheses or precedence rules), in fact to the point where only debuggers would really want to read it. It seems mostly used in messaging to show how a given query was parsed.

XCQL uses a representation centralizing around elements it calls searchClause and triple.

A triple is a structure that builds up a tree, and consists of:

boolean
leftOperand: a (terminal) searchclause, or another triple
rightOperand: a (terminal) searchclause, or another triple

A searchClause may contain an index, relation and term

Both searchClause and triple may contain an array of prefixes, which contains/overrides the mapping at that level and deeper.

See also:

http://srw.cheshire3.org/cql/xcql.html

PQF (Prefix Query Format), also sometimes Prefix Query Notation (PQN)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

a prefix-notation tree format sort of idea
Similar to like Type-1/Type-101 RPN in structure, except that it is prefix. Used in some Z39.50 contexts
http://www.indexdata.dk/yaz/doc/tools.tkl#PQF

Example:

@attrset 1.2.840.10003.3.1 @or  @attr 1=1016 "foo"  @attr 1=1016 "snake"

In which:

1.2.840.10003.3.1 refers to the bib-1 attribute set
1=1016 refers to (bib-1) use attribute (1) and the 'any' field (1016).

Cheshire II (C2)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

http://cheshire.berkeley.edu/cheshire2.html#zfind

Z+SQL

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

also known as ZSQL, Z-SQL
[14],
apparently meant to give the Z39.50-1995 Version3 protocol an SQL-like syntax, and allows creation of more complex queries.

Not commonly seen?(verify) (yet?)

Others

You could implement the basic syntax that is approximately the lowest common denominator between systems like google, lucene, and such (AND, OR, doublequotes, brackets, minus as a NOT)

There are numerous minor expansions you could choose to support You can add minor expansions of this, such as varying ways of specifying NOT, field searching, and more.

See for example

lucene[15],
swish-e[16],

...and many others.

OAI and related

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

OAI is the Open Archive Initiative.

It is effectively a way to share repository metadata content, to be combined, indexed and such elsewhere.

In part, it is a more elegant and less quirky solution than federated search, although the fact that it is based on Dublin Core effectively limits the scope of application.

PMH

OAI often implies a setup where OAI-PMH is used to provide / fetch the metadata. (PMH: Protocol for Metadata Harvesting)

XML-based, transferred over HTTP

See also:

ORE

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

See #OAI-ORE

Related standards

Z39.88: OpenURL

Unsorted

http://www.openarchives.org/

http://foresite.cheshire3.org/wiki/

Interesting searching, browsing or visualization

Relation browsing:

Visual Thesaurus
KartOO
AquaBrowser, e.g. at the Amsterdam public library
LivePlasma

Faceted:

Metasearchers:

Search engines, catalogues & ILS, repositories, supporting libraries

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Some

do just metadata search (like a classical card catalog)

the first initially called Online public access catalog (OPAC)

later there was a "Library 2.0" thing

are union catalog ('what do related libraries have?'

without needing to do so via federation (faster)

do search federation (active searching -- of

library holdings

search remote database subscriptions

some also search ebooks

are 'integrated library systems' (Integrated library system) which often means they also do loaning and such,

BiblioteQ [17] - an ILS

Blacklight OPAC[18] (open source) - catalogue, built around Solr

Cheshire 2[19]
Cheshire 3[20]

DRIVER, DRIVER II [21]

DSpace [22] - repository

Emilda[23] - an ILS

EPrints [24] - repository

Evergreen[25] (open source) - an ILS

Ex Libris[26] offers (among others)
- SFX[27]: An OpenURL resolver
- Metalib[28]: a federated search aggregator
- Aleph [29]
- Voyager [30] - an ILS
- Primo [31]
  - Primo Central [32]

Ferret[33] (Ruby)

Fedora Commons[34] (unrelated to the linux distro), Java-based

Fedora Learning Objects Repository Interface (Flori) [35]

Hyper Estraier[36] - full-text search (written in C)

Hyrax [37] - repository

JAFER[38] - base for Z39.50 clients and servers (written in Java)

Java Z3950 Tookit[39] (Z39.50, Java)

Koha[40] (open source) - an ILS

LibLime[41] supports/delivers Koha, Evergreen

LibraryFind[42] (open source) - federated search(?)

mnoGoSearch[43] (GPL)

MasterKey[44] (open source) - search aggregating system. Can search Z39.50, SRU, PazPar2, a local Zebra index, and via HTML scraping. Harvests and presents it.

Meresco[45] is mostly a SRU(/SRW) interface around a OAI-PHM fetcher (also has a web crawler, OAI export of its data, RSS import/export, etc., but they seem of secondary importance)

Metaproxy [46] - Frontend/switchboard that makes it easier to search Z39.50, SRU, SRW, and Solr (via webservice). Does result merging, filtering, caching. Exposed as Z39.50, SRU, and SRW.

Net::Z3950::ZOOM[47] (Perl)

OAI toolkit[48] (Perl)

OpenBiblio[49]

OCLC-Pica, an apparently now deprecated OPAC

OpenER[50] (EduCommons RSS → OAI)

OpenSiteSearch[51]

Pazpar2[52], a Z39.50-based federated search server (YAZ-based, but no SRU/SRW)

PhpMyBibli (PMB)[53]

PHP-MARC[54]

PhpMyLibrary[55]

Proai[56] (Java-based)

pyasn1[57] (see also ASN.1)

PyMARC[58] -

PyZ3950[59] - Z39.50

pyoai[60]

QtZ39.50[61] (uses PyZ3950)

swish-e[62]

Solr[63] (Java, with various interfaces) (see also Solr)

Sphinx[64] - SQL database backed (C++) (See also sphinx notes)

Summon[65]

VB ZOOM[66] (Z39.50)

VuFind[67]

WebFeat

Xapian[68], a C++ search engine. (Has wrappers, like xappy[69])

YAZ[70]: C(/C++) toolkit for Z39.50, and also SRU and SRW

YAZ proxy[71]

Zebra[72], an indexer/search/retrieve server (OAI, with a Z39.50 interface)

Zedkit

ZMARCO[73] (Z39.50, MARC, OAI)

Zoom .NET[74] (Z39.50)

See also:

Unsorted

E-learning:

Sites like Moodle, Teletop, StudieWeb, Sharepoint
May use some specific formats (e.g. the Dutch CZP)

Interoperability policies often include

use of open standards
harvesting+indexing, not federated search
no supplier-specific / software-specific (proprietary) features

Notes on (non-)centralization of search

Union systems and data warehousing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

To do better you would want a union system, because you would get control over various aspects of behaviour and uniformity you may want to guarantee.

There have been movements to warehouse data to groups , with licensing pretty much as it was as when the same data was federated, meaning it is mediates licenses to members of a group (sometimes licensing to the whole group).

From a search process view, this means you have more data sources in raw form (sometimes, it seems, in indexed form for a specific search system; I can imagine the wish for businesses to tie customers into their products). It does not guarantee you will search it significantly better than the data providers will, but chances are merging and sorting is better if only because they're handled automatically.

Online (library) search related

Contents

Search protocols

ANSI/NISO Z39.50 (also ISO 23950)

Services

Z39.50 Attribute sets - Bib-1

Z39.50 Attribute sets - Others

SRU and related

OpenSearch

Proxied

Query formats

Z39.50 query formats (Type 1, 101, and others)

CCL (Common Command Language)

CQL - Common Query Language (up to 1.1) / Contextual Query Language (since 1.2)

Context sets

CQL syntax

XCQL

PQF (Prefix Query Format), also sometimes Prefix Query Notation (PQN)

Cheshire II (C2)

Z+SQL

Others

See also

OAI and related

PMH

ORE

Related standards

Z39.88: OpenURL

Unsorted

Interesting searching, browsing or visualization

Search engines, catalogues & ILS, repositories, supporting libraries

Notes on (non-)centralization of search

More on the problems in merging

Union systems and data warehousing

Navigation menu