Online (library) search related

From Helpful
Jump to navigation Jump to search
This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Search protocols

ANSI/NISO Z39.50 (also ISO 23950)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Z39.50 is a search-and-present protocol, one that is still fairly commonly used by libraries and scholarly databases.


Specifically:

  • ANSI/NISO Z39.50-1988: version 1 (considered obsolete)
  • ANSI/NISO Z39.50-1992: version 2 (incompatbible with version 1)
  • ANSI/NISO Z39.50-1995: version 2 and 3 (a compatible superset of Z39.50-1992)
  • Z39.50-2003 - a clarified version of -1995?

Apparently various servers use 2 with some extensions from version 3.


  • runs on port 2100 by default, although many databases choose other ports.


A Z39.50 server's capabilities (services, record types, query operators) are not reported by the host itself; you generally have to discover them though experiments, then configure them. (Only a few percent offers explain-style support)


Services

  • Basics include:
    • search
    • present
    • scan
    • sort (not always supported(!) )
  • Optionals:
    • extendedServices (see )
    • namedResultSets
    • triggerResourceCtrl
    • delSet
    • negotiationModel
    • duplicationDetection

See also [2]


Z39.50 Attribute sets - Bib-1

These numbers are used in various Z39.50 query types (Type-1, Type-101) queries, in PQF, and in relevant query translations.

See also:


Of the six types of attributes, use attributes are probably most generally interesting. The others are only interesting if you're tweaking a client or query translato.


Bib-1 Use attributes (1)

'Use Attributes' are references to indices you can refer to in your searches. The most interesting are approximately:

  • 1016 is 'Any'
    • usually as an 'any common field', but may be creatively interpreted
    • other all-ish fields may appear in addition to (or sometimes instead of) 1016, for example 1035, 1036, and others
  • 4, Title
  • 1003, Author, though may be 1004 (Author-name personal), 1 (Personal name), and/or others at a specific database
  • 31, Date of publication (or 30?)
  • 7, ISBN
  • 8, ISSN
  • 21, Subject heading
  • 1035, 'Anywhere', which has assumed status of 'anywhere, including abstract and/or full text'
  • 1036, Author-Title-Subject, and other variations on this with two of the three and/or others.

In practice, you may require per-target use attribute remapping to convert queries, to get consistent behaviour out of a larger set of targets in which some are somewhat unusual.


Z39.50 mentions the MARC fields that this would likely map to.



Relation attributes (2)

  • Values:
    • 1, < less than
    • 2, <= less than or equal
    • 3, = equal
    • 4, >= greater or equal
    • 5, > greater than
    • 6, <> not equal
    • 100, phonetic
    • 101, stem
    • 102, relevance
    • 103, always matches
    • 104, custom, per-target

See also:


Position attribute (3) Default is 3, rarely changed, rarely useful to change

  • 1, first in field
  • 2, first in subfield
  • 3, any position in field


Structure attribute (4) The most interesting are 105 and 1 (and perhaps 108); the rest are usually too target-specific.

  • 1, phrase (requires same order, and adjacency)
  • 2, word
  • 3, key
  • 4, year (4-digit)
  • 5, date (normalized, see ISO 8824)
  • 6, word list (orderless, target-specific interpretation(verify))
  • 100, date (un-normalized)
  • 101, name (normalized)
  • 102, name (un-normalized)
  • 103, structure
  • 104, urx
  • 105, free-form-text
  • 106, document-text
  • 107, local number
  • 108, string
  • 109, numeric string



Truncation attribute (5) Truncation is done with #

  • 1 right
  • 2 left
  • 3 left & right
  • 100 none
  • 101 process
  • 102 regular-1
  • 103 regular-2
    • 104 CCL



Completeness attribute (6)

  • Completeness attribute (6)
    • 1 incomplete subfield
    • 2 complete subfield
    • 3 complete field

Z39.50 Attribute sets - Others

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Z39.50 defines:

  • Bib-1 {Z39-50-attributeSet 1} (See ATR.1)
  • Exp-1 {Z39-50-attributeSet 2} (See ATR.2)
  • Ext-1 {Z39-50-attributeSet 3} (See ATR.3)
  • CCL-1 {Z39-50-attributeSet 4}
  • GILS {Z39-50-attributeSet 5}
  • STAS {Z39-50-attributeSet 6}

Bib-1 is (probably by far) the most common


There is a Danish Dan-1 set, which is often usually used in addition to Bib-1.


See also:


SRU and related

  • SRU (Search/Retrieval via URL) is a simple protocol intended as a successor to Z39.50.
returns XML
1.x uses CQL queries in URLs, 2.0 allows other query languages (but CQL still seems quite common)
Three version: 1.1, 1.2, 2.0


  • SRW (Search/Retrieval as a Web service) is a deprecated name for what should now be called 'SRU over SOAP'.
[3], [4]


  • SRW/U groups both SRU and SRW(verify)


  • MXG (Metasearch XML Gateway) is an XML gateway meant to make metasearch and the required interfacing with content providers easier.
It is based on and (strongly) prefers interfacing to SRU, though other HTTP-based resources that take queries and respond with XML can also be supported.
[5], [6]


SRU is not tied to a particular XML schema (or even to XML(verify)) and should report the data format (DTD identifier). A content provider may choose an XML schema that suits its contents best, and mix them; a client is expected to transform the contents it receives.

XML-coded Dublin Core [7] and MARCXML are not uncommon.

Sources may choose to transform whatever format they use internally to more specific-purpose schemas, such as Metadata Object Description Schema (MODS, [8]) for bibliographic ends, or specific schemas such as Learning Object Metadata (LOM, [9]) for e-learning ends.


OpenSearch

A9's OpenSearch is comparable in function to SRU and MXG

I've noticed some dead links, but it doesn't seem to be dead.

Proxied

A third party implements the search and provides it, usually, over Z39.50, SRU, or SRW.

For example OpenTranslators

http://www.librarytechnology.org/ltg-displayarticle.pl?RC=12980


Query formats

Z39.50 query formats (Type 1, 101, and others)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Z39.50 defines a few query types, of which type 1 and 101 are the most commonly supported.

Both are postfix grammars, but note they are a structure (transported via ASN.1?(verify)) rather than a direct string representation(verify).


From a few minutes of googling it seems that:

  • Type 1: called RPN (Reverse Polish Notation)
    • must be supported by conforming Z39.50 target
  • Type 101: Called ERPN (Extended RPN), a fairly basic extension of Type 1, with additions like the prox operator
    • Target may claim support of Type-101, of its Prox operator, and of its Restriction operand (independently?(verify)) (can be claimed using Explain, or just by informal description)


And apparently:

  • Type 0: a-priori agreement (verify)
  • Type 2: ISO8777 (verify)
  • Type 100: Common Command Language (CCL)?(verify)) (verify)
  • Type 102: Ranked List query - which consists of a list of queries and a weight for each
  • Type 104: (recent) (verify)


See also:

CCL (Common Command Language)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
  • and, or, not


CQL - Common Query Language (up to 1.1) / Contextual Query Language (since 1.2)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Developed and maintained by LoC, apparently with focus on Z39.50 use. As a query syntax it is only choice in SRU and SRW, an option in Z39.50, and it is sometimes seen elsewhere.


Basic 'just give me some words' queries are fairly simple, more structured queries are possible (not required to be supported, see the levels, but most do) though the syntax and index/field names make this beyond 99% people of people to use directly.


'Contextual' seems to refer to leaving exact query interpretation somewhat up to the server-side implementation, rather than being defined strictly into the query syntax. Which allows for flexible searches, though also means each server is as clever or as dumb as its implementation happens to be, with few ways to work around that.


See also




Context sets

CQL has the concept of context sets, mostly referring to sets of attributes referring (usually) to record properties to search in, but also to allow reference to an earlier search set to narrow down.

Already-defined context sets include a (quite simple) cql context set, a srw set, a Bath set (to be deprecated in favour of bib), and more specific ones (some serving niches), like zthes, one for LOM, ccg (collectible card games), music, and any you wish to create yourself. The registered ones are listed at [10].


Systems are allowed to deal only with simpler variations of CQL queries, without supporting some of the more complex stuff. No system is required to support everything, but should still diagnose that they don't support something -- see SRU's list).

The three levels:

  • Level 0 must support simple term-only queries
words, doublequoted strings, backslash escaped doublequotes in value), and report all that is not actually supported
  • Level 1 adds the ability to parse both of the following, and support at least one of them at a time in a query (not necessarily both):
    • search clauses with 'index relation searchTerm' structure (e.g. 'dc.title any food')
    • boolean combinations
  • Level 2 quires that all of CQL be parsed, and for it to report what is not actually supported
(level 2 requires a proper parser while in 0 and 1 you can get away with simpler hackish string handling)


CQL syntax

As to CQL syntax', see [11] for a definition and e.g. [12] for a decent introduction.

Some notes and examples of my own:

  • The syntax of CQL is case insensitive, though the values it carries (term values, modifier values, prefix map values) should be carried through unchanged.
  • A query is one clause, or multiple clauses booleaned together
    • booleans being and, or, not (meaning and not, not a unary operator), and prox (which is like an and with the extra requirement that it is close)
    • booleans are ealuated left to right (no precedence), parentheses can be used to override left-to-right evaluation
  • A single clause consists of
    • an index (optional, defaults to cql.serverChoice. For more details on index specification, see [13]))
    • a relation (optional)
    • a term (required). If it contains a whitespace (e.g. space) or one of <>=/() it must be double-quoted, may always be (only not necessary when you use single words as terms). Backslash-escaped doublequotes may appear in values.
the index and relation are optional in that if you write only a term, it is interpreted as cql.serverChoice = term {{comment|(...since 1.2 anyway, the default relation in 1.1 was scr)
  • the specification of index, relations, booleans, and sorting can be modified by adding a / (optional whitespace around it) and modifier details.
  • sort specification, as last element (e.g. fish sortBy dc.title, and more specific through modification: fish sortBy dc.date/sort.ascending)
  • Note that a single clause can imply boolean-like interpretation, e.g. though cql.any and cql.all
  • You sometimes see examples like title=((dinosaur and bird) or dinobird) or dc.title=(kern* or ritchie). This seems to be invalid CQL, and probably comes from confusion since it is valid CCL.


As to the optionals in the clause: you can either use just the term or the full thing, for example fish or cql.serverChoice = fish, but not, say, = fish or cql.serverChoice fish

Index references must have a base name (e.g. title), and may have a prefix referring to a context set. If the prefix is omitted, it is determined/guessed by the server, so things like title any fish and dc.title any fish are correct - and in various applications will behave identically.

You can also map your own prefixes, but this is rarely seen and probably about as rarely supported. One example (there are more styles):

dc = "info:srw/context-sets/1/dc-v1.1" dc.title any fish


The any (short for cql.any) in the examples above is the relation, which specifies the way you wish for the search to use the term. This includes equality, range tests, non-equality and such. You probably commonly use one of:

  • = - server choice, which may be clever or not so much. Will probably frequently choose == or adj depending on field and/or value
  • exact - exact match
  • == - exact match
  • adj - phrase searches, e.g. dc.title adj "lord of the rings"
  • any - any of the parts of the term value (sort of a shorthand for ORing together all the words in the term)
  • all - all of the parts of the term value (sort of a shorthand for ANDing together all the words in the term)
  • <> - not equal (similar to NOT)

For example:

dc.title any fish
dc.title = fish
dc.title any fish
cql.serverChoice adj fish

You can use relation modifiers to ask for more specific tests, or just hints, in how to evaluate the relation test. For example:

dc.title any/relevant fish
dc.title any/ relevant /cql.string fish
dc.title any/rel.algorithm=cori fish

Generally defined modifiers (with no specific implementation) include:

  • stem - use stemming before matching
  • relevant - conceptually fuzzy. For example subject any/relevant "fish frog" might match various specific fishes, frogs, amphibians, and whatnot.
  • fuzzy - unspecified fuzzy matching, e.g. to match misspelled words, in whatever way it chooses.
  • phonetic - match homophones


Other query examples:

title =/stem "completed dinosaurs"
(((a and b) or (c not d) not (e or f and g)) and h not i) or j
(caudal or dorsal) prox vertebra
ribs prox/>/0/paragraph chevrons
dc.title any fish prox/unit=word/distance>3 dc.title any squirrel

? and * are wildcards for one and zero or more characters, respectively. The syntax allows them at any point (so for arbitry, regexpish matching) although many real-world indexing systems CQL is used on may not necessarily support that.



XCQL

The parsed form of CQL, serialized into XML.

Which is verbose and more explicit (not requiring parentheses or precedence rules), in fact to the point where only debuggers would really want to read it. It seems mostly used in messaging to show how a given query was parsed.


XCQL uses a representation centralizing around elements it calls searchClause and triple.

A triple is a structure that builds up a tree, and consists of:

  • boolean
  • leftOperand: a (terminal) searchclause, or another triple
  • rightOperand: a (terminal) searchclause, or another triple

A searchClause may contain an index, relation and term

Both searchClause and triple may contain an array of prefixes, which contains/overrides the mapping at that level and deeper.


See also:

PQF (Prefix Query Format), also sometimes Prefix Query Notation (PQN)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

Example:

@attrset 1.2.840.10003.3.1 @or  @attr 1=1016 "foo"  @attr 1=1016 "snake"

In which:

  • 1.2.840.10003.3.1 refers to the bib-1 attribute set
  • 1=1016 refers to (bib-1) use attribute (1) and the 'any' field (1016).

Cheshire II (C2)

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

http://cheshire.berkeley.edu/cheshire2.html#zfind


Z+SQL

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)
  • also known as ZSQL, Z-SQL
  • [14],
  • apparently meant to give the Z39.50-1995 Version3 protocol an SQL-like syntax, and allows creation of more complex queries.


Not commonly seen?(verify) (yet?)


Others

You could implement the basic syntax that is approximately the lowest common denominator between systems like google, lucene, and such (AND, OR, doublequotes, brackets, minus as a NOT)


There are numerous minor expansions you could choose to support You can add minor expansions of this, such as varying ways of specifying NOT, field searching, and more.


See for example

...and many others.


See also

OAI and related

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

OAI is the Open Archive Initiative.

It is effectively a way to share repository metadata content, to be combined, indexed and such elsewhere.


In part, it is a more elegant and less quirky solution than federated search, although the fact that it is based on Dublin Core effectively limits the scope of application.


PMH

OAI often implies a setup where OAI-PMH is used to provide / fetch the metadata. (PMH: Protocol for Metadata Harvesting)

XML-based, transferred over HTTP

See also:

ORE

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

See #OAI-ORE

Related standards

Z39.88: OpenURL


Unsorted

Interesting searching, browsing or visualization

Relation browsing:

Faceted:

Metasearchers:


Search engines, catalogues & ILS, repositories, supporting libraries

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)



Some

do just metadata search (like a classical card catalog)
the first initially called Online public access catalog (OPAC)
later there was a "Library 2.0" thing
are union catalog ('what do related libraries have?'
without needing to do so via federation (faster)
do search federation (active searching -- of
library holdings
search remote database subscriptions
some also search ebooks
are 'integrated library systems' (Integrated library system) which often means they also do loaning and such,


  • BiblioteQ [17] - an ILS
  • Blacklight OPAC[18] (open source) - catalogue, built around Solr
  • DRIVER, DRIVER II [21]
  • DSpace [22] - repository
  • EPrints [24] - repository
  • Evergreen[25] (open source) - an ILS
  • Fedora Commons[34] (unrelated to the linux distro), Java-based
  • Fedora Learning Objects Repository Interface (Flori) [35]
  • Hyper Estraier[36] - full-text search (written in C)
  • Hyrax [37] - repository
  • JAFER[38] - base for Z39.50 clients and servers (written in Java)
  • Java Z3950 Tookit[39] (Z39.50, Java)
  • Koha[40] (open source) - an ILS
  • LibLime[41] supports/delivers Koha, Evergreen
  • mnoGoSearch[43] (GPL)
  • MasterKey[44] (open source) - search aggregating system. Can search Z39.50, SRU, PazPar2, a local Zebra index, and via HTML scraping. Harvests and presents it.
  • Meresco[45] is mostly a SRU(/SRW) interface around a OAI-PHM fetcher (also has a web crawler, OAI export of its data, RSS import/export, etc., but they seem of secondary importance)
  • Metaproxy [46] - Frontend/switchboard that makes it easier to search Z39.50, SRU, SRW, and Solr (via webservice). Does result merging, filtering, caching. Exposed as Z39.50, SRU, and SRW.
  • Net::Z3950::ZOOM[47] (Perl)
  • OAI toolkit[48] (Perl)
  • OCLC-Pica, an apparently now deprecated OPAC
  • OpenER[50] (EduCommons RSS → OAI)
  • PhpMyBibli (PMB)[53]
  • Proai[56] (Java-based)
  • PyZ3950[59] - Z39.50
  • QtZ39.50[61] (uses PyZ3950)
  • Solr[63] (Java, with various interfaces) (see also Solr)
  • VB ZOOM[66] (Z39.50)
  • WebFeat
  • Xapian[68], a C++ search engine. (Has wrappers, like xappy[69])
  • YAZ[70]: C(/C++) toolkit for Z39.50, and also SRU and SRW
  • Zebra[72], an indexer/search/retrieve server (OAI, with a Z39.50 interface)
  • Zedkit
  • ZMARCO[73] (Z39.50, MARC, OAI)
  • Zoom .NET[74] (Z39.50)



See also:



Unsorted

E-learning:

  • Sites like Moodle, Teletop, StudieWeb, Sharepoint
  • May use some specific formats (e.g. the Dutch CZP)


Interoperability policies often include

  • use of open standards
  • harvesting+indexing, not federated search
  • no supplier-specific / software-specific (proprietary) features






Notes on (non-)centralization of search

More on the problems in merging

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)


Union systems and data warehousing

This article/section is a stub — probably a pile of half-sorted notes and is probably a first version, is not well-checked, so may have incorrect bits. (Feel free to ignore, or tell me)

To do better you would want a union system, because you would get control over various aspects of behaviour and uniformity you may want to guarantee.

There have been movements to warehouse data to groups , with licensing pretty much as it was as when the same data was federated, meaning it is mediates licenses to members of a group (sometimes licensing to the whole group).

From a search process view, this means you have more data sources in raw form (sometimes, it seems, in indexed form for a specific search system; I can imagine the wish for businesses to tie customers into their products). It does not guarantee you will search it significantly better than the data providers will, but chances are merging and sorting is better if only because they're handled automatically.