Online (library) search related
Data reference, annotation: Data annotation notes and tools · Knowledge representation / Semantic annotation / structured data / linked data on the web Reference: Open science, research, access, data, etc. · Citations Library related: Library glossary · Identifiers, classifiers, and other codes · Repository notes · Metadata models and standards Library systems · Online (library) search related · Library-related service notes · OpenURL notes · OCLC Pica notes · Library - unsorted |
Search protocols
ANSI/NISO Z39.50 (also ISO 23950)
Z39.50 is a search-and-present protocol, one that is still fairly commonly used by libraries and scholarly databases.
Specifically:
- ANSI/NISO Z39.50-1988: version 1 (considered obsolete)
- ANSI/NISO Z39.50-1992: version 2 (incompatbible with version 1)
- ANSI/NISO Z39.50-1995: version 2 and 3 (a compatible superset of Z39.50-1992)
- Z39.50-2003 - a clarified version of -1995?
Apparently various servers use 2 with some extensions from version 3.
- Related links:
- http://www.loc.gov/z3950/agency/
- ZING (Z39.50 International Next Generation initiative) (LOC's information onthis has been merged into its SRU/SRW pages, though)
- includes ZOOM (see e.g. [1]), an effort for more object-oriented Z39.50
- Mike Taylor's z3950.org, including resources related to ZOOM (and various bindings), ZING
- Some statistics
- uses ASN.1 (Abstract Syntax Notation One) to describe record formats and such.
- runs on port 2100 by default, although many databases choose other ports.
A Z39.50 server's capabilities (services, record types, query operators) are not reported by the host itself; you generally have to discover them though experiments, then configure them. (Only a few percent offers explain-style support)
Services
- Basics include:
- search
- present
- scan
- sort (not always supported(!) )
- Optionals:
- extendedServices (see )
- namedResultSets
- triggerResourceCtrl
- delSet
- negotiationModel
- duplicationDetection
See also [2]
Z39.50 Attribute sets - Bib-1
These numbers are used in various Z39.50 query types (Type-1, Type-101) queries, in PQF, and in relevant query translations.
See also:
Of the six types of attributes, use attributes are probably most generally interesting. The others are only interesting if you're tweaking a client or query translato.
Bib-1 Use attributes (1)
'Use Attributes' are references to indices you can refer to in your searches. The most interesting are approximately:
- 1016 is 'Any'
- usually as an 'any common field', but may be creatively interpreted
- other all-ish fields may appear in addition to (or sometimes instead of) 1016, for example 1035, 1036, and others
- 4, Title
- 1003, Author, though may be 1004 (Author-name personal), 1 (Personal name), and/or others at a specific database
- 31, Date of publication (or 30?)
- 7, ISBN
- 8, ISSN
- 21, Subject heading
- 1035, 'Anywhere', which has assumed status of 'anywhere, including abstract and/or full text'
- 1036, Author-Title-Subject, and other variations on this with two of the three and/or others.
In practice, you may require per-target use attribute remapping to convert queries, to get consistent behaviour out of a larger set of targets in which some are somewhat unusual.
Z39.50 mentions the MARC fields that this would likely map to.
Relation attributes (2)
- Values:
- 1, < less than
- 2, <= less than or equal
- 3, = equal
- 4, >= greater or equal
- 5, > greater than
- 6, <> not equal
- 100, phonetic
- 101, stem
- 102, relevance
- 103, always matches
- 104, custom, per-target
See also:
Position attribute (3) Default is 3, rarely changed, rarely useful to change
- 1, first in field
- 2, first in subfield
- 3, any position in field
Structure attribute (4) The most interesting are 105 and 1 (and perhaps 108); the rest are usually too target-specific.
- 1, phrase (requires same order, and adjacency)
- 2, word
- 3, key
- 4, year (4-digit)
- 5, date (normalized, see ISO 8824)
- 6, word list (orderless, target-specific interpretation(verify))
- 100, date (un-normalized)
- 101, name (normalized)
- 102, name (un-normalized)
- 103, structure
- 104, urx
- 105, free-form-text
- 106, document-text
- 107, local number
- 108, string
- 109, numeric string
Truncation attribute (5)
Truncation is done with #
- 1 right
- 2 left
- 3 left & right
- 100 none
- 101 process
- 102 regular-1
- 103 regular-2
- 104 CCL
Completeness attribute (6)
- Completeness attribute (6)
- 1 incomplete subfield
- 2 complete subfield
- 3 complete field
Z39.50 Attribute sets - Others
Z39.50 defines:
- Bib-1 {Z39-50-attributeSet 1} (See ATR.1)
- Exp-1 {Z39-50-attributeSet 2} (See ATR.2)
- Ext-1 {Z39-50-attributeSet 3} (See ATR.3)
- CCL-1 {Z39-50-attributeSet 4}
- GILS {Z39-50-attributeSet 5}
- STAS {Z39-50-attributeSet 6}
Bib-1 is (probably by far) the most common
There is a Danish Dan-1 set, which is often usually used in addition to Bib-1.
See also:
- SRU (Search/Retrieval via URL) is a simple protocol intended as a successor to Z39.50.
- SRW (Search/Retrieval as a Web service) is a deprecated name for what should now be called 'SRU over SOAP'.
- SRW/U groups both SRU and SRW(verify)
- MXG (Metasearch XML Gateway) is an XML gateway meant to make metasearch and the required interfacing with content providers easier.
SRU is not tied to a particular XML schema (or even to XML(verify)) and should report the data format (DTD identifier). A content provider may choose an XML schema that suits its contents best, and mix them; a client is expected to transform the contents it receives.
XML-coded Dublin Core [7] and MARCXML are not uncommon.
Sources may choose to transform whatever format they use internally to more specific-purpose schemas, such as Metadata Object Description Schema (MODS, [8]) for bibliographic ends, or specific schemas such as Learning Object Metadata (LOM, [9]) for e-learning ends.
OpenSearch
A9's OpenSearch is comparable in function to SRU and MXG
I've noticed some dead links, but it doesn't seem to be dead.
Proxied
A third party implements the search and provides it, usually, over Z39.50, SRU, or SRW.
For example OpenTranslators
http://www.librarytechnology.org/ltg-displayarticle.pl?RC=12980
Query formats
Z39.50 query formats (Type 1, 101, and others)
Z39.50 defines a few query types, of which type 1 and 101 are the most commonly supported.
Both are postfix grammars, but note they are a structure (transported via ASN.1?(verify)) rather than a direct string representation(verify).
From a few minutes of googling it seems that:
- Type 1: called RPN (Reverse Polish Notation)
- must be supported by conforming Z39.50 target
- Type 101: Called ERPN (Extended RPN), a fairly basic extension of Type 1, with additions like the prox operator
- Target may claim support of Type-101, of its Prox operator, and of its Restriction operand (independently?(verify)) (can be claimed using Explain, or just by informal description)
And apparently:
- Type 0: a-priori agreement (verify)
- Type 2: ISO8777 (verify)
- Type 100: Common Command Language (CCL)?(verify)) (verify)
- Type 102: Ranked List query - which consists of a list of queries and a weight for each
- Type 104: (recent) (verify)
See also:
CCL (Common Command Language)
- ISO 8777 (but not all are entirely up to that)
- http://www.indexdata.dk/yaz/doc/tools.tkl#CCL
- Meant to be a somewhat user-end query language (but probably still too complex for that)
- and, or, not
CQL - Common Query Language (up to 1.1) / Contextual Query Language (since 1.2)
Developed and maintained by LoC, apparently with focus on Z39.50 use. As a query syntax it is only choice in SRU and SRW, an option in Z39.50, and it is sometimes seen elsewhere.
Basic 'just give me some words' queries are fairly simple, more structured queries are possible (not required to be supported, see the levels, but most do) though the syntax and index/field names make this beyond 99% people of people to use directly.
'Contextual' seems to refer to leaving exact query interpretation somewhat up to the server-side implementation, rather than being defined strictly into the query syntax.
Which allows for flexible searches, though also means each server is as clever or as dumb as its implementation happens to be, with few ways to work around that.
See also
Context sets
CQL has the concept of context sets, mostly referring to sets of attributes referring (usually) to record properties to search in, but also to allow reference to an earlier search set to narrow down.
Already-defined context sets include a (quite simple) cql context set, a srw set, a Bath set (to be deprecated in favour of bib), and more specific ones (some serving niches), like zthes, one for LOM, ccg (collectible card games), music, and any you wish to create yourself. The registered ones are listed at [10].
Systems are allowed to deal only with simpler variations of CQL queries, without supporting some of the more complex stuff.
No system is required to support everything, but should still diagnose that they don't support something -- see SRU's list).
The three levels:
- Level 0 must support simple term-only queries
- words, doublequoted strings, backslash escaped doublequotes in value), and report all that is not actually supported
- Level 1 adds the ability to parse both of the following, and support at least one of them at a time in a query (not necessarily both):
- search clauses with 'index relation searchTerm' structure (e.g. 'dc.title any food')
- boolean combinations
- Level 2 quires that all of CQL be parsed, and for it to report what is not actually supported
- (level 2 requires a proper parser while in 0 and 1 you can get away with simpler hackish string handling)
CQL syntax
As to CQL syntax', see [11] for a definition and e.g. [12] for a decent introduction.
Some notes and examples of my own:
- The syntax of CQL is case insensitive, though the values it carries (term values, modifier values, prefix map values) should be carried through unchanged.
- A query is one clause, or multiple clauses booleaned together
- booleans being and, or, not (meaning and not, not a unary operator), and prox (which is like an and with the extra requirement that it is close)
- booleans are ealuated left to right (no precedence), parentheses can be used to override left-to-right evaluation
- A single clause consists of
- an index (optional, defaults to cql.serverChoice. For more details on index specification, see [13]))
- a relation (optional)
- a term (required). If it contains a whitespace (e.g. space) or one of <>=/() it must be double-quoted, may always be (only not necessary when you use single words as terms). Backslash-escaped doublequotes may appear in values.
- the index and relation are optional in that if you write only a term, it is interpreted as cql.serverChoice = term {{comment|(...since 1.2 anyway, the default relation in 1.1 was scr)
- the specification of index, relations, booleans, and sorting can be modified by adding a / (optional whitespace around it) and modifier details.
- sort specification, as last element (e.g. fish sortBy dc.title, and more specific through modification: fish sortBy dc.date/sort.ascending)
- Note that a single clause can imply boolean-like interpretation, e.g. though cql.any and cql.all
- You sometimes see examples like title=((dinosaur and bird) or dinobird) or dc.title=(kern* or ritchie). This seems to be invalid CQL, and probably comes from confusion since it is valid CCL.
As to the optionals in the clause: you can either use just the term or the full thing, for example fish or cql.serverChoice = fish, but not, say, = fish or cql.serverChoice fish
Index references must have a base name (e.g. title), and may have a prefix referring to a context set. If the prefix is omitted, it is determined/guessed by the server, so things like title any fish and dc.title any fish are correct - and in various applications will behave identically.
You can also map your own prefixes, but this is rarely seen and probably about as rarely supported. One example (there are more styles):
dc = "info:srw/context-sets/1/dc-v1.1" dc.title any fish
The any (short for cql.any) in the examples above is the relation, which specifies the way you wish for the search to use the term. This includes equality, range tests, non-equality and such. You probably commonly use one of:
- = - server choice, which may be clever or not so much. Will probably frequently choose == or adj depending on field and/or value
- exact - exact match
- == - exact match
- adj - phrase searches, e.g. dc.title adj "lord of the rings"
- any - any of the parts of the term value (sort of a shorthand for ORing together all the words in the term)
- all - all of the parts of the term value (sort of a shorthand for ANDing together all the words in the term)
- <> - not equal (similar to NOT)
For example:
dc.title any fish dc.title = fish dc.title any fish cql.serverChoice adj fish
You can use relation modifiers to ask for more specific tests, or just hints, in how to evaluate the relation test. For example:
dc.title any/relevant fish dc.title any/ relevant /cql.string fish dc.title any/rel.algorithm=cori fish
Generally defined modifiers (with no specific implementation) include:
- stem - use stemming before matching
- relevant - conceptually fuzzy. For example subject any/relevant "fish frog" might match various specific fishes, frogs, amphibians, and whatnot.
- fuzzy - unspecified fuzzy matching, e.g. to match misspelled words, in whatever way it chooses.
- phonetic - match homophones
Other query examples:
title =/stem "completed dinosaurs" (((a and b) or (c not d) not (e or f and g)) and h not i) or j (caudal or dorsal) prox vertebra ribs prox/>/0/paragraph chevrons dc.title any fish prox/unit=word/distance>3 dc.title any squirrel
? and * are wildcards for one and zero or more characters, respectively. The syntax allows them at any point (so for arbitry, regexpish matching) although many real-world indexing systems CQL is used on may not necessarily support that.
XCQL
The parsed form of CQL, serialized into XML.
Which is verbose and more explicit (not requiring parentheses or precedence rules), in fact to the point where only debuggers would really want to read it. It seems mostly used in messaging to show how a given query was parsed.
XCQL uses a representation centralizing around elements it calls searchClause and triple.
A triple is a structure that builds up a tree, and consists of:
- boolean
- leftOperand: a (terminal) searchclause, or another triple
- rightOperand: a (terminal) searchclause, or another triple
A searchClause may contain an index, relation and term
Both searchClause and triple may contain an array of prefixes, which contains/overrides the mapping at that level and deeper.
See also:
PQF (Prefix Query Format), also sometimes Prefix Query Notation (PQN)
- a prefix-notation tree format sort of idea
- Similar to like Type-1/Type-101 RPN in structure, except that it is prefix. Used in some Z39.50 contexts
- http://www.indexdata.dk/yaz/doc/tools.tkl#PQF
Example:
@attrset 1.2.840.10003.3.1 @or @attr 1=1016 "foo" @attr 1=1016 "snake"
In which:
- 1.2.840.10003.3.1 refers to the bib-1 attribute set
- 1=1016 refers to (bib-1) use attribute (1) and the 'any' field (1016).
Cheshire II (C2)
http://cheshire.berkeley.edu/cheshire2.html#zfind
Z+SQL
- also known as ZSQL, Z-SQL
- [14],
- apparently meant to give the Z39.50-1995 Version3 protocol an SQL-like syntax, and allows creation of more complex queries.
Not commonly seen?(verify) (yet?)
Others
You could implement the basic syntax that is approximately the lowest common denominator between systems like google, lucene, and such (AND, OR, doublequotes, brackets, minus as a NOT)
There are numerous minor expansions you could choose to support
You can add minor expansions of this, such as varying ways of specifying NOT, field searching, and more.
See for example
...and many others.
See also
OAI is the Open Archive Initiative.
It is effectively a way to share repository metadata content, to be combined, indexed and such elsewhere.
In part, it is a more elegant and less quirky solution than federated search, although the fact that it is based on Dublin Core effectively limits the scope of application.
PMH
OAI often implies a setup where OAI-PMH is used to provide / fetch the metadata. (PMH: Protocol for Metadata Harvesting)
XML-based, transferred over HTTP
See also:
ORE
See #OAI-ORE
Related standards
Z39.88: OpenURL
Unsorted
Interesting searching, browsing or visualization
Relation browsing:
- Visual Thesaurus
- KartOO
- AquaBrowser, e.g. at the Amsterdam public library
- LivePlasma
Faceted:
- Vivisimo, specifically Clusty
- http://askx.com
- http://www.gigablast.com/
- http://www.squidoo.com/
- http://kosmix.com/
Metasearchers:
Search engines, catalogues & ILS, repositories, supporting libraries
Some
- do just metadata search (like a classical card catalog)
- the first initially called Online public access catalog (OPAC)
- later there was a "Library 2.0" thing
- are union catalog ('what do related libraries have?'
- without needing to do so via federation (faster)
- do search federation (active searching -- of
- library holdings
- search remote database subscriptions
- some also search ebooks
- are 'integrated library systems' (Integrated library system) which often means they also do loaning and such,
- BiblioteQ [17] - an ILS
- DRIVER, DRIVER II [21]
- DSpace [22] - repository
- Emilda[23] - an ILS
- EPrints [24] - repository
- Evergreen[25] (open source) - an ILS
- Ex Libris[26] offers (among others)
- Ferret[33] (Ruby)
- Fedora Commons[34] (unrelated to the linux distro), Java-based
- Fedora Learning Objects Repository Interface (Flori) [35]
- Hyper Estraier[36] - full-text search (written in C)
- Hyrax [37] - repository
- JAFER[38] - base for Z39.50 clients and servers (written in Java)
- Java Z3950 Tookit[39] (Z39.50, Java)
- Koha[40] (open source) - an ILS
- LibLime[41] supports/delivers Koha, Evergreen
- LibraryFind[42] (open source) - federated search(?)
- mnoGoSearch[43] (GPL)
- MasterKey[44] (open source) - search aggregating system. Can search Z39.50, SRU, PazPar2, a local Zebra index, and via HTML scraping. Harvests and presents it.
- Meresco[45] is mostly a SRU(/SRW) interface around a OAI-PHM fetcher (also has a web crawler, OAI export of its data, RSS import/export, etc., but they seem of secondary importance)
- Metaproxy [46] - Frontend/switchboard that makes it easier to search Z39.50, SRU, SRW, and Solr (via webservice). Does result merging, filtering, caching. Exposed as Z39.50, SRU, and SRW.
- Net::Z3950::ZOOM[47] (Perl)
- OAI toolkit[48] (Perl)
- OpenBiblio[49]
- OCLC-Pica, an apparently now deprecated OPAC
- OpenER[50] (EduCommons RSS → OAI)
- OpenSiteSearch[51]
- Pazpar2[52], a Z39.50-based federated search server (YAZ-based, but no SRU/SRW)
- PhpMyBibli (PMB)[53]
- PHP-MARC[54]
- PhpMyLibrary[55]
- Proai[56] (Java-based)
- PyMARC[58] -
- PyZ3950[59] - Z39.50
- pyoai[60]
- QtZ39.50[61] (uses PyZ3950)
- swish-e[62]
- Sphinx[64] - SQL database backed (C++) (See also sphinx notes)
- Summon[65]
- VB ZOOM[66] (Z39.50)
- VuFind[67]
- WebFeat
- YAZ[70]: C(/C++) toolkit for Z39.50, and also SRU and SRW
- YAZ proxy[71]
- Zebra[72], an indexer/search/retrieve server (OAI, with a Z39.50 interface)
- Zedkit
- ZMARCO[73] (Z39.50, MARC, OAI)
- Zoom .NET[74] (Z39.50)
See also:
- http://en.wikipedia.org/wiki/Integrated_library_system
- http://en.wikipedia.org/wiki/List_of_next-generation_catalogs
- Recent_Apache_projects_and_related_notes
- http://www.openarchives.org/tools/tools.html
- http://www.loc.gov/z3950/agency/resources/software.html
Unsorted
E-learning:
- Sites like Moodle, Teletop, StudieWeb, Sharepoint
- May use some specific formats (e.g. the Dutch CZP)
Interoperability policies often include
- use of open standards
- harvesting+indexing, not federated search
- no supplier-specific / software-specific (proprietary) features
Notes on (non-)centralization of search
More on the problems in merging
Union systems and data warehousing
To do better you would want a union system, because you would get control over various aspects of behaviour and uniformity you may want to guarantee.
There have been movements to warehouse data to groups , with licensing pretty much as it was as when the same data was federated, meaning it is mediates licenses to members of a group (sometimes licensing to the whole group).
From a search process view, this means you have more data sources in raw form (sometimes, it seems, in indexed form for a specific search system; I can imagine the wish for businesses to tie customers into their products). It does not guarantee you will search it significantly better than the data providers will, but chances are merging and sorting is better if only because they're handled automatically.