OCLC Pica notes
- 1 Screen-scraping the availability
- 2 Deep linking
- 3 Other Pica notes and links
Screen-scraping the availability
The structure to match availability information from the record-viewing HTML:
<tr> <td valign="top" class="preslabel"><strong>Request number: </strong></td> <td valign="bottom" class="presvalue">UB zaal Wisk.Natuurw.Tech. uwnt 190S 013<BR></td> </tr><tr> <td valign="top" class="preslabel"><strong>Request info: </strong></td> <td valign="bottom" class="presvalue">lendable<BR>Available Zaal W.N.T. (3e etage UB)</td> </tr>
This structure can
- not appear at all,
- appear once, or
- appear several times (up to three seen so far),
- appear as only the 'Request number' row, (e.g. just containing 'Use mentioned link,' and indicating there is no second row ('Request info') to go with it)
- appear with <tr>s between the two rows mentioned here, containing other information
The HTML (4.01 right now) is not strictly valid, so a robust mechanical parser is needed. Since I use python, I use BeautifulSoup for this, which also eases picking out data from the parsed document. Many other languages have similar libraries (e.g. rubyful soup) and/or a wrapper around the library form of (html)tidy.
As to the states, the values seen so far (for English; you probably want to specifically ask for a language to avoid possible defaulting to another) are:
- 'restricted circulation'
- 'on approval'
- 'photocopy only'
- 'lendable' -- when an item is 'lendable', its details can contain:
- 'on loan'
- 'not yet available'
Note: This uses an older, messier version of BeautifulSoup, and the code we use has been updated since, partially because it broke because of a Pica update.
As such, please treat this code as pseudocode.
def check_availability(ppn, host,db,ikt): """ Gets the state of some material by its PPN, and possibly some extra information. I use this in a '<dfn title="%s">%s</dfn>'%(details,state) sort of way. Returns an array of zero or more such (state,details) pairs. Fill in the settings as fits your own setup. For example: check_availability(ppn, 'http://opc.ub.rug.nl', db='1',ikt='12'): """ ## Fetch and parse url="%s/LNG=EN/DB=%s/IKT=%s/FRM=ppn%%25%%33%%44%s/CLK?TRM=%s"%(host,db,ikt,ppn,ppn) handle=urllib2.urlopen( urllib2.Request(url) ) data=handle.read() # the page's fetched HTML handle.close() soup=BeautifulSoup(data) # parses, returns doc structure. ## Extract information ret= # note: The nbsps may change in future versions. A regexp would be more robust for rn in soup.fetchText('Request&nbsp;number:&nbsp;'): if ( ' '.join(rn.next.fetchText(True)) ).find('Use mentioned link')!=-1: ret.append( ('URL','') ) # rn.next is the td in the same row. else: rinfo=rn.findNext(text='Request&nbsp;info:') # *Underaccepts, FIXME* if rinfo: # appears at all? atx = rinfo.next.fetchText(True) # fetch text in the next cell if len(atx)>1: state,details=atx,' '.join(atx[1:]) # first should be the state if details.find('n loan')!=-1: # interesting to show state='on loan' ret.append( (state,details) ) return ret
Building URLs to deep-link into PICA is interesting for:
- screen-scraping and for linking people to searches
- linking people to records, or to searches
There is no direct documentation about the URL structure Pica uses. The following are some notes that may be useful as reference:
Pica displays details automatically when there is exactly one result, but this is given with identifiers like PPNs. (well, or no results).
This is nice when linking people to records or screen scraping based on e.g. PPNs, since it means you need no session or state - you can fetch all the data you want with a single request.
Based on the information below, it seems a structure like the following works well for PPNs:
I did hit some problems doing the same query on a different host that should - I think - use the same database, but probably didn't use exactly the same Pica version/configuration. I should investigate.
It seems that there are three distinct parts to the URL:
- General parameters (slash-delimited before the command)
- A main command
- The command's parameters (&-style URL arguments)
Both parameters are case-sensitive VAR=VAL pairs. It seems that most (but apparently not all(verify)) command parameters may also appear as general parameters. For example, the following two are equivalent:
There seem to be some that can only appear as command parameters (or only as general parameters)(verify)
I suspect that all general parameters should be doubly URL-encoded (verify) (e.g. an = in your value has to appear as %25%33%44). I know this applies to FRM (while many other fields take only basic alphanumeric characters, so you don't have to worry about it in most other cases)
Commands (searches, result browsing)
- CMD?ACT=something is a search action, where something is one of:
- SRCH - OR search
- SRCHA - AND search (default (?))
- BRWS - Index (implies TRM and IKT)
- RLV - Reorder by relevance (likely implies SET)
- AND, OR, NOT - constrict or expand set (likely implies SET)
- CLK I'm not entirely sure about, but is often usable as 'search' as well. Avoid it until you know what it is, I suppose.
General and/or common
Things that are relatively commonly present include:
Necessary: it refers to the database/source to look in.
The value will be specific to your installation. You may have mutiple databases, but usually one main one that you want to hardcode into searches and deep links.
The field(s) to search in / type of search.
Based on the Z39.50 Bib-1 use attributes, and most are taken from it directly, but some are not. Note also that each database can have its own set of IKTs(verify).
Lifted from our version:
Code Description (Dutch) When used in browser search query, use: 1016 alle woorden ALL 4 titelwoorden TTL 5 titelwoorden: tijdschrift TTI 1004 auteur AUT 8063 hele titel tijdschrift HTL 2 corporatie COR 1006 congres CON 5004 basisclassificatiecode (GOO) BCL 5040 GOO-trefwoord GOO 1009 trefwoord: persoon PAO 7020 centr.onderw.code RuG (tot 1993) (none?) 8110 decentrale onderwerpscode DCL 7 ISBN (monografieen) ISB 8 ISSN (periodieken) ISS 9000 jaar van uitgave 9004 taalcode (ned, fra, grk, grn (modern grieks), fri(es), spa, dui) TAA 9005 landcode (nl,fr,de,gb,be,es,vs,su) LAN
The query, often a simple string.
You can use booleans and simple filters like "smith AND mat bgm" (where mat refers to material types).
In addition to the keywords listed in the IKT list above, you can use the following:
PRS - person's name PPN - Pica Production Number SIG - Signature (I have some info on this elsewhere) NUM - ISSN *or* ISBN UPD - date of update/entry UIT - Publisher's name DRK - Printer/Publisher name URL - URL (why?) OPI - Internal note (?) PER - period (?) MAT - Material code From help: A - Articles B - Books Q,R - Archive pieces T - Magazines, serials D - Summaries E - Microfiches I - Illustrative K - Cartography V - Video (audiovisual) G - Sound/Audio (also subjects, it seems) M - Sheet music H - Handschriften L,U - Letters S - Software O - Online X - Other/Unknown Apparently also: P - person F - ?
Note that this material code resembles but is different from Pica material codes that you can find in records (and mentioned below).
URL paths for browsing result sets
/DB=1/SET=21/TTL=1/SHW?FRST=1 /DB=1/SET=21/TTL=1/NXT?FRST=11 /DB=1/SET=21/TTL=1/PRS=PP/SHW?FRST=1
Browsing search results: (needs SET reference)
- NXT?FRST=11 shows a page of results starting at item 11
- SHW?FRST=11 shows item 11
- PRS - presentation style
- PP - shows internal code view
- others? (HOL?)
- TTL - seems related to pages, not sure how
Various of these are related to fields you can set in the interface or browse to somehow:
Interface language, useful to get some sort of canonical form when screen-scraping. (Note: languages may be broken in an installation, and it's possible these are not fixed codes).
- NE for Dutch
- EN for English
- FR for French
- DU for German
...although the actual data necessary to present each may not be present in an installation.
This value is shown in the search input in the form. While you could say this is purely decoration (unless/until the user decides to click 'search'), not speciying it means the last search query may be shown there, taken from a cookie, so to avoid confusion, you probably want to duplicate the real search query (TRM) in here (or perhaps effectively clear it).
Two passes of percent-encoding seem to apply here.
- IMPLAND=Y seems to set search type to AND instead of OR (the default?)
- SRT is sort order, can be
- YOP (Year Of Publishing) or
- RLV (Relevance)
- SV - seems to control whether the 'saved set' menu option appears (values: Y or N)
- SET refers to a result set (used when narrowing). Note that general, SET is often meant as a 'last result set' and not necessarily necessary for a current action
- MAT - filter by material types. I've not seen this in the URL (yet?). Related:
- MATSET?DBUHS&TOP=DBUHS - both seem to be necessary, not sure how yet
- NOMAT=Y - remove this filter
- TTL - ? (guess: fetch title by index, TTL=1 fetches first?(verify)
- SID - Search ID?
- FKT - ?
- BOR_U - ?
- CHARSET=ISO-8859-1 - or maybe UTF-8, but that causes the test interface to only respond with errors. This value seems to be remembered per session.
- /BCL - subject list, by Nederlandse Basisclassificatiesysteem
- LBSREQUEST?PPN=780755812 - ILL request
- /loan, /loan/USERINFO (requires login)
'Related things' search, by PPN. Records and people have PPNs, as do various other things.
Possibly the only possible way to find relations.
A PPN refers to the concept of an overall entry.
Since Pica can do the overall management of a segmented library and something may be present at multiple locations, EPN is used to refer to individual copies of a PPN.
In the case of dictionaries, encyclopediae, etc., an EPN copy will refer to a group of individually loanable physical things.
Records from Pica will often contain the material type. It is documented (see this web search, although I cannot find the english version) although most systems use it in a much simpler fashion, using no more than the first one or two positions.
Note that this is not a strictly controlled field, so the set of codes that is used (and correctness of that use) can vary per system.
You can generally conclude that something that starts with...:
- A effectively refers to paper-based materials (though this is approximate given electronic articles)
- O refers to online material (you probably want to look at the second character)
- B refers to audiovisual material
- G refers to audio, regularly music
- M refers to printed music
- S refers to (machine-readable) software or data
- I refers to illustrations
- K refers to maps
- V refers to other objects (such as museum items)
The second position gives more details, but you would generally only want to look at it if the first position is A, or perhaps O.
- Asv refers to an article
- Ab refers to journals, newspapers
- Aa (and most other A-somethings) usually means books