OCLC Pica notes

From Helpful
Jump to navigation Jump to search
The printable version is no longer supported and may have rendering errors. Please update your browser bookmarks and please use the default browser print function instead.
⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).

Screen-scraping the availability

Notes

The structure to match availability information from the record-viewing HTML:

<tr>
  <td valign="top" class="preslabel"><strong>Request number: </strong></td>
  <td valign="bottom" class="presvalue">UB zaal Wisk.Natuurw.Tech. uwnt 190S 013<BR></td>
</tr><tr>
  <td valign="top" class="preslabel"><strong>Request info: </strong></td>
  <td valign="bottom" class="presvalue">lendable<BR>Available Zaal W.N.T. (3e etage UB)</td>
</tr>

This structure can

  • not appear at all,
  • appear once, or
  • appear several times (up to three seen so far),
  • appear as only the 'Request number' row, (e.g. just containing 'Use mentioned link,' and indicating there is no second row ('Request info') to go with it)
  • appear with <tr>s between the two rows mentioned here, containing other information

The HTML (4.01 right now) is not strictly valid, so a robust mechanical parser is needed. Since I use python, I use BeautifulSoup for this, which also eases picking out data from the parsed document. Many other languages have similar libraries (e.g. rubyful soup) and/or a wrapper around the library form of (html)tidy.


As to the states, the values seen so far (for English; you probably want to specifically ask for a language to avoid possible defaulting to another) are:

  • 'restricted circulation'
  • 'on approval'
  • 'photocopy only'
  • 'lendable' -- when an item is 'lendable', its details can contain:
    • 'available'
    • 'on loan'
    • 'not yet available'

Example code

Note: This uses an older, messier version of BeautifulSoup, and the code we use has been updated since, partially because it broke because of a Pica update.

As such, please treat this code as pseudocode.

def check_availability(ppn, host,db,ikt):
    """ Gets the state of some material by its PPN, 
        and possibly some extra information. 
        I use this in a '<dfn title="%s">%s</dfn>'%(details,state) sort of way.  
 
        Returns an array of zero or more such (state,details) pairs. 
 
        Fill in the settings as fits your own setup. For example:
          check_availability(ppn, 'http://opc.ub.rug.nl', db='1',ikt='12'):
 
    """
    ## Fetch and parse
    url="%s/LNG=EN/DB=%s/IKT=%s/FRM=ppn%%25%%33%%44%s/CLK?TRM=%s"%(host,db,ikt,ppn,ppn)
    handle=urllib2.urlopen( urllib2.Request(url) )
    data=handle.read()          # the page's fetched HTML
    handle.close()                                        
    soup=BeautifulSoup(data)    # parses, returns doc structure.

    ## Extract information
    ret=[] 
    # note: The nbsps may change in future versions. A regexp would be more robust
    for rn in soup.fetchText('Request&amp;nbsp;number:&amp;nbsp;'):  
        if ( ' '.join(rn.next.fetchText(True)) ).find('Use mentioned link')!=-1:
            ret.append( ('URL','') )                           # rn.next is the td in the same row.
        else:
            rinfo=rn.findNext(text='Request&amp;nbsp;info:')       # *Underaccepts, FIXME*
            if rinfo:                                          # appears at all?
                atx = rinfo.next.fetchText(True)               # fetch text in the next cell
                if len(atx)>1:
                    state,details=atx[0],' '.join(atx[1:])     # first should be the state
                    if details.find('n loan')!=-1:             # interesting to show
                        state='on loan'
                    ret.append( (state,details) )
    return ret

Deep linking

Building URLs to deep-link into PICA is interesting for:

  • web scraping
  • linking people to searches
  • linking people to records

There is no direct documentation about the URL structure Pica uses. The following are some notes that may be useful as reference:


Pica displays details automatically when there is exactly one result, but this is given with identifiers like PPNs. (well, or no results).

This is nice when linking people to records or screen scraping based on e.g. PPNs, since it means you need no session or state - you can fetch all the data you want with a single request.

Based on the information below, it seems a structure like the following works well for PPNs:

http://opc.ub.rug.nl/LNG=EN/DB=1/IKT=12/FRM=ppn%25%33%44840351267/TRM=840351267/CMD?ACT=SRCH

I did hit some problems doing the same query on a different host that should - I think - use the same database, but probably didn't use exactly the same Pica version/configuration. I should investigate.


Structure

It seems that there are three distinct parts to the URL:

  • General parameters (slash-delimited before the command)
  • A main command
  • The command's parameters (&-style URL arguments)

Both parameters are case-sensitive VAR=VAL pairs. It seems that most (but apparently not all(verify)) command parameters may also appear as general parameters. For example, the following two are equivalent:

/DB=1/CLK?IKT=12&TRM=245870547
/DB=1/IKT=12/TRM=245870547/CLK

There seem to be some that can only appear as command parameters (or only as general parameters)(verify)


I suspect that all general parameters should be doubly URL-encoded (verify) (e.g. an = in your value has to appear as %25%33%44). I know this applies to FRM (while many other fields take only basic alphanumeric characters, so you don't have to worry about it in most other cases)

Commands (searches, result browsing)

  • CMD?ACT=something is a search action, where something is one of:
    • SRCH - OR search
    • SRCHA - AND search (default (?))
    • BRWS - Index (implies TRM and IKT)
    • RLV - Reorder by relevance (likely implies SET)
    • AND, OR, NOT - constrict or expand set (likely implies SET)
  • CLK I'm not entirely sure about, but is often usable as 'search' as well. Avoid it until you know what it is, I suppose.


Parameters

General and/or common

Things that are relatively commonly present include:

DB

Necessary: it refers to the database/source to look in.

Example: DB=2.51

The value will be specific to your installation. You may have mutiple databases, but usually one main one that you want to hardcode into searches and deep links.

IKT

The field(s) to search in / type of search.

Example: IKT=1016


Based on the Z39.50 Bib-1 use attributes, and most are taken from it directly, but some are not. Note also that each database can have its own set of IKTs(verify).

Lifted from our version:

 Code    Description (Dutch)                             When used in browser search query, use:
 1016    alle woorden                                                       ALL
 4       titelwoorden                                                       TTL
 5       titelwoorden: tijdschrift                                          TTI
 1004    auteur                                                             AUT
 8063    hele titel tijdschrift                                             HTL
 2       corporatie                                                         COR
 1006    congres                                                            CON
 5004    basisclassificatiecode (GOO)                                       BCL
 5040    GOO-trefwoord                                                      GOO
 1009    trefwoord: persoon                                                 PAO
 7020    centr.onderw.code RuG (tot 1993)                                  (none?)
 8110    decentrale onderwerpscode                                          DCL
 7       ISBN (monografieen)                                                ISB
 8       ISSN (periodieken)                                                 ISS
 9000    jaar van uitgave
 9004    taalcode (ned, fra, grk, grn (modern grieks), fri(es), spa, dui)   TAA
 9005    landcode (nl,fr,de,gb,be,es,vs,su)                                 LAN


TRM

The query, often a simple string.

You can use booleans and simple filters like "smith AND mat bgm" (where mat refers to material types).

In addition to the keywords listed in the IKT list above, you can use the following:

   PRS - person's name
   PPN - Pica Production Number
   SIG - Signature (I have some info on this elsewhere)
   NUM - ISSN *or* ISBN
   UPD - date of update/entry
   UIT - Publisher's name
   DRK - Printer/Publisher name
   URL - URL (why?)
   OPI - Internal note (?)
   PER - period (?)
   MAT - Material code
     From help:
       A - Articles
       B - Books
       Q,R - Archive pieces
       T - Magazines, serials
       D - Summaries
       E - Microfiches
       I - Illustrative
       K - Cartography
       V - Video (audiovisual)
       G - Sound/Audio    (also subjects, it seems)
       M - Sheet music
       H - Handschriften
       L,U - Letters
       S - Software
       O - Online
       X - Other/Unknown
     Apparently also:
       P - person
       F - ?

Note that this material code resembles but is different from Pica material codes that you can find in records (and mentioned below).

URL paths for browsing result sets

Examples:

/DB=1/SET=21/TTL=1/SHW?FRST=1
/DB=1/SET=21/TTL=1/NXT?FRST=11
/DB=1/SET=21/TTL=1/PRS=PP/SHW?FRST=1


Commands:

Browsing search results: (needs SET reference)

  • NXT?FRST=11 shows a page of results starting at item 11
  • SHW?FRST=11 shows item 11

Parameters:

  • PRS - presentation style
    • PP - shows internal code view
    • others? (HOL?)
  • TTL - seems related to pages, not sure how



Interface-related

Various of these are related to fields you can set in the interface or browse to somehow:

LNG

Interface language, useful to get some sort of canonical form when screen-scraping. (Note: languages may be broken in an installation, and it's possible these are not fixed codes).

Example: LNG=EN

Observed languages:

  • NE for Dutch
  • EN for English
  • FR for French
  • DU for German

...although the actual data necessary to present each may not be present in an installation.


FRM

This value is shown in the search input in the form. While you could say this is purely decoration (unless/until the user decides to click 'search'), not speciying it means the last search query may be shown there, taken from a cookie, so to avoid confusion, you probably want to duplicate the real search query (TRM) in here (or perhaps effectively clear it).

Two passes of percent-encoding seem to apply here.

Unsorted

  • IMPLAND=Y seems to set search type to AND instead of OR (the default?)
  • SRT is sort order, can be
    • YOP (Year Of Publishing) or
    • RLV (Relevance)
  • SV - seems to control whether the 'saved set' menu option appears (values: Y or N)
  • SET refers to a result set (used when narrowing). Note that general, SET is often meant as a 'last result set' and not necessarily necessary for a current action


  • MAT - filter by material types. I've not seen this in the URL (yet?). Related:
  • MATSET?DBUHS&TOP=DBUHS - both seem to be necessary, not sure how yet
  • NOMAT=Y - remove this filter


  • TTL - ? (guess: fetch title by index, TTL=1 fetches first?(verify)
  • LRSET=1
  • SID - Search ID?
  • FKT - ?
  • BOR_U - ?
  • CHARSET=ISO-8859-1 - or maybe UTF-8, but that causes the test interface to only respond with errors. This value seems to be remembered per session.
  • /BCL - subject list, by Nederlandse Basisclassificatiesysteem
  • LBSREQUEST?PPN=780755812 - ILL request
  • /loan, /loan/USERINFO (requires login)
  • HELP

PPN relations

'Related things' search, by PPN. Records and people have PPNs, as do various other things.

Example: /DB=1/REL?PPN=072367857


Possibly the only possible way to find relations.

Other Pica notes and links

PPN, EPN

A PPN refers to the concept of an overall entry.

Since Pica can do the overall management of a segmented library and something may be present at multiple locations, EPN is used to refer to individual copies of a PPN.

In the case of dictionaries, encyclopediae, etc., an EPN copy will refer to a group of individually loanable physical things.

Material code

Records from Pica will often contain the material type. It is documented (see this web search, although I cannot find the english version) although most systems use it in a much simpler fashion, using no more than the first one or two positions.

Note that this is not a strictly controlled field, so the set of codes that is used (and correctness of that use) can vary per system.


You can generally conclude that something that starts with...:

  • A effectively refers to paper-based materials (though this is approximate given electronic articles)
  • O refers to online material (you probably want to look at the second character)
  • B refers to audiovisual material
  • G refers to audio, regularly music
  • M refers to printed music
  • S refers to (machine-readable) software or data
  • I refers to illustrations
  • K refers to maps
  • V refers to other objects (such as museum items)

The second position gives more details, but you would generally only want to look at it if the first position is A, or perhaps O.

  • Asv refers to an article
  • Ab refers to journals, newspapers
  • Aa (and most other A-somethings) usually means books