Solr use notes

From Helpful
Jump to: navigation, search
These are primarily notes
It won't be complete in any sense.
It exists to contain fragments of useful information.
This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).

(See also Apache projects and related notes#Solr (subproject) - overview, context of other projects)


Install and config
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Solr is often used via servlets (comes with Jetty for a quick start, has been used in a most or all containers by different people. Apparently tomcat has been reported to have fewer problems with Unicode).


See also:

Container specifics:


Solr configuration is specific to a Solr instance, so multiple solr setups need multiple container instances.

Note that different distributions (and admins) manage containers in different ways, and , meaning that configuration and index files may be in various different places. It probably helps to study your installation enough to at least know where to find files.

In ubuntu, the solr-tomcat5.5 package is missing a dependency on libxpp3-java, so install that yourself. (Its absence will cause an error mentioning XmlPullParserException)

(If you're new to or generally shy away from Java, administering Solr and everything you need for it may be pretty daunting as an immediate learning curve. You may want to look at Sphinx)



The configuration of the Solr instance itself is mostly:

  • schema.xml, which all new document adds adhere to (you can run yourself into problems after significant changes; usually the easiest way is to do a complete reindex)
  • solrconfig.xml
    • lucene index and indexing parameters
    • solr handler config
    • search config
    • solr cache config
    • and more

Note that default solrconfig.xml's dismax configuration contains references to (fields defined in) the example schema, which can cause handler errors if you don't change them when you change the schema -- with cryptic descriptions such as 'The request sent by the client was syntactically incorrect (undefined field text)' .


Interfacing with
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


You can interface with Solr in a few ways, but since the main way of interfacing is largely just HTTP, most clients use that in some form or other.

When you're working locally you can choose to interface using SolrJ [1] (instead of HTTP) for lower latency, which for some uses can lead to noticeably better speed, and for some others doesn't matter much. (You can also imitate what SolrJ does with your own code (sometimes termed Embedded Solr [2]) but this is often fragile.)


...so you can send in data and commands in the following ways and more:

  • tell Solr where to fetch documents from (sent in via the servlet, e.g. using curl)
  • data via post.jar (Java program)
  • data via curl (often in XML form, or CSV(verify))
  • Java code (Solr-specific, networked)


On post.jar:

  • see its -help
  • To use post.jar to use a different port, or access a different host, add a parameter like:
-Durl='http://192.168.0.2:8180/solr/update'
-Durl='http://127.0.0.1:8080/solr/update'
  • post.jar sends a commit by default. To avoid it until you actually want it, use
-Dcommit=no    
  • data source can be filename references (-Ddata=files, default{[verify}}), stdin (-Ddata=stdin), and command-line arguments (-Ddata=args).


The -Ddata=args option can be handy to send specific commands, for example:

#Clear current index data: 
java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
 
# To enter some prepared docs, and not commit yet
java -Ddata=files -Dcommit=no post.jar some_test_data/*.xml
 
#Commit the changes
java -Ddata=args -jar post.jar "<commit/>"
 
#Optimize the index
java -Ddata=args -jar post.jar "<optimize/>"


Actually, post.jar is a fairly thin convenience wrapper around HTTP POSTs, so the last is roughly equivalent to something like;

curl http://192.168.0.2:8180/solr/update -H "Content-Type: text/xml" --data-binary '<optimize/>'


You can even use GETs, but long URLs can get truncated by browser (or other UA) and server, so it's not a preferable option to use from code, but useful enough for things like:

http://192.168.0.2:8180/solr/update?stream.body=%3Coptimize/%3E


See also:


Response format

When uou use the HTTP interface, you can ask it to send the response in one of many formats (via the wt parameter).

The formats include:

  • XML (data, and the default)
  • XSLT applied to the basic XML (handy for simple transforms and even immediate presentation)
  • HTML via Velocity templating
  • JSON (data)
  • Python (data)
  • Ruby (data)
  • PHP, Serialized PHP
  • javabin[3], a binary format based on Java marshalling(verify)
  • a custom response writer class
  • names of configured response writers that are a class with a specific initialization

See also:

Operations and handlers
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Operations are handled via RequestHandler subclasses - see e.g. http://lucene.apache.org/solr/api/org/apache/solr/request/SolrRequestHandler.html for a decent list.

The most central operations you'll probably use are:

  • query: well, that.
  • add: add/update documents, to be indexed (expects data already prepared into the fields tha tthe schema defines, by default as XML (UTF-8), and possibly as CSV text(verify)) (You can probably expect on the order of 10 to 200 documents to be indexed per second(verify) depending on CPU, IO system, document size and analysis complexity)
  • delete: when necessary.
  • commit: adds and deletes are transactional; commit applies changes since the last commit. Since commits flush various caches (and for some other reasons, e.g. segment-related ones), it's generally best to do these after a decently sized batch job instead of each of its items. (Note that you can also configure automatic commits after some number of pending commits, after some number of adds, after some time after the last add, and such)
  • optimize: merge and optimize data. Not strictly necessary, but good for compactness and speed (varying a little with how you configured segmenting behaviour). Can take some time, so probably have the most use after large commits.
  • some other management
  • use of other features via separate requests / RequestHandlers


Note that the standard search handler also eases combination/chaining of certain (more or less standard) components, including:

  • QueryComponent
  • QueryElevationComponent
  • MoreLikeThis
  • Highlighting
  • FacetComponent
  • ClusteringComponent
  • SpellCheckComponent
  • TermVectorComponent
  • StatsComponent
  • Statistics
  • debug
  • TermsComponent


See also

Schema design
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


See also:


The example schema is fairly well commented (look e.g. in [4] or your local copy of Solr). Some of the more central/interesing parts include:

  • types/fieldType
    nodes
    • (possible subnodes mentioning analyzer, tokenizer, filter specification)
    • name="text"
    • class="Solr.TextField", usually one of the following default classes:
      • solr.StrField (no analysis, indexed/stored verbatim. sortable)
      • solr.TextField (analysis as specified. not sortable)
      • solr.IntField, solr.SortableIntField. The sortable ones work correctly in range queries, so you usually want those.
      • solr.LongField, solr.SortableLongField
      • solr.FloatField, solr.SortableFloatField
      • solr.DoubleField, solr.SortableDoubleField
      • solr.BoolField ("true" or "false")
      • solr.DateField (formatted like 1995-12-31T23:59:59.999Z, with only the fractional second part optiona. While it looks like ISO 8601 (see e.g. Common date formats) it doesn't behave like it)
    • sortMissingLast="true" or sortMissingFirst="true" (documents that do not have this field are sorted last, or first. Works when internal representation is string-based)
    • omitNorms="true"
    • omitTermFreqAndPositions="true" (can be an optimization for non-main-text and certain short fields)


  • fields/field
    for the actual fields, with:
    • name="foo"
    • type="(a fieldType name)"
    • indexed="true" or "false" according to your needs
    • stored="true" or "false" according to your needs
    • compressed="true" lets you compress stored data (in StrFields and TextFields), compressThreshold lets you specify a minimum size that data should have before things are actually compressed (because <~1KB is probably nonsense)
    • multiValued="false" (if there may be multiple fields within a document -- i.e. multiple values before tokenization. )
    • omitNorms="false" --- if true, you lose index-time doc/field boosting, tf, lengthNorms. You generally do not want this on your main text field(s), but it may help efficiency to add it for things like fields used only for faceting {{{1}}}(note that norm values are stored in memory, costing a byte per document, which is usually acceptable until you hit collection sizes of hundreds of millions, a billion or so)}} (verify)
    • termVectors="true" (default: false) true is mostly/useful for MoreLikeThis, which then needs less precalc each time it runs on a document.
    • termOffsets="true" (verify), useful to speed up highlighting
    • termPositions="true" (verify)
    • default="0" (value to use if none in mentioned in document. Mostly useful for numeric fields) (verify)
  • fields/dynamicField
    can be handy for
    • automatically applying types based on field name (for copyFields)
    • automatically ignoring or indexing document fields not mentioned in the schema (robustness to change and variation, without changing the schema)
  • copyField
    from a source to a dest field name (you can use wildcard in source). It appends contents of specific fields to others, before analysis, which can he handy if you e.g. want to
    • add the keyword set you facet on to your general index.
    • index the same text in different wants (in different fields)
    • merge text from fields for an easier (and fairly redundant) 'all' field search
    • present an automatically copied-'n'-truncated preview
    • copy certain fields to one to be used for spelling correction
  • uniqueKey
    to judge documentidentity
  • defaultSearchField
    specifies what field to search in when one is not specified in the query.
  • solrQueryParser
    mostly lets you specify defaultOperator="OR" (the default) or defaultOperator="AND"



When you change a schema, you change the way new documents are added, so may want to reindex all documents you have, and possibly clear the index first.

If you are re-indexing everything, it's fastest to remove the files from disk, then start anew. The command <delete><query>*:*</query></delete> is a special case handled somewhat more efficiently than a lot of deletes. It's not quite as fast, but doesn't require filesystem write access (so can be handy when you're not the sysadmin, or want to be careful).

This means: (from [5])

  • stopping the server
  • changing the schema.xml
  • deleting/clearing the index
  • (an <optimize/> may be a good idea)
  • starting up
  • indexing your data again.
  • optional <optimize/>,


You may sometimes want an 'all searchable content' field, particularly as a default, when other fields are specific-use things. Generally, search will usually rely on a default set of fields (and you can't specify a value meaning 'all', though you can specify them all if you know the schema). It may be more efficient to search in this one field rather than a lot of separate ones (verify), although you may get better matching if your analysis on different fields is clever and non-unifiable.


Different processing/storage details on fields are necessary or useful (and in a few cases prohibitive) for certain types of queries, features, and needs. However, the extra stored details are mostly fpr secondary features, and none of them are necessary for things like search, sort, or phrase queries.

For general details, see

and more specific considerations:


For a summary:

  • indexed: You need this to search in, sort on, and facet on this field.
  • stored: To retrieve the field content from the index. If you both index and store a frield, you'll see that in index size. There are use cases for fields that are stored but not indexed. .
  • multiValued: means that there can be multiple same-named fields in a document (order maintained in index). Use this when you need it and can't get the same from tokenizing the field. Note that some features don't work on multivalued fields (mostly using it as a key, and sorting on that field)
  • norms (omitNorms) Norms are used in length normalization boosting, and use memory. Can be useful where text fields of singificantly different lengths should not be allowed to outweigh each other, but will otherwise waste memory (one byte per document, so this can add up if you unnecessarily leave it on for many fields). (sorting?)
  • termVectors: 'more like this', (highlighting will also use it if present)
  • termPositions: highlighting will use this (conditions?).
  • compressed, whether to compress stored values (only applies to StrField and TextField), and compressThreshold, the content size below which not to bother
Filter stack - Analyzers, Tokenizers, Token Filters
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

For a large part simply Lucene's filter stack, but Solr expands it somewhat.

See also:


On Scoring and boosting
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Boosting refers to changes in score, usually automatic ones. In Solr/lucene, this includes:

  • query clause boost: weight given to a part of the query (over another)
  • coordination factor: boost according to the amount of terms in a multiple-query that match
  • FunctionQuery allows use of field content as numbers for boost, with possible operations


Document/field data related:

  • term frequency: score of document based on more occurences of a term
  • inverse document frequency: rarer terms count more than common ones
  • smaller fields score higher than large ones (because smaller ones are likely to be more specific)


Index-time boosts:

  • index-time document boost
  • index-time per-field boost (note that this is per field, not per value -- if you have a multivalued field, all values in that document for that same field get an identical consolidated value)(verify)




search: doing queries; query control
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you use the servlet for search, you'll probably choose to search using one of:

(controlled via, well, see http://wiki.apache.org/solr/CoreQueryParameters)



Note there are three types of clauses in terms of requirement:

  • optional, a.k.a. SHOULD (the default)
  • mandatory (when you use +)
  • prohibited (when you use -)


Both standard and dismax support:

  • sort:
    • optional
    • on one or more of: (each asc/desc)
      • score
      • one or more fields (multiValued="false" indexed="true")
  • start, rows: the offset and amount of items to fetch (default 0 and 10, respecively)
    • Note that there is a result cache
  • fq (filter query): Query that does subselection, without altering score
    • these sub-set selections can be cached
  • fl (field list): fields to return (default is *, i.e. all)
    • (not the same as field fetching laziness?(verify))
  • defType: Specify query parser to use [6]/solr/search/QParserPlugin.html
  • omitHeader
  • timeAllowed: timeout for the search part of queries, in milliseconds. Don't set this too low, as that could lead to arbitrary partial results under load. 0 means no limit.
  • debugQuery
  • explainOther


Dismax has more parameters/features, including:

  • qf: list fields to query, and field boost to use for each
  • tie: Since dismax searches multiple fields, you have a potential problem in that a match in different fields would have different implications on the score. Read op on what disjunction max and disjunction sum, or just choose something like 0.1.
  • mm: require that some part of the optional-type clauses must match [7] [8]
  • pf (Phrase Fields): When using fq (filter query) and qf (query field), you can boost based on proximity of terms
  • ps (phrase slop): the slop to use when using pf
  • qs (Query Phrase Slop): slop used for the main query string (so affects matching, not just scoring)
  • bf (Boost Function): Boost based on some value. One way of using FunctionQuery; see below for more detail. Useful for things like:
    • weighing by some pre-calculated value in a field
    • weighing by closeness to some value/date


See also:


Queries and FilterQueries; queryResultCache and filterCache
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Note that the Solr-specific caches are associated with a specific Index Searcher(verify). The caches have an item amount limit, and entries seem to have no timeout.


The results of main parts of a queries (q parameter) will be calculated, and entered into the queryResultCache (as a list of matching doc IDs -- and in many cases that's not a huge list).


A complex query is composed from many search results over the index, so caching these as a whole is relatively less useful.

(Dramatically put, complex queries will rarely hit, meaning you'll get higher-than-necessary eviction rates, and general cache contention on that cache. The fact that some parts might be reusable (particularly for more mechanical parts of a query such as filtering on document type) is one of the reason for filter queries.)


Filter queries (fq parameter) are part of overall query execution (both standard and dismax), but are evaluated differently (and can support different function):

In the evaluation of the query as a whole, all documents in the index are evaluated: it either matches the filter query or not (boolean). The direct function is a simple filter, and the score is also not affected by fq.

The per-document (orderless, scoreless) match-or-not is also storable in the Solr filter cache, to be reused by other queries that use the same filter criterion. If you implement faceted browsing, this is a great way to speed it up.


Arguably, queryResultCache is often less memory-hungry than the filterCache -- although which of the two you try to exploit more depends a lot on what you're actually trying to chache. The former stores a list of IDs, but only for the document set that matched, which tends to be a relatively short list. The filter cache stores a bitmap of match/no-match for the filter for the entire document set, using one bit per document, so ~122KB per million docs per fieldCache entry. In many cases, this is more, so pick your filter queries so that they make a difference, and avoid cases where entries would never be looked up (you'ld just be adding overhead).

The filter cache is best used with specific knowledge of your schema and data. For example:

  • As a rule of thumb, the fields mentioned in fq should generally have a relatively small value set (or be reduced by the filter query, e.g. as by number ranges).
  • controlled keywords, subject codes, or other such property metadata may be useful to use (copy/move) to fq

Free-text search means you can't really use fq, seeing as uncontrolled text isn't too useful in fq.

Faceting is a feature that can use the filter cache (not too surprisingly), and it can be useful to place query parts on keyword or string fields into fq.

Index inspection, processing/relevance debugging
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)



Notes on further solr features
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Highlighting/summary

  • For fields that are indexed and stored (based on positions), and only makes sense when such a field is tokenized
  • generated per field (verify)
  • configurable number per field (default 1)
  • (Lucene can also guess based on new text, making it easier to use text stored elsewhere)
  • http://wiki.apache.org/solr/HighlightingParameters


FunctionQuery


Faceting

Note that it can be handy to create facet-only fields, e.g. on identities, names, subjects, and such.


Spell checking / alternate term suggestor


More Like This


Query Evaluation

  •  ?