Lucene notes

Some practicalities to search systems

Lucene and things that wrap it: Lucene · Solr · ElasticSearch

Search-related processing: tf-idf

⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).

Lucene

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lucene is an Apache project aimed at indexing and searching documents.

Lucene is the basis of some fuller packets, today most relevantly ElasticSearch, but also Solr, Nutch, and others.

Lucene itself does

indexing

includes incremental indexing (in chunks)

searches on that index

more than just an inverted index - stores enough information to support NEAR queries, wildcards, regexps

and search fairly quickly (IO permitting)

scoring of results

fairly mature in that it works decently even without much tweaking.

Its model and API and modular enough that you can specialize - which is what projects like Solr and ElasticSearch do.

Written in Java. There are wrappers into various other languages (and a few ports), and you can do most things via data APIs.

Some technical/model notes and various handy-to-knows

Parts of the system

On segments

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

IO behaviour should usually tend towards O(lg n).

Segments can be though of as independent chunks of searchable index.

Lucene updates segments of an index at a time, and never in-place, which avoids blocking so allows for continuous updates, not interrupting search.

Segments at all does mean a little more work searching (it has to search all segments independently), but at the same time makes it parallelizable, and segment being reasonable size means adding/updating documents is reasonable to do at any time, and less need for things like rebalancing a single large structure, and is .

Segments are occasionally merged, and you can control how and when via some parameters/conditions in config. This lets you balance the work necessary for index updates, against the work necessary for a particular search. (it also influences the need for optimizers, and the speed at which indexing happens)

...and tweaking some parameters. Many will want to err on the faster-search side, unless perhaps it is to be expected that updates are relatively continuous.

When you explicitly tell the system to optimize the index, segments are merged into a single one, which helps speed, but tends to be a lot of work on large indexes, and will only make much difference if you haven't done so lately. You can argue over how often you should optimize on a continually updating system.

In general, you want more smaller segments whenever you want to do a lot of updates, and fewer large segments if you are not - because search has to search all segments, while update has to update all of the segment it goes into.

Relevant config

mergeFactor:

low values make for more common segment merges
high values do the opposite: fewer merge operations, more index files to search
2 is low, 25 is high. 10 is often reasonable.

maxBufferedDocs:

number of documents that indexing stores in-memory before they are stored/merged. Larger values tends to mean somewhat faster indexing (varying a little with your segmenting settings) at the cost of a little delay in indexing, and a little more memory use.

maxMergeDocs:

The maximum of documents merged into a segment(verify). Factory default seems to be Integer.MAX_VALUE (2^31-1, ~2 billion), effectively no maximum.

useCompoundFile:

compounds a number of data structures usually stored separately into a single file. Will often be a little slower - the setting is mostly useful to lessen the amount of open file handles, which can be handy when you have many segments on a host(verify)

More on indices

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The index package takes Documents and adds them to the index (with whatever processing is necessary).

IndexWriter:

lets you add Documents to them
lets you merge indices
lets you optimize indices

IndexReader:

is what you want for deleting (by Term or Query - it requires an index search/scan)

IndexSearcher: (wrapped around an IndexReader)

Lets you do searches

Note that you can only update/replace documents by deleting, then re-adding it.

Note that an IndexWriter takes an Analyzer. This seems to be a (perhaps overly cautious) way of modelling the possible mistake of mixing Analyzers in the same index (it leaves the only choice left as analyzing or not analyzing a field), which is a bad idea unless you know anough about lucene to be hacking your way around that.

An index can be stored in a Directory, a simple abstraction that lets indices be stored in various ways. Note that index reading and writing is due to certain restrictions that seem to be geared to making caching easier (verify)

There exist Directory implementations for a set of files on disk, RAM, and more. RAM indices may sound nice, but these are obviously a lot more size-limited than disk caches, and there are better ways to speed up and scale lucene indexes.

Apparently, all index reading and writing is thread and process safe(verify), but since it does so via a MVCC-esque transactional way, index consumers should re-open the index if they want to see up-to-date results, and writers will use advisory locking and so be sequential.

Note that using many IndexSearchers on the same data set will lead to more memory use and more file handles, so within a single multi-threaded searcher, you may want to reuse an IndexSearcher.

Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lucene uses a specific model that you need to fit your needs into, which you may want to know about even if many you won't have to deal with all of them directly and if most have sane defaults.

The document package handles the abstraction of things into Document objects, which are things that that can be indexed and returned as a hit.

Note that you control what parts of a document get analyzed (transformed) or not, indexed or not, and stored in the index.

From lucene's point of view, a document consists of named Fields. Exactly what you create fields for varies with the sort of setup you want. Beyond a main text-to-index field you could add fields for a title, authors, a document identifier, URLs e.g. for origin/presentation, a 'last modified' date, part names/codes, SKUs, categories or keyword fields to facet on, and whatnot.

Note that you can have multiple fields with the same name in a document (multivalued fields), which can make sense in certain cases, for example when you have multiple keywords that you want analysed as independent and not concatenated strings. You may want to read up on the effects this has on scoring and certain features.

From the index's point of view, a Document contains a number of Fields, and once indexed, they contain Terms - the processed form of field data that comes out of the analysis, which are the actual things that are searched for.

That processing usually consists of at least a tokenizer that splits up input text into words (you can give codes and such different treatment), and filters that transform, and sometimes take out or create new tokens/terms.

You can control how much transformation gets applied to the values in a specific field, from no change at all (index the whole things as a single token) to complex splitting into many tokens, stemming inflected words, taking out stopwords, inserting synonyms, and whatnot.

On Fields and the index

For each field, you should decide:

whether to index it (if not, it's probably a field you only want to store)
whether to store the (pre-tokenized, if you tokenize) data in the index
whether to tokenize/analyse it (whether you want it broken down)
how to tokenize/analyse it

Different combinations fit different use cases. For example:

text to search in would be indexed, after analysis. Can be stored too, e.g. to present previews, sometimes to present the whole record from the index itself (not so handy if relatively large), or to support highlighting (since the indexed form of text is a fairly mangled form of the original).

A source URL for site/intranet would probably be both stored (for use) and indexed (to support in-url searching)

a path to a local/original data file (or perhaps immediately related data) might be only stored.

a document ID would probably be both stored and indexed, but not tokenized

Something like a product code (or SKUs) could be both stored, and you may want to tweak the analysis so that you know you can search for it as a unit, as well as for parts.

controlled keyword sets, e.g. to facet on, would probably be indexed, but not analyzed, or only tokenized and not e.g. stemmed.

(Older documentation will mention four types of field (Text, UnStored, UnIndexed, Keyword), which was a simple expansion of the few options you had at that time. These days you have to decide each separately)

More on tokens

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A token represents an occurrence of a term in the text of a Field. As such, it consists of:

the text
the (start and end) offset of the term in the character stream
optional: a (lexical) type, used by some analysers (specifically by filters in an analysis that wants to have some type/context awareness)

Handing in the represented String in directly is now deprecated as it is slow

A token represents a string

the offset with the previous term

start and end character position, and an optional lexical type to assist certain

Token positions may be stored or not (depending on the field type) and can support NEAR queries, and can also be handu in indexing. For example, a Filter that emits synonymscan pretend that various different words things were present at the same original position.

Basics of Analyzers, Tokenizers, Filters

The analyser package primarily has Analysers.

Analyzer objects take a Reader (a character stream instead of a String, for scalability), apply a Tokenizer to emit Tokens (which are strings with, at least, a position), and may apply a bunch of Filters, which may alter that token stream (alter tokens, remove tokens, add tokens) with one of various, before handing this to the indexing process.

As an example, the StandardAnalyzer consists of :

a StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
a StandardFilter (removes 's from the end of words and dots from acronyms)
a LowerCaseFilter (lowercases all text)
a StopFilter (removes common English stopwords).

See also analyzer list below, and you can compose and create your own.

Note that since analysis is actually a transformation that results in a somewhat mangled form of the input, you must use the same (or at least a similar enough) analyzer when querying in a particular field, or you will not get the results you expect, or even any.

Note that you have the option of using different analysis for different fields. You could see this as fields being different sub-indexes, which can be powerful, but be aware of the extra work that needs to be done.

More detailed token and field stuff

Phrase and proximity search relies on position information. In the token generation phrase, the position increment of all tokens is 1, which just means that each term is positioned one term after the other.

There are a number of cases you could think of for which you would want to play with position / position offset information:

enabling phrase matches across stopwords by pretending stopwords aren't there (filters dealing with stopwords tend to do this)
injecting multiple forms at the same position, for example different stems of a term, different rewrites of a compound, injecting synonyms at the same position, and such
avoiding phrase/proximity matches across sentence/paragraph boundaries

When writing an analyzer, it is often easiest to play with the position offset, which is set on each token and refers to the offset to the the previous term (not the next).

For example, for a range of terms that should appear at the same position, you would set all but the first to zero.

Technically, you could even use this for a term frequency boost, repeating a token with an increment of zero.

Multivalued fields, that is, those for which you add multiple chunks, position-wise act as if there are concatenated. When applicable, you can set a large position increment between these fields to avoid phrase matches across such values (a property of the field?(verify)).

Some other...

Tokenizers include:

StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
CharTokenizer
- LetterTokenizer (splits on non-letters, using Java's isLetter())
  - LowercaseTokenizer (LetterTokenizer + LowercaseFilter, mostly for a mild performance gain)
- RussianLetterTokenizer
- WhitespaceTokenizer (only splits on whitespaces, so e.g. not punctuation)
ChineseTokenizer (splits ideograms)
CJKTokenizer (splits ideograms, emits pairs(verify))
NGramTokenizer, EdgeNGramTokenizer (generates n-grams for all n in a given range
KeywordTokenizer (emits entire input as single token)

Filters include:

StandardFilter (removes 's from the end of words and dots from acronyms)

LowerCaseFilter
StopFilter (removes stopwords (stopword set is handed in))

LengthFilter (removes tokens not in a given range of lenths)

SynonymTokenFilter (injects synonyms (synonym map is handed in))

ISOLatin1AccentFilter (filter that removes accents from letters in the Latin-1 set. That is, replaces accented letters by unaccented characters (mostly/all ASCII), without changing case. Looking at the code, this is VERY basic and will NOT do everything you want. A better bet is to normalize your text (removing diacritics) before feeding it to indexing)

Language-spacific (some stemmers, some not) and/or charset-specific: PorterStemFilter, BrazilianStemFilter, DutchStemFilter, FrenchStemFilter, GermanStemFilter, RussianStemFilter, GreekLowerCaseFilter, RussianLowerCaseFilter, ChineseFilter (stopword stuff), ThaiWordFilter

SnowballFilter: Various stemmers

NGramTokenFilter, EdgeNGramTokenFilter

Queries, searches, Hits

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The search package has a Searcher / IndexSearcher, which takes an IndexReader, a Query, and returns Hits, an iteration of the matching result Documents.

There are some ways to deal with searching multiple and remote indices; see e.g. the relations to the Searchable interface.

The queryParser package helps turn a query string into a Query object.

Note that for identifier queries and mechanical queries (e.g. items which share these terms but not the identifier) you'll probably want to construct a Query object structure instead of using the QueryParser.

(Note that unlike most other parts of Lucene, the QueryParser is not thread-safe)

For the basic search functionality lucene has fairly conventional syntax.

In various cases you may wish to rewrite the query somewhat before you feed it to the query parser - in fact, you may well want to protect your users from having to learn Lucene-specific syntax.

On hits:

Hit (mostly wraps a Document, allows fetching of that the score, field values, and the Document object)

TimeLimitedCollector

The older way: (deprecated, will be removed in lucene 3)

Hits (ranked list of documents.
HitIterator
HitCollector

On scoring

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Scoring uses:

per term per document:
- tf for the term in the document
- idf for the term
- norm (some index-time boosts: document, field, length boosts)
- term's query boost

per document:
- coordination (how many of the query terms are found in a document)

per query
- query normalization, to make scores between (sub)queries work on a numerically compatible scale, which is necessary to make complex queries work properly

more manual boosts:
- boost terms (query time)
- boost a document (index time)
- boost a field's importance (in all documents) (index time)

Nutch (subproject)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Nutch extends lucene, to be a somewhat more worked out search engine, indexer, and a crawler that is appropriate for the fairly uncontrolled content on the web, intranets and such. (works incrementally, focuses on well linked content and such).

It has parsers for some formats(verify), and a basic JSP (servlet) based interface.

Nutch also triggered development of the Hadoop framework, and can now its HDFS and mapreduce for distributed crawling and indexing, and for keeping that index updated(verify).

It seems that searching from an index stored on the DFS is not ideal, so you would want to copy them to local filesystem for fast search (possibly also distributed, but note that has a few hairy details).

http://wiki.apache.org/nutch/NutchTutorial

http://wiki.apache.org/nutch/MapReduce http://wiki.apache.org/nutch/NutchDistributedFileSystem

http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Some comparison, and notes on scaling

On scaling indices and search

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Throwing resources at any of these

Lucene notes

Contents

Lucene

Some technical/model notes and various handy-to-knows

Parts of the system

On segments

More on indices

Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters

On Fields and the index

More on tokens

Basics of Analyzers, Tokenizers, Filters

More detailed token and field stuff

Some other...

Queries, searches, Hits

On scoring

See also

Nutch (subproject)

Some comparison, and notes on scaling

On scaling indices and search

Throwing resources at any of these

Navigation menu