Lucene notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
mNo edit summary
 
 
(3 intermediate revisions by the same user not shown)
Line 1: Line 1:
#redirect [[Apache_projects_and_related_notes#Lucene]]
{{search stuff}}
 
{{old}}
 
==Lucene==
{{stub}}
 
Lucene is an Apache project aimed at indexing and searching documents.
 
Lucene is the basis of some fuller packets, today most relevantly ElasticSearch, but also Solr, Nutch, and others.
 
 
Lucene itself does
* indexing
:: includes incremental indexing (in chunks)
 
* searches on that index
:: more than just an inverted index - stores enough information to support NEAR queries, wildcards, regexps
:: and search fairly quickly (IO permitting)
 
* scoring of results
:: fairly mature in that it works decently even without much tweaking.
 
 
Its model and API and modular enough that you can specialize - which is what projects like Solr and ElasticSearch do.
 
 
Written in Java.
There are wrappers into various other languages (and a few ports), and you can do most things via data APIs.
 
 
See also:
* http://lucene.apache.org/
* http://lucene.apache.org/java/docs/ (docs)
 
 
 
===Some technical/model notes and various handy-to-knows===
====Parts of the system====
<!--
 
The index contains:
* A dictionary (all terms in the index)
* A dictionary index (faster access to the dictionary)
* Term postings
* Field data
* Term positions
 
Index parts are stored in separate segments.
 
 
Field Cache
 
 
http://onjava.com/onjava/2003/03/05/lucene.html
-->
 
=====On segments=====
{{stub}}
 
IO behaviour should usually tend towards O(lg n).
 
 
Segments can be though of as independent chunks of searchable index.
 
Lucene updates segments of an index at a time, and never in-place, which avoids blocking so allows for continuous updates, not interrupting search.
 
 
Segments at all does mean a little more work searching (it has to search all segments independently), but at the same time makes it parallelizable, and segment being reasonable size means adding/updating documents is reasonable to do at any time, and less need for things like rebalancing a single large structure, and is .
 
 
Segments are occasionally merged, and you can control how and when via some parameters/conditions in config. This lets you balance the work necessary for index updates, against the work necessary for a particular search. (it also influences the need for optimizers, and the speed at which indexing happens)
 
...and tweaking some parameters. Many will want to err on the faster-search side, unless perhaps it is to be expected that updates are relatively continuous.
 
When you explicitly tell the system to optimize the index, segments are merged into a single one, which helps speed, but tends to be a ''lot'' of work on large indexes, and will only make much difference if you haven't done so lately.
You can argue over how often you should optimize on a continually updating system.
 
 
In general, you want more smaller segments whenever you want to do a lot of updates, and fewer large segments if you are not - because search has to search all segments, while update has to update all of the segment it goes into.
 
 
'''Relevant config'''
 
mergeFactor:
* low values make for more common segment merges
* high values do the opposite: fewer merge operations, more index files to search
* 2 is low, 25 is high. 10 is often reasonable.
 
maxBufferedDocs:
* number of documents that indexing stores in-memory before they are stored/merged. Larger values tends to mean somewhat faster indexing (varying a little with your segmenting settings) at the cost of a little delay in indexing, and a little more memory use.
 
maxMergeDocs:
* The maximum of documents merged into a segment{{verify}}. Factory default seems to be Integer.MAX_VALUE (2^31-1, ~2 billion), effectively no maximum.
 
useCompoundFile:
* compounds a number of data structures usually stored separately into a single file. Will often be a little slower - the setting is mostly useful to lessen the amount of open file handles, which can be handy when you have many segments on a host{{verify}}
 
<!--
maxFieldLength:
* maximum size of the contents of a Field in a Document. Effectively controls truncation. Large numbers can mean more varying memory usage (and out-of-memory errors).
-->
 
=====More on indices=====
{{stub}}
The '''<tt>index</tt> package''' takes Documents and adds them to the index (with whatever processing is necessary).
 
IndexWriter:
* lets you add Documents to them
* lets you merge indices
* lets you optimize indices
 
IndexReader:
* is what you want for deleting (by Term or Query - it requires an index search/scan)
 
IndexSearcher: (wrapped around an IndexReader)
* Lets you do searches
 
 
Note that you can only update/replace documents by deleting, then re-adding it.
 
 
 
Note that an IndexWriter takes an Analyzer. This seems to be a (perhaps overly cautious) way of modelling the possible mistake of mixing Analyzers in the same index (it leaves the only choice left as analyzing or not analyzing a field), which is a bad idea unless you know anough about lucene to be hacking your way around that.
 
 
An index can be stored in a Directory, a simple abstraction that lets indices be stored in various ways. Note that index reading and writing is due to certain restrictions that seem to be geared to making caching easier {{verify}}
 
There exist Directory implementations for a set of files on disk, RAM, and more. RAM indices may sound nice, but these are obviously a ''lot'' more size-limited than disk caches, and there are better ways to speed up and scale lucene indexes.
 
 
Apparently, all index reading and writing is thread and process safe{{verify}}, but since it does so via a MVCC-esque transactional way, index consumers should re-open the index if they want to see up-to-date results, and writers will use advisory locking and so be sequential.
 
Note that using many IndexSearchers on the same data set will lead to more memory use and more file handles, so within a single multi-threaded searcher, you may want to reuse an IndexSearcher.
 
====Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters====
{{stub}}
 
Lucene uses a specific model that you need to fit your needs into, which you may want to know about even if many you won't have to deal with all of them directly and if most have sane defaults.
 
 
The '''<tt>document</tt> package''' handles the abstraction of things into <tt>Document</tt> objects, which are things that that can be indexed and returned as a hit.
 
Note that you control what parts of a document get analyzed (transformed) or not, indexed or not, and stored in the index.
 
 
From lucene's point of view, a document consists of named <tt>Field</tt>s. Exactly what you create fields for varies with the sort of setup you want. Beyond a main text-to-index field you could add fields for a title, authors, a document identifier, URLs e.g. for origin/presentation, a 'last modified' date, part names/codes, SKUs, categories or keyword fields to facet on, and whatnot.
 
 
Note that you can have multiple fields with the same name in a document (multivalued fields), which can make sense in certain cases, for example when you have multiple keywords that you want analysed as independent and not concatenated strings.
You may want to read up on the effects this has on scoring and certain features.
 
 
 
From the index's point of view, a Document contains a number of Fields, and once indexed, they contain '''Terms''' - the processed form of field data that comes out of the analysis, which are the actual things that are searched for.
 
That processing usually consists of at least a tokenizer that splits up input text into words (you can give codes and such different treatment), and filters that transform, and sometimes take out or create new tokens/terms.
 
You can control how much transformation gets applied to the values in a specific field, from no change at all (index the whole things as a single token) to complex splitting into many tokens, stemming inflected words, taking out stopwords, inserting synonyms, and whatnot.
 
 
 
=====On Fields and the index=====
 
For each field, you should decide:
* whether to index it {{comment|(if not, it's probably a field you only want to store)}}
* whether to store the {{comment|(pre-tokenized, if you tokenize)}} data in the index
* whether to tokenize/analyse it (''whether'' you want it broken down)
* ''how'' to tokenize/analyse it
 
 
Different combinations fit different use cases. For example:
* text to search in would be indexed, after analysis. ''Can'' be stored too, e.g. to present previews, sometimes to present the whole record from the index itself (not so handy if relatively large), or to support highlighting (since the indexed form of text is a fairly mangled form of the original).
 
* A source URL for site/intranet would probably be both stored (for use) and indexed (to support in-url searching)
 
* a path to a local/original data file (or perhaps immediately related data) might be only stored.
 
* a document ID would probably be both stored and indexed, but not tokenized
 
* Something like a product code (or SKUs) could be both stored, and you may want to tweak the analysis so that you know you can search for it as a unit, as well as for parts.
 
* controlled keyword sets, e.g. to facet on, would probably be indexed, but not analyzed, or only tokenized and not e.g. stemmed.
 
 
{{comment|(Older documentation will mention four types of field (Text, UnStored, UnIndexed, Keyword), which was a simple expansion of the few options you had at that time. These days you have to decide each separately)}}
 
=====More on tokens=====
{{stub}}
 
A token represents an occurrence of a term in the text of a Field. As such, it consists of:
* the text
* the (start and end) offset of the term in the character stream
* optional: a (lexical) type, used by some analysers (specifically by filters in an analysis that wants to have some type/context awareness)
 
* Handing in the represented String in directly is now deprecated as it is slow
 
* A token represents a string
 
* the offset with the previous ''term''
 
start and end character position, and an optional lexical type to assist certain
 
Token positions may be stored or not (depending on the field type) and can support NEAR queries, and can also be handu in indexing. For example, a Filter that emits synonymscan pretend that various different words things were present at the same original position.
 
=====Basics of Analyzers, Tokenizers, Filters=====
 
The '''<tt>analyser</tt> package''' primarily has Analysers.
 
Analyzer objects take a Reader (a character stream instead of a String, for scalability), apply a Tokenizer to emit Tokens (which are strings with, at least, a position), and may apply a bunch of Filters, which may alter that token stream (alter tokens, remove tokens, add tokens) with one of various, before handing this to the indexing process.
 
As an example, the <tt>StandardAnalyzer</tt> consists of :
* a StandardTokenizer {{comment|(basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)}}
* a StandardFilter {{comment|(removes 's from the end of words and dots from acronyms)}}
* a LowerCaseFilter (lowercases all text)
* a StopFilter (removes common English stopwords).
See also analyzer list below, and you can compose and create your own.
 
 
Note that since analysis is actually a transformation that results in a somewhat mangled form of the input, you ''must'' use the same (or at least a similar enough) analyzer when querying in a particular field, or you will not get the results you expect, or even any.
 
 
 
Note that you have the option of using different analysis for different fields. You could see this as fields being different sub-indexes, which can be powerful, but be aware of the extra work that needs to be done.
 
 
=====More detailed token and field stuff=====
 
 
Phrase and proximity search relies on '''position information'''. In the token generation phrase, the position increment of all tokens is 1, which just means that each term is positioned one term after the other.
 
There are a number of cases you could think of for which you would want to play with position / position offset information:
* enabling phrase matches across stopwords by pretending stopwords aren't there (filters dealing with stopwords tend to do this)
* injecting multiple forms at the same position, for example different stems of a term, different rewrites of a compound, injecting synonyms at the same position, and such
* avoiding phrase/proximity matches across sentence/paragraph boundaries
 
 
When writing an analyzer, it is often easiest to play with the position offset, which is set on each token and refers to the offset to the the previous term (not the next).
 
For example, for a range of terms that should appear at the same position, you would set all but the first to zero.
 
Technically, you could even use this for a term frequency boost, repeating a token with an increment of zero.
 
 
 
Multivalued fields, that is, those for which you add multiple chunks, position-wise act as if there are concatenated. When applicable, you can set a large position increment between these fields to avoid phrase matches across such values (a property of the field?{{verify}}).
 
=====Some other...=====
 
Tokenizers include:
* StandardTokenizer {{comment|(basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)}}
* CharTokenizer
** LetterTokenizer (splits on non-letters, using Java's <tt>isLetter()</tt>)
*** LowercaseTokenizer (LetterTokenizer + LowercaseFilter, mostly for a mild performance gain)
** RussianLetterTokenizer
** WhitespaceTokenizer (only splits on whitespaces, so e.g. not punctuation)
* ChineseTokenizer (splits ideograms)
* CJKTokenizer (splits ideograms, emits pairs{{verify}})
* NGramTokenizer, EdgeNGramTokenizer (generates n-grams for all n in a given range
* KeywordTokenizer (emits entire input as single token)
 
 
Filters include:
* StandardFilter {{comment|(removes 's from the end of words and dots from acronyms)}}
 
* LowerCaseFilter
* StopFilter (removes stopwords (stopword set is handed in))
 
* LengthFilter (removes tokens not in a given range of lenths)
 
* SynonymTokenFilter (injects synonyms (synonym map is handed in))
 
* ISOLatin1AccentFilter (filter that removes accents from letters in the Latin-1 set. That is, replaces accented letters by unaccented characters (mostly/all ASCII), without changing case. Looking at the code, this is VERY basic and will NOT do everything you want. A better bet is to normalize your text (removing diacritics) before feeding it to indexing)
 
* Language-spacific (some stemmers, some not) and/or charset-specific: PorterStemFilter, BrazilianStemFilter, DutchStemFilter, FrenchStemFilter, GermanStemFilter, RussianStemFilter, GreekLowerCaseFilter, RussianLowerCaseFilter, ChineseFilter (stopword stuff), ThaiWordFilter
 
* SnowballFilter: Various stemmers
 
* NGramTokenFilter, EdgeNGramTokenFilter
 
====Queries, searches, Hits====
{{stub}}
 
The '''<tt>search</tt> package''' has a Searcher / IndexSearcher, which takes an IndexReader, a Query, and returns Hits, an iteration of the matching result Documents.
 
There are some ways to deal with searching multiple and remote indices; see e.g. the relations to the [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Searchable.html Searchable] interface.
 
 
The '''<tt>queryParser</tt> package''' helps turn a query string into a Query object.
 
Note that for identifier queries and mechanical queries (e.g. items which share these terms but not the identifier) you'll probably want to construct a Query object structure instead of using the QueryParser.
 
(Note that unlike most other parts of Lucene, the QueryParser is not thread-safe)
 
 
For the basic search functionality lucene has fairly conventional syntax.
 
In various cases you may wish to rewrite the query somewhat before you feed it to the query parser - in fact, you may well want to protect your users from having to learn Lucene-specific syntax.
 
 
 
 
On hits:
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Hit.html Hit] (mostly wraps a Document, allows fetching of that the score, field values, and the Document object)
 
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TopDocs.html TopDocs]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TopDocCollector.html TopDocCollector]
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TimeLimitedCollector.html TimeLimitedCollector]
 
 
The older way: (deprecated, will be removed in lucene 3)
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Hits.html Hits] (ranked list of documents. <!--You can iterate over the entire set with this, but that's usually a bad idea)-->
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/HitIterator.html HitIterator] <!-- you can fetch this from Hits -->
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/HitCollector.html HitCollector] <!-- meant to handle large sets, primarily used for filtering, sorting and such. You can call search and hand along one of these. -->
 
 
<!--
You generally don't want to iterate over the whole set. If you want to do something with all results, you may want to use Hitcollector instead.
 
It is generally more efficient to implement paging by re-doing a query and showing a new subset every time -- and more efficient yet to cache a few pages from a search at a time, because most people never look at more than a few pages, meaning you'll reduce search load at limited memory cost.
-->
 
 
 
See also:
* http://lucene.apache.org/java/2_4_0/queryparsersyntax.html
* http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Query.html
* http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/queryParser/surround/parser/QueryParser.html
 
<!--
Example parses:
* foo: TermQuery
* "new foo": PhraseQuery
* foo AND bar: BooleanQuery, TermQuery
 
Any query clause (query class instantiation) can be boosted, which implies that the score will be multiplied by this value if a document matches (in addition to regular scoring; the boost is 1.0 by default)
-->
 
Query classes:
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/TermQuery.html TermQuery]: simple match (how exact?)
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/BooleanClause.html BooleanQuery]: (see also [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/BooleanClause.Occur.html BooleanClause.Occur].SHOULD, MUST, and MUST_NOT)
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/PhraseQuery.html PhraseQuery] matches a sequence of terms, with optional allowance for words inbetween (max distance, called slop, is handed in)
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/MultiPhraseQuery.html MultiPhraseQuery] (a generalized version of PhraseQuery)
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/PrefixQuery.html PrefixQuery]: Match the start of words (meant for truncation queries like <tt>epidem*</tt>)
 
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanQuery.html SpanQuery] (abstract base class for various queries that use term position spans)
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanTermQuery.html SpanTermQuery] (matches spans containing a term{{verify}})
*** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/payloads/BoostingTermQuery.html BoostingTermquery] {{verify}}
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanNearQuery.html SpanNearQuery] matches spans that are near to each other. You specify a slop and whether the spans have to be in the same order as in the query
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanOrQuery.html SpanOrQuery] (union of SpanQueries)
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/regex/SpanRegexQuery.html SpanRegexQuery]
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanFirstQuery.html SpanFirstQuery] (matches a span near the beginning of a field (last allowed position handed in))
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/spans/SpanNotQuery.html SpanNotQuery] (removes matches where another span query matches)
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/MultiTermQuery.html MultiTermQuery] (abstract class meant for match on a subset of a given something):
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FuzzyQuery.html FuzzyQuery] (based on Levenshtein distance)
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/WildcardQuery.html WildcardQuery]: supports glob-style wildcards
** [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/regex/RegexQuery.html RegexQuery] (a SpanQuery version of RegexQuery)
 
 
Related to scoring:
* BoostingQuery: Affects rating by giving relative more/less value to query parts. (may often be desirably over omitting results)
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/ConstantScoreRangeQuery.html ConstantScoreRangeQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/CustomScoreQuery.html CustomScoreQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/DisjunctionMaxQuery.html DisjunctionMaxQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/ConstantScoreQuery.html ConstantScoreQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/ConstantScoreRangeQuery.html ConstantScoreRangeQuery]
 
 
Unsorted:
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/RangeQuery.html RangeQuery]
 
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FilteredQuery.html FilteredQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/similar/MoreLikeThisQuery.html MoreLikeThisQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/FuzzyLikeThisQuery.html FuzzyLikeThisQuery] (like a mix between FuzzyQuery and MoreLikeThisQuery)
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/function/ValueSourceQuery.html ValueSourceQuery]
* [http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/MatchAllDocsQuery.html MatchAllDocsQuery]
 
 
 
Examples (various based on [http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/ pylucene examples]):
<!--
# A manual 'books similar to this' query:
authors = doc.getValues("author")
authorQuery = BooleanQuery()
for author in authors:
    authorQuery.add(
            TermQuery(Term("author", author)),
            BooleanClause.Occur.SHOULD
    )
authorQuery.setBoost(2.0)
 
vector = self.reader.getTermFreqVector(id, "subject")
subjectQuery = BooleanQuery()
for term in vector.getTerms():
    tq = TermQuery(Term("subject", term))
    subjectQuery.add(tq, BooleanClause.Occur.SHOULD)
 
#Overall query: match author, subject, and avoid the item this was based on
likeThisQuery = BooleanQuery()
likeThisQuery.add(authorQuery,  BooleanClause.Occur.SHOULD)
likeThisQuery.add(subjectQuery, BooleanClause.Occur.SHOULD)
likeThisQuery.add(TermQuery(Term("isbn", doc.get("isbn"))), BooleanClause.Occur.MUST_NOT)
 
-->
 
====On scoring====
{{stub}}
 
Scoring uses:
* per term per document:
** tf for the term in the document
** idf for the term
** norm (some index-time boosts: document, field, length boosts)
** term's query boost
 
* per document:
** coordination (how many of the query terms are found in a document)
 
* per query
** query normalization, to make scores between (sub)queries work on a numerically compatible scale, which is necessary to make complex queries work properly
 
 
* more manual boosts:
** boost terms (query time)
** boost a document (index time)
** boost a field's importance (in all documents) (index time)
 
 
See also:
* http://lucene.apache.org/java/2_4_0/scoring.html
* http://lucene.apache.org/java/2_4_0/api/org/apache/lucene/search/Similarity.html
* http://lucene.apache.org/java/2_4_0/scoring.html
 
 
 
<!--
====Pylucene notes====
{{stub}}
PyLucene is a python wrapper around the java lucene implementation.
 
Note that previous versions used [http://en.wikipedia.org/wiki/GCJ GCJ], current versions use JCC. The GCJ bindings were somewhat finicky, JCC less so.
 
Use your distribution's package unless you are aware of the potential trouble you may run into building this, and/or you must have the latest version.
 
Most installation errors that happen seem to be related to library problems. Both the use of jcc (if applicable) and the compilation of pylucene will fail if you do not have a jdk installed and library inclusion does not lead to libjava.so and libjvm.so.
 
It may also specifically need the sun jdk, though not necessarily 1.5.0 as some hardcoded paths in jcc's setup.py may indicate - just change them.
 
 
If you run into segfaults, the likeliest causes are:
* haven't done a <tt>lucene.initVM</tt> (any lucene class access will cause a segfault)
* you are using threaded access to indices, but not in a pylucene friendly way (details and fixes are mentioned/documented in various places)
 
 
 
See also:
* http://lucene.apache.org/pylucene/
* http://chandlerproject.org/PyLucene/WebHome
 
http://today.java.net/pub/a/today/2003/07/30/LuceneIntro.html
-->
 
====See also====
{{stub}}
 
* http://wiki.apache.org/lucene-java/LuceneFAQ
<!--
 
PyLucene:
* http://inkdroid.org/talks/pylucene/
* http://lucene.apache.org/pylucene/documentation/readme.html
 
* Some example-like things:
** http://sujitpal.blogspot.com/2007/08/executing-booleanquery-with-pylucene.html
** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/IndexFiles.py?view=markup
** http://crschmidt.net/python/lucene.highlight.py
** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/
** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/
*** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/lia/advsearching/BooksLikeThis.py?view=markup
*** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/lia/advsearching/
*** http://svn.apache.org/viewvc/lucene/pylucene/trunk/samples/LuceneInAction/lia/meetlucene/Indexer.py?view=markup
 
 
Unsorted:
* http://www.javaworld.com/javaworld/jw-09-2000/jw-0915-lucene.html
* http://darksleep.com/lucene/
* http://lucene.apache.org/java/2_2_0/api/org/apache/lucene/document/Field.TermVector.html
* http://www.knowledgefolders.com/akc/servlet/DisplayServlet?url=DisplayNoteMPURL&reportId=1395&ownerUserId=satya
-->
 
==Nutch (subproject)==
{{stub}}
 
[http://lucene.apache.org/nutch/ Nutch] extends lucene, to be a somewhat more worked out search engine, indexer, and a crawler that is appropriate for the fairly uncontrolled content on the web, intranets and such. (works incrementally, focuses on well linked content and such).
 
It has parsers for some formats{{verify}}, and a basic JSP ([[servlet]]) based interface.
 
 
Nutch also triggered development of the Hadoop framework, and can now its HDFS and mapreduce for distributed crawling and indexing, and for keeping that index updated{{verify}}.
 
It seems that searching from an index stored on the DFS is not ideal, so you would want to copy them to local filesystem for fast search (possibly also distributed, but note that has a few hairy details).
 
 
<!--
 
* db:
** URLs (fetched or not)
 
basic index or searchable content?
 
* segments:
** crawler metadata
** named by the time they are created
 
 
====Nutch crawling====
Some crawler configuration sits in conf/nutch-site.xml.
 
You can control URL patterns to blacklist or whitelist in conf/crawl-urlfilter.txt, which also allows you to do things like staying within a predetermined set of sites
 
 
You can hand in a few URLs, something like:
bin/nutch crawl urls -dir crawl -topN 50
Where:
* -dir dir names the directory to put the crawl in.
* -threads threads determines the number of threads that will fetch in parallel.
* -depth depth indicates the link depth from the root page that should be crawled.
* -topN N determines the maximum number of pages that will be retrieved at each level up to the depth.
 
 
bin/nutch generate crawl/crawldb crawl/segments -topN 1000
 
 
bin/nutch updatedb crawl/crawldb
 
 
 
 
nutch admin db --create
nutch inject db --urlfile seed.txt
 
Generate a fetchlist of URLS due to be fetched (new, or known but outdated, based on db metadata)
nutch generate db segments
 
nutch fetch segments/mentionedsegment
nutch updatedb db segments/mentionedsegment
nutch index segments/mentionedsegment
nutch dedup (arguments?)
 
 
 
Nutch has its own WAR file, so e.g. if using tomcat, remove webapps/ROOT and copy the warfile in as webapps/ROOT.war
 
-->
 
 
http://wiki.apache.org/nutch/NutchTutorial
 
 
http://wiki.apache.org/nutch/MapReduce
http://wiki.apache.org/nutch/NutchDistributedFileSystem
 
http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine
 
http://wiki.apache.org/nutch/DissectingTheNutchCrawler
 
==Some comparison, and notes on scaling==
<!--
{{stub}}
 
(Approximately, in that I don't have serious production experience with any of these)
 
 
On Lucene, Solr, Nutch, ElasticSearch:
 
Lucene is really mostly just a core and neither easy enough or complete enough to expose directly.
 
 
 
Due to Lucene's field definitions,
all of these are essentially schema'd,
and most of them expose things like sorting on fields,
any of these can be used for structured data as well as free-form text,
but the way each wraps it makes specific uses a lot simpler.
 
From a distance, they all seem similar in this regard,
but in practice the differences may be less subtle and may make your life easy or difficult.
 
 
Solr extends Lucene for ease of management and use (Some might call Solr something like 'search process enterprise management'),
: uses a schema, e.g. adding faceting.
 
Nutch is geared towards web crawling (so has a crawer), can deal with less structured data
though is perhaps more focused on scaling the crawling (and indexing),
and can use either Solr or ElasticSearch to search the result.
: uses a schema, but mostly just to add a bunch of its own metadata. It seems to still like seeing content as a blob of text to be indexed.
 
ElasticSearch is ease of management (with Kibana) plus dealing with more structured data.
 
-->
 
 
====On scaling indices and search====
{{stub}}
 
<!--
A lucene index and the process of searching is a monolithic, single-node thing.
 
To scale beyond millions of documents, you need multiple nodes, e.g. by distributing documents to collections.
 
 
Most options can now index into separate shards, and scrape results from separate shards.
 
This comes with some footnotes (including idf not working globally, not all faceting working, and such)
Most of these limitations can be overcome to some degree.
 
 
 
Solr can be made to support more search load by replicating the indices to slave nodes, and searching in those via a load balancer. The index distribution isn't a central feature, but can apparently be automated well enough, see e.g. http://wiki.apache.org/solr/CollectionDistribution
 
 
Solr can also search and combine over various remote indices. Related notes:
* duplicate ids will only match once - meaning a few dupes in distributed search don't matter as long as they're not different versions
* There are certain limitations. See e.g. http://wiki.apache.org/solr/DistributedSearch
 
This can go along with chopping up a collection into many sets of documents, each fully handled by such a shard. (preferably based on random sampling, so that per-subselection statistics are relatively representative)
 
 
(is nutch indexing mostly indexing segments in map, and merging those indices in reduce?)
 
 
 
It seems that you can combine Solr and Nutch without too much work, in a few ways:
* Nutch can apparently export segments into solr indices
* You can make Nutch post documents to Solr to be indexed (incurring the size limitations of a single index) (is that is what nutch solrindex does?)
* The nutch servlet can apparently search a Solr index (but this is not usual, so there are format requirements - about a dozen fieds Nutch expects)
 
 
http://www.lucidimagination.com/blog/2009/03/09/nutch-solr/
http://blog.foofactory.fi/2007/02/online-indexing-integrating-nutch-with.html
 
 
http://wiki.apache.org/nutch/RunningNutchAndSolr
 
 
 
Solr faceting uses memory proportional to the amount of unique values {{verify}}, which can be a bad idea on certain fields.
 
 
http://wiki.apache.org/solr/SolrPerformanceFactors
 
http://www.mail-archive.com/solr-user@lucene.apache.org/msg10667.html
 
 
 
 
-->
 
====Throwing resources at any of these====
<!--
 
 
'''RAM'''
 
First, a bit of vertical scaling in the form of RAM.
 
Assuming that the inverse index is fast, then the more that you can fit in memory, the more documents you
 
A single server with a bunch of gigs of RAM can usually be tweaked to comfortably serve up to a few dozen million documents, under reasonable query load {{comment|(...though yes, 'reasonable query load' varies with what you actual use is, so this is a rough estimate)}}
 
 
 
There is a tiny cost to the amount of documents you actually return in the result set, but if you're doing ''scored search'' then there was never much point
 
 
If you're using it as a database, or as a precisely scored set, then your bounds are much more likely be the amount of elements you have to handle for any search, and you are bound more by CPU (sooner or later - there are various clever things you can do to ''postpone'' when this happens, and even if it's only a few factors, at scale this is often worth at least considering).
 
 
While RAM is relatively cheap per computer until you physically can't
fit any more (mechanisms are different with cloud providers but it's still broadly true), vertically scaling CPU has a much more immediate and steep curve to it, so...
 
 
 
 
 
'''Horizontal scaling'''
 
Horizontal scaling is always an easy answer. It is the cloud's ''default'' answer.
 
This adds both CPU and storage, in a way that is not interdependent.
 
 
 
 
Storage
* SSDs to store the ''indexes''', because things that have to come from IO are served by both lower and more consistent latency than the same thing falling back to HDD.
:: also, it's harder to cause an SSD to trash than a HDD
 
 
 
Footnotes:
* RAM - if your index fits in RAM '''and''' you can dedicate the RAM to this index (in terms of size and use), you can exploit the OS cache for consistently fast searches.
 
* RAM - If you have noticeably more (free) memory than data, and your index is fairly compact {{comment|(doesn't e.g. spend most of its space on original documents in store-only fields, because that can easily double your index size)}}, then the OS cache can be a productive cache. You could even choose to rely on it as a real cache, which caches more data than the specific Solr caches choose to do.
:: In some tests on a home server (8GB RAM, 7200 RPM HDD) with a 4.5GB index and an hourly cronjob to cat the data to /dev/null {{comment|(as a side effect causing it to be read into OS cache)}}, searches not served this warming take somewhere between 0.15 sec and 0.6sec, while with a warmed OS cache most searches were served in the 0.02 - 0.15sec range (depending on complexity, mostly around 0.02 for simple queries. No filters/filtercache were used in this test).
:: '''However''', this is a tricky bit of advice - there are easy ways to pushes the data out of cache again, meaning you cannot guarantee this low latency. Reasons include
::: a process actually allocating the RAM (indexing could potentially do this)
::: something doing a lot of (cacheable) disk IO. Including your own backups, if you're not careful.
::: a sysadmin noticing the host seems to have resources free and installing other things on it. {{comment|(In some situations, it may at least bothersome to argue that you're actually using that free-looking RAM)}}.
::: [[swappiness]], as it affects how much RAM is swapped out and therefore usable for OS cache (arguably, relatively high swappiness can serve searching in this way, but only if there is a lot of memory to go around anyway))
 
* RAM - Caching seems to have less effect on large indexes than on smaller ones. That's largely because it becomes impossible to avoid talking to disk (and so incur its latency, particularly on HDDs - SDDs are the better option), partly because of the relative size of index and memory, but also because some things aren't directly cached and were always likelier to have to come from disk, so are more bound by it{{verify}})
 
 
 
 
 
'''On Lucene and Solr caches'''
 
 
In Lucene and Solr, committing changes on your amounts to some amount of cache flushing{{verify}}.
This is to be expected, but is also one of the factors in how (and how much) your indexing affects search speed.
 
 
The Lucene '''FieldCache''', well, caches document field vales.
While you have no direct control over this, it can be useful to indirectly warm this through doing autowarming in Solr (newSearcher, firstSearcher).
For example, if you frequently sort on a particular field, then autowarming queries sorting on those fields will make it more likely that a lot of those values will already be in Lucene's cache.
 
 
 
Solr's caches, on the other hand, have fairly specific and limited purposes, and a number of things aren't cached by anything, such as term positions{{verify}}), so only come from the index on disk.
This means that while you can optimize some queries to be served mostly from cache, others will always have parts that come from disk (or, often preferably, OS cache).
 
 
Warming caches with realistic searches has a noticeable positive effect on the Lucene and various Solr caches (although whether it has much effect on any given user search obviously depends on that search and how representative the warming was).
 
When a new Index Searcher is started, its cache can be pre-populated from a number of items from an old cache. {{comment|(Note that while copying may be faster than warming via queries, configuring a large number of to-be-copied items increases the time between creation and registration (ready-to-use-ness) of a new IndexSearcher. As such, you have to consider the time for warming against the value of that warming over its lifetime, under the expected search loads (and how likely it is for users to have to wait for creation of a new searcher))}}
Note that you should have enough searchers around to serve concurrent requests, or a portion of your users will be delayed in general, or delayed by warming.
 
 
 
The Solr '''Query Results Cache''' is a map from unique (query,sort) combinations to an ordered result list (of document IDs) each.
 
This is handy for users requesting more data for the same search (as there is no connection state for that - they would do new searches, but be served very quickly by a combination of this {{comment|(combined with the document cache and a sensible value for queryResultWindowSize)}}.
 
On memory use: Only matching documents are stored in each value, which is usually a set much smaller than the whole set of documents - and a few thousand. You may mostly see entries smaller than dozen KB{{verify}} so this cache is useful and not too costly.
 
 
The Solr '''Filter Cache''' stores the results of each filter that is evaluated {{comment|(which depending on the way you use queries may be none at all)}}, which can be reused to make certain parts of queries more efficient - sometimes much more so.
 
It stores the result of each filter on the entire index {{comment|(a boolean of each document's applicability under this filter)}}, which is 1 bit per document, so ~122KB per million documents per entry. I would call this moderately expensive, so if you want to use it, read up on how to do so effectively.
 
 
The Solr '''Document Cache''' stores documents read from the index, largely for presentation.
 
When the Query Results Cache points to documents loaded in here, they can be presented from cache without talking to disk, so it is probably most useful as a (relatively small) buffer serving adjacent page-range documents with fewer separate disk accesses.
The ''queryResultWindowSize'' setting helps presentation efficiency, by controlling how many documents ''around'' a range request are also fetched. For example, fetching results 11-20 (probably as page 2) with a queryResultWindowSize of 50 means that 1-50 will be loaded into the document cache. This means people clicking on 'next page' or  'previous page' will likely be served from cache.
 
This cache cannot be autowarmed.
<!--
* stores the entire stored document (does it do lazy field loading?{{verify}})
* apparently fieldselector controls what parts of the document are loaded into this cache{{verify}}
* what about lazy loading?{{verify}}
 
 
 
 
There is also an application level data cache, usable by Solr plugins{{verify}} (seems to be a convenience API so that you don't have to use an external cache).
 
-->
 
<!--
====On Solr faceting and its speed====
 
See also http://wiki.apache.org/solr/SolrFacetingOverview
 
There are two ways to do faceting:
* '''Facet queries''' (<tt>facet.query</tt>): Eventually does a filterquery for each facet value. That also makes it handy for the filterCache size to be larger than the unique values in the faceted fields. If that's ''really a lot'', you may not want this method.
 
 
* '''Field queries''' (<tt>facet.field</tt>)
 
Uses one of two methods, based on the field type, overridable using <tt>facet.method</tt>:
* '''Enum Based Field Queries''' (boolean fields, <tt>facet.method=enum</tt>): Faster for fields with small set of distinct values. Resource-heavy if not
* '''Field Cache''' (<tt>facet.method=fc</tt>)
 
-->
 
<!--
====Other Lucene/Solr tweaking for speed and scale====
 
See:
* http://wiki.apache.org/lucene-java/ImproveSearchingSpeed
* http://wiki.apache.org/lucene-java/ImproveIndexingSpeed
 
 
* http://wiki.apache.org/solr/SolrPerformanceFactors
 
* http://wiki.apache.org/solr/FilterQueryGuidance
 
* http://wiki.apache.org/solr/SolrCaching
 
* http://www.lucidimagination.com/Community/Hear-from-the-Experts/Articles/Scaling-Lucene-and-Solr
 
 
* http://wiki.apache.org/solr/LargeIndexes
 
 
* http://wiki.apache.org/solr/SolrPerformanceData
 
* http://wiki.statsbiblioteket.dk/summa/Hardware
 
 
* http://www.eu.apachecon.com/c/aceu2009/sessions/201
 
* http://lucene.apache.org/solr/tutorial.html
 
 
-->
[[Category:Software]]
[[Category:Search]]

Latest revision as of 13:34, 12 July 2023

Some practicalities to search systems

Lucene and things that wrap it: Lucene · Solr · ElasticSearch

Search-related processing: tf-idf

⌛ This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software or research).

Lucene

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lucene is an Apache project aimed at indexing and searching documents.

Lucene is the basis of some fuller packets, today most relevantly ElasticSearch, but also Solr, Nutch, and others.


Lucene itself does

  • indexing
includes incremental indexing (in chunks)
  • searches on that index
more than just an inverted index - stores enough information to support NEAR queries, wildcards, regexps
and search fairly quickly (IO permitting)
  • scoring of results
fairly mature in that it works decently even without much tweaking.


Its model and API and modular enough that you can specialize - which is what projects like Solr and ElasticSearch do.


Written in Java. There are wrappers into various other languages (and a few ports), and you can do most things via data APIs.


See also:


Some technical/model notes and various handy-to-knows

Parts of the system

On segments
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

IO behaviour should usually tend towards O(lg n).


Segments can be though of as independent chunks of searchable index.

Lucene updates segments of an index at a time, and never in-place, which avoids blocking so allows for continuous updates, not interrupting search.


Segments at all does mean a little more work searching (it has to search all segments independently), but at the same time makes it parallelizable, and segment being reasonable size means adding/updating documents is reasonable to do at any time, and less need for things like rebalancing a single large structure, and is .


Segments are occasionally merged, and you can control how and when via some parameters/conditions in config. This lets you balance the work necessary for index updates, against the work necessary for a particular search. (it also influences the need for optimizers, and the speed at which indexing happens)

...and tweaking some parameters. Many will want to err on the faster-search side, unless perhaps it is to be expected that updates are relatively continuous.

When you explicitly tell the system to optimize the index, segments are merged into a single one, which helps speed, but tends to be a lot of work on large indexes, and will only make much difference if you haven't done so lately. You can argue over how often you should optimize on a continually updating system.


In general, you want more smaller segments whenever you want to do a lot of updates, and fewer large segments if you are not - because search has to search all segments, while update has to update all of the segment it goes into.


Relevant config

mergeFactor:

  • low values make for more common segment merges
  • high values do the opposite: fewer merge operations, more index files to search
  • 2 is low, 25 is high. 10 is often reasonable.

maxBufferedDocs:

  • number of documents that indexing stores in-memory before they are stored/merged. Larger values tends to mean somewhat faster indexing (varying a little with your segmenting settings) at the cost of a little delay in indexing, and a little more memory use.

maxMergeDocs:

  • The maximum of documents merged into a segment(verify). Factory default seems to be Integer.MAX_VALUE (2^31-1, ~2 billion), effectively no maximum.

useCompoundFile:

  • compounds a number of data structures usually stored separately into a single file. Will often be a little slower - the setting is mostly useful to lessen the amount of open file handles, which can be handy when you have many segments on a host(verify)


More on indices
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The index package takes Documents and adds them to the index (with whatever processing is necessary).

IndexWriter:

  • lets you add Documents to them
  • lets you merge indices
  • lets you optimize indices

IndexReader:

  • is what you want for deleting (by Term or Query - it requires an index search/scan)

IndexSearcher: (wrapped around an IndexReader)

  • Lets you do searches


Note that you can only update/replace documents by deleting, then re-adding it.


Note that an IndexWriter takes an Analyzer. This seems to be a (perhaps overly cautious) way of modelling the possible mistake of mixing Analyzers in the same index (it leaves the only choice left as analyzing or not analyzing a field), which is a bad idea unless you know anough about lucene to be hacking your way around that.


An index can be stored in a Directory, a simple abstraction that lets indices be stored in various ways. Note that index reading and writing is due to certain restrictions that seem to be geared to making caching easier (verify)

There exist Directory implementations for a set of files on disk, RAM, and more. RAM indices may sound nice, but these are obviously a lot more size-limited than disk caches, and there are better ways to speed up and scale lucene indexes.


Apparently, all index reading and writing is thread and process safe(verify), but since it does so via a MVCC-esque transactional way, index consumers should re-open the index if they want to see up-to-date results, and writers will use advisory locking and so be sequential.

Note that using many IndexSearchers on the same data set will lead to more memory use and more file handles, so within a single multi-threaded searcher, you may want to reuse an IndexSearcher.

Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Lucene uses a specific model that you need to fit your needs into, which you may want to know about even if many you won't have to deal with all of them directly and if most have sane defaults.


The document package handles the abstraction of things into Document objects, which are things that that can be indexed and returned as a hit.

Note that you control what parts of a document get analyzed (transformed) or not, indexed or not, and stored in the index.


From lucene's point of view, a document consists of named Fields. Exactly what you create fields for varies with the sort of setup you want. Beyond a main text-to-index field you could add fields for a title, authors, a document identifier, URLs e.g. for origin/presentation, a 'last modified' date, part names/codes, SKUs, categories or keyword fields to facet on, and whatnot.


Note that you can have multiple fields with the same name in a document (multivalued fields), which can make sense in certain cases, for example when you have multiple keywords that you want analysed as independent and not concatenated strings. You may want to read up on the effects this has on scoring and certain features.


From the index's point of view, a Document contains a number of Fields, and once indexed, they contain Terms - the processed form of field data that comes out of the analysis, which are the actual things that are searched for.

That processing usually consists of at least a tokenizer that splits up input text into words (you can give codes and such different treatment), and filters that transform, and sometimes take out or create new tokens/terms.

You can control how much transformation gets applied to the values in a specific field, from no change at all (index the whole things as a single token) to complex splitting into many tokens, stemming inflected words, taking out stopwords, inserting synonyms, and whatnot.


On Fields and the index

For each field, you should decide:

  • whether to index it (if not, it's probably a field you only want to store)
  • whether to store the (pre-tokenized, if you tokenize) data in the index
  • whether to tokenize/analyse it (whether you want it broken down)
  • how to tokenize/analyse it


Different combinations fit different use cases. For example:

  • text to search in would be indexed, after analysis. Can be stored too, e.g. to present previews, sometimes to present the whole record from the index itself (not so handy if relatively large), or to support highlighting (since the indexed form of text is a fairly mangled form of the original).
  • A source URL for site/intranet would probably be both stored (for use) and indexed (to support in-url searching)
  • a path to a local/original data file (or perhaps immediately related data) might be only stored.
  • a document ID would probably be both stored and indexed, but not tokenized
  • Something like a product code (or SKUs) could be both stored, and you may want to tweak the analysis so that you know you can search for it as a unit, as well as for parts.
  • controlled keyword sets, e.g. to facet on, would probably be indexed, but not analyzed, or only tokenized and not e.g. stemmed.


(Older documentation will mention four types of field (Text, UnStored, UnIndexed, Keyword), which was a simple expansion of the few options you had at that time. These days you have to decide each separately)

More on tokens
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

A token represents an occurrence of a term in the text of a Field. As such, it consists of:

  • the text
  • the (start and end) offset of the term in the character stream
  • optional: a (lexical) type, used by some analysers (specifically by filters in an analysis that wants to have some type/context awareness)
  • Handing in the represented String in directly is now deprecated as it is slow
  • A token represents a string
  • the offset with the previous term

start and end character position, and an optional lexical type to assist certain

Token positions may be stored or not (depending on the field type) and can support NEAR queries, and can also be handu in indexing. For example, a Filter that emits synonymscan pretend that various different words things were present at the same original position.

Basics of Analyzers, Tokenizers, Filters

The analyser package primarily has Analysers.

Analyzer objects take a Reader (a character stream instead of a String, for scalability), apply a Tokenizer to emit Tokens (which are strings with, at least, a position), and may apply a bunch of Filters, which may alter that token stream (alter tokens, remove tokens, add tokens) with one of various, before handing this to the indexing process.

As an example, the StandardAnalyzer consists of :

  • a StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • a StandardFilter (removes 's from the end of words and dots from acronyms)
  • a LowerCaseFilter (lowercases all text)
  • a StopFilter (removes common English stopwords).

See also analyzer list below, and you can compose and create your own.


Note that since analysis is actually a transformation that results in a somewhat mangled form of the input, you must use the same (or at least a similar enough) analyzer when querying in a particular field, or you will not get the results you expect, or even any.


Note that you have the option of using different analysis for different fields. You could see this as fields being different sub-indexes, which can be powerful, but be aware of the extra work that needs to be done.


More detailed token and field stuff

Phrase and proximity search relies on position information. In the token generation phrase, the position increment of all tokens is 1, which just means that each term is positioned one term after the other.

There are a number of cases you could think of for which you would want to play with position / position offset information:

  • enabling phrase matches across stopwords by pretending stopwords aren't there (filters dealing with stopwords tend to do this)
  • injecting multiple forms at the same position, for example different stems of a term, different rewrites of a compound, injecting synonyms at the same position, and such
  • avoiding phrase/proximity matches across sentence/paragraph boundaries


When writing an analyzer, it is often easiest to play with the position offset, which is set on each token and refers to the offset to the the previous term (not the next).

For example, for a range of terms that should appear at the same position, you would set all but the first to zero.

Technically, you could even use this for a term frequency boost, repeating a token with an increment of zero.


Multivalued fields, that is, those for which you add multiple chunks, position-wise act as if there are concatenated. When applicable, you can set a large position increment between these fields to avoid phrase matches across such values (a property of the field?(verify)).

Some other...

Tokenizers include:

  • StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • CharTokenizer
    • LetterTokenizer (splits on non-letters, using Java's isLetter())
      • LowercaseTokenizer (LetterTokenizer + LowercaseFilter, mostly for a mild performance gain)
    • RussianLetterTokenizer
    • WhitespaceTokenizer (only splits on whitespaces, so e.g. not punctuation)
  • ChineseTokenizer (splits ideograms)
  • CJKTokenizer (splits ideograms, emits pairs(verify))
  • NGramTokenizer, EdgeNGramTokenizer (generates n-grams for all n in a given range
  • KeywordTokenizer (emits entire input as single token)


Filters include:

  • StandardFilter (removes 's from the end of words and dots from acronyms)
  • LowerCaseFilter
  • StopFilter (removes stopwords (stopword set is handed in))
  • LengthFilter (removes tokens not in a given range of lenths)
  • SynonymTokenFilter (injects synonyms (synonym map is handed in))
  • ISOLatin1AccentFilter (filter that removes accents from letters in the Latin-1 set. That is, replaces accented letters by unaccented characters (mostly/all ASCII), without changing case. Looking at the code, this is VERY basic and will NOT do everything you want. A better bet is to normalize your text (removing diacritics) before feeding it to indexing)
  • Language-spacific (some stemmers, some not) and/or charset-specific: PorterStemFilter, BrazilianStemFilter, DutchStemFilter, FrenchStemFilter, GermanStemFilter, RussianStemFilter, GreekLowerCaseFilter, RussianLowerCaseFilter, ChineseFilter (stopword stuff), ThaiWordFilter
  • SnowballFilter: Various stemmers
  • NGramTokenFilter, EdgeNGramTokenFilter

Queries, searches, Hits

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The search package has a Searcher / IndexSearcher, which takes an IndexReader, a Query, and returns Hits, an iteration of the matching result Documents.

There are some ways to deal with searching multiple and remote indices; see e.g. the relations to the Searchable interface.


The queryParser package helps turn a query string into a Query object.

Note that for identifier queries and mechanical queries (e.g. items which share these terms but not the identifier) you'll probably want to construct a Query object structure instead of using the QueryParser.

(Note that unlike most other parts of Lucene, the QueryParser is not thread-safe)


For the basic search functionality lucene has fairly conventional syntax.

In various cases you may wish to rewrite the query somewhat before you feed it to the query parser - in fact, you may well want to protect your users from having to learn Lucene-specific syntax.



On hits:

  • Hit (mostly wraps a Document, allows fetching of that the score, field values, and the Document object)



The older way: (deprecated, will be removed in lucene 3)



See also:


Query classes:

  • PhraseQuery matches a sequence of terms, with optional allowance for words inbetween (max distance, called slop, is handed in)
  • PrefixQuery: Match the start of words (meant for truncation queries like epidem*)



Related to scoring:


Unsorted:


Examples (various based on pylucene examples):

On scoring

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Scoring uses:

  • per term per document:
    • tf for the term in the document
    • idf for the term
    • norm (some index-time boosts: document, field, length boosts)
    • term's query boost
  • per document:
    • coordination (how many of the query terms are found in a document)
  • per query
    • query normalization, to make scores between (sub)queries work on a numerically compatible scale, which is necessary to make complex queries work properly


  • more manual boosts:
    • boost terms (query time)
    • boost a document (index time)
    • boost a field's importance (in all documents) (index time)


See also:



See also

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Nutch (subproject)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Nutch extends lucene, to be a somewhat more worked out search engine, indexer, and a crawler that is appropriate for the fairly uncontrolled content on the web, intranets and such. (works incrementally, focuses on well linked content and such).

It has parsers for some formats(verify), and a basic JSP (servlet) based interface.


Nutch also triggered development of the Hadoop framework, and can now its HDFS and mapreduce for distributed crawling and indexing, and for keeping that index updated(verify).

It seems that searching from an index stored on the DFS is not ideal, so you would want to copy them to local filesystem for fast search (possibly also distributed, but note that has a few hairy details).



http://wiki.apache.org/nutch/NutchTutorial


http://wiki.apache.org/nutch/MapReduce http://wiki.apache.org/nutch/NutchDistributedFileSystem

http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Some comparison, and notes on scaling

On scaling indices and search

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Throwing resources at any of these