Lucene and things that wrap it

From Helpful
Jump to: navigation, search

ElasticSearch

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


ElasticSearch is another HTTP server wrapped around Lucene.


ElasticSearch is essentially a document store, with a more defined CRUD API around it.

It puts focus on making both text and more structured types of data easily searchable, with fairly relatively consistent latency over time (...given some consideration and tweaking).



ElasticSearch itself is a standalone thing, with a fairly easy to use API that you can interface directly with code (wrappers are often relatively thin).


In practice, people often pair it with (see also "ELK stack")

Web UI, makes a bunch of inspection and monitoring easier, including some dashboard stuff
Seen in tutorials for its console (poking ES without code)
Pluggable, has a bunch of optional things (many of which you might only eventually try)


And, if you're not coding your own ingest,

you can do all of that yourself, but logstash is a quick start and/or some of the more complex work done for you
  • Beats - where logstash is a moer generic, configurable thing, beats is a set of more lightweight, specific-purpose ingest scripts, e.g. for availabilty, log files, linux system metrics, windows event log, metrics of some cloud services, network traffic [1] - and a lot more contributed ones[2]
e.g. metricbeat, which is stores ES metrics in ES



Parts of ES(/ELK) are dual-licensed public/proprietary, other parts are proprietary only. (There was a license change that seemed to be aimed at Amazon's use, which just prompted(verify) Amazon to fork it into its own OpenSearch[3].), and there are some paid-only features[4] though none of them are advanced / scale features that a lot of setups won't need.


See also:


Some terminology

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Fields, types and mappings

More on indexing

Data streams

Search

Result set

Aggregation

Snapshots, Searchable snapshots

Things you may want to control (/ checklist before you get started)

APIs

On security

Performance considerations

query considerations

use and design

clusters

Tweaking settings

Some Kibana notes

Solr (subproject)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Solr is

  • a layer of management/functionality/caching between lucene and an actual use.
  • a Java servlet that provides friendlier search and management than bare Lucene does
  • a little more analysis on top of Lucene
  • a few features on top of Lucene

It is mostly focuses on a nicer/convenient interface for data/search-providing side of things - see e.g. the solr bootcamp PDF (although there is slight detail overkill there).

Depending on your needs, Solr can either be a convenience interface that exposes most of Lucene's power and minimizing implementation bother -- or being a lot of implementation that's an unnecessary layer and dependencies for some of your specific needs. It's probably the former more often than not.




Some more practical notes

Install and config
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Solr is often used via servlets (comes with Jetty for a quick start, has been used in a most or all containers by different people. Apparently tomcat has been reported to have fewer problems with Unicode).


See also:

Container specifics:


Solr configuration is specific to a Solr instance, so multiple solr setups need multiple container instances.

Note that different distributions (and admins) manage containers in different ways, and , meaning that configuration and index files may be in various different places. It probably helps to study your installation enough to at least know where to find files.

In ubuntu, the solr-tomcat5.5 package is missing a dependency on libxpp3-java, so install that yourself. (Its absence will cause an error mentioning XmlPullParserException)

(If you're new to or generally shy away from Java, administering Solr and everything you need for it may be pretty daunting as an immediate learning curve. You may want to look at Sphinx)



The configuration of the Solr instance itself is mostly:

  • schema.xml, which all new document adds adhere to (you can run yourself into problems after significant changes; usually the easiest way is to do a complete reindex)
  • solrconfig.xml
    • lucene index and indexing parameters
    • solr handler config
    • search config
    • solr cache config
    • and more

Note that default solrconfig.xml's dismax configuration contains references to (fields defined in) the example schema, which can cause handler errors if you don't change them when you change the schema -- with cryptic descriptions such as 'The request sent by the client was syntactically incorrect (undefined field text)' .


Interfacing with
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


You can interface with Solr in a few ways, but since the main way of interfacing is largely just HTTP, most clients use that in some form or other.

When you're working locally you can choose to interface using SolrJ [5] (instead of HTTP) for lower latency, which for some uses can lead to noticeably better speed, and for some others doesn't matter much. (You can also imitate what SolrJ does with your own code (sometimes termed Embedded Solr [6]) but this is often fragile.)


...so you can send in data and commands in the following ways and more:

  • tell Solr where to fetch documents from (sent in via the servlet, e.g. using curl)
  • data via post.jar (Java program)
  • data via curl (often in XML form, or CSV(verify))
  • Java code (Solr-specific, networked)


On post.jar:

  • see its -help
  • To use post.jar to use a different port, or access a different host, add a parameter like:
-Durl='http://192.168.0.2:8180/solr/update'
-Durl='http://127.0.0.1:8080/solr/update'
  • post.jar sends a commit by default. To avoid it until you actually want it, use
-Dcommit=no    
  • data source can be filename references (-Ddata=files, default{[verify}}), stdin (-Ddata=stdin), and command-line arguments (-Ddata=args).


The -Ddata=args option can be handy to send specific commands, for example:

#Clear current index data: 
java -Ddata=args -jar post.jar "<delete><query>*:*</query></delete>"
 
# To enter some prepared docs, and not commit yet
java -Ddata=files -Dcommit=no post.jar some_test_data/*.xml
 
#Commit the changes
java -Ddata=args -jar post.jar "<commit/>"
 
#Optimize the index
java -Ddata=args -jar post.jar "<optimize/>"


Actually, post.jar is a fairly thin convenience wrapper around HTTP POSTs, so the last is roughly equivalent to something like;

curl http://192.168.0.2:8180/solr/update -H "Content-Type: text/xml" --data-binary '<optimize/>'


You can even use GETs, but long URLs can create issues. Still, it's useful enough for things like:

http://192.168.0.2:8180/solr/update?stream.body=%3Coptimize/%3E


See also:


Response format

When uou use the HTTP interface, you can ask it to send the response in one of many formats (via the wt parameter).

The formats include:

  • XML (data, and the default)
  • XSLT applied to the basic XML (handy for simple transforms and even immediate presentation)
  • HTML via Velocity templating
  • JSON (data)
  • Python (data)
  • Ruby (data)
  • PHP, Serialized PHP
  • javabin[7], a binary format based on Java marshalling(verify)
  • a custom response writer class
  • names of configured response writers that are a class with a specific initialization

See also:

Operations and handlers
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Operations are handled via RequestHandler subclasses - see e.g. http://lucene.apache.org/solr/api/org/apache/solr/request/SolrRequestHandler.html for a decent list.

The most central operations you'll probably use are:

  • query: well, that.
  • add: add/update documents, to be indexed (expects data already prepared into the fields tha tthe schema defines, by default as XML (UTF-8), and possibly as CSV text(verify)) (You can probably expect on the order of 10 to 200 documents to be indexed per second(verify) depending on CPU, IO system, document size and analysis complexity)
  • delete: when necessary.
  • commit: adds and deletes are transactional; commit applies changes since the last commit. Since commits flush various caches (and for some other reasons, e.g. segment-related ones), it's generally best to do these after a decently sized batch job instead of each of its items. (Note that you can also configure automatic commits after some number of pending commits, after some number of adds, after some time after the last add, and such)
  • optimize: merge and optimize data. Not strictly necessary, but good for compactness and speed (varying a little with how you configured segmenting behaviour). Can take some time, so probably have the most use after large commits.
  • some other management
  • use of other features via separate requests / RequestHandlers


Note that the standard search handler also eases combination/chaining of certain (more or less standard) components, including:

  • QueryComponent
  • QueryElevationComponent
  • MoreLikeThis
  • Highlighting
  • FacetComponent
  • ClusteringComponent
  • SpellCheckComponent
  • TermVectorComponent
  • StatsComponent
  • Statistics
  • debug
  • TermsComponent


See also

Schema design
This hasn't been updated for a while, so could be outdated (particularly if it's about something that evolves constantly, such as software).

(commented out for now, and possibly forever)


On Scoring and boosting
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Boosting refers to changes in score, usually automatic ones. In Solr/lucene, this includes:

  • query clause boost: weight given to a part of the query (over another)
  • coordination factor: boost according to the amount of terms in a multiple-query that match
  • FunctionQuery allows use of field content as numbers for boost, with possible operations


Document/field data related:

  • term frequency: score of document based on more occurences of a term
  • inverse document frequency: rarer terms count more than common ones
  • smaller fields score higher than large ones (because smaller ones are likely to be more specific)


Index-time boosts:

  • index-time document boost
  • index-time per-field boost (note that this is per field, not per value -- if you have a multivalued field, all values in that document for that same field get an identical consolidated value)(verify)




search: doing queries; query control
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

If you use the servlet for search, you'll probably choose to search using one of:

(controlled via, well, see http://wiki.apache.org/solr/CoreQueryParameters)



Note there are three types of clauses in terms of requirement:

  • optional, a.k.a. SHOULD (the default)
  • mandatory (when you use +)
  • prohibited (when you use -)


Both standard and dismax support:

  • sort:
    • optional
    • on one or more of: (each asc/desc)
      • score
      • one or more fields (multiValued="false" indexed="true")
  • start, rows: the offset and amount of items to fetch (default 0 and 10, respecively)
    • Note that there is a result cache
  • fq (filter query): Query that does subselection, without altering score
    • these sub-set selections can be cached
  • fl (field list): fields to return (default is *, i.e. all)
    • (not the same as field fetching laziness?(verify))
  • defType: Specify query parser to use [8]/solr/search/QParserPlugin.html
  • omitHeader
  • timeAllowed: timeout for the search part of queries, in milliseconds. Don't set this too low, as that could lead to arbitrary partial results under load. 0 means no limit.
  • debugQuery
  • explainOther


Dismax has more parameters/features, including:

  • qf: list fields to query, and field boost to use for each
  • tie: Since dismax searches multiple fields, you have a potential problem in that a match in different fields would have different implications on the score. Read op on what disjunction max and disjunction sum, or just choose something like 0.1.
  • mm: require that some part of the optional-type clauses must match [9] [10]
  • pf (Phrase Fields): When using fq (filter query) and qf (query field), you can boost based on proximity of terms
  • ps (phrase slop): the slop to use when using pf
  • qs (Query Phrase Slop): slop used for the main query string (so affects matching, not just scoring)
  • bf (Boost Function): Boost based on some value. One way of using FunctionQuery; see below for more detail. Useful for things like:
    • weighing by some pre-calculated value in a field
    • weighing by closeness to some value/date


See also:


Queries and FilterQueries; queryResultCache and filterCache
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Note that the Solr-specific caches are associated with a specific Index Searcher(verify). The caches have an item amount limit, and entries seem to have no timeout.


The results of main parts of a queries (q parameter) will be calculated, and entered into the queryResultCache (as a list of matching doc IDs -- and in many cases that's not a huge list).


A complex query is composed from many search results over the index, so caching these as a whole is relatively less useful.

(Dramatically put, complex queries will rarely hit, meaning you'll get higher-than-necessary eviction rates, and general cache contention on that cache. The fact that some parts might be reusable (particularly for more mechanical parts of a query such as filtering on document type) is one of the reason for filter queries.)


Filter queries (fq parameter) are part of overall query execution (both standard and dismax), but are evaluated differently (and can support different function):

In the evaluation of the query as a whole, all documents in the index are evaluated: it either matches the filter query or not (boolean). The direct function is a simple filter, and the score is also not affected by fq.

The per-document (orderless, scoreless) match-or-not is also storable in the Solr filter cache, to be reused by other queries that use the same filter criterion. If you implement faceted browsing, this is a great way to speed it up.


Arguably, queryResultCache is often less memory-hungry than the filterCache -- although which of the two you try to exploit more depends a lot on what you're actually trying to chache. The former stores a list of IDs, but only for the document set that matched, which tends to be a relatively short list. The filter cache stores a bitmap of match/no-match for the filter for the entire document set, using one bit per document, so ~122KB per million docs per fieldCache entry. In many cases, this is more, so pick your filter queries so that they make a difference, and avoid cases where entries would never be looked up (you'ld just be adding overhead).

The filter cache is best used with specific knowledge of your schema and data. For example:

  • As a rule of thumb, the fields mentioned in fq should generally have a relatively small value set (or be reduced by the filter query, e.g. as by number ranges).
  • controlled keywords, subject codes, or other such property metadata may be useful to use (copy/move) to fq

Free-text search means you can't really use fq, seeing as uncontrolled text isn't too useful in fq.

Faceting is a feature that can use the filter cache (not too surprisingly), and it can be useful to place query parts on keyword or string fields into fq.

Index inspection, processing/relevance debugging
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)



Notes on further solr features
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Highlighting/summary

  • For fields that are indexed and stored (based on positions), and only makes sense when such a field is tokenized
  • generated per field (verify)
  • configurable number per field (default 1)
  • (Lucene can also guess based on new text, making it easier to use text stored elsewhere)
  • http://wiki.apache.org/solr/HighlightingParameters


FunctionQuery


Faceting

Note that it can be handy to create facet-only fields, e.g. on identities, names, subjects, and such.


Spell checking / alternate term suggestor


More Like This


Query Evaluation

  •  ?


Lucene

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lucene is an Apache project aimed at indexing and searching documents.

Lucene is the basis of some fuller packets, today most relevantly ElasticSearch, but also Solr, Nutch, and others.


Lucene itself does

  • indexing
includes incremental indexing (in chunks)
  • searches on that index
more than just an inverted index - stores enough information to support NEAR queries, wildcards, regexps
and search fairly quickly (IO permitting)
  • scores results
fairly mature in that it works decently even without much tweaking.


Its model and API and modular enough that you can specialize - which is what projects like Solr and ElasticSearch do.


Written in Java. There are wrappers into various other languages (and a few ports), and you can do most things via data APIs.


See also:


Some technical/model notes and various handy-to-knows

Parts of the system

On segments
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

IO behaviour should usually tend towards O(lg n).


Segments can be though of as independent chunks of searchable index.

Lucene updates segments of an index at a time, and never in-place, which avoids blocking so allows for continuous updates, not interrupting search.


Segments at all does mean a little more work searching (it has to search all segments independently), but at the same time makes it parallelizable, and segment being reasonable size means adding/updating documents is reasonable to do at any time, and less need for things like rebalancing a single large structure, and is .


Segments are occasionally merged, and you can control how and when via some parameters/conditions in config. This lets you balance the work necessary for index updates, against the work necessary for a particular search. (it also influences the need for optimizers, and the speed at which indexing happens)

...and tweaking some parameters. Many will want to err on the faster-search side, unless perhaps it is to be expected that updates are relatively continuous.

When you explicitly tell the system to optimize the index, segments are merged into a single one, which helps speed, but tends to be a lot of work on large indexes, and will only make much difference if you haven't done so lately. You can argue over how often you should optimize on a continually updating system.


In general, you want more smaller segments whenever you want to do a lot of updates, and fewer large segments if you are not - because search has to search all segments, while update has to update all of the segment it goes into.


Relevant config

mergeFactor:

  • low values make for more common segment merges
  • high values do the opposite: fewer merge operations, more index files to search
  • 2 is low, 25 is high. 10 is often reasonable.

maxBufferedDocs:

  • number of documents that indexing stores in-memory before they are stored/merged. Larger values tends to mean somewhat faster indexing (varying a little with your segmenting settings) at the cost of a little delay in indexing, and a little more memory use.

maxMergeDocs:

  • The maximum of documents merged into a segment(verify). Factory default seems to be Integer.MAX_VALUE (2^31-1, ~2 billion), effectively no maximum.

useCompoundFile:

  • compounds a number of data structures usually stored separately into a single file. Will often be a little slower - the setting is mostly useful to lessen the amount of open file handles, which can be handy when you have many segments on a host(verify)


More on indices
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The index package takes Documents and adds them to the index (with whatever processing is necessary).

IndexWriter:

  • lets you add Documents to them
  • lets you merge indices
  • lets you optimize indices

IndexReader:

  • is what you want for deleting (by Term or Query - it requires an index search/scan)

IndexSearcher: (wrapped around an IndexReader)

  • Lets you do searches


Note that you can only update/replace documents by deleting, then re-adding it.


Note that an IndexWriter takes an Analyzer. This seems to be a (perhaps overly cautious) way of modelling the possible mistake of mixing Analyzers in the same index (it leaves the only choice left as analyzing or not analyzing a field), which is a bad idea unless you know anough about lucene to be hacking your way around that.


An index can be stored in a Directory, a simple abstraction that lets indices be stored in various ways. Note that index reading and writing is due to certain restrictions that seem to be geared to making caching easier (verify)

There exist Directory implementations for a set of files on disk, RAM, and more. RAM indices may sound nice, but these are obviously a lot more size-limited than disk caches, and there are better ways to speed up and scale lucene indexes.


Apparently, all index reading and writing is thread and process safe(verify), but since it does so via a MVCC-esque transactional way, index consumers should re-open the index if they want to see up-to-date results, and writers will use advisory locking and so be sequential.

Note that using many IndexSearchers on the same data set will lead to more memory use and more file handles, so within a single multi-threaded searcher, you may want to reuse an IndexSearcher.

Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lucene uses a specific model that you need to fit your needs into, which you may want to know about even if many you won't have to deal with all of them directly and if most have sane defaults.


The document package handles the abstraction of things into Document objects, which are things that that can be indexed and returned as a hit.

Note that you control what parts of a document get analyzed (transformed) or not, indexed or not, and stored in the index.


From lucene's point of view, a document consists of named Fields. Exactly what you create fields for varies with the sort of setup you want. Beyond a main text-to-index field you could add fields for a title, authors, a document identifier, URLs e.g. for origin/presentation, a 'last modified' date, part names/codes, SKUs, categories or keyword fields to facet on, and whatnot.


Note that you can have multiple fields with the same name in a document (multivalued fields), which can make sense in certain cases, for example when you have multiple keywords that you want analysed as independent and not concatenated strings. You may want to read up on the effects this has on scoring and certain features.


From the index's point of view, a Document contains a number of Fields, and once indexed, they contain Terms - the processed form of field data that comes out of the analysis, which are the actual things that are searched for.

That processing usually consists of at least a tokenizer that splits up input text into words (you can give codes and such different treatment), and filters that transform, and sometimes take out or create new tokens/terms.

You can control how much transformation gets applied to the values in a specific field, from no change at all (index the whole things as a single token) to complex splitting into many tokens, stemming inflected words, taking out stopwords, inserting synonyms, and whatnot.


On Fields and the index

For each field, you should decide:

  • whether to index it (if not, it's probably a field you only want to store)
  • whether to store the (pre-tokenized, if you tokenize) data in the index
  • whether to tokenize/analyse it (whether you want it broken down)
  • how to tokenize/analyse it


Different combinations fit different use cases. For example:

  • text to search in would be indexed, after analysis. Can be stored too, e.g. to present previews, sometimes to present the whole record from the index itself (not so handy if relatively large), or to support highlighting (since the indexed form of text is a fairly mangled form of the original).
  • A source URL for site/intranet would probably be both stored (for use) and indexed (to support in-url searching)
  • a path to a local/original data file (or perhaps immediately related data) might be only stored.
  • a document ID would probably be both stored and indexed, but not tokenized
  • Something like a product code (or SKUs) could be both stored, and you may want to tweak the analysis so that you know you can search for it as a unit, as well as for parts.
  • controlled keyword sets, e.g. to facet on, would probably be indexed, but not analyzed, or only tokenized and not e.g. stemmed.


(Older documentation will mention four types of field (Text, UnStored, UnIndexed, Keyword), which was a simple expansion of the few options you had at that time. These days you have to decide each separately)

More on tokens
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A token represents an occurrence of a term in the text of a Field. As such, it consists of:

  • the text
  • the (start and end) offset of the term in the character stream
  • optional: a (lexical) type, used by some analysers (specifically by filters in an analysis that wants to have some type/context awareness)
  • Handing in the represented String in directly is now deprecated as it is slow
  • A token represents a string
  • the offset with the previous term

start and end character position, and an optional lexical type to assist certain

Token positions may be stored or not (depending on the field type) and can support NEAR queries, and can also be handu in indexing. For example, a Filter that emits synonymscan pretend that various different words things were present at the same original position.

Basics of Analyzers, Tokenizers, Filters

The analyser package primarily has Analysers.

Analyzer objects take a Reader (a character stream instead of a String, for scalability), apply a Tokenizer to emit Tokens (which are strings with, at least, a position), and may apply a bunch of Filters, which may alter that token stream (alter tokens, remove tokens, add tokens) with one of various, before handing this to the indexing process.

As an example, the StandardAnalyzer consists of :

  • a StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • a StandardFilter (removes 's from the end of words and dots from acronyms)
  • a LowerCaseFilter (lowercases all text)
  • a StopFilter (removes common English stopwords).

See also analyzer list below, and you can compose and create your own.


Note that since analysis is actually a transformation that results in a somewhat mangled form of the input, you must use the same (or at least a similar enough) analyzer when querying in a particular field, or you will not get the results you expect, or even any.


Note that you have the option of using different analysis for different fields. You could see this as fields being different sub-indexes, which can be powerful, but be aware of the extra work that needs to be done.


More detailed token and field stuff

Phrase and proximity search relies on position information. In the token generation phrase, the position increment of all tokens is 1, which just means that each term is positioned one term after the other.

There are a number of cases you could think of for which you would want to play with position / position offset information:

  • enabling phrase matches across stopwords by pretending stopwords aren't there (filters dealing with stopwords tend to do this)
  • injecting multiple forms at the same position, for example different stems of a term, different rewrites of a compound, injecting synonyms at the same position, and such
  • avoiding phrase/proximity matches across sentence/paragraph boundaries


When writing an analyzer, it is often easiest to play with the position offset, which is set on each token and refers to the offset to the the previous term (not the next).

For example, for a range of terms that should appear at the same position, you would set all but the first to zero.

Technically, you could even use this for a term frequency boost, repeating a token with an increment of zero.


Multivalued fields, that is, those for which you add multiple chunks, position-wise act as if there are concatenated. When applicable, you can set a large position increment between these fields to avoid phrase matches across such values (a property of the field?(verify)).

Some other...

Tokenizers include:

  • StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • CharTokenizer
    • LetterTokenizer (splits on non-letters, using Java's isLetter())
      • LowercaseTokenizer (LetterTokenizer + LowercaseFilter, mostly for a mild performance gain)
    • RussianLetterTokenizer
    • WhitespaceTokenizer (only splits on whitespaces, so e.g. not punctuation)
  • ChineseTokenizer (splits ideograms)
  • CJKTokenizer (splits ideograms, emits pairs(verify))
  • NGramTokenizer, EdgeNGramTokenizer (generates n-grams for all n in a given range
  • KeywordTokenizer (emits entire input as single token)


Filters include:

  • StandardFilter (removes 's from the end of words and dots from acronyms)
  • LowerCaseFilter
  • StopFilter (removes stopwords (stopword set is handed in))
  • LengthFilter (removes tokens not in a given range of lenths)
  • SynonymTokenFilter (injects synonyms (synonym map is handed in))
  • ISOLatin1AccentFilter (filter that removes accents from letters in the Latin-1 set. That is, replaces accented letters by unaccented characters (mostly/all ASCII), without changing case. Looking at the code, this is VERY basic and will NOT do everything you want. A better bet is to normalize your text (removing diacritics) before feeding it to indexing)
  • Language-spacific (some stemmers, some not) and/or charset-specific: PorterStemFilter, BrazilianStemFilter, DutchStemFilter, FrenchStemFilter, GermanStemFilter, RussianStemFilter, GreekLowerCaseFilter, RussianLowerCaseFilter, ChineseFilter (stopword stuff), ThaiWordFilter
  • SnowballFilter: Various stemmers
  • NGramTokenFilter, EdgeNGramTokenFilter

Queries, searches, Hits

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The search package has a Searcher / IndexSearcher, which takes an IndexReader, a Query, and returns Hits, an iteration of the matching result Documents.

There are some ways to deal with searching multiple and remote indices; see e.g. the relations to the Searchable interface.


The queryParser package helps turn a query string into a Query object.

Note that for identifier queries and mechanical queries (e.g. items which share these terms but not the identifier) you'll probably want to construct a Query object structure instead of using the QueryParser.

(Note that unlike most other parts of Lucene, the QueryParser is not thread-safe)


For the basic search functionality lucene has fairly conventional syntax.

In various cases you may wish to rewrite the query somewhat before you feed it to the query parser - in fact, you may well want to protect your users from having to learn Lucene-specific syntax.



On hits:

  • Hit (mostly wraps a Document, allows fetching of that the score, field values, and the Document object)



The older way: (deprecated, will be removed in lucene 3)



See also:


Query classes:

  • PhraseQuery matches a sequence of terms, with optional allowance for words inbetween (max distance, called slop, is handed in)
  • PrefixQuery: Match the start of words (meant for truncation queries like epidem*)



Related to scoring:


Unsorted:


Examples (various based on pylucene examples):

On scoring

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Scoring uses:

  • per term per document:
    • tf for the term in the document
    • idf for the term
    • norm (some index-time boosts: document, field, length boosts)
    • term's query boost
  • per document:
    • coordination (how many of the query terms are found in a document)
  • per query
    • query normalization, to make scores between (sub)queries work on a numerically compatible scale, which is necessary to make complex queries work properly


  • more manual boosts:
    • boost terms (query time)
    • boost a document (index time)
    • boost a field's importance (in all documents) (index time)


See also:



See also

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Nutch (subproject)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Nutch extends lucene, to be a somewhat more worked out search engine, indexer, and a crawler that is appropriate for the fairly uncontrolled content on the web, intranets and such. (works incrementally, focuses on well linked content and such).

It has parsers for some formats(verify), and a basic JSP (servlet) based interface.


Nutch also triggered development of the Hadoop framework, and can now its HDFS and mapreduce for distributed crawling and indexing, and for keeping that index updated(verify).

It seems that searching from an index stored on the DFS is not ideal, so you would want to copy them to local filesystem for fast search (possibly also distributed, but note that has a few hairy details).



http://wiki.apache.org/nutch/NutchTutorial


http://wiki.apache.org/nutch/MapReduce http://wiki.apache.org/nutch/NutchDistributedFileSystem

http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Some comparison, and notes on scaling

On scaling indices and search

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


Throwing resources at any of these