Apache projects and related notes

From Helpful
Jump to: navigation, search
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lucene

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lucene is the core of a search system, which provides indexing (also incremental indexing), and also searches on that index that are relatively flexible (e.g. storing enough information to support NEAR queries, wildcards, regexps), and fairly pretty quickly (storage system IO permitting).

Its model and API and modular enough that you can implement your own cleverness. The default scoring system is also fairly mature in that it works decently without tweaking.


Written in Java, and has been wrapped into various other languages (and a few ports).


See also:


Some technical/model notes and various handy-to-knows

Parts of the system

On segments
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

IO behaviour should usually tend towards O(lg n).


Segments can be though of as independent chunks of searchable index.

Lucene updates segments of an index at a time, and never in-place, which avoids blocking so allows for continuous updates, not interrupting search.


Segments at all does mean a little more work searching (it has to search all segments independently), but at the same time makes it parallelizable, and segment being reasonable size means adding/updating documents is reasonable to do at any time, and less need for things like rebalancing a single large structure, and is .


Segments are occasionally merged, and you can control how and when via some parameters/conditions in config. This lets you balance the work necessary for index updates, against the work necessary for a particular search. (it also influences the need for optimizers, and the speed at which indexing happens)

...and tweaking some parameters. Many will want to err on the faster-search side, unless perhaps it is to be expected that updates are relatively continuous.

When you explicitly tell the system to optimize the index, segments are merged into a single one, which helps speed, but tends to be a lot of work on large indexes, and will only make much difference if you haven't done so lately. You can argue over how often you should optimize on a continually updating system.


In general, you want more smaller segments whenever you want to do a lot of updates, and fewer large segments if you are not - because search has to search all segments, while update has to update all of the segment it goes into.


Relevant config

mergeFactor:

  • low values make for more common segment merges
  • high values do the opposite: fewer merge operations, more index files to search
  • 2 is low, 25 is high. 10 is often reasonable.

maxBufferedDocs:

  • number of documents that indexing stores in-memory before they are stored/merged. Larger values tends to mean somewhat faster indexing (varying a little with your segmenting settings) at the cost of a little delay in indexing, and a little more memory use.

maxMergeDocs:

  • The maximum of documents merged into a segment(verify). Factory default seems to be Integer.MAX_VALUE (2^31-1, ~2 billion), effectively no maximum.

useCompoundFile:

  • compounds a number of data structures usually stored separately into a single file. Will often be a little slower - the setting is mostly useful to lessen the amount of open file handles, which can be handy when you have many segments on a host(verify)


More on indices
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The index package takes Documents and adds them to the index (with whatever processing is necessary).

IndexWriter:

  • lets you add Documents to them
  • lets you merge indices
  • lets you optimize indices

IndexReader:

  • is what you want for deleting (by Term or Query - it requires an index search/scan)

IndexSearcher: (wrapped around an IndexReader)

  • Lets you do searches


Note that you can only update/replace documents by deleting, then re-adding it.


Note that an IndexWriter takes an Analyzer. This seems to be a (perhaps overly cautious) way of modelling the possible mistake of mixing Analyzers in the same index (it leaves the only choice left as analyzing or not analyzing a field), which is a bad idea unless you know anough about lucene to be hacking your way around that.


An index can be stored in a Directory, a simple abstraction that lets indices be stored in various ways. Note that index reading and writing is due to certain restrictions that seem to be geared to making caching easier (verify)

There exist Directory implementations for a set of files on disk, RAM, and more. RAM indices may sound nice, but these are obviously a lot more size-limited than disk caches, and there are better ways to speed up and scale lucene indexes.


Apparently, all index reading and writing is thread and process safe(verify), but since it does so via a MVCC-esque transactional way, index consumers should re-open the index if they want to see up-to-date results, and writers will use advisory locking and so be sequential.

Note that using many IndexSearchers on the same data set will lead to more memory use and more file handles, so within a single multi-threaded searcher, you may want to reuse an IndexSearcher.

Documents, Fields, Tokens, Terms; Analyzers, Tokenizers, Filters

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Lucene uses a specific model that you need to fit your needs into, which you may want to know about even if many you won't have to deal with all of them directly and if most have sane defaults.


The document package handles the abstraction of things into Document objects, which are things that that can be indexed and returned as a hit.

Note that you control what parts of a document get analyzed (transformed) or not, indexed or not, and stored in the index.


From lucene's point of view, a document consists of named Fields. Exactly what you create fields for varies with the sort of setup you want. Beyond a main text-to-index field you could add fields for a title, authors, a document identifier, URLs e.g. for origin/presentation, a 'last modified' date, part names/codes, SKUs, categories or keyword fields to facet on, and whatnot.


Note that you can have multiple fields with the same name in a document (multivalued fields), which can make sense in certain cases, for example when you have multiple keywords that you want analysed as independent and not concatenated strings. You may want to read up on the effects this has on scoring and certain features.


From the index's point of view, a Document contains a number of Fields, and once indexed, they contain Terms - the processed form of field data that comes out of the analysis, which are the actual things that are searched for.

That processing usually consists of at least a tokenizer that splits up input text into words (you can give codes and such different treatment), and filters that transform, and sometimes take out or create new tokens/terms.

You can control how much transformation gets applied to the values in a specific field, from no change at all (index the whole things as a single token) to complex splitting into many tokens, stemming inflected words, taking out stopwords, inserting synonyms, and whatnot.


On Fields and the index

For each field, you should decide:

  • whether to index it (if not, it's probably a field you only want to store)
  • whether to store the (pre-tokenized, if you tokenize) data in the index
  • whether to tokenize/analyse it (whether you want it broken down)
  • how to tokenize/analyse it


Different combinations fit different use cases. For example:

  • text to search in would be indexed, after analysis. Can be stored too, e.g. to present previews, sometimes to present the whole record from the index itself (not so handy if relatively large), or to support highlighting (since the indexed form of text is a fairly mangled form of the original).
  • A source URL for site/intranet would probably be both stored (for use) and indexed (to support in-url searching)
  • a path to a local/original data file (or perhaps immediately related data) might be only stored.
  • a document ID would probably be both stored and indexed, but not tokenized
  • Something like a product code (or SKUs) could be both stored, and you may want to tweak the analysis so that you know you can search for it as a unit, as well as for parts.
  • controlled keyword sets, e.g. to facet on, would probably be indexed, but not analyzed, or only tokenized and not e.g. stemmed.


(Older documentation will mention four types of field (Text, UnStored, UnIndexed, Keyword), which was a simple expansion of the few options you had at that time. These days you have to decide each separately)

More on tokens
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A token represents an occurrence of a term in the text of a Field. As such, it consists of:

  • the text
  • the (start and end) offset of the term in the character stream
  • optional: a (lexical) type, used by some analysers (specifically by filters in an analysis that wants to have some type/context awareness)
  • Handing in the represented String in directly is now deprecated as it is slow
  • A token represents a string
  • the offset with the previous term

start and end character position, and an optional lexical type to assist certain

Token positions may be stored or not (depending on the field type) and can support NEAR queries, and can also be handu in indexing. For example, a Filter that emits synonymscan pretend that various different words things were present at the same original position.

Basics of Analyzers, Tokenizers, Filters

The analyser package primarily has Analysers.

Analyzer objects take a Reader (a character stream instead of a String, for scalability), apply a Tokenizer to emit Tokens (which are strings with, at least, a position), and may apply a bunch of Filters, which may alter that token stream (alter tokens, remove tokens, add tokens) with one of various, before handing this to the indexing process.

As an example, the StandardAnalyzer consists of :

  • a StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • a StandardFilter (removes 's from the end of words and dots from acronyms)
  • a LowerCaseFilter (lowercases all text)
  • a StopFilter (removes common English stopwords).

See also analyzer list below, and you can compose and create your own.


Note that since analysis is actually a transformation that results in a somewhat mangled form of the input, you must use the same (or at least a similar enough) analyzer when querying in a particular field, or you will not get the results you expect, or even any.


Note that you have the option of using different analysis for different fields. You could see this as fields being different sub-indexes, which can be powerful, but be aware of the extra work that needs to be done.


More detailed token and field stuff

Phrase and proximity search relies on position information. In the token generation phrase, the position increment of all tokens is 1, which just means that each term is positioned one term after the other.

There are a number of cases you could think of for which you would want to play with position / position offset information:

  • enabling phrase matches across stopwords by pretending stopwords aren't there (filters dealing with stopwords tend to do this)
  • injecting multiple forms at the same position, for example different stems of a term, different rewrites of a compound, injecting synonyms at the same position, and such
  • avoiding phrase/proximity matches across sentence/paragraph boundaries


When writing an analyzer, it is often easiest to play with the position offset, which is set on each token and refers to the offset to the the previous term (not the next).

For example, for a range of terms that should appear at the same position, you would set all but the first to zero.

Technically, you could even use this for a term frequency boost, repeating a token with an increment of zero.


Multivalued fields, that is, those for which you add multiple chunks, position-wise act as if there are concatenated. When applicable, you can set a large position increment between these fields to avoid phrase matches across such values (a property of the field?(verify)).

Some other...

Tokenizers include:

  • StandardTokenizer (basic splitter at punctuation characters, removing all punctuation except dots not followed by a whitespace, hyphens within what seem to be hyphenated codes, and parts of email addresses and hostnames)
  • CharTokenizer
    • LetterTokenizer (splits on non-letters, using Java's isLetter())
      • LowercaseTokenizer (LetterTokenizer + LowercaseFilter, mostly for a mild performance gain)
    • RussianLetterTokenizer
    • WhitespaceTokenizer (only splits on whitespaces, so e.g. not punctuation)
  • ChineseTokenizer (splits ideograms)
  • CJKTokenizer (splits ideograms, emits pairs(verify))
  • NGramTokenizer, EdgeNGramTokenizer (generates n-grams for all n in a given range
  • KeywordTokenizer (emits entire input as single token)


Filters include:

  • StandardFilter (removes 's from the end of words and dots from acronyms)
  • LowerCaseFilter
  • StopFilter (removes stopwords (stopword set is handed in))
  • LengthFilter (removes tokens not in a given range of lenths)
  • SynonymTokenFilter (injects synonyms (synonym map is handed in))
  • ISOLatin1AccentFilter (filter that removes accents from letters in the Latin-1 set. That is, replaces accented letters by unaccented characters (mostly/all ASCII), without changing case. Looking at the code, this is VERY basic and will NOT do everything you want. A better bet is to normalize your text (removing diacritics) before feeding it to indexing)
  • Language-spacific (some stemmers, some not) and/or charset-specific: PorterStemFilter, BrazilianStemFilter, DutchStemFilter, FrenchStemFilter, GermanStemFilter, RussianStemFilter, GreekLowerCaseFilter, RussianLowerCaseFilter, ChineseFilter (stopword stuff), ThaiWordFilter
  • SnowballFilter: Various stemmers
  • NGramTokenFilter, EdgeNGramTokenFilter

Queries, searches, Hits

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

The search package has a Searcher / IndexSearcher, which takes an IndexReader, a Query, and returns Hits, an iteration of the matching result Documents.

There are some ways to deal with searching multiple and remote indices; see e.g. the relations to the Searchable interface.


The queryParser package helps turn a query string into a Query object.

Note that for identifier queries and mechanical queries (e.g. items which share these terms but not the identifier) you'll probably want to construct a Query object structure instead of using the QueryParser.

(Note that unlike most other parts of Lucene, the QueryParser is not thread-safe)


For the basic search functionality lucene has fairly conventional syntax.

In various cases you may wish to rewrite the query somewhat before you feed it to the query parser - in fact, you may well want to protect your users from having to learn Lucene-specific syntax.



On hits:

  • Hit (mostly wraps a Document, allows fetching of that the score, field values, and the Document object)



The older way: (deprecated, will be removed in lucene 3)



See also:


Query classes:

  • PhraseQuery matches a sequence of terms, with optional allowance for words inbetween (max distance, called slop, is handed in)
  • PrefixQuery: Match the start of words (meant for truncation queries like epidem*)



Related to scoring:


Unsorted:


Examples (various based on pylucene examples):

On scoring

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Scoring uses:

  • per term per document:
    • tf for the term in the document
    • idf for the term
    • norm (some index-time boosts: document, field, length boosts)
    • term's query boost
  • per document:
    • coordination (how many of the query terms are found in a document)
  • per query
    • query normalization, to make scores between (sub)queries work on a numerically compatible scale, which is necessary to make complex queries work properly


  • more manual boosts:
    • boost terms (query time)
    • boost a document (index time)
    • boost a field's importance (in all documents) (index time)


See also:


Pylucene notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

PyLucene is a python wrapper around the java lucene implementation.

Note that previous versions used GCJ, current versions use JCC. The GCJ bindings were somewhat finicky, JCC less so.

Use your distribution's package unless you are aware of the potential trouble you may run into building this, and/or you must have the latest version.

Most installation errors that happen seem to be related to library problems. Both the use of jcc (if applicable) and the compilation of pylucene will fail if you do not have a jdk installed and library inclusion does not lead to libjava.so and libjvm.so.

It may also specifically need the sun jdk, though not necessarily 1.5.0 as some hardcoded paths in jcc's setup.py may indicate - just change them.


If you run into segfaults, the likeliest causes are:

  • haven't done a lucene.initVM (any lucene class access will cause a segfault)
  • you are using threaded access to indices, but not in a pylucene friendly way (details and fixes are mentioned/documented in various places)


See also:


See also

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Solr (subproject)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

For a bunch of nittier-grittier details about actually working with SOLR, see Solr use notes


Solr is

  • a layer of management/functionality/caching between lucene and an actual use.
  • a Java servlet that provides friendlier search and management than bare Lucene does
  • a little more analysis on top of Lucene
  • a few features on top of Lucene

It is mostly focuses on a nicer/convenient interface for data/search-providing side of things - see e.g. the solr bootcamp PDF (although there is slight detail overkill there).

Depending on your needs, Solr can either be a convenience interface that exposes most of Lucene's power and minimizing implementation bother -- or being a lot of implementation that's an unnecessary layer and dependencies for some of your specific needs. It's probably the former more often than not.


On tweaking for speed and scale

See:






Some (early) notes follow



On Lucene and Solr caches
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Note that committing changes on your index flushes various caches. This is to be expected, but is also one of the factors in how (and how much) your indexing affects search speed.


The Lucene FieldCache, well, caches document field vales. While you have no direct control over this, it can be useful to indirectly warm this through doing autowarming in Solr (newSearcher, firstSearcher). For example, if you frequently sort on a particular field, then autowarming queries sorting on those fields will make it more likely that a lot of those values will already be in Lucene's cache.


Solr's caches, on the other hand, have fairly specific and limited purposes, and a number of things aren't cached by anything, such as term positions(verify)), so only come from the index on disk. This means that while you can optimize some queries to be served mostly from cache, others will always have parts that come from disk (or, often preferably, OS cache).


Warming caches with realistic searches has a noticeable positive effect on the Lucene and various Solr caches (although whether it has much effect on any given user search obviously depends on that search and how representative the warming was).

When a new Index Searcher is started, its cache can be pre-populated from a number of items from an old cache. (Note that while copying may be faster than warming via queries, configuring a large number of to-be-copied items increases the time between creation and registration (ready-to-use-ness) of a new IndexSearcher. As such, you have to consider the time for warming against the value of that warming over its lifetime, under the expected search loads (and how likely it is for users to have to wait for creation of a new searcher)) Note that you should have enough searchers around to serve concurrent requests, or a portion of your users will be delayed in general, or delayed by warming.


The Solr Query Results Cache is a map from unique (query,sort) combinations to an ordered result list (of document IDs) each.

This is handy for users requesting more data for the same search (as there is no connection state for that - they would do new searches, but be served very quickly by a combination of this (combined with the document cache and a sensible value for queryResultWindowSize).

On memory use: Only matching documents are stored in each value, which is usually a set much smaller than the whole set of documents - and a few thousand. You may mostly see entries smaller than dozen KB(verify) so this cache is useful and not too costly.


The Solr Filter Cache stores the results of each filter that is evaluated (which depending on the way you use queries may be none at all), which can be reused to make certain parts of queries more efficient - sometimes much more so.

It stores the result of each filter on the entire index (a boolean of each document's applicability under this filter), which is 1 bit per document, so ~122KB per million documents per entry. I would call this moderately expensive, so if you want to use it, read up on how to do so effectively.


The Solr Document Cache stores documents read from the index, largely for presentation.

When the Query Results Cache points to documents loaded in here, they can be presented from cache without talking to disk, so it is probably most useful as a (relatively small) buffer serving adjacent page-range documents with fewer separate disk accesses.

The queryResultWindowSize setting helps presentation efficiency, by controlling how many documents around a range request are also fetched. For example, fetching results 11-20 (probably as page 2) with a queryResultWindowSize of 50 means that 1-50 will be loaded into the document cache. This means people clicking on 'next page' or 'previous page' will likely be served from cache.

This cache cannot be autowarmed.


There is also an application level data cache, usable by Solr plugins(verify) (seems to be a convenience API so that you don't have to use an external cache).

Throwing resources at Solr (OS cache, SSD disks); other performance notes

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


General performance observations:

  • a single server with a bunch of gigs of RAM can usually be tweaked to comfortably serve up to a few dozen million documents, under reasonable query load (...being?{{verify]]). Beyond that you have to get clever with resources and/or querying optimization.
  • A lot of memory is good, because things that can come from memory (Lucene cache, Solr cache, even OS cache) are fast
    • You can use RAMDirectory
    • you could copy the index into tmpfs (or another type of RAMdisk) before starting
    • you can wait for the OS cache, which is the easiest way to load part of the index into memory, but is also the most fragile index caching (the index data may be moved out for various reasons)
  • Caching seems to have less effect on large indexes than on smaller ones. That's largely because it becomes impossible to avoid talking to disk (and so incur its latency, particularly on HDDs - SDDs are the better option), partly because of the relative size of index and memory, but also because some things aren't directly cached and were always likelier to have to come from disk, so are more bound by it(verify))
  • SSDs for index storage is good, because things that have to come from disk are served faster - lower latency as well as higher throughput (practically perhaps 80% of RAM speed)
  • You can try to exploit the Solr caches (mostly filtercache) where practical


If your index fits in RAM and you can dedicate the RAM to this index (in terms of size and use), you can exploit the OS cache for significantly faster searches.

If you have noticeably more (free) memory than data, and your index is fairly compact (doesn't e.g. spend most of its space on original documents in store-only fields, because that can easily double your index size), then the OS cache can be a productive cache. You could even choose to rely on it as a real cache, which caches more data than the specific Solr caches choose to do.

For example, in some tests on a home server (8GB RAM, 7200 RPM HDD) with a 4.5GB index and an hourly cronjob to cat the data to /dev/null (as a side effect causing it to be read into OS cache), searches not served this warming take somewhere between 0.15 sec and 0.6sec, while with a warmed OS cache most searches were served in the 0.02 - 0.15sec range (depending on complexity, mostly around 0.02 for simple queries. No filters/filtercache were used in this test).

However, this is a tricky bit of advice. There are many reasons this may not work so well, including anything that pushes the data out of cache again, including:

  • a process actually allocating the RAM, or causing large amounts of swapped-out data to be swapped back in
  • something doing cachable disk IO. That is, anything handling a lot of data that doesn't explicitly notify the kernel it should not be cached (if your OS even supports this). The more data they handle, the more they are likely to push your index data out. (Note that nightly backup may do this)
  • a sysadmin noticing the host seems to have resources free and installing other things on it. (In some situations, it may be hard or at least bothersome to argue that you're actually using that freeish RAM).

(Another factor in this is swappiness, as it affects how much RAM is swapped out and therefore usable for OS cache (arguably, relatively high swappiness can serve searching in this way, but only if there is a lot of memory to go around anyway))


The OS cache will always be a speed-helping factor, but if your index is larger than RAM or you cannot be sure the RAM will be and stay available, then search will be much likelier to have to fall back to disk IO.

Platter drives have an IOPS rate and seek time that is relatively limited, so will support a relatively limited request rate when compared to SDD drives (and RAID, be careful not to overrate RAID in this context(verify)).

Both SSD and RAID tend to be better at reading than writing, which for search good news as search tends to be a lot more reading than writing (but even then both are okay).

Search supported by SSD reading may approach the speed of search backed by the OS cache, and be a lot more predictable because the SSD is IO and won't be unpredictably re-purposed like the OS cache may be. Also, even though SSD is expensive, even most smaller SSD drives fit indices when you dedicate them to searching (...although in practice you may also wish to place databases on them).



On faceting and its speed
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See also http://wiki.apache.org/solr/SolrFacetingOverview


Distribution, replication
This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

When you know that a node is at its processing limits, it pays to add more identical nodes to lighten the search load on each, and/or to distribute the documents among more hosts.


Nutch (subproject)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Nutch extends lucene, to be a somewhat more worked out search engine, indexer, and a crawler that is appropriate for the fairly uncontrolled content on the web, intranets and such. (works incrementally, focuses on well linked content and such).

It has parsers for some formats(verify), and a basic JSP (servlet) based interface.


Nutch also triggered development of the Hadoop framework, and can now its HDFS and mapreduce for distributed crawling and indexing, and for keeping that index updated(verify).

It seems that searching from an index stored on the DFS is not ideal, so you would want to copy them to local filesystem for fast search (possibly also distributed, but note that has a few hairy details).



http://wiki.apache.org/nutch/NutchTutorial


http://wiki.apache.org/nutch/MapReduce http://wiki.apache.org/nutch/NutchDistributedFileSystem

http://wiki.apache.org/nutch/Nutch_-_The_Java_Search_Engine

http://wiki.apache.org/nutch/DissectingTheNutchCrawler

Comparison, constrasts, and notes on scaling

(From a large distance caused by not having used either Solr or Nutch)


On Lucene, Solr, and Nutch:

Lucene is really mostly just a core and neither an easy/complete index engine or interface - though it is extensible.

Very roughly speaking: Nutch extends it for distribution and scale, particularly scaling the crawling. Solr for ease of management and use (Some might call Solr something like search process enterprise management).


Solr seems intended for collections of more structured data (often implicitly smaller sets), requiring a schema (which thinly wraps Lucene Field specifications) and allows operations like sorting on fields.


Nutch seems more geared to collections of data that is not homogeneously structured. Nutch uses a schema, but mostly just to add a bunch of its own metadata. It likes to see content as a single blob of text to be indexed. Nutch is much more geared to distribution, so indexing scales better.


Solr uses a schema for both metadata and to let you separate out aspects of your actual data -- including faceting.


Solr doesn't have a crawler (and for well-defined collections you don't need one), Nutch has one you can use on large sites and potentially the internet. You might want Droids for certain purposes.


On scaling indices and search

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)


CouchDB, Cassandra, HBase

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

See Non-relational_database_notes


Hadoop

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Hadoop is a Java implementation of concepts strongly inspired by Google Filesystem (GFS) and Google's MapReduce paper.

If that doesn't tell you how Hadoop is useful, read the two relevant papers now: MapReduce, GFS.


Hadoop mostly consists of the distributed filesystem and the Map-Reduce framework that interacts with it. These two are relatively separate, in that each could be used without the other.


Note that as in many scaling systems, you are easily IO-bound. Unless you're google, you are probably calling half a dozen noded a cluster, and still have relatively few hard drives compared to the map jobs waiting on input from them, and compared to the amount of map-to-reduce transfers.

Consider numbers when planning. If you have gigs of data and it has to come from a few disks at at most ~75MB/s, even if you're alone on the system that's going to take minutes because of direct implication.



Hadoopy terms and details:

HDFS

The Hadoop Distributed Filesystem, a.k.a. Hadoop DFS, a.k.a. HDFS.

Note that from the map/reduce platform, there are storage options beyond HDFS


You often use HDFS by referring to it from jobs, which means all the work of storing it is done for you. You can also use the hdfs shell for metadata management, and copying files between local filesystem and HDFS.


In slight technical detail, HDFS consists of:

  • one NameNode, which basically keeps track of the filesystem names, and of where on further nodes parts of data is kept
  • one or (preferably) more DataNodes, among which data is distributed.


Map/Reduce

An execution framework that scales well (assuming a non-bottlenecked data source/sink, such as HDFS).

Note that by its nature it can be very high-throughput, but won't be real-time.


Terms and concepts in Hadoop's implementation:

  • jobs: A collection of code, data, and metadata that the job tracker expects - the things to complete as a whole
  • tasks: pieces of jobs, those that handle individual map and reduce steps
  • management around it (e.g. task distribution that chooses nodes near the data for execution)


A Hadoop job is a JAR file. In many tasks, this contains Java code (which is run in isolated VMs(verify)).

That is not the only option. You can also:

  • use Hadoop Streaming, in which you pack a JAR with command line executables that are actually the tasks (mapper, reducer, optional .). Note that things have to be actually executable on nodes, which means that the more complex or scripy these are, the more you may need to prepare nodes, e.g. installing libraries, install scripting environments, etc., or include all of these (see e.g. python freezing)
  • Use Hadoop pipes for using map-reduce from C/C++, which is like streaming, but differs e.g. in how data is fed to processes
  • do everything yourself and interface with HDFS (e.g. using libhdfs)




See also


Related apache projects (mostly subprojects)

Pig

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A higher-level addition which should make creation of complex processes on Hadoop simpler.

Pig contains a language (Pig Latin) and a bunch of planning. Compilation produces sequences of Map-Reduce programs.


See e.g. http://hadoop.apache.org/pig/ http://www.cloudera.com/hadoop-training-pig-introduction


Hive

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A SQL-like wrapper accessing HDFS data, running on Hadoop. Makes relatively structured and (and table-like) data act like tables, regardless of its stored format. Quite useful when your needs can be expressed in terms of Hive expressions.

Similar to Pig in that it usually compiles to a number of MapReduce steps.

Apparently developed for flexible log aggregation, but has various other uses.


Zookeeper

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Supports coordination, mostly in the form of synchronization, providing in-order writes with eventual consistency, and has configurable redundancy.

Can be made the primary means of controlling a lot of your code. More generally, it lets you avoid reinventing (and debugging) a bunch of common distributed boilerplate code/services.

Some simple examples where ZooKeeper is useful include computation with shared memory, message queues, or simpler things like notifications, or storing configuration.


See also:

Mahout

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A machine learning system, meant to be scalable where possible, and easy to use.

Currently has clustering, classification, some collaborative stuff, some evolutionary stuff, and has other things planned (e.g. SVM, linear regression, easier interoperablity)

See also:


Chukwa and Flume (logging)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

There are two distinct projects that do collection of logs and system
 metrics, letting you do log analytics of recent data, with some tuneable fault tolerance.



See also:


Hama

Avro (data serialization)

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

Typed data exchange, using a more dynamic-type, schema'd protocol than things like Apache Thrift, Protocol Buffers and similar. Does not need pregenerated code.

Written around JSON.

See also:


Whirr

Other related projects

Dumbo

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A project that makes it easier to use Hadoop's MapReduce from Python. Uses hadoop streaming under the covers, but saves you from most manual work.

Can use HDFS. Even without HDFS it can be handy, e.g. to execute simpler tasks on multiple nodes.

See also:




Droids

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A robot writing framework -- letting you write things like web spiders, but also other things where automated and fault-tolerant fetching is nice.

See also:



Supporting projects

Tika

Extracts metadata and sometimes structured data out of a range of types of documents; see http://lucene.apache.org/tika/formats.html

Usable e.g. in Nutch, Droids, ...

See also:

...and e.g.:

Thrift

This article/section is a stub — probably a pile of half-sorted notes, is not well-checked so may have incorrect bits. (Feel free to ignore, fix, or tell me)

A service framework that can automatically create programming interfaces in Java, C#, Python, C++, Ruby, Perl, PHP, Smalltalk, OCaml, Erlang, Haskell, Cocoa, Squeakr based on a service description.

You can see this as code generation around a RPC protocol, and it has with a software stack (using on boost) to run services as independent daemons.


As an example, there are thrift descriptions bundled with Hadoop that let you interface with HDFS, HBase and such.



Airflow

https://airflow.apache.org/