Elasticsearch notes

From Helpful
Jump to navigation Jump to search
Some practicalities to search systems

Lucene and things that wrap it: Lucene · Solr · ElasticSearch

Search-related processing: tf-idf

Broad intro

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


ElasticSearch is largely seen as a document store that is searchable.

You could see it as a document store with CRUD HTTP API which happens to do indexing.

Because of its heritage (Lucene at its core) it is is largely seen as text search, but it is also reasonably suited for logging and certain metrics.

(You could use it like a database engine, though in that niche-purpose NoSQL-ey way where it doesn't do strong consistency where you may not really want it for a primary store).


Compared to some other text indexing, it

wraps a few more features, e.g. replication and distribution
does more to make more data types easily searchable, part of why it is reasonable for data logging and metrics (in fact the company behind it is currently leaning on the monitoring/metrics angle),
eases management and is a little more automatic at that (e.g. index stuff that gives more consistent latency over time).


ES libraries are relatively thin wrappers around communicating with the HTTP API, and more about convenience.

There is no user interface included, perhaps in part because ES is fairly flexible. It's not very hard to interface your own form with the API, though.

The major moving parts

Some terminology

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(note that the core of this comes from the underlying lucene)


mapping - basically a schema, telling ES of how to interpret each field

e.g. mentions types
if a field is not already defined in the mapping by you, it will adds fields as you mention them, with types guessed based on the first value it sees


field - a distinct thing in the mapping

(usually) indexed to be searchable
(usually) fetchable
(there are some field type variants - see below)


document - individual search result.

you can see it as the JSON you sent in, and/or as the part of that that got put into fields (and usually indexed) as specified by the mapping
side note: most fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one
original JSON is also stored in _source


index

if you squint, a grouping of documents - a collection you search as a whole
you can have multiple indexes a per cluster (and may well want to)
an index is divided into one or more shards, which is about replication in general and distribution in a cluster
(and shards are divided into segment files, which is more about how updating documents works)


...and this is the point at which you can stop reading this section if you're just doing some experiments on one host. You will get to them once you if and when you need to scale to multiple hosts.

shard

you typically split an index into a number of shards, e.g. to be able to do horizontal scaling onto nodes
internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a portion of the overall thing we we call index here
Shards come in two types: primary (the basic type), and replica.

segment - a shard consists of a number of segments (segments are individual files).

Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).


replica is an exact copy of a shard

the point of replicas is robustness against node failure:
if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything
without duplication, node failure would mean missing a part of every index distributed onto it
you could run without replicas to save some memory
memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability


node - distinct ES server instance

in larger clusters, it makes sense to have some nodes take on specific roles/jobs - this discusses some of that

cluster - a group of nodes, serving a set of indices created in it

nodes must mention a shared ID to join a cluster

Combined with

In practice, people often pair it with... (see also "ELK stack")

Web UI, makes a bunch of inspection and monitoring easier, including a dashboard interface to ES
also seen in tutorials, there largely for its console interactively poking ES without writing code
itself pluggable, with bunch of optional things


And, if you're not coding your own ingest,

you can do all of that yourself, but logstash can do a lot of work for you, or at least be a quick start
  • Beats - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts, e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [1] - and a lot more contributed ones[2]
e.g. metricbeat, which is stores ES metrics in ES


Licensing / subscriptions

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

From an "I don't care about jargon and legalese" perspective,

  • you can get a managed cloudy thing
Itself split into Standard, Gold, Platinum, Enterprise
  • or you can self-host


Around self-hosting:

  • there are two or three variants: Basic, platinum, and enterprise (Gold is gone?)
Basic is perfectly good for smaller scale
the differences between platinum and enterprise mostly matter once you do this on large scale [3]
  • There is a perfectly free subset of ES, but it's not quite the default


That is, ES configured with a Basic license is just that. (Before 6.3 the free license needed to be refreshed every year, now basic/free does not need an explicit license anymore?(verify))

But the default install sets up a 30-day trial license in which you get more features, to entice you to buy the fuller product.

If you know you want to stick with basic features, then after this 30-day period (or earlier, if you know what you want) you would

not only need to switch to Basic license,
but also disable some features (apparently all X-pack things(verify))

...but if you didn't know that, this is annoying in the form of "automated install configured a trial license for me, but things just stopped working completely after 30 days and now I'm panicking", because it takes more than a little reading what you now need to disable and why. (My logs showed "Elasticsearch Search and Analytics" -- which seemed to just be a confusing name for the machine learning and alerting stuff)



Amazon kerfuffle

Around 2021, there was a lisence change from Apache2 to SSPL - similar to AGPL but even more aggressive, so people don't quite considered it a free/open license anymore), seemingly aimed specifically at limiting Amazon selling it {{{1}}}

The only exceptions to "you must create query ASTs yourself" seem to be:

  • query_string[4] -
still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally
exposes a bunch of the search features in a shorter-to-write syntax
...that you have to learn
...that is fairly breakable
...that allows frankenqueries that unduly load your system
...so you probably don't want to expose this to users unless all of them are power users / already know lucene
  • simple_query_string[5]
same idea as the previous, but using the somewhat more simplified syntax
less breaky, still for experts

...but again, until this covers all your needs and doesn't ope


These exceptions may looks impler but in the long run you will probably need to abandon them. If you want to learn it once, learn the AST way. (that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).

API-wise

The search API[6] (/indexname/_search) should cover many basic needs.

The search template API[7] does the same sort of search, but you can avoid doing some complex middleware/client-side query construction by asking ES to slot some values into an existing saved template.

Multi search[8] (/indexname/_msearch) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request. Responds with the same number of result sets.

Multi search template API[9] - seems to just be the combination of multi search and use of templates.



On result state, and getting more consistency between search interactions

In search systems, it can be a good tradeoff to

get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, many" -- rather than get a precise count
only fetch-and-show the complete data for only the first few -- rather than everything
forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly

...because it turns out we can do that faster.

This is ES stops counting after 10000 (max_result_window), only return the first 10 (size=10, from=0)


Also, if search-and-fetch-a-few turns out to be cheap in terms of IO and CPU, we can consider doing them without storing state.

This also because we know that in interactive browser use, most people will never check more than a few.

If in fact someone does the somewhat unusual thing of browing to page 2 or 3 or 4, you can redo the same search and fetch some more (use from to fetch what amounts to the next bunch).

However, if the index was refreshed between those actions, this new search will shift items around, so might get the same item again, or never show you are didn't see one one. Whenever the consistency or completeness really matters, you are probably looking for async or PIT:

Also, if you use this to back an API that allows "fetching everyting", you won't have won much.



Async search[10] - lets you start searches in the background. Search functionality is mostly equivalent, there's some minor parameter differences (think cache)

the searches are store din their own index(verify) so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)
note that you wait_for_completion_timeout lets you ask for "return regular search if you finish quickly, make it async if not"


Point in time API[11] - consider that document updates and refreshes means you will generally get the latest results. If instead it is more important to get a consisten set, you can use PIT (or presumably async search?)




Searching multiple indices

...mostly points to Multi search (_msearch), which looks something like:

GET /index1/_msearch
{ }
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
{"index": "index2"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["title","plaintext"]  } } }
{"index": "index3"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }



Note that you can also do:

GET /index1,index2,index3/_search
{ "query": { "multi_match":{"query":"fork",  "fields":["title","plaintext"]  } } }

On the upside, you don't have to creatively merge the separate result sets (because multi search will not do that for you).

On the downside, you don't get to control that merge, e.g. scoring can be difficult to control, you don't really get to specify search fields per index anymore, or source filtering fields - though some forethought (e.g. to field naming between indices) can help some of those.



Other, mostly more specific search-like APIs, and some debug stuff

knn[12] search searches a dense_vector close to the query vector.

separate API is depracated, being moved to an option

suggester[13] - suggests similar search terms based on edit distance

terms_enum[14] -


count[15]

explain[16]

profile[17]

validate[18]

shards[19] - report which shards would be accessed

composing queries

Compound queries
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Compound queries[20] wrap others (leaf queries, or other compound queries), for


One reason is logical combinations of other queries (note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)

bool [21]

lets you combine requirements with requirements like
must
filter is must without contributing to scoring
must_not
should ('portion of leafs must', and more matches scores higher)


The rest is mostly about more controlled scoring (particularly when you make such combinations)

dismax [22]

if multiple subqueries match the same document, the highest score gets used


boosting [23]

have subqueries weigh in positively and negatively


constant_score [24]

results from the search this wraps all get a fixed constant score

function_score [25]

results form the search this wraps with scripting that can consider values from the document and query


Term-level and full-text queries
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Term queries arenot put through an analyser, so stay one string, so are mostly used to match exact strings/terms.

Full-text queries are put through an analyzer (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.


term [26]

exact value
works in text fields but is only useful for fairly controlled things (e.g. username)

terms [27]

like term, matches any from a provided list

terms_set [28]

like terms, but with a minimum amount to match from a list


exists[29]

Returns documents that contain any indexed value for a field.

fuzzy[30]

Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.

ids[31]

Returns documents based on their document IDs.

prefix[32]

Returns documents that contain a specific prefix in a provided field.

range [33]

Returns documents that contain terms within a provided range.

regexp[34]

Returns documents that contain terms matching a regular expression.

wildcard [35]

Returns documents that contain terms matching a wildcard pattern.


match [36]

matches text, number, date, or boolean data
in a single field
optionally does things like phrase or proximity queries, and things like edit-distance fuzziness
text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)
{
  "query": {
    "match": {
      "fieldname":{
        "query": "scary matrix snowman monster",
        "minimum_should_match": 2   //optional, but 
      }
    }
  }
}

multi_match [37]

match, but searches one query in multiple fields
{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message" ] 
    }
  }
}

combined_fields [38]

somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields
{
  "query": {
    "combined_fields" : {
      "query":      "database systems",
      "fields":     [ "title", "subject", "message"],
      "operator":   "and"
    }
  }
}


match_phrase [39]

analyses into tokens, then matches only if all are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive
{
  "query": {
    "match_phrase": {
      "message": "this is a test"
    }
  }
}


match_phrase_prefix [40]

like match_phrase, but the last token is a prefix match

match_bool_prefix [41]

constructs a bool of term-shoulds, but the last last is a prefix match



intervals [42]

imposes rules based on order and proximity of matching terms


query_string [43]

lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [44]
maybe don't expose to users, because it's quick to error out

simple_query_string [45]

Span queries

Result set

Aggregation

Performance considerations

query considerations

More general settings

indexing considerations

Routing

Clusters

Security notes

ES errors and warnings

Some Kibana notes