Elasticsearch notes: Difference between revisions

From Helpful
Jump to navigation Jump to search
Line 557: Line 557:
:: details vary
:: details vary
:: can remove stopwords
:: can remove stopwords
:: can apply stemming, and take exceptions to stemming
:: can apply stemming (and take a list of exception cases to those stemming rules)


* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html simple]}}
* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html simple]}}

Revision as of 16:49, 21 August 2023

Some practicalities to search systems

Lucene and things that wrap it: Lucene · Solr · ElasticSearch

Search-related processing: tf-idf

ElasticSearch

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


ElasticSearch is largely seen as a document store focused on text search, and also some data storage (mostly logging/metrics).


ElasticSearch is a HTTP service with Lucene at its core. Compared to other things that do that, it

wraps more features (e.g. making replication and distribution easier),
does more to make more data types easily searchable (to the point that it is also usable for data logging and metrics - in fact the company behind it is leaning on the monitoring/metrics angle),
eases management and is a little more automatic at that (e.g. index stuff that gives more consistent latency over time).


On one hand you can see it as a document store with CRUD API and indexing -- so you could use it like a database engine (...though in that niche-purpose NoSQL-ey way where it doesn't do strong consistency where you may not really want it for a primary store).


ElasticSearch itself is a standalone thing, with an API easy enough that you can interface directly with code or even the CLI if you try - ES libraries are relatively thin and more about convenience.


The major moving parts

Some terminology

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(note that the core of this comes from the underlying lucene)


mapping - basically a schema, telling ES of how to interpret each field

e.g. mentions types
unless you set the mapping, it'll mostly adds fields as you mention them, with types guessed based on the first value it sees


field - a distinct thing in the mapping

(usually) indexed to be searchable
(usually) fetchable
(there are some field type variants - see below)


document - individual search result.

you can see it as the JSON you sent in, and/or as the part of that that got put into fields (and usually indexed) as specified by the mapping
side note: most fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one
original JSON is also stored in _source


index

if you squint, a grouping of documents - a collection you search as a whole
you can have multiple indexes a per cluster (and may well want to)
an index is divided into one or more shards, which is about replication in general and distribution in a cluster
(and shards are divided into segment files, which is more about how updating documents works)


...and this is the point at which you can stop reading this section if you're just doing some experiments on one host. You will get to them once you if and when you need to scale to multiple hosts.

shard

you typically split an index into a number of shards, e.g. to be able to do horizontal scaling onto nodes
internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a portion of the overall thing we we call index here
Shards come in two types: primary (the basic type), and replica.

segment - a shard consists of a number of segments (segments are individual files).

Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).


replica is an exact copy of a shard

the point of replicas is robustness against node failure:
if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything
without duplication, node failure would mean missing a part of every index distributed onto it
you could run without replicas to save some memory
memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability


node - distinct ES server instance

in larger clusters, it makes sense to have some nodes take on specific roles/jobs - this discusses some of that

cluster - a group of nodes, serving a set of indices created in it

nodes must mention a shared ID to join a cluster


Combined with

In practice, people often pair it with (see also "ELK stack")

Web UI, makes a bunch of inspection and monitoring easier, including some dashboard stuff
also seen in tutorials, there largely for its console interactively poking ES without writing code
Pluggable, has a bunch of optional things


And, if you're not coding your own ingest,

you can do all of that yourself, but logstash can do a lot of work for you, or at least be a quick start
  • Beats - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts, e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [1] - and a lot more contributed ones[2]
e.g. metricbeat, which is stores ES metrics in ES


Parts of ES and other ELK components are dual-licensed public/proprietary, other parts are proprietary only. (There was a license change that seemed to be aimed at limiting Amazon selling it, which just prompted(verify) Amazon to fork it into its own OpenSearch[3].), and there are some paid-only features[4] though many of them are advanced / scale features that a lot of setups won't need.


  • "Current license is non-compliant for search application and behavioral analytics. Current license is active basic license"
is a feature it (confusingly) calls "Elasticsearch Search and Analytics", which is the machine learning and alerting stuff.
either buy a license, or disabling those features:
xpack.ml.enabled: false
xpack.graph.enabled: false
xpack.watcher.enabled: false


See also:

Indices

Fields, the mapping, field types

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Each regular field (there are others) gets its own behavior in terms of

whether it is stored and/or
whether it is indexed
and if it is indexed, how it is handled in that process.

By default, ES picks up every part of the document.

The content of each part of the document you send in can go towards multiple fields - or none at all - depending primarily on the mapping.

You can ignore fields (probably by mentioning the field in your mapping but setting "enabled": false, but ES can also do this, e.g. based on configuration like "values should never be bigger than X bytes").

Note also that many fields can be considered to store arrays of values, and most searches will match if any of those match.



The mapping[5][6] is the thing that records all fields in an index - their names, data types, and configured processing on data coming in.

Explicit mapping[7], amounts to specifying a mapping before you add documents. This can be preferred when

you want specific interpretations of some fields (e.g. ip, date), or
specific features (e.g. flattened, join, search_as_you_type)

Dynamic mapping[8] (what happens if you don't do explicit mapping) means not specifying fields ahead of time -- in which case ES acts schemaless, in the sense that it will add new fields to the mapping as you mention them, guessing the type based on the first value it sees. Usually does the right thing, particularly if you're mainly handling text.


You can get behaviour inbetween those by setting the [9] parameter (normally set per index(verify) and inherited, but you can be more precise about it) can be set to:

true - unknown fields are automatically added to the mapping as regular fields (indexed)
(the default)
it'll mostly end up with boolean, float, long, or text/keyword
runtime - unknown fields are automatically added to the mapping as runtime fields (not indexed)
false - unknown fields are ignored
strict - unknown fields mean the document is rejected

Note that

  • with true and runtime, the created field's type is based on the communicated-JSON's type - see this table
  • runtime and dynamic will
    • try to detect whether text seems to be a date[10] (apparently if it follows a specific configured date template(verify))
    • optionally detect numbers[11] (and whether they are reals or integers), but this is disabled by default - probably because it's much less error-prone to control this yourself.


  • You can also alter mappings later - see [12]
but this comes with some restrictions/requirements/footnotes




Text-like field types

The documentation makes a point of a split into

  • text family[13] - text, match_only_text
  • keyword family[14] - keyword, constant_keyword, wildcard
  • ...and it seems that e.g. search-as-you-type are considered miscellaneous


  • text[15] (previously known as string analysed)
flexible search of free-form text, analysis can will transform it before indexing (you probably want to know how - see e.g. analysis)
no aggregations
no sorting
  • keyword[16] (previously known as string not_analysed)
structured content (exact search only?), e.g. identifiers, a small/known set of tags, also things like emails, hostnames, zip code, etc.
should be a little faster to match (if only because of smaller index size)(verify)
allows sorting
allows aggregation
can make makes sense for serial numbers/IDs, tags - even if they look like numbers, you will probably only ever search them up as text equality (compared to storing those as a number, keyword may be a little larger yet also saves some index complexity necessary for numeric-range search)
  • constant_keyword [17] - all documents in the index have the same value
e.g. when you send each log file to its own index, this might assist some combining queries (verify)
assists wildcard and regexp queries
  • search_as_you_type[19]
fast for prefix (match at start) and infix (terms within) matches
mainly for autocompletion of queries, but could work for other shortish things (if not short, the index for this will be large)
n-gram style, can have larger n (but costs in terms of index size, so only useful if you usually want many-token matches?)
kept in memory, costly to build


Further notes:

  • it seems that fields that are dynamically added to the mapping and detected as text will get two indexed fields: a free-form text one, and keyword with ignore_above of 256
this is useful if you don't know what it will be used for
but for e.g. identifiers it's pointless to also tokenize it
and for free-form text it's probably pointless to also have that keyword field (will probably end up _ignored for most documents, except very short ones)
separately, remember that report-only values (e.g. URLs, depending, or summaries, depending) are fine to only store (and so send) and not index


Data-like field types (primitives):

  • numbers: [21]
    • byte, short, integer, long, unsigned_long
    • float, double, half_float, scaled_float
allows range queries
  • and, arguably, keyword (see above)


Data-like fields (specifics but text-like):

(internally milliseconds since epoch (UTC)
  • version - text that represents semantic versioning
mostly for sortability?
  • ip - IPv4 and IPv6 addresses
allows things like subnet searches (CIDR style)
  • geospatial - points and shapes [24]
including distance and overlap


Different way of treating JSON as transferred

each subfield is separately mapped and indexed, with names based on the nesting dot notation, see e.g. the linked example
takes the JSON object and indexes it as one single thing (basically an array of its values combined)
  • nested[27] basically a variant of object that allows some field indexability


Other stuff

not searchable
not stored (by default) - which makes sense. A search index is not a blob store
seems to be there for the possibility of some plugin extracting text to index?(verify) or fishing it out of _source?(verify)


relations to other documents, by id(verify)
seems to mean query-time lookups, so
you could get some basic lookups for much cheaper than separate requests
you probably don't want multiple levels of lookups unless you don't mind slow responses (if you need relational, use a relational database)


token_count[30]

stored as an integer, but takes text, analyses it into tokens, then counts the number of tokens
seems intended to be used via multi-field, to also get the length of text


dense_vector[31] (of floats by defaults, byte also possible)

if indexes, you can use these for knn searches[32]


Managing indices (and thinking about shards and segments)

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Remember that shards are made of segments, and segments are immutable.

ES collects changes in memory, and only occasionally dumps that that into a new segment file.

(this is why documents are not immediately searchable and we call it near-realtime)

Refresh - is that act of writing to a new segment

also helps updates not be blocking - existing segments are searched, new one once they're done

Merge refers to how smaller segments are periodically consolidated into fewer files, [33]

also the only way that delete operations actually flush (remember, segments are immutable).


Refresh interval is 1s by default, but various setups may want that larger - it's a set of tradeoffs

  • Refresh can be too quick in that the overhead of refresh would dominate, and less work is spent actually doing useful things (and be a little pricier in pay-per-CPU-use cloudiness)
  • Refresh can also be too slow both in that
    • new results take a long time to show up
    • the heavier load, and large new segments, could make fore more irregular response times

The default is 1s. Large setups might increase that up to the order of 30s.

(more in the tradeoffs below)




index lifecycle management (ILM) lets you do slightly higher-level management like

  • create a new index when one grows too large
  • do things like creating an index per time interval

Both can help get the granularity you want when it comes to backing them up, duplicating them, retiring them to adhere to data retention standards.


https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-lifecycle-management.html




Also on the topic: a reindex basically means "read data, delete data in ES, ingest again".

You would not generally want this over a merge.
It's mainly useful when you make structural schema changes, and you want to ensure the data uniformly conforms to that.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

other field types

multi-fields[34] let you specify that one thing in your JSON should be picked up towards multiple fields.

  • e.g. a city name
    • as a keyword, for exact match, sorting and aggregation
    • and towards the overall text search, via an analyser
  • text as a searchable thing, and also store its length via token_count



runtime fields [35]

mostly about query time - not part of the index (so won’t display in _source)
just calculated as part of the response
can be defined in the mapping, or in an individual query
(seem to be a refinement of script fields?)
easy to add and remove to a mapping (because there is no backing storage for them)
https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields


you can search specific indices (comma separated names, or _all), or just one - sometimes useful


Metadata fields (that is, those other than e.g. the fields from your document, and runtime/script fields)

  • _id - document's ID
  • _type - document's mapping type
deprecated
  • _index - index that the document is in
is the original JSON that indexing got
note that everything needs to be indexed
this 'original document' is also used in update and reindex operations
so while you can ask it to not store _source at all, that disables such operations
that original submission can be handy for debug
that original submission can be core functionality for your app in a "we found it by the index, now we give you the actual document" way
you can filter what parts of _source are actually sent in search results - see source filtering
  • _size - byte size of _source


  • _ignored - fields in a document that indexing ignored for any reason - see

https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-fields.html others


More on indexing

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


The bulk API can be used for all CRUD operations, but in practice is probably mostly used for indexing.



Do bulk adds where possible

While you can do individual adds, doing a lot of them together reduces overheads a bunch (and can work out more more efficiently refresh-wise as well), largely relating to the refresh interval.


Keep in mind that if some part of the operations fails, you need to notice and deal with that correctly.


Missing data

Setting a field to null (or an array of nulls, or an empty array) means it is not added to the index, which means it can not be matched in searches (except for missing, must_not exists, and such(verify)).

Note that it also affects aggregation on the relevant fields, and some other details.


Ignoring data and fields

Field values may be ignored for varied reasons, including:

by default, using a type that can't be converted means the entire rejects the whole document add/update operation (though note that in bulk updates it only considers that one operation failed, and will report a partial success)
if you set ignore_malformed true, the malformed fields are not indexed, but the rest are processed normally.


You can search for documents where this happened at all with a query like

"query":{ "exists":{"field":"_ignored"} } 


If you set ignore_malformed[37], it will instead reject only bad values.



Text processing

You want curacao to match Curaçao? Want fishes to match fish?

To computers those are different characters and different strings, whether such variations are semantically equivalent, or close enough to be fuzzy with, and how, might vary per language.


Analysers represent relatively simple processing that helps, among other things, normalize data (and implicity the later query) for such fuzziness.


An analyzer is a combination of

  • a character filter
  • a tokenizer
  • a token filter

(a normalizer seems to be an analyzer without a tokenizer - so practically mainly a character filter (can also have a token filter but this is probably not so useful)

Analyzers takes text, and applies a combination of character filters, tokenizer, and token filters, usually to the end of

  • split it into tokens (e.g. words)
  • stripping out things (e.g symbols and punctuation)
  • normalize (e.g. lowercasing)


There are built-in analyzers to cover a lot of basic text search needs

  • standard (the default if you don't specify one)
    • lowercases
    • splits using Unicode Text Segmentation (for english is mostly splitting on spaces and punctuation, but is better behaved default for some other languages) and lowercases
    • removes most punctuation,
    • can remove stop words (by default does not)
  • language-specific analysers - currently arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai.
details vary
can remove stopwords
can apply stemming (and take a list of exception cases to those stemming rules)
no lowercasing
splits on non-letters
  • stop - simple plus stopword removal, so
no lowercasing
splits on non-letters
can remove stopwords
can lowercase
can remove stopwords
no lowercasing
splits on whitespace
  • keyword - does nothing, outputs what it was given
reduces text in a way that helps detect duplicates, see something like [38]


https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

Data streams

Snapshots, Searchable snapshots

Install

Things you may want to think about somewhere before you have a big index

APIs

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


ES grew a lot of APIs over time.

Perhaps the most central are

  • the document CRUD part (Index, Get, Update, Delete, respectively),
  • and the searching.


Document CRUD

  • Index - add a document (and implicitly get it indexed soon)
e.g. PUT /test/_doc/1 with the JSON doc in the body
  • Get - get a specific document by ID
e.g. GET test/_doc/1 (HEAD if you only want to check that it exists)
this basically just gives you _source, so you can filter what part of that gets sent[39]
e.g. POST /test/_update/1
allows things like "script" : "ctx._source.votes += 1"
e.g. DELETE /test/_doc/1


Note that there is also a bulk API [42]

that lets you do many of the above at once
POST _bulk
where the body can contain many index, delete, update
libraries often support the bulk API in a way that makes sense for how the rest of that library



"Is update really update?"

It seems that update is essentially

  • construct a a new document based on the current version's _source
  • add 1 to the version
  • save that to a new segment
...even if it turns out there is no difference (...by default - you can specify that it should check and not do that - see detect_noop)

It seems that the old version will be considered deleted but still be on disk in that immutable segment it was originally entered in (until the next merge), and searches just happen to report only the latest (using _version)(verify) So if you do continuous updates on all of your documents, you can actually get the update process to fall behind, which can mean you temporarily use a lot more space (and search may be slightly slower too).



"Is there a difference between indexing a new version and updating the document?"

Very little in terms of what gets done at low level (both will create a new document, mark old as deleted)

It's most that for you there is a question of what you do is easier to express as

  • a new version of the data
  • the changes you want to see, or a script that makes them

Also, keep in mind how you are updating. If it's fetch, change, send, that's more steps than 'here is how to change it' in a script.

So if you do not keep track of changes to send to elasticsearch, you could update by throwing everything at at with detect_noop - which only applies to update.




Search[43]

GET /indexname/_search

Multi Search[44]

multiple searches with a single API call
GET /indexname/_msearch
...though the request can switch index

Async search


Multi Get[45]

Retrieves multiple JSON documents by ID.
GET /_mget



Search

Shapes of a query

This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Most APIs take a query that looks something like:

{
  "query": {
    "match_phrase": {
      "fieldname": "word word"
    }
  }
}


That is, there is a specific Query DSL, where you expressing a structured search as an abstract syntax tree, which is communicated as JSON.


Notes:

  • the field names will vary with specific types of search
In fact even the nesting will
  • You might like Kibana's console to play with these, as it has some auto-completion of the search syntax.
  • a quick "am I even indexing" test would probably use match_all[46]


Exceptions / other options

The only exception to "AST as JSON" seems to be URI search[47] which exposes basic searches in a URL (without needing to send a JSON style query in the body), but unless this covers all your needs, you will eventually want to abandon this. Might as well do it properly up front)}}

The only exceptions to "you must create query ASTs yourself" seem to be:

still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally
exposes a bunch of the search features in a shorter-to-write syntax
...that you have to learn
...that is fairly breakable
...that allows frankenqueries that unduly load your system
...so you probably don't want to expose this to users unless all of them are power users / already know lucene
  • simple_query_string[49]
same idea as the previous, but using the somewhat more simplified syntax
less breaky, still for experts

...but again, until this covers all your needs and doesn't ope


These exceptions may looks impler but in the long run you will probably need to abandon them. If you want to learn it once, learn the AST way. (that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).

API-wise

The search API[50] (/indexname/_search) should cover many basic needs.

The search template API[51] does the same sort of search, but you can avoid doing some complex middleware/client-side query construction by asking ES to slot some values into an existing saved template.

Multi search[52] (/indexname/_msearch) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request. Responds with the same number of result sets.

Multi search template API[53] - seems to just be the combination of multi search and use of templates.



On result state, and getting more consistency between search interactions

In search systems, it can be a good tradeoff to

get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, many" -- rather than get a precise count
only fetch-and-show the complete data for only the first few -- rather than everything
forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly

...because it turns out we can do that faster.

This is ES stops counting after 10000 (max_result_window), only return the first 10 (size=10, from=0)


Also, if search-and-fetch-a-few turns out to be cheap in terms of IO and CPU, we can consider doing them without storing state.

This also because we know that in interactive browser use, most people will never check more than a few.

If in fact someone does the somewhat unusual thing of browing to page 2 or 3 or 4, you can redo the same search and fetch some more (use from to fetch what amounts to the next bunch).

However, if the index was refreshed between those actions, this new search will shift items around, so might get the same item again, or never show you are didn't see one one. Whenever the consistency or completeness really matters, you are probably looking for async or PIT:

Also, if you use this to back an API that allows "fetching everyting", you won't have won much.



Async search[54] - lets you start searches in the background. Search functionality is mostly equivalent, there's some minor parameter differences (think cache)

the searches are store din their own index(verify) so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)
note that you wait_for_completion_timeout lets you ask for "return regular search if you finish quickly, make it async if not"


Point in time API[55] - consider that document updates and refreshes means you will generally get the latest results. If instead it is more important to get a consisten set, you can use PIT (or presumably async search?)




Searching multiple indices

...mostly points to Multi search (_msearch), which looks something like:

GET /index1/_msearch
{ }
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
{"index": "index2"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["title","plaintext"]  } } }
{"index": "index3"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }



Note that you can also do:

GET /index1,index2,index3/_search
{ "query": { "multi_match":{"query":"fork",  "fields":["title","plaintext"]  } } }

On the upside, you don't have to creatively merge the separate result sets (because multi search will not do that for you).

On the downside, you don't get to control that merge, e.g. scoring can be difficult to control, you don't really get to specify search fields per index anymore, or source filtering fields - though some forethought (e.g. to field naming between indices) can help some of those.



Other, mostly more specific search-like APIs, and some debug stuff

knn[56] search searches a dense_vector close to the query vector.

separate API is depracated, being moved to an option

suggester[57] - suggests similar search terms based on edit distance

terms_enum[58] -


count[59]

explain[60]

profile[61]

validate[62]

shards[63] - report which shards would be accessed

composing queries

Compound queries
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Compound queries[64] wrap others (leaf queries, or other compound queries), for


One reason is logical combinations of other queries (note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)

bool [65]

lets you combine requirements with requirements like
must
filter is must without contributing to scoring
must_not
should ('portion of leafs must', and more matches scores higher)


The rest is mostly about more controlled scoring (particularly when you make such combinations)

dismax [66]

if multiple subqueries match the same document, the highest score gets used


boosting [67]

have subqueries weigh in positively and negatively


constant_score [68]

results from the search this wraps all get a fixed constant score

function_score [69]

results form the search this wraps with scripting that can consider values from the document and query


Term-level and full-text queries
This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.


Term queries arenot put through an analyser, so stay one string, so are mostly used to match exact strings/terms.

Full-text queries are put through an analyzer (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.


term [70]

exact value
works in text fields but is only useful for fairly controlled things (e.g. username)

terms [71]

like term, matches any from a provided list

terms_set [72]

like terms, but with a minimum amount to match from a list


exists[73]

Returns documents that contain any indexed value for a field.

fuzzy[74]

Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.

ids[75]

Returns documents based on their document IDs.

prefix[76]

Returns documents that contain a specific prefix in a provided field.

range [77]

Returns documents that contain terms within a provided range.

regexp[78]

Returns documents that contain terms matching a regular expression.

wildcard [79]

Returns documents that contain terms matching a wildcard pattern.


match [80]

matches text, number, date, or boolean data
in a single field
optionally does things like phrase or proximity queries, and things like edit-distance fuzziness
text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)
{
  "query": {
    "match": {
      "fieldname":{
        "query": "scary matrix snowman monster",
        "minimum_should_match": 2   //optional, but 
      }
    }
  }
}

multi_match [81]

match, but searches one query in multiple fields
{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message" ] 
    }
  }
}

combined_fields [82]

somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields
{
  "query": {
    "combined_fields" : {
      "query":      "database systems",
      "fields":     [ "title", "subject", "message"],
      "operator":   "and"
    }
  }
}


match_phrase [83]

analyses into tokens, then matches only if all are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive
{
  "query": {
    "match_phrase": {
      "message": "this is a test"
    }
  }
}


match_phrase_prefix [84]

like match_phrase, but the last token is a prefix match

match_bool_prefix [85]

constructs a bool of term-shoulds, but the last last is a prefix match



intervals [86]

imposes rules based on order and proximity of matching terms


query_string [87]

lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [88]
maybe don't expose to users, because it's quick to error out

simple_query_string [89]

Span queries

Result set

Aggregation

Performance considerations

query considerations

More general settings

indexing considerations

Routing

Clusters

Security notes

ES errors and warnings

Some Kibana notes