Elasticsearch notes
Some practicalities to search systems
Lucene and things that wrap it: Lucene · Solr · ElasticSearch Search-related processing: tf-idf |
Does ES suit your needs?
Broad intro
ElasticSearch is largely seen as
- a document store,
- that is searchable,
- and with a CRUD HTTP API to both search and manage it
Because of its heritage (Lucene at its core) it is is largely seen as text search.
Compared to some other text indexing choices, it makes things easier - it eases most management, and it adds replication and distribution, so that it should scale to many hosts (should you ever need that)
It also does more to make more data types easily searchable. This makes it reasonably suited for logging and certain metrics,
particularly if you are primarily in recent data only - i.e. monitoring. (in fact the company behind ES it is currently leaning on the monitoring/metrics angle),
(You could even abuse it like a database engine, though in that niche-purpose NoSQL-ey way where you may not really want it for a primary store, because it doesn't do strong consistency).
Libraries to interact with ES are relatively thin wrappers around communicating with the HTTP API, and more about convenience.
(In many cases, you can write those HTTP interactions yourself without that library)
There is no front-end user interface included.
It's not very hard to interface your own frontend with the API
(though when you first start this seems like another afternoon of work just to see anything, let alone something nice).
Subscription model
There are three options[1]:
- Free/Basic - gets you a subset of all features
- 'when find the basic feature set enough this is all you need
- (Note: before 6.3, the free license needed to be refreshed every year, now basic/free does not need an explicit license anymore?(verify))
- Platinum
- Enterprise
And, effectively,
- Trial, which acts like the full thing for 30 days
- when you're just installing it to try out all features, this is great
- after that, you would either
- choose a fancier package
- switch to Basic license, and maybe disable some features (apparently many but not all X-pack things(verify))
Depending on how it is installed, the default may not be basic, but trial.
- if you set that trial, you will know the implications and what settings to change
- if something else set up that trial, then thirty days later you may
- not know why it's not working
- not know what settings to change to make it work under the basic license
You would get something like "Current license is non-compliant for search application and behavioral analytics. Current license is active basic license".
Either buy a license, or disabling those features:
Logs seem to refer to "Elasticsearch Search and Analytics" -- which seemed to just be a confusing name for the machine learning and alerting stuff)}}
xpack.ml.enabled: false xpack.graph.enabled: false xpack.watcher.enabled: false
(note: I expect this to be outdated information)
(It doesn't help that X-pack is hard to explain because what it is changed a lot over the years)
In more serious installations
Around self-hosting there are two or three variants: Basic, platinum, and enterprise (Gold is gone?)
- Basic is perfectly good for smaller scale
- the differences between platinum and enterprise mostly matter once you do this on large scale [2]
From an "I don't care about jargon and legalese" perspective,
- you can get a managed cloudy thing - split into a few variants
- or you can self-host - also split into a few variants (not quite the same?)
License details
The core functionality is essentially free and mostly open.
ES used have more proprietary parts, which seems to be so that they can sell a fancier variant for money.
From a FOSS view, this makes parts less open and more source-available.
From a practical view:
- When you're adding search to a hobby project, that's all you care about.
- When you're a business, the subscription that makes it easier to run at larger scale is clearly worth it.
Relatedly, they use a custom license.
Or rather, a choice between:
- SSPL
- a more-viral, less-open variant of the AGPL
- also meaning you cannot use(verify)/sell this as service with the ES branding removed
- Elastic License
- not open
This seems primarily related to Amazon essentially selling ES without really crediting ES (or helping its development).
Around 2021, ES changes its license from Apache2 to SSPL - similar to AGPL but a little more aggressive
at pointing out that when you provide ES as a service and remove the ES markings, you cannot use the free license.
Which apparently just prompted(verify) Amazon to fork a slightly earlier version it into its own OpenSearch[3] (the SSPL seems made for another such case, MongoDB, and similarly Amazon just forked to DocumentDB(verify))
Install
"max virtual memory areas vm.max_map_count is too low"
Setup details you may want to think about before you find out you have to bring the whole thing down for a while, and/or redo that huge index
Note that among all the settings in ES,
- some are dynamic (changed on the fly)
- some are static (basically set in files, and may have more considerations to changing them)
...so you may want to go through at least some of the more static ones before you deploy something large.
Consider security
This matters ore around multitenancy, or when your ES deployment is on a different provider (e.g. public cloud deployment), and may not be necessary within your own cloudiness.
Ease of management
Some implementation details you'd want to know
tl;dr
- you send in documents. You get back documents.
- documents can be seen as JSON with field:value
- a document is added to an index
- you can have one or more indexes
- more indexes is useful if you have different kinds of documents you want handled differently and/or searched separately
- each an index has a mapping, which constrols
- how it transforms each field
- whether it indexes the value
- whether it stores the value
- you can search one or more indexes at a time
- if you need scaling, you can have indexes spread over multiple computers
The major moving parts
Some terminology
(note that the core of this comes from the underlying lucene)
mapping -
- tells ES how to interpret data for each individual field (is a schema or data dictionary if you prefer)
- e.g. mentions data types, whether to index on it and/or whether to (just) store it to return
- if a field is not already defined in the mapping by you, it will adds fields as you mention them, with types guessed based on the first value it sees
field - a distinct thing in the mapping
- (usually) indexed to be searchable
- (usually) fetchable
- (there are some field type variants - see below)
document - individual document handed to be indexed; individual document handed back in search results are both called a document
- think of this as a collection of fields, in JSON form
- note: most fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one
- (the original JSON you asked it to index is also stored in _source)
index
- if you squint, a grouping of documents - a collection you search as a whole
- you can have multiple indexes a per cluster (and easily want to)
- an index is internally divided into one or more shards, which is about replication in general and distribution in a cluster
For your first experiments, you can skim or skip reading the rest of this section - the details matter more to efficiency and distribution. You will get to them once you if and when you need to scale to multiple hosts.
shard
- you typically split an index into a number of shards, e.g. to be able to do horizontal scaling onto nodes
- internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a portion of the overall thing we we call index here
- Shards come in two types: primary (the basic type), and replica.
- (and shards are further divided into segment files, which is more about how updating documents works)
segment - a shard consists of a number of segments (segments are individual files).
- Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).
replica is an exact copy of a shard
- the point of replicas is robustness against node failure:
- if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything
- without duplication, node failure would mean missing a part of every index distributed onto it
- you could run without replicas to save some memory
- memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability
node - distinct ES server instance
- in larger clusters, it makes sense to have some nodes take on specific roles/jobs - this discusses some of that
cluster - a group of nodes, serving a set of indices created in it
- nodes must mention a shared ID to join a cluster
Combined with
In practice, people often pair ES with... (see also "ELK stack")
- Web UI, makes a bunch of inspection and monitoring easier, including a dashboard interface to ES
- also seen in tutorials, there largely for its console interactively poking ES without writing code
- itself pluggable, with bunch of optional things
And, if you're not coding all your own ingest,
- you can do all of that yourself, but logstash can do a lot of work for you, or at least be a quick start
- Beats - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts,
- e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [4]
- and a lot more contributed ones[5]
- e.g. metricbeat, which is stores ES metrics in ES
Indices
Fields, the mapping, field types
By default, ES picks up every part of the incoming JSON document, but you can configure it
- to have parts go towards multiple fields
- to have parts go to no field at all
- for all documents based on your mapping having a field but setting "enabled": false,
- from other behaviour, such as "values should never be bigger than X bytes" config on a field
Note that in terms of the index, a document can have multiple values for a field. So fields can be considered to store arrays of values, and most searches will match if any of those match.
Each regular field (there are others) gets its own behavior in terms of
- whether it is stored and/or
- whether it is indexed
- if indexed, how it is transformed before going into the index
(For contrast and completeness, a 'runtime field, is evaluated at runtime, and not indexed or in the _source. [6]. This can be handy for certain calculated data that you won't search on, or to experiment with fields you'll later make regular fields(verify))
The mapping[7][8] is the thing that lists all fields in an index, mostly:
- name
- data type(s)
- configured processing when new data comes in
Explicit mapping[9], amounts to specifying a mapping before you add documents. This can be preferred when
- you want specific interpretations of some fields (e.g. identifier that shouldn't be parsed as text, an IP address, a date), and/or
- you want specific features (e.g. flattened, join, search_as_you_type)
Dynamic mapping[10] is (like 'schemaless') a nice name for
- "I didn't set a mapping for a field, so am counting on ES guessing right based on the first value it sees"
- if you're mainly handling text, this usually does the right or at least a very sane thing.
dynamic-parameters allows a little more variation of that:
- true - unknown fields are automatically added to the mapping as regular fields (indexed) -- the default just described
- it'll mostly end up with boolean, float, long, or text/keyword
- runtime - unknown fields are automatically added to the mapping as runtime fields (not indexed)
- false - unknown fields are ignored
- strict - unknown fields mean the document is rejected
Note that
- with true and runtime, the created field's type is based on the communicated-JSON's type - see this table
- runtime and dynamic will
- You can also alter mappings later - see [13]
- but this comes with some restrictions/requirements/footnotes
- dynamic-parameters is normally set per index(verify) and inherited, but you can be more precise about it
Text-like field types
The documentation makes a point of a split into
- text family[14] - text, match_only_text
- keyword family[15] - keyword, constant_keyword, wildcard
- ...and it seems that e.g. search-as-you-type are considered miscellaneous
- text[16] (previously known as string analysed)
- flexible search of free-form text, analysis can will transform it before indexing (you probably want to know how - see e.g. analysis)
- no aggregations
- no sorting
- keyword (previously known as string not_analysed)
- structured content (exact search only?), e.g. identifiers, a small/known set of tags, also things like emails, hostnames, zip code, etc.
- should be a little faster to match (if only because of smaller index size)(verify)
- allows sorting
- allows aggregation
- can make makes sense for serial numbers/IDs, tags - even if they look like numbers, you will probably only ever search them up as text equality (compared to storing those as a number, keyword may be a little larger yet also saves some index complexity necessary for numeric-range search)
- constant_keyword - all documents in the index have the same value
- e.g. when you send each log file to its own index, this might assist some combining queries (verify)
- fast for prefix (match at start) and infix (terms within) matches
- mainly for autocompletion of queries, but could work for other shortish things (if not short, the index for this will be large)
- n-gram style, can have larger n (but costs in terms of index size, so only useful if you usually want many-token matches?)
- kept in memory, costly to build
Further notes:
- it seems that fields that are dynamically added to the mapping and detected as text will get two indexed fields: a free-form text one, and keyword with ignore_above of 256
- this is useful if you don't know what it will be used for
- but for e.g. identifiers it's pointless to also tokenize it
- and for free-form text it will probably do very little -- that is, for all but short documents it ends up _ignored. (It's a clever edge-caes trick to deal with cases where the only value is actually something other than text, and is otherwise almost free)
- separately, some things you may wish to not index/serch on, but still store it so you can report it as part of a hit
Data-like field types
Primitives:
- numbers:
- byte, short, integer, long, unsigned_long
- float, double, half_float, scaled_float
- allows range queries
- and, arguably, keyword (see above)
Specific, text-like:
- (internally milliseconds since epoch (UTC)
- version - text that represents semantic versioning
- mostly for sortability?
- ip - IPv4 and IPv6 addresses
- allows things like subnet searches (CIDR style)
- geospatial - points and shapes [17]
- including distance and overlap
Different way of treating JSON as transferred:
- object[18] -
- each subfield is separately mapped and indexed, with names based on the nesting dot notation, see e.g. the linked example
- flattened[19]
- takes the JSON object and indexes it as one single thing (basically an array of its values combined)
- nested[20] basically a variant of object that allows some field indexability
Other stuff:
- binary[21]
- join[22]
- relations to other documents, by id(verify)
- seems to mean query-time lookups, so
- you could get some basic lookups for much cheaper than separate requests
- at the same time, you can make things slower by doing unnecessary and/or multiple levels of lookups (if you need relational, a relational database is better at that)
token_count[23]
- stored as an integer, but takes text, analyses it into tokens, then counts the number of tokens
- seems intended to be used via multi-field, to also get the length of text
dense_vector[24] (of floats by defaults, byte also possible)
- if indexes, you can use these for knn searches[25]
Managing indices (and thinking about shards and segments)
Remember that shards are made of segments, and segments are immutable.
ES collects changes in memory, and only occasionally dumps that that into a new segment file.
- (this is why documents are not immediately searchable and we call it near-realtime)
Refresh - is that act of writing to a new segment
- also helps updates not be blocking - existing segments are searched, new one once they're done
Merge refers to how smaller segments are periodically consolidated into fewer files, [26]
- also the only way that delete operations actually flush (remember, segments are immutable).
Refresh interval is 1s by default, but various setups may want that larger - it's a set of tradeoffs
- Refresh can be too quick in that the overhead of refresh would dominate, and less work is spent actually doing useful things (and be a little pricier in pay-per-CPU-use cloudiness)
- Refresh can also be too slow both in that
- new results take a long time to show up
- the heavier load, and large new segments, could make fore more irregular response times
The default is 1s. Large setups might increase that up to the order of 30s.
(more in the tradeoffs below)
index lifecycle management (ILM) lets you do slightly higher-level management like
- create a new index when one grows too large
- do things like creating an index per time interval
Both can help get the granularity you want when it comes to backing them up, duplicating them, retiring them to adhere to data retention standards.
https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-lifecycle-management.html
Also on the topic: a reindex basically means "read data, delete data in ES, ingest again".
- You would not generally want this over a merge.
- It's mainly useful when you make structural schema changes, and you want to ensure the data uniformly conforms to that.
https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
other field types
multi-fields[27] let you specify that one thing in your JSON should be picked up towards multiple fields.
- e.g. a city name
- as a keyword, for exact match, sorting and aggregation
- and towards the overall text search, via an analyser
- text as a searchable thing, and also store its length via token_count
runtime fields [28]
- mostly about query time - not part of the index (so won’t display in _source)
- just calculated as part of the response
- can be defined in the mapping, or in an individual query
- (seem to be a refinement of script fields?)
- easy to add and remove to a mapping (because there is no backing storage for them)
- https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
- https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields
- you can search specific indices (comma separated names, or _all), or just one - sometimes useful
Metadata fields (that is, those other than e.g. the fields from your document, and runtime/script fields)
- _id - document's ID
- _type - document's mapping type
- deprecated
- _index - index that the document is in
- _source[29]
- is the original JSON that indexing got
- note that everything needs to be indexed
- this 'original document' is also used in update and reindex operations
- so while you can ask it to not store _source at all, that disables such operations
- that original submission can be handy for debug
- that original submission can be core functionality for your app in a "we found it by the index, now we give you the actual document" way
- you can filter what parts of _source are actually sent in search results - see source filtering
- _size - byte size of _source
- _ignored - fields in a document that indexing ignored for any reason - see
https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-fields.html others
More on indexing
The bulk API could be used for all CRUD operations, but in practice is probably mostly used for indexing.
Bulk adds help performance
While you can do individual adds, doing bunches together reduces overheads, and can make refreshes more efficient as well -- largely relating to the refresh interval.
However, keep in mind that if some part of a bulk operations fails, you need to notice, and deal with that correctly.
Missing data
Setting a field to null (or an array of nulls, or an empty array) means it is not added to the index.
Sounds obvious, but do a mental check that "that field can not be matched in a lot of searches" (except for missing, must_not exists, and such(verify)) is the behaviour you intend.
Note that it also affects aggregation on the relevant fields, and some other details.
Ignoring data and fields
Field values may be ignored for varied reasons, including:
- use of ignore_malformed
- by default, using a type that can't be converted means the entire rejects the document add/update operation
- though note that in bulk updates it only considers that one operation failed, and will report a partial success
- if you set ignore_malformed[30], it will instead reject only bad values, and process the rest normally
- by default, using a type that can't be converted means the entire rejects the document add/update operation
- ignore_above (on keyword fields)
You can search for documents where this happened at all (you'd probably do that for debug reasons) with a query like
"query":{ "exists":{"field":"_ignored"} }
Text processing
Do you want curacao to match Curaçao?
Do you want fishes to match fish?
To computers those are different characters and different strings,
whether such variations are semantically equivalent, or close enough to be fuzzy with, and how, might vary per language.
Analysers represent relatively simple processing that helps, among other things, normalize data (and implicitly also the later query) for such fuzziness.
Analyzers takes text, and applies a combination of
- character filters
- a tokenizer
- token filters
usually to the end of
- stripping out things (e.g symbols and punctuation)
- split it into tokens (e.g. words)
- normalize (e.g. lowercasing)
There are built-in analyzers to cover a lot of basic text search needs
- standard (the default if you don't specify one)
- lowercases
- splits using Unicode Text Segmentation (which for English is mostly splitting on spaces and punctuation, but is better behaved default for some other languages) and applies lowercasing
- removes most punctuation,
- can be told to remove stop words (by default does not)
- language-specific analysers - currently arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai.
- details vary
- can remove stopwords
- can apply stemming (and take a list of exception cases to those stemming rules)
- no lowercasing
- splits on non-letters
- no option for stopword removal
- stop - simple plus stopword removal, so
- no lowercasing
- splits on non-letters
- can remove stopwords
- pattern - splits by regexp
- can lowercase
- can remove stopwords
- no lowercasing
- splits on whitespace
- keyword - does nothing, outputs what it was given
- reduces text in a way that helps detect duplicates, see something like [31]
https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
Data streams
Snapshots, Searchable snapshots
APIs
ES grew a lot of APIs over time.
Perhaps the most central are
- the document CRUD part (Index, Get, Update, Delete, respectively),
- and the searching.
Document CRUD
- Index - add a document (and implicitly get it indexed soon)
- e.g. PUT /test/_doc/1 with the JSON doc in the body
- Get - get a specific document by ID
- e.g. GET test/_doc/1 (HEAD if you only want to check that it exists)
- this basically just gives you _source, so you can filter what part of that gets sent[32]
- Update[33]
- e.g. POST /test/_update/1
- allows things like "script" : "ctx._source.votes += 1"
- Delete[34])
- e.g. DELETE /test/_doc/1
Note that there is also a bulk API [35]
- that lets you do many of the above at once
- POST _bulk
- where the body can contain many index, delete, update
- libraries often support the bulk API in a way that makes sense for how the rest of that library
"Is update really update?"
It seems that update is essentially
- construct a a new document based on the current version's _source
- add 1 to the version
- save that to a new segment
- ...even if it turns out there is no difference (...by default - you can specify that it should check and not do that - see detect_noop)
It seems that the old version will be considered deleted but still be on disk in that immutable segment it was originally entered in (until the next merge), and searches just happen to report only the latest (using _version)(verify) So if you do continuous updates on all of your documents, you can actually get the update process to fall behind, which can mean you temporarily use a lot more space (and search may be slightly slower too).
"Is there a difference between indexing a new version and updating the document?"
Very little in terms of what gets done at low level (both will create a new document, mark old as deleted)
It's most that for you there is a question of what you do is easier to express as
- a new version of the data
- the changes you want to see, or a script that makes them
Also, keep in mind how you are updating. If it's fetch, change, send, that's more steps than 'here is how to change it' in a script.
So if you do not keep track of changes to send to elasticsearch, you could update by throwing everything at at with detect_noop - which only applies to update.
Search[36]
- GET /indexname/_search
Multi Search[37]
- multiple searches with a single API call
- GET /indexname/_msearch
- ...though the request can switch index
Async search
Multi Get[38]
- Retrieves multiple JSON documents by ID.
- GET /_mget
Search
Shapes of a query
There is a specific Query DSL,
where you expressing a structured search as an abstract syntax tree, which is communicated as JSON,
so most APIs take a query that looks something like:
{
"query": {
"match_phrase": {
"fieldname": "word word"
}
}
}
Notes:
- the field names will vary with specific types of search. The nesting will too.
- You might like Kibana's console to play with these, as
- it has some auto-completion of the search syntax.
- it remembers the queries you leave around
- a quick "am I even indexing anything" test would probably use match_all[39] (no filter at all)
Exceptions / other options
The only exception to "you send ASTs as JSON" seems to be URI search[40] which exposes basic searches in a URL (without needing to send a JSON style query in the body), but unless this covers all your needs, you will eventually want to abandon this.
The only exceptions to "you must create query ASTs yourself" seem to be:
- query_string[41] -
- still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally
- exposes a bunch of the search features in a shorter-to-write syntax
- ...that you have to learn
- ...that is fairly breakable
- ...that allows frankenqueries that unduly load your system
- ...so you probably don't want to expose this to users unless all of them are power users / already know lucene
- exposes a bunch of the search features in a shorter-to-write syntax
- simple_query_string[42]
- same idea as the previous, but using the somewhat more simplified syntax
- less breaky, still for experts
...but again, until this covers all your needs and doesn't ope
These exceptions may looks impler but in the long run you will probably need to abandon them.
If you want to learn it once, learn the AST way.
(that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).
API-wise
The search API[43] (/indexname/_search) should cover many basic needs.
The search template API[44] does the same sort of search, but where you otherwise might find yourself writing some middleware to turn a form into a more complex query, templates allow you to slot some values into an existing saved template.
Multi search[45] (/indexname/_msearch) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request. Responds with the same number of result sets.
Multi search template API[46] - seems to just be the combination of multi search and use of templates.
On result state, and getting more consistency between search interactions
In search systems in general, it can be a good tradeoff to
- get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, many" -- rather than get a precise count
- don't fetch anything beyond the first page or so (ES defaults to 10) -- most people will never need it (and the ones that do could e.g. be served by an additional query - how you do this matters less than the saving you get from "nothing beyond the first page 90% of the time")
- if search-and-fetch-a-few turns out to be cheap enough in terms of IO and CPU, we can consider doing them without storing state.
- If in fact someone does the somewhat unusual thing of browing to page 2 or 3 or 4,
you can redo the same search and fetch some more (use from to fetch what amounts to the next bunch).
- Separately, you might as well stop counting the fact that you have hits once that hits "more than anyone is interested in" (ES defaults to 10000)
- forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly
...because it turns out we can do this faster than counting all the matching documents, or fetching all metadata.
However, this "we'll re-do it for page 2" means that if the index was refreshed between those actions, this new search will shift items around, so might get the same item again, or never show you are didn't see one one. Whenever that consistency or completeness really matters to you, you are probably looking for async or PIT:
- Async search[47] - lets you start searches in the background.
Search functionality is mostly equivalent, there's some minor parameter differences (think cache)
- the searches are store din their own index(verify) so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)
- note that you wait_for_completion_timeout lets you ask for "return regular search if you finish quickly, make it async if not"
- Point in time API[48] - consider that document updates and refreshes means you will generally get the latest results.
Searching multiple indices
...mostly points to Multi search (_msearch), which looks something like:
GET /index1/_msearch
{ }
{ "query": {"multi_match":{ "query":"fork", "fields":["plaintext"] } } }
{"index": "index2"}
{ "query": {"multi_match":{ "query":"fork", "fields":["title","plaintext"] } } }
{"index": "index3"}
{ "query": {"multi_match":{ "query":"fork", "fields":["plaintext"] } } }
Note that you can also do:
GET /index1,index2,index3/_search
{ "query": { "multi_match":{"query":"fork", "fields":["title","plaintext"] } } }
The first gives you three different result sets. You get control over the queries but you will have to merge them yourself, and it can be hard to get the scoring to work well.
The second effectively merges for you - but gives you little to no control how, and that one query better make equal sense for each index - so requires some forethought about the mapping of each.
Other, mostly more specific search-like APIs, and some debug stuff
terms_enum - meant to be a light lookup of indexed works by starting matches. Works on keyword fields, defaults to case sensitive.
- index_filter allows search in multiple indices
suggester - suggests similar search terms based on edit distance
knn search searches a dense_vector close to the query vector.
- doing this via a separate API is deprecated, being moved to a parameter on basic searches
count - count the number of hits a particular query gives.
Debugging searches:
explain - given a document ID and a query, explains how it does or doesn't match.
profile - gives timing info about parts of query execution
validate - ask whether a query is valid, without executing it; useful when crafting expensive queries
shards - report which shards would be accessed
composing queries
Compound queries
Compound queries[49] wrap others (leaf queries, or other compound queries), for
One reason is logical combinations of other queries (note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)
bool [50]
- lets you combine requirements with requirements like
- must
- filter is must without contributing to scoring
- must_not
- should ('portion of leafs must', and more matches scores higher)
- lets you combine requirements with requirements like
The rest is mostly about more controlled scoring (particularly when you make such combinations)
dismax [51]
- if multiple subqueries match the same document, the highest score gets used
boosting [52]
- have subqueries weigh in positively and negatively
constant_score [53]
- results from the search this wraps all get a fixed constant score
function_score [54]
- results form the search this wraps with scripting that can consider values from the document and query
Term-level and full-text queries
Term queries arenot put through an analyser, so stay one string, so are mostly used to match exact strings/terms.
Full-text queries are put through an analyzer (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.
term [55]
- exact value
- works in text fields but is only useful for fairly controlled things (e.g. username)
terms [56]
- like term, matches any from a provided list
terms_set [57]
- like terms, but with a minimum amount to match from a list
exists[58]
- Returns documents that contain any indexed value for a field.
fuzzy[59]
- Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.
ids[60]
- Returns documents based on their document IDs.
prefix[61]
- Returns documents that contain a specific prefix in a provided field.
range [62]
- Returns documents that contain terms within a provided range.
regexp[63]
- Returns documents that contain terms matching a regular expression.
wildcard [64]
- Returns documents that contain terms matching a wildcard pattern.
match [65]
- matches text, number, date, or boolean data
- in a single field
- optionally does things like phrase or proximity queries, and things like edit-distance fuzziness
- text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)
{ "query": { "match": { "fieldname":{ "query": "scary matrix snowman monster", "minimum_should_match": 2 //optional, but } } } }
multi_match [66]
- match, but searches one query in multiple fields
{ "query": { "multi_match" : { "query": "this is a test", "fields": [ "subject", "message" ] } } }
combined_fields [67]
- somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields
{ "query": { "combined_fields" : { "query": "database systems", "fields": [ "title", "subject", "message"], "operator": "and" } } }
match_phrase [68]
- analyses into tokens, then matches only if all are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive
{ "query": { "match_phrase": { "message": "this is a test" } } }
match_phrase_prefix [69]
- like match_phrase, but the last token is a prefix match
match_bool_prefix [70]
- constructs a bool of term-shoulds, but the last last is a prefix match
intervals [71]
- imposes rules based on order and proximity of matching terms
query_string [72]
- lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [73]
- maybe don't expose to users, because it's quick to error out
simple_query_string [74]
Span queries
Span queries[75]
use token-position information to be able to express terms's order, proximity ,
This seems mostly useful technical documents, and other things that seem to conform to templates enough that this specificity is meaningful (rather than just rejects a lot)
- span_term[76] - like regular term, but works when combining span queries with some of the below
- span_containing[77]
- takes list of span queries, returns only those spans which also match a second span query
- similar to near but more controlled?
- span_near[78] - multiple span queries must match within distance, and with optional order requirement
- span_or[79] - match by any span query
- span_not[80] - excludes based on a span query
- span_first[81] - appear within the first N positions of field
- span_field_masking[82] - lets you do span-near or span-or across fields
- span_multi[83] - Wraps a term, range, prefix, wildcard, regexp, or fuzzy query.
- span_within[84] - one span must appear within another
Result set
Aggregation
Aggregation is any calculation from multiple documents.
- Which can be particularly useful when you store structured information - like metrics.
- More rarely used in full-text search.
query considerations