Elasticsearch notes: Difference between revisions

Revision as of 00:42, 21 April 2024

Some practicalities to search systems

Lucene and things that wrap it: Lucene · Solr · ElasticSearch

Search-related processing: tf-idf

Choice side

Broad intro

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

ElasticSearch is largely seen as a document store that is searchable.

You could see it as a document store with CRUD HTTP API which happens to do indexing.

Because of its heritage (Lucene at its core) it is is largely seen as text search, but it is also reasonably suited for logging and certain metrics.

(You could use it like a database engine, though in that niche-purpose NoSQL-ey way where it doesn't do strong consistency where you may not really want it for a primary store).

Compared to some other text indexing, it

wraps a few more features, e.g. replication and distribution

does more to make more data types easily searchable, part of why it is reasonable for data logging and metrics (in fact the company behind it is currently leaning on the monitoring/metrics angle),

eases management and is a little more automatic at that (e.g. index stuff that gives more consistent latency over time).

ES libraries are relatively thin wrappers around communicating with the HTTP API, and more about convenience.

There is no user interface included, perhaps in part because ES is fairly flexible. It's not very hard to interface your own form with the API, though.

Subscription model

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

From an "I don't care about jargon and legalese" perspective,

you can get a managed cloudy thing - split into a few variants

or you can self-host - also split into a few variants (not quite the same?)

(Before 6.3 the free license needed to be refreshed every year, now basic/free does not need an explicit license anymore?(verify))

Around self-hosting there are two or three variants: Basic, platinum, and enterprise (Gold is gone?)

Basic is perfectly good for smaller scale

the differences between platinum and enterprise mostly matter once you do this on large scale [1]

There is a perfectly free subset of ES (ES configured with a Basic license).

...but the default is bit a Basic license, it's a 30-day trial license in which you get more features, to entice you to buy the fuller product.

If you know you want to stick with basic features, then after this 30-day period (or earlier, if you know what you want) you would

not only need to switch to Basic license,

but also disable some features (apparently all X-pack things(verify))

...but if you didn't know that, this is annoying in the form of "automated install configured a trial license for me, but things just stopped working completely after 30 days and now I'm panicking", because it takes more than a little reading what you now need to disable and why. (My logs showed "Elasticsearch Search and Analytics" -- which seemed to just be a confusing name for the machine learning and alerting stuff)

License details

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Implementation details you'd want to know to use it

The major moving parts

Some terminology

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

(note that the core of this comes from the underlying lucene)

mapping - basically a schema, telling ES of how to interpret each field

e.g. mentions types

if a field is not already defined in the mapping by you, it will adds fields as you mention them, with types guessed based on the first value it sees

field - a distinct thing in the mapping

(usually) indexed to be searchable

(usually) fetchable

(there are some field type variants - see below)

document - individual search result.

you can see it as the JSON you sent in, and/or as the part of that that got put into fields (and usually indexed) as specified by the mapping

side note: most fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one

original JSON is also stored in _source

index

if you squint, a grouping of documents - a collection you search as a whole

you can have multiple indexes a per cluster (and may well want to)

an index is divided into one or more shards, which is about replication in general and distribution in a cluster

(and shards are divided into segment files, which is more about how updating documents works)

...and this is the point at which you can stop reading this section if you're just doing some experiments on one host. You will get to them once you if and when you need to scale to multiple hosts.

shard

you typically split an index into a number of shards, e.g. to be able to do horizontal scaling onto nodes

internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a portion of the overall thing we we call index here

Shards come in two types: primary (the basic type), and replica.

segment - a shard consists of a number of segments (segments are individual files).

Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).

replica is an exact copy of a shard

the point of replicas is robustness against node failure:

if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything

without duplication, node failure would mean missing a part of every index distributed onto it

you could run without replicas to save some memory

memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability

node - distinct ES server instance

in larger clusters, it makes sense to have some nodes take on specific roles/jobs - this discusses some of that

cluster - a group of nodes, serving a set of indices created in it

nodes must mention a shared ID to join a cluster

Combined with

In practice, people often pair it with... (see also "ELK stack")

Kibana

Web UI, makes a bunch of inspection and monitoring easier, including a dashboard interface to ES

also seen in tutorials, there largely for its console interactively poking ES without writing code

itself pluggable, with bunch of optional things

And, if you're not coding your own ingest,

Logstash is log processing/aggregator and plugins let it be flexible

you can do all of that yourself, but logstash can do a lot of work for you, or at least be a quick start

Beats - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts, e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [2] - and a lot more contributed ones[3]

e.g. metricbeat, which is stores ES metrics in ES

Indices

Fields, the mapping, field types

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

By default, ES picks up every part of the document (remember it's JSON), but you can configure it

to have parts go towards multiple fields

to have parts go to no field at all

for all documents based on your mapping having a field but setting "enabled": false,

from other behaviour, such as "values should never be bigger than X bytes" config on a field

Note that in terms of the index, a document can have multiple values for a field. So fields can be considered to store arrays of values, and most searches will match if any of those match.

Each regular field (there are others) gets its own behavior in terms of

whether it is stored and/or

whether it is indexed

if indexed, how it is transformed before going into the index

(For completeness, a 'runtime field, is evaluated at runtime, and not indexed or in the _source. [4]. This can be handy for certain calculated data that you won't search on, or to experiment with fields you'll later make regular fields(verify))

The mapping[5][6] is the thing that lists all fields in an index, mostly:

name
data type(s)
configured processing when new data comes in

Explicit mapping[7], amounts to specifying a mapping before you add documents. This can be preferred when

you want specific interpretations of some fields (e.g. identifier that shouldn't be parsed as text, an IP address, a date), and/or
you want specific features (e.g. flattened, join, search_as_you_type)

Dynamic mapping[8] is (like 'schemaless') a nice name for

"I didn't set a mapping for a field, so am counting on ES guessing right based on the first value it sees"

if you're mainly handling text, this usually does the right or at least a very sane thing.

dynamic-parameters allows a little more variation of that:

true - unknown fields are automatically added to the mapping as regular fields (indexed) -- the default just described

it'll mostly end up with boolean, float, long, or text/keyword

runtime - unknown fields are automatically added to the mapping as runtime fields (not indexed)

false - unknown fields are ignored

strict - unknown fields mean the document is rejected

Note that

with true and runtime, the created field's type is based on the communicated-JSON's type - see this table

runtime and dynamic will
- try to detect whether text seems to be a date[9] (apparently if it follows a specific configured date template(verify))
- optionally detect numbers[10] (and whether they are reals or integers), but this is disabled by default - probably because it's much less error-prone to control this yourself.

You can also alter mappings later - see [11]

but this comes with some restrictions/requirements/footnotes

dynamic-parameters is normally set per index(verify) and inherited, but you can be more precise about it

Text-like field types

The documentation makes a point of a split into

text family[12] - text, match_only_text
keyword family[13] - keyword, constant_keyword, wildcard
...and it seems that e.g. search-as-you-type are considered miscellaneous

text[14] (previously known as string analysed)

flexible search of free-form text, analysis can will transform it before indexing (you probably want to know how - see e.g. analysis)

no aggregations

no sorting

keyword[15] (previously known as string not_analysed)

structured content (exact search only?), e.g. identifiers, a small/known set of tags, also things like emails, hostnames, zip code, etc.

should be a little faster to match (if only because of smaller index size)(verify)

allows sorting

allows aggregation

can make makes sense for serial numbers/IDs, tags - even if they look like numbers, you will probably only ever search them up as text equality (compared to storing those as a number, keyword may be a little larger yet also saves some index complexity necessary for numeric-range search)

constant_keyword [16] - all documents in the index have the same value

e.g. when you send each log file to its own index, this might assist some combining queries (verify)

wildcard - [17]

assists wildcard and regexp queries

search_as_you_type[18]

fast for prefix (match at start) and infix (terms within) matches

mainly for autocompletion of queries, but could work for other shortish things (if not short, the index for this will be large)

n-gram style, can have larger n (but costs in terms of index size, so only useful if you usually want many-token matches?)

kept in memory, costly to build

Further notes:

it seems that fields that are dynamically added to the mapping and detected as text will get two indexed fields: a free-form text one, and keyword with ignore_above of 256

this is useful if you don't know what it will be used for

but for e.g. identifiers it's pointless to also tokenize it

and for free-form text it will probably do very little -- that is, for all but short documents it ends up _ignored. (It's a clever edge-caes trick to deal with cases where the only value is actually something other than text, and is otherwise almost free)

separately, some things you may wish to not index/serch on, but still store it so you can report it as part of a hit

Data-like field types (primitives):

boolean[19]

numbers: [20]
- byte, short, integer, long, unsigned_long
- float, double, half_float, scaled_float

allows range queries

and, arguably, keyword (see above)

Data-like fields (specifics but text-like):

date [21]

(internally milliseconds since epoch (UTC)

date_nanos [22]

version - text that represents semantic versioning

mostly for sortability?

ip - IPv4 and IPv6 addresses

allows things like subnet searches (CIDR style)

geospatial - points and shapes [23]

including distance and overlap

Different way of treating JSON as transferred

object[24] -

each subfield is separately mapped and indexed, with names based on the nesting dot notation, see e.g. the linked example

flattened[25]

takes the JSON object and indexes it as one single thing (basically an array of its values combined)

nested[26] basically a variant of object that allows some field indexability

Other stuff

binary[27]

not searchable

not stored (by default) - which makes sense. A search index is not a blob store

seems to be there for the possibility of some plugin extracting text to index?(verify) or fishing it out of _source?(verify)

join[28]

relations to other documents, by id(verify)

seems to mean query-time lookups, so

you could get some basic lookups for much cheaper than separate requests

at the same time, you can make things slower by doing unnecessary and/or multiple levels of lookups (if you need relational, a relational database is better at that)

token_count[29]

stored as an integer, but takes text, analyses it into tokens, then counts the number of tokens

seems intended to be used via multi-field, to also get the length of text

dense_vector[30] (of floats by defaults, byte also possible)

if indexes, you can use these for knn searches[31]

Managing indices (and thinking about shards and segments)

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Remember that shards are made of segments, and segments are immutable.

ES collects changes in memory, and only occasionally dumps that that into a new segment file.

(this is why documents are not immediately searchable and we call it near-realtime)

Refresh - is that act of writing to a new segment

also helps updates not be blocking - existing segments are searched, new one once they're done

Merge refers to how smaller segments are periodically consolidated into fewer files, [32]

also the only way that delete operations actually flush (remember, segments are immutable).

Refresh interval is 1s by default, but various setups may want that larger - it's a set of tradeoffs

Refresh can be too quick in that the overhead of refresh would dominate, and less work is spent actually doing useful things (and be a little pricier in pay-per-CPU-use cloudiness)

Refresh can also be too slow both in that
- new results take a long time to show up
- the heavier load, and large new segments, could make fore more irregular response times

The default is 1s. Large setups might increase that up to the order of 30s.

(more in the tradeoffs below)

index lifecycle management (ILM) lets you do slightly higher-level management like

create a new index when one grows too large
do things like creating an index per time interval

Both can help get the granularity you want when it comes to backing them up, duplicating them, retiring them to adhere to data retention standards.

https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-lifecycle-management.html

Also on the topic: a reindex basically means "read data, delete data in ES, ingest again".

You would not generally want this over a merge.

It's mainly useful when you make structural schema changes, and you want to ensure the data uniformly conforms to that.

https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html

other field types

multi-fields[33] let you specify that one thing in your JSON should be picked up towards multiple fields.

e.g. a city name
- as a keyword, for exact match, sorting and aggregation
- and towards the overall text search, via an analyser

text as a searchable thing, and also store its length via token_count

runtime fields [34]

mostly about query time - not part of the index (so won’t display in _source)

just calculated as part of the response

can be defined in the mapping, or in an individual query

(seem to be a refinement of script fields?)

easy to add and remove to a mapping (because there is no backing storage for them)

https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html

https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields

you can search specific indices (comma separated names, or _all), or just one - sometimes useful

Metadata fields (that is, those other than e.g. the fields from your document, and runtime/script fields)

_id - document's ID

_type - document's mapping type

deprecated

_index - index that the document is in

_source[35]

is the original JSON that indexing got

note that everything needs to be indexed

this 'original document' is also used in update and reindex operations

so while you can ask it to not store _source at all, that disables such operations

that original submission can be handy for debug

that original submission can be core functionality for your app in a "we found it by the index, now we give you the actual document" way

you can filter what parts of _source are actually sent in search results - see source filtering

_size - byte size of _source

_ignored - fields in a document that indexing ignored for any reason - see

https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-fields.html others

More on indexing

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

The bulk API could be used for all CRUD operations, but in practice is probably mostly used for indexing.

Do bulk adds where possible

While you can do individual adds, doing bunches together reduces overheads, and can make refreshes more efficient as well, largely relating to the refresh interval.

However, keep in mind that if some part of a bulk operations fails, you need to notice, and deal with that correctly.

Missing data

Setting a field to null (or an array of nulls, or an empty array) means it is not added to the index.

Sounds obvious, but do a mental check that "that field can not be matched in a lot of searches" (except for missing, must_not exists, and such(verify)) is the behaviour you intend.

Note that it also affects aggregation on the relevant fields, and some other details.

Ignoring data and fields

Field values may be ignored for varied reasons, including:

use of ignore_malformed

by default, using a type that can't be converted means the entire rejects the document add/update operation

though note that in bulk updates it only considers that one operation failed, and will report a partial success

if you set ignore_malformed[36], it will instead reject only bad values, and process the rest normally

ignore_above (on keyword fields)

You can search for documents where this happened at all (you'd probably do that for debug reasons) with a query like

"query":{ "exists":{"field":"_ignored"} }

Text processing

You want curacao to match Curaçao? Want fishes to match fish?

To computers those are different characters and different strings, whether such variations are semantically equivalent, or close enough to be fuzzy with, and how, might vary per language.

Analysers represent relatively simple processing that helps, among other things, normalize data (and implicity the later query) for such fuzziness.

An analyzer is a combination of

a character filter
a tokenizer
a token filter

(a normalizer seems to be an analyzer without a tokenizer - so practically mainly a character filter (can also have a token filter but this is probably not so useful)

Analyzers takes text, and applies a combination of character filters, tokenizer, and token filters, usually to the end of

split it into tokens (e.g. words)
stripping out things (e.g symbols and punctuation)
normalize (e.g. lowercasing)

There are built-in analyzers to cover a lot of basic text search needs

standard (the default if you don't specify one)
- lowercases
- splits using Unicode Text Segmentation (for english is mostly splitting on spaces and punctuation, but is better behaved default for some other languages) and lowercases
- removes most punctuation,
- can remove stop words (by default does not)

language-specific analysers - currently arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai.

details vary

can remove stopwords

can apply stemming (and take a list of exception cases to those stemming rules)

simple

no lowercasing

splits on non-letters

stop - simple plus stopword removal, so

no lowercasing

splits on non-letters

can remove stopwords

pattern - splits by regexp

can lowercase

can remove stopwords

whitespace

no lowercasing

splits on whitespace

keyword - does nothing, outputs what it was given

fingerprint

reduces text in a way that helps detect duplicates, see something like [37]

https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html

Data streams

Snapshots, Searchable snapshots

Install

Things you may want to think about somewhere before you have a big index

APIs

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

ES grew a lot of APIs over time.

Perhaps the most central are

the document CRUD part (Index, Get, Update, Delete, respectively),
and the searching.

Document CRUD

Index - add a document (and implicitly get it indexed soon)

e.g. PUT /test/_doc/1 with the JSON doc in the body

Get - get a specific document by ID

e.g. GET test/_doc/1 (HEAD if you only want to check that it exists)

this basically just gives you _source, so you can filter what part of that gets sent[38]

Update[39]

e.g. POST /test/_update/1

allows things like "script" : "ctx._source.votes += 1"

Delete[40])

e.g. DELETE /test/_doc/1

Note that there is also a bulk API [41]

that lets you do many of the above at once

POST _bulk

where the body can contain many index, delete, update

libraries often support the bulk API in a way that makes sense for how the rest of that library

"Is update really update?"

It seems that update is essentially

construct a a new document based on the current version's _source
add 1 to the version
save that to a new segment

...even if it turns out there is no difference (...by default - you can specify that it should check and not do that - see detect_noop)

It seems that the old version will be considered deleted but still be on disk in that immutable segment it was originally entered in (until the next merge), and searches just happen to report only the latest (using _version)(verify) So if you do continuous updates on all of your documents, you can actually get the update process to fall behind, which can mean you temporarily use a lot more space (and search may be slightly slower too).

"Is there a difference between indexing a new version and updating the document?"

Very little in terms of what gets done at low level (both will create a new document, mark old as deleted)

It's most that for you there is a question of what you do is easier to express as

a new version of the data
the changes you want to see, or a script that makes them

Also, keep in mind how you are updating. If it's fetch, change, send, that's more steps than 'here is how to change it' in a script.

So if you do not keep track of changes to send to elasticsearch, you could update by throwing everything at at with detect_noop - which only applies to update.

Search[42]

GET /indexname/_search

Multi Search[43]

multiple searches with a single API call

GET /indexname/_msearch

...though the request can switch index

Async search

Multi Get[44]

Retrieves multiple JSON documents by ID.

GET /_mget

Search

Shapes of a query

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Most APIs take a query that looks something like:

{
  "query": {
    "match_phrase": {
      "fieldname": "word word"
    }
  }
}

That is, there is a specific Query DSL, where you expressing a structured search as an abstract syntax tree, which is communicated as JSON.

Notes:

the field names will vary with specific types of search

In fact even the nesting will

You might like Kibana's console to play with these, as it has some auto-completion of the search syntax.

a quick "am I even indexing" test would probably use match_all[45]

Exceptions / other options

The only exception to "AST as JSON" seems to be URI search[46] which exposes basic searches in a URL (without needing to send a JSON style query in the body), but unless this covers all your needs, you will eventually want to abandon this. Might as well do it properly up front)}}

The only exceptions to "you must create query ASTs yourself" seem to be:

query_string[47] -

still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally

exposes a bunch of the search features in a shorter-to-write syntax

...that you have to learn

...that is fairly breakable

...that allows frankenqueries that unduly load your system

...so you probably don't want to expose this to users unless all of them are power users / already know lucene

simple_query_string[48]

same idea as the previous, but using the somewhat more simplified syntax

less breaky, still for experts

...but again, until this covers all your needs and doesn't ope

These exceptions may looks impler but in the long run you will probably need to abandon them. If you want to learn it once, learn the AST way. (that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).

API-wise

The search API[49] (/indexname/_search) should cover many basic needs.

The search template API[50] does the same sort of search, but you can avoid doing some complex middleware/client-side query construction by asking ES to slot some values into an existing saved template.

Multi search[51] (/indexname/_msearch) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request. Responds with the same number of result sets.

Multi search template API[52] - seems to just be the combination of multi search and use of templates.

On result state, and getting more consistency between search interactions

In search systems, it can be a good tradeoff to

get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, many" -- rather than get a precise count

only fetch-and-show the complete data for only the first few -- rather than everything

forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly

...because it turns out we can do that faster.

This is ES stops counting after 10000 (max_result_window), only return the first 10 (size=10, from=0)

Also, if search-and-fetch-a-few turns out to be cheap in terms of IO and CPU, we can consider doing them without storing state.

This also because we know that in interactive browser use, most people will never check more than a few.

If in fact someone does the somewhat unusual thing of browing to page 2 or 3 or 4, you can redo the same search and fetch some more (use from to fetch what amounts to the next bunch).

However, if the index was refreshed between those actions, this new search will shift items around, so might get the same item again, or never show you are didn't see one one. Whenever the consistency or completeness really matters, you are probably looking for async or PIT:

Also, if you use this to back an API that allows "fetching everyting", you won't have won much.

Async search[53] - lets you start searches in the background. Search functionality is mostly equivalent, there's some minor parameter differences (think cache)

the searches are store din their own index(verify) so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)

note that you wait_for_completion_timeout lets you ask for "return regular search if you finish quickly, make it async if not"

Point in time API[54] - consider that document updates and refreshes means you will generally get the latest results. If instead it is more important to get a consisten set, you can use PIT (or presumably async search?)

Searching multiple indices

...mostly points to Multi search (_msearch), which looks something like:

GET /index1/_msearch
{ }
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
{"index": "index2"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["title","plaintext"]  } } }
{"index": "index3"}
{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }

Note that you can also do:

GET /index1,index2,index3/_search
{ "query": { "multi_match":{"query":"fork",  "fields":["title","plaintext"]  } } }

On the upside, you don't have to creatively merge the separate result sets (because multi search will not do that for you).

On the downside, you don't get to control that merge, e.g. scoring can be difficult to control, you don't really get to specify search fields per index anymore, or source filtering fields - though some forethought (e.g. to field naming between indices) can help some of those.

Other, mostly more specific search-like APIs, and some debug stuff

knn[55] search searches a dense_vector close to the query vector.

separate API is depracated, being moved to an option

suggester[56] - suggests similar search terms based on edit distance

terms_enum[57] -

count[58]

explain[59]

profile[60]

validate[61]

shards[62] - report which shards would be accessed

composing queries

Compound queries

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Compound queries[63] wrap others (leaf queries, or other compound queries), for

One reason is logical combinations of other queries (note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)

bool [64]

lets you combine requirements with requirements like

must

filter is must without contributing to scoring

must_not

should ('portion of leafs must', and more matches scores higher)

The rest is mostly about more controlled scoring (particularly when you make such combinations)

dismax [65]

if multiple subqueries match the same document, the highest score gets used

boosting [66]

have subqueries weigh in positively and negatively

constant_score [67]

results from the search this wraps all get a fixed constant score

function_score [68]

results form the search this wraps with scripting that can consider values from the document and query

Term-level and full-text queries

✎ This article/section is a stub — some half-sorted notes, not necessarily checked, not necessarily correct. Feel free to ignore, or tell me about it.

Term queries arenot put through an analyser, so stay one string, so are mostly used to match exact strings/terms.

Full-text queries are put through an analyzer (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.

term [69]

exact value

works in text fields but is only useful for fairly controlled things (e.g. username)

terms [70]

like term, matches any from a provided list

terms_set [71]

like terms, but with a minimum amount to match from a list

exists[72]

Returns documents that contain any indexed value for a field.

fuzzy[73]

Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.

ids[74]

Returns documents based on their document IDs.

prefix[75]

Returns documents that contain a specific prefix in a provided field.

range [76]

Returns documents that contain terms within a provided range.

regexp[77]

Returns documents that contain terms matching a regular expression.

wildcard [78]

Returns documents that contain terms matching a wildcard pattern.

match [79]

matches text, number, date, or boolean data

in a single field

optionally does things like phrase or proximity queries, and things like edit-distance fuzziness

text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)

{
  "query": {
    "match": {
      "fieldname":{
        "query": "scary matrix snowman monster",
        "minimum_should_match": 2   //optional, but 
      }
    }
  }
}

multi_match [80]

match, but searches one query in multiple fields

{
  "query": {
    "multi_match" : {
      "query":    "this is a test", 
      "fields": [ "subject", "message" ] 
    }
  }
}

combined_fields [81]

somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields

{
  "query": {
    "combined_fields" : {
      "query":      "database systems",
      "fields":     [ "title", "subject", "message"],
      "operator":   "and"
    }
  }
}

match_phrase [82]

analyses into tokens, then matches only if all are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive

{
  "query": {
    "match_phrase": {
      "message": "this is a test"
    }
  }
}

match_phrase_prefix [83]

like match_phrase, but the last token is a prefix match

match_bool_prefix [84]

constructs a bool of term-shoulds, but the last last is a prefix match

intervals [85]

imposes rules based on order and proximity of matching terms

query_string [86]

lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [87]

maybe don't expose to users, because it's quick to error out

simple_query_string [88]

Span queries

Result set

Aggregation

further notes

Performance considerations

query considerations

More general settings

indexing considerations

Routing

Clusters

Security notes

ES errors and warnings

Some Kibana notes

@@ Line 1: / Line 1: @@
+{{#addbodyclass:tag_tech}}
 {{search stuff}}
+==Choice side==
 ===Broad intro===
 {{stub}}
@@ Line 28: / Line 30: @@
 It's not very hard to interface your own form with the API, though.
-===The major moving parts===
+===Subscription model===
-====Some terminology====
 {{stub}}
-{{comment|(note that the core of this comes from the underlying lucene)}}
+From an "I don't care about jargon and legalese" perspective,
+* you can get a managed cloudy thing - split into a few variants
+* or you can self-host - ''also'' split into a few variants (not quite the same?)
-'''mapping''' - basically a schema, telling ES of how to interpret each field
-: e.g. mentions types
-: if a field is not already defined in the mapping by you, it will adds fields as you mention them, with types guessed based on the first value it sees
-'''field''' - a distinct thing in the mapping
+{{comment|(Before 6.3 the free license needed to be refreshed every year, now basic/free does not need an explicit license anymore?{{verify}})}}
-: (usually) indexed to be searchable
-: (usually) fetchable
-: (there are some field type variants - see below)
-'''document''' - individual search result.
+Around self-hosting there are two or three variants: Basic, platinum, and enterprise (Gold is gone?)
-: you can see it as the JSON you sent in, and/or as the part of that that got put into fields (and usually indexed) as specified by the mapping
+: Basic is perfectly good for smaller scale
-:: side note: ''most'' fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one
+: the differences between platinum and enterprise mostly matter once you do this on large scale [https://www.elastic.co/subscriptions]
-: original JSON is also stored in _source
+There is a perfectly free subset of ES (ES configured with a Basic license).
-'''index'''
+...but the default is bit a Basic license, it's a 30-day trial license in which you get more features, to entice you to buy the fuller product.
-: if you squint, a grouping of documents - a collection you search as a whole
-: you can have multiple indexes a per cluster (and may well want to)
-: an index is divided into one or more shards, which is about replication in general and distribution in a cluster
-:: (and shards are divided into segment files, which is more about how updating documents works)
+If you know you want to stick with basic features, then after this 30-day period (or earlier, if you know what you want)
+you would
+: not only need to switch to Basic license,
+: but also disable some features (apparently all X-pack things{{verify}})
-{{comment|...and this is the point at which you can stop reading this section if you're just doing some experiments on one host. You ''will'' get to them once you if and when you need to scale to multiple hosts.}}
+...but if you didn't know that, this is annoying in the form of "automated install configured a trial license for me, but things just stopped working ''completely'' after 30 days and now I'm panicking", because it takes more than a little reading what you now need to disable and why. {{comment|(My logs showed "Elasticsearch Search and Analytics" -- which seemed to just be a confusing name for the machine learning and alerting stuff)}}
-'''shard'''
+<!--
-: you typically split an index into a number of shards, e.g. to be able to do [[horizontal scaling]] onto nodes
+For example,
-: internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a ''portion'' of the overall thing we we call index here
+* "Current license is non-compliant for search application and behavioral analytics. Current license is active basic license"
-: Shards come in two types: primary (the basic type), and replica.
+: is a feature it (confusingly) calls "Elasticsearch Search and Analytics", which is the machine learning and alerting stuff.
+: either buy a license, or disabling those features:
+ xpack.ml.enabled: false
+ xpack.graph.enabled: false
+ xpack.watcher.enabled: false
+: (note that if this is a "basic license or not?" situation there is more than this to disable)
+-->
-'''segment''' - a shard consists of a number of segments (segments are individual files).
+<!--
-: Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).
+What is X-pack[https://www.elastic.co/guide/en/elasticsearch/reference/7.17/setup-xpack.html]?
-'''replica''' is an exact copy of a shard
+An extension that groups
-: the point of replicas is robustness against node failure:
+* "[https://www.elastic.co/guide/en/elasticsearch/reference/current/security-settings.html advanced security]" (meaning? TLS for whatever connects directly to the service?)
-:: if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything
+* [https://www.elastic.co/guide/en/elasticsearch/reference/7.17/monitoring-overview.html monitoring],
-:: without duplication, node failure would mean missing a part of every index distributed onto it
+* [https://www.elastic.co/guide/en/elasticsearch/reference/current/ml-settings.html machine learning]
-: you could run without replicas to save some memory
+...and possibly a little more?
-: memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability
+It ''used'' to be a closed-source set of extended features,
+and is now open-sourced but still very still paid-for?
-'''node''' - distinct ES server instance
+trial for that[https://www.elastic.co/guide/en/kibana/7.17/managing-licenses.html])
-: in larger clusters, it makes sense to have some nodes take on specific roles/jobs - [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/modules-node.html#node-roles this] discusses some of that
+:: is it optional?
-'''cluster''' - a group of nodes, serving a set of indices created in it
+https://www.elastic.co/guide/en/elasticsearch/reference/7.17/setup-xpack.html
-: nodes must mention a shared ID to join a cluster
-====Combined with====
-In practice, people often pair it with   (see also "{{search|ELK stack}}")
+xpack.security.enabled  (in elasticsearch.yml)
+: defaults to true
+: note that if you enable this, kibana needs some extra config to be able to connect
+https://www.elastic.co/guide/en/elasticsearch/reference/current/security-settings.html
-* [https://www.elastic.co/kibana Kibana]
-: Web UI, makes a bunch of inspection and monitoring easier, including some dashboard stuff
-: also seen in tutorials, there largely for its '''console''' interactively poking ES without writing code
-: Pluggable, has a ''bunch'' of optional things
-And, '''if you're not coding your own ingest''',
+This gets more complex when considering more varied ES features, and more of the ELK stack, because
-* [https://www.elastic.co/logstash Logstash] is log processing/aggregator and [https://github.com/logstash-plugins plugins] let it be flexible
+: some components are dual-licensed public + proprietary
-: you ''can do'' all of that yourself, but logstash can do a lot of work for you, or ''at least'' be a quick start
+: some components are proprietary only, most of which are paid-only ES features[https://www.elastic.co/subscriptions]
+:: though many of them are advanced / scale features that a lot of setups won't need.
-* [https://www.elastic.co/beats/ Beats] - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts, e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html] - and a lot more contributed ones[https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html]
-: e.g. [https://www.elastic.co/guide/en/beats/metricbeat/7.16/index.html metricbeat], which is stores ES metrics in ES
-Parts of ES and other ELK components are dual-licensed public/proprietary, other parts are proprietary only. {{comment|(There was a license change that seemed to be aimed at limiting Amazon selling it, which just prompted{{verify}} Amazon to fork it into its own OpenSearch[https://en.wikipedia.org/wiki/OpenSearch_(software)].)}}, and there are some paid-only features[https://www.elastic.co/subscriptions] though many of them are advanced / scale features that a lot of setups won't need.
+Apparently:
+* Before 6.3,
+: you will need a license - you can get a free basic license at https://register.elastic.co/
+:: note this is is a 1-year license and needs to be refreshed
-<!--
+* Since 6.3, Basic (free) does not need an explicit license
-Elasticsearch itself has arguably become hard to oversee.
-* For example, what is xpack, why does it do, do you need it?
-:: x-pack is security, alerting, [https://www.elastic.co/guide/en/elasticsearch/reference/7.17/monitoring-overview.html monitoring], reporting, machine learning
-:: x-pack was originally intentionally a closed-source set of extended features, with monitoring free but most others (e.g. alerting, machine learning) paid and tied to a license. (it was since open-sourced. There still very much is a licensing thing, but it's less about features and more about deployment and support, and you can still get a 30-day trial for that[https://www.elastic.co/guide/en/kibana/7.17/managing-licenses.html])
-:: is it optional?
-https://www.elastic.co/guide/en/elasticsearch/reference/7.17/setup-xpack.html
+xpack.monitoring.collection.enabled
 -->
+===License details===
+{{stub}}
 <!--
+There used to be entirely proprietary parts.{{verify}}
+Many of those are now source-available ''but'' cannot be used without a subscription,
+or taken elsewhere, without some legal implications.
+So this is all more source-available rather than open source.
-xpack.security.enabled  (in elasticsearch.yml)
-: defaults to true
-: note that if you enable this, kibana needs some extra config to be able to connect
-https://www.elastic.co/guide/en/elasticsearch/reference/current/security-settings.html
+'''Core functionality versus lots of automatic niceness'''
--->
+There are a lot of features that you will care about running a large cluster,
+and a business off it, and would ''absolutely'' want to pay for.
-<!--
+...they you may not care about if you just want to add search to a hobby project.
-'''On licenses'''
-Apparently:
-* Before 6.3,
+This is, understandably, tied into the licensing, and part of the proprietary/open mix.
-: you will need a license - you can get a free basic license at https://register.elastic.co/
-:: not this is 1-year and needs to be refreshed
-* Since 6.3, Basic (free) does not need an explicit license
-That said, if you configure features that need more than a basic license, in which case it ''will'' complain.
-This is perhaps most annoying in the form of "install configured a trial license for me, so things broke completely after 30 days and now I'm panicking"
+Broadly, a choice between
+* SSPL (a more-viral, less-open variant of the AGPL)
+* Elastic License - ''not'' open
-* ...including certain automated installs that, set up a trial license with more features -- which just means things break after 30 days
+So the free version cannot be used/sold as a services with the ES branding removed.
+This is related to....
-For example,
--->
-* "Current license is non-compliant for search application and behavioral analytics. Current license is active basic license"
-: is a feature it (confusingly) calls "Elasticsearch Search and Analytics", which is the machine learning and alerting stuff.
-: either buy a license, or disabling those features:
- xpack.ml.enabled: false
- xpack.graph.enabled: false
- xpack.watcher.enabled: false
-<!--
+'''Amazon kerfuffle'''
-xpack.monitoring.collection.enabled
+Around 2021, there was a lisence change from [[Apache2]] to [[SSPL]] - similar to [[AGPL]] but even more aggressive, so people don't quite considered it a free/open license anymore), seemingly aimed specifically at limiting Amazon selling it {{comment|(which just prompted{{verify}} Amazon to fork it into its own OpenSearch[https://en.wikipedia.org/wiki/OpenSearch_(software)] - the SSPL seems made for another such case, MongoDB, and similarly Amazon just forked to DocumentDB{{verify}})}}
 -->
+==Implementation details you'd want to know to use it==
+===The major moving parts===
-See also:
-* https://en.wikipedia.org/wiki/Elasticsearch
-===Indices===
+====Some terminology====
-====Fields, the mapping, field types====
 {{stub}}
+{{comment|(note that the core of this comes from the underlying lucene)}}
-Each '''regular field''' (there are others) gets its own behavior in terms of
-: whether it is stored and/or
-: whether it is indexed
-:: and if it is indexed, ''how'' it is handled in that process.
-By default, ES picks up every part of the document.
+'''mapping''' - basically a schema, telling ES of how to interpret each field
+: e.g. mentions types
+: if a field is not already defined in the mapping by you, it will adds fields as you mention them, with types guessed based on the first value it sees
-The content of each part of the document you send in can go towards multiple fields - or none at all - depending primarily on the mapping.
-You can ignore fields (probably by mentioning the field in your mapping but setting "enabled": false, but ES can also do this, e.g. based on configuration like "values should never be bigger than X bytes").
+'''field''' - a distinct thing in the mapping
+: (usually) indexed to be searchable
+: (usually) fetchable
+: (there are some field type variants - see below)
-{{comment|Note also that many fields can be considered to store ''arrays'' of values, and most searches will match if any of those match.}}
+'''document''' - individual search result.
+: you can see it as the JSON you sent in, and/or as the part of that that got put into fields (and usually indexed) as specified by the mapping
+:: side note: ''most'' fields can have zero or more values in a document, though in a lot of practical cases it's practically just one, and in a few cases it's restricted to just one
+: original JSON is also stored in _source
-The '''mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html][https://www.elastic.co/blog/found-elasticsearch-mapping-introduction] is the thing that lists all fields in an index, mostly:
-* name
-* data type(s)
-* configured processing when new data comes in
-'''Explicit mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html],
+'''index'''
-amounts to specifying a mapping ''before'' you add documents.
+: if you squint, a grouping of documents - a collection you search as a whole
-This can be preferred when
+: you can have multiple indexes a per cluster (and may well want to)
-* you want specific interpretations of some fields (e.g. identifier that shouldn't be parsed as text, an IP address, a date), and/or
+: an index is divided into one or more shards, which is about replication in general and distribution in a cluster
-* you want specific features (e.g. {{inlinecode|flattened}}, {{inlinecode|join}}, {{inlinecode|search_as_you_type}})
+:: (and shards are divided into segment files, which is more about how updating documents works)
-'''Dynamic mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html] is (like '[[schemaless]]') a nice name for
-: "I didn't set a mapping for a field, so am counting on ES guessing right based on the first value it sees"
-: if you're mainly handling text, this usually does the right or at least a very sane thing.
+{{comment|...and this is the point at which you can stop reading this section if you're just doing some experiments on one host. You ''will'' get to them once you if and when you need to scale to multiple hosts.}}
+'''shard'''
+: you typically split an index into a number of shards, e.g. to be able to do [[horizontal scaling]] onto nodes
+: internally, each shard is a self-contained searchable thing. In the sense of the complete set of documents you fed in, this is just a ''portion'' of the overall thing we we call index here
+: Shards come in two types: primary (the basic type), and replica.
-{{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html#dynamic-parameters dynamic-parameters]}} allows a little more variation of that:
+'''segment''' - a shard consists of a number of segments (segments are individual files).
-: '''true''' - unknown fields are automatically added to the mapping as regular fields (indexed) -- the default just described
+: Each segment file is immutable (which eases a lot of management, means no blocking, eases parallelism, eases cacheing).
-:: it'll mostly end up with boolean, float, long, or text/keyword
-: '''runtime''' - unknown fields are automatically added to the mapping as ''runtime'' fields (not indexed)
-: '''false''' - unknown fields are ignored
-: '''strict''' - unknown fields mean the document is rejected
-Note that
-* with {{inlinecode|true}} and {{inlinecode|runtime}}, the created field's type is based on the communicated-JSON's type - see [https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html this table]
-* {{inlinecode|runtime}} and {{inlinecode|dynamic}} will
+'''replica''' is an exact copy of a shard
-** try to detect whether text seems to be a date[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#date-detection] (apparently if it follows a specific configured date template{{verify}})
+: the point of replicas is robustness against node failure:
-** optionally detect numbers[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#numeric-detection] (and whether they are reals or integers), but this is disabled by default - probably because it's much less error-prone to control this yourself.
+:: if you have two copies of every shard (and those copies are never on the same hardware), then one node can always drop out and search won't be missing anything
+:: without duplication, node failure would mean missing a part of every index distributed onto it
+: you could run without replicas to save some memory
+: memory is unlikely to be much of your cloudy bill (something like ES is CPU and transfer hungry), so using replicas is typically worth it for long term stability
-* You can also alter mappings later - see [https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html]
-:: but this comes with some restrictions/requirements/footnotes
-* dynamic-parameters is normally set per index{{verify}} and inherited, but you ''can'' be more precise about it
+'''node''' - distinct ES server instance
+: in larger clusters, it makes sense to have some nodes take on specific roles/jobs - [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/modules-node.html#node-roles this] discusses some of that
+'''cluster''' - a group of nodes, serving a set of indices created in it
+: nodes must mention a shared ID to join a cluster
+====Combined with====
+In practice, people often pair it with...   {{comment|(see also "{{search|ELK stack}}")}}
+* [https://www.elastic.co/kibana Kibana]
+: Web UI, makes a bunch of inspection and monitoring easier, including a dashboard interface to ES
+: also seen in tutorials, there largely for its '''console''' interactively poking ES without writing code
+: itself pluggable, with ''bunch'' of optional things
-'''Text-like field types'''
-The documentation makes a point of a split into
+And, '''if you're not coding your own ingest''',
-* '''''text family'''''[https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html] - text, match_only_text
+* [https://www.elastic.co/logstash Logstash] is log processing/aggregator and [https://github.com/logstash-plugins plugins] let it be flexible
-* '''''keyword family'''''[https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type] - keyword, constant_keyword, wildcard
+: you ''can do'' all of that yourself, but logstash can do a lot of work for you, or ''at least'' be a quick start
-* ...and it seems that e.g. search-as-you-type are considered miscellaneous
+* [https://www.elastic.co/beats/ Beats] - where logstash is a more generic, configurable thing, beats is a set of specific-purpose ingest scripts, e.g. for availability, log files, network traffic, linux system metrics, windows event log, cloud service metrics, [https://www.elastic.co/guide/en/beats/libbeat/current/beats-reference.html] - and a lot more contributed ones[https://www.elastic.co/guide/en/beats/libbeat/current/community-beats.html]
+: e.g. [https://www.elastic.co/guide/en/beats/metricbeat/7.16/index.html metricbeat], which is stores ES metrics in ES
-* '''{{inlinecode|text}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html] {{inlinecode|(previously known as <tt>string analysed</tt>)}}
-: flexible search of free-form text, analysis can will transform it before indexing (you probably want to know how - see e.g. [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis.html analysis])
-: no aggregations
-: no sorting
-* '''{{inlinecode|keyword}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html] {{inlinecode|(previously known as <tt>string not_analysed</tt>)}}
+===Indices===
-: structured content (exact search only?), e.g. identifiers, a small/known set of tags, also things like emails, hostnames, zip code, etc.
-: should be a little faster to match (if only because of smaller index size){{verify}}
-: allows sorting
-: allows aggregation
-: can make makes sense for serial numbers/IDs, tags - even if they ''look'' like numbers, you will probably only ever search them up as text equality (compared to storing those as a number, keyword may be a little larger yet also saves some index complexity necessary for numeric-range search)
-* {{inlinecode|constant_keyword}} [https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#constant-keyword-field-type] - all documents in the index have the same value
-:: e.g. when you send each log file to its own index, this might assist some combining queries {{verify}}
-* {{inlinecode|wildcard}} - [https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type]
+====Fields, the mapping, field types====
-:: assists [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html wildcard] and [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html regexp] queries
+{{stub}}
-* '''search_as_you_type'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html]
-: fast for prefix (match at start) and infix (terms within) matches
-: mainly for autocompletion of queries, but could work for other shortish things (if not short, the index for this will be large)
-: n-gram style, can have larger n (but costs in terms of index size, so only useful if you usually want many-token matches?)
-: kept in memory, costly to build
+By default, ES picks up every part of the document (remember it's JSON), but you can configure it
+: to have parts go towards multiple fields
+: to have parts go to no field at all
+:: for ''all'' documents based on your mapping having a field but setting "enabled": false,
+:: from other behaviour, such as "values should never be bigger than X bytes" config on a field
-Further notes:
+{{comment|Note that in terms of the index, a document can have multiple values for a field. So fields can be considered to store ''arrays'' of values, and most searches will match if any of those match.}}
-* it seems that fields that are dynamically added to the mapping and detected as text will get two indexed fields: a free-form text one, and keyword with ignore_above of 256
-:: this is useful if you don't know what it will be used for
-:: but for e.g. identifiers it's pointless to also tokenize it
-:: and for free-form text it's probably pointless to also have that keyword field (will probably end up _ignored for most documents, except very short ones)
-:: separately, remember that report-only values (e.g. URLs, depending, or summaries, depending) are fine to only store (and so send) and not index
+Each '''regular field''' (there are others) gets its own behavior in terms of
+: whether it is stored and/or
+: whether it is indexed
+:: if indexed, ''how'' it is transformed before going into the index
-'''Data-like field types (primitives)''':
+{{comment|(For completeness, a '''runtime field'', is evaluated at runtime, and not indexed or in the _source.
-* '''boolean'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/boolean.html]
+[https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html]. This can be handy for certain calculated data that you won't search on, or to experiment with fields you'll later make regular fields{{verify}})}}
-* '''numbers''': [https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html]
-** '''byte''', '''short''', '''integer''', '''long''', '''unsigned_long'''
-** '''float''', '''double''', '''half_float''', '''scaled_float'''
-:: allows range queries
-* and, arguably, keyword (see above)
+The '''mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping.html][https://www.elastic.co/blog/found-elasticsearch-mapping-introduction] is the thing that lists all fields in an index, mostly:
+* name
+* data type(s)
+* configured processing when new data comes in
-'''Data-like fields (specifics but text-like)''':
+'''Explicit mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/explicit-mapping.html],
-* '''date''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html]
+amounts to specifying a mapping ''before'' you add documents.
-:: (internally milliseconds since epoch (UTC)
+This can be preferred when
+* you want specific interpretations of some fields (e.g. identifier that shouldn't be parsed as text, an IP address, a date), and/or
+* you want specific features (e.g. {{inlinecode|flattened}}, {{inlinecode|join}}, {{inlinecode|search_as_you_type}})
-* '''date_nanos''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html]
+'''Dynamic mapping'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-mapping.html] is (like '[[schemaless]]') a nice name for
+: "I didn't set a mapping for a field, so am counting on ES guessing right based on the first value it sees"
+: if you're mainly handling text, this usually does the right or at least a very sane thing.
-* '''version''' - text that represents semantic versioning
-:: mostly for sortability?
-* '''ip''' - IPv4 and IPv6 addresses
-:: allows things like subnet searches ([[CIDR]] style)
-* '''geospatial''' - points and shapes [https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html]
+{{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic.html#dynamic-parameters dynamic-parameters]}} allows a little more variation of that:
-:: including distance and overlap
+: '''true''' - unknown fields are automatically added to the mapping as regular fields (indexed) -- the default just described
+:: it'll mostly end up with boolean, float, long, or text/keyword
+: '''runtime''' - unknown fields are automatically added to the mapping as ''runtime'' fields (not indexed)
+: '''false''' - unknown fields are ignored
+: '''strict''' - unknown fields mean the document is rejected
+Note that
-Different way of treating JSON as transferred
+* with {{inlinecode|true}} and {{inlinecode|runtime}}, the created field's type is based on the communicated-JSON's type - see [https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html this table]
-* '''object'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html] -
-:: each subfield is separately mapped and indexed, with names based on the nesting dot notation, see e.g. the linked example
-* '''flattened'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html]
+* {{inlinecode|runtime}} and {{inlinecode|dynamic}} will
-:: takes the JSON object and indexes it as one single thing (basically an array of its values combined)
+** try to detect whether text seems to be a date[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#date-detection] (apparently if it follows a specific configured date template{{verify}})
+** optionally detect numbers[https://www.elastic.co/guide/en/elasticsearch/reference/current/dynamic-field-mapping.html#numeric-detection] (and whether they are reals or integers), but this is disabled by default - probably because it's much less error-prone to control this yourself.
-* '''nested'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html]  basically a variant of object that allows some field indexability
+* You can also alter mappings later - see [https://www.elastic.co/guide/en/elasticsearch/reference/current/indices-put-mapping.html]
+:: but this comes with some restrictions/requirements/footnotes
+* dynamic-parameters is normally set per index{{verify}} and inherited, but you ''can'' be more precise about it
-Other stuff
-* '''binary'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html]
-:: not searchable
-:: not stored (by default) - which makes sense. A search index is not a blob store
-:: seems to be there for the possibility of some plugin extracting text to index?{{verify}} or fishing it out of _source?{{verify}}
-* '''join'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html]
-:: relations to other documents, by id{{verify}}
-:: seems to mean query-time lookups, so
-::: you could get some basic lookups for much cheaper than separate requests
-::: you probably don't want multiple levels of lookups unless you don't mind slow responses (if you need relational, use a relational database)
+'''Text-like field types'''
-'''token_count'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html]
+The documentation makes a point of a split into
-: stored as an integer, but takes text, analyses it into tokens, then counts the number of tokens
+* '''''text family'''''[https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html] - text, match_only_text
-: seems intended to be used via multi-field, to also get the length of text
+* '''''keyword family'''''[https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type] - keyword, constant_keyword, wildcard
+* ...and it seems that e.g. search-as-you-type are considered miscellaneous
-'''dense_vector'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html] (of floats by defaults, byte also possible)
+* '''{{inlinecode|text}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/text.html] {{comment|(previously known as {{inlinecode|string analysed}})}}
-: if indexes, you can use these for knn searches[https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html]
+: flexible search of free-form text, analysis can will transform it before indexing (you probably want to know how - see e.g. [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis.html analysis])
+: no aggregations
+: no sorting
+* '''{{inlinecode|keyword}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html] {{comment|(previously known as {{inlinecode|string not_analysed}})}}
+: structured content (exact search only?), e.g. identifiers, a small/known set of tags, also things like emails, hostnames, zip code, etc.
+: should be a little faster to match (if only because of smaller index size){{verify}}
+: allows sorting
+: allows aggregation
+: can make makes sense for serial numbers/IDs, tags - even if they ''look'' like numbers, you will probably only ever search them up as text equality (compared to storing those as a number, keyword may be a little larger yet also saves some index complexity necessary for numeric-range search)
+* {{inlinecode|constant_keyword}} [https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#constant-keyword-field-type] - all documents in the index have the same value
+:: e.g. when you send each log file to its own index, this might assist some combining queries {{verify}}
-<!--
+* {{inlinecode|wildcard}} - [https://www.elastic.co/guide/en/elasticsearch/reference/current/keyword.html#wildcard-field-type]
-'''rank_feature'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-feature.html]
+:: assists [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html wildcard] and [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html regexp] queries
-: basically a number usable to boosts a document in ''any'' search{{verify}}
-: used with [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html rank_features] queries
-'''rank_features'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-features.html]
+* '''search_as_you_type'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-as-you-type.html]
-:
+: fast for prefix (match at start) and infix (terms within) matches
+: mainly for autocompletion of queries, but could work for other shortish things (if not short, the index for this will be large)
+: n-gram style, can have larger n (but costs in terms of index size, so only useful if you usually want many-token matches?)
+: kept in memory, costly to build
-'''histogram'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/histogram.html]
+Further notes:
-: a length-matched pair of {{inlinecode|values}} (double) and counts (integer)
+* it seems that fields that are dynamically added to the mapping and detected as text will get two indexed fields: a free-form text one, and keyword with ignore_above of 256
-: not indexed
+:: this is useful if you don't know what it will be used for
-: (so) seems mostly useful as pre-aggregated data, and usable by only some aggregation functions
+:: but for e.g. identifiers it's pointless to also tokenize it
+:: and for free-form text it will probably do very little -- that is, for all but short documents it ends up _ignored. {{comment|(It's a clever edge-caes trick to deal with cases where the only value is actually something other than text, and is ''otherwise'' almost free)}}
+:: separately, some things you may wish to not index/serch on, but still store it so you can report it as part of a hit
-'''percolator'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/percolator.html]
+'''Data-like field types (primitives)''':
-:
+* '''boolean'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/boolean.html]
-'''sparse_vector'''[https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sparse-vector.html]
+* '''numbers''': [https://www.elastic.co/guide/en/elasticsearch/reference/current/number.html]
-: usable for scoring
+** '''byte''', '''short''', '''integer''', '''long''', '''unsigned_long'''
-: deprecated
+** '''float''', '''double''', '''half_float''', '''scaled_float'''
--->
+:: allows range queries
-====Managing indices (and thinking about shards and segments)====
+* and, arguably, keyword (see above)
-{{stub}}
-Remember that shards are made of segments, and segments are immutable.
-ES collects changes in memory, and only occasionally dumps that that into a new segment file.
+'''Data-like fields (specifics but text-like)''':
-: (this is why documents are not ''immediately'' searchable and we call it [https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html ''near''-realtime])
+* '''date''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/date.html]
+:: (internally milliseconds since epoch (UTC)
-'''Refresh''' - is that act of writing to a new segment
+* '''date_nanos''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/date_nanos.html]
-: also helps updates not be blocking - existing segments are searched, new one once they're done
-'''Merge''' refers to how smaller segments are periodically consolidated into fewer files, [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-merge.html]
+* '''version''' - text that represents semantic versioning
-: also the only way that delete operations actually flush (remember, segments are immutable).
+:: mostly for sortability?
+* '''ip''' - IPv4 and IPv6 addresses
+:: allows things like subnet searches ([[CIDR]] style)
+* '''geospatial''' - points and shapes [https://www.elastic.co/guide/en/elasticsearch/reference/current/geo-point.html]
+:: including distance and overlap
-Refresh interval is 1s by default, but various setups may want that larger - it's a set of tradeoffs
-* Refresh can be too quick in that the overhead of refresh would dominate, and less work is spent actually doing useful things (and be a little pricier in pay-per-CPU-use cloudiness)
-* Refresh can also be too slow both in that
+Different way of treating JSON as transferred
-** new results take a ''long'' time to show up
+* '''object'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/object.html] -
-** the heavier load, and large new segments, could make fore more irregular response times
+:: each subfield is separately mapped and indexed, with names based on the nesting dot notation, see e.g. the linked example
-The default is 1s. Large setups might increase that up to the order of 30s.
+* '''flattened'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/flattened.html]
+:: takes the JSON object and indexes it as one single thing (basically an array of its values combined)
-(more in the tradeoffs below)
+* '''nested'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/nested.html]  basically a variant of object that allows some field indexability
-----
+Other stuff
+* '''binary'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/binary.html]
+:: not searchable
+:: not stored (by default) - which makes sense. A search index is not a blob store
+:: seems to be there for the possibility of some plugin extracting text to index?{{verify}} or fishing it out of _source?{{verify}}
-'''index lifecycle management''' (ILM) lets you do slightly higher-level management like
+* '''join'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/parent-join.html]
-* create a new index when one grows too large
+:: relations to other documents, by id{{verify}}
-* do things like creating an index per time interval
+:: seems to mean query-time lookups, so
+::: you could get some basic lookups for much cheaper than separate requests
-Both can help get the granularity you want when it comes to
+::: at the same time, you can make things ''slower'' by doing unnecessary and/or multiple levels of lookups (if you need relational, a relational database is better at that)
-backing them up,
-duplicating them, retiring them to adhere to data retention standards.
+'''token_count'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/token-count.html]
+: stored as an integer, but takes text, analyses it into tokens, then counts the number of tokens
+: seems intended to be used via multi-field, to also get the length of text
-https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-lifecycle-management.html
+'''dense_vector'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/dense-vector.html] (of floats by defaults, byte also possible)
+: if indexes, you can use these for knn searches[https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search.html]
+<!--
+'''rank_feature'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-feature.html]
+: basically a number usable to boosts a document in ''any'' search{{verify}}
+: used with [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-rank-feature-query.html rank_features] queries
+'''rank_features'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/rank-features.html]
+:
-Also on the topic: a '''reindex''' basically means "read data, delete data in ES, ingest again".
+'''histogram'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/histogram.html]
-: You would not generally want this over a merge.
+: a length-matched pair of {{inlinecode|values}} (double) and counts (integer)
-: It's mainly useful when you make structural schema changes, and you want to ensure the data uniformly conforms to that.
+: not indexed
+: (so) seems mostly useful as pre-aggregated data, and usable by only some aggregation functions
-https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
-====other field types====
-'''multi-fields'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html]
+'''percolator'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/percolator.html]
-let you specify that one thing in your JSON should be picked up towards multiple fields.
+:
-* e.g. a city name
-** as a keyword, for exact match, sorting and aggregation
-** and towards the overall text search, via an analyser
-* text as a searchable thing, and also store its length via token_count
+'''sparse_vector'''[https://www.elastic.co/guide/en/elasticsearch/reference/7.17/sparse-vector.html]
+: usable for scoring
+: deprecated
+-->
+====Managing indices (and thinking about shards and segments)====
+{{stub}}
+Remember that shards are made of segments, and segments are immutable.
+ES collects changes in memory, and only occasionally dumps that that into a new segment file.
+: (this is why documents are not ''immediately'' searchable and we call it [https://www.elastic.co/guide/en/elasticsearch/reference/current/near-real-time.html ''near''-realtime])
+'''Refresh''' - is that act of writing to a new segment
+: also helps updates not be blocking - existing segments are searched, new one once they're done
+'''Merge''' refers to how smaller segments are periodically consolidated into fewer files, [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-merge.html]
+: also the only way that delete operations actually flush (remember, segments are immutable).
-'''runtime fields''' [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/runtime.html]
-: mostly about query time - not part of the index (so won’t display in _source)
-: just calculated as part of the response
-: can be defined in the mapping, or in an individual query
-: (seem to be a refinement of script fields?)
-: easy to add and remove to a mapping (because there is no backing storage for them)
-<!--
-: when to use them?
-:: on-the-fly aggregations, filtering, and sorting
-::: ...on fields you rarely do that on. Things you search and filter all the time should probably be indexed
--->
-: https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
-: https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields
-: you can search specific indices (comma separated names, or _all), or just one - sometimes useful
+Refresh interval is 1s by default, but various setups may want that larger - it's a set of tradeoffs
+* Refresh can be too quick in that the overhead of refresh would dominate, and less work is spent actually doing useful things (and be a little pricier in pay-per-CPU-use cloudiness)
+* Refresh can also be too slow both in that
+** new results take a ''long'' time to show up
+** the heavier load, and large new segments, could make fore more irregular response times
+The default is 1s. Large setups might increase that up to the order of 30s.
-'''Metadata fields''' (that is, those other than e.g. the fields from your document, and runtime/script fields)
+(more in the tradeoffs below)
-* '''{{inlinecode|_id}}''' - document's ID
-* '''{{inlinecode|_type}}''' - document's mapping type
+----
-: deprecated
-* '''{{inlinecode|_index}}''' - index that the document is in
-* '''{{inlinecode|_source}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-source-field.html]
+'''index lifecycle management''' (ILM) lets you do slightly higher-level management like
-: is the original JSON that indexing got
+* create a new index when one grows too large
-:: note that everything needs to be indexed
+* do things like creating an index per time interval
-:: this 'original document' is also used in update and reindex operations
-::: so while you can ask it to not store _source at all, that disables such operations
-:: that original submission can be handy for debug
-:: that original submission can be core functionality for your app in a "we found it by the index, now we give you the actual document" way
-: you can filter what parts of _source are actually sent in search results - see [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-fields.html#source-filtering source filtering]
-* '''{{inlinecode|_size}}''' - byte size of _source
+Both can help get the granularity you want when it comes to
+backing them up,
+duplicating them, retiring them to adhere to data retention standards.
-* '''{{inlinecode|_ignored}}''' - fields in a document that indexing ignored for any reason - see
-https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-fields.html others
+https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-lifecycle-management.html
-====More on indexing====
-{{stub}}
-The [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html bulk API] can be used for all CRUD operations,
-but in practice is probably ''mostly'' used for indexing.
+Also on the topic: a '''reindex''' basically means "read data, delete data in ES, ingest again".
+: You would not generally want this over a merge.
+: It's mainly useful when you make structural schema changes, and you want to ensure the data uniformly conforms to that.
+https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-reindex.html
+====other field types====
+'''multi-fields'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-fields.html]
+let you specify that one thing in your JSON should be picked up towards multiple fields.
+* e.g. a city name
+** as a keyword, for exact match, sorting and aggregation
+** and towards the overall text search, via an analyser
+* text as a searchable thing, and also store its length via token_count
+'''runtime fields''' [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/runtime.html]
+: mostly about query time - not part of the index (so won’t display in _source)
+: just calculated as part of the response
+: can be defined in the mapping, or in an individual query
+: (seem to be a refinement of script fields?)
+: easy to add and remove to a mapping (because there is no backing storage for them)
+<!--
+: when to use them?
+:: on-the-fly aggregations, filtering, and sorting
+::: ...on fields you rarely do that on. Things you search and filter all the time should probably be indexed
+-->
+: https://www.elastic.co/guide/en/elasticsearch/reference/current/runtime.html
+: https://www.elastic.co/blog/getting-started-with-elasticsearch-runtime-fields
+: you can search specific indices (comma separated names, or _all), or just one - sometimes useful
+'''Metadata fields''' (that is, those other than e.g. the fields from your document, and runtime/script fields)
+* '''{{inlinecode|_id}}''' - document's ID
+* '''{{inlinecode|_type}}''' - document's mapping type
+: deprecated
+* '''{{inlinecode|_index}}''' - index that the document is in
+* '''{{inlinecode|_source}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-source-field.html]
+: is the original JSON that indexing got
+:: note that everything needs to be indexed
+:: this 'original document' is also used in update and reindex operations
+::: so while you can ask it to not store _source at all, that disables such operations
+:: that original submission can be handy for debug
+:: that original submission can be core functionality for your app in a "we found it by the index, now we give you the actual document" way
+: you can filter what parts of _source are actually sent in search results - see [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/search-fields.html#source-filtering source filtering]
+* '''{{inlinecode|_size}}''' - byte size of _source
+* '''{{inlinecode|_ignored}}''' - fields in a document that indexing ignored for any reason - see
+https://www.elastic.co/guide/en/elasticsearch/reference/7.16/mapping-fields.html others
+====More on indexing====
+{{stub}}
+The [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html bulk API] ''could'' be used for all CRUD operations,
+but in practice is probably ''mostly'' used for indexing.
+=====Do bulk adds where possible=====
+While you can do individual adds, doing bunches together reduces overheads,
+and can make refreshes more efficient as well,
+largely relating to the refresh interval.
+However, keep in mind that if some part of a bulk operations fails, you need to notice, and deal with that correctly.
+=====Missing data=====
+Setting a field to null (or an array of nulls, or an empty array) means it is not added to the index.
+Sounds obvious, but do a mental check that "that field can not be matched in a lot of searches"
+(except for missing, must_not exists, and such{{verify}}) is the behaviour you intend.
+Note that it also affects aggregation on the relevant fields, and some other details.
+=====Ignoring data and fields=====
+Field values may be ignored for varied reasons, including:
+* use of [https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html ignore_malformed]
+:: by default, using a type that can't be converted means the entire '''rejects the document''' add/update operation
+::: though note that in bulk updates it only considers that one operation failed, and will report a partial success
+:: if you set {{inlinecode|ignore_malformed}}[https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html], it will instead reject only bad values, and process the rest normally
+* [https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html ignore_above] (on keyword fields)
+You can search for documents where this happened ''at all'' {{comment|(you'd probably do that for debug reasons)}} with a query like
+ "query":{ "exists":{"field":"_ignored"} }
+<!--
+Values like NaN are not supported (by the JSON parser?)
+If you're working from e.g. pandas, you have some extra RTFMing to do for processing on ''that'' side, because it's e.g. likely to add NaN in text (object) fields to have NA, and you'll have to filter those out before feeding it to ES.
+res = df.apply(lambda x: x.fillna(0) if x.dtype.kind in 'biufc' else x.fillna('.'))
+-->
-=====Do bulk adds where possible=====
+=====Text processing=====
-While you can do individual adds,
+You want curacao to match Curaçao? Want fishes to match fish?
-doing a lot of them together reduces overheads a bunch (and can work out more more efficiently refresh-wise as well),
-largely relating to the refresh interval.
+To computers those are different characters and different strings,
+whether such variations are semantically equivalent, or close enough to be fuzzy with, and ''how'', might vary per language.
-Keep in mind that if some part of the operations fails, you need to notice and deal with that correctly.
+Analysers represent relatively simple processing that helps, among other things, normalize data (and implicity the later query) for such fuzziness.
-=====Missing data=====
+<!--
+"Can't I do this myself?"
-Setting a field to null (or an array of nulls, or an empty array) means it is not added to the index,
+Sure. Note that the the _source will now contain your own normalization, so if you also report data (e.g. fulltext) out of ES, that will now not be original.
-which means it can not be matched in searches (except for missing, must_not exists, and such{{verify}}).
+-->
-Note that it also affects aggregation on the relevant fields, and some other details.
-=====Ignoring data and fields=====
+An ''analyzer'' is a combination of
+* a ''character filter''
+* a ''tokenizer''
+* a ''token filter''
+(a normalizer seems to be an analyzer without a tokenizer - so practically mainly a character filter (can also have a token filter but this is probably not so useful)
-Field values may be ignored for varied reasons, including:
+'''Analyzers''' takes text, and applies a combination of character filters, tokenizer, and token filters, usually to the end of
-* use of [https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html ignore_malformed]
+* split it into tokens (e.g. words)
-:: by default, using a type that can't be converted means the entire rejects the whole document add/update operation (though note that in bulk updates it only considers that one operation failed, and will report a partial success)
+* stripping out things (e.g symbols and punctuation)
-:: if you set ignore_malformed true, the malformed fields are not indexed, but the rest are processed normally.
+* normalize (e.g. lowercasing)
-* [https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-above.html ignore_above] (on keyword fields)
+There are '''built-in analyzers''' to cover a lot of basic text search needs
-You can search for documents where this happened at all with a query like
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html standard]}} (the default if you don't specify one)
- "query":{ "exists":{"field":"_ignored"} }
+** lowercases
+** splits using [[Unicode Text Segmentation]] (for english is mostly splitting on spaces and punctuation, but is better behaved default for some other languages) and lowercases
+** removes most punctuation,
+** can remove stop words (by default does not)
+* [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html language-specific analysers] - currently arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai.
+:: details vary
+:: can remove stopwords
+:: can apply stemming (and take a list of exception cases to those stemming rules)
-If you set ignore_malformed[https://www.elastic.co/guide/en/elasticsearch/reference/current/ignore-malformed.html], it will instead reject only bad values.
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html simple]}}
+:: no lowercasing
+:: splits on non-letters
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html stop]}} - simple plus stopword removal, so
+:: no lowercasing
+:: splits on non-letters
+:: can remove stopwords
-<!--
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html pattern]}} - splits by regexp
-Values like NaN are not supported (by the JSON parser?)
+:: can lowercase
+:: can remove stopwords
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html whitespace]}}
+:: no lowercasing
+:: splits on whitespace
-If you're working from e.g. pandas, you have some extra RTFMing to do for processing on ''that'' side, because it's e.g. likely to add NaN in text (object) fields to have NA, and you'll have to filter those out before feeding it to ES.
+* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html keyword]}} - does nothing, outputs what it was given
+* [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html fingerprint]
+:: reduces text in a way that helps detect duplicates, see something like [https://openrefine.org/docs/technical-reference/clustering-in-depth]
-res = df.apply(lambda x: x.fillna(0) if x.dtype.kind in 'biufc' else x.fillna('.'))
--->
+https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
+===Data streams===
-=====Text processing=====
+<!--
+When you start to do metrics, and daily analytics, you append all the time.
-You want curacao to match Curaçao? Want fishes to match fish?
+Your data may ''amounts'' to a timeseries,
+and storing it in a classical index is ''possible'',
+but awkward and its size would add up slowly but surely.
-To computers those are different characters and different strings,
-whether such variations are semantically equivalent, or close enough to be fuzzy with, and ''how'', might vary per language.
+If you were to make that work without data streams, that might amount to
+: use ILM to start a new index daily (doing what amounts to [[log rotation]])
+: use index aliases help keep your query's reference to an index the same over time, rather than changing them daily too
+: use ILM to move old indices to cheaper archival hardware
-Analysers represent relatively simple processing that helps, among other things, normalize data (and implicity the later query) for such fuzziness.
+A data stream amounts to a different sort of storage,
+an append-only time series thing (that still seems backed by indices internally)
+that requires [https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html index templates].
-<!--
-"Can't I do this myself?"
-Sure. Note that the the _source will now contain your own normalization, so if you also report data (e.g. fulltext) out of ES, that will now not be original.
+append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are ..
--->
+Data streams are sort of that but more automated (though you still configure rollover with ILM) and controlled (append-only, and you can only append to the last), which makes it a lot easier to deal with continuously-generated data streams.
-An ''analyzer'' is a combination of
-* a ''character filter''
-* a ''tokenizer''
-* a ''token filter''
-(a normalizer seems to be an analyzer without a tokenizer - so practically mainly a character filter (can also have a token filter but this is probably not so useful)
-'''Analyzers''' takes text, and applies a combination of character filters, tokenizer, and token filters, usually to the end of
+https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
-* split it into tokens (e.g. words)
-* stripping out things (e.g symbols and punctuation)
+https://aravind.dev/elastic-data-stream/
-* normalize (e.g. lowercasing)
+-->
-There are '''built-in analyzers''' to cover a lot of basic text search needs
+===Snapshots, Searchable snapshots===
+<!--
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-standard-analyzer.html standard]}} (the default if you don't specify one)
+Snapshots let you
-** lowercases
+* back up a cluster live
-** splits using [[Unicode Text Segmentation]] (for english is mostly splitting on spaces and punctuation, but is better behaved default for some other languages) and lowercases
-** removes most punctuation,
-** can remove stop words (by default does not)
-* [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-lang-analyzer.html language-specific analysers] - currently arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, estonian, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, and thai.
+* recover after losing files
-:: details vary
-:: can remove stopwords
-:: can apply stemming (and take a list of exception cases to those stemming rules)
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-simple-analyzer.html simple]}}
+* searchable snapshots[https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html] - low search volume on things basically archived to cheaper and slower-to-access storage [https://www.elastic.co/guide/en/elasticsearch/reference/current/put-snapshot-repo-api.html]
-:: no lowercasing
-:: splits on non-letters
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-stop-analyzer.html stop]}} - simple plus stopword removal, so
-:: no lowercasing
-:: splits on non-letters
-:: can remove stopwords
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-pattern-analyzer.html pattern]}} - splits by regexp
+Snapshots can go to e.g.
-:: can lowercase
+: e.g. shared filesystem.
-:: can remove stopwords
+: Plugins provide [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-s3.html S3], [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-gcs.html GCS], [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-azure.html Azure]), [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-hdfs.html HDFS]
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-whitespace-analyzer.html whitespace]}}
-:: no lowercasing
-:: splits on whitespace
-* {{inlinecode|[https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html keyword]}} - does nothing, outputs what it was given
+A snapshots contains
+: cluster state (settings, templates, pipelines)
+: all data streams
+: all open indices
-* [https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-fingerprint-analyzer.html fingerprint]
-:: reduces text in a way that helps detect duplicates, see something like [https://openrefine.org/docs/technical-reference/clustering-in-depth]
-https://www.elastic.co/guide/en/elasticsearch/reference/current/analysis-analyzers.html
-===Data streams===
+: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html
-<!--
+: https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html
-When you start to do metrics, and daily analytics, e.g. the size of what amounts to timeseries add up quickly.
-Data stream is basically a different sort of storage - an append-only time series thing (that still seems backed by indices internally) and requires [https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html index templates].
+-->
-append-only time series data across multiple indices while giving you a single named resource for requests. Data streams are ..
+===Install===
+<!--
+You might want to consider going the docker way - it makes cluster setup easier.
-You can do time series stuff with what is already mentioned, you would probably
-: use ILM to start a new index daily, to do what amounts to log rotation
-: use index aliases help keep the index references in queries the same
-: use ILM to move old indices to cheaper archival hardware
-Data streams are sort of that but more automated (though you still configure rollover with ILM) and controlled (append-only, and you can only append to the last), which makes it a lot easier to deal with continuously-generated data streams.
-https://www.elastic.co/guide/en/elasticsearch/reference/current/data-streams.html
+"max virtual memory areas vm.max_map_count is too low"
-https://aravind.dev/elastic-data-stream/
+See <tt>kernel-doc/Documentation/sysctl/vm.txt</tt>, but
+these relates to the amount of allocations (also mmap, mprotect) done.
+The default is 65530, most processes use fewer than a thousand,
+but e.g. malloc debuggers need more, and so does ElasticSeaarch for some reason.
 -->
-===Snapshots, Searchable snapshots===
+====Things you may want to think about somewhere before you have a big index====
 <!--
-Snapshots let you
+Note that for a bunch of of things - not least importantly indexes - there are static settings (only settable at creation time) and dynamic settings (changeable later)
-* back up a cluster live
-* recover after losing files
-* searchable snapshots[https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html] - low search volume on things basically archived to cheaper and slower-to-access storage [https://www.elastic.co/guide/en/elasticsearch/reference/current/put-snapshot-repo-api.html]
+* '''index template'''
+: basically the settings that new indexes shouldbe created with
+: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html
-Snapshots can go to e.g.
+Index settings [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules.html]
-: e.g. shared filesystem.
-: Plugins provide [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-s3.html S3], [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-gcs.html GCS], [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-azure.html Azure]), [https://www.elastic.co/guide/en/elasticsearch/plugins/7.16/repository-hdfs.html HDFS]
+* Refresh
+: background: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/near-real-time.html
+: interval setting: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-refresh-interval-setting
-A snapshots contains
+* Merge scheduling
-: cluster state (settings, templates, pipelines)
+: [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-merge.html]
-: all data streams
-: all open indices
+* Analysis of text fields
+: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis.html
+Mapping - in particular whether you want to control it rather than use dynamic mapping [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-mapper.html]
-: https://www.elastic.co/guide/en/elasticsearch/reference/current/snapshots-register-repository.html
-: https://www.elastic.co/guide/en/elasticsearch/reference/current/searchable-snapshots.html
+More advanced:
+* Sharding details in clusters:
+: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-allocation.html
+: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/allocation-total-shards.html
--->
-===Install===
-<!--
-You might want to consider going the docker way - it makes cluster setup easier.
+Consider whether you need security
+ES has
+* IP filtering
-"max virtual memory areas vm.max_map_count is too low"
+* its own user auth
+:: role-based access control
-See <tt>kernel-doc/Documentation/sysctl/vm.txt</tt>, but
+* auditing
-these relates to the amount of allocations (also mmap, mprotect) done.
-The default is 65530, most processes use fewer than a thousand,
-but e.g. malloc debuggers need more, and so does ElasticSeaarch for some reason.
--->
+* TLS (towards clients)
-====Things you may want to think about somewhere before you have a big index====
-<!--
+These things are mostly useful around multitenancy,
+or when your ES deployment is on a different provider (e.g. public cloud deployment),
+and may not be necessary within your own cloudiness.
-Note that for a bunch of of things - not least importantly indexes - there are static settings (only settable at creation time) and dynamic settings (changeable later)
+https://www.elastic.co/guide/en/elasticsearch/reference/6.8/configuring-security.html
-* '''index template'''
-: basically the settings that new indexes shouldbe created with
-: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-templates.html
-Index settings [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules.html]
+ES/Kibana security
-* Refresh
+Minimal
-: background: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/near-real-time.html
+* user passwords for ES
-: interval setting: https://www.elastic.co/guide/en/elasticsearch/reference/current/index-modules.html#index-refresh-interval-setting
+* optional passwords for Kibana
-* Merge scheduling
+Basic Security
-: [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-merge.html]
+* TLS between nodes, isolates clusters
+* No TLS for Kibana
-* Analysis of text fields
+Basic + HTTPs external
-: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/analysis.html
+* TLS (HTTPS) on all ES and Kibana, Kibana
-Mapping - in particular whether you want to control it rather than use dynamic mapping [https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-mapper.html]
+-->
+===APIs===
+{{stub}}
-More advanced:
+ES grew a lot of APIs over time.
-* Sharding details in clusters:
-: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/index-modules-allocation.html
-: https://www.elastic.co/guide/en/elasticsearch/reference/7.16/allocation-total-shards.html
+Perhaps the most central are
+* the document CRUD part (Index, Get, Update, Delete, respectively),
+* and the searching.
+'''Document CRUD'''
-Consider whether you need security
+* '''Index''' - add a document (and implicitly get it indexed soon)
+:: e.g. {{inlinecode|PUT /test/_doc/1}} with the JSON doc in the body
+* '''Get''' - get a specific document by ID
+: e.g. GET test/_doc/1   {{comment|(HEAD if you only want to check that it exists)}}
+: this basically just gives you {{inlinecode|_source}}, so you can filter what part of that gets sent[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#get-source-filtering]
-ES has
+* '''Update'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html]
-* IP filtering
+:: e.g. {{inlinecode|POST /test/_update/1}}
+: allows things like <tt>"script" : "ctx._source.votes += 1"</tt>
-* its own user auth
+* '''Delete'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html])
-:: role-based access control
+:: e.g. {{inlinecode|DELETE /test/_doc/1}}
-* auditing
-* TLS (towards clients)
+Note that there is also a '''bulk API''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html]
+: that lets you do many of the above at once
+: POST _bulk
+: where the body can contain many index, delete, update
+: libraries often support the bulk API in a way that makes sense for how the rest of that library
-These things are mostly useful around multitenancy,
-or when your ES deployment is on a different provider (e.g. public cloud deployment),
-and may not be necessary within your own cloudiness.
-https://www.elastic.co/guide/en/elasticsearch/reference/6.8/configuring-security.html
+"Is update really update?"
+It seems that update is essentially
+* construct a a new document based on the current version's _source
+* add 1 to the version
+* save that to a new segment
+:: ...even if it turns out there is no difference (...by default - you can specify that it should check and not do that - see detect_noop)
-ES/Kibana security
+It seems that
+the old version will be ''considered'' deleted but still be on disk in that immutable segment it was originally entered in (until the next merge),
+and searches just happen to report only the latest (using _version){{verify}}
+So if you do continuous updates on all of your documents, you ''can'' actually get the update process to fall behind, which can mean you temporarily use a lot more space (and search may be slightly slower too).
-Minimal
-* user passwords for ES
-* optional passwords for Kibana
-Basic Security
-* TLS between nodes, isolates clusters
-* No TLS for Kibana
-Basic + HTTPs external
-* TLS (HTTPS) on all ES and Kibana, Kibana
+"Is there a difference between indexing a new version and updating the document?"
--->
+Very little in terms of what gets done at low level (both will create a new document, mark old as deleted)
-===APIs===
+It's most that ''for you'' there is a question of what you do is easier to express as
-{{stub}}
+* a new version of the data
+* the changes you want to see, or a script that makes them
+Also, keep in mind ''how'' you are updating. If it's fetch, change, send, that's more steps than 'here is how to change it' in a script.
-ES grew a lot of APIs over time.
+So if you do not keep track of changes to send to elasticsearch,
+you could update by throwing everything at at with detect_noop - which only applies to update.
-Perhaps the most central are
-* the document CRUD part (Index, Get, Update, Delete, respectively),
-* and the searching.
-'''Document CRUD'''
-* '''Index''' - add a document (and implicitly get it indexed soon)
-:: e.g. {{inlinecode|PUT /test/_doc/1}} with the JSON doc in the body
-* '''Get''' - get a specific document by ID
-: e.g. GET test/_doc/1   {{comment|(HEAD if you only want to check that it exists)}}
-: this basically just gives you {{inlinecode|_source}}, so you can filter what part of that gets sent[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-get.html#get-source-filtering]
-* '''Update'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-update.html]
+Search[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html]
-:: e.g. {{inlinecode|POST /test/_update/1}}
+: GET /indexname/_search
-: allows things like <tt>"script" : "ctx._source.votes += 1"</tt>
-* '''Delete'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-delete.html])
+Multi Search[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html]
-:: e.g. {{inlinecode|DELETE /test/_doc/1}}
+: multiple searches with a single API call
+: GET /indexname/_msearch
+:: ...though the request can switch index
+Async search
-Note that there is also a '''bulk API''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-bulk.html]
-: that lets you do many of the above at once
-: POST _bulk
-: where the body can contain many index, delete, update
-: libraries often support the bulk API in a way that makes sense for how the rest of that library
+Multi Get[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-get.html]
+: Retrieves multiple JSON documents by ID.
+: GET /_mget
+:
-"Is update really update?"
-It seems that update is essentially
+<!--
-* construct a a new document based on the current version's _source
-* add 1 to the version
-* save that to a new segment
-:: ...even if it turns out there is no difference (...by default - you can specify that it should check and not do that - see detect_noop)
-It seems that
-the old version will be ''considered'' deleted but still be on disk in that immutable segment it was originally entered in (until the next merge),
-and searches just happen to report only the latest (using _version){{verify}}
-So if you do continuous updates on all of your documents, you ''can'' actually get the update process to fall behind, which can mean you temporarily use a lot more space (and search may be slightly slower too).
+helpers.bulk
+helpers.streaming_bulk
-"Is there a difference between indexing a new version and updating the document?"
+helpers.parallel_bulk
-Very little in terms of what gets done at low level (both will create a new document, mark old as deleted)
-It's most that ''for you'' there is a question of what you do is easier to express as
-* a new version of the data
-* the changes you want to see, or a script that makes them
-Also, keep in mind ''how'' you are updating. If it's fetch, change, send, that's more steps than 'here is how to change it' in a script.
-So if you do not keep track of changes to send to elasticsearch,
-you could update by throwing everything at at with detect_noop - which only applies to update.
+python wrapper
+Index creation
+* https://elasticsearch-py.readthedocs.io/en/v7.16.3/api.html#indices
+-->
+===Search===
-Search[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html]
+====Shapes of a query====
-: GET /indexname/_search
+{{stub}}
-Multi Search[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html]
+Most APIs take a query that looks something like:
-: multiple searches with a single API call
-: GET /indexname/_msearch
-:: ...though the request can switch index
-Async search
+<syntaxhighlight lang="javascript">
+{
+  "query": {
+    "match_phrase": {
+      "fieldname": "word word"
+    }
+  }
+}
+</syntaxhighlight >
+That is,
+there is a specific [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html Query DSL],
+where you expressing a structured search as an abstract syntax tree, which is communicated as JSON.
-Multi Get[https://www.elastic.co/guide/en/elasticsearch/reference/current/docs-multi-get.html]
-: Retrieves multiple JSON documents by ID.
-: GET /_mget
-:
+Notes:
+* the field names will vary with specific types of search
+:: In fact even the nesting will
+* You might like [https://www.elastic.co/guide/en/kibana/7.17/console-kibana.html Kibana's console] to play with these, as it has some auto-completion of the search syntax.
+* a quick "am I even indexing" test would probably use {{inlinecode|match_all}}[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-all-query.html]
-<!--
+'''Exceptions / other options'''
+The only exception to "AST as JSON" seems to be '''URI search'''[https://www.elastic.co/guide/en/elasticsearch/reference/master/search-uri-request.html] which exposes basic searches in a URL (without needing to send a JSON style query in the body), but unless this covers ''all'' your needs, you will eventually want to abandon this. Might as well do it properly up front)}}
-helpers.bulk
+The only exceptions to "you must create query ASTs yourself" seem to be:
+* '''<tt>query_string</tt>'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html] -
+: still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally
+:: exposes a bunch of the search features in a shorter-to-write syntax
+::: ...that you have to learn
+::: ...that is fairly breakable
+::: ...that allows frankenqueries that unduly load your system
+::: ...so you probably don't want to expose this to users unless all of them are power users / already know lucene
-helpers.streaming_bulk
+* '''<tt>simple_query_string</tt>'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html]
+: same idea as the previous, but using the somewhat more simplified syntax
+:: less breaky, still for experts
-helpers.parallel_bulk
+...but again, until this covers all your needs ''and'' doesn't ope
+These exceptions may looks impler but in the long run you will probably need to abandon them.
+If you want to learn it once, learn the AST way.
+(that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).
+====API-wise====
+The '''search API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html] ({{inlinecode|/''indexname''/_search}}) should cover many basic needs.
+: The '''search template API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-template-api.html]  does the same sort of search, but you can avoid doing some complex middleware/client-side query construction by asking ES to slot some values into an existing saved template.
-python wrapper
+'''Multi search'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html] ({{inlinecode|/''indexname''/_msearch}}) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request.  Responds with the same number of result sets.
+: '''Multi search template API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-search-template.html] - seems to just be the combination of multi search and use of templates.
-Index creation
-* https://elasticsearch-py.readthedocs.io/en/v7.16.3/api.html#indices
--->
-===Search===
+=====On result state, and getting more consistency between search interactions=====
+In search systems, it can be a good tradeoff to
+: get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, ''many''" -- rather than get a precise count
+: only fetch-and-show the complete data for only the first few -- rather than everything
+: forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly
-====Shapes of a query====
+...because it turns out we can do that faster.
-{{stub}}
-Most APIs take a query that looks something like:
+This is ES stops counting after 10000 (max_result_window), only return the first 10 (size=10, from=0)
-<syntaxhighlight lang="javascript">
-{
-  "query": {
-    "match_phrase": {
-      "fieldname": "word word"
-    }
-  }
-}
-</syntaxhighlight >
+Also, if search-and-fetch-a-few turns out to be cheap in terms of IO and CPU, we can consider doing them without storing state.
-That is,
+This also because we know that in interactive browser use, most people will never check more than a few.
-there is a specific [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl.html Query DSL],
-where you expressing a structured search as an abstract syntax tree, which is communicated as JSON.
+If in fact someone ''does'' the somewhat unusual thing of browing to page 2 or 3 or 4,
+you can redo the same search and fetch some more (use {{inlinecode|from}} to fetch what ''amounts'' to the next bunch).
-Notes:
+'''However''', if the index was refreshed between those actions, this new search will shift items around,
-* the field names will vary with specific types of search
+so might get the same item again, or never show you are didn't see one one.
-:: In fact even the nesting will
+Whenever the consistency or completeness really matters, you are probably looking for async or PIT:
-* You might like [https://www.elastic.co/guide/en/kibana/7.17/console-kibana.html Kibana's console] to play with these, as it has some auto-completion of the search syntax.
+Also, if you use this to back an API that allows "fetching everyting", you won't have won much.
-* a quick "am I even indexing" test would probably use {{inlinecode|match_all}}[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-all-query.html]
-'''Exceptions / other options'''
-The only exception to "AST as JSON" seems to be '''URI search'''[https://www.elastic.co/guide/en/elasticsearch/reference/master/search-uri-request.html] which exposes basic searches in a URL (without needing to send a JSON style query in the body), but unless this covers ''all'' your needs, you will eventually want to abandon this. Might as well do it properly up front)}}
+'''Async search'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html] - lets you start searches in the background.
+Search functionality is mostly equivalent, there's some minor parameter differences (think cache)
+: the searches are store din their own index{{verify}} so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)
+:: note that you {{inlinecode|wait_for_completion_timeout}} lets you ask for "return regular search if you finish quickly, make it async if not"
-The only exceptions to "you must create query ASTs yourself" seem to be:
+'''Point in time API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html] - consider that document updates and refreshes means you will generally get the ''latest'' results.
-* '''<tt>query_string</tt>'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html] -
+If instead it is more important to get a consisten set, you can use PIT (or presumably async search?)
-: still uses that JSON, but feeds in a single query string in that specific lucene syntax, to be parsed internally
-:: exposes a bunch of the search features in a shorter-to-write syntax
-::: ...that you have to learn
-::: ...that is fairly breakable
-::: ...that allows frankenqueries that unduly load your system
-::: ...so you probably don't want to expose this to users unless all of them are power users / already know lucene
-* '''<tt>simple_query_string</tt>'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html]
-: same idea as the previous, but using the somewhat more simplified syntax
-:: less breaky, still for experts
-...but again, until this covers all your needs ''and'' doesn't ope
+<!--
+Consider efficiency
-These exceptions may looks impler but in the long run you will probably need to abandon them.
+It helps speed/load to try to express queries in simpler queries, like
-If you want to learn it once, learn the AST way.
+term and match (and most match_ variants),
-(that sais, a hacky translation to the latter syntax is sometimes an easier shortcut for the short term).
+and compounds of such.
-====API-wise====
+In general, you may wish to stick to the more-efficient-to-evaluate where possibly;
+somewhat more costly (up-front),
+and require search.allow_expensive_queries to be true
+-->
-The '''search API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html] ({{inlinecode|/''indexname''/_search}}) should cover many basic needs.
-: The '''search template API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-template-api.html]  does the same sort of search, but you can avoid doing some complex middleware/client-side query construction by asking ES to slot some values into an existing saved template.
-'''Multi search'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-multi-search.html] ({{inlinecode|/''indexname''/_msearch}}) is sort of like bulk-CRUD but for search: you send in multiple search commands in one request.  Responds with the same number of result sets.
-: '''Multi search template API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/multi-search-template.html] - seems to just be the combination of multi search and use of templates.
+=====Searching multiple indices=====
+...mostly points to Multi search (_msearch), which looks something like:
+<syntaxhighlight lang="javascript">
+GET /index1/_msearch
+{ }
+{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
+{"index": "index2"}
+{ "query": {"multi_match":{ "query":"fork",  "fields":["title","plaintext"]  } } }
+{"index": "index3"}
+{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
+</syntaxhighlight >
-=====On result state, and getting more consistency between search interactions=====
-In search systems, it can be a good tradeoff to
-: get a good indication of how many there roughly, and stop counting when the number amounts to "well, um, ''many''" -- rather than get a precise count
-: only fetch-and-show the complete data for only the first few -- rather than everything
-: forget the search immediately after serving it -- rather than keeping state around in RAM and/or on disk, for who knows how long exactly
-...because it turns out we can do that faster.
+Note that you can also do:
+<syntaxhighlight lang="javascript">
+GET /index1,index2,index3/_search
+{ "query": { "multi_match":{"query":"fork",  "fields":["title","plaintext"]  } } }
+</syntaxhighlight >
-This is ES stops counting after 10000 (max_result_window), only return the first 10 (size=10, from=0)
+On the upside, you don't have to creatively merge the separate result sets (because multi search ''will not do that for you'').
+On the downside, you don't get to control that merge, e.g. scoring can be difficult to control, you don't really get to specify search fields per index anymore, or source filtering fields - though some forethought (e.g. to field naming between indices) can help some of those.
-Also, if search-and-fetch-a-few turns out to be cheap in terms of IO and CPU, we can consider doing them without storing state.
-This also because we know that in interactive browser use, most people will never check more than a few.
-If in fact someone ''does'' the somewhat unusual thing of browing to page 2 or 3 or 4,
-you can redo the same search and fetch some more (use {{inlinecode|from}} to fetch what ''amounts'' to the next bunch).
-'''However''', if the index was refreshed between those actions, this new search will shift items around,
+=====Other, mostly more specific search-like APIs, and some debug stuff=====
-so might get the same item again, or never show you are didn't see one one.
-Whenever the consistency or completeness really matters, you are probably looking for async or PIT:
-Also, if you use this to back an API that allows "fetching everyting", you won't have won much.
+'''knn'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search-api.html] search searches a dense_vector close to the query vector.
+: separate API is depracated, being moved to [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-api-knn an option]
+'''suggester'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html] - suggests similar search terms based on [[edit distance]]
+'''terms_enum'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-terms-enum.html] -
-'''Async search'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html] - lets you start searches in the background.
+'''count'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html]
-Search functionality is mostly equivalent, there's some minor parameter differences (think cache)
-: the searches are store din their own index{{verify}} so you may want a mechanism to delete the searches by ID if you need to, and/or lower the time these are kept (default is 5 days, see keep_alive)
-:: note that you {{inlinecode|wait_for_completion_timeout}} lets you ask for "return regular search if you finish quickly, make it async if not"
-'''Point in time API'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/point-in-time-api.html] - consider that document updates and refreshes means you will generally get the ''latest'' results.
+'''explain'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html]
-If instead it is more important to get a consisten set, you can use PIT (or presumably async search?)
+'''profile'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html]
+'''validate'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html]
-<!--
+'''shards'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html] - report which shards would be accessed
-Consider efficiency
-It helps speed/load to try to express queries in simpler queries, like
+====composing queries====
-term and match (and most match_ variants),
-and compounds of such.
+=====Compound queries=====
+{{stub}}
-In general, you may wish to stick to the more-efficient-to-evaluate where possibly;
+Compound queries[https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html] wrap others (leaf queries, or other compound queries), for
-somewhat more costly (up-front),
-and require search.allow_expensive_queries to be true
--->
+One reason is logical combinations of other queries {{comment|(note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)}}
+'''{{inlinecode|bool}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html]
+:: lets you combine requirements with requirements like
+::: {{inlinecode|must}}
+::: {{inlinecode|filter}} is <tt>must</tt> without contributing to scoring
+::: {{inlinecode|must_not}}
+::: {{inlinecode|should}} ('portion of leafs must', and more matches scores higher)
-=====Searching multiple indices=====
-...mostly points to Multi search (_msearch), which looks something like:
+The rest is mostly about more controlled scoring (particularly when you make such combinations)
-<syntaxhighlight lang="javascript">
-GET /index1/_msearch
-{ }
-{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
-{"index": "index2"}
-{ "query": {"multi_match":{ "query":"fork",  "fields":["title","plaintext"]  } } }
-{"index": "index3"}
-{ "query": {"multi_match":{ "query":"fork",  "fields":["plaintext"]          } } }
-</syntaxhighlight >
+'''{{inlinecode|dismax}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html]
+:: if multiple subqueries match the same document, the highest score gets used
+'''{{inlinecode|boosting}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html]
+:: have subqueries weigh in positively and negatively
-Note that you can also do:
-<syntaxhighlight lang="javascript">
-GET /index1,index2,index3/_search
-{ "query": { "multi_match":{"query":"fork",  "fields":["title","plaintext"]  } } }
-</syntaxhighlight >
-On the upside, you don't have to creatively merge the separate result sets (because multi search ''will not do that for you'').
+'''{{inlinecode|constant_score}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html]
+:: results from the search this wraps all get a fixed constant score
-On the downside, you don't get to control that merge, e.g. scoring can be difficult to control, you don't really get to specify search fields per index anymore, or source filtering fields - though some forethought (e.g. to field naming between indices) can help some of those.
+'''{{inlinecode|function_score}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html]
+:: results form the search this wraps with scripting that can consider values from the document and query
+=====Term-level and full-text queries=====
+{{stub}}
-=====Other, mostly more specific search-like APIs, and some debug stuff=====
+'''Term queries''' are''not'' put through an analyser, so stay one string, so are mostly used to match exact strings/terms.
+'''Full-text queries''' are put through an [[#More_on_indexing|analyzer]] (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.
-'''knn'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/knn-search-api.html] search searches a dense_vector close to the query vector.
-: separate API is depracated, being moved to [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html#search-api-knn an option]
-'''suggester'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-suggesters.html] - suggests similar search terms based on [[edit distance]]
-'''terms_enum'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-terms-enum.html] -
+'''{{inlinecode|term}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html]
+: exact value
+: works in text fields but is only useful for fairly controlled things (e.g. username)
+'''{{inlinecode|terms}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html]
+: like term, matches ''any'' from a provided list
+'''{{inlinecode|terms_set}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html]
+: like terms, but with a minimum amount to match from a list
-'''count'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-count.html]
-'''explain'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-explain.html]
+'''{{inlinecode|exists}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html]
+: Returns documents that contain any indexed value for a field.
-'''profile'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-profile.html]
+'''{{inlinecode|fuzzy}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html]
+: Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.
-'''validate'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-validate.html]
+'''{{inlinecode|ids}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html]
+: Returns documents based on their document IDs.
-'''shards'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-shards.html] - report which shards would be accessed
+'''{{inlinecode|prefix}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html]
+: Returns documents that contain a specific prefix in a provided field.
-====composing queries====
+'''{{inlinecode|range}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html]
+: Returns documents that contain terms within a provided range.
-=====Compound queries=====
+'''{{inlinecode|regexp}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html]
-{{stub}}
+: Returns documents that contain terms matching a regular expression.
-Compound queries[https://www.elastic.co/guide/en/elasticsearch/reference/current/compound-queries.html] wrap others (leaf queries, or other compound queries), for
+'''{{inlinecode|wildcard}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html]
+: Returns documents that contain terms matching a wildcard pattern.
-One reason is logical combinations of other queries {{comment|(note that when you do this for filtering only, there are sometimes other ways to filter that might be more efficient in specific situations)}}
-'''{{inlinecode|bool}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-bool-query.html]
+{{inlinecode|match}} [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html]
-:: lets you combine requirements with requirements like
+: matches text, number, date, or boolean data
-::: {{inlinecode|must}}
+: in a single field
-::: {{inlinecode|filter}} is <tt>must</tt> without contributing to scoring
+: optionally does things like phrase or proximity queries, and things like edit-distance fuzziness
-::: {{inlinecode|must_not}}
+: text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)
-::: {{inlinecode|should}} ('portion of leafs must', and more matches scores higher)
+ {
+   "query": {
+     "match": {
+       "fieldname":{
+         "query": "scary matrix snowman monster",
+         "minimum_should_match": 2   //optional, but
+       }
+     }
+   }
+ }
+'''{{inlinecode|multi_match}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html]
+: match, but searches one query in multiple fields
+ {
+   "query": {
+     "multi_match" : {
+       "query":    "this is a test",
+       "fields": [ "subject", "message" ]
+     }
+   }
+ }
-The rest is mostly about more controlled scoring (particularly when you make such combinations)
+'''{{inlinecode|combined_fields}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-combined-fields-query.html]
+: somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields
+ {
+   "query": {
+     "combined_fields" : {
+       "query":      "database systems",
+       "fields":     [ "title", "subject", "message"],
+       "operator":   "and"
+     }
+   }
+ }
-'''{{inlinecode|dismax}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-dis-max-query.html]
-:: if multiple subqueries match the same document, the highest score gets used
-'''{{inlinecode|boosting}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-boosting-query.html]
+'''{{inlinecode|match_phrase}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html]
-:: have subqueries weigh in positively and negatively
+: analyses into tokens, then matches only if ''all'' are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive
+ {
+   "query": {
+     "match_phrase": {
+       "message": "this is a test"
+     }
+   }
+ }
-'''{{inlinecode|constant_score}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-constant-score-query.html]
+'''{{inlinecode|match_phrase_prefix}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html]
-:: results from the search this wraps all get a fixed constant score
+: like match_phrase, but the ''last'' token is a prefix match
-'''{{inlinecode|function_score}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-function-score-query.html]
+'''{{inlinecode|match_bool_prefix}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-bool-prefix-query.html]
-:: results form the search this wraps with scripting that can consider values from the document and query
+: constructs a {{inlinecode|bool}} of term-{{inlinecode|should}}s, but the last last is a prefix match
-=====Term-level and full-text queries=====
-{{stub}}
-'''Term queries''' are''not'' put through an analyser, so stay one string, so are mostly used to match exact strings/terms.
+'''{{inlinecode|intervals}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-intervals-query.html]
+: imposes rules based on order and proximity of matching terms
-'''Full-text queries''' are put through an [[#More_on_indexing|analyzer]] (the same way as the text fields it searches), which does some alterations, and means the search then deals with multiple tokens.
+'''{{inlinecode|query_string}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html]
+: lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax]
+: maybe don't expose to users, because it's quick to error out
+'''{{inlinecode|simple_query_string}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html]
-'''{{inlinecode|term}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-term-query.html]
+=====Span queries=====
-: exact value
-: works in text fields but is only useful for fairly controlled things (e.g. username)
-'''{{inlinecode|terms}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-query.html]
+<!--
-: like term, matches ''any'' from a provided list
+'''Span queries'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/span-queries.html]
+use token-position information to be able to express terms's order, proximity ,
-'''{{inlinecode|terms_set}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-terms-set-query.html]
+This seems mostly useful technical documents,
-: like terms, but with a minimum amount to match from a list
+and other things that seem to conform to templates enough that this specificity is meaningful (rather than just rejects a lot)
+* span_term[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-term-query.html] - like regular term, but works when combining span queries with some of the below
-'''{{inlinecode|exists}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-exists-query.html]
+* span_containing[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-containing-query.html]
-: Returns documents that contain any indexed value for a field.
+: takes list of span queries, returns only those spans which also match a second span query
+: similar to near but more controlled?
+* span_near[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-near-query.html] - multiple span queries must match within distance, and with optional order requirement
+* span_or[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-or-query.html] - match by ''any'' span query
+* span_not[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-not-query.html] - excludes based on a span query
+* span_first[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-first-query.html]  - appear within the first N positions of field
-'''{{inlinecode|fuzzy}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-fuzzy-query.html]
+* span_field_masking[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-field-masking-query.html] - lets you do span-near or span-or across fields
-: Returns documents that contain terms similar to the search term. Elasticsearch measures similarity, or fuzziness, using a Levenshtein edit distance.
+* span_multi[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-multi-term-query.html] - Wraps a term, range, prefix, wildcard, regexp, or fuzzy query.
+* span_within[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-within-query.html] - one span must appear within another
+-->
+====Result set====
+<!--
+* size - how many hits to actually return
+:: together with from, you can imitate pages
+:: from & size don't allow fetching more than 10000 documents. If you need to, look at [https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after]
+:: ...also, these seem to be distinct searches{{verify}}, so you're assuming index doesn't change; if you care about that, consider async searches
-'''{{inlinecode|ids}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-ids-query.html]
+* sort
-: Returns documents based on their document IDs.
-'''{{inlinecode|prefix}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-prefix-query.html]
+* search_type - how distribution is handled
-: Returns documents that contain a specific prefix in a provided field.
+:: query_then_fetch - scoring per shard
+:: dfs_query_then_fetch - overall scoring. More precise but slower.
-'''{{inlinecode|range}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-range-query.html]
-: Returns documents that contain terms within a provided range.
-'''{{inlinecode|regexp}}'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-regexp-query.html]
-: Returns documents that contain terms matching a regular expression.
-'''{{inlinecode|wildcard}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-wildcard-query.html]
-: Returns documents that contain terms matching a wildcard pattern.
-{{inlinecode|match}} [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query.html]
+'''Source filtering, stored fields, runtime fields'''
-: matches text, number, date, or boolean data
-: in a single field
-: optionally does things like phrase or proximity queries, and things like edit-distance fuzziness
-: text is analysed; all tokens are searched - effectively defaults to OR of all terms (and/so apparently minimum_should_match=1)
- {
-   "query": {
-     "match": {
-       "fieldname":{
-         "query": "scary matrix snowman monster",
-         "minimum_should_match": 2   //optional, but
-       }
-     }
-   }
- }
-'''{{inlinecode|multi_match}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-multi-match-query.html]
+Values present in the
-: match, but searches one query in multiple fields
- {
-   "query": {
-     "multi_match" : {
-       "query":    "this is a test",
-       "fields": [ "subject", "message" ]
-     }
-   }
- }
-'''{{inlinecode|combined_fields}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-combined-fields-query.html]
+fields is a little more controlled/flexible
-: somewhat like multi_match, but instead of doing the query to each field separately, it acts as if it matched on a single field (that consists of the mentioned fields combined), useful when the match you want could span multiple fields
- {
-   "query": {
-     "combined_fields" : {
-       "query":      "database systems",
-       "fields":     [ "title", "subject", "message"],
-       "operator":   "and"
-     }
-   }
- }
+https://www.elastic.co/guide/en/elasticsearch/reference/current/search-fields.html
-'''{{inlinecode|match_phrase}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase.html]
+In queries:
-: analyses into tokens, then matches only if ''all'' are in a field, in the same sequence, and by default with a slop of 0 meaning they must be consecutive
+* fields
- {
-   "query": {
-     "match_phrase": {
-       "message": "this is a test"
-     }
-   }
- }
-'''{{inlinecode|match_phrase_prefix}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-query-phrase-prefix.html]
-: like match_phrase, but the ''last'' token is a prefix match
-'''{{inlinecode|match_bool_prefix}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-match-bool-prefix-query.html]
-: constructs a {{inlinecode|bool}} of term-{{inlinecode|should}}s, but the last last is a prefix match
+_source is the complete document as ingested.
+Source filtering refers to asking for fewer fields from that.
-'''{{inlinecode|intervals}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-intervals-query.html]
+The upside is that it means less transfer.
-: imposes rules based on order and proximity of matching terms
+The downside is that _source needs to be parsed before it can be filtered, which for larger/more complex documents can add up.
-'''{{inlinecode|query_string}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html]
-: lucene-style query string which lets you specify fields, and AND,OR,NOT, in the query itself [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-query-string-query.html#query-string-syntax]
-: maybe don't expose to users, because it's quick to error out
-'''{{inlinecode|simple_query_string}}''' [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-simple-query-string-query.html]
+In queries:
+* _source - what fields of the _source you want returned in the results. can be true (all, default), false (none), or a comma separated list of field names
+: (see also _source_includes, _source_excludes)
-=====Span queries=====
-<!--
-'''Span queries'''[https://www.elastic.co/guide/en/elasticsearch/reference/current/span-queries.html]
-use token-position information to be able to express terms's order, proximity ,
-This seems mostly useful technical documents,
-and other things that seem to conform to templates enough that this specificity is meaningful (rather than just rejects a lot)
-* span_term[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-term-query.html] - like regular term, but works when combining span queries with some of the below
-* span_containing[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-containing-query.html]
-: takes list of span queries, returns only those spans which also match a second span query
-: similar to near but more controlled?
-* span_near[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-near-query.html] - multiple span queries must match within distance, and with optional order requirement
-* span_or[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-or-query.html] - match by ''any'' span query
-* span_not[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-not-query.html] - excludes based on a span query
-* span_first[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-first-query.html]  - appear within the first N positions of field
-* span_field_masking[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-field-masking-query.html] - lets you do span-near or span-or across fields
+Stored fields[https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-fields.html#stored-fields] are stored in the index, and let you say "don't send _source at all, just these specific stored fields are enough for me"
-* span_multi[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-multi-term-query.html] - Wraps a term, range, prefix, wildcard, regexp, or fuzzy query.
-* span_within[https://www.elastic.co/guide/en/elasticsearch/reference/current/query-dsl-span-within-query.html] - one span must appear within another
--->
-====Result set====
+If _source is large, this can save parsing time (and traffic for anyone who forgets to do source filtering).
-<!--
+The downside is that it duplicates data, so if there is little improvement, then it's just bloat.
-* size - how many hits to actually return
-:: together with from, you can imitate pages
-:: from & size don't allow fetching more than 10000 documents. If you need to, look at [https://www.elastic.co/guide/en/elasticsearch/reference/current/paginate-search-results.html#search-after]
-:: ...also, these seem to be distinct searches{{verify}}, so you're assuming index doesn't change; if you care about that, consider async searches
-* sort
+(stored fields are also something you want when you don't save _source at all - see the indexing section)
-* search_type - how distribution is handled
-:: query_then_fetch - scoring per shard
-:: dfs_query_then_fetch - overall scoring. More precise but slower.
+In queries:
+* stored_fields - what stored-only fields to also return?
+: interacts with the source_
+There are also doc value fields, [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-fields.html#docvalue-fields]
+In queries:
+* docvalue_fields
-'''Source filtering, stored fields, runtime fields'''
-Values present in the
-fields is a little more controlled/flexible
-https://www.elastic.co/guide/en/elasticsearch/reference/current/search-fields.html
+https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
-In queries:
+https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-fields.html#stored-fields
-* fields
+-->
+====Aggregation====
+<!--
+Aggregation is any calculation from multiple documents.
-_source is the complete document as ingested.
-Source filtering refers to asking for fewer fields from that.
+Which can be particularly useful when you store structured information - like metrics.
+It is not as commonly used in full-text search.
-The upside is that it means less transfer.
-The downside is that _source needs to be parsed before it can be filtered, which for larger/more complex documents can add up.
-In queries:
-* _source - what fields of the _source you want returned in the results. can be true (all, default), false (none), or a comma separated list of field names
-: (see also _source_includes, _source_excludes)
+Grouped into
+* Metric aggregations[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html] - whole-set things like
+: min, max, avg, percentile, boxplot, ardinality, stats, extended_stats, geo_bounds, geo_centroid, matrix_stats
+: t_test
+: rate,
+: text: count, min_length, max_length, avg_length, entropy
+: ...and more
+* Bucket[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html] - per-group-of-documents things like
+: missing
+: sampler
+: adjacency_matrix, histogram, date_histogram, auto_date_histogram,
+: categorize_text, significant_terms, significant_text, terms (...counter)
+: range
+: filter
+: geo_distance (and other geo things)
+: children, nested
+* Pipeline[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html] - take input from aggregations
+: bucket_selector
+: bucket_sort
-Stored fields[https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-fields.html#stored-fields] are stored in the index, and let you say "don't send _source at all, just these specific stored fields are enough for me"
+: avg_bucket, sum_bucket
+: stats_bucket, extended_stats_bucket
+: cumulative_sum
+: cumulative_cardinality
+: max_bucket, min_bucket
+: moving_avg, moving_percentiles, moving_fn,
+: serial_diff
+: percentiles_bucket
+: bucket_correlation
+: bucket_count_ks_test
+: normalize
+: inference
+: derivative
+: bucket_script
+: ...and more
-If _source is large, this can save parsing time (and traffic for anyone who forgets to do source filtering).
-The downside is that it duplicates data, so if there is little improvement, then it's just bloat.
-(stored fields are also something you want when you don't save _source at all - see the indexing section)
+https://logz.io/blog/elasticsearch-aggregations/
+-->
-In queries:
+==further notes==
-* stored_fields - what stored-only fields to also return?
+===Performance considerations===
-: interacts with the source_
+<!--
+https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
-There are also doc value fields, [https://www.elastic.co/guide/en/elasticsearch/reference/current/search-fields.html#docvalue-fields]
+https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
-In queries:
+https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
-* docvalue_fields
+-->
+====query considerations====
+<!--
+'''Search only the fields you need'''.
+Searching more fields (via query_string or multi_match) ''roughly'' means
+a number of separate searches
+that are then merged.
+If you ''quite typically'' want to search the same set of multiple fields (that you also want to be able to search separately),
+it may be worth it to have a single field that contains the combination.
+: You can do that without duplicating things in _source by using copy-to in the mapping (probably ''only'' gets values from being copied into)
-https://www.elastic.co/guide/en/elasticsearch/reference/current/search-search.html
-https://www.elastic.co/guide/en/elasticsearch/reference/7.15/search-fields.html#stored-fields
+'''Filtering''' search, without scoring (and sorting by that score) will tend to be faster.
+: If you're mostly just aggregating data, avoid scoring.
+: ...yet in a text search engine, scoring tends to be fairly central
--->
-====Aggregation====
-<!--
-Aggregation is any calculation from multiple documents.
+'''Consider converting a subset of search specifications to queries yourself, instead of exposing arbitrary searching to users'''
+They ''will'' at some point manage a franken-query that takes CPU-minutes.
-Which can be particularly useful when you store structured information - like metrics.
+In more expert cases, you probably can't do much about this - more flexibility is a feature.
-It is not as commonly used in full-text search.
+But consider whether "present nice form, the have code decide query that isn't going to spend CPU-minutes" is a thing you can do.
+-->
-Grouped into
-* Metric aggregations[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-metrics.html] - whole-set things like
+====More general settings====
-: min, max, avg, percentile, boxplot, ardinality, stats, extended_stats, geo_bounds, geo_centroid, matrix_stats
+<!--
-: t_test
-: rate,
+'''Lower refresh interval if a minute later is also acceptable'''
-: text: count, min_length, max_length, avg_length, entropy
-: ...and more
+Increasing from the default 1sec ("I want it now") to something on the order of 10 to 30 sec reduces the overhead that comes from this work,
+(a little more so when not using bulk import).
+Exactly how many documents to add at once before things level off varies, so some trial and error can be good
+[https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_use_bulk_requests]
+(also bulk inserts may wish to temporarily increase it)
+Where: refresh_interval in each index's _settings
-* Bucket[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-bucket.html] - per-group-of-documents things like
-: missing
-: sampler
-: adjacency_matrix, histogram, date_histogram, auto_date_histogram,
-: categorize_text, significant_terms, significant_text, terms (...counter)
-: range
-: filter
-: geo_distance (and other geo things)
-: children, nested
+Other aspects to the '''indexing buffer''' are sometimes interesting[https://www.elastic.co/guide/en/elasticsearch/reference/current/indexing-buffer.html].
-* Pipeline[https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations-pipeline.html] - take input from aggregations
+Mostly, if the indexing buffer is too small, it'll still flush a lot.
-: bucket_selector
-: bucket_sort
-: avg_bucket, sum_bucket
+You shouldn't need more than a a few hundred MB{{verify}}, and since it defaults to 10% of Java's heap size, that's usually on the right order.
-: stats_bucket, extended_stats_bucket
-: cumulative_sum
-: cumulative_cardinality
-: max_bucket, min_bucket
-: moving_avg, moving_percentiles, moving_fn,
-: serial_diff
-: percentiles_bucket
-: bucket_correlation
-: bucket_count_ks_test
-: normalize
-: inference
-: derivative
-: bucket_script
-: ...and more
-https://logz.io/blog/elasticsearch-aggregations/
 -->
-===Performance considerations===
+====indexing considerations====
+<!--
-<!--
+'''Bulk alterations'''
-https://www.elastic.co/guide/en/elasticsearch/reference/current/size-your-shards.html
-https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
+Any CRUD you can do in bulk will lower latencies of things done in serial.
-https://www.elastic.co/blog/how-many-shards-should-i-have-in-my-elasticsearch-cluster
--->
+'''low/no refresh during bulk impoty'''
+Index adding in bulk also makes refresh more efficient and merge necessary less often
-====query considerations====
+You can set index.refresh_interval higher for just that import
-<!--
-'''Search only the fields you need'''.
+Or even disable this, and do a manual refresh request at the end.
+This can make sense on some things that ''aren't'' continuously updated, though in most cases it doesn't save much.
-Searching more fields (via query_string or multi_match) ''roughly'' means
-a number of separate searches
-that are then merged.
-If you ''quite typically'' want to search the same set of multiple fields (that you also want to be able to search separately),
-it may be worth it to have a single field that contains the combination.
-: You can do that without duplicating things in _source by using copy-to in the mapping (probably ''only'' gets values from being copied into)
+'''Bulk import'''
-'''Filtering''' search, without scoring (and sorting by that score) will tend to be faster.
+Setting index.number_of_replicas to 0 during that import and setting it back later (so that the replication happens later - when?{{verify}})
-: If you're mostly just aggregating data, avoid scoring.
-: ...yet in a text search engine, scoring tends to be fairly central
+Indexing is a moderately heavy operation.
+You can consider having one cluster do the indexing,
+then [https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-ccr.html replicating] its results onto a distinct search-only cluster that is exposed to users.
-'''Consider converting a subset of search specifications to queries yourself, instead of exposing arbitrary searching to users'''
-They ''will'' at some point manage a franken-query that takes CPU-minutes.
-In more expert cases, you probably can't do much about this - more flexibility is a feature.
+'''Parallel import'''
-But consider whether "present nice form, the have code decide query that isn't going to spend CPU-minutes" is a thing you can do.
+Once you built a cluster, a single process feeding in documents is unlikely to max out the ingest capacity.
--->
+However, if you do and ES gets behind, it will throw TOO_MANY_REQUESTS (429) so watch for that.
-====More general settings====
-<!--
-'''Lower refresh interval if a minute later is also acceptable'''
-Increasing from the default 1sec ("I want it now") to something on the order of 10 to 30 sec reduces the overhead that comes from this work,
+'''Consider auto-IDs'''
-(a little more so when not using bulk import).
-Exactly how many documents to add at once before things level off varies, so some trial and error can be good
+If continuously index a lot of documents,
-[https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-indexing-speed.html#_use_bulk_requests]
+and use IDs that are ''only'' really there to guarantee uniqueness (and not for their actual value),
-(also bulk inserts may wish to temporarily increase it)
+then consider using auto-generated IDs rather than your own.
-Where: refresh_interval in each index's _settings
+Checking whether an ID is in the store is basically free with auto-ids (it's basically a counter), whereas with your own it's an index check that grows with index size.
-Other aspects to the '''indexing buffer''' are sometimes interesting[https://www.elastic.co/guide/en/elasticsearch/reference/current/indexing-buffer.html].
-Mostly, if the indexing buffer is too small, it'll still flush a lot.
+'''use keyword for identifiers, not text'''
-You shouldn't need more than a a few hundred MB{{verify}}, and since it defaults to 10% of Java's heap size, that's usually on the right order.
+Identifiers may not be used like numbers, even when they are.
+That is, they're often there for known-item search.
+If you don't (or only exceptionally) need to do range searches on them,
+you can keep the indexing slimmer by making them a keyword instead.
--->
+You can even consider a multi-field if you want faster exact search, and somewhat slower range search
-====indexing considerations====
-<!--
-'''Bulk alterations'''
-Any CRUD you can do in bulk will lower latencies of things done in serial.
+'''Avoid <tt>nested</tt> if you don't need it'''
+as it can make things a factor slower.
-'''low/no refresh during bulk impoty'''
-Index adding in bulk also makes refresh more efficient and merge necessary less often
-You can set index.refresh_interval higher for just that import
-Or even disable this, and do a manual refresh request at the end.
+'''Avoid <tt>join</tt> if you don't need it'''
-This can make sense on some things that ''aren't'' continuously updated, though in most cases it doesn't save much.
+Parent-child relationships can make even more difference.
+Consider that a little duplication (as long as it doesn't become the bulk) may be well worth it in terms of search speed.
-'''Bulk import'''
-Setting index.number_of_replicas to 0 during that import and setting it back later (so that the replication happens later - when?{{verify}})
+'''If you want relational/join stuff, then considering the split/distribution details can help'''
+For example, if you can direct the joined documents to end up in the same shared,
+operations involving joins and such ([https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-request-inner-hits.html], ) should end up faster.
-Indexing is a moderately heavy operation.
+You have some control over that in that documents are assigned to shards something like
+ (hash(_routing) % num_routing_shards) / routing_factor
+where
+: _routing defaults to the document _id, but can be set
-You can consider having one cluster do the indexing,
-then [https://www.elastic.co/guide/en/elasticsearch/reference/current/xpack-ccr.html replicating] its results onto a distinct search-only cluster that is exposed to users.
+'''If searches are typically narrow, then split indices the same way you query it'''
-'''Parallel import'''
+E.g. if you parse logs, consider grouping into a daily index
+(see ILM, and for this case probably specifically Data Streams instead),
+which could really localize certain queries.
-Once you built a cluster, a single process feeding in documents is unlikely to max out the ingest capacity.
+If you have distinct users who typically search their own data,
+consider giving each their own index.
-However, if you do and ES gets behind, it will throw TOO_MANY_REQUESTS (429) so watch for that.
+'''Add fields to dumb down common searches'''
+Don't overdo it (particularly when potentially bloating the index), but...
-'''Consider auto-IDs'''
-If continuously index a lot of documents,
+If you know searches always look in a specific year, consider a year (probably a keyword field), so that you don't have to do a range search on a date field.
-and use IDs that are ''only'' really there to guarantee uniqueness (and not for their actual value),
-then consider using auto-generated IDs rather than your own.
-Checking whether an ID is in the store is basically free with auto-ids (it's basically a counter), whereas with your own it's an index check that grows with index size.
+In general, rounded dates are likelier to get searches cached in the [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-cache.html query cache]
+Aggregations are basically query-specific
-'''use keyword for identifiers, not text'''
+If you do a lot of range aggregations - see e.g. [https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html#_pre_index_data]
-Identifiers may not be used like numbers, even when they are.
+([https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html aggregations] are sort of indices)
-That is, they're often there for known-item search.
-If you don't (or only exceptionally) need to do range searches on them,
-you can keep the indexing slimmer by making them a keyword instead.
-You can even consider a multi-field if you want faster exact search, and somewhat slower range search
+-->
+====Routing====
+<!--
-'''Avoid <tt>nested</tt> if you don't need it'''
+For a given document, routing determines which of an index's shards it should go in,
-as it can make things a factor slower.
+(based mostly on the document ID and the cluster size).
+The default tends to spread documents quite evenly (particularly with auto IDs),
+and you don't need to touch this.
-'''Avoid <tt>join</tt> if you don't need it'''
+This can be useful for a few reasons.
-Parent-child relationships can make even more difference.
+One example is when there are subsets of sorts in your data.
+For example, you could group data per user without having to create one index per user.
-Consider that a little duplication (as long as it doesn't become the bulk) may be well worth it in terms of search speed.
+It is arguably more of a trick than structure,
+and it's up to ''you'' to be consistent about specifying the routing in all operations (index, search, more)
+because if you e.g. send search to look in only the wrong shard, it will just miss things.
+Note also that it may be less load-balanced in a large cluster when it e.g. turns out you have one user with a lot of data.
-'''If you want relational/join stuff, then considering the split/distribution details can help'''
+https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
-For example, if you can direct the joined documents to end up in the same shared,
+-->
-operations involving joins and such ([https://www.elastic.co/guide/en/elasticsearch/reference/7.2/search-request-inner-hits.html], ) should end up faster.
-You have some control over that in that documents are assigned to shards something like
+====Clusters====
- (hash(_routing) % num_routing_shards) / routing_factor
+<!--
-where
-: _routing defaults to the document _id, but can be set
+https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html#scalability
-'''If searches are typically narrow, then split indices the same way you query it'''
-E.g. if you parse logs, consider grouping into a daily index
-(see ILM, and for this case probably specifically Data Streams instead),
-which could really localize certain queries.
-If you have distinct users who typically search their own data,
+e.g. an ingest node intercepts index requests (bulk or not), applies transformations,
-consider giving each their own index.
+and it then passes the documents back to the index or bulk APIs.
+by default, all nodes can ingest, so it's more whether you've configured any processing.
+In some cases you may want to dedicate one or a few nodes to ingest, for offload.
-'''Add fields to dumb down common searches'''
+Though note that doing the processing in code before handing it to be indexed is, essentially, the same, and this can be more about how centralized, vesionable, or accidentally-skippable that processing may be.
-Don't overdo it (particularly when potentially bloating the index), but...
+-->
-If you know searches always look in a specific year, consider a year (probably a keyword field), so that you don't have to do a range search on a date field.
+===Security notes===
+<!--
+You have probably run into the fact that out of the box, ES may be configured to use SSL,
+with a self-signed certificate, meaning that all interaction that is not explicitly told
+about the public key that was generated at install time (or told to not veryify the license) will fail.
-In general, rounded dates are likelier to get searches cached in the [https://www.elastic.co/guide/en/elasticsearch/reference/current/query-cache.html query cache]
+xpack.security.http.ssl.enabled
-Aggregations are basically query-specific
-If you do a lot of range aggregations - see e.g. [https://www.elastic.co/guide/en/elasticsearch/reference/current/tune-for-search-speed.html#_pre_index_data]
-([https://www.elastic.co/guide/en/elasticsearch/reference/current/search-aggregations.html aggregations] are sort of indices)
+-->
+===ES errors and warnings===
--->
-====Routing====
 <!--
+'''Failed to get search application count to include in Enterprise Search usage''' / '''Current license is non-compliant for search application and behavioral analytics. Current license is active basic license. Search Applications and behavioral analytics require an active trial, platinum or enterprise license.'''
-For a given document, routing determines which of an index's shards it should go in,
+which, if you're watching your logs, is in a quite-spammy amount of lines (java stack traces).
-(based mostly on the document ID and the cluster size).
-The default tends to spread documents quite evenly (particularly with auto IDs),
+https://github.com/elastic/elasticsearch/issues/96365
-and you don't need to touch this.
+If you didn't mean to use monitoring and an upgrade seems to have enabled it, edit kibana.yml and disable xpack.monitoring.collection.enable
+Probably an update enabled xpath monitoring?
+Not that that should be
-This can be useful for a few reasons.
-One example is when there are subsets of sorts in your data.
-For example, you could group data per user without having to create one index per user.
-It is arguably more of a trick than structure,
+'''License information could not be obtained from Elasticsearch due to ConnectionError'''
-and it's up to ''you'' to be consistent about specifying the routing in all operations (index, search, more)
-because if you e.g. send search to look in only the wrong shard, it will just miss things.
-Note also that it may be less load-balanced in a large cluster when it e.g. turns out you have one user with a lot of data.
-https://www.elastic.co/guide/en/elasticsearch/reference/current/mapping-routing-field.html
--->
-====Clusters====
-<!--
-https://www.elastic.co/guide/en/elasticsearch/reference/current/scalability.html#scalability
+'''Monitoring execution failed'''
+...seems to come from pretty catch-all code.
+Somewhere buried in that stack trace might be some 'Caused by' lines, one of which should be a little more helpful.
+Look for those instead.
-e.g. an ingest node intercepts index requests (bulk or not), applies transformations,
+'''primary shard is not active'''
-and it then passes the documents back to the index or bulk APIs.
-by default, all nodes can ingest, so it's more whether you've configured any processing.
+There are multiple reasons it isn't active.
+* you manually stopped a node that contained a primary shard
-In some cases you may want to dedicate one or a few nodes to ingest, for offload.
-Though note that doing the processing in code before handing it to be indexed is, essentially, the same, and this can be more about how centralized, vesionable, or accidentally-skippable that processing may be.
--->
-===Security notes===
-<!--
-You have probably run into the fact that out of the box, ES may be configured to use SSL,
-with a self-signed certificate, meaning that all interaction that is not explicitly told
-about the public key that was generated at install time (or told to not veryify the license) will fail.
-xpack.security.http.ssl.enabled
+* a node's disk is almost full, and ES refuses to allocate new shards to it (default at 85%) {{comment|or even try to reallocate (default at 90%)}}
+: if you have a large disk, 15% might be 100+ Gbyte, and you might want to set these settings (cluster.routing.allocation.disk.watermark.low {{comment|, cluster.routing.allocation.disk.watermark.high}}) to absolute values instead [https://www.elastic.co/guide/en/elasticsearch/reference/current/modules-cluster.html#cluster-shard-allocation-settings]
+: https://stackoverflow.com/questions/27547091/primary-shard-is-not-active-or-isnt-assigned-is-a-known-node
--->
+PUT /_cluster/settings
+{
-===ES errors and warnings===
+  "persistent" : {
+    "cluster.routing.allocation.disk.watermark.flood_stage":"98%",
-<!--
+    "cluster.routing.allocation.disk.watermark.high" : "97%",
-'''Failed to get search application count to include in Enterprise Search usage''' / '''Current license is non-compliant for search application and behavioral analytics. Current license is active basic license. Search Applications and behavioral analytics require an active trial, platinum or enterprise license.'''
+    "cluster.routing.allocation.disk.watermark.low" :  "95%"
+  }
-which, if you're watching your logs, is in a quite-spammy amount of lines (java stack traces).
+}
-https://github.com/elastic/elasticsearch/issues/96365
-If you didn't mean to use monitoring and an upgrade seems to have enabled it, edit kibana.yml and disable xpack.monitoring.collection.enable
-Probably an update enabled xpath monitoring?
-Not that that should be
-'''License information could not be obtained from Elasticsearch due to ConnectionError'''
 -->
@@ Line 1,637: / Line 1,748: @@
 <!--

Elasticsearch notes: Difference between revisions

Revision as of 00:42, 21 April 2024

Choice side

Broad intro

Subscription model

License details

Implementation details you'd want to know to use it

The major moving parts

Some terminology

Combined with

Indices

Fields, the mapping, field types

Managing indices (and thinking about shards and segments)

other field types

More on indexing

Do bulk adds where possible

Missing data

Ignoring data and fields

Text processing

Data streams

Snapshots, Searchable snapshots

Install

Things you may want to think about somewhere before you have a big index

APIs

Search

Shapes of a query

API-wise

On result state, and getting more consistency between search interactions

Searching multiple indices

Other, mostly more specific search-like APIs, and some debug stuff

composing queries

Compound queries

Term-level and full-text queries

Span queries

Result set

Aggregation

further notes

Performance considerations

query considerations

More general settings

indexing considerations

Routing

Clusters

Security notes

ES errors and warnings

Some Kibana notes

Navigation menu