"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/reference/docs.asciidoc" (29 Dec 2021, 1103 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Reading and Writing documents

Introduction

Each index in Elasticsearch is divided into shards and each shard can have multiple copies. These copies are known as a replication group and must be kept in sync when documents are added or removed. If we fail to do so, reading from one copy will result in very different results than reading from another. The process of keeping the shard copies in sync and serving reads from them is what we call the data replication model.

Elasticsearch’s data replication model is based on the primary-backup model and is described very well in the PacificA paper of Microsoft Research. That model is based on having a single copy from the replication group that acts as the primary shard. The other copies are called replica shards. The primary serves as the main entry point for all indexing operations. It is in charge of validating them and making sure they are correct. Once an index operation has been accepted by the primary, the primary is also responsible for replicating the operation to the other copies.

This purpose of this section is to give a high level overview of the Elasticsearch replication model and discuss the implications it has for various interactions between write and read operations.

Basic write model

Every indexing operation in Elasticsearch is first resolved to a replication group using routing, typically based on the document ID. Once the replication group has been determined, the operation is forwarded internally to the current primary shard of the group. The primary shard is responsible for validating the operation and forwarding it to the other replicas. Since replicas can be offline, the primary is not required to replicate to all replicas. Instead, Elasticsearch maintains a list of shard copies that should receive the operation. This list is called the in-sync copies and is maintained by the master node. As the name implies, these are the set of "good" shard copies that are guaranteed to have processed all of the index and delete operations that have been acknowledged to the user. The primary is responsible for maintaining this invariant and thus has to replicate all operations to each copy in this set.

The primary shard follows this basic flow:

  1. Validate incoming operation and reject it if structurally invalid (Example: have an object field where a number is expected)

  2. Execute the operation locally i.e. indexing or deleting the relevant document. This will also validate the content of fields and reject if needed (Example: a keyword value is too long for indexing in Lucene).

  3. Forward the operation to each replica in the current in-sync copies set. If there are multiple replicas, this is done in parallel.

  4. Once all replicas have successfully performed the operation and responded to the primary, the primary acknowledges the successful completion of the request to the client.

Failure handling

Many things can go wrong during indexing — disks can get corrupted, nodes can be disconnected from each other, or some configuration mistake could cause an operation to fail on a replica despite it being successful on the primary. These are infrequent but the primary has to respond to them.

In the case that the primary itself fails, the node hosting the primary will send a message to the master about it. The indexing operation will wait (up to 1 minute, by default) for the master to promote one of the replicas to be a new primary. The operation will then be forwarded to the new primary for processing. Note that the master also monitors the health of the nodes and may decide to proactively demote a primary. This typically happens when the node holding the primary is isolated from the cluster by a networking issue. See here for more details.

Once the operation has been successfully performed on the primary, the primary has to deal with potential failures when executing it on the replica shards. This may be caused by an actual failure on the replica or due to a network issue preventing the operation from reaching the replica (or preventing the replica from responding). All of these share the same end result: a replica which is part of the in-sync replica set misses an operation that is about to be acknowledged. In order to avoid violating the invariant, the primary sends a message to the master requesting that the problematic shard be removed from the in-sync replica set. Only once removal of the shard has been acknowledged by the master does the primary acknowledge the operation. Note that the master will also instruct another node to start building a new shard copy in order to restore the system to a healthy state.

While forwarding an operation to the replicas, the primary will use the replicas to validate that it is still the active primary. If the primary has been isolated due to a network partition (or a long GC) it may continue to process incoming indexing operations before realising that it has been demoted. Operations that come from a stale primary will be rejected by the replicas. When the primary receives a response from the replica rejecting its request because it is no longer the primary then it will reach out to the master and will learn that it has been replaced. The operation is then routed to the new primary.

What happens if there are no replicas?

This is a valid scenario that can happen due to index configuration or simply because all the replicas have failed. In that case the primary is processing operations without any external validation, which may seem problematic. On the other hand, the primary cannot fail other shards on its own but request the master to do so on its behalf. This means that the master knows that the primary is the only single good copy. We are therefore guaranteed that the master will not promote any other (out-of-date) shard copy to be a new primary and that any operation indexed into the primary will not be lost. Of course, since at that point we are running with only single copy of the data, physical hardware issues can cause data loss. See Wait For Active Shards for some mitigation options.

Basic read model

Reads in Elasticsearch can be very lightweight lookups by ID or a heavy search request with complex aggregations that take non-trivial CPU power. One of the beauties of the primary-backup model is that it keeps all shard copies identical (with the exception of in-flight operations). As such, a single in-sync copy is sufficient to serve read requests.

When a read request is received by a node, that node is responsible for forwarding it to the nodes that hold the relevant shards, collating the responses, and responding to the client. We call that node the coordinating node for that request. The basic flow is as follows:

  1. Resolve the read requests to the relevant shards. Note that since most searches will be sent to one or more indices, they typically need to read from multiple shards, each representing a different subset of the data.

  2. Select an active copy of each relevant shard, from the shard replication group. This can be either the primary or a replica. By default, Elasticsearch will simply round robin between the shard copies.

  3. Send shard level read requests to the selected copies.

  4. Combine the results and respond. Note that in the case of get by ID look up, only one shard is relevant and this step can be skipped.

Shard failures

When a shard fails to respond to a read request, the coordinating node sends the request to another shard copy in the same replication group. Repeated failures can result in no available shard copies.

To ensure fast responses, the following APIs will respond with partial results if one or more shards fail:

Responses containing partial results still provide a 200 OK HTTP status code. Shard failures are indicated by the timed_out and _shards fields of the response header.

A few simple implications

Each of these basic flows determines how Elasticsearch behaves as a system for both reads and writes. Furthermore, since read and write requests can be executed concurrently, these two basic flows interact with each other. This has a few inherent implications:

Efficient reads

Under normal operation each read operation is performed once for each relevant replication group. Only under failure conditions do multiple copies of the same shard execute the same search.

Read unacknowledged

Since the primary first indexes locally and then replicates the request, it is possible for a concurrent read to already see the change before it has been acknowledged.

Two copies by default

This model can be fault tolerant while maintaining only two copies of the data. This is in contrast to quorum-based system where the minimum number of copies for fault tolerance is 3.

Failures

Under failures, the following is possible:

A single shard can slow down indexing

Because the primary waits for all replicas in the in-sync copies set during each operation, a single slow shard can slow down the entire replication group. This is the price we pay for the read efficiency mentioned above. Of course a single slow shard will also slow down unlucky searches that have been routed to it.

Dirty reads

An isolated primary can expose writes that will not be acknowledged. This is caused by the fact that an isolated primary will only realize that it is isolated once it sends requests to its replicas or when reaching out to the master. At that point the operation is already indexed into the primary and can be read by a concurrent read. Elasticsearch mitigates this risk by pinging the master every second (by default) and rejecting indexing operations if no master is known.

The Tip of the Iceberg

This document provides a high level overview of how Elasticsearch deals with data. Of course, there is much much more going on under the hood. Things like primary terms, cluster state publishing, and master election all play a role in keeping this system behaving correctly. This document also doesn’t cover known and important bugs (both closed and open). We recognize that GitHub is hard to keep up with. To help people stay on top of those, we maintain a dedicated resiliency page on our website. We strongly advise reading it.

Index API

Important
See [removal-of-types].

The index API adds or updates a typed JSON document in a specific index, making it searchable. The following example inserts the JSON document into the "twitter" index, under a type called _doc with an id of 1:

PUT twitter/_doc/1
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

The result of the above index operation is:

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 1,
    "_seq_no" : 0,
    "_primary_term" : 1,
    "result" : "created"
}

The _shards header provides information about the replication process of the index operation:

total

Indicates how many shard copies (primary and replica shards) the index operation should be executed on.

successful

Indicates the number of shard copies the index operation succeeded on.

failed

An array that contains replication-related errors in the case an index operation failed on a replica shard.

The index operation is successful in the case successful is at least 1.

Note
Replica shards may not all be started when an indexing operation successfully returns (by default, only the primary is required, but this behavior can be changed). In that case, total will be equal to the total shards based on the number_of_replicas setting and successful will be equal to the number of shards started (primary plus replicas). If there were no failures, the failed will be 0.

Automatic Index Creation

The index operation automatically creates an index if it does not already exist, and applies any index templates that are configured. The index operation also creates a dynamic type mapping for the specified type if one does not already exist. By default, new fields and objects will automatically be added to the mapping definition for the specified type if needed. Check out the mapping section for more information on mapping definitions, and the put mapping API for information about updating type mappings manually.

Automatic index creation is controlled by the action.auto_create_index setting. This setting defaults to true, meaning that indices are always automatically created. Automatic index creation can be permitted only for indices matching certain patterns by changing the value of this setting to a comma-separated list of these patterns. It can also be explicitly permitted and forbidden by prefixing patterns in the list with a + or -. Finally it can be completely disabled by changing this setting to false.

PUT _cluster/settings
{
    "persistent": {
        "action.auto_create_index": "twitter,index10,-index1*,+ind*" (1)
    }
}

PUT _cluster/settings
{
    "persistent": {
        "action.auto_create_index": "false" (2)
    }
}

PUT _cluster/settings
{
    "persistent": {
        "action.auto_create_index": "true" (3)
    }
}
  1. Permit only the auto-creation of indices called twitter, index10, no other index matching index1*, and any other index matching ind*. The patterns are matched in the order in which they are given.

  2. Completely disable the auto-creation of indices.

  3. Permit the auto-creation of indices with any name. This is the default.

Operation Type

The index operation also accepts an op_type that can be used to force a create operation, allowing for "put-if-absent" behavior. When create is used, the index operation will fail if a document by that id already exists in the index.

Here is an example of using the op_type parameter:

PUT twitter/_doc/1?op_type=create
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

Another option to specify create is to use the following uri:

PUT twitter/_doc/1/_create
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

Automatic ID Generation

The index operation can be executed without specifying the id. In such a case, an id will be generated automatically. In addition, the op_type will automatically be set to create. Here is an example (note the POST used instead of PUT):

POST twitter/_doc/
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

The result of the above index operation is:

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "W0tpsmIBdwcYyG50zbta",
    "_version" : 1,
    "_seq_no" : 0,
    "_primary_term" : 1,
    "result": "created"
}

Optimistic concurrency control

Index operations can be made conditional and only be performed if the last modification to the document was assigned the sequence number and primary term specified by the if_seq_no and if_primary_term parameters. If a mismatch is detected, the operation will result in a VersionConflictException and a status code of 409. See Optimistic concurrency control for more details.

Routing

By default, shard placement — or routing — is controlled by using a hash of the document’s id value. For more explicit control, the value fed into the hash function used by the router can be directly specified on a per-operation basis using the routing parameter. For example:

POST twitter/_doc?routing=kimchy
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

In the example above, the "_doc" document is routed to a shard based on the routing parameter provided: "kimchy".

When setting up explicit mapping, the _routing field can be optionally used to direct the index operation to extract the routing value from the document itself. This does come at the (very minimal) cost of an additional document parsing pass. If the _routing mapping is defined and set to be required, the index operation will fail if no routing value is provided or extracted.

Distributed

The index operation is directed to the primary shard based on its route (see the Routing section above) and performed on the actual node containing this shard. After the primary shard completes the operation, if needed, the update is distributed to applicable replicas.

Wait For Active Shards

To improve the resiliency of writes to the system, indexing operations can be configured to wait for a certain number of active shard copies before proceeding with the operation. If the requisite number of active shard copies are not available, then the write operation must wait and retry, until either the requisite shard copies have started or a timeout occurs. By default, write operations only wait for the primary shards to be active before proceeding (i.e. wait_for_active_shards=1). This default can be overridden in the index settings dynamically by setting index.write.wait_for_active_shards. To alter this behavior per operation, the wait_for_active_shards request parameter can be used.

Valid values are all or any positive integer up to the total number of configured copies per shard in the index (which is number_of_replicas+1). Specifying a negative value or a number greater than the number of shard copies will throw an error.

For example, suppose we have a cluster of three nodes, A, B, and C and we create an index index with the number of replicas set to 3 (resulting in 4 shard copies, one more copy than there are nodes). If we attempt an indexing operation, by default the operation will only ensure the primary copy of each shard is available before proceeding. This means that even if B and C went down, and A hosted the primary shard copies, the indexing operation would still proceed with only one copy of the data. If wait_for_active_shards is set on the request to 3 (and all 3 nodes are up), then the indexing operation will require 3 active shard copies before proceeding, a requirement which should be met because there are 3 active nodes in the cluster, each one holding a copy of the shard. However, if we set wait_for_active_shards to all (or to 4, which is the same), the indexing operation will not proceed as we do not have all 4 copies of each shard active in the index. The operation will timeout unless a new node is brought up in the cluster to host the fourth copy of the shard.

It is important to note that this setting greatly reduces the chances of the write operation not writing to the requisite number of shard copies, but it does not completely eliminate the possibility, because this check occurs before the write operation commences. Once the write operation is underway, it is still possible for replication to fail on any number of shard copies but still succeed on the primary. The _shards section of the write operation’s response reveals the number of shard copies on which replication succeeded/failed.

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    }
}

Refresh

Control when the changes made by this request are visible to search. See refresh.

Noop Updates

When updating a document using the index API a new version of the document is always created even if the document hasn’t changed. If this isn’t acceptable use the _update API with detect_noop set to true. This option isn’t available on the index API because the index API doesn’t fetch the old source and isn’t able to compare it against the new source.

There isn’t a hard and fast rule about when noop updates aren’t acceptable. It’s a combination of lots of factors like how frequently your data source sends updates that are actually noops and how many queries per second Elasticsearch runs on the shard receiving the updates.

Timeout

The primary shard assigned to perform the index operation might not be available when the index operation is executed. Some reasons for this might be that the primary shard is currently recovering from a gateway or undergoing relocation. By default, the index operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits. Here is an example of setting it to 5 minutes:

PUT twitter/_doc/1?timeout=5m
{
    "user" : "kimchy",
    "post_date" : "2009-11-15T14:12:12",
    "message" : "trying out Elasticsearch"
}

Versioning

Each indexed document is given a version number. By default, internal versioning is used that starts at 1 and increments with each update, deletes included. Optionally, the version number can be set to an external value (for example, if maintained in a database). To enable this functionality, version_type should be set to external. The value provided must be a numeric, long value greater than or equal to 0, and less than around 9.2e+18.

When using the external version type, the system checks to see if the version number passed to the index request is greater than the version of the currently stored document. If true, the document will be indexed and the new version number used. If the value provided is less than or equal to the stored document’s version number, a version conflict will occur and the index operation will fail. For example:

PUT twitter/_doc/1?version=2&version_type=external
{
    "message" : "elasticsearch now has versioning support, double cool!"
}

NOTE: Versioning is completely real time, and is not affected by the near real time aspects of search operations. If no version is provided, then the operation is executed without any version checks.

The above will succeed since the supplied version of 2 is higher than the current document version of 1. If the document was already updated and its version was set to 2 or higher, the indexing command will fail and result in a conflict (409 http status code).

A nice side effect is that there is no need to maintain strict ordering of async indexing operations executed as a result of changes to a source database, as long as version numbers from the source database are used. Even the simple case of updating the Elasticsearch index using data from a database is simplified if external versioning is used, as only the latest version will be used if the index operations arrive out of order for whatever reason.

Version types

Next to the external version type explained above, Elasticsearch also supports other types for specific use cases. Here is an overview of the different version types and their semantics.

internal

Only index the document if the given version is identical to the version of the stored document. deprecated:[6.7.0, "Please use if_seq_no & if_primary_term instead. See Optimistic concurrency control for more details."]

external or external_gt

Only index the document if the given version is strictly higher than the version of the stored document or if there is no existing document. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.

external_gte

Only index the document if the given version is equal or higher than the version of the stored document. If there is no existing document the operation will succeed as well. The given version will be used as the new version and will be stored with the new document. The supplied version must be a non-negative long number.

NOTE: The external_gte version type is meant for special use cases and should be used with care. If used incorrectly, it can result in loss of data. There is another option, force, which is deprecated because it can cause primary and replica shards to diverge.

Get API

The get API allows to get a typed JSON document from the index based on its id. The following example gets a JSON document from an index called twitter, under a type called _doc, with id valued 0:

GET twitter/_doc/0

The result of the above get operation is:

{
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "0",
    "_version" : 1,
    "_seq_no" : 10,
    "_primary_term" : 1,
    "found": true,
    "_source" : {
        "user" : "kimchy",
        "date" : "2009-11-15T14:12:12",
        "likes": 0,
        "message" : "trying out Elasticsearch"
    }
}

The above result includes the _index, _type, _id, and _version of the document we wish to retrieve, including the actual _source of the document if it could be found (as indicated by the found field in the response).

The API also allows to check for the existence of a document using HEAD, for example:

HEAD twitter/_doc/0

Realtime

By default, the get API is realtime, and is not affected by the refresh rate of the index (when data will become visible for search). If a document has been updated but is not yet refreshed, the get API will issue a refresh call in-place to make the document visible. This will also make other documents changed since the last refresh visible. In order to disable realtime GET, one can set the realtime parameter to false.

Source filtering

By default, the get operation returns the contents of the _source field unless you have used the stored_fields parameter or if the _source field is disabled. You can turn off _source retrieval by using the _source parameter:

GET twitter/_doc/0?_source=false

If you only need one or two fields from the complete _source, you can use the _source_includes and _source_excludes parameters to include or filter out the parts you need. This can be especially helpful with large documents where partial retrieval can save on network overhead. Both parameters take a comma separated list of fields or wildcard expressions. Example:

GET twitter/_doc/0?_source_includes=*.id&_source_excludes=entities

If you only want to specify includes, you can use a shorter notation:

GET twitter/_doc/0?_source=*.id,retweeted

Stored Fields

The get operation allows specifying a set of stored fields that will be returned by passing the stored_fields parameter. If the requested fields are not stored, they will be ignored. Consider for instance the following mapping:

PUT twitter
{
   "mappings": {
      "_doc": {
         "properties": {
            "counter": {
               "type": "integer",
               "store": false
            },
            "tags": {
               "type": "keyword",
               "store": true
            }
         }
      }
   }
}

Now we can add a document:

PUT twitter/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

And then try to retrieve it:

GET twitter/_doc/1?stored_fields=tags,counter

The result of the above get operation is:

{
   "_index": "twitter",
   "_type": "_doc",
   "_id": "1",
   "_version": 1,
   "_seq_no" : 22,
   "_primary_term" : 1,
   "found": true,
   "fields": {
      "tags": [
         "red"
      ]
   }
}

Field values fetched from the document itself are always returned as an array. Since the counter field is not stored the get request simply ignores it when trying to get the stored_fields.

It is also possible to retrieve metadata fields like the _routing field:

PUT twitter/_doc/2?routing=user1
{
    "counter" : 1,
    "tags" : ["white"]
}
GET twitter/_doc/2?routing=user1&stored_fields=tags,counter

The result of the above get operation is:

{
   "_index": "twitter",
   "_type": "_doc",
   "_id": "2",
   "_version": 1,
   "_seq_no" : 13,
   "_primary_term" : 1,
   "_routing": "user1",
   "found": true,
   "fields": {
      "tags": [
         "white"
      ]
   }
}

Also only leaf fields can be returned via the stored_field option. So object fields can’t be returned and such requests will fail.

Getting the _source directly

Use the /{index}/{type}/{id}/_source endpoint to get just the _source field of the document, without any additional content around it. For example:

GET twitter/_doc/1/_source

You can also use the same source filtering parameters to control which parts of the _source will be returned:

GET twitter/_doc/1/_source?_source_includes=*.id&_source_excludes=entities

Note, there is also a HEAD variant for the _source endpoint to efficiently test for document _source existence. An existing document will not have a _source if it is disabled in the mapping.

HEAD twitter/_doc/1/_source

Routing

When indexing using the ability to control the routing, in order to get a document, the routing value should also be provided. For example:

GET twitter/_doc/2?routing=user1

The above will get a tweet with id 2, but will be routed based on the user. Note that issuing a get without the correct routing will cause the document not to be fetched.

Preference

Controls a preference of which shard replicas to execute the get request on. By default, the operation is randomized between the shard replicas.

The preference can be set to:

_primary

The operation will go and be executed only on the primary shards.

_local

The operation will prefer to be executed on a local allocated shard if possible.

Custom (string) value

A custom value will be used to guarantee that the same shards will be used for the same custom value. This can help with "jumping values" when hitting different shards in different refresh states. A sample value can be something like the web session id, or the user name.

Refresh

The refresh parameter can be set to true in order to refresh the relevant shard before the get operation and make it searchable. Setting it to true should be done after careful thought and verification that this does not cause a heavy load on the system (and slows down indexing).

Distributed

The get operation gets hashed into a specific shard id. It then gets redirected to one of the replicas within that shard id and returns the result. The replicas are the primary shard and its replicas within that shard id group. This means that the more replicas we have, the better GET scaling we will have.

Versioning support

You can use the version parameter to retrieve the document only if its current version is equal to the specified one. This behavior is the same for all version types with the exception of version type FORCE which always retrieves the document. Note that FORCE version type is deprecated.

Internally, Elasticsearch has marked the old document as deleted and added an entirely new document. The old version of the document doesn’t disappear immediately, although you won’t be able to access it. Elasticsearch cleans up deleted documents in the background as you continue to index more data.

Delete API

The delete API allows to delete a typed JSON document from a specific index based on its id. The following example deletes the JSON document from an index called twitter, under a type called _doc, with id 1:

DELETE /twitter/_doc/1

The result of the above delete operation is:

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 2
    },
    "_index" : "twitter",
    "_type" : "_doc",
    "_id" : "1",
    "_version" : 2,
    "_primary_term": 1,
    "_seq_no": 5,
    "result": "deleted"
}

Optimistic concurrency control

Delete operations can be made conditional and only be performed if the last modification to the document was assigned the sequence number and primary term specified by the if_seq_no and if_primary_term parameters. If a mismatch is detected, the operation will result in a VersionConflictException and a status code of 409. See Optimistic concurrency control for more details.

Versioning

Each document indexed is versioned. When deleting a document, the version can be specified to make sure the relevant document we are trying to delete is actually being deleted and it has not changed in the meantime. Every write operation executed on a document, deletes included, causes its version to be incremented. The version number of a deleted document remains available for a short time after deletion to allow for control of concurrent operations. The length of time for which a deleted document’s version remains available is determined by the index.gc_deletes index setting and defaults to 60 seconds.

Routing

When indexing using the ability to control the routing, in order to delete a document, the routing value should also be provided. For example:

DELETE /twitter/_doc/1?routing=kimchy

The above will delete a tweet with id 1, but will be routed based on the user. Note that issuing a delete without the correct routing will cause the document to not be deleted.

When the _routing mapping is set as required and no routing value is specified, the delete API will throw a RoutingMissingException and reject the request.

Automatic index creation

If an external versioning variant is used, the delete operation automatically creates an index if it has not been created before (check out the create index API for manually creating an index), and also automatically creates a dynamic type mapping for the specific type if it has not been created before (check out the put mapping API for manually creating type mapping).

Distributed

The delete operation gets hashed into a specific shard id. It then gets redirected into the primary shard within that id group, and replicated (if needed) to shard replicas within that id group.

Wait For Active Shards

When making delete requests, you can set the wait_for_active_shards parameter to require a minimum number of shard copies to be active before starting to process the delete request. See here for further details and a usage example.

Refresh

Control when the changes made by this request are visible to search. See ?refresh.

Timeout

The primary shard assigned to perform the delete operation might not be available when the delete operation is executed. Some reasons for this might be that the primary shard is currently recovering from a store or undergoing relocation. By default, the delete operation will wait on the primary shard to become available for up to 1 minute before failing and responding with an error. The timeout parameter can be used to explicitly specify how long it waits. Here is an example of setting it to 5 minutes:

DELETE /twitter/_doc/1?timeout=5m

Delete By Query API

The simplest usage of _delete_by_query just performs a deletion on every document that matches a query. Here is the API:

POST twitter/_delete_by_query
{
  "query": { (1)
    "match": {
      "message": "some message"
    }
  }
}
  1. The query must be passed as a value to the query key, in the same way as the Search API. You can also use the q parameter in the same way as the search API.

That will return something like this:

{
  "took" : 147,
  "timed_out": false,
  "deleted": 119,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "total": 119,
  "failures" : [ ]
}

_delete_by_query gets a snapshot of the index when it starts and deletes what it finds using internal versioning. That means that you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the delete request is processed. When the versions match the document is deleted.

Note
Since internal versioning does not support the value 0 as a valid version number, documents with version equal to zero cannot be deleted using _delete_by_query and will fail the request.

During the _delete_by_query execution, multiple search requests are sequentially executed in order to find all the matching documents to delete. Every time a batch of documents is found, a corresponding bulk request is executed to delete all these documents. In case a search or bulk request got rejected, _delete_by_query relies on a default policy to retry rejected requests (up to 10 times, with exponential back off). Reaching the maximum retries limit causes the _delete_by_query to abort and all failures are returned in the failures of the response. The deletions that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.

If you’d like to count version conflicts rather than cause them to abort, then set conflicts=proceed on the url or "conflicts": "proceed" in the request body.

Back to the API format, this will delete tweets from the twitter index:

POST twitter/_doc/_delete_by_query?conflicts=proceed
{
  "query": {
    "match_all": {}
  }
}

It’s also possible to delete documents of multiple indexes and multiple types at once, just like the search API:

POST twitter,blog/_docs,post/_delete_by_query
{
  "query": {
    "match_all": {}
  }
}

If you provide routing then the routing is copied to the scroll query, limiting the process to the shards that match that routing value:

POST twitter/_delete_by_query?routing=1
{
  "query": {
    "range" : {
        "age" : {
           "gte" : 10
        }
    }
  }
}

By default _delete_by_query uses scroll batches of 1000. You can change the batch size with the scroll_size URL parameter:

POST twitter/_delete_by_query?scroll_size=5000
{
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}

URL Parameters

In addition to the standard parameters like pretty, the delete by query API also supports refresh, wait_for_completion, wait_for_active_shards, timeout, and scroll.

Sending the refresh will refresh all shards involved in the delete by query once the request completes. This is different than the delete API’s refresh parameter which causes just the shard that received the delete request to be refreshed. Also unlike the delete API it does not support wait_for.

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses.

wait_for_active_shards controls how many copies of a shard must be active before proceeding with the request. See here for details. timeout controls how long each write request waits for unavailable shards to become available. Both work exactly how they work in the Bulk API. As _delete_by_query uses scroll search, you can also specify the scroll parameter to control how long it keeps the "search context" alive, e.g. ?scroll=10m. By default it’s 5 minutes.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc.) and throttles the rate at which delete by query issues batches of delete operations by padding each batch with a wait time. The throttling can be disabled by setting requests_per_second to -1.

The throttling is done by waiting between batches so that scroll that _delete_by_query uses internally can be given a timeout that takes into account the padding. The padding time is the difference between the batch size divided by the requests_per_second and the time spent writing. By default the batch size is 1000, so if the requests_per_second is set to 500:

target_time = 1000 / 500 per second = 2 seconds
wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds

Since the batch is issued as a single _bulk request, large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default is -1.

Response body

The JSON response looks like this:

{
  "took" : 147,
  "timed_out": false,
  "total": 119,
  "deleted": 119,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "failures" : [ ]
}
took

The number of milliseconds from start to end of the whole operation.

timed_out

This flag is set to true if any of the requests executed during the delete by query execution has timed out.

total

The number of documents that were successfully processed.

deleted

The number of documents that were successfully deleted.

batches

The number of scroll responses pulled back by the delete by query.

version_conflicts

The number of version conflicts that the delete by query hit.

noops

This field is always equal to zero for delete by query. It only exists so that delete by query, update by query, and reindex APIs return responses with the same structure.

retries

The number of retries attempted by delete by query. bulk is the number of bulk actions retried, and search is the number of search actions retried.

throttled_millis

Number of milliseconds the request slept to conform to requests_per_second.

requests_per_second

The number of requests per second effectively executed during the delete by query.

throttled_until_millis

This field should always be equal to zero in a _delete_by_query response. It only has meaning when using the Task API, where it indicates the next time (in milliseconds since epoch) a throttled request will be executed again in order to conform to requests_per_second.

failures

Array of failures if there were any unrecoverable errors during the process. If this is non-empty then the request aborted because of those failures. Delete by query is implemented using batches, and any failure causes the entire process to abort but all failures in the current batch are collected into the array. You can use the conflicts option to prevent reindex from aborting on version conflicts.

Works with the Task API

You can fetch the status of any running delete by query requests with the Task API:

GET _tasks?detailed=true&actions=*/delete/byquery

The response looks like:

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/delete/byquery",
          "status" : {    (1)
            "total" : 6154,
            "updated" : 0,
            "created" : 0,
            "deleted" : 3500,
            "batches" : 36,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": 0,
            "throttled_millis": 0
          },
          "description" : ""
        }
      }
    }
  }
}
  1. This object contains the actual status. It is just like the response JSON with the important addition of the total field. total is the total number of operations that the reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.

With the task id you can look up the task directly:

GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set on it then it’ll come back with results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

Works with the Cancel Task API

Any delete by query can be canceled using the task cancel API:

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel

The task ID can be found using the tasks API.

Cancellation should happen quickly but might take a few seconds. The task status API above will continue to list the delete by query task until this task checks that it has been cancelled and terminates itself.

Rethrottling

The value of requests_per_second can be changed on a running delete by query using the _rethrottle API:

POST _delete_by_query/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

The task ID can be found using the tasks API.

Just like when setting it on the delete by query API, requests_per_second can be either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level. Rethrottling that speeds up the query takes effect immediately but rethrotting that slows down the query will take effect after completing the current batch. This prevents scroll timeouts.

Slicing

Delete by query supports sliced scroll to parallelize the deleting process. This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts.

Manual slicing

Slice a delete by query manually by providing a slice id and total number of slices to each request:

POST twitter/_delete_by_query
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}
POST twitter/_delete_by_query
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}

Which you can verify works with:

GET _refresh
POST twitter/_search?size=0&filter_path=hits.total
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}

Which results in a sensible total like this one:

{
  "hits": {
    "total": 0
  }
}

Automatic slicing

You can also let delete by query automatically parallelize using sliced scroll to slice on _uid. Use slices to specify the number of slices to use:

POST twitter/_delete_by_query?refresh&slices=5
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}

Which you also can verify works with:

POST twitter/_search?size=0&filter_path=hits.total
{
  "query": {
    "range": {
      "likes": {
        "lt": 10
      }
    }
  }
}

Which results in a sensible total like this one:

{
  "hits": {
    "total": 0
  }
}

Setting slices to auto will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If there are multiple source indices, it will choose the number of slices based on the index with the smallest number of shards.

Adding slices to _delete_by_query just automates the manual process used in the section above, creating sub-requests which means it has some quirks:

  • You can see these requests in the Tasks APIs. These sub-requests are "child" tasks of the task for the request with slices.

  • Fetching the status of the task for the request with slices only contains the status of completed slices.

  • These sub-requests are individually addressable for things like cancellation and rethrottling.

  • Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.

  • Canceling the request with slices will cancel each sub-request.

  • Due to the nature of slices each sub-request won’t get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution.

  • Parameters like requests_per_second and size on a request with slices are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that the using size with slices might not result in exactly size documents being deleted.

  • Each sub-request gets a slightly different snapshot of the source index though these are all taken at approximately the same time.

Picking the number of slices

If slicing automatically, setting slices to auto will choose a reasonable number for most indices. If you’re slicing manually or otherwise tuning automatic slicing, use these guidelines.

Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large (for example, 500), choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.

Delete performance scales linearly across available resources with the number of slices.

Whether query or delete performance dominates the runtime depends on the documents being reindexed and cluster resources.

Update API

The update API allows to update a document based on a script provided. The operation gets the document (collocated with the shard) from the index, runs the script (with optional script language and parameters), and indexes back the result (also allows to delete, or ignore the operation). It uses versioning to make sure no updates have happened during the "get" and "reindex".

Note, this operation still means full reindex of the document, it just removes some network roundtrips and reduces chances of version conflicts between the get and the index. The _source field needs to be enabled for this feature to work.

For example, let’s index a simple doc:

PUT test/_doc/1
{
    "counter" : 1,
    "tags" : ["red"]
}

Scripted updates

Now, we can execute a script that would increment the counter:

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    }
}

We can add a tag to the list of tags (if the tag exists, it still gets added, since this is a list):

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.tags.add(params.tag)",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

We can remove a tag from the list of tags. Note that the Painless function to remove a tag takes as its parameter the array index of the element you wish to remove, so you need a bit more logic to locate it while avoiding a runtime error. Note that if the tag was present more than once in the list, this will remove only one occurrence of it:

POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx._source.tags.remove(ctx._source.tags.indexOf(params.tag)) }",
        "lang": "painless",
        "params" : {
            "tag" : "blue"
        }
    }
}

In addition to _source, the following variables are available through the ctx map: _index, _type, _id, _version, _routing, and _now (the current timestamp).

We can also add a new field to the document:

POST test/_doc/1/_update
{
    "script" : "ctx._source.new_field = 'value_of_new_field'"
}

Or remove a field from the document:

POST test/_doc/1/_update
{
    "script" : "ctx._source.remove('new_field')"
}

And, we can even change the operation that is executed. This example deletes the doc if the tags field contains green, otherwise it does nothing (noop):

POST test/_doc/1/_update
{
    "script" : {
        "source": "if (ctx._source.tags.contains(params.tag)) { ctx.op = 'delete' } else { ctx.op = 'none' }",
        "lang": "painless",
        "params" : {
            "tag" : "green"
        }
    }
}

Updates with a partial document

The update API also supports passing a partial document, which will be merged into the existing document (simple recursive merge, inner merging of objects, replacing core "keys/values" and arrays). To fully replace the existing document, the index API should be used instead. The following partial update adds a new field to the existing document:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}

If both doc and script are specified, then doc is ignored. Best is to put your field pairs of the partial document in the script itself.

Detecting noop updates

If doc is specified its value is merged with the existing _source. By default updates that don’t change anything detect that they don’t change anything and return "result": "noop" like this:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    }
}

If name was new_name before the request was sent then the entire update request is ignored. The result element in the response returns noop if the request was ignored.

{
   "_shards": {
        "total": 0,
        "successful": 0,
        "failed": 0
   },
   "_index": "test",
   "_type": "_doc",
   "_id": "1",
   "_version": 7,
   "result": "noop"
}

You can disable this behavior by setting "detect_noop": false like this:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "detect_noop": false
}

Upserts

If the document does not already exist, the contents of the upsert element will be inserted as a new document. If the document does exist, then the script will be executed instead:

POST test/_doc/1/_update
{
    "script" : {
        "source": "ctx._source.counter += params.count",
        "lang": "painless",
        "params" : {
            "count" : 4
        }
    },
    "upsert" : {
        "counter" : 1
    }
}

scripted_upsert

If you would like your script to run regardless of whether the document exists or not — i.e. the script handles initializing the document instead of the upsert element — then set scripted_upsert to true:

POST sessions/session/dh3sgudg8gsrgl/_update
{
    "scripted_upsert":true,
    "script" : {
        "id": "my_web_session_summariser",
        "params" : {
            "pageViewEvent" : {
                "url":"foo.com/bar",
                "response":404,
                "time":"2014-01-01 12:32"
            }
        }
    },
    "upsert" : {}
}

doc_as_upsert

Instead of sending a partial doc plus an upsert doc, setting doc_as_upsert to true will use the contents of doc as the upsert value:

POST test/_doc/1/_update
{
    "doc" : {
        "name" : "new_name"
    },
    "doc_as_upsert" : true
}

Parameters

The update operation supports the following query-string parameters:

retry_on_conflict

In between the get and indexing phases of the update, it is possible that another process might have already updated the same document. By default, the update will fail with a version conflict exception. The retry_on_conflict parameter controls how many times to retry the update before finally throwing an exception.

routing

Routing is used to route the update request to the right shard and sets the routing for the upsert request if the document being updated doesn’t exist. Can’t be used to update the routing of an existing document.

timeout

Timeout waiting for a shard to become available.

wait_for_active_shards

The number of shard copies required to be active before proceeding with the update operation. See here for details.

refresh

Control when the changes made by this request are visible to search. See refresh.

_source

Allows to control if and how the updated source should be returned in the response. By default the updated source is not returned. See Source filtering for details.

version

The update API uses the Elasticsearch versioning support internally to make sure the document doesn’t change during the update. You can use the version parameter to specify that the document should only be updated if its version matches the one specified. deprecated:[6.7.0, "Please use if_seq_no & if_primary_term instead. See Optimistic concurrency control for more details."]

Note
The update API does not support versioning other than internal

External (version types external and external_gte) or forced (version type force) versioning is not supported by the update API as it would result in Elasticsearch version numbers being out of sync with the external system. Use the index API instead.

if_seq_no and if_primary_term

Update operations can be made conditional and only be performed if the last modification to the document was assigned the sequence number and primary term specified by the if_seq_no and if_primary_term parameters. If a mismatch is detected, the operation will result in a VersionConflictException and a status code of 409. See Optimistic concurrency control for more details.

Update By Query API

The simplest usage of _update_by_query just performs an update on every document in the index without changing the source. This is useful to pick up a new property or some other online mapping change. Here is the API:

POST twitter/_update_by_query?conflicts=proceed

That will return something like this:

{
  "took" : 147,
  "timed_out": false,
  "updated": 120,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "total": 120,
  "failures" : [ ]
}

_update_by_query gets a snapshot of the index when it starts and indexes what it finds using internal versioning. That means you’ll get a version conflict if the document changes between the time when the snapshot was taken and when the index request is processed. When the versions match, the document is updated and the version number is incremented.

Note
Since internal versioning does not support the value 0 as a valid version number, documents with version equal to zero cannot be updated using _update_by_query and will fail the request.

All update and query failures cause the _update_by_query to abort and are returned in the failures of the response. The updates that have been performed still stick. In other words, the process is not rolled back, only aborted. While the first failure causes the abort, all failures that are returned by the failing bulk request are returned in the failures element; therefore it’s possible for there to be quite a few failed entities.

If you want to simply count version conflicts, and not cause the _update_by_query to abort, you can set conflicts=proceed on the url or "conflicts": "proceed" in the request body. The first example does this because it is just trying to pick up an online mapping change, and a version conflict simply means that the conflicting document was updated between the start of the _update_by_query and the time when it attempted to update the document. This is fine because that update will have picked up the online mapping update.

Back to the API format, this will update tweets from the twitter index:

POST twitter/_doc/_update_by_query?conflicts=proceed

You can also limit _update_by_query using the Query DSL. This will update all documents from the twitter index for the user kimchy:

POST twitter/_update_by_query?conflicts=proceed
{
  "query": { (1)
    "term": {
      "user": "kimchy"
    }
  }
}
  1. The query must be passed as a value to the query key, in the same way as the Search API. You can also use the q parameter in the same way as the search API.

So far we’ve only been updating documents without changing their source. That is genuinely useful for things like picking up new properties but it’s only half the fun. _update_by_query supports scripts to update the document. This will increment the likes field on all of kimchy’s tweets:

POST twitter/_update_by_query
{
  "script": {
    "source": "ctx._source.likes++",
    "lang": "painless"
  },
  "query": {
    "term": {
      "user": "kimchy"
    }
  }
}

Just as in Update API you can set ctx.op to change the operation that is executed:

noop

Set ctx.op = "noop" if your script decides that it doesn’t have to make any changes. That will cause _update_by_query to omit that document from its updates. This no operation will be reported in the noop counter in the response body.

delete

Set ctx.op = "delete" if your script decides that the document must be deleted. The deletion will be reported in the deleted counter in the response body.

Setting ctx.op to anything else is an error. Setting any other field in ctx is an error.

Note that we stopped specifying conflicts=proceed. In this case we want a version conflict to abort the process so we can handle the failure.

This API doesn’t allow you to move the documents it touches, just modify their source. This is intentional! We’ve made no provisions for removing the document from its original location.

It’s also possible to do this whole thing on multiple indexes and multiple types at once, just like the search API:

POST twitter,blog/_doc,post/_update_by_query

If you provide routing then the routing is copied to the scroll query, limiting the process to the shards that match that routing value:

POST twitter/_update_by_query?routing=1

By default _update_by_query uses scroll batches of 1000. You can change the batch size with the scroll_size URL parameter:

POST twitter/_update_by_query?scroll_size=100

_update_by_query can also use the [ingest] feature by specifying a pipeline like this:

PUT _ingest/pipeline/set-foo
{
  "description" : "sets foo",
  "processors" : [ {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
  } ]
}
POST twitter/_update_by_query?pipeline=set-foo

URL Parameters

In addition to the standard parameters like pretty, the Update By Query API also supports refresh, wait_for_completion, wait_for_active_shards, timeout, and scroll.

Sending the refresh will update all shards in the index being updated when the request completes. This is different than the Update API’s refresh parameter, which causes just the shard that received the new data to be indexed. Also unlike the Update API it does not support wait_for.

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses.

wait_for_active_shards controls how many copies of a shard must be active before proceeding with the request. See here for details. timeout controls how long each write request waits for unavailable shards to become available. Both work exactly how they work in the Bulk API. Because _update_by_query uses scroll search, you can also specify the scroll parameter to control how long it keeps the "search context" alive, e.g. ?scroll=10m. By default it’s 5 minutes.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc.) and throttles the rate at which _update_by_query issues batches of index operations by padding each batch with a wait time. The throttling can be disabled by setting requests_per_second to -1.

The throttling is done by waiting between batches so that scroll that _update_by_query uses internally can be given a timeout that takes into account the padding. The padding time is the difference between the batch size divided by the requests_per_second and the time spent writing. By default the batch size is 1000, so if the requests_per_second is set to 500:

target_time = 1000 / 500 per second = 2 seconds
wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds

Since the batch is issued as a single _bulk request, large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default is -1.

Response body

The JSON response looks like this:

{
  "took" : 147,
  "timed_out": false,
  "total": 5,
  "updated": 5,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "failures" : [ ]
}
took

The number of milliseconds from start to end of the whole operation.

timed_out

This flag is set to true if any of the requests executed during the update by query execution has timed out.

total

The number of documents that were successfully processed.

updated

The number of documents that were successfully updated.

deleted

The number of documents that were successfully deleted.

batches

The number of scroll responses pulled back by the update by query.

version_conflicts

The number of version conflicts that the update by query hit.

noops

The number of documents that were ignored because the script used for the update by query returned a noop value for ctx.op.

retries

The number of retries attempted by update by query. bulk is the number of bulk actions retried, and search is the number of search actions retried.

throttled_millis

Number of milliseconds the request slept to conform to requests_per_second.

requests_per_second

The number of requests per second effectively executed during the update by query.

throttled_until_millis

This field should always be equal to zero in an _update_by_query response. It only has meaning when using the Task API, where it indicates the next time (in milliseconds since epoch) a throttled request will be executed again in order to conform to requests_per_second.

failures

Array of failures if there were any unrecoverable errors during the process. If this is non-empty then the request aborted because of those failures. Update by query is implemented using batches. Any failure causes the entire process to abort, but all failures in the current batch are collected into the array. You can use the conflicts option to prevent reindex from aborting on version conflicts.

Works with the Task API

You can fetch the status of all running update by query requests with the Task API:

GET _tasks?detailed=true&actions=*byquery

The responses looks like:

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/update/byquery",
          "status" : {    (1)
            "total" : 6154,
            "updated" : 3500,
            "created" : 0,
            "deleted" : 0,
            "batches" : 4,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": {
              "bulk": 0,
              "search": 0
            },
            "throttled_millis": 0
          },
          "description" : ""
        }
      }
    }
  }
}
  1. This object contains the actual status. It is just like the response JSON with the important addition of the total field. total is the total number of operations that the reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.

With the task id you can look up the task directly. The following example retrieves information about task r1A2WoRbTwKZ516z6NEs5A:36619:

GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set on it, then it’ll come back with a results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

Works with the Cancel Task API

Any update by query can be cancelled using the Task Cancel API:

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel

The task ID can be found using the tasks API.

Cancellation should happen quickly but might take a few seconds. The task status API above will continue to list the update by query task until this task checks that it has been cancelled and terminates itself.

Rethrottling

The value of requests_per_second can be changed on a running update by query using the _rethrottle API:

POST _update_by_query/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

The task ID can be found using the tasks API.

Just like when setting it on the _update_by_query API, requests_per_second can be either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level. Rethrottling that speeds up the query takes effect immediately, but rethrotting that slows down the query will take effect after completing the current batch. This prevents scroll timeouts.

Slicing

Update by query supports [sliced-scroll] to parallelize the updating process. This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts.

Manual slicing

Slice an update by query manually by providing a slice id and total number of slices to each request:

POST twitter/_update_by_query
{
  "slice": {
    "id": 0,
    "max": 2
  },
  "script": {
    "source": "ctx._source['extra'] = 'test'"
  }
}
POST twitter/_update_by_query
{
  "slice": {
    "id": 1,
    "max": 2
  },
  "script": {
    "source": "ctx._source['extra'] = 'test'"
  }
}

Which you can verify works with:

GET _refresh
POST twitter/_search?size=0&q=extra:test&filter_path=hits.total

Which results in a sensible total like this one:

{
  "hits": {
    "total": 120
  }
}

Automatic slicing

You can also let update by query automatically parallelize using [sliced-scroll] to slice on _uid. Use slices to specify the number of slices to use:

POST twitter/_update_by_query?refresh&slices=5
{
  "script": {
    "source": "ctx._source['extra'] = 'test'"
  }
}

Which you also can verify works with:

POST twitter/_search?size=0&q=extra:test&filter_path=hits.total

Which results in a sensible total like this one:

{
  "hits": {
    "total": 120
  }
}

Setting slices to auto will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If there are multiple source indices, it will choose the number of slices based on the index with the smallest number of shards.

Adding slices to _update_by_query just automates the manual process used in the section above, creating sub-requests which means it has some quirks:

  • You can see these requests in the Tasks APIs. These sub-requests are "child" tasks of the task for the request with slices.

  • Fetching the status of the task for the request with slices only contains the status of completed slices.

  • These sub-requests are individually addressable for things like cancellation and rethrottling.

  • Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.

  • Canceling the request with slices will cancel each sub-request.

  • Due to the nature of slices each sub-request won’t get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution.

  • Parameters like requests_per_second and size on a request with slices are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that the using size with slices might not result in exactly size documents being updated.

  • Each sub-request gets a slightly different snapshot of the source index though these are all taken at approximately the same time.

Picking the number of slices

If slicing automatically, setting slices to auto will choose a reasonable number for most indices. If you’re slicing manually or otherwise tuning automatic slicing, use these guidelines.

Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large, (for example, 500) choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.

Update performance scales linearly across available resources with the number of slices.

Whether query or update performance dominates the runtime depends on the documents being reindexed and cluster resources.

Pick up a new property

Say you created an index without dynamic mapping, filled it with data, and then added a mapping value to pick up more fields from the data:

PUT test
{
  "mappings": {
    "_doc": {
      "dynamic": false,   (1)
      "properties": {
        "text": {"type": "text"}
      }
    }
  }
}

POST test/_doc?refresh
{
  "text": "words words",
  "flag": "bar"
}
POST test/_doc?refresh
{
  "text": "words words",
  "flag": "foo"
}
PUT test/_mapping/_doc   (2)
{
  "properties": {
    "text": {"type": "text"},
    "flag": {"type": "text", "analyzer": "keyword"}
  }
}
  1. This means that new fields won’t be indexed, just stored in _source.

  2. This updates the mapping to add the new flag field. To pick up the new field you have to reindex all documents with it.

Searching for the data won’t find anything:

POST test/_search?filter_path=hits.total
{
  "query": {
    "match": {
      "flag": "foo"
    }
  }
}
{
  "hits" : {
    "total" : 0
  }
}

But you can issue an _update_by_query request to pick up the new mapping:

POST test/_update_by_query?refresh&conflicts=proceed
POST test/_search?filter_path=hits.total
{
  "query": {
    "match": {
      "flag": "foo"
    }
  }
}
{
  "hits" : {
    "total" : 1
  }
}

You can do the exact same thing when adding a field to a multifield.

Multi Get API

The Multi get API returns multiple documents based on an index, type, (optional) and id (and possibly routing). The response includes a docs array with all the fetched documents in order corresponding to the original multi-get request (if there was a failure for a specific get, an object containing this error is included in place in the response instead). The structure of a successful get is similar in structure to a document provided by the get API.

Here is an example:

GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

The mget endpoint can also be used against an index (in which case it is not required in the body):

GET /test/_mget
{
    "docs" : [
        {
            "_type" : "_doc",
            "_id" : "1"
        },
        {
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

And type:

GET /test/_doc/_mget
{
    "docs" : [
        {
            "_id" : "1"
        },
        {
            "_id" : "2"
        }
    ]
}

In which case, the ids element can directly be used to simplify the request:

GET /test/_doc/_mget
{
    "ids" : ["1", "2"]
}

Source filtering

By default, the _source field will be returned for every document (if stored). Similar to the get API, you can retrieve only parts of the _source (or not at all) by using the _source parameter. You can also use the url parameters _source, _source_includes, and _source_excludes to specify defaults, which will be used when there are no per-document instructions.

For example:

GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "1",
            "_source" : false
        },
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "2",
            "_source" : ["field3", "field4"]
        },
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "3",
            "_source" : {
                "include": ["user"],
                "exclude": ["user.location"]
            }
        }
    ]
}

Fields

Specific stored fields can be specified to be retrieved per document to get, similar to the stored_fields parameter of the Get API. For example:

GET /_mget
{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "1",
            "stored_fields" : ["field1", "field2"]
        },
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "2",
            "stored_fields" : ["field3", "field4"]
        }
    ]
}

Alternatively, you can specify the stored_fields parameter in the query string as a default to be applied to all documents.

GET /test/_doc/_mget?stored_fields=field1,field2
{
    "docs" : [
        {
            "_id" : "1" (1)
        },
        {
            "_id" : "2",
            "stored_fields" : ["field3", "field4"] (2)
        }
    ]
}
  1. Returns field1 and field2

  2. Returns field3 and field4

Routing

You can also specify a routing value as a parameter:

GET /_mget?routing=key1
{
    "docs" : [
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "1",
            "routing" : "key2"
        },
        {
            "_index" : "test",
            "_type" : "_doc",
            "_id" : "2"
        }
    ]
}

In this example, document test/_doc/2 will be fetched from the shard corresponding to routing key key1 but document test/_doc/1 will be fetched from the shard corresponding to routing key key2.

Security

Partial responses

To ensure fast responses, the multi get API will respond with partial results if one or more shards fail. See Shard failures for more information.

Bulk API

The bulk API makes it possible to perform many index/delete operations in a single API call. This can greatly increase the indexing speed.

Client support for bulk requests

Some of the officially supported clients provide helpers to assist with bulk requests and reindexing of documents from one index to another:

The REST API endpoint is /_bulk, and it expects the following newline delimited JSON (NDJSON) structure:

action_and_meta_data\n
optional_source\n
action_and_meta_data\n
optional_source\n
....
action_and_meta_data\n
optional_source\n

NOTE: The final line of data must end with a newline character \n. Each newline character may be preceded by a carriage return \r. When sending requests to this endpoint the Content-Type header should be set to application/x-ndjson.

The possible actions are index, create, delete, and update. index and create expect a source on the next line, and have the same semantics as the op_type parameter to the standard index API (i.e. create will fail if a document with the same index and type exists already, whereas index will add or replace a document as necessary). delete does not expect a source on the following line, and has the same semantics as the standard delete API. update expects that the partial doc, upsert and script and its options are specified on the next line.

If you’re providing text file input to curl, you must use the --data-binary flag instead of plain -d. The latter doesn’t preserve newlines. Example:

$ cat requests
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
$ curl -s -H "Content-Type: application/x-ndjson" -XPOST localhost:9200/_bulk --data-binary "@requests"; echo
{"took":7, "errors": false, "items":[{"index":{"_index":"test","_type":"_doc","_id":"1","_version":1,"result":"created","forced_refresh":false}}]}

Because this format uses literal `\n’s as delimiters, please be sure that the JSON actions and sources are not pretty printed. Here is an example of a correct sequence of bulk commands:

POST _bulk
{ "index" : { "_index" : "test", "_type" : "_doc", "_id" : "1" } }
{ "field1" : "value1" }
{ "delete" : { "_index" : "test", "_type" : "_doc", "_id" : "2" } }
{ "create" : { "_index" : "test", "_type" : "_doc", "_id" : "3" } }
{ "field1" : "value3" }
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "test"} }
{ "doc" : {"field2" : "value2"} }

The result of this bulk operation is:

{
   "took": 30,
   "errors": false,
   "items": [
      {
         "index": {
            "_index": "test",
            "_type": "_doc",
            "_id": "1",
            "_version": 1,
            "result": "created",
            "_shards": {
               "total": 2,
               "successful": 1,
               "failed": 0
            },
            "status": 201,
            "_seq_no" : 0,
            "_primary_term": 1
         }
      },
      {
         "delete": {
            "_index": "test",
            "_type": "_doc",
            "_id": "2",
            "_version": 1,
            "result": "not_found",
            "_shards": {
               "total": 2,
               "successful": 1,
               "failed": 0
            },
            "status": 404,
            "_seq_no" : 1,
            "_primary_term" : 2
         }
      },
      {
         "create": {
            "_index": "test",
            "_type": "_doc",
            "_id": "3",
            "_version": 1,
            "result": "created",
            "_shards": {
               "total": 2,
               "successful": 1,
               "failed": 0
            },
            "status": 201,
            "_seq_no" : 2,
            "_primary_term" : 3
         }
      },
      {
         "update": {
            "_index": "test",
            "_type": "_doc",
            "_id": "1",
            "_version": 2,
            "result": "updated",
            "_shards": {
                "total": 2,
                "successful": 1,
                "failed": 0
            },
            "status": 200,
            "_seq_no" : 3,
            "_primary_term" : 4
         }
      }
   ]
}

The endpoints are /_bulk, /{index}/_bulk, and {index}/{type}/_bulk. When the index or the index/type are provided, they will be used by default on bulk items that don’t provide them explicitly.

A note on the format. The idea here is to make processing of this as fast as possible. As some of the actions will be redirected to other shards on other nodes, only action_meta_data is parsed on the receiving node side.

Client libraries using this protocol should try and strive to do something similar on the client side, and reduce buffering as much as possible.

The response to a bulk action is a large JSON structure with the individual results of each action that was performed in the same order as the actions that appeared in the request. The failure of a single action does not affect the remaining actions.

There is no "correct" number of actions to perform in a single bulk call. You should experiment with different settings to find the optimum size for your particular workload.

If using the HTTP API, make sure that the client does not send HTTP chunks, as this will slow things down.

Optimistic Concurrency Control

Each index and delete action within a bulk API call may include the if_seq_no and if_primary_term parameters in their respective action and meta data lines. The if_seq_no and if_primary_term parameters control how operations are executed, based on the last modification to existing documents. See Optimistic concurrency control for more details.

Versioning

Each bulk item can include the version value using the version field. It automatically follows the behavior of the index / delete operation based on the _version mapping. It also support the version_type (see versioning).

Routing

Each bulk item can include the routing value using the routing field. It automatically follows the behavior of the index / delete operation based on the _routing mapping.

Wait For Active Shards

When making bulk calls, you can set the wait_for_active_shards parameter to require a minimum number of shard copies to be active before starting to process the bulk request. See here for further details and a usage example.

Refresh

Control when the changes made by this request are visible to search. See refresh.

Note
Only the shards that receive the bulk request will be affected by refresh. Imagine a _bulk?refresh=wait_for request with three documents in it that happen to be routed to different shards in an index with five shards. The request will only wait for those three shards to refresh. The other two shards that make up the index do not participate in the _bulk request at all.

Update

When using the update action, retry_on_conflict can be used as a field in the action itself (not in the extra payload line), to specify how many times an update should be retried in the case of a version conflict.

The update action payload supports the following options: doc (partial document), upsert, doc_as_upsert, script, params (for script), lang (for script), and _source. See update documentation for details on the options. Example with update actions:

POST _bulk
{ "update" : {"_id" : "1", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"} }
{ "update" : { "_id" : "0", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "script" : { "source": "ctx._source.counter += params.param1", "lang" : "painless", "params" : {"param1" : 1}}, "upsert" : {"counter" : 1}}
{ "update" : {"_id" : "2", "_type" : "_doc", "_index" : "index1", "retry_on_conflict" : 3} }
{ "doc" : {"field" : "value"}, "doc_as_upsert" : true }
{ "update" : {"_id" : "3", "_type" : "_doc", "_index" : "index1", "_source" : true} }
{ "doc" : {"field" : "value"} }
{ "update" : {"_id" : "4", "_type" : "_doc", "_index" : "index1"} }
{ "doc" : {"field" : "value"}, "_source": true}

Security

Partial responses

To ensure fast responses, the multi search API will respond with partial results if one or more shards fail. See Shard failures for more information.

Reindex API

Important
Reindex requires _source to be enabled for all documents in the source index.
Important
Reindex does not attempt to set up the destination index. It does not copy the settings of the source index. You should set up the destination index prior to running a _reindex action, including setting up mappings, shard counts, replicas, etc.

The most basic form of _reindex just copies documents from one index to another. This will copy documents from the twitter index into the new_twitter index:

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

That will return something like this:

{
  "took" : 147,
  "timed_out": false,
  "created": 120,
  "updated": 0,
  "deleted": 0,
  "batches": 1,
  "version_conflicts": 0,
  "noops": 0,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": -1.0,
  "throttled_until_millis": 0,
  "total": 120,
  "failures" : [ ]
}

Just like _update_by_query, _reindex gets a snapshot of the source index but its target must be a different index so version conflicts are unlikely. The dest element can be configured like the index API to control optimistic concurrency control. Just leaving out version_type (as above) or setting it to internal will cause Elasticsearch to blindly dump documents into the target, overwriting any that happen to have the same type and id:

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "internal"
  }
}

Setting version_type to external will cause Elasticsearch to preserve the version from the source, create any documents that are missing, and update any documents that have an older version in the destination index than they do in the source index:

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  }
}

Settings op_type to create will cause _reindex to only create missing documents in the target index. All existing documents will cause a version conflict:

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

By default, version conflicts abort the _reindex process. The "conflicts" request body parameter can be used to instruct _reindex to proceed with the next document on version conflicts. It is important to note that the handling of other error types is unaffected by the "conflicts" parameter. When "conflicts": "proceed" is set in the request body, the _reindex process will continue on version conflicts and return a count of version conflicts encountered:

POST _reindex
{
  "conflicts": "proceed",
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "op_type": "create"
  }
}

You can limit the documents by adding a type to the source or by adding a query. This will only copy tweets made by kimchy into new_twitter:

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "_doc",
    "query": {
      "term": {
        "user": "kimchy"
      }
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

index and type in source can both be lists, allowing you to copy from lots of sources in one request. This will copy documents from the _doc and post types in the twitter and blog indices.

POST _reindex
{
  "source": {
    "index": ["twitter", "blog"],
    "type": ["_doc", "post"]
  },
  "dest": {
    "index": "all_together",
    "type": "_doc"
  }
}
Note
The Reindex API makes no effort to handle ID collisions so the last document written will "win" but the order isn’t usually predictable so it is not a good idea to rely on this behavior. Instead, make sure that IDs are unique using a script.

It’s also possible to limit the number of processed documents by setting size. This will only copy a single document from twitter to new_twitter:

POST _reindex
{
  "size": 1,
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

If you want a particular set of documents from the twitter index you’ll need to use sort. Sorting makes the scroll less efficient but in some contexts it’s worth it. If possible, prefer a more selective query to size and sort. This will copy 10000 documents from twitter into new_twitter:

POST _reindex
{
  "size": 10000,
  "source": {
    "index": "twitter",
    "sort": { "date": "desc" }
  },
  "dest": {
    "index": "new_twitter"
  }
}

The source section supports all the elements that are supported in a search request. For instance, only a subset of the fields from the original documents can be reindexed using source filtering as follows:

POST _reindex
{
  "source": {
    "index": "twitter",
    "_source": ["user", "_doc"]
  },
  "dest": {
    "index": "new_twitter"
  }
}

Like _update_by_query, _reindex supports a script that modifies the document. Unlike _update_by_query, the script is allowed to modify the document’s metadata. This example bumps the version of the source document:

POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter",
    "version_type": "external"
  },
  "script": {
    "source": "if (ctx._source.foo == 'bar') {ctx._version++; ctx._source.remove('foo')}",
    "lang": "painless"
  }
}

Just as in _update_by_query, you can set ctx.op to change the operation that is executed on the destination index:

noop

Set ctx.op = "noop" if your script decides that the document doesn’t have to be indexed in the destination index. This no operation will be reported in the noop counter in the response body.

delete

Set ctx.op = "delete" if your script decides that the document must be deleted from the destination index. The deletion will be reported in the deleted counter in the response body.

Setting ctx.op to anything else will return an error, as will setting any other field in ctx.

Think of the possibilities! Just be careful; you are able to change:

  • _id

  • _type

  • _index

  • _version

  • _routing

Setting _version to null or clearing it from the ctx map is just like not sending the version in an indexing request; it will cause the document to be overwritten in the target index regardless of the version on the target or the version type you use in the _reindex request.

By default if _reindex sees a document with routing then the routing is preserved unless it’s changed by the script. You can set routing on the dest request to change this:

keep

Sets the routing on the bulk request sent for each match to the routing on the match. This is the default value.

discard

Sets the routing on the bulk request sent for each match to null.

=<some text>

Sets the routing on the bulk request sent for each match to all text after the =.

For example, you can use the following request to copy all documents from the source index with the company name cat into the dest index with routing set to cat.

POST _reindex
{
  "source": {
    "index": "source",
    "query": {
      "match": {
        "company": "cat"
      }
    }
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

By default _reindex uses scroll batches of 1000. You can change the batch size with the size field in the source element:

POST _reindex
{
  "source": {
    "index": "source",
    "size": 100
  },
  "dest": {
    "index": "dest",
    "routing": "=cat"
  }
}

Reindex can also use the [ingest] feature by specifying a pipeline like this:

POST _reindex
{
  "source": {
    "index": "source"
  },
  "dest": {
    "index": "dest",
    "pipeline": "some_ingest_pipeline"
  }
}

Reindex from Remote

Reindex supports reindexing from a remote Elasticsearch cluster:

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "username": "user",
      "password": "pass"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

The host parameter must contain a scheme, host, port (e.g. https://otherhost:9200), and optional path (e.g. https://otherhost:9200/proxy). The username and password parameters are optional, and when they are present _reindex will connect to the remote Elasticsearch node using basic auth. Be sure to use https when using basic auth or the password will be sent in plain text. There are a range of settings available to configure the behaviour of the https connection.

Remote hosts have to be explicitly whitelisted in elasticsearch.yaml using the reindex.remote.whitelist property. It can be set to a comma delimited list of allowed remote host and port combinations (e.g. otherhost:9200, another:9200, 127.0.10.:9200, localhost:). Scheme is ignored by the whitelist — only host and port are used, for example:

reindex.remote.whitelist: "otherhost:9200, another:9200, 127.0.10.*:9200, localhost:*"

The whitelist must be configured on any nodes that will coordinate the reindex.

This feature should work with remote clusters of any version of Elasticsearch you are likely to find. This should allow you to upgrade from any version of Elasticsearch to the current version by reindexing from a cluster of the old version.

Warning
{es} does not support forward compatibility across major versions. For example, you cannot reindex from a 7.x cluster into a 6.x cluster.

To enable queries sent to older versions of Elasticsearch the query parameter is sent directly to the remote host without validation or modification.

Note
Reindexing from remote clusters does not support manual or automatic slicing.

Reindexing from a remote server uses an on-heap buffer that defaults to a maximum size of 100mb. If the remote index includes very large documents you’ll need to use a smaller batch size. The example below sets the batch size to 10 which is very, very small.

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200"
    },
    "index": "source",
    "size": 10,
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

It is also possible to set the socket read timeout on the remote connection with the socket_timeout field and the connection timeout with the connect_timeout field. Both default to 30 seconds. This example sets the socket read timeout to one minute and the connection timeout to 10 seconds:

POST _reindex
{
  "source": {
    "remote": {
      "host": "http://otherhost:9200",
      "socket_timeout": "1m",
      "connect_timeout": "10s"
    },
    "index": "source",
    "query": {
      "match": {
        "test": "data"
      }
    }
  },
  "dest": {
    "index": "dest"
  }
}

Configuring SSL parameters

Reindex from remote supports configurable SSL settings. These must be specified in the elasticsearch.yml file, with the exception of the secure settings, which you add in the Elasticsearch keystore. It is not possible to configure SSL in the body of the _reindex request.

The following settings are supported:

reindex.ssl.certificate_authorities

List of paths to PEM encoded certificate files that should be trusted. You cannot specify both reindex.ssl.certificate_authorities and reindex.ssl.truststore.path.

reindex.ssl.truststore.path

The path to the Java Keystore file that contains the certificates to trust. This keystore can be in "JKS" or "PKCS#12" format. You cannot specify both reindex.ssl.certificate_authorities and reindex.ssl.truststore.path.

reindex.ssl.truststore.password

The password to the truststore (reindex.ssl.truststore.path). This setting cannot be used with reindex.ssl.truststore.secure_password.

reindex.ssl.truststore.secure_password (Secure)

The password to the truststore (reindex.ssl.truststore.path). This setting cannot be used with reindex.ssl.truststore.password.

reindex.ssl.truststore.type

The type of the truststore (reindex.ssl.truststore.path). Must be either jks or PKCS12. If the truststore path ends in ".p12", ".pfx" or "pkcs12", this setting defaults to PKCS12. Otherwise, it defaults to jks.

reindex.ssl.verification_mode

Indicates the type of verification to protect against man in the middle attacks and certificate forgery. One of full (verify the hostname and the certificate path), certificate (verify the certificate path, but not the hostname) or none (perform no verification - this is strongly discouraged in production environments). Defaults to full.

reindex.ssl.certificate

Specifies the path to the PEM encoded certificate (or certificate chain) to be used for HTTP client authentication (if required by the remote cluster) This setting requires that reindex.ssl.key also be set. You cannot specify both reindex.ssl.certificate and reindex.ssl.keystore.path.

reindex.ssl.key

Specifies the path to the PEM encoded private key associated with the certificate used for client authentication (reindex.ssl.certificate). You cannot specify both reindex.ssl.key and reindex.ssl.keystore.path.

reindex.ssl.key_passphrase

Specifies the passphrase to decrypt the PEM encoded private key (reindex.ssl.key) if it is encrypted. Cannot be used with reindex.ssl.secure_key_passphrase.

reindex.ssl.secure_key_passphrase (Secure)

Specifies the passphrase to decrypt the PEM encoded private key (reindex.ssl.key) if it is encrypted. Cannot be used with reindex.ssl.key_passphrase.

reindex.ssl.keystore.path

Specifies the path to the keystore that contains a private key and certificate to be used for HTTP client authentication (if required by the remote cluster). This keystore can be in "JKS" or "PKCS#12" format. You cannot specify both reindex.ssl.key and reindex.ssl.keystore.path.

reindex.ssl.keystore.type

The type of the keystore (reindex.ssl.keystore.path). Must be either jks or PKCS12. If the keystore path ends in ".p12", ".pfx" or "pkcs12", this setting defaults to PKCS12. Otherwise, it defaults to jks.

reindex.ssl.keystore.password

The password to the keystore (reindex.ssl.keystore.path). This setting cannot be used with reindex.ssl.keystore.secure_password.

reindex.ssl.keystore.secure_password (Secure)

The password to the keystore (reindex.ssl.keystore.path). This setting cannot be used with reindex.ssl.keystore.password.

reindex.ssl.keystore.key_password

The password for the key in the keystore (reindex.ssl.keystore.path). Defaults to the keystore password. This setting cannot be used with reindex.ssl.keystore.secure_key_password.

reindex.ssl.keystore.secure_key_password (Secure)

The password for the key in the keystore (reindex.ssl.keystore.path). Defaults to the keystore password. This setting cannot be used with reindex.ssl.keystore.key_password.

URL Parameters

In addition to the standard parameters like pretty, the Reindex API also supports refresh, wait_for_completion, wait_for_active_shards, timeout, scroll, and requests_per_second.

Sending the refresh url parameter will cause all indexes to which the request wrote to be refreshed. This is different than the Index API’s refresh parameter which causes just the shard that received the new data to be refreshed. Also unlike the Index API it does not support wait_for.

If the request contains wait_for_completion=false then Elasticsearch will perform some preflight checks, launch the request, and then return a task which can be used with Tasks APIs to cancel or get the status of the task. Elasticsearch will also create a record of this task as a document at .tasks/task/${taskId}. This is yours to keep or remove as you see fit. When you are done with it, delete it so Elasticsearch can reclaim the space it uses.

wait_for_active_shards controls how many copies of a shard must be active before proceeding with the reindexing. See here for details. timeout controls how long each write request waits for unavailable shards to become available. Both work exactly how they work in the Bulk API. As _reindex uses scroll search, you can also specify the scroll parameter to control how long it keeps the "search context" alive, (e.g. ?scroll=10m). The default value is 5 minutes.

requests_per_second can be set to any positive decimal number (1.4, 6, 1000, etc.) and throttles the rate at which _reindex issues batches of index operations by padding each batch with a wait time. The throttling can be disabled by setting requests_per_second to -1.

The throttling is done by waiting between batches so that the scroll which _reindex uses internally can be given a timeout that takes into account the padding. The padding time is the difference between the batch size divided by the requests_per_second and the time spent writing. By default the batch size is 1000, so if the requests_per_second is set to 500:

target_time = 1000 / 500 per second = 2 seconds
wait_time = target_time - write_time = 2 seconds - .5 seconds = 1.5 seconds

Since the batch is issued as a single _bulk request, large batch sizes will cause Elasticsearch to create many requests and then wait for a while before starting the next set. This is "bursty" instead of "smooth". The default value is -1.

Response body

The JSON response looks like this:

{
  "took": 639,
  "timed_out": false,
  "total": 5,
  "updated": 0,
  "created": 5,
  "deleted": 0,
  "batches": 1,
  "noops": 0,
  "version_conflicts": 2,
  "retries": {
    "bulk": 0,
    "search": 0
  },
  "throttled_millis": 0,
  "requests_per_second": 1,
  "throttled_until_millis": 0,
  "failures": [ ]
}
took

The total milliseconds the entire operation took.

timed_out

This flag is set to true if any of the requests executed during the reindex timed out.

total

The number of documents that were successfully processed.

updated

The number of documents that were successfully updated.

created

The number of documents that were successfully created.

deleted

The number of documents that were successfully deleted.

batches

The number of scroll responses pulled back by the reindex.

noops

The number of documents that were ignored because the script used for the reindex returned a noop value for ctx.op.

version_conflicts

The number of version conflicts that reindex hit.

retries

The number of retries attempted by reindex. bulk is the number of bulk actions retried and search is the number of search actions retried.

throttled_millis

Number of milliseconds the request slept to conform to requests_per_second.

requests_per_second

The number of requests per second effectively executed during the reindex.

throttled_until_millis

This field should always be equal to zero in a _reindex response. It only has meaning when using the Task API, where it indicates the next time (in milliseconds since epoch) a throttled request will be executed again in order to conform to requests_per_second.

failures

Array of failures if there were any unrecoverable errors during the process. If this is non-empty then the request aborted because of those failures. Reindex is implemented using batches and any failure causes the entire process to abort but all failures in the current batch are collected into the array. You can use the conflicts option to prevent reindex from aborting on version conflicts.

Works with the Task API

You can fetch the status of all running reindex requests with the Task API:

GET _tasks?detailed=true&actions=*reindex

The response looks like:

{
  "nodes" : {
    "r1A2WoRbTwKZ516z6NEs5A" : {
      "name" : "r1A2WoR",
      "transport_address" : "127.0.0.1:9300",
      "host" : "127.0.0.1",
      "ip" : "127.0.0.1:9300",
      "attributes" : {
        "testattr" : "test",
        "portsfile" : "true"
      },
      "tasks" : {
        "r1A2WoRbTwKZ516z6NEs5A:36619" : {
          "node" : "r1A2WoRbTwKZ516z6NEs5A",
          "id" : 36619,
          "type" : "transport",
          "action" : "indices:data/write/reindex",
          "status" : {    (1)
            "total" : 6154,
            "updated" : 3500,
            "created" : 0,
            "deleted" : 0,
            "batches" : 4,
            "version_conflicts" : 0,
            "noops" : 0,
            "retries": {
              "bulk": 0,
              "search": 0
            },
            "throttled_millis": 0,
            "requests_per_second": -1,
            "throttled_until_millis": 0
          },
          "description" : "",
          "start_time_in_millis": 1535149899665,
          "running_time_in_nanos": 5926916792,
          "cancellable": true,
          "headers": {}
        }
      }
    }
  }
}
  1. This object contains the actual status. It is identical to the response JSON except for the important addition of the total field. total is the total number of operations that the _reindex expects to perform. You can estimate the progress by adding the updated, created, and deleted fields. The request will finish when their sum is equal to the total field.

With the task id you can look up the task directly. The following example retrieves information about the task r1A2WoRbTwKZ516z6NEs5A:36619:

GET /_tasks/r1A2WoRbTwKZ516z6NEs5A:36619

The advantage of this API is that it integrates with wait_for_completion=false to transparently return the status of completed tasks. If the task is completed and wait_for_completion=false was set, it will return a results or an error field. The cost of this feature is the document that wait_for_completion=false creates at .tasks/task/${taskId}. It is up to you to delete that document.

Works with the Cancel Task API

Any reindex can be canceled using the Task Cancel API. For example:

POST _tasks/r1A2WoRbTwKZ516z6NEs5A:36619/_cancel

The task ID can be found using the Tasks API.

Cancelation should happen quickly but might take a few seconds. The Tasks API will continue to list the task until it wakes to cancel itself.

Rethrottling

The value of requests_per_second can be changed on a running reindex using the _rethrottle API:

POST _reindex/r1A2WoRbTwKZ516z6NEs5A:36619/_rethrottle?requests_per_second=-1

The task ID can be found using the tasks API.

Just like when setting it on the Reindex API, requests_per_second can be either -1 to disable throttling or any decimal number like 1.7 or 12 to throttle to that level. Rethrottling that speeds up the query takes effect immediately, but rethrottling that slows down the query will take effect after completing the current batch. This prevents scroll timeouts.

Reindex to change the name of a field

_reindex can be used to build a copy of an index with renamed fields. Say you create an index containing documents that look like this:

POST test/_doc/1?refresh
{
  "text": "words words",
  "flag": "foo"
}

but you don’t like the name flag and want to replace it with tag. _reindex can create the other index for you:

POST _reindex
{
  "source": {
    "index": "test"
  },
  "dest": {
    "index": "test2"
  },
  "script": {
    "source": "ctx._source.tag = ctx._source.remove(\"flag\")"
  }
}

Now you can get the new document:

GET test2/_doc/1

which will return:

{
  "found": true,
  "_id": "1",
  "_index": "test2",
  "_type": "_doc",
  "_version": 1,
  "_seq_no": 44,
  "_primary_term": 1,
  "_source": {
    "text": "words words",
    "tag": "foo"
  }
}

Slicing

Reindex supports [sliced-scroll] to parallelize the reindexing process. This parallelization can improve efficiency and provide a convenient way to break the request down into smaller parts.

Manual slicing

Slice a reindex request manually by providing a slice id and total number of slices to each request:

POST _reindex
{
  "source": {
    "index": "twitter",
    "slice": {
      "id": 0,
      "max": 2
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}
POST _reindex
{
  "source": {
    "index": "twitter",
    "slice": {
      "id": 1,
      "max": 2
    }
  },
  "dest": {
    "index": "new_twitter"
  }
}

You can verify this works by:

GET _refresh
POST new_twitter/_search?size=0&filter_path=hits.total

which results in a sensible total like this one:

{
  "hits": {
    "total": 120
  }
}

Automatic slicing

You can also let _reindex automatically parallelize using [sliced-scroll] to slice on _uid. Use slices to specify the number of slices to use:

POST _reindex?slices=5&refresh
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  }
}

You can also this verify works by:

POST new_twitter/_search?size=0&filter_path=hits.total

which results in a sensible total like this one:

{
  "hits": {
    "total": 120
  }
}

Setting slices to auto will let Elasticsearch choose the number of slices to use. This setting will use one slice per shard, up to a certain limit. If there are multiple source indices, it will choose the number of slices based on the index with the smallest number of shards.

Adding slices to _reindex just automates the manual process used in the section above, creating sub-requests which means it has some quirks:

  • You can see these requests in the Tasks APIs. These sub-requests are "child" tasks of the task for the request with slices.

  • Fetching the status of the task for the request with slices only contains the status of completed slices.

  • These sub-requests are individually addressable for things like cancelation and rethrottling.

  • Rethrottling the request with slices will rethrottle the unfinished sub-request proportionally.

  • Canceling the request with slices will cancel each sub-request.

  • Due to the nature of slices each sub-request won’t get a perfectly even portion of the documents. All documents will be addressed, but some slices may be larger than others. Expect larger slices to have a more even distribution.

  • Parameters like requests_per_second and size on a request with slices are distributed proportionally to each sub-request. Combine that with the point above about distribution being uneven and you should conclude that the using size with slices might not result in exactly size documents being reindexed.

  • Each sub-request gets a slightly different snapshot of the source index, though these are all taken at approximately the same time.

Picking the number of slices

If slicing automatically, setting slices to auto will choose a reasonable number for most indices. If slicing manually or otherwise tuning automatic slicing, use these guidelines.

Query performance is most efficient when the number of slices is equal to the number of shards in the index. If that number is large (e.g. 500), choose a lower number as too many slices will hurt performance. Setting slices higher than the number of shards generally does not improve efficiency and adds overhead.

Indexing performance scales linearly across available resources with the number of slices.

Whether query or indexing performance dominates the runtime depends on the documents being reindexed and cluster resources.

Reindexing many indices

If you have many indices to reindex it is generally better to reindex them one at a time rather than using a glob pattern to pick up many indices. That way you can resume the process if there are any errors by removing the partially completed index and starting over at that index. It also makes parallelizing the process fairly simple: split the list of indices to reindex and run each list in parallel.

One-off bash scripts seem to work nicely for this:

for index in i1 i2 i3 i4 i5; do
  curl -HContent-Type:application/json -XPOST localhost:9200/_reindex?pretty -d'{
    "source": {
      "index": "'$index'"
    },
    "dest": {
      "index": "'$index'-reindexed"
    }
  }'
done

Reindex daily indices

Notwithstanding the above advice, you can use _reindex in combination with Painless to reindex daily indices to apply a new template to the existing documents.

Assuming you have indices consisting of documents as follows:

PUT metricbeat-2016.05.30/_doc/1?refresh
{"system.cpu.idle.pct": 0.908}
PUT metricbeat-2016.05.31/_doc/1?refresh
{"system.cpu.idle.pct": 0.105}

The new template for the metricbeat-* indices is already loaded into Elasticsearch, but it applies only to the newly created indices. Painless can be used to reindex the existing documents and apply the new template.

The script below extracts the date from the index name and creates a new index with -1 appended. All data from metricbeat-2016.05.31 will be reindexed into metricbeat-2016.05.31-1.

POST _reindex
{
  "source": {
    "index": "metricbeat-*"
  },
  "dest": {
    "index": "metricbeat"
  },
  "script": {
    "lang": "painless",
    "source": "ctx._index = 'metricbeat-' + (ctx._index.substring('metricbeat-'.length(), ctx._index.length())) + '-1'"
  }
}

All documents from the previous metricbeat indices can now be found in the *-1 indices.

GET metricbeat-2016.05.30-1/_doc/1
GET metricbeat-2016.05.31-1/_doc/1

The previous method can also be used in conjunction with changing a field name to load only the existing data into the new index and rename any fields if needed.

Extracting a random subset of an index

_reindex can be used to extract a random subset of an index for testing:

POST _reindex
{
  "size": 10,
  "source": {
    "index": "twitter",
    "query": {
      "function_score" : {
        "query" : { "match_all": {} },
        "random_score" : {}
      }
    },
    "sort": "_score"    (1)
  },
  "dest": {
    "index": "random_twitter"
  }
}
  1. _reindex defaults to sorting by _doc so random_score will not have any effect unless you override the sort to _score.

Term Vectors

Returns information and statistics on terms in the fields of a particular document. The document could be stored in the index or artificially provided by the user. Term vectors are realtime by default, not near realtime. This can be changed by setting the realtime parameter to false.

GET /twitter/_doc/1/_termvectors

Optionally, you can specify the fields for which the information is retrieved either with a parameter in the url

GET /twitter/_doc/1/_termvectors?fields=message

or by adding the requested fields in the request body (see example below). Fields can also be specified with wildcards in a similar way to the multi match query

Warning
Note that the usage of /_termvector is deprecated in 2.0, and replaced by /_termvectors.

Return values

Three types of values can be requested: term information, term statistics, and field statistics. By default, all term information and field statistics are returned for all fields but no term statistics.

Term information

  • term frequency in the field (always returned)

  • term positions (positions : true)

  • start and end offsets (offsets : true)

  • term payloads (payloads : true), as base64 encoded bytes

If the requested information wasn’t stored in the index, it will be computed on the fly if possible. Additionally, term vectors could be computed for documents not even existing in the index, but instead provided by the user.

Warning

Start and end offsets assume UTF-16 encoding is being used. If you want to use these offsets in order to get the original text that produced this token, you should make sure that the string you are taking a sub-string of is also encoded using UTF-16.

Term statistics

Setting term_statistics to true (default is false) will return

  • total term frequency (how often a term occurs in all documents)

  • document frequency (the number of documents containing the current term)

By default these values are not returned since term statistics can have a serious performance impact.

Field statistics

Setting field_statistics to false (default is true) will omit :

  • document count (how many documents contain this field)

  • sum of document frequencies (the sum of document frequencies for all terms in this field)

  • sum of total term frequencies (the sum of total term frequencies of each term in this field)

Terms Filtering

With the parameter filter, the terms returned could also be filtered based on their tf-idf scores. This could be useful in order find out a good characteristic vector of a document. This feature works in a similar manner to the second phase of the More Like This Query. See example 5 for usage.

The following sub-parameters are supported:

max_num_terms

Maximum number of terms that must be returned per field. Defaults to 25.

min_term_freq

Ignore words with less than this frequency in the source doc. Defaults to 1.

max_term_freq

Ignore words with more than this frequency in the source doc. Defaults to unbounded.

min_doc_freq

Ignore terms which do not occur in at least this many docs. Defaults to 1.

max_doc_freq

Ignore words which occur in more than this many docs. Defaults to unbounded.

min_word_length

The minimum word length below which words will be ignored. Defaults to 0.

max_word_length

The maximum word length above which words will be ignored. Defaults to unbounded (0).

Behaviour

The term and field statistics are not accurate. Deleted documents are not taken into account. The information is only retrieved for the shard the requested document resides in. The term and field statistics are therefore only useful as relative measures whereas the absolute numbers have no meaning in this context. By default, when requesting term vectors of artificial documents, a shard to get the statistics from is randomly selected. Use routing only to hit a particular shard.

Example: Returning stored term vectors

First, we create an index that stores term vectors, payloads, etc. :

PUT /twitter/
{ "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "term_vector": "with_positions_offsets_payloads",
          "store" : true,
          "analyzer" : "fulltext_analyzer"
         },
         "fullname": {
          "type": "text",
          "term_vector": "with_positions_offsets_payloads",
          "analyzer" : "fulltext_analyzer"
        }
      }
    }
  },
  "settings" : {
    "index" : {
      "number_of_shards" : 1,
      "number_of_replicas" : 0
    },
    "analysis": {
      "analyzer": {
        "fulltext_analyzer": {
          "type": "custom",
          "tokenizer": "whitespace",
          "filter": [
            "lowercase",
            "type_as_payload"
          ]
        }
      }
    }
  }
}

Second, we add some documents:

PUT /twitter/_doc/1
{
  "fullname" : "John Doe",
  "text" : "twitter test test test "
}

PUT /twitter/_doc/2
{
  "fullname" : "Jane Doe",
  "text" : "Another twitter test ..."
}

The following request returns all information and statistics for field text in document 1 (John Doe):

GET /twitter/_doc/1/_termvectors
{
  "fields" : ["text"],
  "offsets" : true,
  "payloads" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Response:

{
    "_id": "1",
    "_index": "twitter",
    "_type": "_doc",
    "_version": 1,
    "found": true,
    "took": 6,
    "term_vectors": {
        "text": {
            "field_statistics": {
                "doc_count": 2,
                "sum_doc_freq": 6,
                "sum_ttf": 8
            },
            "terms": {
                "test": {
                    "doc_freq": 2,
                    "term_freq": 3,
                    "tokens": [
                        {
                            "end_offset": 12,
                            "payload": "d29yZA==",
                            "position": 1,
                            "start_offset": 8
                        },
                        {
                            "end_offset": 17,
                            "payload": "d29yZA==",
                            "position": 2,
                            "start_offset": 13
                        },
                        {
                            "end_offset": 22,
                            "payload": "d29yZA==",
                            "position": 3,
                            "start_offset": 18
                        }
                    ],
                    "ttf": 4
                },
                "twitter": {
                    "doc_freq": 2,
                    "term_freq": 1,
                    "tokens": [
                        {
                            "end_offset": 7,
                            "payload": "d29yZA==",
                            "position": 0,
                            "start_offset": 0
                        }
                    ],
                    "ttf": 2
                }
            }
        }
    }
}

Example: Generating term vectors on the fly

Term vectors which are not explicitly stored in the index are automatically computed on the fly. The following request returns all information and statistics for the fields in document 1, even though the terms haven’t been explicitly stored in the index. Note that for the field text, the terms are not re-generated.

GET /twitter/_doc/1/_termvectors
{
  "fields" : ["text", "some_field_without_term_vectors"],
  "offsets" : true,
  "positions" : true,
  "term_statistics" : true,
  "field_statistics" : true
}

Example: Artificial documents

Term vectors can also be generated for artificial documents, that is for documents not present in the index. For example, the following request would return the same results as in example 1. The mapping used is determined by the index and type.

If dynamic mapping is turned on (default), the document fields not in the original mapping will be dynamically created.

GET /twitter/_doc/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  }
}
Per-field analyzer

Additionally, a different analyzer than the one at the field may be provided by using the per_field_analyzer parameter. This is useful in order to generate term vectors in any fashion, especially when using artificial documents. When providing an analyzer for a field that already stores term vectors, the term vectors will be re-generated.

GET /twitter/_doc/_termvectors
{
  "doc" : {
    "fullname" : "John Doe",
    "text" : "twitter test test test"
  },
  "fields": ["fullname"],
  "per_field_analyzer" : {
    "fullname": "keyword"
  }
}

Response:

{
  "_index": "twitter",
  "_type": "_doc",
  "_version": 0,
  "found": true,
  "took": 6,
  "term_vectors": {
    "fullname": {
       "field_statistics": {
          "sum_doc_freq": 2,
          "doc_count": 4,
          "sum_ttf": 4
       },
       "terms": {
          "John Doe": {
             "term_freq": 1,
             "tokens": [
                {
                   "position": 0,
                   "start_offset": 0,
                   "end_offset": 8
                }
             ]
          }
       }
    }
  }
}

Example: Terms filtering

Finally, the terms returned could be filtered based on their tf-idf scores. In the example below we obtain the three most "interesting" keywords from the artificial document having the given "plot" field value. Notice that the keyword "Tony" or any stop words are not part of the response, as their tf-idf must be too low.

GET /imdb/_doc/_termvectors
{
    "doc": {
      "plot": "When wealthy industrialist Tony Stark is forced to build an armored suit after a life-threatening incident, he ultimately decides to use its technology to fight against evil."
    },
    "term_statistics" : true,
    "field_statistics" : true,
    "positions": false,
    "offsets": false,
    "filter" : {
      "max_num_terms" : 3,
      "min_term_freq" : 1,
      "min_doc_freq" : 1
    }
}

Response:

{
   "_index": "imdb",
   "_type": "_doc",
   "_version": 0,
   "found": true,
   "term_vectors": {
      "plot": {
         "field_statistics": {
            "sum_doc_freq": 3384269,
            "doc_count": 176214,
            "sum_ttf": 3753460
         },
         "terms": {
            "armored": {
               "doc_freq": 27,
               "ttf": 27,
               "term_freq": 1,
               "score": 9.74725
            },
            "industrialist": {
               "doc_freq": 88,
               "ttf": 88,
               "term_freq": 1,
               "score": 8.590818
            },
            "stark": {
               "doc_freq": 44,
               "ttf": 47,
               "term_freq": 1,
               "score": 9.272792
            }
         }
      }
   }
}

Multi termvectors API

The multi termvectors API allows to get multiple termvectors at once. The documents from which to retrieve the term vectors are specified by an index, type, and id. But the documents could also be artificially provided in the request itself.

The response includes a docs array with all the fetched termvectors, each element having the structure provided by the termvectors API. Here is an example:

POST /_mtermvectors
{
   "docs": [
      {
         "_index": "twitter",
         "_type": "_doc",
         "_id": "2",
         "term_statistics": true
      },
      {
         "_index": "twitter",
         "_type": "_doc",
         "_id": "1",
         "fields": [
            "message"
         ]
      }
   ]
}

See the termvectors API for a description of possible parameters.

The _mtermvectors endpoint can also be used against an index (in which case it is not required in the body):

POST /twitter/_mtermvectors
{
   "docs": [
      {
         "_type": "_doc",
         "_id": "2",
         "fields": [
            "message"
         ],
         "term_statistics": true
      },
      {
         "_type": "_doc",
         "_id": "1"
      }
   ]
}

And type:

POST /twitter/_doc/_mtermvectors
{
   "docs": [
      {
         "_id": "2",
         "fields": [
            "message"
         ],
         "term_statistics": true
      },
      {
         "_id": "1"
      }
   ]
}

If all requested documents are on same index and have the same type and the same parameters, the request can be simplified:

POST /twitter/_doc/_mtermvectors
{
    "ids" : ["1", "2"],
    "parameters": {
    	"fields": [
         	"message"
      	],
      	"term_statistics": true
    }
}

Additionally, just like for the termvectors API, term vectors could be generated for user provided documents. The mapping used is determined by _index and _type.

POST /_mtermvectors
{
   "docs": [
      {
         "_index": "twitter",
         "_type": "_doc",
         "doc" : {
            "user" : "John Doe",
            "message" : "twitter test test test"
         }
      },
      {
         "_index": "twitter",
         "_type": "_doc",
         "doc" : {
           "user" : "Jane Doe",
           "message" : "Another twitter test ..."
         }
      }
   ]
}

?refresh

The Index, Update, Delete, and Bulk APIs support setting refresh to control when changes made by this request are made visible to search. These are the allowed values:

Empty string or true

Refresh the relevant primary and replica shards (not the whole index) immediately after the operation occurs, so that the updated document appears in search results immediately. This should ONLY be done after careful thought and verification that it does not lead to poor performance, both from an indexing and a search standpoint.

wait_for

Wait for the changes made by the request to be made visible by a refresh before replying. This doesn’t force an immediate refresh, rather, it waits for a refresh to happen. Elasticsearch automatically refreshes shards that have changed every index.refresh_interval which defaults to one second. That setting is dynamic. Calling the [indices-refresh] API or setting refresh to true on any of the APIs that support it will also cause a refresh, in turn causing already running requests with refresh=wait_for to return.

false (the default)

Take no refresh related actions. The changes made by this request will be made visible at some point after the request returns.

Choosing which setting to use

Unless you have a good reason to wait for the change to become visible always use refresh=false, or, because that is the default, just leave the refresh parameter out of the URL. That is the simplest and fastest choice.

If you absolutely must have the changes made by a request visible synchronously with the request then you must pick between putting more load on Elasticsearch (true) and waiting longer for the response (wait_for). Here are a few points that should inform that decision:

  • The more changes being made to the index the more work wait_for saves compared to true. In the case that the index is only changed once every index.refresh_interval then it saves no work.

  • true creates less efficient indexes constructs (tiny segments) that must later be merged into more efficient index constructs (larger segments). Meaning that the cost of true is paid at index time to create the tiny segment, at search time to search the tiny segment, and at merge time to make the larger segments.

  • Never start multiple refresh=wait_for requests in a row. Instead batch them into a single bulk request with refresh=wait_for and Elasticsearch will start them all in parallel and return only when they have all finished.

  • If the refresh interval is set to -1, disabling the automatic refreshes, then requests with refresh=wait_for will wait indefinitely until some action causes a refresh. Conversely, setting index.refresh_interval to something shorter than the default like 200ms will make refresh=wait_for come back faster, but it’ll still generate inefficient segments.

  • refresh=wait_for only affects the request that it is on, but, by forcing a refresh immediately, refresh=true will affect other ongoing requests. In general, if you have a running system you don’t wish to disturb then refresh=wait_for is a smaller modification.

refresh=wait_for Can Force a Refresh

If a refresh=wait_for request comes in when there are already index.max_refresh_listeners (defaults to 1000) requests waiting for a refresh on that shard then that request will behave just as though it had refresh set to true instead: it will force a refresh. This keeps the promise that when a refresh=wait_for request returns that its changes are visible for search while preventing unchecked resource usage for blocked requests. If a request forced a refresh because it ran out of listener slots then its response will contain "forced_refresh": true.

Bulk requests only take up one slot on each shard that they touch no matter how many times they modify the shard.

Examples

These will create a document and immediately refresh the index so it is visible:

PUT /test/_doc/1?refresh
{"test": "test"}
PUT /test/_doc/2?refresh=true
{"test": "test"}

These will create a document without doing anything to make it visible for search:

PUT /test/_doc/3
{"test": "test"}
PUT /test/_doc/4?refresh=false
{"test": "test"}

This will create a document and wait for it to become visible for search:

PUT /test/_doc/4?refresh=wait_for
{"test": "test"}

Optimistic concurrency control

Elasticsearch is distributed. When documents are created, updated, or deleted, the new version of the document has to be replicated to other nodes in the cluster. Elasticsearch is also asynchronous and concurrent, meaning that these replication requests are sent in parallel, and may arrive at their destination out of sequence. Elasticsearch needs a way of ensuring that an older version of a document never overwrites a newer version.

To ensure an older version of a document doesn’t overwrite a newer version, every operation performed to a document is assigned a sequence number by the primary shard that coordinates that change. The sequence number is increased with each operation and thus newer operations are guaranteed to have a higher sequence number than older operations. Elasticsearch can then use the sequence number of operations to make sure a newer document version is never overridden by a change that has a smaller sequence number assigned to it.

For example, the following indexing command will create a document and assign it an initial sequence number and primary term:

PUT products/_doc/1567
{
    "product" : "r2d2",
    "details" : "A resourceful astromech droid"
}

You can see the assigned sequence number and primary term in the _seq_no and _primary_term fields of the response:

{
    "_shards" : {
        "total" : 2,
        "failed" : 0,
        "successful" : 1
    },
    "_index" : "products",
    "_type" : "_doc",
    "_id" : "1567",
    "_version" : 1,
    "_seq_no" : 362,
    "_primary_term" : 2,
    "result" : "created"
}

Elasticsearch keeps tracks of the sequence number and primary term of the last operation to have changed each of the documents it stores. The sequence number and primary term are returned in the _seq_no and _primary_term fields in the response of the GET API:

GET products/_doc/1567

returns:

{
    "_index" : "products",
    "_type" : "_doc",
    "_id" : "1567",
    "_version" : 1,
    "_seq_no" : 362,
    "_primary_term" : 2,
    "found": true,
    "_source" : {
        "product" : "r2d2",
        "details" : "A resourceful astromech droid"
    }
}

Note: The Search API can return the _seq_no and _primary_term for each search hit by setting seq_no_primary_term parameter.

The sequence number and the primary term uniquely identify a change. By noting down the sequence number and primary term returned, you can make sure to only change the document if no other change was made to it since you retrieved it. This is done by setting the if_seq_no and if_primary_term parameters of either the Index API or the Delete API.

For example, the following indexing call will make sure to add a tag to the document without losing any potential change to the description or an addition of another tag by another API:

PUT products/_doc/1567?if_seq_no=362&if_primary_term=2
{
    "product" : "r2d2",
    "details" : "A resourceful astromech droid",
    "tags": ["droid"]
}