"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/reference/modules.asciidoc" (29 Dec 2021, 2699 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Cluster

One of the main roles of the master is to decide which shards to allocate to which nodes, and when to move shards between nodes in order to rebalance the cluster.

There are a number of settings available to control the shard allocation process:

Besides these, there are a few other miscellaneous cluster-level settings.

All of the settings in this section are dynamic settings which can be updated on a live cluster with the cluster-update-settings API.

Cluster Level Shard Allocation

Shard allocation is the process of allocating shards to nodes. This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.

Shard Allocation Settings

The following dynamic settings may be used to control shard allocation and recovery:

cluster.routing.allocation.enable

Enable or disable allocation for specific kinds of shards:

  • all - (default) Allows shard allocation for all kinds of shards.

  • primaries - Allows shard allocation only for primary shards.

  • new_primaries - Allows shard allocation only for primary shards for new indices.

  • none - No shard allocations of any kind are allowed for any indices.

This setting does not affect the recovery of local primary shards when restarting a node. A restarted node that has a copy of an unassigned primary shard will recover that primary immediately, assuming that its allocation id matches one of the active allocation ids in the cluster state.

cluster.routing.allocation.node_concurrent_incoming_recoveries

How many concurrent incoming shard recoveries are allowed to happen on a node. Incoming recoveries are the recoveries where the target shard (most likely the replica unless a shard is relocating) is allocated on the node. Defaults to 2.

cluster.routing.allocation.node_concurrent_outgoing_recoveries

How many concurrent outgoing shard recoveries are allowed to happen on a node. Outgoing recoveries are the recoveries where the source shard (most likely the primary unless a shard is relocating) is allocated on the node. Defaults to 2.

cluster.routing.allocation.node_concurrent_recoveries

A shortcut to set both cluster.routing.allocation.node_concurrent_incoming_recoveries and cluster.routing.allocation.node_concurrent_outgoing_recoveries.

cluster.routing.allocation.node_initial_primaries_recoveries

While the recovery of replicas happens over the network, the recovery of an unassigned primary after node restart uses data from the local disk. These should be fast so more initial primary recoveries can happen in parallel on the same node. Defaults to 4.

cluster.routing.allocation.same_shard.host

Allows to perform a check to prevent allocation of multiple instances of the same shard on a single host, based on host name and host address. Defaults to false, meaning that no check is performed by default. This setting only applies if multiple nodes are started on the same machine.

Shard Rebalancing Settings

The following dynamic settings may be used to control the rebalancing of shards across the cluster:

cluster.routing.rebalance.enable

Enable or disable rebalancing for specific kinds of shards:

  • all - (default) Allows shard balancing for all kinds of shards.

  • primaries - Allows shard balancing only for primary shards.

  • replicas - Allows shard balancing only for replica shards.

  • none - No shard balancing of any kind are allowed for any indices.

cluster.routing.allocation.allow_rebalance

Specify when shard rebalancing is allowed:

  • always - Always allow rebalancing.

  • indices_primaries_active - Only when all primaries in the cluster are allocated.

  • indices_all_active - (default) Only when all shards (primaries and replicas) in the cluster are allocated.

cluster.routing.allocation.cluster_concurrent_rebalance

Allow to control how many concurrent shard rebalances are allowed cluster wide. Defaults to 2. Note that this setting only controls the number of concurrent shard relocations due to imbalances in the cluster. This setting does not limit shard relocations due to allocation filtering or forced awareness.

Shard Balancing Heuristics

The following settings are used together to determine where to place each shard. The cluster is balanced when no allowed rebalancing operation can bring the weight of any node closer to the weight of any other node by more than the balance.threshold.

cluster.routing.allocation.balance.shard

Defines the weight factor for the total number of shards allocated on a node (float). Defaults to 0.45f. Raising this raises the tendency to equalize the number of shards across all nodes in the cluster.

cluster.routing.allocation.balance.index

Defines the weight factor for the number of shards per index allocated on a specific node (float). Defaults to 0.55f. Raising this raises the tendency to equalize the number of shards per index across all nodes in the cluster.

cluster.routing.allocation.balance.threshold

Minimal optimization value of operations that should be performed (non negative float). Defaults to 1.0f. Raising this will cause the cluster to be less aggressive about optimizing the shard balance.

Note
Regardless of the result of the balancing algorithm, rebalancing might not be allowed due to forced awareness or allocation filtering.

Disk-based Shard Allocation

{es} considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node.

Below are the settings that can be configured in the elasticsearch.yml config file or updated dynamically on a live cluster with the cluster-update-settings API:

cluster.routing.allocation.disk.threshold_enabled

Defaults to true. Set to false to disable the disk allocation decider.

cluster.routing.allocation.disk.watermark.low

Controls the low watermark for disk usage. It defaults to 85%, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like 500mb) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices or, specifically, any shards that have never previously been allocated.

cluster.routing.allocation.disk.watermark.high

Controls the high watermark. It defaults to 90%, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.

cluster.routing.allocation.disk.watermark.flood_stage

Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (index.blocks.read_only_allow_delete) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block must be released manually when the disk utilization falls below the high watermark.

Note
You cannot mix the usage of percentage values and byte values within these settings. Either all values are set to percentage values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold.

An example of resetting the read-only index block on the twitter index:

PUT /twitter/_settings
{
  "index.blocks.read_only_allow_delete": null
}
cluster.info.update.interval

How often {es} should check on disk usage for each node in the cluster. Defaults to 30s.

cluster.routing.allocation.disk.include_relocations

Defaults to true, which means that Elasticsearch will take into account shards that are currently being relocated to the target node when computing a node’s disk usage. Taking relocating shards' sizes into account may, however, mean that the disk usage for a node is incorrectly estimated on the high side, since the relocation could be 90% complete and a recently retrieved disk usage would include the total size of the relocating shard as well as the space already used by the running relocation.

Note
Percentage values refer to used disk space, while byte values refer to free disk space. This can be confusing, since it flips the meaning of high and low. For example, it makes sense to set the low watermark to 10gb and the high watermark to 5gb, but not the other way around.

An example of updating the low watermark to at least 100 gigabytes free, a high watermark of at least 50 gigabytes free, and a flood stage watermark of 10 gigabytes free, and updating the information about the cluster every minute:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.disk.watermark.low": "100gb",
    "cluster.routing.allocation.disk.watermark.high": "50gb",
    "cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
    "cluster.info.update.interval": "1m"
  }
}

Shard Allocation Awareness

When running nodes on multiple VMs on the same physical server, on multiple racks, or across multiple zones or domains, it is more likely that two nodes on the same physical server, in the same rack, or in the same zone or domain will crash at the same time, rather than two unrelated nodes crashing simultaneously.

If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimise the risk of losing all shard copies at the same time.

The shard allocation awareness settings allow you to tell Elasticsearch about your hardware configuration.

As an example, let’s assume we have several racks. When we start a node, we can tell it which rack it is in by assigning it an arbitrary metadata attribute called rack_id — we could use any attribute name. For example:

./bin/elasticsearch -Enode.attr.rack_id=rack_one (1)
  1. This setting could also be specified in the elasticsearch.yml config file.

Now, we need to set up shard allocation awareness by telling Elasticsearch which attributes to use. This can be configured in the elasticsearch.yml file on all master-eligible nodes, or it can be set (and changed) with the cluster-update-settings API.

For our example, we’ll set the value in the config file:

cluster.routing.allocation.awareness.attributes: rack_id

With this config in place, let’s say we start two nodes with node.attr.rack_id set to rack_one, and we create an index with 5 primary shards and 1 replica of each primary. All primaries and replicas are allocated across the two nodes.

Now, if we start two more nodes with node.attr.rack_id set to rack_two, Elasticsearch will move shards across to the new nodes, ensuring (if possible) that no two copies of the same shard will be in the same rack. However if rack_two were to fail, taking down both of its nodes, Elasticsearch will still allocate the lost shard copies to nodes in rack_one.

Prefer local shards

When executing search or GET requests, with shard awareness enabled, Elasticsearch will prefer using local shards — shards in the same awareness group — to execute the request. This is usually faster than crossing between racks or across zone boundaries.

Multiple awareness attributes can be specified, in which case each attribute is considered separately when deciding where to allocate the shards.

cluster.routing.allocation.awareness.attributes: rack_id,zone
Note
When using awareness attributes, shards will not be allocated to nodes that don’t have values set for those attributes.
Note
Number of primary/replica of a shard allocated on a specific group of nodes with the same awareness attribute value is determined by the number of attribute values. When the number of nodes in groups is unbalanced and there are many replicas, replica shards may be left unassigned.

Forced Awareness

Imagine that you have two zones and enough hardware across the two zones to host all of your primary and replica shards. But perhaps the hardware in a single zone, while sufficient to host half the shards, would be unable to host ALL the shards.

With ordinary awareness, if one zone lost contact with the other zone, Elasticsearch would assign all of the missing replica shards to a single zone. But in this example, this sudden extra load would cause the hardware in the remaining zone to be overloaded.

Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone.

For example, lets say we have an awareness attribute called zone, and we know we are going to have two zones, zone1 and zone2. Here is how we can force awareness on a node:

cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 (1)
cluster.routing.allocation.awareness.attributes: zone
  1. We must list all possible values that the zone attribute can have.

Now, if we start 2 nodes with node.attr.zone set to zone1 and create an index with 5 shards and 1 replica. The index will be created, but only the 5 primary shards will be allocated (with no replicas). Only when we start more nodes with node.attr.zone set to zone2 will the replicas be allocated.

The cluster.routing.allocation.awareness.* settings can all be updated dynamically on a live cluster with the cluster-update-settings API.

Shard Allocation Filtering

While [index-modules-allocation] provides per-index settings to control the allocation of shards to nodes, cluster-level shard allocation filtering allows you to allow or disallow the allocation of shards from any index to particular nodes.

The available dynamic cluster settings are as follows, where {attribute} refers to an arbitrary node attribute.:

cluster.routing.allocation.include.{attribute}

Allocate shards to a node whose {attribute} has at least one of the comma-separated values.

cluster.routing.allocation.require.{attribute}

Only allocate shards to a node whose {attribute} has all of the comma-separated values.

cluster.routing.allocation.exclude.{attribute}

Do not allocate shards to a node whose {attribute} has any of the comma-separated values.

These special attributes are also supported:

_name

Match nodes by node names

_ip

Match nodes by IP addresses (the IP address associated with the hostname)

_host

Match nodes by hostnames

The typical use case for cluster-wide shard allocation filtering is when you want to decommission a node, and you would like to move the shards from that node to other nodes in the cluster before shutting it down.

For instance, we could decommission a node using its IP address as follows:

PUT _cluster/settings
{
  "transient" : {
    "cluster.routing.allocation.exclude._ip" : "10.0.0.1"
  }
}
Note
Shards will only be relocated if it is possible to do so without breaking another routing constraint, such as never allocating a primary and replica shard to the same node.

In addition to listing multiple values as a comma-separated list, all attribute values can be specified with wildcards, eg:

PUT _cluster/settings
{
  "transient": {
    "cluster.routing.allocation.exclude._ip": "192.168.2.*"
  }
}

Miscellaneous cluster settings

Metadata

An entire cluster may be set to read-only with the following dynamic setting:

cluster.blocks.read_only

Make the whole cluster read only (indices do not accept write operations), metadata is not allowed to be modified (create or delete indices).

cluster.blocks.read_only_allow_delete

Identical to cluster.blocks.read_only but allows to delete indices to free up resources.

Warning
Don’t rely on this setting to prevent changes to your cluster. Any user with access to the cluster-update-settings API can make the cluster read-write again.

Cluster Shard Limit

In a Elasticsearch 7.0 and later, there will be a soft limit on the number of shards in a cluster, based on the number of nodes in the cluster. This is intended to prevent operations which may unintentionally destabilize the cluster. Prior to 7.0, actions which would result in the cluster going over the limit will issue a deprecation warning.

Note
You can set the system property es.enforce_max_shards_per_node to true to opt in to strict enforcement of the shard limit. If this system property is set, actions which would result in the cluster going over the limit will result in an error, rather than a deprecation warning. This property will be removed in Elasticsearch 7.0, as strict enforcement of the limit will be the default and only behavior.
Important
This limit is intended as a safety net, not a sizing recommendation. The exact number of shards your cluster can safely support depends on your hardware configuration and workload, but should remain well below this limit in almost all cases, as the default limit is set quite high.

If an operation, such as creating a new index, restoring a snapshot of an index, or opening a closed index would lead to the number of shards in the cluster going over this limit, the operation will issue a deprecation warning.

If the cluster is already over the limit, due to changes in node membership or setting changes, all operations that create or open indices will issue warnings until either the limit is increased as described below, or some indices are closed or deleted to bring the number of shards below the limit.

Replicas count towards this limit, but closed indexes do not. An index with 5 primary shards and 2 replicas will be counted as 15 shards. Any closed index is counted as 0, no matter how many shards and replicas it contains.

The limit defaults to 1,000 shards per data node, and be dynamically adjusted using the following property:

cluster.max_shards_per_node

Controls the number of shards allowed in the cluster per data node.

For example, a 3-node cluster with the default setting would allow 3,000 shards total, across all open indexes. If the above setting is changed to 500, then the cluster would allow 1,500 shards total.

User Defined Cluster Metadata

User-defined metadata can be stored and retrieved using the Cluster Settings API. This can be used to store arbitrary, infrequently-changing data about the cluster without the need to create an index to store it. This data may be stored using any key prefixed with cluster.metadata.. For example, to store the email address of the administrator of a cluster under the key cluster.metadata.administrator, issue this request:

PUT /_cluster/settings
{
  "persistent": {
    "cluster.metadata.administrator": "sysadmin@example.com"
  }
}
Important
User-defined cluster metadata is not intended to store sensitive or confidential information. Any information stored in user-defined cluster metadata will be viewable by anyone with access to the Cluster Get Settings API, and is recorded in the {es} logs.

Index Tombstones

The cluster state maintains index tombstones to explicitly denote indices that have been deleted. The number of tombstones maintained in the cluster state is controlled by the following property, which cannot be updated dynamically:

cluster.indices.tombstones.size

Index tombstones prevent nodes that are not part of the cluster when a delete occurs from joining the cluster and reimporting the index as though the delete was never issued. To keep the cluster state from growing huge we only keep the last cluster.indices.tombstones.size deletes, which defaults to 500. You can increase it if you expect nodes to be absent from the cluster and miss more than 500 deletes. We think that is rare, thus the default. Tombstones don’t take up much space, but we also think that a number like 50,000 is probably too big.

Logger

The settings which control logging can be updated dynamically with the logger. prefix. For instance, to increase the logging level of the indices.recovery module to DEBUG, issue this request:

PUT /_cluster/settings
{
  "transient": {
    "logger.org.elasticsearch.indices.recovery": "DEBUG"
  }
}

Persistent Tasks Allocations

Plugins can create a kind of tasks called persistent tasks. Those tasks are usually long-live tasks and are stored in the cluster state, allowing the tasks to be revived after a full cluster restart.

Every time a persistent task is created, the master node takes care of assigning the task to a node of the cluster, and the assigned node will then pick up the task and execute it locally. The process of assigning persistent tasks to nodes is controlled by the following properties, which can be updated dynamically:

cluster.persistent_tasks.allocation.enable

Enable or disable allocation for persistent tasks:

  • all - (default) Allows persistent tasks to be assigned to nodes

  • none - No allocations are allowed for any type of persistent task

This setting does not affect the persistent tasks that are already being executed. Only newly created persistent tasks, or tasks that must be reassigned (after a node left the cluster, for example), are impacted by this setting.

cluster.persistent_tasks.allocation.recheck_interval

The master node will automatically check whether persistent tasks need to be assigned when the cluster state changes significantly. However, there may be other factors, such as memory usage, that affect whether persistent tasks can be assigned to nodes but do not cause the cluster state to change. This setting controls how often assignment checks are performed to react to these factors. The default is 30 seconds. The minimum permitted value is 10 seconds.

Discovery

The discovery module is responsible for discovering nodes within a cluster, as well as electing a master node.

Note, Elasticsearch is a peer to peer based system, nodes communicate with one another directly if operations are delegated / broadcast. All the main APIs (index, delete, search) do not communicate with the master node. The responsibility of the master node is to maintain the global cluster state, and act if nodes join or leave the cluster by reassigning shards. Each time a cluster state is changed, the state is made known to the other nodes in the cluster (the manner depends on the actual discovery implementation).

Settings

The cluster.name allows to create separated clusters from one another. The default value for the cluster name is elasticsearch, though it is recommended to change this to reflect the logical group name of the cluster running.

Azure Classic Discovery

Azure classic discovery allows to use the Azure Classic APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See {plugins}/discovery-azure-classic.html[discovery-azure-classic] for more information.

EC2 Discovery

EC2 discovery is available as a plugin. See {plugins}/discovery-ec2.html[discovery-ec2] for more information.

Google Compute Engine Discovery

Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See {plugins}/discovery-gce.html[discovery-gce] for more information.

Zen Discovery

Zen discovery is the built-in, default, discovery module for Elasticsearch. It provides unicast and file-based discovery, and can be extended to support cloud environments and other forms of discovery via plugins.

Zen discovery is integrated with other modules, for example, all communication between nodes is done using the transport module.

It is separated into several sub modules, which are explained below:

Ping

This is the process where a node uses the discovery mechanisms to find other nodes.

Seed nodes

Zen discovery uses a list of seed nodes in order to start off the discovery process. At startup, or when electing a new master, Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete picture of the cluster. By default there are two methods for configuring the list of seed nodes: unicast and file-based. It is recommended that the list of seed nodes comprises the list of master-eligible nodes in the cluster.

Unicast

Unicast discovery configures a static list of hosts for use as seed nodes. These hosts can be specified as hostnames or IP addresses; hosts specified as hostnames are resolved to IP addresses during each round of pinging. Note that if you are in an environment where DNS resolutions vary with time, you might need to adjust your JVM security settings.

The list of hosts is set using the discovery.zen.ping.unicast.hosts static setting. This is either an array of hosts or a comma-delimited string. Each value should be in the form of host:port or host (where port defaults to the setting transport.profiles.default.port falling back to transport.port if not set). Note that IPv6 hosts must be bracketed. The default for this setting is 127.0.0.1, [::1]

Additionally, the discovery.zen.ping.unicast.resolve_timeout configures the amount of time to wait for DNS lookups on each round of pinging. This is specified as a time unit and defaults to 5s.

Unicast discovery uses the transport module to perform the discovery.

File-based

In addition to hosts provided by the static discovery.zen.ping.unicast.hosts setting, it is possible to provide a list of hosts via an external file. Elasticsearch reloads this file when it changes, so that the list of seed nodes can change dynamically without needing to restart each node. For example, this gives a convenient mechanism for an Elasticsearch instance that is run in a Docker container to be dynamically supplied with a list of IP addresses to connect to for Zen discovery when those IP addresses may not be known at node startup.

To enable file-based discovery, configure the file hosts provider as follows:

discovery.zen.hosts_provider: file

Then create a file at $ES_PATH_CONF/unicast_hosts.txt in the format described below. Any time a change is made to the unicast_hosts.txt file the new changes will be picked up by Elasticsearch and the new hosts list will be used.

Note that the file-based discovery plugin augments the unicast hosts list in elasticsearch.yml: if there are valid unicast host entries in discovery.zen.ping.unicast.hosts then they will be used in addition to those supplied in unicast_hosts.txt.

The discovery.zen.ping.unicast.resolve_timeout setting also applies to DNS lookups for nodes specified by address via file-based discovery. This is specified as a time unit and defaults to 5s.

The format of the file is to specify one node entry per line. Each node entry consists of the host (host name or IP address) and an optional transport port number. If the port number is specified, is must come immediately after the host (on the same line) separated by a :. If the port number is not specified, a default value of 9300 is used.

For example, this is an example of unicast_hosts.txt for a cluster with four nodes that participate in unicast discovery, some of which are not running on the default port:

10.10.10.5
10.10.10.6:9305
10.10.10.5:10005
# an IPv6 address
[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301

Host names are allowed instead of IP addresses (similar to discovery.zen.ping.unicast.hosts), and IPv6 addresses must be specified in brackets with the port coming after the brackets.

It is also possible to add comments to this file. All comments must appear on their lines starting with # (i.e. comments cannot start in the middle of a line).

Master Election

As part of the ping process a master of the cluster is either elected or joined to. This is done automatically. The discovery.zen.ping_timeout (which defaults to 3s) determines how long the node will wait before deciding on starting an election or joining an existing cluster. Three pings will be sent over this timeout interval. In case where no decision can be reached after the timeout, the pinging process restarts. In slow or congested networks, three seconds might not be enough for a node to become aware of the other nodes in its environment before making an election decision. Increasing the timeout should be done with care in that case, as it will slow down the election process. Once a node decides to join an existing formed cluster, it will send a join request to the master (discovery.zen.join_timeout) with a timeout defaulting at 20 times the ping timeout.

When the master node stops or has encountered a problem, the cluster nodes start pinging again and will elect a new master. This pinging round also serves as a protection against (partial) network failures where a node may unjustly think that the master has failed. In this case the node will simply hear from other nodes about the currently active master.

If discovery.zen.master_election.ignore_non_master_pings is true, pings from nodes that are not master eligible (nodes where node.master is false) are ignored during master election; the default value is false.

Nodes can be excluded from becoming a master by setting node.master to false.

The discovery.zen.minimum_master_nodes sets the minimum number of master eligible nodes that need to join a newly elected master in order for an election to complete and for the elected node to accept its mastership. The same setting controls the minimum number of active master eligible nodes that should be a part of any active cluster. If this requirement is not met the active master node will step down and a new master election will begin.

This setting must be set to a quorum of your master eligible nodes. It is recommended to avoid having only two master eligible nodes, since a quorum of two is two. Therefore, a loss of either master eligible node will result in an inoperable cluster.

Fault Detection

There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.

The following settings control the fault detection process using the discovery.zen.fd prefix:

Setting Description

ping_interval

How often a node gets pinged. Defaults to 1s.

ping_timeout

How long to wait for a ping response, defaults to 30s.

ping_retries

How many ping failures / timeouts cause a node to be considered failed. Defaults to 3.

Cluster state updates

The master node is the only node in a cluster that can make changes to the cluster state. The master node processes one cluster state update at a time, applies the required changes and publishes the updated cluster state to all the other nodes in the cluster. Each node receives the publish message, acknowledges it, but does not yet apply it. If the master does not receive acknowledgement from at least discovery.zen.minimum_master_nodes nodes within a certain time (controlled by the discovery.zen.commit_timeout setting which defaults to 30 seconds, with negative values treated as 0 seconds) the cluster state change is rejected.

Once enough nodes have responded, the cluster state is committed and a message will be sent to all the nodes. The nodes then proceed to apply the new cluster state to their internal state. The master node waits for all nodes to respond, up to a timeout, before going ahead processing the next updates in the queue. The discovery.zen.publish_timeout is set by default to 30 seconds and is measured from the moment the publishing started. Both timeout settings can be changed dynamically through the cluster update settings api

No master block

For the cluster to be fully operational, it must have an active master and the number of running master eligible nodes must satisfy the discovery.zen.minimum_master_nodes setting if set. The discovery.zen.no_master_block settings controls what operations should be rejected when there is no active master.

The discovery.zen.no_master_block setting has two valid options:

all

All operations on the node—​i.e. both read & writes—​will be rejected. This also applies for api cluster state read or write operations, like the get index settings, put mapping and cluster state api.

write

(default) Write operations will be rejected. Read operations will succeed, based on the last known cluster configuration. This may result in partial reads of stale data as this node may be isolated from the rest of the cluster.

The discovery.zen.no_master_block setting doesn’t apply to nodes-based apis (for example cluster stats, node info and node stats apis). Requests to these apis will not be blocked and can run on any available node.

Single-node discovery

The discovery.type setting specifies whether {es} should form a multiple-node cluster. By default, {es} discovers other nodes when forming a cluster and allows other nodes to join the cluster later. If discovery.type is set to single-node, {es} forms a single-node cluster. For more information about when you might use this setting, see Bootstrap checks.

Local Gateway

The local gateway module stores the cluster state and shard data across full cluster restarts.

The following static settings, which must be set on every master node, control how long a freshly elected master should wait before it tries to recover the cluster state and the cluster’s data:

gateway.expected_nodes

The number of (data or master) nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of nodes have joined the cluster. Defaults to 0

gateway.expected_master_nodes

The number of master nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of master nodes have joined the cluster. Defaults to 0

gateway.expected_data_nodes

The number of data nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of data nodes have joined the cluster. Defaults to 0

gateway.recover_after_time

If the expected number of nodes is not achieved, the recovery process waits for the configured amount of time before trying to recover regardless. Defaults to 5m if one of the expected_nodes settings is configured.

Once the recover_after_time duration has timed out, recovery will start as long as the following conditions are met:

gateway.recover_after_nodes

Recover as long as this many data or master nodes have joined the cluster.

gateway.recover_after_master_nodes

Recover as long as this many master nodes have joined the cluster.

gateway.recover_after_data_nodes

Recover as long as this many data nodes have joined the cluster.

Note
These settings only take effect on a full cluster restart.

HTTP

The http module allows to expose Elasticsearch APIs over HTTP.

The http mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication for HTTP is solving the C10k problem.

When possible, consider using HTTP keep alive when connecting for better performance and try to get your favorite client not to do HTTP chunking.

Settings

The settings in the table below can be configured for HTTP. Note that none of them are dynamically updatable so for them to take effect they should be set in the Elasticsearch configuration file.

Setting Description

http.port

A bind port range. Defaults to 9200-9300.

http.publish_port

The port that HTTP clients should use when communicating with this node. Useful when a cluster node is behind a proxy or firewall and the http.port is not directly addressable from the outside. Defaults to the actual port assigned via http.port.

http.bind_host

The host address to bind the HTTP service to. Defaults to http.host (if set) or network.bind_host.

http.publish_host

The host address to publish for HTTP clients to connect to. Defaults to http.host (if set) or network.publish_host.

http.host

Used to set the http.bind_host and the http.publish_host Defaults to http.host or network.host.

http.max_content_length

The max content of an HTTP request. Defaults to 100mb. If set to greater than Integer.MAX_VALUE, it will be reset to 100mb.

http.max_initial_line_length

The max length of an HTTP URL. Defaults to 4kb

http.max_header_size

The max size of allowed headers. Defaults to 8kB

http.compression

Support for compression when possible (with Accept-Encoding). Defaults to true.

http.compression_level

Defines the compression level to use for HTTP responses. Valid values are in the range of 1 (minimum compression) and 9 (maximum compression). Defaults to 3.

http.cors.enabled

Enable or disable cross-origin resource sharing, i.e. whether a browser on another origin can execute requests against Elasticsearch. Set to true to enable Elasticsearch to process pre-flight CORS requests. Elasticsearch will respond to those requests with the Access-Control-Allow-Origin header if the Origin sent in the request is permitted by the http.cors.allow-origin list. Set to false (the default) to make Elasticsearch ignore the Origin request header, effectively disabling CORS requests because Elasticsearch will never respond with the Access-Control-Allow-Origin response header. Note that if the client does not send a pre-flight request with an Origin header or it does not check the response headers from the server to validate the Access-Control-Allow-Origin response header, then cross-origin security is compromised. If CORS is not enabled on Elasticsearch, the only way for the client to know is to send a pre-flight request and realize the required response headers are missing.

http.cors.allow-origin

Which origins to allow. Defaults to no origins allowed. If you prepend and append a / to the value, this will be treated as a regular expression, allowing you to support HTTP and HTTPs. for example using /https?:\/\/localhost(:[0-9]+)?/ would return the request header appropriately in both cases. is a valid value but is considered a *security risk as your Elasticsearch instance is open to cross origin requests from anywhere.

http.cors.max-age

Browsers send a "preflight" OPTIONS-request to determine CORS settings. max-age defines how long the result should be cached for. Defaults to 1728000 (20 days)

http.cors.allow-methods

Which methods to allow. Defaults to OPTIONS, HEAD, GET, POST, PUT, DELETE.

http.cors.allow-headers

Which headers to allow. Defaults to X-Requested-With, Content-Type, Content-Length.

http.cors.allow-credentials

Whether the Access-Control-Allow-Credentials header should be returned. Note: This header is only returned, when the setting is set to true. Defaults to false

http.detailed_errors.enabled

Enables or disables the output of detailed error messages and stack traces in response output. Note: When set to false and the error_trace request parameter is specified, an error will be returned; when error_trace is not specified, a simple message will be returned. Defaults to true

http.pipelining

Enable or disable HTTP pipelining, defaults to true.

http.pipelining.max_events

The maximum number of events to be queued up in memory before an HTTP connection is closed, defaults to 10000.

http.max_warning_header_count

The maximum number of warning headers in client HTTP responses, defaults to unbounded.

http.max_warning_header_size

The maximum total size of warning headers in client HTTP responses, defaults to unbounded.

It also uses the common network settings.

Disable HTTP

The http module can be completely disabled and not started by setting http.enabled to false. Elasticsearch nodes (and Java clients) communicate internally using the transport interface, not HTTP. It might make sense to disable the http layer entirely on nodes which are not meant to serve REST requests directly. For instance, you could disable HTTP on data-only nodes if you also have client nodes which are intended to serve all REST requests. Be aware, however, that you will not be able to send any REST requests (eg to retrieve node stats) directly to nodes which have HTTP disabled.

Indices

The indices module controls index-related settings that are globally managed for all indices, rather than being configurable at a per-index level.

Available settings include:

Circuit breaker

Circuit breakers set limits on memory usage to avoid out of memory exceptions.

Fielddata cache

Set limits on the amount of heap used by the in-memory fielddata cache.

Node query cache

Configure the amount heap used to cache queries results.

Indexing buffer

Control the size of the buffer allocated to the indexing process.

Shard request cache

Control the behaviour of the shard-level request cache.

Recovery

Control the resource limits on the shard recovery process.

Search Settings

Control global search settings.

Circuit Breaker

Elasticsearch contains multiple circuit breakers used to prevent operations from causing an OutOfMemoryError. Each breaker specifies a limit for how much memory it can use. Additionally, there is a parent-level breaker that specifies the total amount of memory that can be used across all breakers.

These settings can be dynamically updated on a live cluster with the cluster-update-settings API.

Parent circuit breaker

The parent-level breaker can be configured with the following setting:

indices.breaker.total.limit

Starting limit for overall parent breaker, defaults to 70% of JVM heap.

Field data circuit breaker

The field data circuit breaker allows Elasticsearch to estimate the amount of memory a field will require to be loaded into memory. It can then prevent the field data loading by raising an exception. By default the limit is configured to 60% of the maximum JVM heap. It can be configured with the following parameters:

indices.breaker.fielddata.limit

Limit for fielddata breaker, defaults to 60% of JVM heap

indices.breaker.fielddata.overhead

A constant that all field data estimations are multiplied with to determine a final estimation. Defaults to 1.03

Request circuit breaker

The request circuit breaker allows Elasticsearch to prevent per-request data structures (for example, memory used for calculating aggregations during a request) from exceeding a certain amount of memory.

indices.breaker.request.limit

Limit for request breaker, defaults to 60% of JVM heap

indices.breaker.request.overhead

A constant that all request estimations are multiplied with to determine a final estimation. Defaults to 1

In flight requests circuit breaker

The in flight requests circuit breaker allows Elasticsearch to limit the memory usage of all currently active incoming requests on transport or HTTP level from exceeding a certain amount of memory on a node. The memory usage is based on the content length of the request itself.

network.breaker.inflight_requests.limit

Limit for in flight requests breaker, defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.

network.breaker.inflight_requests.overhead

A constant that all in flight requests estimations are multiplied with to determine a final estimation. Defaults to 1

Accounting requests circuit breaker

The accounting circuit breaker allows Elasticsearch to limit the memory usage of things held in memory that are not released when a request is completed. This includes things like the Lucene segment memory.

indices.breaker.accounting.limit

Limit for accounting breaker, defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.

indices.breaker.accounting.overhead

A constant that all accounting estimations are multiplied with to determine a final estimation. Defaults to 1

Script compilation circuit breaker

Slightly different than the previous memory-based circuit breaker, the script compilation circuit breaker limits the number of inline script compilations within a period of time.

See the "prefer-parameters" section of the scripting documentation for more information.

script.max_compilations_rate

Limit for the number of unique dynamic scripts within a certain interval that are allowed to be compiled. Defaults to 75/5m, meaning 75 every 5 minutes.

Fielddata

The field data cache is used mainly when sorting on or computing aggregations on a field. It loads all the field values to memory in order to provide fast document based access to those values. The field data cache can be expensive to build for a field, so its recommended to have enough memory to allocate it, and to keep it loaded.

The amount of memory used for the field data cache can be controlled using indices.fielddata.cache.size. Note: reloading the field data which does not fit into your cache will be expensive and perform poorly.

indices.fielddata.cache.size

The max size of the field data cache, eg 30% of node heap space, or an absolute value, eg 12GB. Defaults to unbounded. Also see Field data circuit breaker.

Note
These are static settings which must be configured on every data node in the cluster.

Monitoring field data

You can monitor memory usage for field data as well as the field data circuit breaker using Nodes Stats API

Node Query Cache

The query cache is responsible for caching the results of queries. There is one queries cache per node that is shared by all shards. The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data. It is not possible to look at the contents being cached.

The query cache only caches queries which are being used in a filter context.

The following setting is static and must be configured on every data node in the cluster:

indices.queries.cache.size

Controls the memory size for the filter cache , defaults to 10%. Accepts either a percentage value, like 5%, or an exact value, like 512mb.

The following setting is an index setting that can be configured on a per-index basis:

index.queries.cache.enabled

Controls whether to enable query caching. Accepts true (default) or false.

Indexing Buffer

The indexing buffer is used to store newly indexed documents. When it fills up, the documents in the buffer are written to a segment on disk. It is divided between all shards on the node.

The following settings are static and must be configured on every data node in the cluster:

indices.memory.index_buffer_size

Accepts either a percentage or a byte size value. It defaults to 10%, meaning that 10% of the total heap allocated to a node will be used as the indexing buffer size shared across all shards.

indices.memory.min_index_buffer_size

If the index_buffer_size is specified as a percentage, then this setting can be used to specify an absolute minimum. Defaults to 48mb.

indices.memory.max_index_buffer_size

If the index_buffer_size is specified as a percentage, then this setting can be used to specify an absolute maximum. Defaults to unbounded.

Shard request cache

When a search request is run against an index or against many indices, each involved shard executes the search locally and returns its local results to the coordinating node, which combines these shard-level results into a ``global'' result set.

The shard-level request cache module caches the local results on each shard. This allows frequently used (and potentially heavy) search requests to return results almost instantly. The requests cache is a very good fit for the logging use case, where only the most recent index is being actively updated — results from older indices will be served directly from the cache.

Important

By default, the requests cache will only cache the results of search requests where size=0, so it will not cache hits, but it will cache hits.total, aggregations, and suggestions.

Most queries that use now (see [date-math]) cannot be cached.

Cache invalidation

The cache is smart — it keeps the same near real-time promise as uncached search.

Cached results are invalidated automatically whenever the shard refreshes, but only if the data in the shard has actually changed. In other words, you will always get the same results from the cache as you would for an uncached search request.

The longer the refresh interval, the longer that cached entries will remain valid. If the cache is full, the least recently used cache keys will be evicted.

The cache can be expired manually with the clear-cache API:

POST /kimchy,elasticsearch/_cache/clear?request=true

Enabling and disabling caching

The cache is enabled by default, but can be disabled when creating a new index as follows:

PUT /my_index
{
  "settings": {
    "index.requests.cache.enable": false
  }
}

It can also be enabled or disabled dynamically on an existing index with the update-settings API:

PUT /my_index/_settings
{ "index.requests.cache.enable": true }

Enabling and disabling caching per request

The request_cache query-string parameter can be used to enable or disable caching on a per-request basis. If set, it overrides the index-level setting:

GET /my_index/_search?request_cache=true
{
  "size": 0,
  "aggs": {
    "popular_colors": {
      "terms": {
        "field": "colors"
      }
    }
  }
}
Important
If your query uses a script whose result is not deterministic (e.g. it uses a random function or references the current time) you should set the request_cache flag to false to disable caching for that request.

Requests where size is greater than 0 will not be cached even if the request cache is enabled in the index settings. To cache these requests you will need to use the query-string parameter detailed here.

Cache key

The whole JSON body is used as the cache key. This means that if the JSON changes — for instance if keys are output in a different order — then the cache key will not be recognised.

Tip
Most JSON libraries support a canonical mode which ensures that JSON keys are always emitted in the same order. This canonical mode can be used in the application to ensure that a request is always serialized in the same way.

Cache settings

The cache is managed at the node level, and has a default maximum size of 1% of the heap. This can be changed in the config/elasticsearch.yml file with:

indices.requests.cache.size: 2%

Also, you can use the indices.requests.cache.expire setting to specify a TTL for cached results, but there should be no reason to do so. Remember that stale results are automatically invalidated when the index is refreshed. This setting is provided for completeness' sake only.

Monitoring cache usage

The size of the cache (in bytes) and the number of evictions can be viewed by index, with the indices-stats API:

GET /_stats/request_cache?human

or by node with the nodes-stats API:

GET /_nodes/stats/indices/request_cache?human

Indices Recovery

Peer recovery syncs data from a primary shard to a new or existing shard copy.

Peer recovery automatically occurs when {es}:

  • Recreates a shard lost during node failure

  • Relocates a shard to another node due to a cluster rebalance or changes to the shard allocation settings

You can view a list of in-progress and completed recoveries using the cat recovery API.

Peer recovery settings

indices.recovery.max_bytes_per_sec (Dynamic)

Limits total inbound and outbound recovery traffic for each node. Defaults to 40mb.

This limit applies to nodes only. If multiple nodes in a cluster perform recoveries at the same time, the cluster’s total recovery traffic may exceed this limit.

If this limit is too high, ongoing recoveries may consume an excess of bandwidth and other resources, which can destabilize the cluster.

Expert peer recovery settings

You can use the following expert setting to manage resources for peer recoveries.

indices.recovery.max_concurrent_file_chunks (Dynamic, Expert)

Number of file chunk requests sent in parallel for each recovery. Defaults to 2.

You can increase the value of this setting when the recovery of a single shard is not reaching the traffic limit set by indices.recovery.max_bytes_per_sec.

Search Settings

The following expert setting can be set to manage global search limits.

indices.query.bool.max_clause_count

Defaults to 1024.

This setting limits the number of clauses a Lucene BooleanQuery can have. The default of 1024 is quite high and should normally be sufficient. This limit does not only affect Elasticsearchs bool query, but many other queries are rewritten to Lucene’s BooleanQuery internally. The limit is in place to prevent searches from becoming to large and taking up too much CPU and memory. In case you consider to increase this setting, make sure you exhausted all other options to avoid having to do this. Higher values can lead to performance degradations and memory issues, especially in clusters with a high load or few resources.

Network Settings

Elasticsearch binds to localhost only by default. This is sufficient for you to run a local development server (or even a development cluster, if you start multiple nodes on the same machine), but you will need to configure some basic network settings in order to run a real production cluster across multiple servers.

Warning
Be careful with the network configuration!

Never expose an unprotected node to the public internet.

Commonly Used Network Settings

network.host

The node will bind to this hostname or IP address and publish (advertise) this host to other nodes in the cluster. Accepts an IP address, hostname, a special value, or an array of any combination of these. Note that any values containing a : (e.g., an IPv6 address or containing one of the special value) must be quoted because : is a special character in YAML. 0.0.0.0 is an acceptable IP address and will bind to all network interfaces. The value 0 has the same effect as the value 0.0.0.0.

Defaults to local.

discovery.zen.ping.unicast.hosts

In order to join a cluster, a node needs to know the hostname or IP address of at least some of the other nodes in the cluster. This setting provides the initial list of other nodes that this node will try to contact. Accepts IP addresses or hostnames. If a hostname lookup resolves to multiple IP addresses then each IP address will be used for discovery. Round robin DNS — returning a different IP from a list on each lookup — can be used for discovery; non- existent IP addresses will throw exceptions and cause another DNS lookup on the next round of pinging (subject to JVM DNS caching).

Defaults to ["127.0.0.1", "[::1]"].

http.port

Port to bind to for incoming HTTP requests. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.

Defaults to 9200-9300.

transport.port

Port to bind for communication between nodes. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.

Defaults to 9300-9400.

Special values for network.host

The following special values may be passed to network.host:

[networkInterface]

Addresses of a network interface, for example en0.

local

Any loopback addresses on the system, for example 127.0.0.1.

site

Any site-local addresses on the system, for example 192.168.0.1.

global

Any globally-scoped addresses on the system, for example 8.8.8.8.

IPv4 vs IPv6

These special values will work over both IPv4 and IPv6 by default, but you can also limit this with the use of :ipv4 of :ipv6 specifiers. For example, en0:ipv4 would only bind to the IPv4 addresses of interface en0.

Tip
Discovery in the cloud

More special settings are available when running in the cloud with either the {plugins}/discovery-ec2.html[EC2 discovery plugin] or the {plugins}/discovery-gce-network-host.html#discovery-gce-network-host[Google Compute Engine discovery plugin] installed.

Advanced network settings

The network.host setting explained in Commonly used network settings is a shortcut which sets the bind host and the publish host at the same time. In advanced used cases, such as when running behind a proxy server, you may need to set these settings to different values:

network.bind_host

This specifies which network interface(s) a node should bind to in order to listen for incoming requests. A node can bind to multiple interfaces, e.g. two network cards, or a site-local address and a local address. Defaults to network.host.

network.publish_host

The publish host is the single interface that the node advertises to other nodes in the cluster, so that those nodes can connect to it. Currently an Elasticsearch node may be bound to multiple addresses, but only publishes one. If not specified, this defaults to the `best'' address from `network.host, sorted by IPv4/IPv6 stack preference, then by reachability. If you set a network.host that results in multiple bind addresses yet rely on a specific address for node-to-node communication, you should explicitly set network.publish_host.

Both of the above settings can be configured just like network.host — they accept IP addresses, host names, and special values.

Advanced TCP Settings

Any component that uses TCP (like the HTTP and Transport modules) share the following settings:

network.tcp.no_delay

Enable or disable the TCP no delay setting. Defaults to true.

network.tcp.keep_alive

Enable or disable TCP keep alive. Defaults to true.

network.tcp.reuse_address

Should an address be reused or not. Defaults to true on non-windows machines.

network.tcp.send_buffer_size

The size of the TCP send buffer (specified with size units). By default not explicitly set.

network.tcp.receive_buffer_size

The size of the TCP receive buffer (specified with size units). By default not explicitly set.

Transport and HTTP protocols

An Elasticsearch node exposes two network protocols which inherit the above settings, but may be further configured independently:

TCP Transport

Used for communication between nodes in the cluster, by the Java {javaclient}/transport-client.html[Transport client] and by the Tribe node. See the Transport module for more information.

HTTP

Exposes the JSON-over-HTTP interface used by all clients other than the Java clients. See the HTTP module for more information.

Node

Any time that you start an instance of Elasticsearch, you are starting a node. A collection of connected nodes is called a cluster. If you are running a single node of Elasticsearch, then you have a cluster of one node.

Every node in the cluster can handle HTTP and Transport traffic by default. The transport layer is used exclusively for communication between nodes and the {javaclient}/transport-client.html[Java TransportClient]; the HTTP layer is used only by external REST clients.

All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node.

By default, a node is all of the following types: master-eligible, data, ingest, and machine learning (if available).

Tip
As the cluster grows and in particular if you have large {ml} jobs, consider separating dedicated master-eligible nodes from dedicated data nodes and dedicated {ml} nodes.
Master-eligible node

A node that has node.master set to true (default), which makes it eligible to be elected as the master node, which controls the cluster.

Data node

A node that has node.data set to true (default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations.

Ingest node

A node that has node.ingest set to true (default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes as node.ingest: false.

Tribe node

A tribe node, configured via the tribe.* settings, is a special type of coordinating only node that can connect to multiple clusters and perform search and other operations across all connected clusters.

By default a node is a master-eligible node and a data node, plus it can pre-process documents through ingest pipelines. This is very convenient for small clusters but, as the cluster grows, it becomes important to consider separating dedicated master-eligible nodes from dedicated data nodes.

Machine learning node

A node that has xpack.ml.enabled and node.ml set to true, which is the default behavior in the {es} {default-dist}. If you want to use {ml-features}, there must be at least one {ml} node in your cluster. For more information about {ml-features}, see {ml-docs}/xpack-ml.html[Machine learning in the {stack}].

Important
If you use the {oss-dist}, do not set node.ml. Otherwise, the node fails to start.
Note
Coordinating node

Requests like search requests or bulk-indexing requests may involve data held on different data nodes. A search request, for example, is executed in two phases which are coordinated by the node which receives the client request — the coordinating node.

In the scatter phase, the coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node. In the gather phase, the coordinating node reduces each data node’s results into a single global resultset.

Every node is implicitly a coordinating node. This means that a node that has all three node.master, node.data and node.ingest set to false will only act as a coordinating node, which cannot be disabled. As a result, such a node needs to have enough memory and CPU in order to deal with the gather phase.

Master Eligible Node

The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.

Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process.

Important
Master nodes must have access to the data/ directory (just like data nodes) as this is where the cluster state is persisted between node restarts.

Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.

While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.

To create a dedicated master-eligible node in the {default-dist}, set:

node.master: true (1)
node.data: false (2)
node.ingest: false (3)
node.ml: false (4)
xpack.ml.enabled: true (5)
cluster.remote.connect: false (6)
  1. The node.master role is enabled by default.

  2. Disable the node.data role (enabled by default).

  3. Disable the node.ingest role (enabled by default).

  4. Disable the node.ml role (enabled by default).

  5. The xpack.ml.enabled setting is enabled by default.

  6. Disable remote cluster connections (enabled by default).

To create a dedicated master-eligible node in the {oss-dist}, set:

node.master: true (1)
node.data: false (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
  1. The node.master role is enabled by default.

  2. Disable the node.data role (enabled by default).

  3. Disable the node.ingest role (enabled by default).

  4. Disable remote cluster connections (enabled by default).

Avoiding split brain with minimum_master_nodes

To prevent data loss, it is vital to configure the discovery.zen.minimum_master_nodes setting (which defaults to 1) so that each master-eligible node knows the minimum number of master-eligible nodes that must be visible in order to form a cluster.

To explain, imagine that you have a cluster consisting of two master-eligible nodes. A network failure breaks communication between these two nodes. Each node sees one master-eligible node…​ itself. With minimum_master_nodes set to the default of 1, this is sufficient to form a cluster. Each node elects itself as the new master (thinking that the other master-eligible node has died) and the result is two clusters, or a split brain. These two nodes will never rejoin until one node is restarted. Any data that has been written to the restarted node will be lost.

Now imagine that you have a cluster with three master-eligible nodes, and minimum_master_nodes set to 2. If a network split separates one node from the other two nodes, the side with one node cannot see enough master-eligible nodes and will realise that it cannot elect itself as master. The side with two nodes will elect a new master (if needed) and continue functioning correctly. As soon as the network split is resolved, the single node will rejoin the cluster and start serving requests again.

This setting should be set to a quorum of master-eligible nodes:

(master_eligible_nodes / 2) + 1

In other words, if there are three master-eligible nodes, then minimum master nodes should be set to (3 / 2) + 1 or 2:

discovery.zen.minimum_master_nodes: 2 (1)
  1. Defaults to 1.

To be able to remain available when one of the master-eligible nodes fails, clusters should have at least three master-eligible nodes, with minimum_master_nodes set accordingly. A rolling upgrade, performed without any downtime, also requires at least three master-eligible nodes to avoid the possibility of data loss if a network split occurs while the upgrade is in progress.

This setting can also be changed dynamically on a live cluster with the cluster update settings API:

PUT _cluster/settings
{
  "transient": {
    "discovery.zen.minimum_master_nodes": 2
  }
}
Tip
An advantage of splitting the master and data roles between dedicated nodes is that you can have just three master-eligible nodes and set minimum_master_nodes to 2. You never have to change this setting, no matter how many dedicated data nodes you add to the cluster.

Data Node

Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.

The main benefit of having dedicated data nodes is the separation of the master and data roles.

To create a dedicated data node in the {default-dist}, set:

node.master: false (1)
node.data: true (2)
node.ingest: false (3)
node.ml: false (4)
cluster.remote.connect: false (5)
  1. Disable the node.master role (enabled by default).

  2. The node.data role is enabled by default.

  3. Disable the node.ingest role (enabled by default).

  4. Disable the node.ml role (enabled by default).

  5. Disable remote cluster connections (enabled by default).

To create a dedicated data node in the {oss-dist}, set:

node.master: false (1)
node.data: true (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
  1. Disable the node.master role (enabled by default).

  2. The node.data role is enabled by default.

  3. Disable the node.ingest role (enabled by default).

  4. Disable remote cluster connections (enabled by default).

Ingest Node

Ingest nodes can execute pre-processing pipelines, composed of one or more ingest processors. Depending on the type of operations performed by the ingest processors and the required resources, it may make sense to have dedicated ingest nodes, that will only perform this specific task.

To create a dedicated ingest node in the {default-dist}, set:

node.master: false (1)
node.data: false (2)
node.ingest: true (3)
node.ml: false (4)
cluster.remote.connect: false (5)
  1. Disable the node.master role (enabled by default).

  2. Disable the node.data role (enabled by default).

  3. The node.ingest role is enabled by default.

  4. Disable the node.ml role (enabled by default).

  5. Disable remote cluster connections (enabled by default).

To create a dedicated ingest node in the {oss-dist}, set:

node.master: false (1)
node.data: false (2)
node.ingest: true (3)
cluster.remote.connect: false (4)
  1. Disable the node.master role (enabled by default).

  2. Disable the node.data role (enabled by default).

  3. The node.ingest role is enabled by default.

  4. Disable remote cluster connections (enabled by default).

Coordinating only node

If you take away the ability to be able to handle master duties, to hold data, and pre-process documents, then you are left with a coordinating node that can only route requests, handle the search reduce phase, and distribute bulk indexing. Essentially, coordinating only nodes behave as smart load balancers.

Coordinating only nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible nodes. They join the cluster and receive the full cluster state, like every other node, and they use the cluster state to route requests directly to the appropriate place(s).

Warning
Adding too many coordinating only nodes to a cluster can increase the burden on the entire cluster because the elected master node must await acknowledgement of cluster state updates from every node! The benefit of coordinating only nodes should not be overstated — data nodes can happily serve the same purpose.

To create a dedicated coordinating node in the {default-dist}, set:

node.master: false (1)
node.data: false (2)
node.ingest: false (3)
node.ml: false (4)
cluster.remote.connect: false (5)
  1. Disable the node.master role (enabled by default).

  2. Disable the node.data role (enabled by default).

  3. Disable the node.ingest role (enabled by default).

  4. Disable the node.ml role (enabled by default).

  5. Disable remote cluster connections (enabled by default).

To create a dedicated coordinating node in the {oss-dist}, set:

node.master: false (1)
node.data: false (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
  1. Disable the node.master role (enabled by default).

  2. Disable the node.data role (enabled by default).

  3. Disable the node.ingest role (enabled by default).

  4. Disable remote cluster connections (enabled by default).

Machine learning node

The {ml-features} provide {ml} nodes, which run jobs and handle {ml} API requests. If xpack.ml.enabled is set to true and node.ml is set to false, the node can service API requests but it cannot run jobs.

If you want to use {ml-features} in your cluster, you must enable {ml} (set xpack.ml.enabled to true) on all master-eligible nodes. If you have the {oss-dist}, do not use these settings.

For more information about these settings, see [ml-settings].

To create a dedicated {ml} node in the {default-dist}, set:

node.master: false (1)
node.data: false (2)
node.ingest: false (3)
node.ml: true (4)
xpack.ml.enabled: true (5)
cluster.remote.connect: false (6)
  1. Disable the node.master role (enabled by default).

  2. Disable the node.data role (enabled by default).

  3. Disable the node.ingest role (enabled by default).

  4. The node.ml role is enabled by default.

  5. The xpack.ml.enabled setting is enabled by default.

  6. Disable remote cluster connections (enabled by default).

Node data path settings

path.data

Every data and master-eligible node requires access to a data directory where shards and index and cluster metadata will be stored. The path.data defaults to $ES_HOME/data but can be configured in the elasticsearch.yml config file an absolute path or a path relative to $ES_HOME as follows:

path.data:  /var/elasticsearch/data

Like all node settings, it can also be specified on the command line as:

./bin/elasticsearch -Epath.data=/var/elasticsearch/data
Tip
When using the .zip or .tar.gz distributions, the path.data setting should be configured to locate the data directory outside the Elasticsearch home directory, so that the home directory can be deleted without deleting your data! The RPM and Debian distributions do this for you already.

node.max_local_storage_nodes

The data path can be shared by multiple nodes, even by nodes from different clusters. This is very useful for testing failover and different configurations on your development machine. In production, however, it is recommended to run only one node of Elasticsearch per server.

By default, Elasticsearch is configured to prevent more than one node from sharing the same data path. To allow for more than one node (e.g., on your development machine), use the setting node.max_local_storage_nodes and set this to a positive integer larger than one.

Warning
Never run different node types (i.e. master, data) from the same data directory. This can lead to unexpected data loss.

Other node settings

More node settings can be found in Modules. Of particular note are the cluster.name, the node.name and the network settings.

Plugins

Plugins

Plugins are a way to enhance the basic Elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers (in a more built in fashion), custom script engines, custom discovery and more.

See the {plugins}/index.html[Plugins documentation] for more.

Snapshot And Restore

A snapshot is a backup taken from a running Elasticsearch cluster. You can take a snapshot of individual indices or of the entire cluster and store it in a repository on a shared filesystem, and there are plugins that support remote repositories on S3, HDFS, Azure, Google Cloud Storage and more.

Snapshots are taken incrementally. This means that when creating a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot of the same index. Therefore it can be efficient to take snapshots of your cluster quite frequently.

Snapshots can be restored into a running cluster via the restore API. When restoring an index it is possible to alter the name of the restored index as well as some of its settings, allowing a great deal of flexibility in how the snapshot and restore functionality can be used.

Warning
It is not possible to back up an Elasticsearch cluster simply by taking a copy of the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running, and this means that copying its data directories cannot be expected to capture a consistent picture of their contents. Attempts to restore a cluster from such a backup may fail, reporting corruption and/or missing files, or may appear to have succeeded having silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality.

Version compatibility

Important
Version compatibility refers to the underlying Lucene index compatibility. Follow the Upgrade documentation when migrating between versions.

A snapshot contains a copy of the on-disk data structures that make up an index. This means that snapshots can only be restored to versions of Elasticsearch that can read the indices:

  • A snapshot of an index created in 5.x can be restored to 6.x.

  • A snapshot of an index created in 2.x can be restored to 5.x.

  • A snapshot of an index created in 1.x can be restored to 2.x.

Conversely, snapshots of indices created in 1.x cannot be restored to 5.x or 6.x, and snapshots of indices created in 2.x cannot be restored to 6.x.

Each snapshot can contain indices created in various versions of Elasticsearch, and when restoring a snapshot it must be possible to restore all of the indices into the target cluster. If any indices in a snapshot were created in an incompatible version, you will not be able restore the snapshot.

Important
When backing up your data prior to an upgrade, keep in mind that you won’t be able to restore snapshots after you upgrade if they contain indices created in a version that’s incompatible with the upgrade version.

If you end up in a situation where you need to restore a snapshot of an index that is incompatible with the version of the cluster you are currently running, you can restore it on the latest compatible version and use reindex-from-remote to rebuild the index on the current version. Reindexing from remote is only possible if the original index has source enabled. Retrieving and reindexing the data can take significantly longer than simply restoring a snapshot. If you have a large amount of data, we recommend testing the reindex from remote process with a subset of your data to understand the time requirements before proceeding.

Repositories

You must register a snapshot repository before you can perform snapshot and restore operations. We recommend creating a new snapshot repository for each major version. The valid repository settings depend on the repository type.

If you register same snapshot repository with multiple clusters, only one cluster should have write access to the repository. All other clusters connected to that repository should set the repository to readonly mode.

Important
The snapshot format can change across major versions, so if you have clusters on different versions trying to write the same repository, snapshots written by one version may not be visible to the other and the repository could be corrupted. While setting the repository to readonly on all but one of the clusters should work with multiple clusters differing by one major version, it is not a supported configuration.
PUT /_snapshot/my_backup
{
  "type": "fs",
  "settings": {
    "location": "my_backup_location"
  }
}

To retrieve information about a registered repository, use a GET request:

GET /_snapshot/my_backup

which returns:

{
  "my_backup": {
    "type": "fs",
    "settings": {
      "location": "my_backup_location"
    }
  }
}

To retrieve information about multiple repositories, specify a comma-delimited list of repositories. You can also use the * wildcard when specifying repository names. For example, the following request retrieves information about all of the snapshot repositories that start with repo or contain backup:

GET /_snapshot/repo*,*backup*

To retrieve information about all registered snapshot repositories, omit the repository name or specify _all:

GET /_snapshot

or

GET /_snapshot/_all
Shared File System Repository

The shared file system repository ("type": "fs") uses the shared file system to store snapshots. In order to register the shared file system repository it is necessary to mount the same shared filesystem to the same location on all master and data nodes. This location (or one of its parent directories) must be registered in the path.repo setting on all master and data nodes.

Assuming that the shared filesystem is mounted to /mount/backups/my_fs_backup_location, the following setting should be added to elasticsearch.yml file:

path.repo: ["/mount/backups", "/mount/longterm_backups"]

The path.repo setting supports Microsoft Windows UNC paths as long as at least server name and share are specified as a prefix and back slashes are properly escaped:

path.repo: ["\\\\MY_SERVER\\Snapshots"]

After all nodes are restarted, the following command can be used to register the shared file system repository with the name my_fs_backup:

PUT /_snapshot/my_fs_backup
{
    "type": "fs",
    "settings": {
        "location": "/mount/backups/my_fs_backup_location",
        "compress": true
    }
}

If the repository location is specified as a relative path this path will be resolved against the first path specified in path.repo:

PUT /_snapshot/my_fs_backup
{
    "type": "fs",
    "settings": {
        "location": "my_fs_backup_location",
        "compress": true
    }
}

The following settings are supported:

location

Location of the snapshots. Mandatory.

compress

Turns on compression of the snapshot files. Compression is applied only to metadata files (index mapping and settings). Data files are not compressed. Defaults to false.

chunk_size

Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: 1GB, 10MB, 5KB, 500B. Defaults to null (unlimited chunk size).

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

Read-only URL Repository

The URL repository ("type": "url") can be used as an alternative read-only way to access data created by the shared file system repository. The URL specified in the url parameter should point to the root of the shared filesystem repository. The following settings are supported:

url

Location of the snapshots. Mandatory.

URL Repository supports the following protocols: "http", "https", "ftp", "file" and "jar". URL repositories with http:, https:, and ftp: URLs has to be whitelisted by specifying allowed URLs in the repositories.url.allowed_urls setting. This setting supports wildcards in the place of host, path, query, and fragment. For example:

repositories.url.allowed_urls: ["http://www.example.org/root/*", "https://*.mydomain.com/*?*#*"]

URL repositories with file: URLs can only point to locations registered in the path.repo setting similar to shared file system repository.

Source Only Repository

A source repository enables you to create minimal, source-only snapshots that take up to 50% less space on disk. Source only snapshots contain stored fields and index metadata. They do not include index or doc values structures and are not searchable when restored. After restoring a source-only snapshot, you must reindex the data into a new index.

Source repositories delegate to another snapshot repository for storage.

Important

Source only snapshots are only supported if the _source field is enabled and no source-filtering is applied. When you restore a source only snapshot:

  • The restored index is read-only and can only serve match_all search or scroll requests to enable reindexing.

  • Queries other than match_all and _get requests are not supported.

  • The mapping of the restored index is empty, but the original mapping is available from the types top level meta element.

When you create a source repository, you must specify the type and name of the delegate repository where the snapshots will be stored:

PUT _snapshot/my_src_only_repository
{
  "type": "source",
  "settings": {
    "delegate_type": "fs",
    "location": "my_backup_location"
  }
}
Repository plugins

Other repository backends are available in these official plugins:

  • {plugins}/repository-s3.html[repository-s3] for S3 repository support

  • {plugins}/repository-hdfs.html[repository-hdfs] for HDFS repository support in Hadoop environments

  • {plugins}/repository-azure.html[repository-azure] for Azure storage repositories

  • {plugins}/repository-gcs.html[repository-gcs] for Google Cloud Storage repositories

Repository Verification

When a repository is registered, it’s immediately verified on all master and data nodes to make sure that it is functional on all nodes currently present in the cluster. The verify parameter can be used to explicitly disable the repository verification when registering or updating a repository:

PUT /_snapshot/my_unverified_backup?verify=false
{
  "type": "fs",
  "settings": {
    "location": "my_unverified_backup_location"
  }
}

The verification process can also be executed manually by running the following command:

POST /_snapshot/my_unverified_backup/_verify

It returns a list of nodes where repository was successfully verified or an error message if verification process failed.

Snapshot

A repository can contain multiple snapshots of the same cluster. Snapshots are identified by unique names within the cluster. A snapshot with the name snapshot_1 in the repository my_backup can be created by executing the following command:

PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true

The wait_for_completion parameter specifies whether or not the request should return immediately after snapshot initialization (default) or wait for snapshot completion. During snapshot initialization, information about all previous snapshots is loaded into the memory, which means that in large repositories it may take several seconds (or even minutes) for this command to return even if the wait_for_completion parameter is set to false.

By default a snapshot of all open and started indices in the cluster is created. This behavior can be changed by specifying the list of indices in the body of the snapshot request.

PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": false
}

The list of indices that should be included into the snapshot can be specified using the indices parameter that supports multi index syntax. The snapshot request also supports the ignore_unavailable option. Setting it to true will cause indices that do not exist to be ignored during snapshot creation. By default, when ignore_unavailable option is not set and an index is missing the snapshot request will fail. By setting include_global_state to false it’s possible to prevent the cluster global state to be stored as part of the snapshot. By default, the entire snapshot will fail if one or more indices participating in the snapshot don’t have all primary shards available. This behaviour can be changed by setting partial to true.

Snapshot names can be automatically derived using date math expressions, similarly as when creating new indices. Note that special characters need to be URI encoded.

For example, creating a snapshot with the current day in the name, like snapshot-2018.05.11, can be achieved with the following command:

# PUT /_snapshot/my_backup/<snapshot-{now/d}>
PUT /_snapshot/my_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E

The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be executed against the index that is being snapshotted. However, a snapshot represents the point-in-time view of the index at the moment when snapshot was created, so no records that were added to the index after the snapshot process was started will be present in the snapshot. The snapshot process starts immediately for the primary shards that has been started and are not relocating at the moment. Before version 1.2.0, the snapshot operation fails if the cluster has any relocating or initializing primaries of indices participating in the snapshot. Starting with version 1.2.0, Elasticsearch waits for relocation or initialization of shards to complete before snapshotting them.

Besides creating a copy of each index the snapshot process can also store global cluster metadata, which includes persistent cluster settings and templates. The transient settings and registered snapshot repositories are not stored as part of the snapshot.

Only one snapshot process can be executed in the cluster at any time. While snapshot of a particular shard is being created this shard cannot be moved to another node, which can interfere with rebalancing process and allocation filtering. Elasticsearch will only be able to move a shard to another node (according to the current allocation filtering settings and rebalancing algorithm) once the snapshot is finished.

Once a snapshot is created information about this snapshot can be obtained using the following command:

GET /_snapshot/my_backup/snapshot_1

This command returns basic information about the snapshot including start and end time, version of Elasticsearch that created the snapshot, the list of included indices, the current state of the snapshot and the list of failures that occurred during the snapshot. The snapshot state can be

IN_PROGRESS

The snapshot is currently running.

SUCCESS

The snapshot finished and all shards were stored successfully.

FAILED

The snapshot finished with an error and failed to store any data.

PARTIAL

The global cluster state was stored, but data of at least one shard wasn’t stored successfully. The failure section in this case should contain more detailed information about shards that were not processed correctly.

INCOMPATIBLE

The snapshot was created with an old version of Elasticsearch and therefore is incompatible with the current version of the cluster.

Similar as for repositories, information about multiple snapshots can be queried in one go, supporting wildcards as well:

GET /_snapshot/my_backup/snapshot_*,some_other_snapshot

All snapshots currently stored in the repository can be listed using the following command:

GET /_snapshot/my_backup/_all

The command fails if some of the snapshots are unavailable. The boolean parameter ignore_unavailable can be used to return all snapshots that are currently available.

Getting all snapshots in the repository can be costly on cloud-based repositories, both from a cost and performance perspective. If the only information required is the snapshot names/uuids in the repository and the indices in each snapshot, then the optional boolean parameter verbose can be set to false to execute a more performant and cost-effective retrieval of the snapshots in the repository. Note that setting verbose to false will omit all other information about the snapshot such as status information, the number of snapshotted shards, etc. The default value of the verbose parameter is true.

A currently running snapshot can be retrieved using the following command:

GET /_snapshot/my_backup/_current

A snapshot can be deleted from the repository using the following command:

DELETE /_snapshot/my_backup/snapshot_2

When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being created the snapshotting process will be aborted and all files created as part of the snapshotting process will be cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were started by mistake.

A repository can be unregistered using the following command:

DELETE /_snapshot/my_backup

When a repository is unregistered, Elasticsearch only removes the reference to the location where the repository is storing the snapshots. The snapshots themselves are left untouched and in place.

Restore

A snapshot can be restored using the following command:

POST /_snapshot/my_backup/snapshot_1/_restore

By default, all indices in the snapshot are restored, and the cluster state is not restored. It’s possible to select indices that should be restored as well as to allow the global cluster state from being restored by using indices and include_global_state options in the restore request body. The list of indices supports multi index syntax. The rename_pattern and rename_replacement options can be also used to rename indices on restore using regular expression that supports referencing the original text as explained here. Set include_aliases to false to prevent aliases from being restored together with associated indices

POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "index_1,index_2",
  "ignore_unavailable": true,
  "include_global_state": true,
  "rename_pattern": "index_(.+)",
  "rename_replacement": "restored_index_$1"
}

The restore operation can be performed on a functioning cluster. However, an existing index can be only restored if it’s closed and has the same number of shards as the index in the snapshot. The restore operation automatically opens restored indices if they were closed and creates new indices if they didn’t exist in the cluster. If cluster state is restored with include_global_state (defaults to false), the restored templates that don’t currently exist in the cluster are added and existing templates with the same name are replaced by the restored templates. The restored persistent settings are added to the existing persistent settings.

Partial restore

By default, the entire restore operation will fail if one or more indices participating in the operation don’t have snapshots of all shards available. It can occur if some shards failed to snapshot for example. It is still possible to restore such indices by setting partial to true. Please note, that only successfully snapshotted shards will be restored in this case and all missing shards will be recreated empty.

Changing index settings during restore

Most of index settings can be overridden during the restore process. For example, the following command will restore the index index_1 without creating any replicas while switching back to default refresh interval:

POST /_snapshot/my_backup/snapshot_1/_restore
{
  "indices": "index_1",
  "index_settings": {
    "index.number_of_replicas": 0
  },
  "ignore_index_settings": [
    "index.refresh_interval"
  ]
}

Please note, that some settings such as index.number_of_shards cannot be changed during restore operation.

Restoring to a different cluster

The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer (only 1 major version newer) than the cluster that was used to create the snapshot. For example, you can restore a 1.x snapshot to a 2.x cluster, but not a 1.x snapshot to a 5.x cluster.

If the new cluster has a smaller size additional considerations should be made. First of all it’s necessary to make sure that new cluster have enough capacity to store all indices in the snapshot. It’s possible to change indices settings during restore to reduce the number of replicas, which can help with restoring snapshots into smaller cluster. It’s also possible to select only subset of the indices using the indices parameter.

If indices in the original cluster were assigned to particular nodes using shard allocation filtering, the same rules will be enforced in the new cluster. Therefore if the new cluster doesn’t contain nodes with appropriate attributes that a restored index can be allocated on, such index will not be successfully restored unless these index allocation settings are changed during restore operation.

The restore operation also checks that restored persistent settings are compatible with the current cluster to avoid accidentally restoring an incompatible settings such as discovery.zen.minimum_master_nodes and as a result disable a smaller cluster until the required number of master eligible nodes is added. If you need to restore a snapshot with incompatible persistent settings, try restoring it without the global cluster state.

Snapshot status

A list of currently running snapshots with their detailed status information can be obtained using the following command:

GET /_snapshot/_status

In this format, the command will return information about all currently running snapshots. By specifying a repository name, it’s possible to limit the results to a particular repository:

GET /_snapshot/my_backup/_status

If both repository name and snapshot id are specified, this command will return detailed status information for the given snapshot even if it’s not currently running:

GET /_snapshot/my_backup/snapshot_1/_status

The output looks similar to the following:

{
  "snapshots": [
    {
      "snapshot": "snapshot_1",
      "repository": "my_backup",
      "uuid": "XuBo4l4ISYiVg0nYUen9zg",
      "state": "SUCCESS",
      "include_global_state": true,
      "shards_stats": {
        "initializing": 0,
        "started": 0,
        "finalizing": 0,
        "done": 5,
        "failed": 0,
        "total": 5
      },
      "stats": {
        "incremental": {
          "file_count": 8,
          "size_in_bytes": 4704
        },
        "processed": {
          "file_count": 7,
          "size_in_bytes": 4254
        },
        "total": {
          "file_count": 8,
          "size_in_bytes": 4704
        },
        "start_time_in_millis": 1526280280355,
        "time_in_millis": 358,

        "number_of_files": 8,
        "processed_files": 8,
        "total_size_in_bytes": 4704,
        "processed_size_in_bytes": 4704
      }
    }
  ]
}

The output is composed of different sections. The stats sub-object provides details on the number and size of files that were snapshotted. As snapshots are incremental, copying only the Lucene segments that are not already in the repository, the stats object contains a total section for all the files that are referenced by the snapshot, as well as an incremental section for those files that actually needed to be copied over as part of the incremental snapshotting. In case of a snapshot that’s still in progress, there’s also a processed section that contains information about the files that are in the process of being copied.

Note: Properties number_of_files, processed_files, total_size_in_bytes and processed_size_in_bytes are used for backward compatibility reasons with older 5.x and 6.x versions. These fields will be removed in Elasticsearch v7.0.0.

Multiple ids are also supported:

GET /_snapshot/my_backup/snapshot_1,snapshot_2/_status

Monitoring snapshot/restore progress

There are several ways to monitor the progress of the snapshot and restores processes while they are running. Both operations support wait_for_completion parameter that would block client until the operation is completed. This is the simplest method that can be used to get notified about operation completion.

The snapshot operation can be also monitored by periodic calls to the snapshot info:

GET /_snapshot/my_backup/snapshot_1

Please note that snapshot info operation uses the same resources and thread pool as the snapshot operation. So, executing a snapshot info operation while large shards are being snapshotted can cause the snapshot info operation to wait for available resources before returning the result. On very large shards the wait time can be significant.

To get more immediate and complete information about snapshots the snapshot status command can be used instead:

GET /_snapshot/my_backup/snapshot_1/_status

While snapshot info method returns only basic information about the snapshot in progress, the snapshot status returns complete breakdown of the current state for each shard participating in the snapshot.

The restore process piggybacks on the standard recovery mechanism of the Elasticsearch. As a result, standard recovery monitoring services can be used to monitor the state of restore. When restore operation is executed the cluster typically goes into red state. It happens because the restore operation starts with "recovering" primary shards of the restored indices. During this operation the primary shards become unavailable which manifests itself in the red cluster state. Once recovery of primary shards is completed Elasticsearch is switching to standard replication process that creates the required number of replicas at this moment cluster switches to the yellow state. Once all required replicas are created, the cluster switches to the green states.

The cluster health operation provides only a high level status of the restore process. It’s possible to get more detailed insight into the current state of the recovery process by using indices recovery and cat recovery APIs.

Stopping currently running snapshot and restore operations

The snapshot and restore framework allows running only one snapshot or one restore operation at a time. If a currently running snapshot was executed by mistake, or takes unusually long, it can be terminated using the snapshot delete operation. The snapshot delete operation checks if the deleted snapshot is currently running and if it does, the delete operation stops that snapshot before deleting the snapshot data from the repository.

DELETE /_snapshot/my_backup/snapshot_1

The restore operation uses the standard shard recovery mechanism. Therefore, any currently running restore operation can be canceled by deleting indices that are being restored. Please note that data for all deleted indices will be removed from the cluster as a result of this operation.

Effect of cluster blocks on snapshot and restore operations

Many snapshot and restore operations are affected by cluster and index blocks. For example, registering and unregistering repositories require write global metadata access. The snapshot operation requires that all indices and their metadata as well as the global metadata were readable. The restore operation requires the global metadata to be writable, however the index level blocks are ignored during restore because indices are essentially recreated during restore. Please note that a repository content is not part of the cluster and therefore cluster blocks don’t affect internal repository operations such as listing or deleting snapshots from an already registered repository.

Thread Pool

A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.

There are several thread pools, but the important ones include:

generic

For generic operations (e.g., background node discovery). Thread pool type is scaling.

index

For index/delete operations. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

search

For count/search/suggest operations. Thread pool type is fixed_auto_queue_size with a size of int((# of available_processors * 3) / 2) + 1, and initial queue_size of 1000.

search_throttled

For count/search/suggest/get operations on search_throttled indices. Thread pool type is fixed_auto_queue_size with a size of 1, and initial queue_size of 100.

get

For get operations. Thread pool type is fixed with a size of # of available processors, queue_size of 1000.

analyze

For analyze requests. Thread pool type is fixed with a size of 1, queue size of 16.

write

For single-document index/delete/update and bulk requests. Thread pool type is fixed with a size of # of available processors, queue_size of 200. The maximum size for this pool is 1 + # of available processors.

snapshot

For snapshot/restore operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of available processors)/2).

warmer

For segment warm-up operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(5, (# of available processors)/2).

refresh

For refresh operations. Thread pool type is scaling with a keep-alive of 5m and a max of min(10, (# of available processors)/2).

listener

Mainly for java client executing of action when listener threaded is set to true. Thread pool type is scaling with a default max of min(10, (# of available processors)/2).

Changing a specific thread pool can be done by setting its type-specific parameters; for example, changing the index thread pool to have more threads:

thread_pool:
    index:
        size: 30

Thread pool types

The following are the types of thread pools and their respective parameters:

fixed

The fixed thread pool holds a fixed size of threads to handle the requests with a queue (optionally bounded) for pending requests that have no threads to service them.

The size parameter controls the number of threads, and defaults to the number of cores times 5.

The queue_size allows to control the size of the queue of pending requests that have no threads to execute them. By default, it is set to -1 which means its unbounded. When a request comes in and the queue is full, it will abort the request.

thread_pool:
    index:
        size: 30
        queue_size: 1000

fixed_auto_queue_size

experimental[]

The fixed_auto_queue_size thread pool holds a fixed size of threads to handle the requests with a bounded queue for pending requests that have no threads to service them. It’s similar to the fixed threadpool, however, the queue_size automatically adjusts according to calculations based on Little’s Law. These calculations will potentially adjust the queue_size up or down by 50 every time auto_queue_frame_size operations have been completed.

The size parameter controls the number of threads, and defaults to the number of cores times 5.

The queue_size allows to control the initial size of the queue of pending requests that have no threads to execute them.

The min_queue_size setting controls the minimum amount the queue_size can be adjusted to.

The max_queue_size setting controls the maximum amount the queue_size can be adjusted to.

The auto_queue_frame_size setting controls the number of operations during which measurement is taken before the queue is adjusted. It should be large enough that a single operation cannot unduly bias the calculation.

The target_response_time is a time value setting that indicates the targeted average response time for tasks in the thread pool queue. If tasks are routinely above this time, the thread pool queue will be adjusted down so that tasks are rejected.

thread_pool:
    search:
        size: 30
        queue_size: 500
        min_queue_size: 10
        max_queue_size: 1000
        auto_queue_frame_size: 2000
        target_response_time: 1s

scaling

The scaling thread pool holds a dynamic number of threads. This number is proportional to the workload and varies between the value of the core and max parameters.

The keep_alive parameter determines how long a thread should be kept around in the thread pool without it doing any work.

thread_pool:
    warmer:
        core: 1
        max: 8
        keep_alive: 2m

Processors setting

The number of processors is automatically detected, and the thread pool settings are automatically set based on it. In some cases it can be useful to override the number of detected processors. This can be done by explicitly setting the processors setting.

processors: 2

There are a few use-cases for explicitly overriding the processors setting:

  1. If you are running multiple instances of Elasticsearch on the same host but want Elasticsearch to size its thread pools as if it only has a fraction of the CPU, you should override the processors setting to the desired fraction (e.g., if you’re running two instances of Elasticsearch on a 16-core machine, set processors to 8). Note that this is an expert-level use-case and there’s a lot more involved than just setting the processors setting as there are other considerations like changing the number of garbage collector threads, pinning processes to cores, etc.

  2. Sometimes the number of processors is wrongly detected and in such cases explicitly setting the processors setting will workaround such issues.

In order to check the number of processors detected, use the nodes info API with the os flag.

Transport

The transport module is used for internal communication between nodes within the cluster. Each call that goes from one node to the other uses the transport module (for example, when an HTTP GET request is processed by one node, and should actually be processed by another node that holds the data). The transport module is also used for the TransportClient in the {es} Java API.

The transport mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication is first solving the C10k problem, as well as being the ideal solution for scatter (broadcast) / gather operations such as search in Elasticsearch.

Transport Settings

The internal transport communicates over TCP. You can configure it with the following settings:

Setting Description

transport.port

A bind port range. Defaults to 9300-9400.

transport.publish_port

The port that other nodes in the cluster should use when communicating with this node. Useful when a cluster node is behind a proxy or firewall and the transport.port is not directly addressable from the outside. Defaults to the actual port assigned via transport.port.

transport.bind_host

The host address to bind the transport service to. Defaults to transport.host (if set) or network.bind_host.

transport.publish_host

The host address to publish for nodes in the cluster to connect to. Defaults to transport.host (if set) or network.publish_host.

transport.host

Used to set the transport.bind_host and the transport.publish_host Defaults to transport.host or network.host.

transport.connect_timeout

The connect timeout for initiating a new connection (in time setting format). Defaults to 30s.

transport.compress

Set to true to enable compression (DEFLATE) between all nodes. Defaults to false.

transport.ping_schedule

Schedule a regular application-level ping message to ensure that transport connections between nodes are kept alive. Defaults to 5s in the transport client and -1 (disabled) elsewhere. It is preferable to correctly configure TCP keep-alives instead of using this feature, because TCP keep-alives apply to all kinds of long-lived connections and not just to transport connections.

It also uses the common network settings.

Transport Profiles

Elasticsearch allows you to bind to multiple ports on different interfaces by the use of transport profiles. See this example configuration

transport.profiles.default.port: 9300-9400
transport.profiles.default.bind_host: 10.0.0.1
transport.profiles.client.port: 9500-9600
transport.profiles.client.bind_host: 192.168.0.1
transport.profiles.dmz.port: 9700-9800
transport.profiles.dmz.bind_host: 172.16.1.2

The default profile is special. It is used as a fallback for any other profiles, if those do not have a specific configuration setting set, and is how this node connects to other nodes in the cluster.

The following parameters can be configured on each transport profile, as in the example above:

  • port: The port to bind to

  • bind_host: The host to bind

  • publish_host: The host which is published in informational APIs

  • tcp.no_delay: Configures the TCP_NO_DELAY option for this socket

  • tcp.keep_alive: Configures the SO_KEEPALIVE option for this socket

  • tcp.reuse_address: Configures the SO_REUSEADDR option for this socket

  • tcp.send_buffer_size: Configures the send buffer size of the socket

  • tcp.receive_buffer_size: Configures the receive buffer size of the socket

Long-lived idle connections

Elasticsearch opens a number of long-lived TCP connections between each pair of nodes in the cluster, and some of these connections may be idle for an extended period of time. Nonetheless, Elasticsearch requires these connections to remain open, and it can disrupt the operation of the cluster if any inter-node connections are closed by an external influence such as a firewall. It is important to configure your network to preserve long-lived idle connections between Elasticsearch nodes, for instance by leaving tcp.keep_alive enabled and ensuring that the keepalive interval is shorter than any timeout that might cause idle connections to be closed, or by setting transport.ping_schedule if keepalives cannot be configured.

Transport Compression

Request Compression

By default, the transport.compress setting is false and network-level request compression is disabled between nodes in the cluster. This default normally makes sense for local cluster communication as compression has a noticeable CPU cost and local clusters tend to be set up with fast network connections between nodes.

The transport.compress setting always configures local cluster request compression and is the fallback setting for remote cluster request compression. If you want to configure remote request compression differently than local request compression, you can set it on a per-remote cluster basis using the cluster.remote.${cluster_alias}.transport.compress setting.

Response Compression

The compression settings do not configure compression for responses. {es} will compress a response if the inbound request was compressed—​even when compression is not enabled. Similarly, {es} will not compress a response if the inbound request was uncompressed—​even when compression is enabled.

Transport Tracer

The transport module has a dedicated tracer logger which, when activated, logs incoming and out going requests. The log can be dynamically activated by settings the level of the org.elasticsearch.transport.TransportService.tracer logger to TRACE:

PUT _cluster/settings
{
   "transient" : {
      "logger.org.elasticsearch.transport.TransportService.tracer" : "TRACE"
   }
}

You can also control which actions will be traced, using a set of include and exclude wildcard patterns. By default every request will be traced except for fault detection pings:

PUT _cluster/settings
{
   "transient" : {
      "transport.tracer.include" : "*",
      "transport.tracer.exclude" : "internal:discovery/zen/fd*"
   }
}

Tribe node

deprecated[5.4.0,The tribe node is deprecated in favour of {ccs-cap} and will be removed in Elasticsearch 7.0.]

The tribes feature allows a tribe node to act as a federated client across multiple clusters.

The tribe node works by retrieving the cluster state from all connected clusters and merging them into a global cluster state. With this information at hand, it is able to perform read and write operations against the nodes in all clusters as if they were local. Note that a tribe node needs to be able to connect to each single node in every configured cluster.

The elasticsearch.yml config file for a tribe node just needs to list the clusters that should be joined, for instance:

tribe:
    t1: (1)
        cluster.name:   cluster_one
    t2: (1)
        cluster.name:   cluster_two
  1. t1 and t2 are arbitrary names representing the connection to each cluster.

The example above configures connections to two clusters, name t1 and t2 respectively. The tribe node will create a node client to connect each cluster using unicast discovery by default. Any other settings for the connection can be configured under tribe.{name}, just like the cluster.name in the example.

The merged global cluster state means that almost all operations work in the same way as a single cluster: distributed search, suggest, percolation, indexing, etc.

However, there are a few exceptions:

  • The merged view cannot handle indices with the same name in multiple clusters. By default it will pick one of them, see later for on_conflict options.

  • Master level read operations (eg [cluster-state], [cluster-health]) will automatically execute with a local flag set to true since there is no master.

  • Master level write operations (eg [indices-create-index]) are not allowed. These should be performed on a single cluster.

The tribe node can be configured to block all write operations and all metadata operations with:

tribe:
    blocks:
        write:    true
        metadata: true

The tribe node can also configure blocks on selected indices:

tribe:
    blocks:
        write.indices:    hk*,ldn*
        metadata.indices: hk*,ldn*

When there is a conflict and multiple clusters hold the same index, by default the tribe node will pick one of them. This can be configured using the tribe.on_conflict setting. It defaults to any, but can be set to drop (drop indices that have a conflict), or prefer_[tribeName] to prefer the index from a specific tribe.

Tribe node settings

The tribe node starts a node client for each listed cluster. The following configuration options are passed down from the tribe node to each node client:

  • node.name (used to derive the node.name for each node client)

  • network.host

  • network.bind_host

  • network.publish_host

  • transport.host

  • transport.bind_host

  • transport.publish_host

  • path.home

  • path.logs

  • shield.*

Almost any setting (except for path.*) may be configured at the node client level itself, in which case it will override any passed through setting from the tribe node. Settings you may want to set at the node client level include:

  • network.host

  • network.bind_host

  • network.publish_host

  • transport.host

  • transport.bind_host

  • transport.publish_host

  • cluster.name

  • discovery.zen.ping.unicast.hosts

network.host:   192.168.1.5 (1)

tribe:
  t1:
    cluster.name:   cluster_one
  t2:
    cluster.name:   cluster_two
    network.host:   10.1.2.3 (2)
  1. The network.host setting is inherited by t1.

  2. The t3 node client overrides the inherited from the tribe node.

Remote clusters

The remote clusters module enables you to establish uni-directional connections to a remote cluster. This functionality is used in {ccs}.

Remote cluster connections work by configuring a remote cluster and connecting only to a limited number of nodes in that remote cluster. Each remote cluster is referenced by a name and a list of seed nodes. When a remote cluster is registered, its cluster state is retrieved from one of the seed nodes and up to three gateway nodes are selected to be connected to as part of remote cluster requests. All the communication required between different clusters goes through the transport layer. Remote cluster connections consist of uni-directional connections from the coordinating node to the selected remote gateway nodes only.

Gateway nodes selection

The gateway nodes selection depends on the following criteria:

  • version: Remote nodes must be compatible with the cluster they are registered to. This is subject to rules that are similar to those for [rolling-upgrades]. Any node can communicate with any other node on the same major version (e.g. 7.0 can talk to any 7.x node). Only nodes on the last minor version of a certain major version can communicate with nodes on the following major version. Note that in the 6.x series, 6.8 can communicate with any 7.x node, while 6.7 can only communicate with 7.0. Version compatibility is symmetric, meaning that if 6.7 can communicate with 7.0, 7.0 can also communicate with 6.7. The matrix below summarizes compatibility as described above.

Compatibility

5.0→5.5

5.6

6.0→6.6

6.7

6.8

7.0

7.1→7.x

5.0→5.5

Yes

Yes

No

No

No

No

No

5.6

Yes

Yes

Yes

Yes

Yes

No

No

6.0→6.6

No

Yes

Yes

Yes

Yes

No

No

6.7

No

Yes

Yes

Yes

Yes

Yes

No

6.8

No

Yes

Yes

Yes

Yes

Yes

Yes

7.0

No

No

No

Yes

Yes

Yes

Yes

7.1→7.x

No

No

No

No

Yes

Yes

Yes

  • role: Dedicated master nodes never get selected.

  • attributes: You can tag which nodes should be selected (see Remote cluster settings), though such tagged nodes still have to satisfy the two above requirements.

Configuring remote clusters

You can configure remote clusters globally by using cluster settings, which you can update dynamically. Alternatively, you can configure them locally on individual nodes by using the elasticsearch.yml file.

If you specify the settings in elasticsearch.yml files, only the nodes with those settings can connect to the remote cluster. In other words, functionality that relies on remote cluster requests must be driven specifically from those nodes. For example:

cluster:
    remote:
        cluster_one: (1)
            seeds: 127.0.0.1:9300
            transport.ping_schedule: 30s (2)
        cluster_two:
            seeds: 127.0.0.1:9301
            transport.compress: true (3)
            skip_unavailable: true (4)
  1. cluster_one and cluster_two are arbitrary cluster aliases representing the connection to each cluster. These names are subsequently used to distinguish between local and remote indices.

  2. A keep-alive ping is configured for cluster_one.

  3. Compression is explicitly enabled for requests to cluster_two.

  4. Disconnected remote clusters are optional for cluster_two.

For more information about the optional transport settings, see Transport.

If you use cluster settings, the remote clusters are available on every node in the cluster. For example:

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_one": {
          "seeds": [
            "127.0.0.1:9300"
          ],
          "transport.ping_schedule": "30s"
        },
        "cluster_two": {
          "seeds": [
            "127.0.0.1:9301"
          ],
          "transport.compress": true,
          "skip_unavailable": true
        },
        "cluster_three": {
          "seeds": [
            "127.0.0.1:9302"
          ]
        }
      }
    }
  }
}

You can dynamically update the compression and ping schedule settings. However, you must re-include seeds in the settings update request. For example:

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_one": {
          "seeds": [
            "127.0.0.1:9300"
          ],
          "transport.ping_schedule": "60s"
        },
        "cluster_two": {
          "seeds": [
            "127.0.0.1:9301"
          ],
          "transport.compress": false
        }
      }
    }
  }
}
Note
When the compression or ping schedule settings change, all the existing node connections must close and re-open, which can cause in-flight requests to fail.

A remote cluster can be deleted from the cluster settings by setting its seeds and optional settings to null :

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_two": { (1)
          "seeds": null,
          "skip_unavailable": null,
          "transport": {
            "compress": null
          }
        }
      }
    }
  }
}
  1. cluster_two would be removed from the cluster settings, leaving cluster_one and cluster_three intact.

Remote cluster settings

cluster.remote.connections_per_cluster

The number of gateway nodes to connect to per remote cluster. The default is 3.

cluster.remote.initial_connect_timeout

The time to wait for remote connections to be established when the node starts. The default is 30s.

cluster.remote.node.attr

A node attribute to filter out nodes that are eligible as a gateway node in the remote cluster. For instance a node can have a node attribute node.attr.gateway: true such that only nodes with this attribute will be connected to if cluster.remote.node.attr is set to gateway.

cluster.remote.connect

By default, any node in the cluster can act as a cross-cluster client and connect to remote clusters. The cluster.remote.connect setting can be set to false (defaults to true) to prevent certain nodes from connecting to remote clusters. Remote cluster requests must be sent to a node that is allowed to act as a cross-cluster client.

cluster.remote.${cluster_alias}.skip_unavailable

Per cluster boolean setting that allows to skip specific clusters when no nodes belonging to them are available and they are the targetof a remote cluster request. Default is false, meaning that all clusters are mandatory by default, but they can selectively be made optional by setting this setting to true.

cluster.remote.${cluster_alias}.transport.ping_schedule

Sets the time interval between regular application-level ping messages that are sent to ensure that transport connections to nodes belonging to remote clusters are kept alive. If set to -1, application-level ping messages to this remote cluster are not sent. If unset, application-level ping messages are sent according to the global transport.ping_schedule setting, which defaults to `-1 meaning that pings are not sent.

cluster.remote.${cluster_alias}.transport.compress

Per cluster boolean setting that enables you to configure compression for requests to a specific remote cluster. This setting impacts only requests sent to the remote cluster. If the inbound request is compressed, Elasticsearch compresses the response. If unset, the global transport.compress is used as the fallback setting.

Retrieving remote clusters info

You can use the remote cluster info API to retrieve information about the configured remote clusters, as well as the remote nodes that the node is connected to.

The {ccs} feature allows any node to act as a federated client across multiple clusters. In contrast to the tribe node feature, a {ccs} node won’t join the remote cluster, instead it connects to a remote cluster in a light fashion in order to execute federated search requests. For details on communication and compatibility between different clusters, see Remote clusters.

Using {ccs}

{ccs-cap} requires configuring remote clusters.

PUT _cluster/settings
{
  "persistent": {
    "cluster": {
      "remote": {
        "cluster_one": {
          "seeds": [
            "127.0.0.1:9300"
          ]
        },
        "cluster_two": {
          "seeds": [
            "127.0.0.1:9301"
          ]
        },
        "cluster_three": {
          "seeds": [
            "127.0.0.1:9302"
          ]
        }
      }
    }
  }
}

To search the twitter index on remote cluster cluster_one the index name must be prefixed with the alias of the remote cluster followed by the : character:

GET /cluster_one:twitter/_search
{
  "query": {
    "match": {
      "user": "kimchy"
    }
  }
}
{
  "took": 150,
  "timed_out": false,
  "_shards": {
    "total": 1,
    "successful": 1,
    "failed": 0,
    "skipped": 0
  },
  "_clusters": {
    "total": 1,
    "successful": 1,
    "skipped": 0
  },
  "hits": {
    "total": 1,
    "max_score": 1,
    "hits": [
      {
        "_index": "cluster_one:twitter",
        "_type": "_doc",
        "_id": "0",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "date": "2009-11-15T14:12:12",
          "message": "trying out Elasticsearch",
          "likes": 0
        }
      }
    ]
  }
}

In contrast to the tribe feature cross cluster search can also search indices with the same name on different clusters:

GET /cluster_one:twitter,twitter/_search
{
  "query": {
    "match": {
      "user": "kimchy"
    }
  }
}

Search results are disambiguated the same way as the indices are disambiguated in the request. Indices with same names are treated as different indices when results are merged. All results retrieved from an index located in a remote cluster are prefixed with their corresponding cluster alias:

{
  "took": 150,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0,
    "skipped": 0
  },
  "_clusters": {
    "total": 2,
    "successful": 2,
    "skipped": 0
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "cluster_one:twitter",
        "_type": "_doc",
        "_id": "0",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "date": "2009-11-15T14:12:12",
          "message": "trying out Elasticsearch",
          "likes": 0
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "0",
        "_score": 2,
        "_source": {
          "user": "kimchy",
          "date": "2009-11-15T14:12:12",
          "message": "trying out Elasticsearch",
          "likes": 0
        }
      }
    ]
  }
}

Skipping disconnected clusters

By default, all remote clusters that are searched via {ccs} need to be available when the search request is executed. Otherwise, the whole request fails; even if some of the clusters are available, no search results are returned. You can use the boolean skip_unavailable setting to make remote clusters optional. By default, it is set to false.

PUT _cluster/settings
{
  "persistent": {
    "cluster.remote.cluster_two.skip_unavailable": true (1)
  }
}
  1. cluster_two is made optional

GET /cluster_one:twitter,cluster_two:twitter,twitter/_search (1)
{
  "query": {
    "match": {
      "user": "kimchy"
    }
  }
}
  1. Search against the twitter index in cluster_one, cluster_two and also locally

{
  "took": 150,
  "timed_out": false,
  "_shards": {
    "total": 2,
    "successful": 2,
    "failed": 0,
    "skipped": 0
  },
  "_clusters": { (1)
    "total": 3,
    "successful": 2,
    "skipped": 1
  },
  "hits": {
    "total": 2,
    "max_score": 1,
    "hits": [
      {
        "_index": "cluster_one:twitter",
        "_type": "_doc",
        "_id": "0",
        "_score": 1,
        "_source": {
          "user": "kimchy",
          "date": "2009-11-15T14:12:12",
          "message": "trying out Elasticsearch",
          "likes": 0
        }
      },
      {
        "_index": "twitter",
        "_type": "_doc",
        "_id": "0",
        "_score": 2,
        "_source": {
          "user": "kimchy",
          "date": "2009-11-15T14:12:12",
          "message": "trying out Elasticsearch",
          "likes": 0
        }
      }
    ]
  }
}
  1. The clusters section indicates that one cluster was unavailable and got skipped