Cluster
One of the main roles of the master is to decide which shards to allocate to which nodes, and when to move shards between nodes in order to rebalance the cluster.
There are a number of settings available to control the shard allocation process:
-
Cluster Level Shard Allocation lists the settings to control the allocation and rebalancing operations.
-
Disk-based Shard Allocation explains how Elasticsearch takes available disk space into account, and the related settings.
-
Shard Allocation Awareness and Forced Awareness control how shards can be distributed across different racks or availability zones.
-
Shard Allocation Filtering allows certain nodes or groups of nodes excluded from allocation so that they can be decommissioned.
Besides these, there are a few other miscellaneous cluster-level settings.
All of the settings in this section are dynamic settings which can be updated on a live cluster with the cluster-update-settings API.
Cluster Level Shard Allocation
Shard allocation is the process of allocating shards to nodes. This can happen during initial recovery, replica allocation, rebalancing, or when nodes are added or removed.
Shard Allocation Settings
The following dynamic settings may be used to control shard allocation and recovery:
cluster.routing.allocation.enable
-
Enable or disable allocation for specific kinds of shards:
-
all
- (default) Allows shard allocation for all kinds of shards. -
primaries
- Allows shard allocation only for primary shards. -
new_primaries
- Allows shard allocation only for primary shards for new indices. -
none
- No shard allocations of any kind are allowed for any indices.
This setting does not affect the recovery of local primary shards when restarting a node. A restarted node that has a copy of an unassigned primary shard will recover that primary immediately, assuming that its allocation id matches one of the active allocation ids in the cluster state.
-
cluster.routing.allocation.node_concurrent_incoming_recoveries
-
How many concurrent incoming shard recoveries are allowed to happen on a node. Incoming recoveries are the recoveries where the target shard (most likely the replica unless a shard is relocating) is allocated on the node. Defaults to
2
. cluster.routing.allocation.node_concurrent_outgoing_recoveries
-
How many concurrent outgoing shard recoveries are allowed to happen on a node. Outgoing recoveries are the recoveries where the source shard (most likely the primary unless a shard is relocating) is allocated on the node. Defaults to
2
. cluster.routing.allocation.node_concurrent_recoveries
-
A shortcut to set both
cluster.routing.allocation.node_concurrent_incoming_recoveries
andcluster.routing.allocation.node_concurrent_outgoing_recoveries
. cluster.routing.allocation.node_initial_primaries_recoveries
-
While the recovery of replicas happens over the network, the recovery of an unassigned primary after node restart uses data from the local disk. These should be fast so more initial primary recoveries can happen in parallel on the same node. Defaults to
4
. cluster.routing.allocation.same_shard.host
-
Allows to perform a check to prevent allocation of multiple instances of the same shard on a single host, based on host name and host address. Defaults to
false
, meaning that no check is performed by default. This setting only applies if multiple nodes are started on the same machine.
Shard Rebalancing Settings
The following dynamic settings may be used to control the rebalancing of shards across the cluster:
cluster.routing.rebalance.enable
-
Enable or disable rebalancing for specific kinds of shards:
-
all
- (default) Allows shard balancing for all kinds of shards. -
primaries
- Allows shard balancing only for primary shards. -
replicas
- Allows shard balancing only for replica shards. -
none
- No shard balancing of any kind are allowed for any indices.
-
cluster.routing.allocation.allow_rebalance
-
Specify when shard rebalancing is allowed:
-
always
- Always allow rebalancing. -
indices_primaries_active
- Only when all primaries in the cluster are allocated. -
indices_all_active
- (default) Only when all shards (primaries and replicas) in the cluster are allocated.
-
cluster.routing.allocation.cluster_concurrent_rebalance
-
Allow to control how many concurrent shard rebalances are allowed cluster wide. Defaults to
2
. Note that this setting only controls the number of concurrent shard relocations due to imbalances in the cluster. This setting does not limit shard relocations due to allocation filtering or forced awareness.
Shard Balancing Heuristics
The following settings are used together to determine where to place each
shard. The cluster is balanced when no allowed rebalancing operation can bring the weight
of any node closer to the weight of any other node by more than the balance.threshold
.
cluster.routing.allocation.balance.shard
-
Defines the weight factor for the total number of shards allocated on a node (float). Defaults to
0.45f
. Raising this raises the tendency to equalize the number of shards across all nodes in the cluster. cluster.routing.allocation.balance.index
-
Defines the weight factor for the number of shards per index allocated on a specific node (float). Defaults to
0.55f
. Raising this raises the tendency to equalize the number of shards per index across all nodes in the cluster. cluster.routing.allocation.balance.threshold
-
Minimal optimization value of operations that should be performed (non negative float). Defaults to
1.0f
. Raising this will cause the cluster to be less aggressive about optimizing the shard balance.
Note
|
Regardless of the result of the balancing algorithm, rebalancing might not be allowed due to forced awareness or allocation filtering. |
Disk-based Shard Allocation
{es} considers the available disk space on a node before deciding whether to allocate new shards to that node or to actively relocate shards away from that node.
Below are the settings that can be configured in the elasticsearch.yml
config
file or updated dynamically on a live cluster with the
cluster-update-settings API:
cluster.routing.allocation.disk.threshold_enabled
-
Defaults to
true
. Set tofalse
to disable the disk allocation decider. cluster.routing.allocation.disk.watermark.low
-
Controls the low watermark for disk usage. It defaults to
85%
, meaning that {es} will not allocate shards to nodes that have more than 85% disk used. It can also be set to an absolute byte value (like500mb
) to prevent {es} from allocating shards if less than the specified amount of space is available. This setting has no effect on the primary shards of newly-created indices or, specifically, any shards that have never previously been allocated. cluster.routing.allocation.disk.watermark.high
-
Controls the high watermark. It defaults to
90%
, meaning that {es} will attempt to relocate shards away from a node whose disk usage is above 90%. It can also be set to an absolute byte value (similarly to the low watermark) to relocate shards away from a node if it has less than the specified amount of free space. This setting affects the allocation of all shards, whether previously allocated or not.
cluster.routing.allocation.disk.watermark.flood_stage
-
Controls the flood stage watermark, which defaults to 95%. {es} enforces a read-only index block (
index.blocks.read_only_allow_delete
) on every index that has one or more shards allocated on the node, and that has at least one disk exceeding the flood stage. This setting is a last resort to prevent nodes from running out of disk space. The index block must be released manually when the disk utilization falls below the high watermark.NoteYou cannot mix the usage of percentage values and byte values within these settings. Either all values are set to percentage values, or all are set to byte values. This enforcement is so that {es} can validate that the settings are internally consistent, ensuring that the low disk threshold is less than the high disk threshold, and the high disk threshold is less than the flood stage threshold. An example of resetting the read-only index block on the
twitter
index:PUT /twitter/_settings { "index.blocks.read_only_allow_delete": null }
cluster.info.update.interval
-
How often {es} should check on disk usage for each node in the cluster. Defaults to
30s
. cluster.routing.allocation.disk.include_relocations
-
Defaults to true, which means that Elasticsearch will take into account shards that are currently being relocated to the target node when computing a node’s disk usage. Taking relocating shards' sizes into account may, however, mean that the disk usage for a node is incorrectly estimated on the high side, since the relocation could be 90% complete and a recently retrieved disk usage would include the total size of the relocating shard as well as the space already used by the running relocation.
Note
|
Percentage values refer to used disk space, while byte values refer to free disk space. This can be confusing, since it flips the meaning of high and low. For example, it makes sense to set the low watermark to 10gb and the high watermark to 5gb, but not the other way around. |
An example of updating the low watermark to at least 100 gigabytes free, a high watermark of at least 50 gigabytes free, and a flood stage watermark of 10 gigabytes free, and updating the information about the cluster every minute:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.disk.watermark.low": "100gb",
"cluster.routing.allocation.disk.watermark.high": "50gb",
"cluster.routing.allocation.disk.watermark.flood_stage": "10gb",
"cluster.info.update.interval": "1m"
}
}
Shard Allocation Awareness
When running nodes on multiple VMs on the same physical server, on multiple racks, or across multiple zones or domains, it is more likely that two nodes on the same physical server, in the same rack, or in the same zone or domain will crash at the same time, rather than two unrelated nodes crashing simultaneously.
If Elasticsearch is aware of the physical configuration of your hardware, it can ensure that the primary shard and its replica shards are spread across different physical servers, racks, or zones, to minimise the risk of losing all shard copies at the same time.
The shard allocation awareness settings allow you to tell Elasticsearch about your hardware configuration.
As an example, let’s assume we have several racks. When we start a node, we
can tell it which rack it is in by assigning it an arbitrary metadata
attribute called rack_id
— we could use any attribute name. For example:
./bin/elasticsearch -Enode.attr.rack_id=rack_one (1)
-
This setting could also be specified in the
elasticsearch.yml
config file.
Now, we need to set up shard allocation awareness by telling Elasticsearch
which attributes to use. This can be configured in the elasticsearch.yml
file on all master-eligible nodes, or it can be set (and changed) with the
cluster-update-settings API.
For our example, we’ll set the value in the config file:
cluster.routing.allocation.awareness.attributes: rack_id
With this config in place, let’s say we start two nodes with
node.attr.rack_id
set to rack_one
, and we create an index with 5 primary
shards and 1 replica of each primary. All primaries and replicas are
allocated across the two nodes.
Now, if we start two more nodes with node.attr.rack_id
set to rack_two
,
Elasticsearch will move shards across to the new nodes, ensuring (if possible)
that no two copies of the same shard will be in the same rack. However if
rack_two
were to fail, taking down both of its nodes, Elasticsearch will
still allocate the lost shard copies to nodes in rack_one
.
Multiple awareness attributes can be specified, in which case each attribute is considered separately when deciding where to allocate the shards.
cluster.routing.allocation.awareness.attributes: rack_id,zone
Note
|
When using awareness attributes, shards will not be allocated to nodes that don’t have values set for those attributes. |
Note
|
Number of primary/replica of a shard allocated on a specific group of nodes with the same awareness attribute value is determined by the number of attribute values. When the number of nodes in groups is unbalanced and there are many replicas, replica shards may be left unassigned. |
Forced Awareness
Imagine that you have two zones and enough hardware across the two zones to host all of your primary and replica shards. But perhaps the hardware in a single zone, while sufficient to host half the shards, would be unable to host ALL the shards.
With ordinary awareness, if one zone lost contact with the other zone, Elasticsearch would assign all of the missing replica shards to a single zone. But in this example, this sudden extra load would cause the hardware in the remaining zone to be overloaded.
Forced awareness solves this problem by NEVER allowing copies of the same shard to be allocated to the same zone.
For example, lets say we have an awareness attribute called zone
, and we
know we are going to have two zones, zone1
and zone2
. Here is how we can
force awareness on a node:
cluster.routing.allocation.awareness.force.zone.values: zone1,zone2 (1)
cluster.routing.allocation.awareness.attributes: zone
-
We must list all possible values that the
zone
attribute can have.
Now, if we start 2 nodes with node.attr.zone
set to zone1
and create an
index with 5 shards and 1 replica. The index will be created, but only the 5
primary shards will be allocated (with no replicas). Only when we start more
nodes with node.attr.zone
set to zone2
will the replicas be allocated.
The cluster.routing.allocation.awareness.*
settings can all be updated
dynamically on a live cluster with the
cluster-update-settings API.
Shard Allocation Filtering
While [index-modules-allocation] provides per-index settings to control the allocation of shards to nodes, cluster-level shard allocation filtering allows you to allow or disallow the allocation of shards from any index to particular nodes.
The available dynamic cluster settings are as follows, where {attribute}
refers to an arbitrary node attribute.:
cluster.routing.allocation.include.{attribute}
-
Allocate shards to a node whose
{attribute}
has at least one of the comma-separated values. cluster.routing.allocation.require.{attribute}
-
Only allocate shards to a node whose
{attribute}
has all of the comma-separated values. cluster.routing.allocation.exclude.{attribute}
-
Do not allocate shards to a node whose
{attribute}
has any of the comma-separated values.
These special attributes are also supported:
_name
|
Match nodes by node names |
_ip
|
Match nodes by IP addresses (the IP address associated with the hostname) |
_host
|
Match nodes by hostnames |
The typical use case for cluster-wide shard allocation filtering is when you want to decommission a node, and you would like to move the shards from that node to other nodes in the cluster before shutting it down.
For instance, we could decommission a node using its IP address as follows:
PUT _cluster/settings
{
"transient" : {
"cluster.routing.allocation.exclude._ip" : "10.0.0.1"
}
}
Note
|
Shards will only be relocated if it is possible to do so without breaking another routing constraint, such as never allocating a primary and replica shard to the same node. |
In addition to listing multiple values as a comma-separated list, all attribute values can be specified with wildcards, eg:
PUT _cluster/settings
{
"transient": {
"cluster.routing.allocation.exclude._ip": "192.168.2.*"
}
}
Miscellaneous cluster settings
Metadata
An entire cluster may be set to read-only with the following dynamic setting:
cluster.blocks.read_only
-
Make the whole cluster read only (indices do not accept write operations), metadata is not allowed to be modified (create or delete indices).
cluster.blocks.read_only_allow_delete
-
Identical to
cluster.blocks.read_only
but allows to delete indices to free up resources.
Warning
|
Don’t rely on this setting to prevent changes to your cluster. Any user with access to the cluster-update-settings API can make the cluster read-write again. |
Cluster Shard Limit
In a Elasticsearch 7.0 and later, there will be a soft limit on the number of shards in a cluster, based on the number of nodes in the cluster. This is intended to prevent operations which may unintentionally destabilize the cluster. Prior to 7.0, actions which would result in the cluster going over the limit will issue a deprecation warning.
Note
|
You can set the system property es.enforce_max_shards_per_node to true
to opt in to strict enforcement of the shard limit. If this system property is
set, actions which would result in the cluster going over the limit will result
in an error, rather than a deprecation warning. This property will be removed in
Elasticsearch 7.0, as strict enforcement of the limit will be the default and
only behavior.
|
Important
|
This limit is intended as a safety net, not a sizing recommendation. The exact number of shards your cluster can safely support depends on your hardware configuration and workload, but should remain well below this limit in almost all cases, as the default limit is set quite high. |
If an operation, such as creating a new index, restoring a snapshot of an index, or opening a closed index would lead to the number of shards in the cluster going over this limit, the operation will issue a deprecation warning.
If the cluster is already over the limit, due to changes in node membership or setting changes, all operations that create or open indices will issue warnings until either the limit is increased as described below, or some indices are closed or deleted to bring the number of shards below the limit.
Replicas count towards this limit, but closed indexes do not. An index with 5 primary shards and 2 replicas will be counted as 15 shards. Any closed index is counted as 0, no matter how many shards and replicas it contains.
The limit defaults to 1,000 shards per data node, and be dynamically adjusted using the following property:
cluster.max_shards_per_node
-
Controls the number of shards allowed in the cluster per data node.
For example, a 3-node cluster with the default setting would allow 3,000 shards total, across all open indexes. If the above setting is changed to 500, then the cluster would allow 1,500 shards total.
User Defined Cluster Metadata
User-defined metadata can be stored and retrieved using the Cluster Settings API.
This can be used to store arbitrary, infrequently-changing data about the cluster
without the need to create an index to store it. This data may be stored using
any key prefixed with cluster.metadata.
. For example, to store the email
address of the administrator of a cluster under the key cluster.metadata.administrator
,
issue this request:
PUT /_cluster/settings
{
"persistent": {
"cluster.metadata.administrator": "sysadmin@example.com"
}
}
Important
|
User-defined cluster metadata is not intended to store sensitive or confidential information. Any information stored in user-defined cluster metadata will be viewable by anyone with access to the Cluster Get Settings API, and is recorded in the {es} logs. |
Index Tombstones
The cluster state maintains index tombstones to explicitly denote indices that have been deleted. The number of tombstones maintained in the cluster state is controlled by the following property, which cannot be updated dynamically:
cluster.indices.tombstones.size
-
Index tombstones prevent nodes that are not part of the cluster when a delete occurs from joining the cluster and reimporting the index as though the delete was never issued. To keep the cluster state from growing huge we only keep the last
cluster.indices.tombstones.size
deletes, which defaults to 500. You can increase it if you expect nodes to be absent from the cluster and miss more than 500 deletes. We think that is rare, thus the default. Tombstones don’t take up much space, but we also think that a number like 50,000 is probably too big.
Logger
The settings which control logging can be updated dynamically with the
logger.
prefix. For instance, to increase the logging level of the
indices.recovery
module to DEBUG
, issue this request:
PUT /_cluster/settings
{
"transient": {
"logger.org.elasticsearch.indices.recovery": "DEBUG"
}
}
Persistent Tasks Allocations
Plugins can create a kind of tasks called persistent tasks. Those tasks are usually long-live tasks and are stored in the cluster state, allowing the tasks to be revived after a full cluster restart.
Every time a persistent task is created, the master node takes care of assigning the task to a node of the cluster, and the assigned node will then pick up the task and execute it locally. The process of assigning persistent tasks to nodes is controlled by the following properties, which can be updated dynamically:
cluster.persistent_tasks.allocation.enable
-
Enable or disable allocation for persistent tasks:
-
all
- (default) Allows persistent tasks to be assigned to nodes -
none
- No allocations are allowed for any type of persistent task
This setting does not affect the persistent tasks that are already being executed. Only newly created persistent tasks, or tasks that must be reassigned (after a node left the cluster, for example), are impacted by this setting.
-
cluster.persistent_tasks.allocation.recheck_interval
-
The master node will automatically check whether persistent tasks need to be assigned when the cluster state changes significantly. However, there may be other factors, such as memory usage, that affect whether persistent tasks can be assigned to nodes but do not cause the cluster state to change. This setting controls how often assignment checks are performed to react to these factors. The default is 30 seconds. The minimum permitted value is 10 seconds.
Discovery
The discovery module is responsible for discovering nodes within a cluster, as well as electing a master node.
Note, Elasticsearch is a peer to peer based system, nodes communicate with one another directly if operations are delegated / broadcast. All the main APIs (index, delete, search) do not communicate with the master node. The responsibility of the master node is to maintain the global cluster state, and act if nodes join or leave the cluster by reassigning shards. Each time a cluster state is changed, the state is made known to the other nodes in the cluster (the manner depends on the actual discovery implementation).
Settings
The cluster.name
allows to create separated clusters from one another.
The default value for the cluster name is elasticsearch
, though it is
recommended to change this to reflect the logical group name of the
cluster running.
Azure Classic Discovery
Azure classic discovery allows to use the Azure Classic APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See {plugins}/discovery-azure-classic.html[discovery-azure-classic] for more information.
EC2 Discovery
EC2 discovery is available as a plugin. See {plugins}/discovery-ec2.html[discovery-ec2] for more information.
Google Compute Engine Discovery
Google Compute Engine (GCE) discovery allows to use the GCE APIs to perform automatic discovery (similar to multicast). It is available as a plugin. See {plugins}/discovery-gce.html[discovery-gce] for more information.
Zen Discovery
Zen discovery is the built-in, default, discovery module for Elasticsearch. It provides unicast and file-based discovery, and can be extended to support cloud environments and other forms of discovery via plugins.
Zen discovery is integrated with other modules, for example, all communication between nodes is done using the transport module.
It is separated into several sub modules, which are explained below:
Ping
This is the process where a node uses the discovery mechanisms to find other nodes.
Seed nodes
Zen discovery uses a list of seed nodes in order to start off the discovery process. At startup, or when electing a new master, Elasticsearch tries to connect to each seed node in its list, and holds a gossip-like conversation with them to find other nodes and to build a complete picture of the cluster. By default there are two methods for configuring the list of seed nodes: unicast and file-based. It is recommended that the list of seed nodes comprises the list of master-eligible nodes in the cluster.
Unicast
Unicast discovery configures a static list of hosts for use as seed nodes. These hosts can be specified as hostnames or IP addresses; hosts specified as hostnames are resolved to IP addresses during each round of pinging. Note that if you are in an environment where DNS resolutions vary with time, you might need to adjust your JVM security settings.
The list of hosts is set using the discovery.zen.ping.unicast.hosts
static
setting. This is either an array of hosts or a comma-delimited string. Each
value should be in the form of host:port
or host
(where port
defaults to
the setting transport.profiles.default.port
falling back to
transport.port
if not set). Note that IPv6 hosts must be bracketed. The
default for this setting is 127.0.0.1, [::1]
Additionally, the discovery.zen.ping.unicast.resolve_timeout
configures the
amount of time to wait for DNS lookups on each round of pinging. This is
specified as a time unit and defaults to 5s.
Unicast discovery uses the transport module to perform the discovery.
File-based
In addition to hosts provided by the static discovery.zen.ping.unicast.hosts
setting, it is possible to provide a list of hosts via an external file.
Elasticsearch reloads this file when it changes, so that the list of seed nodes
can change dynamically without needing to restart each node. For example, this
gives a convenient mechanism for an Elasticsearch instance that is run in a
Docker container to be dynamically supplied with a list of IP addresses to
connect to for Zen discovery when those IP addresses may not be known at node
startup.
To enable file-based discovery, configure the file
hosts provider as follows:
discovery.zen.hosts_provider: file
Then create a file at $ES_PATH_CONF/unicast_hosts.txt
in the format described
below. Any time a change is made to the unicast_hosts.txt
file the new
changes will be picked up by Elasticsearch and the new hosts list will be used.
Note that the file-based discovery plugin augments the unicast hosts list in
elasticsearch.yml
: if there are valid unicast host entries in
discovery.zen.ping.unicast.hosts
then they will be used in addition to those
supplied in unicast_hosts.txt
.
The discovery.zen.ping.unicast.resolve_timeout
setting also applies to DNS
lookups for nodes specified by address via file-based discovery. This is
specified as a time unit and defaults to 5s.
The format of the file is to specify one node entry per line. Each node entry
consists of the host (host name or IP address) and an optional transport port
number. If the port number is specified, is must come immediately after the
host (on the same line) separated by a :
. If the port number is not
specified, a default value of 9300 is used.
For example, this is an example of unicast_hosts.txt
for a cluster with four
nodes that participate in unicast discovery, some of which are not running on
the default port:
10.10.10.5
10.10.10.6:9305
10.10.10.5:10005
# an IPv6 address
[2001:0db8:85a3:0000:0000:8a2e:0370:7334]:9301
Host names are allowed instead of IP addresses (similar to
discovery.zen.ping.unicast.hosts
), and IPv6 addresses must be specified in
brackets with the port coming after the brackets.
It is also possible to add comments to this file. All comments must appear on
their lines starting with #
(i.e. comments cannot start in the middle of a
line).
Master Election
As part of the ping process a master of the cluster is either elected or joined
to. This is done automatically. The discovery.zen.ping_timeout
(which defaults
to 3s
) determines how long the node will wait before deciding on starting an
election or joining an existing cluster. Three pings will be sent over this
timeout interval. In case where no decision can be reached after the timeout,
the pinging process restarts. In slow or congested networks, three seconds
might not be enough for a node to become aware of the other nodes in its
environment before making an election decision. Increasing the timeout should
be done with care in that case, as it will slow down the election process. Once
a node decides to join an existing formed cluster, it will send a join request
to the master (discovery.zen.join_timeout
) with a timeout defaulting at 20
times the ping timeout.
When the master node stops or has encountered a problem, the cluster nodes start pinging again and will elect a new master. This pinging round also serves as a protection against (partial) network failures where a node may unjustly think that the master has failed. In this case the node will simply hear from other nodes about the currently active master.
If discovery.zen.master_election.ignore_non_master_pings
is true
, pings from
nodes that are not master eligible (nodes where node.master
is false
) are
ignored during master election; the default value is false
.
Nodes can be excluded from becoming a master by setting node.master
to
false
.
The discovery.zen.minimum_master_nodes
sets the minimum number of master
eligible nodes that need to join a newly elected master in order for an election
to complete and for the elected node to accept its mastership. The same setting
controls the minimum number of active master eligible nodes that should be a
part of any active cluster. If this requirement is not met the active master
node will step down and a new master election will begin.
This setting must be set to a quorum of your master eligible nodes. It is recommended to avoid having only two master eligible nodes, since a quorum of two is two. Therefore, a loss of either master eligible node will result in an inoperable cluster.
Fault Detection
There are two fault detection processes running. The first is by the master, to ping all the other nodes in the cluster and verify that they are alive. And on the other end, each node pings to master to verify if its still alive or an election process needs to be initiated.
The following settings control the fault detection process using the
discovery.zen.fd
prefix:
Setting | Description |
---|---|
|
How often a node gets pinged. Defaults to |
|
How long to wait for a ping response, defaults to
|
|
How many ping failures / timeouts cause a node to be
considered failed. Defaults to |
Cluster state updates
The master node is the only node in a cluster that can make changes to the
cluster state. The master node processes one cluster state update at a time,
applies the required changes and publishes the updated cluster state to all the
other nodes in the cluster. Each node receives the publish message, acknowledges
it, but does not yet apply it. If the master does not receive acknowledgement
from at least discovery.zen.minimum_master_nodes
nodes within a certain time
(controlled by the discovery.zen.commit_timeout
setting which defaults to 30
seconds, with negative values treated as 0 seconds) the cluster state change is
rejected.
Once enough nodes have responded, the cluster state is committed and a message
will be sent to all the nodes. The nodes then proceed to apply the new cluster
state to their internal state. The master node waits for all nodes to respond,
up to a timeout, before going ahead processing the next updates in the queue.
The discovery.zen.publish_timeout
is set by default to 30 seconds and is
measured from the moment the publishing started. Both timeout settings can be
changed dynamically through the cluster update
settings api
No master block
For the cluster to be fully operational, it must have an active master and the
number of running master eligible nodes must satisfy the
discovery.zen.minimum_master_nodes
setting if set. The
discovery.zen.no_master_block
settings controls what operations should be
rejected when there is no active master.
The discovery.zen.no_master_block
setting has two valid options:
all
|
All operations on the node—i.e. both read & writes—will be rejected. This also applies for api cluster state read or write operations, like the get index settings, put mapping and cluster state api. |
write
|
(default) Write operations will be rejected. Read operations will succeed, based on the last known cluster configuration. This may result in partial reads of stale data as this node may be isolated from the rest of the cluster. |
The discovery.zen.no_master_block
setting doesn’t apply to nodes-based apis
(for example cluster stats, node info and node stats apis). Requests to these
apis will not be blocked and can run on any available node.
Single-node discovery
The discovery.type
setting specifies whether {es} should form a multiple-node
cluster. By default, {es} discovers other nodes when forming a cluster and
allows other nodes to join the cluster later. If discovery.type
is set to
single-node
, {es} forms a single-node cluster. For more information about when
you might use this setting, see Bootstrap checks.
Local Gateway
The local gateway module stores the cluster state and shard data across full cluster restarts.
The following static settings, which must be set on every master node, control how long a freshly elected master should wait before it tries to recover the cluster state and the cluster’s data:
gateway.expected_nodes
-
The number of (data or master) nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of nodes have joined the cluster. Defaults to
0
gateway.expected_master_nodes
-
The number of master nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of master nodes have joined the cluster. Defaults to
0
gateway.expected_data_nodes
-
The number of data nodes that are expected to be in the cluster. Recovery of local shards will start as soon as the expected number of data nodes have joined the cluster. Defaults to
0
gateway.recover_after_time
-
If the expected number of nodes is not achieved, the recovery process waits for the configured amount of time before trying to recover regardless. Defaults to
5m
if one of theexpected_nodes
settings is configured.
Once the recover_after_time
duration has timed out, recovery will start
as long as the following conditions are met:
gateway.recover_after_nodes
-
Recover as long as this many data or master nodes have joined the cluster.
gateway.recover_after_master_nodes
-
Recover as long as this many master nodes have joined the cluster.
gateway.recover_after_data_nodes
-
Recover as long as this many data nodes have joined the cluster.
Note
|
These settings only take effect on a full cluster restart. |
HTTP
The http module allows to expose Elasticsearch APIs over HTTP.
The http mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication for HTTP is solving the C10k problem.
When possible, consider using HTTP keep alive when connecting for better performance and try to get your favorite client not to do HTTP chunking.
Settings
The settings in the table below can be configured for HTTP. Note that none of them are dynamically updatable so for them to take effect they should be set in the Elasticsearch configuration file.
Setting | Description |
---|---|
|
A bind port range. Defaults to |
|
The port that HTTP clients should use when
communicating with this node. Useful when a cluster node is behind a
proxy or firewall and the |
|
The host address to bind the HTTP service to. Defaults to |
|
The host address to publish for HTTP clients to connect to. Defaults to |
|
Used to set the |
|
The max content of an HTTP request. Defaults to
|
|
The max length of an HTTP URL. Defaults
to |
|
The max size of allowed headers. Defaults to |
|
Support for compression when possible (with
Accept-Encoding). Defaults to |
|
Defines the compression level to use for HTTP responses. Valid values are in the range of 1 (minimum compression)
and 9 (maximum compression). Defaults to |
|
Enable or disable cross-origin resource sharing,
i.e. whether a browser on another origin can execute requests against
Elasticsearch. Set to |
|
Which origins to allow. Defaults to no origins
allowed. If you prepend and append a |
|
Browsers send a "preflight" OPTIONS-request to
determine CORS settings. |
|
Which methods to allow. Defaults to
|
|
Which headers to allow. Defaults to
|
|
Whether the |
|
Enables or disables the output of detailed error messages
and stack traces in response output. Note: When set to |
|
Enable or disable HTTP pipelining, defaults to |
|
The maximum number of events to be queued up in memory before an HTTP connection is closed, defaults to |
|
The maximum number of warning headers in client HTTP responses, defaults to unbounded. |
|
The maximum total size of warning headers in client HTTP responses, defaults to unbounded. |
It also uses the common network settings.
Disable HTTP
The http module can be completely disabled and not started by setting
http.enabled
to false
. Elasticsearch nodes (and Java clients) communicate
internally using the transport interface, not HTTP. It
might make sense to disable the http
layer entirely on nodes which are not
meant to serve REST requests directly. For instance, you could disable HTTP on
data-only nodes if you also have
client nodes which are intended to serve all REST requests.
Be aware, however, that you will not be able to send any REST requests (eg to
retrieve node stats) directly to nodes which have HTTP disabled.
Indices
The indices module controls index-related settings that are globally managed for all indices, rather than being configurable at a per-index level.
Available settings include:
- Circuit breaker
-
Circuit breakers set limits on memory usage to avoid out of memory exceptions.
- Fielddata cache
-
Set limits on the amount of heap used by the in-memory fielddata cache.
- Node query cache
-
Configure the amount heap used to cache queries results.
- Indexing buffer
-
Control the size of the buffer allocated to the indexing process.
- Shard request cache
-
Control the behaviour of the shard-level request cache.
- Recovery
-
Control the resource limits on the shard recovery process.
- Search Settings
-
Control global search settings.
Circuit Breaker
Elasticsearch contains multiple circuit breakers used to prevent operations from causing an OutOfMemoryError. Each breaker specifies a limit for how much memory it can use. Additionally, there is a parent-level breaker that specifies the total amount of memory that can be used across all breakers.
These settings can be dynamically updated on a live cluster with the cluster-update-settings API.
Parent circuit breaker
The parent-level breaker can be configured with the following setting:
indices.breaker.total.limit
-
Starting limit for overall parent breaker, defaults to 70% of JVM heap.
Field data circuit breaker
The field data circuit breaker allows Elasticsearch to estimate the amount of memory a field will require to be loaded into memory. It can then prevent the field data loading by raising an exception. By default the limit is configured to 60% of the maximum JVM heap. It can be configured with the following parameters:
indices.breaker.fielddata.limit
-
Limit for fielddata breaker, defaults to 60% of JVM heap
indices.breaker.fielddata.overhead
-
A constant that all field data estimations are multiplied with to determine a final estimation. Defaults to 1.03
Request circuit breaker
The request circuit breaker allows Elasticsearch to prevent per-request data structures (for example, memory used for calculating aggregations during a request) from exceeding a certain amount of memory.
indices.breaker.request.limit
-
Limit for request breaker, defaults to 60% of JVM heap
indices.breaker.request.overhead
-
A constant that all request estimations are multiplied with to determine a final estimation. Defaults to 1
In flight requests circuit breaker
The in flight requests circuit breaker allows Elasticsearch to limit the memory usage of all currently active incoming requests on transport or HTTP level from exceeding a certain amount of memory on a node. The memory usage is based on the content length of the request itself.
network.breaker.inflight_requests.limit
-
Limit for in flight requests breaker, defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.
network.breaker.inflight_requests.overhead
-
A constant that all in flight requests estimations are multiplied with to determine a final estimation. Defaults to 1
Accounting requests circuit breaker
The accounting circuit breaker allows Elasticsearch to limit the memory usage of things held in memory that are not released when a request is completed. This includes things like the Lucene segment memory.
indices.breaker.accounting.limit
-
Limit for accounting breaker, defaults to 100% of JVM heap. This means that it is bound by the limit configured for the parent circuit breaker.
indices.breaker.accounting.overhead
-
A constant that all accounting estimations are multiplied with to determine a final estimation. Defaults to 1
Script compilation circuit breaker
Slightly different than the previous memory-based circuit breaker, the script compilation circuit breaker limits the number of inline script compilations within a period of time.
See the "prefer-parameters" section of the scripting documentation for more information.
script.max_compilations_rate
-
Limit for the number of unique dynamic scripts within a certain interval that are allowed to be compiled. Defaults to 75/5m, meaning 75 every 5 minutes.
Fielddata
The field data cache is used mainly when sorting on or computing aggregations on a field. It loads all the field values to memory in order to provide fast document based access to those values. The field data cache can be expensive to build for a field, so its recommended to have enough memory to allocate it, and to keep it loaded.
The amount of memory used for the field
data cache can be controlled using indices.fielddata.cache.size
. Note:
reloading the field data which does not fit into your cache will be expensive
and perform poorly.
indices.fielddata.cache.size
-
The max size of the field data cache, eg
30%
of node heap space, or an absolute value, eg12GB
. Defaults to unbounded. Also see Field data circuit breaker.
Note
|
These are static settings which must be configured on every data node in the cluster. |
Monitoring field data
You can monitor memory usage for field data as well as the field data circuit breaker using Nodes Stats API
Node Query Cache
The query cache is responsible for caching the results of queries. There is one queries cache per node that is shared by all shards. The cache implements an LRU eviction policy: when a cache becomes full, the least recently used data is evicted to make way for new data. It is not possible to look at the contents being cached.
The query cache only caches queries which are being used in a filter context.
The following setting is static and must be configured on every data node in the cluster:
indices.queries.cache.size
-
Controls the memory size for the filter cache , defaults to
10%
. Accepts either a percentage value, like5%
, or an exact value, like512mb
.
The following setting is an index setting that can be configured on a per-index basis:
index.queries.cache.enabled
-
Controls whether to enable query caching. Accepts
true
(default) orfalse
.
Indexing Buffer
The indexing buffer is used to store newly indexed documents. When it fills up, the documents in the buffer are written to a segment on disk. It is divided between all shards on the node.
The following settings are static and must be configured on every data node in the cluster:
indices.memory.index_buffer_size
-
Accepts either a percentage or a byte size value. It defaults to
10%
, meaning that10%
of the total heap allocated to a node will be used as the indexing buffer size shared across all shards. indices.memory.min_index_buffer_size
-
If the
index_buffer_size
is specified as a percentage, then this setting can be used to specify an absolute minimum. Defaults to48mb
. indices.memory.max_index_buffer_size
-
If the
index_buffer_size
is specified as a percentage, then this setting can be used to specify an absolute maximum. Defaults to unbounded.
Shard request cache
When a search request is run against an index or against many indices, each involved shard executes the search locally and returns its local results to the coordinating node, which combines these shard-level results into a ``global'' result set.
The shard-level request cache module caches the local results on each shard. This allows frequently used (and potentially heavy) search requests to return results almost instantly. The requests cache is a very good fit for the logging use case, where only the most recent index is being actively updated — results from older indices will be served directly from the cache.
Important
|
By default, the requests cache will only cache the results of search requests
where Most queries that use |
Cache invalidation
The cache is smart — it keeps the same near real-time promise as uncached search.
Cached results are invalidated automatically whenever the shard refreshes, but only if the data in the shard has actually changed. In other words, you will always get the same results from the cache as you would for an uncached search request.
The longer the refresh interval, the longer that cached entries will remain valid. If the cache is full, the least recently used cache keys will be evicted.
The cache can be expired manually with the clear-cache
API:
POST /kimchy,elasticsearch/_cache/clear?request=true
Enabling and disabling caching
The cache is enabled by default, but can be disabled when creating a new index as follows:
PUT /my_index
{
"settings": {
"index.requests.cache.enable": false
}
}
It can also be enabled or disabled dynamically on an existing index with the
update-settings
API:
PUT /my_index/_settings
{ "index.requests.cache.enable": true }
Enabling and disabling caching per request
The request_cache
query-string parameter can be used to enable or disable
caching on a per-request basis. If set, it overrides the index-level setting:
GET /my_index/_search?request_cache=true
{
"size": 0,
"aggs": {
"popular_colors": {
"terms": {
"field": "colors"
}
}
}
}
Important
|
If your query uses a script whose result is not deterministic (e.g.
it uses a random function or references the current time) you should set the
request_cache flag to false to disable caching for that request.
|
Requests where size
is greater than 0 will not be cached even if the request cache is
enabled in the index settings. To cache these requests you will need to use the
query-string parameter detailed here.
Cache key
The whole JSON body is used as the cache key. This means that if the JSON changes — for instance if keys are output in a different order — then the cache key will not be recognised.
Tip
|
Most JSON libraries support a canonical mode which ensures that JSON keys are always emitted in the same order. This canonical mode can be used in the application to ensure that a request is always serialized in the same way. |
Cache settings
The cache is managed at the node level, and has a default maximum size of 1%
of the heap. This can be changed in the config/elasticsearch.yml
file with:
indices.requests.cache.size: 2%
Also, you can use the indices.requests.cache.expire setting to specify a TTL for cached results, but there should be no reason to do so. Remember that stale results are automatically invalidated when the index is refreshed. This setting is provided for completeness' sake only.
Monitoring cache usage
The size of the cache (in bytes) and the number of evictions can be viewed
by index, with the indices-stats
API:
GET /_stats/request_cache?human
or by node with the nodes-stats
API:
GET /_nodes/stats/indices/request_cache?human
Indices Recovery
Peer recovery syncs data from a primary shard to a new or existing shard copy.
Peer recovery automatically occurs when {es}:
-
Recreates a shard lost during node failure
-
Relocates a shard to another node due to a cluster rebalance or changes to the shard allocation settings
You can view a list of in-progress and completed recoveries using the cat recovery API.
Peer recovery settings
indices.recovery.max_bytes_per_sec
(Dynamic)-
Limits total inbound and outbound recovery traffic for each node. Defaults to
40mb
.This limit applies to nodes only. If multiple nodes in a cluster perform recoveries at the same time, the cluster’s total recovery traffic may exceed this limit.
If this limit is too high, ongoing recoveries may consume an excess of bandwidth and other resources, which can destabilize the cluster.
Expert peer recovery settings
You can use the following expert setting to manage resources for peer recoveries.
indices.recovery.max_concurrent_file_chunks
(Dynamic, Expert)-
Number of file chunk requests sent in parallel for each recovery. Defaults to
2
.You can increase the value of this setting when the recovery of a single shard is not reaching the traffic limit set by
indices.recovery.max_bytes_per_sec
.
Search Settings
The following expert setting can be set to manage global search limits.
indices.query.bool.max_clause_count
-
Defaults to
1024
.
This setting limits the number of clauses a Lucene BooleanQuery can have. The
default of 1024 is quite high and should normally be sufficient. This limit does
not only affect Elasticsearchs bool
query, but many other queries are rewritten to Lucene’s
BooleanQuery internally. The limit is in place to prevent searches from becoming to large
and taking up too much CPU and memory. In case you consider to increase this setting,
make sure you exhausted all other options to avoid having to do this. Higher values can lead
to performance degradations and memory issues, especially in clusters with a high load or
few resources.
Network Settings
Elasticsearch binds to localhost only by default. This is sufficient for you to run a local development server (or even a development cluster, if you start multiple nodes on the same machine), but you will need to configure some basic network settings in order to run a real production cluster across multiple servers.
Warning
|
Be careful with the network configuration!
Never expose an unprotected node to the public internet. |
Commonly Used Network Settings
network.host
-
The node will bind to this hostname or IP address and publish (advertise) this host to other nodes in the cluster. Accepts an IP address, hostname, a special value, or an array of any combination of these. Note that any values containing a
:
(e.g., an IPv6 address or containing one of the special value) must be quoted because:
is a special character in YAML.0.0.0.0
is an acceptable IP address and will bind to all network interfaces. The value0
has the same effect as the value0.0.0.0
.Defaults to
local
. discovery.zen.ping.unicast.hosts
-
In order to join a cluster, a node needs to know the hostname or IP address of at least some of the other nodes in the cluster. This setting provides the initial list of other nodes that this node will try to contact. Accepts IP addresses or hostnames. If a hostname lookup resolves to multiple IP addresses then each IP address will be used for discovery. Round robin DNS — returning a different IP from a list on each lookup — can be used for discovery; non- existent IP addresses will throw exceptions and cause another DNS lookup on the next round of pinging (subject to JVM DNS caching).
Defaults to
["127.0.0.1", "[::1]"]
. http.port
-
Port to bind to for incoming HTTP requests. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.
Defaults to
9200-9300
. transport.port
-
Port to bind for communication between nodes. Accepts a single value or a range. If a range is specified, the node will bind to the first available port in the range.
Defaults to
9300-9400
.
Special values for network.host
The following special values may be passed to network.host
:
[networkInterface]
|
Addresses of a network interface, for example |
local
|
Any loopback addresses on the system, for example |
site
|
Any site-local addresses on the system, for example |
global
|
Any globally-scoped addresses on the system, for example |
IPv4 vs IPv6
These special values will work over both IPv4 and IPv6 by default, but you can
also limit this with the use of :ipv4
of :ipv6
specifiers. For example,
en0:ipv4
would only bind to the IPv4 addresses of interface en0
.
Tip
|
Discovery in the cloud
More special settings are available when running in the cloud with either the {plugins}/discovery-ec2.html[EC2 discovery plugin] or the {plugins}/discovery-gce-network-host.html#discovery-gce-network-host[Google Compute Engine discovery plugin] installed. |
Advanced network settings
The network.host
setting explained in Commonly used network settings
is a shortcut which sets the bind host and the publish host at the same
time. In advanced used cases, such as when running behind a proxy server, you
may need to set these settings to different values:
network.bind_host
-
This specifies which network interface(s) a node should bind to in order to listen for incoming requests. A node can bind to multiple interfaces, e.g. two network cards, or a site-local address and a local address. Defaults to
network.host
. network.publish_host
-
The publish host is the single interface that the node advertises to other nodes in the cluster, so that those nodes can connect to it. Currently an Elasticsearch node may be bound to multiple addresses, but only publishes one. If not specified, this defaults to the
`best'' address from `network.host
, sorted by IPv4/IPv6 stack preference, then by reachability. If you set anetwork.host
that results in multiple bind addresses yet rely on a specific address for node-to-node communication, you should explicitly setnetwork.publish_host
.
Both of the above settings can be configured just like network.host
— they
accept IP addresses, host names, and
special values.
Advanced TCP Settings
network.tcp.no_delay
|
Enable or disable the TCP no delay
setting. Defaults to |
network.tcp.keep_alive
|
Enable or disable TCP keep alive.
Defaults to |
network.tcp.reuse_address
|
Should an address be reused or not. Defaults to |
network.tcp.send_buffer_size
|
The size of the TCP send buffer (specified with size units). By default not explicitly set. |
network.tcp.receive_buffer_size
|
The size of the TCP receive buffer (specified with size units). By default not explicitly set. |
Transport and HTTP protocols
An Elasticsearch node exposes two network protocols which inherit the above settings, but may be further configured independently:
- TCP Transport
-
Used for communication between nodes in the cluster, by the Java {javaclient}/transport-client.html[Transport client] and by the Tribe node. See the Transport module for more information.
- HTTP
-
Exposes the JSON-over-HTTP interface used by all clients other than the Java clients. See the HTTP module for more information.
Node
Any time that you start an instance of Elasticsearch, you are starting a node. A collection of connected nodes is called a cluster. If you are running a single node of Elasticsearch, then you have a cluster of one node.
Every node in the cluster can handle HTTP and
Transport traffic by default. The transport layer
is used exclusively for communication between nodes and the
{javaclient}/transport-client.html[Java TransportClient
]; the HTTP layer is
used only by external REST clients.
All nodes know about all the other nodes in the cluster and can forward client requests to the appropriate node.
By default, a node is all of the following types: master-eligible, data, ingest, and machine learning (if available).
Tip
|
As the cluster grows and in particular if you have large {ml} jobs, consider separating dedicated master-eligible nodes from dedicated data nodes and dedicated {ml} nodes. |
- Master-eligible node
-
A node that has
node.master
set totrue
(default), which makes it eligible to be elected as the master node, which controls the cluster. - Data node
-
A node that has
node.data
set totrue
(default). Data nodes hold data and perform data related operations such as CRUD, search, and aggregations. - Ingest node
-
A node that has
node.ingest
set totrue
(default). Ingest nodes are able to apply an ingest pipeline to a document in order to transform and enrich the document before indexing. With a heavy ingest load, it makes sense to use dedicated ingest nodes and to mark the master and data nodes asnode.ingest: false
. - Tribe node
-
A tribe node, configured via the
tribe.*
settings, is a special type of coordinating only node that can connect to multiple clusters and perform search and other operations across all connected clusters.
By default a node is a master-eligible node and a data node, plus it can pre-process documents through ingest pipelines. This is very convenient for small clusters but, as the cluster grows, it becomes important to consider separating dedicated master-eligible nodes from dedicated data nodes.
- Machine learning node
-
A node that has
xpack.ml.enabled
andnode.ml
set totrue
, which is the default behavior in the {es} {default-dist}. If you want to use {ml-features}, there must be at least one {ml} node in your cluster. For more information about {ml-features}, see {ml-docs}/xpack-ml.html[Machine learning in the {stack}].ImportantIf you use the {oss-dist}, do not set node.ml
. Otherwise, the node fails to start.
Note
|
Coordinating node
Requests like search requests or bulk-indexing requests may involve data held on different data nodes. A search request, for example, is executed in two phases which are coordinated by the node which receives the client request — the coordinating node. In the scatter phase, the coordinating node forwards the request to the data nodes which hold the data. Each data node executes the request locally and returns its results to the coordinating node. In the gather phase, the coordinating node reduces each data node’s results into a single global resultset. Every node is implicitly a coordinating node. This means that a node that has
all three |
Master Eligible Node
The master node is responsible for lightweight cluster-wide actions such as creating or deleting an index, tracking which nodes are part of the cluster, and deciding which shards to allocate to which nodes. It is important for cluster health to have a stable master node.
Any master-eligible node (all nodes by default) may be elected to become the master node by the master election process.
Important
|
Master nodes must have access to the data/ directory (just like
data nodes) as this is where the cluster state is persisted between node restarts.
|
Indexing and searching your data is CPU-, memory-, and I/O-intensive work which can put pressure on a node’s resources. To ensure that your master node is stable and not under pressure, it is a good idea in a bigger cluster to split the roles between dedicated master-eligible nodes and dedicated data nodes.
While master nodes can also behave as coordinating nodes and route search and indexing requests from clients to data nodes, it is better not to use dedicated master nodes for this purpose. It is important for the stability of the cluster that master-eligible nodes do as little work as possible.
To create a dedicated master-eligible node in the {default-dist}, set:
node.master: true (1)
node.data: false (2)
node.ingest: false (3)
node.ml: false (4)
xpack.ml.enabled: true (5)
cluster.remote.connect: false (6)
-
The
node.master
role is enabled by default. -
Disable the
node.data
role (enabled by default). -
Disable the
node.ingest
role (enabled by default). -
Disable the
node.ml
role (enabled by default). -
The
xpack.ml.enabled
setting is enabled by default. -
Disable remote cluster connections (enabled by default).
To create a dedicated master-eligible node in the {oss-dist}, set:
node.master: true (1)
node.data: false (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
-
The
node.master
role is enabled by default. -
Disable the
node.data
role (enabled by default). -
Disable the
node.ingest
role (enabled by default). -
Disable remote cluster connections (enabled by default).
Avoiding split brain with minimum_master_nodes
To prevent data loss, it is vital to configure the
discovery.zen.minimum_master_nodes
setting (which defaults to 1
) so that
each master-eligible node knows the minimum number of master-eligible nodes
that must be visible in order to form a cluster.
To explain, imagine that you have a cluster consisting of two master-eligible
nodes. A network failure breaks communication between these two nodes. Each
node sees one master-eligible node… itself. With minimum_master_nodes
set
to the default of 1
, this is sufficient to form a cluster. Each node elects
itself as the new master (thinking that the other master-eligible node has
died) and the result is two clusters, or a split brain. These two nodes
will never rejoin until one node is restarted. Any data that has been written
to the restarted node will be lost.
Now imagine that you have a cluster with three master-eligible nodes, and
minimum_master_nodes
set to 2
. If a network split separates one node from
the other two nodes, the side with one node cannot see enough master-eligible
nodes and will realise that it cannot elect itself as master. The side with
two nodes will elect a new master (if needed) and continue functioning
correctly. As soon as the network split is resolved, the single node will
rejoin the cluster and start serving requests again.
This setting should be set to a quorum of master-eligible nodes:
(master_eligible_nodes / 2) + 1
In other words, if there are three master-eligible nodes, then minimum master
nodes should be set to (3 / 2) + 1
or 2
:
discovery.zen.minimum_master_nodes: 2 (1)
-
Defaults to
1
.
To be able to remain available when one of the master-eligible nodes fails,
clusters should have at least three master-eligible nodes, with
minimum_master_nodes
set accordingly. A rolling upgrade,
performed without any downtime, also requires at least three master-eligible
nodes to avoid the possibility of data loss if a network split occurs while the
upgrade is in progress.
This setting can also be changed dynamically on a live cluster with the cluster update settings API:
PUT _cluster/settings
{
"transient": {
"discovery.zen.minimum_master_nodes": 2
}
}
Tip
|
An advantage of splitting the master and data roles between dedicated
nodes is that you can have just three master-eligible nodes and set
minimum_master_nodes to 2 . You never have to change this setting, no
matter how many dedicated data nodes you add to the cluster.
|
Data Node
Data nodes hold the shards that contain the documents you have indexed. Data nodes handle data related operations like CRUD, search, and aggregations. These operations are I/O-, memory-, and CPU-intensive. It is important to monitor these resources and to add more data nodes if they are overloaded.
The main benefit of having dedicated data nodes is the separation of the master and data roles.
To create a dedicated data node in the {default-dist}, set:
node.master: false (1)
node.data: true (2)
node.ingest: false (3)
node.ml: false (4)
cluster.remote.connect: false (5)
-
Disable the
node.master
role (enabled by default). -
The
node.data
role is enabled by default. -
Disable the
node.ingest
role (enabled by default). -
Disable the
node.ml
role (enabled by default). -
Disable remote cluster connections (enabled by default).
To create a dedicated data node in the {oss-dist}, set:
node.master: false (1)
node.data: true (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
-
Disable the
node.master
role (enabled by default). -
The
node.data
role is enabled by default. -
Disable the
node.ingest
role (enabled by default). -
Disable remote cluster connections (enabled by default).
Ingest Node
Ingest nodes can execute pre-processing pipelines, composed of one or more ingest processors. Depending on the type of operations performed by the ingest processors and the required resources, it may make sense to have dedicated ingest nodes, that will only perform this specific task.
To create a dedicated ingest node in the {default-dist}, set:
node.master: false (1)
node.data: false (2)
node.ingest: true (3)
node.ml: false (4)
cluster.remote.connect: false (5)
-
Disable the
node.master
role (enabled by default). -
Disable the
node.data
role (enabled by default). -
The
node.ingest
role is enabled by default. -
Disable the
node.ml
role (enabled by default). -
Disable remote cluster connections (enabled by default).
To create a dedicated ingest node in the {oss-dist}, set:
node.master: false (1)
node.data: false (2)
node.ingest: true (3)
cluster.remote.connect: false (4)
-
Disable the
node.master
role (enabled by default). -
Disable the
node.data
role (enabled by default). -
The
node.ingest
role is enabled by default. -
Disable remote cluster connections (enabled by default).
Coordinating only node
If you take away the ability to be able to handle master duties, to hold data, and pre-process documents, then you are left with a coordinating node that can only route requests, handle the search reduce phase, and distribute bulk indexing. Essentially, coordinating only nodes behave as smart load balancers.
Coordinating only nodes can benefit large clusters by offloading the coordinating node role from data and master-eligible nodes. They join the cluster and receive the full cluster state, like every other node, and they use the cluster state to route requests directly to the appropriate place(s).
Warning
|
Adding too many coordinating only nodes to a cluster can increase the burden on the entire cluster because the elected master node must await acknowledgement of cluster state updates from every node! The benefit of coordinating only nodes should not be overstated — data nodes can happily serve the same purpose. |
To create a dedicated coordinating node in the {default-dist}, set:
node.master: false (1)
node.data: false (2)
node.ingest: false (3)
node.ml: false (4)
cluster.remote.connect: false (5)
-
Disable the
node.master
role (enabled by default). -
Disable the
node.data
role (enabled by default). -
Disable the
node.ingest
role (enabled by default). -
Disable the
node.ml
role (enabled by default). -
Disable remote cluster connections (enabled by default).
To create a dedicated coordinating node in the {oss-dist}, set:
node.master: false (1)
node.data: false (2)
node.ingest: false (3)
cluster.remote.connect: false (4)
-
Disable the
node.master
role (enabled by default). -
Disable the
node.data
role (enabled by default). -
Disable the
node.ingest
role (enabled by default). -
Disable remote cluster connections (enabled by default).
Machine learning node
The {ml-features} provide {ml} nodes, which run jobs and handle {ml} API
requests. If xpack.ml.enabled
is set to true and node.ml
is set to false
,
the node can service API requests but it cannot run jobs.
If you want to use {ml-features} in your cluster, you must enable {ml}
(set xpack.ml.enabled
to true
) on all master-eligible nodes. If you have the
{oss-dist}, do not use these settings.
For more information about these settings, see [ml-settings].
To create a dedicated {ml} node in the {default-dist}, set:
node.master: false (1)
node.data: false (2)
node.ingest: false (3)
node.ml: true (4)
xpack.ml.enabled: true (5)
cluster.remote.connect: false (6)
-
Disable the
node.master
role (enabled by default). -
Disable the
node.data
role (enabled by default). -
Disable the
node.ingest
role (enabled by default). -
The
node.ml
role is enabled by default. -
The
xpack.ml.enabled
setting is enabled by default. -
Disable remote cluster connections (enabled by default).
Node data path settings
path.data
Every data and master-eligible node requires access to a data directory where
shards and index and cluster metadata will be stored. The path.data
defaults
to $ES_HOME/data
but can be configured in the elasticsearch.yml
config
file an absolute path or a path relative to $ES_HOME
as follows:
path.data: /var/elasticsearch/data
Like all node settings, it can also be specified on the command line as:
./bin/elasticsearch -Epath.data=/var/elasticsearch/data
Tip
|
When using the .zip or .tar.gz distributions, the path.data setting
should be configured to locate the data directory outside the Elasticsearch
home directory, so that the home directory can be deleted without deleting
your data! The RPM and Debian distributions do this for you already.
|
node.max_local_storage_nodes
The data path can be shared by multiple nodes, even by nodes from different clusters. This is very useful for testing failover and different configurations on your development machine. In production, however, it is recommended to run only one node of Elasticsearch per server.
By default, Elasticsearch is configured to prevent more than one node from sharing the same data
path. To allow for more than one node (e.g., on your development machine), use the setting
node.max_local_storage_nodes
and set this to a positive integer larger than one.
Warning
|
Never run different node types (i.e. master, data) from the same data directory. This can lead to unexpected data loss. |
Other node settings
More node settings can be found in Modules. Of particular note are
the cluster.name
, the node.name
and the
network settings.
Plugins
Plugins
Plugins are a way to enhance the basic Elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers (in a more built in fashion), custom script engines, custom discovery and more.
See the {plugins}/index.html[Plugins documentation] for more.
Snapshot And Restore
A snapshot is a backup taken from a running Elasticsearch cluster. You can take a snapshot of individual indices or of the entire cluster and store it in a repository on a shared filesystem, and there are plugins that support remote repositories on S3, HDFS, Azure, Google Cloud Storage and more.
Snapshots are taken incrementally. This means that when creating a snapshot of an index Elasticsearch will avoid copying any data that is already stored in the repository as part of an earlier snapshot of the same index. Therefore it can be efficient to take snapshots of your cluster quite frequently.
Snapshots can be restored into a running cluster via the restore API. When restoring an index it is possible to alter the name of the restored index as well as some of its settings, allowing a great deal of flexibility in how the snapshot and restore functionality can be used.
Warning
|
It is not possible to back up an Elasticsearch cluster simply by taking a copy of the data directories of all of its nodes. Elasticsearch may be making changes to the contents of its data directories while it is running, and this means that copying its data directories cannot be expected to capture a consistent picture of their contents. Attempts to restore a cluster from such a backup may fail, reporting corruption and/or missing files, or may appear to have succeeded having silently lost some of its data. The only reliable way to back up a cluster is by using the snapshot and restore functionality. |
Version compatibility
Important
|
Version compatibility refers to the underlying Lucene index compatibility. Follow the Upgrade documentation when migrating between versions. |
A snapshot contains a copy of the on-disk data structures that make up an index. This means that snapshots can only be restored to versions of Elasticsearch that can read the indices:
-
A snapshot of an index created in 5.x can be restored to 6.x.
-
A snapshot of an index created in 2.x can be restored to 5.x.
-
A snapshot of an index created in 1.x can be restored to 2.x.
Conversely, snapshots of indices created in 1.x cannot be restored to 5.x or 6.x, and snapshots of indices created in 2.x cannot be restored to 6.x.
Each snapshot can contain indices created in various versions of Elasticsearch, and when restoring a snapshot it must be possible to restore all of the indices into the target cluster. If any indices in a snapshot were created in an incompatible version, you will not be able restore the snapshot.
Important
|
When backing up your data prior to an upgrade, keep in mind that you won’t be able to restore snapshots after you upgrade if they contain indices created in a version that’s incompatible with the upgrade version. |
If you end up in a situation where you need to restore a snapshot of an index that is incompatible with the version of the cluster you are currently running, you can restore it on the latest compatible version and use reindex-from-remote to rebuild the index on the current version. Reindexing from remote is only possible if the original index has source enabled. Retrieving and reindexing the data can take significantly longer than simply restoring a snapshot. If you have a large amount of data, we recommend testing the reindex from remote process with a subset of your data to understand the time requirements before proceeding.
Repositories
You must register a snapshot repository before you can perform snapshot and restore operations. We recommend creating a new snapshot repository for each major version. The valid repository settings depend on the repository type.
If you register same snapshot repository with multiple clusters, only
one cluster should have write access to the repository. All other clusters
connected to that repository should set the repository to readonly
mode.
Important
|
The snapshot format can change across major versions, so if you have
clusters on different versions trying to write the same repository, snapshots
written by one version may not be visible to the other and the repository could
be corrupted. While setting the repository to readonly on all but one of the
clusters should work with multiple clusters differing by one major version, it
is not a supported configuration.
|
PUT /_snapshot/my_backup
{
"type": "fs",
"settings": {
"location": "my_backup_location"
}
}
To retrieve information about a registered repository, use a GET request:
GET /_snapshot/my_backup
which returns:
{
"my_backup": {
"type": "fs",
"settings": {
"location": "my_backup_location"
}
}
}
To retrieve information about multiple repositories, specify a comma-delimited
list of repositories. You can also use the * wildcard when
specifying repository names. For example, the following request retrieves
information about all of the snapshot repositories that start with repo
or
contain backup
:
GET /_snapshot/repo*,*backup*
To retrieve information about all registered snapshot repositories, omit the
repository name or specify _all
:
GET /_snapshot
or
GET /_snapshot/_all
Shared File System Repository
The shared file system repository ("type": "fs"
) uses the shared file system to store snapshots. In order to register
the shared file system repository it is necessary to mount the same shared filesystem to the same location on all
master and data nodes. This location (or one of its parent directories) must be registered in the path.repo
setting on all master and data nodes.
Assuming that the shared filesystem is mounted to /mount/backups/my_fs_backup_location
, the following setting should
be added to elasticsearch.yml
file:
path.repo: ["/mount/backups", "/mount/longterm_backups"]
The path.repo
setting supports Microsoft Windows UNC paths as long as at least server name and share are specified as
a prefix and back slashes are properly escaped:
path.repo: ["\\\\MY_SERVER\\Snapshots"]
After all nodes are restarted, the following command can be used to register the shared file system repository with
the name my_fs_backup
:
PUT /_snapshot/my_fs_backup
{
"type": "fs",
"settings": {
"location": "/mount/backups/my_fs_backup_location",
"compress": true
}
}
If the repository location is specified as a relative path this path will be resolved against the first path specified
in path.repo
:
PUT /_snapshot/my_fs_backup
{
"type": "fs",
"settings": {
"location": "my_fs_backup_location",
"compress": true
}
}
The following settings are supported:
location
|
Location of the snapshots. Mandatory. |
compress
|
Turns on compression of the snapshot files. Compression is applied only to metadata files (index mapping and settings). Data files are not compressed. Defaults to |
chunk_size
|
Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and
unit, for example: |
max_restore_bytes_per_sec
|
Throttles per node restore rate. Defaults to |
max_snapshot_bytes_per_sec
|
Throttles per node snapshot rate. Defaults to |
readonly
|
Makes repository read-only. Defaults to |
Read-only URL Repository
The URL repository ("type": "url"
) can be used as an alternative read-only way to access data created by the shared file
system repository. The URL specified in the url
parameter should point to the root of the shared filesystem repository.
The following settings are supported:
url
|
Location of the snapshots. Mandatory. |
URL Repository supports the following protocols: "http", "https", "ftp", "file" and "jar". URL repositories with http:
,
https:
, and ftp:
URLs has to be whitelisted by specifying allowed URLs in the repositories.url.allowed_urls
setting.
This setting supports wildcards in the place of host, path, query, and fragment. For example:
repositories.url.allowed_urls: ["http://www.example.org/root/*", "https://*.mydomain.com/*?*#*"]
URL repositories with file:
URLs can only point to locations registered in the path.repo
setting similar to
shared file system repository.
Source Only Repository
A source repository enables you to create minimal, source-only snapshots that take up to 50% less space on disk. Source only snapshots contain stored fields and index metadata. They do not include index or doc values structures and are not searchable when restored. After restoring a source-only snapshot, you must reindex the data into a new index.
Source repositories delegate to another snapshot repository for storage.
Important
|
Source only snapshots are only supported if the
|
When you create a source repository, you must specify the type and name of the delegate repository where the snapshots will be stored:
PUT _snapshot/my_src_only_repository
{
"type": "source",
"settings": {
"delegate_type": "fs",
"location": "my_backup_location"
}
}
Repository plugins
Other repository backends are available in these official plugins:
-
{plugins}/repository-s3.html[repository-s3] for S3 repository support
-
{plugins}/repository-hdfs.html[repository-hdfs] for HDFS repository support in Hadoop environments
-
{plugins}/repository-azure.html[repository-azure] for Azure storage repositories
-
{plugins}/repository-gcs.html[repository-gcs] for Google Cloud Storage repositories
Repository Verification
When a repository is registered, it’s immediately verified on all master and data nodes to make sure that it is functional
on all nodes currently present in the cluster. The verify
parameter can be used to explicitly disable the repository
verification when registering or updating a repository:
PUT /_snapshot/my_unverified_backup?verify=false
{
"type": "fs",
"settings": {
"location": "my_unverified_backup_location"
}
}
The verification process can also be executed manually by running the following command:
POST /_snapshot/my_unverified_backup/_verify
It returns a list of nodes where repository was successfully verified or an error message if verification process failed.
Snapshot
A repository can contain multiple snapshots of the same cluster. Snapshots are identified by unique names within the
cluster. A snapshot with the name snapshot_1
in the repository my_backup
can be created by executing the following
command:
PUT /_snapshot/my_backup/snapshot_1?wait_for_completion=true
The wait_for_completion
parameter specifies whether or not the request should return immediately after snapshot
initialization (default) or wait for snapshot completion. During snapshot initialization, information about all
previous snapshots is loaded into the memory, which means that in large repositories it may take several seconds (or
even minutes) for this command to return even if the wait_for_completion
parameter is set to false
.
By default a snapshot of all open and started indices in the cluster is created. This behavior can be changed by specifying the list of indices in the body of the snapshot request.
PUT /_snapshot/my_backup/snapshot_2?wait_for_completion=true
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": false
}
The list of indices that should be included into the snapshot can be specified using the indices
parameter that
supports multi index syntax. The snapshot request also supports the
ignore_unavailable
option. Setting it to true
will cause indices that do not exist to be ignored during snapshot
creation. By default, when ignore_unavailable
option is not set and an index is missing the snapshot request will fail.
By setting include_global_state
to false it’s possible to prevent the cluster global state to be stored as part of
the snapshot. By default, the entire snapshot will fail if one or more indices participating in the snapshot don’t have
all primary shards available. This behaviour can be changed by setting partial
to true
.
Snapshot names can be automatically derived using date math expressions, similarly as when creating new indices. Note that special characters need to be URI encoded.
For example, creating a snapshot with the current day in the name, like snapshot-2018.05.11
, can be achieved with
the following command:
# PUT /_snapshot/my_backup/<snapshot-{now/d}>
PUT /_snapshot/my_backup/%3Csnapshot-%7Bnow%2Fd%7D%3E
The index snapshot process is incremental. In the process of making the index snapshot Elasticsearch analyses the list of the index files that are already stored in the repository and copies only files that were created or changed since the last snapshot. That allows multiple snapshots to be preserved in the repository in a compact form. Snapshotting process is executed in non-blocking fashion. All indexing and searching operation can continue to be executed against the index that is being snapshotted. However, a snapshot represents the point-in-time view of the index at the moment when snapshot was created, so no records that were added to the index after the snapshot process was started will be present in the snapshot. The snapshot process starts immediately for the primary shards that has been started and are not relocating at the moment. Before version 1.2.0, the snapshot operation fails if the cluster has any relocating or initializing primaries of indices participating in the snapshot. Starting with version 1.2.0, Elasticsearch waits for relocation or initialization of shards to complete before snapshotting them.
Besides creating a copy of each index the snapshot process can also store global cluster metadata, which includes persistent cluster settings and templates. The transient settings and registered snapshot repositories are not stored as part of the snapshot.
Only one snapshot process can be executed in the cluster at any time. While snapshot of a particular shard is being created this shard cannot be moved to another node, which can interfere with rebalancing process and allocation filtering. Elasticsearch will only be able to move a shard to another node (according to the current allocation filtering settings and rebalancing algorithm) once the snapshot is finished.
Once a snapshot is created information about this snapshot can be obtained using the following command:
GET /_snapshot/my_backup/snapshot_1
This command returns basic information about the snapshot including start and end time, version of
Elasticsearch that created the snapshot, the list of included indices, the current state of the
snapshot and the list of failures that occurred during the snapshot. The snapshot state
can be
IN_PROGRESS
|
The snapshot is currently running. |
SUCCESS
|
The snapshot finished and all shards were stored successfully. |
FAILED
|
The snapshot finished with an error and failed to store any data. |
PARTIAL
|
The global cluster state was stored, but data of at least one shard wasn’t stored successfully.
The |
INCOMPATIBLE
|
The snapshot was created with an old version of Elasticsearch and therefore is incompatible with the current version of the cluster. |
Similar as for repositories, information about multiple snapshots can be queried in one go, supporting wildcards as well:
GET /_snapshot/my_backup/snapshot_*,some_other_snapshot
All snapshots currently stored in the repository can be listed using the following command:
GET /_snapshot/my_backup/_all
The command fails if some of the snapshots are unavailable. The boolean parameter ignore_unavailable
can be used to
return all snapshots that are currently available.
Getting all snapshots in the repository can be costly on cloud-based repositories,
both from a cost and performance perspective. If the only information required is
the snapshot names/uuids in the repository and the indices in each snapshot, then
the optional boolean parameter verbose
can be set to false
to execute a more
performant and cost-effective retrieval of the snapshots in the repository. Note
that setting verbose
to false
will omit all other information about the snapshot
such as status information, the number of snapshotted shards, etc. The default
value of the verbose
parameter is true
.
A currently running snapshot can be retrieved using the following command:
GET /_snapshot/my_backup/_current
A snapshot can be deleted from the repository using the following command:
DELETE /_snapshot/my_backup/snapshot_2
When a snapshot is deleted from a repository, Elasticsearch deletes all files that are associated with the deleted snapshot and not used by any other snapshots. If the deleted snapshot operation is executed while the snapshot is being created the snapshotting process will be aborted and all files created as part of the snapshotting process will be cleaned. Therefore, the delete snapshot operation can be used to cancel long running snapshot operations that were started by mistake.
A repository can be unregistered using the following command:
DELETE /_snapshot/my_backup
When a repository is unregistered, Elasticsearch only removes the reference to the location where the repository is storing the snapshots. The snapshots themselves are left untouched and in place.
Restore
A snapshot can be restored using the following command:
POST /_snapshot/my_backup/snapshot_1/_restore
By default, all indices in the snapshot are restored, and the cluster state is
not restored. It’s possible to select indices that should be restored as well
as to allow the global cluster state from being restored by using indices
and
include_global_state
options in the restore request body. The list of indices
supports multi index syntax. The rename_pattern
and rename_replacement
options can be also used to rename indices on restore
using regular expression that supports referencing the original text as
explained
here.
Set include_aliases
to false
to prevent aliases from being restored together
with associated indices
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "index_1,index_2",
"ignore_unavailable": true,
"include_global_state": true,
"rename_pattern": "index_(.+)",
"rename_replacement": "restored_index_$1"
}
The restore operation can be performed on a functioning cluster. However, an
existing index can be only restored if it’s closed and
has the same number of shards as the index in the snapshot. The restore
operation automatically opens restored indices if they were closed and creates
new indices if they didn’t exist in the cluster. If cluster state is restored
with include_global_state
(defaults to false
), the restored templates that
don’t currently exist in the cluster are added and existing templates with the
same name are replaced by the restored templates. The restored persistent
settings are added to the existing persistent settings.
Partial restore
By default, the entire restore operation will fail if one or more indices participating in the operation don’t have
snapshots of all shards available. It can occur if some shards failed to snapshot for example. It is still possible to
restore such indices by setting partial
to true
. Please note, that only successfully snapshotted shards will be
restored in this case and all missing shards will be recreated empty.
Changing index settings during restore
Most of index settings can be overridden during the restore process. For example, the following command will restore
the index index_1
without creating any replicas while switching back to default refresh interval:
POST /_snapshot/my_backup/snapshot_1/_restore
{
"indices": "index_1",
"index_settings": {
"index.number_of_replicas": 0
},
"ignore_index_settings": [
"index.refresh_interval"
]
}
Please note, that some settings such as index.number_of_shards
cannot be changed during restore operation.
Restoring to a different cluster
The information stored in a snapshot is not tied to a particular cluster or a cluster name. Therefore it’s possible to restore a snapshot made from one cluster into another cluster. All that is required is registering the repository containing the snapshot in the new cluster and starting the restore process. The new cluster doesn’t have to have the same size or topology. However, the version of the new cluster should be the same or newer (only 1 major version newer) than the cluster that was used to create the snapshot. For example, you can restore a 1.x snapshot to a 2.x cluster, but not a 1.x snapshot to a 5.x cluster.
If the new cluster has a smaller size additional considerations should be made. First of all it’s necessary to make sure
that new cluster have enough capacity to store all indices in the snapshot. It’s possible to change indices settings
during restore to reduce the number of replicas, which can help with restoring snapshots into smaller cluster. It’s also
possible to select only subset of the indices using the indices
parameter.
If indices in the original cluster were assigned to particular nodes using shard allocation filtering, the same rules will be enforced in the new cluster. Therefore if the new cluster doesn’t contain nodes with appropriate attributes that a restored index can be allocated on, such index will not be successfully restored unless these index allocation settings are changed during restore operation.
The restore operation also checks that restored persistent settings are compatible with the current cluster to avoid accidentally
restoring an incompatible settings such as discovery.zen.minimum_master_nodes
and as a result disable a smaller cluster until the
required number of master eligible nodes is added. If you need to restore a snapshot with incompatible persistent settings, try
restoring it without the global cluster state.
Snapshot status
A list of currently running snapshots with their detailed status information can be obtained using the following command:
GET /_snapshot/_status
In this format, the command will return information about all currently running snapshots. By specifying a repository name, it’s possible to limit the results to a particular repository:
GET /_snapshot/my_backup/_status
If both repository name and snapshot id are specified, this command will return detailed status information for the given snapshot even if it’s not currently running:
GET /_snapshot/my_backup/snapshot_1/_status
The output looks similar to the following:
{
"snapshots": [
{
"snapshot": "snapshot_1",
"repository": "my_backup",
"uuid": "XuBo4l4ISYiVg0nYUen9zg",
"state": "SUCCESS",
"include_global_state": true,
"shards_stats": {
"initializing": 0,
"started": 0,
"finalizing": 0,
"done": 5,
"failed": 0,
"total": 5
},
"stats": {
"incremental": {
"file_count": 8,
"size_in_bytes": 4704
},
"processed": {
"file_count": 7,
"size_in_bytes": 4254
},
"total": {
"file_count": 8,
"size_in_bytes": 4704
},
"start_time_in_millis": 1526280280355,
"time_in_millis": 358,
"number_of_files": 8,
"processed_files": 8,
"total_size_in_bytes": 4704,
"processed_size_in_bytes": 4704
}
}
]
}
The output is composed of different sections. The stats
sub-object provides details on the number and size of files that were
snapshotted. As snapshots are incremental, copying only the Lucene segments that are not already in the repository,
the stats
object contains a total
section for all the files that are referenced by the snapshot, as well as an incremental
section
for those files that actually needed to be copied over as part of the incremental snapshotting. In case of a snapshot that’s still
in progress, there’s also a processed
section that contains information about the files that are in the process of being copied.
Note: Properties number_of_files
, processed_files
, total_size_in_bytes
and processed_size_in_bytes
are used for
backward compatibility reasons with older 5.x and 6.x versions. These fields will be removed in Elasticsearch v7.0.0.
Multiple ids are also supported:
GET /_snapshot/my_backup/snapshot_1,snapshot_2/_status
Monitoring snapshot/restore progress
There are several ways to monitor the progress of the snapshot and restores processes while they are running. Both
operations support wait_for_completion
parameter that would block client until the operation is completed. This is
the simplest method that can be used to get notified about operation completion.
The snapshot operation can be also monitored by periodic calls to the snapshot info:
GET /_snapshot/my_backup/snapshot_1
Please note that snapshot info operation uses the same resources and thread pool as the snapshot operation. So, executing a snapshot info operation while large shards are being snapshotted can cause the snapshot info operation to wait for available resources before returning the result. On very large shards the wait time can be significant.
To get more immediate and complete information about snapshots the snapshot status command can be used instead:
GET /_snapshot/my_backup/snapshot_1/_status
While snapshot info method returns only basic information about the snapshot in progress, the snapshot status returns complete breakdown of the current state for each shard participating in the snapshot.
The restore process piggybacks on the standard recovery mechanism of the Elasticsearch. As a result, standard recovery
monitoring services can be used to monitor the state of restore. When restore operation is executed the cluster
typically goes into red
state. It happens because the restore operation starts with "recovering" primary shards of the
restored indices. During this operation the primary shards become unavailable which manifests itself in the red
cluster
state. Once recovery of primary shards is completed Elasticsearch is switching to standard replication process that
creates the required number of replicas at this moment cluster switches to the yellow
state. Once all required replicas
are created, the cluster switches to the green
states.
The cluster health operation provides only a high level status of the restore process. It’s possible to get more detailed insight into the current state of the recovery process by using indices recovery and cat recovery APIs.
Stopping currently running snapshot and restore operations
The snapshot and restore framework allows running only one snapshot or one restore operation at a time. If a currently running snapshot was executed by mistake, or takes unusually long, it can be terminated using the snapshot delete operation. The snapshot delete operation checks if the deleted snapshot is currently running and if it does, the delete operation stops that snapshot before deleting the snapshot data from the repository.
DELETE /_snapshot/my_backup/snapshot_1
The restore operation uses the standard shard recovery mechanism. Therefore, any currently running restore operation can be canceled by deleting indices that are being restored. Please note that data for all deleted indices will be removed from the cluster as a result of this operation.
Effect of cluster blocks on snapshot and restore operations
Many snapshot and restore operations are affected by cluster and index blocks. For example, registering and unregistering repositories require write global metadata access. The snapshot operation requires that all indices and their metadata as well as the global metadata were readable. The restore operation requires the global metadata to be writable, however the index level blocks are ignored during restore because indices are essentially recreated during restore. Please note that a repository content is not part of the cluster and therefore cluster blocks don’t affect internal repository operations such as listing or deleting snapshots from an already registered repository.
Thread Pool
A node holds several thread pools in order to improve how threads memory consumption are managed within a node. Many of these pools also have queues associated with them, which allow pending requests to be held instead of discarded.
There are several thread pools, but the important ones include:
generic
-
For generic operations (e.g., background node discovery). Thread pool type is
scaling
. index
-
For index/delete operations. Thread pool type is
fixed
with a size of# of available processors
, queue_size of200
. The maximum size for this pool is1 + # of available processors
. search
-
For count/search/suggest operations. Thread pool type is
fixed_auto_queue_size
with a size ofint((# of available_processors * 3) / 2) + 1
, and initial queue_size of1000
. search_throttled
-
For count/search/suggest/get operations on
search_throttled indices
. Thread pool type isfixed_auto_queue_size
with a size of1
, and initial queue_size of100
. get
-
For get operations. Thread pool type is
fixed
with a size of# of available processors
, queue_size of1000
. analyze
-
For analyze requests. Thread pool type is
fixed
with a size of 1, queue size of 16. write
-
For single-document index/delete/update and bulk requests. Thread pool type is
fixed
with a size of# of available processors
, queue_size of200
. The maximum size for this pool is1 + # of available processors
. snapshot
-
For snapshot/restore operations. Thread pool type is
scaling
with a keep-alive of5m
and a max ofmin(5, (# of available processors)/2)
. warmer
-
For segment warm-up operations. Thread pool type is
scaling
with a keep-alive of5m
and a max ofmin(5, (# of available processors)/2)
. refresh
-
For refresh operations. Thread pool type is
scaling
with a keep-alive of5m
and a max ofmin(10, (# of available processors)/2)
. listener
-
Mainly for java client executing of action when listener threaded is set to true. Thread pool type is
scaling
with a default max ofmin(10, (# of available processors)/2)
.
Changing a specific thread pool can be done by setting its type-specific parameters; for example, changing the index
thread pool to have more threads:
thread_pool:
index:
size: 30
Thread pool types
The following are the types of thread pools and their respective parameters:
fixed
The fixed
thread pool holds a fixed size of threads to handle the
requests with a queue (optionally bounded) for pending requests that
have no threads to service them.
The size
parameter controls the number of threads, and defaults to the
number of cores times 5.
The queue_size
allows to control the size of the queue of pending
requests that have no threads to execute them. By default, it is set to
-1
which means its unbounded. When a request comes in and the queue is
full, it will abort the request.
thread_pool:
index:
size: 30
queue_size: 1000
fixed_auto_queue_size
experimental[]
The fixed_auto_queue_size
thread pool holds a fixed size of threads to handle
the requests with a bounded queue for pending requests that have no threads to
service them. It’s similar to the fixed
threadpool, however, the queue_size
automatically adjusts according to calculations based on
Little’s Law. These calculations
will potentially adjust the queue_size
up or down by 50 every time
auto_queue_frame_size
operations have been completed.
The size
parameter controls the number of threads, and defaults to the
number of cores times 5.
The queue_size
allows to control the initial size of the queue of pending
requests that have no threads to execute them.
The min_queue_size
setting controls the minimum amount the queue_size
can be
adjusted to.
The max_queue_size
setting controls the maximum amount the queue_size
can be
adjusted to.
The auto_queue_frame_size
setting controls the number of operations during
which measurement is taken before the queue is adjusted. It should be large
enough that a single operation cannot unduly bias the calculation.
The target_response_time
is a time value setting that indicates the targeted
average response time for tasks in the thread pool queue. If tasks are routinely
above this time, the thread pool queue will be adjusted down so that tasks are
rejected.
thread_pool:
search:
size: 30
queue_size: 500
min_queue_size: 10
max_queue_size: 1000
auto_queue_frame_size: 2000
target_response_time: 1s
scaling
The scaling
thread pool holds a dynamic number of threads. This
number is proportional to the workload and varies between the value of
the core
and max
parameters.
The keep_alive
parameter determines how long a thread should be kept
around in the thread pool without it doing any work.
thread_pool:
warmer:
core: 1
max: 8
keep_alive: 2m
Processors setting
The number of processors is automatically detected, and the thread pool
settings are automatically set based on it. In some cases it can be
useful to override the number of detected processors. This can be done
by explicitly setting the processors
setting.
processors: 2
There are a few use-cases for explicitly overriding the processors
setting:
-
If you are running multiple instances of Elasticsearch on the same host but want Elasticsearch to size its thread pools as if it only has a fraction of the CPU, you should override the
processors
setting to the desired fraction (e.g., if you’re running two instances of Elasticsearch on a 16-core machine, setprocessors
to 8). Note that this is an expert-level use-case and there’s a lot more involved than just setting theprocessors
setting as there are other considerations like changing the number of garbage collector threads, pinning processes to cores, etc. -
Sometimes the number of processors is wrongly detected and in such cases explicitly setting the
processors
setting will workaround such issues.
In order to check the number of processors detected, use the nodes info
API with the os
flag.
Transport
The transport module is used for internal communication between nodes
within the cluster. Each call that goes from one node to the other uses
the transport module (for example, when an HTTP GET request is processed
by one node, and should actually be processed by another node that holds
the data). The transport module is also used for the TransportClient
in the
{es} Java API.
The transport mechanism is completely asynchronous in nature, meaning that there is no blocking thread waiting for a response. The benefit of using asynchronous communication is first solving the C10k problem, as well as being the ideal solution for scatter (broadcast) / gather operations such as search in Elasticsearch.
Transport Settings
The internal transport communicates over TCP. You can configure it with the following settings:
Setting | Description |
---|---|
|
A bind port range. Defaults to |
|
The port that other nodes in the cluster
should use when communicating with this node. Useful when a cluster node
is behind a proxy or firewall and the |
|
The host address to bind the transport service to. Defaults to |
|
The host address to publish for nodes in the cluster to connect to. Defaults to |
|
Used to set the |
|
The connect timeout for initiating a new connection (in
time setting format). Defaults to |
|
Set to |
|
Schedule a regular application-level ping message
to ensure that transport connections between nodes are kept alive. Defaults to
|
It also uses the common network settings.
Transport Profiles
Elasticsearch allows you to bind to multiple ports on different interfaces by the use of transport profiles. See this example configuration
transport.profiles.default.port: 9300-9400
transport.profiles.default.bind_host: 10.0.0.1
transport.profiles.client.port: 9500-9600
transport.profiles.client.bind_host: 192.168.0.1
transport.profiles.dmz.port: 9700-9800
transport.profiles.dmz.bind_host: 172.16.1.2
The default
profile is special. It is used as a fallback for any other
profiles, if those do not have a specific configuration setting set, and is how
this node connects to other nodes in the cluster.
The following parameters can be configured on each transport profile, as in the example above:
-
port
: The port to bind to -
bind_host
: The host to bind -
publish_host
: The host which is published in informational APIs -
tcp.no_delay
: Configures theTCP_NO_DELAY
option for this socket -
tcp.keep_alive
: Configures theSO_KEEPALIVE
option for this socket -
tcp.reuse_address
: Configures theSO_REUSEADDR
option for this socket -
tcp.send_buffer_size
: Configures the send buffer size of the socket -
tcp.receive_buffer_size
: Configures the receive buffer size of the socket
Long-lived idle connections
Elasticsearch opens a number of long-lived TCP connections between each pair of
nodes in the cluster, and some of these connections may be idle for an extended
period of time. Nonetheless, Elasticsearch requires these connections to remain
open, and it can disrupt the operation of the cluster if any inter-node
connections are closed by an external influence such as a firewall. It is
important to configure your network to preserve long-lived idle connections
between Elasticsearch nodes, for instance by leaving tcp.keep_alive
enabled
and ensuring that the keepalive interval is shorter than any timeout that might
cause idle connections to be closed, or by setting transport.ping_schedule
if
keepalives cannot be configured.
Transport Compression
Request Compression
By default, the transport.compress
setting is false
and network-level
request compression is disabled between nodes in the cluster. This default
normally makes sense for local cluster communication as compression has a
noticeable CPU cost and local clusters tend to be set up with fast network
connections between nodes.
The transport.compress
setting always configures local cluster request
compression and is the fallback setting for remote cluster request compression.
If you want to configure remote request compression differently than local
request compression, you can set it on a per-remote cluster basis using the
cluster.remote.${cluster_alias}.transport.compress
setting.
Response Compression
The compression settings do not configure compression for responses. {es} will compress a response if the inbound request was compressed—even when compression is not enabled. Similarly, {es} will not compress a response if the inbound request was uncompressed—even when compression is enabled.
Transport Tracer
The transport module has a dedicated tracer logger which, when activated, logs incoming and out going requests. The log can be dynamically activated
by settings the level of the org.elasticsearch.transport.TransportService.tracer
logger to TRACE
:
PUT _cluster/settings
{
"transient" : {
"logger.org.elasticsearch.transport.TransportService.tracer" : "TRACE"
}
}
You can also control which actions will be traced, using a set of include and exclude wildcard patterns. By default every request will be traced except for fault detection pings:
PUT _cluster/settings
{
"transient" : {
"transport.tracer.include" : "*",
"transport.tracer.exclude" : "internal:discovery/zen/fd*"
}
}
Tribe node
deprecated[5.4.0,The tribe
node is deprecated in favour of {ccs-cap} and will be removed in Elasticsearch 7.0.]
The tribes feature allows a tribe node to act as a federated client across multiple clusters.
The tribe node works by retrieving the cluster state from all connected clusters and merging them into a global cluster state. With this information at hand, it is able to perform read and write operations against the nodes in all clusters as if they were local. Note that a tribe node needs to be able to connect to each single node in every configured cluster.
The elasticsearch.yml
config file for a tribe node just needs to list the
clusters that should be joined, for instance:
tribe:
t1: (1)
cluster.name: cluster_one
t2: (1)
cluster.name: cluster_two
-
t1
andt2
are arbitrary names representing the connection to each cluster.
The example above configures connections to two clusters, name t1
and t2
respectively. The tribe node will create a node client to
connect each cluster using unicast discovery by default. Any
other settings for the connection can be configured under tribe.{name}
, just
like the cluster.name
in the example.
The merged global cluster state means that almost all operations work in the same way as a single cluster: distributed search, suggest, percolation, indexing, etc.
However, there are a few exceptions:
-
The merged view cannot handle indices with the same name in multiple clusters. By default it will pick one of them, see later for on_conflict options.
-
Master level read operations (eg [cluster-state], [cluster-health]) will automatically execute with a local flag set to true since there is no master.
-
Master level write operations (eg [indices-create-index]) are not allowed. These should be performed on a single cluster.
The tribe node can be configured to block all write operations and all metadata operations with:
tribe:
blocks:
write: true
metadata: true
The tribe node can also configure blocks on selected indices:
tribe:
blocks:
write.indices: hk*,ldn*
metadata.indices: hk*,ldn*
When there is a conflict and multiple clusters hold the same index, by default
the tribe node will pick one of them. This can be configured using the tribe.on_conflict
setting. It defaults to any
, but can be set to drop
(drop indices that have
a conflict), or prefer_[tribeName]
to prefer the index from a specific tribe.
Tribe node settings
The tribe node starts a node client for each listed cluster. The following configuration options are passed down from the tribe node to each node client:
-
node.name
(used to derive thenode.name
for each node client) -
network.host
-
network.bind_host
-
network.publish_host
-
transport.host
-
transport.bind_host
-
transport.publish_host
-
path.home
-
path.logs
-
shield.*
Almost any setting (except for path.*
) may be configured at the node client
level itself, in which case it will override any passed through setting from
the tribe node. Settings you may want to set at the node client level
include:
-
network.host
-
network.bind_host
-
network.publish_host
-
transport.host
-
transport.bind_host
-
transport.publish_host
-
cluster.name
-
discovery.zen.ping.unicast.hosts
network.host: 192.168.1.5 (1)
tribe:
t1:
cluster.name: cluster_one
t2:
cluster.name: cluster_two
network.host: 10.1.2.3 (2)
-
The
network.host
setting is inherited byt1
. -
The
t3
node client overrides the inherited from the tribe node.
Remote clusters
The remote clusters module enables you to establish uni-directional connections to a remote cluster. This functionality is used in {ccs}.
Remote cluster connections work by configuring a remote cluster and connecting only to a limited number of nodes in that remote cluster. Each remote cluster is referenced by a name and a list of seed nodes. When a remote cluster is registered, its cluster state is retrieved from one of the seed nodes and up to three gateway nodes are selected to be connected to as part of remote cluster requests. All the communication required between different clusters goes through the transport layer. Remote cluster connections consist of uni-directional connections from the coordinating node to the selected remote gateway nodes only.
Gateway nodes selection
The gateway nodes selection depends on the following criteria:
-
version: Remote nodes must be compatible with the cluster they are registered to. This is subject to rules that are similar to those for [rolling-upgrades]. Any node can communicate with any other node on the same major version (e.g. 7.0 can talk to any 7.x node). Only nodes on the last minor version of a certain major version can communicate with nodes on the following major version. Note that in the 6.x series, 6.8 can communicate with any 7.x node, while 6.7 can only communicate with 7.0. Version compatibility is symmetric, meaning that if 6.7 can communicate with 7.0, 7.0 can also communicate with 6.7. The matrix below summarizes compatibility as described above.
Compatibility |
5.0→5.5 |
5.6 |
6.0→6.6 |
6.7 |
6.8 |
7.0 |
7.1→7.x |
5.0→5.5 |
Yes |
Yes |
No |
No |
No |
No |
No |
5.6 |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
No |
6.0→6.6 |
No |
Yes |
Yes |
Yes |
Yes |
No |
No |
6.7 |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
No |
6.8 |
No |
Yes |
Yes |
Yes |
Yes |
Yes |
Yes |
7.0 |
No |
No |
No |
Yes |
Yes |
Yes |
Yes |
7.1→7.x |
No |
No |
No |
No |
Yes |
Yes |
Yes |
-
role: Dedicated master nodes never get selected.
-
attributes: You can tag which nodes should be selected (see Remote cluster settings), though such tagged nodes still have to satisfy the two above requirements.
Configuring remote clusters
You can configure remote clusters globally by using
cluster settings, which you can update dynamically.
Alternatively, you can configure them locally on individual nodes by using the
elasticsearch.yml
file.
If you specify the settings in elasticsearch.yml
files, only the nodes with
those settings can connect to the remote cluster. In other words, functionality
that relies on remote cluster requests must be driven specifically from those
nodes. For example:
cluster:
remote:
cluster_one: (1)
seeds: 127.0.0.1:9300
transport.ping_schedule: 30s (2)
cluster_two:
seeds: 127.0.0.1:9301
transport.compress: true (3)
skip_unavailable: true (4)
-
cluster_one
andcluster_two
are arbitrary cluster aliases representing the connection to each cluster. These names are subsequently used to distinguish between local and remote indices. -
A keep-alive ping is configured for
cluster_one
. -
Compression is explicitly enabled for requests to
cluster_two
. -
Disconnected remote clusters are optional for
cluster_two
.
For more information about the optional transport settings, see Transport.
If you use cluster settings, the remote clusters are available on every node in the cluster. For example:
PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster_one": {
"seeds": [
"127.0.0.1:9300"
],
"transport.ping_schedule": "30s"
},
"cluster_two": {
"seeds": [
"127.0.0.1:9301"
],
"transport.compress": true,
"skip_unavailable": true
},
"cluster_three": {
"seeds": [
"127.0.0.1:9302"
]
}
}
}
}
}
You can dynamically update the compression and ping schedule settings. However, you must re-include seeds in the settings update request. For example:
PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster_one": {
"seeds": [
"127.0.0.1:9300"
],
"transport.ping_schedule": "60s"
},
"cluster_two": {
"seeds": [
"127.0.0.1:9301"
],
"transport.compress": false
}
}
}
}
}
Note
|
When the compression or ping schedule settings change, all the existing node connections must close and re-open, which can cause in-flight requests to fail. |
A remote cluster can be deleted from the cluster settings by setting its seeds and optional settings to null
:
PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster_two": { (1)
"seeds": null,
"skip_unavailable": null,
"transport": {
"compress": null
}
}
}
}
}
}
-
cluster_two
would be removed from the cluster settings, leavingcluster_one
andcluster_three
intact.
Remote cluster settings
cluster.remote.connections_per_cluster
-
The number of gateway nodes to connect to per remote cluster. The default is
3
. cluster.remote.initial_connect_timeout
-
The time to wait for remote connections to be established when the node starts. The default is
30s
. cluster.remote.node.attr
-
A node attribute to filter out nodes that are eligible as a gateway node in the remote cluster. For instance a node can have a node attribute
node.attr.gateway: true
such that only nodes with this attribute will be connected to ifcluster.remote.node.attr
is set togateway
. cluster.remote.connect
-
By default, any node in the cluster can act as a cross-cluster client and connect to remote clusters. The
cluster.remote.connect
setting can be set tofalse
(defaults totrue
) to prevent certain nodes from connecting to remote clusters. Remote cluster requests must be sent to a node that is allowed to act as a cross-cluster client. cluster.remote.${cluster_alias}.skip_unavailable
-
Per cluster boolean setting that allows to skip specific clusters when no nodes belonging to them are available and they are the targetof a remote cluster request. Default is
false
, meaning that all clusters are mandatory by default, but they can selectively be made optional by setting this setting totrue
. cluster.remote.${cluster_alias}.transport.ping_schedule
-
Sets the time interval between regular application-level ping messages that are sent to ensure that transport connections to nodes belonging to remote clusters are kept alive. If set to
-1
, application-level ping messages to this remote cluster are not sent. If unset, application-level ping messages are sent according to the globaltransport.ping_schedule
setting, which defaults to`-1
meaning that pings are not sent. cluster.remote.${cluster_alias}.transport.compress
-
Per cluster boolean setting that enables you to configure compression for requests to a specific remote cluster. This setting impacts only requests sent to the remote cluster. If the inbound request is compressed, Elasticsearch compresses the response. If unset, the global
transport.compress
is used as the fallback setting.
Retrieving remote clusters info
You can use the remote cluster info API to retrieve information about the configured remote clusters, as well as the remote nodes that the node is connected to.
{ccs-cap}
The {ccs} feature allows any node to act as a federated client across multiple clusters. In contrast to the tribe node feature, a {ccs} node won’t join the remote cluster, instead it connects to a remote cluster in a light fashion in order to execute federated search requests. For details on communication and compatibility between different clusters, see Remote clusters.
Using {ccs}
{ccs-cap} requires configuring remote clusters.
PUT _cluster/settings
{
"persistent": {
"cluster": {
"remote": {
"cluster_one": {
"seeds": [
"127.0.0.1:9300"
]
},
"cluster_two": {
"seeds": [
"127.0.0.1:9301"
]
},
"cluster_three": {
"seeds": [
"127.0.0.1:9302"
]
}
}
}
}
}
To search the twitter
index on remote cluster cluster_one
the index name
must be prefixed with the alias of the remote cluster followed by the :
character:
GET /cluster_one:twitter/_search
{
"query": {
"match": {
"user": "kimchy"
}
}
}
{
"took": 150,
"timed_out": false,
"_shards": {
"total": 1,
"successful": 1,
"failed": 0,
"skipped": 0
},
"_clusters": {
"total": 1,
"successful": 1,
"skipped": 0
},
"hits": {
"total": 1,
"max_score": 1,
"hits": [
{
"_index": "cluster_one:twitter",
"_type": "_doc",
"_id": "0",
"_score": 1,
"_source": {
"user": "kimchy",
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"likes": 0
}
}
]
}
}
In contrast to the tribe
feature cross cluster search can also search indices
with the same name on different clusters:
GET /cluster_one:twitter,twitter/_search
{
"query": {
"match": {
"user": "kimchy"
}
}
}
Search results are disambiguated the same way as the indices are disambiguated in the request. Indices with same names are treated as different indices when results are merged. All results retrieved from an index located in a remote cluster are prefixed with their corresponding cluster alias:
{
"took": 150,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0,
"skipped": 0
},
"_clusters": {
"total": 2,
"successful": 2,
"skipped": 0
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "cluster_one:twitter",
"_type": "_doc",
"_id": "0",
"_score": 1,
"_source": {
"user": "kimchy",
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"likes": 0
}
},
{
"_index": "twitter",
"_type": "_doc",
"_id": "0",
"_score": 2,
"_source": {
"user": "kimchy",
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"likes": 0
}
}
]
}
}
Skipping disconnected clusters
By default, all remote clusters that are searched via {ccs} need to be
available when the search request is executed. Otherwise, the whole request
fails; even if some of the clusters are available, no search results are
returned. You can use the boolean skip_unavailable
setting to make remote
clusters optional. By default, it is set to false
.
PUT _cluster/settings
{
"persistent": {
"cluster.remote.cluster_two.skip_unavailable": true (1)
}
}
-
cluster_two
is made optional
GET /cluster_one:twitter,cluster_two:twitter,twitter/_search (1)
{
"query": {
"match": {
"user": "kimchy"
}
}
}
-
Search against the
twitter
index incluster_one
,cluster_two
and also locally
{
"took": 150,
"timed_out": false,
"_shards": {
"total": 2,
"successful": 2,
"failed": 0,
"skipped": 0
},
"_clusters": { (1)
"total": 3,
"successful": 2,
"skipped": 1
},
"hits": {
"total": 2,
"max_score": 1,
"hits": [
{
"_index": "cluster_one:twitter",
"_type": "_doc",
"_id": "0",
"_score": 1,
"_source": {
"user": "kimchy",
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"likes": 0
}
},
{
"_index": "twitter",
"_type": "_doc",
"_id": "0",
"_score": 2,
"_source": {
"user": "kimchy",
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"likes": 0
}
}
]
}
}
-
The
clusters
section indicates that one cluster was unavailable and got skipped