"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/plugins/index.asciidoc" (29 Dec 2021, 2134 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Unresolved directive in ../Versions.asciidoc - include::{asciidoc-dir}/../../shared/versions/stack/{source_branch}.asciidoc[]

Unresolved directive in ../Versions.asciidoc - include::{asciidoc-dir}/../../shared/attributes.asciidoc[]

Introduction to plugins

Plugins are a way to enhance the core Elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers, native scripts, custom discovery and more.

Plugins contain JAR files, but may also contain scripts and config files, and must be installed on every node in the cluster. After installation, each node must be restarted before the plugin becomes visible.

Note
A full cluster restart is required for installing plugins that have custom cluster state metadata, such as X-Pack. It is still possible to upgrade such plugins with a rolling restart.

This documentation distinguishes two categories of plugins:

Core Plugins

This category identifies plugins that are part of Elasticsearch project. Delivered at the same time as Elasticsearch, their version number always matches the version number of Elasticsearch itself. These plugins are maintained by the Elastic team with the appreciated help of amazing community members (for open source plugins). Issues and bug reports can be reported on the Github project page.

Community contributed

This category identifies plugins that are external to the Elasticsearch project. They are provided by individual developers or private companies and have their own licenses as well as their own versioning system. Issues and bug reports can usually be reported on the community plugin’s web site.

For advice on writing your own plugin, see Help for plugin authors.

Important
Site plugins — plugins containing HTML, CSS and JavaScript — are no longer supported.

Plugin Management

The plugin script is used to install, list, and remove plugins. It is located in the $ES_HOME/bin directory by default but it may be in a different location depending on which Elasticsearch package you installed:

  • {ref}/zip-targz.html#zip-targz-layout[Directory layout of .zip and .tar.gz archives]

  • {ref}/deb.html#deb-layout[Directory layout of Debian package]

  • {ref}/rpm.html#rpm-layout[Directory layout of RPM]

Run the following command to get usage instructions:

sudo bin/elasticsearch-plugin -h
Important
Running as root

If Elasticsearch was installed using the deb or rpm package then run /usr/share/elasticsearch/bin/elasticsearch-plugin as root so it can write to the appropriate files on disk. Otherwise run bin/elasticsearch-plugin as the user that owns all of the Elasticsearch files.

Installing Plugins

The documentation for each plugin usually includes specific installation instructions for that plugin, but below we document the various available options:

Core Elasticsearch plugins

Core Elasticsearch plugins can be installed as follows:

sudo bin/elasticsearch-plugin install [plugin_name]

For instance, to install the core ICU plugin, just run the following command:

sudo bin/elasticsearch-plugin install analysis-icu

This command will install the version of the plugin that matches your Elasticsearch version and also show a progress bar while downloading.

Custom URL or file system

A plugin can also be downloaded directly from a custom location by specifying the URL:

sudo bin/elasticsearch-plugin install [url] (1)
  1. must be a valid URL, the plugin name is determined from its descriptor.

Unix

To install a plugin from your local file system at /path/to/plugin.zip, you could run:

sudo bin/elasticsearch-plugin install file:///path/to/plugin.zip
Windows

To install a plugin from your local file system at C:\path\to\plugin.zip, you could run:

bin\elasticsearch-plugin install file:///C:/path/to/plugin.zip
Note
Any path that contains spaces must be wrapped in quotes!
Note
If you are installing a plugin from the filesystem the plugin distribution must not be contained in the plugins directory for the node that you are installing the plugin to or installation will fail.
HTTP

To install a plugin from an HTTP URL:

sudo bin/elasticsearch-plugin install http://some.domain/path/to/plugin.zip

The plugin script will refuse to talk to an HTTPS URL with an untrusted certificate. To use a self-signed HTTPS cert, you will need to add the CA cert to a local Java truststore and pass the location to the script as follows:

sudo ES_JAVA_OPTS="-Djavax.net.ssl.trustStore=/path/to/trustStore.jks" bin/elasticsearch-plugin install https://host/plugin.zip

Mandatory Plugins

If you rely on some plugins, you can define mandatory plugins by adding plugin.mandatory setting to the config/elasticsearch.yml file, for example:

plugin.mandatory: analysis-icu,lang-js

For safety reasons, a node will not start if it is missing a mandatory plugin.

Listing, Removing and Updating Installed Plugins

Listing plugins

A list of the currently loaded plugins can be retrieved with the list option:

sudo bin/elasticsearch-plugin list

Alternatively, use the {ref}/cluster-nodes-info.html[node-info API] to find out which plugins are installed on each node in the cluster

Removing plugins

Plugins can be removed manually, by deleting the appropriate directory under plugins/, or using the public script:

sudo bin/elasticsearch-plugin remove [pluginname]

After a Java plugin has been removed, you will need to restart the node to complete the removal process.

By default, plugin configuration files (if any) are preserved on disk; this is so that configuration is not lost while upgrading a plugin. If you wish to purge the configuration files while removing a plugin, use -p or --purge. This can option can be used after a plugin is removed to remove any lingering configuration files.

Updating plugins

Plugins are built for a specific version of Elasticsearch, and therefore must be reinstalled each time Elasticsearch is updated.

sudo bin/elasticsearch-plugin remove [pluginname]
sudo bin/elasticsearch-plugin install [pluginname]

Other command line parameters

The plugin scripts supports a number of other command line parameters:

Silent/Verbose mode

The --verbose parameter outputs more debug information, while the --silent parameter turns off all output including the progress bar. The script may return the following exit codes:

0

everything was OK

64

unknown command or incorrect option parameter

74

IO error

70

any other error

Batch mode

Certain plugins require more privileges than those provided by default in core Elasticsearch. These plugins will list the required privileges and ask the user for confirmation before continuing with installation.

When running the plugin install script from another program (e.g. install automation scripts), the plugin script should detect that it is not being called from the console and skip the confirmation response, automatically granting all requested permissions. If console detection fails, then batch mode can be forced by specifying -b or --batch as follows:

sudo bin/elasticsearch-plugin install --batch [pluginname]

Custom config directory

If your elasticsearch.yml config file is in a custom location, you will need to specify the path to the config file when using the plugin script. You can do this as follows:

sudo ES_PATH_CONF=/path/to/conf/dir bin/elasticsearch-plugin install <plugin name>

Proxy settings

To install a plugin via a proxy, you can add the proxy details to the ES_JAVA_OPTS environment variable with the Java settings http.proxyHost and http.proxyPort (or https.proxyHost and https.proxyPort):

sudo ES_JAVA_OPTS="-Dhttp.proxyHost=host_name -Dhttp.proxyPort=port_number -Dhttps.proxyHost=host_name -Dhttps.proxyPort=https_port_number" bin/elasticsearch-plugin install analysis-icu

Or on Windows:

set ES_JAVA_OPTS="-Dhttp.proxyHost=host_name -Dhttp.proxyPort=port_number -Dhttps.proxyHost=host_name -Dhttps.proxyPort=https_port_number"
bin\elasticsearch-plugin install analysis-icu

Plugins directory

The default location of the plugins directory depends on which package you install:

  • {ref}/zip-targz.html#zip-targz-layout[Directory layout of .zip and .tar.gz archives]

  • {ref}/deb.html#deb-layout[Directory layout of Debian package]

  • {ref}/rpm.html#rpm-layout[Directory layout of RPM]

API Extension Plugins

API extension plugins add new functionality to Elasticsearch by adding new APIs or features, usually to do with search or mapping.

Community contributed API extension plugins

A number of plugins have been contributed by our community:

Alerting Plugins

Alerting plugins allow Elasticsearch to monitor indices and to trigger alerts when thresholds are breached.

Core alerting plugins

The core alerting plugins are:

X-Pack

X-Pack contains the alerting and notification product for Elasticsearch that lets you take action based on changes in your data. It is designed around the principle that if you can query something in Elasticsearch, you can alert on it. Simply define a query, condition, schedule, and the actions to take, and X-Pack will do the rest.

Analysis Plugins

Analysis plugins extend Elasticsearch by adding new analyzers, tokenizers, token filters, or character filters to Elasticsearch.

Core analysis plugins

The core analysis plugins are:

ICU

Adds extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

Kuromoji

Advanced analysis of Japanese using the Kuromoji analyzer.

Nori

Morphological analysis of Korean using the Lucene Nori analyzer.

Phonetic

Analyzes tokens into their phonetic equivalent using Soundex, Metaphone, Caverphone, and other codecs.

SmartCN

An analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

Stempel

Provides high quality stemming for Polish.

Ukrainian

Provides stemming for Ukrainian.

Community contributed analysis plugins

A number of analysis plugins have been contributed by our community:

ICU Analysis Plugin

The ICU Analysis plugin integrates the Lucene ICU module into {es}, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

Important
ICU analysis and backwards compatibility

From time to time, the ICU library receives updates such as adding new characters and emojis, and improving collation (sort) orders. These changes may or may not affect search and sort orders, depending on which characters sets you are using.

While we restrict ICU upgrades to major versions, you may find that an index created in the previous major version will need to be reindexed in order to return correct (and correctly ordered) results, and to take advantage of new characters.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-icu

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-icu

The node must be stopped before removing the plugin.

ICU Analyzer

Performs basic normalization, tokenization and character folding, using the icu_normalizer char filter, icu_tokenizer and icu_normalizer token filter

The following parameters are accepted:

method

Normalization method. Accepts nfkc, nfc or nfkc_cf (default)

mode

Normalization mode. Accepts compose (default) or decompose.

ICU Normalization Character Filter

Normalizes characters as explained here. It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively:

Which letters are normalized can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

Here are two examples, the default usage and a customised character filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfd_normalizer token filter, which is set to use nfc normalization with decomposition.

ICU Tokenizer

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the {ref}/analysis-standard-tokenizer.html[standard tokenizer], but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}
Rules customization

experimental[This functionality is marked as experimental in Lucene]

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.

As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:

.+ {200};

Then create an analyzer to use this rule file as follows:

PUT icu_sample
{
    "settings": {
        "index":{
            "analysis":{
                "tokenizer" : {
                    "icu_user_file" : {
                       "type" : "icu_tokenizer",
                       "rule_files" : "Latn:KeywordTokenizer.rbbi"
                    }
                },
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "icu_user_file"
                    }
                }
            }
        }
    }
}

GET icu_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch. Wow!"
}

The above analyze request returns the following:

{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

ICU Normalization Token Filter

Normalizes characters as explained here. It registers itself as the icu_normalizer token filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default).

Which letters are normalized can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

You should probably prefer the Normalization character filter.

Here are two examples, the default usage and a customised token filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_normalizer"
            ]
          },
          "nfc_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfc_normalizer"
            ]
          }
        },
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfc_normalizer token filter, which is set to use nfc normalization.

ICU Folding Token Filter

Case folding of Unicode characters based on UTR#30, like the {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter] on steroids. It registers itself as the icu_folding token filter and is available to all indices:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}

The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.

Which letters are folded can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

The following example exempts Swedish characters from folding. It is important to note that both upper and lowercase forms should be specified, and that these filtered character are not lowercased which is why we add the lowercase filter as well:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}

ICU Collation Token Filter

Warning

This token filter has been deprecated since Lucene 5.0. Please use ICU Collation Keyword Field.

ICU Collation Keyword Field

Collations are used for sorting documents in a language-specific word order. The icu_collation_keyword field type is available to all indices and will encode the terms directly as bytes in a doc values field and a single indexed token just like a standard {ref}/keyword.html[Keyword Field].

Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation], which is a best-effort attempt at language-neutral sorting.

Below is an example of how to set up a field for sorting German names in ``phonebook'' order:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {   (1)
          "type": "text",
          "fields": {
            "sort": {  (2)
              "type": "icu_collation_keyword",
              "index": false,
              "language": "de",
              "country": "DE",
              "variant": "@collation=phonebook"
            }
          }
        }
      }
    }
  }
}

GET _search (3)
{
  "query": {
    "match": {
      "name": "Fritz"
    }
  },
  "sort": "name.sort"
}
  1. The name field uses the standard analyzer, and so support full text queries.

  2. The name.sort field is an icu_collation_keyword field that will preserve the name as a single token doc values, and applies the German ``phonebook'' order.

  3. An example query which searches the name field and sorts on the name.sort field.

Parameters for ICU Collation Keyword Fields

The following parameters are accepted by icu_collation_keyword fields:

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) or false.

null_value

Accepts a string value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the {ref}/mapping-source-field.html[_source] field. Accepts true or false (default).

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations.

Collation options
strength

The strength property determines the minimum level of difference considered significant during comparison. Possible values are : primary, secondary, tertiary, quaternary or identical. See the ICU Collation documentation for a more detailed explanation for each value. Defaults to tertiary unless otherwise specified in the collation.

decomposition

Possible values: no (default, but collation-dependent) or canonical. Setting this decomposition property to canonical allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. If no is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales set no as the default decomposition mode.

The following options are expert only:

alternate

Possible values: shifted or non-ignorable. Sets the alternate handling for strength quaternary to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace.

case_level

Possible values: true or false (default). Whether case level sorting is required. When strength is set to primary this will ignore accent differences.

case_first

Possible values: lower or upper. Useful to control which case is sorted first when case is not ignored for strength tertiary. The default depends on the collation.

numeric

Possible values: true or false (default) . Whether digits are sorted according to their numeric representation. For example the value egg-9 is sorted before the value egg-21.

variable_top

Single character or contraction. Controls what is variable for alternate.

hiragana_quaternary_mode

Possible values: true or false. Distinguishing between Katakana and Hiragana characters in quaternary strength.

ICU Transform Token Filter

Transforms are used to process Unicode text in many different ways, such as case mapping, normalization, transliteration and bidirectional text handling.

You can define which transformation you want to apply with the id parameter (defaults to Null), and specify text direction with the dir parameter which accepts forward (default) for LTR and reverse for RTL. Custom rulesets are not yet supported.

For example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "latin": {
            "tokenizer": "keyword",
            "filter": [
              "myLatinTransform"
            ]
          }
        },
        "filter": {
          "myLatinTransform": {
            "type": "icu_transform",
            "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" (1)
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "你好" (2)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "здравствуйте" (3)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "こんにちは" (4)
}
  1. This transforms transliterates characters to Latin, and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.

  2. Returns ni hao.

  3. Returns zdravstvujte.

  4. Returns kon’nichiha.

For more documentation, Please see the user guide of ICU Transform.

Japanese (kuromoji) Analysis Plugin

The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-kuromoji

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-kuromoji

The node must be stopped before removing the plugin.

kuromoji analyzer

The kuromoji analyzer consists of the following tokenizer and token filters:

It supports the mode and user_dictionary settings from kuromoji_tokenizer.

kuromoji_iteration_mark character filter

The kuromoji_iteration_mark normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. It accepts the following settings:

normalize_kanji

Indicates whether kanji iteration marks should be normalize. Defaults to true.

normalize_kana

Indicates whether kana iteration marks should be normalized. Defaults to true

kuromoji_tokenizer

The kuromoji_tokenizer accepts the following settings:

mode

The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:

normal

Normal segmentation, no decomposition for compounds. Example output:

関西国際空港
アブラカダブラ
search

Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:

関西, 関西国際空港, 国際, 空港
アブラカダブラ
extended

Extended mode outputs unigrams for unknown words. Example output:

関西, 国際, 空港
ア, ブ, ラ, カ, ダ, ブ, ラ
discard_punctuation

Whether punctuation should be discarded from the output. Defaults to true.

user_dictionary

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ja.txt:

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
nbest_cost/nbest_examples

Additional expert user parameters nbest_cost and nbest_examples can be used to include additional tokens that most likely according to the statistical model. If both parameters are used, the largest number of both is applied.

nbest_cost

The nbest_cost parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path.

nbest_examples

The nbest_examples can be used to find a nbest_cost value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).

Then create an analyzer as follows:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "東京スカイツリー"
}

The above analyze request returns the following:

{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 1
  } ]
}

kuromoji_baseform token filter

The kuromoji_baseform token filter replaces terms with their BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "飲み"
}

which responds with:

{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  } ]
}

kuromoji_part_of_speech token filter

The kuromoji_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

stoptags

An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar.

For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "寿司がおいしいね"
}

Which responds with:

{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  } ]
}

kuromoji_readingform token filter

The kuromoji_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:

use_romaji

Whether romaji reading form should be output instead of katakana. Defaults to false.

When using the pre-defined kuromoji_readingform filter, use_romaji is set to true. The default when defining a custom kuromoji_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

PUT kuromoji_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "katakana_analyzer",
  "text": "寿司" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "romaji_analyzer",
  "text": "寿司" (2)
}
  1. Returns スシ.

  2. Returns sushi.

kuromoji_stemmer token filter

The kuromoji_stemmer token filter normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only full-width katakana characters are supported.

This token filter accepts the following setting:

minimum_length

Katakana words shorter than the minimum length are not stemmed (default is 4).

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_stemmer",
            "minimum_length": 4
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "コピー" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "サーバー" (2)
}
  1. Returns コピー.

  2. Return サーバ.

ja_stop token filter

The ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the {ref}/analysis-stop-tokenfilter.html[stop token filter] instead.

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_ja_stop": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "ja_stop"
            ]
          }
        },
        "filter": {
          "ja_stop": {
            "type": "ja_stop",
            "stopwords": [
              "_japanese_",
              "ストップ"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "analyzer_with_ja_stop",
  "text": "ストップは消える"
}

The above request returns:

{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}

kuromoji_number token filter

The kuromoji_number token filter normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters. For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_number"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "一〇〇〇"
}

Which results in:

{
  "tokens" : [ {
    "token" : "1000",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 0
  } ]
}

Korean (nori) Analysis Plugin

The Korean (nori) Analysis plugin integrates Lucene nori analysis module into elasticsearch. It uses the mecab-ko-dic dictionary to perform morphological analysis of Korean texts.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-nori

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-nori

The node must be stopped before removing the plugin.

nori analyzer

The nori analyzer consists of the following tokenizer and token filters:

It supports the decompound_mode and user_dictionary settings from nori_tokenizer and the stoptags setting from nori_part_of_speech.

nori_tokenizer

The nori_tokenizer accepts the following settings:

decompound_mode

The decompound mode determines how the tokenizer handles compound tokens. It can be set to:

none

No decomposition for compounds. Example output:

가거도항
가곡역
discard

Decomposes compounds and discards the original form (default). Example output:

가곡역 => 가곡, 역
mixed

Decomposes compounds and keeps the original form. Example output:

가곡역 => 가곡역, 가곡, 역
user_dictionary

The Nori tokenizer uses the mecab-ko-dic dictionary by default. A user_dictionary with custom nouns (NNG) may be appended to the default dictionary. The dictionary should have the following format:

<token> [<token 1> ... <token n>]

The first token is mandatory and represents the custom noun that should be added in the dictionary. For compound nouns the custom segmentation can be provided after the first token ([<token 1> …​ <token n>]). The segmentation of the custom compound nouns is controlled by the decompound_mode setting.

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ko.txt:

c++                 (1)
C샤프
세종
세종시 세종 시        (2)
  1. A simple noun

  2. A compound noun (세종시) followed by its decomposition: 세종 and .

Then create an analyzer as follows:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "세종시"  (1)
}
  1. Sejong city

The above analyze request returns the following:

{
  "tokens" : [ {
    "token" : "세종시",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0,
    "positionLength" : 2    (1)
  }, {
    "token" : "세종",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "시",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
   }]
}
  1. This is a compound token that spans two positions (mixed mode).

user_dictionary_rules

You can also inline the rules directly in the tokenizer definition using the user_dictionary_rules option:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

The nori_tokenizer sets a number of additional attributes per token that are used by token filters to modify the stream. You can view all these additional attributes with the following request:

GET _analyze
{
  "tokenizer": "nori_tokenizer",
  "text": "뿌리가 깊은 나무는",   (1)
  "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  "explain": true
}
  1. A tree with deep roots

Which responds with:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "nori_tokenizer",
            "tokens": [
                {
                    "token": "뿌리",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "가",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "깊",
                    "start_offset": 4,
                    "end_offset": 5,
                    "type": "word",
                    "position": 2,
                    "leftPOS": "VA(Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VA(Adjective)"
                },
                {
                    "token": "은",
                    "start_offset": 5,
                    "end_offset": 6,
                    "type": "word",
                    "position": 3,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                },
                {
                    "token": "나무",
                    "start_offset": 7,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "는",
                    "start_offset": 9,
                    "end_offset": 10,
                    "type": "word",
                    "position": 5,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                }
            ]
        },
        "tokenfilters": []
    }
}

nori_part_of_speech token filter

The nori_part_of_speech token filter removes tokens that match a set of part-of-speech tags. The list of supported tags and their meanings can be found here: Part of speech tags

It accepts the following setting:

stoptags

An array of part-of-speech tags that should be removed.

and defaults to:

"stoptags": [
    "E",
    "IC",
    "J",
    "MAG", "MAJ", "MM",
    "SP", "SSC", "SSO", "SC", "SE",
    "XPN", "XSA", "XSN", "XSV",
    "UNA", "NA", "VSV"
]

For example:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "nori_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "nori_part_of_speech",
            "stoptags": [
              "NR"   (1)
            ]
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "여섯 용이"  (2)
}
  1. Korean numerals should be removed (NR)

  2. Six dragons

Which responds with:

{
  "tokens" : [ {
    "token" : "용",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "이",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 2
  } ]
}

nori_readingform token filter

The nori_readingform token filter rewrites tokens written in Hanja to their Hangul form.

PUT nori_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "nori_tokenizer",
                        "filter" : ["nori_readingform"]
                    }
                }
            }
        }
    }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "鄕歌"      (1)
}
  1. A token written in Hanja: Hyangga

Which responds with:

{
  "tokens" : [ {
    "token" : "향가",     (1)
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }]
}
  1. The Hanja form is replaced by the Hangul translation.

Phonetic Analysis Plugin

The Phonetic Analysis plugin provides token filters which convert tokens to their phonetic representation using Soundex, Metaphone, and a variety of other algorithms.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-phonetic

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-phonetic

The node must be stopped before removing the plugin.

phonetic token filter

The phonetic token filter takes the following settings:

encoder

Which phonetic encoder to use. Accepts metaphone (default), double_metaphone, soundex, refined_soundex, caverphone1, caverphone2, cologne, nysiis, koelnerphonetik, haasephonetik, beider_morse, daitch_mokotoff.

replace

Whether or not the original token should be replaced by the phonetic token. Accepts true (default) and false. Not supported by beider_morse encoding.

PUT phonetic_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_metaphone"
            ]
          }
        },
        "filter": {
          "my_metaphone": {
            "type": "phonetic",
            "encoder": "metaphone",
            "replace": false
          }
        }
      }
    }
  }
}

GET phonetic_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Joe Bloggs" (1)
}
  1. Returns: J, joe, BLKS, bloggs

Double metaphone settings

If the double_metaphone encoder is used, then this additional setting is supported:

max_code_len

The maximum length of the emitted metaphone token. Defaults to 4.

Beider Morse settings

If the beider_morse encoder is used, then these additional settings are supported:

rule_type

Whether matching should be exact or approx (default).

name_type

Whether names are ashkenazi, sephardic, or generic (default).

languageset

An array of languages to check. If not specified, then the language will be guessed. Accepts: any, common, cyrillic, english, french, german, hebrew, hungarian, polish, romanian, russian, spanish.

Smart Chinese Analysis Plugin

The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese analysis module into elasticsearch.

It provides an analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-smartcn

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-smartcn

The node must be stopped before removing the plugin.

smartcn tokenizer and token filter

The plugin provides the smartcn analyzer and smartcn_tokenizer tokenizer, which are not configurable.

Note
The smartcn_word token filter and smartcn_sentence have been deprecated.

Stempel Polish Analysis Plugin

The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.

It provides high quality stemming for Polish, based on the Egothor project.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-stempel

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-stempel

The node must be stopped before removing the plugin.

stempel tokenizer and token filter

The plugin provides the polish analyzer and polish_stem token filter, which are not configurable.

Ukrainian Analysis Plugin

The Ukrainian Analysis plugin integrates Lucene’s UkrainianMorfologikAnalyzer into elasticsearch.

It provides stemming for Ukrainian using the Morfologik project.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-ukrainian

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-ukrainian

The node must be stopped before removing the plugin.

ukrainian analyzer

The plugin provides the ukrainian analyzer.

Discovery Plugins

Discovery plugins extend Elasticsearch by adding new discovery mechanisms that can be used instead of {ref}/modules-discovery-zen.html[Zen Discovery].

Core discovery plugins

The core discovery plugins are:

EC2 discovery

The EC2 discovery plugin uses the AWS API for unicast discovery.

Azure Classic discovery

The Azure Classic discovery plugin uses the Azure Classic API for unicast discovery.

GCE discovery

The Google Compute Engine discovery plugin uses the GCE API for unicast discovery.

File-based discovery

The File-based discovery plugin allows providing the unicast hosts list through a dynamically updatable file.

Community contributed discovery plugins

A number of discovery plugins have been contributed by our community:

EC2 Discovery Plugin

The EC2 discovery plugin uses the AWS API for unicast discovery.

If you are looking for a hosted solution of Elasticsearch on AWS, please visit http://www.elastic.co/cloud.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install discovery-ec2

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove discovery-ec2

The node must be stopped before removing the plugin.

Getting started with AWS

The plugin provides a hosts provider for zen discovery named ec2. This hosts provider finds other Elasticsearch instances in EC2 through AWS metadata. Authentication is done using IAM Role credentials by default. To enable the plugin, set the unicast host provider for Zen discovery to ec2:

discovery.zen.hosts_provider: ec2

Settings

EC2 host discovery supports a number of settings. Some settings are sensitive and must be stored in the {ref}/secure-settings.html[elasticsearch keystore]. For example, to use explicit AWS access keys:

bin/elasticsearch-keystore add discovery.ec2.access_key
bin/elasticsearch-keystore add discovery.ec2.secret_key

The following are the available discovery settings. All should be prefixed with discovery.ec2.. Those that must be stored in the keystore are marked as Secure.

access_key

An ec2 access key. The secret_key setting must also be specified. (Secure)

secret_key

An ec2 secret key. The access_key setting must also be specified. (Secure)

session_token

An ec2 session token. The access_key and secret_key settings must also be specified. (Secure)

endpoint

The ec2 service endpoint to connect to. See http://docs.aws.amazon.com/general/latest/gr/rande.html#ec2_region. This defaults to ec2.us-east-1.amazonaws.com.

protocol

The protocol to use to connect to ec2. Valid values are either http or https. Defaults to https.

proxy.host

The host name of a proxy to connect to ec2 through.

proxy.port

The port of a proxy to connect to ec2 through.

proxy.username

The username to connect to the proxy.host with. (Secure)

proxy.password

The password to connect to the proxy.host with. (Secure)

read_timeout

The socket timeout for connecting to ec2. The value should specify the unit. For example, a value of 5s specifies a 5 second timeout. The default value is 50 seconds.

groups

Either a comma separated list or array based list of (security) groups. Only instances with the provided security groups will be used in the cluster discovery. (NOTE: You could provide either group NAME or group ID.)

host_type

The type of host type to use to communicate with other instances. Can be one of private_ip, public_ip, private_dns, public_dns or tag:TAGNAME where TAGNAME refers to a name of a tag configured for all EC2 instances. Instances which don’t have this tag set will be ignored by the discovery process.

For example if you defined a tag my-elasticsearch-host in ec2 and set it to myhostname1.mydomain.com, then setting host_type: tag:my-elasticsearch-host will tell Discovery Ec2 plugin to read the host name from the my-elasticsearch-host tag. In this case, it will be resolved to myhostname1.mydomain.com. Read more about EC2 Tags.

Defaults to private_ip.

availability_zones

Either a comma separated list or array based list of availability zones. Only instances within the provided availability zones will be used in the cluster discovery.

any_group

If set to false, will require all security groups to be present for the instance to be used for the discovery. Defaults to true.

node_cache_time

How long the list of hosts is cached to prevent further requests to the AWS API. Defaults to 10s.

All secure settings of this plugin are {ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you reload the settings, an aws sdk client with the latest settings from the keystore will be used.

Important
Binding the network host

It’s important to define network.host as by default it’s bound to localhost.

You can use {ref}/modules-network.html[core network host settings] or ec2 specific host settings:

EC2 Network Host

When the discovery-ec2 plugin is installed, the following are also allowed as valid network host settings:

EC2 Host Value Description

ec2:privateIpv4

The private IP address (ipv4) of the machine.

ec2:privateDns

The private host of the machine.

ec2:publicIpv4

The public IP address (ipv4) of the machine.

ec2:publicDns

The public host of the machine.

ec2:privateIp

equivalent to ec2:privateIpv4.

ec2:publicIp

equivalent to ec2:publicIpv4.

ec2

equivalent to ec2:privateIpv4.

Recommended EC2 Permissions

EC2 discovery requires making a call to the EC2 service. You’ll want to setup an IAM policy to allow this. You can create a custom policy via the IAM Management Console. It should look similar to this.

{
  "Statement": [
    {
      "Action": [
        "ec2:DescribeInstances"
      ],
      "Effect": "Allow",
      "Resource": [
        "*"
      ]
    }
  ],
  "Version": "2012-10-17"
}
Filtering by Tags

The ec2 discovery can also filter machines to include in the cluster based on tags (and not just groups). The settings to use include the discovery.ec2.tag. prefix. For example, if you defined a tag stage in EC2 and set it to dev, setting discovery.ec2.tag.stage to dev will only filter instances with a tag key set to stage, and a value of dev. Adding multiple discovery.ec2.tag settings will require all of those tags to be set for the instance to be included.

One practical use for tag filtering is when an ec2 cluster contains many nodes that are not running Elasticsearch. In this case (particularly with high discovery.zen.ping_timeout values) there is a risk that a new node’s discovery phase will end before it has found the cluster (which will result in it declaring itself master of a new cluster with the same name - highly undesirable). Tagging Elasticsearch ec2 nodes and then filtering by that tag will resolve this issue.

Automatic Node Attributes

Though not dependent on actually using ec2 as discovery (but still requires the discovery-ec2 plugin installed), the plugin can automatically add node attributes relating to ec2. In the future this may support other attributes, but this will currently only add an aws_availability_zone node attribute, which is the availability zone of the current node. Attributes can be used to isolate primary and replica shards across availability zones by using the {ref}/allocation-awareness.html[Allocation Awareness] feature.

In order to enable it, set cloud.node.auto_attributes to true in the settings. For example:

cloud.node.auto_attributes: true

cluster.routing.allocation.awareness.attributes: aws_availability_zone

Best Practices in AWS

Collection of best practices and other information around running Elasticsearch on AWS.

Instance/Disk

When selecting disk please be aware of the following order of preference:

  • EFS - Avoid as the sacrifices made to offer durability, shared storage, and grow/shrink come at performance cost, such file systems have been known to cause corruption of indices, and due to Elasticsearch being distributed and having built-in replication, the benefits that EFS offers are not needed.

  • EBS - Works well if running a small cluster (1-2 nodes) and cannot tolerate the loss all storage backing a node easily or if running indices with no replicas. If EBS is used, then leverage provisioned IOPS to ensure performance.

  • Instance Store - When running clusters of larger size and with replicas the ephemeral nature of Instance Store is ideal since Elasticsearch can tolerate the loss of shards. With Instance Store one gets the performance benefit of having disk physically attached to the host running the instance and also the cost benefit of avoiding paying extra for EBS.

Prefer Amazon Linux AMIs as since Elasticsearch runs on the JVM, OS dependencies are very minimal and one can benefit from the lightweight nature, support, and performance tweaks specific to EC2 that the Amazon Linux AMIs offer.

Networking
  • Networking throttling takes place on smaller instance types in both the form of bandwidth and number of connections. Therefore if large number of connections are needed and networking is becoming a bottleneck, avoid instance types with networking labeled as Moderate or Low.

  • Multicast is not supported, even when in an VPC; the aws cloud plugin which joins by performing a security group lookup.

  • When running in multiple availability zones be sure to leverage {ref}/allocation-awareness.html[shard allocation awareness] so that not all copies of shard data reside in the same availability zone.

  • Do not span a cluster across regions. If necessary, use cross cluster search.

Misc
  • If you have split your nodes into roles, consider tagging the EC2 instances by role to make it easier to filter and view your EC2 instances in the AWS console.

  • Consider enabling termination protection for all of your instances to avoid accidentally terminating a node in the cluster and causing a potentially disruptive reallocation.

Azure Classic Discovery Plugin

The Azure Classic Discovery plugin uses the Azure Classic API for unicast discovery.

deprecated[5.0.0, Use coming Azure ARM Discovery plugin instead]

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install discovery-azure-classic

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove discovery-azure-classic

The node must be stopped before removing the plugin.

Azure Virtual Machine Discovery

Azure VM discovery allows to use the azure APIs to perform automatic discovery (similar to multicast in non hostile multicast environments). Here is a simple sample configuration:

cloud:
    azure:
        management:
             subscription.id: XXX-XXX-XXX-XXX
             cloud.service.name: es-demo-app
             keystore:
                   path: /path/to/azurekeystore.pkcs12
                   password: WHATEVER
                   type: pkcs12

discovery:
    zen.hosts_provider: azure
Important
Binding the network host

The keystore file must be placed in a directory accessible by Elasticsearch like the config directory.

It’s important to define network.host as by default it’s bound to localhost.

You can use {ref}/modules-network.html[core network host settings]. For example en0.

How to start (short story)
  • Create Azure instances

  • Install Elasticsearch

  • Install Azure plugin

  • Modify elasticsearch.yml file

  • Start Elasticsearch

Azure credential API settings

The following are a list of settings that can further control the credential API:

cloud.azure.management.keystore.path

/path/to/keystore

cloud.azure.management.keystore.type

pkcs12, jceks or jks. Defaults to pkcs12.

cloud.azure.management.keystore.password

your_password for the keystore

cloud.azure.management.subscription.id

your_azure_subscription_id

cloud.azure.management.cloud.service.name

your_azure_cloud_service_name. This is the cloud service name/DNS but without the cloudapp.net part. So if the DNS name is abc.cloudapp.net then the cloud.service.name to use is just abc.

Advanced settings

The following are a list of settings that can further control the discovery:

discovery.azure.host.type

Either public_ip or private_ip (default). Azure discovery will use the one you set to ping other nodes.

discovery.azure.endpoint.name

When using public_ip this setting is used to identify the endpoint name used to forward requests to Elasticsearch (aka transport port name). Defaults to elasticsearch. In Azure management console, you could define an endpoint elasticsearch forwarding for example requests on public IP on port 8100 to the virtual machine on port 9300.

discovery.azure.deployment.name

Deployment name if any. Defaults to the value set with cloud.azure.management.cloud.service.name.

discovery.azure.deployment.slot

Either staging or production (default).

For example:

discovery:
    type: azure
    azure:
        host:
            type: private_ip
        endpoint:
            name: elasticsearch
        deployment:
            name: your_azure_cloud_service_name
            slot: production

Setup process for Azure Discovery

We will expose here one strategy which is to hide our Elasticsearch cluster from outside.

With this strategy, only VMs behind the same virtual port can talk to each other. That means that with this mode, you can use Elasticsearch unicast discovery to build a cluster, using the Azure API to retrieve information about your nodes.

Prerequisites

Before starting, you need to have:

  • A Windows Azure account

  • OpenSSL that isn’t from MacPorts, specifically OpenSSL 1.0.1f 6 Jan 2014 doesn’t seem to create a valid keypair for ssh. FWIW, OpenSSL 1.0.1c 10 May 2012 on Ubuntu 14.04 LTS is known to work.

  • SSH keys and certificate

    You should follow this guide to learn how to create or use existing SSH keys. If you have already did it, you can skip the following.

    Here is a description on how to generate SSH keys using openssl:

    # You may want to use another dir than /tmp
    cd /tmp
    openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout azure-private.key -out azure-certificate.pem
    chmod 600 azure-private.key azure-certificate.pem
    openssl x509 -outform der -in azure-certificate.pem -out azure-certificate.cer

    Generate a keystore which will be used by the plugin to authenticate with a certificate all Azure API calls.

    # Generate a keystore (azurekeystore.pkcs12)
    # Transform private key to PEM format
    openssl pkcs8 -topk8 -nocrypt -in azure-private.key -inform PEM -out azure-pk.pem -outform PEM
    # Transform certificate to PEM format
    openssl x509 -inform der -in azure-certificate.cer -out azure-cert.pem
    cat azure-cert.pem azure-pk.pem > azure.pem.txt
    # You MUST enter a password!
    openssl pkcs12 -export -in azure.pem.txt -out azurekeystore.pkcs12 -name azure -noiter -nomaciter

    Upload the azure-certificate.cer file both in the Elasticsearch Cloud Service (under Manage Certificates), and under Settings → Manage Certificates.

    Important
    When prompted for a password, you need to enter a non empty one.

    See this guide for more details about how to create keys for Azure.

    Once done, you need to upload your certificate in Azure:

    • Go to the management console.

    • Sign in using your account.

    • Click on Portal.

    • Go to Settings (bottom of the left list)

    • On the bottom bar, click on Upload and upload your azure-certificate.cer file.

    You may want to use Windows Azure Command-Line Tool:

  • Install NodeJS, for example using homebrew on MacOS X:

    brew install node
  • Install Azure tools

    sudo npm install azure-cli -g
  • Download and import your azure settings:

    # This will open a browser and will download a .publishsettings file
    azure account download
    
    # Import this file (we have downloaded it to /tmp)
    # Note, it will create needed files in ~/.azure. You can remove azure.publishsettings when done.
    azure account import /tmp/azure.publishsettings
Creating your first instance

You need to have a storage account available. Check Azure Blob Storage documentation for more information.

You will need to choose the operating system you want to run on. To get a list of official available images, run:

azure vm image list

Let’s say we are going to deploy an Ubuntu image on an extra small instance in West Europe:

Azure cluster name

azure-elasticsearch-cluster

Image

b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-13_10-amd64-server-20130808-alpha3-en-us-30GB

VM Name

myesnode1

VM Size

extrasmall

Location

West Europe

Login

elasticsearch

Password

password1234!!

Using command line:

azure vm create azure-elasticsearch-cluster \
                b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-13_10-amd64-server-20130808-alpha3-en-us-30GB \
                --vm-name myesnode1 \
                --location "West Europe" \
                --vm-size extrasmall \
                --ssh 22 \
                --ssh-cert /tmp/azure-certificate.pem \
                elasticsearch password1234\!\!

You should see something like:

info:    Executing command vm create
+ Looking up image
+ Looking up cloud service
+ Creating cloud service
+ Retrieving storage accounts
+ Configuring certificate
+ Creating VM
info:    vm create command OK

Now, your first instance is started.

Tip
Working with SSH

You need to give the private key and username each time you log on your instance:

ssh -i ~/.ssh/azure-private.key elasticsearch@myescluster.cloudapp.net

But you can also define it once in ~/.ssh/config file:

Host *.cloudapp.net
 User elasticsearch
 StrictHostKeyChecking no
 UserKnownHostsFile=/dev/null
 IdentityFile ~/.ssh/azure-private.key

Next, you need to install Elasticsearch on your new instance. First, copy your keystore to the instance, then connect to the instance using SSH:

scp /tmp/azurekeystore.pkcs12 azure-elasticsearch-cluster.cloudapp.net:/home/elasticsearch
ssh azure-elasticsearch-cluster.cloudapp.net

Once connected, install Elasticsearch:

# Install Latest Java version
# Read http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html for details
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer

# If you want to install OpenJDK instead
# sudo apt-get update
# sudo apt-get install openjdk-8-jre-headless

# Download Elasticsearch
curl -s https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-{version}.deb -o elasticsearch-{version}.deb

# Prepare Elasticsearch installation
sudo dpkg -i elasticsearch-{version}.deb

Check that Elasticsearch is running:

GET /

This command should give you a JSON result:

{
  "name" : "Cp8oag6",
  "cluster_name" : "elasticsearch",
  "cluster_uuid" : "AT69_T_DTp-1qgIJlatQqA",
  "version" : {
    "number" : "{version}",
    "build_flavor" : "default",
    "build_type" : "zip",
    "build_hash" : "f27399d",
    "build_date" : "2016-03-30T09:51:41.449Z",
    "build_snapshot" : false,
    "lucene_version" : "7.7.3",
    "minimum_wire_compatibility_version" : "1.2.3",
    "minimum_index_compatibility_version" : "1.2.3"
  },
  "tagline" : "You Know, for Search"
}
Install Elasticsearch cloud azure plugin
# Stop Elasticsearch
sudo service elasticsearch stop

# Install the plugin
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install discovery-azure-classic

# Configure it
sudo vi /etc/elasticsearch/elasticsearch.yml

And add the following lines:

# If you don't remember your account id, you may get it with `azure account list`
cloud:
    azure:
        management:
             subscription.id: your_azure_subscription_id
             cloud.service.name: your_azure_cloud_service_name
             keystore:
                   path: /home/elasticsearch/azurekeystore.pkcs12
                   password: your_password_for_keystore

discovery:
    type: azure

# Recommended (warning: non durable disk)
# path.data: /mnt/resource/elasticsearch/data

Restart Elasticsearch:

sudo service elasticsearch start

If anything goes wrong, check your logs in /var/log/elasticsearch.

Scaling Out!

You need first to create an image of your previous machine. Disconnect from your machine and run locally the following commands:

# Shutdown the instance
azure vm shutdown myesnode1

# Create an image from this instance (it could take some minutes)
azure vm capture myesnode1 esnode-image --delete

# Note that the previous instance has been deleted (mandatory)
# So you need to create it again and BTW create other instances.

azure vm create azure-elasticsearch-cluster \
                esnode-image \
                --vm-name myesnode1 \
                --location "West Europe" \
                --vm-size extrasmall \
                --ssh 22 \
                --ssh-cert /tmp/azure-certificate.pem \
                elasticsearch password1234\!\!
Tip

It could happen that azure changes the endpoint public IP address. DNS propagation could take some minutes before you can connect again using name. You can get from azure the IP address if needed, using:

# Look at Network `Endpoints 0 Vip`
azure vm show myesnode1

Let’s start more instances!

for x in $(seq  2 10)
	do
		echo "Launching azure instance #$x..."
		azure vm create azure-elasticsearch-cluster \
		                esnode-image \
		                --vm-name myesnode$x \
		                --vm-size extrasmall \
		                --ssh $((21 + $x)) \
		                --ssh-cert /tmp/azure-certificate.pem \
		                --connect \
		                elasticsearch password1234\!\!
	done

If you want to remove your running instances:

azure vm delete myesnode1

GCE Discovery Plugin

The Google Compute Engine Discovery plugin uses the GCE API for unicast discovery.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install discovery-gce

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove discovery-gce

The node must be stopped before removing the plugin.

GCE Virtual Machine Discovery

Google Compute Engine VM discovery allows to use the google APIs to perform automatic discovery (similar to multicast in non hostile multicast environments). Here is a simple sample configuration:

cloud:
  gce:
      project_id: <your-google-project-id>
      zone: <your-zone>
discovery:
      zen.hosts_provider: gce

The following gce settings (prefixed with cloud.gce) are supported:

project_id

Your Google project id. By default the project id will be derived from the instance metadata.

Note: Deriving the project id from system properties or environment variables
(`GOOGLE_CLOUD_PROJECT` or `GCLOUD_PROJECT`) is not supported.
zone

helps to retrieve instances running in a given zone. It should be one of the GCE supported zones. By default the zone will be derived from the instance metadata. See also Using GCE zones.

retry

If set to true, client will use ExponentialBackOff policy to retry the failed http request. Defaults to true.

max_wait

The maximum elapsed time after the client instantiating retry. If the time elapsed goes past the max_wait, client stops to retry. A negative value means that it will wait indefinitely. Defaults to 0s (retry indefinitely).

refresh_interval

How long the list of hosts is cached to prevent further requests to the GCE API. 0s disables caching. A negative value will cause infinite caching. Defaults to 0s.

Important
Binding the network host

It’s important to define network.host as by default it’s bound to localhost.

You can use {ref}/modules-network.html[core network host settings] or gce specific host settings:

GCE Network Host

When the discovery-gce plugin is installed, the following are also allowed as valid network host settings:

GCE Host Value Description

gce:privateIp:X

The private IP address of the machine for a given network interface.

gce:hostname

The hostname of the machine.

gce

Same as gce:privateIp:0 (recommended).

Examples:

# get the IP address from network interface 1
network.host: _gce:privateIp:1_
# Using GCE internal hostname
network.host: _gce:hostname_
# shortcut for _gce:privateIp:0_ (recommended)
network.host: _gce_
How to start (short story)
  • Create Google Compute Engine instance (with compute rw permissions)

  • Install Elasticsearch

  • Install Google Compute Engine Cloud plugin

  • Modify elasticsearch.yml file

  • Start Elasticsearch

Setting up GCE Discovery

Prerequisites

Before starting, you need:

If you did not set it yet, you can define your default project you will work on:

gcloud config set project es-cloud
Login to Google Cloud

If you haven’t already, login to Google Cloud

gcloud auth login

This will open your browser. You will be asked to sign-in to a Google account and authorize access to the Google Cloud SDK.

Creating your first instance
gcloud compute instances create myesnode1 \
       --zone <your-zone> \
       --scopes compute-rw

When done, a report like this one should appears:

Created [https://www.googleapis.com/compute/v1/projects/es-cloud-1070/zones/us-central1-f/instances/myesnode1].
NAME      ZONE          MACHINE_TYPE  PREEMPTIBLE INTERNAL_IP   EXTERNAL_IP   STATUS
myesnode1 us-central1-f n1-standard-1             10.240.133.54 104.197.94.25 RUNNING

You can now connect to your instance:

# Connect using google cloud SDK
gcloud compute ssh myesnode1 --zone europe-west1-a

# Or using SSH with external IP address
ssh -i ~/.ssh/google_compute_engine 192.158.29.199
Important
Service Account Permissions

It’s important when creating an instance that the correct permissions are set. At a minimum, you must ensure you have:

scopes=compute-rw

Failing to set this will result in unauthorized messages when starting Elasticsearch. See Machine Permissions.

Once connected, install Elasticsearch:

sudo apt-get update

# Download Elasticsearch
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-2.0.0.deb

# Prepare Java installation (Oracle)
sudo echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | sudo tee /etc/apt/sources.list.d/webupd8team-java.list
sudo echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | sudo tee -a /etc/apt/sources.list.d/webupd8team-java.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
sudo apt-get update
sudo apt-get install oracle-java8-installer

# Prepare Java installation (or OpenJDK)
# sudo apt-get install java8-runtime-headless

# Prepare Elasticsearch installation
sudo dpkg -i elasticsearch-2.0.0.deb
Install Elasticsearch discovery gce plugin

Install the plugin:

# Use Plugin Manager to install it
sudo bin/elasticsearch-plugin install discovery-gce

Open the elasticsearch.yml file:

sudo vi /etc/elasticsearch/elasticsearch.yml

And add the following lines:

cloud:
  gce:
      project_id: es-cloud
      zone: europe-west1-a
discovery:
      zen.hosts_provider: gce

Start Elasticsearch:

sudo /etc/init.d/elasticsearch start

If anything goes wrong, you should check logs:

tail -f /var/log/elasticsearch/elasticsearch.log

If needed, you can change log level to trace by opening log4j2.properties:

sudo vi /etc/elasticsearch/log4j2.properties

and adding the following line:

# discovery
logger.discovery_gce.name = discovery.gce
logger.discovery_gce.level = trace

Cloning your existing machine

In order to build a cluster on many nodes, you can clone your configured instance to new nodes. You won’t have to reinstall everything!

First create an image of your running instance and upload it to Google Cloud Storage:

# Create an image of your current instance
sudo /usr/bin/gcimagebundle -d /dev/sda -o /tmp/

# An image has been created in `/tmp` directory:
ls /tmp
e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz

# Upload your image to Google Cloud Storage:
# Create a bucket to hold your image, let's say `esimage`:
gsutil mb gs://esimage

# Copy your image to this bucket:
gsutil cp /tmp/e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz gs://esimage

# Then add your image to images collection:
gcloud compute images create elasticsearch-2-0-0 --source-uri gs://esimage/e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz

# If the previous command did not work for you, logout from your instance
# and launch the same command from your local machine.
Start new instances

As you have now an image, you can create as many instances as you need:

# Just change node name (here myesnode2)
gcloud compute instances create myesnode2 --image elasticsearch-2-0-0 --zone europe-west1-a

# If you want to provide all details directly, you can use:
gcloud compute instances create myesnode2 --image=elasticsearch-2-0-0 \
       --zone europe-west1-a --machine-type f1-micro --scopes=compute-rw
Remove an instance (aka shut it down)

You can use Google Cloud Console or CLI to manage your instances:

# Stopping and removing instances
gcloud compute instances delete myesnode1 myesnode2 \
       --zone=europe-west1-a

# Consider removing disk as well if you don't need them anymore
gcloud compute disks delete boot-myesnode1 boot-myesnode2  \
       --zone=europe-west1-a

Using GCE zones

cloud.gce.zone helps to retrieve instances running in a given zone. It should be one of the GCE supported zones.

The GCE discovery can support multi zones although you need to be aware of network latency between zones. To enable discovery across more than one zone, just enter add your zone list to cloud.gce.zone setting:

cloud:
  gce:
      project_id: <your-google-project-id>
      zone: ["<your-zone1>", "<your-zone2>"]
discovery:
      zen.hosts_provider: gce

Filtering by tags

The GCE discovery can also filter machines to include in the cluster based on tags using discovery.gce.tags settings. For example, setting discovery.gce.tags to dev will only filter instances having a tag set to dev. Several tags set will require all of those tags to be set for the instance to be included.

One practical use for tag filtering is when an GCE cluster contains many nodes that are not running Elasticsearch. In this case (particularly with high discovery.zen.ping_timeout values) there is a risk that a new node’s discovery phase will end before it has found the cluster (which will result in it declaring itself master of a new cluster with the same name - highly undesirable). Adding tag on Elasticsearch GCE nodes and then filtering by that tag will resolve this issue.

Add your tag when building the new instance:

gcloud compute instances create myesnode1 --project=es-cloud \
       --scopes=compute-rw \
       --tags=elasticsearch,dev

Then, define it in elasticsearch.yml:

cloud:
  gce:
      project_id: es-cloud
      zone: europe-west1-a
discovery:
      zen.hosts_provider: gce
      gce:
            tags: elasticsearch, dev

Changing default transport port

By default, Elasticsearch GCE plugin assumes that you run Elasticsearch on 9300 default port. But you can specify the port value Elasticsearch is meant to use using google compute engine metadata es_port:

When creating instance

Add --metadata es_port=9301 option:

# when creating first instance
gcloud compute instances create myesnode1 \
       --scopes=compute-rw,storage-full \
       --metadata es_port=9301

# when creating an instance from an image
gcloud compute instances create myesnode2 --image=elasticsearch-1-0-0-RC1 \
       --zone europe-west1-a --machine-type f1-micro --scopes=compute-rw \
       --metadata es_port=9301
On a running instance
gcloud compute instances add-metadata myesnode1 \
       --zone europe-west1-a \
       --metadata es_port=9301

GCE Tips

Store project id locally

If you don’t want to repeat the project id each time, you can save it in the local gcloud config

gcloud config set project es-cloud
Machine Permissions

If you have created a machine without the correct permissions, you will see 403 unauthorized error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to Access Scopes to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.

Creating machines with gcloud

Ensure the following flags are set:

--scopes=compute-rw
Creating with console (web)

When creating an instance using the web portal, click Show advanced options.

At the bottom of the page, under PROJECT ACCESS, choose >> Compute >> Read Write.

Creating with knife google

Set the service account scopes when creating the machine:

knife google server create www1 \
    -m n1-standard-1 \
    -I debian-8 \
    -Z us-central1-a \
    -i ~/.ssh/id_rsa \
    -x jdoe \
    --gce-service-account-scopes https://www.googleapis.com/auth/compute.full_control

Or, you may use the alias:

    --gce-service-account-scopes compute-rw

Testing GCE

Integrations tests in this plugin require working GCE configuration and therefore disabled by default. To enable tests prepare a config file elasticsearch.yml with the following content:

cloud:
  gce:
      project_id: es-cloud
      zone: europe-west1-a
discovery:
      zen.hosts_provider: gce

Replaces project_id and zone with your settings.

To run test:

mvn -Dtests.gce=true -Dtests.config=/path/to/config/file/elasticsearch.yml clean test

File-Based Discovery Plugin

The functionality provided by the discovery-file plugin is now available in Elasticsearch without requiring a plugin. This plugin still exists to ensure backwards compatibility, but it will be removed in a future version.

On installation, this plugin creates a file at $ES_PATH_CONF/discovery-file/unicast_hosts.txt that comprises comments that describe how to use it. It is preferable not to install this plugin and instead to create this file, and its containing directory, using standard tools.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install discovery-file

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove discovery-file

The node must be stopped before removing the plugin.

Ingest Plugins

The ingest plugins extend Elasticsearch by providing additional ingest node capabilities.

Core Ingest Plugins

The core ingest plugins are:

Ingest Attachment Processor Plugin

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

Ingest geoip Processor Plugin

The geoip processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases. This processor adds this information by default under the geoip field. The geoip processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See {ref}/geoip-processor.html[GeoIP processor] for more details.

Ingest user_agent Processor Plugin

A processor that extracts details from the User-Agent header value. The user_agent processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See {ref}/user-agent-processor.html[User Agent processor] for more details.

Community contributed ingest plugins

The following plugin has been contributed by our community:

Ingest Attachment Processor Plugin

The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.

You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.

The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install ingest-attachment

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove ingest-attachment

The node must be stopped before removing the plugin.

Using the Attachment Processor in a Pipeline

Table 1. Attachment options
Name Required Default Description

field

yes

-

The field to get the base64 encoded field from

target_field

no

attachment

The field that will hold the attachment information

indexed_chars

no

100000

The number of chars being used for extraction to prevent huge fields. Use -1 for no limit.

indexed_chars_field

no

null

Field name from which you can overwrite the number of chars being used for extraction. See indexed_chars.

properties

no

all properties

 Array of properties to select to be stored. Can be content, title, name, author, keywords, date, content_type, content_length, language

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

For example, this:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

Returns this:

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 22,
  "_primary_term": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem ipsum dolor sit amet",
      "content_length": 28
    }
  }
}

To specify only some fields to be extracted:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "properties": [ "content", "title" ]
      }
    }
  ]
}
Note
Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node.

Limit the number of extracted chars

To prevent extracting too many chars and overload the node memory, the number of chars being used for extraction is limited by default to 100000. You can change this value by setting indexed_chars. Use -1 for no limit but ensure when setting this that your node will have enough HEAP to extract the content of very big documents.

You can also define this limit per document by extracting from a given field the limit to set. If the document has that field, it will overwrite the indexed_chars setting. To set this field, define the indexed_chars_field setting.

For example:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : 11,
        "indexed_chars_field" : "max_size"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id

Returns this:

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 35,
  "_primary_term": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "attachment": {
      "content_type": "application/rtf",
      "language": "sl",
      "content": "Lorem ipsum",
      "content_length": 11
    }
  }
}
PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information",
  "processors" : [
    {
      "attachment" : {
        "field" : "data",
        "indexed_chars" : 11,
        "indexed_chars_field" : "max_size"
      }
    }
  ]
}
PUT my_index/_doc/my_id_2?pipeline=attachment
{
  "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
  "max_size": 5
}
GET my_index/_doc/my_id_2

Returns this:

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id_2",
  "_version": 1,
  "_seq_no": 40,
  "_primary_term": 1,
  "_source": {
    "data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
    "max_size": 5,
    "attachment": {
      "content_type": "application/rtf",
      "language": "ro",
      "content": "Lorem",
      "content_length": 5
    }
  }
}

Using the Attachment Processor with arrays

To use the attachment processor within an array of attachments the {ref}/foreach-processor.html[foreach processor] is required. This enables the attachment processor to be run on the individual elements of the array.

For example, given the following source:

{
  "attachments" : [
    {
      "filename" : "ipsum.txt",
      "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
    },
    {
      "filename" : "test.txt",
      "data" : "VGhpcyBpcyBhIHRlc3QK"
    }
  ]
}

In this case, we want to process the data field in each element of the attachments field and insert the properties into the document so the following foreach processor is used:

PUT _ingest/pipeline/attachment
{
  "description" : "Extract attachment information from arrays",
  "processors" : [
    {
      "foreach": {
        "field": "attachments",
        "processor": {
          "attachment": {
            "target_field": "_ingest._value.attachment",
            "field": "_ingest._value.data"
          }
        }
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
  "attachments" : [
    {
      "filename" : "ipsum.txt",
      "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
    },
    {
      "filename" : "test.txt",
      "data" : "VGhpcyBpcyBhIHRlc3QK"
    }
  ]
}
GET my_index/_doc/my_id

Returns this:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 1,
  "_seq_no" : 50,
  "_primary_term" : 1,
  "found" : true,
  "_source" : {
    "attachments" : [
      {
        "filename" : "ipsum.txt",
        "data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
        "attachment" : {
          "content_type" : "text/plain; charset=ISO-8859-1",
          "language" : "en",
          "content" : "this is\njust some text",
          "content_length" : 24
        }
      },
      {
        "filename" : "test.txt",
        "data" : "VGhpcyBpcyBhIHRlc3QK",
        "attachment" : {
          "content_type" : "text/plain; charset=ISO-8859-1",
          "language" : "en",
          "content" : "This is a test",
          "content_length" : 16
        }
      }
    ]
  }
}

Note that the target_field needs to be set, otherwise the default value is used which is a top level field attachment. The properties on this top level field will contain the value of the first attachment only. However, by specifying the target_field on to a value on _ingest._value it will correctly associate the properties with the correct attachment.

Ingest geoip Processor Plugin

The geoip processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See the {ref}/geoip-processor.html[GeoIP processor] for more details.

Using the geoip Processor in a Pipeline

See {ref}/geoip-processor.html#using-ingest-geoip[using ingest-geoip].

Ingest user_agent Processor Plugin

The user_agent processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See the {ref}/user-agent-processor.html[User Agent processor] for more details.

Management Plugins

Management plugins offer UIs for managing and interacting with Elasticsearch.

Core management plugins

The core management plugins are:

X-Pack

X-Pack contains the management and monitoring features for Elasticsearch. It aggregates cluster wide statistics and events and offers a single interface to view and analyze them. You can get a free license for basic monitoring or a higher level license for more advanced needs.

Mapper Plugins

Mapper plugins allow new field datatypes to be added to Elasticsearch.

Core mapper plugins

The core mapper plugins are:

Mapper Size Plugin

The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original {ref}/mapping-source-field.html[_source] field.

[mapper-murmur3]

The mapper-murmur3 plugin allows hashes to be computed at index-time and stored in the index for later use with the cardinality aggregation.

Mapper Annotated Text Plugin

The annotated text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).

Mapper Size Plugin

The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original {ref}/mapping-source-field.html[_source] field.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-size

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-size

The node must be stopped before removing the plugin.

Using the _size field

In order to enable the _size field, set the mapping as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "_size": {
        "enabled": true
      }
    }
  }
}

The value of the _size field is accessible in queries, aggregations, scripts, and when sorting:

# Example documents
PUT my_index/_doc/1
{
  "text": "This is a document"
}

PUT my_index/_doc/2
{
  "text": "This is another document"
}

GET my_index/_search
{
  "query": {
    "range": {
      "_size": { (1)
        "gt": 10
      }
    }
  },
  "aggs": {
    "sizes": {
      "terms": {
        "field": "_size", (2)
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_size": { (3)
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "size": {
      "script": "doc['_size']"  (4)
    }
  }
}
  1. Querying on the _size field

  2. Aggregating on the _size field

  3. Sorting on the _size field

  4. Accessing the _size field in scripts (inline scripts must be modules-security-scripting.html#enable-dynamic-scripting[enabled] for this example to work) === Mapper Murmur3 Plugin

The mapper-murmur3 plugin provides the ability to compute hash of field values at index-time and store them in the index. This can sometimes be helpful when running cardinality aggregations on high-cardinality and large string fields.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-murmur3

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-murmur3

The node must be stopped before removing the plugin.

Using the murmur3 field

The murmur3 is typically used within a multi-field, so that both the original value and its hash are stored in the index:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "keyword",
          "fields": {
            "hash": {
              "type": "murmur3"
            }
          }
        }
      }
    }
  }
}

Such a mapping would allow to refer to my_field.hash in order to get hashes of the values of the my_field field. This is only useful in order to run cardinality aggregations:

# Example documents
PUT my_index/_doc/1
{
  "my_field": "This is a document"
}

PUT my_index/_doc/2
{
  "my_field": "This is another document"
}

GET my_index/_search
{
  "aggs": {
    "my_field_cardinality": {
      "cardinality": {
        "field": "my_field.hash" (1)
      }
    }
  }
}
  1. Counting unique values on the my_field.hash field

Running a cardinality aggregation on the my_field field directly would yield the same result, however using my_field.hash instead might result in a speed-up if the field has a high-cardinality. On the other hand, it is discouraged to use the murmur3 field on numeric fields and string fields that are not almost unique as the use of a murmur3 field is unlikely to bring significant speed-ups, while increasing the amount of disk space required to store the index.

Mapper Annotated Text Plugin

experimental[]

The mapper-annotated-text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).

The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token stream at the same position as the underlying text it annotates.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-annotated-text

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-annotated-text

The node must be stopped before removing the plugin.

Using the annotated-text field

The annotated-text tokenizes text content as per the more common text field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "annotated_text"
        }
      }
    }
  }
}

Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text and structured tokens. The annotations use a markdown-like syntax using URL encoding of one or more values separated by the & symbol.

We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:

GET my_index/_analyze
{
  "field": "my_field",
  "text":"Investors in [Apple](Apple+Inc.) rejoiced."
}

Response:

{
  "tokens": [
    {
      "token": "investors",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "Apple Inc.", (1)
      "start_offset": 13,
      "end_offset": 18,
      "type": "annotation",
      "position": 2
    },
    {
      "token": "apple",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "rejoiced",
      "start_offset": 19,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
  1. Note the whole annotation token Apple Inc. is placed, unchanged as a single token in the token stream and at the same position (position 2) as the text token (apple) it annotates.

We can now perform searches for annotations using regular term queries that don’t tokenize the provided search values. Annotations are a more precise way of matching as can be seen in this example where a search for Beck will not match Jeff Beck :

# Example documents
PUT my_index/_doc/1
{
  "my_field": "[Beck](Beck) announced a new tour"(1)
}

PUT my_index/_doc/2
{
  "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"(2)
}

# Example search
GET my_index/_search
{
  "query": {
    "term": {
        "my_field": "Beck" (3)
    }
  }
}
  1. As well as tokenising the plain text into single words e.g. beck, here we inject the single token value Beck at the same position as beck in the token stream.

  2. Note annotations can inject multiple tokens at the same position - here we inject both the very specific value Jeff Beck and the broader term Guitarist. This enables broader positional queries e.g. finding mentions of a Guitarist near to strat.

  3. A benefit of searching with these carefully defined annotation tokens is that a query for Beck will not match document 2 that contains the tokens jeff, beck and Jeff Beck

Warning
Any use of = signs in annotation values eg [Prince](person=Prince) will cause the document to be rejected with a parse failure. In future we hope to have a use for the equals signs so wil actively reject documents that contain this today.

Data modelling tips

Use structured and unstructured fields

Annotations are normally a way of weaving structured information into unstructured text for higher-precision search.

Entity resolution is a form of document enrichment undertaken by specialist software or people where references to entities in a document are disambiguated by attaching a canonical ID. The ID is used to resolve any number of aliases or distinguish between people with the same name. The hyperlinks connecting Wikipedia’s articles are a good example of resolved entity IDs woven into text.

These IDs can be embedded as annotations in an annotated_text field but it often makes sense to include them in dedicated structured fields to support discovery via aggregations:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_unstructured_text_field": {
          "type": "annotated_text"
        },
        "my_structured_people_field": {
          "type": "text",
          "fields": {
          	"keyword" :{
          	  "type": "keyword"
          	}
          }
        }
      }
    }
  }
}

Applications would then typically provide content and discover it as follows:

# Example documents
PUT my_index/_doc/1
{
  "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
  "my_twitter_handles": ["@kimchy"] (1)
}

GET my_index/_search
{
  "query": {
    "query_string": {
        "query": "elasticsearch OR logstash OR kibana",(2)
        "default_field": "my_unstructured_text_field"
    }
  },
  "aggregations": {
  	"top_people" :{
  	    "significant_terms" : { (3)
	       "field" : "my_twitter_handles.keyword"
  	    }
  	}
  }
}
  1. Note the my_twitter_handles contains a list of the annotation values also used in the unstructured text. (Note the annotated_text syntax requires escaping). By repeating the annotation values in a structured field this application has ensured that the tokens discovered in the structured field can be used for search and highlighting in the unstructured field.

  2. In this example we search for documents that talk about components of the elastic stack

  3. We use the my_twitter_handles field here to discover people who are significantly associated with the elastic stack.

Avoiding over-matching annotations

By design, the regular text tokens and the annotation tokens co-exist in the same indexed field but in rare cases this can lead to some over-matching.

The value of an annotation often denotes a named entity (a person, place or company). The tokens for these named entities are inserted untokenized, and differ from typical text tokens because they are normally:

  • Mixed case e.g. Madonna

  • Multiple words e.g. Jeff Beck

  • Can have punctuation or numbers e.g. Apple Inc. or @kimchy

This means, for the most part, a search for a named entity in the annotated text field will not have any false positives e.g. when selecting Apple Inc. from an aggregation result you can drill down to highlight uses in the text without "over matching" on any text tokens like the word apple in this context:

the apple was very juicy

However, a problem arises if your named entity happens to be a single term and lower-case e.g. the company elastic. In this case, a search on the annotated text field for the token elastic may match a text document such as this:

he fired an elastic band

To avoid such false matches users should consider prefixing annotation values to ensure they don’t name clash with text tokens e.g.

[elastic](Company_elastic) released version 7.0 of the elastic stack today

Using the annotated highlighter

The annotated-text plugin includes a custom highlighter designed to mark up search hits in a way which is respectful of the original markup:

# Example documents
PUT my_index/_doc/1
{
  "my_field": "The cat sat on the [mat](sku3578)"
}

GET my_index/_search
{
  "query": {
    "query_string": {
        "query": "cats"
    }
  },
  "highlight": {
    "fields": {
      "my_field": {
        "type": "annotated", (1)
        "require_field_match": false
      }
    }
  }
}
  1. The annotated highlighter type is designed for use with annotated_text fields

The annotated highlighter is based on the unified highlighter and supports the same settings but does not use the pre_tags or post_tags parameters. Rather than using html-like markup such as <em>cat</em> the annotated highlighter uses the same markdown-like syntax used for annotations and injects a key=value annotation where _hit_term is the key and the matched search term is the value e.g.

The [cat](_hit_term=cat) sat on the [mat](sku3578)

The annotated highlighter tries to be respectful of any existing markup in the original text:

  • If the search term matches exactly the location of an existing annotation then the _hit_term key is merged into the url-like syntax used in the (…​) part of the existing annotation.

  • However, if the search term overlaps the span of an existing annotation it would break the markup formatting so the original annotation is removed in favour of a new annotation with just the search hit information in the results.

  • Any non-overlapping annotations in the original text are preserved in highlighter selections

Limitations

The annotated_text field type supports the same mapping settings as the text field type but with the following exceptions:

  • No support for fielddata or fielddata_frequency_filter

  • No support for index_prefixes or index_phrases indexing

Security Plugins

Security plugins add a security layer to Elasticsearch.

Core security plugins

The core security plugins are:

X-Pack

X-Pack is the Elastic product that makes it easy for anyone to add enterprise-grade security to their Elastic Stack. Designed to address the growing security needs of thousands of enterprises using the Elastic Stack today, X-Pack provides peace of mind when it comes to protecting your data.

Community contributed security plugins

The following plugins have been contributed by our community:

  • Readonly REST: High performance access control for Elasticsearch native REST API (by Simone Scarduzio)

Snapshot/Restore Repository Plugins

Repository plugins extend the {ref}/modules-snapshots.html[Snapshot/Restore] functionality in Elasticsearch by adding repositories backed by the cloud or by distributed file systems:

Core repository plugins

The core repository plugins are:

S3 Repository

The S3 repository plugin adds support for using S3 as a repository.

Azure Repository

The Azure repository plugin adds support for using Azure as a repository.

HDFS Repository

The Hadoop HDFS Repository plugin adds support for using HDFS as a repository.

Google Cloud Storage Repository

The GCS repository plugin adds support for using Google Cloud Storage service as a repository.

Community contributed repository plugins

The following plugin has been contributed by our community:

Azure Repository Plugin

The Azure Repository plugin adds support for using Azure as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install repository-azure

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove repository-azure

The node must be stopped before removing the plugin.

Azure Repository

To enable Azure repositories, you have first to define your azure storage settings as {ref}/secure-settings.html[secure settings], before starting up the node:

bin/elasticsearch-keystore add azure.client.default.account
bin/elasticsearch-keystore add azure.client.default.key

Where account is the azure account name and key the azure secret key. Instead of an azure secret key under key, you can alternatively define a shared access signatures (SAS) token under sas_token to use for authentication instead. When using an SAS token instead of an account key, the SAS token must have read (r), write (w), list (l), and delete (d) permissions for the repository base path and all its contents. These permissions need to be granted for the blob service (b) and apply to resource types service (s), container (c), and object (o). These settings are used by the repository’s internal azure client.

Note that you can also define more than one account:

bin/elasticsearch-keystore add azure.client.default.account
bin/elasticsearch-keystore add azure.client.default.key
bin/elasticsearch-keystore add azure.client.secondary.account
bin/elasticsearch-keystore add azure.client.secondary.sas_token

default is the default account name which will be used by a repository, unless you set an explicit one in the repository settings.

The account, key, and sas_token storage settings are {ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you reload the settings, the internal azure clients, which are used to transfer the snapshot, will utilize the latest settings from the keystore.

Note
In progress snapshot/restore jobs will not be preempted by a reload of the storage secure settings. They will complete using the client as it was built when the operation started.

You can set the client side timeout to use when making any single request. It can be defined globally, per account or both. It’s not set by default which means that Elasticsearch is using the default value set by the azure client (known as 5 minutes).

max_retries can help to control the exponential backoff policy. It will fix the number of retries in case of failures before considering the snapshot is failing. Defaults to 3 retries. The initial backoff period is defined by Azure SDK as 30s. Which means 30s of wait time before retrying after a first timeout or failure. The maximum backoff period is defined by Azure SDK as 90s.

endpoint_suffix can be used to specify Azure endpoint suffix explicitly. Defaults to core.windows.net.

cloud.azure.storage.timeout: 10s
azure.client.default.max_retries: 7
azure.client.default.endpoint_suffix: core.chinacloudapi.cn
azure.client.secondary.timeout: 30s

In this example, timeout will be 10s per try for default with 7 retries before failing and endpoint suffix will be core.chinacloudapi.cn and 30s per try for secondary with 3 retries.

Important
Supported Azure Storage Account types

The Azure Repository plugin works with all Standard storage accounts

  • Standard Locally Redundant Storage - Standard_LRS

  • Standard Zone-Redundant Storage - Standard_ZRS

  • Standard Geo-Redundant Storage - Standard_GRS

  • Standard Read Access Geo-Redundant Storage - Standard_RAGRS

Premium Locally Redundant Storage (Premium_LRS) is not supported as it is only usable as VM disk storage, not as general storage.

You can register a proxy per client using the following settings:

azure.client.default.proxy.host: proxy.host
azure.client.default.proxy.port: 8888
azure.client.default.proxy.type: http

Supported values for proxy.type are direct (default), http or socks. When proxy.type is set to http or socks, proxy.host and proxy.port must be provided.

Repository settings

The Azure repository supports following settings:

client

Azure named client to use. Defaults to default.

container

Container name. You must create the azure container before creating the repository. Defaults to elasticsearch-snapshots.

base_path

Specifies the path within container to repository data. Defaults to empty (root directory).

chunk_size

Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: 10MB, 5KB, 500B. Defaults to 64MB (64MB max).

compress

When set to true metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults to false.

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

location_mode

primary_only or secondary_only. Defaults to primary_only. Note that if you set it to secondary_only, it will force readonly to true.

Some examples, using scripts:

# The simplest one
PUT _snapshot/my_backup1
{
    "type": "azure"
}

# With some settings
PUT _snapshot/my_backup2
{
    "type": "azure",
    "settings": {
        "container": "backup-container",
        "base_path": "backups",
        "chunk_size": "32m",
        "compress": true
    }
}


# With two accounts defined in elasticsearch.yml (my_account1 and my_account2)
PUT _snapshot/my_backup3
{
    "type": "azure",
    "settings": {
        "client": "secondary"
    }
}
PUT _snapshot/my_backup4
{
    "type": "azure",
    "settings": {
        "client": "secondary",
        "location_mode": "primary_only"
    }
}

Example using Java:

client.admin().cluster().preparePutRepository("my_backup_java1")
    .setType("azure").setSettings(Settings.builder()
        .put(Storage.CONTAINER, "backup-container")
        .put(Storage.CHUNK_SIZE, new ByteSizeValue(32, ByteSizeUnit.MB))
    ).get();

Repository validation rules

According to the containers naming guide, a container name must be a valid DNS name, conforming to the following naming rules:

  • Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.

  • Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.

  • All letters in a container name must be lowercase.

  • Container names must be from 3 through 63 characters long.

S3 Repository Plugin

The S3 repository plugin adds support for using AWS S3 as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].

If you are looking for a hosted solution of Elasticsearch on AWS, please visit http://www.elastic.co/cloud.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install repository-s3

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove repository-s3

The node must be stopped before removing the plugin.

Getting Started

The plugin provides a repository type named s3 which may be used when creating a repository. The repository defaults to using ECS IAM Role or EC2 IAM Role credentials for authentication. The only mandatory setting is the bucket name:

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my_bucket"
  }
}

Client Settings

The client that you use to connect to S3 has a number of settings available. The settings have the form s3.client.CLIENT_NAME.SETTING_NAME. By default, s3 repositories use a client named default, but this can be modified using the repository setting client. For example:

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my_bucket",
    "client": "my_alternate_client"
  }
}

Most client settings can be added to the elasticsearch.yml configuration file with the exception of the secure settings, which you add to the {es} keystore. For more information about creating and updating the {es} keystore, see {ref}/secure-settings.html[Secure settings].

For example, before you start the node, run these commands to add AWS access key settings to the keystore:

bin/elasticsearch-keystore add s3.client.default.access_key
bin/elasticsearch-keystore add s3.client.default.secret_key

All client secure settings of this plugin are {ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you reload the settings, the internal s3 clients, used to transfer the snapshot contents, will utilize the latest settings from the keystore. Any existing s3 repositories, as well as any newly created ones, will pick up the new values stored in the keystore.

Note
In-progress snapshot/restore tasks will not be preempted by a reload of the client’s secure settings. The task will complete using the client as it was built when the operation started.

The following list contains the available client settings. Those that must be stored in the keystore are marked as "secure" and are reloadable; the other settings belong in the elasticsearch.yml file.

access_key ({ref}/secure-settings.html[Secure])

An S3 access key. The secret_key setting must also be specified.

secret_key ({ref}/secure-settings.html[Secure])

An S3 secret key. The access_key setting must also be specified.

session_token

An S3 session token. The access_key and secret_key settings must also be specified. (Secure)

endpoint

The S3 service endpoint to connect to. This defaults to s3.amazonaws.com but the AWS documentation lists alternative S3 endpoints. If you are using an S3-compatible service then you should set this to the service’s endpoint.

protocol

The protocol to use to connect to S3. Valid values are either http or https. Defaults to https.

proxy.host

The host name of a proxy to connect to S3 through.

proxy.port

The port of a proxy to connect to S3 through.

proxy.username ({ref}/secure-settings.html[Secure])

The username to connect to the proxy.host with.

proxy.password ({ref}/secure-settings.html[Secure])

The password to connect to the proxy.host with.

read_timeout

The socket timeout for connecting to S3. The value should specify the unit. For example, a value of 5s specifies a 5 second timeout. The default value is 50 seconds.

max_retries

The number of retries to use when an S3 request fails. The default value is 3.

use_throttle_retries

Whether retries should be throttled (i.e. should back off). Must be true or false. Defaults to true.

S3-compatible services

There are a number of storage systems that provide an S3-compatible API, and the repository-s3 plugin allows you to use these systems in place of AWS S3. To do so, you should set the s3.client.CLIENT_NAME.endpoint setting to the system’s endpoint. This setting accepts IP addresses and hostnames and may include a port. For example, the endpoint may be 172.17.0.2 or 172.17.0.2:9000. You may also need to set s3.client.CLIENT_NAME.protocol to http if the endpoint does not support HTTPS.

Minio is an example of a storage system that provides an S3-compatible API. The repository-s3 plugin allows {es} to work with Minio-backed repositories as well as repositories stored on AWS S3. Other S3-compatible storage systems may also work with {es}, but these are not tested or supported.

Repository Settings

The s3 repository type supports a number of settings to customize how data is stored in S3. These can be specified when creating the repository. For example:

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "bucket": "my_bucket_name",
    "another_setting": "setting_value"
  }
}

The following settings are supported:

bucket

The name of the bucket to be used for snapshots. (Mandatory)

client

The name of the S3 client to use to connect to S3. Defaults to default.

base_path

Specifies the path to the repository data within its bucket. Defaults to an empty string, meaning that the repository is at the root of the bucket. The value of this setting should not start or end with a /.

chunk_size

Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: 1GB, 10MB, 5KB, 500B. Defaults to 1GB.

compress

When set to true metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults to false.

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

server_side_encryption

When set to true files are encrypted on server side using AES256 algorithm. Defaults to false.

buffer_size

Minimum threshold below which the chunk is uploaded using a single request. Beyond this threshold, the S3 repository will use the AWS Multipart Upload API to split the chunk into several parts, each of buffer_size length, and to upload each part in its own request. Note that setting a buffer size lower than 5mb is not allowed since it will prevent the use of the Multipart API and may result in upload errors. It is also not possible to set a buffer size greater than 5gb as it is the maximum upload size allowed by S3. Defaults to the minimum between 100mb and 5% of the heap size.

canned_acl

The S3 repository supports all S3 canned ACLs : private, public-read, public-read-write, authenticated-read, log-delivery-write, bucket-owner-read, bucket-owner-full-control. Defaults to private. You could specify a canned ACL using the canned_acl setting. When the S3 repository creates buckets and objects, it adds the canned ACL into the buckets and objects.

storage_class

Sets the S3 storage class for objects stored in the snapshot repository. Values may be standard, reduced_redundancy, standard_ia. Defaults to standard. Changing this setting on an existing repository only affects the storage class for newly created objects, resulting in a mixed usage of storage classes. Additionally, S3 Lifecycle Policies can be used to manage the storage class of existing objects. Due to the extra complexity with the Glacier class lifecycle, it is not currently supported by the plugin. For more information about the different classes, see AWS Storage Classes Guide

Note
The option of defining client settings in the repository settings as documented below is considered deprecated, and will be removed in a future version.

In addition to the above settings, you may also specify all non-secure client settings in the repository settings. In this case, the client settings found in the repository settings will be merged with those of the named client used by the repository. Conflicts between client and repository settings are resolved by the repository settings taking precedence over client settings.

For example:

PUT _snapshot/my_s3_repository
{
  "type": "s3",
  "settings": {
    "client": "my_client_name",
    "bucket": "my_bucket_name",
    "endpoint": "my.s3.endpoint"
  }
}

This sets up a repository that uses all client settings from the client my_client_name except for the endpoint that is overridden to my.s3.endpoint by the repository settings.

Recommended S3 Permissions

In order to restrict the Elasticsearch snapshot process to the minimum required resources, we recommend using Amazon IAM in conjunction with pre-existing S3 buckets. Here is an example policy which will allow the snapshot access to an S3 bucket named "snaps.example.com". This may be configured through the AWS IAM console, by creating a Custom Policy, and using a Policy Document similar to this (changing snaps.example.com to your bucket name).

{
  "Statement": [
    {
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:ListBucketMultipartUploads",
        "s3:ListBucketVersions"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::snaps.example.com"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::snaps.example.com/*"
      ]
    }
  ],
  "Version": "2012-10-17"
}

You may further restrict the permissions by specifying a prefix within the bucket, in this example, named "foo".

{
  "Statement": [
    {
      "Action": [
        "s3:ListBucket",
        "s3:GetBucketLocation",
        "s3:ListBucketMultipartUploads",
        "s3:ListBucketVersions"
      ],
      "Condition": {
        "StringLike": {
          "s3:prefix": [
            "foo/*"
          ]
        }
      },
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::snaps.example.com"
      ]
    },
    {
      "Action": [
        "s3:GetObject",
        "s3:PutObject",
        "s3:DeleteObject",
        "s3:AbortMultipartUpload",
        "s3:ListMultipartUploadParts"
      ],
      "Effect": "Allow",
      "Resource": [
        "arn:aws:s3:::snaps.example.com/foo/*"
      ]
    }
  ],
  "Version": "2012-10-17"
}

The bucket needs to exist to register a repository for snapshots. If you did not create the bucket then the repository registration will fail.

Note: Starting in version 7.0, all bucket operations are using the path style access pattern. In previous versions the decision to use virtual hosted style or path style access was made by the AWS Java SDK.

AWS VPC Bandwidth Settings

AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch instances reside in a private subnet in an AWS VPC then all traffic to S3 will go through that VPC’s NAT instance. If your VPC’s NAT instance is a smaller instance size (e.g. a t1.micro) or is handling a high volume of network traffic your bandwidth to S3 may be limited by that NAT instance’s networking bandwidth limitations.

Instances residing in a public subnet in an AWS VPC will connect to S3 via the VPC’s internet gateway and not be bandwidth limited by the VPC’s NAT instance.

Hadoop HDFS Repository Plugin

The HDFS repository plugin adds support for using HDFS File System as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install repository-hdfs

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove repository-hdfs

The node must be stopped before removing the plugin.

Getting started with HDFS

The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If the distro you are using is not protocol compatible with Apache Hadoop, consider replacing the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required).

Even if Hadoop is already installed on the Elasticsearch nodes, for security reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distro is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files (see below).

Windows Users

Using Apache Hadoop on Windows is problematic and thus it is not recommended. For those really wanting to use it, make sure you place the elusive winutils.exe under the plugin folder and point HADOOP_HOME variable to it; this should minimize the amount of permissions Hadoop requires (though one would still have to add some more).

Configuration Properties

Once installed, define the configuration for the hdfs repository through the {ref}/modules-snapshots.html[REST API]:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "elasticsearch/repositories/my_hdfs_repository",
    "conf.dfs.client.read.shortcircuit": "true"
  }
}

The following settings are supported:

uri

The uri address for hdfs. ex: "hdfs://<host>:<port>/". (Required)

path

The file path within the filesystem where data is stored/loaded. ex: "path/to/file". (Required)

load_defaults

Whether to load the default Hadoop configuration or not. (Enabled by default)

conf.<key>

Inlined configuration parameter to be added to Hadoop configuration. (Optional) Only client oriented properties from the hadoop core and hdfs configuration files will be recognized by the plugin.

compress

Whether to compress the metadata or not. (Disabled by default)

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

chunk_size

Override the chunk size. (Disabled by default)

security.principal

Kerberos principal to use when connecting to a secured HDFS cluster. If you are using a service principal for your elasticsearch node, you may use the _HOST pattern in the principal name and the plugin will replace the pattern with the hostname of the node at runtime (see Creating the Secure Repository).

A Note on HDFS Availability

When you initialize a repository, its settings are persisted in the cluster state. When a node comes online, it will attempt to initialize all repositories for which it has settings. If your cluster has an HDFS repository configured, then all nodes in the cluster must be able to reach HDFS when starting. If not, then the node will fail to initialize the repository at start up and the repository will be unusable. If this happens, you will need to remove and re-add the repository or restart the offending node.

Hadoop Security

The HDFS Repository Plugin integrates seamlessly with Hadoop’s authentication model. The following authentication methods are supported by the plugin:

simple

Also means "no security" and is enabled by default. Uses information from underlying operating system account running Elasticsearch to inform Hadoop of the name of the current user. Hadoop makes no attempts to verify this information.

kerberos

Authenticates to Hadoop through the usage of a Kerberos principal and keytab. Interfacing with HDFS clusters secured with Kerberos requires a few additional steps to enable (See Principals and Keytabs and Creating the Secure Repository for more info)

Principals and Keytabs

Before attempting to connect to a secured HDFS cluster, provision the Kerberos principals and keytabs that the Elasticsearch nodes will use for authenticating to Kerberos. For maximum security and to avoid tripping up the Kerberos replay protection, you should create a service principal per node, following the pattern of elasticsearch/hostname@REALM.

Warning
In some cases, if the same principal is authenticating from multiple clients at once, services may reject authentication for those principals under the assumption that they could be replay attacks. If you are running the plugin in production with multiple nodes you should be using a unique service principal for each node.

On each Elasticsearch node, place the appropriate keytab file in the node’s configuration location under the repository-hdfs directory using the name krb5.keytab:

$> cd elasticsearch/config
$> ls
elasticsearch.yml  jvm.options        log4j2.properties  repository-hdfs/   scripts/
$> cd repository-hdfs
$> ls
krb5.keytab
Note
Make sure you have the correct keytabs! If you are using a service principal per node (like elasticsearch/hostname@REALM) then each node will need its own unique keytab file for the principal assigned to that host!
Creating the Secure Repository

Once your keytab files are in place and your cluster is started, creating a secured HDFS repository is simple. Just add the name of the principal that you will be authenticating as in the repository settings under the security.principal option:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "/user/elasticsearch/repositories/my_hdfs_repository",
    "security.principal": "elasticsearch@REALM"
  }
}

If you are using different service principals for each node, you can use the _HOST pattern in your principal name. Elasticsearch will automatically replace the pattern with the hostname of the node at runtime:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "/user/elasticsearch/repositories/my_hdfs_repository",
    "security.principal": "elasticsearch/_HOST@REALM"
  }
}
Authorization

Once Elasticsearch is connected and authenticated to HDFS, HDFS will infer a username to use for authorizing file access for the client. By default, it picks this username from the primary part of the kerberos principal used to authenticate to the service. For example, in the case of a principal like elasticsearch@REALM or elasticsearch/hostname@REALM then the username that HDFS extracts for file access checks will be elasticsearch.

Note
The repository plugin makes no assumptions of what Elasticsearch’s principal name is. The main fragment of the Kerberos principal is not required to be elasticsearch. If you have a principal or service name that works better for you or your organization then feel free to use it instead!

Google Cloud Storage Repository Plugin

The GCS repository plugin adds support for using the Google Cloud Storage service as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install repository-gcs

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove repository-gcs

The node must be stopped before removing the plugin.

Getting started

The plugin uses the Google Cloud Java Client for Storage to connect to the Storage service. If you are using Google Cloud Storage for the first time, you must connect to the Google Cloud Platform Console and create a new project. After your project is created, you must enable the Cloud Storage Service for your project.

Creating a Bucket

The Google Cloud Storage service uses the concept of a bucket as a container for all the data. Buckets are usually created using the Google Cloud Platform Console. The plugin does not automatically create buckets.

To create a new bucket:

  1. Connect to the Google Cloud Platform Console.

  2. Select your project.

  3. Go to the Storage Browser.

  4. Click the Create Bucket button.

  5. Enter the name of the new bucket.

  6. Select a storage class.

  7. Select a location.

  8. Click the Create button.

For more detailed instructions, see the Google Cloud documentation.

Service Authentication

The plugin must authenticate the requests it makes to the Google Cloud Storage service. It is common for Google client libraries to employ a strategy named application default credentials. However, that strategy is not supported for use with Elasticsearch. The plugin operates under the Elasticsearch process, which runs with the security manager enabled. The security manager obstructs the "automatic" credential discovery. Therefore, you must configure service account credentials even if you are using an environment that does not normally require this configuration (such as Compute Engine, Kubernetes Engine or App Engine).

Using a Service Account

You have to obtain and provide service account credentials manually.

For detailed information about generating JSON service account files, see the Google Cloud documentation. Note that the PKCS12 format is not supported by this plugin.

Here is a summary of the steps:

  1. Connect to the Google Cloud Platform Console.

  2. Select your project.

  3. Go to the Permission tab.

  4. Select the Service Accounts tab.

  5. Click Create service account.

  6. After the account is created, select it and download a JSON key file.

A JSON service account file looks like this:

{
  "type": "service_account",
  "project_id": "your-project-id",
  "private_key_id": "...",
  "private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
  "client_email": "service-account-for-your-repository@your-project-id.iam.gserviceaccount.com",
  "client_id": "...",
  "auth_uri": "https://accounts.google.com/o/oauth2/auth",
  "token_uri": "https://accounts.google.com/o/oauth2/token",
  "auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
  "client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-bucket@your-project-id.iam.gserviceaccount.com"
}

To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must add a setting name of the form gcs.client.NAME.credentials_file, where NAME is the name of the client configuration for the repository. The implicit client name is default, but a different client name can be specified in the repository settings with the client key.

Note
Passing the file path via the GOOGLE_APPLICATION_CREDENTIALS environment variable is not supported.

For example, if you added a gcs.client.my_alternate_client.credentials_file setting in the keystore, you can configure a repository to use those credentials like this:

PUT _snapshot/my_gcs_repository
{
  "type": "gcs",
  "settings": {
    "bucket": "my_bucket",
    "client": "my_alternate_client"
  }
}

The credentials_file settings are {ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you reload the settings, the internal gcs clients, which are used to transfer the snapshot contents, utilize the latest settings from the keystore.

Note
Snapshot or restore jobs that are in progress are not preempted by a reload of the client’s credentials_file settings. They complete using the client as it was built when the operation started.

Client Settings

The client used to connect to Google Cloud Storage has a number of settings available. Client setting names are of the form gcs.client.CLIENT_NAME.SETTING_NAME and are specified inside elasticsearch.yml. The default client name looked up by a gcs repository is called default, but can be customized with the repository setting client.

For example:

PUT _snapshot/my_gcs_repository
{
  "type": "gcs",
  "settings": {
    "bucket": "my_bucket",
    "client": "my_alternate_client"
  }
}

Some settings are sensitive and must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. This is the case for the service account file:

bin/elasticsearch-keystore add-file gcs.client.default.credentials_file /path/service-account.json

The following are the available client settings. Those that must be stored in the keystore are marked as Secure.

credentials_file

The service account file that is used to authenticate to the Google Cloud Storage service. (Secure)

endpoint

The Google Cloud Storage service endpoint to connect to. This will be automatically determined by the Google Cloud Storage client but can be specified explicitly.

connect_timeout

The timeout to establish a connection to the Google Cloud Storage service. The value should specify the unit. For example, a value of 5s specifies a 5 second timeout. The value of -1 corresponds to an infinite timeout. The default value is 20 seconds.

read_timeout

The timeout to read data from an established connection. The value should specify the unit. For example, a value of 5s specifies a 5 second timeout. The value of -1 corresponds to an infinite timeout. The default value is 20 seconds.

application_name

Name used by the client when it uses the Google Cloud Storage service. Setting a custom name can be useful to authenticate your cluster when requests statistics are logged in the Google Cloud Platform. Default to repository-gcs

project_id

The Google Cloud project id. This will be automatically inferred from the credentials file but can be specified explicitly. For example, it can be used to switch between projects when the same credentials are usable for both the production and the development projects.

Repository Settings

The gcs repository type supports a number of settings to customize how data is stored in Google Cloud Storage.

These can be specified when creating the repository. For example:

PUT _snapshot/my_gcs_repository
{
  "type": "gcs",
  "settings": {
    "bucket": "my_other_bucket",
    "base_path": "dev"
  }
}

The following settings are supported:

bucket

The name of the bucket to be used for snapshots. (Mandatory)

client

The name of the client to use to connect to Google Cloud Storage. Defaults to default.

base_path

Specifies the path within bucket to repository data. Defaults to the root of the bucket.

chunk_size

Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example: 10MB or 5KB. Defaults to 100MB, which is the maximum permitted.

compress

When set to true metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults to false.

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

application_name

deprecated:[6.3.0, "This setting is now defined in the client settings."] Name used by the client when it uses the Google Cloud Storage service.

Recommended Bucket Permission

The service account used to access the bucket must have the "Writer" access to the bucket:

  1. Connect to the Google Cloud Platform Console.

  2. Select your project.

  3. Go to the Storage Browser.

  4. Select the bucket and "Edit bucket permission".

  5. The service account must be configured as a "User" with "Writer" access.

Store Plugins

Store plugins offer alternatives to default Lucene stores.

Core store plugins

The core store plugins are:

Store SMB

The Store SMB plugin works around for a bug in Windows SMB and Java on windows.

Store SMB Plugin

The Store SMB plugin works around for a bug in Windows SMB and Java on windows.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install store-smb

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove store-smb

The node must be stopped before removing the plugin.

Working around a bug in Windows SMB and Java on windows

When using a shared file system based on the SMB protocol (like Azure File Service) to store indices, the way Lucene open index segment files is with a write only flag. This is the correct way to open the files, as they will only be used for writes and allows different FS implementations to optimize for it. Sadly, in windows with SMB, this disables the cache manager, causing writes to be slow. This has been described in LUCENE-6176, but it affects each and every Java program out there!. This need and must be fixed outside of ES and/or Lucene, either in windows or OpenJDK. For now, we are providing an experimental support to open the files with read flag, but this should be considered experimental and the correct way to fix it is in OpenJDK or Windows.

The Store SMB plugin provides two storage types optimized for SMB:

smb_mmap_fs

a SMB specific implementation of the default {ref}/index-modules-store.html#mmapfs[mmap fs]

smb_simple_fs

a SMB specific implementation of the default {ref}/index-modules-store.html#simplefs[simple fs]

To use one of these specific storage types, you need to install the Store SMB plugin and restart the node. Then configure Elasticsearch to set the storage type you want.

This can be configured for all indices by adding this to the elasticsearch.yml file:

index.store.type: smb_simple_fs

Note that setting will be applied for newly created indices.

It can also be set on a per-index basis at index creation time:

PUT my_index
{
   "settings": {
       "index.store.type": "smb_mmap_fs"
   }
}

Integrations

Integrations are not plugins, but are external tools or modules that make it easier to work with Elasticsearch.

CMS integrations

Supported by the community:

  • Drupal: Drupal Elasticsearch integration via Search API.

  • Drupal: Drupal Elasticsearch integration.

  • ElasticPress: Elasticsearch WordPress Plugin

  • WPSOLR: Elasticsearch (and Apache Solr) WordPress Plugin

  • Tiki Wiki CMS Groupware: Tiki has native support for Elasticsearch. This provides faster & better search (facets, etc), along with some Natural Language Processing features (ex.: More like this)

  • XWiki Next Generation Wiki: XWiki has an Elasticsearch and Kibana macro allowing to run Elasticsearch queries and display the results in XWiki pages using XWiki’s scripting language as well as include Kibana Widgets in XWiki pages

Data import/export and validation

Note
Rivers were used to import data from external systems into Elasticsearch prior to the 2.0 release. Elasticsearch releases 2.0 and later do not support rivers.

Supported by Elasticsearch:

  • {logstash-ref}/plugins-outputs-elasticsearch.html[Logstash output to Elasticsearch]: The Logstash elasticsearch output plugin.

  • {logstash-ref}/plugins-inputs-elasticsearch.html[Elasticsearch input to Logstash] The Logstash elasticsearch input plugin.

  • {logstash-ref}/plugins-filters-elasticsearch.html[Elasticsearch event filtering in Logstash] The Logstash elasticsearch filter plugin.

  • {logstash-ref}/plugins-codecs-es_bulk.html[Elasticsearch bulk codec] The Logstash es_bulk plugin decodes the Elasticsearch bulk format into individual events.

Supported by the community:

  • JDBC importer: The Java Database Connection (JDBC) importer allows to fetch data from JDBC sources for indexing into Elasticsearch (by Jörg Prante)

  • https://github.com/BigDataDevs/kafka-elasticsearch-consumer [Kafka Standalone Consumer(Indexer)]: Kafka Standalone Consumer [Indexer] will read messages from Kafka in batches, processes(as implemented) and bulk-indexes them into Elasticsearch. Flexible and scalable. More documentation in above GitHub repo’s Wiki.

  • Mongolastic: A tool that clones data from Elasticsearch to MongoDB and vice versa

  • Scrutineer: A high performance consistency checker to compare what you’ve indexed with your source of truth content (e.g. DB)

  • IMAP/POP3/Mail importer: The Mail importer allows to fetch data from IMAP and POP3 servers for indexing into Elasticsearch (by Hendrik Saly)

  • FS Crawler: The File System (FS) crawler allows to index documents (PDF, Open Office…​) from your local file system and over SSH. (by David Pilato)

Deployment

Supported by Elasticsearch:

  • Ansible playbook for Elasticsearch: An officially supported ansible playbook for Elasticsearch. Tested with the latest version of 5.x and 6.x on Ubuntu 14.04/16.04, Debian 8, Centos 7.

  • Puppet: Elasticsearch puppet module.

Supported by the community:

  • Chef: Chef cookbook for Elasticsearch

Framework integrations

Supported by the community:

  • Aspire for Elasticsearch: Aspire, from Search Technologies, is a powerful connector and processing framework designed for unstructured data. It has connectors to internal and external repositories including SharePoint, Documentum, Jive, RDB, file systems, websites and more, and can transform and normalize this data before indexing in Elasticsearch.

  • Apache Camel Integration: An Apache camel component to integrate Elasticsearch

  • Catmanadu: An Elasticsearch backend for the Catmandu framework.

  • elasticsearch-test: Elasticsearch Java annotations for unit testing with JUnit

  • FOSElasticaBundle: Symfony2 Bundle wrapping Elastica.

  • Grails: Elasticsearch Grails plugin.

  • Haystack: Modular search for Django

  • Hibernate Search Integration with Hibernate ORM, from the Hibernate team. Automatic synchronization of write operations, yet exposes full Elasticsearch capabilities for queries. Can return either Elasticsearch native or re-map queries back into managed entities loaded within transaction from the reference database.

  • play2-elasticsearch: Elasticsearch module for Play Framework 2.x

  • Spring Data Elasticsearch: Spring Data implementation for Elasticsearch

  • Spring Elasticsearch: Spring Factory for Elasticsearch

  • Twitter Storehaus: Thin asynchronous Scala client for Storehaus.

Hadoop integrations

Supported by Elasticsearch:

  • es-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.

Health and Performance Monitoring

Supported by the community:

  • check_elasticsearch: An Elasticsearch availability and performance monitoring plugin for Nagios.

  • check-es: Nagios/Shinken plugins for checking on Elasticsearch

  • es2graphite: Send cluster and indices stats and status to Graphite for monitoring and graphing.

  • ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring tool to keep an eye on DigitalOcean Droplets or Elasticsearch instances or both of them on-a-go.

  • opsview-elasticsearch: Opsview plugin written in Perl for monitoring Elasticsearch

  • Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices.

  • SPM for Elasticsearch: Performance monitoring with live charts showing cluster and node stats, integrated alerts, email reports, etc.

Other integrations

Supported by the community:

  • Pes: A pluggable elastic JavaScript query DSL builder for Elasticsearch

  • Wireshark: Protocol dissection for Zen discovery, HTTP and the binary protocol

  • ItemsAPI: Search backend for mobile and web

These projects appear to have been abandoned:

  • daikon: Daikon Elasticsearch CLI

  • dangle: A set of AngularJS directives that provide common visualizations for Elasticsearch based on D3.

  • eslogd: Linux daemon that replicates events to a central Elasticsearch server in realtime

Help for plugin authors

The Elasticsearch repository contains examples of:

  • a Java plugin which contains a plugin with custom settings.

  • a Java plugin which contains a plugin that registers a Rest handler.

  • a Java plugin which contains a rescore plugin.

  • a Java plugin which contains a script plugin.

These examples provide the bare bones needed to get started. For more information about how to write a plugin, we recommend looking at the plugins listed in this documentation for inspiration.

Plugin descriptor file

All plugins must contain a file called plugin-descriptor.properties. The format for this file is described in detail in this example:

# Elasticsearch plugin descriptor file
# This file must exist as 'plugin-descriptor.properties' inside a plugin.
#
### example plugin for "foo"
#
# foo.zip <-- zip file for the plugin, with this structure:
# |____   <arbitrary name1>.jar <-- classes, resources, dependencies
# |____   <arbitrary nameN>.jar <-- any number of jars
# |____   plugin-descriptor.properties <-- example contents below:
#
# classname=foo.bar.BazPlugin
# description=My cool plugin
# version=6.0
# elasticsearch.version=6.0
# java.version=1.8
#
### mandatory elements for all plugins:
#
# 'description': simple summary of the plugin
description=${description}
#
# 'version': plugin's version
version=${version}
#
# 'name': the plugin name
name=${name}
#
# 'classname': the name of the class to load, fully-qualified.
classname=${classname}
#
# 'java.version': version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=${javaVersion}
#
# 'elasticsearch.version': version of elasticsearch compiled against
elasticsearch.version=${elasticsearchVersion}
### optional elements for plugins:
#
#  'extended.plugins': other plugins this plugin extends through SPI
extended.plugins=${extendedPlugins}
#
# 'has.native.controller': whether or not the plugin has a native controller
has.native.controller=${hasNativeController}
<% if (licensed) { %>
# This plugin requires that a license agreement be accepted before installation
licensed=${licensed}
<% } %>

Either fill in this template yourself or, if you are using Elasticsearch’s Gradle build system, you can fill in the necessary values in the build.gradle file for your plugin.

Mandatory elements for plugins

Element Type Description

description

String

simple summary of the plugin

version

String

plugin’s version

name

String

the plugin name

classname

String

the name of the class to load, fully-qualified.

java.version

String

version of java the code is built against. Use the system property java.specification.version. Version string must be a sequence of nonnegative decimal integers separated by "."'s and may have leading zeros.

elasticsearch.version

String

version of Elasticsearch compiled against.

Note that only jar files at the root of the plugin are added to the classpath for the plugin! If you need other resources, package them into a resources jar.

Important
Plugin release lifecycle

You will have to release a new version of the plugin for each new Elasticsearch release. This version is checked when the plugin is loaded so Elasticsearch will refuse to start in the presence of plugins with the incorrect elasticsearch.version.

Testing your plugin

When testing a Java plugin, it will only be auto-loaded if it is in the plugins/ directory. Use bin/elasticsearch-plugin install file:///path/to/your/plugin to install your plugin for testing.

You may also load your plugin within the test framework for integration tests. Read more in {ref}/integration-tests.html#changing-node-configuration[Changing Node Configuration].

Java Security permissions

Some plugins may need additional security permissions. A plugin can include the optional plugin-security.policy file containing grant statements for additional permissions. Any additional permissions will be displayed to the user with a large warning, and they will have to confirm them when installing the plugin interactively. So if possible, it is best to avoid requesting any spurious permissions!

If you are using the Elasticsearch Gradle build system, place this file in src/main/plugin-metadata and it will be applied during unit tests as well.

Keep in mind that the Java security model is stack-based, and the additional permissions will only be granted to the jars in your plugin, so you will have write proper security code around operations requiring elevated privileges. It is recommended to add a check to prevent unprivileged code (such as scripts) from gaining escalated permissions. For example:

// ES permission you should check before doPrivileged() blocks
import org.elasticsearch.SpecialPermission;

SecurityManager sm = System.getSecurityManager();
if (sm != null) {
  // unprivileged code such as scripts do not have SpecialPermission
  sm.checkPermission(new SpecialPermission());
}
AccessController.doPrivileged(
  // sensitive operation
);

See Secure Coding Guidelines for Java SE for more information.

Appendix A: Deleted pages

The following pages have moved or been deleted.

Multicast Discovery Plugin

The multicast-discovery plugin has been removed. Instead, configure networking using unicast (see {ref}/modules-network.html[Network settings]) or using one of the cloud discovery plugins.

AWS Cloud Plugin

Looking for a hosted solution for Elasticsearch on AWS? Check out http://www.elastic.co/cloud.

The Elasticsearch cloud-aws plugin has been split into two separate plugins:

Azure Cloud Plugin

The cloud-azure plugin has been split into two separate plugins:

GCE Cloud Plugin

The cloud-gce plugin has been renamed to GCE Discovery Plugin (discovery-gce).

Delete-By-Query plugin removed

The Delete-By-Query plugin has been removed in favor of a new {ref}/docs-delete-by-query.html[Delete By Query API] implementation in core.