Unresolved directive in ../Versions.asciidoc - include::{asciidoc-dir}/../../shared/versions/stack/{source_branch}.asciidoc[]
Unresolved directive in ../Versions.asciidoc - include::{asciidoc-dir}/../../shared/attributes.asciidoc[]
Introduction to plugins
Plugins are a way to enhance the core Elasticsearch functionality in a custom manner. They range from adding custom mapping types, custom analyzers, native scripts, custom discovery and more.
Plugins contain JAR files, but may also contain scripts and config files, and must be installed on every node in the cluster. After installation, each node must be restarted before the plugin becomes visible.
Note
|
A full cluster restart is required for installing plugins that have custom cluster state metadata, such as X-Pack. It is still possible to upgrade such plugins with a rolling restart. |
This documentation distinguishes two categories of plugins:
- Core Plugins
-
This category identifies plugins that are part of Elasticsearch project. Delivered at the same time as Elasticsearch, their version number always matches the version number of Elasticsearch itself. These plugins are maintained by the Elastic team with the appreciated help of amazing community members (for open source plugins). Issues and bug reports can be reported on the Github project page.
- Community contributed
-
This category identifies plugins that are external to the Elasticsearch project. They are provided by individual developers or private companies and have their own licenses as well as their own versioning system. Issues and bug reports can usually be reported on the community plugin’s web site.
For advice on writing your own plugin, see Help for plugin authors.
Important
|
Site plugins — plugins containing HTML, CSS and JavaScript — are no longer supported. |
Plugin Management
The plugin
script is used to install, list, and remove plugins. It is
located in the $ES_HOME/bin
directory by default but it may be in a
different location depending on which Elasticsearch package you installed:
-
{ref}/zip-targz.html#zip-targz-layout[Directory layout of
.zip
and.tar.gz
archives] -
{ref}/deb.html#deb-layout[Directory layout of Debian package]
-
{ref}/rpm.html#rpm-layout[Directory layout of RPM]
Run the following command to get usage instructions:
sudo bin/elasticsearch-plugin -h
Important
|
Running as root
If Elasticsearch was installed using the deb or rpm package then run
|
Installing Plugins
The documentation for each plugin usually includes specific installation instructions for that plugin, but below we document the various available options:
Core Elasticsearch plugins
Core Elasticsearch plugins can be installed as follows:
sudo bin/elasticsearch-plugin install [plugin_name]
For instance, to install the core ICU plugin, just run the following command:
sudo bin/elasticsearch-plugin install analysis-icu
This command will install the version of the plugin that matches your Elasticsearch version and also show a progress bar while downloading.
Custom URL or file system
A plugin can also be downloaded directly from a custom location by specifying the URL:
sudo bin/elasticsearch-plugin install [url] (1)
-
must be a valid URL, the plugin name is determined from its descriptor.
- Unix
-
To install a plugin from your local file system at
/path/to/plugin.zip
, you could run:sudo bin/elasticsearch-plugin install file:///path/to/plugin.zip
- Windows
-
To install a plugin from your local file system at
C:\path\to\plugin.zip
, you could run:bin\elasticsearch-plugin install file:///C:/path/to/plugin.zip
NoteAny path that contains spaces must be wrapped in quotes! NoteIf you are installing a plugin from the filesystem the plugin distribution must not be contained in the plugins
directory for the node that you are installing the plugin to or installation will fail. - HTTP
-
To install a plugin from an HTTP URL:
sudo bin/elasticsearch-plugin install http://some.domain/path/to/plugin.zip
The plugin script will refuse to talk to an HTTPS URL with an untrusted certificate. To use a self-signed HTTPS cert, you will need to add the CA cert to a local Java truststore and pass the location to the script as follows:
sudo ES_JAVA_OPTS="-Djavax.net.ssl.trustStore=/path/to/trustStore.jks" bin/elasticsearch-plugin install https://host/plugin.zip
Mandatory Plugins
If you rely on some plugins, you can define mandatory plugins by adding
plugin.mandatory
setting to the config/elasticsearch.yml
file, for
example:
plugin.mandatory: analysis-icu,lang-js
For safety reasons, a node will not start if it is missing a mandatory plugin.
Listing, Removing and Updating Installed Plugins
Listing plugins
A list of the currently loaded plugins can be retrieved with the list
option:
sudo bin/elasticsearch-plugin list
Alternatively, use the {ref}/cluster-nodes-info.html[node-info API] to find out which plugins are installed on each node in the cluster
Removing plugins
Plugins can be removed manually, by deleting the appropriate directory under
plugins/
, or using the public script:
sudo bin/elasticsearch-plugin remove [pluginname]
After a Java plugin has been removed, you will need to restart the node to complete the removal process.
By default, plugin configuration files (if any) are preserved on disk; this is
so that configuration is not lost while upgrading a plugin. If you wish to
purge the configuration files while removing a plugin, use -p
or --purge
.
This can option can be used after a plugin is removed to remove any lingering
configuration files.
Updating plugins
Plugins are built for a specific version of Elasticsearch, and therefore must be reinstalled each time Elasticsearch is updated.
sudo bin/elasticsearch-plugin remove [pluginname]
sudo bin/elasticsearch-plugin install [pluginname]
Other command line parameters
The plugin
scripts supports a number of other command line parameters:
Silent/Verbose mode
The --verbose
parameter outputs more debug information, while the --silent
parameter turns off all output including the progress bar. The script may
return the following exit codes:
0
|
everything was OK |
64
|
unknown command or incorrect option parameter |
74
|
IO error |
70
|
any other error |
Batch mode
Certain plugins require more privileges than those provided by default in core Elasticsearch. These plugins will list the required privileges and ask the user for confirmation before continuing with installation.
When running the plugin install script from another program (e.g. install
automation scripts), the plugin script should detect that it is not being
called from the console and skip the confirmation response, automatically
granting all requested permissions. If console detection fails, then batch
mode can be forced by specifying -b
or --batch
as follows:
sudo bin/elasticsearch-plugin install --batch [pluginname]
Custom config directory
If your elasticsearch.yml
config file is in a custom location, you will need
to specify the path to the config file when using the plugin
script. You
can do this as follows:
sudo ES_PATH_CONF=/path/to/conf/dir bin/elasticsearch-plugin install <plugin name>
Proxy settings
To install a plugin via a proxy, you can add the proxy details to the
ES_JAVA_OPTS
environment variable with the Java settings http.proxyHost
and http.proxyPort
(or https.proxyHost
and https.proxyPort
):
sudo ES_JAVA_OPTS="-Dhttp.proxyHost=host_name -Dhttp.proxyPort=port_number -Dhttps.proxyHost=host_name -Dhttps.proxyPort=https_port_number" bin/elasticsearch-plugin install analysis-icu
Or on Windows:
set ES_JAVA_OPTS="-Dhttp.proxyHost=host_name -Dhttp.proxyPort=port_number -Dhttps.proxyHost=host_name -Dhttps.proxyPort=https_port_number"
bin\elasticsearch-plugin install analysis-icu
Plugins directory
The default location of the plugins
directory depends on which package you install:
-
{ref}/zip-targz.html#zip-targz-layout[Directory layout of
.zip
and.tar.gz
archives] -
{ref}/deb.html#deb-layout[Directory layout of Debian package]
-
{ref}/rpm.html#rpm-layout[Directory layout of RPM]
API Extension Plugins
API extension plugins add new functionality to Elasticsearch by adding new APIs or features, usually to do with search or mapping.
Community contributed API extension plugins
A number of plugins have been contributed by our community:
-
carrot2 Plugin: Results clustering with carrot2 (by Dawid Weiss)
-
Elasticsearch Trigram Accelerated Regular Expression Filter: (by Wikimedia Foundation/Nik Everett)
-
Elasticsearch Experimental Highlighter: (by Wikimedia Foundation/Nik Everett)
-
Entity Resolution Plugin: Uses Duke for duplication detection (by Yann Barraud)
-
Entity Resolution Plugin (zentity): Real-time entity resolution with pure Elasticsearch (by Dave Moore)
-
PQL language Plugin: Allows Elasticsearch to be queried with simple pipeline query syntax.
-
Elasticsearch Taste Plugin: Mahout Taste-based Collaborative Filtering implementation (by CodeLibs Project)
-
WebSocket Change Feed Plugin (by ForgeRock/Chris Clifton)
Alerting Plugins
Alerting plugins allow Elasticsearch to monitor indices and to trigger alerts when thresholds are breached.
Core alerting plugins
The core alerting plugins are:
- X-Pack
-
X-Pack contains the alerting and notification product for Elasticsearch that lets you take action based on changes in your data. It is designed around the principle that if you can query something in Elasticsearch, you can alert on it. Simply define a query, condition, schedule, and the actions to take, and X-Pack will do the rest.
Analysis Plugins
Analysis plugins extend Elasticsearch by adding new analyzers, tokenizers, token filters, or character filters to Elasticsearch.
Core analysis plugins
The core analysis plugins are:
- ICU
-
Adds extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.
- Kuromoji
-
Advanced analysis of Japanese using the Kuromoji analyzer.
- Nori
-
Morphological analysis of Korean using the Lucene Nori analyzer.
- Phonetic
-
Analyzes tokens into their phonetic equivalent using Soundex, Metaphone, Caverphone, and other codecs.
- SmartCN
-
An analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
- Stempel
-
Provides high quality stemming for Polish.
- Ukrainian
-
Provides stemming for Ukrainian.
Community contributed analysis plugins
A number of analysis plugins have been contributed by our community:
-
IK Analysis Plugin (by Medcl)
-
Pinyin Analysis Plugin (by Medcl)
-
Vietnamese Analysis Plugin (by Duy Do)
-
Network Addresses Analysis Plugin (by Ofir123)
-
Dandelion Analysis Plugin (by ZarHenry96)
-
STConvert Analysis Plugin (by Medcl)
ICU Analysis Plugin
The ICU Analysis plugin integrates the Lucene ICU module into {es}, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.
Important
|
ICU analysis and backwards compatibility
From time to time, the ICU library receives updates such as adding new characters and emojis, and improving collation (sort) orders. These changes may or may not affect search and sort orders, depending on which characters sets you are using. While we restrict ICU upgrades to major versions, you may find that an index created in the previous major version will need to be reindexed in order to return correct (and correctly ordered) results, and to take advantage of new characters. |
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-icu
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-icu/analysis-icu-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-icu
The node must be stopped before removing the plugin.
ICU Analyzer
Performs basic normalization, tokenization and character folding, using the
icu_normalizer
char filter, icu_tokenizer
and icu_normalizer
token filter
The following parameters are accepted:
method
|
Normalization method. Accepts |
mode
|
Normalization mode. Accepts |
ICU Normalization Character Filter
Normalizes characters as explained
here.
It registers itself as the icu_normalizer
character filter, which is
available to all indices without any further configuration. The type of
normalization can be specified with the name
parameter, which accepts nfc
,
nfkc
, and nfkc_cf
(default). Set the mode
parameter to decompose
to
convert nfc
to nfd
or nfkc
to nfkd
respectively:
Which letters are normalized can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
Here are two examples, the default usage and a customised character filter:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nfkc_cf_normalized": { (1)
"tokenizer": "icu_tokenizer",
"char_filter": [
"icu_normalizer"
]
},
"nfd_normalized": { (2)
"tokenizer": "icu_tokenizer",
"char_filter": [
"nfd_normalizer"
]
}
},
"char_filter": {
"nfd_normalizer": {
"type": "icu_normalizer",
"name": "nfc",
"mode": "decompose"
}
}
}
}
}
}
-
Uses the default
nfkc_cf
normalization. -
Uses the customized
nfd_normalizer
token filter, which is set to usenfc
normalization with decomposition.
ICU Tokenizer
Tokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the {ref}/analysis-standard-tokenizer.html[standard
tokenizer],
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
Rules customization
experimental[This functionality is marked as experimental in Lucene]
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample
{
"settings": {
"index":{
"analysis":{
"tokenizer" : {
"icu_user_file" : {
"type" : "icu_tokenizer",
"rule_files" : "Latn:KeywordTokenizer.rbbi"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "icu_user_file"
}
}
}
}
}
}
GET icu_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Elasticsearch. Wow!"
}
The above analyze
request returns the following:
{
"tokens": [
{
"token": "Elasticsearch. Wow!",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}
ICU Normalization Token Filter
Normalizes characters as explained
here. It registers
itself as the icu_normalizer
token filter, which is available to all indices
without any further configuration. The type of normalization can be specified
with the name
parameter, which accepts nfc
, nfkc
, and nfkc_cf
(default).
Which letters are normalized can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
You should probably prefer the Normalization character filter.
Here are two examples, the default usage and a customised token filter:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nfkc_cf_normalized": { (1)
"tokenizer": "icu_tokenizer",
"filter": [
"icu_normalizer"
]
},
"nfc_normalized": { (2)
"tokenizer": "icu_tokenizer",
"filter": [
"nfc_normalizer"
]
}
},
"filter": {
"nfc_normalizer": {
"type": "icu_normalizer",
"name": "nfc"
}
}
}
}
}
}
-
Uses the default
nfkc_cf
normalization. -
Uses the customized
nfc_normalizer
token filter, which is set to usenfc
normalization.
ICU Folding Token Filter
Case folding of Unicode characters based on UTR#30
, like the
{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
on steroids. It registers itself as the icu_folding
token filter and is
available to all indices:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
}
}
}
}
}
}
The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.
Which letters are folded can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
The following example exempts Swedish characters from folding. It is important
to note that both upper and lowercase forms should be specified, and that
these filtered character are not lowercased which is why we add the
lowercase
filter as well:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"swedish_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [
"swedish_folding",
"lowercase"
]
}
},
"filter": {
"swedish_folding": {
"type": "icu_folding",
"unicodeSetFilter": "[^åäöÅÄÖ]"
}
}
}
}
}
}
ICU Collation Token Filter
Warning
|
This token filter has been deprecated since Lucene 5.0. Please use ICU Collation Keyword Field. |
ICU Collation Keyword Field
Collations are used for sorting documents in a language-specific word order.
The icu_collation_keyword
field type is available to all indices and will encode
the terms directly as bytes in a doc values field and a single indexed token just
like a standard {ref}/keyword.html[Keyword Field].
Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation], which is a best-effort attempt at language-neutral sorting.
Below is an example of how to set up a field for sorting German names in ``phonebook'' order:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"name": { (1)
"type": "text",
"fields": {
"sort": { (2)
"type": "icu_collation_keyword",
"index": false,
"language": "de",
"country": "DE",
"variant": "@collation=phonebook"
}
}
}
}
}
}
}
GET _search (3)
{
"query": {
"match": {
"name": "Fritz"
}
},
"sort": "name.sort"
}
-
The
name
field uses thestandard
analyzer, and so support full text queries. -
The
name.sort
field is anicu_collation_keyword
field that will preserve the name as a single token doc values, and applies the German ``phonebook'' order. -
An example query which searches the
name
field and sorts on thename.sort
field.
Parameters for ICU Collation Keyword Fields
The following parameters are accepted by icu_collation_keyword
fields:
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a string value which is substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the {ref}/mapping-source-field.html[ |
fields
|
Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations. |
Collation options
strength
-
The strength property determines the minimum level of difference considered significant during comparison. Possible values are :
primary
,secondary
,tertiary
,quaternary
oridentical
. See the ICU Collation documentation for a more detailed explanation for each value. Defaults totertiary
unless otherwise specified in the collation. decomposition
-
Possible values:
no
(default, but collation-dependent) orcanonical
. Setting this decomposition property tocanonical
allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. Ifno
is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales setno
as the default decomposition mode.
The following options are expert only:
alternate
-
Possible values:
shifted
ornon-ignorable
. Sets the alternate handling for strengthquaternary
to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace. case_level
-
Possible values:
true
orfalse
(default). Whether case level sorting is required. When strength is set toprimary
this will ignore accent differences. case_first
-
Possible values:
lower
orupper
. Useful to control which case is sorted first when case is not ignored for strengthtertiary
. The default depends on the collation. numeric
-
Possible values:
true
orfalse
(default) . Whether digits are sorted according to their numeric representation. For example the valueegg-9
is sorted before the valueegg-21
. variable_top
-
Single character or contraction. Controls what is variable for
alternate
. hiragana_quaternary_mode
-
Possible values:
true
orfalse
. Distinguishing between Katakana and Hiragana characters inquaternary
strength.
ICU Transform Token Filter
Transforms are used to process Unicode text in many different ways, such as case mapping, normalization, transliteration and bidirectional text handling.
You can define which transformation you want to apply with the id
parameter
(defaults to Null
), and specify text direction with the dir
parameter
which accepts forward
(default) for LTR and reverse
for RTL. Custom
rulesets are not yet supported.
For example:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"latin": {
"tokenizer": "keyword",
"filter": [
"myLatinTransform"
]
}
},
"filter": {
"myLatinTransform": {
"type": "icu_transform",
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" (1)
}
}
}
}
}
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "你好" (2)
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "здравствуйте" (3)
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "こんにちは" (4)
}
-
This transforms transliterates characters to Latin, and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.
-
Returns
ni hao
. -
Returns
zdravstvujte
. -
Returns
kon’nichiha
.
For more documentation, Please see the user guide of ICU Transform.
Japanese (kuromoji) Analysis Plugin
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-kuromoji
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-kuromoji/analysis-kuromoji-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-kuromoji
The node must be stopped before removing the plugin.
kuromoji
analyzer
The kuromoji
analyzer consists of the following tokenizer and token filters:
-
kuromoji_baseform
token filter -
kuromoji_part_of_speech
token filter -
{ref}/analysis-cjk-width-tokenfilter.html[
cjk_width
] token filter -
ja_stop
token filter -
kuromoji_stemmer
token filter -
{ref}/analysis-lowercase-tokenfilter.html[
lowercase
] token filter
It supports the mode
and user_dictionary
settings from
kuromoji_tokenizer
.
kuromoji_iteration_mark
character filter
The kuromoji_iteration_mark
normalizes Japanese horizontal iteration marks
(odoriji) to their expanded form. It accepts the following settings:
normalize_kanji
-
Indicates whether kanji iteration marks should be normalize. Defaults to
true
. normalize_kana
-
Indicates whether kana iteration marks should be normalized. Defaults to
true
kuromoji_tokenizer
The kuromoji_tokenizer
accepts the following settings:
mode
-
The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:
normal
-
Normal segmentation, no decomposition for compounds. Example output:
関西国際空港 アブラカダブラ
search
-
Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:
関西, 関西国際空港, 国際, 空港 アブラカダブラ
extended
-
Extended mode outputs unigrams for unknown words. Example output:
関西, 国際, 空港 ア, ブ, ラ, カ, ダ, ブ, ラ
discard_punctuation
-
Whether punctuation should be discarded from the output. Defaults to
true
. user_dictionary
-
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A
user_dictionary
may be appended to the default dictionary. The dictionary should have the following CSV format:<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
As a demonstration of how the user dictionary can be used, save the following
dictionary to $ES_HOME/config/userdict_ja.txt
:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
nbest_cost
/nbest_examples
-
Additional expert user parameters
nbest_cost
andnbest_examples
can be used to include additional tokens that most likely according to the statistical model. If both parameters are used, the largest number of both is applied.nbest_cost
-
The
nbest_cost
parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path. nbest_examples
-
The
nbest_examples
can be used to find anbest_cost
value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).
Then create an analyzer as follows:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary": "userdict_ja.txt"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "東京スカイツリー"
}
The above analyze
request returns the following:
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]
}
kuromoji_baseform
token filter
The kuromoji_baseform
token filter replaces terms with their
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_baseform"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "飲み"
}
which responds with:
{
"tokens" : [ {
"token" : "飲む",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
} ]
}
kuromoji_part_of_speech
token filter
The kuromoji_part_of_speech
token filter removes tokens that match a set of
part-of-speech tags. It accepts the following setting:
stoptags
-
An array of part-of-speech tags that should be removed. It defaults to the
stoptags.txt
file embedded in thelucene-analyzer-kuromoji.jar
.
For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "寿司がおいしいね"
}
Which responds with:
{
"tokens" : [ {
"token" : "寿司",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "おいしい",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 2
} ]
}
kuromoji_readingform
token filter
The kuromoji_readingform
token filter replaces the token with its reading
form in either katakana or romaji. It accepts the following setting:
use_romaji
-
Whether romaji reading form should be output instead of katakana. Defaults to
false
.
When using the pre-defined kuromoji_readingform
filter, use_romaji
is set
to true
. The default when defining a custom kuromoji_readingform
, however,
is false
. The only reason to use the custom form is if you need the
katakana reading form:
PUT kuromoji_sample
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"romaji_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["romaji_readingform"]
},
"katakana_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["katakana_readingform"]
}
},
"filter" : {
"romaji_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : true
},
"katakana_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : false
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "katakana_analyzer",
"text": "寿司" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "romaji_analyzer",
"text": "寿司" (2)
}
-
Returns
スシ
. -
Returns
sushi
.
kuromoji_stemmer
token filter
The kuromoji_stemmer
token filter normalizes common katakana spelling
variations ending in a long sound character by removing this character
(U+30FC). Only full-width katakana characters are supported.
This token filter accepts the following setting:
minimum_length
-
Katakana words shorter than the
minimum length
are not stemmed (default is4
).
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_katakana_stemmer"
]
}
},
"filter": {
"my_katakana_stemmer": {
"type": "kuromoji_stemmer",
"minimum_length": 4
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "コピー" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "サーバー" (2)
}
-
Returns
コピー
. -
Return
サーバ
.
ja_stop
token filter
The ja_stop
token filter filters out Japanese stopwords (japanese
), and
any other custom stopwords specified by the user. This filter only supports
the predefined japanese
stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[stop
token filter] instead.
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_with_ja_stop": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"ja_stop"
]
}
},
"filter": {
"ja_stop": {
"type": "ja_stop",
"stopwords": [
"_japanese_",
"ストップ"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "analyzer_with_ja_stop",
"text": "ストップは消える"
}
The above request returns:
{
"tokens" : [ {
"token" : "消える",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
kuromoji_number
token filter
The kuromoji_number
token filter normalizes Japanese numbers (kansūji)
to regular Arabic decimal numbers in half-width characters. For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_number"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "一〇〇〇"
}
Which results in:
{
"tokens" : [ {
"token" : "1000",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
} ]
}
Korean (nori) Analysis Plugin
The Korean (nori) Analysis plugin integrates Lucene nori analysis module into elasticsearch. It uses the mecab-ko-dic dictionary to perform morphological analysis of Korean texts.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-nori
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-nori/analysis-nori-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-nori
The node must be stopped before removing the plugin.
nori
analyzer
The nori
analyzer consists of the following tokenizer and token filters:
-
nori_part_of_speech
token filter -
nori_readingform
token filter -
{ref}/analysis-lowercase-tokenfilter.html[
lowercase
] token filter
It supports the decompound_mode
and user_dictionary
settings from
nori_tokenizer
and the stoptags
setting from
nori_part_of_speech
.
nori_tokenizer
The nori_tokenizer
accepts the following settings:
decompound_mode
-
The decompound mode determines how the tokenizer handles compound tokens. It can be set to:
none
-
No decomposition for compounds. Example output:
가거도항 가곡역
discard
-
Decomposes compounds and discards the original form (default). Example output:
가곡역 => 가곡, 역
mixed
-
Decomposes compounds and keeps the original form. Example output:
가곡역 => 가곡역, 가곡, 역
user_dictionary
-
The Nori tokenizer uses the mecab-ko-dic dictionary by default. A
user_dictionary
with custom nouns (NNG
) may be appended to the default dictionary. The dictionary should have the following format:<token> [<token 1> ... <token n>]
The first token is mandatory and represents the custom noun that should be added in the dictionary. For compound nouns the custom segmentation can be provided after the first token (
[<token 1> … <token n>]
). The segmentation of the custom compound nouns is controlled by thedecompound_mode
setting.As a demonstration of how the user dictionary can be used, save the following dictionary to
$ES_HOME/config/userdict_ko.txt
:c++ (1) C샤프 세종 세종시 세종 시 (2)
-
A simple noun
-
A compound noun (
세종시
) followed by its decomposition:세종
and시
.
Then create an analyzer as follows:
PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "nori_user_dict": { "type": "nori_tokenizer", "decompound_mode": "mixed", "user_dictionary": "userdict_ko.txt" } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "nori_user_dict" } } } } } } GET nori_sample/_analyze { "analyzer": "my_analyzer", "text": "세종시" (1) }
-
Sejong city
The above
analyze
request returns the following:{ "tokens" : [ { "token" : "세종시", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0, "positionLength" : 2 (1) }, { "token" : "세종", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "시", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 1 }] }
-
This is a compound token that spans two positions (
mixed
mode).
-
user_dictionary_rules
-
You can also inline the rules directly in the tokenizer definition using the
user_dictionary_rules
option:PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "nori_user_dict": { "type": "nori_tokenizer", "decompound_mode": "mixed", "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"] } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "nori_user_dict" } } } } } }
The nori_tokenizer
sets a number of additional attributes per token that are used by token filters
to modify the stream.
You can view all these additional attributes with the following request:
GET _analyze
{
"tokenizer": "nori_tokenizer",
"text": "뿌리가 깊은 나무는", (1)
"attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
"explain": true
}
-
A tree with deep roots
Which responds with:
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "nori_tokenizer",
"tokens": [
{
"token": "뿌리",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
},
{
"token": "가",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1,
"leftPOS": "J(Ending Particle)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "J(Ending Particle)"
},
{
"token": "깊",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2,
"leftPOS": "VA(Adjective)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "VA(Adjective)"
},
{
"token": "은",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3,
"leftPOS": "E(Verbal endings)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "E(Verbal endings)"
},
{
"token": "나무",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 4,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
},
{
"token": "는",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 5,
"leftPOS": "J(Ending Particle)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "J(Ending Particle)"
}
]
},
"tokenfilters": []
}
}
nori_part_of_speech
token filter
The nori_part_of_speech
token filter removes tokens that match a set of
part-of-speech tags. The list of supported tags and their meanings can be found here:
Part of speech tags
It accepts the following setting:
stoptags
-
An array of part-of-speech tags that should be removed.
and defaults to:
"stoptags": [
"E",
"IC",
"J",
"MAG", "MAJ", "MM",
"SP", "SSC", "SSO", "SC", "SE",
"XPN", "XSA", "XSN", "XSV",
"UNA", "NA", "VSV"
]
For example:
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "nori_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "nori_part_of_speech",
"stoptags": [
"NR" (1)
]
}
}
}
}
}
}
GET nori_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "여섯 용이" (2)
}
-
Korean numerals should be removed (
NR
) -
Six dragons
Which responds with:
{
"tokens" : [ {
"token" : "용",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
}, {
"token" : "이",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
} ]
}
nori_readingform
token filter
The nori_readingform
token filter rewrites tokens written in Hanja to their Hangul form.
PUT nori_sample
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "nori_tokenizer",
"filter" : ["nori_readingform"]
}
}
}
}
}
}
GET nori_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "鄕歌" (1)
}
-
A token written in Hanja: Hyangga
Which responds with:
{
"tokens" : [ {
"token" : "향가", (1)
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}]
}
-
The Hanja form is replaced by the Hangul translation.
Phonetic Analysis Plugin
The Phonetic Analysis plugin provides token filters which convert tokens to their phonetic representation using Soundex, Metaphone, and a variety of other algorithms.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-phonetic
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-phonetic/analysis-phonetic-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-phonetic
The node must be stopped before removing the plugin.
phonetic
token filter
The phonetic
token filter takes the following settings:
encoder
-
Which phonetic encoder to use. Accepts
metaphone
(default),double_metaphone
,soundex
,refined_soundex
,caverphone1
,caverphone2
,cologne
,nysiis
,koelnerphonetik
,haasephonetik
,beider_morse
,daitch_mokotoff
. replace
-
Whether or not the original token should be replaced by the phonetic token. Accepts
true
(default) andfalse
. Not supported bybeider_morse
encoding.
PUT phonetic_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": false
}
}
}
}
}
}
GET phonetic_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Joe Bloggs" (1)
}
-
Returns:
J
,joe
,BLKS
,bloggs
Double metaphone settings
If the double_metaphone
encoder is used, then this additional setting is
supported:
max_code_len
-
The maximum length of the emitted metaphone token. Defaults to
4
.
Beider Morse settings
If the beider_morse
encoder is used, then these additional settings are
supported:
rule_type
-
Whether matching should be
exact
orapprox
(default). name_type
-
Whether names are
ashkenazi
,sephardic
, orgeneric
(default). languageset
-
An array of languages to check. If not specified, then the language will be guessed. Accepts:
any
,common
,cyrillic
,english
,french
,german
,hebrew
,hungarian
,polish
,romanian
,russian
,spanish
.
Smart Chinese Analysis Plugin
The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese analysis module into elasticsearch.
It provides an analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-smartcn
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-smartcn/analysis-smartcn-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-smartcn
The node must be stopped before removing the plugin.
smartcn
tokenizer and token filter
The plugin provides the smartcn
analyzer and smartcn_tokenizer
tokenizer,
which are not configurable.
Note
|
The smartcn_word token filter and smartcn_sentence have been deprecated.
|
Stempel Polish Analysis Plugin
The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.
It provides high quality stemming for Polish, based on the Egothor project.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-stempel
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-stempel/analysis-stempel-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-stempel
The node must be stopped before removing the plugin.
stempel
tokenizer and token filter
The plugin provides the polish
analyzer and polish_stem
token filter,
which are not configurable.
Ukrainian Analysis Plugin
The Ukrainian Analysis plugin integrates Lucene’s UkrainianMorfologikAnalyzer into elasticsearch.
It provides stemming for Ukrainian using the Morfologik project.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-ukrainian
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/analysis-ukrainian/analysis-ukrainian-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-ukrainian
The node must be stopped before removing the plugin.
ukrainian
analyzer
The plugin provides the ukrainian
analyzer.
Discovery Plugins
Discovery plugins extend Elasticsearch by adding new discovery mechanisms that can be used instead of {ref}/modules-discovery-zen.html[Zen Discovery].
Core discovery plugins
The core discovery plugins are:
- EC2 discovery
-
The EC2 discovery plugin uses the AWS API for unicast discovery.
- Azure Classic discovery
-
The Azure Classic discovery plugin uses the Azure Classic API for unicast discovery.
- GCE discovery
-
The Google Compute Engine discovery plugin uses the GCE API for unicast discovery.
- File-based discovery
-
The File-based discovery plugin allows providing the unicast hosts list through a dynamically updatable file.
Community contributed discovery plugins
A number of discovery plugins have been contributed by our community:
-
eskka Discovery Plugin (by Shikhar Bhushan)
-
Kubernetes Discovery Plugin (by Jimmi Dyson, fabric8)
EC2 Discovery Plugin
The EC2 discovery plugin uses the AWS API for unicast discovery.
If you are looking for a hosted solution of Elasticsearch on AWS, please visit http://www.elastic.co/cloud.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install discovery-ec2
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/discovery-ec2/discovery-ec2-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove discovery-ec2
The node must be stopped before removing the plugin.
Getting started with AWS
The plugin provides a hosts provider for zen discovery named ec2
. This hosts
provider finds other Elasticsearch instances in EC2 through AWS metadata.
Authentication is done using
IAM
Role credentials by default. To enable the plugin, set the unicast host
provider for Zen discovery to ec2
:
discovery.zen.hosts_provider: ec2
Settings
EC2 host discovery supports a number of settings. Some settings are sensitive and must be stored in the {ref}/secure-settings.html[elasticsearch keystore]. For example, to use explicit AWS access keys:
bin/elasticsearch-keystore add discovery.ec2.access_key
bin/elasticsearch-keystore add discovery.ec2.secret_key
The following are the available discovery settings. All should be prefixed with discovery.ec2.
.
Those that must be stored in the keystore are marked as Secure
.
access_key
-
An ec2 access key. The
secret_key
setting must also be specified. (Secure) secret_key
-
An ec2 secret key. The
access_key
setting must also be specified. (Secure) session_token
-
An ec2 session token. The
access_key
andsecret_key
settings must also be specified. (Secure) endpoint
-
The ec2 service endpoint to connect to. See http://docs.aws.amazon.com/general/latest/gr/rande.html#ec2_region. This defaults to
ec2.us-east-1.amazonaws.com
. protocol
-
The protocol to use to connect to ec2. Valid values are either
http
orhttps
. Defaults tohttps
. proxy.host
-
The host name of a proxy to connect to ec2 through.
proxy.port
-
The port of a proxy to connect to ec2 through.
proxy.username
-
The username to connect to the
proxy.host
with. (Secure) proxy.password
-
The password to connect to the
proxy.host
with. (Secure) read_timeout
-
The socket timeout for connecting to ec2. The value should specify the unit. For example, a value of
5s
specifies a 5 second timeout. The default value is 50 seconds. groups
-
Either a comma separated list or array based list of (security) groups. Only instances with the provided security groups will be used in the cluster discovery. (NOTE: You could provide either group NAME or group ID.)
host_type
-
The type of host type to use to communicate with other instances. Can be one of
private_ip
,public_ip
,private_dns
,public_dns
ortag:TAGNAME
whereTAGNAME
refers to a name of a tag configured for all EC2 instances. Instances which don’t have this tag set will be ignored by the discovery process.For example if you defined a tag
my-elasticsearch-host
in ec2 and set it tomyhostname1.mydomain.com
, then settinghost_type: tag:my-elasticsearch-host
will tell Discovery Ec2 plugin to read the host name from themy-elasticsearch-host
tag. In this case, it will be resolved tomyhostname1.mydomain.com
. Read more about EC2 Tags.Defaults to
private_ip
. availability_zones
-
Either a comma separated list or array based list of availability zones. Only instances within the provided availability zones will be used in the cluster discovery.
any_group
-
If set to
false
, will require all security groups to be present for the instance to be used for the discovery. Defaults totrue
. node_cache_time
-
How long the list of hosts is cached to prevent further requests to the AWS API. Defaults to
10s
.
All secure settings of this plugin are {ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you reload the settings, an aws sdk client with the latest settings from the keystore will be used.
Important
|
Binding the network host
It’s important to define You can use {ref}/modules-network.html[core network host settings] or ec2 specific host settings: |
EC2 Network Host
When the discovery-ec2
plugin is installed, the following are also allowed
as valid network host settings:
EC2 Host Value | Description |
---|---|
|
The private IP address (ipv4) of the machine. |
|
The private host of the machine. |
|
The public IP address (ipv4) of the machine. |
|
The public host of the machine. |
|
equivalent to |
|
equivalent to |
|
equivalent to |
Recommended EC2 Permissions
EC2 discovery requires making a call to the EC2 service. You’ll want to setup an IAM policy to allow this. You can create a custom policy via the IAM Management Console. It should look similar to this.
{
"Statement": [
{
"Action": [
"ec2:DescribeInstances"
],
"Effect": "Allow",
"Resource": [
"*"
]
}
],
"Version": "2012-10-17"
}
Filtering by Tags
The ec2 discovery can also filter machines to include in the cluster based on tags (and not just groups). The settings
to use include the discovery.ec2.tag.
prefix. For example, if you defined a tag stage
in EC2 and set it to dev
,
setting discovery.ec2.tag.stage
to dev
will only filter instances with a tag key set to stage
, and a value
of dev
. Adding multiple discovery.ec2.tag
settings will require all of those tags to be set for the instance to be included.
One practical use for tag filtering is when an ec2 cluster contains many nodes that are not running Elasticsearch. In
this case (particularly with high discovery.zen.ping_timeout
values) there is a risk that a new node’s discovery phase
will end before it has found the cluster (which will result in it declaring itself master of a new cluster with the same
name - highly undesirable). Tagging Elasticsearch ec2 nodes and then filtering by that tag will resolve this issue.
Automatic Node Attributes
Though not dependent on actually using ec2
as discovery (but still requires the discovery-ec2
plugin installed), the
plugin can automatically add node attributes relating to ec2. In the future this may support other attributes, but this will
currently only add an aws_availability_zone
node attribute, which is the availability zone of the current node. Attributes
can be used to isolate primary and replica shards across availability zones by using the
{ref}/allocation-awareness.html[Allocation Awareness] feature.
In order to enable it, set cloud.node.auto_attributes
to true
in the settings. For example:
cloud.node.auto_attributes: true
cluster.routing.allocation.awareness.attributes: aws_availability_zone
Best Practices in AWS
Collection of best practices and other information around running Elasticsearch on AWS.
Instance/Disk
When selecting disk please be aware of the following order of preference:
-
EFS - Avoid as the sacrifices made to offer durability, shared storage, and grow/shrink come at performance cost, such file systems have been known to cause corruption of indices, and due to Elasticsearch being distributed and having built-in replication, the benefits that EFS offers are not needed.
-
EBS - Works well if running a small cluster (1-2 nodes) and cannot tolerate the loss all storage backing a node easily or if running indices with no replicas. If EBS is used, then leverage provisioned IOPS to ensure performance.
-
Instance Store - When running clusters of larger size and with replicas the ephemeral nature of Instance Store is ideal since Elasticsearch can tolerate the loss of shards. With Instance Store one gets the performance benefit of having disk physically attached to the host running the instance and also the cost benefit of avoiding paying extra for EBS.
Prefer Amazon Linux AMIs as since Elasticsearch runs on the JVM, OS dependencies are very minimal and one can benefit from the lightweight nature, support, and performance tweaks specific to EC2 that the Amazon Linux AMIs offer.
Networking
-
Networking throttling takes place on smaller instance types in both the form of bandwidth and number of connections. Therefore if large number of connections are needed and networking is becoming a bottleneck, avoid instance types with networking labeled as
Moderate
orLow
. -
Multicast is not supported, even when in an VPC; the aws cloud plugin which joins by performing a security group lookup.
-
When running in multiple availability zones be sure to leverage {ref}/allocation-awareness.html[shard allocation awareness] so that not all copies of shard data reside in the same availability zone.
-
Do not span a cluster across regions. If necessary, use cross cluster search.
Misc
-
If you have split your nodes into roles, consider tagging the EC2 instances by role to make it easier to filter and view your EC2 instances in the AWS console.
-
Consider enabling termination protection for all of your instances to avoid accidentally terminating a node in the cluster and causing a potentially disruptive reallocation.
Azure Classic Discovery Plugin
The Azure Classic Discovery plugin uses the Azure Classic API for unicast discovery.
deprecated[5.0.0, Use coming Azure ARM Discovery plugin instead]
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install discovery-azure-classic
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/discovery-azure-classic/discovery-azure-classic-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove discovery-azure-classic
The node must be stopped before removing the plugin.
Azure Virtual Machine Discovery
Azure VM discovery allows to use the azure APIs to perform automatic discovery (similar to multicast in non hostile multicast environments). Here is a simple sample configuration:
cloud:
azure:
management:
subscription.id: XXX-XXX-XXX-XXX
cloud.service.name: es-demo-app
keystore:
path: /path/to/azurekeystore.pkcs12
password: WHATEVER
type: pkcs12
discovery:
zen.hosts_provider: azure
Important
|
Binding the network host
The keystore file must be placed in a directory accessible by Elasticsearch like the It’s important to define You can use {ref}/modules-network.html[core network host settings]. For example |
How to start (short story)
-
Create Azure instances
-
Install Elasticsearch
-
Install Azure plugin
-
Modify
elasticsearch.yml
file -
Start Elasticsearch
Azure credential API settings
The following are a list of settings that can further control the credential API:
cloud.azure.management.keystore.path
|
/path/to/keystore |
cloud.azure.management.keystore.type
|
|
cloud.azure.management.keystore.password
|
your_password for the keystore |
cloud.azure.management.subscription.id
|
your_azure_subscription_id |
cloud.azure.management.cloud.service.name
|
your_azure_cloud_service_name. This is the cloud service name/DNS but without the |
Advanced settings
The following are a list of settings that can further control the discovery:
discovery.azure.host.type
-
Either
public_ip
orprivate_ip
(default). Azure discovery will use the one you set to ping other nodes. discovery.azure.endpoint.name
-
When using
public_ip
this setting is used to identify the endpoint name used to forward requests to Elasticsearch (aka transport port name). Defaults toelasticsearch
. In Azure management console, you could define an endpointelasticsearch
forwarding for example requests on public IP on port 8100 to the virtual machine on port 9300. discovery.azure.deployment.name
-
Deployment name if any. Defaults to the value set with
cloud.azure.management.cloud.service.name
. discovery.azure.deployment.slot
-
Either
staging
orproduction
(default).
For example:
discovery:
type: azure
azure:
host:
type: private_ip
endpoint:
name: elasticsearch
deployment:
name: your_azure_cloud_service_name
slot: production
Setup process for Azure Discovery
We will expose here one strategy which is to hide our Elasticsearch cluster from outside.
With this strategy, only VMs behind the same virtual port can talk to each other. That means that with this mode, you can use Elasticsearch unicast discovery to build a cluster, using the Azure API to retrieve information about your nodes.
Prerequisites
Before starting, you need to have:
-
OpenSSL that isn’t from MacPorts, specifically
OpenSSL 1.0.1f 6 Jan 2014
doesn’t seem to create a valid keypair for ssh. FWIW,OpenSSL 1.0.1c 10 May 2012
on Ubuntu 14.04 LTS is known to work. -
SSH keys and certificate
You should follow this guide to learn how to create or use existing SSH keys. If you have already did it, you can skip the following.
Here is a description on how to generate SSH keys using
openssl
:# You may want to use another dir than /tmp cd /tmp openssl req -x509 -nodes -days 365 -newkey rsa:2048 -keyout azure-private.key -out azure-certificate.pem chmod 600 azure-private.key azure-certificate.pem openssl x509 -outform der -in azure-certificate.pem -out azure-certificate.cer
Generate a keystore which will be used by the plugin to authenticate with a certificate all Azure API calls.
# Generate a keystore (azurekeystore.pkcs12) # Transform private key to PEM format openssl pkcs8 -topk8 -nocrypt -in azure-private.key -inform PEM -out azure-pk.pem -outform PEM # Transform certificate to PEM format openssl x509 -inform der -in azure-certificate.cer -out azure-cert.pem cat azure-cert.pem azure-pk.pem > azure.pem.txt # You MUST enter a password! openssl pkcs12 -export -in azure.pem.txt -out azurekeystore.pkcs12 -name azure -noiter -nomaciter
Upload the
azure-certificate.cer
file both in the Elasticsearch Cloud Service (underManage Certificates
), and underSettings → Manage Certificates
.ImportantWhen prompted for a password, you need to enter a non empty one. See this guide for more details about how to create keys for Azure.
Once done, you need to upload your certificate in Azure:
-
Go to the management console.
-
Sign in using your account.
-
Click on
Portal
. -
Go to Settings (bottom of the left list)
-
On the bottom bar, click on
Upload
and upload yourazure-certificate.cer
file.
You may want to use Windows Azure Command-Line Tool:
-
-
Install NodeJS, for example using homebrew on MacOS X:
brew install node
-
Install Azure tools
sudo npm install azure-cli -g
-
Download and import your azure settings:
# This will open a browser and will download a .publishsettings file azure account download # Import this file (we have downloaded it to /tmp) # Note, it will create needed files in ~/.azure. You can remove azure.publishsettings when done. azure account import /tmp/azure.publishsettings
Creating your first instance
You need to have a storage account available. Check Azure Blob Storage documentation for more information.
You will need to choose the operating system you want to run on. To get a list of official available images, run:
azure vm image list
Let’s say we are going to deploy an Ubuntu image on an extra small instance in West Europe:
Azure cluster name |
|
Image |
|
VM Name |
|
VM Size |
|
Location |
|
Login |
|
Password |
|
Using command line:
azure vm create azure-elasticsearch-cluster \
b39f27a8b8c64d52b05eac6a62ebad85__Ubuntu-13_10-amd64-server-20130808-alpha3-en-us-30GB \
--vm-name myesnode1 \
--location "West Europe" \
--vm-size extrasmall \
--ssh 22 \
--ssh-cert /tmp/azure-certificate.pem \
elasticsearch password1234\!\!
You should see something like:
info: Executing command vm create
+ Looking up image
+ Looking up cloud service
+ Creating cloud service
+ Retrieving storage accounts
+ Configuring certificate
+ Creating VM
info: vm create command OK
Now, your first instance is started.
Tip
|
Working with SSH
You need to give the private key and username each time you log on your instance:
But you can also define it once in
|
Next, you need to install Elasticsearch on your new instance. First, copy your keystore to the instance, then connect to the instance using SSH:
scp /tmp/azurekeystore.pkcs12 azure-elasticsearch-cluster.cloudapp.net:/home/elasticsearch
ssh azure-elasticsearch-cluster.cloudapp.net
Once connected, install Elasticsearch:
# Install Latest Java version
# Read http://www.webupd8.org/2012/09/install-oracle-java-8-in-ubuntu-via-ppa.html for details
sudo add-apt-repository ppa:webupd8team/java
sudo apt-get update
sudo apt-get install oracle-java8-installer
# If you want to install OpenJDK instead
# sudo apt-get update
# sudo apt-get install openjdk-8-jre-headless
# Download Elasticsearch
curl -s https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-{version}.deb -o elasticsearch-{version}.deb
# Prepare Elasticsearch installation
sudo dpkg -i elasticsearch-{version}.deb
Check that Elasticsearch is running:
GET /
This command should give you a JSON result:
{
"name" : "Cp8oag6",
"cluster_name" : "elasticsearch",
"cluster_uuid" : "AT69_T_DTp-1qgIJlatQqA",
"version" : {
"number" : "{version}",
"build_flavor" : "default",
"build_type" : "zip",
"build_hash" : "f27399d",
"build_date" : "2016-03-30T09:51:41.449Z",
"build_snapshot" : false,
"lucene_version" : "7.7.3",
"minimum_wire_compatibility_version" : "1.2.3",
"minimum_index_compatibility_version" : "1.2.3"
},
"tagline" : "You Know, for Search"
}
Install Elasticsearch cloud azure plugin
# Stop Elasticsearch
sudo service elasticsearch stop
# Install the plugin
sudo /usr/share/elasticsearch/bin/elasticsearch-plugin install discovery-azure-classic
# Configure it
sudo vi /etc/elasticsearch/elasticsearch.yml
And add the following lines:
# If you don't remember your account id, you may get it with `azure account list`
cloud:
azure:
management:
subscription.id: your_azure_subscription_id
cloud.service.name: your_azure_cloud_service_name
keystore:
path: /home/elasticsearch/azurekeystore.pkcs12
password: your_password_for_keystore
discovery:
type: azure
# Recommended (warning: non durable disk)
# path.data: /mnt/resource/elasticsearch/data
Restart Elasticsearch:
sudo service elasticsearch start
If anything goes wrong, check your logs in /var/log/elasticsearch
.
Scaling Out!
You need first to create an image of your previous machine. Disconnect from your machine and run locally the following commands:
# Shutdown the instance
azure vm shutdown myesnode1
# Create an image from this instance (it could take some minutes)
azure vm capture myesnode1 esnode-image --delete
# Note that the previous instance has been deleted (mandatory)
# So you need to create it again and BTW create other instances.
azure vm create azure-elasticsearch-cluster \
esnode-image \
--vm-name myesnode1 \
--location "West Europe" \
--vm-size extrasmall \
--ssh 22 \
--ssh-cert /tmp/azure-certificate.pem \
elasticsearch password1234\!\!
Tip
|
It could happen that azure changes the endpoint public IP address. DNS propagation could take some minutes before you can connect again using name. You can get from azure the IP address if needed, using:
|
Let’s start more instances!
for x in $(seq 2 10)
do
echo "Launching azure instance #$x..."
azure vm create azure-elasticsearch-cluster \
esnode-image \
--vm-name myesnode$x \
--vm-size extrasmall \
--ssh $((21 + $x)) \
--ssh-cert /tmp/azure-certificate.pem \
--connect \
elasticsearch password1234\!\!
done
If you want to remove your running instances:
azure vm delete myesnode1
GCE Discovery Plugin
The Google Compute Engine Discovery plugin uses the GCE API for unicast discovery.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install discovery-gce
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/discovery-gce/discovery-gce-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove discovery-gce
The node must be stopped before removing the plugin.
GCE Virtual Machine Discovery
Google Compute Engine VM discovery allows to use the google APIs to perform automatic discovery (similar to multicast in non hostile multicast environments). Here is a simple sample configuration:
cloud:
gce:
project_id: <your-google-project-id>
zone: <your-zone>
discovery:
zen.hosts_provider: gce
The following gce settings (prefixed with cloud.gce
) are supported:
project_id
-
Your Google project id. By default the project id will be derived from the instance metadata.
Note: Deriving the project id from system properties or environment variables (`GOOGLE_CLOUD_PROJECT` or `GCLOUD_PROJECT`) is not supported.
zone
-
helps to retrieve instances running in a given zone. It should be one of the GCE supported zones. By default the zone will be derived from the instance metadata. See also Using GCE zones.
retry
-
If set to
true
, client will use ExponentialBackOff policy to retry the failed http request. Defaults totrue
. max_wait
-
The maximum elapsed time after the client instantiating retry. If the time elapsed goes past the
max_wait
, client stops to retry. A negative value means that it will wait indefinitely. Defaults to0s
(retry indefinitely). refresh_interval
-
How long the list of hosts is cached to prevent further requests to the GCE API.
0s
disables caching. A negative value will cause infinite caching. Defaults to0s
.
Important
|
Binding the network host
It’s important to define You can use {ref}/modules-network.html[core network host settings] or gce specific host settings: |
GCE Network Host
When the discovery-gce
plugin is installed, the following are also allowed
as valid network host settings:
GCE Host Value | Description |
---|---|
|
The private IP address of the machine for a given network interface. |
|
The hostname of the machine. |
|
Same as |
Examples:
# get the IP address from network interface 1
network.host: _gce:privateIp:1_
# Using GCE internal hostname
network.host: _gce:hostname_
# shortcut for _gce:privateIp:0_ (recommended)
network.host: _gce_
How to start (short story)
-
Create Google Compute Engine instance (with compute rw permissions)
-
Install Elasticsearch
-
Install Google Compute Engine Cloud plugin
-
Modify
elasticsearch.yml
file -
Start Elasticsearch
Setting up GCE Discovery
Prerequisites
Before starting, you need:
-
Your project ID, e.g.
es-cloud
. Get it from Google API Console. -
To install Google Cloud SDK
If you did not set it yet, you can define your default project you will work on:
gcloud config set project es-cloud
Login to Google Cloud
If you haven’t already, login to Google Cloud
gcloud auth login
This will open your browser. You will be asked to sign-in to a Google account and authorize access to the Google Cloud SDK.
Creating your first instance
gcloud compute instances create myesnode1 \
--zone <your-zone> \
--scopes compute-rw
When done, a report like this one should appears:
Created [https://www.googleapis.com/compute/v1/projects/es-cloud-1070/zones/us-central1-f/instances/myesnode1].
NAME ZONE MACHINE_TYPE PREEMPTIBLE INTERNAL_IP EXTERNAL_IP STATUS
myesnode1 us-central1-f n1-standard-1 10.240.133.54 104.197.94.25 RUNNING
You can now connect to your instance:
# Connect using google cloud SDK
gcloud compute ssh myesnode1 --zone europe-west1-a
# Or using SSH with external IP address
ssh -i ~/.ssh/google_compute_engine 192.158.29.199
Important
|
Service Account Permissions
It’s important when creating an instance that the correct permissions are set. At a minimum, you must ensure you have:
Failing to set this will result in unauthorized messages when starting Elasticsearch. See Machine Permissions. |
Once connected, install Elasticsearch:
sudo apt-get update
# Download Elasticsearch
wget https://download.elasticsearch.org/elasticsearch/elasticsearch/elasticsearch-2.0.0.deb
# Prepare Java installation (Oracle)
sudo echo "deb http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | sudo tee /etc/apt/sources.list.d/webupd8team-java.list
sudo echo "deb-src http://ppa.launchpad.net/webupd8team/java/ubuntu trusty main" | sudo tee -a /etc/apt/sources.list.d/webupd8team-java.list
sudo apt-key adv --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys EEA14886
sudo apt-get update
sudo apt-get install oracle-java8-installer
# Prepare Java installation (or OpenJDK)
# sudo apt-get install java8-runtime-headless
# Prepare Elasticsearch installation
sudo dpkg -i elasticsearch-2.0.0.deb
Install Elasticsearch discovery gce plugin
Install the plugin:
# Use Plugin Manager to install it
sudo bin/elasticsearch-plugin install discovery-gce
Open the elasticsearch.yml
file:
sudo vi /etc/elasticsearch/elasticsearch.yml
And add the following lines:
cloud:
gce:
project_id: es-cloud
zone: europe-west1-a
discovery:
zen.hosts_provider: gce
Start Elasticsearch:
sudo /etc/init.d/elasticsearch start
If anything goes wrong, you should check logs:
tail -f /var/log/elasticsearch/elasticsearch.log
If needed, you can change log level to trace
by opening log4j2.properties
:
sudo vi /etc/elasticsearch/log4j2.properties
and adding the following line:
# discovery
logger.discovery_gce.name = discovery.gce
logger.discovery_gce.level = trace
Cloning your existing machine
In order to build a cluster on many nodes, you can clone your configured instance to new nodes. You won’t have to reinstall everything!
First create an image of your running instance and upload it to Google Cloud Storage:
# Create an image of your current instance
sudo /usr/bin/gcimagebundle -d /dev/sda -o /tmp/
# An image has been created in `/tmp` directory:
ls /tmp
e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz
# Upload your image to Google Cloud Storage:
# Create a bucket to hold your image, let's say `esimage`:
gsutil mb gs://esimage
# Copy your image to this bucket:
gsutil cp /tmp/e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz gs://esimage
# Then add your image to images collection:
gcloud compute images create elasticsearch-2-0-0 --source-uri gs://esimage/e4686d7f5bf904a924ae0cfeb58d0827c6d5b966.image.tar.gz
# If the previous command did not work for you, logout from your instance
# and launch the same command from your local machine.
Start new instances
As you have now an image, you can create as many instances as you need:
# Just change node name (here myesnode2)
gcloud compute instances create myesnode2 --image elasticsearch-2-0-0 --zone europe-west1-a
# If you want to provide all details directly, you can use:
gcloud compute instances create myesnode2 --image=elasticsearch-2-0-0 \
--zone europe-west1-a --machine-type f1-micro --scopes=compute-rw
Remove an instance (aka shut it down)
You can use Google Cloud Console or CLI to manage your instances:
# Stopping and removing instances
gcloud compute instances delete myesnode1 myesnode2 \
--zone=europe-west1-a
# Consider removing disk as well if you don't need them anymore
gcloud compute disks delete boot-myesnode1 boot-myesnode2 \
--zone=europe-west1-a
Using GCE zones
cloud.gce.zone
helps to retrieve instances running in a given zone. It should be one of the
GCE supported zones.
The GCE discovery can support multi zones although you need to be aware of network latency between zones.
To enable discovery across more than one zone, just enter add your zone list to cloud.gce.zone
setting:
cloud:
gce:
project_id: <your-google-project-id>
zone: ["<your-zone1>", "<your-zone2>"]
discovery:
zen.hosts_provider: gce
Filtering by tags
The GCE discovery can also filter machines to include in the cluster based on tags using discovery.gce.tags
settings.
For example, setting discovery.gce.tags
to dev
will only filter instances having a tag set to dev
. Several tags
set will require all of those tags to be set for the instance to be included.
One practical use for tag filtering is when an GCE cluster contains many nodes that are not running
Elasticsearch. In this case (particularly with high discovery.zen.ping_timeout
values) there is a risk that a new
node’s discovery phase will end before it has found the cluster (which will result in it declaring itself master of a
new cluster with the same name - highly undesirable). Adding tag on Elasticsearch GCE nodes and then filtering by that
tag will resolve this issue.
Add your tag when building the new instance:
gcloud compute instances create myesnode1 --project=es-cloud \
--scopes=compute-rw \
--tags=elasticsearch,dev
Then, define it in elasticsearch.yml
:
cloud:
gce:
project_id: es-cloud
zone: europe-west1-a
discovery:
zen.hosts_provider: gce
gce:
tags: elasticsearch, dev
Changing default transport port
By default, Elasticsearch GCE plugin assumes that you run Elasticsearch on 9300 default port.
But you can specify the port value Elasticsearch is meant to use using google compute engine metadata es_port
:
When creating instance
Add --metadata es_port=9301
option:
# when creating first instance
gcloud compute instances create myesnode1 \
--scopes=compute-rw,storage-full \
--metadata es_port=9301
# when creating an instance from an image
gcloud compute instances create myesnode2 --image=elasticsearch-1-0-0-RC1 \
--zone europe-west1-a --machine-type f1-micro --scopes=compute-rw \
--metadata es_port=9301
On a running instance
gcloud compute instances add-metadata myesnode1 \
--zone europe-west1-a \
--metadata es_port=9301
GCE Tips
Store project id locally
If you don’t want to repeat the project id each time, you can save it in the local gcloud config
gcloud config set project es-cloud
Machine Permissions
If you have created a machine without the correct permissions, you will see 403 unauthorized
error messages. To change machine permission on an existing instance, first stop the instance then Edit. Scroll down to Access Scopes
to change permission. The other way to alter these permissions is to delete the instance (NOT THE DISK). Then create another with the correct permissions.
- Creating machines with gcloud
-
Ensure the following flags are set:
--scopes=compute-rw
- Creating with console (web)
-
When creating an instance using the web portal, click Show advanced options.
At the bottom of the page, under
PROJECT ACCESS
, choose>> Compute >> Read Write
. - Creating with knife google
-
Set the service account scopes when creating the machine:
knife google server create www1 \ -m n1-standard-1 \ -I debian-8 \ -Z us-central1-a \ -i ~/.ssh/id_rsa \ -x jdoe \ --gce-service-account-scopes https://www.googleapis.com/auth/compute.full_control
Or, you may use the alias:
--gce-service-account-scopes compute-rw
Testing GCE
Integrations tests in this plugin require working GCE configuration and therefore disabled by default. To enable tests prepare a config file elasticsearch.yml with the following content:
cloud:
gce:
project_id: es-cloud
zone: europe-west1-a
discovery:
zen.hosts_provider: gce
Replaces project_id
and zone
with your settings.
To run test:
mvn -Dtests.gce=true -Dtests.config=/path/to/config/file/elasticsearch.yml clean test
File-Based Discovery Plugin
The functionality provided by the discovery-file
plugin is now available in
Elasticsearch without requiring a plugin. This plugin still exists to ensure
backwards compatibility, but it will be removed in a future version.
On installation, this plugin creates a file at
$ES_PATH_CONF/discovery-file/unicast_hosts.txt
that comprises comments that
describe how to use it. It is preferable not to install this plugin and instead
to create this file, and its containing directory, using standard tools.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install discovery-file
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/discovery-file/discovery-file-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove discovery-file
The node must be stopped before removing the plugin.
Ingest Plugins
The ingest plugins extend Elasticsearch by providing additional ingest node capabilities.
Core Ingest Plugins
The core ingest plugins are:
- Ingest Attachment Processor Plugin
-
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
- Ingest
geoip
Processor Plugin -
The
geoip
processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases. This processor adds this information by default under thegeoip
field. Thegeoip
processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See {ref}/geoip-processor.html[GeoIP processor] for more details. - Ingest
user_agent
Processor Plugin -
A processor that extracts details from the User-Agent header value. The
user_agent
processor is no longer distributed as a plugin, but is now a module distributed by default with Elasticsearch. See {ref}/user-agent-processor.html[User Agent processor] for more details.
Community contributed ingest plugins
The following plugin has been contributed by our community:
-
Ingest CSV Processor Plugin (by Jun Ohtani)
Ingest Attachment Processor Plugin
The ingest attachment plugin lets Elasticsearch extract file attachments in common formats (such as PPT, XLS, and PDF) by using the Apache text extraction library Tika.
You can use the ingest attachment plugin as a replacement for the mapper attachment plugin.
The source field must be a base64 encoded binary. If you do not want to incur the overhead of converting back and forth between base64, you can use the CBOR format instead of JSON and specify the field as a bytes array instead of a string representation. The processor will skip the base64 decoding then.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install ingest-attachment
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/ingest-attachment/ingest-attachment-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove ingest-attachment
The node must be stopped before removing the plugin.
Using the Attachment Processor in a Pipeline
Name | Required | Default | Description |
---|---|---|---|
|
yes |
- |
The field to get the base64 encoded field from |
|
no |
attachment |
The field that will hold the attachment information |
|
no |
100000 |
The number of chars being used for extraction to prevent huge fields. Use |
|
no |
|
Field name from which you can overwrite the number of chars being used for extraction. See |
|
no |
all properties |
Array of properties to select to be stored. Can be |
|
no |
|
If |
For example, this:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data"
}
}
]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id
Returns this:
{
"found": true,
"_index": "my_index",
"_type": "_doc",
"_id": "my_id",
"_version": 1,
"_seq_no": 22,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem ipsum dolor sit amet",
"content_length": 28
}
}
}
To specify only some fields to be extracted:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"properties": [ "content", "title" ]
}
}
]
}
Note
|
Extracting contents from binary data is a resource intensive operation and consumes a lot of resources. It is highly recommended to run pipelines using this processor in a dedicated ingest node. |
Limit the number of extracted chars
To prevent extracting too many chars and overload the node memory, the number of chars being used for extraction
is limited by default to 100000
. You can change this value by setting indexed_chars
. Use -1
for no limit but
ensure when setting this that your node will have enough HEAP to extract the content of very big documents.
You can also define this limit per document by extracting from a given field the limit to set. If the document
has that field, it will overwrite the indexed_chars
setting. To set this field, define the indexed_chars_field
setting.
For example:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size"
}
}
]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0="
}
GET my_index/_doc/my_id
Returns this:
{
"found": true,
"_index": "my_index",
"_type": "_doc",
"_id": "my_id",
"_version": 1,
"_seq_no": 35,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"attachment": {
"content_type": "application/rtf",
"language": "sl",
"content": "Lorem ipsum",
"content_length": 11
}
}
}
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information",
"processors" : [
{
"attachment" : {
"field" : "data",
"indexed_chars" : 11,
"indexed_chars_field" : "max_size"
}
}
]
}
PUT my_index/_doc/my_id_2?pipeline=attachment
{
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_size": 5
}
GET my_index/_doc/my_id_2
Returns this:
{
"found": true,
"_index": "my_index",
"_type": "_doc",
"_id": "my_id_2",
"_version": 1,
"_seq_no": 40,
"_primary_term": 1,
"_source": {
"data": "e1xydGYxXGFuc2kNCkxvcmVtIGlwc3VtIGRvbG9yIHNpdCBhbWV0DQpccGFyIH0=",
"max_size": 5,
"attachment": {
"content_type": "application/rtf",
"language": "ro",
"content": "Lorem",
"content_length": 5
}
}
}
Using the Attachment Processor with arrays
To use the attachment processor within an array of attachments the {ref}/foreach-processor.html[foreach processor] is required. This enables the attachment processor to be run on the individual elements of the array.
For example, given the following source:
{
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK"
}
]
}
In this case, we want to process the data field in each element
of the attachments field and insert
the properties into the document so the following foreach
processor is used:
PUT _ingest/pipeline/attachment
{
"description" : "Extract attachment information from arrays",
"processors" : [
{
"foreach": {
"field": "attachments",
"processor": {
"attachment": {
"target_field": "_ingest._value.attachment",
"field": "_ingest._value.data"
}
}
}
}
]
}
PUT my_index/_doc/my_id?pipeline=attachment
{
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo="
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK"
}
]
}
GET my_index/_doc/my_id
Returns this:
{
"_index" : "my_index",
"_type" : "_doc",
"_id" : "my_id",
"_version" : 1,
"_seq_no" : 50,
"_primary_term" : 1,
"found" : true,
"_source" : {
"attachments" : [
{
"filename" : "ipsum.txt",
"data" : "dGhpcyBpcwpqdXN0IHNvbWUgdGV4dAo=",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
"content" : "this is\njust some text",
"content_length" : 24
}
},
{
"filename" : "test.txt",
"data" : "VGhpcyBpcyBhIHRlc3QK",
"attachment" : {
"content_type" : "text/plain; charset=ISO-8859-1",
"language" : "en",
"content" : "This is a test",
"content_length" : 16
}
}
]
}
}
Note that the target_field
needs to be set, otherwise the
default value is used which is a top level field attachment
. The
properties on this top level field will contain the value of the
first attachment only. However, by specifying the
target_field
on to a value on _ingest._value
it will correctly
associate the properties with the correct attachment.
Ingest geoip
Processor Plugin
The geoip
processor is no longer distributed as a plugin, but is now a module
distributed by default with Elasticsearch. See the
{ref}/geoip-processor.html[GeoIP processor] for more details.
Using the geoip
Processor in a Pipeline
See {ref}/geoip-processor.html#using-ingest-geoip[using ingest-geoip
].
Ingest user_agent
Processor Plugin
The user_agent
processor is no longer distributed as a plugin, but is now a module
distributed by default with Elasticsearch. See the
{ref}/user-agent-processor.html[User Agent processor] for more details.
Management Plugins
Management plugins offer UIs for managing and interacting with Elasticsearch.
Core management plugins
The core management plugins are:
- X-Pack
-
X-Pack contains the management and monitoring features for Elasticsearch. It aggregates cluster wide statistics and events and offers a single interface to view and analyze them. You can get a free license for basic monitoring or a higher level license for more advanced needs.
Mapper Plugins
Mapper plugins allow new field datatypes to be added to Elasticsearch.
Core mapper plugins
The core mapper plugins are:
- Mapper Size Plugin
-
The mapper-size plugin provides the
_size
meta field which, when enabled, indexes the size in bytes of the original {ref}/mapping-source-field.html[_source
] field. - [mapper-murmur3]
-
The mapper-murmur3 plugin allows hashes to be computed at index-time and stored in the index for later use with the
cardinality
aggregation. - Mapper Annotated Text Plugin
-
The annotated text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).
Mapper Size Plugin
The mapper-size plugin provides the _size
meta field which, when enabled,
indexes the size in bytes of the original
{ref}/mapping-source-field.html[_source
] field.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install mapper-size
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/mapper-size/mapper-size-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove mapper-size
The node must be stopped before removing the plugin.
Using the _size
field
In order to enable the _size
field, set the mapping as follows:
PUT my_index
{
"mappings": {
"_doc": {
"_size": {
"enabled": true
}
}
}
}
The value of the _size
field is accessible in queries, aggregations, scripts,
and when sorting:
# Example documents
PUT my_index/_doc/1
{
"text": "This is a document"
}
PUT my_index/_doc/2
{
"text": "This is another document"
}
GET my_index/_search
{
"query": {
"range": {
"_size": { (1)
"gt": 10
}
}
},
"aggs": {
"sizes": {
"terms": {
"field": "_size", (2)
"size": 10
}
}
},
"sort": [
{
"_size": { (3)
"order": "desc"
}
}
],
"script_fields": {
"size": {
"script": "doc['_size']" (4)
}
}
}
-
Querying on the
_size
field -
Aggregating on the
_size
field -
Sorting on the
_size
field -
Accessing the
_size
field in scripts (inline scripts must be modules-security-scripting.html#enable-dynamic-scripting[enabled] for this example to work) === Mapper Murmur3 Plugin
The mapper-murmur3 plugin provides the ability to compute hash of field values at index-time and store them in the index. This can sometimes be helpful when running cardinality aggregations on high-cardinality and large string fields.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install mapper-murmur3
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/mapper-murmur3/mapper-murmur3-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove mapper-murmur3
The node must be stopped before removing the plugin.
Using the murmur3
field
The murmur3
is typically used within a multi-field, so that both the original
value and its hash are stored in the index:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "keyword",
"fields": {
"hash": {
"type": "murmur3"
}
}
}
}
}
}
}
Such a mapping would allow to refer to my_field.hash
in order to get hashes
of the values of the my_field
field. This is only useful in order to run
cardinality
aggregations:
# Example documents
PUT my_index/_doc/1
{
"my_field": "This is a document"
}
PUT my_index/_doc/2
{
"my_field": "This is another document"
}
GET my_index/_search
{
"aggs": {
"my_field_cardinality": {
"cardinality": {
"field": "my_field.hash" (1)
}
}
}
}
-
Counting unique values on the
my_field.hash
field
Running a cardinality
aggregation on the my_field
field directly would
yield the same result, however using my_field.hash
instead might result in
a speed-up if the field has a high-cardinality. On the other hand, it is
discouraged to use the murmur3
field on numeric fields and string fields
that are not almost unique as the use of a murmur3
field is unlikely to
bring significant speed-ups, while increasing the amount of disk space required
to store the index.
Mapper Annotated Text Plugin
experimental[]
The mapper-annotated-text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).
The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token stream at the same position as the underlying text it annotates.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install mapper-annotated-text
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/mapper-annotated-text/mapper-annotated-text-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove mapper-annotated-text
The node must be stopped before removing the plugin.
Using the annotated-text
field
The annotated-text
tokenizes text content as per the more common text
field (see
"limitations" below) but also injects any marked-up annotation tokens directly into
the search index:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": {
"type": "annotated_text"
}
}
}
}
}
Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text
and structured tokens. The annotations use a markdown-like syntax using URL encoding of
one or more values separated by the &
symbol.
We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:
GET my_index/_analyze
{
"field": "my_field",
"text":"Investors in [Apple](Apple+Inc.) rejoiced."
}
Response:
{
"tokens": [
{
"token": "investors",
"start_offset": 0,
"end_offset": 9,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "in",
"start_offset": 10,
"end_offset": 12,
"type": "<ALPHANUM>",
"position": 1
},
{
"token": "Apple Inc.", (1)
"start_offset": 13,
"end_offset": 18,
"type": "annotation",
"position": 2
},
{
"token": "apple",
"start_offset": 13,
"end_offset": 18,
"type": "<ALPHANUM>",
"position": 2
},
{
"token": "rejoiced",
"start_offset": 19,
"end_offset": 27,
"type": "<ALPHANUM>",
"position": 3
}
]
}
-
Note the whole annotation token
Apple Inc.
is placed, unchanged as a single token in the token stream and at the same position (position 2) as the text token (apple
) it annotates.
We can now perform searches for annotations using regular term
queries that don’t tokenize
the provided search values. Annotations are a more precise way of matching as can be seen
in this example where a search for Beck
will not match Jeff Beck
:
# Example documents
PUT my_index/_doc/1
{
"my_field": "[Beck](Beck) announced a new tour"(1)
}
PUT my_index/_doc/2
{
"my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"(2)
}
# Example search
GET my_index/_search
{
"query": {
"term": {
"my_field": "Beck" (3)
}
}
}
-
As well as tokenising the plain text into single words e.g.
beck
, here we inject the single token valueBeck
at the same position asbeck
in the token stream. -
Note annotations can inject multiple tokens at the same position - here we inject both the very specific value
Jeff Beck
and the broader termGuitarist
. This enables broader positional queries e.g. finding mentions of aGuitarist
near tostrat
. -
A benefit of searching with these carefully defined annotation tokens is that a query for
Beck
will not match document 2 that contains the tokensjeff
,beck
andJeff Beck
Warning
|
Any use of = signs in annotation values eg [Prince](person=Prince) will
cause the document to be rejected with a parse failure. In future we hope to have a use for
the equals signs so wil actively reject documents that contain this today.
|
Data modelling tips
Use structured and unstructured fields
Annotations are normally a way of weaving structured information into unstructured text for higher-precision search.
Entity resolution
is a form of document enrichment undertaken by specialist software or people
where references to entities in a document are disambiguated by attaching a canonical ID.
The ID is used to resolve any number of aliases or distinguish between people with the
same name. The hyperlinks connecting Wikipedia’s articles are a good example of resolved
entity IDs woven into text.
These IDs can be embedded as annotations in an annotated_text field but it often makes sense to include them in dedicated structured fields to support discovery via aggregations:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_unstructured_text_field": {
"type": "annotated_text"
},
"my_structured_people_field": {
"type": "text",
"fields": {
"keyword" :{
"type": "keyword"
}
}
}
}
}
}
}
Applications would then typically provide content and discover it as follows:
# Example documents
PUT my_index/_doc/1
{
"my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
"my_twitter_handles": ["@kimchy"] (1)
}
GET my_index/_search
{
"query": {
"query_string": {
"query": "elasticsearch OR logstash OR kibana",(2)
"default_field": "my_unstructured_text_field"
}
},
"aggregations": {
"top_people" :{
"significant_terms" : { (3)
"field" : "my_twitter_handles.keyword"
}
}
}
}
-
Note the
my_twitter_handles
contains a list of the annotation values also used in the unstructured text. (Note the annotated_text syntax requires escaping). By repeating the annotation values in a structured field this application has ensured that the tokens discovered in the structured field can be used for search and highlighting in the unstructured field. -
In this example we search for documents that talk about components of the elastic stack
-
We use the
my_twitter_handles
field here to discover people who are significantly associated with the elastic stack.
Avoiding over-matching annotations
By design, the regular text tokens and the annotation tokens co-exist in the same indexed field but in rare cases this can lead to some over-matching.
The value of an annotation often denotes a named entity (a person, place or company). The tokens for these named entities are inserted untokenized, and differ from typical text tokens because they are normally:
-
Mixed case e.g.
Madonna
-
Multiple words e.g.
Jeff Beck
-
Can have punctuation or numbers e.g.
Apple Inc.
or@kimchy
This means, for the most part, a search for a named entity in the annotated text field will
not have any false positives e.g. when selecting Apple Inc.
from an aggregation result
you can drill down to highlight uses in the text without "over matching" on any text tokens
like the word apple
in this context:
the apple was very juicy
However, a problem arises if your named entity happens to be a single term and lower-case e.g. the
company elastic
. In this case, a search on the annotated text field for the token elastic
may match a text document such as this:
he fired an elastic band
To avoid such false matches users should consider prefixing annotation values to ensure they don’t name clash with text tokens e.g.
[elastic](Company_elastic) released version 7.0 of the elastic stack today
Using the annotated
highlighter
The annotated-text
plugin includes a custom highlighter designed to mark up search hits
in a way which is respectful of the original markup:
# Example documents
PUT my_index/_doc/1
{
"my_field": "The cat sat on the [mat](sku3578)"
}
GET my_index/_search
{
"query": {
"query_string": {
"query": "cats"
}
},
"highlight": {
"fields": {
"my_field": {
"type": "annotated", (1)
"require_field_match": false
}
}
}
}
-
The
annotated
highlighter type is designed for use with annotated_text fields
The annotated highlighter is based on the unified
highlighter and supports the same
settings but does not use the pre_tags
or post_tags
parameters. Rather than using
html-like markup such as <em>cat</em>
the annotated highlighter uses the same
markdown-like syntax used for annotations and injects a key=value annotation where _hit_term
is the key and the matched search term is the value e.g.
The [cat](_hit_term=cat) sat on the [mat](sku3578)
The annotated highlighter tries to be respectful of any existing markup in the original text:
-
If the search term matches exactly the location of an existing annotation then the
_hit_term
key is merged into the url-like syntax used in the(…)
part of the existing annotation. -
However, if the search term overlaps the span of an existing annotation it would break the markup formatting so the original annotation is removed in favour of a new annotation with just the search hit information in the results.
-
Any non-overlapping annotations in the original text are preserved in highlighter selections
Limitations
The annotated_text field type supports the same mapping settings as the text
field type
but with the following exceptions:
-
No support for
fielddata
orfielddata_frequency_filter
-
No support for
index_prefixes
orindex_phrases
indexing
Security Plugins
Security plugins add a security layer to Elasticsearch.
Core security plugins
The core security plugins are:
- X-Pack
-
X-Pack is the Elastic product that makes it easy for anyone to add enterprise-grade security to their Elastic Stack. Designed to address the growing security needs of thousands of enterprises using the Elastic Stack today, X-Pack provides peace of mind when it comes to protecting your data.
Community contributed security plugins
The following plugins have been contributed by our community:
-
Readonly REST: High performance access control for Elasticsearch native REST API (by Simone Scarduzio)
Snapshot/Restore Repository Plugins
Repository plugins extend the {ref}/modules-snapshots.html[Snapshot/Restore] functionality in Elasticsearch by adding repositories backed by the cloud or by distributed file systems:
Core repository plugins
The core repository plugins are:
- S3 Repository
-
The S3 repository plugin adds support for using S3 as a repository.
- Azure Repository
-
The Azure repository plugin adds support for using Azure as a repository.
- HDFS Repository
-
The Hadoop HDFS Repository plugin adds support for using HDFS as a repository.
- Google Cloud Storage Repository
-
The GCS repository plugin adds support for using Google Cloud Storage service as a repository.
Community contributed repository plugins
The following plugin has been contributed by our community:
-
Openstack Swift (by Wikimedia Foundation and BigData Boutique)
Azure Repository Plugin
The Azure Repository plugin adds support for using Azure as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install repository-azure
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-azure/repository-azure-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove repository-azure
The node must be stopped before removing the plugin.
Azure Repository
To enable Azure repositories, you have first to define your azure storage settings as {ref}/secure-settings.html[secure settings], before starting up the node:
bin/elasticsearch-keystore add azure.client.default.account
bin/elasticsearch-keystore add azure.client.default.key
Where account
is the azure account name and key
the azure secret key. Instead of an azure secret key under key
, you can alternatively
define a shared access signatures (SAS) token under sas_token
to use for authentication instead. When using an SAS token instead of an
account key, the SAS token must have read (r), write (w), list (l), and delete (d) permissions for the repository base path and
all its contents. These permissions need to be granted for the blob service (b) and apply to resource types service (s), container (c), and
object (o).
These settings are used by the repository’s internal azure client.
Note that you can also define more than one account:
bin/elasticsearch-keystore add azure.client.default.account
bin/elasticsearch-keystore add azure.client.default.key
bin/elasticsearch-keystore add azure.client.secondary.account
bin/elasticsearch-keystore add azure.client.secondary.sas_token
default
is the default account name which will be used by a repository,
unless you set an explicit one in the
repository settings.
The account
, key
, and sas_token
storage settings are
{ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you
reload the settings, the internal azure clients, which are used to transfer the
snapshot, will utilize the latest settings from the keystore.
Note
|
In progress snapshot/restore jobs will not be preempted by a reload of the storage secure settings. They will complete using the client as it was built when the operation started. |
You can set the client side timeout to use when making any single request. It can be defined globally, per account or both. It’s not set by default which means that Elasticsearch is using the default value set by the azure client (known as 5 minutes).
max_retries
can help to control the exponential backoff policy. It will fix the number of retries
in case of failures before considering the snapshot is failing. Defaults to 3
retries.
The initial backoff period is defined by Azure SDK as 30s
. Which means 30s
of wait time
before retrying after a first timeout or failure. The maximum backoff period is defined by Azure SDK as
90s
.
endpoint_suffix
can be used to specify Azure endpoint suffix explicitly. Defaults to core.windows.net
.
cloud.azure.storage.timeout: 10s
azure.client.default.max_retries: 7
azure.client.default.endpoint_suffix: core.chinacloudapi.cn
azure.client.secondary.timeout: 30s
In this example, timeout will be 10s
per try for default
with 7
retries before failing
and endpoint suffix will be core.chinacloudapi.cn
and 30s
per try for secondary
with 3
retries.
Important
|
Supported Azure Storage Account types
The Azure Repository plugin works with all Standard storage accounts
Premium Locally Redundant Storage ( |
You can register a proxy per client using the following settings:
azure.client.default.proxy.host: proxy.host
azure.client.default.proxy.port: 8888
azure.client.default.proxy.type: http
Supported values for proxy.type
are direct
(default), http
or socks
.
When proxy.type
is set to http
or socks
, proxy.host
and proxy.port
must be provided.
Repository settings
The Azure repository supports following settings:
client
-
Azure named client to use. Defaults to
default
. container
-
Container name. You must create the azure container before creating the repository. Defaults to
elasticsearch-snapshots
. base_path
-
Specifies the path within container to repository data. Defaults to empty (root directory).
chunk_size
-
Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example:
10MB
,5KB
,500B
. Defaults to64MB
(64MB max). compress
-
When set to
true
metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults tofalse
. max_restore_bytes_per_sec
-
Throttles per node restore rate. Defaults to
40mb
per second. max_snapshot_bytes_per_sec
-
Throttles per node snapshot rate. Defaults to
40mb
per second. readonly
-
Makes repository read-only. Defaults to
false
. location_mode
-
primary_only
orsecondary_only
. Defaults toprimary_only
. Note that if you set it tosecondary_only
, it will forcereadonly
to true.
Some examples, using scripts:
# The simplest one
PUT _snapshot/my_backup1
{
"type": "azure"
}
# With some settings
PUT _snapshot/my_backup2
{
"type": "azure",
"settings": {
"container": "backup-container",
"base_path": "backups",
"chunk_size": "32m",
"compress": true
}
}
# With two accounts defined in elasticsearch.yml (my_account1 and my_account2)
PUT _snapshot/my_backup3
{
"type": "azure",
"settings": {
"client": "secondary"
}
}
PUT _snapshot/my_backup4
{
"type": "azure",
"settings": {
"client": "secondary",
"location_mode": "primary_only"
}
}
Example using Java:
client.admin().cluster().preparePutRepository("my_backup_java1")
.setType("azure").setSettings(Settings.builder()
.put(Storage.CONTAINER, "backup-container")
.put(Storage.CHUNK_SIZE, new ByteSizeValue(32, ByteSizeUnit.MB))
).get();
Repository validation rules
According to the containers naming guide, a container name must be a valid DNS name, conforming to the following naming rules:
-
Container names must start with a letter or number, and can contain only letters, numbers, and the dash (-) character.
-
Every dash (-) character must be immediately preceded and followed by a letter or number; consecutive dashes are not permitted in container names.
-
All letters in a container name must be lowercase.
-
Container names must be from 3 through 63 characters long.
S3 Repository Plugin
The S3 repository plugin adds support for using AWS S3 as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].
If you are looking for a hosted solution of Elasticsearch on AWS, please visit http://www.elastic.co/cloud.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install repository-s3
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-s3/repository-s3-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove repository-s3
The node must be stopped before removing the plugin.
Getting Started
The plugin provides a repository type named s3
which may be used when creating
a repository. The repository defaults to using
ECS
IAM Role or
EC2
IAM Role credentials for authentication. The only mandatory setting is the
bucket name:
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my_bucket"
}
}
Client Settings
The client that you use to connect to S3 has a number of settings available.
The settings have the form s3.client.CLIENT_NAME.SETTING_NAME
. By default,
s3
repositories use a client named default
, but this can be modified using
the repository setting client
. For example:
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my_bucket",
"client": "my_alternate_client"
}
}
Most client settings can be added to the elasticsearch.yml
configuration file
with the exception of the secure settings, which you add to the {es} keystore.
For more information about creating and updating the {es} keystore, see
{ref}/secure-settings.html[Secure settings].
For example, before you start the node, run these commands to add AWS access key settings to the keystore:
bin/elasticsearch-keystore add s3.client.default.access_key
bin/elasticsearch-keystore add s3.client.default.secret_key
All client secure settings of this plugin are
{ref}/secure-settings.html#reloadable-secure-settings[reloadable]. After you
reload the settings, the internal s3
clients, used to transfer the snapshot
contents, will utilize the latest settings from the keystore. Any existing s3
repositories, as well as any newly created ones, will pick up the new values
stored in the keystore.
Note
|
In-progress snapshot/restore tasks will not be preempted by a reload of the client’s secure settings. The task will complete using the client as it was built when the operation started. |
The following list contains the available client settings. Those that must be
stored in the keystore are marked as "secure" and are reloadable; the other
settings belong in the elasticsearch.yml
file.
access_key
({ref}/secure-settings.html[Secure])-
An S3 access key. The
secret_key
setting must also be specified. secret_key
({ref}/secure-settings.html[Secure])-
An S3 secret key. The
access_key
setting must also be specified. session_token
-
An S3 session token. The
access_key
andsecret_key
settings must also be specified. (Secure) endpoint
-
The S3 service endpoint to connect to. This defaults to
s3.amazonaws.com
but the AWS documentation lists alternative S3 endpoints. If you are using an S3-compatible service then you should set this to the service’s endpoint. protocol
-
The protocol to use to connect to S3. Valid values are either
http
orhttps
. Defaults tohttps
. proxy.host
-
The host name of a proxy to connect to S3 through.
proxy.port
-
The port of a proxy to connect to S3 through.
proxy.username
({ref}/secure-settings.html[Secure])-
The username to connect to the
proxy.host
with. proxy.password
({ref}/secure-settings.html[Secure])-
The password to connect to the
proxy.host
with. read_timeout
-
The socket timeout for connecting to S3. The value should specify the unit. For example, a value of
5s
specifies a 5 second timeout. The default value is 50 seconds. max_retries
-
The number of retries to use when an S3 request fails. The default value is
3
. use_throttle_retries
-
Whether retries should be throttled (i.e. should back off). Must be
true
orfalse
. Defaults totrue
.
S3-compatible services
There are a number of storage systems that provide an S3-compatible API, and
the repository-s3
plugin allows you to use these systems in place of AWS S3.
To do so, you should set the s3.client.CLIENT_NAME.endpoint
setting to the
system’s endpoint. This setting accepts IP addresses and hostnames and may
include a port. For example, the endpoint may be 172.17.0.2
or
172.17.0.2:9000
. You may also need to set s3.client.CLIENT_NAME.protocol
to
http
if the endpoint does not support HTTPS.
Minio is an example of a storage system that provides an
S3-compatible API. The repository-s3
plugin allows {es} to work with
Minio-backed repositories as well as repositories stored on AWS S3. Other
S3-compatible storage systems may also work with {es}, but these are not tested
or supported.
Repository Settings
The s3
repository type supports a number of settings to customize how data is
stored in S3. These can be specified when creating the repository. For example:
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"bucket": "my_bucket_name",
"another_setting": "setting_value"
}
}
The following settings are supported:
bucket
-
The name of the bucket to be used for snapshots. (Mandatory)
client
-
The name of the S3 client to use to connect to S3. Defaults to
default
. base_path
-
Specifies the path to the repository data within its bucket. Defaults to an empty string, meaning that the repository is at the root of the bucket. The value of this setting should not start or end with a
/
. chunk_size
-
Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example:
1GB
,10MB
,5KB
,500B
. Defaults to1GB
. compress
-
When set to
true
metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults tofalse
. max_restore_bytes_per_sec
-
Throttles per node restore rate. Defaults to
40mb
per second. max_snapshot_bytes_per_sec
-
Throttles per node snapshot rate. Defaults to
40mb
per second. readonly
-
Makes repository read-only. Defaults to
false
. server_side_encryption
-
When set to
true
files are encrypted on server side using AES256 algorithm. Defaults tofalse
. buffer_size
-
Minimum threshold below which the chunk is uploaded using a single request. Beyond this threshold, the S3 repository will use the AWS Multipart Upload API to split the chunk into several parts, each of
buffer_size
length, and to upload each part in its own request. Note that setting a buffer size lower than5mb
is not allowed since it will prevent the use of the Multipart API and may result in upload errors. It is also not possible to set a buffer size greater than5gb
as it is the maximum upload size allowed by S3. Defaults to the minimum between100mb
and5%
of the heap size. canned_acl
-
The S3 repository supports all S3 canned ACLs :
private
,public-read
,public-read-write
,authenticated-read
,log-delivery-write
,bucket-owner-read
,bucket-owner-full-control
. Defaults toprivate
. You could specify a canned ACL using thecanned_acl
setting. When the S3 repository creates buckets and objects, it adds the canned ACL into the buckets and objects. storage_class
-
Sets the S3 storage class for objects stored in the snapshot repository. Values may be
standard
,reduced_redundancy
,standard_ia
. Defaults tostandard
. Changing this setting on an existing repository only affects the storage class for newly created objects, resulting in a mixed usage of storage classes. Additionally, S3 Lifecycle Policies can be used to manage the storage class of existing objects. Due to the extra complexity with the Glacier class lifecycle, it is not currently supported by the plugin. For more information about the different classes, see AWS Storage Classes Guide
Note
|
The option of defining client settings in the repository settings as documented below is considered deprecated, and will be removed in a future version. |
In addition to the above settings, you may also specify all non-secure client settings in the repository settings. In this case, the client settings found in the repository settings will be merged with those of the named client used by the repository. Conflicts between client and repository settings are resolved by the repository settings taking precedence over client settings.
For example:
PUT _snapshot/my_s3_repository
{
"type": "s3",
"settings": {
"client": "my_client_name",
"bucket": "my_bucket_name",
"endpoint": "my.s3.endpoint"
}
}
This sets up a repository that uses all client settings from the client
my_client_name
except for the endpoint
that is overridden to
my.s3.endpoint
by the repository settings.
Recommended S3 Permissions
In order to restrict the Elasticsearch snapshot process to the minimum required resources, we recommend using Amazon IAM in conjunction with pre-existing S3 buckets. Here is an example policy which will allow the snapshot access to an S3 bucket named "snaps.example.com". This may be configured through the AWS IAM console, by creating a Custom Policy, and using a Policy Document similar to this (changing snaps.example.com to your bucket name).
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com/*"
]
}
],
"Version": "2012-10-17"
}
You may further restrict the permissions by specifying a prefix within the bucket, in this example, named "foo".
{
"Statement": [
{
"Action": [
"s3:ListBucket",
"s3:GetBucketLocation",
"s3:ListBucketMultipartUploads",
"s3:ListBucketVersions"
],
"Condition": {
"StringLike": {
"s3:prefix": [
"foo/*"
]
}
},
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com"
]
},
{
"Action": [
"s3:GetObject",
"s3:PutObject",
"s3:DeleteObject",
"s3:AbortMultipartUpload",
"s3:ListMultipartUploadParts"
],
"Effect": "Allow",
"Resource": [
"arn:aws:s3:::snaps.example.com/foo/*"
]
}
],
"Version": "2012-10-17"
}
The bucket needs to exist to register a repository for snapshots. If you did not create the bucket then the repository registration will fail.
Note: Starting in version 7.0, all bucket operations are using the path style access pattern. In previous versions the decision to use virtual hosted style or path style access was made by the AWS Java SDK.
AWS VPC Bandwidth Settings
AWS instances resolve S3 endpoints to a public IP. If the Elasticsearch instances reside in a private subnet in an AWS VPC then all traffic to S3 will go through that VPC’s NAT instance. If your VPC’s NAT instance is a smaller instance size (e.g. a t1.micro) or is handling a high volume of network traffic your bandwidth to S3 may be limited by that NAT instance’s networking bandwidth limitations.
Instances residing in a public subnet in an AWS VPC will connect to S3 via the VPC’s internet gateway and not be bandwidth limited by the VPC’s NAT instance.
Hadoop HDFS Repository Plugin
The HDFS repository plugin adds support for using HDFS File System as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install repository-hdfs
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-hdfs/repository-hdfs-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove repository-hdfs
The node must be stopped before removing the plugin.
Getting started with HDFS
The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If the distro you are using is not protocol compatible with Apache Hadoop, consider replacing the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required).
Even if Hadoop is already installed on the Elasticsearch nodes, for security reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distro is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files (see below).
- Windows Users
-
Using Apache Hadoop on Windows is problematic and thus it is not recommended. For those really wanting to use it, make sure you place the elusive
winutils.exe
under the plugin folder and pointHADOOP_HOME
variable to it; this should minimize the amount of permissions Hadoop requires (though one would still have to add some more).
Configuration Properties
Once installed, define the configuration for the hdfs
repository through the
{ref}/modules-snapshots.html[REST API]:
PUT _snapshot/my_hdfs_repository
{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "elasticsearch/repositories/my_hdfs_repository",
"conf.dfs.client.read.shortcircuit": "true"
}
}
The following settings are supported:
uri
|
The uri address for hdfs. ex: "hdfs://<host>:<port>/". (Required) |
path
|
The file path within the filesystem where data is stored/loaded. ex: "path/to/file". (Required) |
load_defaults
|
Whether to load the default Hadoop configuration or not. (Enabled by default) |
conf.<key>
|
Inlined configuration parameter to be added to Hadoop configuration. (Optional) Only client oriented properties from the hadoop core and hdfs configuration files will be recognized by the plugin. |
compress
|
Whether to compress the metadata or not. (Disabled by default) |
max_restore_bytes_per_sec
|
Throttles per node restore rate. Defaults to |
max_snapshot_bytes_per_sec
|
Throttles per node snapshot rate. Defaults to |
readonly
|
Makes repository read-only. Defaults to |
chunk_size
|
Override the chunk size. (Disabled by default) |
security.principal
|
Kerberos principal to use when connecting to a secured HDFS cluster.
If you are using a service principal for your elasticsearch node, you may
use the |
A Note on HDFS Availability
When you initialize a repository, its settings are persisted in the cluster state. When a node comes online, it will attempt to initialize all repositories for which it has settings. If your cluster has an HDFS repository configured, then all nodes in the cluster must be able to reach HDFS when starting. If not, then the node will fail to initialize the repository at start up and the repository will be unusable. If this happens, you will need to remove and re-add the repository or restart the offending node.
Hadoop Security
The HDFS Repository Plugin integrates seamlessly with Hadoop’s authentication model. The following authentication methods are supported by the plugin:
simple
|
Also means "no security" and is enabled by default. Uses information from underlying operating system account running Elasticsearch to inform Hadoop of the name of the current user. Hadoop makes no attempts to verify this information. |
kerberos
|
Authenticates to Hadoop through the usage of a Kerberos principal and keytab. Interfacing with HDFS clusters secured with Kerberos requires a few additional steps to enable (See Principals and Keytabs and Creating the Secure Repository for more info) |
Principals and Keytabs
Before attempting to connect to a secured HDFS cluster, provision the Kerberos principals and keytabs that the
Elasticsearch nodes will use for authenticating to Kerberos. For maximum security and to avoid tripping up the Kerberos
replay protection, you should create a service principal per node, following the pattern of
elasticsearch/hostname@REALM
.
Warning
|
In some cases, if the same principal is authenticating from multiple clients at once, services may reject authentication for those principals under the assumption that they could be replay attacks. If you are running the plugin in production with multiple nodes you should be using a unique service principal for each node. |
On each Elasticsearch node, place the appropriate keytab file in the node’s configuration location under the
repository-hdfs
directory using the name krb5.keytab
:
$> cd elasticsearch/config
$> ls
elasticsearch.yml jvm.options log4j2.properties repository-hdfs/ scripts/
$> cd repository-hdfs
$> ls
krb5.keytab
Note
|
Make sure you have the correct keytabs! If you are using a service principal per node (like
elasticsearch/hostname@REALM ) then each node will need its own unique keytab file for the principal assigned to that
host!
|
Creating the Secure Repository
Once your keytab files are in place and your cluster is started, creating a secured HDFS repository is simple. Just
add the name of the principal that you will be authenticating as in the repository settings under the
security.principal
option:
PUT _snapshot/my_hdfs_repository
{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "/user/elasticsearch/repositories/my_hdfs_repository",
"security.principal": "elasticsearch@REALM"
}
}
If you are using different service principals for each node, you can use the _HOST
pattern in your principal
name. Elasticsearch will automatically replace the pattern with the hostname of the node at runtime:
PUT _snapshot/my_hdfs_repository
{
"type": "hdfs",
"settings": {
"uri": "hdfs://namenode:8020/",
"path": "/user/elasticsearch/repositories/my_hdfs_repository",
"security.principal": "elasticsearch/_HOST@REALM"
}
}
Authorization
Once Elasticsearch is connected and authenticated to HDFS, HDFS will infer a username to use for
authorizing file access for the client. By default, it picks this username from the primary part of
the kerberos principal used to authenticate to the service. For example, in the case of a principal
like elasticsearch@REALM
or elasticsearch/hostname@REALM
then the username that HDFS
extracts for file access checks will be elasticsearch
.
Note
|
The repository plugin makes no assumptions of what Elasticsearch’s principal name is. The main fragment of the
Kerberos principal is not required to be elasticsearch . If you have a principal or service name that works better
for you or your organization then feel free to use it instead!
|
Google Cloud Storage Repository Plugin
The GCS repository plugin adds support for using the Google Cloud Storage service as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install repository-gcs
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/repository-gcs/repository-gcs-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove repository-gcs
The node must be stopped before removing the plugin.
Getting started
The plugin uses the Google Cloud Java Client for Storage to connect to the Storage service. If you are using Google Cloud Storage for the first time, you must connect to the Google Cloud Platform Console and create a new project. After your project is created, you must enable the Cloud Storage Service for your project.
Creating a Bucket
The Google Cloud Storage service uses the concept of a bucket as a container for all the data. Buckets are usually created using the Google Cloud Platform Console. The plugin does not automatically create buckets.
To create a new bucket:
-
Connect to the Google Cloud Platform Console.
-
Select your project.
-
Go to the Storage Browser.
-
Click the Create Bucket button.
-
Enter the name of the new bucket.
-
Select a storage class.
-
Select a location.
-
Click the Create button.
For more detailed instructions, see the Google Cloud documentation.
Service Authentication
The plugin must authenticate the requests it makes to the Google Cloud Storage service. It is common for Google client libraries to employ a strategy named application default credentials. However, that strategy is not supported for use with Elasticsearch. The plugin operates under the Elasticsearch process, which runs with the security manager enabled. The security manager obstructs the "automatic" credential discovery. Therefore, you must configure service account credentials even if you are using an environment that does not normally require this configuration (such as Compute Engine, Kubernetes Engine or App Engine).
Using a Service Account
You have to obtain and provide service account credentials manually.
For detailed information about generating JSON service account files, see the Google Cloud documentation. Note that the PKCS12 format is not supported by this plugin.
Here is a summary of the steps:
-
Connect to the Google Cloud Platform Console.
-
Select your project.
-
Go to the Permission tab.
-
Select the Service Accounts tab.
-
Click Create service account.
-
After the account is created, select it and download a JSON key file.
A JSON service account file looks like this:
{
"type": "service_account",
"project_id": "your-project-id",
"private_key_id": "...",
"private_key": "-----BEGIN PRIVATE KEY-----\n...\n-----END PRIVATE KEY-----\n",
"client_email": "service-account-for-your-repository@your-project-id.iam.gserviceaccount.com",
"client_id": "...",
"auth_uri": "https://accounts.google.com/o/oauth2/auth",
"token_uri": "https://accounts.google.com/o/oauth2/token",
"auth_provider_x509_cert_url": "https://www.googleapis.com/oauth2/v1/certs",
"client_x509_cert_url": "https://www.googleapis.com/robot/v1/metadata/x509/your-bucket@your-project-id.iam.gserviceaccount.com"
}
To provide this file to the plugin, it must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. You must add a setting name of the form gcs.client.NAME.credentials_file
, where NAME
is the name of the client configuration for the repository. The implicit client
name is default
, but a different client name can be specified in the
repository settings with the client
key.
Note
|
Passing the file path via the GOOGLE_APPLICATION_CREDENTIALS environment variable is not supported. |
For example, if you added a gcs.client.my_alternate_client.credentials_file
setting in the keystore, you can configure a repository to use those credentials
like this:
PUT _snapshot/my_gcs_repository
{
"type": "gcs",
"settings": {
"bucket": "my_bucket",
"client": "my_alternate_client"
}
}
The credentials_file
settings are {ref}/secure-settings.html#reloadable-secure-settings[reloadable].
After you reload the settings, the internal gcs
clients, which are used to
transfer the snapshot contents, utilize the latest settings from the keystore.
Note
|
Snapshot or restore jobs that are in progress are not preempted by a reload
of the client’s credentials_file settings. They complete using the client as
it was built when the operation started.
|
Client Settings
The client used to connect to Google Cloud Storage has a number of settings available.
Client setting names are of the form gcs.client.CLIENT_NAME.SETTING_NAME
and are specified
inside elasticsearch.yml
. The default client name looked up by a gcs
repository is
called default
, but can be customized with the repository setting client
.
For example:
PUT _snapshot/my_gcs_repository
{
"type": "gcs",
"settings": {
"bucket": "my_bucket",
"client": "my_alternate_client"
}
}
Some settings are sensitive and must be stored in the {ref}/secure-settings.html[Elasticsearch keystore]. This is the case for the service account file:
bin/elasticsearch-keystore add-file gcs.client.default.credentials_file /path/service-account.json
The following are the available client settings. Those that must be stored in the keystore
are marked as Secure
.
credentials_file
-
The service account file that is used to authenticate to the Google Cloud Storage service. (Secure)
endpoint
-
The Google Cloud Storage service endpoint to connect to. This will be automatically determined by the Google Cloud Storage client but can be specified explicitly.
connect_timeout
-
The timeout to establish a connection to the Google Cloud Storage service. The value should specify the unit. For example, a value of
5s
specifies a 5 second timeout. The value of-1
corresponds to an infinite timeout. The default value is 20 seconds. read_timeout
-
The timeout to read data from an established connection. The value should specify the unit. For example, a value of
5s
specifies a 5 second timeout. The value of-1
corresponds to an infinite timeout. The default value is 20 seconds. application_name
-
Name used by the client when it uses the Google Cloud Storage service. Setting a custom name can be useful to authenticate your cluster when requests statistics are logged in the Google Cloud Platform. Default to
repository-gcs
project_id
-
The Google Cloud project id. This will be automatically inferred from the credentials file but can be specified explicitly. For example, it can be used to switch between projects when the same credentials are usable for both the production and the development projects.
Repository Settings
The gcs
repository type supports a number of settings to customize how data
is stored in Google Cloud Storage.
These can be specified when creating the repository. For example:
PUT _snapshot/my_gcs_repository
{
"type": "gcs",
"settings": {
"bucket": "my_other_bucket",
"base_path": "dev"
}
}
The following settings are supported:
bucket
-
The name of the bucket to be used for snapshots. (Mandatory)
client
-
The name of the client to use to connect to Google Cloud Storage. Defaults to
default
. base_path
-
Specifies the path within bucket to repository data. Defaults to the root of the bucket.
chunk_size
-
Big files can be broken down into chunks during snapshotting if needed. Specify the chunk size as a value and unit, for example:
10MB
or5KB
. Defaults to100MB
, which is the maximum permitted. compress
-
When set to
true
metadata files are stored in compressed format. This setting doesn’t affect index files that are already compressed by default. Defaults tofalse
. max_restore_bytes_per_sec
-
Throttles per node restore rate. Defaults to
40mb
per second. max_snapshot_bytes_per_sec
-
Throttles per node snapshot rate. Defaults to
40mb
per second. readonly
-
Makes repository read-only. Defaults to
false
. application_name
-
deprecated:[6.3.0, "This setting is now defined in the client settings."] Name used by the client when it uses the Google Cloud Storage service.
Recommended Bucket Permission
The service account used to access the bucket must have the "Writer" access to the bucket:
-
Connect to the Google Cloud Platform Console.
-
Select your project.
-
Go to the Storage Browser.
-
Select the bucket and "Edit bucket permission".
-
The service account must be configured as a "User" with "Writer" access.
Store Plugins
Store plugins offer alternatives to default Lucene stores.
Core store plugins
The core store plugins are:
- Store SMB
-
The Store SMB plugin works around for a bug in Windows SMB and Java on windows.
Store SMB Plugin
The Store SMB plugin works around for a bug in Windows SMB and Java on windows.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install store-smb
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from https://artifacts.elastic.co/downloads/elasticsearch-plugins/store-smb/store-smb-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove store-smb
The node must be stopped before removing the plugin.
Working around a bug in Windows SMB and Java on windows
When using a shared file system based on the SMB protocol (like Azure File Service) to store indices, the way Lucene open index segment files is with a write only flag. This is the correct way to open the files, as they will only be used for writes and allows different FS implementations to optimize for it. Sadly, in windows with SMB, this disables the cache manager, causing writes to be slow. This has been described in LUCENE-6176, but it affects each and every Java program out there!. This need and must be fixed outside of ES and/or Lucene, either in windows or OpenJDK. For now, we are providing an experimental support to open the files with read flag, but this should be considered experimental and the correct way to fix it is in OpenJDK or Windows.
The Store SMB plugin provides two storage types optimized for SMB:
smb_mmap_fs
-
a SMB specific implementation of the default {ref}/index-modules-store.html#mmapfs[mmap fs]
smb_simple_fs
-
a SMB specific implementation of the default {ref}/index-modules-store.html#simplefs[simple fs]
To use one of these specific storage types, you need to install the Store SMB plugin and restart the node. Then configure Elasticsearch to set the storage type you want.
This can be configured for all indices by adding this to the elasticsearch.yml
file:
index.store.type: smb_simple_fs
Note that setting will be applied for newly created indices.
It can also be set on a per-index basis at index creation time:
PUT my_index
{
"settings": {
"index.store.type": "smb_mmap_fs"
}
}
Integrations
Integrations are not plugins, but are external tools or modules that make it easier to work with Elasticsearch.
CMS integrations
Supported by the community:
-
Drupal: Drupal Elasticsearch integration via Search API.
-
Drupal: Drupal Elasticsearch integration.
-
ElasticPress: Elasticsearch WordPress Plugin
-
WPSOLR: Elasticsearch (and Apache Solr) WordPress Plugin
-
Tiki Wiki CMS Groupware: Tiki has native support for Elasticsearch. This provides faster & better search (facets, etc), along with some Natural Language Processing features (ex.: More like this)
-
XWiki Next Generation Wiki: XWiki has an Elasticsearch and Kibana macro allowing to run Elasticsearch queries and display the results in XWiki pages using XWiki’s scripting language as well as include Kibana Widgets in XWiki pages
Data import/export and validation
Note
|
Rivers were used to import data from external systems into Elasticsearch prior to the 2.0 release. Elasticsearch releases 2.0 and later do not support rivers. |
Supported by Elasticsearch:
-
{logstash-ref}/plugins-outputs-elasticsearch.html[Logstash output to Elasticsearch]: The Logstash
elasticsearch
output plugin. -
{logstash-ref}/plugins-inputs-elasticsearch.html[Elasticsearch input to Logstash] The Logstash
elasticsearch
input plugin. -
{logstash-ref}/plugins-filters-elasticsearch.html[Elasticsearch event filtering in Logstash] The Logstash
elasticsearch
filter plugin. -
{logstash-ref}/plugins-codecs-es_bulk.html[Elasticsearch bulk codec] The Logstash
es_bulk
plugin decodes the Elasticsearch bulk format into individual events.
Supported by the community:
-
JDBC importer: The Java Database Connection (JDBC) importer allows to fetch data from JDBC sources for indexing into Elasticsearch (by Jörg Prante)
-
https://github.com/BigDataDevs/kafka-elasticsearch-consumer [Kafka Standalone Consumer(Indexer)]: Kafka Standalone Consumer [Indexer] will read messages from Kafka in batches, processes(as implemented) and bulk-indexes them into Elasticsearch. Flexible and scalable. More documentation in above GitHub repo’s Wiki.
-
Mongolastic: A tool that clones data from Elasticsearch to MongoDB and vice versa
-
Scrutineer: A high performance consistency checker to compare what you’ve indexed with your source of truth content (e.g. DB)
-
IMAP/POP3/Mail importer: The Mail importer allows to fetch data from IMAP and POP3 servers for indexing into Elasticsearch (by Hendrik Saly)
-
FS Crawler: The File System (FS) crawler allows to index documents (PDF, Open Office…) from your local file system and over SSH. (by David Pilato)
Deployment
Supported by Elasticsearch:
-
Ansible playbook for Elasticsearch: An officially supported ansible playbook for Elasticsearch. Tested with the latest version of 5.x and 6.x on Ubuntu 14.04/16.04, Debian 8, Centos 7.
-
Puppet: Elasticsearch puppet module.
Supported by the community:
-
Chef: Chef cookbook for Elasticsearch
Framework integrations
Supported by the community:
-
Aspire for Elasticsearch: Aspire, from Search Technologies, is a powerful connector and processing framework designed for unstructured data. It has connectors to internal and external repositories including SharePoint, Documentum, Jive, RDB, file systems, websites and more, and can transform and normalize this data before indexing in Elasticsearch.
-
Apache Camel Integration: An Apache camel component to integrate Elasticsearch
-
Catmanadu: An Elasticsearch backend for the Catmandu framework.
-
elasticsearch-test: Elasticsearch Java annotations for unit testing with JUnit
-
FOSElasticaBundle: Symfony2 Bundle wrapping Elastica.
-
Grails: Elasticsearch Grails plugin.
-
Haystack: Modular search for Django
-
Hibernate Search Integration with Hibernate ORM, from the Hibernate team. Automatic synchronization of write operations, yet exposes full Elasticsearch capabilities for queries. Can return either Elasticsearch native or re-map queries back into managed entities loaded within transaction from the reference database.
-
play2-elasticsearch: Elasticsearch module for Play Framework 2.x
-
Spring Data Elasticsearch: Spring Data implementation for Elasticsearch
-
Spring Elasticsearch: Spring Factory for Elasticsearch
-
Twitter Storehaus: Thin asynchronous Scala client for Storehaus.
Hadoop integrations
Supported by Elasticsearch:
-
es-hadoop: Elasticsearch real-time search and analytics natively integrated with Hadoop. Supports Map/Reduce, Cascading, Apache Hive, Apache Pig, Apache Spark and Apache Storm.
Health and Performance Monitoring
Supported by the community:
-
check_elasticsearch: An Elasticsearch availability and performance monitoring plugin for Nagios.
-
check-es: Nagios/Shinken plugins for checking on Elasticsearch
-
es2graphite: Send cluster and indices stats and status to Graphite for monitoring and graphing.
-
ElasticOcean: Elasticsearch & DigitalOcean iOS Real-Time Monitoring tool to keep an eye on DigitalOcean Droplets or Elasticsearch instances or both of them on-a-go.
-
opsview-elasticsearch: Opsview plugin written in Perl for monitoring Elasticsearch
-
Scout: Provides plugins for monitoring Elasticsearch nodes, clusters, and indices.
-
SPM for Elasticsearch: Performance monitoring with live charts showing cluster and node stats, integrated alerts, email reports, etc.
Other integrations
Supported by the community:
These projects appear to have been abandoned:
Help for plugin authors
The Elasticsearch repository contains examples of:
-
a Java plugin which contains a plugin with custom settings.
-
a Java plugin which contains a plugin that registers a Rest handler.
-
a Java plugin which contains a rescore plugin.
-
a Java plugin which contains a script plugin.
These examples provide the bare bones needed to get started. For more information about how to write a plugin, we recommend looking at the plugins listed in this documentation for inspiration.
Plugin descriptor file
All plugins must contain a file called plugin-descriptor.properties
.
The format for this file is described in detail in this example:
# Elasticsearch plugin descriptor file
# This file must exist as 'plugin-descriptor.properties' inside a plugin.
#
### example plugin for "foo"
#
# foo.zip <-- zip file for the plugin, with this structure:
# |____ <arbitrary name1>.jar <-- classes, resources, dependencies
# |____ <arbitrary nameN>.jar <-- any number of jars
# |____ plugin-descriptor.properties <-- example contents below:
#
# classname=foo.bar.BazPlugin
# description=My cool plugin
# version=6.0
# elasticsearch.version=6.0
# java.version=1.8
#
### mandatory elements for all plugins:
#
# 'description': simple summary of the plugin
description=${description}
#
# 'version': plugin's version
version=${version}
#
# 'name': the plugin name
name=${name}
#
# 'classname': the name of the class to load, fully-qualified.
classname=${classname}
#
# 'java.version': version of java the code is built against
# use the system property java.specification.version
# version string must be a sequence of nonnegative decimal integers
# separated by "."'s and may have leading zeros
java.version=${javaVersion}
#
# 'elasticsearch.version': version of elasticsearch compiled against
elasticsearch.version=${elasticsearchVersion}
### optional elements for plugins:
#
# 'extended.plugins': other plugins this plugin extends through SPI
extended.plugins=${extendedPlugins}
#
# 'has.native.controller': whether or not the plugin has a native controller
has.native.controller=${hasNativeController}
<% if (licensed) { %>
# This plugin requires that a license agreement be accepted before installation
licensed=${licensed}
<% } %>
Either fill in this template yourself or, if you are using Elasticsearch’s Gradle build system, you
can fill in the necessary values in the build.gradle
file for your plugin.
Mandatory elements for plugins
Element | Type | Description |
---|---|---|
|
String |
simple summary of the plugin |
|
String |
plugin’s version |
|
String |
the plugin name |
|
String |
the name of the class to load, fully-qualified. |
|
String |
version of java the code is built against.
Use the system property |
|
String |
version of Elasticsearch compiled against. |
Note that only jar files at the root of the plugin are added to the classpath for the plugin! If you need other resources, package them into a resources jar.
Important
|
Plugin release lifecycle
You will have to release a new version of the plugin for each new Elasticsearch release.
This version is checked when the plugin is loaded so Elasticsearch will refuse to start
in the presence of plugins with the incorrect |
Testing your plugin
When testing a Java plugin, it will only be auto-loaded if it is in the
plugins/
directory. Use bin/elasticsearch-plugin install file:///path/to/your/plugin
to install your plugin for testing.
You may also load your plugin within the test framework for integration tests. Read more in {ref}/integration-tests.html#changing-node-configuration[Changing Node Configuration].
Java Security permissions
Some plugins may need additional security permissions. A plugin can include
the optional plugin-security.policy
file containing grant
statements for
additional permissions. Any additional permissions will be displayed to the user
with a large warning, and they will have to confirm them when installing the
plugin interactively. So if possible, it is best to avoid requesting any
spurious permissions!
If you are using the Elasticsearch Gradle build system, place this file in
src/main/plugin-metadata
and it will be applied during unit tests as well.
Keep in mind that the Java security model is stack-based, and the additional permissions will only be granted to the jars in your plugin, so you will have write proper security code around operations requiring elevated privileges. It is recommended to add a check to prevent unprivileged code (such as scripts) from gaining escalated permissions. For example:
// ES permission you should check before doPrivileged() blocks
import org.elasticsearch.SpecialPermission;
SecurityManager sm = System.getSecurityManager();
if (sm != null) {
// unprivileged code such as scripts do not have SpecialPermission
sm.checkPermission(new SpecialPermission());
}
AccessController.doPrivileged(
// sensitive operation
);
See Secure Coding Guidelines for Java SE for more information.
Appendix A: Deleted pages
The following pages have moved or been deleted.
Multicast Discovery Plugin
The multicast-discovery
plugin has been removed. Instead, configure networking
using unicast (see {ref}/modules-network.html[Network settings]) or using
one of the cloud discovery plugins.
AWS Cloud Plugin
Looking for a hosted solution for Elasticsearch on AWS? Check out http://www.elastic.co/cloud.
The Elasticsearch cloud-aws
plugin has been split into two separate plugins:
-
EC2 Discovery Plugin (
discovery-ec2
) -
S3 Repository Plugin (
repository-s3
)
Azure Cloud Plugin
The cloud-azure
plugin has been split into two separate plugins:
-
Azure Classic Discovery Plugin (
discovery-azure-classic
) -
Azure Repository Plugin (
repository-azure
)
GCE Cloud Plugin
The cloud-gce
plugin has been renamed to GCE Discovery Plugin (discovery-gce
).
Delete-By-Query plugin removed
The Delete-By-Query plugin has been removed in favor of a new {ref}/docs-delete-by-query.html[Delete By Query API] implementation in core.