Analysis Plugins
Analysis plugins extend Elasticsearch by adding new analyzers, tokenizers, token filters, or character filters to Elasticsearch.
Core analysis plugins
The core analysis plugins are:
- ICU
-
Adds extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.
- Kuromoji
-
Advanced analysis of Japanese using the Kuromoji analyzer.
- Nori
-
Morphological analysis of Korean using the Lucene Nori analyzer.
- Phonetic
-
Analyzes tokens into their phonetic equivalent using Soundex, Metaphone, Caverphone, and other codecs.
- SmartCN
-
An analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
- Stempel
-
Provides high quality stemming for Polish.
- Ukrainian
-
Provides stemming for Ukrainian.
Community contributed analysis plugins
A number of analysis plugins have been contributed by our community:
-
IK Analysis Plugin (by Medcl)
-
Pinyin Analysis Plugin (by Medcl)
-
Vietnamese Analysis Plugin (by Duy Do)
-
Network Addresses Analysis Plugin (by Ofir123)
-
Dandelion Analysis Plugin (by ZarHenry96)
-
STConvert Analysis Plugin (by Medcl)
ICU Analysis Plugin
The ICU Analysis plugin integrates the Lucene ICU module into {es}, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.
Important
|
ICU analysis and backwards compatibility
From time to time, the ICU library receives updates such as adding new characters and emojis, and improving collation (sort) orders. These changes may or may not affect search and sort orders, depending on which characters sets you are using. While we restrict ICU upgrades to major versions, you may find that an index created in the previous major version will need to be reindexed in order to return correct (and correctly ordered) results, and to take advantage of new characters. |
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-icu
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-icu/analysis-icu-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-icu
The node must be stopped before removing the plugin.
ICU Analyzer
Performs basic normalization, tokenization and character folding, using the
icu_normalizer
char filter, icu_tokenizer
and icu_normalizer
token filter
The following parameters are accepted:
method
|
Normalization method. Accepts |
mode
|
Normalization mode. Accepts |
ICU Normalization Character Filter
Normalizes characters as explained
here.
It registers itself as the icu_normalizer
character filter, which is
available to all indices without any further configuration. The type of
normalization can be specified with the name
parameter, which accepts nfc
,
nfkc
, and nfkc_cf
(default). Set the mode
parameter to decompose
to
convert nfc
to nfd
or nfkc
to nfkd
respectively:
Which letters are normalized can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
Here are two examples, the default usage and a customised character filter:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nfkc_cf_normalized": { (1)
"tokenizer": "icu_tokenizer",
"char_filter": [
"icu_normalizer"
]
},
"nfd_normalized": { (2)
"tokenizer": "icu_tokenizer",
"char_filter": [
"nfd_normalizer"
]
}
},
"char_filter": {
"nfd_normalizer": {
"type": "icu_normalizer",
"name": "nfc",
"mode": "decompose"
}
}
}
}
}
}
-
Uses the default
nfkc_cf
normalization. -
Uses the customized
nfd_normalizer
token filter, which is set to usenfc
normalization with decomposition.
ICU Tokenizer
Tokenizes text into words on word boundaries, as defined in
UAX #29: Unicode Text Segmentation.
It behaves much like the {ref}/analysis-standard-tokenizer.html[standard
tokenizer],
but adds better support for some Asian languages by using a dictionary-based
approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and
using custom rules to break Myanmar and Khmer text into syllables.
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_icu_analyzer": {
"tokenizer": "icu_tokenizer"
}
}
}
}
}
}
Rules customization
experimental[This functionality is marked as experimental in Lucene]
You can customize the icu-tokenizer
behavior by specifying per-script rule files, see the
RBBI rules syntax reference
for a more detailed explanation.
To add icu tokenizer rules, set the rule_files
settings, which should contain a comma-separated list of
code:rulefile
pairs in the following format:
four-letter ISO 15924 script code,
followed by a colon, then a rule file name. Rule files are placed ES_HOME/config
directory.
As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi
:
.+ {200};
Then create an analyzer to use this rule file as follows:
PUT icu_sample
{
"settings": {
"index":{
"analysis":{
"tokenizer" : {
"icu_user_file" : {
"type" : "icu_tokenizer",
"rule_files" : "Latn:KeywordTokenizer.rbbi"
}
},
"analyzer" : {
"my_analyzer" : {
"type" : "custom",
"tokenizer" : "icu_user_file"
}
}
}
}
}
}
GET icu_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Elasticsearch. Wow!"
}
The above analyze
request returns the following:
{
"tokens": [
{
"token": "Elasticsearch. Wow!",
"start_offset": 0,
"end_offset": 19,
"type": "<ALPHANUM>",
"position": 0
}
]
}
ICU Normalization Token Filter
Normalizes characters as explained
here. It registers
itself as the icu_normalizer
token filter, which is available to all indices
without any further configuration. The type of normalization can be specified
with the name
parameter, which accepts nfc
, nfkc
, and nfkc_cf
(default).
Which letters are normalized can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
You should probably prefer the Normalization character filter.
Here are two examples, the default usage and a customised token filter:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"nfkc_cf_normalized": { (1)
"tokenizer": "icu_tokenizer",
"filter": [
"icu_normalizer"
]
},
"nfc_normalized": { (2)
"tokenizer": "icu_tokenizer",
"filter": [
"nfc_normalizer"
]
}
},
"filter": {
"nfc_normalizer": {
"type": "icu_normalizer",
"name": "nfc"
}
}
}
}
}
}
-
Uses the default
nfkc_cf
normalization. -
Uses the customized
nfc_normalizer
token filter, which is set to usenfc
normalization.
ICU Folding Token Filter
Case folding of Unicode characters based on UTR#30
, like the
{ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter]
on steroids. It registers itself as the icu_folding
token filter and is
available to all indices:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"folded": {
"tokenizer": "icu_tokenizer",
"filter": [
"icu_folding"
]
}
}
}
}
}
}
The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.
Which letters are folded can be controlled by specifying the
unicodeSetFilter
parameter, which accepts a
UnicodeSet.
The following example exempts Swedish characters from folding. It is important
to note that both upper and lowercase forms should be specified, and that
these filtered character are not lowercased which is why we add the
lowercase
filter as well:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"swedish_analyzer": {
"tokenizer": "icu_tokenizer",
"filter": [
"swedish_folding",
"lowercase"
]
}
},
"filter": {
"swedish_folding": {
"type": "icu_folding",
"unicodeSetFilter": "[^åäöÅÄÖ]"
}
}
}
}
}
}
ICU Collation Token Filter
Warning
|
This token filter has been deprecated since Lucene 5.0. Please use ICU Collation Keyword Field. |
ICU Collation Keyword Field
Collations are used for sorting documents in a language-specific word order.
The icu_collation_keyword
field type is available to all indices and will encode
the terms directly as bytes in a doc values field and a single indexed token just
like a standard {ref}/keyword.html[Keyword Field].
Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation], which is a best-effort attempt at language-neutral sorting.
Below is an example of how to set up a field for sorting German names in ``phonebook'' order:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"name": { (1)
"type": "text",
"fields": {
"sort": { (2)
"type": "icu_collation_keyword",
"index": false,
"language": "de",
"country": "DE",
"variant": "@collation=phonebook"
}
}
}
}
}
}
}
GET _search (3)
{
"query": {
"match": {
"name": "Fritz"
}
},
"sort": "name.sort"
}
-
The
name
field uses thestandard
analyzer, and so support full text queries. -
The
name.sort
field is anicu_collation_keyword
field that will preserve the name as a single token doc values, and applies the German ``phonebook'' order. -
An example query which searches the
name
field and sorts on thename.sort
field.
Parameters for ICU Collation Keyword Fields
The following parameters are accepted by icu_collation_keyword
fields:
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a string value which is substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the {ref}/mapping-source-field.html[ |
fields
|
Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations. |
Collation options
strength
-
The strength property determines the minimum level of difference considered significant during comparison. Possible values are :
primary
,secondary
,tertiary
,quaternary
oridentical
. See the ICU Collation documentation for a more detailed explanation for each value. Defaults totertiary
unless otherwise specified in the collation. decomposition
-
Possible values:
no
(default, but collation-dependent) orcanonical
. Setting this decomposition property tocanonical
allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. Ifno
is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales setno
as the default decomposition mode.
The following options are expert only:
alternate
-
Possible values:
shifted
ornon-ignorable
. Sets the alternate handling for strengthquaternary
to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace. case_level
-
Possible values:
true
orfalse
(default). Whether case level sorting is required. When strength is set toprimary
this will ignore accent differences. case_first
-
Possible values:
lower
orupper
. Useful to control which case is sorted first when case is not ignored for strengthtertiary
. The default depends on the collation. numeric
-
Possible values:
true
orfalse
(default) . Whether digits are sorted according to their numeric representation. For example the valueegg-9
is sorted before the valueegg-21
. variable_top
-
Single character or contraction. Controls what is variable for
alternate
. hiragana_quaternary_mode
-
Possible values:
true
orfalse
. Distinguishing between Katakana and Hiragana characters inquaternary
strength.
ICU Transform Token Filter
Transforms are used to process Unicode text in many different ways, such as case mapping, normalization, transliteration and bidirectional text handling.
You can define which transformation you want to apply with the id
parameter
(defaults to Null
), and specify text direction with the dir
parameter
which accepts forward
(default) for LTR and reverse
for RTL. Custom
rulesets are not yet supported.
For example:
PUT icu_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"latin": {
"tokenizer": "keyword",
"filter": [
"myLatinTransform"
]
}
},
"filter": {
"myLatinTransform": {
"type": "icu_transform",
"id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" (1)
}
}
}
}
}
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "你好" (2)
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "здравствуйте" (3)
}
GET icu_sample/_analyze
{
"analyzer": "latin",
"text": "こんにちは" (4)
}
-
This transforms transliterates characters to Latin, and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.
-
Returns
ni hao
. -
Returns
zdravstvujte
. -
Returns
kon’nichiha
.
For more documentation, Please see the user guide of ICU Transform.
Japanese (kuromoji) Analysis Plugin
The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-kuromoji
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-kuromoji/analysis-kuromoji-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-kuromoji
The node must be stopped before removing the plugin.
kuromoji
analyzer
The kuromoji
analyzer consists of the following tokenizer and token filters:
-
kuromoji_baseform
token filter -
kuromoji_part_of_speech
token filter -
{ref}/analysis-cjk-width-tokenfilter.html[
cjk_width
] token filter -
ja_stop
token filter -
kuromoji_stemmer
token filter -
{ref}/analysis-lowercase-tokenfilter.html[
lowercase
] token filter
It supports the mode
and user_dictionary
settings from
kuromoji_tokenizer
.
kuromoji_iteration_mark
character filter
The kuromoji_iteration_mark
normalizes Japanese horizontal iteration marks
(odoriji) to their expanded form. It accepts the following settings:
normalize_kanji
-
Indicates whether kanji iteration marks should be normalize. Defaults to
true
. normalize_kana
-
Indicates whether kana iteration marks should be normalized. Defaults to
true
kuromoji_tokenizer
The kuromoji_tokenizer
accepts the following settings:
mode
-
The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:
normal
-
Normal segmentation, no decomposition for compounds. Example output:
関西国際空港 アブラカダブラ
search
-
Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:
関西, 関西国際空港, 国際, 空港 アブラカダブラ
extended
-
Extended mode outputs unigrams for unknown words. Example output:
関西, 国際, 空港 ア, ブ, ラ, カ, ダ, ブ, ラ
discard_punctuation
-
Whether punctuation should be discarded from the output. Defaults to
true
. user_dictionary
-
The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A
user_dictionary
may be appended to the default dictionary. The dictionary should have the following CSV format:<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>
As a demonstration of how the user dictionary can be used, save the following
dictionary to $ES_HOME/config/userdict_ja.txt
:
東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
nbest_cost
/nbest_examples
-
Additional expert user parameters
nbest_cost
andnbest_examples
can be used to include additional tokens that most likely according to the statistical model. If both parameters are used, the largest number of both is applied.nbest_cost
-
The
nbest_cost
parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path. nbest_examples
-
The
nbest_examples
can be used to find anbest_cost
value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).
Then create an analyzer as follows:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"tokenizer": {
"kuromoji_user_dict": {
"type": "kuromoji_tokenizer",
"mode": "extended",
"discard_punctuation": "false",
"user_dictionary": "userdict_ja.txt"
}
},
"analyzer": {
"my_analyzer": {
"type": "custom",
"tokenizer": "kuromoji_user_dict"
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "東京スカイツリー"
}
The above analyze
request returns the following:
{
"tokens" : [ {
"token" : "東京",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "スカイツリー",
"start_offset" : 2,
"end_offset" : 8,
"type" : "word",
"position" : 1
} ]
}
kuromoji_baseform
token filter
The kuromoji_baseform
token filter replaces terms with their
BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_baseform"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "飲み"
}
which responds with:
{
"tokens" : [ {
"token" : "飲む",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
} ]
}
kuromoji_part_of_speech
token filter
The kuromoji_part_of_speech
token filter removes tokens that match a set of
part-of-speech tags. It accepts the following setting:
stoptags
-
An array of part-of-speech tags that should be removed. It defaults to the
stoptags.txt
file embedded in thelucene-analyzer-kuromoji.jar
.
For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "kuromoji_part_of_speech",
"stoptags": [
"助詞-格助詞-一般",
"助詞-終助詞"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "寿司がおいしいね"
}
Which responds with:
{
"tokens" : [ {
"token" : "寿司",
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}, {
"token" : "おいしい",
"start_offset" : 3,
"end_offset" : 7,
"type" : "word",
"position" : 2
} ]
}
kuromoji_readingform
token filter
The kuromoji_readingform
token filter replaces the token with its reading
form in either katakana or romaji. It accepts the following setting:
use_romaji
-
Whether romaji reading form should be output instead of katakana. Defaults to
false
.
When using the pre-defined kuromoji_readingform
filter, use_romaji
is set
to true
. The default when defining a custom kuromoji_readingform
, however,
is false
. The only reason to use the custom form is if you need the
katakana reading form:
PUT kuromoji_sample
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"romaji_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["romaji_readingform"]
},
"katakana_analyzer" : {
"tokenizer" : "kuromoji_tokenizer",
"filter" : ["katakana_readingform"]
}
},
"filter" : {
"romaji_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : true
},
"katakana_readingform" : {
"type" : "kuromoji_readingform",
"use_romaji" : false
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "katakana_analyzer",
"text": "寿司" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "romaji_analyzer",
"text": "寿司" (2)
}
-
Returns
スシ
. -
Returns
sushi
.
kuromoji_stemmer
token filter
The kuromoji_stemmer
token filter normalizes common katakana spelling
variations ending in a long sound character by removing this character
(U+30FC). Only full-width katakana characters are supported.
This token filter accepts the following setting:
minimum_length
-
Katakana words shorter than the
minimum length
are not stemmed (default is4
).
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"my_katakana_stemmer"
]
}
},
"filter": {
"my_katakana_stemmer": {
"type": "kuromoji_stemmer",
"minimum_length": 4
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "コピー" (1)
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "サーバー" (2)
}
-
Returns
コピー
. -
Return
サーバ
.
ja_stop
token filter
The ja_stop
token filter filters out Japanese stopwords (japanese
), and
any other custom stopwords specified by the user. This filter only supports
the predefined japanese
stopwords list. If you want to use a different
predefined list, then use the
{ref}/analysis-stop-tokenfilter.html[stop
token filter] instead.
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"analyzer_with_ja_stop": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"ja_stop"
]
}
},
"filter": {
"ja_stop": {
"type": "ja_stop",
"stopwords": [
"_japanese_",
"ストップ"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "analyzer_with_ja_stop",
"text": "ストップは消える"
}
The above request returns:
{
"tokens" : [ {
"token" : "消える",
"start_offset" : 5,
"end_offset" : 8,
"type" : "word",
"position" : 2
} ]
}
kuromoji_number
token filter
The kuromoji_number
token filter normalizes Japanese numbers (kansūji)
to regular Arabic decimal numbers in half-width characters. For example:
PUT kuromoji_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "kuromoji_tokenizer",
"filter": [
"kuromoji_number"
]
}
}
}
}
}
}
GET kuromoji_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "一〇〇〇"
}
Which results in:
{
"tokens" : [ {
"token" : "1000",
"start_offset" : 0,
"end_offset" : 4,
"type" : "word",
"position" : 0
} ]
}
Korean (nori) Analysis Plugin
The Korean (nori) Analysis plugin integrates Lucene nori analysis module into elasticsearch. It uses the mecab-ko-dic dictionary to perform morphological analysis of Korean texts.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-nori
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-nori/analysis-nori-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-nori
The node must be stopped before removing the plugin.
nori
analyzer
The nori
analyzer consists of the following tokenizer and token filters:
-
nori_part_of_speech
token filter -
nori_readingform
token filter -
{ref}/analysis-lowercase-tokenfilter.html[
lowercase
] token filter
It supports the decompound_mode
and user_dictionary
settings from
nori_tokenizer
and the stoptags
setting from
nori_part_of_speech
.
nori_tokenizer
The nori_tokenizer
accepts the following settings:
decompound_mode
-
The decompound mode determines how the tokenizer handles compound tokens. It can be set to:
none
-
No decomposition for compounds. Example output:
가거도항 가곡역
discard
-
Decomposes compounds and discards the original form (default). Example output:
가곡역 => 가곡, 역
mixed
-
Decomposes compounds and keeps the original form. Example output:
가곡역 => 가곡역, 가곡, 역
user_dictionary
-
The Nori tokenizer uses the mecab-ko-dic dictionary by default. A
user_dictionary
with custom nouns (NNG
) may be appended to the default dictionary. The dictionary should have the following format:<token> [<token 1> ... <token n>]
The first token is mandatory and represents the custom noun that should be added in the dictionary. For compound nouns the custom segmentation can be provided after the first token (
[<token 1> … <token n>]
). The segmentation of the custom compound nouns is controlled by thedecompound_mode
setting.As a demonstration of how the user dictionary can be used, save the following dictionary to
$ES_HOME/config/userdict_ko.txt
:c++ (1) C샤프 세종 세종시 세종 시 (2)
-
A simple noun
-
A compound noun (
세종시
) followed by its decomposition:세종
and시
.
Then create an analyzer as follows:
PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "nori_user_dict": { "type": "nori_tokenizer", "decompound_mode": "mixed", "user_dictionary": "userdict_ko.txt" } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "nori_user_dict" } } } } } } GET nori_sample/_analyze { "analyzer": "my_analyzer", "text": "세종시" (1) }
-
Sejong city
The above
analyze
request returns the following:{ "tokens" : [ { "token" : "세종시", "start_offset" : 0, "end_offset" : 3, "type" : "word", "position" : 0, "positionLength" : 2 (1) }, { "token" : "세종", "start_offset" : 0, "end_offset" : 2, "type" : "word", "position" : 0 }, { "token" : "시", "start_offset" : 2, "end_offset" : 3, "type" : "word", "position" : 1 }] }
-
This is a compound token that spans two positions (
mixed
mode).
-
user_dictionary_rules
-
You can also inline the rules directly in the tokenizer definition using the
user_dictionary_rules
option:PUT nori_sample { "settings": { "index": { "analysis": { "tokenizer": { "nori_user_dict": { "type": "nori_tokenizer", "decompound_mode": "mixed", "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"] } }, "analyzer": { "my_analyzer": { "type": "custom", "tokenizer": "nori_user_dict" } } } } } }
The nori_tokenizer
sets a number of additional attributes per token that are used by token filters
to modify the stream.
You can view all these additional attributes with the following request:
GET _analyze
{
"tokenizer": "nori_tokenizer",
"text": "뿌리가 깊은 나무는", (1)
"attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
"explain": true
}
-
A tree with deep roots
Which responds with:
{
"detail": {
"custom_analyzer": true,
"charfilters": [],
"tokenizer": {
"name": "nori_tokenizer",
"tokens": [
{
"token": "뿌리",
"start_offset": 0,
"end_offset": 2,
"type": "word",
"position": 0,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
},
{
"token": "가",
"start_offset": 2,
"end_offset": 3,
"type": "word",
"position": 1,
"leftPOS": "J(Ending Particle)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "J(Ending Particle)"
},
{
"token": "깊",
"start_offset": 4,
"end_offset": 5,
"type": "word",
"position": 2,
"leftPOS": "VA(Adjective)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "VA(Adjective)"
},
{
"token": "은",
"start_offset": 5,
"end_offset": 6,
"type": "word",
"position": 3,
"leftPOS": "E(Verbal endings)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "E(Verbal endings)"
},
{
"token": "나무",
"start_offset": 7,
"end_offset": 9,
"type": "word",
"position": 4,
"leftPOS": "NNG(General Noun)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "NNG(General Noun)"
},
{
"token": "는",
"start_offset": 9,
"end_offset": 10,
"type": "word",
"position": 5,
"leftPOS": "J(Ending Particle)",
"morphemes": null,
"posType": "MORPHEME",
"reading": null,
"rightPOS": "J(Ending Particle)"
}
]
},
"tokenfilters": []
}
}
nori_part_of_speech
token filter
The nori_part_of_speech
token filter removes tokens that match a set of
part-of-speech tags. The list of supported tags and their meanings can be found here:
{lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]
It accepts the following setting:
stoptags
-
An array of part-of-speech tags that should be removed.
and defaults to:
"stoptags": [
"E",
"IC",
"J",
"MAG", "MAJ", "MM",
"SP", "SSC", "SSO", "SC", "SE",
"XPN", "XSA", "XSN", "XSV",
"UNA", "NA", "VSV"
]
For example:
PUT nori_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "nori_tokenizer",
"filter": [
"my_posfilter"
]
}
},
"filter": {
"my_posfilter": {
"type": "nori_part_of_speech",
"stoptags": [
"NR" (1)
]
}
}
}
}
}
}
GET nori_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "여섯 용이" (2)
}
-
Korean numerals should be removed (
NR
) -
Six dragons
Which responds with:
{
"tokens" : [ {
"token" : "용",
"start_offset" : 3,
"end_offset" : 4,
"type" : "word",
"position" : 1
}, {
"token" : "이",
"start_offset" : 4,
"end_offset" : 5,
"type" : "word",
"position" : 2
} ]
}
nori_readingform
token filter
The nori_readingform
token filter rewrites tokens written in Hanja to their Hangul form.
PUT nori_sample
{
"settings": {
"index":{
"analysis":{
"analyzer" : {
"my_analyzer" : {
"tokenizer" : "nori_tokenizer",
"filter" : ["nori_readingform"]
}
}
}
}
}
}
GET nori_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "鄕歌" (1)
}
-
A token written in Hanja: Hyangga
Which responds with:
{
"tokens" : [ {
"token" : "향가", (1)
"start_offset" : 0,
"end_offset" : 2,
"type" : "word",
"position" : 0
}]
}
-
The Hanja form is replaced by the Hangul translation.
Phonetic Analysis Plugin
The Phonetic Analysis plugin provides token filters which convert tokens to their phonetic representation using Soundex, Metaphone, and a variety of other algorithms.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-phonetic
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-phonetic/analysis-phonetic-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-phonetic
The node must be stopped before removing the plugin.
phonetic
token filter
The phonetic
token filter takes the following settings:
encoder
-
Which phonetic encoder to use. Accepts
metaphone
(default),double_metaphone
,soundex
,refined_soundex
,caverphone1
,caverphone2
,cologne
,nysiis
,koelnerphonetik
,haasephonetik
,beider_morse
,daitch_mokotoff
. replace
-
Whether or not the original token should be replaced by the phonetic token. Accepts
true
(default) andfalse
. Not supported bybeider_morse
encoding.
PUT phonetic_sample
{
"settings": {
"index": {
"analysis": {
"analyzer": {
"my_analyzer": {
"tokenizer": "standard",
"filter": [
"lowercase",
"my_metaphone"
]
}
},
"filter": {
"my_metaphone": {
"type": "phonetic",
"encoder": "metaphone",
"replace": false
}
}
}
}
}
}
GET phonetic_sample/_analyze
{
"analyzer": "my_analyzer",
"text": "Joe Bloggs" (1)
}
-
Returns:
J
,joe
,BLKS
,bloggs
Double metaphone settings
If the double_metaphone
encoder is used, then this additional setting is
supported:
max_code_len
-
The maximum length of the emitted metaphone token. Defaults to
4
.
Beider Morse settings
If the beider_morse
encoder is used, then these additional settings are
supported:
rule_type
-
Whether matching should be
exact
orapprox
(default). name_type
-
Whether names are
ashkenazi
,sephardic
, orgeneric
(default). languageset
-
An array of languages to check. If not specified, then the language will be guessed. Accepts:
any
,common
,cyrillic
,english
,french
,german
,hebrew
,hungarian
,polish
,romanian
,russian
,spanish
.
Smart Chinese Analysis Plugin
The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese analysis module into elasticsearch.
It provides an analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-smartcn
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-smartcn/analysis-smartcn-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-smartcn
The node must be stopped before removing the plugin.
smartcn
tokenizer and token filter
The plugin provides the smartcn
analyzer and smartcn_tokenizer
tokenizer,
which are not configurable.
Note
|
The smartcn_word token filter and smartcn_sentence have been deprecated.
|
Stempel Polish Analysis Plugin
The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.
It provides high quality stemming for Polish, based on the Egothor project.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-stempel
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-stempel/analysis-stempel-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-stempel
The node must be stopped before removing the plugin.
stempel
tokenizer and token filter
The plugin provides the polish
analyzer and polish_stem
token filter,
which are not configurable.
Ukrainian Analysis Plugin
The Ukrainian Analysis plugin integrates Lucene’s UkrainianMorfologikAnalyzer into elasticsearch.
It provides stemming for Ukrainian using the Morfologik project.
Installation
This plugin can be installed using the plugin manager:
sudo bin/elasticsearch-plugin install analysis-ukrainian
The plugin must be installed on every node in the cluster, and each node must be restarted after installation.
This plugin can be downloaded for offline install from {plugin_url}/analysis-ukrainian/analysis-ukrainian-{version}.zip.
Removal
The plugin can be removed with the following command:
sudo bin/elasticsearch-plugin remove analysis-ukrainian
The node must be stopped before removing the plugin.
ukrainian
analyzer
The plugin provides the ukrainian
analyzer.