"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/plugins/analysis.asciidoc" (29 Dec 2021, 2232 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Analysis Plugins

Analysis plugins extend Elasticsearch by adding new analyzers, tokenizers, token filters, or character filters to Elasticsearch.

Core analysis plugins

The core analysis plugins are:

ICU

Adds extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

Kuromoji

Advanced analysis of Japanese using the Kuromoji analyzer.

Nori

Morphological analysis of Korean using the Lucene Nori analyzer.

Phonetic

Analyzes tokens into their phonetic equivalent using Soundex, Metaphone, Caverphone, and other codecs.

SmartCN

An analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

Stempel

Provides high quality stemming for Polish.

Ukrainian

Provides stemming for Ukrainian.

Community contributed analysis plugins

A number of analysis plugins have been contributed by our community:

ICU Analysis Plugin

The ICU Analysis plugin integrates the Lucene ICU module into {es}, adding extended Unicode support using the ICU libraries, including better analysis of Asian languages, Unicode normalization, Unicode-aware case folding, collation support, and transliteration.

Important
ICU analysis and backwards compatibility

From time to time, the ICU library receives updates such as adding new characters and emojis, and improving collation (sort) orders. These changes may or may not affect search and sort orders, depending on which characters sets you are using.

While we restrict ICU upgrades to major versions, you may find that an index created in the previous major version will need to be reindexed in order to return correct (and correctly ordered) results, and to take advantage of new characters.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-icu

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-icu/analysis-icu-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-icu

The node must be stopped before removing the plugin.

ICU Analyzer

Performs basic normalization, tokenization and character folding, using the icu_normalizer char filter, icu_tokenizer and icu_normalizer token filter

The following parameters are accepted:

method

Normalization method. Accepts nfkc, nfc or nfkc_cf (default)

mode

Normalization mode. Accepts compose (default) or decompose.

ICU Normalization Character Filter

Normalizes characters as explained here. It registers itself as the icu_normalizer character filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default). Set the mode parameter to decompose to convert nfc to nfd or nfkc to nfkd respectively:

Which letters are normalized can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

Here are two examples, the default usage and a customised character filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "icu_normalizer"
            ]
          },
          "nfd_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "char_filter": [
              "nfd_normalizer"
            ]
          }
        },
        "char_filter": {
          "nfd_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc",
            "mode": "decompose"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfd_normalizer token filter, which is set to use nfc normalization with decomposition.

ICU Tokenizer

Tokenizes text into words on word boundaries, as defined in UAX #29: Unicode Text Segmentation. It behaves much like the {ref}/analysis-standard-tokenizer.html[standard tokenizer], but adds better support for some Asian languages by using a dictionary-based approach to identify words in Thai, Lao, Chinese, Japanese, and Korean, and using custom rules to break Myanmar and Khmer text into syllables.

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_icu_analyzer": {
            "tokenizer": "icu_tokenizer"
          }
        }
      }
    }
  }
}
Rules customization

experimental[This functionality is marked as experimental in Lucene]

You can customize the icu-tokenizer behavior by specifying per-script rule files, see the RBBI rules syntax reference for a more detailed explanation.

To add icu tokenizer rules, set the rule_files settings, which should contain a comma-separated list of code:rulefile pairs in the following format: four-letter ISO 15924 script code, followed by a colon, then a rule file name. Rule files are placed ES_HOME/config directory.

As a demonstration of how the rule files can be used, save the following user file to $ES_HOME/config/KeywordTokenizer.rbbi:

.+ {200};

Then create an analyzer to use this rule file as follows:

PUT icu_sample
{
    "settings": {
        "index":{
            "analysis":{
                "tokenizer" : {
                    "icu_user_file" : {
                       "type" : "icu_tokenizer",
                       "rule_files" : "Latn:KeywordTokenizer.rbbi"
                    }
                },
                "analyzer" : {
                    "my_analyzer" : {
                        "type" : "custom",
                        "tokenizer" : "icu_user_file"
                    }
                }
            }
        }
    }
}

GET icu_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Elasticsearch. Wow!"
}

The above analyze request returns the following:

{
   "tokens": [
      {
         "token": "Elasticsearch. Wow!",
         "start_offset": 0,
         "end_offset": 19,
         "type": "<ALPHANUM>",
         "position": 0
      }
   ]
}

ICU Normalization Token Filter

Normalizes characters as explained here. It registers itself as the icu_normalizer token filter, which is available to all indices without any further configuration. The type of normalization can be specified with the name parameter, which accepts nfc, nfkc, and nfkc_cf (default).

Which letters are normalized can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

You should probably prefer the Normalization character filter.

Here are two examples, the default usage and a customised token filter:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "nfkc_cf_normalized": { (1)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_normalizer"
            ]
          },
          "nfc_normalized": { (2)
            "tokenizer": "icu_tokenizer",
            "filter": [
              "nfc_normalizer"
            ]
          }
        },
        "filter": {
          "nfc_normalizer": {
            "type": "icu_normalizer",
            "name": "nfc"
          }
        }
      }
    }
  }
}
  1. Uses the default nfkc_cf normalization.

  2. Uses the customized nfc_normalizer token filter, which is set to use nfc normalization.

ICU Folding Token Filter

Case folding of Unicode characters based on UTR#30, like the {ref}/analysis-asciifolding-tokenfilter.html[ASCII-folding token filter] on steroids. It registers itself as the icu_folding token filter and is available to all indices:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "folded": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "icu_folding"
            ]
          }
        }
      }
    }
  }
}

The ICU folding token filter already does Unicode normalization, so there is no need to use Normalize character or token filter as well.

Which letters are folded can be controlled by specifying the unicodeSetFilter parameter, which accepts a UnicodeSet.

The following example exempts Swedish characters from folding. It is important to note that both upper and lowercase forms should be specified, and that these filtered character are not lowercased which is why we add the lowercase filter as well:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "swedish_analyzer": {
            "tokenizer": "icu_tokenizer",
            "filter": [
              "swedish_folding",
              "lowercase"
            ]
          }
        },
        "filter": {
          "swedish_folding": {
            "type": "icu_folding",
            "unicodeSetFilter": "[^åäöÅÄÖ]"
          }
        }
      }
    }
  }
}

ICU Collation Token Filter

Warning

This token filter has been deprecated since Lucene 5.0. Please use ICU Collation Keyword Field.

ICU Collation Keyword Field

Collations are used for sorting documents in a language-specific word order. The icu_collation_keyword field type is available to all indices and will encode the terms directly as bytes in a doc values field and a single indexed token just like a standard {ref}/keyword.html[Keyword Field].

Defaults to using {defguide}/sorting-collations.html#uca[DUCET collation], which is a best-effort attempt at language-neutral sorting.

Below is an example of how to set up a field for sorting German names in ``phonebook'' order:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {   (1)
          "type": "text",
          "fields": {
            "sort": {  (2)
              "type": "icu_collation_keyword",
              "index": false,
              "language": "de",
              "country": "DE",
              "variant": "@collation=phonebook"
            }
          }
        }
      }
    }
  }
}

GET _search (3)
{
  "query": {
    "match": {
      "name": "Fritz"
    }
  },
  "sort": "name.sort"
}
  1. The name field uses the standard analyzer, and so support full text queries.

  2. The name.sort field is an icu_collation_keyword field that will preserve the name as a single token doc values, and applies the German ``phonebook'' order.

  3. An example query which searches the name field and sorts on the name.sort field.

Parameters for ICU Collation Keyword Fields

The following parameters are accepted by icu_collation_keyword fields:

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) or false.

null_value

Accepts a string value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the {ref}/mapping-source-field.html[_source] field. Accepts true or false (default).

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations.

Collation options
strength

The strength property determines the minimum level of difference considered significant during comparison. Possible values are : primary, secondary, tertiary, quaternary or identical. See the ICU Collation documentation for a more detailed explanation for each value. Defaults to tertiary unless otherwise specified in the collation.

decomposition

Possible values: no (default, but collation-dependent) or canonical. Setting this decomposition property to canonical allows the Collator to handle unnormalized text properly, producing the same results as if the text were normalized. If no is set, it is the user’s responsibility to insure that all text is already in the appropriate form before a comparison or before getting a CollationKey. Adjusting decomposition mode allows the user to select between faster and more complete collation behavior. Since a great many of the world’s languages do not require text normalization, most locales set no as the default decomposition mode.

The following options are expert only:

alternate

Possible values: shifted or non-ignorable. Sets the alternate handling for strength quaternary to be either shifted or non-ignorable. Which boils down to ignoring punctuation and whitespace.

case_level

Possible values: true or false (default). Whether case level sorting is required. When strength is set to primary this will ignore accent differences.

case_first

Possible values: lower or upper. Useful to control which case is sorted first when case is not ignored for strength tertiary. The default depends on the collation.

numeric

Possible values: true or false (default) . Whether digits are sorted according to their numeric representation. For example the value egg-9 is sorted before the value egg-21.

variable_top

Single character or contraction. Controls what is variable for alternate.

hiragana_quaternary_mode

Possible values: true or false. Distinguishing between Katakana and Hiragana characters in quaternary strength.

ICU Transform Token Filter

Transforms are used to process Unicode text in many different ways, such as case mapping, normalization, transliteration and bidirectional text handling.

You can define which transformation you want to apply with the id parameter (defaults to Null), and specify text direction with the dir parameter which accepts forward (default) for LTR and reverse for RTL. Custom rulesets are not yet supported.

For example:

PUT icu_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "latin": {
            "tokenizer": "keyword",
            "filter": [
              "myLatinTransform"
            ]
          }
        },
        "filter": {
          "myLatinTransform": {
            "type": "icu_transform",
            "id": "Any-Latin; NFD; [:Nonspacing Mark:] Remove; NFC" (1)
          }
        }
      }
    }
  }
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "你好" (2)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "здравствуйте" (3)
}

GET icu_sample/_analyze
{
  "analyzer": "latin",
  "text": "こんにちは" (4)
}
  1. This transforms transliterates characters to Latin, and separates accents from their base characters, removes the accents, and then puts the remaining text into an unaccented form.

  2. Returns ni hao.

  3. Returns zdravstvujte.

  4. Returns kon’nichiha.

For more documentation, Please see the user guide of ICU Transform.

Japanese (kuromoji) Analysis Plugin

The Japanese (kuromoji) Analysis plugin integrates Lucene kuromoji analysis module into elasticsearch.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-kuromoji

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-kuromoji/analysis-kuromoji-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-kuromoji

The node must be stopped before removing the plugin.

kuromoji analyzer

The kuromoji analyzer consists of the following tokenizer and token filters:

It supports the mode and user_dictionary settings from kuromoji_tokenizer.

kuromoji_iteration_mark character filter

The kuromoji_iteration_mark normalizes Japanese horizontal iteration marks (odoriji) to their expanded form. It accepts the following settings:

normalize_kanji

Indicates whether kanji iteration marks should be normalize. Defaults to true.

normalize_kana

Indicates whether kana iteration marks should be normalized. Defaults to true

kuromoji_tokenizer

The kuromoji_tokenizer accepts the following settings:

mode

The tokenization mode determines how the tokenizer handles compound and unknown words. It can be set to:

normal

Normal segmentation, no decomposition for compounds. Example output:

関西国際空港
アブラカダブラ
search

Segmentation geared towards search. This includes a decompounding process for long nouns, also including the full compound token as a synonym. Example output:

関西, 関西国際空港, 国際, 空港
アブラカダブラ
extended

Extended mode outputs unigrams for unknown words. Example output:

関西, 国際, 空港
ア, ブ, ラ, カ, ダ, ブ, ラ
discard_punctuation

Whether punctuation should be discarded from the output. Defaults to true.

user_dictionary

The Kuromoji tokenizer uses the MeCab-IPADIC dictionary by default. A user_dictionary may be appended to the default dictionary. The dictionary should have the following CSV format:

<text>,<token 1> ... <token n>,<reading 1> ... <reading n>,<part-of-speech tag>

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ja.txt:

東京スカイツリー,東京 スカイツリー,トウキョウ スカイツリー,カスタム名詞
nbest_cost/nbest_examples

Additional expert user parameters nbest_cost and nbest_examples can be used to include additional tokens that most likely according to the statistical model. If both parameters are used, the largest number of both is applied.

nbest_cost

The nbest_cost parameter specifies an additional Viterbi cost. The KuromojiTokenizer will include all tokens in Viterbi paths that are within the nbest_cost value of the best path.

nbest_examples

The nbest_examples can be used to find a nbest_cost value based on examples. For example, a value of /箱根山-箱根/成田空港-成田/ indicates that in the texts, 箱根山 (Mt. Hakone) and 成田空港 (Narita Airport) we’d like a cost that gives is us 箱根 (Hakone) and 成田 (Narita).

Then create an analyzer as follows:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "kuromoji_user_dict": {
            "type": "kuromoji_tokenizer",
            "mode": "extended",
            "discard_punctuation": "false",
            "user_dictionary": "userdict_ja.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "kuromoji_user_dict"
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "東京スカイツリー"
}

The above analyze request returns the following:

{
  "tokens" : [ {
    "token" : "東京",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "スカイツリー",
    "start_offset" : 2,
    "end_offset" : 8,
    "type" : "word",
    "position" : 1
  } ]
}

kuromoji_baseform token filter

The kuromoji_baseform token filter replaces terms with their BaseFormAttribute. This acts as a lemmatizer for verbs and adjectives. Example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_baseform"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "飲み"
}

which responds with:

{
  "tokens" : [ {
    "token" : "飲む",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  } ]
}

kuromoji_part_of_speech token filter

The kuromoji_part_of_speech token filter removes tokens that match a set of part-of-speech tags. It accepts the following setting:

stoptags

An array of part-of-speech tags that should be removed. It defaults to the stoptags.txt file embedded in the lucene-analyzer-kuromoji.jar.

For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "kuromoji_part_of_speech",
            "stoptags": [
              "助詞-格助詞-一般",
              "助詞-終助詞"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "寿司がおいしいね"
}

Which responds with:

{
  "tokens" : [ {
    "token" : "寿司",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "おいしい",
    "start_offset" : 3,
    "end_offset" : 7,
    "type" : "word",
    "position" : 2
  } ]
}

kuromoji_readingform token filter

The kuromoji_readingform token filter replaces the token with its reading form in either katakana or romaji. It accepts the following setting:

use_romaji

Whether romaji reading form should be output instead of katakana. Defaults to false.

When using the pre-defined kuromoji_readingform filter, use_romaji is set to true. The default when defining a custom kuromoji_readingform, however, is false. The only reason to use the custom form is if you need the katakana reading form:

PUT kuromoji_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "romaji_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["romaji_readingform"]
                    },
                    "katakana_analyzer" : {
                        "tokenizer" : "kuromoji_tokenizer",
                        "filter" : ["katakana_readingform"]
                    }
                },
                "filter" : {
                    "romaji_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : true
                    },
                    "katakana_readingform" : {
                        "type" : "kuromoji_readingform",
                        "use_romaji" : false
                    }
                }
            }
        }
    }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "katakana_analyzer",
  "text": "寿司" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "romaji_analyzer",
  "text": "寿司" (2)
}
  1. Returns スシ.

  2. Returns sushi.

kuromoji_stemmer token filter

The kuromoji_stemmer token filter normalizes common katakana spelling variations ending in a long sound character by removing this character (U+30FC). Only full-width katakana characters are supported.

This token filter accepts the following setting:

minimum_length

Katakana words shorter than the minimum length are not stemmed (default is 4).

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "my_katakana_stemmer"
            ]
          }
        },
        "filter": {
          "my_katakana_stemmer": {
            "type": "kuromoji_stemmer",
            "minimum_length": 4
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "コピー" (1)
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "サーバー" (2)
}
  1. Returns コピー.

  2. Return サーバ.

ja_stop token filter

The ja_stop token filter filters out Japanese stopwords (japanese), and any other custom stopwords specified by the user. This filter only supports the predefined japanese stopwords list. If you want to use a different predefined list, then use the {ref}/analysis-stop-tokenfilter.html[stop token filter] instead.

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "analyzer_with_ja_stop": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "ja_stop"
            ]
          }
        },
        "filter": {
          "ja_stop": {
            "type": "ja_stop",
            "stopwords": [
              "_japanese_",
              "ストップ"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "analyzer_with_ja_stop",
  "text": "ストップは消える"
}

The above request returns:

{
  "tokens" : [ {
    "token" : "消える",
    "start_offset" : 5,
    "end_offset" : 8,
    "type" : "word",
    "position" : 2
  } ]
}

kuromoji_number token filter

The kuromoji_number token filter normalizes Japanese numbers (kansūji) to regular Arabic decimal numbers in half-width characters. For example:

PUT kuromoji_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "kuromoji_tokenizer",
            "filter": [
              "kuromoji_number"
            ]
          }
        }
      }
    }
  }
}

GET kuromoji_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "一〇〇〇"
}

Which results in:

{
  "tokens" : [ {
    "token" : "1000",
    "start_offset" : 0,
    "end_offset" : 4,
    "type" : "word",
    "position" : 0
  } ]
}

Korean (nori) Analysis Plugin

The Korean (nori) Analysis plugin integrates Lucene nori analysis module into elasticsearch. It uses the mecab-ko-dic dictionary to perform morphological analysis of Korean texts.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-nori

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-nori/analysis-nori-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-nori

The node must be stopped before removing the plugin.

nori analyzer

The nori analyzer consists of the following tokenizer and token filters:

It supports the decompound_mode and user_dictionary settings from nori_tokenizer and the stoptags setting from nori_part_of_speech.

nori_tokenizer

The nori_tokenizer accepts the following settings:

decompound_mode

The decompound mode determines how the tokenizer handles compound tokens. It can be set to:

none

No decomposition for compounds. Example output:

가거도항
가곡역
discard

Decomposes compounds and discards the original form (default). Example output:

가곡역 => 가곡, 역
mixed

Decomposes compounds and keeps the original form. Example output:

가곡역 => 가곡역, 가곡, 역
user_dictionary

The Nori tokenizer uses the mecab-ko-dic dictionary by default. A user_dictionary with custom nouns (NNG) may be appended to the default dictionary. The dictionary should have the following format:

<token> [<token 1> ... <token n>]

The first token is mandatory and represents the custom noun that should be added in the dictionary. For compound nouns the custom segmentation can be provided after the first token ([<token 1> …​ <token n>]). The segmentation of the custom compound nouns is controlled by the decompound_mode setting.

As a demonstration of how the user dictionary can be used, save the following dictionary to $ES_HOME/config/userdict_ko.txt:

c++                 (1)
C샤프
세종
세종시 세종 시        (2)
  1. A simple noun

  2. A compound noun (세종시) followed by its decomposition: 세종 and .

Then create an analyzer as follows:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary": "userdict_ko.txt"
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "세종시"  (1)
}
  1. Sejong city

The above analyze request returns the following:

{
  "tokens" : [ {
    "token" : "세종시",
    "start_offset" : 0,
    "end_offset" : 3,
    "type" : "word",
    "position" : 0,
    "positionLength" : 2    (1)
  }, {
    "token" : "세종",
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }, {
    "token" : "시",
    "start_offset" : 2,
    "end_offset" : 3,
    "type" : "word",
    "position" : 1
   }]
}
  1. This is a compound token that spans two positions (mixed mode).

user_dictionary_rules

You can also inline the rules directly in the tokenizer definition using the user_dictionary_rules option:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "tokenizer": {
          "nori_user_dict": {
            "type": "nori_tokenizer",
            "decompound_mode": "mixed",
            "user_dictionary_rules": ["c++", "C샤프", "세종", "세종시 세종 시"]
          }
        },
        "analyzer": {
          "my_analyzer": {
            "type": "custom",
            "tokenizer": "nori_user_dict"
          }
        }
      }
    }
  }
}

The nori_tokenizer sets a number of additional attributes per token that are used by token filters to modify the stream. You can view all these additional attributes with the following request:

GET _analyze
{
  "tokenizer": "nori_tokenizer",
  "text": "뿌리가 깊은 나무는",   (1)
  "attributes" : ["posType", "leftPOS", "rightPOS", "morphemes", "reading"],
  "explain": true
}
  1. A tree with deep roots

Which responds with:

{
    "detail": {
        "custom_analyzer": true,
        "charfilters": [],
        "tokenizer": {
            "name": "nori_tokenizer",
            "tokens": [
                {
                    "token": "뿌리",
                    "start_offset": 0,
                    "end_offset": 2,
                    "type": "word",
                    "position": 0,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "가",
                    "start_offset": 2,
                    "end_offset": 3,
                    "type": "word",
                    "position": 1,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                },
                {
                    "token": "깊",
                    "start_offset": 4,
                    "end_offset": 5,
                    "type": "word",
                    "position": 2,
                    "leftPOS": "VA(Adjective)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "VA(Adjective)"
                },
                {
                    "token": "은",
                    "start_offset": 5,
                    "end_offset": 6,
                    "type": "word",
                    "position": 3,
                    "leftPOS": "E(Verbal endings)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "E(Verbal endings)"
                },
                {
                    "token": "나무",
                    "start_offset": 7,
                    "end_offset": 9,
                    "type": "word",
                    "position": 4,
                    "leftPOS": "NNG(General Noun)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "NNG(General Noun)"
                },
                {
                    "token": "는",
                    "start_offset": 9,
                    "end_offset": 10,
                    "type": "word",
                    "position": 5,
                    "leftPOS": "J(Ending Particle)",
                    "morphemes": null,
                    "posType": "MORPHEME",
                    "reading": null,
                    "rightPOS": "J(Ending Particle)"
                }
            ]
        },
        "tokenfilters": []
    }
}

nori_part_of_speech token filter

The nori_part_of_speech token filter removes tokens that match a set of part-of-speech tags. The list of supported tags and their meanings can be found here: {lucene-core-javadoc}/../analyzers-nori/org/apache/lucene/analysis/ko/POS.Tag.html[Part of speech tags]

It accepts the following setting:

stoptags

An array of part-of-speech tags that should be removed.

and defaults to:

"stoptags": [
    "E",
    "IC",
    "J",
    "MAG", "MAJ", "MM",
    "SP", "SSC", "SSO", "SC", "SE",
    "XPN", "XSA", "XSN", "XSV",
    "UNA", "NA", "VSV"
]

For example:

PUT nori_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "nori_tokenizer",
            "filter": [
              "my_posfilter"
            ]
          }
        },
        "filter": {
          "my_posfilter": {
            "type": "nori_part_of_speech",
            "stoptags": [
              "NR"   (1)
            ]
          }
        }
      }
    }
  }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "여섯 용이"  (2)
}
  1. Korean numerals should be removed (NR)

  2. Six dragons

Which responds with:

{
  "tokens" : [ {
    "token" : "용",
    "start_offset" : 3,
    "end_offset" : 4,
    "type" : "word",
    "position" : 1
  }, {
    "token" : "이",
    "start_offset" : 4,
    "end_offset" : 5,
    "type" : "word",
    "position" : 2
  } ]
}

nori_readingform token filter

The nori_readingform token filter rewrites tokens written in Hanja to their Hangul form.

PUT nori_sample
{
    "settings": {
        "index":{
            "analysis":{
                "analyzer" : {
                    "my_analyzer" : {
                        "tokenizer" : "nori_tokenizer",
                        "filter" : ["nori_readingform"]
                    }
                }
            }
        }
    }
}

GET nori_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "鄕歌"      (1)
}
  1. A token written in Hanja: Hyangga

Which responds with:

{
  "tokens" : [ {
    "token" : "향가",     (1)
    "start_offset" : 0,
    "end_offset" : 2,
    "type" : "word",
    "position" : 0
  }]
}
  1. The Hanja form is replaced by the Hangul translation.

Phonetic Analysis Plugin

The Phonetic Analysis plugin provides token filters which convert tokens to their phonetic representation using Soundex, Metaphone, and a variety of other algorithms.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-phonetic

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-phonetic/analysis-phonetic-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-phonetic

The node must be stopped before removing the plugin.

phonetic token filter

The phonetic token filter takes the following settings:

encoder

Which phonetic encoder to use. Accepts metaphone (default), double_metaphone, soundex, refined_soundex, caverphone1, caverphone2, cologne, nysiis, koelnerphonetik, haasephonetik, beider_morse, daitch_mokotoff.

replace

Whether or not the original token should be replaced by the phonetic token. Accepts true (default) and false. Not supported by beider_morse encoding.

PUT phonetic_sample
{
  "settings": {
    "index": {
      "analysis": {
        "analyzer": {
          "my_analyzer": {
            "tokenizer": "standard",
            "filter": [
              "lowercase",
              "my_metaphone"
            ]
          }
        },
        "filter": {
          "my_metaphone": {
            "type": "phonetic",
            "encoder": "metaphone",
            "replace": false
          }
        }
      }
    }
  }
}

GET phonetic_sample/_analyze
{
  "analyzer": "my_analyzer",
  "text": "Joe Bloggs" (1)
}
  1. Returns: J, joe, BLKS, bloggs

Double metaphone settings

If the double_metaphone encoder is used, then this additional setting is supported:

max_code_len

The maximum length of the emitted metaphone token. Defaults to 4.

Beider Morse settings

If the beider_morse encoder is used, then these additional settings are supported:

rule_type

Whether matching should be exact or approx (default).

name_type

Whether names are ashkenazi, sephardic, or generic (default).

languageset

An array of languages to check. If not specified, then the language will be guessed. Accepts: any, common, cyrillic, english, french, german, hebrew, hungarian, polish, romanian, russian, spanish.

Smart Chinese Analysis Plugin

The Smart Chinese Analysis plugin integrates Lucene’s Smart Chinese analysis module into elasticsearch.

It provides an analyzer for Chinese or mixed Chinese-English text. This analyzer uses probabilistic knowledge to find the optimal word segmentation for Simplified Chinese text. The text is first broken into sentences, then each sentence is segmented into words.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-smartcn

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-smartcn/analysis-smartcn-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-smartcn

The node must be stopped before removing the plugin.

smartcn tokenizer and token filter

The plugin provides the smartcn analyzer and smartcn_tokenizer tokenizer, which are not configurable.

Note
The smartcn_word token filter and smartcn_sentence have been deprecated.

Stempel Polish Analysis Plugin

The Stempel Analysis plugin integrates Lucene’s Stempel analysis module for Polish into elasticsearch.

It provides high quality stemming for Polish, based on the Egothor project.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-stempel

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-stempel/analysis-stempel-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-stempel

The node must be stopped before removing the plugin.

stempel tokenizer and token filter

The plugin provides the polish analyzer and polish_stem token filter, which are not configurable.

Ukrainian Analysis Plugin

The Ukrainian Analysis plugin integrates Lucene’s UkrainianMorfologikAnalyzer into elasticsearch.

It provides stemming for Ukrainian using the Morfologik project.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install analysis-ukrainian

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/analysis-ukrainian/analysis-ukrainian-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove analysis-ukrainian

The node must be stopped before removing the plugin.

ukrainian analyzer

The plugin provides the ukrainian analyzer.