"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/reference/analysis.asciidoc" (29 Dec 2021, 3365 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Anatomy of an analyzer

An analyzer  — whether built-in or custom — is just a package which contains three lower-level building blocks: character filters, tokenizers, and token filters.

The built-in analyzers pre-package these building blocks into analyzers suitable for different languages and types of text. Elasticsearch also exposes the individual building blocks so that they can be combined to define new custom analyzers.

Character filters

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b> from the stream.

An analyzer may have zero or more character filters, which are applied in order.

Tokenizer

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

The tokenizer is also responsible for recording the order or position of each term and the start and end character offsets of the original word which the term represents.

An analyzer must have exactly one tokenizer.

Token filters

A token filter receives the token stream and may add, remove, or change tokens. For example, a lowercase token filter converts all tokens to lowercase, a stop token filter removes common words (stop words) like the from the token stream, and a synonym token filter introduces synonyms into the token stream.

Token filters are not allowed to change the position or character offsets of each token.

An analyzer may have zero or more token filters, which are applied in order.

Testing analyzers

The analyze API is an invaluable tool for viewing the terms produced by an analyzer. A built-in analyzer (or combination of built-in tokenizer, token filters, and character filters) can be specified inline in the request:

POST _analyze
{
  "analyzer": "whitespace",
  "text":     "The quick brown fox."
}

POST _analyze
{
  "tokenizer": "standard",
  "filter":  [ "lowercase", "asciifolding" ],
  "text":      "Is this déja vu?"
}
Positions and character offsets

As can be seen from the output of the analyze API, analyzers not only convert words into terms, they also record the order or relative positions of each term (used for phrase queries or word proximity queries), and the start and end character offsets of each term in the original text (used for highlighting search snippets).

Alternatively, a custom analyzer can be referred to when running the analyze API on a specific index:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_folded": { (1)
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type": "text",
          "analyzer": "std_folded" (2)
        }
      }
    }
  }
}

GET my_index/_analyze (3)
{
  "analyzer": "std_folded", (4)
  "text":     "Is this déjà vu?"
}

GET my_index/_analyze (3)
{
  "field": "my_text", (5)
  "text":  "Is this déjà vu?"
}
  1. Define a custom analyzer called std_folded.

  2. The field my_text uses the std_folded analyzer.

  3. To refer to this analyzer, the analyze API must specify the index name.

  4. Refer to the analyzer by name.

  5. Refer to the analyzer used by field my_text.

Analyzers

Elasticsearch ships with a wide range of built-in analyzers, which can be used in any index without further configuration:

Standard Analyzer

The standard analyzer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation, lowercases terms, and supports removing stop words.

Simple Analyzer

The simple analyzer divides text into terms whenever it encounters a character which is not a letter. It lowercases all terms.

Whitespace Analyzer

The whitespace analyzer divides text into terms whenever it encounters any whitespace character. It does not lowercase terms.

Stop Analyzer

The stop analyzer is like the simple analyzer, but also supports removal of stop words.

Keyword Analyzer

The keyword analyzer is a ``noop'' analyzer that accepts whatever text it is given and outputs the exact same text as a single term.

Pattern Analyzer

The pattern analyzer uses a regular expression to split the text into terms. It supports lower-casing and stop words.

Language Analyzers

Elasticsearch provides many language-specific analyzers like english or french.

Fingerprint Analyzer

The fingerprint analyzer is a specialist analyzer which creates a fingerprint which can be used for duplicate detection.

Custom analyzers

If you do not find an analyzer suitable for your needs, you can create a custom analyzer which combines the appropriate character filters, tokenizer, and token filters.

Configuring built-in analyzers

The built-in analyzers can be used directly without any configuration. Some of them, however, support configuration options to alter their behaviour. For instance, the standard analyzer can be configured to support a list of stop words:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "std_english": { (1)
          "type":      "standard",
          "stopwords": "_english_"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "my_text": {
          "type":     "text",
          "analyzer": "standard", (2)
          "fields": {
            "english": {
              "type":     "text",
              "analyzer": "std_english" (3)
            }
          }
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "field": "my_text", (2)
  "text": "The old brown cow"
}

POST my_index/_analyze
{
  "field": "my_text.english", (3)
  "text": "The old brown cow"
}
  1. We define the std_english analyzer to be based on the standard analyzer, but configured to remove the pre-defined list of English stopwords.

  2. The my_text field uses the standard analyzer directly, without any configuration. No stop words will be removed from this field. The resulting terms are: [ the, old, brown, cow ]

  3. The my_text.english field uses the std_english analyzer, so English stop words will be removed. The resulting terms are: [ old, brown, cow ]

Standard Analyzer

The standard analyzer is the default analyzer which is used if none is specified. It provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example output

POST _analyze
{
  "analyzer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

The standard analyzer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

stopwords

A pre-defined stop words list like english or an array containing a list of stop words. Defaults to none.

stopwords_path

The path to a file containing stop words.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the standard analyzer to have a max_token_length of 5 (for demonstration purposes), and to use the pre-defined list of English stop words:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_english_analyzer": {
          "type": "standard",
          "max_token_length": 5,
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_english_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ 2, quick, brown, foxes, jumpe, d, over, lazy, dog's, bone ]

Definition

The standard analyzer consists of:

If you need to customize the standard analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in standard analyzer and you can use it as a starting point:

PUT /standard_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_standard": {
          "tokenizer": "standard",
          "filter": [
            "lowercase"       (1)
          ]
        }
      }
    }
  }
}
  1. You’d add any token filters after lowercase.

Simple Analyzer

The simple analyzer breaks text into terms whenever it encounters a character which is not a letter. All terms are lower cased.

Example output

POST _analyze
{
  "analyzer": "simple",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Configuration

The simple analyzer is not configurable.

Definition

The simple analzyer consists of:

If you need to customize the simple analyzer then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in simple analyzer and you can use it as a starting point for further customization:

PUT /simple_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_simple": {
          "tokenizer": "lowercase",
          "filter": [         (1)
          ]
        }
      }
    }
  }
}
  1. You’d add any token filters here.

Whitespace Analyzer

The whitespace analyzer breaks text into terms whenever it encounters a whitespace character.

Example output

POST _analyze
{
  "analyzer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

Configuration

The whitespace analyzer is not configurable.

Definition

It consists of:

If you need to customize the whitespace analyzer then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in whitespace analyzer and you can use it as a starting point for further customization:

PUT /whitespace_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_whitespace": {
          "tokenizer": "whitespace",
          "filter": [         (1)
          ]
        }
      }
    }
  }
}
  1. You’d add any token filters here.

Stop Analyzer

The stop analyzer is the same as the simple analyzer but adds support for removing stop words. It defaults to using the english stop words.

Example output

POST _analyze
{
  "analyzer": "stop",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ quick, brown, foxes, jumped, over, lazy, dog, s, bone ]

Configuration

The stop analyzer accepts the following parameters:

stopwords

A pre-defined stop words list like english or an array containing a list of stop words. Defaults to english.

stopwords_path

The path to a file containing stop words. This path is relative to the Elasticsearch config directory.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the stop analyzer to use a specified list of words as stop words:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_stop_analyzer": {
          "type": "stop",
          "stopwords": ["the", "over"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_stop_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ quick, brown, foxes, jumped, lazy, dog, s, bone ]

Definition

It consists of:

Tokenizer
Token filters

If you need to customize the stop analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in stop analyzer and you can use it as a starting point for further customization:

PUT /stop_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" (1)
        }
      },
      "analyzer": {
        "rebuilt_stop": {
          "tokenizer": "lowercase",
          "filter": [
            "english_stop"          (2)
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. You’d add any token filters after english_stop.

Keyword Analyzer

The keyword analyzer is a ``noop'' analyzer which returns the entire input string as a single token.

Example output

POST _analyze
{
  "analyzer": "keyword",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following single term:

[ The 2 QUICK Brown-Foxes jumped over the lazy dog's bone. ]

Configuration

The keyword analyzer is not configurable.

Definition

The keyword analyzer consists of:

Tokenizer

If you need to customize the keyword analyzer then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. Usually, you should prefer the Keyword type when you want strings that are not split into tokens, but just in case you need it, this would recreate the built-in keyword analyzer and you can use it as a starting point for further customization:

PUT /keyword_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_keyword": {
          "tokenizer": "keyword",
          "filter": [         (1)
          ]
        }
      }
    }
  }
}
  1. You’d add any token filters here.

Pattern Analyzer

The pattern analyzer uses a regular expression to split the text into terms. The regular expression should match the token separators not the tokens themselves. The regular expression defaults to \W+ (or all non-word characters).

Warning
Beware of Pathological Regular Expressions

The pattern analyzer uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Example output

POST _analyze
{
  "analyzer": "pattern",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ the, 2, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Configuration

The pattern analyzer accepts the following parameters:

pattern

A Java regular expression, defaults to \W+.

flags

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".

lowercase

Should terms be lowercased or not. Defaults to true.

stopwords

A pre-defined stop words list like english or an array containing a list of stop words. Defaults to none.

stopwords_path

The path to a file containing stop words.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the pattern analyzer to split email addresses on non-word characters or on underscores (\W|_), and to lower-case the result:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_email_analyzer": {
          "type":      "pattern",
          "pattern":   "\\W|_", (1)
          "lowercase": true
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_email_analyzer",
  "text": "John_Smith@foo-bar.com"
}
  1. The backslashes in the pattern need to be escaped when specifying the pattern as a JSON string.

The above example produces the following terms:

[ john, smith, foo, bar, com ]

CamelCase tokenizer

The following more complicated example splits CamelCase text into tokens:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "camel": {
          "type": "pattern",
          "pattern": "([^\\p{L}\\d]+)|(?<=\\D)(?=\\d)|(?<=\\d)(?=\\D)|(?<=[\\p{L}&&[^\\p{Lu}]])(?=\\p{Lu})|(?<=\\p{Lu})(?=\\p{Lu}[\\p{L}&&[^\\p{Lu}]])"
        }
      }
    }
  }
}

GET my_index/_analyze
{
  "analyzer": "camel",
  "text": "MooseX::FTPClass2_beta"
}

The above example produces the following terms:

[ moose, x, ftp, class, 2, beta ]

The regex above is easier to understand as:

  ([^\p{L}\d]+)                 # swallow non letters and numbers,
| (?<=\D)(?=\d)                 # or non-number followed by number,
| (?<=\d)(?=\D)                 # or number followed by non-number,
| (?<=[ \p{L} && [^\p{Lu}]])    # or lower case
  (?=\p{Lu})                    #   followed by upper case,
| (?<=\p{Lu})                   # or upper case
  (?=\p{Lu}                     #   followed by upper case
    [\p{L}&&[^\p{Lu}]]          #   then lower case
  )

Definition

The pattern anlayzer consists of:

Tokenizer
Token Filters

If you need to customize the pattern analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in pattern analyzer and you can use it as a starting point for further customization:

PUT /pattern_example
{
  "settings": {
    "analysis": {
      "tokenizer": {
        "split_on_non_word": {
          "type":       "pattern",
          "pattern":    "\\W+" (1)
        }
      },
      "analyzer": {
        "rebuilt_pattern": {
          "tokenizer": "split_on_non_word",
          "filter": [
            "lowercase"       (2)
          ]
        }
      }
    }
  }
}
  1. The default pattern is \W+ which splits on non-word characters and this is where you’d change it.

  2. You’d add other token filters after lowercase.

Language Analyzers

A set of analyzers aimed at analyzing specific language text. The following types are supported: arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, cjk, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, turkish, thai.

Configuring language analyzers

Stopwords

All analyzers support setting custom stopwords either internally in the config, or by using an external stopwords file by setting stopwords_path. Check Stop Analyzer for more details.

Excluding words from stemming

The stem_exclusion parameter allows you to specify an array of lowercase words that should not be stemmed. Internally, this functionality is implemented by adding the keyword_marker token filter with the keywords set to the value of the stem_exclusion parameter.

The following analyzers support setting custom stem_exclusion list: arabic, armenian, basque, bengali, bulgarian, catalan, czech, dutch, english, finnish, french, galician, german, hindi, hungarian, indonesian, irish, italian, latvian, lithuanian, norwegian, portuguese, romanian, russian, sorani, spanish, swedish, turkish.

Reimplementing language analyzers

The built-in language analyzers can be reimplemented as custom analyzers (as described below) in order to customize their behaviour.

Note
If you do not intend to exclude words from being stemmed (the equivalent of the stem_exclusion parameter above), then you should remove the keyword_marker token filter from the custom analyzer configuration.
arabic analyzer

The arabic analyzer could be reimplemented as a custom analyzer as follows:

PUT /arabic_example
{
  "settings": {
    "analysis": {
      "filter": {
        "arabic_stop": {
          "type":       "stop",
          "stopwords":  "_arabic_" (1)
        },
        "arabic_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["مثال"] (2)
        },
        "arabic_stemmer": {
          "type":       "stemmer",
          "language":   "arabic"
        }
      },
      "analyzer": {
        "rebuilt_arabic": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "decimal_digit",
            "arabic_stop",
            "arabic_normalization",
            "arabic_keywords",
            "arabic_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

armenian analyzer

The armenian analyzer could be reimplemented as a custom analyzer as follows:

PUT /armenian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "armenian_stop": {
          "type":       "stop",
          "stopwords":  "_armenian_" (1)
        },
        "armenian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["օրինակ"] (2)
        },
        "armenian_stemmer": {
          "type":       "stemmer",
          "language":   "armenian"
        }
      },
      "analyzer": {
        "rebuilt_armenian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "armenian_stop",
            "armenian_keywords",
            "armenian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

basque analyzer

The basque analyzer could be reimplemented as a custom analyzer as follows:

PUT /basque_example
{
  "settings": {
    "analysis": {
      "filter": {
        "basque_stop": {
          "type":       "stop",
          "stopwords":  "_basque_" (1)
        },
        "basque_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["Adibidez"] (2)
        },
        "basque_stemmer": {
          "type":       "stemmer",
          "language":   "basque"
        }
      },
      "analyzer": {
        "rebuilt_basque": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "basque_stop",
            "basque_keywords",
            "basque_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

bengali analyzer

The bengali analyzer could be reimplemented as a custom analyzer as follows:

PUT /bengali_example
{
  "settings": {
    "analysis": {
      "filter": {
        "bengali_stop": {
          "type":       "stop",
          "stopwords":  "_bengali_" (1)
        },
        "bengali_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["উদাহরণ"] (2)
        },
        "bengali_stemmer": {
          "type":       "stemmer",
          "language":   "bengali"
        }
      },
      "analyzer": {
        "rebuilt_bengali": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "decimal_digit",
            "bengali_keywords",
            "indic_normalization",
            "bengali_normalization",
            "bengali_stop",
            "bengali_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

brazilian analyzer

The brazilian analyzer could be reimplemented as a custom analyzer as follows:

PUT /brazilian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "brazilian_stop": {
          "type":       "stop",
          "stopwords":  "_brazilian_" (1)
        },
        "brazilian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exemplo"] (2)
        },
        "brazilian_stemmer": {
          "type":       "stemmer",
          "language":   "brazilian"
        }
      },
      "analyzer": {
        "rebuilt_brazilian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "brazilian_stop",
            "brazilian_keywords",
            "brazilian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

bulgarian analyzer

The bulgarian analyzer could be reimplemented as a custom analyzer as follows:

PUT /bulgarian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "bulgarian_stop": {
          "type":       "stop",
          "stopwords":  "_bulgarian_" (1)
        },
        "bulgarian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["пример"] (2)
        },
        "bulgarian_stemmer": {
          "type":       "stemmer",
          "language":   "bulgarian"
        }
      },
      "analyzer": {
        "rebuilt_bulgarian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "bulgarian_stop",
            "bulgarian_keywords",
            "bulgarian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

catalan analyzer

The catalan analyzer could be reimplemented as a custom analyzer as follows:

PUT /catalan_example
{
  "settings": {
    "analysis": {
      "filter": {
        "catalan_elision": {
          "type":       "elision",
          "articles":   [ "d", "l", "m", "n", "s", "t"],
          "articles_case": true
        },
        "catalan_stop": {
          "type":       "stop",
          "stopwords":  "_catalan_" (1)
        },
        "catalan_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exemple"] (2)
        },
        "catalan_stemmer": {
          "type":       "stemmer",
          "language":   "catalan"
        }
      },
      "analyzer": {
        "rebuilt_catalan": {
          "tokenizer":  "standard",
          "filter": [
            "catalan_elision",
            "lowercase",
            "catalan_stop",
            "catalan_keywords",
            "catalan_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

cjk analyzer
Note
You may find that icu_analyzer in the ICU analysis plugin works better for CJK text than the cjk analyzer. Experiment with your text and queries.

The cjk analyzer could be reimplemented as a custom analyzer as follows:

PUT /cjk_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  [ (1)
            "a", "and", "are", "as", "at", "be", "but", "by", "for",
            "if", "in", "into", "is", "it", "no", "not", "of", "on",
            "or", "s", "such", "t", "that", "the", "their", "then",
            "there", "these", "they", "this", "to", "was", "will",
            "with", "www"
          ]
        }
      },
      "analyzer": {
        "rebuilt_cjk": {
          "tokenizer":  "standard",
          "filter": [
            "cjk_width",
            "lowercase",
            "cjk_bigram",
            "english_stop"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters. The default stop words are almost the same as the english set, but not exactly the same.

czech analyzer

The czech analyzer could be reimplemented as a custom analyzer as follows:

PUT /czech_example
{
  "settings": {
    "analysis": {
      "filter": {
        "czech_stop": {
          "type":       "stop",
          "stopwords":  "_czech_" (1)
        },
        "czech_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["příklad"] (2)
        },
        "czech_stemmer": {
          "type":       "stemmer",
          "language":   "czech"
        }
      },
      "analyzer": {
        "rebuilt_czech": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "czech_stop",
            "czech_keywords",
            "czech_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

danish analyzer

The danish analyzer could be reimplemented as a custom analyzer as follows:

PUT /danish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "danish_stop": {
          "type":       "stop",
          "stopwords":  "_danish_" (1)
        },
        "danish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["eksempel"] (2)
        },
        "danish_stemmer": {
          "type":       "stemmer",
          "language":   "danish"
        }
      },
      "analyzer": {
        "rebuilt_danish": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "danish_stop",
            "danish_keywords",
            "danish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

dutch analyzer

The dutch analyzer could be reimplemented as a custom analyzer as follows:

PUT /dutch_example
{
  "settings": {
    "analysis": {
      "filter": {
        "dutch_stop": {
          "type":       "stop",
          "stopwords":  "_dutch_" (1)
        },
        "dutch_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["voorbeeld"] (2)
        },
        "dutch_stemmer": {
          "type":       "stemmer",
          "language":   "dutch"
        },
        "dutch_override": {
          "type":       "stemmer_override",
          "rules": [
            "fiets=>fiets",
            "bromfiets=>bromfiets",
            "ei=>eier",
            "kind=>kinder"
          ]
        }
      },
      "analyzer": {
        "rebuilt_dutch": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "dutch_stop",
            "dutch_keywords",
            "dutch_override",
            "dutch_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

english analyzer

The english analyzer could be reimplemented as a custom analyzer as follows:

PUT /english_example
{
  "settings": {
    "analysis": {
      "filter": {
        "english_stop": {
          "type":       "stop",
          "stopwords":  "_english_" (1)
        },
        "english_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["example"] (2)
        },
        "english_stemmer": {
          "type":       "stemmer",
          "language":   "english"
        },
        "english_possessive_stemmer": {
          "type":       "stemmer",
          "language":   "possessive_english"
        }
      },
      "analyzer": {
        "rebuilt_english": {
          "tokenizer":  "standard",
          "filter": [
            "english_possessive_stemmer",
            "lowercase",
            "english_stop",
            "english_keywords",
            "english_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

finnish analyzer

The finnish analyzer could be reimplemented as a custom analyzer as follows:

PUT /finnish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "finnish_stop": {
          "type":       "stop",
          "stopwords":  "_finnish_" (1)
        },
        "finnish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["esimerkki"] (2)
        },
        "finnish_stemmer": {
          "type":       "stemmer",
          "language":   "finnish"
        }
      },
      "analyzer": {
        "rebuilt_finnish": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "finnish_stop",
            "finnish_keywords",
            "finnish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

french analyzer

The french analyzer could be reimplemented as a custom analyzer as follows:

PUT /french_example
{
  "settings": {
    "analysis": {
      "filter": {
        "french_elision": {
          "type":         "elision",
          "articles_case": true,
          "articles": [
              "l", "m", "t", "qu", "n", "s",
              "j", "d", "c", "jusqu", "quoiqu",
              "lorsqu", "puisqu"
            ]
        },
        "french_stop": {
          "type":       "stop",
          "stopwords":  "_french_" (1)
        },
        "french_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["Exemple"] (2)
        },
        "french_stemmer": {
          "type":       "stemmer",
          "language":   "light_french"
        }
      },
      "analyzer": {
        "rebuilt_french": {
          "tokenizer":  "standard",
          "filter": [
            "french_elision",
            "lowercase",
            "french_stop",
            "french_keywords",
            "french_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

galician analyzer

The galician analyzer could be reimplemented as a custom analyzer as follows:

PUT /galician_example
{
  "settings": {
    "analysis": {
      "filter": {
        "galician_stop": {
          "type":       "stop",
          "stopwords":  "_galician_" (1)
        },
        "galician_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exemplo"] (2)
        },
        "galician_stemmer": {
          "type":       "stemmer",
          "language":   "galician"
        }
      },
      "analyzer": {
        "rebuilt_galician": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "galician_stop",
            "galician_keywords",
            "galician_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

german analyzer

The german analyzer could be reimplemented as a custom analyzer as follows:

PUT /german_example
{
  "settings": {
    "analysis": {
      "filter": {
        "german_stop": {
          "type":       "stop",
          "stopwords":  "_german_" (1)
        },
        "german_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["Beispiel"] (2)
        },
        "german_stemmer": {
          "type":       "stemmer",
          "language":   "light_german"
        }
      },
      "analyzer": {
        "rebuilt_german": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "german_stop",
            "german_keywords",
            "german_normalization",
            "german_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

greek analyzer

The greek analyzer could be reimplemented as a custom analyzer as follows:

PUT /greek_example
{
  "settings": {
    "analysis": {
      "filter": {
        "greek_stop": {
          "type":       "stop",
          "stopwords":  "_greek_" (1)
        },
        "greek_lowercase": {
          "type":       "lowercase",
          "language":   "greek"
        },
        "greek_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["παράδειγμα"] (2)
        },
        "greek_stemmer": {
          "type":       "stemmer",
          "language":   "greek"
        }
      },
      "analyzer": {
        "rebuilt_greek": {
          "tokenizer":  "standard",
          "filter": [
            "greek_lowercase",
            "greek_stop",
            "greek_keywords",
            "greek_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

hindi analyzer

The hindi analyzer could be reimplemented as a custom analyzer as follows:

PUT /hindi_example
{
  "settings": {
    "analysis": {
      "filter": {
        "hindi_stop": {
          "type":       "stop",
          "stopwords":  "_hindi_" (1)
        },
        "hindi_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["उदाहरण"] (2)
        },
        "hindi_stemmer": {
          "type":       "stemmer",
          "language":   "hindi"
        }
      },
      "analyzer": {
        "rebuilt_hindi": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "decimal_digit",
            "hindi_keywords",
            "indic_normalization",
            "hindi_normalization",
            "hindi_stop",
            "hindi_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

hungarian analyzer

The hungarian analyzer could be reimplemented as a custom analyzer as follows:

PUT /hungarian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "hungarian_stop": {
          "type":       "stop",
          "stopwords":  "_hungarian_" (1)
        },
        "hungarian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["példa"] (2)
        },
        "hungarian_stemmer": {
          "type":       "stemmer",
          "language":   "hungarian"
        }
      },
      "analyzer": {
        "rebuilt_hungarian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "hungarian_stop",
            "hungarian_keywords",
            "hungarian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

indonesian analyzer

The indonesian analyzer could be reimplemented as a custom analyzer as follows:

PUT /indonesian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "indonesian_stop": {
          "type":       "stop",
          "stopwords":  "_indonesian_" (1)
        },
        "indonesian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["contoh"] (2)
        },
        "indonesian_stemmer": {
          "type":       "stemmer",
          "language":   "indonesian"
        }
      },
      "analyzer": {
        "rebuilt_indonesian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "indonesian_stop",
            "indonesian_keywords",
            "indonesian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

irish analyzer

The irish analyzer could be reimplemented as a custom analyzer as follows:

PUT /irish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "irish_hyphenation": {
          "type":       "stop",
          "stopwords":  [ "h", "n", "t" ],
          "ignore_case": true
        },
        "irish_elision": {
          "type":       "elision",
          "articles":   [ "d", "m", "b" ],
          "articles_case": true
        },
        "irish_stop": {
          "type":       "stop",
          "stopwords":  "_irish_" (1)
        },
        "irish_lowercase": {
          "type":       "lowercase",
          "language":   "irish"
        },
        "irish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["sampla"] (2)
        },
        "irish_stemmer": {
          "type":       "stemmer",
          "language":   "irish"
        }
      },
      "analyzer": {
        "rebuilt_irish": {
          "tokenizer":  "standard",
          "filter": [
            "irish_hyphenation",
            "irish_elision",
            "irish_lowercase",
            "irish_stop",
            "irish_keywords",
            "irish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

italian analyzer

The italian analyzer could be reimplemented as a custom analyzer as follows:

PUT /italian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "italian_elision": {
          "type": "elision",
          "articles": [
                "c", "l", "all", "dall", "dell",
                "nell", "sull", "coll", "pell",
                "gl", "agl", "dagl", "degl", "negl",
                "sugl", "un", "m", "t", "s", "v", "d"
          ],
          "articles_case": true
        },
        "italian_stop": {
          "type":       "stop",
          "stopwords":  "_italian_" (1)
        },
        "italian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["esempio"] (2)
        },
        "italian_stemmer": {
          "type":       "stemmer",
          "language":   "light_italian"
        }
      },
      "analyzer": {
        "rebuilt_italian": {
          "tokenizer":  "standard",
          "filter": [
            "italian_elision",
            "lowercase",
            "italian_stop",
            "italian_keywords",
            "italian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

latvian analyzer

The latvian analyzer could be reimplemented as a custom analyzer as follows:

PUT /latvian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "latvian_stop": {
          "type":       "stop",
          "stopwords":  "_latvian_" (1)
        },
        "latvian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["piemērs"] (2)
        },
        "latvian_stemmer": {
          "type":       "stemmer",
          "language":   "latvian"
        }
      },
      "analyzer": {
        "rebuilt_latvian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "latvian_stop",
            "latvian_keywords",
            "latvian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

lithuanian analyzer

The lithuanian analyzer could be reimplemented as a custom analyzer as follows:

PUT /lithuanian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "lithuanian_stop": {
          "type":       "stop",
          "stopwords":  "_lithuanian_" (1)
        },
        "lithuanian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["pavyzdys"] (2)
        },
        "lithuanian_stemmer": {
          "type":       "stemmer",
          "language":   "lithuanian"
        }
      },
      "analyzer": {
        "rebuilt_lithuanian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "lithuanian_stop",
            "lithuanian_keywords",
            "lithuanian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

norwegian analyzer

The norwegian analyzer could be reimplemented as a custom analyzer as follows:

PUT /norwegian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "norwegian_stop": {
          "type":       "stop",
          "stopwords":  "_norwegian_" (1)
        },
        "norwegian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["eksempel"] (2)
        },
        "norwegian_stemmer": {
          "type":       "stemmer",
          "language":   "norwegian"
        }
      },
      "analyzer": {
        "rebuilt_norwegian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "norwegian_stop",
            "norwegian_keywords",
            "norwegian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

persian analyzer

The persian analyzer could be reimplemented as a custom analyzer as follows:

PUT /persian_example
{
  "settings": {
    "analysis": {
      "char_filter": {
        "zero_width_spaces": {
            "type":       "mapping",
            "mappings": [ "\\u200C=>\\u0020"] (1)
        }
      },
      "filter": {
        "persian_stop": {
          "type":       "stop",
          "stopwords":  "_persian_" (2)
        }
      },
      "analyzer": {
        "rebuilt_persian": {
          "tokenizer":     "standard",
          "char_filter": [ "zero_width_spaces" ],
          "filter": [
            "lowercase",
            "decimal_digit",
            "arabic_normalization",
            "persian_normalization",
            "persian_stop"
          ]
        }
      }
    }
  }
}
  1. Replaces zero-width non-joiners with an ASCII space.

  2. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

portuguese analyzer

The portuguese analyzer could be reimplemented as a custom analyzer as follows:

PUT /portuguese_example
{
  "settings": {
    "analysis": {
      "filter": {
        "portuguese_stop": {
          "type":       "stop",
          "stopwords":  "_portuguese_" (1)
        },
        "portuguese_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exemplo"] (2)
        },
        "portuguese_stemmer": {
          "type":       "stemmer",
          "language":   "light_portuguese"
        }
      },
      "analyzer": {
        "rebuilt_portuguese": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "portuguese_stop",
            "portuguese_keywords",
            "portuguese_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

romanian analyzer

The romanian analyzer could be reimplemented as a custom analyzer as follows:

PUT /romanian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "romanian_stop": {
          "type":       "stop",
          "stopwords":  "_romanian_" (1)
        },
        "romanian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exemplu"] (2)
        },
        "romanian_stemmer": {
          "type":       "stemmer",
          "language":   "romanian"
        }
      },
      "analyzer": {
        "rebuilt_romanian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "romanian_stop",
            "romanian_keywords",
            "romanian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

russian analyzer

The russian analyzer could be reimplemented as a custom analyzer as follows:

PUT /russian_example
{
  "settings": {
    "analysis": {
      "filter": {
        "russian_stop": {
          "type":       "stop",
          "stopwords":  "_russian_" (1)
        },
        "russian_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["пример"] (2)
        },
        "russian_stemmer": {
          "type":       "stemmer",
          "language":   "russian"
        }
      },
      "analyzer": {
        "rebuilt_russian": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "russian_stop",
            "russian_keywords",
            "russian_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

sorani analyzer

The sorani analyzer could be reimplemented as a custom analyzer as follows:

PUT /sorani_example
{
  "settings": {
    "analysis": {
      "filter": {
        "sorani_stop": {
          "type":       "stop",
          "stopwords":  "_sorani_" (1)
        },
        "sorani_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["mînak"] (2)
        },
        "sorani_stemmer": {
          "type":       "stemmer",
          "language":   "sorani"
        }
      },
      "analyzer": {
        "rebuilt_sorani": {
          "tokenizer":  "standard",
          "filter": [
            "sorani_normalization",
            "lowercase",
            "decimal_digit",
            "sorani_stop",
            "sorani_keywords",
            "sorani_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

spanish analyzer

The spanish analyzer could be reimplemented as a custom analyzer as follows:

PUT /spanish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "spanish_stop": {
          "type":       "stop",
          "stopwords":  "_spanish_" (1)
        },
        "spanish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["ejemplo"] (2)
        },
        "spanish_stemmer": {
          "type":       "stemmer",
          "language":   "light_spanish"
        }
      },
      "analyzer": {
        "rebuilt_spanish": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "spanish_stop",
            "spanish_keywords",
            "spanish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

swedish analyzer

The swedish analyzer could be reimplemented as a custom analyzer as follows:

PUT /swedish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "swedish_stop": {
          "type":       "stop",
          "stopwords":  "_swedish_" (1)
        },
        "swedish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["exempel"] (2)
        },
        "swedish_stemmer": {
          "type":       "stemmer",
          "language":   "swedish"
        }
      },
      "analyzer": {
        "rebuilt_swedish": {
          "tokenizer":  "standard",
          "filter": [
            "lowercase",
            "swedish_stop",
            "swedish_keywords",
            "swedish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

turkish analyzer

The turkish analyzer could be reimplemented as a custom analyzer as follows:

PUT /turkish_example
{
  "settings": {
    "analysis": {
      "filter": {
        "turkish_stop": {
          "type":       "stop",
          "stopwords":  "_turkish_" (1)
        },
        "turkish_lowercase": {
          "type":       "lowercase",
          "language":   "turkish"
        },
        "turkish_keywords": {
          "type":       "keyword_marker",
          "keywords":   ["örnek"] (2)
        },
        "turkish_stemmer": {
          "type":       "stemmer",
          "language":   "turkish"
        }
      },
      "analyzer": {
        "rebuilt_turkish": {
          "tokenizer":  "standard",
          "filter": [
            "apostrophe",
            "turkish_lowercase",
            "turkish_stop",
            "turkish_keywords",
            "turkish_stemmer"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

  2. This filter should be removed unless there are words which should be excluded from stemming.

thai analyzer

The thai analyzer could be reimplemented as a custom analyzer as follows:

PUT /thai_example
{
  "settings": {
    "analysis": {
      "filter": {
        "thai_stop": {
          "type":       "stop",
          "stopwords":  "_thai_" (1)
        }
      },
      "analyzer": {
        "rebuilt_thai": {
          "tokenizer":  "thai",
          "filter": [
            "lowercase",
            "decimal_digit",
            "thai_stop"
          ]
        }
      }
    }
  }
}
  1. The default stopwords can be overridden with the stopwords or stopwords_path parameters.

Fingerprint Analyzer

The fingerprint analyzer implements a fingerprinting algorithm which is used by the OpenRefine project to assist in clustering.

Input text is lowercased, normalized to remove extended characters, sorted, deduplicated and concatenated into a single token. If a stopword list is configured, stop words will also be removed.

Example output

POST _analyze
{
  "analyzer": "fingerprint",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

The above sentence would produce the following single term:

[ and consistent godel is said sentence this yes ]

Configuration

The fingerprint analyzer accepts the following parameters:

separator

The character to use to concatenate the terms. Defaults to a space.

max_output_size

The maximum token size to emit. Defaults to 255. Tokens larger than this size will be discarded.

stopwords

A pre-defined stop words list like english or an array containing a list of stop words. Defaults to none.

stopwords_path

The path to a file containing stop words.

See the Stop Token Filter for more information about stop word configuration.

Example configuration

In this example, we configure the fingerprint analyzer to use the pre-defined list of English stop words:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_fingerprint_analyzer": {
          "type": "fingerprint",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_fingerprint_analyzer",
  "text": "Yes yes, Gödel said this sentence is consistent and."
}

The above example produces the following term:

[ consistent godel said sentence yes ]

Definition

The fingerprint tokenizer consists of:

If you need to customize the fingerprint analyzer beyond the configuration parameters then you need to recreate it as a custom analyzer and modify it, usually by adding token filters. This would recreate the built-in fingerprint analyzer and you can use it as a starting point for further customization:

PUT /fingerprint_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "rebuilt_fingerprint": {
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "asciifolding",
            "fingerprint"
          ]
        }
      }
    }
  }
}

Custom Analyzer

When the built-in analyzers do not fulfill your needs, you can create a custom analyzer which uses the appropriate combination of:

Configuration

The custom analyzer accepts the following parameters:

tokenizer

A built-in or customised tokenizer. (Required)

char_filter

An optional array of built-in or customised character filters.

filter

An optional array of built-in or customised token filters.

position_increment_gap

When indexing an array of text values, Elasticsearch inserts a fake "gap" between the last term of one value and the first term of the next value to ensure that a phrase query doesn’t match two terms from different array elements. Defaults to 100. See [position-increment-gap] for more.

Example configuration

Here is an example that combines the following:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": {
          "type":      "custom", (1)
          "tokenizer": "standard",
          "char_filter": [
            "html_strip"
          ],
          "filter": [
            "lowercase",
            "asciifolding"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text": "Is this <b>déjà vu</b>?"
}
  1. Setting type to custom tells Elasticsearch that we are defining a custom analyzer. Compare this to how built-in analyzers can be configured: type will be set to the name of the built-in analyzer, like standard or simple.

The above example produces the following terms:

[ is, this, deja, vu ]

The previous example used tokenizer, token filters, and character filters with their default configurations, but it is possible to create configured versions of each and to use them in a custom analyzer.

Here is a more complicated example that combines the following:

Character Filter
Tokenizer
Token Filters

Here is an example:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_custom_analyzer": { (1)
          "type": "custom",
          "char_filter": [
            "emoticons"
          ],
          "tokenizer": "punctuation",
          "filter": [
            "lowercase",
            "english_stop"
          ]
        }
      },
      "tokenizer": {
        "punctuation": { (2)
          "type": "pattern",
          "pattern": "[ .,!?]"
        }
      },
      "char_filter": {
        "emoticons": { (3)
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      },
      "filter": {
        "english_stop": { (4)
          "type": "stop",
          "stopwords": "_english_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_custom_analyzer",
  "text":     "I'm a :) person, and you?"
}
  1. Assigns the index a default custom analyzer, my_custom_analyzer. This analyzer uses a custom tokenizer, character filter, and token filter that are defined later in the request.

  2. Defines the custom punctuation tokenizer.

  3. Defines the custom emoticons character filter.

  4. Defines the custom english_stop token filter.

The above example produces the following terms:

[ i'm, _happy_, person, you ]

Normalizers

Normalizers are similar to analyzers except that they may only emit a single token. As a consequence, they do not have a tokenizer and only accept a subset of the available char filters and token filters. Only the filters that work on a per-character basis are allowed. For instance a lowercasing filter would be allowed, but not a stemming filter, which needs to look at the keyword as a whole. The current list of filters that can be used in a normalizer is following: arabic_normalization, asciifolding, bengali_normalization, cjk_width, decimal_digit, elision, german_normalization, hindi_normalization, indic_normalization, lowercase, persian_normalization, scandinavian_folding, serbian_normalization, sorani_normalization, uppercase.

Custom normalizers

Elasticsearch does not ship with built-in normalizers so far, so the only way to get one is by building a custom one. Custom normalizers take a list of char character filters and a list of token filters.

PUT index
{
  "settings": {
    "analysis": {
      "char_filter": {
        "quote": {
          "type": "mapping",
          "mappings": [
            "« => \"",
            "» => \""
          ]
        }
      },
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": ["quote"],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

Tokenizers

A tokenizer receives a stream of characters, breaks it up into individual tokens (usually individual words), and outputs a stream of tokens. For instance, a whitespace tokenizer breaks text into tokens whenever it sees any whitespace. It would convert the text "Quick brown fox!" into the terms [Quick, brown, fox!].

The tokenizer is also responsible for recording the order or position of each term (used for phrase and word proximity queries) and the start and end character offsets of the original word which the term represents (used for highlighting search snippets).

Elasticsearch has a number of built in tokenizers which can be used to build custom analyzers.

Word Oriented Tokenizers

The following tokenizers are usually used for tokenizing full text into individual words:

Standard Tokenizer

The standard tokenizer divides text into terms on word boundaries, as defined by the Unicode Text Segmentation algorithm. It removes most punctuation symbols. It is the best choice for most languages.

Letter Tokenizer

The letter tokenizer divides text into terms whenever it encounters a character which is not a letter.

Lowercase Tokenizer

The lowercase tokenizer, like the letter tokenizer, divides text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms.

Whitespace Tokenizer

The whitespace tokenizer divides text into terms whenever it encounters any whitespace character.

UAX URL Email Tokenizer

The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.

Classic Tokenizer

The classic tokenizer is a grammar based tokenizer for the English Language.

Thai Tokenizer

The thai tokenizer segments Thai text into words.

Partial Word Tokenizers

These tokenizers break up text or words into small fragments, for partial word matching:

N-Gram Tokenizer

The ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word: a sliding window of continuous letters, e.g. quick[qu, ui, ic, ck].

Edge N-Gram Tokenizer

The edge_ngram tokenizer can break up text into words when it encounters any of a list of specified characters (e.g. whitespace or punctuation), then it returns n-grams of each word which are anchored to the start of the word, e.g. quick[q, qu, qui, quic, quick].

Structured Text Tokenizers

The following tokenizers are usually used with structured text like identifiers, email addresses, zip codes, and paths, rather than with full text:

Keyword Tokenizer

The keyword tokenizer is a `noop'' tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters like `lowercase to normalise the analysed terms.

Pattern Tokenizer

The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

Simple Pattern Tokenizer

The simple_pattern tokenizer uses a regular expression to capture matching text as terms. It uses a restricted subset of regular expression features and is generally faster than the pattern tokenizer.

Char Group Tokenizer

The char_group tokenizer is configurable through sets of characters to split on, which is usually less expensive than running regular expressions.

Simple Pattern Split Tokenizer

The simple_pattern_split tokenizer uses the same restricted regular expression subset as the simple_pattern tokenizer, but splits the input at matches rather than returning the matches as terms.

Path Tokenizer

The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree, e.g. /foo/bar/baz[/foo, /foo/bar, /foo/bar/baz ].

Standard Tokenizer

The standard tokenizer provides grammar based tokenization (based on the Unicode Text Segmentation algorithm, as specified in Unicode Standard Annex #29) and works well for most languages.

Example output

POST _analyze
{
  "tokenizer": "standard",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

The standard tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example configuration

In this example, we configure the standard tokenizer to have a max_token_length of 5 (for demonstration purposes):

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "standard",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

Letter Tokenizer

The letter tokenizer breaks text into terms whenever it encounters a character which is not a letter. It does a reasonable job for most European languages, but does a terrible job for some Asian languages, where words are not separated by spaces.

Example output

POST _analyze
{
  "tokenizer": "letter",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, QUICK, Brown, Foxes, jumped, over, the, lazy, dog, s, bone ]

Configuration

The letter tokenizer is not configurable.

Lowercase Tokenizer

The lowercase tokenizer, like the letter tokenizer breaks text into terms whenever it encounters a character which is not a letter, but it also lowercases all terms. It is functionally equivalent to the letter tokenizer combined with the lowercase token filter, but is more efficient as it performs both steps in a single pass.

Example output

POST _analyze
{
  "tokenizer": "lowercase",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ the, quick, brown, foxes, jumped, over, the, lazy, dog, s, bone ]

Configuration

The lowercase tokenizer is not configurable.

Whitespace Tokenizer

The whitespace tokenizer breaks text into terms whenever it encounters a whitespace character.

Example output

POST _analyze
{
  "tokenizer": "whitespace",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown-Foxes, jumped, over, the, lazy, dog's, bone. ]

Configuration

The whitespace tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

UAX URL Email Tokenizer

The uax_url_email tokenizer is like the standard tokenizer except that it recognises URLs and email addresses as single tokens.

Example output

POST _analyze
{
  "tokenizer": "uax_url_email",
  "text": "Email me at john.smith@global-international.com"
}

The above sentence would produce the following terms:

[ Email, me, at, john.smith@global-international.com ]

while the standard tokenizer would produce:

[ Email, me, at, john.smith, global, international.com ]

Configuration

The uax_url_email tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example configuration

In this example, we configure the uax_url_email tokenizer to have a max_token_length of 5 (for demonstration purposes):

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "uax_url_email",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "john.smith@global-international.com"
}

The above example produces the following terms:

[ john, smith, globa, l, inter, natio, nal.c, om ]

Classic Tokenizer

The classic tokenizer is a grammar based tokenizer that is good for English language documents. This tokenizer has heuristics for special treatment of acronyms, company names, email addresses, and internet host names. However, these rules don’t always work, and the tokenizer doesn’t work well for most languages other than English:

  • It splits words at most punctuation characters, removing punctuation. However, a dot that’s not followed by whitespace is considered part of a token.

  • It splits words at hyphens, unless there’s a number in the token, in which case the whole token is interpreted as a product number and is not split.

  • It recognizes email addresses and internet hostnames as one token.

Example output

POST _analyze
{
  "tokenizer": "classic",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above sentence would produce the following terms:

[ The, 2, QUICK, Brown, Foxes, jumped, over, the, lazy, dog's, bone ]

Configuration

The classic tokenizer accepts the following parameters:

max_token_length

The maximum token length. If a token is seen that exceeds this length then it is split at max_token_length intervals. Defaults to 255.

Example configuration

In this example, we configure the classic tokenizer to have a max_token_length of 5 (for demonstration purposes):

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "classic",
          "max_token_length": 5
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The 2 QUICK Brown-Foxes jumped over the lazy dog's bone."
}

The above example produces the following terms:

[ The, 2, QUICK, Brown, Foxes, jumpe, d, over, the, lazy, dog's, bone ]

Thai Tokenizer

The thai tokenizer segments Thai text into words, using the Thai segmentation algorithm included with Java. Text in other languages in general will be treated the same as the standard tokenizer.

Warning
This tokenizer may not be supported by all JREs. It is known to work with Sun/Oracle and OpenJDK. If your application needs to be fully portable, consider using the {plugins}/analysis-icu-tokenizer.html[ICU Tokenizer] instead.

Example output

POST _analyze
{
  "tokenizer": "thai",
  "text": "การที่ได้ต้องแสดงว่างานดี"
}

The above sentence would produce the following terms:

[ การ, ที่, ได้, ต้อง, แสดง, ว่า, งาน, ดี ]

Configuration

The thai tokenizer is not configurable.

NGram Tokenizer

The ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word of the specified length.

N-grams are like a sliding window that moves across the word - a continuous sequence of characters of the specified length. They are useful for querying languages that don’t use spaces or that have long compound words, like German.

Example output

With the default settings, the ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2:

POST _analyze
{
  "tokenizer": "ngram",
  "text": "Quick Fox"
}

The above sentence would produce the following terms:

[ Q, Qu, u, ui, i, ic, c, ck, k, "k ", " ", " F", F, Fo, o, ox, x ]

Configuration

The ngram tokenizer accepts the following parameters:

min_gram

Minimum length of characters in a gram. Defaults to 1.

max_gram

Maximum length of characters in a gram. Defaults to 2.

token_chars

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

Character classes may be any of the following:

  • letter —  for example a, b, ï or

  • digit —  for example 3 or 7

  • whitespace —  for example " " or "\n"

  • punctuation — for example ! or "

  • symbol —  for example $ or

Tip
It usually makes sense to set min_gram and max_gram to the same value. The smaller the length, the more documents will match but the lower the quality of the matches. The longer the length, the more specific the matches. A tri-gram (length 3) is a good place to start.

The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram.

Example configuration

In this example, we configure the ngram tokenizer to treat letters and digits as tokens, and to produce tri-grams (grams of length 3):

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "ngram",
          "min_gram": 3,
          "max_gram": 3,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}

The above example produces the following terms:

[ Qui, uic, ick, Fox, oxe, xes ]

Edge NGram Tokenizer

The edge_ngram tokenizer first breaks text down into words whenever it encounters one of a list of specified characters, then it emits N-grams of each word where the start of the N-gram is anchored to the beginning of the word.

Edge N-Grams are useful for search-as-you-type queries.

Tip
When you need search-as-you-type for text which has a widely known order, such as movie or song titles, the completion suggester is a much more efficient choice than edge N-grams. Edge N-grams have the advantage when trying to autocomplete words that can appear in any order.

Example output

With the default settings, the edge_ngram tokenizer treats the initial text as a single token and produces N-grams with minimum length 1 and maximum length 2:

POST _analyze
{
  "tokenizer": "edge_ngram",
  "text": "Quick Fox"
}

The above sentence would produce the following terms:

[ Q, Qu ]
Note
These default gram lengths are almost entirely useless. You need to configure the edge_ngram before using it.

Configuration

The edge_ngram tokenizer accepts the following parameters:

min_gram

Minimum length of characters in a gram. Defaults to 1.

max_gram

Maximum length of characters in a gram. Defaults to 2.

token_chars

Character classes that should be included in a token. Elasticsearch will split on characters that don’t belong to the classes specified. Defaults to [] (keep all characters).

Character classes may be any of the following:

  • letter —  for example a, b, ï or

  • digit —  for example 3 or 7

  • whitespace —  for example " " or "\n"

  • punctuation — for example ! or "

  • symbol —  for example $ or

Limitations of the max_gram parameter

The edge_ngram tokenizer’s max_gram value limits the character length of tokens. When the edge_ngram tokenizer is used with an index analyzer, this means search terms longer than the max_gram length may not match any indexed terms.

For example, if the max_gram is 3, searches for apple won’t match the indexed term app.

To account for this, you can use the truncate token filter token filter with a search analyzer to shorten search terms to the max_gram character length. However, this could return irrelevant results.

For example, if the max_gram is 3 and search terms are truncated to three characters, the search term apple is shortened to app. This means searches for apple return any indexed terms matching app, such as apply, snapped, and apple.

We recommend testing both approaches to see which best fits your use case and desired search experience.

Example configuration

In this example, we configure the edge_ngram tokenizer to treat letters and digits as tokens, and to produce grams with minimum length 2 and maximum length 10:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter",
            "digit"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "2 Quick Foxes."
}

The above example produces the following terms:

[ Qu, Qui, Quic, Quick, Fo, Fox, Foxe, Foxes ]

Usually we recommend using the same analyzer at index time and at search time. In the case of the edge_ngram tokenizer, the advice is different. It only makes sense to use the edge_ngram tokenizer at index time, to ensure that partial words are available for matching in the index. At search time, just search for the terms the user has typed in, for instance: Quick Fo.

Below is an example of how to set up a field for search-as-you-type.

Note that the max_gram value for the index analyzer is 10, which limits indexed terms to 10 characters. Search terms are not truncated, meaning that search terms longer than 10 characters may not match any indexed terms.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "autocomplete": {
          "tokenizer": "autocomplete",
          "filter": [
            "lowercase"
          ]
        },
        "autocomplete_search": {
          "tokenizer": "lowercase"
        }
      },
      "tokenizer": {
        "autocomplete": {
          "type": "edge_ngram",
          "min_gram": 2,
          "max_gram": 10,
          "token_chars": [
            "letter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "analyzer": "autocomplete",
          "search_analyzer": "autocomplete_search"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "title": "Quick Foxes" (1)
}

POST my_index/_refresh

GET my_index/_search
{
  "query": {
    "match": {
      "title": {
        "query": "Quick Fo", (2)
        "operator": "and"
      }
    }
  }
}
  1. The autocomplete analyzer indexes the terms [qu, qui, quic, quick, fo, fox, foxe, foxes].

  2. The autocomplete_search analyzer searches for the terms [quick, fo], both of which appear in the index.

Keyword Tokenizer

The keyword tokenizer is a ``noop'' tokenizer that accepts whatever text it is given and outputs the exact same text as a single term. It can be combined with token filters to normalise output, e.g. lower-casing email addresses.

Example output

POST _analyze
{
  "tokenizer": "keyword",
  "text": "New York"
}

The above sentence would produce the following term:

[ New York ]

Configuration

The keyword tokenizer accepts the following parameters:

buffer_size

The number of characters read into the term buffer in a single pass. Defaults to 256. The term buffer will grow by this size until all the text has been consumed. It is advisable not to change this setting.

Pattern Tokenizer

The pattern tokenizer uses a regular expression to either split text into terms whenever it matches a word separator, or to capture matching text as terms.

The default pattern is \W+, which splits text whenever it encounters non-word characters.

Warning
Beware of Pathological Regular Expressions

The pattern tokenizer uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Example output

POST _analyze
{
  "tokenizer": "pattern",
  "text": "The foo_bar_size's default is 5."
}

The above sentence would produce the following terms:

[ The, foo_bar_size, s, default, is, 5 ]

Configuration

The pattern tokenizer accepts the following parameters:

pattern

A Java regular expression, defaults to \W+.

flags

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".

group

Which capture group to extract as tokens. Defaults to -1 (split).

Example configuration

In this example, we configure the pattern tokenizer to break text into tokens when it encounters commas:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": ","
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "comma,separated,values"
}

The above example produces the following terms:

[ comma, separated, values ]

In the next example, we configure the pattern tokenizer to capture values enclosed in double quotes (ignoring embedded escaped quotes \"). The regex itself looks like this:

"((?:\\"|[^"]|\\")*)"

And reads as follows:

  • A literal "

  • Start capturing:

    • A literal \" OR any character except "

    • Repeat until no more characters match

  • A literal closing "

When the pattern is specified in JSON, the " and \ characters need to be escaped, so the pattern ends up looking like:

\"((?:\\\\\"|[^\"]|\\\\\")+)\"
PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "pattern",
          "pattern": "\"((?:\\\\\"|[^\"]|\\\\\")+)\"",
          "group": 1
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "\"value\", \"value with embedded \\\" quote\""
}

The above example produces the following two terms:

[ value, value with embedded \" quote ]

Char Group Tokenizer

The char_group tokenizer breaks text into terms whenever it encounters a character which is in a defined set. It is mostly useful for cases where a simple custom tokenization is desired, and the overhead of use of the pattern tokenizer is not acceptable.

Configuration

The char_group tokenizer accepts one parameter:

tokenize_on_chars

A list containing a list of characters to tokenize the string on. Whenever a character from this list is encountered, a new token is started. This accepts either single characters like e.g. -, or character groups: whitespace, letter, digit, punctuation, symbol.

Example output

POST _analyze
{
  "tokenizer": {
    "type": "char_group",
    "tokenize_on_chars": [
      "whitespace",
      "-",
      "\n"
    ]
  },
  "text": "The QUICK brown-fox"
}

returns

{
  "tokens": [
    {
      "token": "The",
      "start_offset": 0,
      "end_offset": 3,
      "type": "word",
      "position": 0
    },
    {
      "token": "QUICK",
      "start_offset": 4,
      "end_offset": 9,
      "type": "word",
      "position": 1
    },
    {
      "token": "brown",
      "start_offset": 10,
      "end_offset": 15,
      "type": "word",
      "position": 2
    },
    {
      "token": "fox",
      "start_offset": 16,
      "end_offset": 19,
      "type": "word",
      "position": 3
    }
  ]
}

Simple Pattern Tokenizer

experimental[This functionality is marked as experimental in Lucene]

The simple_pattern tokenizer uses a regular expression to capture matching text as terms. The set of regular expression features it supports is more limited than the pattern tokenizer, but the tokenization is generally faster.

This tokenizer does not support splitting the input on a pattern match, unlike the pattern tokenizer. To split on pattern matches using the same restricted regular expression subset, see the simple_pattern_split tokenizer.

This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions]. For an explanation of the supported features and syntax, see Regular Expression Syntax.

The default pattern is the empty string, which produces no terms. This tokenizer should always be configured with a non-default pattern.

Configuration

The simple_pattern tokenizer accepts the following parameters:

pattern

{lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.

Example configuration

This example configures the simple_pattern tokenizer to produce terms that are three-digit numbers

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern",
          "pattern": "[0123456789]{3}"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "fd-786-335-514-x"
}

The above example produces these terms:

[ 786, 335, 514 ]

Simple Pattern Split Tokenizer

experimental[This functionality is marked as experimental in Lucene]

The simple_pattern_split tokenizer uses a regular expression to split the input into terms at pattern matches. The set of regular expression features it supports is more limited than the pattern tokenizer, but the tokenization is generally faster.

This tokenizer does not produce terms from the matches themselves. To produce terms from matches using patterns in the same restricted regular expression subset, see the simple_pattern tokenizer.

This tokenizer uses {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expressions]. For an explanation of the supported features and syntax, see Regular Expression Syntax.

The default pattern is the empty string, which produces one term containing the full input. This tokenizer should always be configured with a non-default pattern.

Configuration

The simple_pattern_split tokenizer accepts the following parameters:

pattern

A {lucene-core-javadoc}/org/apache/lucene/util/automaton/RegExp.html[Lucene regular expression], defaults to the empty string.

Example configuration

This example configures the simple_pattern_split tokenizer to split the input text on underscores.

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "simple_pattern_split",
          "pattern": "_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "an_underscored_phrase"
}

The above example produces these terms:

[ an, underscored, phrase ]

Path Hierarchy Tokenizer

The path_hierarchy tokenizer takes a hierarchical value like a filesystem path, splits on the path separator, and emits a term for each component in the tree.

Example output

POST _analyze
{
  "tokenizer": "path_hierarchy",
  "text": "/one/two/three"
}

The above text would produce the following terms:

[ /one, /one/two, /one/two/three ]

Configuration

The path_hierarchy tokenizer accepts the following parameters:

delimiter

The character to use as the path separator. Defaults to /.

replacement

An optional replacement character to use for the delimiter. Defaults to the delimiter.

buffer_size

The number of characters read into the term buffer in a single pass. Defaults to 1024. The term buffer will grow by this size until all the text has been consumed. It is advisable not to change this setting.

reverse

If set to true, emits the tokens in reverse order. Defaults to false.

skip

The number of initial tokens to skip. Defaults to 0.

Example configuration

In this example, we configure the path_hierarchy tokenizer to split on - characters, and to replace them with /. The first two tokens are skipped:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "my_tokenizer"
        }
      },
      "tokenizer": {
        "my_tokenizer": {
          "type": "path_hierarchy",
          "delimiter": "-",
          "replacement": "/",
          "skip": 2
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "one-two-three-four-five"
}

The above example produces the following terms:

[ /three, /three/four, /three/four/five ]

If we were to set reverse to true, it would produce the following:

[ one/two/three/, two/three/, three/ ]

Detailed Examples

Path Hierarchy Tokenizer Examples

A common use-case for the path_hierarchy tokenizer is filtering results by file paths. If indexing a file path along with the data, the use of the path_hierarchy tokenizer to analyze the path allows filtering the results by different parts of the file path string.

This example configures an index to have two custom analyzers and applies those analyzers to multifields of the file_path text field that will store filenames. One of the two analyzers uses reverse tokenization. Some sample documents are then indexed to represent some file paths for photos inside photo folders of two different users.

PUT file-path-test
{
  "settings": {
    "analysis": {
      "analyzer": {
        "custom_path_tree": {
          "tokenizer": "custom_hierarchy"
        },
        "custom_path_tree_reversed": {
          "tokenizer": "custom_hierarchy_reversed"
        }
      },
      "tokenizer": {
        "custom_hierarchy": {
          "type": "path_hierarchy",
          "delimiter": "/"
        },
        "custom_hierarchy_reversed": {
          "type": "path_hierarchy",
          "delimiter": "/",
          "reverse": "true"
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "file_path": {
          "type": "text",
          "fields": {
            "tree": {
              "type": "text",
              "analyzer": "custom_path_tree"
            },
            "tree_reversed": {
              "type": "text",
              "analyzer": "custom_path_tree_reversed"
            }
          }
        }
      }
    }
  }
}

POST file-path-test/_doc/1
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

POST file-path-test/_doc/2
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo2.jpg"
}

POST file-path-test/_doc/3
{
  "file_path": "/User/alice/photos/2017/05/16/my_photo3.jpg"
}

POST file-path-test/_doc/4
{
  "file_path": "/User/alice/photos/2017/05/15/my_photo1.jpg"
}

POST file-path-test/_doc/5
{
  "file_path": "/User/bob/photos/2017/05/16/my_photo1.jpg"
}

A search for a particular file path string against the text field matches all the example documents, with Bob’s documents ranking highest due to bob also being one of the terms created by the standard analyzer boosting relevance for Bob’s documents.

GET file-path-test/_search
{
  "query": {
    "match": {
      "file_path": "/User/bob/photos/2017/05"
    }
  }
}

It’s simple to match or filter documents with file paths that exist within a particular directory using the file_path.tree field.

GET file-path-test/_search
{
  "query": {
    "term": {
      "file_path.tree": "/User/alice/photos/2017/05/16"
    }
  }
}

With the reverse parameter for this tokenizer, it’s also possible to match from the other end of the file path, such as individual file names or a deep level subdirectory. The following example shows a search for all files named my_photo1.jpg within any directory via the file_path.tree_reversed field configured to use the reverse parameter in the mapping.

GET file-path-test/_search
{
  "query": {
    "term": {
      "file_path.tree_reversed": {
        "value": "my_photo1.jpg"
      }
    }
  }
}

Viewing the tokens generated with both forward and reverse is instructive in showing the tokens created for the same file path value.

POST file-path-test/_analyze
{
  "analyzer": "custom_path_tree",
  "text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

POST file-path-test/_analyze
{
  "analyzer": "custom_path_tree_reversed",
  "text": "/User/alice/photos/2017/05/16/my_photo1.jpg"
}

It’s also useful to be able to filter with file paths when combined with other types of searches, such as this example looking for any files paths with 16 that also must be in Alice’s photo directory.

GET file-path-test/_search
{
  "query": {
    "bool" : {
      "must" : {
        "match" : { "file_path" : "16" }
      },
      "filter": {
        "term" : { "file_path.tree" : "/User/alice" }
      }
    }
  }
}

Token Filters

Token filters accept a stream of tokens from a tokenizer and can modify tokens (eg lowercasing), delete tokens (eg remove stopwords) or add tokens (eg synonyms).

Elasticsearch has a number of built in token filters which can be used to build custom analyzers.

Standard Token Filter

deprecated:[6.5.0, "This filter is deprecated and will be removed in the next major version."]

A token filter of type standard that normalizes tokens extracted with the Standard Tokenizer.

Tip

The standard token filter currently does nothing. It remains as a placeholder in case some filtering function needs to be added in a future version.

ASCII Folding Token Filter

A token filter of type asciifolding that converts alphabetic, numeric, and symbolic Unicode characters which are not in the first 127 ASCII characters (the "Basic Latin" Unicode block) into their ASCII equivalents, if one exists. Example:

PUT /asciifold_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["asciifolding"]
                }
            }
        }
    }
}

Accepts preserve_original setting which defaults to false but if true will keep the original token as well as emit the folded token. For example:

PUT /asciifold_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["my_ascii_folding"]
                }
            },
            "filter" : {
                "my_ascii_folding" : {
                    "type" : "asciifolding",
                    "preserve_original" : true
                }
            }
        }
    }
}

Flatten Graph Token Filter

experimental[This functionality is marked as experimental in Lucene]

The flatten_graph token filter accepts an arbitrary graph token stream, such as that produced by Synonym Graph Token Filter, and flattens it into a single linear chain of tokens suitable for indexing.

This is a lossy process, as separate side paths are squashed on top of one another, but it is necessary if you use a graph token stream during indexing because a Lucene index cannot currently represent a graph. For this reason, it’s best to apply graph analyzers only at search time because that preserves the full graph structure and gives correct matches for proximity queries.

For more information on this topic and its various complexities, please read the Lucene’s TokenStreams are actually graphs blog post.

Length Token Filter

A token filter of type length that removes words that are too long or too short for the stream.

The following are settings that can be set for a length token filter type:

Setting Description

min

The minimum number. Defaults to 0.

max

The maximum number. Defaults to Integer.MAX_VALUE, which is 2^31-1 or 2147483647.

Lowercase Token Filter

A token filter of type lowercase that normalizes token text to lower case.

Lowercase token filter supports Greek, Irish, and Turkish lowercase token filters through the language parameter. Below is a usage example in a custom analyzer

PUT /lowercase_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "standard_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase"]
        },
        "greek_lowercase_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["greek_lowercase"]
        }
      },
      "filter": {
        "greek_lowercase": {
          "type": "lowercase",
          "language": "greek"
        }
      }
    }
  }
}

Uppercase Token Filter

A token filter of type uppercase that normalizes token text to upper case.

NGram Token Filter

A token filter of type nGram.

The following are settings that can be set for a nGram token filter type:

Setting Description

min_gram

Defaults to 1.

max_gram

Defaults to 2.

The index level setting index.max_ngram_diff controls the maximum allowed difference between max_gram and min_gram.

Edge NGram Token Filter

A token filter of type edgeNGram.

The following are settings that can be set for a edgeNGram token filter type:

Setting Description

min_gram

Defaults to 1.

max_gram

Defaults to 2.

side

deprecated. Either front or back. Defaults to front.

Porter Stem Token Filter

A token filter of type porter_stem that transforms the token stream as per the Porter stemming algorithm.

Note, the input to the stemming filter must already be in lower case, so you will need to use Lower Case Token Filter or Lower Case Tokenizer farther down the Tokenizer chain in order for this to work properly!. For example, when using custom analyzer, make sure the lowercase filter comes before the porter_stem filter in the list of filters.

Shingle Token Filter

Note
Shingles are generally used to help speed up phrase queries. Rather than building filter chains by hand, you may find it easier to use the index-phrases option on a text field.

A token filter of type shingle that constructs shingles (token n-grams) from a token stream. In other words, it creates combinations of tokens as a single token. For example, the sentence "please divide this sentence into shingles" might be tokenized into shingles "please divide", "divide this", "this sentence", "sentence into", and "into shingles".

This filter handles position increments > 1 by inserting filler tokens (tokens with termtext "_"). It does not handle a position increment of 0.

The following are settings that can be set for a shingle token filter type:

Setting Description

max_shingle_size

The maximum shingle size. Defaults to 2.

min_shingle_size

The minimum shingle size. Defaults to 2.

output_unigrams

If true the output will contain the input tokens (unigrams) as well as the shingles. Defaults to true.

output_unigrams_if_no_shingles

If output_unigrams is false the output will contain the input tokens (unigrams) if no shingles are available. Note if output_unigrams is set to true this setting has no effect. Defaults to false.

token_separator

The string to use when joining adjacent tokens to form a shingle. Defaults to " ".

filler_token

The string to use as a replacement for each position at which there is no actual token in the stream. For instance this string is used if the position increment is greater than one when a stop filter is used together with the shingle filter. Defaults to "_"

The index level setting index.max_shingle_diff controls the maximum allowed difference between max_shingle_size and min_shingle_size.

Stop Token Filter

A token filter of type stop that removes stop words from token streams.

The following are settings that can be set for a stop token filter type:

stopwords

A list of stop words to use. Defaults to english stop words.

stopwords_path

A path (either relative to config location, or absolute) to a stopwords file configuration. Each stop word should be in its own "line" (separated by a line break). The file must be UTF-8 encoded.

ignore_case

Set to true to lower case all words first. Defaults to false.

remove_trailing

Set to false in order to not ignore the last term of a search if it is a stop word. This is very useful for the completion suggester as a query like green a can be extended to green apple even though you remove stop words in general. Defaults to true.

The stopwords parameter accepts either an array of stopwords:

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "my_stop": {
                    "type":       "stop",
                    "stopwords": ["and", "is", "the"]
                }
            }
        }
    }
}

or a predefined language-specific list:

PUT /my_index
{
    "settings": {
        "analysis": {
            "filter": {
                "my_stop": {
                    "type":       "stop",
                    "stopwords":  "_english_"
                }
            }
        }
    }
}

Elasticsearch provides the following predefined list of languages:

arabic, armenian, basque, bengali, brazilian, bulgarian, catalan, czech, danish, dutch, english, finnish, french, galician, german, greek, hindi, hungarian, indonesian, irish, italian, latvian, norwegian, persian, portuguese, romanian, russian, sorani, spanish, swedish, thai, turkish.

For the empty stopwords list (to disable stopwords) use: none.

Word Delimiter Token Filter

Named word_delimiter, it Splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

  • split on intra-word delimiters (by default, all non alpha-numeric characters).

  • "Wi-Fi" → "Wi", "Fi"

  • split on case transitions: "PowerShot" → "Power", "Shot"

  • split on letter-number transitions: "SD500" → "SD", "500"

  • leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"

  • trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"

Parameters include:

generate_word_parts

If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.

generate_number_parts

If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.

catenate_words

If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to false.

catenate_numbers

If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to false.

catenate_all

If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to false.

split_on_case_change

If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults to true.

preserve_original

If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to false.

split_on_numerics

If true causes "j2se" to be three tokens; "j" "2" "se". Defaults to true.

stem_english_possessive

If true causes trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true.

Advance settings include:

protected_words

List of tokens the filter won’t split. Either an array, or also can set protected_words_path which resolved to a file configured with protected words (one on each line). Automatically resolves to config/ based location if exists.

type_table

A custom type mapping table, for example (when configured using type_table_path):

    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \\u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D => ALPHANUM
Note
Using a tokenizer like the standard tokenizer may interfere with the catenate_* and preserve_original parameters, as the original string may already have lost punctuation during tokenization. Instead, you may want to use the whitespace tokenizer.

Word Delimiter Graph Token Filter

experimental[This functionality is marked as experimental in Lucene]

Named word_delimiter_graph, it splits words into subwords and performs optional transformations on subword groups. Words are split into subwords with the following rules:

  • split on intra-word delimiters (by default, all non alpha-numeric characters).

  • "Wi-Fi" → "Wi", "Fi"

  • split on case transitions: "PowerShot" → "Power", "Shot"

  • split on letter-number transitions: "SD500" → "SD", "500"

  • leading and trailing intra-word delimiters on each subword are ignored: "//hello---there, 'dude'" → "hello", "there", "dude"

  • trailing "'s" are removed for each subword: "O’Neil’s" → "O", "Neil"

Unlike the word_delimiter, this token filter correctly handles positions for multi terms expansion at search-time when any of the following options are set to true:

  • preserve_original

  • catenate_numbers

  • catenate_words

  • catenate_all

Parameters include:

generate_word_parts

If true causes parts of words to be generated: "PowerShot" ⇒ "Power" "Shot". Defaults to true.

generate_number_parts

If true causes number subwords to be generated: "500-42" ⇒ "500" "42". Defaults to true.

catenate_words

If true causes maximum runs of word parts to be catenated: "wi-fi" ⇒ "wifi". Defaults to false.

catenate_numbers

If true causes maximum runs of number parts to be catenated: "500-42" ⇒ "50042". Defaults to false.

catenate_all

If true causes all subword parts to be catenated: "wi-fi-4000" ⇒ "wifi4000". Defaults to false.

split_on_case_change

If true causes "PowerShot" to be two tokens; ("Power-Shot" remains two parts regards). Defaults to true.

preserve_original

If true includes original words in subwords: "500-42" ⇒ "500-42" "500" "42". Defaults to false.

split_on_numerics

If true causes "j2se" to be three tokens; "j" "2" "se". Defaults to true.

stem_english_possessive

If true causes trailing "'s" to be removed for each subword: "O’Neil’s" ⇒ "O", "Neil". Defaults to true.

Advance settings include:

protected_words

A list of protected words from being delimiter. Either an array, or also can set protected_words_path which resolved to a file configured with protected words (one on each line). Automatically resolves to config/ based location if exists.

type_table

A custom type mapping table, for example (when configured using type_table_path):

    # Map the $, %, '.', and ',' characters to DIGIT
    # This might be useful for financial data.
    $ => DIGIT
    % => DIGIT
    . => DIGIT
    \\u002C => DIGIT

    # in some cases you might not want to split on ZWJ
    # this also tests the case where we need a bigger byte[]
    # see http://en.wikipedia.org/wiki/Zero-width_joiner
    \\u200D => ALPHANUM
Note
Using a tokenizer like the standard tokenizer may interfere with the catenate_* and preserve_original parameters, as the original string may already have lost punctuation during tokenization. Instead, you may want to use the whitespace tokenizer.

Multiplexer Token Filter

A token filter of type multiplexer will emit multiple tokens at the same position, each version of the token having been run through a different filter. Identical output tokens at the same position will be removed.

Warning
If the incoming token stream has duplicate tokens, then these will also be removed by the multiplexer

Options

filters

a list of token filters to apply to incoming tokens. These can be any token filters defined elsewhere in the index mappings. Filters can be chained using a comma-delimited string, so for example "lowercase, porter_stem" would apply the lowercase filter and then the porter_stem filter to a single token.

Warning
Shingle or multi-word synonym token filters will not function normally when they are declared in the filters array because they read ahead internally which is unsupported by the multiplexer
preserve_original

if true (the default) then emit the original token in addition to the filtered tokens

Settings example

You can set it up like:

PUT /multiplexer_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : [ "my_multiplexer" ]
                }
            },
            "filter" : {
                "my_multiplexer" : {
                    "type" : "multiplexer",
                    "filters" : [ "lowercase", "lowercase, porter_stem" ]
                }
            }
        }
    }
}

And test it like:

POST /multiplexer_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "Going HOME"
}

And it’d respond:

{
  "tokens": [
    {
      "token": "Going",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "going",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "go",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "HOME",
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "home",          (1)
      "start_offset": 6,
      "end_offset": 10,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
  1. The stemmer has also emitted a token home at position 1, but because it is a duplicate of this token it has been removed from the token stream

Note
The synonym and synonym_graph filters use their preceding analysis chain to parse and analyse their synonym lists, and ignore any token filters in the chain that produce multiple tokens at the same position. This means that any filters within the multiplexer will be ignored for the purpose of synonyms. If you want to use filters contained within the multiplexer for parsing synonyms (for example, to apply stemming to the synonym lists), then you should append the synonym filter to the relevant multiplexer filter list.

Conditional Token Filter

The conditional token filter takes a predicate script and a list of subfilters, and only applies the subfilters to the current token if it matches the predicate.

Options

filter

a chain of token filters to apply to the current token if the predicate matches. These can be any token filters defined elsewhere in the index mappings.

script

a predicate script that determines whether or not the filters will be applied to the current token. Note that only inline scripts are supported

Settings example

You can set it up like:

PUT /condition_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : [ "my_condition" ]
                }
            },
            "filter" : {
                "my_condition" : {
                    "type" : "condition",
                    "filter" : [ "lowercase" ],
                    "script" : {
                        "source" : "token.getTerm().length() < 5"  (1)
                    }
                }
            }
        }
    }
}
  1. This will only apply the lowercase filter to terms that are less than 5 characters in length

And test it like:

POST /condition_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "What Flapdoodle"
}

And it’d respond:

{
  "tokens": [
    {
      "token": "what",              (1)
      "start_offset": 0,
      "end_offset": 4,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "Flapdoodle",        (2)
      "start_offset": 5,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}
  1. The term What has been lowercased, because it is only 4 characters long

  2. The term Flapdoodle has been left in its original case, because it doesn’t pass the predicate

Predicate Token Filter Script

The predicate_token_filter token filter takes a predicate script, and removes tokens that do not match the predicate.

Options

script

a predicate script that determines whether or not the current token will be emitted. Note that only inline scripts are supported.

Settings example

You can set it up like:

PUT /condition_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : [ "my_script_filter" ]
                }
            },
            "filter" : {
                "my_script_filter" : {
                    "type" : "predicate_token_filter",
                    "script" : {
                        "source" : "token.getTerm().length() > 5"  (1)
                    }
                }
            }
        }
    }
}
  1. This will emit tokens that are more than 5 characters long

And test it like:

POST /condition_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "What Flapdoodle"
}

And it’d respond:

{
  "tokens": [
    {
      "token": "Flapdoodle",        (1)
      "start_offset": 5,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 1                 (2)
    }
  ]
}
  1. The token 'What' has been removed from the tokenstream because it does not match the predicate.

  2. The position and offset values are unaffected by the removal of earlier tokens

Stemmer Token Filter

A filter that provides access to (almost) all of the available stemming token filters through a single unified interface. For example:

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_stemmer"]
                }
            },
            "filter" : {
                "my_stemmer" : {
                    "type" : "stemmer",
                    "name" : "light_german"
                }
            }
        }
    }
}

The language/name parameter controls the stemmer with the following available values (the preferred filters are marked in bold):

Arabic

arabic

Armenian

armenian

Basque

basque

Bengali

bengali light_bengali

Brazilian Portuguese

brazilian

Bulgarian

bulgarian

Catalan

catalan

Czech

czech

Danish

danish

Dutch

dutch, dutch_kp

English

english, light_english, minimal_english, possessive_english, porter2, lovins

Finnish

finnish, light_finnish

French

french, light_french, minimal_french

Galician

galician, minimal_galician (Plural step only)

German

german, german2, light_german, minimal_german

Greek

greek

Hindi

hindi

Hungarian

hungarian, light_hungarian

Indonesian

indonesian

Irish

irish

Italian

italian, light_italian

Kurdish (Sorani)

sorani

Latvian

latvian

Lithuanian

lithuanian

Norwegian (Bokmål)

norwegian, light_norwegian, minimal_norwegian

Norwegian (Nynorsk)

light_nynorsk, minimal_nynorsk

Portuguese

portuguese, light_portuguese, minimal_portuguese, portuguese_rslp

Romanian

romanian

Russian

russian, light_russian

Spanish

spanish, light_spanish

Swedish

swedish, light_swedish

Turkish

turkish

Stemmer Override Token Filter

Overrides stemming algorithms, by applying a custom mapping, then protecting these terms from being modified by stemmers. Must be placed before any stemming filters.

Rules are separated by

Setting Description

rules

A list of mapping rules to use.

rules_path

A path (either relative to config location, or absolute) to a list of mappings.

Here is an example:

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "custom_stems", "porter_stem"]
                }
            },
            "filter" : {
                "custom_stems" : {
                    "type" : "stemmer_override",
                    "rules_path" : "analysis/stemmer_override.txt"
                }
            }
        }
    }
}

Where the file looks like:

Unresolved directive in analysis/tokenfilters/stemmer-override-tokenfilter.asciidoc - include::{es-test-dir}/cluster/config/analysis/stemmer_override.txt[]

You can also define the overrides rules inline:

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "custom_stems", "porter_stem"]
                }
            },
            "filter" : {
                "custom_stems" : {
                    "type" : "stemmer_override",
                    "rules" : [
                        "running => run",
                        "stemmer => stemmer"
                    ]
                }
            }
        }
    }
}

Keyword Marker Token Filter

Protects words from being modified by stemmers. Must be placed before any stemming filters.

Setting Description

keywords

A list of words to use.

keywords_path

A path (either relative to config location, or absolute) to a list of words.

keywords_pattern

A regular expression pattern to match against words in the text.

ignore_case

Set to true to lower case all words first. Defaults to false.

You can configure it like:

PUT /keyword_marker_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "protect_cats": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "protect_cats", "porter_stem"]
        },
        "normal": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "porter_stem"]
        }
      },
      "filter": {
        "protect_cats": {
          "type": "keyword_marker",
          "keywords": ["cats"]
        }
      }
    }
  }
}

And test it with:

POST /keyword_marker_example/_analyze
{
  "analyzer" : "protect_cats",
  "text" : "I like cats"
}

And it’d respond:

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "like",
      "start_offset": 2,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "cats",
      "start_offset": 7,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

As compared to the normal analyzer which has cats stemmed to cat:

POST /keyword_marker_example/_analyze
{
  "analyzer" : "normal",
  "text" : "I like cats"
}

Response:

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "like",
      "start_offset": 2,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "cat",
      "start_offset": 7,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Keyword Repeat Token Filter

The keyword_repeat token filter Emits each incoming token twice once as keyword and once as a non-keyword to allow an unstemmed version of a term to be indexed side by side with the stemmed version of the term. Given the nature of this filter each token that isn’t transformed by a subsequent stemmer will be indexed twice. Therefore, consider adding a unique filter with only_on_same_position set to true to drop unnecessary duplicates.

Here is an example of using the keyword_repeat token filter to preserve both the stemmed and unstemmed version of tokens:

PUT /keyword_repeat_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "stemmed_and_unstemmed": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "keyword_repeat", "porter_stem", "unique_stem"]
        }
      },
      "filter": {
        "unique_stem": {
          "type": "unique",
          "only_on_same_position": true
        }
      }
    }
  }
}

And you can test it with:

POST /keyword_repeat_example/_analyze
{
  "analyzer" : "stemmed_and_unstemmed",
  "text" : "I like cats"
}

And it’d respond:

{
  "tokens": [
    {
      "token": "i",
      "start_offset": 0,
      "end_offset": 1,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "like",
      "start_offset": 2,
      "end_offset": 6,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "cats",
      "start_offset": 7,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "cat",
      "start_offset": 7,
      "end_offset": 11,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Which preserves both the cat and cats tokens. Compare this to the example on the Keyword Marker Token Filter.

KStem Token Filter

The kstem token filter is a high performance filter for english. All terms must already be lowercased (use lowercase filter) for this filter to work correctly.

Snowball Token Filter

A filter that stems words using a Snowball-generated stemmer. The language parameter controls the stemmer with the following available values: Armenian, Basque, Catalan, Danish, Dutch, English, Finnish, French, German, German2, Hungarian, Italian, Kp, Lithuanian, Lovins, Norwegian, Porter, Portuguese, Romanian, Russian, Spanish, Swedish, Turkish.

For example:

PUT /my_index
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "my_snow"]
                }
            },
            "filter" : {
                "my_snow" : {
                    "type" : "snowball",
                    "language" : "Lovins"
                }
            }
        }
    }
}

Phonetic Token Filter

The phonetic token filter is provided as the {plugins}/analysis-phonetic.html[analysis-phonetic] plugin.

Synonym Token Filter

The synonym token filter allows to easily handle synonyms during the analysis process. Synonyms are configured using a configuration file. Here is an example:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "whitespace",
                        "filter" : ["synonym"]
                    }
                },
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}

The above configures a synonym filter, with a path of analysis/synonym.txt (relative to the config location). The synonym analyzer is then configured with the filter.

This filter tokenizes synonyms with whatever tokenizer and token filters appear before it in the chain.

Additional settings are:

  • expand (defaults to true).

  • lenient (defaults to false). If true ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["my_stop", "synonym"]
                    }
                },
                "filter" : {
                    "my_stop": {
                        "type" : "stop",
                        "stopwords": ["bar"]
                    },
                    "synonym" : {
                        "type" : "synonym",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

With the above request the word bar gets skipped but a mapping foo ⇒ baz is still added. However, if the mapping being added was "foo, baz ⇒ bar" nothing would get added to the synonym list. This is because the target word for the mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and expand was set to false no mapping would get added as when expand=false the target mapping is the first word. However, if expand=true then the mappings added would be equivalent to foo, baz ⇒ foo, baz i.e, all mappings other than the stop word.

tokenizer and ignore_case are deprecated

The tokenizer parameter controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0. The ignore_case parameter works with tokenizer parameter only.

Two synonym formats are supported: Solr, WordNet.

Solr synonyms

The following is a sample format of the file:

Unresolved directive in analysis/tokenfilters/synonym-tokenfilter.asciidoc - include::{es-test-dir}/cluster/config/analysis/synonym.txt[]

You can also define synonyms for the filter directly in the configuration file (note use of synonyms instead of synonyms_path):

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "synonyms" : [
                            "i-pod, i pod => ipod",
                            "universe, cosmos"
                        ]
                    }
                }
            }
        }
    }
}

However, it is recommended to define large synonyms set in a file using synonyms_path, because specifying them inline increases cluster size unnecessarily.

WordNet synonyms

Synonyms based on WordNet format can be declared using format:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym",
                        "format" : "wordnet",
                        "synonyms" : [
                            "s(100000001,1,'abstain',v,1,0).",
                            "s(100000001,2,'refrain',v,1,0).",
                            "s(100000001,3,'desist',v,1,0)."
                        ]
                    }
                }
            }
        }
    }
}

Using synonyms_path to define WordNet synonyms in a file is supported as well.

Parsing synonym files

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms, e.g. asciifolding will only produce the folded version of the token. Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.

Synonym Graph Token Filter

The synonym_graph token filter allows to easily handle synonyms, including multi-word synonyms correctly during the analysis process.

In order to properly handle multi-word synonyms this token filter creates a "graph token stream" during processing. For more information on this topic and its various complexities, please read the Lucene’s TokenStreams are actually graphs blog post.

Note

This token filter is designed to be used as part of a search analyzer only. If you want to apply synonyms during indexing please use the standard synonym token filter.

Synonyms are configured using a configuration file. Here is an example:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "search_synonyms" : {
                        "tokenizer" : "whitespace",
                        "filter" : ["graph_synonyms"]
                    }
                },
                "filter" : {
                    "graph_synonyms" : {
                        "type" : "synonym_graph",
                        "synonyms_path" : "analysis/synonym.txt"
                    }
                }
            }
        }
    }
}

The above configures a search_synonyms filter, with a path of analysis/synonym.txt (relative to the config location). The search_synonyms analyzer is then configured with the filter.

Additional settings are:

  • expand (defaults to true).

  • lenient (defaults to false). If true ignores exceptions while parsing the synonym configuration. It is important to note that only those synonym rules which cannot get parsed are ignored. For instance consider the following request:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "analyzer" : {
                    "synonym" : {
                        "tokenizer" : "standard",
                        "filter" : ["my_stop", "synonym_graph"]
                    }
                },
                "filter" : {
                    "my_stop": {
                        "type" : "stop",
                        "stopwords": ["bar"]
                    },
                    "synonym_graph" : {
                        "type" : "synonym_graph",
                        "lenient": true,
                        "synonyms" : ["foo, bar => baz"]
                    }
                }
            }
        }
    }
}

With the above request the word bar gets skipped but a mapping foo ⇒ baz is still added. However, if the mapping being added was "foo, baz ⇒ bar" nothing would get added to the synonym list. This is because the target word for the mapping is itself eliminated because it was a stop word. Similarly, if the mapping was "bar, foo, baz" and expand was set to false no mapping would get added as when expand=false the target mapping is the first word. However, if expand=true then the mappings added would be equivalent to foo, baz ⇒ foo, baz i.e, all mappings other than the stop word.

tokenizer and ignore_case are deprecated

The tokenizer parameter controls the tokenizers that will be used to tokenize the synonym, this parameter is for backwards compatibility for indices that created before 6.0.. The ignore_case parameter works with tokenizer parameter only.

Two synonym formats are supported: Solr, WordNet.

Solr synonyms

The following is a sample format of the file:

Unresolved directive in analysis/tokenfilters/synonym-graph-tokenfilter.asciidoc - include::{es-test-dir}/cluster/config/analysis/synonym.txt[]

You can also define synonyms for the filter directly in the configuration file (note use of synonyms instead of synonyms_path):

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym_graph",
                        "synonyms" : [
                            "lol, laughing out loud",
                            "universe, cosmos"
                        ]
                    }
                }
            }
        }
    }
}

However, it is recommended to define large synonyms set in a file using synonyms_path, because specifying them inline increases cluster size unnecessarily.

WordNet synonyms

Synonyms based on WordNet format can be declared using format:

PUT /test_index
{
    "settings": {
        "index" : {
            "analysis" : {
                "filter" : {
                    "synonym" : {
                        "type" : "synonym_graph",
                        "format" : "wordnet",
                        "synonyms" : [
                            "s(100000001,1,'abstain',v,1,0).",
                            "s(100000001,2,'refrain',v,1,0).",
                            "s(100000001,3,'desist',v,1,0)."
                        ]
                    }
                }
            }
        }
    }
}

Using synonyms_path to define WordNet synonyms in a file is supported as well.

Parsing synonym files

Elasticsearch will use the token filters preceding the synonym filter in a tokenizer chain to parse the entries in a synonym file. So, for example, if a synonym filter is placed after a stemmer, then the stemmer will also be applied to the synonym entries. Because entries in the synonym map cannot have stacked positions, some token filters may cause issues here. Token filters that produce multiple versions of a token may choose which version of the token to emit when parsing synonyms, e.g. asciifolding will only produce the folded version of the token. Others, e.g. multiplexer, word_delimiter_graph or ngram will throw an error.

Warning
The synonym rules should not contain words that are removed by a filter that appears after in the chain (a stop filter for instance). Removing a term from a synonym rule breaks the matching at query time.

Compound Word Token Filters

The hyphenation_decompounder and dictionary_decompounder token filters can decompose compound words found in many German languages into word parts.

Both token filters require a dictionary of word parts, which can be provided as:

word_list

An array of words, specified inline in the token filter configuration, or

word_list_path

The path (either absolute or relative to the config directory) to a UTF-8 encoded file containing one word per line.

Hyphenation decompounder

The hyphenation_decompounder uses hyphenation grammars to find potential subwords that are then checked against the word dictionary. The quality of the output tokens is directly connected to the quality of the grammar file you use. For languages like German they are quite good.

XML based hyphenation grammar files can be found in the Objects For Formatting Objects (OFFO) Sourceforge project. Currently only FOP v1.2 compatible hyphenation files are supported. You can download offo-hyphenation_v1.2.zip directly and look in the offo-hyphenation/hyph/ directory. Credits for the hyphenation code go to the Apache FOP project .

Dictionary decompounder

The dictionary_decompounder uses a brute force approach in conjunction with only the word dictionary to find subwords in a compound word. It is much slower than the hyphenation decompounder but can be used as a first start to check the quality of your dictionary.

Compound token filter parameters

The following parameters can be used to configure a compound word token filter:

type

Either dictionary_decompounder or hyphenation_decompounder.

word_list

A array containing a list of words to use for the word dictionary.

word_list_path

The path (either absolute or relative to the config directory) to the word dictionary.

hyphenation_patterns_path

The path (either absolute or relative to the config directory) to a FOP XML hyphenation pattern file. (required for hyphenation)

min_word_size

Minimum word size. Defaults to 5.

min_subword_size

Minimum subword size. Defaults to 2.

max_subword_size

Maximum subword size. Defaults to 15.

only_longest_match

Whether to include only the longest matching subword or not. Defaults to false

Here is an example:

PUT /compound_word_example
{
    "index": {
        "analysis": {
            "analyzer": {
                "my_analyzer": {
                    "type": "custom",
                    "tokenizer": "standard",
                    "filter": ["dictionary_decompounder", "hyphenation_decompounder"]
                }
            },
            "filter": {
                "dictionary_decompounder": {
                    "type": "dictionary_decompounder",
                    "word_list": ["one", "two", "three"]
                },
                "hyphenation_decompounder": {
                    "type" : "hyphenation_decompounder",
                    "word_list_path": "analysis/example_word_list.txt",
                    "hyphenation_patterns_path": "analysis/hyphenation_patterns.xml",
                    "max_subword_size": 22
                }
            }
        }
    }
}

Reverse Token Filter

A token filter of type reverse that simply reverses each token.

Elision Token Filter

A token filter which removes elisions. For example, "l’avion" (the plane) will tokenized as "avion" (plane).

Accepts articles parameter which is a set of stop words articles. Also accepts articles_case, which indicates whether the filter treats those articles as case insensitive.

For example:

PUT /elision_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "default" : {
                    "tokenizer" : "standard",
                    "filter" : ["elision"]
                }
            },
            "filter" : {
                "elision" : {
                    "type" : "elision",
                    "articles_case": true,
                    "articles" : ["l", "m", "t", "qu", "n", "s", "j"]
                }
            }
        }
    }
}

Truncate Token Filter

The truncate token filter can be used to truncate tokens into a specific length.

It accepts a length parameter which control the number of characters to truncate to, defaults to 10.

Unique Token Filter

The unique token filter can be used to only index unique tokens during analysis. By default it is applied on all the token stream. If only_on_same_position is set to true, it will only remove duplicate tokens on the same position.

Pattern Capture Token Filter

The pattern_capture token filter, unlike the pattern tokenizer, emits a token for every capture group in the regular expression. Patterns are not anchored to the beginning and end of the string, so each pattern can match multiple times, and matches are allowed to overlap.

Warning
Beware of Pathological Regular Expressions

The pattern capture token filter uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

For instance a pattern like :

"(([a-z]+)(\d*))"

when matched against:

"abc123def456"

would produce the tokens: [ abc123, abc, 123, def456, def, 456 ]

If preserve_original is set to true (the default) then it would also emit the original token: abc123def456.

This is particularly useful for indexing text like camel-case code, eg stripHTML where a user may search for "strip html" or "striphtml":

PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "code" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "(\\p{Ll}+|\\p{Lu}\\p{Ll}+|\\p{Lu}+)",
                  "(\\d+)"
               ]
            }
         },
         "analyzer" : {
            "code" : {
               "tokenizer" : "pattern",
               "filter" : [ "code", "lowercase" ]
            }
         }
      }
   }
}

When used to analyze the text

import static org.apache.commons.lang.StringEscapeUtils.escapeHtml

this emits the tokens: [ import, static, org, apache, commons, lang, stringescapeutils, string, escape, utils, escapehtml, escape, html ]

Another example is analyzing email addresses:

PUT test
{
   "settings" : {
      "analysis" : {
         "filter" : {
            "email" : {
               "type" : "pattern_capture",
               "preserve_original" : true,
               "patterns" : [
                  "([^@]+)",
                  "(\\p{L}+)",
                  "(\\d+)",
                  "@(.+)"
               ]
            }
         },
         "analyzer" : {
            "email" : {
               "tokenizer" : "uax_url_email",
               "filter" : [ "email", "lowercase",  "unique" ]
            }
         }
      }
   }
}

When the above analyzer is used on an email address like:

john-smith_123@foo-bar.com

it would produce the following tokens:

john-smith_123@foo-bar.com, john-smith_123,
john, smith, 123, foo-bar.com, foo, bar, com

Multiple patterns are required to allow overlapping captures, but also means that patterns are less dense and easier to understand.

Note: All tokens are emitted in the same position, and with the same character offsets. This means, for example, that a match query for john-smith_123@foo-bar.com that uses this analyzer will return documents containing any of these tokens, even when using the and operator. Also, when combined with highlighting, the whole original token will be highlighted, not just the matching subset. For instance, querying the above email address for "smith" would highlight:

  <em>john-smith_123@foo-bar.com</em>

not:

  john-<em>smith</em>_123@foo-bar.com

Pattern Replace Token Filter

The pattern_replace token filter allows to easily handle string replacements based on a regular expression. The regular expression is defined using the pattern parameter, and the replacement string can be provided using the replacement parameter (supporting referencing the original text, as explained here).

Warning
Beware of Pathological Regular Expressions

The pattern replace token filter uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Trim Token Filter

The trim token filter trims the whitespace surrounding a token.

Limit Token Count Token Filter

Limits the number of tokens that are indexed per document and field.

Setting Description

max_token_count

The maximum number of tokens that should be indexed per document and field. The default is 1

consume_all_tokens

If set to true the filter exhaust the stream even if max_token_count tokens have been consumed already. The default is false.

Here is an example:

PUT /limit_example
{
  "settings": {
    "analysis": {
      "analyzer": {
        "limit_example": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": ["lowercase", "five_token_limit"]
        }
      },
      "filter": {
        "five_token_limit": {
          "type": "limit",
          "max_token_count": 5
        }
      }
    }
  }
}

Hunspell Token Filter

Basic support for hunspell stemming. Hunspell dictionaries will be picked up from a dedicated hunspell directory on the filesystem (<path.conf>/hunspell). Each dictionary is expected to have its own directory named after its associated locale (language). This dictionary directory is expected to hold a single .aff and one or more .dic files (all of which will automatically be picked up). For example, assuming the default hunspell location is used, the following directory layout will define the en_US dictionary:

- conf
    |-- hunspell
    |    |-- en_US
    |    |    |-- en_US.dic
    |    |    |-- en_US.aff

Each dictionary can be configured with one setting:

ignore_case

If true, dictionary matching will be case insensitive (defaults to false)

This setting can be configured globally in elasticsearch.yml using

  • indices.analysis.hunspell.dictionary.ignore_case

or for specific dictionaries:

  • indices.analysis.hunspell.dictionary.en_US.ignore_case.

It is also possible to add settings.yml file under the dictionary directory which holds these settings (this will override any other settings defined in the elasticsearch.yml).

One can use the hunspell stem filter by configuring it the analysis settings:

PUT /hunspell_example
{
    "settings": {
        "analysis" : {
            "analyzer" : {
                "en" : {
                    "tokenizer" : "standard",
                    "filter" : [ "lowercase", "en_US" ]
                }
            },
            "filter" : {
                "en_US" : {
                    "type" : "hunspell",
                    "locale" : "en_US",
                    "dedup" : true
                }
            }
        }
    }
}

The hunspell token filter accepts four options:

locale

A locale for this filter. If this is unset, the lang or language are used instead - so one of these has to be set.

dictionary

The name of a dictionary. The path to your hunspell dictionaries should be configured via indices.analysis.hunspell.dictionary.location before.

dedup

If only unique terms should be returned, this needs to be set to true. Defaults to true.

longest_only

If only the longest term should be returned, set this to true. Defaults to false: all possible stems are returned.

Note
As opposed to the snowball stemmers (which are algorithm based) this is a dictionary lookup based stemmer and therefore the quality of the stemming is determined by the quality of the dictionary.

Dictionary loading

By default, the default Hunspell directory (config/hunspell/) is checked for dictionaries when the node starts up, and any dictionaries are automatically loaded.

Dictionary loading can be deferred until they are actually used by setting indices.analysis.hunspell.dictionary.lazy to true in the config file.

References

Hunspell is a spell checker and morphological analyzer designed for languages with rich morphology and complex word compounding and character encoding.

Common Grams Token Filter

Token filter that generates bigrams for frequently occurring terms. Single terms are still indexed. It can be used as an alternative to the Stop Token Filter when we don’t want to completely ignore common terms.

For example, the text "the quick brown is a fox" will be tokenized as "the", "the_quick", "quick", "brown", "brown_is", "is", "is_a", "a", "a_fox", "fox". Assuming "the", "is" and "a" are common words.

When query_mode is enabled, the token filter removes common words and single terms followed by a common word. This parameter should be enabled in the search analyzer.

For example, the query "the quick brown is a fox" will be tokenized as "the_quick", "quick", "brown_is", "is_a", "a_fox", "fox".

The following are settings that can be set:

Setting Description

common_words

A list of common words to use.

common_words_path

A path (either relative to config location, or absolute) to a list of common words. Each word should be in its own "line" (separated by a line break). The file must be UTF-8 encoded.

ignore_case

If true, common words matching will be case insensitive (defaults to false).

query_mode

Generates bigrams then removes common words and single terms followed by a common word (defaults to false).

Note, common_words or common_words_path field is required.

Here is an example:

PUT /common_grams_example
{
    "settings": {
        "analysis": {
            "analyzer": {
                "index_grams": {
                    "tokenizer": "whitespace",
                    "filter": ["common_grams"]
                },
                "search_grams": {
                    "tokenizer": "whitespace",
                    "filter": ["common_grams_query"]
                }
            },
            "filter": {
                "common_grams": {
                    "type": "common_grams",
                    "common_words": ["the", "is", "a"]
                },
                "common_grams_query": {
                    "type": "common_grams",
                    "query_mode": true,
                    "common_words": ["the", "is", "a"]
                }
            }
        }
    }
}

You can see the output by using e.g. the _analyze endpoint:

POST /common_grams_example/_analyze
{
  "analyzer" : "index_grams",
  "text" : "the quick brown is a fox"
}

And the response will be:

{
  "tokens" : [
    {
      "token" : "the",
      "start_offset" : 0,
      "end_offset" : 3,
      "type" : "word",
      "position" : 0
    },
    {
      "token" : "the_quick",
      "start_offset" : 0,
      "end_offset" : 9,
      "type" : "gram",
      "position" : 0,
      "positionLength" : 2
    },
    {
      "token" : "quick",
      "start_offset" : 4,
      "end_offset" : 9,
      "type" : "word",
      "position" : 1
    },
    {
      "token" : "brown",
      "start_offset" : 10,
      "end_offset" : 15,
      "type" : "word",
      "position" : 2
    },
    {
      "token" : "brown_is",
      "start_offset" : 10,
      "end_offset" : 18,
      "type" : "gram",
      "position" : 2,
      "positionLength" : 2
    },
    {
      "token" : "is",
      "start_offset" : 16,
      "end_offset" : 18,
      "type" : "word",
      "position" : 3
    },
    {
      "token" : "is_a",
      "start_offset" : 16,
      "end_offset" : 20,
      "type" : "gram",
      "position" : 3,
      "positionLength" : 2
    },
    {
      "token" : "a",
      "start_offset" : 19,
      "end_offset" : 20,
      "type" : "word",
      "position" : 4
    },
    {
      "token" : "a_fox",
      "start_offset" : 19,
      "end_offset" : 24,
      "type" : "gram",
      "position" : 4,
      "positionLength" : 2
    },
    {
      "token" : "fox",
      "start_offset" : 21,
      "end_offset" : 24,
      "type" : "word",
      "position" : 5
    }
  ]
}

Normalization Token Filter

There are several token filters available which try to normalize special characters of a certain language.

CJK Width Token Filter

The cjk_width token filter normalizes CJK width differences:

  • Folds fullwidth ASCII variants into the equivalent basic Latin

  • Folds halfwidth Katakana variants into the equivalent Kana

Note
This token filter can be viewed as a subset of NFKC/NFKD Unicode normalization. See the {plugins}/analysis-icu-normalization-charfilter.html[analysis-icu plugin] for full normalization support.

CJK Bigram Token Filter

The cjk_bigram token filter forms bigrams out of the CJK terms that are generated by the standard tokenizer or the icu_tokenizer (see {plugins}/analysis-icu-tokenizer.html[analysis-icu plugin]).

By default, when a CJK character has no adjacent characters to form a bigram, it is output in unigram form. If you always want to output both unigrams and bigrams, set the output_unigrams flag to true. This can be used for a combined unigram+bigram approach.

Bigrams are generated for characters in han, hiragana, katakana and hangul, but bigrams can be disabled for particular scripts with the ignored_scripts parameter. All non-CJK input is passed through unmodified.

PUT /cjk_bigram_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "han_bigrams" : {
                    "tokenizer" : "standard",
                    "filter" : ["han_bigrams_filter"]
                }
            },
            "filter" : {
                "han_bigrams_filter" : {
                    "type" : "cjk_bigram",
                    "ignored_scripts": [
                        "hiragana",
                        "katakana",
                        "hangul"
                    ],
                    "output_unigrams" : true
                }
            }
        }
    }
}

Delimited Payload Token Filter

Named delimited_payload. Splits tokens into tokens and payload whenever a delimiter character is found.

Warning

The older name delimited_payload_filter is deprecated and should not be used for new indices. Use delimited_payload instead.

Example: "the|1 quick|2 fox|3" is split by default into tokens the, quick, and fox with payloads 1, 2, and 3 respectively.

Parameters:

delimiter

Character used for splitting the tokens. Default is |.

encoding

The type of the payload. int for integer, float for float and identity for characters. Default is float.

Keep Words Token Filter

A token filter of type keep that only keeps tokens with text contained in a predefined set of words. The set of words can be defined in the settings or loaded from a text file containing one word per line.

Options

keep_words

a list of words to keep

keep_words_path

a path to a words file

keep_words_case

a boolean indicating whether to lower case the words (defaults to false)

Settings example

PUT /keep_words_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "example_1" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "words_till_three"]
                },
                "example_2" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "words_in_file"]
                }
            },
            "filter" : {
                "words_till_three" : {
                    "type" : "keep",
                    "keep_words" : [ "one", "two", "three"]
                },
                "words_in_file" : {
                    "type" : "keep",
                    "keep_words_path" : "analysis/example_word_list.txt"
                }
            }
        }
    }
}

Keep Types Token Filter

A token filter of type keep_types that only keeps tokens with a token type contained in a predefined set.

Options

types

a list of types to include (default mode) or exclude

mode

if set to include (default) the specified token types will be kept, if set to exclude the specified token types will be removed from the stream

Settings example

You can set it up like:

PUT /keep_types_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "extract_numbers"]
                }
            },
            "filter" : {
                "extract_numbers" : {
                    "type" : "keep_types",
                    "types" : [ "<NUM>" ]
                }
            }
        }
    }
}

And test it like:

POST /keep_types_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "this is just 1 a test"
}

The response will be:

{
  "tokens": [
    {
      "token": "1",
      "start_offset": 13,
      "end_offset": 14,
      "type": "<NUM>",
      "position": 3
    }
  ]
}

Note how only the <NUM> token is in the output.

Exclude mode settings example

If the mode parameter is set to exclude like in the following example:

PUT /keep_types_exclude_example
{
    "settings" : {
        "analysis" : {
            "analyzer" : {
                "my_analyzer" : {
                    "tokenizer" : "standard",
                    "filter" : ["lowercase", "remove_numbers"]
                }
            },
            "filter" : {
                "remove_numbers" : {
                    "type" : "keep_types",
                    "mode" : "exclude",
                    "types" : [ "<NUM>" ]
                }
            }
        }
    }
}

And we test it like:

POST /keep_types_exclude_example/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "hello 101 world"
}

The response will be:

{
  "tokens": [
    {
      "token": "hello",
      "start_offset": 0,
      "end_offset": 5,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "world",
      "start_offset": 10,
      "end_offset": 15,
      "type": "<ALPHANUM>",
      "position": 2
    }
  ]
}

Classic Token Filter

The classic token filter does optional post-processing of terms that are generated by the classic tokenizer.

This filter removes the english possessive from the end of words, and it removes dots from acronyms.

Apostrophe Token Filter

The apostrophe token filter strips all characters after an apostrophe, including the apostrophe itself.

Decimal Digit Token Filter

The decimal_digit token filter folds unicode digits to 0-9

Fingerprint Token Filter

The fingerprint token filter emits a single token which is useful for fingerprinting a body of text, and/or providing a token that can be clustered on. It does this by sorting the tokens, deduplicating and then concatenating them back into a single token.

For example, the tokens ["the", "quick", "quick", "brown", "fox", "was", "very", "brown"] will be transformed into a single token: "brown fox quick the very was". Notice how the tokens were sorted alphabetically, and there is only one "quick".

The following are settings that can be set for a fingerprint token filter type:

Setting Description

separator

Defaults to a space.

max_output_size

Defaults to 255.

Maximum token size

Because a field may have many unique tokens, it is important to set a cutoff so that fields do not grow too large. The max_output_size setting controls this behavior. If the concatenated fingerprint grows larger than max_output_size, the token filter will exit and will not emit a token (e.g. the field will be empty).

MinHash Token Filter

The min_hash token filter hashes each token of the token stream and divides the resulting hashes into buckets, keeping the lowest-valued hashes per bucket. It then returns these hashes as tokens.

The following are settings that can be set for a min_hash token filter.

Setting Description

hash_count

The number of hashes to hash the token stream with. Defaults to 1.

bucket_count

The number of buckets to divide the minhashes into. Defaults to 512.

hash_set_size

The number of minhashes to keep per bucket. Defaults to 1.

with_rotation

Whether or not to fill empty buckets with the value of the first non-empty bucket to its circular right. Only takes effect if hash_set_size is equal to one. Defaults to true if bucket_count is greater than one, else false.

Some points to consider while setting up a min_hash filter:

  • min_hash filter input tokens should typically be k-words shingles produced from shingle token filter. You should choose k large enough so that the probability of any given shingle occurring in a document is low. At the same time, as internally each shingle is hashed into to 128-bit hash, you should choose k small enough so that all possible different k-words shingles can be hashed to 128-bit hash with minimal collision.

  • choosing the right settings for hash_count, bucket_count and hash_set_size needs some experimentation.

    • to improve the precision, you should increase bucket_count or hash_set_size. Higher values of bucket_count or hash_set_size will provide a higher guarantee that different tokens are indexed to different buckets.

    • to improve the recall, you should increase hash_count parameter. For example, setting hash_count=2, will make each token to be hashed in two different ways, thus increasing the number of potential candidates for search.

  • the default settings makes the min_hash filter to produce for each document 512 min_hash tokens, each is of size 16 bytes. Thus, each document’s size will be increased by around 8Kb.

  • min_hash filter is used to hash for Jaccard similarity. This means that it doesn’t matter how many times a document contains a certain token, only that if it contains it or not.

Theory

MinHash token filter allows you to hash documents for similarity search. Similarity search, or nearest neighbor search is a complex problem. A naive solution requires an exhaustive pairwise comparison between a query document and every document in an index. This is a prohibitive operation if the index is large. A number of approximate nearest neighbor search solutions have been developed to make similarity search more practical and computationally feasible. One of these solutions involves hashing of documents.

Documents are hashed in a way that similar documents are more likely to produce the same hash code and are put into the same hash bucket, while dissimilar documents are more likely to be hashed into different hash buckets. This type of hashing is known as locality sensitive hashing (LSH).

Depending on what constitutes the similarity between documents, various LSH functions have been proposed. For Jaccard similarity, a popular LSH function is MinHash. A general idea of the way MinHash produces a signature for a document is by applying a random permutation over the whole index vocabulary (random numbering for the vocabulary), and recording the minimum value for this permutation for the document (the minimum number for a vocabulary word that is present in the document). The permutations are run several times; combining the minimum values for all of them will constitute a signature for the document.

In practice, instead of random permutations, a number of hash functions are chosen. A hash function calculates a hash code for each of a document’s tokens and chooses the minimum hash code among them. The minimum hash codes from all hash functions are combined to form a signature for the document.

Example of setting MinHash Token Filter in Elasticsearch

Here is an example of setting up a min_hash filter:

POST /index1
{
  "settings": {
    "analysis": {
      "filter": {
        "my_shingle_filter": { (1)
          "type": "shingle",
          "min_shingle_size": 5,
          "max_shingle_size": 5,
          "output_unigrams": false
        },
        "my_minhash_filter": {
          "type": "min_hash",
          "hash_count": 1,   (2)
          "bucket_count": 512, (3)
          "hash_set_size": 1, (4)
          "with_rotation": true (5)
        }
      },
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "filter": [
            "my_shingle_filter",
            "my_minhash_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "text": {
        "fingerprint": "text",
        "analyzer": "my_analyzer"
      }
    }
  }
}
  1. setting a shingle filter with 5-word shingles

  2. setting min_hash filter to hash with 1 hash

  3. setting min_hash filter to hash tokens into 512 buckets

  4. setting min_hash filter to keep only a single smallest hash in each bucket

  5. setting min_hash filter to fill empty buckets with values from neighboring buckets

Remove Duplicates Token Filter

A token filter of type remove_duplicates that drops identical tokens at the same position.

Character Filters

Character filters are used to preprocess the stream of characters before it is passed to the tokenizer.

A character filter receives the original text as a stream of characters and can transform the stream by adding, removing, or changing characters. For instance, a character filter could be used to convert Hindu-Arabic numerals (٠‎١٢٣٤٥٦٧٨‎٩‎) into their Arabic-Latin equivalents (0123456789), or to strip HTML elements like <b> from the stream.

Elasticsearch has a number of built in character filters which can be used to build custom analyzers.

HTML Strip Character Filter

The html_strip character filter strips out HTML elements like <b> and decodes HTML entities like &.

Mapping Character Filter

The mapping character filter replaces any occurrences of the specified strings with the specified replacements.

Pattern Replace Character Filter

The pattern_replace character filter replaces any characters matching a regular expression with the specified replacement.

HTML Strip Char Filter

The html_strip character filter strips HTML elements from the text and replaces HTML entities with their decoded value (e.g. replacing & with &).

Example output

POST _analyze
{
  "tokenizer":      "keyword", (1)
  "char_filter":  [ "html_strip" ],
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}
  1. The keyword tokenizer returns a single term.

The above example returns the term:

[ \nI'm so happy!\n ]

The same example with the standard tokenizer would return the following terms:

[ I'm, so, happy ]

Configuration

The html_strip character filter accepts the following parameter:

escaped_tags

An array of HTML tags which should not be stripped from the original text.

Example configuration

In this example, we configure the html_strip character filter to leave <b> tags in place:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": ["my_char_filter"]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "html_strip",
          "escaped_tags": ["b"]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "<p>I&apos;m so <b>happy</b>!</p>"
}

The above example produces the following term:

[ \nI'm so <b>happy</b>!\n ]

Mapping Char Filter

The mapping character filter accepts a map of keys and values. Whenever it encounters a string of characters that is the same as a key, it replaces them with the value associated with that key.

Matching is greedy; the longest pattern matching at a given point wins. Replacements are allowed to be the empty string.

Configuration

The mapping character filter accepts the following parameters:

mappings

A array of mappings, with each element having the form key ⇒ value.

mappings_path

A path, either absolute or relative to the config directory, to a UTF-8 encoded text mappings file containing a key ⇒ value mapping per line.

Either the mappings or mappings_path parameter must be provided.

Example configuration

In this example, we configure the mapping character filter to replace Arabic numerals with their Latin equivalents:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "keyword",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            "٠ => 0",
            "١ => 1",
            "٢ => 2",
            "٣ => 3",
            "٤ => 4",
            "٥ => 5",
            "٦ => 6",
            "٧ => 7",
            "٨ => 8",
            "٩ => 9"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My license plate is ٢٥٠١٥"
}

The above example produces the following term:

[ My license plate is 25015 ]

Keys and values can be strings with multiple characters. The following example replaces the :) and :( emoticons with a text equivalent:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "mapping",
          "mappings": [
            ":) => _happy_",
            ":( => _sad_"
          ]
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "I'm delighted about it :("
}

The above example produces the following terms:

[ I'm, delighted, about, it, _sad_ ]

Pattern Replace Char Filter

The pattern_replace character filter uses a regular expression to match characters which should be replaced with the specified replacement string. The replacement string can refer to capture groups in the regular expression.

Warning
Beware of Pathological Regular Expressions

The pattern replace character filter uses Java Regular Expressions.

A badly written regular expression could run very slowly or even throw a StackOverflowError and cause the node it is running on to exit suddenly.

Configuration

The pattern_replace character filter accepts the following parameters:

pattern

A Java regular expression. Required.

replacement

The replacement string, which can reference capture groups using the $1..$9 syntax, as explained here.

flags

Java regular expression flags. Flags should be pipe-separated, eg "CASE_INSENSITIVE|COMMENTS".

Example configuration

In this example, we configure the pattern_replace character filter to replace any embedded dashes in numbers with underscores, i.e 123-456-789123_456_789:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(\\d+)-(?=\\d)",
          "replacement": "$1_"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "My credit card is 123-456-789"
}

The above example produces the following terms:

[ My, credit, card, is, 123_456_789 ]
Warning
Using a replacement string that changes the length of the original text will work for search purposes, but will result in incorrect highlighting, as can be seen in the following example.

This example inserts a space whenever it encounters a lower-case letter followed by an upper-case letter (i.e. fooBarBazfoo Bar Baz), allowing camelCase words to be queried individually:

PUT my_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer": {
          "tokenizer": "standard",
          "char_filter": [
            "my_char_filter"
          ],
          "filter": [
            "lowercase"
          ]
        }
      },
      "char_filter": {
        "my_char_filter": {
          "type": "pattern_replace",
          "pattern": "(?<=\\p{Lower})(?=\\p{Upper})",
          "replacement": " "
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "my_analyzer"
        }
      }
    }
  }
}

POST my_index/_analyze
{
  "analyzer": "my_analyzer",
  "text": "The fooBarBaz method"
}

The above returns the following terms:

[ the, foo, bar, baz, method ]

Querying for bar will find the document correctly, but highlighting on the result will produce incorrect highlights, because our character filter changed the length of the original text:

PUT my_index/_doc/1?refresh
{
  "text": "The fooBarBaz method"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "bar"
    }
  },
  "highlight": {
    "fields": {
      "text": {}
    }
  }
}

The output from the above is:

{
  "timed_out": false,
  "took": $body.took,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "my_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "text": "The fooBarBaz method"
        },
        "highlight": {
          "text": [
            "The foo<em>Ba</em>rBaz method" (1)
          ]
        }
      }
    ]
  }
}
  1. Note the incorrect highlight.