"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/reference/mapping.asciidoc" (29 Dec 2021, 6066 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Removal of mapping types

Important
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Types will be deprecated in APIs in Elasticsearch 7.0.0, and completely removed in 8.0.0.

What are mapping types?

Since the first release of Elasticsearch, each document has been stored in a single index and assigned a single mapping type. A mapping type was used to represent the type of document or entity being indexed, for instance a twitter index might have a user type and a tweet type.

Each mapping type could have its own fields, so the user type might have a full_name field, a user_name field, and an email field, while the tweet type could have a content field, a tweeted_at field and, like the user type, a user_name field.

Each document had a _type meta-field containing the type name, and searches could be limited to one or more types by specifying the type name(s) in the URL:

GET twitter/user,tweet/_search
{
  "query": {
    "match": {
      "user_name": "kimchy"
    }
  }
}

The _type field was combined with the document’s _id to generate a _uid field, so documents of different types with the same _id could exist in a single index.

Mapping types were also used to establish a parent-child relationship between documents, so documents of type question could be parents to documents of type answer.

Why are mapping types being removed?

Initially, we spoke about an index'' being similar to a database'' in an SQL database, and a type'' being equivalent to a table''.

This was a bad analogy that led to incorrect assumptions. In an SQL database, tables are independent of each other. The columns in one table have no bearing on columns with the same name in another table. This is not the case for fields in a mapping type.

In an Elasticsearch index, fields that have the same name in different mapping types are backed by the same Lucene field internally. In other words, using the example above, the user_name field in the user type is stored in exactly the same field as the user_name field in the tweet type, and both user_name fields must have the same mapping (definition) in both types.

This can lead to frustration when, for example, you want deleted to be a date field in one type and a boolean field in another type in the same index.

On top of that, storing different entities that have few or no fields in common in the same index leads to sparse data and interferes with Lucene’s ability to compress documents efficiently.

For these reasons, we have decided to remove the concept of mapping types from Elasticsearch.

Alternatives to mapping types

Index per document type

The first alternative is to have an index per document type. Instead of storing tweets and users in a single twitter index, you could store tweets in the tweets index and users in the user index. Indices are completely independent of each other and so there will be no conflict of field types between indices.

This approach has two benefits:

  • Data is more likely to be dense and so benefit from compression techniques used in Lucene.

  • The term statistics used for scoring in full text search are more likely to be accurate because all documents in the same index represent a single entity.

Each index can be sized appropriately for the number of documents it will contain: you can use a smaller number of primary shards for users and a larger number of primary shards for tweets.

Custom type field

Of course, there is a limit to how many primary shards can exist in a cluster so you may not want to waste an entire shard for a collection of only a few thousand documents. In this case, you can implement your own custom type field which will work in a similar way to the old _type.

Let’s take the user/tweet example above. Originally, the workflow would have looked something like this:

PUT twitter
{
  "mappings": {
    "user": {
      "properties": {
        "name": { "type": "text" },
        "user_name": { "type": "keyword" },
        "email": { "type": "keyword" }
      }
    },
    "tweet": {
      "properties": {
        "content": { "type": "text" },
        "user_name": { "type": "keyword" },
        "tweeted_at": { "type": "date" }
      }
    }
  }
}

PUT twitter/user/kimchy
{
  "name": "Shay Banon",
  "user_name": "kimchy",
  "email": "shay@kimchy.com"
}

PUT twitter/tweet/1
{
  "user_name": "kimchy",
  "tweeted_at": "2017-10-24T09:00:00Z",
  "content": "Types are going away"
}

GET twitter/tweet/_search
{
  "query": {
    "match": {
      "user_name": "kimchy"
    }
  }
}

You could achieve the same thing by adding a custom type field as follows:

PUT twitter
{
  "mappings": {
    "_doc": {
      "properties": {
        "type": { "type": "keyword" }, (1)
        "name": { "type": "text" },
        "user_name": { "type": "keyword" },
        "email": { "type": "keyword" },
        "content": { "type": "text" },
        "tweeted_at": { "type": "date" }
      }
    }
  }
}

PUT twitter/_doc/user-kimchy
{
  "type": "user", (1)
  "name": "Shay Banon",
  "user_name": "kimchy",
  "email": "shay@kimchy.com"
}

PUT twitter/_doc/tweet-1
{
  "type": "tweet", (1)
  "user_name": "kimchy",
  "tweeted_at": "2017-10-24T09:00:00Z",
  "content": "Types are going away"
}

GET twitter/_search
{
  "query": {
    "bool": {
      "must": {
        "match": {
          "user_name": "kimchy"
        }
      },
      "filter": {
        "match": {
          "type": "tweet" (1)
        }
      }
    }
  }
}
  1. The explicit type field takes the place of the implicit _type field.

Parent/Child without mapping types

Previously, a parent-child relationship was represented by making one mapping type the parent, and one or more other mapping types the children. Without types, we can no longer use this syntax. The parent-child feature will continue to function as before, except that the way of expressing the relationship between documents has been changed to use the new join field.

Schedule for removal of mapping types

This is a big change for our users, so we have tried to make it as painless as possible. The change will roll out as follows:

Elasticsearch 5.6.0
  • Setting index.mapping.single_type: true on an index will enable the single-type-per-index behaviour which will be enforced in 6.0.

  • The join field replacement for parent-child is available on indices created in 5.6.

Elasticsearch 6.x
  • Indices created in 5.x will continue to function in 6.x as they did in 5.x.

  • Indices created in 6.x only allow a single-type per index. Any name can be used for the type, but there can be only one. The preferred type name is _doc, so that index APIs have the same path as they will have in 7.0: PUT {index}/_doc/{id} and POST {index}/_doc

  • The _type name can no longer be combined with the _id to form the _uid field. The _uid field has become an alias for the _id field.

  • New indices no longer support the old-style of parent/child and should use the join field instead.

  • The default mapping type is deprecated.

  • In 6.7, the index creation, index template, and mapping APIs support a query string parameter (include_type_name) which indicates whether requests and responses should include a type name. It defaults to true, and should be set to an explicit value to prepare to upgrade to 7.0. Not setting include_type_name will result in a deprecation warning. Indices which don’t have an explicit type will use the dummy type name _doc.

Elasticsearch 7.x
  • Specifying types in requests is deprecated. For instance, indexing a document no longer requires a document type. The new index APIs are PUT {index}/_doc/{id} in case of explicit ids and POST {index}/_doc for auto-generated ids.

  • The include_type_name parameter in the index creation, index template, and mapping APIs will default to false. Setting the parameter at all will result in a deprecation warning.

  • The default mapping type is removed.

Elasticsearch 8.x
  • Specifying types in requests is no longer supported.

  • The include_type_name parameter is removed.

Migrating multi-type indices to single-type

The Reindex API can be used to convert multi-type indices to single-type indices. The following examples can be used in Elasticsearch 5.6 or Elasticsearch 6.x. In 6.x, there is no need to specify index.mapping.single_type as that is the default.

Index per document type

This first example splits our twitter index into a tweets index and a users index:

PUT users
{
  "settings": {
    "index.mapping.single_type": true
  },
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        },
        "user_name": {
          "type": "keyword"
        },
        "email": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT tweets
{
  "settings": {
    "index.mapping.single_type": true
  },
  "mappings": {
    "_doc": {
      "properties": {
        "content": {
          "type": "text"
        },
        "user_name": {
          "type": "keyword"
        },
        "tweeted_at": {
          "type": "date"
        }
      }
    }
  }
}

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "user"
  },
  "dest": {
    "index": "users",
    "type": "_doc"
  }
}

POST _reindex
{
  "source": {
    "index": "twitter",
    "type": "tweet"
  },
  "dest": {
    "index": "tweets",
    "type": "_doc"
  }
}

Custom type field

This next example adds a custom type field and sets it to the value of the original _type. It also adds the type to the _id in case there are any documents of different types which have conflicting IDs:

PUT new_twitter
{
  "mappings": {
    "_doc": {
      "properties": {
        "type": {
          "type": "keyword"
        },
        "name": {
          "type": "text"
        },
        "user_name": {
          "type": "keyword"
        },
        "email": {
          "type": "keyword"
        },
        "content": {
          "type": "text"
        },
        "tweeted_at": {
          "type": "date"
        }
      }
    }
  }
}


POST _reindex
{
  "source": {
    "index": "twitter"
  },
  "dest": {
    "index": "new_twitter"
  },
  "script": {
    "source": """
      ctx._source.type = ctx._type;
      ctx._id = ctx._type + '-' + ctx._id;
      ctx._type = '_doc';
    """
  }
}

Typeless APIs in 7.0

In Elasticsearch 7.0, each API will support typeless requests, and specifying a type will produce a deprecation warning. Certain typeless APIs are also available in 6.7, to enable a smooth upgrade path to 7.0.

Indices APIs

Index creation, index template, and mapping APIs support a new include_type_name URL parameter that specifies whether mapping definitions in requests and responses should contain the type name. The parameter defaults to true in version 6.7 to match the pre-7.0 behavior of using type names in mappings. It defaults to false in version 7.0 and will be removed in version 8.0.

It should be set explicitly in 6.7 to prepare to upgrade to 7.0. To avoid deprecation warnings in 6.7, the parameter can be set to either true or false. In 7.0, setting include_type_name at all will result in a deprecation warning.

See some examples of interactions with Elasticsearch with this option set to false:

PUT index?include_type_name=false
{
  "mappings": {
    "properties": { (1)
      "foo": {
        "type": "keyword"
      }
    }
  }
}
  1. Mappings are included directly under the mappings key, without a type name.

PUT index/_mappings?include_type_name=false
{
  "properties": { (1)
    "bar": {
      "type": "text"
    }
  }
}
  1. Mappings are included directly under the mappings key, without a type name.

GET index/_mappings?include_type_name=false

The above call returns

{
  "index": {
    "mappings": {
      "properties": { (1)
        "foo": {
          "type": "keyword"
        },
        "bar": {
          "type": "text"
        }
      }
    }
  }
}
  1. Mappings are included directly under the mappings key, without a type name.

Index templates

It is recommended to make index templates typeless by re-adding them with include_type_name set to false. Under the hood, typeless templates will use the dummy type _doc when creating indices.

In case typeless templates are used with typed index creation calls or typed templates are used with typeless index creation calls, the template will still be applied but the index creation call decides whether there should be a type or not. For instance in the below example, index-1-01 will have a type in spite of the fact that it matches a template that is typeless, and index-2-01 will be typeless in spite of the fact that it matches a template that defines a type. Both index-1-01 and index-2-01 will inherit the foo field from the template that they match.

PUT _template/template1?include_type_name=false
{
  "index_patterns":[ "index-1-*" ],
  "mappings": {
    "properties": {
      "foo": {
        "type": "keyword"
      }
    }
  }
}

PUT _template/template2?include_type_name=true
{
  "index_patterns":[ "index-2-*" ],
  "mappings": {
    "type": {
      "properties": {
        "foo": {
          "type": "keyword"
        }
      }
    }
  }
}

PUT index-1-01?include_type_name=true
{
  "mappings": {
    "type": {
      "properties": {
        "bar": {
          "type": "long"
        }
      }
    }
  }
}

PUT index-2-01?include_type_name=false
{
  "mappings": {
    "properties": {
      "bar": {
        "type": "long"
      }
    }
  }
}

In case of implicit index creation, because of documents that get indexed in an index that doesn’t exist yet, the template is always honored. This is usually not a problem due to the fact that typeless index calls work on typed indices.

Mixed-version clusters

In a cluster composed of both 6.7 and 7.0 nodes, the parameter include_type_name should be specified in indices APIs like index creation. This is because the parameter has a different default between 6.7 and 7.0, so the same mapping definition will not be valid for both node versions.

Typeless document APIs such as bulk and update are only available as of 7.0, and will not work with 6.7 nodes. This also holds true for the typeless versions of queries that perform document lookups, such as terms.

Field datatypes

Elasticsearch supports a number of different datatypes for the fields in a document:

Core datatypes

string

text and keyword

Numeric datatypes

long, integer, short, byte, double, float, half_float, scaled_float

Date datatype

date

Boolean datatype

boolean

Binary datatype

binary

Range datatypes

integer_range, float_range, long_range, double_range, date_range, ip_range

Complex datatypes

Object datatype

object for single JSON objects

Nested datatype

nested for arrays of JSON objects

Geo datatypes

Geo-point datatype

geo_point for lat/lon points

Geo-Shape datatype

geo_shape for complex shapes like polygons

Specialised datatypes

IP datatype

ip for IPv4 and IPv6 addresses

Completion datatype

completion to provide auto-complete suggestions

Token count datatype

token_count to count the number of tokens in a string

{plugins}/mapper-murmur3.html[mapper-murmur3]

murmur3 to compute hashes of values at index-time and store them in the index

{plugins}/mapper-annotated-text.html[mapper-annotated-text]

annotated-text to index text containing special markup (typically used for identifying named entities)

Percolator type

Accepts queries from the query-dsl

join datatype

Defines parent/child relation for documents within the same index

Alias datatype

Defines an alias to an existing field.

Arrays

In {es}, arrays do not require a dedicated field datatype. Any field can contain zero or more values by default, however, all values in the array must be of the same datatype. See Arrays.

Multi-fields

It is often useful to index the same field in different ways for different purposes. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations. Alternatively, you could index a text field with the standard analyzer, the english analyzer, and the french analyzer.

This is the purpose of multi-fields. Most datatypes support multi-fields via the fields parameter.

Alias datatype

Note
Field aliases can only be specified on indexes with a single mapping type. To add a field alias, the index must therefore have been created in 6.0 or later, or be an older index with the setting index.mapping.single_type: true. Please see Removal of mapping types for more information.

An alias mapping defines an alternate name for a field in the index. The alias can be used in place of the target field in search requests, and selected other APIs like field capabilities.

PUT trips
{
  "mappings": {
    "_doc": {
      "properties": {
        "distance": {
          "type": "long"
        },
        "route_length_miles": {
          "type": "alias",
          "path": "distance" (1)
        },
        "transit_mode": {
          "type": "keyword"
        }
      }
    }
  }
}

GET _search
{
  "query": {
    "range" : {
      "route_length_miles" : {
        "gte" : 39
      }
    }
  }
}
  1. The path to the target field. Note that this must be the full path, including any parent objects (e.g. object1.object2.field).

Almost all components of the search request accept field aliases. In particular, aliases can be used in queries, aggregations, and sort fields, as well as when requesting docvalue_fields, stored_fields, suggestions, and highlights. Scripts also support aliases when accessing field values. Please see the section on unsupported APIs for exceptions.

In some parts of the search request and when requesting field capabilities, field wildcard patterns can be provided. In these cases, the wildcard pattern will match field aliases in addition to concrete fields:

GET trips/_field_caps?fields=route_*,transit_mode

Alias targets

There are a few restrictions on the target of an alias:

  • The target must be a concrete field, and not an object or another field alias.

  • The target field must exist at the time the alias is created.

  • If nested objects are defined, a field alias must have the same nested scope as its target.

Additionally, a field alias can only have one target. This means that it is not possible to use a field alias to query over multiple target fields in a single clause.

An alias can be changed to refer to a new target through a mappings update. A known limitation is that if any stored percolator queries contain the field alias, they will still refer to its original target. More information can be found in the percolator documentation.

Unsupported APIs

Writes to field aliases are not supported: attempting to use an alias in an index or update request will result in a failure. Likewise, aliases cannot be used as the target of copy_to or in multi-fields.

Because alias names are not present in the document source, aliases cannot be used when performing source filtering. For example, the following request will return an empty result for _source:

GET /_search
{
  "query" : {
    "match_all": {}
  },
  "_source": "route_length_miles"
}

Currently only the search and field capabilities APIs will accept and resolve field aliases. Other APIs that accept field names, such as term vectors, cannot be used with field aliases.

Finally, some queries, such as terms, geo_shape, and more_like_this, allow for fetching query information from an indexed document. Because field aliases aren’t supported when fetching documents, the part of the query that specifies the lookup path cannot refer to a field by its alias.

Arrays

In Elasticsearch, there is no dedicated array datatype. Any field can contain zero or more values by default, however, all values in the array must be of the same datatype. For instance:

  • an array of strings: [ "one", "two" ]

  • an array of integers: [ 1, 2 ]

  • an array of arrays: [ 1, [ 2, 3 ]] which is the equivalent of [ 1, 2, 3 ]

  • an array of objects: [ { "name": "Mary", "age": 12 }, { "name": "John", "age": 10 }]

Note
Arrays of objects

Arrays of objects do not work as you would expect: you cannot query each object independently of the other objects in the array. If you need to be able to do this then you should use the nested datatype instead of the object datatype.

This is explained in more detail in Nested datatype.

When adding a field dynamically, the first value in the array determines the field type. All subsequent values must be of the same datatype or it must at least be possible to coerce subsequent values to the same datatype.

Arrays with a mixture of datatypes are not supported: [ 10, "some string" ]

An array may contain null values, which are either replaced by the configured null_value or skipped entirely. An empty array [] is treated as a missing field — a field with no values.

Nothing needs to be pre-configured in order to use arrays in documents, they are supported out of the box:

PUT my_index/_doc/1
{
  "message": "some arrays in this document...",
  "tags":  [ "elasticsearch", "wow" ], (1)
  "lists": [ (2)
    {
      "name": "prog_list",
      "description": "programming list"
    },
    {
      "name": "cool_list",
      "description": "cool stuff list"
    }
  ]
}

PUT my_index/_doc/2 (3)
{
  "message": "no arrays in this document...",
  "tags":  "elasticsearch",
  "lists": {
    "name": "prog_list",
    "description": "programming list"
  }
}

GET my_index/_search
{
  "query": {
    "match": {
      "tags": "elasticsearch" (4)
    }
  }
}
  1. The tags field is dynamically added as a string field.

  2. The lists field is dynamically added as an object field.

  3. The second document contains no arrays, but can be indexed into the same fields.

  4. The query looks for elasticsearch in the tags field, and matches both documents.

Multi-value fields and the inverted index

The fact that all field types support multi-value fields out of the box is a consequence of the origins of Lucene. Lucene was designed to be a full text search engine. In order to be able to search for individual words within a big block of text, Lucene tokenizes the text into individual terms, and adds each term to the inverted index separately.

This means that even a simple text field must be able to support multiple values by default. When other datatypes were added, such as numbers and dates, they used the same data structure as strings, and so got multi-values for free.

Binary datatype

The binary type accepts a binary value as a Base64 encoded string. The field is not stored by default and is not searchable:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": {
          "type": "text"
        },
        "blob": {
          "type": "binary"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "name": "Some binary blob",
  "blob": "U29tZSBiaW5hcnkgYmxvYg==" (1)
}
  1. The Base64 encoded binary value must not have embedded newlines \n.

Parameters for binary fields

The following parameters are accepted by binary fields:

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true or false (default).

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Range datatypes

The following range types are supported:

integer_range

A range of signed 32-bit integers with a minimum value of -2^31^ and maximum of 2^31^-1.

float_range

A range of single-precision 32-bit IEEE 754 floating point values.

long_range

A range of signed 64-bit integers with a minimum value of -2^63^ and maximum of 2^63^-1.

double_range

A range of double-precision 64-bit IEEE 754 floating point values.

date_range

A range of date values represented as unsigned 64-bit integer milliseconds elapsed since system epoch.

ip_range

A range of ip values supporting either IPv4 or IPv6 (or mixed) addresses.

Below is an example of configuring a mapping with various range fields followed by an example that indexes several range types.

PUT range_index
{
  "settings": {
    "number_of_shards": 2
  },
  "mappings": {
    "_doc": {
      "properties": {
        "expected_attendees": {
          "type": "integer_range"
        },
        "time_frame": {
          "type": "date_range", (1)
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

PUT range_index/_doc/1?refresh
{
  "expected_attendees" : { (2)
    "gte" : 10,
    "lte" : 20
  },
  "time_frame" : { (3)
    "gte" : "2015-10-31 12:00:00", (4)
    "lte" : "2015-11-01"
  }
}
  1. date_range types accept the same field parameters defined by the date type.

  2. Example indexing a meeting with 10 to 20 attendees.

  3. Date ranges accept the same format as described in date range queries.

  4. Example date range using date time stamp. This also accepts date math formatting. Note that "now" cannot be used at indexing time.

The following is an example of a term query on the integer_range field named "expected_attendees".

GET range_index/_search
{
  "query" : {
    "term" : {
      "expected_attendees" : {
        "value": 12
      }
    }
  }
}

The result produced by the above query.

{
  "took": 13,
  "timed_out": false,
  "_shards" : {
    "total": 2,
    "successful": 2,
    "skipped" : 0,
    "failed": 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "range_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "expected_attendees" : {
            "gte" : 10, "lte" : 20
          },
          "time_frame" : {
            "gte" : "2015-10-31 12:00:00", "lte" : "2015-11-01"
          }
        }
      }
    ]
  }
}

The following is an example of a date_range query over the date_range field named "time_frame".

GET range_index/_search
{
  "query" : {
    "range" : {
      "time_frame" : { (1)
        "gte" : "2015-10-31",
        "lte" : "2015-11-01",
        "relation" : "within" (2)
      }
    }
  }
}
  1. Range queries work the same as described in range query.

  2. Range queries over range fields support a relation parameter which can be one of WITHIN, CONTAINS, INTERSECTS (default).

This query produces a similar result:

{
  "took": 13,
  "timed_out": false,
  "_shards" : {
    "total": 2,
    "successful": 2,
    "skipped" : 0,
    "failed": 0
  },
  "hits" : {
    "total" : 1,
    "max_score" : 1.0,
    "hits" : [
      {
        "_index" : "range_index",
        "_type" : "_doc",
        "_id" : "1",
        "_score" : 1.0,
        "_source" : {
          "expected_attendees" : {
            "gte" : 10, "lte" : 20
          },
          "time_frame" : {
            "gte" : "2015-10-31 12:00:00", "lte" : "2015-11-01"
          }
        }
      }
    ]
  }
}

IP Range

In addition to the range format above, IP ranges can be provided in CIDR notation:

PUT range_index/_mapping/_doc
{
  "properties": {
    "ip_whitelist": {
      "type": "ip_range"
    }
  }
}

PUT range_index/_doc/2
{
  "ip_whitelist" : "192.168.0.0/16"
}

Parameters for range fields

The following parameters are accepted by range types:

coerce

Try to convert strings to numbers and truncate fractions for integers. Accepts true (default) and false.

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

index

Should the field be searchable? Accepts true (default) and false.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Boolean datatype

Boolean fields accept JSON true and false values, but can also accept strings which are interpreted as either true or false:

False values

false, "false", "" (empty string)

True values

true, "true"

For example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "is_published": {
          "type": "boolean"
        }
      }
    }
  }
}

POST my_index/_doc/1
{
  "is_published": "true" (1)
}

GET my_index/_search
{
  "query": {
    "term": {
      "is_published": true (2)
    }
  }
}
  1. Indexing a document with "true", which is interpreted as true.

  2. Searching for documents with a JSON true.

Aggregations like the terms aggregation use 1 and 0 for the key, and the strings "true" and "false" for the key_as_string. Boolean fields when used in scripts, return 1 and 0:

POST my_index/_doc/1
{
  "is_published": true
}

POST my_index/_doc/2
{
  "is_published": false
}

GET my_index/_search
{
  "aggs": {
    "publish_state": {
      "terms": {
        "field": "is_published"
      }
    }
  },
  "script_fields": {
    "is_published": {
      "script": {
        "lang": "painless",
        "source": "doc['is_published'].value"
      }
    }
  }
}

Parameters for boolean fields

The following parameters are accepted by boolean fields:

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts any of the true or false values listed above. The value is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Date datatype

JSON doesn’t have a date datatype, so dates in Elasticsearch can either be:

  • strings containing formatted dates, e.g. "2015-01-01" or "2015/01/01 12:10:30".

  • a long number representing milliseconds-since-the-epoch.

  • an integer representing seconds-since-the-epoch.

Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.

Queries on dates are internally converted to range queries on this long representation, and the result of aggregations and stored fields is converted back to a string depending on the date format that is associated with the field.

Note
Dates will always be rendered as strings, even if they were initially supplied as a long in the JSON document.

Date formats can be customised, but if no format is specified then it uses the default:

"strict_date_optional_time||epoch_millis"

This means that it will accept dates with optional timestamps, which conform to the formats supported by strict_date_optional_time or milliseconds-since-the-epoch.

For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "date": {
          "type": "date" (1)
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "date": "2015-01-01" } (2)

PUT my_index/_doc/2
{ "date": "2015-01-01T12:10:30Z" } (3)

PUT my_index/_doc/3
{ "date": 1420070400001 } (4)

GET my_index/_search
{
  "sort": { "date": "asc"} (5)
}
  1. The date field uses the default format.

  2. This document uses a plain date.

  3. This document includes a time.

  4. This document uses milliseconds-since-the-epoch.

  5. Note that the sort values that are returned are all in milliseconds-since-the-epoch.

Multiple date formats

Multiple formats can be specified by separating them with || as a separator. Each format will be tried in turn until a matching format is found. The first format will be used to convert the milliseconds-since-the-epoch value back into a string.

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
        }
      }
    }
  }
}

Parameters for date fields

The following parameters are accepted by date fields:

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

format

The date format(s) that can be parsed. Defaults to strict_date_optional_time||epoch_millis.

locale

The locale to use when parsing dates since months do not have the same names and/or abbreviations in all languages. The default is the ROOT locale,

ignore_malformed

If true, malformed numbers are ignored. If false (default), malformed numbers throw an exception and reject the whole document.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts a date value in one of the configured format's as the field which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Geo-point datatype

Fields of type geo_point accept latitude-longitude pairs, which can be used:

There are four ways that a geo-point may be specified, as demonstrated below:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "location": {
          "type": "geo_point"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text": "Geo-point as an object",
  "location": { (1)
    "lat": 41.12,
    "lon": -71.34
  }
}

PUT my_index/_doc/2
{
  "text": "Geo-point as a string",
  "location": "41.12,-71.34" (2)
}

PUT my_index/_doc/3
{
  "text": "Geo-point as a geohash",
  "location": "drm3btev3e86" (3)
}

PUT my_index/_doc/4
{
  "text": "Geo-point as an array",
  "location": [ -71.34, 41.12 ] (4)
}

GET my_index/_search
{
  "query": {
    "geo_bounding_box": { (5)
      "location": {
        "top_left": {
          "lat": 42,
          "lon": -72
        },
        "bottom_right": {
          "lat": 40,
          "lon": -74
        }
      }
    }
  }
}
  1. Geo-point expressed as an object, with lat and lon keys.

  2. Geo-point expressed as a string with the format: "lat,lon".

  3. Geo-point expressed as a geohash.

  4. Geo-point expressed as an array with the format: [ lon, lat]

  5. A geo-bounding box query which finds all geo-points that fall inside the box.

Important
Geo-points expressed as an array or string

Please note that string geo-points are ordered as lat,lon, while array geo-points are ordered as the reverse: lon,lat.

Originally, lat,lon was used for both array and string, but the array format was changed early on to conform to the format used by GeoJSON.

Note
A point can be expressed as a geohash. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. Each character in a geohash adds additional 5 bits to the precision. So the longer the hash, the more precise it is. For the indexing purposed geohashs are translated into latitude-longitude pairs. During this process only first 12 characters are used, so specifying more than 12 characters in a geohash doesn’t increase the precision. The 12 characters provide 60 bits, which should reduce a possible error to less than 2cm.

Parameters for geo_point fields

The following parameters are accepted by geo_point fields:

ignore_malformed

If true, malformed geo-points are ignored. If false (default), malformed geo-points throw an exception and reject the whole document.

ignore_z_value

If true (default) three dimension points will be accepted (stored in source) but only latitude and longitude values will be indexed; the third dimension is ignored. If false, geo-points containing any more than latitude and longitude (two dimensions) values throw an exception and reject the whole document.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts an geopoint value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

Using geo-points in scripts

When accessing the value of a geo-point in a script, the value is returned as a GeoPoint object, which allows access to the .lat and .lon values respectively:

def geopoint = doc['location'].value;
def lat      = geopoint.lat;
def lon      = geopoint.lon;

For performance reasons, it is better to access the lat/lon values directly:

def lat      = doc['location'].lat;
def lon      = doc['location'].lon;

Geo-Shape datatype

The geo_shape datatype facilitates the indexing of and searching with arbitrary geo shapes such as rectangles and polygons. It should be used when either the data being indexed or the queries being executed contain shapes other than just points.

You can query documents using this type using geo_shape Query.

Mapping Options

The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.

Option Description Default

tree

deprecated[6.6, PrefixTrees no longer used] Name of the PrefixTree implementation to be used: geohash for GeohashPrefixTree and quadtree for QuadPrefixTree. Note: This parameter is only relevant for term and recursive strategies.

quadtree

precision

deprecated[6.6, PrefixTrees no longer used] This parameter may be used instead of tree_levels to set an appropriate value for the tree_levels parameter. The value specifies the desired precision and Elasticsearch will calculate the best tree_levels value to honor this precision. The value should be a number followed by an optional distance unit. Valid distance units include: in, inch, yd, yard, mi, miles, km, kilometers, m,meters, cm,centimeters, mm, millimeters. Note: This parameter is only relevant for term and recursive strategies.

50m

tree_levels

deprecated[6.6, PrefixTrees no longer used] Maximum number of layers to be used by the PrefixTree. This can be used to control the precision of shape representations andtherefore how many terms are indexed. Defaults to the default value of the chosen PrefixTree implementation. Since this parameter requires a certain level of understanding of the underlying implementation, users may use the precision parameter instead. However, Elasticsearch only uses the tree_levels parameter internally and this is what is returned via the mapping API even if you use the precision parameter. Note: This parameter is only relevant for term and recursive strategies.

various

strategy

deprecated[6.6, PrefixTrees no longer used] The strategy parameter defines the approach for how to represent shapes at indexing and search time. It also influences the capabilities available so it is recommended to let Elasticsearch set this parameter automatically. There are two strategies available: recursive, and term. Recursive and Term strategies are deprecated and will be removed in a future version. While they are still available, the Term strategy supports point types only (the points_only parameter will be automatically set to true) while Recursive strategy supports all shape types. (IMPORTANT: see Prefix trees for more detailed information about these strategies)

recursive

distance_error_pct

deprecated[6.6, PrefixTrees no longer used] Used as a hint to the PrefixTree about how precise it should be. Defaults to 0.025 (2.5%) with 0.5 as the maximum supported value. PERFORMANCE NOTE: This value will default to 0 if a precision or tree_level definition is explicitly defined. This guarantees spatial precision at the level defined in the mapping. This can lead to significant memory usage for high resolution shapes with low error (e.g., large shapes at 1m with < 0.001 error). To improve indexing performance (at the cost of query accuracy) explicitly define tree_level or precision along with a reasonable distance_error_pct, noting that large shapes will have greater false positives. Note: This parameter is only relevant for term and recursive strategies.

0.025

orientation

Optionally define how to interpret vertex order for polygons / multipolygons. This parameter defines one of two coordinate system rules (Right-hand or Left-hand) each of which can be specified in three different ways. 1. Right-hand rule: right, ccw, counterclockwise, 2. Left-hand rule: left, cw, clockwise. The default orientation (counterclockwise) complies with the OGC standard which defines outer ring vertices in counterclockwise order with inner ring(s) vertices (holes) in clockwise order. Setting this parameter in the geo_shape mapping explicitly sets vertex order for the coordinate list of a geo_shape field but can be overridden in each individual GeoJSON or WKT document.

ccw

points_only

deprecated[6.6, PrefixTrees no longer used] Setting this option to true (defaults to false) configures the geo_shape field type for point shapes only (NOTE: Multi-Points are not yet supported). This optimizes index and search performance for the geohash and quadtree when it is known that only points will be indexed. At present geo_shape queries can not be executed on geo_point field types. This option bridges the gap by improving point performance on a geo_shape field so that geo_shape queries are optimal on a point only field.

false

ignore_malformed

If true, malformed GeoJSON or WKT shapes are ignored. If false (default), malformed GeoJSON and WKT shapes throw an exception and reject the entire document.

false

ignore_z_value

If true (default) three dimension points will be accepted (stored in source) but only latitude and longitude values will be indexed; the third dimension is ignored. If false, geo-points containing any more than latitude and longitude (two dimensions) values throw an exception and reject the whole document.

true

coerce

If true unclosed linear rings in polygons will be automatically closed.

false

Indexing approach

GeoShape types are indexed by decomposing the shape into a triangular mesh and indexing each triangle as a 7 dimension point in a BKD tree. This provides near perfect spatial resolution (down to 1e-7 decimal degree precision) since all spatial relations are computed using an encoded vector representation of the original shape instead of a raster-grid representation as used by the Prefix trees indexing approach. Performance of the tessellator primarily depends on the number of vertices that define the polygon/multi-polygon. While this is the default indexing technique prefix trees can still be used by setting the tree or strategy parameters according to the appropriate Mapping Options. Note that these parameters are now deprecated and will be removed in a future version.

IMPORTANT NOTES

The following features are not yet supported with the new indexing approach:

  • geo_shape query with MultiPoint geometry types - Elasticsearch currently prevents searching geo_shape fields with a MultiPoint geometry type to avoid a brute force linear search over each individual point. For now, if this is absolutely needed, this can be achieved using a bool query with each individual point.

  • CONTAINS relation query - when using the new default vector indexing strategy, geo_shape queries with relation defined as contains are not yet supported. If this query relation is an absolute necessity, it is recommended to set strategy to quadtree and use the deprecated PrefixTree strategy indexing approach.

Prefix trees

deprecated[6.6, PrefixTrees no longer used] To efficiently represent shapes in an inverted index, Shapes are converted into a series of hashes representing grid squares (commonly referred to as "rasters") using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth. This can be thought of as increasing the level of detail of a map or image at higher zoom levels. Since this approach causes precision issues with indexed shape, it has been deprecated in favor of a vector indexing approach that indexes the shapes as a triangular mesh (see Indexing approach).

Multiple PrefixTree implementations are provided:

  • GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.

  • QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.

Spatial strategies

deprecated[6.6, PrefixTrees no longer used] The indexing implementation selected relies on a SpatialStrategy for choosing how to decompose the shapes (either as grid squares or a tessellated triangular mesh). Each strategy answers the following:

  • What type of Shapes can be indexed?

  • What types of Query Operations and Shapes can be used?

  • Does it support more than one Shape per field?

The following Strategy implementations (with corresponding capabilities) are provided:

Strategy Supported Shapes Supported Queries Multiple Shapes

recursive

All

INTERSECTS, DISJOINT, WITHIN, CONTAINS

Yes

term

Points

INTERSECTS

Yes

Accuracy

Recursive and Term strategies do not provide 100% accuracy and depending on how they are configured it may return some false positives for INTERSECTS, WITHIN and CONTAINS queries, and some false negatives for DISJOINT queries. To mitigate this, it is important to select an appropriate value for the tree_levels parameter and to adjust expectations accordingly. For example, a point may be near the border of a particular grid cell and may thus not match a query that only matches the cell right next to it — even though the shape is very close to the point.

Example
PUT /example
{
    "mappings": {
        "doc": {
            "properties": {
                "location": {
                    "type": "geo_shape"
                }
            }
        }
    }
}

This mapping definition maps the location field to the geo_shape type using the default vector implementation. It provides approximately 1e-7 decimal degree precision.

Performance considerations with Prefix Trees

deprecated[6.6, PrefixTrees no longer used] With prefix trees, Elasticsearch uses the paths in the tree as terms in the inverted index and in queries. The higher the level (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.

The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.

Input Structure

Shapes can be represented using either the GeoJSON or Well-Known Text (WKT) format. The following table provides a mapping of GeoJSON and WKT to Elasticsearch types:

GeoJSON Type WKT Type Elasticsearch Type Description

Point

POINT

point

A single geographic coordinate. Note: Elasticsearch uses WGS-84 coordinates only.

LineString

LINESTRING

linestring

An arbitrary line given two or more points.

Polygon

POLYGON

polygon

A closed polygon whose first and last point must match, thus requiring n + 1 vertices to create an n-sided polygon and a minimum of 4 vertices.

MultiPoint

MULTIPOINT

multipoint

An array of unconnected, but likely related points.

MultiLineString

MULTILINESTRING

multilinestring

An array of separate linestrings.

MultiPolygon

MULTIPOLYGON

multipolygon

An array of separate polygons.

GeometryCollection

GEOMETRYCOLLECTION

geometrycollection

A GeoJSON shape similar to the multi* shapes except that multiple types can coexist (e.g., a Point and a LineString).

N/A

BBOX

envelope

A bounding rectangle, or envelope, specified by specifying only the top left and bottom right points.

N/A

N/A

circle

A circle specified by a center point and radius with units, which default to METERS.

Note

For all types, both the inner type and coordinates fields are required.

In GeoJSON and WKT, and therefore Elasticsearch, the correct coordinate order is longitude, latitude (X, Y) within coordinate arrays. This differs from many Geospatial APIs (e.g., Google Maps) that generally use the colloquial latitude, longitude (Y, X).

Point

A point is a single geographic coordinate, such as the location of a building or the current position given by a smartphone’s Geolocation API. The following is an example of a point in GeoJSON.

POST /example/doc
{
    "location" : {
        "type" : "point",
        "coordinates" : [-77.03653, 38.897676]
    }
}

The following is an example of a point in WKT:

POST /example/doc
{
    "location" : "POINT (-77.03653 38.897676)"
}
LineString

A linestring defined by an array of two or more positions. By specifying only two points, the linestring will represent a straight line. Specifying more than two points creates an arbitrary path. The following is an example of a LineString in GeoJSON.

POST /example/doc
{
    "location" : {
        "type" : "linestring",
        "coordinates" : [[-77.03653, 38.897676], [-77.009051, 38.889939]]
    }
}

The following is an example of a LineString in WKT:

POST /example/doc
{
    "location" : "LINESTRING (-77.03653 38.897676, -77.009051 38.889939)"
}

The above linestring would draw a straight line starting at the White House to the US Capitol Building.

Polygon

A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed). The following is an example of a Polygon in GeoJSON.

POST /example/doc
{
    "location" : {
        "type" : "polygon",
        "coordinates" : [
            [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]
        ]
    }
}

The following is an example of a Polygon in WKT:

POST /example/doc
{
    "location" : "POLYGON ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0))"
}

The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"). The following is a GeoJSON example of a polygon with a hole:

POST /example/doc
{
    "location" : {
        "type" : "polygon",
        "coordinates" : [
            [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ],
            [ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2] ]
        ]
    }
}

The following is an example of a Polygon with a hole in WKT:

POST /example/doc
{
    "location" : "POLYGON ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8, 100.2 0.2))"
}

IMPORTANT NOTE: WKT does not enforce a specific order for vertices thus ambiguous polygons around the dateline and poles are possible. GeoJSON mandates that the outer polygon must be counterclockwise and interior shapes must be clockwise, which agrees with the Open Geospatial Consortium (OGC) Simple Feature Access specification for vertex ordering.

Elasticsearch accepts both clockwise and counterclockwise polygons if they appear not to cross the dateline (i.e. they cross less than 180° of longitude), but for polygons that do cross the dateline (or for other polygons wider than 180°) Elasticsearch requires the vertex ordering to comply with the OGC and GeoJSON specifications. Otherwise, an unintended polygon may be created and unexpected query/filter results will be returned.

The following provides an example of an ambiguous polygon. Elasticsearch will apply the GeoJSON standard to eliminate ambiguity resulting in a polygon that crosses the dateline.

POST /example/doc
{
    "location" : {
        "type" : "polygon",
        "coordinates" : [
            [ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ],
            [ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ]
        ]
    }
}

An orientation parameter can be defined when setting the geo_shape mapping (see Mapping Options). This will define vertex order for the coordinate list on the mapped geo_shape field. It can also be overridden on each document. The following is an example for overriding the orientation on a document:

POST /example/doc
{
    "location" : {
        "type" : "polygon",
        "orientation" : "clockwise",
        "coordinates" : [
            [ [100.0, 0.0], [100.0, 1.0], [101.0, 1.0], [101.0, 0.0], [100.0, 0.0] ]
        ]
    }
}
MultiPoint

The following is an example of a list of geojson points:

POST /example/doc
{
    "location" : {
        "type" : "multipoint",
        "coordinates" : [
            [102.0, 2.0], [103.0, 2.0]
        ]
    }
}

The following is an example of a list of WKT points:

POST /example/doc
{
    "location" : "MULTIPOINT (102.0 2.0, 103.0 2.0)"
}
MultiLineString

The following is an example of a list of geojson linestrings:

POST /example/doc
{
    "location" : {
        "type" : "multilinestring",
        "coordinates" : [
            [ [102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0] ],
            [ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0] ],
            [ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8] ]
        ]
    }
}

The following is an example of a list of WKT linestrings:

POST /example/doc
{
    "location" : "MULTILINESTRING ((102.0 2.0, 103.0 2.0, 103.0 3.0, 102.0 3.0), (100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8))"
}
MultiPolygon

The following is an example of a list of geojson polygons (second polygon contains a hole):

POST /example/doc
{
    "location" : {
        "type" : "multipolygon",
        "coordinates" : [
            [ [[102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0], [102.0, 2.0]] ],
            [ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]],
              [[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ]
        ]
    }
}

The following is an example of a list of WKT polygons (second polygon contains a hole):

POST /example/doc
{
    "location" : "MULTIPOLYGON (((102.0 2.0, 103.0 2.0, 103.0 3.0, 102.0 3.0, 102.0 2.0)), ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8, 100.2 0.2)))"
}
Geometry Collection

The following is an example of a collection of geojson geometry objects:

POST /example/doc
{
    "location" : {
        "type": "geometrycollection",
        "geometries": [
            {
                "type": "point",
                "coordinates": [100.0, 0.0]
            },
            {
                "type": "linestring",
                "coordinates": [ [101.0, 0.0], [102.0, 1.0] ]
            }
        ]
    }
}

The following is an example of a collection of WKT geometry objects:

POST /example/doc
{
    "location" : "GEOMETRYCOLLECTION (POINT (100.0 0.0), LINESTRING (101.0 0.0, 102.0 1.0))"
}
Envelope

Elasticsearch supports an envelope type, which consists of coordinates for upper left and lower right points of the shape to represent a bounding rectangle in the format :

POST /example/doc
{
    "location" : {
        "type" : "envelope",
        "coordinates" : [ [100.0, 1.0], [101.0, 0.0] ]
    }
}

The following is an example of an envelope using the WKT BBOX format:

NOTE: WKT specification expects the following order: minLon, maxLon, maxLat, minLat.

POST /example/doc
{
    "location" : "BBOX (100.0, 102.0, 2.0, 0.0)"
}
Circle

Elasticsearch supports a circle type, which consists of a center point with a radius. Note that this circle representation can only be indexed when using the recursive Prefix Tree strategy. For the default Indexing approach circles should be approximated using a POLYGON.

POST /example/doc
{
    "location" : {
        "type" : "circle",
        "coordinates" : [101.0, 1.0],
        "radius" : "100m"
    }
}

Note: The inner radius field is required. If not specified, then the units of the radius will default to METERS.

NOTE: Neither GeoJSON or WKT support a point-radius circle type.

Sorting and Retrieving index Shapes

Due to the complex input structure and index representation of shapes, it is not currently possible to sort shapes or retrieve their fields directly. The geo_shape value is only retrievable through the _source field.

IP datatype

An ip field can index/store either IPv4 or IPv6 addresses.

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "ip_addr": {
          "type": "ip"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "ip_addr": "192.168.1.1"
}

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}
Note
You can also store ip ranges in a single field using an ip_range datatype.

Parameters for ip fields

The following parameters are accepted by ip fields:

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts an IPv4 value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Querying ip fields

The most common way to query ip addresses is to use the CIDR notation: [ip_address]/[prefix_length]. For instance:

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "192.168.0.0/16"
    }
  }
}

or

GET my_index/_search
{
  "query": {
    "term": {
      "ip_addr": "2001:db8::/48"
    }
  }
}

Also beware that colons are special characters to the query_string query, so ipv6 addresses will need to be escaped. The easiest way to do so is to put quotes around the searched value:

GET my_index/_search
{
  "query": {
    "query_string" : {
      "query": "ip_addr:\"2001:db8::/48\""
    }
  }
}

Keyword datatype

A field to index structured content such as email addresses, hostnames, status codes, zip codes or tags.

They are typically used for filtering (Find me all blog posts where status is published), for sorting, and for aggregations. Keyword fields are only searchable by their exact value.

If you need to index full text content such as email bodies or product descriptions, it is likely that you should rather use a text field.

Below is an example of a mapping for a keyword field:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "tags": {
          "type":  "keyword"
        }
      }
    }
  }
}

Parameters for keyword fields

The following parameters are accepted by keyword fields:

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

eager_global_ordinals

Should global ordinals be loaded eagerly on refresh? Accepts true or false (default). Enabling this is a good idea on fields that are frequently used for terms aggregations.

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations.

ignore_above

Do not index any string longer than this value. Defaults to 2147483647 so that all values would be accepted. Please however note that default dynamic mapping rules create a sub keyword field that overrides this default by setting ignore_above: 256.

index

Should the field be searchable? Accepts true (default) or false.

index_options

What information should be stored in the index, for scoring purposes. Defaults to docs but can also be set to freqs to take term frequency into account when computing scores.

norms

Whether field-length should be taken into account when scoring queries. Accepts true or false (default).

null_value

Accepts a string value which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

similarity

Which scoring algorithm or similarity should be used. Defaults to BM25.

normalizer

How to pre-process the keyword prior to indexing. Defaults to null, meaning the keyword is kept as-is.

split_queries_on_whitespace

Whether full text queries should split the input on whitespace when building a query for this field. Accepts true or false (default).

Note
Indexes imported from 2.x do not support keyword. Instead they will attempt to downgrade keyword into string. This allows you to merge modern mappings with legacy mappings. Long lived indexes will have to be recreated before upgrading to 6.x but mapping downgrade gives you the opportunity to do the recreation on your own schedule.

Nested datatype

The nested type is a specialised version of the object datatype that allows arrays of objects to be indexed in a way that they can be queried independently of each other.

How arrays of objects are flattened

Arrays of inner object fields do not work the way you may expect. Lucene has no concept of inner objects, so Elasticsearch flattens object hierarchies into a simple list of field names and values. For instance, the following document:

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [ (1)
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}
  1. The user field is dynamically added as a field of type object.

would be transformed internally into a document that looks more like this:

{
  "group" :        "fans",
  "user.first" : [ "alice", "john" ],
  "user.last" :  [ "smith", "white" ]
}

The user.first and user.last fields are flattened into multi-value fields, and the association between alice and white is lost. This document would incorrectly match a query for alice AND smith:

GET my_index/_search
{
  "query": {
    "bool": {
      "must": [
        { "match": { "user.first": "Alice" }},
        { "match": { "user.last":  "Smith" }}
      ]
    }
  }
}

Using nested fields for arrays of objects

If you need to index arrays of objects and to maintain the independence of each object in the array, you should use the nested datatype instead of the object datatype. Internally, nested objects index each object in the array as a separate hidden document, meaning that each nested object can be queried independently of the others, with the nested query:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "user": {
          "type": "nested" (1)
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "group" : "fans",
  "user" : [
    {
      "first" : "John",
      "last" :  "Smith"
    },
    {
      "first" : "Alice",
      "last" :  "White"
    }
  ]
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "Smith" }} (2)
          ]
        }
      }
    }
  }
}

GET my_index/_search
{
  "query": {
    "nested": {
      "path": "user",
      "query": {
        "bool": {
          "must": [
            { "match": { "user.first": "Alice" }},
            { "match": { "user.last":  "White" }} (3)
          ]
        }
      },
      "inner_hits": { (4)
        "highlight": {
          "fields": {
            "user.first": {}
          }
        }
      }
    }
  }
}
  1. The user field is mapped as type nested instead of type object.

  2. This query doesn’t match because Alice and Smith are not in the same nested object.

  3. This query matches because Alice and White are in the same nested object.

  4. inner_hits allow us to highlight the matching nested documents.

Nested documents can be:

Important

Because nested documents are indexed as separate documents, they can only be accessed within the scope of the nested query, the nested/reverse_nested aggregations, or nested inner hits.

For instance, if a string field within a nested document has index_options set to offsets to allow use of the postings during the highlighting, these offsets will not be available during the main highlighting phase. Instead, highlighting needs to be performed via nested inner hits. The same consideration applies when loading fields during a search through docvalue_fields or stored_fields.

Parameters for nested fields

The following parameters are accepted by nested fields:

dynamic

Whether or not new properties should be added dynamically to an existing nested object. Accepts true (default), false and strict.

properties

The fields within the nested object, which can be of any datatype, including nested. New properties may be added to an existing nested object.

Limits on nested mappings and objects

As described earlier, each nested object is indexed as a separate document under the hood. Continuing with the example above, if we indexed a single document containing 100 user objects, then 101 Lucene documents would be created — one for the parent document, and one for each nested object. Because of the expense associated with nested mappings, Elasticsearch puts the following setting in place to guard against performance problems:

index.mapping.nested_fields.limit

The nested type should only be used in special cases, when arrays of objects need to be queried independently of each other. To safeguard against poorly designed mappings, this setting limits the number of unique nested types per index. In our example, the user mapping would count as only 1 towards this limit. Defaults to 50.

Additional background on this setting can be found in Settings to prevent mappings explosion.

Numeric datatypes

The following numeric types are supported:

long

A signed 64-bit integer with a minimum value of -2^63^ and a maximum value of 2^63^-1.

integer

A signed 32-bit integer with a minimum value of -2^31^ and a maximum value of 2^31^-1.

short

A signed 16-bit integer with a minimum value of -32,768 and a maximum value of 32,767.

byte

A signed 8-bit integer with a minimum value of -128 and a maximum value of 127.

double

A double-precision 64-bit IEEE 754 floating point number, restricted to finite values.

float

A single-precision 32-bit IEEE 754 floating point number, restricted to finite values.

half_float

A half-precision 16-bit IEEE 754 floating point number, restricted to finite values.

scaled_float

A finite floating point number that is backed by a long, scaled by a fixed double scaling factor.

Below is an example of configuring a mapping with numeric fields:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "number_of_bytes": {
          "type": "integer"
        },
        "time_in_seconds": {
          "type": "float"
        },
        "price": {
          "type": "scaled_float",
          "scaling_factor": 100
        }
      }
    }
  }
}
Note
The double, float and half_float types consider that -0.0 and +0.0 are different values. As a consequence, doing a term query on -0.0 will not match +0.0 and vice-versa. Same is true for range queries: if the upper bound is -0.0 then +0.0 will not match, and if the lower bound is +0.0 then -0.0 will not match.

Which type should I use?

As far as integer types (byte, short, integer and long) are concerned, you should pick the smallest type which is enough for your use-case. This will help indexing and searching be more efficient. Note however that storage is optimized based on the actual values that are stored, so picking one type over another one will have no impact on storage requirements.

For floating-point types, it is often more efficient to store floating-point data into an integer using a scaling factor, which is what the scaled_float type does under the hood. For instance, a price field could be stored in a scaled_float with a scaling_factor of 100. All APIs would work as if the field was stored as a double, but under the hood Elasticsearch would be working with the number of cents, price*100, which is an integer. This is mostly helpful to save disk space since integers are way easier to compress than floating points. scaled_float is also fine to use in order to trade accuracy for disk space. For instance imagine that you are tracking cpu utilization as a number between 0 and 1. It usually does not matter much whether cpu utilization is 12.7% or 13%, so you could use a scaled_float with a scaling_factor of 100 in order to round cpu utilization to the closest percent in order to save space.

If scaled_float is not a good fit, then you should pick the smallest type that is enough for the use-case among the floating-point types: double, float and half_float. Here is a table that compares these types in order to help make a decision.

Type Minimum value Maximum value Significant bits / digits

double

2^-1074^

(2-2^-52^)·2^1023^

53 / 15.95

float

2^-149^

(2-2^-23^)·2^127^

24 / 7.22

half_float

2^-24^

65504

11 / 3.31

Parameters for numeric fields

The following parameters are accepted by numeric types:

coerce

Try to convert strings to numbers and truncate fractions for integers. Accepts true (default) and false.

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

ignore_malformed

If true, malformed numbers are ignored. If false (default), malformed numbers throw an exception and reject the whole document.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts a numeric value of the same type as the field which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Parameters for scaled_float

scaled_float accepts an additional parameter:

scaling_factor

The scaling factor to use when encoding values. Values will be multiplied by this factor at index time and rounded to the closest long value. For instance, a scaled_float with a scaling_factor of 10 would internally store 2.34 as 23 and all search-time operations (queries, aggregations, sorting) will behave as if the document had a value of 2.3. High values of scaling_factor improve accuracy but also increase space requirements. This parameter is required.

Object datatype

JSON documents are hierarchical in nature: the document may contain inner objects which, in turn, may contain inner objects themselves:

PUT my_index/_doc/1
{ (1)
  "region": "US",
  "manager": { (2)
    "age":     30,
    "name": { (3)
      "first": "John",
      "last":  "Smith"
    }
  }
}
  1. The outer document is also a JSON object.

  2. It contains an inner object called manager.

  3. Which in turn contains an inner object called name.

Internally, this document is indexed as a simple, flat list of key-value pairs, something like this:

{
  "region":             "US",
  "manager.age":        30,
  "manager.name.first": "John",
  "manager.name.last":  "Smith"
}

An explicit mapping for the above document could look like this:

PUT my_index
{
  "mappings": {
    "_doc": { (1)
      "properties": {
        "region": {
          "type": "keyword"
        },
        "manager": { (2)
          "properties": {
            "age":  { "type": "integer" },
            "name": { (3)
              "properties": {
                "first": { "type": "text" },
                "last":  { "type": "text" }
              }
            }
          }
        }
      }
    }
  }
}
  1. The mapping type is a type of object, and has a properties field.

  2. The manager field is an inner object field.

  3. The manager.name field is an inner object field within the manager field.

You are not required to set the field type to object explicitly, as this is the default value.

Parameters for object fields

The following parameters are accepted by object fields:

dynamic

Whether or not new properties should be added dynamically to an existing object. Accepts true (default), false and strict.

enabled

Whether the JSON value given for the object field should be parsed and indexed (true, default) or completely ignored (false).

properties

The fields within the object, which can be of any datatype, including object. New properties may be added to an existing object.

Important
If you need to index arrays of objects instead of single objects, read Nested datatype first.

Text datatype

A field to index full-text values, such as the body of an email or the description of a product. These fields are analyzed, that is they are passed through an analyzer to convert the string into a list of individual terms before being indexed. The analysis process allows Elasticsearch to search for individual words within each full text field. Text fields are not used for sorting and seldom used for aggregations (although the significant text aggregation is a notable exception).

If you need to index structured content such as email addresses, hostnames, status codes, or tags, it is likely that you should rather use a keyword field.

Below is an example of a mapping for a text field:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "full_name": {
          "type":  "text"
        }
      }
    }
  }
}

Use a field as both text and keyword

Sometimes it is useful to have both a full text (text) and a keyword (keyword) version of the same field: one for full text search and the other for aggregations and sorting. This can be achieved with multi-fields.

Parameters for text fields

The following parameters are accepted by text fields:

analyzer

The analyzer which should be used for analyzed string fields, both at index-time and at search-time (unless overridden by the search_analyzer). Defaults to the default index analyzer, or the standard analyzer.

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

eager_global_ordinals

Should global ordinals be loaded eagerly on refresh? Accepts true or false (default). Enabling this is a good idea on fields that are frequently used for (significant) terms aggregations.

fielddata

Can the field use in-memory fielddata for sorting, aggregations, or scripting? Accepts true or false (default).

fielddata_frequency_filter

Expert settings which allow to decide which values to load in memory when fielddata is enabled. By default all values are loaded.

fields

Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations, or the same string value analyzed by different analyzers.

index

Should the field be searchable? Accepts true (default) or false.

index_options

What information should be stored in the index, for search and highlighting purposes. Defaults to positions.

index_prefixes

If enabled, term prefixes of between 2 and 5 characters are indexed into a separate field. This allows prefix searches to run more efficiently, at the expense of a larger index.

index_phrases

If enabled, two-term word combinations ('shingles') are indexed into a separate field. This allows exact phrase queries to run more efficiently, at the expense of a larger index. Note that this works best when stopwords are not removed, as phrases containing stopwords will not use the subsidiary field and will fall back to a standard phrase query. Accepts true or false (default).

norms

Whether field-length should be taken into account when scoring queries. Accepts true (default) or false.

position_increment_gap

The number of fake term position which should be inserted between each element of an array of strings. Defaults to the position_increment_gap configured on the analyzer which defaults to 100. 100 was chosen because it prevents phrase queries with reasonably large slops (less than 100) from matching terms across field values.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

search_analyzer

The analyzer that should be used at search time on analyzed fields. Defaults to the analyzer setting.

search_quote_analyzer

The analyzer that should be used at search time when a phrase is encountered. Defaults to the search_analyzer setting.

similarity

Which scoring algorithm or similarity should be used. Defaults to BM25.

term_vector

Whether term vectors should be stored for an analyzed field. Defaults to no.

Token count datatype

A field of type token_count is really an integer field which accepts string values, analyzes them, then indexes the number of tokens in the string.

For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "name": { (1)
          "type": "text",
          "fields": {
            "length": { (2)
              "type":     "token_count",
              "analyzer": "standard"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "name": "John Smith" }

PUT my_index/_doc/2
{ "name": "Rachel Alice Williams" }

GET my_index/_search
{
  "query": {
    "term": {
      "name.length": 3 (3)
    }
  }
}
  1. The name field is an analyzed string field which uses the default standard analyzer.

  2. The name.length field is a token_count multi-field which will index the number of tokens in the name field.

  3. This query matches only the document containing Rachel Alice Williams, as it contains three tokens.

Parameters for token_count fields

The following parameters are accepted by token_count fields:

analyzer

The analyzer which should be used to analyze the string value. Required. For best performance, use an analyzer without token filters.

enable_position_increments

Indicates if position increments should be counted. Set to false if you don’t want to count tokens removed by analyzer filters (like stop). Defaults to true.

boost

Mapping field-level query time boosting. Accepts a floating point number, defaults to 1.0.

doc_values

Should the field be stored on disk in a column-stride fashion, so that it can later be used for sorting, aggregations, or scripting? Accepts true (default) or false.

index

Should the field be searchable? Accepts true (default) and false.

null_value

Accepts a numeric value of the same type as the field which is substituted for any explicit null values. Defaults to null, which means the field is treated as missing.

store

Whether the field value should be stored and retrievable separately from the _source field. Accepts true or false (default).

Percolator type

The percolator field type parses a json structure into a native query and stores that query, so that the percolate query can use it to match provided documents.

Any field that contains a json object can be configured to be a percolator field. The percolator field type has no settings. Just configuring the percolator field type is sufficient to instruct Elasticsearch to treat a field as a query.

If the following mapping configures the percolator field type for the query field:

PUT my_index
{
    "mappings": {
        "_doc": {
            "properties": {
                "query": {
                    "type": "percolator"
                },
                "field": {
                    "type": "text"
                }
            }
        }
    }
}

Then you can index a query:

PUT my_index/_doc/match_value
{
    "query" : {
        "match" : {
            "field" : "value"
        }
    }
}
Important

Fields referred to in a percolator query must already exist in the mapping associated with the index used for percolation. In order to make sure these fields exist, add or update a mapping via the create index or put mapping APIs. Fields referred in a percolator query may exist in any type of the index containing the percolator field type.

Reindexing your percolator queries

Reindexing percolator queries is sometimes required to benefit from improvements made to the percolator field type in new releases.

Reindexing percolator queries can be reindexed by using the reindex api. Lets take a look at the following index with a percolator field type:

PUT index
{
  "mappings": {
    "_doc" : {
      "properties": {
        "query" : {
          "type" : "percolator"
        },
        "body" : {
          "type": "text"
        }
      }
    }
  }
}

POST _aliases
{
  "actions": [
    {
      "add": {
        "index": "index",
        "alias": "queries" (1)
      }
    }
  ]
}

PUT queries/_doc/1?refresh
{
  "query" : {
    "match" : {
      "body" : "quick brown fox"
    }
  }
}
  1. It is always recommended to define an alias for your index, so that in case of a reindex systems / applications don’t need to be changed to know that the percolator queries are now in a different index.

Lets say you’re going to upgrade to a new major version and in order for the new Elasticsearch version to still be able to read your queries you need to reindex your queries into a new index on the current Elasticsearch version:

PUT new_index
{
  "mappings": {
    "_doc" : {
      "properties": {
        "query" : {
          "type" : "percolator"
        },
        "body" : {
          "type": "text"
        }
      }
    }
  }
}

POST /_reindex?refresh
{
  "source": {
    "index": "index"
  },
  "dest": {
    "index": "new_index"
  }
}

POST _aliases
{
  "actions": [ (1)
    {
      "remove": {
        "index" : "index",
        "alias": "queries"
      }
    },
    {
      "add": {
        "index": "new_index",
        "alias": "queries"
      }
    }
  ]
}
  1. If you have an alias don’t forget to point it to the new index.

Executing the percolate query via the queries alias:

GET /queries/_search
{
  "query": {
    "percolate" : {
      "field" : "query",
      "document" : {
        "body" : "fox jumps over the lazy dog"
      }
    }
  }
}

now returns matches from the new index:

{
  "took": 3,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "new_index", (1)
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "query": {
            "match": {
              "body": "quick brown fox"
            }
          }
        },
        "fields" : {
          "_percolator_document_slot" : [0]
        }
      }
    ]
  }
}
  1. Percolator query hit is now being presented from the new index.

Optimizing query time text analysis

When the percolator verifies a percolator candidate match it is going to parse, perform query time text analysis and actually run the percolator query on the document being percolated. This is done for each candidate match and every time the percolate query executes. If your query time text analysis is relatively expensive part of query parsing then text analysis can become the dominating factor time is being spent on when percolating. This query parsing overhead can become noticeable when the percolator ends up verifying many candidate percolator query matches.

To avoid the most expensive part of text analysis at percolate time. One can choose to do the expensive part of text analysis when indexing the percolator query. This requires using two different analyzers. The first analyzer actually performs text analysis that needs be performed (expensive part). The second analyzer (usually whitespace) just splits the generated tokens that the first analyzer has produced. Then before indexing a percolator query, the analyze api should be used to analyze the query text with the more expensive analyzer. The result of the analyze api, the tokens, should be used to substitute the original query text in the percolator query. It is important that the query should now be configured to override the analyzer from the mapping and just the second analyzer. Most text based queries support an analyzer option (match, query_string, simple_query_string). Using this approach the expensive text analysis is performed once instead of many times.

Lets demonstrate this workflow via a simplified example.

Lets say we want to index the following percolator query:

{
  "query" : {
    "match" : {
      "body" : {
        "query" : "missing bicycles"
      }
    }
  }
}

with these settings and mapping:

PUT /test_index
{
  "settings": {
    "analysis": {
      "analyzer": {
        "my_analyzer" : {
          "tokenizer": "standard",
          "filter" : ["lowercase", "porter_stem"]
        }
      }
    }
  },
  "mappings": {
    "_doc" : {
      "properties": {
        "query" : {
          "type": "percolator"
        },
        "body" : {
          "type": "text",
          "analyzer": "my_analyzer" (1)
        }
      }
    }
  }
}
  1. For the purpose of this example, this analyzer is considered expensive.

First we need to use the analyze api to perform the text analysis prior to indexing:

POST /test_index/_analyze
{
  "analyzer" : "my_analyzer",
  "text" : "missing bicycles"
}

This results the following response:

{
  "tokens": [
    {
      "token": "miss",
      "start_offset": 0,
      "end_offset": 7,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "bicycl",
      "start_offset": 8,
      "end_offset": 16,
      "type": "<ALPHANUM>",
      "position": 1
    }
  ]
}

All the tokens in the returned order need to replace the query text in the percolator query:

PUT /test_index/_doc/1?refresh
{
  "query" : {
    "match" : {
      "body" : {
        "query" : "miss bicycl",
        "analyzer" : "whitespace" (1)
      }
    }
  }
}
  1. It is important to select a whitespace analyzer here, otherwise the analyzer defined in the mapping will be used, which defeats the point of using this workflow. Note that whitespace is a built-in analyzer, if a different analyzer needs to be used, it needs to be configured first in the index’s settings.

The analyze api prior to the indexing the percolator flow should be done for each percolator query.

At percolate time nothing changes and the percolate query can be defined normally:

GET /test_index/_search
{
  "query": {
    "percolate" : {
      "field" : "query",
      "document" : {
        "body" : "Bycicles are missing"
      }
    }
  }
}

This results in a response like this:

{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "test_index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "query": {
            "match": {
              "body": {
                "query": "miss bicycl",
                "analyzer": "whitespace"
              }
            }
          }
        },
        "fields" : {
          "_percolator_document_slot" : [0]
        }
      }
    ]
  }
}

Optimizing wildcard queries.

Wildcard queries are more expensive than other queries for the percolator, especially if the wildcard expressions are large.

In the case of wildcard queries with prefix wildcard expressions or just the prefix query, the edge_ngram token filter can be used to replace these queries with regular term query on a field where the edge_ngram token filter is configured.

Creating an index with custom analysis settings:

PUT my_queries1
{
  "settings": {
    "analysis": {
      "analyzer": {
        "wildcard_prefix": { (1)
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "wildcard_edge_ngram"
          ]
        }
      },
      "filter": {
        "wildcard_edge_ngram": { (2)
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 32
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "query": {
          "type": "percolator"
        },
        "my_field": {
          "type": "text",
          "fields": {
            "prefix": { (3)
              "type": "text",
              "analyzer": "wildcard_prefix",
              "search_analyzer": "standard"
            }
          }
        }
      }
    }
  }
}
  1. The analyzer that generates the prefix tokens to be used at index time only.

  2. Increase the min_gram and decrease max_gram settings based on your prefix search needs.

  3. This multifield should be used to do the prefix search with a term or match query instead of a prefix or wildcard query.

Then instead of indexing the following query:

{
  "query": {
    "wildcard": {
      "my_field": "abc*"
    }
  }
}

this query below should be indexed:

PUT /my_queries1/_doc/1?refresh
{
  "query": {
    "term": {
      "my_field.prefix": "abc"
    }
  }
}

This way can handle the second query more efficiently than the first query.

The following search request will match with the previously indexed percolator query:

GET /my_queries1/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "my_field": "abcd"
      }
    }
  }
}
{
  "took": 6,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 1,
    "max_score": 0.41501677,
    "hits": [
      {
        "_index": "my_queries1",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.41501677,
        "_source": {
          "query": {
            "term": {
              "my_field.prefix": "abc"
            }
          }
        },
        "fields": {
          "_percolator_document_slot": [
            0
          ]
        }
      }
    ]
  }
}

The same technique can also be used to speed up suffix wildcard searches. By using the reverse token filter before the edge_ngram token filter.

PUT my_queries2
{
  "settings": {
    "analysis": {
      "analyzer": {
        "wildcard_suffix": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "reverse",
            "wildcard_edge_ngram"
          ]
        },
        "wildcard_suffix_search_time": {
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "reverse"
          ]
        }
      },
      "filter": {
        "wildcard_edge_ngram": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 32
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "query": {
          "type": "percolator"
        },
        "my_field": {
          "type": "text",
          "fields": {
            "suffix": {
              "type": "text",
              "analyzer": "wildcard_suffix",
              "search_analyzer": "wildcard_suffix_search_time" (1)
            }
          }
        }
      }
    }
  }
}
  1. A custom analyzer is needed at search time too, because otherwise the query terms are not being reversed and would otherwise not match with the reserved suffix tokens.

Then instead of indexing the following query:

{
  "query": {
    "wildcard": {
      "my_field": "*xyz"
    }
  }
}

the following query below should be indexed:

PUT /my_queries2/_doc/2?refresh
{
  "query": {
    "match": { (1)
      "my_field.suffix": "xyz"
    }
  }
}
  1. The match query should be used instead of the term query, because text analysis needs to reverse the query terms.

The following search request will match with the previously indexed percolator query:

GET /my_queries2/_search
{
  "query": {
    "percolate": {
      "field": "query",
      "document": {
        "my_field": "wxyz"
      }
    }
  }
}

Dedicated Percolator Index

Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in, these queries can also be added to a dedicated index. The advantage of this is that this dedicated percolator index can have its own index settings (For example the number of primary and replica shards). If you choose to have a dedicated percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index. Otherwise percolate queries can be parsed incorrectly.

Forcing Unmapped Fields to be Handled as Strings

In certain cases it is unknown what kind of percolator queries do get registered, and if no field mapping exists for fields that are referred by percolator queries then adding a percolator query fails. This means the mapping needs to be updated to have the field with the appropriate settings, and then the percolator query can be added. But sometimes it is sufficient if all unmapped fields are handled as if these were default text fields. In those cases one can configure the index.percolator.map_unmapped_fields_as_text setting to true (default to false) and then if a field referred in a percolator query does not exist, it will be handled as a default text field so that adding the percolator query doesn’t fail.

Limitations

Parent/child

Because the percolate query is processing one document at a time, it doesn’t support queries and filters that run against child documents such as has_child and has_parent.

Fetching queries

There are a number of queries that fetch data via a get call during query parsing. For example the terms query when using terms lookup, template query when using indexed scripts and geo_shape when using pre-indexed shapes. When these queries are indexed by the percolator field type then the get call is executed once. So each time the percolator query evaluates these queries, the fetches terms, shapes etc. as the were upon index time will be used. Important to note is that fetching of terms that these queries do, happens both each time the percolator query gets indexed on both primary and replica shards, so the terms that are actually indexed can be different between shard copies, if the source index changed while indexing.

Script query

The script inside a script query can only access doc values fields. The percolate query indexes the provided document into an in-memory index. This in-memory index doesn’t support stored fields and because of that the _source field and other stored fields are not stored. This is the reason why in the script query the _source and other stored fields aren’t available.

Field aliases

Percolator queries that contain field aliases may not always behave as expected. In particular, if a percolator query is registered that contains a field alias, and then that alias is updated in the mappings to refer to a different field, the stored query will still refer to the original target field. To pick up the change to the field alias, the percolator query must be explicitly reindexed.

join datatype

The join datatype is a special field that creates parent/child relation within documents of the same index. The relations section defines a set of possible relations within the documents, each relation being a parent name and a child name. A parent/child relation can be defined as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": { (1)
          "type": "join",
          "relations": {
            "question": "answer" (2)
          }
        }
      }
    }
  }
}
  1. The name for the field

  2. Defines a single relation where question is parent of answer.

To index a document with a join, the name of the relation and the optional parent of the document must be provided in the source. For instance the following example creates two parent documents in the question context:

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": {
    "name": "question" (1)
  }
}

PUT my_index/_doc/2?refresh
{
  "text": "This is another question",
  "my_join_field": {
    "name": "question"
  }
}
  1. This document is a question document.

When indexing parent documents, you can choose to specify just the name of the relation as a shortcut instead of encapsulating it in the normal object notation:

PUT my_index/_doc/1?refresh
{
  "text": "This is a question",
  "my_join_field": "question" (1)
}

PUT my_index/_doc/2?refresh
{
  "text": "This is another question",
  "my_join_field": "question"
}
  1. Simpler notation for a parent document just uses the relation name.

When indexing a child, the name of the relation as well as the parent id of the document must be added in the _source.

Warning
It is required to index the lineage of a parent in the same shard so you must always route child documents using their greater parent id.

For instance the following example shows how to index two child documents:

PUT my_index/_doc/3?routing=1&refresh (1)
{
  "text": "This is an answer",
  "my_join_field": {
    "name": "answer", (2)
    "parent": "1" (3)
  }
}

PUT my_index/_doc/4?routing=1&refresh
{
  "text": "This is another answer",
  "my_join_field": {
    "name": "answer",
    "parent": "1"
  }
}
  1. The routing value is mandatory because parent and child documents must be indexed on the same shard

  2. answer is the name of the join for this document

  3. The parent id of this child document

Parent-join and performance.

The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance is to de-normalize your data into documents. Each join field, has_child or has_parent query adds a significant tax to your query performance.

The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products and offers for these products. In the case that offers significantly outnumbers the number of products then it makes sense to model the product as parent document and the offer as child document.

Parent-join restrictions

  • Only one join field mapping is allowed per index.

  • Parent and child documents must be indexed on the same shard. This means that the same routing value needs to be provided when getting, deleting, or updating a child document.

  • An element can have multiple children but only one parent.

  • It is possible to add a new relation to an existing join field.

  • It is also possible to add a child to an existing element but only if the element is already a parent.

Searching with parent-join

The parent-join creates one field to index the name of the relation within the document (my_parent, my_child, …​).

It also creates one field per parent/child relation. The name of this field is the name of the join field followed by # and the name of the parent in the relation. So for instance for the my_parent ⇒ [my_child, another_child] relation, the join field creates an additional field named my_join_field#my_parent.

This field contains the parent _id that the document links to if the document is a child (my_child or another_child) and the _id of document if it’s a parent (my_parent).

When searching an index that contains a join field, these two fields are always returned in the search response:

GET my_index/_search
{
  "query": {
    "match_all": {}
  },
  "sort": ["_id"]
}

Will return:

{
    ...,
    "hits": {
        "total": 4,
        "max_score": null,
        "hits": [
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "1",
                "_score": null,
                "_source": {
                    "text": "This is a question",
                    "my_join_field": "question" (1)
                },
                "sort": [
                    "1"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "2",
                "_score": null,
                "_source": {
                    "text": "This is another question",
                    "my_join_field": "question" (2)
                },
                "sort": [
                    "2"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "3",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "text": "This is an answer",
                    "my_join_field": {
                        "name": "answer", (3)
                        "parent": "1"  (4)
                    }
                },
                "sort": [
                    "3"
                ]
            },
            {
                "_index": "my_index",
                "_type": "_doc",
                "_id": "4",
                "_score": null,
                "_routing": "1",
                "_source": {
                    "text": "This is another answer",
                    "my_join_field": {
                        "name": "answer",
                        "parent": "1"
                    }
                },
                "sort": [
                    "4"
                ]
            }
        ]
    }
}
  1. This document belongs to the question join

  2. This document belongs to the question join

  3. This document belongs to the answer join

  4. The linked parent id for the child document

Parent-join queries and aggregations

See the has_child and has_parent queries, the children aggregation, and inner hits for more information.

The value of the join field is accessible in aggregations and scripts, and may be queried with the parent_id query:

GET my_index/_search
{
  "query": {
    "parent_id": { (1)
      "type": "answer",
      "id": "1"
    }
  },
  "aggs": {
    "parents": {
      "terms": {
        "field": "my_join_field#question", (2)
        "size": 10
      }
    }
  },
  "script_fields": {
    "parent": {
      "script": {
         "source": "doc['my_join_field#question']" (3)
      }
    }
  }
}
  1. Querying the parent id field (also see the has_parent query and the has_child query)

  2. Aggregating on the parent id field (also see the children aggregation)

  3. Accessing the parent id` field in scripts

Global ordinals

The join field uses global ordinals to speed up joins. Global ordinals need to be rebuilt after any change to a shard. The more parent id values are stored in a shard, the longer it takes to rebuild the global ordinals for the join field.

Global ordinals, by default, are built eagerly: if the index has changed, global ordinals for the join field will be rebuilt as part of the refresh. This can add significant time to the refresh. However most of the times this is the right trade-off, otherwise global ordinals are rebuilt when the first parent-join query or aggregation is used. This can introduce a significant latency spike for your users and usually this is worse as multiple global ordinals for the join field may be attempt rebuilt within a single refresh interval when many writes are occurring.

When the join field is used infrequently and writes occur frequently it may make sense to disable eager loading:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
             "question": "answer"
          },
          "eager_global_ordinals": false
        }
      }
    }
  }
}

The amount of heap used by global ordinals can be checked per parent relation as follows:

# Per-index
GET _stats/fielddata?human&fields=my_join_field#question

# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=my_join_field#question

Multiple children per parent

It is also possible to define multiple children for a single parent:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
            "question": ["answer", "comment"]  (1)
          }
        }
      }
    }
  }
}
  1. question is parent of answer and comment.

Multiple levels of parent join

Warning
Using multiple levels of relations to replicate a relational model is not recommended. Each level of relation adds an overhead at query time in terms of memory and computation. You should de-normalize your data if you care about performance.

Multiple levels of parent/child:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_join_field": {
          "type": "join",
          "relations": {
            "question": ["answer", "comment"],  (1)
            "answer": "vote" (2)
          }
        }
      }
    }
  }
}
  1. question is parent of answer and comment

  2. answer is parent of vote

The mapping above represents the following tree:

   question
    /    \
   /      \
comment  answer
           |
           |
          vote

Indexing a grandchild document requires a routing value equals to the grand-parent (the greater parent of the lineage):

PUT my_index/_doc/3?routing=1&refresh (1)
{
  "text": "This is a vote",
  "my_join_field": {
    "name": "vote",
    "parent": "2" (2)
  }
}
  1. This child document must be on the same shard than its grand-parent and parent

  2. The parent id of this document (must points to an answer document)

Meta-Fields

Each document has metadata associated with it, such as the _index, mapping _type, and _id meta-fields. The behaviour of some of these meta-fields can be customised when a mapping type is created.

Identity meta-fields

_index

The index to which the document belongs.

_uid

A composite field consisting of the _type and the _id.

_type

The document’s mapping type.

_id

The document’s ID.

Document source meta-fields

_source

The original JSON representing the body of the document.

{plugins}/mapper-size.html[_size]

The size of the _source field in bytes, provided by the {plugins}/mapper-size.html[mapper-size plugin].

Indexing meta-fields

_all

A catch-all field that indexes the values of all other fields. Disabled by default.

_field_names

All fields in the document which contain non-null values.

_ignored

All fields in the document that have been ignored at index time because of ignore_malformed.

Routing meta-field

_routing

A custom routing value which routes a document to a particular shard.

Other meta-field

_meta

Application specific metadata.

_all field

deprecated::[6.0.0, "`_all` may no longer be enabled for indices created in 6.0+, use a custom field and the mapping copy_to parameter"]

The all field is a special _catch-all field which concatenates the values of all of the other fields into one big string, using space as a delimiter, which is then analyzed and indexed, but not stored. This means that it can be searched, but not retrieved.

The _all field allows you to search for values in documents without knowing which field contains the value. This makes it a useful option when getting started with a new dataset. For instance:

PUT /my_index
{
  "mapping": {
    "user": {
      "_all": {
        "enabled": true   (1)
      }
    }
  }
}

PUT /my_index/user/1      (2)
{
  "first_name":    "John",
  "last_name":     "Smith",
  "date_of_birth": "1970-10-24"
}

GET /my_index/_search
{
  "query": {
    "match": {
      "_all": "john smith 1970"
    }
  }
}
  1. Enabling the _all field

  2. The _all field will contain the terms: [ "john", "smith", "1970", "10", "24" ]

Note
All values treated as strings

The date_of_birth field in the above example is recognised as a date field and so will index a single term representing 1970-10-24 00:00:00 UTC. The _all field, however, treats all values as strings, so the date value is indexed as the three string terms: "1970", "24", "10".

It is important to note that the all field combines the original values from each field as a string. It does not combine the _terms from each field.

The _all field is just a text field, and accepts the same parameters that other string fields accept, including analyzer, term_vectors, index_options, and store.

The _all field can be useful, especially when exploring new data using simple filtering. However, by concatenating field values into one big string, the _all field loses the distinction between short fields (more relevant) and long fields (less relevant). For use cases where search relevance is important, it is better to query individual fields specifically.

The _all field is not free: it requires extra CPU cycles and uses more disk space. For this reason, it is disabled by default. If needed, it can be enabled.

Using the _all field in queries

The query_string and simple_query_string queries query the _all field by default if it is enabled, unless another field is specified:

GET _search
{
  "query": {
    "query_string": {
      "query": "john smith new york"
    }
  }
}

The same goes for the ?q= parameter in URI search requests (which is rewritten to a query_string query internally):

GET _search?q=john+smith+new+york

Other queries, such as the match and term queries require you to specify the _all field explicitly, as per the first example.

Enabling the _all field

The _all field can be enabled per-type by setting enabled to true:

PUT my_index
{
  "mappings": {
    "type_1": { (1)
      "properties": {...}
    },
    "type_2": { (2)
      "_all": {
        "enabled": true
      },
      "properties": {...}
    }
  }
}
  1. The _all field in type_1 is disabled.

  2. The _all field in type_2 is enabled.

If the _all field is enabled, then URI search requests and the query_string and simple_query_string queries can automatically use it for queries (see Using the _all field in queries). You can configure them to use a different field with the index.query.default_field setting:

PUT my_index
{
  "mappings": {
    "my_type": {
      "properties": {
        "content": {
          "type": "text"
        }
      }
    }
  },
  "settings": {
    "index.query.default_field": "content" (1)
  }
}
  1. The query_string query will default to querying the content field in this index.

Index boosting and the _all field

Individual fields can be boosted at index time, with the boost parameter. The _all field takes these boosts into account:

PUT myindex
{
  "mappings": {
    "mytype": {
      "_all": {"enabled": true},
      "properties": {
        "title": { (1)
          "type": "text",
          "boost": 2
        },
        "content": { (1)
          "type": "text"
        }
      }
    }
  }
}
  1. When querying the _all field, words that originated in the title field are twice as relevant as words that originated in the content field.

Warning
Using index-time boosting with the _all field has a significant impact on query performance. Usually the better solution is to query fields individually, with optional query time boosting.

Custom _all fields

While there is only a single _all field per index, the copy_to parameter allows the creation of multiple custom _all fields. For instance, first_name and last_name fields can be combined together into the full_name field:

PUT myindex
{
  "mappings": {
    "mytype": {
      "properties": {
        "first_name": {
          "type":    "text",
          "copy_to": "full_name" (1)
        },
        "last_name": {
          "type":    "text",
          "copy_to": "full_name" (1)
        },
        "full_name": {
          "type":    "text"
        }
      }
    }
  }
}

PUT myindex/mytype/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET myindex/_search
{
  "query": {
    "match": {
      "full_name": "John Smith"
    }
  }
}
  1. The first_name and last_name values are copied to the full_name field.

Highlighting and the _all field

A field can only be used for highlighting if the original string value is available, either from the _source field or as a stored field.

The _all field is not present in the _source field and it is not stored or enabled by default, and so cannot be highlighted. There are two options. Either store the _all field or highlight the original fields.

Store the _all field

If store is set to true, then the original field value is retrievable and can be highlighted:

PUT myindex
{
  "mappings": {
    "mytype": {
      "_all": {
        "enabled": true,
        "store": true
      }
    }
  }
}

PUT myindex/mytype/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET _search
{
  "query": {
    "match": {
      "_all": "John Smith"
    }
  },
  "highlight": {
    "fields": {
      "_all": {}
    }
  }
}

Of course, enabling and storing the _all field will use significantly more disk space and, because it is a combination of other fields, it may result in odd highlighting results.

The _all field also accepts the term_vector and index_options parameters, allowing highlighting to use it.

Highlight original fields

You can query the _all field, but use the original fields for highlighting as follows:

PUT myindex
{
  "mappings": {
    "mytype": {
      "_all": {"enabled": true}
    }
  }
}

PUT myindex/mytype/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET _search
{
  "query": {
    "match": {
      "_all": "John Smith" (1)
    }
  },
  "highlight": {
    "fields": {
      "*_name": { (2)
        "require_field_match": false  (3)
      }
    }
  }
}
  1. The query inspects the _all field to find matching documents.

  2. Highlighting is performed on the two name fields, which are available from the _source.

  3. The query wasn’t run against the name fields, so set require_field_match to false.

_field_names field

The _field_names field used to index the names of every field in a document that contains any value other than null. This field was used by the exists query to find documents that either have or don’t have any non-null value for a particular field.

Now the _field_names field only indexes the names of fields that have doc_values and norms disabled. For fields which have either doc_values or norm enabled the exists query will still be available but will not use the _field_names field.

Disabling _field_names

Disabling _field_names is often not necessary because it no longer carries the index overhead it once did. If you have a lot of fields which have doc_values and norms disabled and you do not need to execute exists queries using those fields you might want to disable _field_names be adding the following to the mappings:

PUT tweets
{
  "mappings": {
    "_doc": {
      "_field_names": {
        "enabled": false
      }
    }
  }
}

_ignored field

added[6.4.0]

The _ignored field indexes and stores the names of every field in a document that has been ignored because it was malformed and ignore_malformed was turned on.

This field is searchable with term, terms and exists queries, and is returned as part of the search hits.

For instance the below query matches all documents that have one or more fields that got ignored:

GET _search
{
  "query": {
    "exists": {
      "field": "_ignored"
    }
  }
}

Similarly, the below query finds all documents whose @timestamp field was ignored at index time:

GET _search
{
  "query": {
    "term": {
      "_ignored": "@timestamp"
    }
  }
}

_id field

Each document has an _id that uniquely identifies it, which is indexed so that documents can be looked up either with the GET API or the ids query.

Note
This was not the case with pre-6.0 indices due to the fact that they supported multiple types, so the _type and _id were merged into a composite primary key called _uid.

The value of the _id field is accessible in certain queries (term, terms, match, query_string, simple_query_string).

# Example documents
PUT my_index/_doc/1
{
  "text": "Document with ID 1"
}

PUT my_index/_doc/2&refresh=true
{
  "text": "Document with ID 2"
}

GET my_index/_search
{
  "query": {
    "terms": {
      "_id": [ "1", "2" ] (1)
    }
  }
}
  1. Querying on the _id field (also see the ids query)

The value of the _id field is also accessible in aggregations or for sorting, but doing so is discouraged as it requires to load a lot of data in memory. In case sorting or aggregating on the _id field is required, it is advised to duplicate the content of the _id field in another field that has doc_values enabled.

_index field

When performing queries across multiple indexes, it is sometimes desirable to add query clauses that are associated with documents of only certain indexes. The _index field allows matching on the index a document was indexed into. Its value is accessible in term, or terms queries, aggregations, scripts, and when sorting:

Note
The _index is exposed as a virtual field — it is not added to the Lucene index as a real field. This means that you can use the _index field in a term or terms query (or any query that is rewritten to a term query, such as the match, query_string or simple_query_string query), but it does not support prefix, wildcard, regexp, or fuzzy queries.
# Example documents
PUT index_1/_doc/1
{
  "text": "Document in index 1"
}

PUT index_2/_doc/2?refresh=true
{
  "text": "Document in index 2"
}

GET index_1,index_2/_search
{
  "query": {
    "terms": {
      "_index": ["index_1", "index_2"] (1)
    }
  },
  "aggs": {
    "indices": {
      "terms": {
        "field": "_index", (2)
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_index": { (3)
        "order": "asc"
      }
    }
  ],
  "script_fields": {
    "index_name": {
      "script": {
        "lang": "painless",
        "source": "doc['_index']" (4)
      }
    }
  }
}
  1. Querying on the _index field

  2. Aggregating on the _index field

  3. Sorting on the _index field

  4. Accessing the _index field in scripts

_meta field

A mapping type can have custom meta data associated with it. These are not used at all by Elasticsearch, but can be used to store application-specific metadata, such as the class that a document belongs to:

PUT my_index
{
  "mappings": {
    "_doc": {
      "_meta": { (1)
        "class": "MyApp::User",
        "version": {
          "min": "1.0",
          "max": "1.3"
        }
      }
    }
  }
}
  1. This _meta info can be retrieved with the GET mapping API.

The _meta field can be updated on an existing type using the PUT mapping API:

PUT my_index/_mapping/_doc
{
  "_meta": {
    "class": "MyApp2::User3",
    "version": {
      "min": "1.3",
      "max": "1.5"
    }
  }
}

_routing field

A document is routed to a particular shard in an index using the following formula:

shard_num = hash(_routing) % num_primary_shards

The default value used for _routing is the document’s _id.

Custom routing patterns can be implemented by specifying a custom routing value per document. For instance:

PUT my_index/_doc/1?routing=user1&refresh=true (1)
{
  "title": "This is a document"
}

GET my_index/_doc/1?routing=user1 (2)
  1. This document uses user1 as its routing value, instead of its ID.

  2. The same routing value needs to be provided when getting, deleting, or updating the document.

The value of the _routing field is accessible in queries:

GET my_index/_search
{
  "query": {
    "terms": {
      "_routing": [ "user1" ] (1)
    }
  }
}
  1. Querying on the _routing field (also see the ids query)

Searching with custom routing

Custom routing can reduce the impact of searches. Instead of having to fan out a search request to all the shards in an index, the request can be sent to just the shard that matches the specific routing value (or values):

GET my_index/_search?routing=user1,user2 (1)
{
  "query": {
    "match": {
      "title": "document"
    }
  }
}
  1. This search request will only be executed on the shards associated with the user1 and user2 routing values.

Making a routing value required

When using custom routing, it is important to provide the routing value whenever indexing, getting, deleting, or updating a document.

Forgetting the routing value can lead to a document being indexed on more than one shard. As a safeguard, the _routing field can be configured to make a custom routing value required for all CRUD operations:

PUT my_index2
{
  "mappings": {
    "_doc": {
      "_routing": {
        "required": true (1)
      }
    }
  }
}

PUT my_index2/_doc/1 (2)
{
  "text": "No routing value provided"
}
  1. Routing is required for _doc documents.

  2. This index request throws a routing_missing_exception.

Unique IDs with custom routing

When indexing documents specifying a custom _routing, the uniqueness of the _id is not guaranteed across all of the shards in the index. In fact, documents with the same _id might end up on different shards if indexed with different _routing values.

It is up to the user to ensure that IDs are unique across the index.

Routing to an index partition

An index can be configured such that custom routing values will go to a subset of the shards rather than a single shard. This helps mitigate the risk of ending up with an imbalanced cluster while still reducing the impact of searches.

This is done by providing the index level setting index.routing_partition_size at index creation. As the partition size increases, the more evenly distributed the data will become at the expense of having to search more shards per request.

When this setting is present, the formula for calculating the shard becomes:

shard_num = (hash(_routing) + hash(_id) % routing_partition_size) % num_primary_shards

That is, the _routing field is used to calculate a set of shards within the index and then the _id is used to pick a shard within that set.

To enable this feature, the index.routing_partition_size should have a value greater than 1 and less than index.number_of_shards.

Once enabled, the partitioned index will have the following limitations:

  • Mappings with join field relationships cannot be created within it.

  • All mappings within the index must have the _routing field marked as required.

_source field

The source field contains the original JSON document body that was passed at index time. The _source field itself is not indexed (and thus is not searchable), but it is stored so that it can be returned when executing _fetch requests, like get or search.

Disabling the _source field

Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled as follows:

PUT tweets
{
  "mappings": {
    "_doc": {
      "_source": {
        "enabled": false
      }
    }
  }
}
Warning
Think before disabling the _source field

Users often disable the _source field without thinking about the consequences, and then live to regret it. If the _source field isn’t available then a number of features are not supported:

  • The update, update_by_query, and reindex APIs.

  • On the fly highlighting.

  • The ability to reindex from one Elasticsearch index to another, either to change mappings or analysis, or to upgrade an index to a new major version.

  • The ability to debug queries or aggregations by viewing the original document used at index time.

  • Potentially in the future, the ability to repair index corruption automatically.

Tip
If disk space is a concern, rather increase the compression level instead of disabling the _source.
The metrics use case

The metrics use case is distinct from other time-based or logging use cases in that there are many small documents which consist only of numbers, dates, or keywords. There are no updates, no highlighting requests, and the data ages quickly so there is no need to reindex. Search requests typically use simple queries to filter the dataset by date or tags, and the results are returned as aggregations.

In this case, disabling the _source field will save space and reduce I/O. It is also advisable to disable the _all field in the metrics case.

Including / Excluding fields from _source

An expert-only feature is the ability to prune the contents of the _source field after the document has been indexed, but before the _source field is stored.

Warning
Removing fields from the _source has similar downsides to disabling _source, especially the fact that you cannot reindex documents from one Elasticsearch index to another. Consider using source filtering instead.

The includes/excludes parameters (which also accept wildcards) can be used as follows:

PUT logs
{
  "mappings": {
    "_doc": {
      "_source": {
        "includes": [
          "*.count",
          "meta.*"
        ],
        "excludes": [
          "meta.description",
          "meta.other.*"
        ]
      }
    }
  }
}

PUT logs/_doc/1
{
  "requests": {
    "count": 10,
    "foo": "bar" (1)
  },
  "meta": {
    "name": "Some metric",
    "description": "Some metric description", (1)
    "other": {
      "foo": "one", (1)
      "baz": "two" (1)
    }
  }
}

GET logs/_search
{
  "query": {
    "match": {
      "meta.other.foo": "one" (2)
    }
  }
}
  1. These fields will be removed from the stored _source field.

  2. We can still search on this field, even though it is not in the stored _source.

_type field

deprecated[6.0.0,See Removal of mapping types]

Each document indexed is associated with a _type (see Mapping Type) and an _id. The _type field is indexed in order to make searching by type name fast.

The value of the _type field is accessible in queries, aggregations, scripts, and when sorting:

# Example documents

PUT my_index/_doc/1?refresh=true
{
  "text": "Document with type 'doc'"
}

GET my_index/_search
{
  "query": {
    "term": {
      "_type": "_doc"  (1)
    }
  },
  "aggs": {
    "types": {
      "terms": {
        "field": "_type", (2)
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_type": { (3)
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "type": {
      "script": {
        "lang": "painless",
        "source": "doc['_type']" (4)
      }
    }
  }
}
  1. Querying on the _type field

  2. Aggregating on the _type field

  3. Sorting on the _type field

  4. Accessing the _type field in scripts

_uid field

deprecated::[6.0.0, "Now that types have been removed, documents are uniquely identified by their _id and the _uid field has only been kept as a view over the _id field for backward compatibility."]

Each document indexed is associated with a _type (see Mapping Type) and an _id. These values are combined as {type}#{id} and indexed as the _uid field.

The value of the _uid field is accessible in queries, aggregations, scripts, and when sorting:

# Example documents
PUT my_index/_doc/1
{
  "text": "Document with ID 1"
}

PUT my_index/_doc/2?refresh=true
{
  "text": "Document with ID 2"
}
GET my_index/_search
{
  "query": {
    "terms": {
      "_uid": [ "_doc#1", "_doc#2" ] (1)
    }
  },
  "aggs": {
    "UIDs": {
      "terms": {
        "field": "_uid", (2)
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_uid": { (3)
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "UID": {
      "script": {
         "lang": "painless",
         "source": "doc['_uid']" (4)
      }
    }
  }
}
  1. Querying on the _uid field (also see the ids query)

  2. Aggregating on the _uid field

  3. Sorting on the _uid field

  4. Accessing the _uid field in scripts

Mapping parameters

The following pages provide detailed explanations of the various mapping parameters that are used by field mappings:

The following mapping parameters are common to some or all field datatypes:

analyzer

The values of analyzed string fields are passed through an analyzer to convert the string into a stream of tokens or terms. For instance, the string "The quick Brown Foxes." may, depending on which analyzer is used, be analyzed to the tokens: quick, brown, fox. These are the actual terms that are indexed for the field, which makes it possible to search efficiently for individual words within big blobs of text.

This analysis process needs to happen not just at index time, but also at query time: the query string needs to be passed through the same (or a similar) analyzer so that the terms that it tries to find are in the same format as those that exist in the index.

Elasticsearch ships with a number of pre-defined analyzers, which can be used without further configuration. It also ships with many character filters, tokenizers, and [analysis-tokenfilters] which can be combined to configure custom analyzers per index.

Analyzers can be specified per-query, per-field or per-index. At index time, Elasticsearch will look for an analyzer in this order:

  • The analyzer defined in the field mapping.

  • An analyzer named default in the index settings.

  • The standard analyzer.

At query time, there are a few more layers:

  • The analyzer defined in a full-text query.

  • The search_analyzer defined in the field mapping.

  • The analyzer defined in the field mapping.

  • An analyzer named default_search in the index settings.

  • An analyzer named default in the index settings.

  • The standard analyzer.

The easiest way to specify an analyzer for a particular field is to define it in the field mapping, as follows:

PUT /my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { (1)
          "type": "text",
          "fields": {
            "english": { (2)
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

GET my_index/_analyze (3)
{
  "field": "text",
  "text": "The quick Brown Foxes."
}

GET my_index/_analyze (4)
{
  "field": "text.english",
  "text": "The quick Brown Foxes."
}
  1. The text field uses the default standard analyzer`.

  2. The text.english multi-field uses the english analyzer, which removes stop words and applies stemming.

  3. This returns the tokens: [ the, quick, brown, foxes ].

  4. This returns the tokens: [ quick, brown, fox ].

search_quote_analyzer

The search_quote_analyzer setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling stop words for phrase queries.

To disable stop words for phrases a field utilising three analyzer settings will be required:

  1. An analyzer setting for indexing all terms including stop words

  2. A search_analyzer setting for non-phrase queries that will remove stop words

  3. A search_quote_analyzer setting for phrase queries that will not remove stop words

PUT my_index
{
   "settings":{
      "analysis":{
         "analyzer":{
            "my_analyzer":{ (1)
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase"
               ]
            },
            "my_stop_analyzer":{ (2)
               "type":"custom",
               "tokenizer":"standard",
               "filter":[
                  "lowercase",
                  "english_stop"
               ]
            }
         },
         "filter":{
            "english_stop":{
               "type":"stop",
               "stopwords":"_english_"
            }
         }
      }
   },
   "mappings":{
      "_doc":{
         "properties":{
            "title": {
               "type":"text",
               "analyzer":"my_analyzer", (3)
               "search_analyzer":"my_stop_analyzer", (4)
               "search_quote_analyzer":"my_analyzer" (5)
            }
         }
      }
   }
}

PUT my_index/_doc/1
{
   "title":"The Quick Brown Fox"
}

PUT my_index/_doc/2
{
   "title":"A Quick Brown Fox"
}

GET my_index/_search
{
   "query":{
      "query_string":{
         "query":"\"the quick brown fox\"" (6)
      }
   }
}
  1. my_analyzer analyzer which tokens all terms including stop words

  2. my_stop_analyzer analyzer which removes stop words

  3. analyzer setting that points to the my_analyzer analyzer which will be used at index time

  4. search_analyzer setting that points to the my_stop_analyzer and removes stop words for non-phrase queries

  5. search_quote_analyzer setting that points to the my_analyzer analyzer and ensures that stop words are not removed from phrase queries

  6. Since the query is wrapped in quotes it is detected as a phrase query therefore the search_quote_analyzer kicks in and ensures the stop words are not removed from the query. The my_analyzer analyzer will then return the following tokens [the, quick, brown, fox] which will match one of the documents. Meanwhile term queries will be analyzed with the my_stop_analyzer analyzer which will filter out stop words. So a search for either The quick brown fox or A quick brown fox will return both documents since both documents contain the following tokens [quick, brown, fox]. Without the search_quote_analyzer it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be removed resulting in both documents matching.

normalizer

The normalizer property of keyword fields is similar to analyzer except that it guarantees that the analysis chain produces a single token.

The normalizer is applied prior to indexing the keyword, as well as at search-time when the keyword field is searched via a query parser such as the match query or via a term-level query such as the term query.

PUT index
{
  "settings": {
    "analysis": {
      "normalizer": {
        "my_normalizer": {
          "type": "custom",
          "char_filter": [],
          "filter": ["lowercase", "asciifolding"]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "foo": {
          "type": "keyword",
          "normalizer": "my_normalizer"
        }
      }
    }
  }
}

PUT index/_doc/1
{
  "foo": "BÀR"
}

PUT index/_doc/2
{
  "foo": "bar"
}

PUT index/_doc/3
{
  "foo": "baz"
}

POST index/_refresh

GET index/_search
{
  "query": {
    "term": {
      "foo": "BAR"
    }
  }
}

GET index/_search
{
  "query": {
    "match": {
      "foo": "BAR"
    }
  }
}

The above queries match documents 1 and 2 since BÀR is converted to bar at both index and query time.

{
  "took": $body.took,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 2,
    "max_score": 0.2876821,
    "hits": [
      {
        "_index": "index",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.2876821,
        "_source": {
          "foo": "bar"
        }
      },
      {
        "_index": "index",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "foo": "BÀR"
        }
      }
    ]
  }
}

Also, the fact that keywords are converted prior to indexing also means that aggregations return normalized values:

GET index/_search
{
  "size": 0,
  "aggs": {
    "foo_terms": {
      "terms": {
        "field": "foo"
      }
    }
  }
}

returns

{
  "took": 43,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "foo_terms": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "bar",
          "doc_count": 2
        },
        {
          "key": "baz",
          "doc_count": 1
        }
      ]
    }
  }
}

boost

Individual fields can be boosted automatically — count more towards the relevance score — at query time, with the boost parameter as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "boost": 2 (1)
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}
  1. Matches on the title field will have twice the weight as those on the content field, which has the default boost of 1.0.

Note
The boost is applied only for term queries (prefix, range and fuzzy queries are not boosted).

You can achieve the same effect by using the boost parameter directly in the query, for instance the following query (with field time boost):

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox"
            }
        }
    }
}

is equivalent to:

POST _search
{
    "query": {
        "match" : {
            "title": {
                "query": "quick brown fox",
                "boost": 2
            }
        }
    }
}

The boost is also applied when it is copied with the value in the _all field. This means that, when querying the _all field, words that originated from the title field will have a higher score than words that originated in the content field. This functionality comes at a cost: queries on the _all field are slower when field boosting is used.

deprecated[5.0.0, "index time boost is deprecated. Instead, the field mapping boost is applied at query time. For indices created before 5.0.0 the boost will still be applied at index time."]

Warning
Why index time boosting is a bad idea

We advise against using index time boosting for the following reasons:

  • You cannot change index-time boost values without reindexing all of your documents.

  • Every query supports query-time boosting which achieves the same effect. The difference is that you can tweak the boost value without having to reindex.

  • Index-time boosts are stored as part of the norm, which is only one byte. This reduces the resolution of the field length normalization factor which can lead to lower quality relevance calculations.

coerce

Data is not always clean. Depending on how it is produced a number might be rendered in the JSON body as a true JSON number, e.g. 5, but it might also be rendered as a string, e.g. "5". Alternatively, a number that should be an integer might instead be rendered as a floating point, e.g. 5.0, or even "5.0".

Coercion attempts to clean up dirty values to fit the datatype of a field. For instance:

  • Strings will be coerced to numbers.

  • Floating points will be truncated for integer values.

For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "number_one": {
          "type": "integer"
        },
        "number_two": {
          "type": "integer",
          "coerce": false
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "number_one": "10" (1)
}

PUT my_index/_doc/2
{
  "number_two": "10" (2)
}
  1. The number_one field will contain the integer 10.

  2. This document will be rejected because coercion is disabled.

Tip
The coerce setting is allowed to have different settings for fields of the same name in the same index. Its value can be updated on existing fields using the PUT mapping API.

Index-level default

The index.mapping.coerce setting can be set on the index level to disable coercion globally across all mapping types:

PUT my_index
{
  "settings": {
    "index.mapping.coerce": false
  },
  "mappings": {
    "_doc": {
      "properties": {
        "number_one": {
          "type": "integer",
          "coerce": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "number_one": "10" } (1)

PUT my_index/_doc/2
{ "number_two": "10" } (2)
  1. The number_one field overrides the index level setting to enable coercion.

  2. This document will be rejected because the number_two field inherits the index-level coercion setting.

copy_to

The copy_to parameter allows you to create custom _all fields. In other words, the values of multiple fields can be copied into a group field, which can then be queried as a single field. For instance, the first_name and last_name fields can be copied to the full_name field as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "first_name": {
          "type": "text",
          "copy_to": "full_name" (1)
        },
        "last_name": {
          "type": "text",
          "copy_to": "full_name" (1)
        },
        "full_name": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "first_name": "John",
  "last_name": "Smith"
}

GET my_index/_search
{
  "query": {
    "match": {
      "full_name": { (2)
        "query": "John Smith",
        "operator": "and"
      }
    }
  }
}
  1. The values of the first_name and last_name fields are copied to the full_name field.

  2. The first_name and last_name fields can still be queried for the first name and last name respectively, but the full_name field can be queried for both first and last names.

Some important points:

  • It is the field value which is copied, not the terms (which result from the analysis process).

  • The original _source field will not be modified to show the copied values.

  • The same value can be copied to multiple fields, with "copy_to": [ "field_1", "field_2" ]

  • You cannot copy recursively via intermediary fields such as a copy_to on field_1 to field_2 and copy_to on field_2 to field_3 expecting indexing into field_1 will eventuate in field_3, instead use copy_to directly to multiple fields from the originating field.

doc_values

Most fields are indexed by default, which makes them searchable. The inverted index allows queries to look up the search term in unique sorted list of terms, and from that immediately have access to the list of documents that contain the term.

Sorting, aggregations, and access to field values in scripts requires a different data access pattern. Instead of looking up the term and finding documents, we need to be able to look up the document and find the terms that it has in a field.

Doc values are the on-disk data structure, built at document index time, which makes this data access pattern possible. They store the same values as the _source but in a column-oriented fashion that is way more efficient for sorting and aggregations. Doc values are supported on almost all field types, with the notable exception of analyzed string fields.

All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "status_code": { (1)
          "type":       "keyword"
        },
        "session_id": { (2)
          "type":       "keyword",
          "doc_values": false
        }
      }
    }
  }
}
  1. The status_code field has doc_values enabled by default.

  2. The session_id has doc_values disabled, but can still be queried.

dynamic

By default, fields can be added dynamically to a document, or to inner objects within a document, just by indexing a document containing the new field. For instance:

PUT my_index/_doc/1 (1)
{
  "username": "johnsmith",
  "name": {
    "first": "John",
    "last": "Smith"
  }
}

GET my_index/_mapping (2)

PUT my_index/_doc/2 (3)
{
  "username": "marywhite",
  "email": "mary@white.com",
  "name": {
    "first": "Mary",
    "middle": "Alice",
    "last": "White"
  }
}

GET my_index/_mapping (4)
  1. This document introduces the string field username, the object field name, and two string fields under the name object which can be referred to as name.first and name.last.

  2. Check the mapping to verify the above.

  3. This document adds two string fields: email and name.middle.

  4. Check the mapping to verify the changes.

The details of how new fields are detected and added to the mapping is explained in Dynamic Mapping.

The dynamic setting controls whether new fields can be added dynamically or not. It accepts three settings:

true

Newly detected fields are added to the mapping. (default)

false

Newly detected fields are ignored. These fields will not be indexed so will not be searchable but will still appear in the _source field of returned hits. These fields will not be added to the mapping, new fields must be added explicitly.

strict

If new fields are detected, an exception is thrown and the document is rejected. New fields must be explicitly added to the mapping.

The dynamic setting may be set at the mapping type level, and on each inner object. Inner objects inherit the setting from their parent object or from the mapping type. For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic": false, (1)
      "properties": {
        "user": { (2)
          "properties": {
            "name": {
              "type": "text"
            },
            "social_networks": { (3)
              "dynamic": true,
              "properties": {}
            }
          }
        }
      }
    }
  }
}
  1. Dynamic mapping is disabled at the type level, so no new top-level fields will be added dynamically.

  2. The user object inherits the type-level setting.

  3. The user.social_networks object enables dynamic mapping, so new fields may be added to this inner object.

Tip
The dynamic setting can be updated on existing fields using the PUT mapping API.

enabled

Elasticsearch tries to index all of the fields you give it, but sometimes you want to just store the field without indexing it. For instance, imagine that you are using Elasticsearch as a web session store. You may want to index the session ID and last update time, but you don’t need to query or run aggregations on the session data itself.

The enabled setting, which can be applied only to the mapping type and to object fields, causes Elasticsearch to skip parsing of the contents of the field entirely. The JSON can still be retrieved from the _source field, but it is not searchable or stored in any other way:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "user_id": {
          "type":  "keyword"
        },
        "last_updated": {
          "type": "date"
        },
        "session_data": { (1)
          "enabled": false
        }
      }
    }
  }
}

PUT my_index/_doc/session_1
{
  "user_id": "kimchy",
  "session_data": { (2)
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

PUT my_index/_doc/session_2
{
  "user_id": "jpountz",
  "session_data": "none", (3)
  "last_updated": "2015-12-06T18:22:13"
}
  1. The session_data field is disabled.

  2. Any arbitrary data can be passed to the session_data field as it will be entirely ignored.

  3. The session_data will also ignore values that are not JSON objects.

The entire mapping type may be disabled as well, in which case the document is stored in the _source field, which means it can be retrieved, but none of its contents are indexed in any way:

PUT my_index
{
  "mappings": {
    "_doc": { (1)
      "enabled": false
    }
  }
}

PUT my_index/_doc/session_1
{
  "user_id": "kimchy",
  "session_data": {
    "arbitrary_object": {
      "some_array": [ "foo", "bar", { "baz": 2 } ]
    }
  },
  "last_updated": "2015-12-06T18:20:22"
}

GET my_index/_doc/session_1 (2)

GET my_index/_mapping (3)
  1. The entire _doc mapping type is disabled.

  2. The document can be retrieved.

  3. Checking the mapping reveals that no fields have been added.

The enabled setting for existing fields and the top-level mapping definition cannot be updated.

eager_global_ordinals

What are global ordinals?

To support aggregations and other operations that require looking up field values on a per-document basis, Elasticsearch uses a data structure called doc values. Term-based field types such as keyword store their doc values using an ordinal mapping for a more compact representation. This mapping works by assigning each term an incremental integer or 'ordinal' based on its lexicographic order. The field’s doc values store only the ordinals for each document instead of the original terms, with a separate lookup structure to convert between ordinals and terms.

When used during aggregations, ordinals can greatly improve performance. As an example, the terms aggregation relies only on ordinals to collect documents into buckets at the shard-level, then converts the ordinals back to their original term values when combining results across shards.

Each index segment defines its own ordinal mapping, but aggregations collect data across an entire shard. So to be able to use ordinals for shard-level operations like aggregations, Elasticsearch creates a unified mapping called 'global ordinals'. The global ordinal mapping is built on top of segment ordinals, and works by maintaining a map from global ordinal to the local ordinal for each segment.

Global ordinals are used if a search contains any of the following components:

  • Certain bucket aggregations on keyword, ip, and flattened fields. This includes terms aggregations as mentioned above, as well as composite, diversified_sampler, and significant_terms.

  • Bucket aggregations on text fields that require fielddata to be enabled.

  • Operations on parent and child documents from a join field, including has_child queries and parent aggregations.

Note
The global ordinal mapping is an on-heap data structure. When measuring memory usage, Elasticsearch counts the memory from global ordinals as 'fielddata'. Global ordinals memory is included in the fielddata circuit breaker, and is returned under fielddata in the node stats response.

Loading global ordinals

The global ordinal mapping must be built before ordinals can be used during a search. By default, the mapping is loaded during search on the first time that global ordinals are needed. This is is the right approach if you are optimizing for indexing speed, but if search performance is a priority, it’s recommended to eagerly load global ordinals eagerly on fields that will be used in aggregations:

PUT my_index/_mapping/_doc
{
  "properties": {
    "tags": {
      "type": "keyword",
      "eager_global_ordinals": true
    }
  }
}

When eager_global_ordinals is enabled, global ordinals are built when a shard is refreshed — Elasticsearch always loads them before exposing changes to the content of the index. This shifts the cost of building global ordinals from search to index-time. Elasticsearch will also eagerly build global ordinals when creating a new copy of a shard, as can occur when increasing the number of replicas or relocating a shard onto a new node.

Eager loading can be disabled at any time by updating the eager_global_ordinals setting:

PUT my_index/_mapping/_doc
{
  "properties": {
    "tags": {
      "type": "keyword",
      "eager_global_ordinals": false
    }
  }
}
Important
On a frozen index, global ordinals are discarded after each search and rebuilt again when they’re requested. This means that eager_global_ordinals should not be used on frozen indices: it would cause global ordinals to be reloaded on every search. Instead, the index should be force-merged to a single segment before being frozen. This avoids building global ordinals altogether (more details can be found in the next section).

Avoiding global ordinal loading

Usually, global ordinals do not present a large overhead in terms of their loading time and memory usage. However, loading global ordinals can be expensive on indices with large shards, or if the fields contain a large number of unique term values. Because global ordinals provide a unified mapping for all segments on the shard, they also need to be rebuilt entirely when a new segment becomes visible.

In some cases it is possible to avoid global ordinal loading altogether:

  • The terms, sampler, and significant_terms aggregations support a parameter execution_hint that helps control how buckets are collected. It defaults to global_ordinals, but can be set to map to instead use the term values directly.

  • If a shard has been force-merged down to a single segment, then its segment ordinals are already 'global' to the shard. In this case, Elasticsearch does not need to build a global ordinal mapping and there is no additional overhead from using global ordinals. Note that for performance reasons you should only force-merge an index to which you will never write to again.

fielddata

Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.

Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".

Most fields can use index-time, on-disk doc_values for this data access pattern, but text fields do not support doc_values.

Instead, text fields use a query-time in-memory data structure called fielddata. This data structure is built on demand the first time that a field is used for aggregations, sorting, or in a script. It is built by reading the entire inverted index for each segment from disk, inverting the term ↔︎ document relationship, and storing the result in memory, in the JVM heap.

Fielddata is disabled on text fields by default

Fielddata can consume a lot of heap space, especially when loading high cardinality text fields. Once fielddata has been loaded into the heap, it remains there for the lifetime of the segment. Also, loading fielddata is an expensive process which can cause users to experience latency hits. This is why fielddata is disabled by default.

If you try to sort, aggregate, or access values from a script on a text field, you will see this exception:

Fielddata is disabled on text fields by default.  Set `fielddata=true` on
[`your_field_name`] in order to load  fielddata in memory by uninverting the
inverted index. Note that this can however use significant memory.

Before enabling fielddata

Before you enable fielddata, consider why you are using a text field for aggregations, sorting, or in a script. It usually doesn’t make sense to do so.

A text field is analyzed before indexing so that a value like New York can be found by searching for new or for york. A terms aggregation on this field will return a new bucket and a york bucket, when you probably want a single bucket called New York.

Instead, you should have a text field for full text searches, and an unanalyzed keyword field with doc_values enabled for aggregations, as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": { (1)
          "type": "text",
          "fields": {
            "keyword": { (2)
              "type": "keyword"
            }
          }
        }
      }
    }
  }
}
  1. Use the my_field field for searches.

  2. Use the my_field.keyword field for aggregations, sorting, or in scripts.

Enabling fielddata on text fields

You can enable fielddata on an existing text field using the PUT mapping API as follows:

PUT my_index/_mapping/_doc
{
  "properties": {
    "my_field": { (1)
      "type":     "text",
      "fielddata": true
    }
  }
}
  1. The mapping that you specify for my_field should consist of the existing mapping for that field, plus the fielddata parameter.

fielddata_frequency_filter

Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:

The frequency filter allows you to only load terms whose document frequency falls between a min and max value, which can be expressed an absolute number (when the number is bigger than 1.0) or as a percentage (eg 0.01 is 1% and 1.0 is 100%). Frequency is calculated per segment. Percentages are based on the number of docs which have a value for the field, as opposed to all docs in the segment.

Small segments can be excluded completely by specifying the minimum number of docs that the segment should contain with min_segment_size:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "tag": {
          "type": "text",
          "fielddata": true,
          "fielddata_frequency_filter": {
            "min": 0.001,
            "max": 0.1,
            "min_segment_size": 500
          }
        }
      }
    }
  }
}

format

In JSON documents, dates are represented as strings. Elasticsearch uses a set of preconfigured formats to recognize and parse these strings into a long value representing milliseconds-since-the-epoch in UTC.

Besides the built-in formats, your own custom formats can be specified using the familiar yyyy/MM/dd syntax:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "date": {
          "type":   "date",
          "format": "yyyy-MM-dd"
        }
      }
    }
  }
}

Many APIs which support date values also support date math expressions, such as now-1m/d — the current time, minus one month, rounded down to the nearest day.

Custom date formats

Completely customizable date formats are supported. The syntax for these is explained in the Joda docs.

Built In Formats

Most of the below formats have a strict companion format, which means that year, month and day parts of the week must use respectively 4, 2 and 2 digits exactly, potentially prepending zeros. For instance a date like 5/11/1 would be considered invalid and would need to be rewritten to 2005/11/01 to be accepted by the date parser.

To use them, you need to prepend strict_ to the name of the date format, for instance strict_date_optional_time instead of date_optional_time.

These strict date formats are especially useful when date fields are dynamically mapped in order to make sure to not accidentally map irrelevant strings as dates.

The following tables lists all the defaults ISO formats supported:

epoch_millis

A formatter for the number of milliseconds since the epoch. Note, that this timestamp is subject to the limits of a Java Long.MIN_VALUE and Long.MAX_VALUE.

epoch_second

A formatter for the number of seconds since the epoch. Note, that this timestamp is subject to the limits of a Java Long.MIN_VALUE and Long. MAX_VALUE divided by 1000 (the number of milliseconds in a second).

date_optional_time or strict_date_optional_time

A generic ISO datetime parser where the date is mandatory and the time is optional. Full details here.

basic_date

A basic formatter for a full date as four digit year, two digit month of year, and two digit day of month: yyyyMMdd.

basic_date_time

A basic formatter that combines a basic date and time, separated by a 'T': yyyyMMdd’T’HHmmss.SSSZ.

basic_date_time_no_millis

A basic formatter that combines a basic date and time without millis, separated by a 'T': yyyyMMdd’T’HHmmssZ.

basic_ordinal_date

A formatter for a full ordinal date, using a four digit year and three digit dayOfYear: yyyyDDD.

basic_ordinal_date_time

A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear: yyyyDDD’T’HHmmss.SSSZ.

basic_ordinal_date_time_no_millis

A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear: yyyyDDD’T’HHmmssZ.

basic_time

A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone offset: HHmmss.SSSZ.

basic_time_no_millis

A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset: HHmmssZ.

basic_t_time

A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone off set prefixed by 'T': 'T’HHmmss.SSSZ.

basic_t_time_no_millis

A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by 'T': 'T’HHmmssZ.

basic_week_date or strict_basic_week_date

A basic formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week: xxxx’W’wwe.

basic_week_date_time or strict_basic_week_date_time

A basic formatter that combines a basic weekyear date and time, separated by a 'T': xxxx’W’wwe’T’HHmmss.SSSZ.

basic_week_date_time_no_millis or strict_basic_week_date_time_no_millis

A basic formatter that combines a basic weekyear date and time without millis, separated by a 'T': xxxx’W’wwe’T’HHmmssZ.

date or strict_date

A formatter for a full date as four digit year, two digit month of year, and two digit day of month: yyyy-MM-dd.

date_hour or strict_date_hour

A formatter that combines a full date and two digit hour of day: yyyy-MM-dd’T’HH.

date_hour_minute or strict_date_hour_minute

A formatter that combines a full date, two digit hour of day, and two digit minute of hour: yyyy-MM-dd’T’HH:mm.

date_hour_minute_second or strict_date_hour_minute_second

A formatter that combines a full date, two digit hour of day, two digit minute of hour, and two digit second of minute: yyyy-MM-dd’T’HH:mm:ss.

date_hour_minute_second_fraction or strict_date_hour_minute_second_fraction

A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second: yyyy-MM-dd’T’HH:mm:ss.SSS.

date_hour_minute_second_millis or strict_date_hour_minute_second_millis

A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second: yyyy-MM-dd’T’HH:mm:ss.SSS.

date_time or strict_date_time

A formatter that combines a full date and time, separated by a 'T': yyyy-MM-dd’T’HH:mm:ss.SSSZZ.

date_time_no_millis or strict_date_time_no_millis

A formatter that combines a full date and time without millis, separated by a 'T': yyyy-MM-dd’T’HH:mm:ssZZ.

hour or strict_hour

A formatter for a two digit hour of day: HH

hour_minute or strict_hour_minute

A formatter for a two digit hour of day and two digit minute of hour: HH:mm.

hour_minute_second or strict_hour_minute_second

A formatter for a two digit hour of day, two digit minute of hour, and two digit second of minute: HH:mm:ss.

hour_minute_second_fraction or strict_hour_minute_second_fraction

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second: HH:mm:ss.SSS.

hour_minute_second_millis or strict_hour_minute_second_millis

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second: HH:mm:ss.SSS.

ordinal_date or strict_ordinal_date

A formatter for a full ordinal date, using a four digit year and three digit dayOfYear: yyyy-DDD.

ordinal_date_time or strict_ordinal_date_time

A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear: yyyy-DDD’T’HH:mm:ss.SSSZZ.

ordinal_date_time_no_millis or strict_ordinal_date_time_no_millis

A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear: yyyy-DDD’T’HH:mm:ssZZ.

time or strict_time

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset: HH:mm:ss.SSSZZ.

time_no_millis or strict_time_no_millis

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset: HH:mm:ssZZ.

t_time or strict_t_time

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset prefixed by 'T': 'T’HH:mm:ss.SSSZZ.

t_time_no_millis or strict_t_time_no_millis

A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by 'T': 'T’HH:mm:ssZZ.

week_date or strict_week_date

A formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week: xxxx-'W’ww-e.

week_date_time or strict_week_date_time

A formatter that combines a full weekyear date and time, separated by a 'T': xxxx-'W’ww-e’T’HH:mm:ss.SSSZZ.

week_date_time_no_millis or strict_week_date_time_no_millis

A formatter that combines a full weekyear date and time without millis, separated by a 'T': xxxx-'W’ww-e’T’HH:mm:ssZZ.

weekyear or strict_weekyear

A formatter for a four digit weekyear: xxxx.

weekyear_week or strict_weekyear_week

A formatter for a four digit weekyear and two digit week of weekyear: xxxx-'W’ww.

weekyear_week_day or strict_weekyear_week_day

A formatter for a four digit weekyear, two digit week of weekyear, and one digit day of week: xxxx-'W’ww-e.

year or strict_year

A formatter for a four digit year: yyyy.

year_month or strict_year_month

A formatter for a four digit year and two digit month of year: yyyy-MM.

year_month_day or strict_year_month_day

A formatter for a four digit year, two digit month of year, and two digit day of month: yyyy-MM-dd.

ignore_above

Strings longer than the ignore_above setting will not be indexed or stored. For arrays of strings, ignore_above will be applied for each array element separately and string elements longer than ignore_above will not be indexed or stored.

Note
All strings/array elements will still be present in the _source field, if the latter is enabled which is the default in Elasticsearch.
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "message": {
          "type": "keyword",
          "ignore_above": 20 (1)
        }
      }
    }
  }
}

PUT my_index/_doc/1 (2)
{
  "message": "Syntax error"
}

PUT my_index/_doc/2 (3)
{
  "message": "Syntax error with some long stacktrace"
}

GET _search (4)
{
  "aggs": {
    "messages": {
      "terms": {
        "field": "message"
      }
    }
  }
}
  1. This field will ignore any string longer than 20 characters.

  2. This document is indexed successfully.

  3. This document will be indexed, but without indexing the message field.

  4. Search returns both documents, but only the first is present in the terms aggregation.

Tip
The ignore_above setting is allowed to have different settings for fields of the same name in the same index. Its value can be updated on existing fields using the PUT mapping API.

This option is also useful for protecting against Lucene’s term byte-length limit of 32766.

Note
The value for ignore_above is the character count, but Lucene counts bytes. If you use UTF-8 text with many non-ASCII characters, you may want to set the limit to 32766 / 4 = 8191 since UTF-8 characters may occupy at most 4 bytes.

ignore_malformed

Sometimes you don’t have much control over the data that you receive. One user may send a login field that is a date, and another sends a login field that is an email address.

Trying to index the wrong datatype into a field throws an exception by default, and rejects the whole document. The ignore_malformed parameter, if set to true, allows the exception to be ignored. The malformed field is not indexed, but other fields in the document are processed normally.

For example:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "number_one": {
          "type": "integer",
          "ignore_malformed": true
        },
        "number_two": {
          "type": "integer"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text":       "Some text value",
  "number_one": "foo" (1)
}

PUT my_index/_doc/2
{
  "text":       "Some text value",
  "number_two": "foo" (2)
}
  1. This document will have the text field indexed, but not the number_one field.

  2. This document will be rejected because number_two does not allow malformed values.

Tip
The ignore_malformed setting is allowed to have different settings for fields of the same name in the same index. Its value can be updated on existing fields using the PUT mapping API.

Index-level default

The index.mapping.ignore_malformed setting can be set on the index level to allow to ignore malformed content globally across all mapping types.

PUT my_index
{
  "settings": {
    "index.mapping.ignore_malformed": true (1)
  },
  "mappings": {
    "_doc": {
      "properties": {
        "number_one": { (1)
          "type": "byte"
        },
        "number_two": {
          "type": "integer",
          "ignore_malformed": false (2)
        }
      }
    }
  }
}
  1. The number_one field inherits the index-level setting.

  2. The number_two field overrides the index-level setting to turn off ignore_malformed.

Dealing with malformed fields

Malformed fields are silently ignored at indexing time when ignore_malformed is turned on. Whenever possible it is recommended to keep the number of documents that have a malformed field contained, or queries on this field will become meaningless. Elasticsearch makes it easy to check how many documents have malformed fields by using exists,term or terms queries on the special _ignored field.

Limits for JSON Objects

You can’t use ignore_malformed with the following datatypes:

You also can’t use ignore_malformed to ignore JSON objects submitted to fields of the wrong datatype. A JSON object is any data surrounded by curly brackets "{}" and includes data mapped to the nested, object, and range datatypes.

If you submit a JSON object to an unsupported field, {es} will return an error and reject the entire document regardless of the ignore_malformed setting.

index

The index option controls whether field values are indexed. It accepts true or false and defaults to true. Fields that are not indexed are not queryable.

index_options

The index_options parameter controls what information is added to the inverted index, for search and highlighting purposes. It accepts the following settings:

docs

Only the doc number is indexed. Can answer the question Does this term exist in this field?

freqs

Doc number and term frequencies are indexed. Term frequencies are used to score repeated terms higher than single terms.

positions

Doc number, term frequencies, and term positions (or order) are indexed. Positions can be used for proximity or phrase queries.

offsets

Doc number, term frequencies, positions, and start and end character offsets (which map the term back to the original string) are indexed. Offsets are used by the unified highlighter to speed up highlighting.

Warning
The index_options parameter has been deprecated for Numeric fields in 6.0.0.

Analyzed string fields use positions as the default, and all other fields use docs as the default.

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "index_options": "offsets"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} (1)
    }
  }
}
  1. The text field will use the postings for the highlighting by default because offsets are indexed.

index_phrases

If enabled, two-term word combinations ('shingles') are indexed into a separate field. This allows exact phrase queries (no slop) to run more efficiently, at the expense of a larger index. Note that this works best when stopwords are not removed, as phrases containing stopwords will not use the subsidiary field and will fall back to a standard phrase query. Accepts true or false (default).

index_prefixes

The index_prefixes parameter enables the indexing of term prefixes to speed up prefix searches. It accepts the following optional settings:

min_chars

The minimum prefix length to index. Must be greater than 0, and defaults to 2. The value is inclusive.

max_chars

The maximum prefix length to index. Must be less than 20, and defaults to 5. The value is inclusive.

This example creates a text field using the default prefix length settings:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "body_text": {
          "type": "text",
          "index_prefixes": { }    (1)
        }
      }
    }
  }
}
  1. An empty settings object will use the default min_chars and max_chars settings

This example uses custom prefix length settings:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "full_name": {
          "type": "text",
          "index_prefixes": {
            "min_chars" : 1,
            "max_chars" : 10
          }
        }
      }
    }
  }
}

fields

It is often useful to index the same field in different ways for different purposes. This is the purpose of multi-fields. For instance, a string field could be mapped as a text field for full-text search, and as a keyword field for sorting or aggregations:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "city": {
          "type": "text",
          "fields": {
            "raw": { (1)
              "type":  "keyword"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "city": "New York"
}

PUT my_index/_doc/2
{
  "city": "York"
}

GET my_index/_search
{
  "query": {
    "match": {
      "city": "york" (2)
    }
  },
  "sort": {
    "city.raw": "asc" (3)
  },
  "aggs": {
    "Cities": {
      "terms": {
        "field": "city.raw" (3)
      }
    }
  }
}
  1. The city.raw field is a keyword version of the city field.

  2. The city field can be used for full text search.

  3. The city.raw field can be used for sorting and aggregations

Note
Multi-fields do not change the original _source field.
Tip
The fields setting is allowed to have different settings for fields of the same name in the same index. New multi-fields can be added to existing fields using the PUT mapping API.

Multi-fields with multiple analyzers

Another use case of multi-fields is to analyze the same field in different ways for better relevance. For instance we could index a field with the standard analyzer which breaks text up into words, and again with the english analyzer which stems words into their root form:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": { (1)
          "type": "text",
          "fields": {
            "english": { (2)
              "type":     "text",
              "analyzer": "english"
            }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1
{ "text": "quick brown fox" } (3)

PUT my_index/_doc/2
{ "text": "quick brown foxes" } (3)

GET my_index/_search
{
  "query": {
    "multi_match": {
      "query": "quick brown foxes",
      "fields": [ (4)
        "text",
        "text.english"
      ],
      "type": "most_fields" (4)
    }
  }
}
  1. The text field uses the standard analyzer.

  2. The text.english field uses the english analyzer.

  3. Index two documents, one with fox and the other with foxes.

  4. Query both the text and text.english fields and combine the scores.

The text field contains the term fox in the first document and foxes in the second document. The text.english field contains fox for both documents, because foxes is stemmed to fox.

The query string is also analyzed by the standard analyzer for the text field, and by the english analyzer for the text.english field. The stemmed field allows a query for foxes to also match the document containing just fox. This allows us to match as many documents as possible. By also querying the unstemmed text field, we improve the relevance score of the document which matches foxes exactly.

norms

Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.

Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field. In particular, this is the case for fields that are used solely for filtering or aggregations.

Tip
The norms setting must have the same setting for fields of the same name in the same index. Norms can be disabled on existing fields using the PUT mapping API.

Norms can be disabled (but not reenabled) after the fact, using the PUT mapping API like so:

PUT my_index/_mapping/_doc
{
  "properties": {
    "title": {
      "type": "text",
      "norms": false
    }
  }
}
Note
Norms will not be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms.

null_value

A null value cannot be indexed or searched. When a field is set to null, (or an empty array or an array of null values) it is treated as though that field has no values.

The null_value parameter allows you to replace explicit null values with the specified value so that it can be indexed and searched. For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "status_code": {
          "type":       "keyword",
          "null_value": "NULL" (1)
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "status_code": null
}

PUT my_index/_doc/2
{
  "status_code": [] (2)
}

GET my_index/_search
{
  "query": {
    "term": {
      "status_code": "NULL" (3)
    }
  }
}
  1. Replace explicit null values with the term NULL.

  2. An empty array does not contain an explicit null, and so won’t be replaced with the null_value.

  3. A query for NULL returns document 1, but not document 2.

Important
The null_value needs to be the same datatype as the field. For instance, a long field cannot have a string null_value.
Note
The null_value only influences how data is indexed, it doesn’t modify the _source document.

position_increment_gap

Analyzed text fields take term positions into account, in order to be able to support proximity or phrase queries. When indexing text fields with multiple values a "fake" gap is added between the values to prevent most phrase queries from matching across the values. The size of this gap is configured using position_increment_gap and defaults to 100.

For example:

PUT my_index/_doc/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln" (1)
            }
        }
    }
}

GET my_index/_search
{
    "query": {
        "match_phrase": {
            "names": {
                "query": "Abraham Lincoln",
                "slop": 101 (2)
            }
        }
    }
}
  1. This phrase query doesn’t match our document which is totally expected.

  2. This phrase query matches our document, even though Abraham and Lincoln are in separate strings, because slop > position_increment_gap.

The position_increment_gap can be specified in the mapping. For instance:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "names": {
          "type": "text",
          "position_increment_gap": 0 (1)
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
    "names": [ "John Abraham", "Lincoln Smith"]
}

GET my_index/_search
{
    "query": {
        "match_phrase": {
            "names": "Abraham Lincoln" (2)
        }
    }
}
  1. The first term in the next array element will be 0 terms apart from the last term in the previous array element.

  2. The phrase query matches our document which is weird, but its what we asked for in the mapping.

properties

Type mappings, object fields and nested fields contain sub-fields, called properties. These properties may be of any datatype, including object and nested. Properties can be added:

  • explicitly by defining them when creating an index.

  • explicitly by defining them when adding or updating a mapping type with the PUT mapping API.

  • dynamically just by indexing documents containing new fields.

Below is an example of adding properties to a mapping type, an object field, and a nested field:

PUT my_index
{
  "mappings": {
    "_doc": { (1)
      "properties": {
        "manager": { (2)
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        },
        "employees": { (3)
          "type": "nested",
          "properties": {
            "age":  { "type": "integer" },
            "name": { "type": "text"  }
          }
        }
      }
    }
  }
}

PUT my_index/_doc/1 (4)
{
  "region": "US",
  "manager": {
    "name": "Alice White",
    "age": 30
  },
  "employees": [
    {
      "name": "John Smith",
      "age": 34
    },
    {
      "name": "Peter Brown",
      "age": 26
    }
  ]
}
  1. Properties under the _doc mapping type.

  2. Properties under the manager object field.

  3. Properties under the employees nested field.

  4. An example document which corresponds to the above mapping.

Tip
The properties setting is allowed to have different settings for fields of the same name in the same index. New properties can be added to existing fields using the PUT mapping API.

Dot notation

Inner fields can be referred to in queries, aggregations, etc., using dot notation:

GET my_index/_search
{
  "query": {
    "match": {
      "manager.name": "Alice White"
    }
  },
  "aggs": {
    "Employees": {
      "nested": {
        "path": "employees"
      },
      "aggs": {
        "Employee Ages": {
          "histogram": {
            "field": "employees.age",
            "interval": 5
          }
        }
      }
    }
  }
}
Important
The full path to the inner field must be specified.

search_analyzer

Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index.

Sometimes, though, it can make sense to use a different analyzer at search time, such as when using the edge_ngram tokenizer for autocomplete.

By default, queries will use the analyzer defined in the field mapping, but this can be overridden with the search_analyzer setting:

PUT my_index
{
  "settings": {
    "analysis": {
      "filter": {
        "autocomplete_filter": {
          "type": "edge_ngram",
          "min_gram": 1,
          "max_gram": 20
        }
      },
      "analyzer": {
        "autocomplete": { (1)
          "type": "custom",
          "tokenizer": "standard",
          "filter": [
            "lowercase",
            "autocomplete_filter"
          ]
        }
      }
    }
  },
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type": "text",
          "analyzer": "autocomplete", (2)
          "search_analyzer": "standard" (2)
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text": "Quick Brown Fox" (3)
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": {
        "query": "Quick Br", (4)
        "operator": "and"
      }
    }
  }
}
  1. Analysis settings to define the custom autocomplete analyzer.

  2. The text field uses the autocomplete analyzer at index time, but the standard analyzer at search time.

  3. This field is indexed as the terms: [ q, qu, qui, quic, quick, b, br, bro, brow, brown, f, fo, fox ]

  4. The query searches for both of these terms: [ quick, br ]

See {defguide}/_index_time_search_as_you_type.html[Index time search-as-you- type] for a full explanation of this example.

Tip
The search_analyzer setting can be updated on existing fields using the PUT mapping API.

similarity

Elasticsearch allows you to configure a scoring algorithm or similarity per field. The similarity setting provides a simple way of choosing a similarity algorithm other than the default BM25, such as TF/IDF.

Similarities are mostly useful for text fields, but can also apply to other field types.

Custom similarities can be configured by tuning the parameters of the built-in similarities. For more details about this expert options, see the similarity module.

The only similarities which can be used out of the box, without any further configuration are:

BM25

The Okapi BM25 algorithm. The algorithm used by default in Elasticsearch and Lucene. See {defguide}/pluggable-similarites.html[Pluggable Similarity Algorithms] for more information.

classic

The TF/IDF algorithm which used to be the default in Elasticsearch and Lucene. See {defguide}/practical-scoring-function.html[Lucene’s Practical Scoring Function] for more information.

boolean

A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.

The similarity can be set on the field level when a field is first created, as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "default_field": { (1)
          "type": "text"
        },
        "boolean_sim_field": {
          "type": "text",
          "similarity": "boolean" (2)
        }
      }
    }
  }
}
  1. The default_field uses the BM25 similarity.

  2. The boolean_sim_field uses the boolean similarity.

store

By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.

Usually this doesn’t matter. The field value is already part of the _source field, which is stored by default. If you only want to retrieve the value of a single field or of a few fields, instead of the whole _source, then this can be achieved with source filtering.

In certain situations it can make sense to store a field. For instance, if you have a document with a title, a date, and a very large content field, you may want to retrieve just the title and the date without having to extract those fields from a large _source field:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "title": {
          "type": "text",
          "store": true (1)
        },
        "date": {
          "type": "date",
          "store": true (1)
        },
        "content": {
          "type": "text"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "title":   "Some short title",
  "date":    "2015-01-01",
  "content": "A very long content field..."
}

GET my_index/_search
{
  "stored_fields": [ "title", "date" ] (2)
}
  1. The title and date fields are stored.

  2. This request will retrieve the values of the title and date fields.

Note
Stored fields returned as arrays

For consistency, stored fields are always returned as an array because there is no way of knowing if the original field value was a single value, multiple values, or an empty array.

If you need the original value, you should retrieve it from the _source field instead.

Another situation where it can make sense to make a field stored is for those that don’t appear in the _source field (such as copy_to fields).

term_vector

Term vectors contain information about the terms produced by the analysis process, including:

  • a list of terms.

  • the position (or order) of each term.

  • the start and end character offsets mapping the term to its origin in the original string.

  • payloads (if they are available) — user-defined binary data associated with each term position.

These term vectors can be stored so that they can be retrieved for a particular document.

The term_vector setting accepts:

no

No term vectors are stored. (default)

yes

Just the terms in the field are stored.

with_positions

Terms and positions are stored.

with_offsets

Terms and character offsets are stored.

with_positions_offsets

Terms, positions, and character offsets are stored.

with_positions_payloads

Terms, positions, and payloads are stored.

with_positions_offsets_payloads

Terms, positions, offsets and payloads are stored.

The fast vector highlighter requires with_positions_offsets. The term vectors API can retrieve whatever is stored.

Warning
Setting with_positions_offsets will double the size of a field’s index.
PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "text": {
          "type":        "text",
          "term_vector": "with_positions_offsets"
        }
      }
    }
  }
}

PUT my_index/_doc/1
{
  "text": "Quick brown fox"
}

GET my_index/_search
{
  "query": {
    "match": {
      "text": "brown fox"
    }
  },
  "highlight": {
    "fields": {
      "text": {} (1)
    }
  }
}
  1. The fast vector highlighter will be used by default for the text field because term vectors are enabled.

Dynamic Mapping

One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible. To index a document, you don’t have to first create an index, define a mapping type, and define your fields — you can just index a document and the index, type, and fields will spring to life automatically:

PUT data/_doc/1 (1)
{ "count": 5 }
  1. Creates the data index, the _doc mapping type, and a field called count with datatype long.

The automatic detection and addition of new fields is called dynamic mapping. The dynamic mapping rules can be customised to suit your purposes with:

Dynamic field mappings

The rules governing dynamic field detection.

Dynamic templates

Custom rules to configure the mapping for dynamically added fields.

Tip
Index templates allow you to configure the default mappings, settings and aliases for new indices, whether created automatically or explicitly.

Dynamic field mapping

By default, when a previously unseen field is found in a document, Elasticsearch will add the new field to the type mapping. This behaviour can be disabled, both at the document and at the object level, by setting the dynamic parameter to false (to ignore new fields) or to strict (to throw an exception if an unknown field is encountered).

Assuming dynamic field mapping is enabled, some simple rules are used to determine which datatype the field should have:

JSON datatype

Elasticsearch datatype

null

No field is added.

true or false

boolean field

floating point number

float field

integer

long field

object

object field

array

Depends on the first non-null value in the array.

string

Either a date field (if the value passes date detection), a double or long field (if the value passes numeric detection) or a text field, with a keyword sub-field.

These are the only field datatypes that are dynamically detected. All other datatypes must be mapped explicitly.

Besides the options listed below, dynamic field mapping rules can be further customised with dynamic_templates.

Date detection

If date_detection is enabled (default), then new string fields are checked to see whether their contents match any of the date patterns specified in dynamic_date_formats. If a match is found, a new date field is added with the corresponding format.

The default value for dynamic_date_formats is:

[ "strict_date_optional_time","yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"]

For example:

PUT my_index/_doc/1
{
  "create_date": "2015/09/02"
}

GET my_index/_mapping (1)
  1. The create_date field has been added as a date field with the format:
    "yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z".

Disabling date detection

Dynamic date detection can be disabled by setting date_detection to false:

PUT my_index
{
  "mappings": {
    "_doc": {
      "date_detection": false
    }
  }
}

PUT my_index/_doc/1 (1)
{
  "create": "2015/09/02"
}
  1. The create_date field has been added as a text field.

Customising detected date formats

Alternatively, the dynamic_date_formats can be customised to support your own date formats:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_date_formats": ["MM/dd/yyyy"]
    }
  }
}

PUT my_index/_doc/1
{
  "create_date": "09/25/2015"
}

Numeric detection

While JSON has support for native floating point and integer datatypes, some applications or languages may sometimes render numbers as strings. Usually the correct solution is to map these fields explicitly, but numeric detection (which is disabled by default) can be enabled to do this automatically:

PUT my_index
{
  "mappings": {
    "_doc": {
      "numeric_detection": true
    }
  }
}

PUT my_index/_doc/1
{
  "my_float":   "1.0", (1)
  "my_integer": "1" (2)
}
  1. The my_float field is added as a float field.

  2. The my_integer field is added as a long field.

Dynamic templates

Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:

The original field name {name} and the detected datatype {dynamic_type} template variables can be used in the mapping specification as placeholders.

Important
Dynamic field mappings are only added when a field contains a concrete value — not null or an empty array. This means that if the null_value option is used in a dynamic_template, it will only be applied after the first document with a concrete value for the field has been indexed.

Dynamic templates are specified as an array of named objects:

  "dynamic_templates": [
    {
      "my_template_name": { (1)
        ...  match conditions ... (2)
        "mapping": { ... } (3)
      }
    },
    ...
  ]
  1. The template name can be any string value.

  2. The match conditions can include any of : match_mapping_type, match, match_pattern, unmatch, path_match, path_unmatch.

  3. The mapping that the matched field should use.

Templates are processed in order — the first matching template wins. When putting new dynamic templates through the put mapping API, all existing templates are overwritten. This allows for dynamic templates to be reordered or deleted after they were initially added.

match_mapping_type

The match_mapping_type is the datatype detected by the json parser. Since JSON doesn’t allow to distinguish a long from an integer or a double from a float, it will always choose the wider datatype, i.e. long for integers and double for floating-point numbers.

The following datatypes may be automatically detected:

  • boolean when true or false are encountered.

  • date when date detection is enabled and a string is found that matches any of the configured date formats.

  • double for numbers with a decimal part.

  • long for numbers without a decimal part.

  • object for objects, also called hashes.

  • string for character strings.

* may also be used in order to match all datatypes.

For example, if we wanted to map all integer fields as integer instead of long, and all string fields as both text and keyword, we could use the following template:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "integers": {
            "match_mapping_type": "long",
            "mapping": {
              "type": "integer"
            }
          }
        },
        {
          "strings": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "text",
              "fields": {
                "raw": {
                  "type":  "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ]
    }
  }
}

PUT my_index/_doc/1
{
  "my_integer": 5, (1)
  "my_string": "Some string" (2)
}
  1. The my_integer field is mapped as an integer.

  2. The my_string field is mapped as a text, with a keyword multi field.

match and unmatch

The match parameter uses a pattern to match on the field name, while unmatch uses a pattern to exclude fields matched by match.

The following example matches all string fields whose name starts with long_ (except for those which end with _text) and maps them as long fields:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "longs_as_strings": {
            "match_mapping_type": "string",
            "match":   "long_*",
            "unmatch": "*_text",
            "mapping": {
              "type": "long"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/_doc/1
{
  "long_num": "5", (1)
  "long_text": "foo" (2)
}
  1. The long_num field is mapped as a long.

  2. The long_text field uses the default string mapping.

match_pattern

The match_pattern parameter adjusts the behavior of the match parameter such that it supports full Java regular expression matching on the field name instead of simple wildcards, for instance:

  "match_pattern": "regex",
  "match": "^profit_\d+$"

path_match and path_unmatch

The path_match and path_unmatch parameters work in the same way as match and unmatch, but operate on the full dotted path to the field, not just the final name, e.g. some_object.*.some_field.

This example copies the values of any fields in the name object to the top-level full_name field, except for the middle field:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "full_name": {
            "path_match":   "name.*",
            "path_unmatch": "*.middle",
            "mapping": {
              "type":       "text",
              "copy_to":    "full_name"
            }
          }
        }
      ]
    }
  }
}

PUT my_index/_doc/1
{
  "name": {
    "first":  "John",
    "middle": "Winston",
    "last":   "Lennon"
  }
}

Note that the path_match and path_unmatch parameters match on object paths in addition to leaf fields. As an example, indexing the following document will result in an error because the path_match setting also matches the object field name.title, which can’t be mapped as text:

PUT my_index/_doc/2
{
  "name": {
    "first":  "Paul",
    "last":   "McCartney",
    "title": {
      "value": "Sir",
      "category": "order of chivalry"
    }
  }
}

{name} and {dynamic_type}

The {name} and {dynamic_type} placeholders are replaced in the mapping with the field name and detected dynamic type. The following example sets all string fields to use an analyzer with the same name as the field, and disables doc_values for all non-string fields:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "named_analyzers": {
            "match_mapping_type": "string",
            "match": "*",
            "mapping": {
              "type": "text",
              "analyzer": "{name}"
            }
          }
        },
        {
          "no_doc_values": {
            "match_mapping_type":"*",
            "mapping": {
              "type": "{dynamic_type}",
              "doc_values": false
            }
          }
        }
      ]
    }
  }
}

PUT my_index/_doc/1
{
  "english": "Some English text", (1)
  "count":   5 (2)
}
  1. The english field is mapped as a string field with the english analyzer.

  2. The count field is mapped as a long field with doc_values disabled.

Template examples

Here are some examples of potentially useful dynamic templates:

By default Elasticsearch will map string fields as a text field with a sub keyword field. However if you are only indexing structured content and not interested in full text search, you can make Elasticsearch map your fields only as `keyword`s. Note that this means that in order to search those fields, you will have to search on the exact same value that was indexed.

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "keyword"
            }
          }
        }
      ]
    }
  }
}
text-only mappings for strings

On the contrary to the previous example, if the only thing that you care about on your string fields is full-text search, and if you don’t plan on running aggregations, sorting or exact search on your string fields, you could tell Elasticsearch to map it only as a text field (which was the default behaviour before 5.0):

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "strings_as_text": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "text"
            }
          }
        }
      ]
    }
  }
}
Disabled norms

Norms are index-time scoring factors. If you do not care about scoring, which would be the case for instance if you never sort documents by score, you could disable the storage of these scoring factors in the index and save some space.

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "strings_as_keywords": {
            "match_mapping_type": "string",
            "mapping": {
              "type": "text",
              "norms": false,
              "fields": {
                "keyword": {
                  "type": "keyword",
                  "ignore_above": 256
                }
              }
            }
          }
        }
      ]
    }
  }
}

The sub keyword field appears in this template to be consistent with the default rules of dynamic mappings. Of course if you do not need them because you don’t need to perform exact search or aggregate on this field, you could remove it as described in the previous section.

Time-series

When doing time series analysis with Elasticsearch, it is common to have many numeric fields that you will often aggregate on but never filter on. In such a case, you could disable indexing on those fields to save disk space and also maybe gain some indexing speed:

PUT my_index
{
  "mappings": {
    "_doc": {
      "dynamic_templates": [
        {
          "unindexed_longs": {
            "match_mapping_type": "long",
            "mapping": {
              "type": "long",
              "index": false
            }
          }
        },
        {
          "unindexed_doubles": {
            "match_mapping_type": "double",
            "mapping": {
              "type": "float", (1)
              "index": false
            }
          }
        }
      ]
    }
  }
}
  1. Like the default dynamic mapping rules, doubles are mapped as floats, which are usually accurate enough, yet require half the disk space.

default mapping

deprecated[6.0.0,See Removal of mapping types]

The default mapping, which will be used as the base mapping for a new mapping type, can be customised by adding a mapping type with the name default to an index, either when creating the index or later on with the PUT mapping API.

The documentation for this feature has been removed as it no longer makes sense in 6.x where there can be only a single type per index.