Removal of mapping types
Important
|
Indices created in Elasticsearch 6.0.0 or later may only contain a single mapping type. Indices created in 5.x with multiple mapping types will continue to function as before in Elasticsearch 6.x. Types will be deprecated in APIs in Elasticsearch 7.0.0, and completely removed in 8.0.0. |
What are mapping types?
Since the first release of Elasticsearch, each document has been stored in a
single index and assigned a single mapping type. A mapping type was used to
represent the type of document or entity being indexed, for instance a
twitter
index might have a user
type and a tweet
type.
Each mapping type could have its own fields, so the user
type might have a
full_name
field, a user_name
field, and an email
field, while the
tweet
type could have a content
field, a tweeted_at
field and, like the
user
type, a user_name
field.
Each document had a _type
meta-field containing the type name, and searches
could be limited to one or more types by specifying the type name(s) in the
URL:
GET twitter/user,tweet/_search
{
"query": {
"match": {
"user_name": "kimchy"
}
}
}
The _type
field was combined with the document’s _id
to generate a _uid
field, so documents of different types with the same _id
could exist in a
single index.
Mapping types were also used to establish a
parent-child relationship
between documents, so documents of type question
could be parents to
documents of type answer
.
Why are mapping types being removed?
Initially, we spoke about an index'' being similar to a
database'' in an
SQL database, and a type'' being equivalent to a
table''.
This was a bad analogy that led to incorrect assumptions. In an SQL database, tables are independent of each other. The columns in one table have no bearing on columns with the same name in another table. This is not the case for fields in a mapping type.
In an Elasticsearch index, fields that have the same name in different mapping
types are backed by the same Lucene field internally. In other words, using
the example above, the user_name
field in the user
type is stored in
exactly the same field as the user_name
field in the tweet
type, and both
user_name
fields must have the same mapping (definition) in both types.
This can lead to frustration when, for example, you want deleted
to be a
date
field in one type and a boolean
field in another type in the same
index.
On top of that, storing different entities that have few or no fields in common in the same index leads to sparse data and interferes with Lucene’s ability to compress documents efficiently.
For these reasons, we have decided to remove the concept of mapping types from Elasticsearch.
Alternatives to mapping types
Index per document type
The first alternative is to have an index per document type. Instead of
storing tweets and users in a single twitter
index, you could store tweets
in the tweets
index and users in the user
index. Indices are completely
independent of each other and so there will be no conflict of field types
between indices.
This approach has two benefits:
-
Data is more likely to be dense and so benefit from compression techniques used in Lucene.
-
The term statistics used for scoring in full text search are more likely to be accurate because all documents in the same index represent a single entity.
Each index can be sized appropriately for the number of documents it will
contain: you can use a smaller number of primary shards for users
and a
larger number of primary shards for tweets
.
Custom type field
Of course, there is a limit to how many primary shards can exist in a cluster
so you may not want to waste an entire shard for a collection of only a few
thousand documents. In this case, you can implement your own custom type
field which will work in a similar way to the old _type
.
Let’s take the user
/tweet
example above. Originally, the workflow would
have looked something like this:
PUT twitter
{
"mappings": {
"user": {
"properties": {
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" }
}
},
"tweet": {
"properties": {
"content": { "type": "text" },
"user_name": { "type": "keyword" },
"tweeted_at": { "type": "date" }
}
}
}
}
PUT twitter/user/kimchy
{
"name": "Shay Banon",
"user_name": "kimchy",
"email": "shay@kimchy.com"
}
PUT twitter/tweet/1
{
"user_name": "kimchy",
"tweeted_at": "2017-10-24T09:00:00Z",
"content": "Types are going away"
}
GET twitter/tweet/_search
{
"query": {
"match": {
"user_name": "kimchy"
}
}
}
You could achieve the same thing by adding a custom type
field as follows:
PUT twitter
{
"mappings": {
"_doc": {
"properties": {
"type": { "type": "keyword" }, (1)
"name": { "type": "text" },
"user_name": { "type": "keyword" },
"email": { "type": "keyword" },
"content": { "type": "text" },
"tweeted_at": { "type": "date" }
}
}
}
}
PUT twitter/_doc/user-kimchy
{
"type": "user", (1)
"name": "Shay Banon",
"user_name": "kimchy",
"email": "shay@kimchy.com"
}
PUT twitter/_doc/tweet-1
{
"type": "tweet", (1)
"user_name": "kimchy",
"tweeted_at": "2017-10-24T09:00:00Z",
"content": "Types are going away"
}
GET twitter/_search
{
"query": {
"bool": {
"must": {
"match": {
"user_name": "kimchy"
}
},
"filter": {
"match": {
"type": "tweet" (1)
}
}
}
}
}
-
The explicit
type
field takes the place of the implicit_type
field.
Parent/Child without mapping types
Previously, a parent-child relationship was represented by making one mapping
type the parent, and one or more other mapping types the children. Without
types, we can no longer use this syntax. The parent-child feature will
continue to function as before, except that the way of expressing the
relationship between documents has been changed to use the new
join
field.
Schedule for removal of mapping types
This is a big change for our users, so we have tried to make it as painless as possible. The change will roll out as follows:
- Elasticsearch 5.6.0
-
-
Setting
index.mapping.single_type: true
on an index will enable the single-type-per-index behaviour which will be enforced in 6.0. -
The
join
field replacement for parent-child is available on indices created in 5.6.
-
- Elasticsearch 6.x
-
-
Indices created in 5.x will continue to function in 6.x as they did in 5.x.
-
Indices created in 6.x only allow a single-type per index. Any name can be used for the type, but there can be only one. The preferred type name is
_doc
, so that index APIs have the same path as they will have in 7.0:PUT {index}/_doc/{id}
andPOST {index}/_doc
-
The
_type
name can no longer be combined with the_id
to form the_uid
field. The_uid
field has become an alias for the_id
field. -
New indices no longer support the old-style of parent/child and should use the
join
field instead. -
The
default
mapping type is deprecated. -
In 6.7, the index creation, index template, and mapping APIs support a query string parameter (
include_type_name
) which indicates whether requests and responses should include a type name. It defaults totrue
, and should be set to an explicit value to prepare to upgrade to 7.0. Not settinginclude_type_name
will result in a deprecation warning. Indices which don’t have an explicit type will use the dummy type name_doc
.
-
- Elasticsearch 7.x
-
-
Specifying types in requests is deprecated. For instance, indexing a document no longer requires a document
type
. The new index APIs arePUT {index}/_doc/{id}
in case of explicit ids andPOST {index}/_doc
for auto-generated ids. -
The
include_type_name
parameter in the index creation, index template, and mapping APIs will default tofalse
. Setting the parameter at all will result in a deprecation warning. -
The
default
mapping type is removed.
-
- Elasticsearch 8.x
-
-
Specifying types in requests is no longer supported.
-
The
include_type_name
parameter is removed.
-
Migrating multi-type indices to single-type
The Reindex API can be used to convert multi-type indices to
single-type indices. The following examples can be used in Elasticsearch 5.6
or Elasticsearch 6.x. In 6.x, there is no need to specify
index.mapping.single_type
as that is the default.
Index per document type
This first example splits our twitter
index into a tweets
index and a
users
index:
PUT users
{
"settings": {
"index.mapping.single_type": true
},
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"email": {
"type": "keyword"
}
}
}
}
}
PUT tweets
{
"settings": {
"index.mapping.single_type": true
},
"mappings": {
"_doc": {
"properties": {
"content": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"tweeted_at": {
"type": "date"
}
}
}
}
}
POST _reindex
{
"source": {
"index": "twitter",
"type": "user"
},
"dest": {
"index": "users",
"type": "_doc"
}
}
POST _reindex
{
"source": {
"index": "twitter",
"type": "tweet"
},
"dest": {
"index": "tweets",
"type": "_doc"
}
}
Custom type field
This next example adds a custom type
field and sets it to the value of the
original _type
. It also adds the type to the _id
in case there are any
documents of different types which have conflicting IDs:
PUT new_twitter
{
"mappings": {
"_doc": {
"properties": {
"type": {
"type": "keyword"
},
"name": {
"type": "text"
},
"user_name": {
"type": "keyword"
},
"email": {
"type": "keyword"
},
"content": {
"type": "text"
},
"tweeted_at": {
"type": "date"
}
}
}
}
}
POST _reindex
{
"source": {
"index": "twitter"
},
"dest": {
"index": "new_twitter"
},
"script": {
"source": """
ctx._source.type = ctx._type;
ctx._id = ctx._type + '-' + ctx._id;
ctx._type = '_doc';
"""
}
}
Typeless APIs in 7.0
In Elasticsearch 7.0, each API will support typeless requests, and specifying a type will produce a deprecation warning. Certain typeless APIs are also available in 6.7, to enable a smooth upgrade path to 7.0.
Indices APIs
Index creation, index template, and mapping APIs support a new include_type_name
URL parameter that specifies whether mapping definitions in requests and responses
should contain the type name. The parameter defaults to true
in version 6.7 to
match the pre-7.0 behavior of using type names in mappings. It defaults to false
in version 7.0 and will be removed in version 8.0.
It should be set explicitly in 6.7 to prepare to upgrade to 7.0. To avoid deprecation
warnings in 6.7, the parameter can be set to either true
or false
. In 7.0, setting
include_type_name
at all will result in a deprecation warning.
See some examples of interactions with Elasticsearch with this option set to false
:
PUT index?include_type_name=false
{
"mappings": {
"properties": { (1)
"foo": {
"type": "keyword"
}
}
}
}
-
Mappings are included directly under the
mappings
key, without a type name.
PUT index/_mappings?include_type_name=false
{
"properties": { (1)
"bar": {
"type": "text"
}
}
}
-
Mappings are included directly under the
mappings
key, without a type name.
GET index/_mappings?include_type_name=false
The above call returns
{
"index": {
"mappings": {
"properties": { (1)
"foo": {
"type": "keyword"
},
"bar": {
"type": "text"
}
}
}
}
}
-
Mappings are included directly under the
mappings
key, without a type name.
Index templates
It is recommended to make index templates typeless by re-adding them with
include_type_name
set to false
. Under the hood, typeless templates will use
the dummy type _doc
when creating indices.
In case typeless templates are used with typed index creation calls or typed
templates are used with typeless index creation calls, the template will still
be applied but the index creation call decides whether there should be a type
or not. For instance in the below example, index-1-01
will have a type in
spite of the fact that it matches a template that is typeless, and index-2-01
will be typeless in spite of the fact that it matches a template that defines
a type. Both index-1-01
and index-2-01
will inherit the foo
field from
the template that they match.
PUT _template/template1?include_type_name=false
{
"index_patterns":[ "index-1-*" ],
"mappings": {
"properties": {
"foo": {
"type": "keyword"
}
}
}
}
PUT _template/template2?include_type_name=true
{
"index_patterns":[ "index-2-*" ],
"mappings": {
"type": {
"properties": {
"foo": {
"type": "keyword"
}
}
}
}
}
PUT index-1-01?include_type_name=true
{
"mappings": {
"type": {
"properties": {
"bar": {
"type": "long"
}
}
}
}
}
PUT index-2-01?include_type_name=false
{
"mappings": {
"properties": {
"bar": {
"type": "long"
}
}
}
}
In case of implicit index creation, because of documents that get indexed in an index that doesn’t exist yet, the template is always honored. This is usually not a problem due to the fact that typeless index calls work on typed indices.
Mixed-version clusters
In a cluster composed of both 6.7 and 7.0 nodes, the parameter
include_type_name
should be specified in indices APIs like index
creation. This is because the parameter has a different default between
6.7 and 7.0, so the same mapping definition will not be valid for both
node versions.
Typeless document APIs such as bulk
and update
are only available as of
7.0, and will not work with 6.7 nodes. This also holds true for the typeless
versions of queries that perform document lookups, such as terms
.
Field datatypes
Elasticsearch supports a number of different datatypes for the fields in a document:
Core datatypes
- string
- Numeric datatypes
-
long
,integer
,short
,byte
,double
,float
,half_float
,scaled_float
- Date datatype
-
date
- Boolean datatype
-
boolean
- Binary datatype
-
binary
- Range datatypes
-
integer_range
,float_range
,long_range
,double_range
,date_range
,ip_range
Complex datatypes
- Object datatype
-
object
for single JSON objects - Nested datatype
-
nested
for arrays of JSON objects
Geo datatypes
- Geo-point datatype
-
geo_point
for lat/lon points - Geo-Shape datatype
-
geo_shape
for complex shapes like polygons
Specialised datatypes
- IP datatype
-
ip
for IPv4 and IPv6 addresses - Completion datatype
-
completion
to provide auto-complete suggestions - Token count datatype
-
token_count
to count the number of tokens in a string - {plugins}/mapper-murmur3.html[
mapper-murmur3
] -
murmur3
to compute hashes of values at index-time and store them in the index - {plugins}/mapper-annotated-text.html[
mapper-annotated-text
] -
annotated-text
to index text containing special markup (typically used for identifying named entities) - Percolator type
-
Accepts queries from the query-dsl
join
datatype-
Defines parent/child relation for documents within the same index
- Alias datatype
-
Defines an alias to an existing field.
Arrays
In {es}, arrays do not require a dedicated field datatype. Any field can contain zero or more values by default, however, all values in the array must be of the same datatype. See Arrays.
Multi-fields
It is often useful to index the same field in different ways for different
purposes. For instance, a string
field could be mapped as
a text
field for full-text search, and as a keyword
field for
sorting or aggregations. Alternatively, you could index a text field with
the standard
analyzer, the
english
analyzer, and the
french
analyzer.
This is the purpose of multi-fields. Most datatypes support multi-fields
via the fields
parameter.
Alias datatype
Note
|
Field aliases can only be specified on indexes with a single mapping type. To add a field
alias, the index must therefore have been created in 6.0 or later, or be an older index with
the setting index.mapping.single_type: true . Please see Removal of mapping types for more information.
|
An alias
mapping defines an alternate name for a field in the index.
The alias can be used in place of the target field in search requests,
and selected other APIs like field capabilities.
PUT trips
{
"mappings": {
"_doc": {
"properties": {
"distance": {
"type": "long"
},
"route_length_miles": {
"type": "alias",
"path": "distance" (1)
},
"transit_mode": {
"type": "keyword"
}
}
}
}
}
GET _search
{
"query": {
"range" : {
"route_length_miles" : {
"gte" : 39
}
}
}
}
-
The path to the target field. Note that this must be the full path, including any parent objects (e.g.
object1.object2.field
).
Almost all components of the search request accept field aliases. In particular, aliases can be
used in queries, aggregations, and sort fields, as well as when requesting docvalue_fields
,
stored_fields
, suggestions, and highlights. Scripts also support aliases when accessing
field values. Please see the section on unsupported APIs for exceptions.
In some parts of the search request and when requesting field capabilities, field wildcard patterns can be provided. In these cases, the wildcard pattern will match field aliases in addition to concrete fields:
GET trips/_field_caps?fields=route_*,transit_mode
Alias targets
There are a few restrictions on the target of an alias:
-
The target must be a concrete field, and not an object or another field alias.
-
The target field must exist at the time the alias is created.
-
If nested objects are defined, a field alias must have the same nested scope as its target.
Additionally, a field alias can only have one target. This means that it is not possible to use a field alias to query over multiple target fields in a single clause.
An alias can be changed to refer to a new target through a mappings update. A known limitation is that if any stored percolator queries contain the field alias, they will still refer to its original target. More information can be found in the percolator documentation.
Unsupported APIs
Writes to field aliases are not supported: attempting to use an alias in an index or update request
will result in a failure. Likewise, aliases cannot be used as the target of copy_to
or in multi-fields.
Because alias names are not present in the document source, aliases cannot be used when performing
source filtering. For example, the following request will return an empty result for _source
:
GET /_search
{
"query" : {
"match_all": {}
},
"_source": "route_length_miles"
}
Currently only the search and field capabilities APIs will accept and resolve field aliases. Other APIs that accept field names, such as term vectors, cannot be used with field aliases.
Finally, some queries, such as terms
, geo_shape
, and more_like_this
, allow for fetching query
information from an indexed document. Because field aliases aren’t supported when fetching documents,
the part of the query that specifies the lookup path cannot refer to a field by its alias.
Arrays
In Elasticsearch, there is no dedicated array
datatype. Any field can contain
zero or more values by default, however, all values in the array must be of the
same datatype. For instance:
-
an array of strings: [
"one"
,"two"
] -
an array of integers: [
1
,2
] -
an array of arrays: [
1
, [2
,3
]] which is the equivalent of [1
,2
,3
] -
an array of objects: [
{ "name": "Mary", "age": 12 }
,{ "name": "John", "age": 10 }
]
Note
|
Arrays of objects
Arrays of objects do not work as you would expect: you cannot query each
object independently of the other objects in the array. If you need to be
able to do this then you should use the This is explained in more detail in Nested datatype. |
When adding a field dynamically, the first value in the array determines the
field type
. All subsequent values must be of the same datatype or it must
at least be possible to coerce subsequent values to the same
datatype.
Arrays with a mixture of datatypes are not supported: [ 10
, "some string"
]
An array may contain null
values, which are either replaced by the
configured null_value
or skipped entirely. An empty array
[]
is treated as a missing field — a field with no values.
Nothing needs to be pre-configured in order to use arrays in documents, they are supported out of the box:
PUT my_index/_doc/1
{
"message": "some arrays in this document...",
"tags": [ "elasticsearch", "wow" ], (1)
"lists": [ (2)
{
"name": "prog_list",
"description": "programming list"
},
{
"name": "cool_list",
"description": "cool stuff list"
}
]
}
PUT my_index/_doc/2 (3)
{
"message": "no arrays in this document...",
"tags": "elasticsearch",
"lists": {
"name": "prog_list",
"description": "programming list"
}
}
GET my_index/_search
{
"query": {
"match": {
"tags": "elasticsearch" (4)
}
}
}
-
The
tags
field is dynamically added as astring
field. -
The
lists
field is dynamically added as anobject
field. -
The second document contains no arrays, but can be indexed into the same fields.
-
The query looks for
elasticsearch
in thetags
field, and matches both documents.
Binary datatype
The binary
type accepts a binary value as a
Base64 encoded string. The field is not
stored by default and is not searchable:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"name": {
"type": "text"
},
"blob": {
"type": "binary"
}
}
}
}
}
PUT my_index/_doc/1
{
"name": "Some binary blob",
"blob": "U29tZSBiaW5hcnkgYmxvYg==" (1)
}
-
The Base64 encoded binary value must not have embedded newlines
\n
.
Parameters for binary
fields
The following parameters are accepted by binary
fields:
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
store
|
Whether the field value should be stored and retrievable separately from
the |
Range datatypes
The following range types are supported:
integer_range
|
A range of signed 32-bit integers with a minimum value of -2^31^ and maximum of 2^31^-1. |
float_range
|
A range of single-precision 32-bit IEEE 754 floating point values. |
long_range
|
A range of signed 64-bit integers with a minimum value of -2^63^ and maximum of 2^63^-1. |
double_range
|
A range of double-precision 64-bit IEEE 754 floating point values. |
date_range
|
A range of date values represented as unsigned 64-bit integer milliseconds elapsed since system epoch. |
ip_range
|
A range of ip values supporting either IPv4 or IPv6 (or mixed) addresses. |
Below is an example of configuring a mapping with various range fields followed by an example that indexes several range types.
PUT range_index
{
"settings": {
"number_of_shards": 2
},
"mappings": {
"_doc": {
"properties": {
"expected_attendees": {
"type": "integer_range"
},
"time_frame": {
"type": "date_range", (1)
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
}
PUT range_index/_doc/1?refresh
{
"expected_attendees" : { (2)
"gte" : 10,
"lte" : 20
},
"time_frame" : { (3)
"gte" : "2015-10-31 12:00:00", (4)
"lte" : "2015-11-01"
}
}
-
date_range
types accept the same field parameters defined by thedate
type. -
Example indexing a meeting with 10 to 20 attendees.
-
Date ranges accept the same format as described in date range queries.
-
Example date range using date time stamp. This also accepts date math formatting. Note that "now" cannot be used at indexing time.
The following is an example of a term query on the integer_range
field named "expected_attendees".
GET range_index/_search
{
"query" : {
"term" : {
"expected_attendees" : {
"value": 12
}
}
}
}
The result produced by the above query.
{
"took": 13,
"timed_out": false,
"_shards" : {
"total": 2,
"successful": 2,
"skipped" : 0,
"failed": 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "range_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"expected_attendees" : {
"gte" : 10, "lte" : 20
},
"time_frame" : {
"gte" : "2015-10-31 12:00:00", "lte" : "2015-11-01"
}
}
}
]
}
}
The following is an example of a date_range
query over the date_range
field named "time_frame".
GET range_index/_search
{
"query" : {
"range" : {
"time_frame" : { (1)
"gte" : "2015-10-31",
"lte" : "2015-11-01",
"relation" : "within" (2)
}
}
}
}
-
Range queries work the same as described in range query.
-
Range queries over range fields support a
relation
parameter which can be one ofWITHIN
,CONTAINS
,INTERSECTS
(default).
This query produces a similar result:
{
"took": 13,
"timed_out": false,
"_shards" : {
"total": 2,
"successful": 2,
"skipped" : 0,
"failed": 0
},
"hits" : {
"total" : 1,
"max_score" : 1.0,
"hits" : [
{
"_index" : "range_index",
"_type" : "_doc",
"_id" : "1",
"_score" : 1.0,
"_source" : {
"expected_attendees" : {
"gte" : 10, "lte" : 20
},
"time_frame" : {
"gte" : "2015-10-31 12:00:00", "lte" : "2015-11-01"
}
}
}
]
}
}
IP Range
In addition to the range format above, IP ranges can be provided in CIDR notation:
PUT range_index/_mapping/_doc
{
"properties": {
"ip_whitelist": {
"type": "ip_range"
}
}
}
PUT range_index/_doc/2
{
"ip_whitelist" : "192.168.0.0/16"
}
Parameters for range fields
The following parameters are accepted by range types:
coerce
|
Try to convert strings to numbers and truncate fractions for integers.
Accepts |
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
index
|
Should the field be searchable? Accepts |
store
|
Whether the field value should be stored and retrievable separately from
the |
Boolean datatype
Boolean fields accept JSON true
and false
values, but can also accept
strings which are interpreted as either true or false:
False values |
|
True values |
|
For example:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"is_published": {
"type": "boolean"
}
}
}
}
}
POST my_index/_doc/1
{
"is_published": "true" (1)
}
GET my_index/_search
{
"query": {
"term": {
"is_published": true (2)
}
}
}
-
Indexing a document with
"true"
, which is interpreted astrue
. -
Searching for documents with a JSON
true
.
Aggregations like the terms
aggregation use 1
and 0
for the key
, and the strings "true"
and
"false"
for the key_as_string
. Boolean fields when used in scripts,
return 1
and 0
:
POST my_index/_doc/1
{
"is_published": true
}
POST my_index/_doc/2
{
"is_published": false
}
GET my_index/_search
{
"aggs": {
"publish_state": {
"terms": {
"field": "is_published"
}
}
},
"script_fields": {
"is_published": {
"script": {
"lang": "painless",
"source": "doc['is_published'].value"
}
}
}
}
Parameters for boolean
fields
The following parameters are accepted by boolean
fields:
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts any of the true or false values listed above. The value is
substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the |
Date datatype
JSON doesn’t have a date datatype, so dates in Elasticsearch can either be:
-
strings containing formatted dates, e.g.
"2015-01-01"
or"2015/01/01 12:10:30"
. -
a long number representing milliseconds-since-the-epoch.
-
an integer representing seconds-since-the-epoch.
Internally, dates are converted to UTC (if the time-zone is specified) and stored as a long number representing milliseconds-since-the-epoch.
Queries on dates are internally converted to range queries on this long representation, and the result of aggregations and stored fields is converted back to a string depending on the date format that is associated with the field.
Note
|
Dates will always be rendered as strings, even if they were initially supplied as a long in the JSON document. |
Date formats can be customised, but if no format
is specified then it uses
the default:
"strict_date_optional_time||epoch_millis"
This means that it will accept dates with optional timestamps, which conform
to the formats supported by strict_date_optional_time
or milliseconds-since-the-epoch.
For instance:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"date": {
"type": "date" (1)
}
}
}
}
}
PUT my_index/_doc/1
{ "date": "2015-01-01" } (2)
PUT my_index/_doc/2
{ "date": "2015-01-01T12:10:30Z" } (3)
PUT my_index/_doc/3
{ "date": 1420070400001 } (4)
GET my_index/_search
{
"sort": { "date": "asc"} (5)
}
-
The
date
field uses the defaultformat
. -
This document uses a plain date.
-
This document includes a time.
-
This document uses milliseconds-since-the-epoch.
-
Note that the
sort
values that are returned are all in milliseconds-since-the-epoch.
Multiple date formats
Multiple formats can be specified by separating them with ||
as a separator.
Each format will be tried in turn until a matching format is found. The first
format will be used to convert the milliseconds-since-the-epoch value back
into a string.
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd HH:mm:ss||yyyy-MM-dd||epoch_millis"
}
}
}
}
}
Parameters for date
fields
The following parameters are accepted by date
fields:
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
format
|
The date format(s) that can be parsed. Defaults to
|
locale
|
The locale to use when parsing dates since months do not have the same names
and/or abbreviations in all languages. The default is the
|
ignore_malformed
|
If |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a date value in one of the configured format's as the field
which is substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the |
Geo-point datatype
Fields of type geo_point
accept latitude-longitude pairs, which can be used:
-
to find geo-points within a bounding box, within a certain distance of a central point, or within a polygon.
-
to aggregate documents geographically or by distance from a central point.
-
to integrate distance into a document’s relevance score.
-
to sort documents by distance.
There are four ways that a geo-point may be specified, as demonstrated below:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Geo-point as an object",
"location": { (1)
"lat": 41.12,
"lon": -71.34
}
}
PUT my_index/_doc/2
{
"text": "Geo-point as a string",
"location": "41.12,-71.34" (2)
}
PUT my_index/_doc/3
{
"text": "Geo-point as a geohash",
"location": "drm3btev3e86" (3)
}
PUT my_index/_doc/4
{
"text": "Geo-point as an array",
"location": [ -71.34, 41.12 ] (4)
}
GET my_index/_search
{
"query": {
"geo_bounding_box": { (5)
"location": {
"top_left": {
"lat": 42,
"lon": -72
},
"bottom_right": {
"lat": 40,
"lon": -74
}
}
}
}
}
-
Geo-point expressed as an object, with
lat
andlon
keys. -
Geo-point expressed as a string with the format:
"lat,lon"
. -
Geo-point expressed as a geohash.
-
Geo-point expressed as an array with the format: [
lon
,lat
] -
A geo-bounding box query which finds all geo-points that fall inside the box.
Important
|
Geo-points expressed as an array or string
Please note that string geo-points are ordered as Originally, |
Note
|
A point can be expressed as a geohash. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. Each character in a geohash adds additional 5 bits to the precision. So the longer the hash, the more precise it is. For the indexing purposed geohashs are translated into latitude-longitude pairs. During this process only first 12 characters are used, so specifying more than 12 characters in a geohash doesn’t increase the precision. The 12 characters provide 60 bits, which should reduce a possible error to less than 2cm. |
Parameters for geo_point
fields
The following parameters are accepted by geo_point
fields:
ignore_malformed
|
If |
ignore_z_value
|
If |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts an geopoint value which is substituted for any explicit |
Using geo-points in scripts
When accessing the value of a geo-point in a script, the value is returned as
a GeoPoint
object, which allows access to the .lat
and .lon
values
respectively:
def geopoint = doc['location'].value;
def lat = geopoint.lat;
def lon = geopoint.lon;
For performance reasons, it is better to access the lat/lon values directly:
def lat = doc['location'].lat;
def lon = doc['location'].lon;
Geo-Shape datatype
The geo_shape
datatype facilitates the indexing of and searching
with arbitrary geo shapes such as rectangles and polygons. It should be
used when either the data being indexed or the queries being executed
contain shapes other than just points.
You can query documents using this type using geo_shape Query.
Mapping Options
The geo_shape mapping maps geo_json geometry objects to the geo_shape type. To enable it, users must explicitly map fields to the geo_shape type.
Option | Description | Default |
---|---|---|
|
deprecated[6.6, PrefixTrees no longer used] Name of the PrefixTree
implementation to be used: |
|
|
deprecated[6.6, PrefixTrees no longer used] This parameter may
be used instead of |
|
|
deprecated[6.6, PrefixTrees no longer used] Maximum number
of layers to be used by the PrefixTree. This can be used to control the
precision of shape representations andtherefore how many terms are
indexed. Defaults to the default value of the chosen PrefixTree
implementation. Since this parameter requires a certain level of
understanding of the underlying implementation, users may use the
|
various |
|
deprecated[6.6, PrefixTrees no longer used] The strategy
parameter defines the approach for how to represent shapes at indexing
and search time. It also influences the capabilities available so it
is recommended to let Elasticsearch set this parameter automatically.
There are two strategies available: |
|
|
deprecated[6.6, PrefixTrees no longer used] Used as a
hint to the PrefixTree about how precise it should be. Defaults to 0.025 (2.5%)
with 0.5 as the maximum supported value. PERFORMANCE NOTE: This value will
default to 0 if a |
|
|
Optionally define how to interpret vertex order for
polygons / multipolygons. This parameter defines one of two coordinate
system rules (Right-hand or Left-hand) each of which can be specified in three
different ways. 1. Right-hand rule: |
|
|
deprecated[6.6, PrefixTrees no longer used] Setting this option to
|
|
|
If true, malformed GeoJSON or WKT shapes are ignored. If false (default), malformed GeoJSON and WKT shapes throw an exception and reject the entire document. |
|
|
If |
|
|
If |
|
Indexing approach
GeoShape types are indexed by decomposing the shape into a triangular mesh and
indexing each triangle as a 7 dimension point in a BKD tree. This provides
near perfect spatial resolution (down to 1e-7 decimal degree precision) since all
spatial relations are computed using an encoded vector representation of the
original shape instead of a raster-grid representation as used by the
Prefix trees indexing approach. Performance of the tessellator primarily
depends on the number of vertices that define the polygon/multi-polygon. While
this is the default indexing technique prefix trees can still be used by setting
the tree
or strategy
parameters according to the appropriate
Mapping Options. Note that these parameters are now deprecated
and will be removed in a future version.
IMPORTANT NOTES
The following features are not yet supported with the new indexing approach:
-
geo_shape
query withMultiPoint
geometry types - Elasticsearch currently prevents searching geo_shape fields with a MultiPoint geometry type to avoid a brute force linear search over each individual point. For now, if this is absolutely needed, this can be achieved using abool
query with each individual point. -
CONTAINS
relation query - when using the new default vector indexing strategy,geo_shape
queries withrelation
defined ascontains
are not yet supported. If this query relation is an absolute necessity, it is recommended to setstrategy
toquadtree
and use the deprecated PrefixTree strategy indexing approach.
Prefix trees
deprecated[6.6, PrefixTrees no longer used] To efficiently represent shapes in an inverted index, Shapes are converted into a series of hashes representing grid squares (commonly referred to as "rasters") using implementations of a PrefixTree. The tree notion comes from the fact that the PrefixTree uses multiple grid layers, each with an increasing level of precision to represent the Earth. This can be thought of as increasing the level of detail of a map or image at higher zoom levels. Since this approach causes precision issues with indexed shape, it has been deprecated in favor of a vector indexing approach that indexes the shapes as a triangular mesh (see Indexing approach).
Multiple PrefixTree implementations are provided:
-
GeohashPrefixTree - Uses geohashes for grid squares. Geohashes are base32 encoded strings of the bits of the latitude and longitude interleaved. So the longer the hash, the more precise it is. Each character added to the geohash represents another tree level and adds 5 bits of precision to the geohash. A geohash represents a rectangular area and has 32 sub rectangles. The maximum amount of levels in Elasticsearch is 24.
-
QuadPrefixTree - Uses a quadtree for grid squares. Similar to geohash, quad trees interleave the bits of the latitude and longitude the resulting hash is a bit set. A tree level in a quad tree represents 2 bits in this bit set, one for each coordinate. The maximum amount of levels for the quad trees in Elasticsearch is 50.
Spatial strategies
deprecated[6.6, PrefixTrees no longer used] The indexing implementation selected relies on a SpatialStrategy for choosing how to decompose the shapes (either as grid squares or a tessellated triangular mesh). Each strategy answers the following:
-
What type of Shapes can be indexed?
-
What types of Query Operations and Shapes can be used?
-
Does it support more than one Shape per field?
The following Strategy implementations (with corresponding capabilities) are provided:
Strategy | Supported Shapes | Supported Queries | Multiple Shapes |
---|---|---|---|
|
|
Yes |
|
|
|
Yes |
Accuracy
Recursive
and Term
strategies do not provide 100% accuracy and depending on
how they are configured it may return some false positives for INTERSECTS
,
WITHIN
and CONTAINS
queries, and some false negatives for DISJOINT
queries.
To mitigate this, it is important to select an appropriate value for the tree_levels
parameter and to adjust expectations accordingly. For example, a point may be near
the border of a particular grid cell and may thus not match a query that only matches
the cell right next to it — even though the shape is very close to the point.
Example
PUT /example
{
"mappings": {
"doc": {
"properties": {
"location": {
"type": "geo_shape"
}
}
}
}
}
This mapping definition maps the location field to the geo_shape type using the default vector implementation. It provides approximately 1e-7 decimal degree precision.
Performance considerations with Prefix Trees
deprecated[6.6, PrefixTrees no longer used] With prefix trees, Elasticsearch uses the paths in the tree as terms in the inverted index and in queries. The higher the level (and thus the precision), the more terms are generated. Of course, calculating the terms, keeping them in memory, and storing them on disk all have a price. Especially with higher tree levels, indices can become extremely large even with a modest amount of data. Additionally, the size of the features also matters. Big, complex polygons can take up a lot of space at higher tree levels. Which setting is right depends on the use case. Generally one trades off accuracy against index size and query performance.
The defaults in Elasticsearch for both implementations are a compromise between index size and a reasonable level of precision of 50m at the equator. This allows for indexing tens of millions of shapes without overly bloating the resulting index too much relative to the input size.
Input Structure
Shapes can be represented using either the GeoJSON or Well-Known Text (WKT) format. The following table provides a mapping of GeoJSON and WKT to Elasticsearch types:
GeoJSON Type | WKT Type | Elasticsearch Type | Description |
---|---|---|---|
|
|
|
A single geographic coordinate. Note: Elasticsearch uses WGS-84 coordinates only. |
|
|
|
An arbitrary line given two or more points. |
|
|
|
A closed polygon whose first and last point
must match, thus requiring |
|
|
|
An array of unconnected, but likely related points. |
|
|
|
An array of separate linestrings. |
|
|
|
An array of separate polygons. |
|
|
|
A GeoJSON shape similar to the
|
|
|
|
A bounding rectangle, or envelope, specified by specifying only the top left and bottom right points. |
|
|
|
A circle specified by a center point and radius with
units, which default to |
Note
|
For all types, both the inner In GeoJSON and WKT, and therefore Elasticsearch, the correct coordinate order is longitude, latitude (X, Y) within coordinate arrays. This differs from many Geospatial APIs (e.g., Google Maps) that generally use the colloquial latitude, longitude (Y, X). |
Point
A point is a single geographic coordinate, such as the location of a building or the current position given by a smartphone’s Geolocation API. The following is an example of a point in GeoJSON.
POST /example/doc
{
"location" : {
"type" : "point",
"coordinates" : [-77.03653, 38.897676]
}
}
The following is an example of a point in WKT:
POST /example/doc
{
"location" : "POINT (-77.03653 38.897676)"
}
LineString
A linestring
defined by an array of two or more positions. By
specifying only two points, the linestring
will represent a straight
line. Specifying more than two points creates an arbitrary path. The
following is an example of a LineString in GeoJSON.
POST /example/doc
{
"location" : {
"type" : "linestring",
"coordinates" : [[-77.03653, 38.897676], [-77.009051, 38.889939]]
}
}
The following is an example of a LineString in WKT:
POST /example/doc
{
"location" : "LINESTRING (-77.03653 38.897676, -77.009051 38.889939)"
}
The above linestring
would draw a straight line starting at the White
House to the US Capitol Building.
Polygon
A polygon is defined by a list of a list of points. The first and last points in each (outer) list must be the same (the polygon must be closed). The following is an example of a Polygon in GeoJSON.
POST /example/doc
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ]
]
}
}
The following is an example of a Polygon in WKT:
POST /example/doc
{
"location" : "POLYGON ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0))"
}
The first array represents the outer boundary of the polygon, the other arrays represent the interior shapes ("holes"). The following is a GeoJSON example of a polygon with a hole:
POST /example/doc
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0] ],
[ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2] ]
]
}
}
The following is an example of a Polygon with a hole in WKT:
POST /example/doc
{
"location" : "POLYGON ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8, 100.2 0.2))"
}
IMPORTANT NOTE: WKT does not enforce a specific order for vertices thus ambiguous polygons around the dateline and poles are possible. GeoJSON mandates that the outer polygon must be counterclockwise and interior shapes must be clockwise, which agrees with the Open Geospatial Consortium (OGC) Simple Feature Access specification for vertex ordering.
Elasticsearch accepts both clockwise and counterclockwise polygons if they appear not to cross the dateline (i.e. they cross less than 180° of longitude), but for polygons that do cross the dateline (or for other polygons wider than 180°) Elasticsearch requires the vertex ordering to comply with the OGC and GeoJSON specifications. Otherwise, an unintended polygon may be created and unexpected query/filter results will be returned.
The following provides an example of an ambiguous polygon. Elasticsearch will apply the GeoJSON standard to eliminate ambiguity resulting in a polygon that crosses the dateline.
POST /example/doc
{
"location" : {
"type" : "polygon",
"coordinates" : [
[ [-177.0, 10.0], [176.0, 15.0], [172.0, 0.0], [176.0, -15.0], [-177.0, -10.0], [-177.0, 10.0] ],
[ [178.2, 8.2], [-178.8, 8.2], [-180.8, -8.8], [178.2, 8.8] ]
]
}
}
An orientation
parameter can be defined when setting the geo_shape mapping (see Mapping Options). This will define vertex
order for the coordinate list on the mapped geo_shape field. It can also be overridden on each document. The following is an example for
overriding the orientation on a document:
POST /example/doc
{
"location" : {
"type" : "polygon",
"orientation" : "clockwise",
"coordinates" : [
[ [100.0, 0.0], [100.0, 1.0], [101.0, 1.0], [101.0, 0.0], [100.0, 0.0] ]
]
}
}
MultiPoint
The following is an example of a list of geojson points:
POST /example/doc
{
"location" : {
"type" : "multipoint",
"coordinates" : [
[102.0, 2.0], [103.0, 2.0]
]
}
}
The following is an example of a list of WKT points:
POST /example/doc
{
"location" : "MULTIPOINT (102.0 2.0, 103.0 2.0)"
}
MultiLineString
The following is an example of a list of geojson linestrings:
POST /example/doc
{
"location" : {
"type" : "multilinestring",
"coordinates" : [
[ [102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0] ],
[ [100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0] ],
[ [100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8] ]
]
}
}
The following is an example of a list of WKT linestrings:
POST /example/doc
{
"location" : "MULTILINESTRING ((102.0 2.0, 103.0 2.0, 103.0 3.0, 102.0 3.0), (100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8))"
}
MultiPolygon
The following is an example of a list of geojson polygons (second polygon contains a hole):
POST /example/doc
{
"location" : {
"type" : "multipolygon",
"coordinates" : [
[ [[102.0, 2.0], [103.0, 2.0], [103.0, 3.0], [102.0, 3.0], [102.0, 2.0]] ],
[ [[100.0, 0.0], [101.0, 0.0], [101.0, 1.0], [100.0, 1.0], [100.0, 0.0]],
[[100.2, 0.2], [100.8, 0.2], [100.8, 0.8], [100.2, 0.8], [100.2, 0.2]] ]
]
}
}
The following is an example of a list of WKT polygons (second polygon contains a hole):
POST /example/doc
{
"location" : "MULTIPOLYGON (((102.0 2.0, 103.0 2.0, 103.0 3.0, 102.0 3.0, 102.0 2.0)), ((100.0 0.0, 101.0 0.0, 101.0 1.0, 100.0 1.0, 100.0 0.0), (100.2 0.2, 100.8 0.2, 100.8 0.8, 100.2 0.8, 100.2 0.2)))"
}
Geometry Collection
The following is an example of a collection of geojson geometry objects:
POST /example/doc
{
"location" : {
"type": "geometrycollection",
"geometries": [
{
"type": "point",
"coordinates": [100.0, 0.0]
},
{
"type": "linestring",
"coordinates": [ [101.0, 0.0], [102.0, 1.0] ]
}
]
}
}
The following is an example of a collection of WKT geometry objects:
POST /example/doc
{
"location" : "GEOMETRYCOLLECTION (POINT (100.0 0.0), LINESTRING (101.0 0.0, 102.0 1.0))"
}
Envelope
Elasticsearch supports an envelope
type, which consists of coordinates
for upper left and lower right points of the shape to represent a
bounding rectangle in the format :
POST /example/doc
{
"location" : {
"type" : "envelope",
"coordinates" : [ [100.0, 1.0], [101.0, 0.0] ]
}
}
The following is an example of an envelope using the WKT BBOX format:
NOTE: WKT specification expects the following order: minLon, maxLon, maxLat, minLat.
POST /example/doc
{
"location" : "BBOX (100.0, 102.0, 2.0, 0.0)"
}
Circle
Elasticsearch supports a circle
type, which consists of a center
point with a radius. Note that this circle representation can only
be indexed when using the recursive
Prefix Tree strategy. For
the default Indexing approach circles should be approximated using
a POLYGON
.
POST /example/doc
{
"location" : {
"type" : "circle",
"coordinates" : [101.0, 1.0],
"radius" : "100m"
}
}
Note: The inner radius
field is required. If not specified, then
the units of the radius
will default to METERS
.
NOTE: Neither GeoJSON or WKT support a point-radius circle type.
Sorting and Retrieving index Shapes
Due to the complex input structure and index representation of shapes,
it is not currently possible to sort shapes or retrieve their fields
directly. The geo_shape value is only retrievable through the _source
field.
IP datatype
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"ip_addr": {
"type": "ip"
}
}
}
}
}
PUT my_index/_doc/1
{
"ip_addr": "192.168.1.1"
}
GET my_index/_search
{
"query": {
"term": {
"ip_addr": "192.168.0.0/16"
}
}
}
Note
|
You can also store ip ranges in a single field using an ip_range datatype. |
Parameters for ip
fields
The following parameters are accepted by ip
fields:
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts an IPv4 value which is substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the |
Querying ip
fields
The most common way to query ip addresses is to use the
CIDR
notation: [ip_address]/[prefix_length]
. For instance:
GET my_index/_search
{
"query": {
"term": {
"ip_addr": "192.168.0.0/16"
}
}
}
or
GET my_index/_search
{
"query": {
"term": {
"ip_addr": "2001:db8::/48"
}
}
}
Also beware that colons are special characters to the
query_string
query, so ipv6 addresses will
need to be escaped. The easiest way to do so is to put quotes around the
searched value:
GET my_index/_search
{
"query": {
"query_string" : {
"query": "ip_addr:\"2001:db8::/48\""
}
}
}
Keyword datatype
A field to index structured content such as email addresses, hostnames, status codes, zip codes or tags.
They are typically used for filtering (Find me all blog posts where status is published), for sorting, and for aggregations. Keyword fields are only searchable by their exact value.
If you need to index full text content such as email bodies or product
descriptions, it is likely that you should rather use a text
field.
Below is an example of a mapping for a keyword field:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"tags": {
"type": "keyword"
}
}
}
}
}
Parameters for keyword fields
The following parameters are accepted by keyword
fields:
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
eager_global_ordinals
|
Should global ordinals be loaded eagerly on refresh? Accepts |
fields
|
Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations. |
ignore_above
|
Do not index any string longer than this value. Defaults to |
index
|
Should the field be searchable? Accepts |
index_options
|
What information should be stored in the index, for scoring purposes.
Defaults to |
norms
|
Whether field-length should be taken into account when scoring queries.
Accepts |
null_value
|
Accepts a string value which is substituted for any explicit |
store
|
Whether the field value should be stored and retrievable separately from
the |
similarity
|
Which scoring algorithm or similarity should be used. Defaults
to |
normalizer
|
How to pre-process the keyword prior to indexing. Defaults to |
split_queries_on_whitespace
|
Whether full text queries should split the input on whitespace
when building a query for this field.
Accepts |
Note
|
Indexes imported from 2.x do not support keyword . Instead they will
attempt to downgrade keyword into string . This allows you to merge modern
mappings with legacy mappings. Long lived indexes will have to be recreated
before upgrading to 6.x but mapping downgrade gives you the opportunity to do
the recreation on your own schedule.
|
Nested datatype
The nested
type is a specialised version of the object
datatype
that allows arrays of objects to be indexed in a way that they can be queried
independently of each other.
How arrays of objects are flattened
Arrays of inner object
fields do not work the way you may expect.
Lucene has no concept of inner objects, so Elasticsearch flattens object
hierarchies into a simple list of field names and values. For instance, the
following document:
PUT my_index/_doc/1
{
"group" : "fans",
"user" : [ (1)
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
-
The
user
field is dynamically added as a field of typeobject
.
would be transformed internally into a document that looks more like this:
{
"group" : "fans",
"user.first" : [ "alice", "john" ],
"user.last" : [ "smith", "white" ]
}
The user.first
and user.last
fields are flattened into multi-value fields,
and the association between alice
and white
is lost. This document would
incorrectly match a query for alice AND smith
:
GET my_index/_search
{
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }}
]
}
}
}
Using nested
fields for arrays of objects
If you need to index arrays of objects and to maintain the independence of
each object in the array, you should use the nested
datatype instead of the
object
datatype. Internally, nested objects index each object in
the array as a separate hidden document, meaning that each nested object can be
queried independently of the others, with the nested
query:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"user": {
"type": "nested" (1)
}
}
}
}
}
PUT my_index/_doc/1
{
"group" : "fans",
"user" : [
{
"first" : "John",
"last" : "Smith"
},
{
"first" : "Alice",
"last" : "White"
}
]
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "Smith" }} (2)
]
}
}
}
}
}
GET my_index/_search
{
"query": {
"nested": {
"path": "user",
"query": {
"bool": {
"must": [
{ "match": { "user.first": "Alice" }},
{ "match": { "user.last": "White" }} (3)
]
}
},
"inner_hits": { (4)
"highlight": {
"fields": {
"user.first": {}
}
}
}
}
}
}
-
The
user
field is mapped as typenested
instead of typeobject
. -
This query doesn’t match because
Alice
andSmith
are not in the same nested object. -
This query matches because
Alice
andWhite
are in the same nested object. -
inner_hits
allow us to highlight the matching nested documents.
Nested documents can be:
-
queried with the
nested
query. -
analyzed with the
nested
andreverse_nested
aggregations. -
sorted with nested sorting.
-
retrieved and highlighted with nested inner hits.
Important
|
Because nested documents are indexed as separate documents, they can only be
accessed within the scope of the For instance, if a string field within a nested document has
|
Parameters for nested
fields
The following parameters are accepted by nested
fields:
dynamic
|
Whether or not new |
properties
|
The fields within the nested object, which can be of any
datatype, including |
Limits on nested
mappings and objects
As described earlier, each nested object is indexed as a separate document under the hood.
Continuing with the example above, if we indexed a single document containing 100 user
objects,
then 101 Lucene documents would be created — one for the parent document, and one for each
nested object. Because of the expense associated with nested
mappings, Elasticsearch puts
the following setting in place to guard against performance problems:
index.mapping.nested_fields.limit
-
The
nested
type should only be used in special cases, when arrays of objects need to be queried independently of each other. To safeguard against poorly designed mappings, this setting limits the number of uniquenested
types per index. In our example, theuser
mapping would count as only 1 towards this limit. Defaults to 50.
Additional background on this setting can be found in Settings to prevent mappings explosion.
Numeric datatypes
The following numeric types are supported:
long
|
A signed 64-bit integer with a minimum value of -2^63^ and a maximum value of 2^63^-1. |
integer
|
A signed 32-bit integer with a minimum value of -2^31^ and a maximum value of 2^31^-1. |
short
|
A signed 16-bit integer with a minimum value of -32,768 and a maximum value of 32,767. |
byte
|
A signed 8-bit integer with a minimum value of -128 and a maximum value of 127. |
double
|
A double-precision 64-bit IEEE 754 floating point number, restricted to finite values. |
float
|
A single-precision 32-bit IEEE 754 floating point number, restricted to finite values. |
half_float
|
A half-precision 16-bit IEEE 754 floating point number, restricted to finite values. |
scaled_float
|
A finite floating point number that is backed by a |
Below is an example of configuring a mapping with numeric fields:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"number_of_bytes": {
"type": "integer"
},
"time_in_seconds": {
"type": "float"
},
"price": {
"type": "scaled_float",
"scaling_factor": 100
}
}
}
}
}
Note
|
The double , float and half_float types consider that -0.0 and
+0.0 are different values. As a consequence, doing a term query on
-0.0 will not match +0.0 and vice-versa. Same is true for range queries:
if the upper bound is -0.0 then +0.0 will not match, and if the lower
bound is +0.0 then -0.0 will not match.
|
Which type should I use?
As far as integer types (byte
, short
, integer
and long
) are concerned,
you should pick the smallest type which is enough for your use-case. This will
help indexing and searching be more efficient. Note however that storage is
optimized based on the actual values that are stored, so picking one type over
another one will have no impact on storage requirements.
For floating-point types, it is often more efficient to store floating-point
data into an integer using a scaling factor, which is what the scaled_float
type does under the hood. For instance, a price
field could be stored in a
scaled_float
with a scaling_factor
of 100. All APIs would work as if
the field was stored as a double, but under the hood Elasticsearch would be
working with the number of cents, price*100, which is an integer. This is
mostly helpful to save disk space since integers are way easier to compress
than floating points. scaled_float
is also fine to use in order to trade
accuracy for disk space. For instance imagine that you are tracking cpu
utilization as a number between 0 and 1. It usually does not matter much
whether cpu utilization is 12.7% or 13%, so you could use a scaled_float
with a scaling_factor
of 100 in order to round cpu utilization to the
closest percent in order to save space.
If scaled_float
is not a good fit, then you should pick the smallest type
that is enough for the use-case among the floating-point types: double
,
float
and half_float
. Here is a table that compares these types in order
to help make a decision.
Type | Minimum value | Maximum value | Significant bits / digits |
---|---|---|---|
|
2^-1074^ |
(2-2^-52^)·2^1023^ |
53 / 15.95 |
|
2^-149^ |
(2-2^-23^)·2^127^ |
24 / 7.22 |
|
2^-24^ |
65504 |
11 / 3.31 |
Parameters for numeric fields
The following parameters are accepted by numeric types:
coerce
|
Try to convert strings to numbers and truncate fractions for integers.
Accepts |
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
ignore_malformed
|
If |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a numeric value of the same |
store
|
Whether the field value should be stored and retrievable separately from
the |
Parameters for scaled_float
scaled_float
accepts an additional parameter:
scaling_factor
|
The scaling factor to use when encoding values. Values will be multiplied
by this factor at index time and rounded to the closest long value. For
instance, a |
Object datatype
JSON documents are hierarchical in nature: the document may contain inner objects which, in turn, may contain inner objects themselves:
PUT my_index/_doc/1
{ (1)
"region": "US",
"manager": { (2)
"age": 30,
"name": { (3)
"first": "John",
"last": "Smith"
}
}
}
-
The outer document is also a JSON object.
-
It contains an inner object called
manager
. -
Which in turn contains an inner object called
name
.
Internally, this document is indexed as a simple, flat list of key-value pairs, something like this:
{
"region": "US",
"manager.age": 30,
"manager.name.first": "John",
"manager.name.last": "Smith"
}
An explicit mapping for the above document could look like this:
PUT my_index
{
"mappings": {
"_doc": { (1)
"properties": {
"region": {
"type": "keyword"
},
"manager": { (2)
"properties": {
"age": { "type": "integer" },
"name": { (3)
"properties": {
"first": { "type": "text" },
"last": { "type": "text" }
}
}
}
}
}
}
}
}
-
The mapping type is a type of object, and has a
properties
field. -
The
manager
field is an innerobject
field. -
The
manager.name
field is an innerobject
field within themanager
field.
You are not required to set the field type
to object
explicitly, as this is the default value.
Parameters for object
fields
The following parameters are accepted by object
fields:
dynamic
|
Whether or not new |
enabled
|
Whether the JSON value given for the object field should be
parsed and indexed ( |
properties
|
The fields within the object, which can be of any
datatype, including |
Important
|
If you need to index arrays of objects instead of single objects, read Nested datatype first. |
Text datatype
A field to index full-text values, such as the body of an email or the
description of a product. These fields are analyzed
, that is they are passed through an
analyzer to convert the string into a list of individual terms
before being indexed. The analysis process allows Elasticsearch to search for
individual words within each full text field. Text fields are not
used for sorting and seldom used for aggregations (although the
significant text aggregation
is a notable exception).
If you need to index structured content such as email addresses, hostnames, status
codes, or tags, it is likely that you should rather use a keyword
field.
Below is an example of a mapping for a text field:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"full_name": {
"type": "text"
}
}
}
}
}
Use a field as both text and keyword
Sometimes it is useful to have both a full text (text
) and a keyword
(keyword
) version of the same field: one for full text search and the
other for aggregations and sorting. This can be achieved with
multi-fields.
Parameters for text fields
The following parameters are accepted by text
fields:
analyzer
|
The analyzer which should be used for
|
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
eager_global_ordinals
|
Should global ordinals be loaded eagerly on refresh? Accepts |
fielddata
|
Can the field use in-memory fielddata for sorting, aggregations,
or scripting? Accepts |
fielddata_frequency_filter
|
Expert settings which allow to decide which values to load in memory when |
fields
|
Multi-fields allow the same string value to be indexed in multiple ways for different purposes, such as one field for search and a multi-field for sorting and aggregations, or the same string value analyzed by different analyzers. |
index
|
Should the field be searchable? Accepts |
index_options
|
What information should be stored in the index, for search and highlighting purposes.
Defaults to |
index_prefixes
|
If enabled, term prefixes of between 2 and 5 characters are indexed into a separate field. This allows prefix searches to run more efficiently, at the expense of a larger index. |
index_phrases
|
If enabled, two-term word combinations ('shingles') are indexed into a separate
field. This allows exact phrase queries to run more efficiently, at the expense
of a larger index. Note that this works best when stopwords are not removed,
as phrases containing stopwords will not use the subsidiary field and will fall
back to a standard phrase query. Accepts |
norms
|
Whether field-length should be taken into account when scoring queries.
Accepts |
position_increment_gap
|
The number of fake term position which should be inserted between each
element of an array of strings. Defaults to the |
store
|
Whether the field value should be stored and retrievable separately from
the |
search_analyzer
|
The |
search_quote_analyzer
|
The |
similarity
|
Which scoring algorithm or similarity should be used. Defaults
to |
term_vector
|
Whether term vectors should be stored for an |
Token count datatype
A field of type token_count
is really an integer
field which
accepts string values, analyzes them, then indexes the number of tokens in the
string.
For instance:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"name": { (1)
"type": "text",
"fields": {
"length": { (2)
"type": "token_count",
"analyzer": "standard"
}
}
}
}
}
}
}
PUT my_index/_doc/1
{ "name": "John Smith" }
PUT my_index/_doc/2
{ "name": "Rachel Alice Williams" }
GET my_index/_search
{
"query": {
"term": {
"name.length": 3 (3)
}
}
}
-
The
name
field is an analyzed string field which uses the defaultstandard
analyzer. -
The
name.length
field is atoken_count
multi-field which will index the number of tokens in thename
field. -
This query matches only the document containing
Rachel Alice Williams
, as it contains three tokens.
Parameters for token_count
fields
The following parameters are accepted by token_count
fields:
analyzer
|
The analyzer which should be used to analyze the string value. Required. For best performance, use an analyzer without token filters. |
enable_position_increments
|
Indicates if position increments should be counted.
Set to |
boost
|
Mapping field-level query time boosting. Accepts a floating point number, defaults
to |
doc_values
|
Should the field be stored on disk in a column-stride fashion, so that it
can later be used for sorting, aggregations, or scripting? Accepts |
index
|
Should the field be searchable? Accepts |
null_value
|
Accepts a numeric value of the same |
store
|
Whether the field value should be stored and retrievable separately from
the |
Percolator type
The percolator
field type parses a json structure into a native query and
stores that query, so that the percolate query
can use it to match provided documents.
Any field that contains a json object can be configured to be a percolator
field. The percolator field type has no settings. Just configuring the percolator
field type is sufficient to instruct Elasticsearch to treat a field as a
query.
If the following mapping configures the percolator
field type for the
query
field:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"query": {
"type": "percolator"
},
"field": {
"type": "text"
}
}
}
}
}
Then you can index a query:
PUT my_index/_doc/match_value
{
"query" : {
"match" : {
"field" : "value"
}
}
}
Important
|
Fields referred to in a percolator query must already exist in the mapping
associated with the index used for percolation. In order to make sure these fields exist,
add or update a mapping via the create index or put mapping APIs.
Fields referred in a percolator query may exist in any type of the index containing the |
Reindexing your percolator queries
Reindexing percolator queries is sometimes required to benefit from improvements made to the percolator
field type in
new releases.
Reindexing percolator queries can be reindexed by using the reindex api. Lets take a look at the following index with a percolator field type:
PUT index
{
"mappings": {
"_doc" : {
"properties": {
"query" : {
"type" : "percolator"
},
"body" : {
"type": "text"
}
}
}
}
}
POST _aliases
{
"actions": [
{
"add": {
"index": "index",
"alias": "queries" (1)
}
}
]
}
PUT queries/_doc/1?refresh
{
"query" : {
"match" : {
"body" : "quick brown fox"
}
}
}
-
It is always recommended to define an alias for your index, so that in case of a reindex systems / applications don’t need to be changed to know that the percolator queries are now in a different index.
Lets say you’re going to upgrade to a new major version and in order for the new Elasticsearch version to still be able to read your queries you need to reindex your queries into a new index on the current Elasticsearch version:
PUT new_index
{
"mappings": {
"_doc" : {
"properties": {
"query" : {
"type" : "percolator"
},
"body" : {
"type": "text"
}
}
}
}
}
POST /_reindex?refresh
{
"source": {
"index": "index"
},
"dest": {
"index": "new_index"
}
}
POST _aliases
{
"actions": [ (1)
{
"remove": {
"index" : "index",
"alias": "queries"
}
},
{
"add": {
"index": "new_index",
"alias": "queries"
}
}
]
}
-
If you have an alias don’t forget to point it to the new index.
Executing the percolate
query via the queries
alias:
GET /queries/_search
{
"query": {
"percolate" : {
"field" : "query",
"document" : {
"body" : "fox jumps over the lazy dog"
}
}
}
}
now returns matches from the new index:
{
"took": 3,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "new_index", (1)
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"query": {
"match": {
"body": "quick brown fox"
}
}
},
"fields" : {
"_percolator_document_slot" : [0]
}
}
]
}
}
-
Percolator query hit is now being presented from the new index.
Optimizing query time text analysis
When the percolator verifies a percolator candidate match it is going to parse, perform query time text analysis and actually run
the percolator query on the document being percolated. This is done for each candidate match and every time the percolate
query executes.
If your query time text analysis is relatively expensive part of query parsing then text analysis can become the
dominating factor time is being spent on when percolating. This query parsing overhead can become noticeable when the
percolator ends up verifying many candidate percolator query matches.
To avoid the most expensive part of text analysis at percolate time. One can choose to do the expensive part of text analysis
when indexing the percolator query. This requires using two different analyzers. The first analyzer actually performs
text analysis that needs be performed (expensive part). The second analyzer (usually whitespace) just splits the generated tokens
that the first analyzer has produced. Then before indexing a percolator query, the analyze api should be used to analyze the query
text with the more expensive analyzer. The result of the analyze api, the tokens, should be used to substitute the original query
text in the percolator query. It is important that the query should now be configured to override the analyzer from the mapping and
just the second analyzer. Most text based queries support an analyzer
option (match
, query_string
, simple_query_string
).
Using this approach the expensive text analysis is performed once instead of many times.
Lets demonstrate this workflow via a simplified example.
Lets say we want to index the following percolator query:
{
"query" : {
"match" : {
"body" : {
"query" : "missing bicycles"
}
}
}
}
with these settings and mapping:
PUT /test_index
{
"settings": {
"analysis": {
"analyzer": {
"my_analyzer" : {
"tokenizer": "standard",
"filter" : ["lowercase", "porter_stem"]
}
}
}
},
"mappings": {
"_doc" : {
"properties": {
"query" : {
"type": "percolator"
},
"body" : {
"type": "text",
"analyzer": "my_analyzer" (1)
}
}
}
}
}
-
For the purpose of this example, this analyzer is considered expensive.
First we need to use the analyze api to perform the text analysis prior to indexing:
POST /test_index/_analyze
{
"analyzer" : "my_analyzer",
"text" : "missing bicycles"
}
This results the following response:
{
"tokens": [
{
"token": "miss",
"start_offset": 0,
"end_offset": 7,
"type": "<ALPHANUM>",
"position": 0
},
{
"token": "bicycl",
"start_offset": 8,
"end_offset": 16,
"type": "<ALPHANUM>",
"position": 1
}
]
}
All the tokens in the returned order need to replace the query text in the percolator query:
PUT /test_index/_doc/1?refresh
{
"query" : {
"match" : {
"body" : {
"query" : "miss bicycl",
"analyzer" : "whitespace" (1)
}
}
}
}
-
It is important to select a whitespace analyzer here, otherwise the analyzer defined in the mapping will be used, which defeats the point of using this workflow. Note that
whitespace
is a built-in analyzer, if a different analyzer needs to be used, it needs to be configured first in the index’s settings.
The analyze api prior to the indexing the percolator flow should be done for each percolator query.
At percolate time nothing changes and the percolate
query can be defined normally:
GET /test_index/_search
{
"query": {
"percolate" : {
"field" : "query",
"document" : {
"body" : "Bycicles are missing"
}
}
}
}
This results in a response like this:
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "test_index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"query": {
"match": {
"body": {
"query": "miss bicycl",
"analyzer": "whitespace"
}
}
}
},
"fields" : {
"_percolator_document_slot" : [0]
}
}
]
}
}
Optimizing wildcard queries.
Wildcard queries are more expensive than other queries for the percolator, especially if the wildcard expressions are large.
In the case of wildcard
queries with prefix wildcard expressions or just the prefix
query,
the edge_ngram
token filter can be used to replace these queries with regular term
query on a field where the edge_ngram
token filter is configured.
Creating an index with custom analysis settings:
PUT my_queries1
{
"settings": {
"analysis": {
"analyzer": {
"wildcard_prefix": { (1)
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"wildcard_edge_ngram"
]
}
},
"filter": {
"wildcard_edge_ngram": { (2)
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 32
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"query": {
"type": "percolator"
},
"my_field": {
"type": "text",
"fields": {
"prefix": { (3)
"type": "text",
"analyzer": "wildcard_prefix",
"search_analyzer": "standard"
}
}
}
}
}
}
}
-
The analyzer that generates the prefix tokens to be used at index time only.
-
Increase the
min_gram
and decreasemax_gram
settings based on your prefix search needs. -
This multifield should be used to do the prefix search with a
term
ormatch
query instead of aprefix
orwildcard
query.
Then instead of indexing the following query:
{
"query": {
"wildcard": {
"my_field": "abc*"
}
}
}
this query below should be indexed:
PUT /my_queries1/_doc/1?refresh
{
"query": {
"term": {
"my_field.prefix": "abc"
}
}
}
This way can handle the second query more efficiently than the first query.
The following search request will match with the previously indexed percolator query:
GET /my_queries1/_search
{
"query": {
"percolate": {
"field": "query",
"document": {
"my_field": "abcd"
}
}
}
}
{
"took": 6,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 1,
"max_score": 0.41501677,
"hits": [
{
"_index": "my_queries1",
"_type": "_doc",
"_id": "1",
"_score": 0.41501677,
"_source": {
"query": {
"term": {
"my_field.prefix": "abc"
}
}
},
"fields": {
"_percolator_document_slot": [
0
]
}
}
]
}
}
The same technique can also be used to speed up suffix
wildcard searches. By using the reverse
token filter
before the edge_ngram
token filter.
PUT my_queries2
{
"settings": {
"analysis": {
"analyzer": {
"wildcard_suffix": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"reverse",
"wildcard_edge_ngram"
]
},
"wildcard_suffix_search_time": {
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"reverse"
]
}
},
"filter": {
"wildcard_edge_ngram": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 32
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"query": {
"type": "percolator"
},
"my_field": {
"type": "text",
"fields": {
"suffix": {
"type": "text",
"analyzer": "wildcard_suffix",
"search_analyzer": "wildcard_suffix_search_time" (1)
}
}
}
}
}
}
}
-
A custom analyzer is needed at search time too, because otherwise the query terms are not being reversed and would otherwise not match with the reserved suffix tokens.
Then instead of indexing the following query:
{
"query": {
"wildcard": {
"my_field": "*xyz"
}
}
}
the following query below should be indexed:
PUT /my_queries2/_doc/2?refresh
{
"query": {
"match": { (1)
"my_field.suffix": "xyz"
}
}
}
-
The
match
query should be used instead of theterm
query, because text analysis needs to reverse the query terms.
The following search request will match with the previously indexed percolator query:
GET /my_queries2/_search
{
"query": {
"percolate": {
"field": "query",
"document": {
"my_field": "wxyz"
}
}
}
}
Dedicated Percolator Index
Percolate queries can be added to any index. Instead of adding percolate queries to the index the data resides in, these queries can also be added to a dedicated index. The advantage of this is that this dedicated percolator index can have its own index settings (For example the number of primary and replica shards). If you choose to have a dedicated percolate index, you need to make sure that the mappings from the normal index are also available on the percolate index. Otherwise percolate queries can be parsed incorrectly.
Forcing Unmapped Fields to be Handled as Strings
In certain cases it is unknown what kind of percolator queries do get registered, and if no field mapping exists for fields
that are referred by percolator queries then adding a percolator query fails. This means the mapping needs to be updated
to have the field with the appropriate settings, and then the percolator query can be added. But sometimes it is sufficient
if all unmapped fields are handled as if these were default text fields. In those cases one can configure the
index.percolator.map_unmapped_fields_as_text
setting to true
(default to false
) and then if a field referred in
a percolator query does not exist, it will be handled as a default text field so that adding the percolator query doesn’t
fail.
Limitations
Parent/child
Because the percolate
query is processing one document at a time, it doesn’t support queries and filters that run
against child documents such as has_child
and has_parent
.
Fetching queries
There are a number of queries that fetch data via a get call during query parsing. For example the terms
query when
using terms lookup, template
query when using indexed scripts and geo_shape
when using pre-indexed shapes. When these
queries are indexed by the percolator
field type then the get call is executed once. So each time the percolator
query evaluates these queries, the fetches terms, shapes etc. as the were upon index time will be used. Important to note
is that fetching of terms that these queries do, happens both each time the percolator query gets indexed on both primary
and replica shards, so the terms that are actually indexed can be different between shard copies, if the source index
changed while indexing.
Script query
The script inside a script
query can only access doc values fields. The percolate
query indexes the provided document
into an in-memory index. This in-memory index doesn’t support stored fields and because of that the _source
field and
other stored fields are not stored. This is the reason why in the script
query the _source
and other stored fields
aren’t available.
Field aliases
Percolator queries that contain field aliases may not always behave as expected. In particular, if a percolator query is registered that contains a field alias, and then that alias is updated in the mappings to refer to a different field, the stored query will still refer to the original target field. To pick up the change to the field alias, the percolator query must be explicitly reindexed.
join
datatype
The join
datatype is a special field that creates
parent/child relation within documents of the same index.
The relations
section defines a set of possible relations within the documents,
each relation being a parent name and a child name.
A parent/child relation can be defined as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_join_field": { (1)
"type": "join",
"relations": {
"question": "answer" (2)
}
}
}
}
}
}
-
The name for the field
-
Defines a single relation where
question
is parent ofanswer
.
To index a document with a join, the name of the relation and the optional parent
of the document must be provided in the source
.
For instance the following example creates two parent
documents in the question
context:
PUT my_index/_doc/1?refresh
{
"text": "This is a question",
"my_join_field": {
"name": "question" (1)
}
}
PUT my_index/_doc/2?refresh
{
"text": "This is another question",
"my_join_field": {
"name": "question"
}
}
-
This document is a
question
document.
When indexing parent documents, you can choose to specify just the name of the relation as a shortcut instead of encapsulating it in the normal object notation:
PUT my_index/_doc/1?refresh
{
"text": "This is a question",
"my_join_field": "question" (1)
}
PUT my_index/_doc/2?refresh
{
"text": "This is another question",
"my_join_field": "question"
}
-
Simpler notation for a parent document just uses the relation name.
When indexing a child, the name of the relation as well as the parent id of the document
must be added in the _source
.
Warning
|
It is required to index the lineage of a parent in the same shard so you must always route child documents using their greater parent id. |
For instance the following example shows how to index two child
documents:
PUT my_index/_doc/3?routing=1&refresh (1)
{
"text": "This is an answer",
"my_join_field": {
"name": "answer", (2)
"parent": "1" (3)
}
}
PUT my_index/_doc/4?routing=1&refresh
{
"text": "This is another answer",
"my_join_field": {
"name": "answer",
"parent": "1"
}
}
-
The routing value is mandatory because parent and child documents must be indexed on the same shard
-
answer
is the name of the join for this document -
The parent id of this child document
Parent-join and performance.
The join field shouldn’t be used like joins in a relation database. In Elasticsearch the key to good performance
is to de-normalize your data into documents. Each join field, has_child
or has_parent
query adds a
significant tax to your query performance.
The only case where the join field makes sense is if your data contains a one-to-many relationship where one entity significantly outnumbers the other entity. An example of such case is a use case with products and offers for these products. In the case that offers significantly outnumbers the number of products then it makes sense to model the product as parent document and the offer as child document.
Parent-join restrictions
-
Only one
join
field mapping is allowed per index. -
Parent and child documents must be indexed on the same shard. This means that the same
routing
value needs to be provided when getting, deleting, or updating a child document. -
An element can have multiple children but only one parent.
-
It is possible to add a new relation to an existing
join
field. -
It is also possible to add a child to an existing element but only if the element is already a parent.
Searching with parent-join
The parent-join creates one field to index the name of the relation
within the document (my_parent
, my_child
, …).
It also creates one field per parent/child relation.
The name of this field is the name of the join
field followed by #
and the
name of the parent in the relation.
So for instance for the my_parent
⇒ [my_child
, another_child
] relation,
the join
field creates an additional field named my_join_field#my_parent
.
This field contains the parent _id
that the document links to
if the document is a child (my_child
or another_child
) and the _id
of
document if it’s a parent (my_parent
).
When searching an index that contains a join
field, these two fields are always
returned in the search response:
GET my_index/_search
{
"query": {
"match_all": {}
},
"sort": ["_id"]
}
Will return:
{
...,
"hits": {
"total": 4,
"max_score": null,
"hits": [
{
"_index": "my_index",
"_type": "_doc",
"_id": "1",
"_score": null,
"_source": {
"text": "This is a question",
"my_join_field": "question" (1)
},
"sort": [
"1"
]
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "2",
"_score": null,
"_source": {
"text": "This is another question",
"my_join_field": "question" (2)
},
"sort": [
"2"
]
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "3",
"_score": null,
"_routing": "1",
"_source": {
"text": "This is an answer",
"my_join_field": {
"name": "answer", (3)
"parent": "1" (4)
}
},
"sort": [
"3"
]
},
{
"_index": "my_index",
"_type": "_doc",
"_id": "4",
"_score": null,
"_routing": "1",
"_source": {
"text": "This is another answer",
"my_join_field": {
"name": "answer",
"parent": "1"
}
},
"sort": [
"4"
]
}
]
}
}
-
This document belongs to the
question
join -
This document belongs to the
question
join -
This document belongs to the
answer
join -
The linked parent id for the child document
Parent-join queries and aggregations
See the has_child
and
has_parent
queries,
the children
aggregation,
and inner hits for more information.
The value of the join
field is accessible in aggregations
and scripts, and may be queried with the
parent_id
query:
GET my_index/_search
{
"query": {
"parent_id": { (1)
"type": "answer",
"id": "1"
}
},
"aggs": {
"parents": {
"terms": {
"field": "my_join_field#question", (2)
"size": 10
}
}
},
"script_fields": {
"parent": {
"script": {
"source": "doc['my_join_field#question']" (3)
}
}
}
}
-
Querying the
parent id
field (also see thehas_parent
query and thehas_child
query) -
Aggregating on the
parent id
field (also see thechildren
aggregation) -
Accessing the parent id` field in scripts
Global ordinals
The join
field uses global ordinals to speed up joins.
Global ordinals need to be rebuilt after any change to a shard. The more
parent id values are stored in a shard, the longer it takes to rebuild the
global ordinals for the join
field.
Global ordinals, by default, are built eagerly: if the index has changed,
global ordinals for the join
field will be rebuilt as part of the refresh.
This can add significant time to the refresh. However most of the times this is the
right trade-off, otherwise global ordinals are rebuilt when the first parent-join
query or aggregation is used. This can introduce a significant latency spike for
your users and usually this is worse as multiple global ordinals for the join
field may be attempt rebuilt within a single refresh interval when many writes
are occurring.
When the join
field is used infrequently and writes occur frequently it may
make sense to disable eager loading:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"question": "answer"
},
"eager_global_ordinals": false
}
}
}
}
}
The amount of heap used by global ordinals can be checked per parent relation as follows:
# Per-index
GET _stats/fielddata?human&fields=my_join_field#question
# Per-node per-index
GET _nodes/stats/indices/fielddata?human&fields=my_join_field#question
Multiple children per parent
It is also possible to define multiple children for a single parent:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"question": ["answer", "comment"] (1)
}
}
}
}
}
}
-
question
is parent ofanswer
andcomment
.
Multiple levels of parent join
Warning
|
Using multiple levels of relations to replicate a relational model is not recommended. Each level of relation adds an overhead at query time in terms of memory and computation. You should de-normalize your data if you care about performance. |
Multiple levels of parent/child:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_join_field": {
"type": "join",
"relations": {
"question": ["answer", "comment"], (1)
"answer": "vote" (2)
}
}
}
}
}
}
-
question
is parent ofanswer
andcomment
-
answer
is parent ofvote
The mapping above represents the following tree:
question / \ / \ comment answer | | vote
Indexing a grandchild document requires a routing
value equals
to the grand-parent (the greater parent of the lineage):
PUT my_index/_doc/3?routing=1&refresh (1)
{
"text": "This is a vote",
"my_join_field": {
"name": "vote",
"parent": "2" (2)
}
}
-
This child document must be on the same shard than its grand-parent and parent
-
The parent id of this document (must points to an
answer
document)
Meta-Fields
Each document has metadata associated with it, such as the _index
, mapping
_type
, and _id
meta-fields. The behaviour of some of these meta-fields
can be customised when a mapping type is created.
Identity meta-fields
_index
|
The index to which the document belongs. |
_uid
|
A composite field consisting of the |
_type
|
The document’s mapping type. |
_id
|
The document’s ID. |
Document source meta-fields
_source
-
The original JSON representing the body of the document.
- {plugins}/mapper-size.html[
_size
] -
The size of the
_source
field in bytes, provided by the {plugins}/mapper-size.html[mapper-size
plugin].
Indexing meta-fields
_all
-
A catch-all field that indexes the values of all other fields. Disabled by default.
_field_names
-
All fields in the document which contain non-null values.
_ignored
-
All fields in the document that have been ignored at index time because of
ignore_malformed
.
Routing meta-field
_routing
-
A custom routing value which routes a document to a particular shard.
Other meta-field
_meta
-
Application specific metadata.
_all
field
deprecated::[6.0.0, "`_all` may no longer be enabled for indices created in 6.0+, use a custom field and the mapping copy_to
parameter"]
The all
field is a special _catch-all field which concatenates the values
of all of the other fields into one big string, using space as a delimiter, which is then
analyzed and indexed, but not stored. This means that it can be
searched, but not retrieved.
The _all
field allows you to search for values in documents without knowing
which field contains the value. This makes it a useful option when getting
started with a new dataset. For instance:
PUT /my_index
{
"mapping": {
"user": {
"_all": {
"enabled": true (1)
}
}
}
}
PUT /my_index/user/1 (2)
{
"first_name": "John",
"last_name": "Smith",
"date_of_birth": "1970-10-24"
}
GET /my_index/_search
{
"query": {
"match": {
"_all": "john smith 1970"
}
}
}
-
Enabling the
_all
field -
The
_all
field will contain the terms: ["john"
,"smith"
,"1970"
,"10"
,"24"
]
Note
|
All values treated as strings
The It is important to note that the |
The _all
field is just a text
field, and accepts the same
parameters that other string fields accept, including analyzer
,
term_vectors
, index_options
, and store
.
The _all
field can be useful, especially when exploring new data using
simple filtering. However, by concatenating field values into one big string,
the _all
field loses the distinction between short fields (more relevant)
and long fields (less relevant). For use cases where search relevance is
important, it is better to query individual fields specifically.
The _all
field is not free: it requires extra CPU cycles and uses more disk
space. For this reason, it is disabled by default. If needed, it can be
enabled.
Using the _all
field in queries
The query_string
and
simple_query_string
queries query the
_all
field by default if it is enabled, unless another field is specified:
GET _search
{
"query": {
"query_string": {
"query": "john smith new york"
}
}
}
The same goes for the ?q=
parameter in URI search
requests (which is rewritten to a query_string
query internally):
GET _search?q=john+smith+new+york
Other queries, such as the match
and
term
queries require you to specify the _all
field
explicitly, as per the first example.
Enabling the _all
field
The _all
field can be enabled per-type by setting enabled
to true
:
PUT my_index
{
"mappings": {
"type_1": { (1)
"properties": {...}
},
"type_2": { (2)
"_all": {
"enabled": true
},
"properties": {...}
}
}
}
-
The
_all
field intype_1
is disabled. -
The
_all
field intype_2
is enabled.
If the _all
field is enabled, then URI search requests and the query_string
and simple_query_string
queries can automatically use it for queries (see
Using the _all
field in queries). You can configure them to use a different field with
the index.query.default_field
setting:
PUT my_index
{
"mappings": {
"my_type": {
"properties": {
"content": {
"type": "text"
}
}
}
},
"settings": {
"index.query.default_field": "content" (1)
}
}
-
The
query_string
query will default to querying thecontent
field in this index.
Index boosting and the _all
field
Individual fields can be boosted at index time, with the boost
parameter. The _all
field takes these boosts into account:
PUT myindex
{
"mappings": {
"mytype": {
"_all": {"enabled": true},
"properties": {
"title": { (1)
"type": "text",
"boost": 2
},
"content": { (1)
"type": "text"
}
}
}
}
}
-
When querying the
_all
field, words that originated in thetitle
field are twice as relevant as words that originated in thecontent
field.
Warning
|
Using index-time boosting with the _all field has a significant
impact on query performance. Usually the better solution is to query fields
individually, with optional query time boosting.
|
Custom _all
fields
While there is only a single _all
field per index, the copy_to
parameter allows the creation of multiple custom _all
fields. For
instance, first_name
and last_name
fields can be combined together into
the full_name
field:
PUT myindex
{
"mappings": {
"mytype": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name" (1)
},
"last_name": {
"type": "text",
"copy_to": "full_name" (1)
},
"full_name": {
"type": "text"
}
}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET myindex/_search
{
"query": {
"match": {
"full_name": "John Smith"
}
}
}
-
The
first_name
andlast_name
values are copied to thefull_name
field.
Highlighting and the _all
field
A field can only be used for highlighting if
the original string value is available, either from the
_source
field or as a stored field.
The _all
field is not present in the _source
field and it is not stored or
enabled by default, and so cannot be highlighted. There are two options. Either
store the _all
field or highlight the
original fields.
Store the _all
field
If store
is set to true
, then the original field value is retrievable and
can be highlighted:
PUT myindex
{
"mappings": {
"mytype": {
"_all": {
"enabled": true,
"store": true
}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET _search
{
"query": {
"match": {
"_all": "John Smith"
}
},
"highlight": {
"fields": {
"_all": {}
}
}
}
Of course, enabling and storing the _all
field will use significantly more
disk space and, because it is a combination of other fields, it may result in
odd highlighting results.
The _all
field also accepts the term_vector
and index_options
parameters, allowing highlighting to use it.
Highlight original fields
You can query the _all
field, but use the original fields for highlighting as follows:
PUT myindex
{
"mappings": {
"mytype": {
"_all": {"enabled": true}
}
}
}
PUT myindex/mytype/1
{
"first_name": "John",
"last_name": "Smith"
}
GET _search
{
"query": {
"match": {
"_all": "John Smith" (1)
}
},
"highlight": {
"fields": {
"*_name": { (2)
"require_field_match": false (3)
}
}
}
}
-
The query inspects the
_all
field to find matching documents. -
Highlighting is performed on the two name fields, which are available from the
_source
. -
The query wasn’t run against the name fields, so set
require_field_match
tofalse
.
_field_names
field
The _field_names
field used to index the names of every field in a document that
contains any value other than null
. This field was used by the
exists
query to find documents that
either have or don’t have any non-null value for a particular field.
Now the _field_names
field only indexes the names of fields that have
doc_values
and norms
disabled. For fields which have either doc_values
or norm
enabled the exists
query will still
be available but will not use the _field_names
field.
Disabling _field_names
Disabling _field_names
is often not necessary because it no longer
carries the index overhead it once did. If you have a lot of fields
which have doc_values
and norms
disabled and you do not need to
execute exists
queries using those fields you might want to disable
_field_names
be adding the following to the mappings:
PUT tweets
{
"mappings": {
"_doc": {
"_field_names": {
"enabled": false
}
}
}
}
_ignored
field
added[6.4.0]
The _ignored
field indexes and stores the names of every field in a document
that has been ignored because it was malformed and
ignore_malformed
was turned on.
This field is searchable with term
,
terms
and exists
queries, and is returned as part of the search hits.
For instance the below query matches all documents that have one or more fields that got ignored:
GET _search
{
"query": {
"exists": {
"field": "_ignored"
}
}
}
Similarly, the below query finds all documents whose @timestamp
field was
ignored at index time:
GET _search
{
"query": {
"term": {
"_ignored": "@timestamp"
}
}
}
_id
field
Each document has an _id
that uniquely identifies it, which is indexed
so that documents can be looked up either with the GET API or the
ids
query.
Note
|
This was not the case with pre-6.0 indices due to the fact that they
supported multiple types, so the _type and _id were merged into a composite
primary key called _uid .
|
The value of the _id
field is accessible in certain queries (term
,
terms
, match
, query_string
, simple_query_string
).
# Example documents
PUT my_index/_doc/1
{
"text": "Document with ID 1"
}
PUT my_index/_doc/2&refresh=true
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_id": [ "1", "2" ] (1)
}
}
}
-
Querying on the
_id
field (also see theids
query)
The value of the _id
field is also accessible in aggregations or for sorting,
but doing so is discouraged as it requires to load a lot of data in memory. In
case sorting or aggregating on the _id
field is required, it is advised to
duplicate the content of the _id
field in another field that has doc_values
enabled.
_index
field
When performing queries across multiple indexes, it is sometimes desirable to
add query clauses that are associated with documents of only certain indexes.
The _index
field allows matching on the index a document was indexed into.
Its value is accessible in term
, or terms
queries, aggregations,
scripts, and when sorting:
Note
|
The _index is exposed as a virtual field — it is not added to the
Lucene index as a real field. This means that you can use the _index field
in a term or terms query (or any query that is rewritten to a term
query, such as the match , query_string or simple_query_string query),
but it does not support prefix , wildcard , regexp , or fuzzy queries.
|
# Example documents
PUT index_1/_doc/1
{
"text": "Document in index 1"
}
PUT index_2/_doc/2?refresh=true
{
"text": "Document in index 2"
}
GET index_1,index_2/_search
{
"query": {
"terms": {
"_index": ["index_1", "index_2"] (1)
}
},
"aggs": {
"indices": {
"terms": {
"field": "_index", (2)
"size": 10
}
}
},
"sort": [
{
"_index": { (3)
"order": "asc"
}
}
],
"script_fields": {
"index_name": {
"script": {
"lang": "painless",
"source": "doc['_index']" (4)
}
}
}
}
-
Querying on the
_index
field -
Aggregating on the
_index
field -
Sorting on the
_index
field -
Accessing the
_index
field in scripts
_meta
field
A mapping type can have custom meta data associated with it. These are not used at all by Elasticsearch, but can be used to store application-specific metadata, such as the class that a document belongs to:
PUT my_index
{
"mappings": {
"_doc": {
"_meta": { (1)
"class": "MyApp::User",
"version": {
"min": "1.0",
"max": "1.3"
}
}
}
}
}
-
This
_meta
info can be retrieved with the GET mapping API.
The _meta
field can be updated on an existing type using the
PUT mapping API:
PUT my_index/_mapping/_doc
{
"_meta": {
"class": "MyApp2::User3",
"version": {
"min": "1.3",
"max": "1.5"
}
}
}
_routing
field
A document is routed to a particular shard in an index using the following formula:
shard_num = hash(_routing) % num_primary_shards
The default value used for _routing
is the document’s _id
.
Custom routing patterns can be implemented by specifying a custom routing
value per document. For instance:
PUT my_index/_doc/1?routing=user1&refresh=true (1)
{
"title": "This is a document"
}
GET my_index/_doc/1?routing=user1 (2)
The value of the _routing
field is accessible in queries:
GET my_index/_search
{
"query": {
"terms": {
"_routing": [ "user1" ] (1)
}
}
}
-
Querying on the
_routing
field (also see theids
query)
Searching with custom routing
Custom routing can reduce the impact of searches. Instead of having to fan out a search request to all the shards in an index, the request can be sent to just the shard that matches the specific routing value (or values):
GET my_index/_search?routing=user1,user2 (1)
{
"query": {
"match": {
"title": "document"
}
}
}
-
This search request will only be executed on the shards associated with the
user1
anduser2
routing values.
Making a routing value required
When using custom routing, it is important to provide the routing value whenever indexing, getting, deleting, or updating a document.
Forgetting the routing value can lead to a document being indexed on more than
one shard. As a safeguard, the _routing
field can be configured to make a
custom routing
value required for all CRUD operations:
PUT my_index2
{
"mappings": {
"_doc": {
"_routing": {
"required": true (1)
}
}
}
}
PUT my_index2/_doc/1 (2)
{
"text": "No routing value provided"
}
-
Routing is required for
_doc
documents. -
This index request throws a
routing_missing_exception
.
Unique IDs with custom routing
When indexing documents specifying a custom _routing
, the uniqueness of the
_id
is not guaranteed across all of the shards in the index. In fact,
documents with the same _id
might end up on different shards if indexed with
different _routing
values.
It is up to the user to ensure that IDs are unique across the index.
Routing to an index partition
An index can be configured such that custom routing values will go to a subset of the shards rather than a single shard. This helps mitigate the risk of ending up with an imbalanced cluster while still reducing the impact of searches.
This is done by providing the index level setting index.routing_partition_size
at index creation.
As the partition size increases, the more evenly distributed the data will become at the
expense of having to search more shards per request.
When this setting is present, the formula for calculating the shard becomes:
shard_num = (hash(_routing) + hash(_id) % routing_partition_size) % num_primary_shards
That is, the _routing
field is used to calculate a set of shards within the index and then the
_id
is used to pick a shard within that set.
To enable this feature, the index.routing_partition_size
should have a value greater than 1 and
less than index.number_of_shards
.
Once enabled, the partitioned index will have the following limitations:
-
Mappings with
join
field relationships cannot be created within it. -
All mappings within the index must have the
_routing
field marked as required.
_source
field
The source
field contains the original JSON document body that was passed
at index time. The _source
field itself is not indexed (and thus is not
searchable), but it is stored so that it can be returned when executing
_fetch requests, like get or search.
Disabling the _source
field
Though very handy to have around, the source field does incur storage overhead within the index. For this reason, it can be disabled as follows:
PUT tweets
{
"mappings": {
"_doc": {
"_source": {
"enabled": false
}
}
}
}
Warning
|
Think before disabling the
_source fieldUsers often disable the
|
Tip
|
If disk space is a concern, rather increase the
compression level instead of disabling the _source .
|
Including / Excluding fields from _source
An expert-only feature is the ability to prune the contents of the _source
field after the document has been indexed, but before the _source
field is
stored.
Warning
|
Removing fields from the _source has similar downsides to disabling
_source , especially the fact that you cannot reindex documents from one
Elasticsearch index to another. Consider using
source filtering instead.
|
The includes
/excludes
parameters (which also accept wildcards) can be used
as follows:
PUT logs
{
"mappings": {
"_doc": {
"_source": {
"includes": [
"*.count",
"meta.*"
],
"excludes": [
"meta.description",
"meta.other.*"
]
}
}
}
}
PUT logs/_doc/1
{
"requests": {
"count": 10,
"foo": "bar" (1)
},
"meta": {
"name": "Some metric",
"description": "Some metric description", (1)
"other": {
"foo": "one", (1)
"baz": "two" (1)
}
}
}
GET logs/_search
{
"query": {
"match": {
"meta.other.foo": "one" (2)
}
}
}
-
These fields will be removed from the stored
_source
field. -
We can still search on this field, even though it is not in the stored
_source
.
_type
field
deprecated[6.0.0,See Removal of mapping types]
Each document indexed is associated with a _type
(see
Mapping Type) and an _id
. The _type
field is
indexed in order to make searching by type name fast.
The value of the _type
field is accessible in queries, aggregations,
scripts, and when sorting:
# Example documents
PUT my_index/_doc/1?refresh=true
{
"text": "Document with type 'doc'"
}
GET my_index/_search
{
"query": {
"term": {
"_type": "_doc" (1)
}
},
"aggs": {
"types": {
"terms": {
"field": "_type", (2)
"size": 10
}
}
},
"sort": [
{
"_type": { (3)
"order": "desc"
}
}
],
"script_fields": {
"type": {
"script": {
"lang": "painless",
"source": "doc['_type']" (4)
}
}
}
}
-
Querying on the
_type
field -
Aggregating on the
_type
field -
Sorting on the
_type
field -
Accessing the
_type
field in scripts
_uid
field
deprecated::[6.0.0, "Now that types have been removed, documents are uniquely identified by their _id
and the _uid
field has only been kept as a view over the _id
field for backward compatibility."]
Each document indexed is associated with a _type
(see
Mapping Type) and an _id
. These values are
combined as {type}#{id}
and indexed as the _uid
field.
The value of the _uid
field is accessible in queries, aggregations, scripts,
and when sorting:
# Example documents
PUT my_index/_doc/1
{
"text": "Document with ID 1"
}
PUT my_index/_doc/2?refresh=true
{
"text": "Document with ID 2"
}
GET my_index/_search
{
"query": {
"terms": {
"_uid": [ "_doc#1", "_doc#2" ] (1)
}
},
"aggs": {
"UIDs": {
"terms": {
"field": "_uid", (2)
"size": 10
}
}
},
"sort": [
{
"_uid": { (3)
"order": "desc"
}
}
],
"script_fields": {
"UID": {
"script": {
"lang": "painless",
"source": "doc['_uid']" (4)
}
}
}
}
-
Querying on the
_uid
field (also see theids
query) -
Aggregating on the
_uid
field -
Sorting on the
_uid
field -
Accessing the
_uid
field in scripts
Mapping parameters
The following pages provide detailed explanations of the various mapping parameters that are used by field mappings:
The following mapping parameters are common to some or all field datatypes:
analyzer
The values of analyzed
string fields are passed through an
analyzer to convert the string into a stream of tokens or
terms. For instance, the string "The quick Brown Foxes."
may, depending
on which analyzer is used, be analyzed to the tokens: quick
, brown
,
fox
. These are the actual terms that are indexed for the field, which makes
it possible to search efficiently for individual words within big blobs of
text.
This analysis process needs to happen not just at index time, but also at query time: the query string needs to be passed through the same (or a similar) analyzer so that the terms that it tries to find are in the same format as those that exist in the index.
Elasticsearch ships with a number of pre-defined analyzers, which can be used without further configuration. It also ships with many character filters, tokenizers, and [analysis-tokenfilters] which can be combined to configure custom analyzers per index.
Analyzers can be specified per-query, per-field or per-index. At index time, Elasticsearch will look for an analyzer in this order:
-
The
analyzer
defined in the field mapping. -
An analyzer named
default
in the index settings. -
The
standard
analyzer.
At query time, there are a few more layers:
-
The
analyzer
defined in a full-text query. -
The
search_analyzer
defined in the field mapping. -
The
analyzer
defined in the field mapping. -
An analyzer named
default_search
in the index settings. -
An analyzer named
default
in the index settings. -
The
standard
analyzer.
The easiest way to specify an analyzer for a particular field is to define it in the field mapping, as follows:
PUT /my_index
{
"mappings": {
"_doc": {
"properties": {
"text": { (1)
"type": "text",
"fields": {
"english": { (2)
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
GET my_index/_analyze (3)
{
"field": "text",
"text": "The quick Brown Foxes."
}
GET my_index/_analyze (4)
{
"field": "text.english",
"text": "The quick Brown Foxes."
}
-
The
text
field uses the defaultstandard
analyzer`. -
The
text.english
multi-field uses theenglish
analyzer, which removes stop words and applies stemming. -
This returns the tokens: [
the
,quick
,brown
,foxes
]. -
This returns the tokens: [
quick
,brown
,fox
].
search_quote_analyzer
The search_quote_analyzer
setting allows you to specify an analyzer for phrases, this is particularly useful when dealing with disabling
stop words for phrase queries.
To disable stop words for phrases a field utilising three analyzer settings will be required:
-
An
analyzer
setting for indexing all terms including stop words -
A
search_analyzer
setting for non-phrase queries that will remove stop words -
A
search_quote_analyzer
setting for phrase queries that will not remove stop words
PUT my_index
{
"settings":{
"analysis":{
"analyzer":{
"my_analyzer":{ (1)
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase"
]
},
"my_stop_analyzer":{ (2)
"type":"custom",
"tokenizer":"standard",
"filter":[
"lowercase",
"english_stop"
]
}
},
"filter":{
"english_stop":{
"type":"stop",
"stopwords":"_english_"
}
}
}
},
"mappings":{
"_doc":{
"properties":{
"title": {
"type":"text",
"analyzer":"my_analyzer", (3)
"search_analyzer":"my_stop_analyzer", (4)
"search_quote_analyzer":"my_analyzer" (5)
}
}
}
}
}
PUT my_index/_doc/1
{
"title":"The Quick Brown Fox"
}
PUT my_index/_doc/2
{
"title":"A Quick Brown Fox"
}
GET my_index/_search
{
"query":{
"query_string":{
"query":"\"the quick brown fox\"" (6)
}
}
}
-
my_analyzer
analyzer which tokens all terms including stop words -
my_stop_analyzer
analyzer which removes stop words -
analyzer
setting that points to themy_analyzer
analyzer which will be used at index time -
search_analyzer
setting that points to themy_stop_analyzer
and removes stop words for non-phrase queries -
search_quote_analyzer
setting that points to themy_analyzer
analyzer and ensures that stop words are not removed from phrase queries -
Since the query is wrapped in quotes it is detected as a phrase query therefore the
search_quote_analyzer
kicks in and ensures the stop words are not removed from the query. Themy_analyzer
analyzer will then return the following tokens [the
,quick
,brown
,fox
] which will match one of the documents. Meanwhile term queries will be analyzed with themy_stop_analyzer
analyzer which will filter out stop words. So a search for eitherThe quick brown fox
orA quick brown fox
will return both documents since both documents contain the following tokens [quick
,brown
,fox
]. Without thesearch_quote_analyzer
it would not be possible to do exact matches for phrase queries as the stop words from phrase queries would be removed resulting in both documents matching.
normalizer
The normalizer
property of keyword
fields is similar to
analyzer
except that it guarantees that the analysis chain
produces a single token.
The normalizer
is applied prior to indexing the keyword, as well as at
search-time when the keyword
field is searched via a query parser such as
the match
query or via a term-level query
such as the term
query.
PUT index
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"char_filter": [],
"filter": ["lowercase", "asciifolding"]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"foo": {
"type": "keyword",
"normalizer": "my_normalizer"
}
}
}
}
}
PUT index/_doc/1
{
"foo": "BÀR"
}
PUT index/_doc/2
{
"foo": "bar"
}
PUT index/_doc/3
{
"foo": "baz"
}
POST index/_refresh
GET index/_search
{
"query": {
"term": {
"foo": "BAR"
}
}
}
GET index/_search
{
"query": {
"match": {
"foo": "BAR"
}
}
}
The above queries match documents 1 and 2 since BÀR
is converted to bar
at
both index and query time.
{
"took": $body.took,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 2,
"max_score": 0.2876821,
"hits": [
{
"_index": "index",
"_type": "_doc",
"_id": "2",
"_score": 0.2876821,
"_source": {
"foo": "bar"
}
},
{
"_index": "index",
"_type": "_doc",
"_id": "1",
"_score": 0.2876821,
"_source": {
"foo": "BÀR"
}
}
]
}
}
Also, the fact that keywords are converted prior to indexing also means that aggregations return normalized values:
GET index/_search
{
"size": 0,
"aggs": {
"foo_terms": {
"terms": {
"field": "foo"
}
}
}
}
returns
{
"took": 43,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"foo_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "bar",
"doc_count": 2
},
{
"key": "baz",
"doc_count": 1
}
]
}
}
}
boost
Individual fields can be boosted automatically — count more towards the relevance score — at query time, with the boost
parameter as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"boost": 2 (1)
},
"content": {
"type": "text"
}
}
}
}
}
-
Matches on the
title
field will have twice the weight as those on thecontent
field, which has the defaultboost
of1.0
.
Note
|
The boost is applied only for term queries (prefix, range and fuzzy queries are not boosted). |
You can achieve the same effect by using the boost parameter directly in the query, for instance the following query (with field time boost):
POST _search
{
"query": {
"match" : {
"title": {
"query": "quick brown fox"
}
}
}
}
is equivalent to:
POST _search
{
"query": {
"match" : {
"title": {
"query": "quick brown fox",
"boost": 2
}
}
}
}
The boost is also applied when it is copied with the
value in the _all
field. This means that, when
querying the _all
field, words that originated from the title
field will
have a higher score than words that originated in the content
field.
This functionality comes at a cost: queries on the _all
field are slower
when field boosting is used.
deprecated[5.0.0, "index time boost is deprecated. Instead, the field mapping boost is applied at query time. For indices created before 5.0.0 the boost will still be applied at index time."]
Warning
|
Why index time boosting is a bad idea
We advise against using index time boosting for the following reasons:
|
coerce
Data is not always clean. Depending on how it is produced a number might be
rendered in the JSON body as a true JSON number, e.g. 5
, but it might also
be rendered as a string, e.g. "5"
. Alternatively, a number that should be
an integer might instead be rendered as a floating point, e.g. 5.0
, or even
"5.0"
.
Coercion attempts to clean up dirty values to fit the datatype of a field. For instance:
-
Strings will be coerced to numbers.
-
Floating points will be truncated for integer values.
For instance:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"number_one": {
"type": "integer"
},
"number_two": {
"type": "integer",
"coerce": false
}
}
}
}
}
PUT my_index/_doc/1
{
"number_one": "10" (1)
}
PUT my_index/_doc/2
{
"number_two": "10" (2)
}
-
The
number_one
field will contain the integer10
. -
This document will be rejected because coercion is disabled.
Tip
|
The coerce setting is allowed to have different settings for fields of
the same name in the same index. Its value can be updated on existing fields
using the PUT mapping API.
|
Index-level default
The index.mapping.coerce
setting can be set on the index level to disable
coercion globally across all mapping types:
PUT my_index
{
"settings": {
"index.mapping.coerce": false
},
"mappings": {
"_doc": {
"properties": {
"number_one": {
"type": "integer",
"coerce": true
},
"number_two": {
"type": "integer"
}
}
}
}
}
PUT my_index/_doc/1
{ "number_one": "10" } (1)
PUT my_index/_doc/2
{ "number_two": "10" } (2)
-
The
number_one
field overrides the index level setting to enable coercion. -
This document will be rejected because the
number_two
field inherits the index-level coercion setting.
copy_to
The copy_to
parameter allows you to create custom
_all
fields. In other words, the values of multiple
fields can be copied into a group field, which can then be queried as a single
field. For instance, the first_name
and last_name
fields can be copied to
the full_name
field as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"first_name": {
"type": "text",
"copy_to": "full_name" (1)
},
"last_name": {
"type": "text",
"copy_to": "full_name" (1)
},
"full_name": {
"type": "text"
}
}
}
}
}
PUT my_index/_doc/1
{
"first_name": "John",
"last_name": "Smith"
}
GET my_index/_search
{
"query": {
"match": {
"full_name": { (2)
"query": "John Smith",
"operator": "and"
}
}
}
}
-
The values of the
first_name
andlast_name
fields are copied to thefull_name
field. -
The
first_name
andlast_name
fields can still be queried for the first name and last name respectively, but thefull_name
field can be queried for both first and last names.
Some important points:
-
It is the field value which is copied, not the terms (which result from the analysis process).
-
The original
_source
field will not be modified to show the copied values. -
The same value can be copied to multiple fields, with
"copy_to": [ "field_1", "field_2" ]
-
You cannot copy recursively via intermediary fields such as a
copy_to
onfield_1
tofield_2
andcopy_to
onfield_2
tofield_3
expecting indexing intofield_1
will eventuate infield_3
, instead use copy_to directly to multiple fields from the originating field.
doc_values
Most fields are indexed by default, which makes them searchable. The inverted index allows queries to look up the search term in unique sorted list of terms, and from that immediately have access to the list of documents that contain the term.
Sorting, aggregations, and access to field values in scripts requires a different data access pattern. Instead of looking up the term and finding documents, we need to be able to look up the document and find the terms that it has in a field.
Doc values are the on-disk data structure, built at document index time, which
makes this data access pattern possible. They store the same values as the
_source
but in a column-oriented fashion that is way more efficient for
sorting and aggregations. Doc values are supported on almost all field types,
with the notable exception of analyzed
string fields.
All fields which support doc values have them enabled by default. If you are sure that you don’t need to sort or aggregate on a field, or access the field value from a script, you can disable doc values in order to save disk space:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"status_code": { (1)
"type": "keyword"
},
"session_id": { (2)
"type": "keyword",
"doc_values": false
}
}
}
}
}
-
The
status_code
field hasdoc_values
enabled by default. -
The
session_id
hasdoc_values
disabled, but can still be queried.
dynamic
By default, fields can be added dynamically to a document, or to inner objects within a document, just by indexing a document containing the new field. For instance:
PUT my_index/_doc/1 (1)
{
"username": "johnsmith",
"name": {
"first": "John",
"last": "Smith"
}
}
GET my_index/_mapping (2)
PUT my_index/_doc/2 (3)
{
"username": "marywhite",
"email": "mary@white.com",
"name": {
"first": "Mary",
"middle": "Alice",
"last": "White"
}
}
GET my_index/_mapping (4)
-
This document introduces the string field
username
, the object fieldname
, and two string fields under thename
object which can be referred to asname.first
andname.last
. -
Check the mapping to verify the above.
-
This document adds two string fields:
email
andname.middle
. -
Check the mapping to verify the changes.
The details of how new fields are detected and added to the mapping is explained in Dynamic Mapping.
The dynamic
setting controls whether new fields can be added dynamically or
not. It accepts three settings:
true
|
Newly detected fields are added to the mapping. (default) |
false
|
Newly detected fields are ignored. These fields will not be indexed so will not be searchable
but will still appear in the |
strict
|
If new fields are detected, an exception is thrown and the document is rejected. New fields must be explicitly added to the mapping. |
The dynamic
setting may be set at the mapping type level, and on each
inner object. Inner objects inherit the setting from their parent
object or from the mapping type. For instance:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic": false, (1)
"properties": {
"user": { (2)
"properties": {
"name": {
"type": "text"
},
"social_networks": { (3)
"dynamic": true,
"properties": {}
}
}
}
}
}
}
}
-
Dynamic mapping is disabled at the type level, so no new top-level fields will be added dynamically.
-
The
user
object inherits the type-level setting. -
The
user.social_networks
object enables dynamic mapping, so new fields may be added to this inner object.
Tip
|
The dynamic setting can be updated on existing fields
using the PUT mapping API.
|
enabled
Elasticsearch tries to index all of the fields you give it, but sometimes you want to just store the field without indexing it. For instance, imagine that you are using Elasticsearch as a web session store. You may want to index the session ID and last update time, but you don’t need to query or run aggregations on the session data itself.
The enabled
setting, which can be applied only to the mapping type and to
object
fields, causes Elasticsearch to skip parsing of the
contents of the field entirely. The JSON can still be retrieved from the
_source
field, but it is not searchable or stored
in any other way:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"user_id": {
"type": "keyword"
},
"last_updated": {
"type": "date"
},
"session_data": { (1)
"enabled": false
}
}
}
}
}
PUT my_index/_doc/session_1
{
"user_id": "kimchy",
"session_data": { (2)
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
PUT my_index/_doc/session_2
{
"user_id": "jpountz",
"session_data": "none", (3)
"last_updated": "2015-12-06T18:22:13"
}
-
The
session_data
field is disabled. -
Any arbitrary data can be passed to the
session_data
field as it will be entirely ignored. -
The
session_data
will also ignore values that are not JSON objects.
The entire mapping type may be disabled as well, in which case the document is
stored in the _source
field, which means it can be
retrieved, but none of its contents are indexed in any way:
PUT my_index
{
"mappings": {
"_doc": { (1)
"enabled": false
}
}
}
PUT my_index/_doc/session_1
{
"user_id": "kimchy",
"session_data": {
"arbitrary_object": {
"some_array": [ "foo", "bar", { "baz": 2 } ]
}
},
"last_updated": "2015-12-06T18:20:22"
}
GET my_index/_doc/session_1 (2)
GET my_index/_mapping (3)
-
The entire
_doc
mapping type is disabled. -
The document can be retrieved.
-
Checking the mapping reveals that no fields have been added.
The enabled
setting for existing fields and the top-level mapping
definition cannot be updated.
eager_global_ordinals
What are global ordinals?
To support aggregations and other operations that require looking up field
values on a per-document basis, Elasticsearch uses a data structure called
doc values. Term-based field types such as keyword
store
their doc values using an ordinal mapping for a more compact representation.
This mapping works by assigning each term an incremental integer or 'ordinal'
based on its lexicographic order. The field’s doc values store only the
ordinals for each document instead of the original terms, with a separate
lookup structure to convert between ordinals and terms.
When used during aggregations, ordinals can greatly improve performance. As an
example, the terms
aggregation relies only on ordinals to collect documents
into buckets at the shard-level, then converts the ordinals back to their
original term values when combining results across shards.
Each index segment defines its own ordinal mapping, but aggregations collect data across an entire shard. So to be able to use ordinals for shard-level operations like aggregations, Elasticsearch creates a unified mapping called 'global ordinals'. The global ordinal mapping is built on top of segment ordinals, and works by maintaining a map from global ordinal to the local ordinal for each segment.
Global ordinals are used if a search contains any of the following components:
-
Certain bucket aggregations on
keyword
,ip
, andflattened
fields. This includesterms
aggregations as mentioned above, as well ascomposite
,diversified_sampler
, andsignificant_terms
. -
Bucket aggregations on
text
fields that requirefielddata
to be enabled. -
Operations on parent and child documents from a
join
field, includinghas_child
queries andparent
aggregations.
Note
|
The global ordinal mapping is an on-heap data structure. When measuring
memory usage, Elasticsearch counts the memory from global ordinals as
'fielddata'. Global ordinals memory is included in the
fielddata circuit breaker, and is returned
under fielddata in the node stats response.
|
Loading global ordinals
The global ordinal mapping must be built before ordinals can be used during a search. By default, the mapping is loaded during search on the first time that global ordinals are needed. This is is the right approach if you are optimizing for indexing speed, but if search performance is a priority, it’s recommended to eagerly load global ordinals eagerly on fields that will be used in aggregations:
PUT my_index/_mapping/_doc
{
"properties": {
"tags": {
"type": "keyword",
"eager_global_ordinals": true
}
}
}
When eager_global_ordinals
is enabled, global ordinals are built when a shard
is refreshed — Elasticsearch always loads them before
exposing changes to the content of the index. This shifts the cost of building
global ordinals from search to index-time. Elasticsearch will also eagerly
build global ordinals when creating a new copy of a shard, as can occur when
increasing the number of replicas or relocating a shard onto a new node.
Eager loading can be disabled at any time by updating the eager_global_ordinals
setting:
PUT my_index/_mapping/_doc
{
"properties": {
"tags": {
"type": "keyword",
"eager_global_ordinals": false
}
}
}
Important
|
On a frozen index, global ordinals are discarded
after each search and rebuilt again when they’re requested. This means that
eager_global_ordinals should not be used on frozen indices: it would
cause global ordinals to be reloaded on every search. Instead, the index should
be force-merged to a single segment before being frozen. This avoids building
global ordinals altogether (more details can be found in the next section).
|
Avoiding global ordinal loading
Usually, global ordinals do not present a large overhead in terms of their loading time and memory usage. However, loading global ordinals can be expensive on indices with large shards, or if the fields contain a large number of unique term values. Because global ordinals provide a unified mapping for all segments on the shard, they also need to be rebuilt entirely when a new segment becomes visible.
In some cases it is possible to avoid global ordinal loading altogether:
-
The
terms
,sampler
, andsignificant_terms
aggregations support a parameterexecution_hint
that helps control how buckets are collected. It defaults toglobal_ordinals
, but can be set tomap
to instead use the term values directly. -
If a shard has been force-merged down to a single segment, then its segment ordinals are already 'global' to the shard. In this case, Elasticsearch does not need to build a global ordinal mapping and there is no additional overhead from using global ordinals. Note that for performance reasons you should only force-merge an index to which you will never write to again.
fielddata
Most fields are indexed by default, which makes them searchable. Sorting, aggregations, and accessing field values in scripts, however, requires a different access pattern from search.
Search needs to answer the question "Which documents contain this term?", while sorting and aggregations need to answer a different question: "What is the value of this field for this document?".
Most fields can use index-time, on-disk doc_values
for this
data access pattern, but text
fields do not support doc_values
.
Instead, text
fields use a query-time in-memory data structure called
fielddata
. This data structure is built on demand the first time that a
field is used for aggregations, sorting, or in a script. It is built by
reading the entire inverted index for each segment from disk, inverting the
term ↔︎ document relationship, and storing the result in memory, in the JVM
heap.
Fielddata is disabled on text
fields by default
Fielddata can consume a lot of heap space, especially when loading high
cardinality text
fields. Once fielddata has been loaded into the heap, it
remains there for the lifetime of the segment. Also, loading fielddata is an
expensive process which can cause users to experience latency hits. This is
why fielddata is disabled by default.
If you try to sort, aggregate, or access values from a script on a text
field, you will see this exception:
Fielddata is disabled on text fields by default. Set `fielddata=true` on [`your_field_name`] in order to load fielddata in memory by uninverting the inverted index. Note that this can however use significant memory.
Before enabling fielddata
Before you enable fielddata, consider why you are using a text
field for
aggregations, sorting, or in a script. It usually doesn’t make sense to do
so.
A text field is analyzed before indexing so that a value like
New York
can be found by searching for new
or for york
. A terms
aggregation on this field will return a new
bucket and a york
bucket, when
you probably want a single bucket called New York
.
Instead, you should have a text
field for full text searches, and an
unanalyzed keyword
field with doc_values
enabled for aggregations, as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"my_field": { (1)
"type": "text",
"fields": {
"keyword": { (2)
"type": "keyword"
}
}
}
}
}
}
}
-
Use the
my_field
field for searches. -
Use the
my_field.keyword
field for aggregations, sorting, or in scripts.
Enabling fielddata on text
fields
You can enable fielddata on an existing text
field using the
PUT mapping API as follows:
PUT my_index/_mapping/_doc
{
"properties": {
"my_field": { (1)
"type": "text",
"fielddata": true
}
}
}
-
The mapping that you specify for
my_field
should consist of the existing mapping for that field, plus thefielddata
parameter.
fielddata_frequency_filter
Fielddata filtering can be used to reduce the number of terms loaded into memory, and thus reduce memory usage. Terms can be filtered by frequency:
The frequency filter allows you to only load terms whose document frequency falls
between a min
and max
value, which can be expressed an absolute
number (when the number is bigger than 1.0) or as a percentage
(eg 0.01
is 1%
and 1.0
is 100%
). Frequency is calculated
per segment. Percentages are based on the number of docs which have a
value for the field, as opposed to all docs in the segment.
Small segments can be excluded completely by specifying the minimum
number of docs that the segment should contain with min_segment_size
:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"tag": {
"type": "text",
"fielddata": true,
"fielddata_frequency_filter": {
"min": 0.001,
"max": 0.1,
"min_segment_size": 500
}
}
}
}
}
}
format
In JSON documents, dates are represented as strings. Elasticsearch uses a set of preconfigured formats to recognize and parse these strings into a long value representing milliseconds-since-the-epoch in UTC.
Besides the built-in formats, your own
custom formats can be specified using the familiar
yyyy/MM/dd
syntax:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"date": {
"type": "date",
"format": "yyyy-MM-dd"
}
}
}
}
}
Many APIs which support date values also support date math
expressions, such as now-1m/d
— the current time, minus one month, rounded
down to the nearest day.
Custom date formats
Completely customizable date formats are supported. The syntax for these is explained in the Joda docs.
Built In Formats
Most of the below formats have a strict
companion format, which means that
year, month and day parts of the week must use respectively 4, 2 and 2 digits
exactly, potentially prepending zeros. For instance a date like 5/11/1
would
be considered invalid and would need to be rewritten to 2005/11/01
to be
accepted by the date parser.
To use them, you need to prepend strict_
to the name of the date format, for
instance strict_date_optional_time
instead of date_optional_time
.
These strict date formats are especially useful when date fields are dynamically mapped in order to make sure to not accidentally map irrelevant strings as dates.
The following tables lists all the defaults ISO formats supported:
epoch_millis
-
A formatter for the number of milliseconds since the epoch. Note, that this timestamp is subject to the limits of a Java
Long.MIN_VALUE
andLong.MAX_VALUE
. epoch_second
-
A formatter for the number of seconds since the epoch. Note, that this timestamp is subject to the limits of a Java
Long.MIN_VALUE
andLong. MAX_VALUE
divided by 1000 (the number of milliseconds in a second). date_optional_time
orstrict_date_optional_time
-
A generic ISO datetime parser where the date is mandatory and the time is optional. Full details here.
basic_date
-
A basic formatter for a full date as four digit year, two digit month of year, and two digit day of month:
yyyyMMdd
. basic_date_time
-
A basic formatter that combines a basic date and time, separated by a 'T':
yyyyMMdd’T’HHmmss.SSSZ
. basic_date_time_no_millis
-
A basic formatter that combines a basic date and time without millis, separated by a 'T':
yyyyMMdd’T’HHmmssZ
. basic_ordinal_date
-
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear:
yyyyDDD
. basic_ordinal_date_time
-
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear:
yyyyDDD’T’HHmmss.SSSZ
. basic_ordinal_date_time_no_millis
-
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear:
yyyyDDD’T’HHmmssZ
. basic_time
-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone offset:
HHmmss.SSSZ
. basic_time_no_millis
-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset:
HHmmssZ
. basic_t_time
-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit millis, and time zone off set prefixed by 'T':
'T’HHmmss.SSSZ
. basic_t_time_no_millis
-
A basic formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by 'T':
'T’HHmmssZ
. basic_week_date
orstrict_basic_week_date
-
A basic formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week:
xxxx’W’wwe
. basic_week_date_time
orstrict_basic_week_date_time
-
A basic formatter that combines a basic weekyear date and time, separated by a 'T':
xxxx’W’wwe’T’HHmmss.SSSZ
. basic_week_date_time_no_millis
orstrict_basic_week_date_time_no_millis
-
A basic formatter that combines a basic weekyear date and time without millis, separated by a 'T':
xxxx’W’wwe’T’HHmmssZ
. date
orstrict_date
-
A formatter for a full date as four digit year, two digit month of year, and two digit day of month:
yyyy-MM-dd
. date_hour
orstrict_date_hour
-
A formatter that combines a full date and two digit hour of day:
yyyy-MM-dd’T’HH
. date_hour_minute
orstrict_date_hour_minute
-
A formatter that combines a full date, two digit hour of day, and two digit minute of hour:
yyyy-MM-dd’T’HH:mm
. date_hour_minute_second
orstrict_date_hour_minute_second
-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, and two digit second of minute:
yyyy-MM-dd’T’HH:mm:ss
. date_hour_minute_second_fraction
orstrict_date_hour_minute_second_fraction
-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
yyyy-MM-dd’T’HH:mm:ss.SSS
. date_hour_minute_second_millis
orstrict_date_hour_minute_second_millis
-
A formatter that combines a full date, two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
yyyy-MM-dd’T’HH:mm:ss.SSS
. date_time
orstrict_date_time
-
A formatter that combines a full date and time, separated by a 'T':
yyyy-MM-dd’T’HH:mm:ss.SSSZZ
. date_time_no_millis
orstrict_date_time_no_millis
-
A formatter that combines a full date and time without millis, separated by a 'T':
yyyy-MM-dd’T’HH:mm:ssZZ
. hour
orstrict_hour
-
A formatter for a two digit hour of day:
HH
hour_minute
orstrict_hour_minute
-
A formatter for a two digit hour of day and two digit minute of hour:
HH:mm
. hour_minute_second
orstrict_hour_minute_second
-
A formatter for a two digit hour of day, two digit minute of hour, and two digit second of minute:
HH:mm:ss
. hour_minute_second_fraction
orstrict_hour_minute_second_fraction
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
HH:mm:ss.SSS
. hour_minute_second_millis
orstrict_hour_minute_second_millis
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and three digit fraction of second:
HH:mm:ss.SSS
. ordinal_date
orstrict_ordinal_date
-
A formatter for a full ordinal date, using a four digit year and three digit dayOfYear:
yyyy-DDD
. ordinal_date_time
orstrict_ordinal_date_time
-
A formatter for a full ordinal date and time, using a four digit year and three digit dayOfYear:
yyyy-DDD’T’HH:mm:ss.SSSZZ
. ordinal_date_time_no_millis
orstrict_ordinal_date_time_no_millis
-
A formatter for a full ordinal date and time without millis, using a four digit year and three digit dayOfYear:
yyyy-DDD’T’HH:mm:ssZZ
. time
orstrict_time
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset:
HH:mm:ss.SSSZZ
. time_no_millis
orstrict_time_no_millis
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset:
HH:mm:ssZZ
. t_time
orstrict_t_time
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, three digit fraction of second, and time zone offset prefixed by 'T':
'T’HH:mm:ss.SSSZZ
. t_time_no_millis
orstrict_t_time_no_millis
-
A formatter for a two digit hour of day, two digit minute of hour, two digit second of minute, and time zone offset prefixed by 'T':
'T’HH:mm:ssZZ
. week_date
orstrict_week_date
-
A formatter for a full date as four digit weekyear, two digit week of weekyear, and one digit day of week:
xxxx-'W’ww-e
. week_date_time
orstrict_week_date_time
-
A formatter that combines a full weekyear date and time, separated by a 'T':
xxxx-'W’ww-e’T’HH:mm:ss.SSSZZ
. week_date_time_no_millis
orstrict_week_date_time_no_millis
-
A formatter that combines a full weekyear date and time without millis, separated by a 'T':
xxxx-'W’ww-e’T’HH:mm:ssZZ
. weekyear
orstrict_weekyear
-
A formatter for a four digit weekyear:
xxxx
. weekyear_week
orstrict_weekyear_week
-
A formatter for a four digit weekyear and two digit week of weekyear:
xxxx-'W’ww
. weekyear_week_day
orstrict_weekyear_week_day
-
A formatter for a four digit weekyear, two digit week of weekyear, and one digit day of week:
xxxx-'W’ww-e
. year
orstrict_year
-
A formatter for a four digit year:
yyyy
. year_month
orstrict_year_month
-
A formatter for a four digit year and two digit month of year:
yyyy-MM
. year_month_day
orstrict_year_month_day
-
A formatter for a four digit year, two digit month of year, and two digit day of month:
yyyy-MM-dd
.
ignore_above
Strings longer than the ignore_above
setting will not be indexed or stored.
For arrays of strings, ignore_above
will be applied for each array element separately and string elements longer than ignore_above
will not be indexed or stored.
Note
|
All strings/array elements will still be present in the _source field, if the latter is enabled which is the default in Elasticsearch.
|
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"message": {
"type": "keyword",
"ignore_above": 20 (1)
}
}
}
}
}
PUT my_index/_doc/1 (2)
{
"message": "Syntax error"
}
PUT my_index/_doc/2 (3)
{
"message": "Syntax error with some long stacktrace"
}
GET _search (4)
{
"aggs": {
"messages": {
"terms": {
"field": "message"
}
}
}
}
-
This field will ignore any string longer than 20 characters.
-
This document is indexed successfully.
-
This document will be indexed, but without indexing the
message
field. -
Search returns both documents, but only the first is present in the terms aggregation.
Tip
|
The ignore_above setting is allowed to have different settings for
fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
This option is also useful for protecting against Lucene’s term byte-length
limit of 32766
.
Note
|
The value for ignore_above is the character count, but Lucene counts
bytes. If you use UTF-8 text with many non-ASCII characters, you may want to
set the limit to 32766 / 4 = 8191 since UTF-8 characters may occupy at most
4 bytes.
|
ignore_malformed
Sometimes you don’t have much control over the data that you receive. One
user may send a login
field that is a date
, and another sends a
login
field that is an email address.
Trying to index the wrong datatype into a field throws an exception by
default, and rejects the whole document. The ignore_malformed
parameter, if
set to true
, allows the exception to be ignored. The malformed field is not
indexed, but other fields in the document are processed normally.
For example:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"number_one": {
"type": "integer",
"ignore_malformed": true
},
"number_two": {
"type": "integer"
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Some text value",
"number_one": "foo" (1)
}
PUT my_index/_doc/2
{
"text": "Some text value",
"number_two": "foo" (2)
}
-
This document will have the
text
field indexed, but not thenumber_one
field. -
This document will be rejected because
number_two
does not allow malformed values.
Tip
|
The ignore_malformed setting is allowed to have different settings for
fields of the same name in the same index. Its value can be updated on
existing fields using the PUT mapping API.
|
Index-level default
The index.mapping.ignore_malformed
setting can be set on the index level to
allow to ignore malformed content globally across all mapping types.
PUT my_index
{
"settings": {
"index.mapping.ignore_malformed": true (1)
},
"mappings": {
"_doc": {
"properties": {
"number_one": { (1)
"type": "byte"
},
"number_two": {
"type": "integer",
"ignore_malformed": false (2)
}
}
}
}
}
-
The
number_one
field inherits the index-level setting. -
The
number_two
field overrides the index-level setting to turn offignore_malformed
.
Dealing with malformed fields
Malformed fields are silently ignored at indexing time when ignore_malformed
is turned on. Whenever possible it is recommended to keep the number of
documents that have a malformed field contained, or queries on this field will
become meaningless. Elasticsearch makes it easy to check how many documents
have malformed fields by using exists
,term
or terms
queries on the special
_ignored
field.
Limits for JSON Objects
You can’t use ignore_malformed
with the following datatypes:
You also can’t use ignore_malformed
to ignore JSON objects submitted to fields
of the wrong datatype. A JSON object is any data surrounded by curly brackets
"{}"
and includes data mapped to the nested, object, and range datatypes.
If you submit a JSON object to an unsupported field, {es} will return an error
and reject the entire document regardless of the ignore_malformed
setting.
index
The index
option controls whether field values are indexed. It accepts true
or false
and defaults to true
. Fields that are not indexed are not queryable.
index_options
The index_options
parameter controls what information is added to the
inverted index, for search and highlighting purposes. It accepts the
following settings:
docs
|
Only the doc number is indexed. Can answer the question Does this term exist in this field? |
freqs
|
Doc number and term frequencies are indexed. Term frequencies are used to score repeated terms higher than single terms. |
positions
|
Doc number, term frequencies, and term positions (or order) are indexed. Positions can be used for proximity or phrase queries. |
offsets
|
Doc number, term frequencies, positions, and start and end character offsets (which map the term back to the original string) are indexed. Offsets are used by the unified highlighter to speed up highlighting. |
Warning
|
The index_options parameter has been deprecated for Numeric fields in 6.0.0.
|
Analyzed string fields use positions
as the default, and
all other fields use docs
as the default.
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"index_options": "offsets"
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Quick brown fox"
}
GET my_index/_search
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {} (1)
}
}
}
-
The
text
field will use the postings for the highlighting by default becauseoffsets
are indexed.
index_phrases
If enabled, two-term word combinations ('shingles') are indexed into a separate
field. This allows exact phrase queries (no slop) to run more efficiently, at the expense
of a larger index. Note that this works best when stopwords are not removed,
as phrases containing stopwords will not use the subsidiary field and will fall
back to a standard phrase query. Accepts true
or false
(default).
index_prefixes
The index_prefixes
parameter enables the indexing of term prefixes to speed
up prefix searches. It accepts the following optional settings:
min_chars
|
The minimum prefix length to index. Must be greater than 0, and defaults to 2. The value is inclusive. |
max_chars
|
The maximum prefix length to index. Must be less than 20, and defaults to 5. The value is inclusive. |
This example creates a text field using the default prefix length settings:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"body_text": {
"type": "text",
"index_prefixes": { } (1)
}
}
}
}
}
-
An empty settings object will use the default
min_chars
andmax_chars
settings
This example uses custom prefix length settings:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"full_name": {
"type": "text",
"index_prefixes": {
"min_chars" : 1,
"max_chars" : 10
}
}
}
}
}
}
fields
It is often useful to index the same field in different ways for different
purposes. This is the purpose of multi-fields. For instance, a string
field could be mapped as a text
field for full-text
search, and as a keyword
field for sorting or aggregations:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"city": {
"type": "text",
"fields": {
"raw": { (1)
"type": "keyword"
}
}
}
}
}
}
}
PUT my_index/_doc/1
{
"city": "New York"
}
PUT my_index/_doc/2
{
"city": "York"
}
GET my_index/_search
{
"query": {
"match": {
"city": "york" (2)
}
},
"sort": {
"city.raw": "asc" (3)
},
"aggs": {
"Cities": {
"terms": {
"field": "city.raw" (3)
}
}
}
}
-
The
city.raw
field is akeyword
version of thecity
field. -
The
city
field can be used for full text search. -
The
city.raw
field can be used for sorting and aggregations
Note
|
Multi-fields do not change the original _source field.
|
Tip
|
The fields setting is allowed to have different settings for fields of
the same name in the same index. New multi-fields can be added to existing
fields using the PUT mapping API.
|
Multi-fields with multiple analyzers
Another use case of multi-fields is to analyze the same field in different
ways for better relevance. For instance we could index a field with the
standard
analyzer which breaks text up into
words, and again with the english
analyzer
which stems words into their root form:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"text": { (1)
"type": "text",
"fields": {
"english": { (2)
"type": "text",
"analyzer": "english"
}
}
}
}
}
}
}
PUT my_index/_doc/1
{ "text": "quick brown fox" } (3)
PUT my_index/_doc/2
{ "text": "quick brown foxes" } (3)
GET my_index/_search
{
"query": {
"multi_match": {
"query": "quick brown foxes",
"fields": [ (4)
"text",
"text.english"
],
"type": "most_fields" (4)
}
}
}
-
The
text
field uses thestandard
analyzer. -
The
text.english
field uses theenglish
analyzer. -
Index two documents, one with
fox
and the other withfoxes
. -
Query both the
text
andtext.english
fields and combine the scores.
The text
field contains the term fox
in the first document and foxes
in
the second document. The text.english
field contains fox
for both
documents, because foxes
is stemmed to fox
.
The query string is also analyzed by the standard
analyzer for the text
field, and by the english
analyzer for the text.english
field. The
stemmed field allows a query for foxes
to also match the document containing
just fox
. This allows us to match as many documents as possible. By also
querying the unstemmed text
field, we improve the relevance score of the
document which matches foxes
exactly.
norms
Norms store various normalization factors that are later used at query time in order to compute the score of a document relatively to a query.
Although useful for scoring, norms also require quite a lot of disk (typically in the order of one byte per document per field in your index, even for documents that don’t have this specific field). As a consequence, if you don’t need scoring on a specific field, you should disable norms on that field. In particular, this is the case for fields that are used solely for filtering or aggregations.
Tip
|
The norms setting must have the same setting for fields of the
same name in the same index. Norms can be disabled on existing fields using
the PUT mapping API.
|
Norms can be disabled (but not reenabled) after the fact, using the PUT mapping API like so:
PUT my_index/_mapping/_doc
{
"properties": {
"title": {
"type": "text",
"norms": false
}
}
}
Note
|
Norms will not be removed instantly, but will be removed as old segments are merged into new segments as you continue indexing new documents. Any score computation on a field that has had norms removed might return inconsistent results since some documents won’t have norms anymore while other documents might still have norms. |
null_value
A null
value cannot be indexed or searched. When a field is set to null
,
(or an empty array or an array of null
values) it is treated as though that
field has no values.
The null_value
parameter allows you to replace explicit null
values with
the specified value so that it can be indexed and searched. For instance:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"status_code": {
"type": "keyword",
"null_value": "NULL" (1)
}
}
}
}
}
PUT my_index/_doc/1
{
"status_code": null
}
PUT my_index/_doc/2
{
"status_code": [] (2)
}
GET my_index/_search
{
"query": {
"term": {
"status_code": "NULL" (3)
}
}
}
-
Replace explicit
null
values with the termNULL
. -
An empty array does not contain an explicit
null
, and so won’t be replaced with thenull_value
. -
A query for
NULL
returns document 1, but not document 2.
Important
|
The null_value needs to be the same datatype as the field. For
instance, a long field cannot have a string null_value .
|
Note
|
The null_value only influences how data is indexed, it doesn’t modify
the _source document.
|
position_increment_gap
Analyzed text fields take term positions
into account, in order to be able to support
proximity or phrase queries.
When indexing text fields with multiple values a "fake" gap is added between
the values to prevent most phrase queries from matching across the values. The
size of this gap is configured using position_increment_gap
and defaults to
100
.
For example:
PUT my_index/_doc/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
GET my_index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Lincoln" (1)
}
}
}
}
GET my_index/_search
{
"query": {
"match_phrase": {
"names": {
"query": "Abraham Lincoln",
"slop": 101 (2)
}
}
}
}
-
This phrase query doesn’t match our document which is totally expected.
-
This phrase query matches our document, even though
Abraham
andLincoln
are in separate strings, becauseslop
>position_increment_gap
.
The position_increment_gap
can be specified in the mapping. For instance:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"names": {
"type": "text",
"position_increment_gap": 0 (1)
}
}
}
}
}
PUT my_index/_doc/1
{
"names": [ "John Abraham", "Lincoln Smith"]
}
GET my_index/_search
{
"query": {
"match_phrase": {
"names": "Abraham Lincoln" (2)
}
}
}
-
The first term in the next array element will be 0 terms apart from the last term in the previous array element.
-
The phrase query matches our document which is weird, but its what we asked for in the mapping.
properties
Type mappings, object
fields and nested
fields
contain sub-fields, called properties
. These properties may be of any
datatype, including object
and nested
. Properties can
be added:
-
explicitly by defining them when creating an index.
-
explicitly by defining them when adding or updating a mapping type with the PUT mapping API.
-
dynamically just by indexing documents containing new fields.
Below is an example of adding properties
to a mapping type, an object
field, and a nested
field:
PUT my_index
{
"mappings": {
"_doc": { (1)
"properties": {
"manager": { (2)
"properties": {
"age": { "type": "integer" },
"name": { "type": "text" }
}
},
"employees": { (3)
"type": "nested",
"properties": {
"age": { "type": "integer" },
"name": { "type": "text" }
}
}
}
}
}
}
PUT my_index/_doc/1 (4)
{
"region": "US",
"manager": {
"name": "Alice White",
"age": 30
},
"employees": [
{
"name": "John Smith",
"age": 34
},
{
"name": "Peter Brown",
"age": 26
}
]
}
-
Properties under the
_doc
mapping type. -
Properties under the
manager
object field. -
Properties under the
employees
nested field. -
An example document which corresponds to the above mapping.
Tip
|
The properties setting is allowed to have different settings for fields
of the same name in the same index. New properties can be added to existing
fields using the PUT mapping API.
|
Dot notation
Inner fields can be referred to in queries, aggregations, etc., using dot notation:
GET my_index/_search
{
"query": {
"match": {
"manager.name": "Alice White"
}
},
"aggs": {
"Employees": {
"nested": {
"path": "employees"
},
"aggs": {
"Employee Ages": {
"histogram": {
"field": "employees.age",
"interval": 5
}
}
}
}
}
}
Important
|
The full path to the inner field must be specified. |
search_analyzer
Usually, the same analyzer should be applied at index time and at search time, to ensure that the terms in the query are in the same format as the terms in the inverted index.
Sometimes, though, it can make sense to use a different analyzer at search
time, such as when using the edge_ngram
tokenizer for autocomplete.
By default, queries will use the analyzer
defined in the field mapping, but
this can be overridden with the search_analyzer
setting:
PUT my_index
{
"settings": {
"analysis": {
"filter": {
"autocomplete_filter": {
"type": "edge_ngram",
"min_gram": 1,
"max_gram": 20
}
},
"analyzer": {
"autocomplete": { (1)
"type": "custom",
"tokenizer": "standard",
"filter": [
"lowercase",
"autocomplete_filter"
]
}
}
}
},
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"analyzer": "autocomplete", (2)
"search_analyzer": "standard" (2)
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Quick Brown Fox" (3)
}
GET my_index/_search
{
"query": {
"match": {
"text": {
"query": "Quick Br", (4)
"operator": "and"
}
}
}
}
-
Analysis settings to define the custom
autocomplete
analyzer. -
The
text
field uses theautocomplete
analyzer at index time, but thestandard
analyzer at search time. -
This field is indexed as the terms: [
q
,qu
,qui
,quic
,quick
,b
,br
,bro
,brow
,brown
,f
,fo
,fox
] -
The query searches for both of these terms: [
quick
,br
]
See {defguide}/_index_time_search_as_you_type.html[Index time search-as-you- type] for a full explanation of this example.
Tip
|
The search_analyzer setting can be updated on existing fields
using the PUT mapping API.
|
similarity
Elasticsearch allows you to configure a scoring algorithm or similarity per
field. The similarity
setting provides a simple way of choosing a similarity
algorithm other than the default BM25
, such as TF/IDF
.
Similarities are mostly useful for text
fields, but can also apply
to other field types.
Custom similarities can be configured by tuning the parameters of the built-in similarities. For more details about this expert options, see the similarity module.
The only similarities which can be used out of the box, without any further configuration are:
BM25
-
The Okapi BM25 algorithm. The algorithm used by default in Elasticsearch and Lucene. See {defguide}/pluggable-similarites.html[Pluggable Similarity Algorithms] for more information.
classic
-
The TF/IDF algorithm which used to be the default in Elasticsearch and Lucene. See {defguide}/practical-scoring-function.html[Lucene’s Practical Scoring Function] for more information.
boolean
-
A simple boolean similarity, which is used when full-text ranking is not needed and the score should only be based on whether the query terms match or not. Boolean similarity gives terms a score equal to their query boost.
The similarity
can be set on the field level when a field is first created,
as follows:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"default_field": { (1)
"type": "text"
},
"boolean_sim_field": {
"type": "text",
"similarity": "boolean" (2)
}
}
}
}
}
-
The
default_field
uses theBM25
similarity. -
The
boolean_sim_field
uses theboolean
similarity.
store
By default, field values are indexed to make them searchable, but they are not stored. This means that the field can be queried, but the original field value cannot be retrieved.
Usually this doesn’t matter. The field value is already part of the
_source
field, which is stored by default. If you
only want to retrieve the value of a single field or of a few fields, instead
of the whole _source
, then this can be achieved with
source filtering.
In certain situations it can make sense to store
a field. For instance, if
you have a document with a title
, a date
, and a very large content
field, you may want to retrieve just the title
and the date
without having
to extract those fields from a large _source
field:
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"title": {
"type": "text",
"store": true (1)
},
"date": {
"type": "date",
"store": true (1)
},
"content": {
"type": "text"
}
}
}
}
}
PUT my_index/_doc/1
{
"title": "Some short title",
"date": "2015-01-01",
"content": "A very long content field..."
}
GET my_index/_search
{
"stored_fields": [ "title", "date" ] (2)
}
-
The
title
anddate
fields are stored. -
This request will retrieve the values of the
title
anddate
fields.
Note
|
Stored fields returned as arrays
For consistency, stored fields are always returned as an array because there is no way of knowing if the original field value was a single value, multiple values, or an empty array. If you need the original value, you should retrieve it from the |
Another situation where it can make sense to make a field stored is for those
that don’t appear in the _source
field (such as copy_to
fields).
term_vector
Term vectors contain information about the terms produced by the analysis process, including:
-
a list of terms.
-
the position (or order) of each term.
-
the start and end character offsets mapping the term to its origin in the original string.
-
payloads (if they are available) — user-defined binary data associated with each term position.
These term vectors can be stored so that they can be retrieved for a particular document.
The term_vector
setting accepts:
no
|
No term vectors are stored. (default) |
yes
|
Just the terms in the field are stored. |
with_positions
|
Terms and positions are stored. |
with_offsets
|
Terms and character offsets are stored. |
with_positions_offsets
|
Terms, positions, and character offsets are stored. |
with_positions_payloads
|
Terms, positions, and payloads are stored. |
with_positions_offsets_payloads
|
Terms, positions, offsets and payloads are stored. |
The fast vector highlighter requires with_positions_offsets
.
The term vectors API can retrieve whatever is stored.
Warning
|
Setting with_positions_offsets will double the size of a field’s
index.
|
PUT my_index
{
"mappings": {
"_doc": {
"properties": {
"text": {
"type": "text",
"term_vector": "with_positions_offsets"
}
}
}
}
}
PUT my_index/_doc/1
{
"text": "Quick brown fox"
}
GET my_index/_search
{
"query": {
"match": {
"text": "brown fox"
}
},
"highlight": {
"fields": {
"text": {} (1)
}
}
}
-
The fast vector highlighter will be used by default for the
text
field because term vectors are enabled.
Dynamic Mapping
One of the most important features of Elasticsearch is that it tries to get out of your way and let you start exploring your data as quickly as possible. To index a document, you don’t have to first create an index, define a mapping type, and define your fields — you can just index a document and the index, type, and fields will spring to life automatically:
PUT data/_doc/1 (1)
{ "count": 5 }
-
Creates the
data
index, the_doc
mapping type, and a field calledcount
with datatypelong
.
The automatic detection and addition of new fields is called dynamic mapping. The dynamic mapping rules can be customised to suit your purposes with:
- Dynamic field mappings
-
The rules governing dynamic field detection.
- Dynamic templates
-
Custom rules to configure the mapping for dynamically added fields.
Tip
|
Index templates allow you to configure the default mappings, settings and aliases for new indices, whether created automatically or explicitly. |
Dynamic field mapping
By default, when a previously unseen field is found in a document,
Elasticsearch will add the new field to the type mapping. This behaviour can
be disabled, both at the document and at the object
level, by
setting the dynamic
parameter to false
(to ignore new fields) or to strict
(to throw
an exception if an unknown field is encountered).
Assuming dynamic
field mapping is enabled, some simple rules are used to
determine which datatype the field should have:
JSON datatype |
Elasticsearch datatype |
null
|
No field is added. |
true or false
|
|
floating point number |
|
integer |
|
object |
|
array |
Depends on the first non- |
string |
Either a |
These are the only field datatypes that are dynamically detected. All other datatypes must be mapped explicitly.
Besides the options listed below, dynamic field mapping rules can be further
customised with dynamic_templates
.
Date detection
If date_detection
is enabled (default), then new string fields are checked
to see whether their contents match any of the date patterns specified in
dynamic_date_formats
. If a match is found, a new date
field is
added with the corresponding format.
The default value for dynamic_date_formats
is:
[ "strict_date_optional_time"
,"yyyy/MM/dd HH:mm:ss Z||yyyy/MM/dd Z"
]
For example:
PUT my_index/_doc/1
{
"create_date": "2015/09/02"
}
GET my_index/_mapping (1)
Disabling date detection
Dynamic date detection can be disabled by setting date_detection
to false
:
PUT my_index
{
"mappings": {
"_doc": {
"date_detection": false
}
}
}
PUT my_index/_doc/1 (1)
{
"create": "2015/09/02"
}
-
The
create_date
field has been added as atext
field.
Customising detected date formats
Alternatively, the dynamic_date_formats
can be customised to support your
own date formats:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_date_formats": ["MM/dd/yyyy"]
}
}
}
PUT my_index/_doc/1
{
"create_date": "09/25/2015"
}
Numeric detection
While JSON has support for native floating point and integer datatypes, some applications or languages may sometimes render numbers as strings. Usually the correct solution is to map these fields explicitly, but numeric detection (which is disabled by default) can be enabled to do this automatically:
PUT my_index
{
"mappings": {
"_doc": {
"numeric_detection": true
}
}
}
PUT my_index/_doc/1
{
"my_float": "1.0", (1)
"my_integer": "1" (2)
}
Dynamic templates
Dynamic templates allow you to define custom mappings that can be applied to dynamically added fields based on:
-
the datatype detected by Elasticsearch, with
match_mapping_type
. -
the name of the field, with
match
andunmatch
ormatch_pattern
. -
the full dotted path to the field, with
path_match
andpath_unmatch
.
The original field name {name}
and the detected datatype
{dynamic_type
} template variables can be used in
the mapping specification as placeholders.
Important
|
Dynamic field mappings are only added when a field contains a
concrete value — not null or an empty array. This means that if the
null_value option is used in a dynamic_template , it will only be applied
after the first document with a concrete value for the field has been
indexed.
|
Dynamic templates are specified as an array of named objects:
"dynamic_templates": [
{
"my_template_name": { (1)
... match conditions ... (2)
"mapping": { ... } (3)
}
},
...
]
-
The template name can be any string value.
-
The match conditions can include any of :
match_mapping_type
,match
,match_pattern
,unmatch
,path_match
,path_unmatch
. -
The mapping that the matched field should use.
Templates are processed in order — the first matching template wins. When putting new dynamic templates through the put mapping API, all existing templates are overwritten. This allows for dynamic templates to be reordered or deleted after they were initially added.
match_mapping_type
The match_mapping_type
is the datatype detected by the json parser. Since
JSON doesn’t allow to distinguish a long
from an integer
or a double
from
a float
, it will always choose the wider datatype, i.e. long
for integers
and double
for floating-point numbers.
The following datatypes may be automatically detected:
-
boolean
whentrue
orfalse
are encountered. -
date
when date detection is enabled and a string is found that matches any of the configured date formats. -
double
for numbers with a decimal part. -
long
for numbers without a decimal part. -
object
for objects, also called hashes. -
string
for character strings.
*
may also be used in order to match all datatypes.
For example, if we wanted to map all integer fields as integer
instead of
long
, and all string
fields as both text
and keyword
, we
could use the following template:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"integers": {
"match_mapping_type": "long",
"mapping": {
"type": "integer"
}
}
},
{
"strings": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"fields": {
"raw": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
PUT my_index/_doc/1
{
"my_integer": 5, (1)
"my_string": "Some string" (2)
}
-
The
my_integer
field is mapped as aninteger
. -
The
my_string
field is mapped as atext
, with akeyword
multi field.
match
and unmatch
The match
parameter uses a pattern to match on the field name, while
unmatch
uses a pattern to exclude fields matched by match
.
The following example matches all string
fields whose name starts with
long_
(except for those which end with _text
) and maps them as long
fields:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"longs_as_strings": {
"match_mapping_type": "string",
"match": "long_*",
"unmatch": "*_text",
"mapping": {
"type": "long"
}
}
}
]
}
}
}
PUT my_index/_doc/1
{
"long_num": "5", (1)
"long_text": "foo" (2)
}
-
The
long_num
field is mapped as along
. -
The
long_text
field uses the defaultstring
mapping.
match_pattern
The match_pattern
parameter adjusts the behavior of the match
parameter
such that it supports full Java regular expression matching on the field name
instead of simple wildcards, for instance:
"match_pattern": "regex",
"match": "^profit_\d+$"
path_match
and path_unmatch
The path_match
and path_unmatch
parameters work in the same way as match
and unmatch
, but operate on the full dotted path to the field, not just the
final name, e.g. some_object.*.some_field
.
This example copies the values of any fields in the name
object to the
top-level full_name
field, except for the middle
field:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"full_name": {
"path_match": "name.*",
"path_unmatch": "*.middle",
"mapping": {
"type": "text",
"copy_to": "full_name"
}
}
}
]
}
}
}
PUT my_index/_doc/1
{
"name": {
"first": "John",
"middle": "Winston",
"last": "Lennon"
}
}
Note that the path_match
and path_unmatch
parameters match on object paths
in addition to leaf fields. As an example, indexing the following document will
result in an error because the path_match
setting also matches the object
field name.title
, which can’t be mapped as text:
PUT my_index/_doc/2
{
"name": {
"first": "Paul",
"last": "McCartney",
"title": {
"value": "Sir",
"category": "order of chivalry"
}
}
}
{name}
and {dynamic_type}
The {name}
and {dynamic_type}
placeholders are replaced in the mapping
with the field name and detected dynamic type. The following example sets all
string fields to use an analyzer
with the same name as the
field, and disables doc_values
for all non-string fields:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"named_analyzers": {
"match_mapping_type": "string",
"match": "*",
"mapping": {
"type": "text",
"analyzer": "{name}"
}
}
},
{
"no_doc_values": {
"match_mapping_type":"*",
"mapping": {
"type": "{dynamic_type}",
"doc_values": false
}
}
}
]
}
}
}
PUT my_index/_doc/1
{
"english": "Some English text", (1)
"count": 5 (2)
}
-
The
english
field is mapped as astring
field with theenglish
analyzer. -
The
count
field is mapped as along
field withdoc_values
disabled.
Template examples
Here are some examples of potentially useful dynamic templates:
Structured search
By default Elasticsearch will map string fields as a text
field with a sub
keyword
field. However if you are only indexing structured content and not
interested in full text search, you can make Elasticsearch map your fields
only as `keyword`s. Note that this means that in order to search those fields,
you will have to search on the exact same value that was indexed.
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "keyword"
}
}
}
]
}
}
}
text
-only mappings for strings
On the contrary to the previous example, if the only thing that you care about on your string fields is full-text search, and if you don’t plan on running aggregations, sorting or exact search on your string fields, you could tell Elasticsearch to map it only as a text field (which was the default behaviour before 5.0):
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"strings_as_text": {
"match_mapping_type": "string",
"mapping": {
"type": "text"
}
}
}
]
}
}
}
Disabled norms
Norms are index-time scoring factors. If you do not care about scoring, which would be the case for instance if you never sort documents by score, you could disable the storage of these scoring factors in the index and save some space.
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"strings_as_keywords": {
"match_mapping_type": "string",
"mapping": {
"type": "text",
"norms": false,
"fields": {
"keyword": {
"type": "keyword",
"ignore_above": 256
}
}
}
}
}
]
}
}
}
The sub keyword
field appears in this template to be consistent with the
default rules of dynamic mappings. Of course if you do not need them because
you don’t need to perform exact search or aggregate on this field, you could
remove it as described in the previous section.
Time-series
When doing time series analysis with Elasticsearch, it is common to have many numeric fields that you will often aggregate on but never filter on. In such a case, you could disable indexing on those fields to save disk space and also maybe gain some indexing speed:
PUT my_index
{
"mappings": {
"_doc": {
"dynamic_templates": [
{
"unindexed_longs": {
"match_mapping_type": "long",
"mapping": {
"type": "long",
"index": false
}
}
},
{
"unindexed_doubles": {
"match_mapping_type": "double",
"mapping": {
"type": "float", (1)
"index": false
}
}
}
]
}
}
}
-
Like the default dynamic mapping rules, doubles are mapped as floats, which are usually accurate enough, yet require half the disk space.
default
mapping
deprecated[6.0.0,See Removal of mapping types]
The default mapping, which will be used as the base mapping for a new
mapping type, can be customised by adding a mapping type with the name
default
to an index, either when
creating the index or later on with the
PUT mapping API.
The documentation for this feature has been removed as it no longer makes sense in 6.x where there can be only a single type per index.