Metrics Aggregations
The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.
Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output
a single numeric metric (e.g. avg
) and are called single-value numeric metrics aggregation
, others generate multiple
metrics (e.g. stats
) and are called multi-value numeric metrics aggregation
. The distinction between single-value and
multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some
bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket).
Avg Aggregation
A single-value
metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students we can average their scores with:
POST /exams/_search?size=0
{
"aggs" : {
"avg_grade" : { "avg" : { "field" : "grade" } }
}
}
The above aggregation computes the average grade over all documents. The aggregation type is avg
and the field
setting defines the numeric field of the documents the average will be computed on. The above will return the following:
{
...
"aggregations": {
"avg_grade": {
"value": 75.0
}
}
}
The name of the aggregation (avg_grade
above) also serves as the key by which the aggregation result can be retrieved from the returned response.
Script
Computing the average grade based on a script:
POST /exams/_search?size=0
{
"aggs" : {
"avg_grade" : {
"avg" : {
"script" : {
"source" : "doc.grade.value"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
POST /exams/_search?size=0
{
"aggs" : {
"avg_grade" : {
"avg" : {
"script" : {
"id": "my_script",
"params": {
"field": "grade"
}
}
}
}
}
}
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average:
POST /exams/_search?size=0
{
"aggs" : {
"avg_corrected_grade" : {
"avg" : {
"field" : "grade",
"script" : {
"lang": "painless",
"source": "_value * params.correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
POST /exams/_search?size=0
{
"aggs" : {
"grade_avg" : {
"avg" : {
"field" : "grade",
"missing": 10 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value10
.
Weighted Avg Aggregation
A single-value
metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents.
These values can be extracted either from specific numeric fields in the documents.
When calculating a regular average, each datapoint has an equal "weight" … it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document, or provided by a script.
As a formula, a weighted average is the ∑(value * weight) / ∑(weight)
A regular average can be thought of as a weighted average where every value has an implicit weight of 1
.
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The configuration for the field or script that provides the values |
Required |
|
|
The configuration for the field or script that provides the weights |
Required |
|
|
The numeric response formatter |
Optional |
|
|
A hint about the values for pure scripts or unmapped fields |
Optional |
The value
and weight
objects have per-field specific configuration:
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The field that values should be extracted from |
Required |
|
|
A value to use if the field is missing entirely |
Optional |
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The field that weights should be extracted from |
Required |
|
|
A weight to use if the field is missing entirely |
Optional |
Examples
If our documents have a "grade"
field that holds a 0-100 numeric score, and a "weight"
field which holds an arbitrary numeric weight,
we can calculate the weighted average using:
POST /exams/_search
{
"size": 0,
"aggs" : {
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "grade"
},
"weight": {
"field": "weight"
}
}
}
}
}
Which yields a response like:
{
...
"aggregations": {
"weighted_grade": {
"value": 70.0
}
}
}
While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters
a document that has more than one weight (e.g. the weight field is a multi-valued field) it will throw an exception.
If you have this situation, you will need to specify a script
for the weight field, and use the script
to combine the multiple values into a single value to be used.
This single weight will be applied independently to each value extracted from the value
field.
This example show how a single document with multiple values will be averaged with a single weight:
POST /exams/_doc?refresh
{
"grade": [1, 2, 3],
"weight": 2
}
POST /exams/_search
{
"size": 0,
"aggs" : {
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "grade"
},
"weight": {
"field": "weight"
}
}
}
}
}
The three values (1
, 2
, and 3
) will be included as independent values, all with the weight of 2
:
{
...
"aggregations": {
"weighted_grade": {
"value": 2.0
}
}
}
The aggregation returns 2.0
as the result, which matches what we would expect when calculating by hand:
1*2) + (2*2) + (3*2 / (2+2+2) == 2
Script
Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following will add one to the grade and weight in the document using a script:
POST /exams/_search
{
"size": 0,
"aggs" : {
"weighted_grade": {
"weighted_avg": {
"value": {
"script": "doc.grade.value + 1"
},
"weight": {
"script": "doc.weight.value + 1"
}
}
}
}
}
Missing values
The missing
parameter defines how documents that are missing a value should be treated.
The default behavior is different for value
and weight
:
By default, if the value
field is missing the document is ignored and the aggregation moves on to the next document.
If the weight
field is missing, it is assumed to have a weight of 1
(like a normal average).
Both of these defaults can be overridden with the missing
parameter:
POST /exams/_search
{
"size": 0,
"aggs" : {
"weighted_grade": {
"weighted_avg": {
"value": {
"field": "grade",
"missing": 2
},
"weight": {
"field": "weight",
"missing": 3
}
}
}
}
}
Cardinality Aggregation
A single-value
metrics aggregation that calculates an approximate count of
distinct values. Values can be extracted either from specific fields in the
document or generated by a script.
Assume you are indexing store sales and would like to count the unique number of sold products that match a query:
POST /sales/_search?size=0
{
"aggs" : {
"type_count" : {
"cardinality" : {
"field" : "type"
}
}
}
}
Response:
{
...
"aggregations" : {
"type_count" : {
"value" : 3
}
}
}
Precision control
This aggregation also supports the precision_threshold
option:
POST /sales/_search?size=0
{
"aggs" : {
"type_count" : {
"cardinality" : {
"field" : "_doc",
"precision_threshold": 100 (1)
}
}
}
}
-
The
precision_threshold
options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default value is 3000.
Counts are approximate
Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.
This cardinality
aggregation is based on the
HyperLogLog++
algorithm, which counts based on the hashes of the values with some interesting
properties:
-
configurable precision, which decides on how to trade memory for accuracy,
-
excellent accuracy on low-cardinality sets,
-
fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.
For a precision threshold of c
, the implementation that we are using requires
about c * 8
bytes.
The following chart shows how the error varies before and after the threshold:
For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.
The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.
Please also note that even with a threshold as low as 100, the error remains very low, even when counting millions of items.
Pre-computed hashes
On string fields that have a high cardinality, it might be faster to store the
hash of your field values in your index and then run the cardinality aggregation
on this field. This can either be done by providing hash values from client-side
or by letting Elasticsearch compute hash values for you by using the
{plugins}/mapper-murmur3.html[mapper-murmur3
] plugin.
Note
|
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory. However, on numeric fields, hashing is very fast and storing the original values requires as much or less memory than storing the hashes. This is also true on low-cardinality string fields, especially given that those have an optimization in order to make sure that hashes are computed at most once per unique value per segment. |
Script
The cardinality
metric supports scripting, with a noticeable performance hit
however since hashes need to be computed on the fly.
POST /sales/_search?size=0
{
"aggs" : {
"type_promoted_count" : {
"cardinality" : {
"script": {
"lang": "painless",
"source": "doc['type'].value + ' ' + doc['promoted'].value"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
POST /sales/_search?size=0
{
"aggs" : {
"type_promoted_count" : {
"cardinality" : {
"script" : {
"id": "my_script",
"params": {
"type_field": "_doc",
"promoted_field": "promoted"
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
POST /sales/_search?size=0
{
"aggs" : {
"tag_cardinality" : {
"cardinality" : {
"field" : "tag",
"missing": "N/A" (1)
}
}
}
}
-
Documents without a value in the
tag
field will fall into the same bucket as documents that have the valueN/A
.
Extended Stats Aggregation
A multi-value
metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
The extended_stats
aggregations is an extended version of the stats
aggregation, where additional metrics are added such as sum_of_squares
, variance
, std_deviation
and std_deviation_bounds
.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : { "extended_stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation type is extended_stats
and the field
setting defines the numeric field of the documents the stats will be computed on. The above will return the following:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0,
"sum_of_squares": 12500.0,
"variance": 625.0,
"std_deviation": 25.0,
"std_deviation_bounds": {
"upper": 125.0,
"lower": 25.0
}
}
}
}
The name of the aggregation (grades_stats
above) also serves as the key by which the aggregation result can be retrieved from the returned response.
Standard Deviation Bounds
By default, the extended_stats
metric will return an object called std_deviation_bounds
, which provides an interval of plus/minus two standard
deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example
three standard deviations, you can set sigma
in the request:
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"sigma" : 3 (1)
}
}
}
}
-
sigma
controls how many standard deviations +/- from the mean should be displayed
sigma
can be any non-negative double, meaning you can request non-integer values such as 1.5
. A value of 0
is valid, but will simply
return the average for both upper
and lower
bounds.
Note
|
Standard Deviation and Bounds require normality
The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading. |
Script
Computing the grades stats based on a script:
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"script" : {
"source" : "doc['grade'].value",
"lang" : "painless"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"script" : {
"id": "my_script",
"params": {
"field": "grade"
}
}
}
}
}
}
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats:
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"script" : {
"lang" : "painless",
"source": "_value * params.correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
GET /exams/_search
{
"size": 0,
"aggs" : {
"grades_stats" : {
"extended_stats" : {
"field" : "grade",
"missing": 0 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value0
.
Geo Bounds Aggregation
A metric aggregation that computes the bounding box containing all geo_point values for a field.
Example:
PUT /museums
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"query" : {
"match" : { "name" : "musée" }
},
"aggs" : {
"viewport" : {
"geo_bounds" : {
"field" : "location", (1)
"wrap_longitude" : true (2)
}
}
}
}
-
The
geo_bounds
aggregation specifies the field to use to obtain the bounds -
wrap_longitude
is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value istrue
The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop
The response for the above aggregation:
{
...
"aggregations": {
"viewport": {
"bounds": {
"top_left": {
"lat": 48.86111099738628,
"lon": 2.3269999679178
},
"bottom_right": {
"lat": 48.85999997612089,
"lon": 2.3363889567553997
}
}
}
}
}
Geo Centroid Aggregation
A metric aggregation that computes the weighted centroid from all coordinate values for a [geo-point] field.
Example:
PUT /museums
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"aggs" : {
"centroid" : {
"geo_centroid" : {
"field" : "location" (1)
}
}
}
}
-
The
geo_centroid
aggregation specifies the field to use for computing the centroid. (NOTE: field must be a [geo-point] type)
The above aggregation demonstrates how one would compute the centroid of the location field for all documents with a crime type of burglary
The response for the above aggregation:
{
...
"aggregations": {
"centroid": {
"location": {
"lat": 51.00982963107526,
"lon": 3.9662130922079086
},
"count": 6
}
}
}
The geo_centroid
aggregation is more interesting when combined as a sub-aggregation to other bucket aggregations.
Example:
POST /museums/_search?size=0
{
"aggs" : {
"cities" : {
"terms" : { "field" : "city.keyword" },
"aggs" : {
"centroid" : {
"geo_centroid" : { "field" : "location" }
}
}
}
}
}
The above example uses geo_centroid
as a sub-aggregation to a
terms bucket aggregation
for finding the central location for museums in each city.
The response for the above aggregation:
{
...
"aggregations": {
"cities": {
"sum_other_doc_count": 0,
"doc_count_error_upper_bound": 0,
"buckets": [
{
"key": "Amsterdam",
"doc_count": 3,
"centroid": {
"location": {
"lat": 52.371655656024814,
"lon": 4.909563297405839
},
"count": 3
}
},
{
"key": "Paris",
"doc_count": 2,
"centroid": {
"location": {
"lat": 48.86055548675358,
"lon": 2.3316944623366
},
"count": 2
}
},
{
"key": "Antwerp",
"doc_count": 1,
"centroid": {
"location": {
"lat": 51.22289997059852,
"lon": 4.40519998781383
},
"count": 1
}
}
]
}
}
}
Max Aggregation
A single-value
metrics aggregation that keeps track and returns the maximum
value among the numeric values extracted from the aggregated documents. These
values can be extracted either from specific numeric fields in the documents,
or be generated by a provided script.
Note
|
The min and max aggregation operate on the double representation of
the data. As a consequence, the result may be approximate when running on longs
whose absolute value is greater than 2^53.
|
Computing the max price value across all documents
POST /sales/_search?size=0
{
"aggs" : {
"max_price" : { "max" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"max_price": {
"value": 200.0
}
}
}
As can be seen, the name of the aggregation (max_price
above) also serves as
the key by which the aggregation result can be retrieved from the returned
response.
Script
The max
aggregation can also calculate the maximum of a script. The example
below computes the maximum price:
POST /sales/_search
{
"aggs" : {
"max_price" : {
"max" : {
"script" : {
"source" : "doc.price.value"
}
}
}
}
}
This will use the Painless scripting language and no script parameters. To use a stored script use the following syntax:
POST /sales/_search
{
"aggs" : {
"max_price" : {
"max" : {
"script" : {
"id": "my_script",
"params": {
"field": "price"
}
}
}
}
}
}
Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, let’s say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:
POST /sales/_search
{
"aggs" : {
"max_price_in_euros" : {
"max" : {
"field" : "price",
"script" : {
"source" : "_value * params.conversion_rate",
"params" : {
"conversion_rate" : 1.2
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should
be treated. By default they will be ignored but it is also possible to treat
them as if they had a value.
POST /sales/_search
{
"aggs" : {
"grade_max" : {
"max" : {
"field" : "grade",
"missing": 10 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value10
.
Min Aggregation
A single-value
metrics aggregation that keeps track and returns the minimum
value among numeric values extracted from the aggregated documents. These
values can be extracted either from specific numeric fields in the documents,
or be generated by a provided script.
Note
|
The min and max aggregation operate on the double representation of
the data. As a consequence, the result may be approximate when running on longs
whose absolute value is greater than 2^53.
|
Computing the min price value across all documents:
POST /sales/_search?size=0
{
"aggs" : {
"min_price" : { "min" : { "field" : "price" } }
}
}
Response:
{
...
"aggregations": {
"min_price": {
"value": 10.0
}
}
}
As can be seen, the name of the aggregation (min_price
above) also serves as
the key by which the aggregation result can be retrieved from the returned
response.
Script
The min
aggregation can also calculate the minimum of a script. The example
below computes the minimum price:
POST /sales/_search
{
"aggs" : {
"min_price" : {
"min" : {
"script" : {
"source" : "doc.price.value"
}
}
}
}
}
This will use the Painless scripting language and no script parameters. To use a stored script use the following syntax:
POST /sales/_search
{
"aggs" : {
"min_price" : {
"min" : {
"script" : {
"id": "my_script",
"params": {
"field": "price"
}
}
}
}
}
}
Value Script
Let’s say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, let’s say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:
POST /sales/_search
{
"aggs" : {
"min_price_in_euros" : {
"min" : {
"field" : "price",
"script" : {
"source" : "_value * params.conversion_rate",
"params" : {
"conversion_rate" : 1.2
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should
be treated. By default they will be ignored but it is also possible to treat
them as if they had a value.
POST /sales/_search
{
"aggs" : {
"grade_min" : {
"min" : {
"field" : "grade",
"missing": 10 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value10
.
Percentiles Aggregation
A multi-value
metrics aggregation that calculates one or more percentiles
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.
Percentiles are often used to find outliers. In normal distributions, the 0.13th and 99.87th percentiles represents three standard deviations from the mean. Any data which falls outside three standard deviations is often considered an anomaly.
When a range of percentiles are retrieved, they can be used to estimate the data distribution and determine if the data is skewed, bimodal, etc.
Assume your data consists of website load times. The average and median load times are not overly useful to an administrator. The max may be interesting, but it can be easily skewed by a single slow response.
Let’s look at a range of percentiles representing load time:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time" (1)
}
}
}
}
-
The field
load_time
must be a numeric field
By default, the percentile
metric will generate a range of
percentiles: [ 1, 5, 25, 50, 75, 95, 99 ]
. The response will look like this:
{
...
"aggregations": {
"load_time_outlier": {
"values" : {
"1.0": 5.0,
"5.0": 25.0,
"25.0": 165.0,
"50.0": 445.0,
"75.0": 725.0,
"95.0": 945.0,
"99.0": 985.0
}
}
}
}
As you can see, the aggregation will return a calculated value for each percentile in the default range. If we assume response times are in milliseconds, it is immediately obvious that the webpage normally loads in 10-725ms, but occasionally spikes to 945-985ms.
Often, administrators are only interested in outliers — the extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9] (1)
}
}
}
}
-
Use the
percents
parameter to specify particular percentiles to calculate
Keyed Response
By default the keyed
flag is set to true
which associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the keyed
flag to false
will disable this behavior:
GET latency/_search
{
"size": 0,
"aggs": {
"load_time_outlier": {
"percentiles": {
"field": "load_time",
"keyed": false
}
}
}
}
Response:
{
...
"aggregations": {
"load_time_outlier": {
"values": [
{
"key": 1.0,
"value": 5.0
},
{
"key": 5.0,
"value": 25.0
},
{
"key": 25.0,
"value": 165.0
},
{
"key": 50.0,
"value": 445.0
},
{
"key": 75.0,
"value": 725.0
},
{
"key": 95.0,
"value": 945.0
},
{
"key": 99.0,
"value": 985.0
}
]
}
}
}
Script
The percentile metric supports scripting. For example, if our load times are in milliseconds but we want percentiles calculated in seconds, we could use a script to convert them on-the-fly:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"script" : {
"lang": "painless",
"source": "doc['load_time'].value / params.timeUnit", (1)
"params" : {
"timeUnit" : 1000 (2)
}
}
}
}
}
}
-
The
field
parameter is replaced with ascript
parameter, which uses the script to generate values which percentiles are calculated on -
Scripting supports parameterized input just like any other script
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"script" : {
"id": "my_script",
"params": {
"field": "load_time"
}
}
}
}
}
}
Percentiles are (usually) approximate
There are many different algorithms to calculate percentiles. The naive
implementation simply stores all the values in a sorted array. To find the 50th
percentile, you simply find the value that is at my_array[count(my_array) * 0.5]
.
Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, approximate percentiles are calculated.
The algorithm used by the percentile
metric is called TDigest (introduced by
Ted Dunning in
Computing Accurate Quantiles using T-Digests).
When using this metric, there are a few guidelines to keep in mind:
-
Accuracy is proportional to
q(1-q)
. This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median -
For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).
-
As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated
The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:
It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.
Warning
|
Percentile aggregations are also non-deterministic. This means you can get slightly different results using the same data. |
Compression
Approximate algorithms must balance memory utilization with estimation accuracy.
This balance can be controlled using a compression
parameter:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"tdigest": {
"compression" : 200 (1)
}
}
}
}
}
-
Compression controls memory usage and approximation error
The TDigest algorithm uses a number of "nodes" to approximate percentiles — the
more nodes available, the higher the accuracy (and large memory footprint) proportional
to the volume of data. The compression
parameter limits the maximum number of
nodes to 20 * compression
.
Therefore, by increasing the compression value, you can increase the accuracy of
your percentiles at the cost of more memory. Larger compression values also
make the algorithm slower since the underlying tree data structure grows in size,
resulting in more expensive operations. The default compression value is
100
.
A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of data which arrives sorted and in-order) the default settings will produce a TDigest roughly 64KB in size. In practice data tends to be more random and the TDigest will use less memory.
HDR Histogram
Note
|
This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future. |
HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentiles for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the method
parameter in the request:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_outlier" : {
"percentiles" : {
"field" : "load_time",
"percents" : [95, 99, 99.9],
"hdr": { (1)
"number_of_significant_value_digits" : 3 (2)
}
}
}
}
}
-
hdr
object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object -
number_of_significant_value_digits
specifies the resolution of values for the histogram in number of significant digits
The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
GET latency/_search
{
"size": 0,
"aggs" : {
"grade_percentiles" : {
"percentiles" : {
"field" : "grade",
"missing": 10 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value10
.
Percentile Ranks Aggregation
A multi-value
metrics aggregation that calculates one or more percentile ranks
over numeric values extracted from the aggregated documents. These values
can be extracted either from specific numeric fields in the documents, or
be generated by a provided script.
Note
|
Please see Percentiles are (usually) approximate and Compression for advice regarding approximation and memory use of the percentile ranks aggregation |
Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.
Assume your data consists of website load times. You may have a service agreement that 95% of page loads completely within 500ms and 99% of page loads complete within 600ms.
Let’s look at a range of percentiles representing load time:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"field" : "load_time", (1)
"values" : [500, 600]
}
}
}
}
-
The field
load_time
must be a numeric field
The response will look like this:
{
...
"aggregations": {
"load_time_ranks": {
"values" : {
"500.0": 90.01,
"600.0": 100.0
}
}
}
}
From this information you can determine you are hitting the 99% load time target but not quite hitting the 95% load time target
Keyed Response
By default the keyed
flag is set to true
associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the keyed
flag to false
will disable this behavior:
GET latency/_search
{
"size": 0,
"aggs": {
"load_time_ranks": {
"percentile_ranks": {
"field": "load_time",
"values": [500, 600],
"keyed": false
}
}
}
}
Response:
{
...
"aggregations": {
"load_time_ranks": {
"values": [
{
"key": 500.0,
"value": 90.01
},
{
"key": 600.0,
"value": 100.0
}
]
}
}
}
Script
The percentile rank metric supports scripting. For example, if our load times are in milliseconds but we want to specify values in seconds, we could use a script to convert them on-the-fly:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"values" : [500, 600],
"script" : {
"lang": "painless",
"source": "doc['load_time'].value / params.timeUnit", (1)
"params" : {
"timeUnit" : 1000 (2)
}
}
}
}
}
}
-
The
field
parameter is replaced with ascript
parameter, which uses the script to generate values which percentile ranks are calculated on -
Scripting supports parameterized input just like any other script
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"values" : [500, 600],
"script" : {
"id": "my_script",
"params": {
"field": "load_time"
}
}
}
}
}
}
HDR Histogram
Note
|
This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future. |
HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).
The HDR Histogram can be used by specifying the method
parameter in the request:
GET latency/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"field" : "load_time",
"values" : [500, 600],
"hdr": { (1)
"number_of_significant_value_digits" : 3 (2)
}
}
}
}
}
-
hdr
object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object -
number_of_significant_value_digits
specifies the resolution of values for the histogram in number of significant digits
The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
GET latency/data/_search
{
"size": 0,
"aggs" : {
"load_time_ranks" : {
"percentile_ranks" : {
"field" : "load_time",
"values" : [500, 600],
"missing": 10 (1)
}
}
}
}
-
Documents without a value in the
load_time
field will fall into the same bucket as documents that have the value10
.
Scripted Metric Aggregation
A metric aggregation that executes using scripts to provide a metric output.
Example:
POST ledger/_search?size=0
{
"query" : {
"match_all" : {}
},
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : "state.transactions = []",
"map_script" : "state.transactions.add(doc.type.value == 'sale' ? doc.amount.value : -1 * doc.amount.value)", (1)
"combine_script" : "double profit = 0; for (t in state.transactions) { profit += t } return profit",
"reduce_script" : "double profit = 0; for (a in states) { profit += a } return profit"
}
}
}
}
-
map_script
is the only required parameter
The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions.
The response for the above aggregation:
{
"took": 218,
...
"aggregations": {
"profit": {
"value": 240.0
}
}
}
The above example can also be specified using stored scripts as follows:
POST ledger/_search?size=0
{
"aggs": {
"profit": {
"scripted_metric": {
"init_script" : {
"id": "my_init_script"
},
"map_script" : {
"id": "my_map_script"
},
"combine_script" : {
"id": "my_combine_script"
},
"params": {
"field": "amount" (1)
},
"reduce_script" : {
"id": "my_reduce_script"
}
}
}
}
}
-
script parameters for
init
,map
andcombine
scripts must be specified in a globalparams
object so that it can be shared between the scripts.
For more details on specifying scripts see script documentation.
Allowed return types
Whilst any valid script object can be used within a single script, the scripts must return or store in the state
object only the following types:
-
primitive types
-
String
-
Map (containing only keys and values of the types listed here)
-
Array (containing elements of only the types listed here)
Scope of scripts
The scripted metric aggregation uses scripts at 4 stages of its execution:
- init_script
-
Executed prior to any collection of documents. Allows the aggregation to set up any initial state.
In the above example, the
init_script
creates an arraytransactions
in thestate
object. - map_script
-
Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state needs to be stored in the
state
object.In the above example, the
map_script
checks the value of the type field. If the value is 'sale' the value of the amount field is added to the transactions array. If the value of the type field is not 'sale' the negated value of the amount field is added to transactions. - combine_script
-
Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from each shard. If a combine_script is not provided the combine phase will return the aggregation variable.
In the above example, the
combine_script
iterates through all the stored transactions, summing the values in theprofit
variable and finally returnsprofit
. - reduce_script
-
Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a variable
states
which is an array of the result of the combine_script on each shard. If a reduce_script is not provided the reduce phase will return thestates
variable.In the above example, the
reduce_script
iterates through theprofit
returned by each shard summing the values before returning the final combined profit which will be returned in the response of the aggregation.
Worked Example
Imagine a situation where you index the following documents into an index with 2 shards:
PUT /transactions/_doc/_bulk?refresh
{"index":{"_id":1}}
{"type": "sale","amount": 80}
{"index":{"_id":2}}
{"type": "cost","amount": 10}
{"index":{"_id":3}}
{"type": "cost","amount": 30}
{"index":{"_id":4}}
{"type": "sale","amount": 130}
Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is at each stage of the example above.
Before init_script
state
is initialized as a new empty object.
"state" : {}
After init_script
This is run once on each shard before any document collection is performed, and so we will have a copy on each shard:
- Shard A
-
"state" : { "transactions" : [] }
- Shard B
-
"state" : { "transactions" : [] }
After map_script
Each shard collects its documents and runs the map_script on each document that is collected:
- Shard A
-
"state" : { "transactions" : [ 80, -30 ] }
- Shard B
-
"state" : { "transactions" : [ -10, 130 ] }
After combine_script
The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each shard (by summing the values in the transactions array) which is passed back to the coordinating node:
- Shard A
-
50
- Shard B
-
120
After reduce_script
The reduce_script receives a states
array containing the result of the combine script for each shard:
"states" : [
50,
120
]
It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to produce the response:
{
...
"aggregations": {
"profit": {
"value": 170
}
}
}
Other Parameters
params |
Optional. An object whose contents will be passed as variables to the
|
Empty Buckets
If a parent bucket of the scripted metric aggregation does not collect any documents an empty aggregation response will be returned from the
shard with a null
value. In this case the reduce_script’s `states
variable will contain null
as a response from that shard.
reduce_script’s should therefore expect and deal with `null
responses from shards.
Stats Aggregation
A multi-value
metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
The stats that are returned consist of: min
, max
, sum
, count
and avg
.
Assuming the data consists of documents representing exams grades (between 0 and 100) of students
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : { "stats" : { "field" : "grade" } }
}
}
The above aggregation computes the grades statistics over all documents. The aggregation type is stats
and the field
setting defines the numeric field of the documents the stats will be computed on. The above will return the following:
{
...
"aggregations": {
"grades_stats": {
"count": 2,
"min": 50.0,
"max": 100.0,
"avg": 75.0,
"sum": 150.0
}
}
}
The name of the aggregation (grades_stats
above) also serves as the key by which the aggregation result can be retrieved from the returned response.
Script
Computing the grades stats based on a script:
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : {
"stats" : {
"script" : {
"lang": "painless",
"source": "doc['grade'].value"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : {
"stats" : {
"script" : {
"id": "my_script",
"params" : {
"field" : "grade"
}
}
}
}
}
}
Value Script
It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats:
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"script" : {
"lang": "painless",
"source": "_value * params.correction",
"params" : {
"correction" : 1.2
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
POST /exams/_search?size=0
{
"aggs" : {
"grades_stats" : {
"stats" : {
"field" : "grade",
"missing": 0 (1)
}
}
}
}
-
Documents without a value in the
grade
field will fall into the same bucket as documents that have the value0
.
Sum Aggregation
A single-value
metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.
Assuming the data consists of documents representing sales records we can sum the sale price of all hats with:
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : { "sum" : { "field" : "price" } }
}
}
Resulting in:
{
...
"aggregations": {
"hat_prices": {
"value": 450.0
}
}
}
The name of the aggregation (hat_prices
above) also serves as the key by which the aggregation result can be retrieved from the returned response.
Script
We could also use a script to fetch the sales price:
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : {
"sum" : {
"script" : {
"source": "doc.price.value"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : {
"sum" : {
"script" : {
"id": "my_script",
"params" : {
"field" : "price"
}
}
}
}
}
}
Value Script
It is also possible to access the field value from the script using _value
.
For example, this will sum the square of the prices for all hats:
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"square_hats" : {
"sum" : {
"field" : "price",
"script" : {
"source": "_value * _value"
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should
be treated. By default documents missing the value will be ignored but it is
also possible to treat them as if they had a value. For example, this treats
all hat sales without a price as being 100
.
POST /sales/_search?size=0
{
"query" : {
"constant_score" : {
"filter" : {
"match" : { "type" : "hat" }
}
}
},
"aggs" : {
"hat_prices" : {
"sum" : {
"field" : "price",
"missing": 100 (1)
}
}
}
}
Top Hits Aggregation
A top_hits
metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended
to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.
The top_hits
aggregator can effectively be used to group result sets by certain fields via a bucket aggregator.
One or more bucket aggregators determines by which properties a result set get sliced into.
Options
-
from
- The offset from the first result you want to fetch. -
size
- The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned. -
sort
- How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.
Supported per hit features
The top_hits aggregation returns regular search hits, because of this many per hit features can be supported:
Example
In the following example we group the sales by type and per type we show the last sale. For each sale only the date and price fields are being included in the source.
POST /sales/_search?size=0
{
"aggs": {
"top_tags": {
"terms": {
"field": "type",
"size": 3
},
"aggs": {
"top_sales_hits": {
"top_hits": {
"sort": [
{
"date": {
"order": "desc"
}
}
],
"_source": {
"includes": [ "date", "price" ]
},
"size" : 1
}
}
}
}
}
}
Possible response:
{
...
"aggregations": {
"top_tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "hat",
"doc_count": 3,
"top_sales_hits": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmauCQpcRyxw6ChK",
"_source": {
"date": "2015/03/01 00:00:00",
"price": 200
},
"sort": [
1425168000000
],
"_score": null
}
]
}
}
},
{
"key": "t-shirt",
"doc_count": 3,
"top_sales_hits": {
"hits": {
"total": 3,
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmauCQpcRyxw6ChL",
"_source": {
"date": "2015/03/01 00:00:00",
"price": 175
},
"sort": [
1425168000000
],
"_score": null
}
]
}
}
},
{
"key": "bag",
"doc_count": 1,
"top_sales_hits": {
"hits": {
"total": 1,
"max_score": null,
"hits": [
{
"_index": "sales",
"_type": "_doc",
"_id": "AVnNBmatCQpcRyxw6ChH",
"_source": {
"date": "2015/01/01 00:00:00",
"price": 150
},
"sort": [
1420070400000
],
"_score": null
}
]
}
}
}
]
}
}
}
Field collapse example
Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns
top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In
Elasticsearch this can be implemented via a bucket aggregator that wraps a top_hits
aggregator as sub-aggregator.
In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage
belong to. By defining a terms
aggregator on the domain
field we group the result set of webpages by domain. The
top_hits
aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket.
Also a max
aggregator is defined which is used by the terms
aggregator’s order feature to return the buckets by
relevancy order of the most relevant document in a bucket.
POST /sales/_search
{
"query": {
"match": {
"body": "elections"
}
},
"aggs": {
"top_sites": {
"terms": {
"field": "domain",
"order": {
"top_hit": "desc"
}
},
"aggs": {
"top_tags_hits": {
"top_hits": {}
},
"top_hit" : {
"max": {
"script": {
"source": "_score"
}
}
}
}
}
}
}
At the moment the max
(or min
) aggregator is needed to make sure the buckets from the terms
aggregator are
ordered according to the score of the most relevant webpage per domain. Unfortunately the top_hits
aggregator
can’t be used in the order
option of the terms
aggregator yet.
top_hits support in a nested or reverse_nested aggregator
If the top_hits
aggregator is wrapped in a nested
or reverse_nested
aggregator then nested hits are being returned.
Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type
has been configured. The top_hits
aggregator has the ability to un-hide these documents if it is wrapped in a nested
or reverse_nested
aggregator. Read more about nested in the nested type mapping.
If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share
the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why
nested hits also include their nested identity. The nested identity is kept under the _nested
field in the search hit
and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based.
Let’s see how it works with a real sample. Considering the following mapping:
PUT /sales
{
"mappings": {
"_doc" : {
"properties" : {
"tags" : { "type" : "keyword" },
"comments" : { (1)
"type" : "nested",
"properties" : {
"username" : { "type" : "keyword" },
"comment" : { "type" : "text" }
}
}
}
}
}
}
-
The
comments
is an array that holds nested documents under theproduct
object.
And some documents:
PUT /sales/_doc/1?refresh
{
"tags": ["car", "auto"],
"comments": [
{"username": "baddriver007", "comment": "This car could have better brakes"},
{"username": "dr_who", "comment": "Where's the autopilot? Can't find it"},
{"username": "ilovemotorbikes", "comment": "This car has two extra wheels"}
]
}
It’s now possible to execute the following top_hits
aggregation (wrapped in a nested
aggregation):
POST /sales/_search
{
"query": {
"term": { "tags": "car" }
},
"aggs": {
"by_sale": {
"nested" : {
"path" : "comments"
},
"aggs": {
"by_user": {
"terms": {
"field": "comments.username",
"size": 1
},
"aggs": {
"by_nested": {
"top_hits":{}
}
}
}
}
}
}
}
Top hits response snippet with a nested hit, which resides in the first slot of array field comments
:
{
...
"aggregations": {
"by_sale": {
"by_user": {
"buckets": [
{
"key": "baddriver007",
"doc_count": 1,
"by_nested": {
"hits": {
"total": 1,
"max_score": 0.2876821,
"hits": [
{
"_index": "sales",
"_type" : "_doc",
"_id": "1",
"_nested": {
"field": "comments", (1)
"offset": 0 (2)
},
"_score": 0.2876821,
"_source": {
"comment": "This car could have better brakes", (3)
"username": "baddriver007"
}
}
]
}
}
}
...
]
}
}
}
}
-
Name of the array field containing the nested hit
-
Position if the nested hit in the containing array
-
Source of the nested hit
If _source
is requested then just the part of the source of the nested object is returned, not the entire source of the document.
Also stored fields on the nested inner object level are accessible via top_hits
aggregator residing in a nested
or reverse_nested
aggregator.
Only nested hits will have a _nested
field in the hit, non nested (regular) hits will not have a _nested
field.
The information in _nested
can also be used to parse the original source somewhere else if _source
isn’t enabled.
If there are multiple levels of nested object types defined in mappings then the _nested
information can also be hierarchical
in order to express the identity of nested hits that are two layers deep or more.
In the example below a nested hit resides in the first slot of the field nested_grand_child_field
which then resides in
the second slow of the nested_child_field
field:
...
"hits": {
"total": 2565,
"max_score": 1,
"hits": [
{
"_index": "a",
"_type": "b",
"_id": "1",
"_score": 1,
"_nested" : {
"field" : "nested_child_field",
"offset" : 1,
"_nested" : {
"field" : "nested_grand_child_field",
"offset" : 0
}
}
"_source": ...
},
...
]
}
...
Value Count Aggregation
A single-value
metrics aggregation that counts the number of values that are extracted from the aggregated documents.
These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically,
this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg
one might be interested in the number of values the average is computed over.
POST /sales/_search?size=0
{
"aggs" : {
"types_count" : { "value_count" : { "field" : "type" } }
}
}
Response:
{
...
"aggregations": {
"types_count": {
"value": 7
}
}
}
The name of the aggregation (types_count
above) also serves as the key by which the aggregation result can be
retrieved from the returned response.
Script
Counting the values generated by a script:
POST /sales/_search?size=0
{
"aggs" : {
"type_count" : {
"value_count" : {
"script" : {
"source" : "doc['type'].value"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the painless
script language and no script parameters. To use a stored script use the following syntax:
POST /sales/_search?size=0
{
"aggs" : {
"types_count" : {
"value_count" : {
"script" : {
"id": "my_script",
"params" : {
"field" : "type"
}
}
}
}
}
}
Median Absolute Deviation Aggregation
This single-value
aggregation approximates the median absolute deviation
of its search results.
Median absolute deviation is a measure of variability. It is a robust statistic, meaning that it is useful for describing data that may have outliers, or may not be normally distributed. For such data it can be more descriptive than standard deviation.
It is calculated as the median of each data point’s deviation from the median of the entire sample. That is, for a random variable X, the median absolute deviation is median(|median(X) - Xi|).
Example
Assume our data represents product reviews on a one to five star scale. Such reviews are usually summarized as a mean, which is easily understandable but doesn’t describe the reviews' variability. Estimating the median absolute deviation can provide insight into how much reviews vary from one another.
In this example we have a product which has an average rating of 3 stars. Let’s look at its ratings' median absolute deviation to determine how much they vary
GET reviews/_search
{
"size": 0,
"aggs": {
"review_average": {
"avg": {
"field": "rating"
}
},
"review_variability": {
"median_absolute_deviation": {
"field": "rating" (1)
}
}
}
}
-
rating
must be a numeric field
The resulting median absolute deviation of 2
tells us that there is a fair
amount of variability in the ratings. Reviewers must have diverse opinions about
this product.
{
...
"aggregations": {
"review_average": {
"value": 3.0
},
"review_variability": {
"value": 2.0
}
}
}
Approximation
The naive implementation of calculating median absolute deviation stores the entire sample in memory, so this aggregation instead calculates an approximation. It uses the TDigest data structure to approximate the sample median and the median of deviations from the sample median. For more about the approximation characteristics of TDigests, see Percentiles are (usually) approximate.
The tradeoff between resource usage and accuracy of a TDigest’s quantile
approximation, and therefore the accuracy of this aggregation’s approximation
of median absolute deviation, is controlled by the compression
parameter. A
higher compression
setting provides a more accurate approximation at the
cost of higher memory usage. For more about the characteristics of the TDigest
compression
parameter see
Compression.
GET reviews/_search
{
"size": 0,
"aggs": {
"review_variability": {
"median_absolute_deviation": {
"field": "rating",
"compression": 100
}
}
}
}
The default compression
value for this aggregation is 1000
. At this
compression level this aggregation is usually within 5% of the exact result,
but observed performance will depend on the sample data.
Script
This metric aggregation supports scripting. In our example above, product reviews are on a scale of one to five. If we wanted to modify them to a scale of one to ten, we can using scripting.
To provide an inline script:
GET reviews/_search
{
"size": 0,
"aggs": {
"review_variability": {
"median_absolute_deviation": {
"script": {
"lang": "painless",
"source": "doc['rating'].value * params.scaleFactor",
"params": {
"scaleFactor": 2
}
}
}
}
}
}
To provide a stored script:
GET reviews/_search
{
"size": 0,
"aggs": {
"review_variability": {
"median_absolute_deviation": {
"script": {
"id": "my_script",
"params": {
"field": "rating"
}
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be
treated. By default they will be ignored but it is also possible to treat them
as if they had a value.
Let’s be optimistic and assume some reviewers loved the product so much that they forgot to give it a rating. We’ll assign them five stars
GET reviews/_search
{
"size": 0,
"aggs": {
"review_variability": {
"median_absolute_deviation": {
"field": "rating",
"missing": 5
}
}
}
}
Bucket Aggregations
Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create
buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines
whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document
sets. In addition to the buckets themselves, the bucket
aggregations also compute and return the number of documents
that "fell into" each bucket.
Bucket aggregations, as opposed to metrics
aggregations, can hold sub-aggregations. These sub-aggregations will be
aggregated for the buckets created by their "parent" bucket aggregation.
There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.
Note
|
The maximum number of buckets allowed in a single response is limited by a dynamic cluster
setting named search.max_buckets . It is disabled by default (-1) but requests that try to return more than
10,000 buckets (the default value for future versions) will log a deprecation warning.
When using composite aggs however, the handling of -1 differs. Elasticsearch would use the soft limit as a
hard limit for those aggregations, and raise a TooManyBucketsException
about Trying to create too many buckets. Must be less than or equal to: [10000] if the soft limit is exceeded.
|
Adjacency Matrix Aggregation
A bucket aggregation returning a form of adjacency matrix.
The request provides a collection of named filter expressions, similar to the filters
aggregation
request.
Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.
beta::["The adjacency_matrix
aggregation is a new feature and we may evolve its design as we get feedback on its use. As a result, the API for this feature may change in non-backwards compatible ways"]
Given filters named A
, B
and C
the response would return buckets with the following names:
A | B | C | |
---|---|---|---|
A |
A |
A&B |
A&C |
B |
B |
B&C |
|
C |
C |
The intersecting buckets e.g A&C
are labelled using a combination of the two filter names separated by
the ampersand character. Note that the response does not also include a "C&A" bucket as this would be the
same set of documents as "A&C". The matrix is said to be symmetric so we only return half of it. To do this we sort
the filter name strings and always use the lowest of a pair as the value to the left of the "&" separator.
An alternative separator
parameter can be passed in the request if clients wish to use a separator string
other than the default of the ampersand.
Example:
PUT /emails/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "accounts" : ["hillary", "sidney"]}
{ "index" : { "_id" : 2 } }
{ "accounts" : ["hillary", "donald"]}
{ "index" : { "_id" : 3 } }
{ "accounts" : ["vladimir", "donald"]}
GET emails/_search
{
"size": 0,
"aggs" : {
"interactions" : {
"adjacency_matrix" : {
"filters" : {
"grpA" : { "terms" : { "accounts" : ["hillary", "sidney"] }},
"grpB" : { "terms" : { "accounts" : ["donald", "mitt"] }},
"grpC" : { "terms" : { "accounts" : ["vladimir", "nigel"] }}
}
}
}
}
}
In the above example, we analyse email messages to see which groups of individuals have exchanged messages. We will get counts for each group individually and also a count of messages for pairs of groups that have recorded interactions.
Response:
{
"took": 9,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"interactions": {
"buckets": [
{
"key":"grpA",
"doc_count": 2
},
{
"key":"grpA&grpB",
"doc_count": 1
},
{
"key":"grpB",
"doc_count": 2
},
{
"key":"grpB&grpC",
"doc_count": 1
},
{
"key":"grpC",
"doc_count": 1
}
]
}
}
}
Usage
On its own this aggregation can provide all of the data required to create an undirected weighted graph.
However, when used with child aggregations such as a date_histogram
the results can provide the
additional levels of data required to perform dynamic network analysis
where examining interactions over time becomes important.
Limitations
For N filters the matrix of buckets produced can be N²/2 and so there is a default maximum
imposed of 100 filters . This setting can be changed using the index.max_adjacency_matrix_filters
index-level setting.
Auto-interval Date Histogram Aggregation
A multi-bucket aggregation similar to the Date Histogram Aggregation except instead of providing an interval to use as the width of each bucket, a target number of buckets is provided indicating the number of buckets needed and the interval of the buckets is automatically chosen to best achieve that target. The number of buckets returned will always be less than or equal to this target number.
The buckets field is optional, and will default to 10 buckets if not specified.
Requesting a target of 10 buckets.
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"auto_date_histogram" : {
"field" : "date",
"buckets" : 10
}
}
}
}
Keys
Internally, a date is represented as a 64 bit number representing a timestamp
in milliseconds-since-the-epoch. These timestamps are returned as the bucket
keys. The key_as_string
is the same timestamp converted to a formatted
date string using the format specified with the format
parameter:
Tip
|
If no format is specified, then it will use the first date
format specified in the field mapping.
|
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"auto_date_histogram" : {
"field" : "date",
"buckets" : 5,
"format" : "yyyy-MM-dd" (1)
}
}
}
}
-
Supports expressive date format pattern
Response:
{
...
"aggregations": {
"sales_over_time": {
"buckets": [
{
"key_as_string": "2015-01-01",
"key": 1420070400000,
"doc_count": 3
},
{
"key_as_string": "2015-02-01",
"key": 1422748800000,
"doc_count": 2
},
{
"key_as_string": "2015-03-01",
"key": 1425168000000,
"doc_count": 2
}
],
"interval": "1M"
}
}
}
Intervals
The interval of the returned buckets is selected based on the data collected by the aggregation so that the number of buckets returned is less than or equal to the number requested. The possible intervals returned are:
seconds |
In multiples of 1, 5, 10 and 30 |
minutes |
In multiples of 1, 5, 10 and 30 |
hours |
In multiples of 1, 3 and 12 |
days |
In multiples of 1, and 7 |
months |
In multiples of 1, and 3 |
years |
In multiples of 1, 5, 10, 20, 50 and 100 |
In the worst case, where the number of daily buckets are too many for the requested number of buckets, the number of buckets returned will be 1/7th of the number of buckets requested.
Time Zone
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. The time_zone
parameter can be used to indicate
that bucketing should use a different time zone.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00
or
-08:00
) or as a timezone id, an identifier used in the TZ database like
America/Los_Angeles
.
Consider the following example:
PUT my_index/log/1?refresh
{
"date": "2015-10-01T00:30:00Z"
}
PUT my_index/log/2?refresh
{
"date": "2015-10-01T01:30:00Z"
}
PUT my_index/log/3?refresh
{
"date": "2015-10-01T02:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"auto_date_histogram": {
"field": "date",
"buckets" : 3
}
}
}
}
UTC is used if no time zone is specified, three 1-hour buckets are returned starting at midnight UTC on 1 October 2015:
{
...
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-10-01T00:00:00.000Z",
"key": 1443657600000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T01:00:00.000Z",
"key": 1443661200000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T02:00:00.000Z",
"key": 1443664800000,
"doc_count": 1
}
],
"interval": "1h"
}
}
}
If a time_zone
of -01:00
is specified, then midnight starts at one hour before
midnight UTC:
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"auto_date_histogram": {
"field": "date",
"buckets" : 3,
"time_zone": "-01:00"
}
}
}
}
Now three 1-hour buckets are still returned but the first bucket starts at 11:00pm on 30 September 2015 since that is the local time for the bucket in the specified time zone.
{
...
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T23:00:00.000-01:00", (1)
"key": 1443657600000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T00:00:00.000-01:00",
"key": 1443661200000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T01:00:00.000-01:00",
"key": 1443664800000,
"doc_count": 1
}
],
"interval": "1h"
}
}
}
-
The
key_as_string
value represents midnight on each day in the specified time zone.
Warning
|
When using time zones that follow DST (daylight savings time) changes,
buckets close to the moment when those changes happen can have slightly different
sizes than neighbouring buckets.
For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am,
clocks were turned forward 1 hour to 3am local time. If the result of the aggregation
was daily buckets, the bucket covering that day will only hold data for 23 hours
instead of the usual 24 hours for other buckets. The same is true for shorter intervals
like e.g. 12h. Here, we will have only a 11h bucket on the morning of 27 March when the
DST shift happens.
|
Scripts
Like with the normal date_histogram
, both document level
scripts and value level scripts are supported. This aggregation does not however, support the min_doc_count
,
extended_bounds
and order
parameters.
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
POST /sales/_search?size=0
{
"aggs" : {
"sale_date" : {
"auto_date_histogram" : {
"field" : "date",
"buckets": 10,
"missing": "2000/01/01" (1)
}
}
}
}
-
Documents without a value in the
publish_date
field will fall into the same bucket as documents that have the value2000-01-01
.
Children Aggregation
A special single bucket aggregation that selects child documents that have the specified type, as defined in a join
field.
This aggregation has a single option:
-
type
- The child type that should be selected.
For example, let’s say we have an index of questions and answers. The answer type has the following join
field in the mapping:
PUT child_example
{
"mappings": {
"_doc": {
"properties": {
"join": {
"type": "join",
"relations": {
"question": "answer"
}
}
}
}
}
}
The question
document contain a tag field and the answer
documents contain an owner field. With the children
aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in
two different kinds of documents.
An example of a question document:
PUT child_example/_doc/1
{
"join": {
"name": "question"
},
"body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...",
"title": "Whats the best way to file transfer my site from server to a newer one?",
"tags": [
"windows-server-2003",
"windows-server-2008",
"file-transfer"
]
}
Examples of answer
documents:
PUT child_example/_doc/2?routing=1
{
"join": {
"name": "answer",
"parent": "1"
},
"owner": {
"location": "Norfolk, United Kingdom",
"display_name": "Sam",
"id": 48
},
"body": "<p>Unfortunately you're pretty much limited to FTP...",
"creation_date": "2009-05-04T13:45:37.030"
}
PUT child_example/_doc/3?routing=1&refresh
{
"join": {
"name": "answer",
"parent": "1"
},
"owner": {
"location": "Norfolk, United Kingdom",
"display_name": "Troll",
"id": 49
},
"body": "<p>Use Linux...",
"creation_date": "2009-05-05T13:45:37.030"
}
The following request can be built that connects the two together:
POST child_example/_search?size=0
{
"aggs": {
"top-tags": {
"terms": {
"field": "tags.keyword",
"size": 10
},
"aggs": {
"to-answers": {
"children": {
"type" : "answer" (1)
},
"aggs": {
"top-names": {
"terms": {
"field": "owner.display_name.keyword",
"size": 10
}
}
}
}
}
}
}
}
-
The
type
points to type / mapping with the nameanswer
.
The above example returns the top question tags and per tag the top answer owners.
Possible response:
{
"took": 25,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped" : 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "file-transfer",
"doc_count": 1, (1)
"to-answers": {
"doc_count": 2, (2)
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Sam",
"doc_count": 1
},
{
"key": "Troll",
"doc_count": 1
}
]
}
}
},
{
"key": "windows-server-2003",
"doc_count": 1, (1)
"to-answers": {
"doc_count": 2, (2)
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Sam",
"doc_count": 1
},
{
"key": "Troll",
"doc_count": 1
}
]
}
}
},
{
"key": "windows-server-2008",
"doc_count": 1, (1)
"to-answers": {
"doc_count": 2, (2)
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Sam",
"doc_count": 1
},
{
"key": "Troll",
"doc_count": 1
}
]
}
}
}
]
}
}
}
-
The number of question documents with the tag
file-transfer
,windows-server-2003
, etc. -
The number of answer documents that are related to question documents with the tag
file-transfer
,windows-server-2003
, etc.
Composite Aggregation
A multi-bucket aggregation that creates composite buckets from different sources.
Unlike the other multi-bucket
aggregation the composite
aggregation can be used
to paginate all buckets from a multi-level aggregation efficiently. This aggregation
provides a way to stream all buckets of a specific aggregation similarly to what
scroll does for documents.
The composite buckets are built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.
For instance the following document:
{
"keyword": ["foo", "bar"],
"number": [23, 65, 76]
}
... creates the following composite buckets when keyword
and number
are used as values source
for the aggregation:
{ "keyword": "foo", "number": 23 }
{ "keyword": "foo", "number": 65 }
{ "keyword": "foo", "number": 76 }
{ "keyword": "bar", "number": 23 }
{ "keyword": "bar", "number": 65 }
{ "keyword": "bar", "number": 76 }
Values source
The sources
parameter controls the sources that should be used to build the composite buckets.
There are three different types of values source:
Terms
The terms
value source is equivalent to a simple terms
aggregation.
The values are extracted from a field or a script exactly like the terms
aggregation.
Example:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "product": { "terms" : { "field": "product" } } }
]
}
}
}
}
Like the terms
aggregation it is also possible to use a script to create the values for the composite buckets:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{
"product": {
"terms" : {
"script" : {
"source": "doc['product'].value",
"lang": "painless"
}
}
}
}
]
}
}
}
}
Histogram
The histogram
value source can be applied on numeric values to build fixed size
interval over the values. The interval
parameter defines how the numeric values should be
transformed. For instance an interval
set to 5 will translate any numeric values to its closest interval,
a value of 101
would be translated to 100
which is the key for the interval between 100 and 105.
Example:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "histo": { "histogram" : { "field": "price", "interval": 5 } } }
]
}
}
}
}
The values are built from a numeric field or a script that return numerical values:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{
"histo": {
"histogram" : {
"interval": 5,
"script" : {
"source": "doc['price'].value",
"lang": "painless"
}
}
}
}
]
}
}
}
}
Date Histogram
The date_histogram
is similar to the histogram
value source except that the interval
is specified by date/time expression:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "date": { "date_histogram" : { "field": "timestamp", "interval": "1d" } } }
]
}
}
}
}
The example above creates an interval per day and translates all timestamp
values to the start of its closest intervals.
Available expressions for interval: year
, quarter
, month
, week
, day
, hour
, minute
, second
Time values can also be specified via abbreviations supported by time units parsing.
Note that fractional time values are not supported, but you can address this by shifting to another
time unit (e.g., 1.5h
could instead be specified as 90m
).
Format
Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch. These timestamps are returned as the bucket keys. It is possible to return a formatted date string instead using the format specified with the format parameter:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{
"date": {
"date_histogram" : {
"field": "timestamp",
"interval": "1d",
"format": "yyyy-MM-dd" (1)
}
}
}
]
}
}
}
}
-
Supports expressive date format pattern
Time Zone
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. The time_zone
parameter can be used to indicate
that bucketing should use a different time zone.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00
or
-08:00
) or as a timezone id, an identifier used in the TZ database like
America/Los_Angeles
.
Mixing different values source
The sources
parameter accepts an array of values source.
It is possible to mix different values source to create composite buckets.
For example:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } },
{ "product": { "terms": {"field": "product" } } }
]
}
}
}
}
This will create composite buckets from the values created by two values source, a date_histogram
and a terms
.
Each bucket is composed of two values, one for each value source defined in the aggregation.
Any type of combinations is allowed and the order in the array is preserved
in the composite buckets.
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "shop": { "terms": {"field": "shop" } } },
{ "product": { "terms": { "field": "product" } } },
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } }
]
}
}
}
}
Order
By default the composite buckets are sorted by their natural ordering. Values are sorted
in ascending order of their values. When multiple value sources are requested, the ordering is done per value
source, the first value of the composite bucket is compared to the first value of the other composite bucket and if they are equals the
next values in the composite bucket are used for tie-breaking. This means that the composite bucket
[foo, 100]
is considered smaller than [foobar, 0]
because foo
is considered smaller than foobar
.
It is possible to define the direction of the sort for each value source by setting order
to asc
(default value)
or desc
(descending order) directly in the value source definition.
For example:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
{ "product": { "terms": {"field": "product", "order": "asc" } } }
]
}
}
}
}
... will sort the composite bucket in descending order when comparing values from the date_histogram
source
and in ascending order when comparing values from the terms
source.
Missing bucket
By default documents without a value for a given source are ignored.
It is possible to include them in the response by setting missing_bucket
to
true
(defaults to false
):
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "product_name": { "terms" : { "field": "product", "missing_bucket": true } } }
]
}
}
}
}
In the example above the source product_name
will emit an explicit null
value
for documents without a value for the field product
.
The order
specified in the source dictates whether the null
values should rank
first (ascending order, asc
) or last (descending order, desc
).
Size
The size
parameter can be set to define how many composite buckets should be returned.
Each composite bucket is considered as a single bucket so setting a size of 10 will return the
first 10 composite buckets created from the values source.
The response contains the values for each composite bucket in an array containing the values extracted
from each value source.
After
If the number of composite buckets is too high (or unknown) to be returned in a single response
it is possible to split the retrieval in multiple requests.
Since the composite buckets are flat by nature, the requested size
is exactly the number of composite buckets
that will be returned in the response (assuming that they are at least size
composite buckets to return).
If all composite buckets should be retrieved it is preferable to use a small size (100
or 1000
for instance)
and then use the after
parameter to retrieve the next results.
For example:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"size": 2,
"sources" : [
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } },
{ "product": { "terms": {"field": "product" } } }
]
}
}
}
}
... returns:
{
...
"aggregations": {
"my_buckets": {
"after_key": { (1)
"date": 1494288000000,
"product": "mad max"
},
"buckets": [
{
"key": {
"date": 1494201600000,
"product": "rocky"
},
"doc_count": 1
},
{
"key": {
"date": 1494288000000,
"product": "mad max"
},
"doc_count": 2
}
]
}
}
}
-
The last composite bucket returned by the query.
Note
|
The after_key is equals to the last bucket returned in the response before
any filtering that could be done by Pipeline aggregations.
If all buckets are filtered/removed by a pipeline aggregation, the after_key will contain
the last bucket before filtering.
|
The after
parameter can be used to retrieve the composite buckets that are after
the last composite buckets returned in a previous round.
For the example below the last bucket can be found in after_key
and the next
round of result can be retrieved with:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"size": 2,
"sources" : [
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
{ "product": { "terms": {"field": "product", "order": "asc" } } }
],
"after": { "date": 1494288000000, "product": "mad max" } (1)
}
}
}
}
-
Should restrict the aggregation to buckets that sort after the provided values.
Sub-aggregations
Like any multi-bucket
aggregations the composite
aggregation can hold sub-aggregations.
These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this
parent aggregation.
For instance the following example computes the average value of a field
per composite bucket:
GET /_search
{
"aggs" : {
"my_buckets": {
"composite" : {
"sources" : [
{ "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
{ "product": { "terms": {"field": "product" } } }
]
},
"aggregations": {
"the_avg": {
"avg": { "field": "price" }
}
}
}
}
}
... returns:
{
...
"aggregations": {
"my_buckets": {
"after_key": {
"date": 1494201600000,
"product": "rocky"
},
"buckets": [
{
"key": {
"date": 1494460800000,
"product": "apocalypse now"
},
"doc_count": 1,
"the_avg": {
"value": 10.0
}
},
{
"key": {
"date": 1494374400000,
"product": "mad max"
},
"doc_count": 1,
"the_avg": {
"value": 27.0
}
},
{
"key": {
"date": 1494288000000,
"product" : "mad max"
},
"doc_count": 2,
"the_avg": {
"value": 22.5
}
},
{
"key": {
"date": 1494201600000,
"product": "rocky"
},
"doc_count": 1,
"the_avg": {
"value": 10.0
}
}
]
}
}
}
Date Histogram Aggregation
This multi-bucket aggregation is similar to the normal
histogram, but it can
only be used with date values. Because dates are represented internally in
Elasticsearch as long values, it is possible, but not as accurate, to use the
normal histogram
on dates as well. The main difference in the two APIs is
that here the interval can be specified using date/time expressions. Time-based
data requires special support because time-based intervals are not always a
fixed length.
Setting intervals
There seems to be no limit to the creativity we humans apply to setting our clocks and calendars. We’ve invented leap years and leap seconds, standard and daylight savings times, and timezone offsets of 30 or 45 minutes rather than a full hour. While these creations help keep us in sync with the cosmos and our environment, they can make specifying time intervals accurately a real challenge. The only universal truth our researchers have yet to disprove is that a millisecond is always the same duration, and a second is always 1000 milliseconds. Beyond that, things get complicated.
Generally speaking, when you specify a single time unit, such as 1 hour or 1 day, you are working with a calendar interval, but multiples, such as 6 hours or 3 days, are fixed-length intervals.
For example, a specification of 1 day (1d) from now is a calendar interval that means "at this exact time tomorrow" no matter the length of the day. A change to or from daylight savings time that results in a 23 or 25 hour day is compensated for and the specification of "this exact time tomorrow" is maintained. But if you specify 2 or more days, each day must be of the same fixed duration (24 hours). In this case, if the specified interval includes the change to or from daylight savings time, the interval will end an hour sooner or later than you expect.
There are similar differences to consider when you specify single versus multiple minutes or hours. Multiple time periods longer than a day are not supported.
Here are the valid time specifications and their meanings:
- milliseconds (ms)
-
Fixed length interval; supports multiples.
- seconds (s)
-
1000 milliseconds; fixed length interval (except for the last second of a minute that contains a leap-second, which is 2000ms long); supports multiples.
- minutes (m)
-
All minutes begin at 00 seconds.
-
One minute (1m) is the interval between 00 seconds of the first minute and 00 seconds of the following minute in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.
-
Multiple minutes (nm) are intervals of exactly 60x1000=60,000 milliseconds each.
-
- hours (h)
-
All hours begin at 00 minutes and 00 seconds.
-
One hour (1h) is the interval between 00:00 minutes of the first hour and 00:00 minutes of the following hour in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.
-
Multiple hours (nh) are intervals of exactly 60x60x1000=3,600,000 milliseconds each.
-
- days (d)
-
All days begin at the earliest possible time, which is usually 00:00:00 (midnight).
-
One day (1d) is the interval between the start of the day and the start of of the following day in the specified timezone, compensating for any intervening time changes.
-
Multiple days (nd) are intervals of exactly 24x60x60x1000=86,400,000 milliseconds each.
-
- weeks (w)
-
-
One week (1w) is the interval between the start day_of_week:hour:minute:second and the same day of the week and time of the following week in the specified timezone.
-
Multiple weeks (nw) are not supported.
-
- months (M)
-
-
One month (1M) is the interval between the start day of the month and time of day and the same day of the month and time of the following month in the specified timezone, so that the day of the month and time of day are the same at the start and end.
-
Multiple months (nM) are not supported.
-
- quarters (q)
-
-
One quarter (1q) is the interval between the start day of the month and time of day and the same day of the month and time of day three months later, so that the day of the month and time of day are the same at the start and end.
-
Multiple quarters (nq) are not supported.
-
- years (y)
-
-
One year (1y) is the interval between the start day of the month and time of day and the same day of the month and time of day the following year in the specified timezone, so that the date and time are the same at the start and end.
-
Multiple years (ny) are not supported.
-
NOTE: In all cases, when the specified end time does not exist, the actual end time is the closest available time after the specified end.
Widely distributed applications must also consider vagaries such as countries that start and stop daylight savings time at 12:01 A.M., so end up with one minute of Sunday followed by an additional 59 minutes of Saturday once a year, and countries that decide to move across the international date line. Situations like that can make irregular timezone offsets seem easy.
As always, rigorous testing, especially around time-change events, will ensure that your time interval specification is what you intend it to be.
WARNING: To avoid unexpected results, all connected servers and clients must sync to a reliable network time service.
Examples
Requesting bucket intervals of a month.
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
}
}
}
}
You can also specify time values using abbreviations supported by
time units parsing.
Note that fractional time values are not supported, but you can address this by
shifting to another
time unit (e.g., 1.5h
could instead be specified as 90m
).
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "90m"
}
}
}
}
Keys
Internally, a date is represented as a 64 bit number representing a timestamp
in milliseconds-since-the-epoch (01/01/1970 midnight UTC). These timestamps are
returned as the key name of the bucket. The key_as_string
is the same
timestamp converted to a formatted
date string using the format
parameter specification:
Tip
|
If you don’t specify format , the first date
format specified in the field mapping is used.
|
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd" (1)
}
}
}
}
-
Supports expressive date format pattern
Response:
{
...
"aggregations": {
"sales_over_time": {
"buckets": [
{
"key_as_string": "2015-01-01",
"key": 1420070400000,
"doc_count": 3
},
{
"key_as_string": "2015-02-01",
"key": 1422748800000,
"doc_count": 2
},
{
"key_as_string": "2015-03-01",
"key": 1425168000000,
"doc_count": 2
}
]
}
}
}
Timezone
Date-times are stored in Elasticsearch in UTC. By default, all bucketing and
rounding is also done in UTC. Use the time_zone
parameter to indicate
that bucketing should use a different timezone.
You can specify timezones as either an ISO 8601 UTC offset (e.g. +01:00
or
-08:00
) or as a timezone ID as specified in the IANA timezone database,
such as`America/Los_Angeles`.
Consider the following example:
PUT my_index/_doc/1?refresh
{
"date": "2015-10-01T00:30:00Z"
}
PUT my_index/_doc/2?refresh
{
"date": "2015-10-01T01:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day"
}
}
}
}
If you don’t specify a timezone, UTC is used. This would result in both of these documents being placed into the same day bucket, which starts at midnight UTC on 1 October 2015:
{
...
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-10-01T00:00:00.000Z",
"key": 1443657600000,
"doc_count": 2
}
]
}
}
}
If you specify a time_zone
of -01:00
, midnight in that timezone is one hour
before midnight UTC:
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"time_zone": "-01:00"
}
}
}
}
Now the first document falls into the bucket for 30 September 2015, while the second document falls into the bucket for 1 October 2015:
{
...
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T00:00:00.000-01:00", (1)
"key": 1443574800000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T00:00:00.000-01:00", (1)
"key": 1443661200000,
"doc_count": 1
}
]
}
}
}
-
The
key_as_string
value represents midnight on each day in the specified timezone.
Warning
|
When using time zones that follow DST (daylight savings time) changes,
buckets close to the moment when those changes happen can have slightly different
sizes than you would expect from the used interval .
For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am,
clocks were turned forward 1 hour to 3am local time. If you use day as interval ,
the bucket covering that day will only hold data for 23 hours instead of the usual
24 hours for other buckets. The same is true for shorter intervals, like 12h,
where you’ll have only a 11h bucket on the morning of 27 March when the DST shift
happens.
|
Offset
Use the offset
parameter to change the start value of each bucket by the
specified positive (+
) or negative offset (-
) duration, such as 1h
for
an hour, or 1d
for a day. See [time-units] for more possible time
duration options.
For example, when using an interval of day
, each bucket runs from midnight
to midnight. Setting the offset
parameter to +6h
changes each bucket
to run from 6am to 6am:
PUT my_index/_doc/1?refresh
{
"date": "2015-10-01T05:30:00Z"
}
PUT my_index/_doc/2?refresh
{
"date": "2015-10-01T06:30:00Z"
}
GET my_index/_search?size=0
{
"aggs": {
"by_day": {
"date_histogram": {
"field": "date",
"interval": "day",
"offset": "+6h"
}
}
}
}
Instead of a single bucket starting at midnight, the above request groups the documents into buckets starting at 6am:
{
...
"aggregations": {
"by_day": {
"buckets": [
{
"key_as_string": "2015-09-30T06:00:00.000Z",
"key": 1443592800000,
"doc_count": 1
},
{
"key_as_string": "2015-10-01T06:00:00.000Z",
"key": 1443679200000,
"doc_count": 1
}
]
}
}
}
Note
|
The start offset of each bucket is calculated after time_zone
adjustments have been made.
|
Keyed Response
Setting the keyed
flag to true
associates a unique string key with each
bucket and returns the ranges as a hash rather than an array:
POST /sales/_search?size=0
{
"aggs" : {
"sales_over_time" : {
"date_histogram" : {
"field" : "date",
"interval" : "1M",
"format" : "yyyy-MM-dd",
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"sales_over_time": {
"buckets": {
"2015-01-01": {
"key_as_string": "2015-01-01",
"key": 1420070400000,
"doc_count": 3
},
"2015-02-01": {
"key_as_string": "2015-02-01",
"key": 1422748800000,
"doc_count": 2
},
"2015-03-01": {
"key_as_string": "2015-03-01",
"key": 1425168000000,
"doc_count": 2
}
}
}
}
}
Scripts
As with the normal histogram,
both document-level scripts and
value-level scripts are supported. You can control the order of the returned
buckets using the order
settings and filter the returned buckets based on a min_doc_count
setting
(by default all buckets between the first
bucket that matches documents and the last one are returned). This histogram
also supports the extended_bounds
setting, which enables extending the bounds of the histogram beyond the data
itself. For more information, see
Extended Bounds
.
Missing value
The missing
parameter defines how to treat documents that are missing a value.
By default, they are ignored, but it is also possible to treat them as if they
have a value.
POST /sales/_search?size=0
{
"aggs" : {
"sale_date" : {
"date_histogram" : {
"field" : "date",
"interval": "year",
"missing": "2000/01/01" (1)
}
}
}
}
-
Documents without a value in the
publish_date
field will fall into the same bucket as documents that have the value2000-01-01
.
Order
By default the returned buckets are sorted by their key
ascending, but you can
control the order using
the order
setting. This setting supports the same order
functionality as
Terms Aggregation
.
deprecated[6.0.0, Use _key
instead of _time
to order buckets by their dates/keys]
Using a script to aggregate by day of the week
When you need to aggregate the results by day of the week, use a script that returns the day of the week:
POST /sales/_search?size=0
{
"aggs": {
"dayOfWeek": {
"terms": {
"script": {
"lang": "painless",
"source": "doc['date'].value.dayOfWeekEnum.value"
}
}
}
}
}
Response:
{
...
"aggregations": {
"dayOfWeek": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "7",
"doc_count": 4
},
{
"key": "4",
"doc_count": 3
}
]
}
}
}
The response will contain all the buckets having the relative day of the week as key : 1 for Monday, 2 for Tuesday… 7 for Sunday.
Date Range Aggregation
A range aggregation that is dedicated for date values. The main difference
between this aggregation and the normal
range
aggregation is that the from
and to
values can be expressed in
Date Math expressions, and it is also possible to specify a date
format by which the from
and to
response fields will be returned.
Note that this aggregation includes the from
value and excludes the to
value
for each range.
Example:
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyyy",
"ranges": [
{ "to": "now-10M/M" }, (1)
{ "from": "now-10M/M" } (2)
]
}
}
}
}
-
< now minus 10 months, rounded down to the start of the month.
-
>= now minus 10 months, rounded down to the start of the month.
In the example above, we created two range buckets, the first will "bucket" all documents dated prior to 10 months ago and the second will "bucket" all documents dated since 10 months ago
Response:
{
...
"aggregations": {
"range": {
"buckets": [
{
"to": 1.4436576E12,
"to_as_string": "10-2015",
"doc_count": 7,
"key": "*-10-2015"
},
{
"from": 1.4436576E12,
"from_as_string": "10-2015",
"doc_count": 0,
"key": "10-2015-*"
}
]
}
}
}
Missing Values
The missing
parameter defines how documents that are missing a value should
be treated. By default they will be ignored but it is also possible to treat
them as if they had a value. This is done by adding a set of fieldname :
value mappings to specify default values per field.
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"missing": "1976/11/30",
"ranges": [
{
"key": "Older",
"to": "2016/02/01"
}, (1)
{
"key": "Newer",
"from": "2016/02/01",
"to" : "now/d"
}
]
}
}
}
}
-
Documents without a value in the
date
field will be added to the "Older" bucket, as if they had a date value of "1899-12-31".
Date Format/Pattern
Note
|
this information was copied from JodaDate |
All ASCII letters are reserved as format pattern letters, which are defined as follows:
Symbol | Meaning | Presentation | Examples |
---|---|---|---|
G |
era |
text |
AD |
C |
century of era (>=0) |
number |
20 |
Y |
year of era (>=0) |
year |
1996 |
x |
weekyear |
year |
1996 |
w |
week of weekyear |
number |
27 |
e |
day of week |
number |
2 |
E |
day of week |
text |
Tuesday; Tue |
y |
year |
year |
1996 |
D |
day of year |
number |
189 |
M |
month of year |
month |
July; Jul; 07 |
d |
day of month |
number |
10 |
a |
halfday of day |
text |
PM |
K |
hour of halfday (0~11) |
number |
0 |
h |
clockhour of halfday (1~12) |
number |
12 |
H |
hour of day (0~23) |
number |
0 |
k |
clockhour of day (1~24) |
number |
24 |
m |
minute of hour |
number |
30 |
s |
second of minute |
number |
55 |
S |
fraction of second |
number |
978 |
z |
time zone |
text |
Pacific Standard Time; PST |
Z |
time zone offset/id |
zone |
-0800; -08:00; America/Los_Angeles |
' |
escape for text |
delimiter |
'' |
The count of pattern letters determine the format.
- Text
-
If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available.
- Number
-
The minimum number of digits. Shorter numbers are zero-padded to this amount.
- Year
-
Numeric presentation for year and weekyear fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits.
- Month
-
3 or over, use text, otherwise use number.
- Zone
-
'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id.
- Zone names
-
Time zone names ('z') cannot be parsed.
Any characters in the pattern that are not in the ranges of ['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance, characters like ':', '.', ' ', '#' and '?' will appear in the resulting time text even they are not embraced within single quotes.
Time zone in date range aggregations
Dates can be converted from another time zone to UTC by specifying the
time_zone
parameter.
Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as one of the http://www.joda.org/joda-time/timezones.html [time zone ids] from the TZ database.
The time_zone
parameter is also applied to rounding in date math expressions.
As an example, to round to the beginning of the day in the CET time zone, you
can do the following:
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"time_zone": "CET",
"ranges": [
{ "to": "2016/02/01" }, (1)
{ "from": "2016/02/01", "to" : "now/d" }, (2)
{ "from": "now/d" }
]
}
}
}
}
-
This date will be converted to
2016-02-01T00:00:00.000+01:00
. -
now/d
will be rounded to the beginning of the day in the CET time zone.
Keyed Response
Setting the keyed
flag to true
will associate a unique string key with each
bucket and return the ranges as a hash rather than an array:
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "to": "now-10M/M" },
{ "from": "now-10M/M" }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"range": {
"buckets": {
"*-10-2015": {
"to": 1.4436576E12,
"to_as_string": "10-2015",
"doc_count": 7
},
"10-2015-*": {
"from": 1.4436576E12,
"from_as_string": "10-2015",
"doc_count": 0
}
}
}
}
}
It is also possible to customize the key for each range:
POST /sales/_search?size=0
{
"aggs": {
"range": {
"date_range": {
"field": "date",
"format": "MM-yyy",
"ranges": [
{ "from": "01-2015", "to": "03-2015", "key": "quarter_01" },
{ "from": "03-2015", "to": "06-2015", "key": "quarter_02" }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"range": {
"buckets": {
"quarter_01": {
"from": 1.4200704E12,
"from_as_string": "01-2015",
"to": 1.425168E12,
"to_as_string": "03-2015",
"doc_count": 5
},
"quarter_02": {
"from": 1.425168E12,
"from_as_string": "03-2015",
"to": 1.4331168E12,
"to_as_string": "06-2015",
"doc_count": 2
}
}
}
}
}
Diversified Sampler Aggregation
Like the sampler
aggregation this is a filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
The diversified_sampler
aggregation adds the ability to limit the number of matches that share a common value such as an "author".
Note
|
Any good market researcher will tell you that when working with samples of data it is important that the sample represents a healthy variety of opinions rather than being skewed by any single voice. The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer). |
-
Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
-
Removing bias from analytics by ensuring fair representation of content from different sources
-
Reducing the running cost of aggregations that can produce useful results using only samples e.g.
significant_terms
A choice of field
or script
setting is used to provide values used for de-duplication and the max_docs_per_value
setting controls the maximum
number of documents collected on any one shard which share a common value. The default setting for max_docs_per_value
is 1.
The aggregation will throw an error if the choice of field
or script
produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).
Example:
We might want to see which tags are strongly associated with #elasticsearch
on StackOverflow
forum posts but ignoring the effects of some prolific users with a tendency to misspell #Kibana as #Cabana.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:elasticsearch"
}
},
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 200,
"field" : "author"
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "tags",
"exclude": ["elasticsearch"]
}
}
}
}
}
}
Response:
{
...
"aggregations": {
"my_unbiased_sample": {
"doc_count": 151,(1)
"keywords": {(2)
"doc_count": 151,
"bg_count": 650,
"buckets": [
{
"key": "kibana",
"doc_count": 150,
"score": 2.213,
"bg_count": 200
}
]
}
}
}
}
-
151 documents were sampled in total.
-
The results of the significant_terms aggregation are not skewed by any single author’s quirks because we asked for a maximum of one post from any one author in our sample.
Scripted example:
In this scenario we might want to diversify on a combination of field values. We can use a script
to produce a hash of the
multiple values in a tags field to ensure we don’t have a sample that consists of the same repeated combinations of tags.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:kibana"
}
},
"aggs": {
"my_unbiased_sample": {
"diversified_sampler": {
"shard_size": 200,
"max_docs_per_value" : 3,
"script" : {
"lang": "painless",
"source": "doc['tags'].hashCode()"
}
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "tags",
"exclude": ["kibana"]
}
}
}
}
}
}
Response:
{
...
"aggregations": {
"my_unbiased_sample": {
"doc_count": 6,
"keywords": {
"doc_count": 6,
"bg_count": 650,
"buckets": [
{
"key": "logstash",
"doc_count": 3,
"score": 2.213,
"bg_count": 50
},
{
"key": "elasticsearch",
"doc_count": 3,
"score": 1.34,
"bg_count": 200
}
]
}
}
}
}
shard_size
The shard_size
parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
max_docs_per_value
The max_docs_per_value
is an optional parameter and limits how many documents are permitted per choice of de-duplicating value.
The default setting is "1".
execution_hint
The optional execution_hint
setting can influence the management of the values used for de-duplication.
Each option will hold up to shard_size
values in memory while performing de-duplication but the type of value held can be controlled as follows:
-
hold field values directly (
map
) -
hold ordinals of the field as determined by the Lucene index (
global_ordinals
) -
hold hashes of the field values - with potential for hash collisions (
bytes_hash
)
The default setting is to use global_ordinals
if this information is available from the Lucene index and reverting to map
if not.
The bytes_hash
setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions.
Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
Limitations
Cannot be nested under breadth_first
aggregations
Being a quality-based filter the diversified_sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms
aggregation which has the collect_mode
switched from the default depth_first
mode to breadth_first
as this discards scores.
In this situation an error will be thrown.
Limited de-dup logic.
The de-duplication logic applies only at a shard level so will not apply across shards.
No specialized syntax for geo/date fields
Currently the syntax for defining the diversifying values is defined by a choice of field
or
script
- there is no added syntactical sugar for expressing geo or date units such as "7d" (7
days). This support may be added in a later release and users will currently have to create these
sorts of values using a script.
Filter Aggregation
Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.
Example:
POST /sales/_search?size=0
{
"aggs" : {
"t_shirts" : {
"filter" : { "term": { "type": "t-shirt" } },
"aggs" : {
"avg_price" : { "avg" : { "field" : "price" } }
}
}
}
}
In the above example, we calculate the average price of all the products that are of type t-shirt.
Response:
{
...
"aggregations" : {
"t_shirts" : {
"doc_count" : 3,
"avg_price" : { "value" : 128.33333333333334 }
}
}
}
Filters Aggregation
Defines a multi bucket aggregation where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.
Example:
PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
In the above example, we analyze log messages. The aggregation will build two collection (buckets) of log messages - one for all those containing an error, and another for all those containing a warning.
Response:
{
"took": 9,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"messages": {
"buckets": {
"errors": {
"doc_count": 1
},
"warnings": {
"doc_count": 2
}
}
}
}
}
Anonymous filters
The filters field can also be provided as an array of filters, as in the following request:
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"filters" : [
{ "match" : { "body" : "error" }},
{ "match" : { "body" : "warning" }}
]
}
}
}
}
The filtered buckets are returned in the same order as provided in the request. The response for this example would be:
{
"took": 4,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"messages": {
"buckets": [
{
"doc_count": 1
},
{
"doc_count": 2
}
]
}
}
}
Other
Bucket
The other_bucket
parameter can be set to add a bucket to the response which will contain all documents that do
not match any of the given filters. The value of this parameter can be as follows:
false
-
Does not compute the
other
bucket true
-
Returns the
other
bucket either in a bucket (namedother
by default) if named filters are being used, or as the last bucket if anonymous filters are being used
The other_bucket_key
parameter can be used to set the key for the other
bucket to a value other than the default other
. Setting
this parameter will implicitly set the other_bucket
parameter to true
.
The following snippet shows a response where the other
bucket is requested to be named other_messages
.
PUT logs/_doc/4?refresh
{
"body": "info: user Bob logged out"
}
GET logs/_search
{
"size": 0,
"aggs" : {
"messages" : {
"filters" : {
"other_bucket_key": "other_messages",
"filters" : {
"errors" : { "match" : { "body" : "error" }},
"warnings" : { "match" : { "body" : "warning" }}
}
}
}
}
}
The response would be something like the following:
{
"took": 3,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"messages": {
"buckets": {
"errors": {
"doc_count": 1
},
"warnings": {
"doc_count": 2
},
"other_messages": {
"doc_count": 1
}
}
}
}
}
Geo Distance Aggregation
A multi-bucket aggregation that works on geo_point
fields and conceptually works very similar to the range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).
PUT /museums
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100000 },
{ "from" : 100000, "to" : 300000 },
{ "from" : 300000 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"rings_around_amsterdam" : {
"buckets": [
{
"key": "*-100000.0",
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
{
"key": "100000.0-300000.0",
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
{
"key": "300000.0-*",
"from": 300000.0,
"doc_count": 2
}
]
}
}
}
The specified field must be of type geo_point
(which can only be set explicitly in the mappings). And it can also hold an array of geo_point
fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the geo_point
type:
-
Object format:
{ "lat" : 52.3760, "lon" : 4.894 }
- this is the safest format as it is the most explicit about thelat
&lon
values -
String format:
"52.3760, 4.894"
- where the first number is thelat
and the second is thelon
-
Array format:
[4.894, 52.3760]
- which is based on theGeoJson
standard and where the first number is thelon
and the second one is thelat
By default, the distance unit is m
(meters) but it can also accept: mi
(miles), in
(inches), yd
(yards), km
(kilometers), cm
(centimeters), mm
(millimeters).
POST /museums/_search?size=0
{
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"unit" : "km", (1)
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
-
The distances will be computed in kilometers
There are two distance calculation modes: arc
(the default), and plane
. The arc
calculation is the most accurate. The plane
is the fastest but least accurate. Consider using plane
when your search context is "narrow", and spans smaller geographical areas (~5km). plane
will return higher error margins for searches across very large areas (e.g. cross continent search). The distance calculation type can be set using the distance_type
parameter:
POST /museums/_search?size=0
{
"aggs" : {
"rings" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"unit" : "km",
"distance_type" : "plane",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 300 },
{ "from" : 300 }
]
}
}
}
}
Keyed Response
Setting the keyed
flag to true
will associate a unique string key with each bucket and return the ranges as a hash rather than an array:
POST /museums/_search?size=0
{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100000 },
{ "from" : 100000, "to" : 300000 },
{ "from" : 300000 }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"rings_around_amsterdam" : {
"buckets": {
"*-100000.0": {
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
"100000.0-300000.0": {
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
"300000.0-*": {
"from": 300000.0,
"doc_count": 2
}
}
}
}
}
It is also possible to customize the key for each range:
POST /museums/_search?size=0
{
"aggs" : {
"rings_around_amsterdam" : {
"geo_distance" : {
"field" : "location",
"origin" : "52.3760, 4.894",
"ranges" : [
{ "to" : 100000, "key": "first_ring" },
{ "from" : 100000, "to" : 300000, "key": "second_ring" },
{ "from" : 300000, "key": "third_ring" }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"rings_around_amsterdam" : {
"buckets": {
"first_ring": {
"from": 0.0,
"to": 100000.0,
"doc_count": 3
},
"second_ring": {
"from": 100000.0,
"to": 300000.0,
"doc_count": 1
},
"third_ring": {
"from": 300000.0,
"doc_count": 2
}
}
}
}
}
GeoHash grid Aggregation
A multi-bucket aggregation that works on geo_point
fields and groups points into buckets that represent cells in a grid.
The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a geohash which is of user-definable precision.
-
High precision geohashes have a long string length and represent cells that cover only a small area.
-
Low precision geohashes have a short string length and represent cells that each cover a large area.
Geohashes used in this aggregation can have a choice of precision between 1 and 12.
Warning
|
The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail. |
The specified field must be of type geo_point
(which can only be set explicitly in the mappings) and it can also hold an array of geo_point
fields, in which case all points will be taken into account during aggregation.
Simple low-precision request
PUT /museums
{
"mappings": {
"_doc": {
"properties": {
"location": {
"type": "geo_point"
}
}
}
}
}
POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}
POST /museums/_search?size=0
{
"aggregations" : {
"large-grid" : {
"geohash_grid" : {
"field" : "location",
"precision" : 3
}
}
}
}
Response:
{
...
"aggregations": {
"large-grid": {
"buckets": [
{
"key": "u17",
"doc_count": 3
},
{
"key": "u09",
"doc_count": 2
},
{
"key": "u15",
"doc_count": 1
}
]
}
}
}
High-precision requests
When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.
POST /museums/_search?size=0
{
"aggregations" : {
"zoomed-in" : {
"filter" : {
"geo_bounding_box" : {
"location" : {
"top_left" : "52.4, 4.9",
"bottom_right" : "52.3, 5.0"
}
}
},
"aggregations":{
"zoom1":{
"geohash_grid" : {
"field": "location",
"precision": 8
}
}
}
}
}
}
The geohashes returned by the geohash_grid
aggregation as bucket keys can be also
used for "zooming in" by translating them into bounding boxes using one of available
geohash libraries. For example, for javascript the
node-geohash library can be used:
var geohash = require('ngeohash');
// bbox will contain [ 52.03125, 4.21875, 53.4375, 5.625 ]
// [ minlat, minlon, maxlat, maxlon]
var bbox = geohash.decode_bbox('u17');
Cell dimensions at the equator
The table below shows the metric dimensions for cells covered by various string lengths of geohash. Cell dimensions vary with latitude and so the table is for the worst-case scenario at the equator.
GeoHash length |
Area width x height |
1 |
5,009.4km x 4,992.6km |
2 |
1,252.3km x 624.1km |
3 |
156.5km x 156km |
4 |
39.1km x 19.5km |
5 |
4.9km x 4.9km |
6 |
1.2km x 609.4m |
7 |
152.9m x 152.4m |
8 |
38.2m x 19m |
9 |
4.8m x 4.8m |
10 |
1.2m x 59.5cm |
11 |
14.9cm x 14.9cm |
12 |
3.7cm x 1.9cm |
Options
field |
Mandatory. The name of the field indexed with GeoPoints. |
precision |
Optional. The string length of the geohashes used to define cells/buckets in the results. Defaults to 5. The precision can either be defined in terms of the integer precision levels mentioned above. Values outside of [1,12] will be rejected. Alternatively, the precision level can be approximated from a distance measure like "1km", "10m". The precision level is calculate such that cells will not exceed the specified size (diagonal) of the required precision. When this would lead to precision levels higher than the supported 12 levels, (e.g. for distances <5.6cm) the value is rejected. |
size |
Optional. The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain. |
shard_size |
Optional. To allow for more accurate counting of the top cells
returned in the final result the aggregation defaults to
returning |
Global Aggregation
Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.
Note
|
Global aggregators can only be placed as top level aggregators because it doesn’t make sense to embed a global aggregator within another bucket aggregator. |
Example:
POST /sales/_search?size=0
{
"query" : {
"match" : { "type" : "t-shirt" }
},
"aggs" : {
"all_products" : {
"global" : {}, (1)
"aggs" : { (2)
"avg_price" : { "avg" : { "field" : "price" } }
}
},
"t_shirts": { "avg" : { "field" : "price" } }
}
}
-
The
global
aggregation has an empty body -
The sub-aggregations that are registered for this
global
aggregation
The above aggregation demonstrates how one would compute aggregations
(avg_price
in this example) on all the documents in the search context,
regardless of the query (in our example, it will compute the average price over
all products in our catalog, not just on the "shirts").
The response for the above aggregation:
{
...
"aggregations" : {
"all_products" : {
"doc_count" : 7, (1)
"avg_price" : {
"value" : 140.71428571428572 (2)
}
},
"t_shirts": {
"value" : 128.33333333333334 (3)
}
}
}
-
The number of documents that were aggregated (in our case, all documents within the search context)
-
The average price of all products in the index
-
The average price of all t-shirts
Histogram Aggregation
A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents.
It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field
that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5
(in case of price it may represent $5). When the aggregation executes, the price field of every document will be
evaluated and will be rounded down to its closest bucket - for example, if the price is 32
and the bucket size is 5
then the rounding will yield 30
and thus the document will "fall" into the bucket that is associated with the key 30
.
To make this more formal, here is the rounding function that is used:
bucket_key = Math.floor((value - offset) / interval) * interval + offset
The interval
must be a positive decimal, while the offset
must be a decimal in [0, interval)
(a decimal greater than or equal to 0
and less than interval
)
The following snippet "buckets" the products based on their price
by interval of 50
:
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50
}
}
}
}
And the following may be the response:
{
...
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0.0,
"doc_count": 1
},
{
"key": 50.0,
"doc_count": 1
},
{
"key": 100.0,
"doc_count": 0
},
{
"key": 150.0,
"doc_count": 2
},
{
"key": 200.0,
"doc_count": 3
}
]
}
}
}
Minimum document count
The response above show that no documents has a price that falls within the range of [100, 150)
. By default the
response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with
a higher minimum count thanks to the min_doc_count
setting:
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"min_doc_count" : 1
}
}
}
}
Response:
{
...
"aggregations": {
"prices" : {
"buckets": [
{
"key": 0.0,
"doc_count": 1
},
{
"key": 50.0,
"doc_count": 1
},
{
"key": 150.0,
"doc_count": 2
},
{
"key": 200.0,
"doc_count": 3
}
]
}
}
}
By default the histogram
returns all the buckets within the range of the data itself, that is, the documents with
the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the
documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when
requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.
To understand why, let’s look at an example:
Lets say the you’re filtering your request to get all docs with values between 0
and 500
, in addition you’d like
to slice the data per price using a histogram with an interval of 50
. You also specify "min_doc_count" : 0
as you’d
like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than 100
,
the first bucket you’ll get will be the one with 100
as its key. This is confusing, as many times, you’d also like
to get those buckets between 0 - 100
.
With extended_bounds
setting, you now can "force" the histogram aggregation to start building buckets on a specific
min
value and also keep on building buckets up to a max
value (even if there are no documents anymore). Using
extended_bounds
only makes sense when min_doc_count
is 0 (the empty buckets will never be returned if min_doc_count
is greater than 0).
Note that (as the name suggest) extended_bounds
is not filtering buckets. Meaning, if the extended_bounds.min
is higher
than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the
same goes for the extended_bounds.max
and the last bucket). For filtering buckets, one should nest the histogram aggregation
under a range filter
aggregation with the appropriate from
/to
settings.
Example:
POST /sales/_search?size=0
{
"query" : {
"constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
},
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"extended_bounds" : {
"min" : 0,
"max" : 500
}
}
}
}
}
Order
By default the returned buckets are sorted by their key
ascending, though the order behaviour can be controlled using
the order
setting. Supports the same order
functionality as the Terms Aggregation
.
Offset
By default the bucket keys start with 0 and then continue in even spaced steps
of interval
, e.g. if the interval is 10
, the first three buckets (assuming
there is data inside them) will be [0, 10)
, [10, 20)
, [20, 30)
. The bucket
boundaries can be shifted by using the offset
option.
This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval 10
will result in
two buckets with 5 documents each. If an additional offset 5
is used, there will be only one single bucket [5, 15)
containing all the 10
documents.
Response Format
By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash instead keyed by the buckets keys:
POST /sales/_search?size=0
{
"aggs" : {
"prices" : {
"histogram" : {
"field" : "price",
"interval" : 50,
"keyed" : true
}
}
}
}
Response:
{
...
"aggregations": {
"prices": {
"buckets": {
"0.0": {
"key": 0.0,
"doc_count": 1
},
"50.0": {
"key": 50.0,
"doc_count": 1
},
"100.0": {
"key": 100.0,
"doc_count": 0
},
"150.0": {
"key": 150.0,
"doc_count": 2
},
"200.0": {
"key": 200.0,
"doc_count": 3
}
}
}
}
}
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
POST /sales/_search?size=0
{
"aggs" : {
"quantity" : {
"histogram" : {
"field" : "quantity",
"interval": 10,
"missing": 0 (1)
}
}
}
}
-
Documents without a value in the
quantity
field will fall into the same bucket as documents that have the value0
.
IP Range Aggregation
Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields:
Example:
GET /ip_addresses/_search
{
"size": 10,
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "to" : "10.0.0.5" },
{ "from" : "10.0.0.5" }
]
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets" : [
{
"key": "*-10.0.0.5",
"to": "10.0.0.5",
"doc_count": 10
},
{
"key": "10.0.0.5-*",
"from": "10.0.0.5",
"doc_count": 260
}
]
}
}
}
IP ranges can also be defined as CIDR masks:
GET /ip_addresses/_search
{
"size": 0,
"aggs" : {
"ip_ranges" : {
"ip_range" : {
"field" : "ip",
"ranges" : [
{ "mask" : "10.0.0.0/25" },
{ "mask" : "10.0.0.127/25" }
]
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets": [
{
"key": "10.0.0.0/25",
"from": "10.0.0.0",
"to": "10.0.0.128",
"doc_count": 128
},
{
"key": "10.0.0.127/25",
"from": "10.0.0.0",
"to": "10.0.0.128",
"doc_count": 128
}
]
}
}
}
Keyed Response
Setting the keyed
flag to true
will associate a unique string key with each bucket and return the ranges as a hash rather than an array:
GET /ip_addresses/_search
{
"size": 0,
"aggs": {
"ip_ranges": {
"ip_range": {
"field": "ip",
"ranges": [
{ "to" : "10.0.0.5" },
{ "from" : "10.0.0.5" }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets": {
"*-10.0.0.5": {
"to": "10.0.0.5",
"doc_count": 10
},
"10.0.0.5-*": {
"from": "10.0.0.5",
"doc_count": 260
}
}
}
}
}
It is also possible to customize the key for each range:
GET /ip_addresses/_search
{
"size": 0,
"aggs": {
"ip_ranges": {
"ip_range": {
"field": "ip",
"ranges": [
{ "key": "infinity", "to" : "10.0.0.5" },
{ "key": "and-beyond", "from" : "10.0.0.5" }
],
"keyed": true
}
}
}
}
Response:
{
...
"aggregations": {
"ip_ranges": {
"buckets": {
"infinity": {
"to": "10.0.0.5",
"doc_count": 10
},
"and-beyond": {
"from": "10.0.0.5",
"doc_count": 260
}
}
}
}
}
Missing Aggregation
A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.
Example:
POST /sales/_search?size=0
{
"aggs" : {
"products_without_a_price" : {
"missing" : { "field" : "price" }
}
}
}
In the above example, we get the total number of products that do not have a price.
Response:
{
...
"aggregations" : {
"products_without_a_price" : {
"doc_count" : 00
}
}
}
Nested Aggregation
A special single bucket aggregation that enables aggregating nested documents.
For example, lets say we have an index of products, and each product holds the list of resellers - each having its own price for the product. The mapping could look like:
PUT /products
{
"mappings": {
"product" : {
"properties" : {
"resellers" : { (1)
"type" : "nested",
"properties" : {
"reseller" : { "type" : "text" },
"price" : { "type" : "double" }
}
}
}
}
}
}
-
resellers
is an array that holds nested documents.
The following request adds a product with two resellers:
PUT /products/_doc/0
{
"name": "LED TV", (1)
"resellers": [
{
"reseller": "companyA",
"price": 350
},
{
"reseller": "companyB",
"price": 500
}
]
}
-
We are using a dynamic mapping for the
name
attribute.
The following request returns the minimum price a product can be purchased for:
GET /products/_search
{
"query" : {
"match" : { "name" : "led tv" }
},
"aggs" : {
"resellers" : {
"nested" : {
"path" : "resellers"
},
"aggs" : {
"min_price" : { "min" : { "field" : "resellers.price" } }
}
}
}
}
As you can see above, the nested aggregation requires the path
of the nested documents within the top level documents.
Then one can define any type of aggregation over these nested documents.
Response:
{
...
"aggregations": {
"resellers": {
"doc_count": 2,
"min_price": {
"value": 350
}
}
}
}
Parent Aggregation
A special single bucket aggregation that selects parent documents that have the specified type, as defined in a join
field.
This aggregation has a single option:
-
type
- The child type that should be selected.
For example, let’s say we have an index of questions and answers. The answer type has the following join
field in the mapping:
PUT parent_example
{
"mappings": {
"_doc": {
"properties": {
"join": {
"type": "join",
"relations": {
"question": "answer"
}
}
}
}
}
}
The question
document contain a tag field and the answer
documents contain an owner field. With the parent
aggregation the owner buckets can be mapped to the tag buckets in a single request even though the two fields exist in
two different kinds of documents.
An example of a question document:
PUT parent_example/_doc/1
{
"join": {
"name": "question"
},
"body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...",
"title": "Whats the best way to file transfer my site from server to a newer one?",
"tags": [
"windows-server-2003",
"windows-server-2008",
"file-transfer"
]
}
Examples of answer
documents:
PUT parent_example/_doc/2?routing=1
{
"join": {
"name": "answer",
"parent": "1"
},
"owner": {
"location": "Norfolk, United Kingdom",
"display_name": "Sam",
"id": 48
},
"body": "<p>Unfortunately you're pretty much limited to FTP...",
"creation_date": "2009-05-04T13:45:37.030"
}
PUT parent_example/_doc/3?routing=1&refresh
{
"join": {
"name": "answer",
"parent": "1"
},
"owner": {
"location": "Norfolk, United Kingdom",
"display_name": "Troll",
"id": 49
},
"body": "<p>Use Linux...",
"creation_date": "2009-05-05T13:45:37.030"
}
The following request can be built that connects the two together:
POST parent_example/_search?size=0
{
"aggs": {
"top-names": {
"terms": {
"field": "owner.display_name.keyword",
"size": 10
},
"aggs": {
"to-questions": {
"parent": {
"type" : "answer" (1)
},
"aggs": {
"top-tags": {
"terms": {
"field": "tags.keyword",
"size": 10
}
}
}
}
}
}
}
}
-
The
type
points to type / mapping with the nameanswer
.
The above example returns the top answer owners and per owner the top question tags.
Possible response:
{
"took": 9,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0.0,
"hits": []
},
"aggregations": {
"top-names": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "Sam",
"doc_count": 1, (1)
"to-questions": {
"doc_count": 1, (2)
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "file-transfer",
"doc_count": 1
},
{
"key": "windows-server-2003",
"doc_count": 1
},
{
"key": "windows-server-2008",
"doc_count": 1
}
]
}
}
},
{
"key": "Troll",
"doc_count": 1,
"to-questions": {
"doc_count": 1,
"top-tags": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "file-transfer",
"doc_count": 1
},
{
"key": "windows-server-2003",
"doc_count": 1
},
{
"key": "windows-server-2008",
"doc_count": 1
}
]
}
}
}
]
}
}
}
-
The number of answer documents with the tag
Sam
,Troll
, etc. -
The number of question documents that are related to answer documents with the tag
Sam
,Troll
, etc.
Range Aggregation
A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document.
Note that this aggregation includes the from
value and excludes the to
value for each range.
Example:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 100.0 },
{ "from" : 100.0, "to" : 200.0 },
{ "from" : 200.0 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": [
{
"key": "*-100.0",
"to": 100.0,
"doc_count": 2
},
{
"key": "100.0-200.0",
"from": 100.0,
"to": 200.0,
"doc_count": 2
},
{
"key": "200.0-*",
"from": 200.0,
"doc_count": 3
}
]
}
}
}
Keyed Response
Setting the keyed
flag to true
will associate a unique string key with each bucket and return the ranges as a hash rather than an array:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 200 },
{ "from" : 200 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": {
"*-100.0": {
"to": 100.0,
"doc_count": 2
},
"100.0-200.0": {
"from": 100.0,
"to": 200.0,
"doc_count": 2
},
"200.0-*": {
"from": 200.0,
"doc_count": 3
}
}
}
}
}
It is also possible to customize the key for each range:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"keyed" : true,
"ranges" : [
{ "key" : "cheap", "to" : 100 },
{ "key" : "average", "from" : 100, "to" : 200 },
{ "key" : "expensive", "from" : 200 }
]
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges" : {
"buckets": {
"cheap": {
"to": 100.0,
"doc_count": 2
},
"average": {
"from": 100.0,
"to": 200.0,
"doc_count": 2
},
"expensive": {
"from": 200.0,
"doc_count": 3
}
}
}
}
}
Script
Range aggregation accepts a script
parameter. This parameter allows to defined an inline script
that
will be executed during aggregation execution.
The following example shows how to use an inline
script with the painless
script language and no script parameters:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : {
"lang": "painless",
"source": "doc['price'].value"
},
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 200 },
{ "from" : 200 }
]
}
}
}
}
It is also possible to use stored scripts. Here is a simple stored script:
POST /_scripts/convert_currency
{
"script": {
"lang": "painless",
"source": "doc[params.field].value * params.conversion_rate"
}
}
And this new stored script can be used in the range aggregation like this:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"script" : {
"id": "convert_currency", (1)
"params": { (2)
"field": "price",
"conversion_rate": 0.835526591
}
},
"ranges" : [
{ "from" : 0, "to" : 100 },
{ "from" : 100 }
]
}
}
}
}
-
Id of the stored script
-
Parameters to use when executing the stored script
Value Script
Lets say the product prices are in USD but we would like to get the price ranges in EURO. We can use value script to convert the prices prior the aggregation (assuming conversion rate of 0.8)
GET /sales/_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"script" : {
"source": "_value * params.conversion_rate",
"params" : {
"conversion_rate" : 0.8
}
},
"ranges" : [
{ "to" : 35 },
{ "from" : 35, "to" : 70 },
{ "from" : 70 }
]
}
}
}
}
Sub Aggregations
The following example, not only "bucket" the documents to the different buckets but also computes statistics over the prices in each price range
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 200 },
{ "from" : 200 }
]
},
"aggs" : {
"price_stats" : {
"stats" : { "field" : "price" }
}
}
}
}
}
Response:
{
...
"aggregations": {
"price_ranges": {
"buckets": [
{
"key": "*-100.0",
"to": 100.0,
"doc_count": 2,
"price_stats": {
"count": 2,
"min": 10.0,
"max": 50.0,
"avg": 30.0,
"sum": 60.0
}
},
{
"key": "100.0-200.0",
"from": 100.0,
"to": 200.0,
"doc_count": 2,
"price_stats": {
"count": 2,
"min": 150.0,
"max": 175.0,
"avg": 162.5,
"sum": 325.0
}
},
{
"key": "200.0-*",
"from": 200.0,
"doc_count": 3,
"price_stats": {
"count": 3,
"min": 200.0,
"max": 200.0,
"avg": 200.0,
"sum": 600.0
}
}
]
}
}
}
If a sub aggregation is also based on the same value source as the range aggregation (like the stats
aggregation in the example above) it is possible to leave out the value source definition for it. The following will return the same response as above:
GET /_search
{
"aggs" : {
"price_ranges" : {
"range" : {
"field" : "price",
"ranges" : [
{ "to" : 100 },
{ "from" : 100, "to" : 200 },
{ "from" : 200 }
]
},
"aggs" : {
"price_stats" : {
"stats" : {} (1)
}
}
}
}
}
-
We don’t need to specify the
price
as we "inherit" it by default from the parentrange
aggregation
Reverse nested Aggregation
A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this aggregation can break out of the nested block structure and link to other nested structures or the root document, which allows nesting other aggregations that aren’t part of the nested object in a nested aggregation.
The reverse_nested
aggregation must be defined inside a nested
aggregation.
-
path
- Which defines to what nested object field should be joined back. The default is empty, which means that it joins back to the root / main document level. The path cannot contain a reference to a nested object field that falls outside thenested
aggregation’s nested structure areverse_nested
is in.
For example, lets say we have an index for a ticket system with issues and comments. The comments are inlined into the issue documents as nested documents. The mapping could look like:
PUT /issues
{
"mappings": {
"issue" : {
"properties" : {
"tags" : { "type" : "keyword" },
"comments" : { (1)
"type" : "nested",
"properties" : {
"username" : { "type" : "keyword" },
"comment" : { "type" : "text" }
}
}
}
}
}
}
-
The
comments
is an array that holds nested documents under theissue
object.
The following aggregations will return the top commenters' username that have commented and per top commenter the top tags of the issues the user has commented on:
GET /issues/_search
{
"query": {
"match_all": {}
},
"aggs": {
"comments": {
"nested": {
"path": "comments"
},
"aggs": {
"top_usernames": {
"terms": {
"field": "comments.username"
},
"aggs": {
"comment_to_issue": {
"reverse_nested": {}, (1)
"aggs": {
"top_tags_per_comment": {
"terms": {
"field": "tags"
}
}
}
}
}
}
}
}
}
}
As you can see above, the reverse_nested
aggregation is put in to a nested
aggregation as this is the only place
in the dsl where the reverse_nested
aggregation can be used. Its sole purpose is to join back to a parent doc higher
up in the nested structure.
-
A
reverse_nested
aggregation that joins back to the root / main document level, because nopath
has been defined. Via thepath
option thereverse_nested
aggregation can join back to a different level, if multiple layered nested object types have been defined in the mapping
Possible response snippet:
{
"aggregations": {
"comments": {
"doc_count": 1,
"top_usernames": {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets": [
{
"key": "username_1",
"doc_count": 1,
"comment_to_issue": {
"doc_count": 1,
"top_tags_per_comment": {
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets": [
{
"key": "tag_1",
"doc_count": 1
}
...
]
}
}
}
...
]
}
}
}
}
Sampler Aggregation
A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.
-
Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches
-
Reducing the running cost of aggregations that can produce useful results using only samples e.g.
significant_terms
Example:
A query on StackOverflow data for the popular term javascript
OR the rarer term
kibana
will match many documents - most of them missing the word Kibana. To focus
the significant_terms
aggregation on top-scoring documents that are more likely to match
the most interesting parts of our query we use a sample.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 200
},
"aggs": {
"keywords": {
"significant_terms": {
"field": "tags",
"exclude": ["kibana", "javascript"]
}
}
}
}
}
}
Response:
{
...
"aggregations": {
"sample": {
"doc_count": 200,(1)
"keywords": {
"doc_count": 200,
"bg_count": 650,
"buckets": [
{
"key": "elasticsearch",
"doc_count": 150,
"score": 1.078125,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.5625,
"bg_count": 50
}
]
}
}
}
}
-
200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.
Without the sampler
aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies
less significant terms such as jquery
and angular
rather than focusing on the more insightful Kibana-related terms.
POST /stackoverflow/_search?size=0
{
"query": {
"query_string": {
"query": "tags:kibana OR tags:javascript"
}
},
"aggs": {
"low_quality_keywords": {
"significant_terms": {
"field": "tags",
"size": 3,
"exclude":["kibana", "javascript"]
}
}
}
}
Response:
{
...
"aggregations": {
"low_quality_keywords": {
"doc_count": 600,
"bg_count": 650,
"buckets": [
{
"key": "angular",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "jquery",
"doc_count": 200,
"score": 0.02777,
"bg_count": 200
},
{
"key": "logstash",
"doc_count": 50,
"score": 0.0069,
"bg_count": 50
}
]
}
}
}
shard_size
The shard_size
parameter limits how many top-scoring documents are collected in the sample processed on each shard.
The default value is 100.
Limitations
Cannot be nested under breadth_first
aggregations
Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document.
It therefore cannot be nested under a terms
aggregation which has the collect_mode
switched from the default depth_first
mode to breadth_first
as this discards scores.
In this situation an error will be thrown.
Significant Terms Aggregation
An aggregation that returns interesting or unusual occurrences of terms in a set.
-
Suggesting "H5N1" when users search for "bird flu" in text
-
Identifying the merchant that is the "common point of compromise" from the transaction history of credit card owners reporting loss
-
Suggesting keywords relating to stock symbol $ATI for an automated news classifier
-
Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash injuries
-
Spotting the tire manufacturer who has a disproportionate number of blow-outs
In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
Single-set analysis
In the simplest case, the foreground set of interest is the search results matched by a query and the background set used for statistical comparisons is the index or indices from which the results were gathered.
Example:
GET /_search
{
"query" : {
"terms" : {"force" : [ "British Transport Police" ]}
},
"aggregations" : {
"significant_crime_types" : {
"significant_terms" : { "field" : "crime_type" }
}
}
}
Response:
{
...
"aggregations" : {
"significant_crime_types" : {
"doc_count": 47347,
"bg_count": 5064554,
"buckets" : [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371235374214817,
"bg_count": 66799
}
...
]
}
}
}
When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.
The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.
This can be a tedious way to look for unusual patterns in an index
Multi-set analysis
A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis.
Example using a parent aggregation for segmentation:
GET /_search
{
"aggregations": {
"forces": {
"terms": {"field": "force"},
"aggregations": {
"significant_crime_types": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
Response:
{
...
"aggregations": {
"forces": {
"doc_count_error_upper_bound": 1375,
"sum_other_doc_count": 7879845,
"buckets": [
{
"key": "Metropolitan Police Service",
"doc_count": 894038,
"significant_crime_types": {
"doc_count": 894038,
"bg_count": 5064554,
"buckets": [
{
"key": "Robbery",
"doc_count": 27617,
"score": 0.0599,
"bg_count": 53182
}
...
]
}
},
{
"key": "British Transport Police",
"doc_count": 47347,
"significant_crime_types": {
"doc_count": 47347,
"bg_count": 5064554,
"buckets": [
{
"key": "Bicycle theft",
"doc_count": 3640,
"score": 0.371,
"bg_count": 66799
}
...
]
}
}
]
}
}
}
Now we have anomaly detection for each of the police forces using a single request.
We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic area to identify unusual hot-spots of a particular crime type:
GET /_search
{
"aggs": {
"hotspots": {
"geohash_grid": {
"field": "location",
"precision": 5
},
"aggs": {
"significant_crime_types": {
"significant_terms": {"field": "crime_type"}
}
}
}
}
}
This example uses the geohash_grid
aggregation to create result buckets that represent geographic areas, and inside each
bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.
-
Airports exhibit unusual numbers of weapon confiscations
-
Universities show uplifts of bicycle thefts
At a higher geohash_grid zoom-level with larger coverage areas we would start to see where an entire police-force may be tackling an unusual volume of a particular crime type.
Obviously a time-based top-level segmentation would help identify current trends for each point in time
where a simple terms
aggregation would typically show the very popular "constants" that persist across all time slots.
Use on free-text fields
The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:
-
keywords for refining end-user searches
-
keywords for use in percolator queries
Warning
|
Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt to load every unique word into RAM. It is recommended to only use this on smaller indices. |
Tip
|
Show significant_terms in context
Free-text significant_terms are much more easily understood when viewed in context. Take the results of |
Custom background sets
Ordinarily, the foreground set of documents is "diffed" against a background set of all the documents in your index.
However, sometimes it may prove useful to use a narrower background set as the basis for comparisons.
For example, a query on documents relating to "Madrid" in an index with content from all over the world might reveal that "Spanish"
was a significant term. This may be true but if you want some more focused terms you could use a background_filter
on the term 'spain' to establish a narrower set of documents as context. With this as a background "Spanish" would now
be seen as commonplace and therefore not as significant as words like "capital" that relate more strongly with Madrid.
Note that using a background filter will slow things down - each term’s background frequency must now be derived on-the-fly from filtering posting lists rather than reading the index’s pre-computed count for a term.
Limitations
Significant terms must be indexed values
Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. Because of the way the significant_terms aggregation must consider both foreground and background frequencies it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons. Also DocValues are not supported as sources of term data for similar reasons.
No analysis of floating point fields
Floating point fields are currently not supported as the subject of significant_terms analysis. While integer or long fields can be used to represent concepts like bank account numbers or category numbers which can be interesting to track, floating point fields are usually used to represent quantities of something. As such, individual floating point terms are not useful for this form of frequency analysis.
Use as a parent aggregation
If there is the equivalent of a match_all
query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the
top-most aggregation - in this scenario the foreground set is exactly the same as the background set and
so there is no difference in document frequencies to observe and from which to make sensible suggestions.
Another consideration is that the significant_terms aggregation produces many candidate results at shard level that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.
Approximate counts
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:
-
low if certain shards did not provide figures for a given term in their top sample
-
high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
However, the size
and shard size
settings covered in the next section provide tools to help control the accuracy levels.
Parameters
JLH score
The JLH score can be used as a significance score by adding the parameter
"jlh": {
}
The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.
Mutual information
Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter
"mutual_information": {
"include_negatives": true
}
Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, include_negatives
can be set to false
.
Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set
"background_is_superset": false
Chi square
Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter
"chi_square": {
}
Chi square behaves like mutual information and can be configured with the same parameters include_negatives
and background_is_superset
.
Google normalized distance
Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter
"gnd": {
}
gnd
also accepts the background_is_superset
parameter.
Percentage
A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term. By default this produces a score greater than zero and less than one.
The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%.
It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat.
Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both min_doc_count
and shard_min_doc_count
to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence.
"percentage": {
}
Which one is best?
Roughly, mutual_information
prefers high frequent terms even if they occur also frequently in the background. For example, in an analysis of natural language text this might lead to selection of stop words. mutual_information
is unlikely to select very rare terms like misspellings. gnd
prefers terms with a high co-occurrence and avoids selection of stopwords. It might be better suited for synonym detection. However, gnd
has a tendency to select very rare terms that are, for example, a result of misspelling. chi_square
and jlh
are somewhat in-between.
It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a study on using significant terms for feature selection for text classification).
If none of the above measures suits your usecase than another option is to implement a custom significance measure:
Scripted
Customized scores can be implemented via a script:
"script_heuristic": {
"script": {
"lang": "painless",
"source": "params._subset_freq/(params._superset_freq - params._subset_freq + 1)"
}
}
Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see script documentation.
Available parameters in the script are
_subset_freq
|
Number of documents the term appears in the subset. |
_superset_freq
|
Number of documents the term appears in the superset. |
_subset_size
|
Number of documents in the subset. |
_superset_size
|
Number of documents in the superset. |
Size & Shard Size
The size
parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
If the number of unique terms is greater than size
, the returned list can be slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
To ensure better accuracy a multiple of the final size
is used as the number of terms to request from each shard
(2 * (size * 1.5 + 10)
). To take manual control of this setting the shard_size
parameter
can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the shard_size
parameter is set to
values significantly higher than the size
setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size
is set to -1 (the default) then shard_size
will be automatically estimated based on the number of shards and the size
parameter.
Note
|
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, Elasticsearch will
override it and reset it to be equal to size .
|
Minimum document count
It is possible to only return terms that match more than a configured number of hits using the min_doc_count
option:
GET /_search
{
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tag",
"min_doc_count": 10
}
}
}
}
The above aggregation would only return tags which have been found in 10 hits or more. Default value is 3
.
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The min_doc_count
criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count
. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the shard_size
parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count
parameter
The parameter shard_min_doc_count
regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count
. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count
. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the shard_min_doc_count
parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count
even after merging the local frequencies. shard_min_doc_count
is set to 1
per default and has no effect unless you explicitly set it.
Warning
|
Setting min_doc_count to 1 is generally not advised as it tends to return terms that
are typos or other bizarre curiosities. Finding more than one instance of a term helps
reinforce that, while still rare, the term was not the result of a one-off accident. The
default value of 3 is used to provide a minimum weight-of-evidence.
Setting shard_min_doc_count too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards .
|
Custom background context
The default source of statistical information for background term frequencies is the entire index and this
scope can be narrowed through the use of a background_filter
to focus in on significant terms within a narrower
context:
GET /_search
{
"query" : {
"match" : {
"city" : "madrid"
}
},
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tag",
"background_filter": {
"term" : { "text" : "spain"}
}
}
}
}
}
The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing terms like "Spanish" that are unusual in the full index’s worldwide context but commonplace in the subset of documents containing the word "Spain".
Warning
|
Use of background filters will slow the query as each term’s postings must be filtered to determine a frequency |
Filtering Values
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the include
and
exclude
parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
described in the terms aggregation documentation.
Execution hint
There are different mechanisms by which terms aggregations can be executed:
-
by using field values directly in order to aggregate data per-bucket (
map
) -
by using global ordinals of the field and allocating one bucket per global ordinal (
global_ordinals
)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.
global_ordinals
is the default option for keyword
field, it uses global ordinals to allocates buckets dynamically
so memory usage is linear to the number of values of the documents that are part of the aggregation scope.
map
should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode
is significantly faster. By default, map
is only used when running an aggregation on scripts, since they don’t have
ordinals.
GET /_search
{
"aggs" : {
"tags" : {
"significant_terms" : {
"field" : "tags",
"execution_hint": "map" (1)
}
}
}
}
-
the possible values are
map
,global_ordinals
Please note that Elasticsearch will ignore this execution hint if it is not applicable.
Significant Text Aggregation
An aggregation that returns interesting or unusual occurrences of free-text terms in a set. It is like the significant terms aggregation but differs in that:
-
It is specifically designed for use on type
text
fields -
It does not require field data or doc-values
-
It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.
Warning
|
Re-analyzing large result sets will require a lot of time and memory. It is recommended that the significant_text aggregation is used as a child of either the sampler or diversified sampler aggregation to limit the analysis to a small selection of top-matching documents e.g. 200. This will typically improve speed, memory use and quality of results. |
-
Suggesting "H5N1" when users search for "bird flu" to help expand queries
-
Suggesting keywords relating to stock symbol $ATI for use in an automated news classifier
In these cases the words being selected are not simply the most popular terms in results. The most popular words tend to be very boring (and, of, the, we, I, they …). The significant words are the ones that have undergone a significant change in popularity measured between a foreground and background set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.
Basic use
In the typical use case, the foreground set of interest is a selection of the top-matching search results for a query and the _background_set used for statistical comparisons is the index or indices from which the results were gathered.
Example:
GET news/article/_search
{
"query" : {
"match" : {"content" : "Bird flu"}
},
"aggregations" : {
"my_sample" : {
"sampler" : {
"shard_size" : 100
},
"aggregations": {
"keywords" : {
"significant_text" : { "field" : "content" }
}
}
}
}
}
Response:
{
"took": 9,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations" : {
"my_sample": {
"doc_count": 100,
"keywords" : {
"doc_count": 100,
"buckets" : [
{
"key": "h5n1",
"doc_count": 4,
"score": 4.71235374214817,
"bg_count": 5
}
...
]
}
}
}
}
The results show that "h5n1" is one of several terms strongly associated with bird flu.
It only occurs 5 times in our index as a whole (see the bg_count
) and yet 4 of these
were lucky enough to appear in our 100 document sample of "bird flu" results. That suggests
a significant word and one which the user can potentially add to their search.
Dealing with noisy data using filter_duplicate_text
Free-text fields often contain a mix of original content and mechanical copies of text (cut-and-paste biographies, email reply chains, retweets, boilerplate headers/footers, page navigation menus, sidebar news links, copyright notices, standard disclaimers, addresses).
In real-world data these duplicate sections of text tend to feature heavily in significant_text
results if they aren’t filtered out.
Filtering near-duplicate text is a difficult task at index-time but we can cleanse the data on-the-fly at query time using the
filter_duplicate_text
setting.
First let’s look at an unfiltered real-world example using the Signal media dataset of a million news articles covering a wide variety of news. Here are the raw significant text results for a search for the articles mentioning "elasticsearch":
{
...
"aggregations": {
"sample": {
"doc_count": 35,
"keywords": {
"doc_count": 35,
"buckets": [
{
"key": "elasticsearch",
"doc_count": 35,
"score": 28570.428571428572,
"bg_count": 35
},
...
{
"key": "currensee",
"doc_count": 8,
"score": 6530.383673469388,
"bg_count": 8
},
...
{
"key": "pozmantier",
"doc_count": 4,
"score": 3265.191836734694,
"bg_count": 4
},
...
}
The uncleansed documents have thrown up some odd-looking terms that are, on the face of it, statistically correlated with appearances of our search term "elasticsearch" e.g. "pozmantier". We can drill down into examples of these documents to see why pozmantier is connected using this query:
GET news/article/_search
{
"query": {
"simple_query_string": {
"query": "+elasticsearch +pozmantier"
}
},
"_source": [
"title",
"source"
],
"highlight": {
"fields": {
"content": {}
}
}
}
The results show a series of very similar news articles about a judging panel for a number of tech projects:
{
...
"hits": {
"hits": [
{
...
"_source": {
"source": "Presentation Master",
"title": "T.E.N. Announces Nominees for the 2015 ISE® North America Awards"
},
"highlight": {
"content": [
"City of San Diego Mike <em>Pozmantier</em>, Program Manager, Cyber Security Division, Department of",
" Janus, Janus <em>ElasticSearch</em> Security Visualization Engine "
]
}
},
{
...
"_source": {
"source": "RCL Advisors",
"title": "T.E.N. Announces Nominees for the 2015 ISE(R) North America Awards"
},
"highlight": {
"content": [
"Mike <em>Pozmantier</em>, Program Manager, Cyber Security Division, Department of Homeland Security S&T",
"Janus, Janus <em>ElasticSearch</em> Security Visualization Engine"
]
}
},
...
Mike Pozmantier was one of many judges on a panel and elasticsearch was used in one of many projects being judged.
As is typical, this lengthy press release was cut-and-paste by a variety of news sites and consequently any rare names, numbers or typos they contain become statistically correlated with our matching query.
Fortunately similar documents tend to rank similarly so as part of examining the stream of top-matching documents the significant_text
aggregation can apply a filter to remove sequences of any 6 or more tokens that have already been seen. Let’s try this same query now but
with the filter_duplicate_text
setting turned on:
GET news/article/_search
{
"query": {
"match": {
"content": "elasticsearch"
}
},
"aggs": {
"sample": {
"sampler": {
"shard_size": 100
},
"aggs": {
"keywords": {
"significant_text": {
"field": "content",
"filter_duplicate_text": true
}
}
}
}
}
}
The results from analysing our deduplicated text are obviously of higher quality to anyone familiar with the elastic stack:
{
...
"aggregations": {
"sample": {
"doc_count": 35,
"keywords": {
"doc_count": 35,
"buckets": [
{
"key": "elasticsearch",
"doc_count": 22,
"score": 11288.001166180758,
"bg_count": 35
},
{
"key": "logstash",
"doc_count": 3,
"score": 1836.648979591837,
"bg_count": 4
},
{
"key": "kibana",
"doc_count": 3,
"score": 1469.3020408163263,
"bg_count": 5
}
]
}
}
}
}
Mr Pozmantier and other one-off associations with elasticsearch no longer appear in the aggregation results as a consequence of copy-and-paste operations or other forms of mechanical repetition.
If your duplicate or near-duplicate content is identifiable via a single-value indexed field (perhaps
a hash of the article’s title
text or an original_press_release_url
field) then it would be more
efficient to use a parent diversified sampler aggregation
to eliminate these documents from the sample set based on that single key. The less duplicate content you can feed into
the significant_text aggregation up front the better in terms of performance.
Limitations
No support for child aggregations
The significant_text aggregation intentionally does not support the addition of child aggregations because:
-
It would come with a high memory cost
-
It isn’t a generally useful feature and there is a workaround for those that need it
The volume of candidate terms is generally very high and these are pruned heavily before the final
results are returned. Supporting child aggregations would generate additional churn and be inefficient.
Clients can always take the heavily-trimmed set of results from a significant_text
request and
make a subsequent follow-up query using a terms
aggregation with an include
clause and child
aggregations to perform further analysis of selected keywords in a more efficient fashion.
No support for nested objects
The significant_text aggregation currently also cannot be used with text fields in nested objects, because it works with the document JSON source. This makes this feature inefficient when matching nested docs from stored JSON given a matching Lucene docID.
Approximate counts
The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:
-
low if certain shards did not provide figures for a given term in their top sample
-
high when considering the background frequency as it may count occurrences found in deleted documents
Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies.
However, the size
and shard size
settings covered in the next section provide tools to help control the accuracy levels.
Parameters
Significance heuristics
This aggregation supports the same scoring heuristics (JLH, mutual_information, gnd, chi_square etc) as the significant terms aggregation
Size & Shard Size
The size
parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
If the number of unique terms is greater than size
, the returned list can be slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
To ensure better accuracy a multiple of the final size
is used as the number of terms to request from each shard
(2 * (size * 1.5 + 10)
). To take manual control of this setting the shard_size
parameter
can be used to control the volumes of candidate terms produced by each shard.
Low-frequency terms can turn out to be the most interesting ones once all results are combined so the
significant_terms aggregation can produce higher-quality results when the shard_size
parameter is set to
values significantly higher than the size
setting. This ensures that a bigger volume of promising candidate terms are given
a consolidated review by the reducing node before the final selection. Obviously large candidate term lists
will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size
is set to -1 (the default) then shard_size
will be automatically estimated based on the number of shards and the size
parameter.
Note
|
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, elasticsearch will
override it and reset it to be equal to size .
|
Minimum document count
It is possible to only return terms that match more than a configured number of hits using the min_doc_count
option.
The Default value is 3.
Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step.
However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a
candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word.
The min_doc_count
criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the
term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count
.
This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated
the candidate lists. To avoid this, the shard_size
parameter can be increased to allow more candidate terms on the shards.
However, this increases memory consumption and network traffic.
shard_min_doc_count
parameter
The parameter shard_min_doc_count
regulates the certainty a shard has if the term should actually be added to the candidate list or
not with respect to the min_doc_count
. Terms will only be considered if their local shard frequency within the set is higher than the
shard_min_doc_count
. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings),
then you can set the shard_min_doc_count
parameter to filter out candidate terms on a shard level that will with a reasonable certainty
not reach the required min_doc_count
even after merging the local frequencies. shard_min_doc_count
is set to 1
per default and has
no effect unless you explicitly set it.
Warning
|
Setting min_doc_count to 1 is generally not advised as it tends to return terms that
are typos or other bizarre curiosities. Finding more than one instance of a term helps
reinforce that, while still rare, the term was not the result of a one-off accident. The
default value of 3 is used to provide a minimum weight-of-evidence.
Setting shard_min_doc_count too high will cause significant candidate terms to be filtered out on a shard level.
This value should be set much lower than min_doc_count/#shards .
|
Custom background context
The default source of statistical information for background term frequencies is the entire index and this
scope can be narrowed through the use of a background_filter
to focus in on significant terms within a narrower
context:
GET news/article/_search
{
"query" : {
"match" : {
"content" : "madrid"
}
},
"aggs" : {
"tags" : {
"significant_text" : {
"field" : "content",
"background_filter": {
"term" : { "content" : "spain"}
}
}
}
}
}
The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing terms like "Spanish" that are unusual in the full index’s worldwide context but commonplace in the subset of documents containing the word "Spain".
Warning
|
Use of background filters will slow the query as each term’s postings must be filtered to determine a frequency |
Dealing with source and index mappings
Ordinarily the indexed field name and the original JSON field being retrieved share the same name.
However with more complex field mappings using features like copy_to
the source
JSON field(s) and the indexed field being aggregated can differ.
In these cases it is possible to list the JSON _source fields from which text
will be analyzed using the source_fields
parameter:
GET news/article/_search
{
"query" : {
"match" : {
"custom_all" : "elasticsearch"
}
},
"aggs" : {
"tags" : {
"significant_text" : {
"field" : "custom_all",
"source_fields": ["content" , "title"]
}
}
}
}
Filtering Values
It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the include
and
exclude
parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features
described in the terms aggregation documentation.
Terms Aggregation
A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.
Example:
GET /_search
{
"aggs" : {
"genres" : {
"terms" : { "field" : "genre" } (1)
}
}
}
-
terms
aggregation should be a field of typekeyword
or any other data type suitable for bucket aggregations. In order to use it withtext
you will need to enable fielddata.
Response:
{
...
"aggregations" : {
"genres" : {
"doc_count_error_upper_bound": 0, (1)
"sum_other_doc_count": 0, (2)
"buckets" : [ (3)
{
"key" : "electronic",
"doc_count" : 6
},
{
"key" : "rock",
"doc_count" : 3
},
{
"key" : "jazz",
"doc_count" : 2
}
]
}
}
}
-
an upper bound of the error on the document counts for each term, see below
-
when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response
-
the list of the top buckets, the meaning of
top
being defined by the order
By default, the terms
aggregation will return the buckets for the top ten terms ordered by the doc_count
. One can
change this default behaviour by setting the size
parameter.
Size
The size
parameter can be set to define how many term buckets should be returned out of the overall terms list. By
default, the node coordinating the search process will request each shard to provide its own top size
term buckets
and once all shards respond, it will reduce the results to the final list that will then be returned to the client.
This means that if the number of unique terms is greater than size
, the returned list is slightly off and not accurate
(it could be that the term counts are slightly off and it could even be that a term that should have been in the top
size buckets was not returned).
Note
|
If you want to retrieve all terms or all combinations of terms in a nested terms aggregation
you should use the Composite aggregation which
allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the
terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.
|
Document counts are approximate
Document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. Each shard provides its own view of what the ordered list of terms should be. These views are combined to give a final view.
Shard Size
The higher the requested size
is, the more accurate the results will be, but also, the more expensive it will be to
compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data
transfers between the nodes and the client).
The shard_size
parameter can be used to minimize the extra work that comes with bigger requested size
. When defined,
it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the
coordinating node will then reduce them to a final result which will be based on the size
parameter - this way,
one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to
the client.
Note
|
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, Elasticsearch will
override it and reset it to be equal to size .
|
The default shard_size
is (size * 1.5 + 10)
.
Calculating Document Count Error
There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms. This is calculated as the sum of the document count from the last term returned from each shard.
Per bucket document count error
The second error value can be enabled by setting the show_term_doc_count_error
parameter to true:
GET /_search
{
"aggs" : {
"products" : {
"terms" : {
"field" : "product",
"size" : 5,
"show_term_doc_count_error": true
}
}
}
}
This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count
and can be useful when deciding on a value for the shard_size
parameter. This is calculated by summing the document counts for
the last term returned by all shards which did not return the term.
These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.
Order
The order of the buckets can be customized by setting the order
parameter. By default, the buckets are ordered by
their doc_count
descending. It is possible to change this behaviour as documented below:
Warning
|
Sorting by ascending _count or by sub aggregation is discouraged as it increases the
error on document counts.
It is fine when a single shard is queried, or when the field that is being aggregated was used
as a routing key at index time: in these cases results will be accurate since shards have disjoint
values. However otherwise, errors are unbounded. One particular case that could still be useful
is sorting by min or
max aggregation: counts will not be accurate
but at least the top buckets will be correctly picked.
|
Ordering the buckets by their doc _count
in an ascending manner:
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "_count" : "asc" }
}
}
}
}
Ordering the buckets alphabetically by their terms in an ascending manner:
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "_key" : "asc" }
}
}
}
}
deprecated[6.0.0, Use _key
instead of _term
to order buckets by their term]
Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "max_play_count" : "desc" }
},
"aggs" : {
"max_play_count" : { "max" : { "field" : "play_count" } }
}
}
}
}
Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"order" : { "playback_stats.max" : "desc" }
},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" } }
}
}
}
}
Note
|
Pipeline aggs cannot be used for sorting
Pipeline aggregations are run during the reduce phase after all other aggregations have already completed. For this reason, they cannot be used for ordering. |
It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long
as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket
one or a metrics one. If it’s a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. doc_count
),
in case it’s a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of
a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).
The path must be defined in the following form:
AGG_SEPARATOR = '>' ;
METRIC_SEPARATOR = '.' ;
AGG_NAME = <the name of the aggregation> ;
METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;
GET /_search
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : { "rock>playback_stats.avg" : "desc" }
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : "rock" }},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" }}
}
}
}
}
}
}
The above will sort the artist’s countries buckets based on the average play count among the rock songs.
Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following:
GET /_search
{
"aggs" : {
"countries" : {
"terms" : {
"field" : "artist.country",
"order" : [ { "rock>playback_stats.avg" : "desc" }, { "_count" : "desc" } ]
},
"aggs" : {
"rock" : {
"filter" : { "term" : { "genre" : "rock" }},
"aggs" : {
"playback_stats" : { "stats" : { "field" : "play_count" }}
}
}
}
}
}
}
The above will sort the artist’s countries buckets based on the average play count among the rock songs and then by
their doc_count
in descending order.
Note
|
In the event that two buckets share the same values for all order criteria the bucket’s term value is used as a tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets. |
Minimum document count
It is possible to only return terms that match more than a configured number of hits using the min_doc_count
option:
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"min_doc_count": 10
}
}
}
}
The above aggregation would only return tags which have been found in 10 hits or more. Default value is 1
.
Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count
criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count
. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size
parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.
shard_min_doc_count
parameter
The parameter shard_min_doc_count
regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count
. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count
. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count
parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count
even after merging the local counts. shard_min_doc_count
is set to 0
per default and has no effect unless you explicitly set it.
Note
|
Setting min_doc_count =0 will also return buckets for terms that didn’t match any hit. However, some of
the returned terms which have a document count of zero might only belong to deleted documents or documents
from other types, so there is no warranty that a match_all query would find a positive document count for
those terms.
|
Warning
|
When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets
which is less than size because not enough data was gathered from the shards. Missing buckets can be
back by increasing shard_size .
Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards .
|
Script
Generating the terms using a script:
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"source": "doc['genre'].value",
"lang": "painless"
}
}
}
}
}
This will interpret the script
parameter as an inline
script with the default script language and no script parameters. To use a stored script use the following syntax:
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"script" : {
"id": "my_script",
"params": {
"field": "genre"
}
}
}
}
}
}
Value Script
GET /_search
{
"aggs" : {
"genres" : {
"terms" : {
"field" : "genre",
"script" : {
"source" : "'Genre: ' +_value",
"lang" : "painless"
}
}
}
}
}
Filtering Values
It is possible to filter the values for which buckets will be created. This can be done using the include
and
exclude
parameters which are based on regular expression strings or arrays of exact values. Additionally,
include
clauses can filter using partition
expressions.
Filtering Values with regular expressions
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"include" : ".*sport.*",
"exclude" : "water_.*"
}
}
}
}
In the above example, buckets will be created for all the tags that has the word sport
in them, except those starting
with water_
(so the tag water_sports
will not be aggregated). The include
regular expression will determine what
values are "allowed" to be aggregated, while the exclude
determines the values that should not be aggregated. When
both are defined, the exclude
has precedence, meaning, the include
is evaluated first and only then the exclude
.
The syntax is the same as regexp queries.
Filtering Values with exact values
For matching based on exact values the include
and exclude
parameters can simply take an array of
strings that represent the terms as they are found in the index:
GET /_search
{
"aggs" : {
"JapaneseCars" : {
"terms" : {
"field" : "make",
"include" : ["mazda", "honda"]
}
},
"ActiveCarManufacturers" : {
"terms" : {
"field" : "make",
"exclude" : ["rover", "jensen"]
}
}
}
}
Filtering Values with partitions
Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request. Consider this request which is looking for accounts that have not logged any access recently:
GET /_search
{
"size": 0,
"aggs": {
"expired_sessions": {
"terms": {
"field": "account_id",
"include": {
"partition": 0,
"num_partitions": 20
},
"size": 10000,
"order": {
"last_access": "asc"
}
},
"aggs": {
"last_access": {
"max": {
"field": "access_date"
}
}
}
}
}
}
This request is finding the last logged access date for a subset of customer accounts because we
might want to expire some customer accounts who haven’t been seen for a long while.
The num_partitions
setting has requested that the unique account_ids are organized evenly into twenty
partitions (0 to 19). and the partition
setting in this request filters to only consider account_ids falling
into partition 0. Subsequent requests should ask for partitions 1 then 2 etc to complete the expired-account analysis.
Note that the size
setting for the number of results returned needs to be tuned with the num_partitions
.
For this particular account-expiration example the process for balancing values for size
and num_partitions
would be as follows:
-
Use the
cardinality
aggregation to estimate the total number of unique account_id values -
Pick a value for
num_partitions
to break the number from 1) up into more manageable chunks -
Pick a
size
value for the number of responses we want from each partition -
Run a test request
If we have a circuit-breaker error we are trying to do too much in one request and must increase num_partitions
.
If the request was successful but the last account ID in the date-sorted test response was still an account we might want to
expire then we may be missing accounts of interest and have set our numbers too low. We must either
-
increase the
size
parameter to return more results per partition (could be heavy on memory) or -
increase the
num_partitions
to consider less accounts per request (could increase overall processing time as we need to make more requests)
Ultimately this is a balancing act between managing the Elasticsearch resources required to process a single request and the volume of requests that the client application must issue to complete a task.
Multi-field terms aggregation
The terms
aggregation does not support collecting terms from multiple fields
in the same document. The reason is that the terms
agg doesn’t collect the
string term values themselves, but rather uses
global ordinals
to produce a list of all of the unique values in the field. Global ordinals
results in an important performance boost which would not be possible across
multiple fields.
There are two approaches that you can use to perform a terms
agg across
multiple fields:
- Script
-
Use a script to retrieve terms from multiple fields. This disables the global ordinals optimization and will be slower than collecting terms from a single field, but it gives you the flexibility to implement this option at search time.
copy_to
field-
If you know ahead of time that you want to collect the terms from two or more fields, then use
copy_to
in your mapping to create a new dedicated field at index time which contains the values from both fields. You can aggregate on this single field, which will benefit from the global ordinals optimization.
Collect mode
Deferring calculation of child aggregations
For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree are expanded in one depth-first pass and only then any pruning occurs. In some scenarios this can be very wasteful and can hit memory constraints. An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:
GET /_search
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets
during calculation - a single actor can produce n² buckets where n is the number of actors. The sane option would be to first determine
the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the breadth_first
collection
mode as opposed to the depth_first
mode.
Note
|
The breadth_first is the default mode for fields with a cardinality bigger than the requested size or when the cardinality is unknown (numeric fields or scripts for instance).
It is possible to override the default heuristic and to provide a collect mode directly in the request:
|
GET /_search
{
"aggs" : {
"actors" : {
"terms" : {
"field" : "actors",
"size" : 10,
"collect_mode" : "breadth_first" (1)
},
"aggs" : {
"costars" : {
"terms" : {
"field" : "actors",
"size" : 5
}
}
}
}
}
}
-
the possible values are
breadth_first
anddepth_first
When using breadth_first
mode the set of documents that fall into the uppermost buckets are
cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents.
Note that the order
parameter can still be used to refer to data from a child aggregation when using the breadth_first
setting - the parent
aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.
Warning
|
Nested aggregations such as top_hits which require access to score information under an aggregation that uses the breadth_first
collection mode need to replay the query on the second pass but only for the documents belonging to the top buckets.
|
Execution hint
There are different mechanisms by which terms aggregations can be executed:
-
by using field values directly in order to aggregate data per-bucket (
map
) -
by using global ordinals of the field and allocating one bucket per global ordinal (
global_ordinals
)
Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.
global_ordinals
is the default option for keyword
field, it uses global ordinals to allocates buckets dynamically
so memory usage is linear to the number of values of the documents that are part of the aggregation scope.
map
should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode
is significantly faster. By default, map
is only used when running an aggregation on scripts, since they don’t have
ordinals.
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"execution_hint": "map" (1)
}
}
}
}
-
The possible values are
map
,global_ordinals
Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.
Missing value
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they
had a value.
GET /_search
{
"aggs" : {
"tags" : {
"terms" : {
"field" : "tags",
"missing": "N/A" (1)
}
}
}
}
-
Documents without a value in the
tags
field will fall into the same bucket as documents that have the valueN/A
.
Mixing field types
Warning
|
When aggregating on multiple indices the type of the aggregated field may not be the same in all indices.
Some types are compatible with each other (integer and long or float and double ) but when the types are a mix
of decimal and non-decimal number the terms aggregation will promote the non-decimal numbers to decimal numbers.
This can result in a loss of precision in the bucket values.
|
Pipeline Aggregations
Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:
- Parent
-
A family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.
- Sibling
-
Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.
Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path
parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the
buckets_path
Syntax section below.
Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the buckets_path
allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative
(i.e. a derivative of a derivative).
Note
|
Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation will be included in the final output. |
buckets_path
Syntax
Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path
parameter, which follows a specific format:
AGG_SEPARATOR = '>' ;
METRIC_SEPARATOR = '.' ;
AGG_NAME = <the name of the aggregation> ;
METRIC = <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH = <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;
For example, the path "my_bucket>my_stats.avg"
will path to the avg
value in the "my_stats"
metric, which is
contained in the "my_bucket"
bucket aggregation.
Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the
aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling"
metric "the_sum"
:
POST /_search
{
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"timestamp",
"interval":"day"
},
"aggs":{
"the_sum":{
"sum":{ "field": "lemmings" } (1)
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" } (2)
}
}
}
}
}
-
The metric is called
"the_sum"
-
The
buckets_path
refers to the metric via a relative path"the_sum"
buckets_path
is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets
instead of embedded "inside" them. For example, the max_bucket
aggregation uses the buckets_path
to specify
a metric embedded inside a sibling aggregation:
POST /_search
{
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
buckets_path
instructs this max_bucket aggregation that we want the maximum value of thesales
aggregation in thesales_per_month
date histogram.
Special Paths
Instead of pathing to a metric, buckets_path
can use a special "_count"
path. This instructs
the pipeline aggregation to use the document count as its input. For example, a moving average can be calculated on the document count of each bucket, instead of a specific metric:
POST /_search
{
"aggs": {
"my_date_histo": {
"date_histogram": {
"field":"timestamp",
"interval":"day"
},
"aggs": {
"the_movavg": {
"moving_avg": { "buckets_path": "_count" } (1)
}
}
}
}
}
-
By using
_count
instead of a metric name, we can calculate the moving average of document counts in the histogram
The buckets_path
can also use "_bucket_count"
and path to a multi-bucket aggregation to use the number of buckets
returned by that aggregation in the pipeline aggregation instead of a metric. for example a bucket_selector
can be
used here to filter out buckets which contain no buckets for an inner terms aggregation:
POST /sales/_search
{
"size": 0,
"aggs": {
"histo": {
"date_histogram": {
"field": "date",
"interval": "day"
},
"aggs": {
"categories": {
"terms": {
"field": "category"
}
},
"min_bucket_selector": {
"bucket_selector": {
"buckets_path": {
"count": "categories._bucket_count" (1)
},
"script": {
"source": "params.count != 0"
}
}
}
}
}
}
}
-
By using
_bucket_count
instead of a metric name, we can filter outhisto
buckets where they contain no buckets for thecategories
aggregation
Dealing with dots in agg names
An alternate syntax is supported to cope with aggregations or metrics which have dots in the name, such as the 99.9th percentile. This metric may be referred to as:
"buckets_path": "my_percentile[99.9]"
Dealing with gaps in the data
Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:
-
Documents falling into a bucket do not contain a required field
-
There are no documents matching the query for one or more buckets
-
The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)
Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing
data is encountered. All pipeline aggregations accept the gap_policy
parameter. There are currently two gap policies
to choose from:
- skip
-
This option treats missing data as if the bucket does not exist. It will skip the bucket and continue calculating using the next available value.
- insert_zeros
-
This option will replace missing values with a zero (
0
) and pipeline aggregation computation will proceed as normal.
Avg Bucket Aggregation
A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
An avg_bucket
aggregation looks like this in isolation:
{
"avg_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the average for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the average of the total monthly sales
:
POST /_search
{
"size": 0,
"aggs": {
"sales_per_month": {
"date_histogram": {
"field": "date",
"interval": "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"avg_monthly_sales": {
"avg_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
buckets_path
instructs this avg_bucket aggregation that we want the (mean) average value of thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"avg_monthly_sales": {
"value": 328.33333333333333
}
}
}
Derivative Aggregation
A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count
set to 0
(default
for histogram
aggregations).
Syntax
A derivative
aggregation looks like this in isolation:
"derivative": {
"buckets_path": "the_sum"
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the derivative for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
First Order Derivative
The following snippet calculates the derivative of the total monthly sales
:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales" (1)
}
}
}
}
}
}
-
buckets_path
instructs this derivative aggregation to use the output of thesales
aggregation for the derivative
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
} (1)
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
},
"sales_deriv": {
"value": -490.0 (2)
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2, (3)
"sales": {
"value": 375.0
},
"sales_deriv": {
"value": 315.0
}
}
]
}
}
}
-
No derivative for the first bucket since we need at least 2 data points to calculate the derivative
-
Derivative value units are implicitly defined by the
sales
aggregation and the parent histogram so in this case the units would be $/month assuming theprice
field has units of $. -
The number of documents in the bucket are represented by the
doc_count
Second Order Derivative
A second order derivative can be calculated by chaining the derivative pipeline aggregation onto the result of another derivative pipeline aggregation as in the following example which will calculate both the first and the second order derivative of the total monthly sales:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales"
}
},
"sales_2nd_deriv": {
"derivative": {
"buckets_path": "sales_deriv" (1)
}
}
}
}
}
}
-
buckets_path
for the second derivative points to the name of the first derivative
And the following may be the response:
{
"took": 50,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
} (1)
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
},
"sales_deriv": {
"value": -490.0
} (1)
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
},
"sales_deriv": {
"value": 315.0
},
"sales_2nd_deriv": {
"value": 805.0
}
}
]
}
}
}
-
No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the second derivative
Units
The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response
normalized_value
which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative
of the total sales per month but ask for the derivative of the sales as in the units of sales per day:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"sales_deriv": {
"derivative": {
"buckets_path": "sales",
"unit": "day" (1)
}
}
}
}
}
}
-
unit
specifies what unit to use for the x-axis of the derivative calculation
And the following may be the response:
{
"took": 50,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
} (1)
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
},
"sales_deriv": {
"value": -490.0, (1)
"normalized_value": -15.806451612903226 (2)
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
},
"sales_deriv": {
"value": 315.0,
"normalized_value": 11.25
}
}
]
}
}
}
-
value
is reported in the original units of 'per month' -
normalized_value
is reported in the desired units of 'per day' === Max Bucket Aggregation
A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
A max_bucket
aggregation looks like this in isolation:
{
"max_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the maximum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the maximum of the total monthly sales
:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"max_monthly_sales": {
"max_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
buckets_path
instructs this max_bucket aggregation that we want the maximum value of thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"max_monthly_sales": {
"keys": ["2015/01/01 00:00:00"], (1)
"value": 550.0
}
}
}
-
keys
is an array of strings since the maximum value may be present in multiple buckets === Min Bucket Aggregation
A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
A min_bucket
aggregation looks like this in isolation:
{
"min_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the minimum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the minimum of the total monthly sales
:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"min_monthly_sales": {
"min_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
buckets_path
instructs this min_bucket aggregation that we want the minimum value of thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"min_monthly_sales": {
"keys": ["2015/02/01 00:00:00"], (1)
"value": 60.0
}
}
}
-
keys
is an array of strings since the minimum value may be present in multiple buckets === Sum Bucket Aggregation
A sibling pipeline aggregation which calculates the sum across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
A sum_bucket
aggregation looks like this in isolation:
{
"sum_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the sum for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the sum of all the total monthly sales
buckets:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"sum_monthly_sales": {
"sum_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
buckets_path
instructs this sum_bucket aggregation that we want the sum of thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"sum_monthly_sales": {
"value": 985.0
}
}
}
Stats Bucket Aggregation
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
A stats_bucket
aggregation looks like this in isolation:
{
"stats_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to calculate stats for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the stats for monthly sales
:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_monthly_sales": {
"stats_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
bucket_paths
instructs thisstats_bucket
aggregation that we want the calculate stats for thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"stats_monthly_sales": {
"count": 3,
"min": 60.0,
"max": 550.0,
"avg": 328.3333333333333,
"sum": 985.0
}
}
}
Extended Stats Bucket Aggregation
A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
This aggregation provides a few more statistics (sum of squares, standard deviation, etc) compared to the stats_bucket
aggregation.
Syntax
A extended_stats_bucket
aggregation looks like this in isolation:
{
"extended_stats_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to calculate stats for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
|
The number of standard deviations above/below the mean to display |
Optional |
2 |
The following snippet calculates the extended stats for monthly sales
bucket:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"stats_monthly_sales": {
"extended_stats_bucket": {
"buckets_path": "sales_per_month>sales" (1)
}
}
}
}
-
bucket_paths
instructs thisextended_stats_bucket
aggregation that we want the calculate stats for thesales
aggregation in thesales_per_month
date histogram.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"stats_monthly_sales": {
"count": 3,
"min": 60.0,
"max": 550.0,
"avg": 328.3333333333333,
"sum": 985.0,
"sum_of_squares": 446725.0,
"variance": 41105.55555555556,
"std_deviation": 202.74505063146563,
"std_deviation_bounds": {
"upper": 733.8234345962646,
"lower": -77.15676792959795
}
}
}
}
Percentiles Bucket Aggregation
A sibling pipeline aggregation which calculates percentiles across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.
Syntax
A percentiles_bucket
aggregation looks like this in isolation:
{
"percentiles_bucket": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the percentiles for (see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
|
The list of percentiles to calculate |
Optional |
|
The following snippet calculates the percentiles for the total monthly sales
buckets:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"percentiles_monthly_sales": {
"percentiles_bucket": {
"buckets_path": "sales_per_month>sales", (1)
"percents": [ 25.0, 50.0, 75.0 ] (2)
}
}
}
}
-
buckets_path
instructs this percentiles_bucket aggregation that we want to calculate percentiles for thesales
aggregation in thesales_per_month
date histogram. -
percents
specifies which percentiles we wish to calculate, in this case, the 25th, 50th and 75th percentiles.
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
}
}
]
},
"percentiles_monthly_sales": {
"values" : {
"25.0": 375.0,
"50.0": 375.0,
"75.0": 550.0
}
}
}
}
Percentiles_bucket implementation
The Percentile Bucket returns the nearest input data point that is not greater than the requested percentile; it does not interpolate between data points.
The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means
the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the
data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of
data-points in a single percentiles_bucket
.
Moving Average Aggregation
deprecated:[6.4.0, "The Moving Average aggregation has been deprecated in favor of the more general Moving Function Aggregation. The new Moving Function aggregation provides all the same functionality as the Moving Average aggregation, but also provides more flexibility."]
Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average
value of that window. For example, given the data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10]
, we can calculate a simple moving
average with windows size of 5
as follows:
-
(1 + 2 + 3 + 4 + 5) / 5 = 3
-
(2 + 3 + 4 + 5 + 6) / 5 = 4
-
(3 + 4 + 5 + 6 + 7) / 5 = 5
-
etc
Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, which allows the lower frequency trends to be more easily visualized, such as seasonality.
Syntax
A moving_avg
aggregation looks like this in isolation:
{
"moving_avg": {
"buckets_path": "the_sum",
"model": "holt",
"window": 5,
"gap_policy": "insert_zeros",
"settings": {
"alpha": 0.8
}
}
}
Parameter Name |
Description |
Required |
Default Value |
|
Path to the metric of interest (see |
Required |
|
|
The moving average weighting model that we wish to use |
Optional |
|
|
Determines what should happen when a gap in the data is encountered. |
Optional |
|
|
The size of window to "slide" across the histogram. |
Optional |
|
|
If the model should be algorithmically minimized. See Minimization for more details |
Optional |
|
|
Model-specific settings, contents which differ depending on the model specified. |
Optional |
moving_avg
aggregations must be embedded inside of a histogram
or date_histogram
aggregation. They can be
embedded like any other metric aggregation:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{ (1)
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" } (2)
},
"the_movavg":{
"moving_avg":{ "buckets_path": "the_sum" } (3)
}
}
}
}
}
-
A
date_histogram
named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals -
A
sum
metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) -
Finally, we specify a
moving_avg
aggregation which uses "the_sum" metric as its input.
Moving averages are built by first specifying a histogram
or date_histogram
over a field. You can then optionally
add normal metrics, such as a sum
, inside of that histogram. Finally, the moving_avg
is embedded inside the histogram.
The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
buckets_path
Syntax for a description of the syntax for buckets_path
.
An example response from the above aggregation may look like:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"the_sum": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"the_sum": {
"value": 60.0
},
"the_movavg": {
"value": 550.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"the_sum": {
"value": 375.0
},
"the_movavg": {
"value": 305.0
}
}
]
}
}
}
Models
The moving_avg
aggregation includes four different moving average "models". The main difference is how the values in the
window are weighted. As data-points become "older" in the window, they may be weighted differently. This will
affect the final average for that window.
Models are specified using the model
parameter. Some models may have optional configurations which are specified inside
the settings
parameter.
Simple
The simple
model calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means
the values from a simple
moving average tend to "lag" behind the real data.
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg":{
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "simple"
}
}
}
}
}
}
A simple
model has no special settings to configure
The window size can change the behavior of the moving average. For example, a small window ("window": 10
) will closely
track the data and only smooth out small scale fluctuations:

In contrast, a simple
moving average with larger window ("window": 100
) will smooth out all higher-frequency fluctuations,
leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount:

Linear
The linear
model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the "lag" behind the data’s mean, since older points have less influence.
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "linear"
}
}
}
}
}
}
A linear
model has no special settings to configure
Like the simple
model, window size can change the behavior of the moving average. For example, a small window ("window": 10
)
will closely track the data and only smooth out small scale fluctuations:

In contrast, a linear
moving average with larger window ("window": 100
) will smooth out all higher-frequency fluctuations,
leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount,
although typically less than the simple
model:

EWMA (Exponentially Weighted)
The ewma
model (aka "single-exponential") is similar to the linear
model, except older data-points become exponentially less important,
rather than linearly less important. The speed at which the importance decays can be controlled with an alpha
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
The default value of alpha
is 0.3
, and the setting accepts any float from 0-1 inclusive.
The EWMA model can be Minimized
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "ewma",
"settings" : {
"alpha" : 0.5
}
}
}
}
}
}
}


Holt-Linear
The holt
model (aka "double exponential") incorporates a second exponential term which
tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The
double exponential model calculates two values internally: a "level" and a "trend".
The level calculation is similar to ewma
, and is an exponentially weighted view of the data. The difference is
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
smoothed data). The trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
The default value of alpha
is 0.3
and beta
is 0.1
. The settings accept any float from 0-1 inclusive.
The Holt-Linear model can be Minimized
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt",
"settings" : {
"alpha" : 0.5,
"beta" : 0.5
}
}
}
}
}
}
}
In practice, the alpha
value behaves very similarly in holt
as ewma
: small values produce more smoothing
and more lag, while larger values produce closer tracking and less lag. The value of beta
is often difficult
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
values emphasize short-term trends. This will become more apparently when you are predicting values.


Holt-Winters
The holt_winters
model (aka "triple exponential") incorporates a third exponential term which
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
and "seasonality".
The level and trend calculation is identical to holt
The seasonal calculation looks at the difference between
the current point, and the point one period earlier.
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set period: 7
. Similarly if there was
a monthly trend, you would set it to 30
. There is currently no periodicity detection, although that is planned
for future enhancements.
There are two varieties of Holt-Winters: additive and multiplicative.
"Cold Start"
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
means that your window
must always be at least twice the size of your period. An exception will be thrown if it
isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period
buckets; the current algorithm
does not backcast.

Because the "cold start" obscures what the moving average looks like, the rest of the Holt-Winters images are truncated to not show the "cold start". Just be aware this will always be present at the beginning of your moving averages!
Additive Holt-Winters
Additive seasonality is the default; it can also be specified by setting "type": "add"
. This variety is preferred
when the seasonal affect is additive to your data. E.g. you could simply subtract the seasonal effect to "de-seasonalize"
your data into a flat trend.
The default values of alpha
and gamma
are 0.3
while beta
is 0.1
. The settings accept any float from 0-1 inclusive.
The default value of period
is 1
.
The additive Holt-Winters model can be Minimized
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt_winters",
"settings" : {
"type" : "add",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7
}
}
}
}
}
}
}

Multiplicative Holt-Winters
Multiplicative is specified by setting "type": "mult"
. This variety is preferred when the seasonal affect is
multiplied against your data. E.g. if the seasonal affect is x5 the data, rather than simply adding to it.
The default values of alpha
and gamma
are 0.3
while beta
is 0.1
. The settings accept any float from 0-1 inclusive.
The default value of period
is 1
.
The multiplicative Holt-Winters model can be Minimized
Warning
|
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "holt_winters",
"settings" : {
"type" : "mult",
"alpha" : 0.5,
"beta" : 0.5,
"gamma" : 0.5,
"period" : 7,
"pad" : true
}
}
}
}
}
}
}
Prediction
experimental[]
All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate.
Predictions are enabled by adding a predict
parameter to any moving average aggregation, specifying the number of
predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval
as your buckets:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"window" : 30,
"model" : "simple",
"predict" : 10
}
}
}
}
}
}
The simple
, linear
and ewma
models all produce "flat" predictions: they essentially converge on the mean
of the last value in the series, producing a flat:

In contrast, the holt
model can extrapolate based on local or global constant trends. If we set a high beta
value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end
of the series was heading in a downward direction):

In contrast, if we choose a small beta
, the predictions are based on the global constant trend. In this series, the
global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope:

The holt_winters
model has the potential to deliver the best predictions, since it also incorporates seasonal
fluctuations into the model:

Minimization
Some of the models (EWMA, Holt-Linear, Holt-Winters) require one or more parameters to be configured. Parameter choice can be tricky and sometimes non-intuitive. Furthermore, small deviations in these parameters can sometimes have a drastic effect on the output moving average.
For that reason, the three "tunable" models can be algorithmically minimized. Minimization is a process where parameters are tweaked until the predictions generated by the model closely match the output data. Minimization is not fullproof and can be susceptible to overfitting, but it often gives better results than hand-tuning.
Minimization is disabled by default for ewma
and holt_linear
, while it is enabled by default for holt_winters
.
Minimization is most useful with Holt-Winters, since it helps improve the accuracy of the predictions. EWMA and
Holt-Linear are not great predictors, and mostly used for smoothing data, so minimization is less useful on those
models.
Minimization is enabled/disabled via the minimize
parameter:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_avg":{
"buckets_path": "the_sum",
"model" : "holt_winters",
"window" : 30,
"minimize" : true, (1)
"settings" : {
"period" : 7
}
}
}
}
}
}
}
-
Minimization is enabled with the
minimize
parameter
When enabled, minimization will find the optimal values for alpha
, beta
and gamma
. The user should still provide
appropriate values for window
, period
and type
.
Warning
|
Minimization works by running a stochastic process called simulated annealing. This process will usually generate a good solution, but is not guaranteed to find the global optimum. It also requires some amount of additional computational power, since the model needs to be re-run multiple times as the values are tweaked. The run-time of minimization is linear to the size of the window being processed: excessively large windows may cause latency. Finally, minimization fits the model to the last |
Moving Function Aggregation
Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages, etc.
This is conceptually very similar to the Moving Average pipeline aggregation, except it provides more functionality.
Syntax
A moving_fn
aggregation looks like this in isolation:
{
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.min(values)"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
Path to the metric of interest (see |
Required |
|
|
The size of window to "slide" across the histogram. |
Required |
|
|
The script that should be executed on each window of data |
Required |
moving_fn
aggregations must be embedded inside of a histogram
or date_histogram
aggregation. They can be
embedded like any other metric aggregation:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{ (1)
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" } (2)
},
"the_movfn": {
"moving_fn": {
"buckets_path": "the_sum", (3)
"window": 10,
"script": "MovingFunctions.unweightedAvg(values)"
}
}
}
}
}
}
-
A
date_histogram
named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals -
A
sum
metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc) -
Finally, we specify a
moving_fn
aggregation which uses "the_sum" metric as its input.
Moving averages are built by first specifying a histogram
or date_histogram
over a field. You can then optionally
add numeric metrics, such as a sum
, inside of that histogram. Finally, the moving_fn
is embedded inside the histogram.
The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
buckets_path
Syntax for a description of the syntax for buckets_path
.
An example response from the above aggregation may look like:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"my_date_histo": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"the_sum": {
"value": 550.0
},
"the_movfn": {
"value": null
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"the_sum": {
"value": 60.0
},
"the_movfn": {
"value": 550.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"the_sum": {
"value": 375.0
},
"the_movfn": {
"value": 305.0
}
}
]
}
}
}
Custom user scripting
The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a
new window of data is collected. These values are provided to the script in the values
variable. The script should then perform some
kind of calculation and emit a single double
as the result. Emitting null
is not permitted, although NaN
and +/- Inf
are allowed.
For example, this script will simply return the first value from the window, or NaN
if no values are available:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "return values.length > 0 ? values[0] : Double.NaN"
}
}
}
}
}
}
Pre-built Functions
For convenience, a number of functions have been prebuilt and are available inside the moving_fn
script context:
-
max()
-
min()
-
sum()
-
stdDev()
-
unweightedAvg()
-
linearWeightedAvg()
-
ewma()
-
holt()
-
holtWinters()
The functions are available from the MovingFunctions
namespace. E.g. MovingFunctions.max()
max Function
This function accepts a collection of doubles and returns the maximum value in that window. null
and NaN
values are ignored; the maximum
is only calculated over the real values. If the window is empty, or all values are null
/NaN
, NaN
is returned as the result.
Parameter Name | Description |
---|---|
|
The window of values to find the maximum |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_moving_max": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.max(values)"
}
}
}
}
}
}
min Function
This function accepts a collection of doubles and returns the minimum value in that window. null
and NaN
values are ignored; the minimum
is only calculated over the real values. If the window is empty, or all values are null
/NaN
, NaN
is returned as the result.
Parameter Name | Description |
---|---|
|
The window of values to find the minimum |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_moving_min": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.min(values)"
}
}
}
}
}
}
sum Function
This function accepts a collection of doubles and returns the sum of the values in that window. null
and NaN
values are ignored;
the sum is only calculated over the real values. If the window is empty, or all values are null
/NaN
, 0.0
is returned as the result.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_moving_sum": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.sum(values)"
}
}
}
}
}
}
stdDev Function
This function accepts a collection of doubles and average, then returns the standard deviation of the values in that window.
null
and NaN
values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are
null
/NaN
, 0.0
is returned as the result.
Parameter Name | Description |
---|---|
|
The window of values to find the standard deviation of |
|
The average of the window |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_moving_sum": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.stdDev(values, MovingFunctions.unweightedAvg(values))"
}
}
}
}
}
}
The avg
parameter must be provided to the standard deviation function because different styles of averages can be computed on the window
(simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the
standard deviation function.
unweightedAvg Function
The unweightedAvg
function calculates the sum of all values in the window, then divides by the size of the window. It is effectively
a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means
the values from a simple
moving average tend to "lag" behind the real data.
null
and NaN
values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
null
/NaN
, NaN
is returned as the result. This means that the count used in the average calculation is count of non-null
,non-NaN
values.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.unweightedAvg(values)"
}
}
}
}
}
}
linearWeightedAvg Function
The linearWeightedAvg
function assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at
the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce
the "lag" behind the data’s mean, since older points have less influence.
If the window is empty, or all values are null
/NaN
, NaN
is returned as the result.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.linearWeightedAvg(values)"
}
}
}
}
}
}
ewma Function
The ewma
function (aka "single-exponential") is similar to the linearMovAvg
function,
except older data-points become exponentially less important,
rather than linearly less important. The speed at which the importance decays can be controlled with an alpha
setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger
portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the
moving average. This tends to make the moving average track the data more closely but with less smoothing.
null
and NaN
values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
null
/NaN
, NaN
is returned as the result. This means that the count used in the average calculation is count of non-null
,non-NaN
values.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
|
Exponential decay |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.ewma(values, 0.3)"
}
}
}
}
}
}
holt Function
The holt
function (aka "double exponential") incorporates a second exponential term which
tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The
double exponential model calculates two values internally: a "level" and a "trend".
The level calculation is similar to ewma
, and is an exponentially weighted view of the data. The difference is
that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series.
The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the
smoothed data). The trend value is also exponentially weighted.
Values are produced by multiplying the level and trend components.
null
and NaN
values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
null
/NaN
, NaN
is returned as the result. This means that the count used in the average calculation is count of non-null
,non-NaN
values.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
|
Level decay value |
|
Trend decay value |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "MovingFunctions.holt(values, 0.3, 0.1)"
}
}
}
}
}
}
In practice, the alpha
value behaves very similarly in holtMovAvg
as ewmaMovAvg
: small values produce more smoothing
and more lag, while larger values produce closer tracking and less lag. The value of beta
is often difficult
to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger
values emphasize short-term trends.
holtWinters Function
The holtWinters
function (aka "triple exponential") incorporates a third exponential term which
tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend"
and "seasonality".
The level and trend calculation is identical to holt
The seasonal calculation looks at the difference between
the current point, and the point one period earlier.
Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity"
of your data: e.g. if your data has cyclic trends every 7 days, you would set period = 7
. Similarly if there was
a monthly trend, you would set it to 30
. There is currently no periodicity detection, although that is planned
for future enhancements.
null
and NaN
values are ignored; the average is only calculated over the real values. If the window is empty, or all values are
null
/NaN
, NaN
is returned as the result. This means that the count used in the average calculation is count of non-null
,non-NaN
values.
Parameter Name | Description |
---|---|
|
The window of values to find the sum of |
|
Level decay value |
|
Trend decay value |
|
Seasonality decay value |
|
The periodicity of the data |
|
True if you wish to use multiplicative holt-winters, false to use additive |
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo":{
"date_histogram":{
"field":"date",
"interval":"1M"
},
"aggs":{
"the_sum":{
"sum":{ "field": "price" }
},
"the_movavg": {
"moving_fn": {
"buckets_path": "the_sum",
"window": 10,
"script": "if (values.length > 5*2) {MovingFunctions.holtWinters(values, 0.3, 0.1, 0.1, 5, false)}"
}
}
}
}
}
}
Warning
|
Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of
your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the
|
"Cold Start"
Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This
means that your window
must always be at least twice the size of your period. An exception will be thrown if it
isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period
buckets; the current algorithm
does not backcast.
You’ll notice in the above example we have an if ()
statement checking the size of values. This is checking to make sure
we have two periods worth of data (5 * 2
, where 5 is the period specified in the holtWintersMovAvg
function) before calling
the holt-winters function.
Cumulative Sum Aggregation
A parent pipeline aggregation which calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram)
aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count
set to 0
(default
for histogram
aggregations).
Syntax
A cumulative_sum
aggregation looks like this in isolation:
{
"cumulative_sum": {
"buckets_path": "the_sum"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The path to the buckets we wish to find the cumulative sum for (see |
Required |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the cumulative sum of the total monthly sales
:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
},
"cumulative_sales": {
"cumulative_sum": {
"buckets_path": "sales" (1)
}
}
}
}
}
}
-
buckets_path
instructs this cumulative sum aggregation to use the output of thesales
aggregation for the cumulative sum
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"sales": {
"value": 550.0
},
"cumulative_sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"sales": {
"value": 60.0
},
"cumulative_sales": {
"value": 610.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"sales": {
"value": 375.0
},
"cumulative_sales": {
"value": 985.0
}
}
]
}
}
}
Bucket Script Aggregation
A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.
Syntax
A bucket_script
aggregation looks like this in isolation:
{
"bucket_script": {
"buckets_path": {
"my_var1": "the_sum", (1)
"my_var2": "the_value_count"
},
"script": "params.my_var1 / params.my_var2"
}
}
-
Here,
my_var1
is the name of the variable for this buckets path to use in the script,the_sum
is the path to the metrics to use for that variable.
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The script to run for this aggregation. The script can be inline, file or indexed. (see [modules-scripting] for more details) |
Required |
|
|
A map of script variables and their associated path to the buckets we wish to use for the variable
(see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
|
format to apply to the output value of this aggregation |
Optional |
|
The following snippet calculates the ratio percentage of t-shirt sales compared to total sales each month:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"t-shirts": {
"filter": {
"term": {
"type": "t-shirt"
}
},
"aggs": {
"sales": {
"sum": {
"field": "price"
}
}
}
},
"t-shirt-percentage": {
"bucket_script": {
"buckets_path": {
"tShirtSales": "t-shirts>sales",
"totalSales": "total_sales"
},
"script": "params.tShirtSales / params.totalSales * 100"
}
}
}
}
}
}
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 200.0
}
},
"t-shirt-percentage": {
"value": 36.36363636363637
}
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"total_sales": {
"value": 60.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 10.0
}
},
"t-shirt-percentage": {
"value": 16.666666666666664
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
},
"t-shirts": {
"doc_count": 1,
"sales": {
"value": 175.0
}
},
"t-shirt-percentage": {
"value": 46.666666666666664
}
}
]
}
}
}
Bucket Selector Aggregation
A parent pipeline aggregation which executes a script which determines whether the current bucket will be retained
in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value.
If the script language is expression
then a numeric return value is permitted. In this case 0.0 will be evaluated as false
and all other values will evaluate to true.
Note
|
The bucket_selector aggregation, like all pipeline aggregations, executes after all other sibling aggregations. This means that using the bucket_selector aggregation to filter the returned buckets in the response does not save on execution time running the aggregations. |
Syntax
A bucket_selector
aggregation looks like this in isolation:
{
"bucket_selector": {
"buckets_path": {
"my_var1": "the_sum", (1)
"my_var2": "the_value_count"
},
"script": "params.my_var1 > params.my_var2"
}
}
-
Here,
my_var1
is the name of the variable for this buckets path to use in the script,the_sum
is the path to the metrics to use for that variable.
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The script to run for this aggregation. The script can be inline, file or indexed. (see [modules-scripting] for more details) |
Required |
|
|
A map of script variables and their associated path to the buckets we wish to use for the variable
(see |
Required |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
The following snippet only retains buckets where the total sales for the month is more than 200:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_filter": {
"bucket_selector": {
"buckets_path": {
"totalSales": "total_sales"
},
"script": "params.totalSales > 200"
}
}
}
}
}
}
And the following may be the response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
}
},(1)
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
},
}
]
}
}
}
-
Bucket for
2015/02/01 00:00:00
has been removed as its total sales was less than 200 === Bucket Sort Aggregation
A parent pipeline aggregation which sorts the buckets of its parent multi-bucket aggregation.
Zero or more sort fields may be specified together with the corresponding sort order.
Each bucket may be sorted based on its _key
, _count
or its sub-aggregations.
In addition, parameters from
and size
may be set in order to truncate the result buckets.
Note
|
The bucket_sort aggregation, like all pipeline aggregations, is executed after all other non-pipeline aggregations.
This means the sorting only applies to whatever buckets are already returned from the parent aggregation. For example,
if the parent aggregation is terms and its size is set to 10 , the bucket_sort will only sort over those 10
returned term buckets.
|
Syntax
A bucket_sort
aggregation looks like this in isolation:
{
"bucket_sort": {
"sort": [
{"sort_field_1": {"order": "asc"}},(1)
{"sort_field_2": {"order": "desc"}},
"sort_field_3"
],
"from": 1,
"size": 3
}
}
-
Here,
sort_field_1
is the bucket path to the variable to be used as the primary sort and its order is ascending.
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
The list of fields to sort on. See |
Optional |
|
|
Buckets in positions prior to the set value will be truncated. |
Optional |
|
|
The number of buckets to return. Defaults to all buckets of the parent aggregation. |
Optional |
|
|
The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details) |
Optional |
|
The following snippet returns the buckets corresponding to the 3 months with the highest total sales in descending order:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"total_sales": {
"sum": {
"field": "price"
}
},
"sales_bucket_sort": {
"bucket_sort": {
"sort": [
{"total_sales": {"order": "desc"}}(1)
],
"size": 3(2)
}
}
}
}
}
}
-
sort
is set to use the values oftotal_sales
in descending order -
size
is set to3
meaning only the top 3 months intotal_sales
will be returned
And the following may be the response:
{
"took": 82,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/01/01 00:00:00",
"key": 1420070400000,
"doc_count": 3,
"total_sales": {
"value": 550.0
}
},
{
"key_as_string": "2015/03/01 00:00:00",
"key": 1425168000000,
"doc_count": 2,
"total_sales": {
"value": 375.0
},
},
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2,
"total_sales": {
"value": 60.0
},
}
]
}
}
}
Truncating without sorting
It is also possible to use this aggregation in order to truncate the result buckets
without doing any sorting. To do so, just use the from
and/or size
parameters
without specifying sort
.
The following example simply truncates the result so that only the second bucket is returned:
POST /sales/_search
{
"size": 0,
"aggs" : {
"sales_per_month" : {
"date_histogram" : {
"field" : "date",
"interval" : "month"
},
"aggs": {
"bucket_truncate": {
"bucket_sort": {
"from": 1,
"size": 1
}
}
}
}
}
}
Response:
{
"took": 11,
"timed_out": false,
"_shards": ...,
"hits": ...,
"aggregations": {
"sales_per_month": {
"buckets": [
{
"key_as_string": "2015/02/01 00:00:00",
"key": 1422748800000,
"doc_count": 2
}
]
}
}
}
Serial Differencing Aggregation
Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.
A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.
Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.
By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the previous value +/- a random amount. This insight allows selection of further tools for analysis.

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.
The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

Syntax
A serial_diff
aggregation looks like this in isolation:
{
"serial_diff": {
"buckets_path": "the_sum",
"lag": "7"
}
}
Parameter Name | Description | Required | Default Value |
---|---|---|---|
|
Path to the metric of interest (see |
Required |
|
|
The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from the value 7 buckets ago. Must be a positive, non-zero integer |
Optional |
|
|
Determines what should happen when a gap in the data is encountered. |
Optional |
|
|
Format to apply to the output value of this aggregation |
Optional |
|
serial_diff
aggregations must be embedded inside of a histogram
or date_histogram
aggregation:
POST /_search
{
"size": 0,
"aggs": {
"my_date_histo": { (1)
"date_histogram": {
"field": "timestamp",
"interval": "day"
},
"aggs": {
"the_sum": {
"sum": {
"field": "lemmings" (2)
}
},
"thirtieth_difference": {
"serial_diff": { (3)
"buckets_path": "the_sum",
"lag" : 30
}
}
}
}
}
}
-
A
date_histogram
named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals -
A
sum
metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc) -
Finally, we specify a
serial_diff
aggregation which uses "the_sum" metric as its input.
Serial differences are built by first specifying a histogram
or date_histogram
over a field. You can then optionally
add normal metrics, such as a sum
, inside of that histogram. Finally, the serial_diff
is embedded inside the histogram.
The buckets_path
parameter is then used to "point" at one of the sibling metrics inside of the histogram (see
buckets_path
Syntax for a description of the syntax for buckets_path
.
Matrix Aggregations
experimental[]
The aggregations in this family operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting.
Matrix Stats
The matrix_stats
aggregation is a numeric aggregation that computes the following statistics over a set of document fields:
count
|
Number of per field samples included in the calculation. |
mean
|
The average value for each field. |
variance
|
Per field Measurement for how spread out the samples are from the mean. |
skewness
|
Per field measurement quantifying the asymmetric distribution around the mean. |
kurtosis
|
Per field measurement quantifying the shape of the distribution. |
covariance
|
A matrix that quantitatively describes how changes in one field are associated with another. |
correlation
|
The covariance matrix scaled to a range of -1 to 1, inclusive. Describes the relationship between field distributions. |
The following example demonstrates the use of matrix stats to describe the relationship between income and poverty.
GET /_search
{
"aggs": {
"statistics": {
"matrix_stats": {
"fields": ["poverty", "income"]
}
}
}
}
The aggregation type is matrix_stats
and the fields
setting defines the set of fields (as an array) for computing
the statistics. The above request returns the following response:
{
...
"aggregations": {
"statistics": {
"doc_count": 50,
"fields": [{
"name": "income",
"count": 50,
"mean": 51985.1,
"variance": 7.383377037755103E7,
"skewness": 0.5595114003506483,
"kurtosis": 2.5692365287787124,
"covariance": {
"income": 7.383377037755103E7,
"poverty": -21093.65836734694
},
"correlation": {
"income": 1.0,
"poverty": -0.8352655256272504
}
}, {
"name": "poverty",
"count": 50,
"mean": 12.732000000000001,
"variance": 8.637730612244896,
"skewness": 0.4516049811903419,
"kurtosis": 2.8615929677997767,
"covariance": {
"income": -21093.65836734694,
"poverty": 8.637730612244896
},
"correlation": {
"income": -0.8352655256272504,
"poverty": 1.0
}
}]
}
}
}
The doc_count
field indicates the number of documents involved in the computation of the statistics.
Multi Value Fields
The matrix_stats
aggregation treats each document field as an independent sample. The mode
parameter controls what
array value the aggregation will use for array or multi-valued fields. This parameter can take one of the following:
avg
|
(default) Use the average of all values. |
min
|
Pick the lowest value. |
max
|
Pick the highest value. |
sum
|
Use the sum of all values. |
median
|
Use the median of all values. |
Missing Values
The missing
parameter defines how documents that are missing a value should be treated.
By default they will be ignored but it is also possible to treat them as if they had a value.
This is done by adding a set of fieldname : value mappings to specify default values per field.
GET /_search
{
"aggs": {
"matrixstats": {
"matrix_stats": {
"fields": ["poverty", "income"],
"missing": {"income" : 50000} (1)
}
}
}
}
-
Documents without a value in the
income
field will have the default value50000
.
Script
This aggregation family does not yet support scripting.
Caching heavy aggregations
Frequently used aggregations (e.g. for display on the home page of a website) can be cached for faster responses. These cached results are the same results that would be returned by an uncached aggregation — you will never get stale results.
See [shard-request-cache] for more details.
Returning only aggregation results
There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by
setting size=0
. For example:
GET /twitter/_search
{
"size": 0,
"aggregations": {
"my_agg": {
"terms": {
"field": "text"
}
}
}
}
Setting size
to 0
avoids executing the fetch phase of the search making the request more efficient.
Aggregation Metadata
You can associate a piece of metadata with individual aggregations at request time that will be returned in place at response time.
Consider this example where we want to associate the color blue with our terms
aggregation.
GET /twitter/_search
{
"size": 0,
"aggs": {
"titles": {
"terms": {
"field": "title"
},
"meta": {
"color": "blue"
}
}
}
}
Then that piece of metadata will be returned in place for our titles
terms aggregation
{
"aggregations": {
"titles": {
"meta": {
"color" : "blue"
},
"doc_count_error_upper_bound" : 0,
"sum_other_doc_count" : 0,
"buckets": [
]
}
},
...
}
Returning the type of the aggregation
Sometimes you need to know the exact type of an aggregation in order to parse its results. The typed_keys
parameter
can be used to change the aggregation’s name in the response so that it will be prefixed by its internal type.
Considering the following date_histogram
aggregation named
tweets_over_time
which has a sub 'top_hits` aggregation named
top_users
:
GET /twitter/_search?typed_keys
{
"aggregations": {
"tweets_over_time": {
"date_histogram": {
"field": "date",
"interval": "year"
},
"aggregations": {
"top_users": {
"top_hits": {
"size": 1
}
}
}
}
}
}
In the response, the aggregations names will be changed to respectively date_histogram#tweets_over_time
and
top_hits#top_users
, reflecting the internal types of each aggregation:
{
"aggregations": {
"date_histogram#tweets_over_time": { (1)
"buckets" : [
{
"key_as_string" : "2009-01-01T00:00:00.000Z",
"key" : 1230768000000,
"doc_count" : 5,
"top_hits#top_users" : { (2)
"hits" : {
"total" : 5,
"max_score" : 1.0,
"hits" : [
{
"_index": "twitter",
"_type": "_doc",
"_id": "0",
"_score": 1.0,
"_source": {
"date": "2009-11-15T14:12:12",
"message": "trying out Elasticsearch",
"user": "kimchy",
"likes": 0
}
}
]
}
}
}
]
}
},
...
}
-
The name
tweets_over_time
now contains thedate_histogram
prefix. -
The name
top_users
now contains thetop_hits
prefix.
Note
|
For some aggregations, it is possible that the returned type is not the same as the one provided with the
request. This is the case for Terms, Significant Terms and Percentiles aggregations, where the returned type
also contains information about the type of the targeted field: lterms (for a terms aggregation on a Long field),
sigsterms (for a significant terms aggregation on a String field), tdigest_percentiles (for a percentile
aggregation based on the TDigest algorithm).
|