"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.5/docs/reference/aggregations.asciidoc" (13 Nov 2019, 6710 Bytes) of package /linux/www/elasticsearch-6.8.5-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Metrics Aggregations

The aggregations in this family compute metrics based on values extracted in one way or another from the documents that are being aggregated. The values are typically extracted from the fields of the document (using the field data), but can also be generated using scripts.

Numeric metrics aggregations are a special type of metrics aggregation which output numeric values. Some aggregations output a single numeric metric (e.g. avg) and are called single-value numeric metrics aggregation, others generate multiple metrics (e.g. stats) and are called multi-value numeric metrics aggregation. The distinction between single-value and multi-value numeric metrics aggregations plays a role when these aggregations serve as direct sub-aggregations of some bucket aggregations (some bucket aggregations enable you to sort the returned buckets based on the numeric metrics in each bucket).

Avg Aggregation

A single-value metrics aggregation that computes the average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students we can average their scores with:

POST /exams/_search?size=0
{
    "aggs" : {
        "avg_grade" : { "avg" : { "field" : "grade" } }
    }
}

The above aggregation computes the average grade over all documents. The aggregation type is avg and the field setting defines the numeric field of the documents the average will be computed on. The above will return the following:

{
    ...
    "aggregations": {
        "avg_grade": {
            "value": 75.0
        }
    }
}

The name of the aggregation (avg_grade above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the average grade based on a script:

POST /exams/_search?size=0
{
    "aggs" : {
        "avg_grade" : {
            "avg" : {
                "script" : {
                    "source" : "doc.grade.value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

POST /exams/_search?size=0
{
    "aggs" : {
        "avg_grade" : {
            "avg" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "grade"
                    }
                }
            }
        }
    }
}
Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new average:

POST /exams/_search?size=0
{
    "aggs" : {
        "avg_corrected_grade" : {
            "avg" : {
                "field" : "grade",
                "script" : {
                    "lang": "painless",
                    "source": "_value * params.correction",
                    "params" : {
                        "correction" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /exams/_search?size=0
{
    "aggs" : {
        "grade_avg" : {
            "avg" : {
                "field" : "grade",
                "missing": 10 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Weighted Avg Aggregation

A single-value metrics aggregation that computes the weighted average of numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents.

When calculating a regular average, each datapoint has an equal "weight" …​ it contributes equally to the final value. Weighted averages, on the other hand, weight each datapoint differently. The amount that each datapoint contributes to the final value is extracted from the document, or provided by a script.

As a formula, a weighted average is the ∑(value * weight) / ∑(weight)

A regular average can be thought of as a weighted average where every value has an implicit weight of 1.

Table 1. weighted_avg Parameters
Parameter Name Description Required Default Value

value

The configuration for the field or script that provides the values

Required

weight

The configuration for the field or script that provides the weights

Required

format

The numeric response formatter

Optional

value_type

A hint about the values for pure scripts or unmapped fields

Optional

The value and weight objects have per-field specific configuration:

Table 2. value Parameters
Parameter Name Description Required Default Value

field

The field that values should be extracted from

Required

missing

A value to use if the field is missing entirely

Optional

Table 3. weight Parameters
Parameter Name Description Required Default Value

field

The field that weights should be extracted from

Required

missing

A weight to use if the field is missing entirely

Optional

Examples

If our documents have a "grade" field that holds a 0-100 numeric score, and a "weight" field which holds an arbitrary numeric weight, we can calculate the weighted average using:

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "field": "grade"
                },
                "weight": {
                    "field": "weight"
                }
            }
        }
    }
}

Which yields a response like:

{
    ...
    "aggregations": {
        "weighted_grade": {
            "value": 70.0
        }
    }
}

While multiple values-per-field are allowed, only one weight is allowed. If the aggregation encounters a document that has more than one weight (e.g. the weight field is a multi-valued field) it will throw an exception. If you have this situation, you will need to specify a script for the weight field, and use the script to combine the multiple values into a single value to be used.

This single weight will be applied independently to each value extracted from the value field.

This example show how a single document with multiple values will be averaged with a single weight:

POST /exams/_doc?refresh
{
    "grade": [1, 2, 3],
    "weight": 2
}

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "field": "grade"
                },
                "weight": {
                    "field": "weight"
                }
            }
        }
    }
}

The three values (1, 2, and 3) will be included as independent values, all with the weight of 2:

{
    ...
    "aggregations": {
        "weighted_grade": {
            "value": 2.0
        }
    }
}

The aggregation returns 2.0 as the result, which matches what we would expect when calculating by hand: 1*2) + (2*2) + (3*2 / (2+2+2) == 2

Script

Both the value and the weight can be derived from a script, instead of a field. As a simple example, the following will add one to the grade and weight in the document using a script:

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "script": "doc.grade.value + 1"
                },
                "weight": {
                    "script": "doc.weight.value + 1"
                }
            }
        }
    }
}

Missing values

The missing parameter defines how documents that are missing a value should be treated. The default behavior is different for value and weight:

By default, if the value field is missing the document is ignored and the aggregation moves on to the next document. If the weight field is missing, it is assumed to have a weight of 1 (like a normal average).

Both of these defaults can be overridden with the missing parameter:

POST /exams/_search
{
    "size": 0,
    "aggs" : {
        "weighted_grade": {
            "weighted_avg": {
                "value": {
                    "field": "grade",
                    "missing": 2
                },
                "weight": {
                    "field": "weight",
                    "missing": 3
                }
            }
        }
    }
}

Cardinality Aggregation

A single-value metrics aggregation that calculates an approximate count of distinct values. Values can be extracted either from specific fields in the document or generated by a script.

Assume you are indexing store sales and would like to count the unique number of sold products that match a query:

POST /sales/_search?size=0
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "type"
            }
        }
    }
}

Response:

{
    ...
    "aggregations" : {
        "type_count" : {
            "value" : 3
        }
    }
}

Precision control

This aggregation also supports the precision_threshold option:

POST /sales/_search?size=0
{
    "aggs" : {
        "type_count" : {
            "cardinality" : {
                "field" : "_doc",
                "precision_threshold": 100 (1)
            }
        }
    }
}
  1. The precision_threshold options allows to trade memory for accuracy, and defines a unique count below which counts are expected to be close to accurate. Above this value, counts might become a bit more fuzzy. The maximum supported value is 40000, thresholds above this number will have the same effect as a threshold of 40000. The default value is 3000.

Counts are approximate

Computing exact counts requires loading values into a hash set and returning its size. This doesn’t scale when working on high-cardinality sets and/or large values as the required memory usage and the need to communicate those per-shard sets between nodes would utilize too many resources of the cluster.

This cardinality aggregation is based on the HyperLogLog++ algorithm, which counts based on the hashes of the values with some interesting properties:

  • configurable precision, which decides on how to trade memory for accuracy,

  • excellent accuracy on low-cardinality sets,

  • fixed memory usage: no matter if there are tens or billions of unique values, memory usage only depends on the configured precision.

For a precision threshold of c, the implementation that we are using requires about c * 8 bytes.

The following chart shows how the error varies before and after the threshold:

cardinality error

For all 3 thresholds, counts have been accurate up to the configured threshold. Although not guaranteed, this is likely to be the case. Accuracy in practice depends on the dataset in question. In general, most datasets show consistently good accuracy. Also note that even with a threshold as low as 100, the error remains very low (1-6% as seen in the above graph) even when counting millions of items.

The HyperLogLog++ algorithm depends on the leading zeros of hashed values, the exact distributions of hashes in a dataset can affect the accuracy of the cardinality.

Please also note that even with a threshold as low as 100, the error remains very low, even when counting millions of items.

Pre-computed hashes

On string fields that have a high cardinality, it might be faster to store the hash of your field values in your index and then run the cardinality aggregation on this field. This can either be done by providing hash values from client-side or by letting Elasticsearch compute hash values for you by using the {plugins}/mapper-murmur3.html[mapper-murmur3] plugin.

Note
Pre-computing hashes is usually only useful on very large and/or high-cardinality fields as it saves CPU and memory. However, on numeric fields, hashing is very fast and storing the original values requires as much or less memory than storing the hashes. This is also true on low-cardinality string fields, especially given that those have an optimization in order to make sure that hashes are computed at most once per unique value per segment.

Script

The cardinality metric supports scripting, with a noticeable performance hit however since hashes need to be computed on the fly.

POST /sales/_search?size=0
{
    "aggs" : {
        "type_promoted_count" : {
            "cardinality" : {
                "script": {
                    "lang": "painless",
                    "source": "doc['type'].value + ' ' + doc['promoted'].value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

POST /sales/_search?size=0
{
    "aggs" : {
        "type_promoted_count" : {
            "cardinality" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "type_field": "_doc",
                        "promoted_field": "promoted"
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /sales/_search?size=0
{
    "aggs" : {
        "tag_cardinality" : {
            "cardinality" : {
                "field" : "tag",
                "missing": "N/A" (1)
            }
        }
    }
}
  1. Documents without a value in the tag field will fall into the same bucket as documents that have the value N/A.

Extended Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The extended_stats aggregations is an extended version of the stats aggregation, where additional metrics are added such as sum_of_squares, variance, std_deviation and std_deviation_bounds.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : { "extended_stats" : { "field" : "grade" } }
    }
}

The above aggregation computes the grades statistics over all documents. The aggregation type is extended_stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:

{
    ...

    "aggregations": {
        "grades_stats": {
           "count": 2,
           "min": 50.0,
           "max": 100.0,
           "avg": 75.0,
           "sum": 150.0,
           "sum_of_squares": 12500.0,
           "variance": 625.0,
           "std_deviation": 25.0,
           "std_deviation_bounds": {
            "upper": 125.0,
            "lower": 25.0
           }
        }
    }
}

The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Standard Deviation Bounds

By default, the extended_stats metric will return an object called std_deviation_bounds, which provides an interval of plus/minus two standard deviations from the mean. This can be a useful way to visualize variance of your data. If you want a different boundary, for example three standard deviations, you can set sigma in the request:

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "sigma" : 3 (1)
            }
        }
    }
}
  1. sigma controls how many standard deviations +/- from the mean should be displayed

sigma can be any non-negative double, meaning you can request non-integer values such as 1.5. A value of 0 is valid, but will simply return the average for both upper and lower bounds.

Note
Standard Deviation and Bounds require normality

The standard deviation and its bounds are displayed by default, but they are not always applicable to all data-sets. Your data must be normally distributed for the metrics to make sense. The statistics behind standard deviations assumes normally distributed data, so if your data is skewed heavily left or right, the value returned will be misleading.

Script

Computing the grades stats based on a script:

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "script" : {
                    "source" : "doc['grade'].value",
                    "lang" : "painless"
                 }
             }
         }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "grade"
                    }
                }
            }
        }
    }
}
Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use value script to get the new stats:

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "script" : {
                    "lang" : "painless",
                    "source": "_value * params.correction",
                    "params" : {
                        "correction" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

GET /exams/_search
{
    "size": 0,
    "aggs" : {
        "grades_stats" : {
            "extended_stats" : {
                "field" : "grade",
                "missing": 0 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 0.

Geo Bounds Aggregation

A metric aggregation that computes the bounding box containing all geo_point values for a field.

Example:

PUT /museums
{
    "mappings": {
        "_doc": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}

POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
    "query" : {
        "match" : { "name" : "musée" }
    },
    "aggs" : {
        "viewport" : {
            "geo_bounds" : {
                "field" : "location", (1)
                "wrap_longitude" : true (2)
            }
        }
    }
}
  1. The geo_bounds aggregation specifies the field to use to obtain the bounds

  2. wrap_longitude is an optional parameter which specifies whether the bounding box should be allowed to overlap the international date line. The default value is true

The above aggregation demonstrates how one would compute the bounding box of the location field for all documents with a business type of shop

The response for the above aggregation:

{
    ...
    "aggregations": {
        "viewport": {
            "bounds": {
                "top_left": {
                    "lat": 48.86111099738628,
                    "lon": 2.3269999679178
                },
                "bottom_right": {
                    "lat": 48.85999997612089,
                    "lon": 2.3363889567553997
                }
            }
        }
    }
}

Geo Centroid Aggregation

A metric aggregation that computes the weighted centroid from all coordinate values for a [geo-point] field.

Example:

PUT /museums
{
    "mappings": {
        "_doc": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}

POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "city": "Amsterdam", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "city": "Amsterdam", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "city": "Amsterdam", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "city": "Antwerp", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "city": "Paris", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "city": "Paris", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
    "aggs" : {
        "centroid" : {
            "geo_centroid" : {
                "field" : "location" (1)
            }
        }
    }
}
  1. The geo_centroid aggregation specifies the field to use for computing the centroid. (NOTE: field must be a [geo-point] type)

The above aggregation demonstrates how one would compute the centroid of the location field for all documents with a crime type of burglary

The response for the above aggregation:

{
    ...
    "aggregations": {
        "centroid": {
            "location": {
                "lat": 51.00982963107526,
                "lon": 3.9662130922079086
            },
            "count": 6
        }
    }
}

The geo_centroid aggregation is more interesting when combined as a sub-aggregation to other bucket aggregations.

Example:

POST /museums/_search?size=0
{
    "aggs" : {
        "cities" : {
            "terms" : { "field" : "city.keyword" },
            "aggs" : {
                "centroid" : {
                    "geo_centroid" : { "field" : "location" }
                }
            }
        }
    }
}

The above example uses geo_centroid as a sub-aggregation to a terms bucket aggregation for finding the central location for museums in each city.

The response for the above aggregation:

{
    ...
    "aggregations": {
        "cities": {
            "sum_other_doc_count": 0,
            "doc_count_error_upper_bound": 0,
            "buckets": [
               {
                   "key": "Amsterdam",
                   "doc_count": 3,
                   "centroid": {
                      "location": {
                         "lat": 52.371655656024814,
                         "lon": 4.909563297405839
                      },
                      "count": 3
                   }
               },
               {
                   "key": "Paris",
                   "doc_count": 2,
                   "centroid": {
                      "location": {
                         "lat": 48.86055548675358,
                         "lon": 2.3316944623366
                      },
                      "count": 2
                   }
                },
                {
                    "key": "Antwerp",
                    "doc_count": 1,
                    "centroid": {
                       "location": {
                          "lat": 51.22289997059852,
                          "lon": 4.40519998781383
                       },
                       "count": 1
                    }
                 }
            ]
        }
    }
}

Max Aggregation

A single-value metrics aggregation that keeps track and returns the maximum value among the numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note
The min and max aggregation operate on the double representation of the data. As a consequence, the result may be approximate when running on longs whose absolute value is greater than 2^53.

Computing the max price value across all documents

POST /sales/_search?size=0
{
    "aggs" : {
        "max_price" : { "max" : { "field" : "price" } }
    }
}

Response:

{
    ...
    "aggregations": {
        "max_price": {
            "value": 200.0
        }
    }
}

As can be seen, the name of the aggregation (max_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

The max aggregation can also calculate the maximum of a script. The example below computes the maximum price:

POST /sales/_search
{
    "aggs" : {
        "max_price" : {
            "max" : {
                "script" : {
                    "source" : "doc.price.value"
                }
            }
        }
    }
}

This will use the Painless scripting language and no script parameters. To use a stored script use the following syntax:

POST /sales/_search
{
    "aggs" : {
        "max_price" : {
            "max" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

Value Script

Let’s say that the prices of the documents in our index are in USD, but we would like to compute the max in EURO (and for the sake of this example, let’s say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:

POST /sales/_search
{
    "aggs" : {
        "max_price_in_euros" : {
            "max" : {
                "field" : "price",
                "script" : {
                    "source" : "_value * params.conversion_rate",
                    "params" : {
                        "conversion_rate" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /sales/_search
{
    "aggs" : {
        "grade_max" : {
            "max" : {
                "field" : "grade",
                "missing": 10 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Min Aggregation

A single-value metrics aggregation that keeps track and returns the minimum value among numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note
The min and max aggregation operate on the double representation of the data. As a consequence, the result may be approximate when running on longs whose absolute value is greater than 2^53.

Computing the min price value across all documents:

POST /sales/_search?size=0
{
    "aggs" : {
        "min_price" : { "min" : { "field" : "price" } }
    }
}

Response:

{
    ...

    "aggregations": {
        "min_price": {
            "value": 10.0
        }
    }
}

As can be seen, the name of the aggregation (min_price above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

The min aggregation can also calculate the minimum of a script. The example below computes the minimum price:

POST /sales/_search
{
    "aggs" : {
        "min_price" : {
            "min" : {
                "script" : {
                    "source" : "doc.price.value"
                }
            }
        }
    }
}

This will use the Painless scripting language and no script parameters. To use a stored script use the following syntax:

POST /sales/_search
{
    "aggs" : {
        "min_price" : {
            "min" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "price"
                    }
                }
            }
        }
    }
}

Value Script

Let’s say that the prices of the documents in our index are in USD, but we would like to compute the min in EURO (and for the sake of this example, let’s say the conversion rate is 1.2). We can use a value script to apply the conversion rate to every value before it is aggregated:

POST /sales/_search
{
    "aggs" : {
        "min_price_in_euros" : {
            "min" : {
                "field" : "price",
                "script" : {
                    "source" : "_value * params.conversion_rate",
                    "params" : {
                        "conversion_rate" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /sales/_search
{
    "aggs" : {
        "grade_min" : {
            "min" : {
                "field" : "grade",
                "missing": 10 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Percentiles Aggregation

A multi-value metrics aggregation that calculates one or more percentiles over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Percentiles show the point at which a certain percentage of observed values occur. For example, the 95th percentile is the value which is greater than 95% of the observed values.

Percentiles are often used to find outliers. In normal distributions, the 0.13th and 99.87th percentiles represents three standard deviations from the mean. Any data which falls outside three standard deviations is often considered an anomaly.

When a range of percentiles are retrieved, they can be used to estimate the data distribution and determine if the data is skewed, bimodal, etc.

Assume your data consists of website load times. The average and median load times are not overly useful to an administrator. The max may be interesting, but it can be easily skewed by a single slow response.

Let’s look at a range of percentiles representing load time:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time" (1)
            }
        }
    }
}
  1. The field load_time must be a numeric field

By default, the percentile metric will generate a range of percentiles: [ 1, 5, 25, 50, 75, 95, 99 ]. The response will look like this:

{
    ...

   "aggregations": {
      "load_time_outlier": {
         "values" : {
            "1.0": 5.0,
            "5.0": 25.0,
            "25.0": 165.0,
            "50.0": 445.0,
            "75.0": 725.0,
            "95.0": 945.0,
            "99.0": 985.0
         }
      }
   }
}

As you can see, the aggregation will return a calculated value for each percentile in the default range. If we assume response times are in milliseconds, it is immediately obvious that the webpage normally loads in 10-725ms, but occasionally spikes to 945-985ms.

Often, administrators are only interested in outliers — the extreme percentiles. We can specify just the percents we are interested in (requested percentiles must be a value between 0-100 inclusive):

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "percents" : [95, 99, 99.9] (1)
            }
        }
    }
}
  1. Use the percents parameter to specify particular percentiles to calculate

Keyed Response

By default the keyed flag is set to true which associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the keyed flag to false will disable this behavior:

GET latency/_search
{
    "size": 0,
    "aggs": {
        "load_time_outlier": {
            "percentiles": {
                "field": "load_time",
                "keyed": false
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "load_time_outlier": {
            "values": [
                {
                    "key": 1.0,
                    "value": 5.0
                },
                {
                    "key": 5.0,
                    "value": 25.0
                },
                {
                    "key": 25.0,
                    "value": 165.0
                },
                {
                    "key": 50.0,
                    "value": 445.0
                },
                {
                    "key": 75.0,
                    "value": 725.0
                },
                {
                    "key": 95.0,
                    "value": 945.0
                },
                {
                    "key": 99.0,
                    "value": 985.0
                }
            ]
        }
    }
}

Script

The percentile metric supports scripting. For example, if our load times are in milliseconds but we want percentiles calculated in seconds, we could use a script to convert them on-the-fly:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "script" : {
                    "lang": "painless",
                    "source": "doc['load_time'].value / params.timeUnit", (1)
                    "params" : {
                        "timeUnit" : 1000   (2)
                    }
                }
            }
        }
    }
}
  1. The field parameter is replaced with a script parameter, which uses the script to generate values which percentiles are calculated on

  2. Scripting supports parameterized input just like any other script

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "load_time"
                    }
                }
            }
        }
    }
}

Percentiles are (usually) approximate

There are many different algorithms to calculate percentiles. The naive implementation simply stores all the values in a sorted array. To find the 50th percentile, you simply find the value that is at my_array[count(my_array) * 0.5].

Clearly, the naive implementation does not scale — the sorted array grows linearly with the number of values in your dataset. To calculate percentiles across potentially billions of values in an Elasticsearch cluster, approximate percentiles are calculated.

The algorithm used by the percentile metric is called TDigest (introduced by Ted Dunning in Computing Accurate Quantiles using T-Digests).

When using this metric, there are a few guidelines to keep in mind:

  • Accuracy is proportional to q(1-q). This means that extreme percentiles (e.g. 99%) are more accurate than less extreme percentiles, such as the median

  • For small sets of values, percentiles are highly accurate (and potentially 100% accurate if the data is small enough).

  • As the quantity of values in a bucket grows, the algorithm begins to approximate the percentiles. It is effectively trading accuracy for memory savings. The exact level of inaccuracy is difficult to generalize, since it depends on your data distribution and volume of data being aggregated

The following chart shows the relative error on a uniform distribution depending on the number of collected values and the requested percentile:

percentiles error

It shows how precision is better for extreme percentiles. The reason why error diminishes for large number of values is that the law of large numbers makes the distribution of values more and more uniform and the t-digest tree can do a better job at summarizing it. It would not be the case on more skewed distributions.

Compression

Approximate algorithms must balance memory utilization with estimation accuracy. This balance can be controlled using a compression parameter:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "tdigest": {
                  "compression" : 200 (1)
                }
            }
        }
    }
}
  1. Compression controls memory usage and approximation error

The TDigest algorithm uses a number of "nodes" to approximate percentiles — the more nodes available, the higher the accuracy (and large memory footprint) proportional to the volume of data. The compression parameter limits the maximum number of nodes to 20 * compression.

Therefore, by increasing the compression value, you can increase the accuracy of your percentiles at the cost of more memory. Larger compression values also make the algorithm slower since the underlying tree data structure grows in size, resulting in more expensive operations. The default compression value is 100.

A "node" uses roughly 32 bytes of memory, so under worst-case scenarios (large amount of data which arrives sorted and in-order) the default settings will produce a TDigest roughly 64KB in size. In practice data tends to be more random and the TDigest will use less memory.

HDR Histogram

Note
This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future.

HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentiles for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).

The HDR Histogram can be used by specifying the method parameter in the request:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_outlier" : {
            "percentiles" : {
                "field" : "load_time",
                "percents" : [95, 99, 99.9],
                "hdr": { (1)
                  "number_of_significant_value_digits" : 3 (2)
                }
            }
        }
    }
}
  1. hdr object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object

  2. number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits

The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "grade_percentiles" : {
            "percentiles" : {
                "field" : "grade",
                "missing": 10 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 10.

Percentile Ranks Aggregation

A multi-value metrics aggregation that calculates one or more percentile ranks over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Note

Please see Percentiles are (usually) approximate and Compression for advice regarding approximation and memory use of the percentile ranks aggregation

Percentile rank show the percentage of observed values which are below certain value. For example, if a value is greater than or equal to 95% of the observed values it is said to be at the 95th percentile rank.

Assume your data consists of website load times. You may have a service agreement that 95% of page loads completely within 500ms and 99% of page loads complete within 600ms.

Let’s look at a range of percentiles representing load time:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "field" : "load_time", (1)
                "values" : [500, 600]
            }
        }
    }
}
  1. The field load_time must be a numeric field

The response will look like this:

{
    ...

   "aggregations": {
      "load_time_ranks": {
         "values" : {
            "500.0": 55.00000000000001,
            "600.0": 64.0
         }
      }
   }
}

From this information you can determine you are hitting the 99% load time target but not quite hitting the 95% load time target

Keyed Response

By default the keyed flag is set to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array. Setting the keyed flag to false will disable this behavior:

GET latency/_search
{
    "size": 0,
    "aggs": {
        "load_time_ranks": {
            "percentile_ranks": {
                "field": "load_time",
                "values": [500, 600],
                "keyed": false
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "load_time_ranks": {
            "values": [
                {
                    "key": 500.0,
                    "value": 55.00000000000001
                },
                {
                    "key": 600.0,
                    "value": 64.0
                }
            ]
        }
    }
}

Script

The percentile rank metric supports scripting. For example, if our load times are in milliseconds but we want to specify values in seconds, we could use a script to convert them on-the-fly:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "values" : [500, 600],
                "script" : {
                    "lang": "painless",
                    "source": "doc['load_time'].value / params.timeUnit", (1)
                    "params" : {
                        "timeUnit" : 1000   (2)
                    }
                }
            }
        }
    }
}
  1. The field parameter is replaced with a script parameter, which uses the script to generate values which percentile ranks are calculated on

  2. Scripting supports parameterized input just like any other script

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "values" : [500, 600],
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "load_time"
                    }
                }
            }
        }
    }
}

HDR Histogram

Note
This setting exposes the internal implementation of HDR Histogram and the syntax may change in the future.

HDR Histogram (High Dynamic Range Histogram) is an alternative implementation that can be useful when calculating percentile ranks for latency measurements as it can be faster than the t-digest implementation with the trade-off of a larger memory footprint. This implementation maintains a fixed worse-case percentage error (specified as a number of significant digits). This means that if data is recorded with values from 1 microsecond up to 1 hour (3,600,000,000 microseconds) in a histogram set to 3 significant digits, it will maintain a value resolution of 1 microsecond for values up to 1 millisecond and 3.6 seconds (or better) for the maximum tracked value (1 hour).

The HDR Histogram can be used by specifying the method parameter in the request:

GET latency/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "field" : "load_time",
                "values" : [500, 600],
                "hdr": { (1)
                  "number_of_significant_value_digits" : 3 (2)
                }
            }
        }
    }
}
  1. hdr object indicates that HDR Histogram should be used to calculate the percentiles and specific settings for this algorithm can be specified inside the object

  2. number_of_significant_value_digits specifies the resolution of values for the histogram in number of significant digits

The HDRHistogram only supports positive values and will error if it is passed a negative value. It is also not a good idea to use the HDRHistogram if the range of values is unknown as this could lead to high memory usage.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

GET latency/data/_search
{
    "size": 0,
    "aggs" : {
        "load_time_ranks" : {
            "percentile_ranks" : {
                "field" : "load_time",
                "values" : [500, 600],
                "missing": 10 (1)
            }
        }
    }
}
  1. Documents without a value in the load_time field will fall into the same bucket as documents that have the value 10.

Scripted Metric Aggregation

A metric aggregation that executes using scripts to provide a metric output.

Example:

POST ledger/_search?size=0
{
    "query" : {
        "match_all" : {}
    },
    "aggs": {
        "profit": {
            "scripted_metric": {
                "init_script" : "state.transactions = []",
                "map_script" : "state.transactions.add(doc.type.value == 'sale' ? doc.amount.value : -1 * doc.amount.value)", (1)
                "combine_script" : "double profit = 0; for (t in state.transactions) { profit += t } return profit",
                "reduce_script" : "double profit = 0; for (a in states) { profit += a } return profit"
            }
        }
    }
}
  1. map_script is the only required parameter

The above aggregation demonstrates how one would use the script aggregation compute the total profit from sale and cost transactions.

The response for the above aggregation:

{
    "took": 218,
    ...
    "aggregations": {
        "profit": {
            "value": 240.0
        }
   }
}

The above example can also be specified using stored scripts as follows:

POST ledger/_search?size=0
{
    "aggs": {
        "profit": {
            "scripted_metric": {
                "init_script" : {
                    "id": "my_init_script"
                },
                "map_script" : {
                    "id": "my_map_script"
                },
                "combine_script" : {
                    "id": "my_combine_script"
                },
                "params": {
                    "field": "amount" (1)
                },
                "reduce_script" : {
                    "id": "my_reduce_script"
                }
            }
        }
    }
}
  1. script parameters for init, map and combine scripts must be specified in a global params object so that it can be shared between the scripts.

For more details on specifying scripts see script documentation.

Allowed return types

Whilst any valid script object can be used within a single script, the scripts must return or store in the state object only the following types:

  • primitive types

  • String

  • Map (containing only keys and values of the types listed here)

  • Array (containing elements of only the types listed here)

Scope of scripts

The scripted metric aggregation uses scripts at 4 stages of its execution:

init_script

Executed prior to any collection of documents. Allows the aggregation to set up any initial state.

In the above example, the init_script creates an array transactions in the state object.

map_script

Executed once per document collected. This is the only required script. If no combine_script is specified, the resulting state needs to be stored in the state object.

In the above example, the map_script checks the value of the type field. If the value is 'sale' the value of the amount field is added to the transactions array. If the value of the type field is not 'sale' the negated value of the amount field is added to transactions.

combine_script

Executed once on each shard after document collection is complete. Allows the aggregation to consolidate the state returned from each shard. If a combine_script is not provided the combine phase will return the aggregation variable.

In the above example, the combine_script iterates through all the stored transactions, summing the values in the profit variable and finally returns profit.

reduce_script

Executed once on the coordinating node after all shards have returned their results. The script is provided with access to a variable states which is an array of the result of the combine_script on each shard. If a reduce_script is not provided the reduce phase will return the states variable.

In the above example, the reduce_script iterates through the profit returned by each shard summing the values before returning the final combined profit which will be returned in the response of the aggregation.

Worked Example

Imagine a situation where you index the following documents into an index with 2 shards:

PUT /transactions/_doc/_bulk?refresh
{"index":{"_id":1}}
{"type": "sale","amount": 80}
{"index":{"_id":2}}
{"type": "cost","amount": 10}
{"index":{"_id":3}}
{"type": "cost","amount": 30}
{"index":{"_id":4}}
{"type": "sale","amount": 130}

Lets say that documents 1 and 3 end up on shard A and documents 2 and 4 end up on shard B. The following is a breakdown of what the aggregation result is at each stage of the example above.

Before init_script

state is initialized as a new empty object.

"state" : {}
After init_script

This is run once on each shard before any document collection is performed, and so we will have a copy on each shard:

Shard A
"state" : {
    "transactions" : []
}
Shard B
"state" : {
    "transactions" : []
}
After map_script

Each shard collects its documents and runs the map_script on each document that is collected:

Shard A
"state" : {
    "transactions" : [ 80, -30 ]
}
Shard B
"state" : {
    "transactions" : [ -10, 130 ]
}
After combine_script

The combine_script is executed on each shard after document collection is complete and reduces all the transactions down to a single profit figure for each shard (by summing the values in the transactions array) which is passed back to the coordinating node:

Shard A

50

Shard B

120

After reduce_script

The reduce_script receives a states array containing the result of the combine script for each shard:

"states" : [
    50,
    120
]

It reduces the responses for the shards down to a final overall profit figure (by summing the values) and returns this as the result of the aggregation to produce the response:

{
    ...

    "aggregations": {
        "profit": {
            "value": 170
        }
   }
}

Other Parameters

params

Optional. An object whose contents will be passed as variables to the init_script, map_script and combine_script. This can be useful to allow the user to control the behavior of the aggregation and for storing state between the scripts. If this is not specified, the default is the equivalent of providing:

"params" : {}

Empty Buckets

If a parent bucket of the scripted metric aggregation does not collect any documents an empty aggregation response will be returned from the shard with a null value. In this case the reduce_script’s `states variable will contain null as a response from that shard. reduce_script’s should therefore expect and deal with `null responses from shards.

Stats Aggregation

A multi-value metrics aggregation that computes stats over numeric values extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

The stats that are returned consist of: min, max, sum, count and avg.

Assuming the data consists of documents representing exams grades (between 0 and 100) of students

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : { "stats" : { "field" : "grade" } }
    }
}

The above aggregation computes the grades statistics over all documents. The aggregation type is stats and the field setting defines the numeric field of the documents the stats will be computed on. The above will return the following:

{
    ...

    "aggregations": {
        "grades_stats": {
            "count": 2,
            "min": 50.0,
            "max": 100.0,
            "avg": 75.0,
            "sum": 150.0
        }
    }
}

The name of the aggregation (grades_stats above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Computing the grades stats based on a script:

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : {
             "stats" : {
                 "script" : {
                     "lang": "painless",
                     "source": "doc['grade'].value"
                 }
             }
         }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : {
            "stats" : {
                "script" : {
                    "id": "my_script",
                    "params" : {
                        "field" : "grade"
                    }
                }
            }
        }
    }
}
Value Script

It turned out that the exam was way above the level of the students and a grade correction needs to be applied. We can use a value script to get the new stats:

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : {
            "stats" : {
                "field" : "grade",
                "script" : {
                    "lang": "painless",
                    "source": "_value * params.correction",
                    "params" : {
                        "correction" : 1.2
                    }
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /exams/_search?size=0
{
    "aggs" : {
        "grades_stats" : {
            "stats" : {
                "field" : "grade",
                "missing": 0 (1)
            }
        }
    }
}
  1. Documents without a value in the grade field will fall into the same bucket as documents that have the value 0.

Sum Aggregation

A single-value metrics aggregation that sums up numeric values that are extracted from the aggregated documents. These values can be extracted either from specific numeric fields in the documents, or be generated by a provided script.

Assuming the data consists of documents representing sales records we can sum the sale price of all hats with:

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "hat_prices" : { "sum" : { "field" : "price" } }
    }
}

Resulting in:

{
    ...
    "aggregations": {
        "hat_prices": {
           "value": 450.0
        }
    }
}

The name of the aggregation (hat_prices above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

We could also use a script to fetch the sales price:

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "hat_prices" : {
            "sum" : {
                "script" : {
                   "source": "doc.price.value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "hat_prices" : {
            "sum" : {
                "script" : {
                    "id": "my_script",
                    "params" : {
                        "field" : "price"
                    }
                }
            }
        }
    }
}
Value Script

It is also possible to access the field value from the script using _value. For example, this will sum the square of the prices for all hats:

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "square_hats" : {
            "sum" : {
                "field" : "price",
                "script" : {
                    "source": "_value * _value"
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default documents missing the value will be ignored but it is also possible to treat them as if they had a value. For example, this treats all hat sales without a price as being 100.

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : {
            "filter" : {
                "match" : { "type" : "hat" }
            }
        }
    },
    "aggs" : {
        "hat_prices" : {
            "sum" : {
                "field" : "price",
                "missing": 100 (1)
            }
        }
    }
}

Top Hits Aggregation

A top_hits metric aggregator keeps track of the most relevant document being aggregated. This aggregator is intended to be used as a sub aggregator, so that the top matching documents can be aggregated per bucket.

The top_hits aggregator can effectively be used to group result sets by certain fields via a bucket aggregator. One or more bucket aggregators determines by which properties a result set get sliced into.

Options

  • from - The offset from the first result you want to fetch.

  • size - The maximum number of top matching hits to return per bucket. By default the top three matching hits are returned.

  • sort - How the top matching hits should be sorted. By default the hits are sorted by the score of the main query.

Supported per hit features

The top_hits aggregation returns regular search hits, because of this many per hit features can be supported:

Example

In the following example we group the sales by type and per type we show the last sale. For each sale only the date and price fields are being included in the source.

POST /sales/_search?size=0
{
    "aggs": {
        "top_tags": {
            "terms": {
                "field": "type",
                "size": 3
            },
            "aggs": {
                "top_sales_hits": {
                    "top_hits": {
                        "sort": [
                            {
                                "date": {
                                    "order": "desc"
                                }
                            }
                        ],
                        "_source": {
                            "includes": [ "date", "price" ]
                        },
                        "size" : 1
                    }
                }
            }
        }
    }
}

Possible response:

{
  ...
  "aggregations": {
    "top_tags": {
       "doc_count_error_upper_bound": 0,
       "sum_other_doc_count": 0,
       "buckets": [
          {
             "key": "hat",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChK",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 200
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "t-shirt",
             "doc_count": 3,
             "top_sales_hits": {
                "hits": {
                   "total": 3,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmauCQpcRyxw6ChL",
                         "_source": {
                            "date": "2015/03/01 00:00:00",
                            "price": 175
                         },
                         "sort": [
                            1425168000000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          },
          {
             "key": "bag",
             "doc_count": 1,
             "top_sales_hits": {
                "hits": {
                   "total": 1,
                   "max_score": null,
                   "hits": [
                      {
                         "_index": "sales",
                         "_type": "_doc",
                         "_id": "AVnNBmatCQpcRyxw6ChH",
                         "_source": {
                            "date": "2015/01/01 00:00:00",
                            "price": 150
                         },
                         "sort": [
                            1420070400000
                         ],
                         "_score": null
                      }
                   ]
                }
             }
          }
       ]
    }
  }
}

Field collapse example

Field collapsing or result grouping is a feature that logically groups a result set into groups and per group returns top documents. The ordering of the groups is determined by the relevancy of the first document in a group. In Elasticsearch this can be implemented via a bucket aggregator that wraps a top_hits aggregator as sub-aggregator.

In the example below we search across crawled webpages. For each webpage we store the body and the domain the webpage belong to. By defining a terms aggregator on the domain field we group the result set of webpages by domain. The top_hits aggregator is then defined as sub-aggregator, so that the top matching hits are collected per bucket.

Also a max aggregator is defined which is used by the terms aggregator’s order feature to return the buckets by relevancy order of the most relevant document in a bucket.

POST /sales/_search
{
  "query": {
    "match": {
      "body": "elections"
    }
  },
  "aggs": {
    "top_sites": {
      "terms": {
        "field": "domain",
        "order": {
          "top_hit": "desc"
        }
      },
      "aggs": {
        "top_tags_hits": {
          "top_hits": {}
        },
        "top_hit" : {
          "max": {
            "script": {
              "source": "_score"
            }
          }
        }
      }
    }
  }
}

At the moment the max (or min) aggregator is needed to make sure the buckets from the terms aggregator are ordered according to the score of the most relevant webpage per domain. Unfortunately the top_hits aggregator can’t be used in the order option of the terms aggregator yet.

top_hits support in a nested or reverse_nested aggregator

If the top_hits aggregator is wrapped in a nested or reverse_nested aggregator then nested hits are being returned. Nested hits are in a sense hidden mini documents that are part of regular document where in the mapping a nested field type has been configured. The top_hits aggregator has the ability to un-hide these documents if it is wrapped in a nested or reverse_nested aggregator. Read more about nested in the nested type mapping.

If nested type has been configured a single document is actually indexed as multiple Lucene documents and they share the same id. In order to determine the identity of a nested hit there is more needed than just the id, so that is why nested hits also include their nested identity. The nested identity is kept under the _nested field in the search hit and includes the array field and the offset in the array field the nested hit belongs to. The offset is zero based.

Let’s see how it works with a real sample. Considering the following mapping:

PUT /sales
{
    "mappings": {
        "_doc" : {
            "properties" : {
                "tags" : { "type" : "keyword" },
                "comments" : { (1)
                    "type" : "nested",
                    "properties" : {
                        "username" : { "type" : "keyword" },
                        "comment" : { "type" : "text" }
                    }
                }
            }
        }
    }
}
  1. The comments is an array that holds nested documents under the product object.

And some documents:

PUT /sales/_doc/1?refresh
{
    "tags": ["car", "auto"],
    "comments": [
        {"username": "baddriver007", "comment": "This car could have better brakes"},
        {"username": "dr_who", "comment": "Where's the autopilot? Can't find it"},
        {"username": "ilovemotorbikes", "comment": "This car has two extra wheels"}
    ]
}

It’s now possible to execute the following top_hits aggregation (wrapped in a nested aggregation):

POST /sales/_search
{
    "query": {
        "term": { "tags": "car" }
    },
    "aggs": {
        "by_sale": {
            "nested" : {
                "path" : "comments"
            },
            "aggs": {
                "by_user": {
                    "terms": {
                        "field": "comments.username",
                        "size": 1
                    },
                    "aggs": {
                        "by_nested": {
                            "top_hits":{}
                        }
                    }
                }
            }
        }
    }
}

Top hits response snippet with a nested hit, which resides in the first slot of array field comments:

{
  ...
  "aggregations": {
    "by_sale": {
      "by_user": {
        "buckets": [
          {
            "key": "baddriver007",
            "doc_count": 1,
            "by_nested": {
              "hits": {
                "total": 1,
                "max_score": 0.2876821,
                "hits": [
                  {
                    "_index": "sales",
                    "_type" : "_doc",
                    "_id": "1",
                    "_nested": {
                      "field": "comments",  (1)
                      "offset": 0 (2)
                    },
                    "_score": 0.2876821,
                    "_source": {
                      "comment": "This car could have better brakes", (3)
                      "username": "baddriver007"
                    }
                  }
                ]
              }
            }
          }
          ...
        ]
      }
    }
  }
}
  1. Name of the array field containing the nested hit

  2. Position if the nested hit in the containing array

  3. Source of the nested hit

If _source is requested then just the part of the source of the nested object is returned, not the entire source of the document. Also stored fields on the nested inner object level are accessible via top_hits aggregator residing in a nested or reverse_nested aggregator.

Only nested hits will have a _nested field in the hit, non nested (regular) hits will not have a _nested field.

The information in _nested can also be used to parse the original source somewhere else if _source isn’t enabled.

If there are multiple levels of nested object types defined in mappings then the _nested information can also be hierarchical in order to express the identity of nested hits that are two layers deep or more.

In the example below a nested hit resides in the first slot of the field nested_grand_child_field which then resides in the second slow of the nested_child_field field:

...
"hits": {
 "total": 2565,
 "max_score": 1,
 "hits": [
   {
     "_index": "a",
     "_type": "b",
     "_id": "1",
     "_score": 1,
     "_nested" : {
       "field" : "nested_child_field",
       "offset" : 1,
       "_nested" : {
         "field" : "nested_grand_child_field",
         "offset" : 0
       }
     }
     "_source": ...
   },
   ...
 ]
}
...

Value Count Aggregation

A single-value metrics aggregation that counts the number of values that are extracted from the aggregated documents. These values can be extracted either from specific fields in the documents, or be generated by a provided script. Typically, this aggregator will be used in conjunction with other single-value aggregations. For example, when computing the avg one might be interested in the number of values the average is computed over.

POST /sales/_search?size=0
{
    "aggs" : {
        "types_count" : { "value_count" : { "field" : "type" } }
    }
}

Response:

{
    ...
    "aggregations": {
        "types_count": {
            "value": 7
        }
    }
}

The name of the aggregation (types_count above) also serves as the key by which the aggregation result can be retrieved from the returned response.

Script

Counting the values generated by a script:

POST /sales/_search?size=0
{
    "aggs" : {
        "type_count" : {
            "value_count" : {
                "script" : {
                    "source" : "doc['type'].value"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the painless script language and no script parameters. To use a stored script use the following syntax:

POST /sales/_search?size=0
{
    "aggs" : {
        "types_count" : {
            "value_count" : {
                "script" : {
                    "id": "my_script",
                    "params" : {
                        "field" : "type"
                    }
                }
            }
        }
    }
}

Median Absolute Deviation Aggregation

This single-value aggregation approximates the median absolute deviation of its search results.

Median absolute deviation is a measure of variability. It is a robust statistic, meaning that it is useful for describing data that may have outliers, or may not be normally distributed. For such data it can be more descriptive than standard deviation.

It is calculated as the median of each data point’s deviation from the median of the entire sample. That is, for a random variable X, the median absolute deviation is median(|median(X) - Xi|).

Example

Assume our data represents product reviews on a one to five star scale. Such reviews are usually summarized as a mean, which is easily understandable but doesn’t describe the reviews' variability. Estimating the median absolute deviation can provide insight into how much reviews vary from one another.

In this example we have a product which has an average rating of 3 stars. Let’s look at its ratings' median absolute deviation to determine how much they vary

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_average": {
      "avg": {
        "field": "rating"
      }
    },
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating" (1)
      }
    }
  }
}
  1. rating must be a numeric field

The resulting median absolute deviation of 2 tells us that there is a fair amount of variability in the ratings. Reviewers must have diverse opinions about this product.

{
  ...
  "aggregations": {
    "review_average": {
      "value": 3.0
    },
    "review_variability": {
      "value": 2.0
    }
  }
}

Approximation

The naive implementation of calculating median absolute deviation stores the entire sample in memory, so this aggregation instead calculates an approximation. It uses the TDigest data structure to approximate the sample median and the median of deviations from the sample median. For more about the approximation characteristics of TDigests, see Percentiles are (usually) approximate.

The tradeoff between resource usage and accuracy of a TDigest’s quantile approximation, and therefore the accuracy of this aggregation’s approximation of median absolute deviation, is controlled by the compression parameter. A higher compression setting provides a more accurate approximation at the cost of higher memory usage. For more about the characteristics of the TDigest compression parameter see Compression.

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating",
        "compression": 100
      }
    }
  }
}

The default compression value for this aggregation is 1000. At this compression level this aggregation is usually within 5% of the exact result, but observed performance will depend on the sample data.

Script

This metric aggregation supports scripting. In our example above, product reviews are on a scale of one to five. If we wanted to modify them to a scale of one to ten, we can using scripting.

To provide an inline script:

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "script": {
          "lang": "painless",
          "source": "doc['rating'].value * params.scaleFactor",
          "params": {
            "scaleFactor": 2
          }
        }
      }
    }
  }
}

To provide a stored script:

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "script": {
          "id": "my_script",
          "params": {
            "field": "rating"
          }
        }
      }
    }
  }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

Let’s be optimistic and assume some reviewers loved the product so much that they forgot to give it a rating. We’ll assign them five stars

GET reviews/_search
{
  "size": 0,
  "aggs": {
    "review_variability": {
      "median_absolute_deviation": {
        "field": "rating",
        "missing": 5
      }
    }
  }
}

Bucket Aggregations

Bucket aggregations don’t calculate metrics over fields like the metrics aggregations do, but instead, they create buckets of documents. Each bucket is associated with a criterion (depending on the aggregation type) which determines whether or not a document in the current context "falls" into it. In other words, the buckets effectively define document sets. In addition to the buckets themselves, the bucket aggregations also compute and return the number of documents that "fell into" each bucket.

Bucket aggregations, as opposed to metrics aggregations, can hold sub-aggregations. These sub-aggregations will be aggregated for the buckets created by their "parent" bucket aggregation.

There are different bucket aggregators, each with a different "bucketing" strategy. Some define a single bucket, some define fixed number of multiple buckets, and others dynamically create the buckets during the aggregation process.

Note
The maximum number of buckets allowed in a single response is limited by a dynamic cluster setting named search.max_buckets. It is disabled by default (-1) but requests that try to return more than 10,000 buckets (the default value for future versions) will log a deprecation warning.

Adjacency Matrix Aggregation

A bucket aggregation returning a form of adjacency matrix. The request provides a collection of named filter expressions, similar to the filters aggregation request. Each bucket in the response represents a non-empty cell in the matrix of intersecting filters.

beta::["The adjacency_matrix aggregation is a new feature and we may evolve its design as we get feedback on its use. As a result, the API for this feature may change in non-backwards compatible ways"]

Given filters named A, B and C the response would return buckets with the following names:

A B C

A

A

A&B

A&C

B

B

B&C

C

C

The intersecting buckets e.g A&C are labelled using a combination of the two filter names separated by the ampersand character. Note that the response does not also include a "C&A" bucket as this would be the same set of documents as "A&C". The matrix is said to be symmetric so we only return half of it. To do this we sort the filter name strings and always use the lowest of a pair as the value to the left of the "&" separator.

An alternative separator parameter can be passed in the request if clients wish to use a separator string other than the default of the ampersand.

Example:

PUT /emails/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "accounts" : ["hillary", "sidney"]}
{ "index" : { "_id" : 2 } }
{ "accounts" : ["hillary", "donald"]}
{ "index" : { "_id" : 3 } }
{ "accounts" : ["vladimir", "donald"]}

GET emails/_search
{
  "size": 0,
  "aggs" : {
    "interactions" : {
      "adjacency_matrix" : {
        "filters" : {
          "grpA" : { "terms" : { "accounts" : ["hillary", "sidney"] }},
          "grpB" : { "terms" : { "accounts" : ["donald", "mitt"] }},
          "grpC" : { "terms" : { "accounts" : ["vladimir", "nigel"] }}
        }
      }
    }
  }
}

In the above example, we analyse email messages to see which groups of individuals have exchanged messages. We will get counts for each group individually and also a count of messages for pairs of groups that have recorded interactions.

Response:

{
  "took": 9,
  "timed_out": false,
  "_shards": ...,
  "hits": ...,
  "aggregations": {
    "interactions": {
      "buckets": [
        {
          "key":"grpA",
          "doc_count": 2
        },
        {
          "key":"grpA&grpB",
          "doc_count": 1
        },
        {
          "key":"grpB",
          "doc_count": 2
        },
        {
          "key":"grpB&grpC",
          "doc_count": 1
        },
        {
          "key":"grpC",
          "doc_count": 1
        }
      ]
    }
  }
}

Usage

On its own this aggregation can provide all of the data required to create an undirected weighted graph. However, when used with child aggregations such as a date_histogram the results can provide the additional levels of data required to perform dynamic network analysis where examining interactions over time becomes important.

Limitations

For N filters the matrix of buckets produced can be N²/2 and so there is a default maximum imposed of 100 filters . This setting can be changed using the index.max_adjacency_matrix_filters index-level setting.

Auto-interval Date Histogram Aggregation

A multi-bucket aggregation similar to the Date Histogram Aggregation except instead of providing an interval to use as the width of each bucket, a target number of buckets is provided indicating the number of buckets needed and the interval of the buckets is automatically chosen to best achieve that target. The number of buckets returned will always be less than or equal to this target number.

The buckets field is optional, and will default to 10 buckets if not specified.

Requesting a target of 10 buckets.

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "auto_date_histogram" : {
                "field" : "date",
                "buckets" : 10
            }
        }
    }
}

Keys

Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch. These timestamps are returned as the bucket keys. The key_as_string is the same timestamp converted to a formatted date string using the format specified with the format parameter:

Tip
If no format is specified, then it will use the first date format specified in the field mapping.
POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "auto_date_histogram" : {
                "field" : "date",
                "buckets" : 5,
                "format" : "yyyy-MM-dd" (1)
            }
        }
    }
}
  1. Supports expressive date format pattern

Response:

{
    ...
    "aggregations": {
        "sales_over_time": {
            "buckets": [
                {
                    "key_as_string": "2015-01-01",
                    "key": 1420070400000,
                    "doc_count": 3
                },
                {
                    "key_as_string": "2015-02-01",
                    "key": 1422748800000,
                    "doc_count": 2
                },
                {
                    "key_as_string": "2015-03-01",
                    "key": 1425168000000,
                    "doc_count": 2
                }
            ],
            "interval": "1M"
        }
    }
}

Intervals

The interval of the returned buckets is selected based on the data collected by the aggregation so that the number of buckets returned is less than or equal to the number requested. The possible intervals returned are:

seconds

In multiples of 1, 5, 10 and 30

minutes

In multiples of 1, 5, 10 and 30

hours

In multiples of 1, 3 and 12

days

In multiples of 1, and 7

months

In multiples of 1, and 3

years

In multiples of 1, 5, 10, 20, 50 and 100

In the worst case, where the number of daily buckets are too many for the requested number of buckets, the number of buckets returned will be 1/7th of the number of buckets requested.

Time Zone

Date-times are stored in Elasticsearch in UTC. By default, all bucketing and rounding is also done in UTC. The time_zone parameter can be used to indicate that bucketing should use a different time zone.

Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as a timezone id, an identifier used in the TZ database like America/Los_Angeles.

Consider the following example:

PUT my_index/log/1?refresh
{
  "date": "2015-10-01T00:30:00Z"
}

PUT my_index/log/2?refresh
{
  "date": "2015-10-01T01:30:00Z"
}

PUT my_index/log/3?refresh
{
  "date": "2015-10-01T02:30:00Z"
}

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "auto_date_histogram": {
        "field":     "date",
        "buckets" : 3
      }
    }
  }
}

UTC is used if no time zone is specified, three 1-hour buckets are returned starting at midnight UTC on 1 October 2015:

{
  ...
  "aggregations": {
    "by_day": {
      "buckets": [
        {
          "key_as_string": "2015-10-01T00:00:00.000Z",
          "key": 1443657600000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T01:00:00.000Z",
          "key": 1443661200000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T02:00:00.000Z",
          "key": 1443664800000,
          "doc_count": 1
        }
      ],
      "interval": "1h"
    }
  }
}

If a time_zone of -01:00 is specified, then midnight starts at one hour before midnight UTC:

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "auto_date_histogram": {
        "field":     "date",
        "buckets" : 3,
        "time_zone": "-01:00"
      }
    }
  }
}

Now three 1-hour buckets are still returned but the first bucket starts at 11:00pm on 30 September 2015 since that is the local time for the bucket in the specified time zone.

{
  ...
  "aggregations": {
    "by_day": {
      "buckets": [
        {
          "key_as_string": "2015-09-30T23:00:00.000-01:00", (1)
          "key": 1443657600000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T00:00:00.000-01:00",
          "key": 1443661200000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T01:00:00.000-01:00",
          "key": 1443664800000,
          "doc_count": 1
        }
      ],
      "interval": "1h"
    }
  }
}
  1. The key_as_string value represents midnight on each day in the specified time zone.

Warning
When using time zones that follow DST (daylight savings time) changes, buckets close to the moment when those changes happen can have slightly different sizes than neighbouring buckets. For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am, clocks were turned forward 1 hour to 3am local time. If the result of the aggregation was daily buckets, the bucket covering that day will only hold data for 23 hours instead of the usual 24 hours for other buckets. The same is true for shorter intervals like e.g. 12h. Here, we will have only a 11h bucket on the morning of 27 March when the DST shift happens.

Scripts

Like with the normal date_histogram, both document level scripts and value level scripts are supported. This aggregation does not however, support the min_doc_count, extended_bounds and order parameters.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /sales/_search?size=0
{
    "aggs" : {
        "sale_date" : {
             "auto_date_histogram" : {
                 "field" : "date",
                 "buckets": 10,
                 "missing": "2000/01/01" (1)
             }
         }
    }
}
  1. Documents without a value in the publish_date field will fall into the same bucket as documents that have the value 2000-01-01.

Children Aggregation

A special single bucket aggregation that selects child documents that have the specified type, as defined in a join field.

This aggregation has a single option:

  • type - The child type that should be selected.

For example, let’s say we have an index of questions and answers. The answer type has the following join field in the mapping:

PUT child_example
{
  "mappings": {
    "_doc": {
      "properties": {
        "join": {
          "type": "join",
          "relations": {
            "question": "answer"
          }
        }
      }
    }
  }
}

The question document contain a tag field and the answer documents contain an owner field. With the children aggregation the tag buckets can be mapped to the owner buckets in a single request even though the two fields exist in two different kinds of documents.

An example of a question document:

PUT child_example/_doc/1
{
  "join": {
    "name": "question"
  },
  "body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...",
  "title": "Whats the best way to file transfer my site from server to a newer one?",
  "tags": [
    "windows-server-2003",
    "windows-server-2008",
    "file-transfer"
  ]
}

Examples of answer documents:

PUT child_example/_doc/2?routing=1
{
  "join": {
    "name": "answer",
    "parent": "1"
  },
  "owner": {
    "location": "Norfolk, United Kingdom",
    "display_name": "Sam",
    "id": 48
  },
  "body": "<p>Unfortunately you're pretty much limited to FTP...",
  "creation_date": "2009-05-04T13:45:37.030"
}

PUT child_example/_doc/3?routing=1&refresh
{
  "join": {
    "name": "answer",
    "parent": "1"
  },
  "owner": {
    "location": "Norfolk, United Kingdom",
    "display_name": "Troll",
    "id": 49
  },
  "body": "<p>Use Linux...",
  "creation_date": "2009-05-05T13:45:37.030"
}

The following request can be built that connects the two together:

POST child_example/_search?size=0
{
  "aggs": {
    "top-tags": {
      "terms": {
        "field": "tags.keyword",
        "size": 10
      },
      "aggs": {
        "to-answers": {
          "children": {
            "type" : "answer" (1)
          },
          "aggs": {
            "top-names": {
              "terms": {
                "field": "owner.display_name.keyword",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}
  1. The type points to type / mapping with the name answer.

The above example returns the top question tags and per tag the top answer owners.

Possible response:

{
  "took": 25,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped" : 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "top-tags": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "file-transfer",
          "doc_count": 1, (1)
          "to-answers": {
            "doc_count": 2, (2)
            "top-names": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Sam",
                  "doc_count": 1
                },
                {
                  "key": "Troll",
                  "doc_count": 1
                }
              ]
            }
          }
        },
        {
          "key": "windows-server-2003",
          "doc_count": 1, (1)
          "to-answers": {
            "doc_count": 2, (2)
            "top-names": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Sam",
                  "doc_count": 1
                },
                {
                  "key": "Troll",
                  "doc_count": 1
                }
              ]
            }
          }
        },
        {
          "key": "windows-server-2008",
          "doc_count": 1, (1)
          "to-answers": {
            "doc_count": 2, (2)
            "top-names": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "Sam",
                  "doc_count": 1
                },
                {
                  "key": "Troll",
                  "doc_count": 1
                }
              ]
            }
          }
        }
      ]
    }
  }
}
  1. The number of question documents with the tag file-transfer, windows-server-2003, etc.

  2. The number of answer documents that are related to question documents with the tag file-transfer, windows-server-2003, etc.

Composite Aggregation

A multi-bucket aggregation that creates composite buckets from different sources.

Unlike the other multi-bucket aggregation the composite aggregation can be used to paginate all buckets from a multi-level aggregation efficiently. This aggregation provides a way to stream all buckets of a specific aggregation similarly to what scroll does for documents.

The composite buckets are built from the combinations of the values extracted/created for each document and each combination is considered as a composite bucket.

For instance the following document:

{
    "keyword": ["foo", "bar"],
    "number": [23, 65, 76]
}

... creates the following composite buckets when keyword and number are used as values source for the aggregation:

{ "keyword": "foo", "number": 23 }
{ "keyword": "foo", "number": 65 }
{ "keyword": "foo", "number": 76 }
{ "keyword": "bar", "number": 23 }
{ "keyword": "bar", "number": 65 }
{ "keyword": "bar", "number": 76 }

Values source

The sources parameter controls the sources that should be used to build the composite buckets. There are three different types of values source:

Terms

The terms value source is equivalent to a simple terms aggregation. The values are extracted from a field or a script exactly like the terms aggregation.

Example:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "product": { "terms" : { "field": "product" } } }
                ]
            }
        }
     }
}

Like the terms aggregation it is also possible to use a script to create the values for the composite buckets:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    {
                        "product": {
                            "terms" : {
                                "script" : {
                                    "source": "doc['product'].value",
                                    "lang": "painless"
                                }
                            }
                        }
                    }
                ]
            }
        }
    }
}
Histogram

The histogram value source can be applied on numeric values to build fixed size interval over the values. The interval parameter defines how the numeric values should be transformed. For instance an interval set to 5 will translate any numeric values to its closest interval, a value of 101 would be translated to 100 which is the key for the interval between 100 and 105.

Example:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "histo": { "histogram" : { "field": "price", "interval": 5 } } }
                ]
            }
        }
    }
}

The values are built from a numeric field or a script that return numerical values:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    {
                        "histo": {
                            "histogram" : {
                                "interval": 5,
                                "script" : {
                                    "source": "doc['price'].value",
                                    "lang": "painless"
                                }
                            }
                        }
                    }
                ]
            }
        }
    }
}
Date Histogram

The date_histogram is similar to the histogram value source except that the interval is specified by date/time expression:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "date": { "date_histogram" : { "field": "timestamp", "interval": "1d" } } }
                ]
            }
        }
    }
}

The example above creates an interval per day and translates all timestamp values to the start of its closest intervals. Available expressions for interval: year, quarter, month, week, day, hour, minute, second

Time values can also be specified via abbreviations supported by time units parsing. Note that fractional time values are not supported, but you can address this by shifting to another time unit (e.g., 1.5h could instead be specified as 90m).

Format

Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch. These timestamps are returned as the bucket keys. It is possible to return a formatted date string instead using the format specified with the format parameter:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    {
                        "date": {
                            "date_histogram" : {
                                "field": "timestamp",
                                "interval": "1d",
                                "format": "yyyy-MM-dd" (1)
                            }
                        }
                    }
                ]
            }
        }
    }
}
  1. Supports expressive date format pattern

Time Zone

Date-times are stored in Elasticsearch in UTC. By default, all bucketing and rounding is also done in UTC. The time_zone parameter can be used to indicate that bucketing should use a different time zone.

Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as a timezone id, an identifier used in the TZ database like America/Los_Angeles.

Mixing different values source

The sources parameter accepts an array of values source. It is possible to mix different values source to create composite buckets. For example:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } },
                    { "product": { "terms": {"field": "product" } } }
                ]
            }
        }
    }
}

This will create composite buckets from the values created by two values source, a date_histogram and a terms. Each bucket is composed of two values, one for each value source defined in the aggregation. Any type of combinations is allowed and the order in the array is preserved in the composite buckets.

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "shop": { "terms": {"field": "shop" } } },
                    { "product": { "terms": { "field": "product" } } },
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } }
                ]
            }
        }
    }
}

Order

By default the composite buckets are sorted by their natural ordering. Values are sorted in ascending order of their values. When multiple value sources are requested, the ordering is done per value source, the first value of the composite bucket is compared to the first value of the other composite bucket and if they are equals the next values in the composite bucket are used for tie-breaking. This means that the composite bucket [foo, 100] is considered smaller than [foobar, 0] because foo is considered smaller than foobar. It is possible to define the direction of the sort for each value source by setting order to asc (default value) or desc (descending order) directly in the value source definition. For example:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
                    { "product": { "terms": {"field": "product", "order": "asc" } } }
                ]
            }
        }
    }
}

... will sort the composite bucket in descending order when comparing values from the date_histogram source and in ascending order when comparing values from the terms source.

Missing bucket

By default documents without a value for a given source are ignored. It is possible to include them in the response by setting missing_bucket to true (defaults to false):

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "sources" : [
                    { "product_name": { "terms" : { "field": "product", "missing_bucket": true } } }
                ]
            }
        }
     }
}

In the example above the source product_name will emit an explicit null value for documents without a value for the field product. The order specified in the source dictates whether the null values should rank first (ascending order, asc) or last (descending order, desc).

Size

The size parameter can be set to define how many composite buckets should be returned. Each composite bucket is considered as a single bucket so setting a size of 10 will return the first 10 composite buckets created from the values source. The response contains the values for each composite bucket in an array containing the values extracted from each value source.

After

If the number of composite buckets is too high (or unknown) to be returned in a single response it is possible to split the retrieval in multiple requests. Since the composite buckets are flat by nature, the requested size is exactly the number of composite buckets that will be returned in the response (assuming that they are at least size composite buckets to return). If all composite buckets should be retrieved it is preferable to use a small size (100 or 1000 for instance) and then use the after parameter to retrieve the next results. For example:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "size": 2,
                "sources" : [
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d" } } },
                    { "product": { "terms": {"field": "product" } } }
                ]
            }
        }
    }
}

... returns:

{
    ...
    "aggregations": {
        "my_buckets": {
            "after_key": { (1)
                "date": 1494288000000,
                "product": "mad max"
            },
            "buckets": [
                {
                    "key": {
                        "date": 1494201600000,
                        "product": "rocky"
                    },
                    "doc_count": 1
                },
                {
                    "key": {
                        "date": 1494288000000,
                        "product": "mad max"
                    },
                    "doc_count": 2
                }
            ]
        }
    }
}
  1. The last composite bucket returned by the query.

Note
The after_key is equals to the last bucket returned in the response before any filtering that could be done by Pipeline aggregations. If all buckets are filtered/removed by a pipeline aggregation, the after_key will contain the last bucket before filtering.

The after parameter can be used to retrieve the composite buckets that are after the last composite buckets returned in a previous round. For the example below the last bucket can be found in after_key and the next round of result can be retrieved with:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                "size": 2,
                 "sources" : [
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
                    { "product": { "terms": {"field": "product", "order": "asc" } } }
                ],
                "after": { "date": 1494288000000, "product": "mad max" } (1)
            }
        }
    }
}
  1. Should restrict the aggregation to buckets that sort after the provided values.

Sub-aggregations

Like any multi-bucket aggregations the composite aggregation can hold sub-aggregations. These sub-aggregations can be used to compute other buckets or statistics on each composite bucket created by this parent aggregation. For instance the following example computes the average value of a field per composite bucket:

GET /_search
{
    "aggs" : {
        "my_buckets": {
            "composite" : {
                 "sources" : [
                    { "date": { "date_histogram": { "field": "timestamp", "interval": "1d", "order": "desc" } } },
                    { "product": { "terms": {"field": "product" } } }
                ]
            },
            "aggregations": {
                "the_avg": {
                    "avg": { "field": "price" }
                }
            }
        }
    }
}

... returns:

{
    ...
    "aggregations": {
        "my_buckets": {
            "after_key": {
                "date": 1494201600000,
                "product": "rocky"
            },
            "buckets": [
                {
                    "key": {
                        "date": 1494460800000,
                        "product": "apocalypse now"
                    },
                    "doc_count": 1,
                    "the_avg": {
                        "value": 10.0
                    }
                },
                {
                    "key": {
                        "date": 1494374400000,
                        "product": "mad max"
                    },
                    "doc_count": 1,
                    "the_avg": {
                        "value": 27.0
                    }
                },
                {
                    "key": {
                        "date": 1494288000000,
                        "product" : "mad max"
                    },
                    "doc_count": 2,
                    "the_avg": {
                        "value": 22.5
                    }
                },
                {
                    "key": {
                        "date": 1494201600000,
                        "product": "rocky"
                    },
                    "doc_count": 1,
                    "the_avg": {
                        "value": 10.0
                    }
                }
            ]
        }
    }
}

Date Histogram Aggregation

This multi-bucket aggregation is similar to the normal histogram, but it can only be used with date values. Because dates are represented internally in Elasticsearch as long values, it is possible, but not as accurate, to use the normal histogram on dates as well. The main difference in the two APIs is that here the interval can be specified using date/time expressions. Time-based data requires special support because time-based intervals are not always a fixed length.

Setting intervals

There seems to be no limit to the creativity we humans apply to setting our clocks and calendars. We’ve invented leap years and leap seconds, standard and daylight savings times, and timezone offsets of 30 or 45 minutes rather than a full hour. While these creations help keep us in sync with the cosmos and our environment, they can make specifying time intervals accurately a real challenge. The only universal truth our researchers have yet to disprove is that a millisecond is always the same duration, and a second is always 1000 milliseconds. Beyond that, things get complicated.

Generally speaking, when you specify a single time unit, such as 1 hour or 1 day, you are working with a calendar interval, but multiples, such as 6 hours or 3 days, are fixed-length intervals.

For example, a specification of 1 day (1d) from now is a calendar interval that means "at this exact time tomorrow" no matter the length of the day. A change to or from daylight savings time that results in a 23 or 25 hour day is compensated for and the specification of "this exact time tomorrow" is maintained. But if you specify 2 or more days, each day must be of the same fixed duration (24 hours). In this case, if the specified interval includes the change to or from daylight savings time, the interval will end an hour sooner or later than you expect.

There are similar differences to consider when you specify single versus multiple minutes or hours. Multiple time periods longer than a day are not supported.

Here are the valid time specifications and their meanings:

milliseconds (ms)

Fixed length interval; supports multiples.

seconds (s)

1000 milliseconds; fixed length interval (except for the last second of a minute that contains a leap-second, which is 2000ms long); supports multiples.

minutes (m)

All minutes begin at 00 seconds.

  • One minute (1m) is the interval between 00 seconds of the first minute and 00 seconds of the following minute in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.

  • Multiple minutes (nm) are intervals of exactly 60x1000=60,000 milliseconds each.

hours (h)

All hours begin at 00 minutes and 00 seconds.

  • One hour (1h) is the interval between 00:00 minutes of the first hour and 00:00 minutes of the following hour in the specified timezone, compensating for any intervening leap seconds, so that the number of minutes and seconds past the hour is the same at the start and end.

  • Multiple hours (nh) are intervals of exactly 60x60x1000=3,600,000 milliseconds each.

days (d)

All days begin at the earliest possible time, which is usually 00:00:00 (midnight).

  • One day (1d) is the interval between the start of the day and the start of of the following day in the specified timezone, compensating for any intervening time changes.

  • Multiple days (nd) are intervals of exactly 24x60x60x1000=86,400,000 milliseconds each.

weeks (w)
  • One week (1w) is the interval between the start day_of_week:hour:minute:second and the same day of the week and time of the following week in the specified timezone.

  • Multiple weeks (nw) are not supported.

months (M)
  • One month (1M) is the interval between the start day of the month and time of day and the same day of the month and time of the following month in the specified timezone, so that the day of the month and time of day are the same at the start and end.

  • Multiple months (nM) are not supported.

quarters (q)
  • One quarter (1q) is the interval between the start day of the month and time of day and the same day of the month and time of day three months later, so that the day of the month and time of day are the same at the start and end.

  • Multiple quarters (nq) are not supported.

years (y)
  • One year (1y) is the interval between the start day of the month and time of day and the same day of the month and time of day the following year in the specified timezone, so that the date and time are the same at the start and end.

  • Multiple years (ny) are not supported.

NOTE: In all cases, when the specified end time does not exist, the actual end time is the closest available time after the specified end.

Widely distributed applications must also consider vagaries such as countries that start and stop daylight savings time at 12:01 A.M., so end up with one minute of Sunday followed by an additional 59 minutes of Saturday once a year, and countries that decide to move across the international date line. Situations like that can make irregular timezone offsets seem easy.

As always, rigorous testing, especially around time-change events, will ensure that your time interval specification is what you intend it to be.

WARNING: To avoid unexpected results, all connected servers and clients must sync to a reliable network time service.

Examples

Requesting bucket intervals of a month.

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            }
        }
    }
}

You can also specify time values using abbreviations supported by time units parsing. Note that fractional time values are not supported, but you can address this by shifting to another time unit (e.g., 1.5h could instead be specified as 90m).

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "90m"
            }
        }
    }
}
Keys

Internally, a date is represented as a 64 bit number representing a timestamp in milliseconds-since-the-epoch (01/01/1970 midnight UTC). These timestamps are returned as the key name of the bucket. The key_as_string is the same timestamp converted to a formatted date string using the format parameter specification:

Tip
If you don’t specify format, the first date format specified in the field mapping is used.
POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "1M",
                "format" : "yyyy-MM-dd" (1)
            }
        }
    }
}
  1. Supports expressive date format pattern

Response:

{
    ...
    "aggregations": {
        "sales_over_time": {
            "buckets": [
                {
                    "key_as_string": "2015-01-01",
                    "key": 1420070400000,
                    "doc_count": 3
                },
                {
                    "key_as_string": "2015-02-01",
                    "key": 1422748800000,
                    "doc_count": 2
                },
                {
                    "key_as_string": "2015-03-01",
                    "key": 1425168000000,
                    "doc_count": 2
                }
            ]
        }
    }
}
Timezone

Date-times are stored in Elasticsearch in UTC. By default, all bucketing and rounding is also done in UTC. Use the time_zone parameter to indicate that bucketing should use a different timezone.

You can specify timezones as either an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as a timezone ID as specified in the IANA timezone database, such as`America/Los_Angeles`.

Consider the following example:

PUT my_index/_doc/1?refresh
{
  "date": "2015-10-01T00:30:00Z"
}

PUT my_index/_doc/2?refresh
{
  "date": "2015-10-01T01:30:00Z"
}

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "interval":  "day"
      }
    }
  }
}

If you don’t specify a timezone, UTC is used. This would result in both of these documents being placed into the same day bucket, which starts at midnight UTC on 1 October 2015:

{
  ...
  "aggregations": {
    "by_day": {
      "buckets": [
        {
          "key_as_string": "2015-10-01T00:00:00.000Z",
          "key":           1443657600000,
          "doc_count":     2
        }
      ]
    }
  }
}

If you specify a time_zone of -01:00, midnight in that timezone is one hour before midnight UTC:

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "interval":  "day",
        "time_zone": "-01:00"
      }
    }
  }
}

Now the first document falls into the bucket for 30 September 2015, while the second document falls into the bucket for 1 October 2015:

{
  ...
  "aggregations": {
    "by_day": {
      "buckets": [
        {
          "key_as_string": "2015-09-30T00:00:00.000-01:00", (1)
          "key": 1443574800000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T00:00:00.000-01:00", (1)
          "key": 1443661200000,
          "doc_count": 1
        }
      ]
    }
  }
}
  1. The key_as_string value represents midnight on each day in the specified timezone.

Warning
When using time zones that follow DST (daylight savings time) changes, buckets close to the moment when those changes happen can have slightly different sizes than you would expect from the used interval. For example, consider a DST start in the CET time zone: on 27 March 2016 at 2am, clocks were turned forward 1 hour to 3am local time. If you use day as interval, the bucket covering that day will only hold data for 23 hours instead of the usual 24 hours for other buckets. The same is true for shorter intervals, like 12h, where you’ll have only a 11h bucket on the morning of 27 March when the DST shift happens.
Offset

Use the offset parameter to change the start value of each bucket by the specified positive (+) or negative offset (-) duration, such as 1h for an hour, or 1d for a day. See [time-units] for more possible time duration options.

For example, when using an interval of day, each bucket runs from midnight to midnight. Setting the offset parameter to +6h changes each bucket to run from 6am to 6am:

PUT my_index/_doc/1?refresh
{
  "date": "2015-10-01T05:30:00Z"
}

PUT my_index/_doc/2?refresh
{
  "date": "2015-10-01T06:30:00Z"
}

GET my_index/_search?size=0
{
  "aggs": {
    "by_day": {
      "date_histogram": {
        "field":     "date",
        "interval":  "day",
        "offset":    "+6h"
      }
    }
  }
}

Instead of a single bucket starting at midnight, the above request groups the documents into buckets starting at 6am:

{
  ...
  "aggregations": {
    "by_day": {
      "buckets": [
        {
          "key_as_string": "2015-09-30T06:00:00.000Z",
          "key": 1443592800000,
          "doc_count": 1
        },
        {
          "key_as_string": "2015-10-01T06:00:00.000Z",
          "key": 1443679200000,
          "doc_count": 1
        }
      ]
    }
  }
}
Note
The start offset of each bucket is calculated after time_zone adjustments have been made.
Keyed Response

Setting the keyed flag to true associates a unique string key with each bucket and returns the ranges as a hash rather than an array:

POST /sales/_search?size=0
{
    "aggs" : {
        "sales_over_time" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "1M",
                "format" : "yyyy-MM-dd",
                "keyed": true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "sales_over_time": {
            "buckets": {
                "2015-01-01": {
                    "key_as_string": "2015-01-01",
                    "key": 1420070400000,
                    "doc_count": 3
                },
                "2015-02-01": {
                    "key_as_string": "2015-02-01",
                    "key": 1422748800000,
                    "doc_count": 2
                },
                "2015-03-01": {
                    "key_as_string": "2015-03-01",
                    "key": 1425168000000,
                    "doc_count": 2
                }
            }
        }
    }
}
Scripts

As with the normal histogram, both document-level scripts and value-level scripts are supported. You can control the order of the returned buckets using the order settings and filter the returned buckets based on a min_doc_count setting (by default all buckets between the first bucket that matches documents and the last one are returned). This histogram also supports the extended_bounds setting, which enables extending the bounds of the histogram beyond the data itself. For more information, see Extended Bounds.

Missing value

The missing parameter defines how to treat documents that are missing a value. By default, they are ignored, but it is also possible to treat them as if they have a value.

POST /sales/_search?size=0
{
    "aggs" : {
        "sale_date" : {
             "date_histogram" : {
                 "field" : "date",
                 "interval": "year",
                 "missing": "2000/01/01" (1)
             }
         }
    }
}
  1. Documents without a value in the publish_date field will fall into the same bucket as documents that have the value 2000-01-01.

Order

By default the returned buckets are sorted by their key ascending, but you can control the order using the order setting. This setting supports the same order functionality as Terms Aggregation.

deprecated[6.0.0, Use _key instead of _time to order buckets by their dates/keys]

Using a script to aggregate by day of the week

When you need to aggregate the results by day of the week, use a script that returns the day of the week:

POST /sales/_search?size=0
{
    "aggs": {
        "dayOfWeek": {
            "terms": {
                "script": {
                    "lang": "painless",
                    "source": "doc['date'].value.dayOfWeekEnum.value"
                }
            }
        }
    }
}

Response:

{
  ...
  "aggregations": {
    "dayOfWeek": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "7",
          "doc_count": 4
        },
        {
          "key": "4",
          "doc_count": 3
        }
      ]
    }
  }
}

The response will contain all the buckets having the relative day of the week as key : 1 for Monday, 2 for Tuesday…​ 7 for Sunday.

Date Range Aggregation

A range aggregation that is dedicated for date values. The main difference between this aggregation and the normal range aggregation is that the from and to values can be expressed in Date Math expressions, and it is also possible to specify a date format by which the from and to response fields will be returned. Note that this aggregation includes the from value and excludes the to value for each range.

Example:

POST /sales/_search?size=0
{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyyy",
                "ranges": [
                    { "to": "now-10M/M" }, (1)
                    { "from": "now-10M/M" } (2)
                ]
            }
        }
    }
}
  1. < now minus 10 months, rounded down to the start of the month.

  2. >= now minus 10 months, rounded down to the start of the month.

In the example above, we created two range buckets, the first will "bucket" all documents dated prior to 10 months ago and the second will "bucket" all documents dated since 10 months ago

Response:

{
    ...
    "aggregations": {
        "range": {
            "buckets": [
                {
                    "to": 1.4436576E12,
                    "to_as_string": "10-2015",
                    "doc_count": 7,
                    "key": "*-10-2015"
                },
                {
                    "from": 1.4436576E12,
                    "from_as_string": "10-2015",
                    "doc_count": 0,
                    "key": "10-2015-*"
                }
            ]
        }
    }
}

Missing Values

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value. This is done by adding a set of fieldname : value mappings to specify default values per field.

POST /sales/_search?size=0
{
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "missing": "1976/11/30",
               "ranges": [
                  {
                    "key": "Older",
                    "to": "2016/02/01"
                  }, (1)
                  {
                    "key": "Newer",
                    "from": "2016/02/01",
                    "to" : "now/d"
                  }
              ]
          }
      }
   }
}
  1. Documents without a value in the date field will be added to the "Older" bucket, as if they had a date value of "1899-12-31".

Date Format/Pattern

Note
this information was copied from JodaDate

All ASCII letters are reserved as format pattern letters, which are defined as follows:

Symbol Meaning Presentation Examples

G

era

text

AD

C

century of era (>=0)

number

20

Y

year of era (>=0)

year

1996

x

weekyear

year

1996

w

week of weekyear

number

27

e

day of week

number

2

E

day of week

text

Tuesday; Tue

y

year

year

1996

D

day of year

number

189

M

month of year

month

July; Jul; 07

d

day of month

number

10

a

halfday of day

text

PM

K

hour of halfday (0~11)

number

0

h

clockhour of halfday (1~12)

number

12

H

hour of day (0~23)

number

0

k

clockhour of day (1~24)

number

24

m

minute of hour

number

30

s

second of minute

number

55

S

fraction of second

number

978

z

time zone

text

Pacific Standard Time; PST

Z

time zone offset/id

zone

-0800; -08:00; America/Los_Angeles

'

escape for text

delimiter

''

The count of pattern letters determine the format.

Text

If the number of pattern letters is 4 or more, the full form is used; otherwise a short or abbreviated form is used if available.

Number

The minimum number of digits. Shorter numbers are zero-padded to this amount.

Year

Numeric presentation for year and weekyear fields are handled specially. For example, if the count of 'y' is 2, the year will be displayed as the zero-based year of the century, which is two digits.

Month

3 or over, use text, otherwise use number.

Zone

'Z' outputs offset without a colon, 'ZZ' outputs the offset with a colon, 'ZZZ' or more outputs the zone id.

Zone names

Time zone names ('z') cannot be parsed.

Any characters in the pattern that are not in the ranges of ['a'..'z'] and ['A'..'Z'] will be treated as quoted text. For instance, characters like ':', '.', ' ', '#' and '?' will appear in the resulting time text even they are not embraced within single quotes.

Time zone in date range aggregations

Dates can be converted from another time zone to UTC by specifying the time_zone parameter.

Time zones may either be specified as an ISO 8601 UTC offset (e.g. +01:00 or -08:00) or as one of the http://www.joda.org/joda-time/timezones.html [time zone ids] from the TZ database.

The time_zone parameter is also applied to rounding in date math expressions. As an example, to round to the beginning of the day in the CET time zone, you can do the following:

POST /sales/_search?size=0
{
   "aggs": {
       "range": {
           "date_range": {
               "field": "date",
               "time_zone": "CET",
               "ranges": [
                  { "to": "2016/02/01" }, (1)
                  { "from": "2016/02/01", "to" : "now/d" }, (2)
                  { "from": "now/d" }
              ]
          }
      }
   }
}
  1. This date will be converted to 2016-02-01T00:00:00.000+01:00.

  2. now/d will be rounded to the beginning of the day in the CET time zone.

Keyed Response

Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array:

POST /sales/_search?size=0
{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [
                    { "to": "now-10M/M" },
                    { "from": "now-10M/M" }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "range": {
            "buckets": {
                "*-10-2015": {
                    "to": 1.4436576E12,
                    "to_as_string": "10-2015",
                    "doc_count": 7
                },
                "10-2015-*": {
                    "from": 1.4436576E12,
                    "from_as_string": "10-2015",
                    "doc_count": 0
                }
            }
        }
    }
}

It is also possible to customize the key for each range:

POST /sales/_search?size=0
{
    "aggs": {
        "range": {
            "date_range": {
                "field": "date",
                "format": "MM-yyy",
                "ranges": [
                    { "from": "01-2015",  "to": "03-2015", "key": "quarter_01" },
                    { "from": "03-2015", "to": "06-2015", "key": "quarter_02" }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "range": {
            "buckets": {
                "quarter_01": {
                    "from": 1.4200704E12,
                    "from_as_string": "01-2015",
                    "to": 1.425168E12,
                    "to_as_string": "03-2015",
                    "doc_count": 5
                },
                "quarter_02": {
                    "from": 1.425168E12,
                    "from_as_string": "03-2015",
                    "to": 1.4331168E12,
                    "to_as_string": "06-2015",
                    "doc_count": 2
                }
            }
        }
    }
}

Diversified Sampler Aggregation

Like the sampler aggregation this is a filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents. The diversified_sampler aggregation adds the ability to limit the number of matches that share a common value such as an "author".

Note
Any good market researcher will tell you that when working with samples of data it is important that the sample represents a healthy variety of opinions rather than being skewed by any single voice. The same is true with aggregations and sampling with these diversify settings can offer a way to remove the bias in your content (an over-populated geography, a large spike in a timeline or an over-active forum spammer).
Example use cases:
  • Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches

  • Removing bias from analytics by ensuring fair representation of content from different sources

  • Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

A choice of field or script setting is used to provide values used for de-duplication and the max_docs_per_value setting controls the maximum number of documents collected on any one shard which share a common value. The default setting for max_docs_per_value is 1.

The aggregation will throw an error if the choice of field or script produces multiple values for a single document (de-duplication using multi-valued fields is not supported due to efficiency concerns).

Example:

We might want to see which tags are strongly associated with #elasticsearch on StackOverflow forum posts but ignoring the effects of some prolific users with a tendency to misspell #Kibana as #Cabana.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:elasticsearch"
        }
    },
    "aggs": {
        "my_unbiased_sample": {
            "diversified_sampler": {
                "shard_size": 200,
                "field" : "author"
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["elasticsearch"]
                    }
                }
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "my_unbiased_sample": {
            "doc_count": 151,(1)
            "keywords": {(2)
                "doc_count": 151,
                "bg_count": 650,
                "buckets": [
                    {
                        "key": "kibana",
                        "doc_count": 150,
                        "score": 2.213,
                        "bg_count": 200
                    }
                ]
            }
        }
    }
}
  1. 151 documents were sampled in total.

  2. The results of the significant_terms aggregation are not skewed by any single author’s quirks because we asked for a maximum of one post from any one author in our sample.

Scripted example:

In this scenario we might want to diversify on a combination of field values. We can use a script to produce a hash of the multiple values in a tags field to ensure we don’t have a sample that consists of the same repeated combinations of tags.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana"
        }
    },
    "aggs": {
        "my_unbiased_sample": {
            "diversified_sampler": {
                "shard_size": 200,
                "max_docs_per_value" : 3,
                "script" : {
                    "lang": "painless",
                    "source": "doc['tags'].hashCode()"
                }
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["kibana"]
                    }
                }
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "my_unbiased_sample": {
            "doc_count": 6,
            "keywords": {
                "doc_count": 6,
                "bg_count": 650,
                "buckets": [
                    {
                        "key": "logstash",
                        "doc_count": 3,
                        "score": 2.213,
                        "bg_count": 50
                    },
                    {
                        "key": "elasticsearch",
                        "doc_count": 3,
                        "score": 1.34,
                        "bg_count": 200
                    }
                ]
            }
        }
    }
}

shard_size

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

max_docs_per_value

The max_docs_per_value is an optional parameter and limits how many documents are permitted per choice of de-duplicating value. The default setting is "1".

execution_hint

The optional execution_hint setting can influence the management of the values used for de-duplication. Each option will hold up to shard_size values in memory while performing de-duplication but the type of value held can be controlled as follows:

  • hold field values directly (map)

  • hold ordinals of the field as determined by the Lucene index (global_ordinals)

  • hold hashes of the field values - with potential for hash collisions (bytes_hash)

The default setting is to use global_ordinals if this information is available from the Lucene index and reverting to map if not. The bytes_hash setting may prove faster in some cases but introduces the possibility of false positives in de-duplication logic due to the possibility of hash collisions. Please note that Elasticsearch will ignore the choice of execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.

Limitations

Cannot be nested under breadth_first aggregations

Being a quality-based filter the diversified_sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

Limited de-dup logic.

The de-duplication logic applies only at a shard level so will not apply across shards.

No specialized syntax for geo/date fields

Currently the syntax for defining the diversifying values is defined by a choice of field or script - there is no added syntactical sugar for expressing geo or date units such as "7d" (7 days). This support may be added in a later release and users will currently have to create these sorts of values using a script.

Filter Aggregation

Defines a single bucket of all the documents in the current document set context that match a specified filter. Often this will be used to narrow down the current aggregation context to a specific set of documents.

Example:

POST /sales/_search?size=0
{
    "aggs" : {
        "t_shirts" : {
            "filter" : { "term": { "type": "t-shirt" } },
            "aggs" : {
                "avg_price" : { "avg" : { "field" : "price" } }
            }
        }
    }
}

In the above example, we calculate the average price of all the products that are of type t-shirt.

Response:

{
    ...
    "aggregations" : {
        "t_shirts" : {
            "doc_count" : 3,
            "avg_price" : { "value" : 128.33333333333334 }
        }
    }
}

Filters Aggregation

Defines a multi bucket aggregation where each bucket is associated with a filter. Each bucket will collect all documents that match its associated filter.

Example:

PUT /logs/_doc/_bulk?refresh
{ "index" : { "_id" : 1 } }
{ "body" : "warning: page could not be rendered" }
{ "index" : { "_id" : 2 } }
{ "body" : "authentication error" }
{ "index" : { "_id" : 3 } }
{ "body" : "warning: connection timed out" }

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

In the above example, we analyze log messages. The aggregation will build two collection (buckets) of log messages - one for all those containing an error, and another for all those containing a warning.

Response:

{
  "took": 9,
  "timed_out": false,
  "_shards": ...,
  "hits": ...,
  "aggregations": {
    "messages": {
      "buckets": {
        "errors": {
          "doc_count": 1
        },
        "warnings": {
          "doc_count": 2
        }
      }
    }
  }
}

Anonymous filters

The filters field can also be provided as an array of filters, as in the following request:

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "filters" : [
          { "match" : { "body" : "error"   }},
          { "match" : { "body" : "warning" }}
        ]
      }
    }
  }
}

The filtered buckets are returned in the same order as provided in the request. The response for this example would be:

{
  "took": 4,
  "timed_out": false,
  "_shards": ...,
  "hits": ...,
  "aggregations": {
    "messages": {
      "buckets": [
        {
          "doc_count": 1
        },
        {
          "doc_count": 2
        }
      ]
    }
  }
}

Other Bucket

The other_bucket parameter can be set to add a bucket to the response which will contain all documents that do not match any of the given filters. The value of this parameter can be as follows:

false

Does not compute the other bucket

true

Returns the other bucket either in a bucket (named other by default) if named filters are being used, or as the last bucket if anonymous filters are being used

The other_bucket_key parameter can be used to set the key for the other bucket to a value other than the default other. Setting this parameter will implicitly set the other_bucket parameter to true.

The following snippet shows a response where the other bucket is requested to be named other_messages.

PUT logs/_doc/4?refresh
{
  "body": "info: user Bob logged out"
}

GET logs/_search
{
  "size": 0,
  "aggs" : {
    "messages" : {
      "filters" : {
        "other_bucket_key": "other_messages",
        "filters" : {
          "errors" :   { "match" : { "body" : "error"   }},
          "warnings" : { "match" : { "body" : "warning" }}
        }
      }
    }
  }
}

The response would be something like the following:

{
  "took": 3,
  "timed_out": false,
  "_shards": ...,
  "hits": ...,
  "aggregations": {
    "messages": {
      "buckets": {
        "errors": {
          "doc_count": 1
        },
        "warnings": {
          "doc_count": 2
        },
        "other_messages": {
          "doc_count": 1
        }
      }
    }
  }
}

Geo Distance Aggregation

A multi-bucket aggregation that works on geo_point fields and conceptually works very similar to the range aggregation. The user can define a point of origin and a set of distance range buckets. The aggregation evaluate the distance of each document value from the origin point and determines the buckets it belongs to based on the ranges (a document belongs to a bucket if the distance between the document and the origin falls within the distance range of the bucket).

PUT /museums
{
    "mappings": {
        "_doc": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}

POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
    "aggs" : {
        "rings_around_amsterdam" : {
            "geo_distance" : {
                "field" : "location",
                "origin" : "52.3760, 4.894",
                "ranges" : [
                    { "to" : 100000 },
                    { "from" : 100000, "to" : 300000 },
                    { "from" : 300000 }
                ]
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "rings_around_amsterdam" : {
            "buckets": [
                {
                    "key": "*-100000.0",
                    "from": 0.0,
                    "to": 100000.0,
                    "doc_count": 3
                },
                {
                    "key": "100000.0-300000.0",
                    "from": 100000.0,
                    "to": 300000.0,
                    "doc_count": 1
                },
                {
                    "key": "300000.0-*",
                    "from": 300000.0,
                    "doc_count": 2
                }
            ]
        }
    }
}

The specified field must be of type geo_point (which can only be set explicitly in the mappings). And it can also hold an array of geo_point fields, in which case all will be taken into account during aggregation. The origin point can accept all formats supported by the geo_point type:

  • Object format: { "lat" : 52.3760, "lon" : 4.894 } - this is the safest format as it is the most explicit about the lat & lon values

  • String format: "52.3760, 4.894" - where the first number is the lat and the second is the lon

  • Array format: [4.894, 52.3760] - which is based on the GeoJson standard and where the first number is the lon and the second one is the lat

By default, the distance unit is m (meters) but it can also accept: mi (miles), in (inches), yd (yards), km (kilometers), cm (centimeters), mm (millimeters).

POST /museums/_search?size=0
{
    "aggs" : {
        "rings" : {
            "geo_distance" : {
                "field" : "location",
                "origin" : "52.3760, 4.894",
                "unit" : "km", (1)
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 300 },
                    { "from" : 300 }
                ]
            }
        }
    }
}
  1. The distances will be computed in kilometers

There are two distance calculation modes: arc (the default), and plane. The arc calculation is the most accurate. The plane is the fastest but least accurate. Consider using plane when your search context is "narrow", and spans smaller geographical areas (~5km). plane will return higher error margins for searches across very large areas (e.g. cross continent search). The distance calculation type can be set using the distance_type parameter:

POST /museums/_search?size=0
{
    "aggs" : {
        "rings" : {
            "geo_distance" : {
                "field" : "location",
                "origin" : "52.3760, 4.894",
                "unit" : "km",
                "distance_type" : "plane",
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 300 },
                    { "from" : 300 }
                ]
            }
        }
    }
}

Keyed Response

Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array:

POST /museums/_search?size=0
{
    "aggs" : {
        "rings_around_amsterdam" : {
            "geo_distance" : {
                "field" : "location",
                "origin" : "52.3760, 4.894",
                "ranges" : [
                    { "to" : 100000 },
                    { "from" : 100000, "to" : 300000 },
                    { "from" : 300000 }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "rings_around_amsterdam" : {
            "buckets": {
                "*-100000.0": {
                    "from": 0.0,
                    "to": 100000.0,
                    "doc_count": 3
                },
                "100000.0-300000.0": {
                    "from": 100000.0,
                    "to": 300000.0,
                    "doc_count": 1
                },
                "300000.0-*": {
                    "from": 300000.0,
                    "doc_count": 2
                }
            }
        }
    }
}

It is also possible to customize the key for each range:

POST /museums/_search?size=0
{
    "aggs" : {
        "rings_around_amsterdam" : {
            "geo_distance" : {
                "field" : "location",
                "origin" : "52.3760, 4.894",
                "ranges" : [
                    { "to" : 100000, "key": "first_ring" },
                    { "from" : 100000, "to" : 300000, "key": "second_ring" },
                    { "from" : 300000, "key": "third_ring" }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "rings_around_amsterdam" : {
            "buckets": {
                "first_ring": {
                    "from": 0.0,
                    "to": 100000.0,
                    "doc_count": 3
                },
                "second_ring": {
                    "from": 100000.0,
                    "to": 300000.0,
                    "doc_count": 1
                },
                "third_ring": {
                    "from": 300000.0,
                    "doc_count": 2
                }
            }
        }
    }
}

GeoHash grid Aggregation

A multi-bucket aggregation that works on geo_point fields and groups points into buckets that represent cells in a grid. The resulting grid can be sparse and only contains cells that have matching data. Each cell is labeled using a geohash which is of user-definable precision.

  • High precision geohashes have a long string length and represent cells that cover only a small area.

  • Low precision geohashes have a short string length and represent cells that each cover a large area.

Geohashes used in this aggregation can have a choice of precision between 1 and 12.

Warning
The highest-precision geohash of length 12 produces cells that cover less than a square metre of land and so high-precision requests can be very costly in terms of RAM and result sizes. Please see the example below on how to first filter the aggregation to a smaller geographic area before requesting high-levels of detail.

The specified field must be of type geo_point (which can only be set explicitly in the mappings) and it can also hold an array of geo_point fields, in which case all points will be taken into account during aggregation.

Simple low-precision request

PUT /museums
{
    "mappings": {
        "_doc": {
            "properties": {
                "location": {
                    "type": "geo_point"
                }
            }
        }
    }
}

POST /museums/_doc/_bulk?refresh
{"index":{"_id":1}}
{"location": "52.374081,4.912350", "name": "NEMO Science Museum"}
{"index":{"_id":2}}
{"location": "52.369219,4.901618", "name": "Museum Het Rembrandthuis"}
{"index":{"_id":3}}
{"location": "52.371667,4.914722", "name": "Nederlands Scheepvaartmuseum"}
{"index":{"_id":4}}
{"location": "51.222900,4.405200", "name": "Letterenhuis"}
{"index":{"_id":5}}
{"location": "48.861111,2.336389", "name": "Musée du Louvre"}
{"index":{"_id":6}}
{"location": "48.860000,2.327000", "name": "Musée d'Orsay"}

POST /museums/_search?size=0
{
    "aggregations" : {
        "large-grid" : {
            "geohash_grid" : {
                "field" : "location",
                "precision" : 3
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "large-grid": {
            "buckets": [
                {
                    "key": "u17",
                    "doc_count": 3
                },
                {
                    "key": "u09",
                    "doc_count": 2
                },
                {
                    "key": "u15",
                    "doc_count": 1
                }
            ]
        }
    }
}

High-precision requests

When requesting detailed buckets (typically for displaying a "zoomed in" map) a filter like geo_bounding_box should be applied to narrow the subject area otherwise potentially millions of buckets will be created and returned.

POST /museums/_search?size=0
{
    "aggregations" : {
        "zoomed-in" : {
            "filter" : {
                "geo_bounding_box" : {
                    "location" : {
                        "top_left" : "52.4, 4.9",
                        "bottom_right" : "52.3, 5.0"
                    }
                }
            },
            "aggregations":{
                "zoom1":{
                    "geohash_grid" : {
                        "field": "location",
                        "precision": 8
                    }
                }
            }
        }
    }
}

The geohashes returned by the geohash_grid aggregation as bucket keys can be also used for "zooming in" by translating them into bounding boxes using one of available geohash libraries. For example, for javascript the node-geohash library can be used:

var geohash = require('ngeohash');

// bbox will contain [ 52.03125, 4.21875, 53.4375, 5.625 ]
//                   [   minlat,  minlon,  maxlat, maxlon]
var bbox = geohash.decode_bbox('u17');

Cell dimensions at the equator

The table below shows the metric dimensions for cells covered by various string lengths of geohash. Cell dimensions vary with latitude and so the table is for the worst-case scenario at the equator.

GeoHash length

Area width x height

1

5,009.4km x 4,992.6km

2

1,252.3km x 624.1km

3

156.5km x 156km

4

39.1km x 19.5km

5

4.9km x 4.9km

6

1.2km x 609.4m

7

152.9m x 152.4m

8

38.2m x 19m

9

4.8m x 4.8m

10

1.2m x 59.5cm

11

14.9cm x 14.9cm

12

3.7cm x 1.9cm

Options

field

Mandatory. The name of the field indexed with GeoPoints.

precision

Optional. The string length of the geohashes used to define cells/buckets in the results. Defaults to 5. The precision can either be defined in terms of the integer precision levels mentioned above. Values outside of [1,12] will be rejected. Alternatively, the precision level can be approximated from a distance measure like "1km", "10m". The precision level is calculate such that cells will not exceed the specified size (diagonal) of the required precision. When this would lead to precision levels higher than the supported 12 levels, (e.g. for distances <5.6cm) the value is rejected.

size

Optional. The maximum number of geohash buckets to return (defaults to 10,000). When results are trimmed, buckets are prioritised based on the volumes of documents they contain.

shard_size

Optional. To allow for more accurate counting of the top cells returned in the final result the aggregation defaults to returning max(10,(size x number-of-shards)) buckets from each shard. If this heuristic is undesirable, the number considered from each shard can be over-ridden using this parameter.

Global Aggregation

Defines a single bucket of all the documents within the search execution context. This context is defined by the indices and the document types you’re searching on, but is not influenced by the search query itself.

Note
Global aggregators can only be placed as top level aggregators because it doesn’t make sense to embed a global aggregator within another bucket aggregator.

Example:

POST /sales/_search?size=0
{
    "query" : {
        "match" : { "type" : "t-shirt" }
    },
    "aggs" : {
        "all_products" : {
            "global" : {}, (1)
            "aggs" : { (2)
                "avg_price" : { "avg" : { "field" : "price" } }
            }
        },
        "t_shirts": { "avg" : { "field" : "price" } }
    }
}
  1. The global aggregation has an empty body

  2. The sub-aggregations that are registered for this global aggregation

The above aggregation demonstrates how one would compute aggregations (avg_price in this example) on all the documents in the search context, regardless of the query (in our example, it will compute the average price over all products in our catalog, not just on the "shirts").

The response for the above aggregation:

{
    ...
    "aggregations" : {
        "all_products" : {
            "doc_count" : 7, (1)
            "avg_price" : {
                "value" : 140.71428571428572 (2)
            }
        },
        "t_shirts": {
            "value" : 128.33333333333334 (3)
        }
    }
}
  1. The number of documents that were aggregated (in our case, all documents within the search context)

  2. The average price of all products in the index

  3. The average price of all t-shirts

Histogram Aggregation

A multi-bucket values source based aggregation that can be applied on numeric values extracted from the documents. It dynamically builds fixed size (a.k.a. interval) buckets over the values. For example, if the documents have a field that holds a price (numeric), we can configure this aggregation to dynamically build buckets with interval 5 (in case of price it may represent $5). When the aggregation executes, the price field of every document will be evaluated and will be rounded down to its closest bucket - for example, if the price is 32 and the bucket size is 5 then the rounding will yield 30 and thus the document will "fall" into the bucket that is associated with the key 30. To make this more formal, here is the rounding function that is used:

bucket_key = Math.floor((value - offset) / interval) * interval + offset

The interval must be a positive decimal, while the offset must be a decimal in [0, interval) (a decimal greater than or equal to 0 and less than interval)

The following snippet "buckets" the products based on their price by interval of 50:

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50
            }
        }
    }
}

And the following may be the response:

{
    ...
    "aggregations": {
        "prices" : {
            "buckets": [
                {
                    "key": 0.0,
                    "doc_count": 1
                },
                {
                    "key": 50.0,
                    "doc_count": 1
                },
                {
                    "key": 100.0,
                    "doc_count": 0
                },
                {
                    "key": 150.0,
                    "doc_count": 2
                },
                {
                    "key": 200.0,
                    "doc_count": 3
                }
            ]
        }
    }
}

Minimum document count

The response above show that no documents has a price that falls within the range of [100, 150). By default the response will fill gaps in the histogram with empty buckets. It is possible change that and request buckets with a higher minimum count thanks to the min_doc_count setting:

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "min_doc_count" : 1
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "prices" : {
            "buckets": [
                {
                    "key": 0.0,
                    "doc_count": 1
                },
                {
                    "key": 50.0,
                    "doc_count": 1
                },
                {
                    "key": 150.0,
                    "doc_count": 2
                },
                {
                    "key": 200.0,
                    "doc_count": 3
                }
            ]
        }
    }
}

By default the histogram returns all the buckets within the range of the data itself, that is, the documents with the smallest values (on which with histogram) will determine the min bucket (the bucket with the smallest key) and the documents with the highest values will determine the max bucket (the bucket with the highest key). Often, when requesting empty buckets, this causes a confusion, specifically, when the data is also filtered.

To understand why, let’s look at an example:

Lets say the you’re filtering your request to get all docs with values between 0 and 500, in addition you’d like to slice the data per price using a histogram with an interval of 50. You also specify "min_doc_count" : 0 as you’d like to get all buckets even the empty ones. If it happens that all products (documents) have prices higher than 100, the first bucket you’ll get will be the one with 100 as its key. This is confusing, as many times, you’d also like to get those buckets between 0 - 100.

With extended_bounds setting, you now can "force" the histogram aggregation to start building buckets on a specific min value and also keep on building buckets up to a max value (even if there are no documents anymore). Using extended_bounds only makes sense when min_doc_count is 0 (the empty buckets will never be returned if min_doc_count is greater than 0).

Note that (as the name suggest) extended_bounds is not filtering buckets. Meaning, if the extended_bounds.min is higher than the values extracted from the documents, the documents will still dictate what the first bucket will be (and the same goes for the extended_bounds.max and the last bucket). For filtering buckets, one should nest the histogram aggregation under a range filter aggregation with the appropriate from/to settings.

Example:

POST /sales/_search?size=0
{
    "query" : {
        "constant_score" : { "filter": { "range" : { "price" : { "to" : "500" } } } }
    },
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "extended_bounds" : {
                    "min" : 0,
                    "max" : 500
                }
            }
        }
    }
}

Order

By default the returned buckets are sorted by their key ascending, though the order behaviour can be controlled using the order setting. Supports the same order functionality as the Terms Aggregation.

Offset

By default the bucket keys start with 0 and then continue in even spaced steps of interval, e.g. if the interval is 10, the first three buckets (assuming there is data inside them) will be [0, 10), [10, 20), [20, 30). The bucket boundaries can be shifted by using the offset option.

This can be best illustrated with an example. If there are 10 documents with values ranging from 5 to 14, using interval 10 will result in two buckets with 5 documents each. If an additional offset 5 is used, there will be only one single bucket [5, 15) containing all the 10 documents.

Response Format

By default, the buckets are returned as an ordered array. It is also possible to request the response as a hash instead keyed by the buckets keys:

POST /sales/_search?size=0
{
    "aggs" : {
        "prices" : {
            "histogram" : {
                "field" : "price",
                "interval" : 50,
                "keyed" : true
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "prices": {
            "buckets": {
                "0.0": {
                    "key": 0.0,
                    "doc_count": 1
                },
                "50.0": {
                    "key": 50.0,
                    "doc_count": 1
                },
                "100.0": {
                    "key": 100.0,
                    "doc_count": 0
                },
                "150.0": {
                    "key": 150.0,
                    "doc_count": 2
                },
                "200.0": {
                    "key": 200.0,
                    "doc_count": 3
                }
            }
        }
    }
}

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

POST /sales/_search?size=0
{
    "aggs" : {
        "quantity" : {
             "histogram" : {
                 "field" : "quantity",
                 "interval": 10,
                 "missing": 0 (1)
             }
         }
    }
}
  1. Documents without a value in the quantity field will fall into the same bucket as documents that have the value 0.

IP Range Aggregation

Just like the dedicated date range aggregation, there is also a dedicated range aggregation for IP typed fields:

Example:

GET /ip_addresses/_search
{
    "size": 10,
    "aggs" : {
        "ip_ranges" : {
            "ip_range" : {
                "field" : "ip",
                "ranges" : [
                    { "to" : "10.0.0.5" },
                    { "from" : "10.0.0.5" }
                ]
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "ip_ranges": {
            "buckets" : [
                {
                    "key": "*-10.0.0.5",
                    "to": "10.0.0.5",
                    "doc_count": 10
                },
                {
                    "key": "10.0.0.5-*",
                    "from": "10.0.0.5",
                    "doc_count": 260
                }
            ]
        }
    }
}

IP ranges can also be defined as CIDR masks:

GET /ip_addresses/_search
{
    "size": 0,
    "aggs" : {
        "ip_ranges" : {
            "ip_range" : {
                "field" : "ip",
                "ranges" : [
                    { "mask" : "10.0.0.0/25" },
                    { "mask" : "10.0.0.127/25" }
                ]
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "ip_ranges": {
            "buckets": [
                {
                    "key": "10.0.0.0/25",
                    "from": "10.0.0.0",
                    "to": "10.0.0.128",
                    "doc_count": 128
                },
                {
                    "key": "10.0.0.127/25",
                    "from": "10.0.0.0",
                    "to": "10.0.0.128",
                    "doc_count": 128
                }
            ]
        }
    }
}

Keyed Response

Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array:

GET /ip_addresses/_search
{
    "size": 0,
    "aggs": {
        "ip_ranges": {
            "ip_range": {
                "field": "ip",
                "ranges": [
                    { "to" : "10.0.0.5" },
                    { "from" : "10.0.0.5" }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "ip_ranges": {
            "buckets": {
                "*-10.0.0.5": {
                    "to": "10.0.0.5",
                    "doc_count": 10
                },
                "10.0.0.5-*": {
                    "from": "10.0.0.5",
                    "doc_count": 260
                }
            }
        }
    }
}

It is also possible to customize the key for each range:

GET /ip_addresses/_search
{
    "size": 0,
    "aggs": {
        "ip_ranges": {
            "ip_range": {
                "field": "ip",
                "ranges": [
                    { "key": "infinity", "to" : "10.0.0.5" },
                    { "key": "and-beyond", "from" : "10.0.0.5" }
                ],
                "keyed": true
            }
        }
    }
}

Response:

{
    ...

    "aggregations": {
        "ip_ranges": {
            "buckets": {
                "infinity": {
                    "to": "10.0.0.5",
                    "doc_count": 10
                },
                "and-beyond": {
                    "from": "10.0.0.5",
                    "doc_count": 260
                }
            }
        }
    }
}

Missing Aggregation

A field data based single bucket aggregation, that creates a bucket of all documents in the current document set context that are missing a field value (effectively, missing a field or having the configured NULL value set). This aggregator will often be used in conjunction with other field data bucket aggregators (such as ranges) to return information for all the documents that could not be placed in any of the other buckets due to missing field data values.

Example:

POST /sales/_search?size=0
{
    "aggs" : {
        "products_without_a_price" : {
            "missing" : { "field" : "price" }
        }
    }
}

In the above example, we get the total number of products that do not have a price.

Response:

{
    ...
    "aggregations" : {
        "products_without_a_price" : {
            "doc_count" : 00
        }
    }
}

Nested Aggregation

A special single bucket aggregation that enables aggregating nested documents.

For example, lets say we have an index of products, and each product holds the list of resellers - each having its own price for the product. The mapping could look like:

PUT /products
{
  "mappings": {
    "product" : {
        "properties" : {
            "resellers" : { (1)
                "type" : "nested",
                "properties" : {
                    "reseller" : { "type" : "text" },
                    "price" : { "type" : "double" }
                }
            }
        }
    }
  }
}
  1. resellers is an array that holds nested documents.

The following request adds a product with two resellers:

PUT /products/_doc/0
{
  "name": "LED TV", (1)
  "resellers": [
    {
      "reseller": "companyA",
      "price": 350
    },
    {
      "reseller": "companyB",
      "price": 500
    }
  ]
}
  1. We are using a dynamic mapping for the name attribute.

The following request returns the minimum price a product can be purchased for:

GET /products/_search
{
    "query" : {
        "match" : { "name" : "led tv" }
    },
    "aggs" : {
        "resellers" : {
            "nested" : {
                "path" : "resellers"
            },
            "aggs" : {
                "min_price" : { "min" : { "field" : "resellers.price" } }
            }
        }
    }
}

As you can see above, the nested aggregation requires the path of the nested documents within the top level documents. Then one can define any type of aggregation over these nested documents.

Response:

{
  ...
  "aggregations": {
    "resellers": {
      "doc_count": 2,
      "min_price": {
        "value": 350
      }
    }
  }
}

Parent Aggregation

A special single bucket aggregation that selects parent documents that have the specified type, as defined in a join field.

This aggregation has a single option:

  • type - The child type that should be selected.

For example, let’s say we have an index of questions and answers. The answer type has the following join field in the mapping:

PUT parent_example
{
  "mappings": {
    "_doc": {
      "properties": {
        "join": {
          "type": "join",
          "relations": {
            "question": "answer"
          }
        }
      }
    }
  }
}

The question document contain a tag field and the answer documents contain an owner field. With the parent aggregation the owner buckets can be mapped to the tag buckets in a single request even though the two fields exist in two different kinds of documents.

An example of a question document:

PUT parent_example/_doc/1
{
  "join": {
    "name": "question"
  },
  "body": "<p>I have Windows 2003 server and i bought a new Windows 2008 server...",
  "title": "Whats the best way to file transfer my site from server to a newer one?",
  "tags": [
    "windows-server-2003",
    "windows-server-2008",
    "file-transfer"
  ]
}

Examples of answer documents:

PUT parent_example/_doc/2?routing=1
{
  "join": {
    "name": "answer",
    "parent": "1"
  },
  "owner": {
    "location": "Norfolk, United Kingdom",
    "display_name": "Sam",
    "id": 48
  },
  "body": "<p>Unfortunately you're pretty much limited to FTP...",
  "creation_date": "2009-05-04T13:45:37.030"
}

PUT parent_example/_doc/3?routing=1&refresh
{
  "join": {
    "name": "answer",
    "parent": "1"
  },
  "owner": {
    "location": "Norfolk, United Kingdom",
    "display_name": "Troll",
    "id": 49
  },
  "body": "<p>Use Linux...",
  "creation_date": "2009-05-05T13:45:37.030"
}

The following request can be built that connects the two together:

POST parent_example/_search?size=0
{
  "aggs": {
    "top-names": {
      "terms": {
        "field": "owner.display_name.keyword",
        "size": 10
      },
      "aggs": {
        "to-questions": {
          "parent": {
            "type" : "answer" (1)
          },
          "aggs": {
            "top-tags": {
              "terms": {
                "field": "tags.keyword",
                "size": 10
              }
            }
          }
        }
      }
    }
  }
}
  1. The type points to type / mapping with the name answer.

The above example returns the top answer owners and per owner the top question tags.

Possible response:

{
  "took": 9,
  "timed_out": false,
  "_shards": {
    "total": 5,
    "successful": 5,
    "skipped": 0,
    "failed": 0
  },
  "hits": {
    "total": 3,
    "max_score": 0.0,
    "hits": []
  },
  "aggregations": {
    "top-names": {
      "doc_count_error_upper_bound": 0,
      "sum_other_doc_count": 0,
      "buckets": [
        {
          "key": "Sam",
          "doc_count": 1, (1)
          "to-questions": {
            "doc_count": 1, (2)
            "top-tags": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "file-transfer",
                  "doc_count": 1
                },
                {
                  "key": "windows-server-2003",
                  "doc_count": 1
                },
                {
                  "key": "windows-server-2008",
                  "doc_count": 1
                }
              ]
            }
          }
        },
        {
          "key": "Troll",
          "doc_count": 1,
          "to-questions": {
            "doc_count": 1,
            "top-tags": {
              "doc_count_error_upper_bound": 0,
              "sum_other_doc_count": 0,
              "buckets": [
                {
                  "key": "file-transfer",
                  "doc_count": 1
                },
                {
                  "key": "windows-server-2003",
                  "doc_count": 1
                },
                {
                  "key": "windows-server-2008",
                  "doc_count": 1
                }
              ]
            }
          }
        }
      ]
    }
  }
}
  1. The number of answer documents with the tag Sam, Troll, etc.

  2. The number of question documents that are related to answer documents with the tag Sam, Troll, etc.

Range Aggregation

A multi-bucket value source based aggregation that enables the user to define a set of ranges - each representing a bucket. During the aggregation process, the values extracted from each document will be checked against each bucket range and "bucket" the relevant/matching document. Note that this aggregation includes the from value and excludes the to value for each range.

Example:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 100.0 },
                    { "from" : 100.0, "to" : 200.0 },
                    { "from" : 200.0 }
                ]
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "price_ranges" : {
            "buckets": [
                {
                    "key": "*-100.0",
                    "to": 100.0,
                    "doc_count": 2
                },
                {
                    "key": "100.0-200.0",
                    "from": 100.0,
                    "to": 200.0,
                    "doc_count": 2
                },
                {
                    "key": "200.0-*",
                    "from": 200.0,
                    "doc_count": 3
                }
            ]
        }
    }
}

Keyed Response

Setting the keyed flag to true will associate a unique string key with each bucket and return the ranges as a hash rather than an array:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "keyed" : true,
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 200 },
                    { "from" : 200 }
                ]
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "price_ranges" : {
            "buckets": {
                "*-100.0": {
                    "to": 100.0,
                    "doc_count": 2
                },
                "100.0-200.0": {
                    "from": 100.0,
                    "to": 200.0,
                    "doc_count": 2
                },
                "200.0-*": {
                    "from": 200.0,
                    "doc_count": 3
                }
            }
        }
    }
}

It is also possible to customize the key for each range:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "keyed" : true,
                "ranges" : [
                    { "key" : "cheap", "to" : 100 },
                    { "key" : "average", "from" : 100, "to" : 200 },
                    { "key" : "expensive", "from" : 200 }
                ]
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "price_ranges" : {
            "buckets": {
                "cheap": {
                    "to": 100.0,
                    "doc_count": 2
                },
                "average": {
                    "from": 100.0,
                    "to": 200.0,
                    "doc_count": 2
                },
                "expensive": {
                    "from": 200.0,
                    "doc_count": 3
                }
            }
        }
    }
}

Script

Range aggregation accepts a script parameter. This parameter allows to defined an inline script that will be executed during aggregation execution.

The following example shows how to use an inline script with the painless script language and no script parameters:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "script" : {
                    "lang": "painless",
                    "source": "doc['price'].value"
                },
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 200 },
                    { "from" : 200 }
                ]
            }
        }
    }
}

It is also possible to use stored scripts. Here is a simple stored script:

POST /_scripts/convert_currency
{
  "script": {
    "lang": "painless",
    "source": "doc[params.field].value * params.conversion_rate"
  }
}

And this new stored script can be used in the range aggregation like this:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "script" : {
                    "id": "convert_currency", (1)
                    "params": { (2)
                        "field": "price",
                        "conversion_rate": 0.835526591
                    }
                },
                "ranges" : [
                    { "from" : 0, "to" : 100 },
                    { "from" : 100 }
                ]
            }
        }
    }
}
  1. Id of the stored script

  2. Parameters to use when executing the stored script

Value Script

Lets say the product prices are in USD but we would like to get the price ranges in EURO. We can use value script to convert the prices prior the aggregation (assuming conversion rate of 0.8)

GET /sales/_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "script" : {
                    "source": "_value * params.conversion_rate",
                    "params" : {
                        "conversion_rate" : 0.8
                    }
                },
                "ranges" : [
                    { "to" : 35 },
                    { "from" : 35, "to" : 70 },
                    { "from" : 70 }
                ]
            }
        }
    }
}

Sub Aggregations

The following example, not only "bucket" the documents to the different buckets but also computes statistics over the prices in each price range

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 200 },
                    { "from" : 200 }
                ]
            },
            "aggs" : {
                "price_stats" : {
                    "stats" : { "field" : "price" }
                }
            }
        }
    }
}

Response:

{
  ...
  "aggregations": {
    "price_ranges": {
      "buckets": [
        {
          "key": "*-100.0",
          "to": 100.0,
          "doc_count": 2,
          "price_stats": {
            "count": 2,
            "min": 10.0,
            "max": 50.0,
            "avg": 30.0,
            "sum": 60.0
          }
        },
        {
          "key": "100.0-200.0",
          "from": 100.0,
          "to": 200.0,
          "doc_count": 2,
          "price_stats": {
            "count": 2,
            "min": 150.0,
            "max": 175.0,
            "avg": 162.5,
            "sum": 325.0
          }
        },
        {
          "key": "200.0-*",
          "from": 200.0,
          "doc_count": 3,
          "price_stats": {
            "count": 3,
            "min": 200.0,
            "max": 200.0,
            "avg": 200.0,
            "sum": 600.0
          }
        }
      ]
    }
  }
}

If a sub aggregation is also based on the same value source as the range aggregation (like the stats aggregation in the example above) it is possible to leave out the value source definition for it. The following will return the same response as above:

GET /_search
{
    "aggs" : {
        "price_ranges" : {
            "range" : {
                "field" : "price",
                "ranges" : [
                    { "to" : 100 },
                    { "from" : 100, "to" : 200 },
                    { "from" : 200 }
                ]
            },
            "aggs" : {
                "price_stats" : {
                    "stats" : {} (1)
                }
            }
        }
    }
}
  1. We don’t need to specify the price as we "inherit" it by default from the parent range aggregation

Reverse nested Aggregation

A special single bucket aggregation that enables aggregating on parent docs from nested documents. Effectively this aggregation can break out of the nested block structure and link to other nested structures or the root document, which allows nesting other aggregations that aren’t part of the nested object in a nested aggregation.

The reverse_nested aggregation must be defined inside a nested aggregation.

Options:
  • path - Which defines to what nested object field should be joined back. The default is empty, which means that it joins back to the root / main document level. The path cannot contain a reference to a nested object field that falls outside the nested aggregation’s nested structure a reverse_nested is in.

For example, lets say we have an index for a ticket system with issues and comments. The comments are inlined into the issue documents as nested documents. The mapping could look like:

PUT /issues
{
    "mappings": {
        "issue" : {
            "properties" : {
                "tags" : { "type" : "keyword" },
                "comments" : { (1)
                    "type" : "nested",
                    "properties" : {
                        "username" : { "type" : "keyword" },
                        "comment" : { "type" : "text" }
                    }
                }
            }
        }
    }
}
  1. The comments is an array that holds nested documents under the issue object.

The following aggregations will return the top commenters' username that have commented and per top commenter the top tags of the issues the user has commented on:

GET /issues/_search
{
  "query": {
    "match_all": {}
  },
  "aggs": {
    "comments": {
      "nested": {
        "path": "comments"
      },
      "aggs": {
        "top_usernames": {
          "terms": {
            "field": "comments.username"
          },
          "aggs": {
            "comment_to_issue": {
              "reverse_nested": {}, (1)
              "aggs": {
                "top_tags_per_comment": {
                  "terms": {
                    "field": "tags"
                  }
                }
              }
            }
          }
        }
      }
    }
  }
}

As you can see above, the reverse_nested aggregation is put in to a nested aggregation as this is the only place in the dsl where the reverse_nested aggregation can be used. Its sole purpose is to join back to a parent doc higher up in the nested structure.

  1. A reverse_nested aggregation that joins back to the root / main document level, because no path has been defined. Via the path option the reverse_nested aggregation can join back to a different level, if multiple layered nested object types have been defined in the mapping

Possible response snippet:

{
  "aggregations": {
    "comments": {
      "doc_count": 1,
      "top_usernames": {
        "doc_count_error_upper_bound" : 0,
        "sum_other_doc_count" : 0,
        "buckets": [
          {
            "key": "username_1",
            "doc_count": 1,
            "comment_to_issue": {
              "doc_count": 1,
              "top_tags_per_comment": {
                "doc_count_error_upper_bound" : 0,
                "sum_other_doc_count" : 0,
                "buckets": [
                  {
                    "key": "tag_1",
                    "doc_count": 1
                  }
                  ...
                ]
              }
            }
          }
          ...
        ]
      }
    }
  }
}

Sampler Aggregation

A filtering aggregation used to limit any sub aggregations' processing to a sample of the top-scoring documents.

Example use cases:
  • Tightening the focus of analytics to high-relevance matches rather than the potentially very long tail of low-quality matches

  • Reducing the running cost of aggregations that can produce useful results using only samples e.g. significant_terms

Example:

A query on StackOverflow data for the popular term javascript OR the rarer term kibana will match many documents - most of them missing the word Kibana. To focus the significant_terms aggregation on top-scoring documents that are more likely to match the most interesting parts of our query we use a sample.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
        "sample": {
            "sampler": {
                "shard_size": 200
            },
            "aggs": {
                "keywords": {
                    "significant_terms": {
                        "field": "tags",
                        "exclude": ["kibana", "javascript"]
                    }
                }
            }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "sample": {
            "doc_count": 200,(1)
            "keywords": {
                "doc_count": 200,
                "bg_count": 650,
                "buckets": [
                    {
                        "key": "elasticsearch",
                        "doc_count": 150,
                        "score": 1.078125,
                        "bg_count": 200
                    },
                    {
                        "key": "logstash",
                        "doc_count": 50,
                        "score": 0.5625,
                        "bg_count": 50
                    }
                ]
            }
        }
    }
}
  1. 200 documents were sampled in total. The cost of performing the nested significant_terms aggregation was therefore limited rather than unbounded.

Without the sampler aggregation the request query considers the full "long tail" of low-quality matches and therefore identifies less significant terms such as jquery and angular rather than focusing on the more insightful Kibana-related terms.

POST /stackoverflow/_search?size=0
{
    "query": {
        "query_string": {
            "query": "tags:kibana OR tags:javascript"
        }
    },
    "aggs": {
             "low_quality_keywords": {
                "significant_terms": {
                    "field": "tags",
                    "size": 3,
                    "exclude":["kibana", "javascript"]
                }
        }
    }
}

Response:

{
    ...
    "aggregations": {
        "low_quality_keywords": {
            "doc_count": 600,
            "bg_count": 650,
            "buckets": [
                {
                    "key": "angular",
                    "doc_count": 200,
                    "score": 0.02777,
                    "bg_count": 200
                },
                {
                    "key": "jquery",
                    "doc_count": 200,
                    "score": 0.02777,
                    "bg_count": 200
                },
                {
                    "key": "logstash",
                    "doc_count": 50,
                    "score": 0.0069,
                    "bg_count": 50
                }
            ]
        }
    }
}

shard_size

The shard_size parameter limits how many top-scoring documents are collected in the sample processed on each shard. The default value is 100.

Limitations

Cannot be nested under breadth_first aggregations

Being a quality-based filter the sampler aggregation needs access to the relevance score produced for each document. It therefore cannot be nested under a terms aggregation which has the collect_mode switched from the default depth_first mode to breadth_first as this discards scores. In this situation an error will be thrown.

Significant Terms Aggregation

An aggregation that returns interesting or unusual occurrences of terms in a set.

Example use cases:
  • Suggesting "H5N1" when users search for "bird flu" in text

  • Identifying the merchant that is the "common point of compromise" from the transaction history of credit card owners reporting loss

  • Suggesting keywords relating to stock symbol $ATI for an automated news classifier

  • Spotting the fraudulent doctor who is diagnosing more than his fair share of whiplash injuries

  • Spotting the tire manufacturer who has a disproportionate number of blow-outs

In all these cases the terms being selected are not simply the most popular terms in a set. They are the terms that have undergone a significant change in popularity measured between a foreground and background set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.

Single-set analysis

In the simplest case, the foreground set of interest is the search results matched by a query and the background set used for statistical comparisons is the index or indices from which the results were gathered.

Example:

GET /_search
{
    "query" : {
        "terms" : {"force" : [ "British Transport Police" ]}
    },
    "aggregations" : {
        "significant_crime_types" : {
            "significant_terms" : { "field" : "crime_type" }
        }
    }
}

Response:

{
    ...
    "aggregations" : {
        "significant_crime_types" : {
            "doc_count": 47347,
            "bg_count": 5064554,
            "buckets" : [
                {
                    "key": "Bicycle theft",
                    "doc_count": 3640,
                    "score": 0.371235374214817,
                    "bg_count": 66799
                }
                ...
            ]
        }
    }
}

When querying an index of all crimes from all police forces, what these results show is that the British Transport Police force stand out as a force dealing with a disproportionately large number of bicycle thefts. Ordinarily, bicycle thefts represent only 1% of crimes (66799/5064554) but for the British Transport Police, who handle crime on railways and stations, 7% of crimes (3640/47347) is a bike theft. This is a significant seven-fold increase in frequency and so this anomaly was highlighted as the top crime type.

The problem with using a query to spot anomalies is it only gives us one subset to use for comparisons. To discover all the other police forces' anomalies we would have to repeat the query for each of the different forces.

This can be a tedious way to look for unusual patterns in an index

Multi-set analysis

A simpler way to perform analysis across multiple categories is to use a parent-level aggregation to segment the data ready for analysis.

Example using a parent aggregation for segmentation:

GET /_search
{
    "aggregations": {
        "forces": {
            "terms": {"field": "force"},
            "aggregations": {
                "significant_crime_types": {
                    "significant_terms": {"field": "crime_type"}
                }
            }
        }
    }
}

Response:

{
 ...
 "aggregations": {
    "forces": {
        "doc_count_error_upper_bound": 1375,
        "sum_other_doc_count": 7879845,
        "buckets": [
            {
                "key": "Metropolitan Police Service",
                "doc_count": 894038,
                "significant_crime_types": {
                    "doc_count": 894038,
                    "bg_count": 5064554,
                    "buckets": [
                        {
                            "key": "Robbery",
                            "doc_count": 27617,
                            "score": 0.0599,
                            "bg_count": 53182
                        }
                        ...
                    ]
                }
            },
            {
                "key": "British Transport Police",
                "doc_count": 47347,
                "significant_crime_types": {
                    "doc_count": 47347,
                    "bg_count": 5064554,
                    "buckets": [
                        {
                            "key": "Bicycle theft",
                            "doc_count": 3640,
                            "score": 0.371,
                            "bg_count": 66799
                        }
                        ...
                    ]
                }
            }
        ]
    }
  }
}

Now we have anomaly detection for each of the police forces using a single request.

We can use other forms of top-level aggregations to segment our data, for example segmenting by geographic area to identify unusual hot-spots of a particular crime type:

GET /_search
{
    "aggs": {
        "hotspots": {
            "geohash_grid": {
                "field": "location",
                "precision": 5
            },
            "aggs": {
                "significant_crime_types": {
                    "significant_terms": {"field": "crime_type"}
                }
            }
        }
    }
}

This example uses the geohash_grid aggregation to create result buckets that represent geographic areas, and inside each bucket we can identify anomalous levels of a crime type in these tightly-focused areas e.g.

  • Airports exhibit unusual numbers of weapon confiscations

  • Universities show uplifts of bicycle thefts

At a higher geohash_grid zoom-level with larger coverage areas we would start to see where an entire police-force may be tackling an unusual volume of a particular crime type.

Obviously a time-based top-level segmentation would help identify current trends for each point in time where a simple terms aggregation would typically show the very popular "constants" that persist across all time slots.

How are the scores calculated?

The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.

Use on free-text fields

The significant_terms aggregation can be used effectively on tokenized free-text fields to suggest:

  • keywords for refining end-user searches

  • keywords for use in percolator queries

Warning
Picking a free-text field as the subject of a significant terms analysis can be expensive! It will attempt to load every unique word into RAM. It is recommended to only use this on smaller indices.
Use the "like this but not this" pattern

You can spot mis-categorized content by first searching a structured field e.g. category:adultMovie and use significant_terms on the free-text "movie_description" field. Take the suggested words (I’ll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords. You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category.

The significance score from each term can also provide a useful boost setting to sort matches. Using the minimum_should_match setting of the terms query with the keywords will help control the balance of precision/recall in the result set i.e a high setting would have a small number of relevant results packed full of keywords and a setting of "1" would produce a more exhaustive results set with all documents containing any keyword.

Tip
Show significant_terms in context

Free-text significant_terms are much more easily understood when viewed in context. Take the results of significant_terms suggestions from a free-text field and use them in a terms query on the same field with a highlight clause to present users with example snippets of documents. When the terms are presented unstemmed, highlighted, with the right case, in the right order and with some context, their significance/meaning is more readily apparent.

Custom background sets

Ordinarily, the foreground set of documents is "diffed" against a background set of all the documents in your index. However, sometimes it may prove useful to use a narrower background set as the basis for comparisons. For example, a query on documents relating to "Madrid" in an index with content from all over the world might reveal that "Spanish" was a significant term. This may be true but if you want some more focused terms you could use a background_filter on the term 'spain' to establish a narrower set of documents as context. With this as a background "Spanish" would now be seen as commonplace and therefore not as significant as words like "capital" that relate more strongly with Madrid. Note that using a background filter will slow things down - each term’s background frequency must now be derived on-the-fly from filtering posting lists rather than reading the index’s pre-computed count for a term.

Limitations

Significant terms must be indexed values

Unlike the terms aggregation it is currently not possible to use script-generated terms for counting purposes. Because of the way the significant_terms aggregation must consider both foreground and background frequencies it would be prohibitively expensive to use a script on the entire index to obtain background frequencies for comparisons. Also DocValues are not supported as sources of term data for similar reasons.

No analysis of floating point fields

Floating point fields are currently not supported as the subject of significant_terms analysis. While integer or long fields can be used to represent concepts like bank account numbers or category numbers which can be interesting to track, floating point fields are usually used to represent quantities of something. As such, individual floating point terms are not useful for this form of frequency analysis.

Use as a parent aggregation

If there is the equivalent of a match_all query or no query criteria providing a subset of the index the significant_terms aggregation should not be used as the top-most aggregation - in this scenario the foreground set is exactly the same as the background set and so there is no difference in document frequencies to observe and from which to make sensible suggestions.

Another consideration is that the significant_terms aggregation produces many candidate results at shard level that are only later pruned on the reducing node once all statistics from all shards are merged. As a result, it can be inefficient and costly in terms of RAM to embed large child aggregations under a significant_terms aggregation that later discards many candidate terms. It is advisable in these cases to perform two searches - the first to provide a rationalized list of significant_terms and then add this shortlist of terms to a second query to go back and fetch the required child aggregations.

Approximate counts

The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:

  • low if certain shards did not provide figures for a given term in their top sample

  • high when considering the background frequency as it may count occurrences found in deleted documents

Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. However, the size and shard size settings covered in the next section provide tools to help control the accuracy levels.

Parameters

JLH score

The JLH score can be used as a significance score by adding the parameter

	 "jlh": {
	 }

The scores are derived from the doc frequencies in foreground and background sets. The absolute change in popularity (foregroundPercent - backgroundPercent) would favor common terms whereas the relative change in popularity (foregroundPercent/ backgroundPercent) would favor rare terms. Rare vs common is essentially a precision vs recall balance and so the absolute and relative changes are multiplied to provide a sweet spot between precision and recall.

Mutual information

Mutual information as described in "Information Retrieval", Manning et al., Chapter 13.5.1 can be used as significance score by adding the parameter

	 "mutual_information": {
	      "include_negatives": true
	 }

Mutual information does not differentiate between terms that are descriptive for the subset or for documents outside the subset. The significant terms therefore can contain terms that appear more or less frequent in the subset than outside the subset. To filter out the terms that appear less often in the subset than in documents outside the subset, include_negatives can be set to false.

Per default, the assumption is that the documents in the bucket are also contained in the background. If instead you defined a custom background filter that represents a different set of documents that you want to compare to, set

"background_is_superset": false
Chi square

Chi square as described in "Information Retrieval", Manning et al., Chapter 13.5.2 can be used as significance score by adding the parameter

	 "chi_square": {
	 }

Chi square behaves like mutual information and can be configured with the same parameters include_negatives and background_is_superset.

Google normalized distance

Google normalized distance as described in "The Google Similarity Distance", Cilibrasi and Vitanyi, 2007 (http://arxiv.org/pdf/cs/0412098v3.pdf) can be used as significance score by adding the parameter

	 "gnd": {
	 }

gnd also accepts the background_is_superset parameter.

Percentage

A simple calculation of the number of documents in the foreground sample with a term divided by the number of documents in the background with the term. By default this produces a score greater than zero and less than one.

The benefit of this heuristic is that the scoring logic is simple to explain to anyone familiar with a "per capita" statistic. However, for fields with high cardinality there is a tendency for this heuristic to select the rarest terms such as typos that occur only once because they score 1/1 = 100%.

It would be hard for a seasoned boxer to win a championship if the prize was awarded purely on the basis of percentage of fights won - by these rules a newcomer with only one fight under his belt would be impossible to beat. Multiple observations are typically required to reinforce a view so it is recommended in these cases to set both min_doc_count and shard_min_doc_count to a higher value such as 10 in order to filter out the low-frequency terms that otherwise take precedence.

	 "percentage": {
	 }
Which one is best?

Roughly, mutual_information prefers high frequent terms even if they occur also frequently in the background. For example, in an analysis of natural language text this might lead to selection of stop words. mutual_information is unlikely to select very rare terms like misspellings. gnd prefers terms with a high co-occurrence and avoids selection of stopwords. It might be better suited for synonym detection. However, gnd has a tendency to select very rare terms that are, for example, a result of misspelling. chi_square and jlh are somewhat in-between.

It is hard to say which one of the different heuristics will be the best choice as it depends on what the significant terms are used for (see for example [Yang and Pedersen, "A Comparative Study on Feature Selection in Text Categorization", 1997](http://courses.ischool.berkeley.edu/i256/f06/papers/yang97comparative.pdf) for a study on using significant terms for feature selection for text classification).

If none of the above measures suits your usecase than another option is to implement a custom significance measure:

Scripted

Customized scores can be implemented via a script:

	    "script_heuristic": {
              "script": {
	        "lang": "painless",
	        "source": "params._subset_freq/(params._superset_freq - params._subset_freq + 1)"
	      }
            }

Scripts can be inline (as in above example), indexed or stored on disk. For details on the options, see script documentation.

Available parameters in the script are

_subset_freq

Number of documents the term appears in the subset.

_superset_freq

Number of documents the term appears in the superset.

_subset_size

Number of documents in the subset.

_superset_size

Number of documents in the superset.

Size & Shard Size

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. If the number of unique terms is greater than size, the returned list can be slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).

To ensure better accuracy a multiple of the final size is used as the number of terms to request from each shard (2 * (size * 1.5 + 10)). To take manual control of this setting the shard_size parameter can be used to control the volumes of candidate terms produced by each shard.

Low-frequency terms can turn out to be the most interesting ones once all results are combined so the significant_terms aggregation can produce higher-quality results when the shard_size parameter is set to values significantly higher than the size setting. This ensures that a bigger volume of promising candidate terms are given a consolidated review by the reducing node before the final selection. Obviously large candidate term lists will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.

Note
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, Elasticsearch will override it and reset it to be equal to size.
Minimum document count

It is possible to only return terms that match more than a configured number of hits using the min_doc_count option:

GET /_search
{
    "aggs" : {
        "tags" : {
            "significant_terms" : {
                "field" : "tag",
                "min_doc_count": 10
            }
        }
    }
}

The above aggregation would only return tags which have been found in 10 hits or more. Default value is 3.

Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.

shard_min_doc_count parameter

The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local frequencies. shard_min_doc_count is set to 1 per default and has no effect unless you explicitly set it.

Warning
Setting min_doc_count to 1 is generally not advised as it tends to return terms that are typos or other bizarre curiosities. Finding more than one instance of a term helps reinforce that, while still rare, the term was not the result of a one-off accident. The default value of 3 is used to provide a minimum weight-of-evidence. Setting shard_min_doc_count too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
Custom background context

The default source of statistical information for background term frequencies is the entire index and this scope can be narrowed through the use of a background_filter to focus in on significant terms within a narrower context:

GET /_search
{
    "query" : {
        "match" : {
            "city" : "madrid"
        }
    },
    "aggs" : {
        "tags" : {
            "significant_terms" : {
                "field" : "tag",
                "background_filter": {
                	"term" : { "text" : "spain"}
                }
            }
        }
    }
}

The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing terms like "Spanish" that are unusual in the full index’s worldwide context but commonplace in the subset of documents containing the word "Spain".

Warning
Use of background filters will slow the query as each term’s postings must be filtered to determine a frequency
Filtering Values

It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the include and exclude parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features described in the terms aggregation documentation.

Execution hint

There are different mechanisms by which terms aggregations can be executed:

  • by using field values directly in order to aggregate data per-bucket (map)

  • by using global ordinals of the field and allocating one bucket per global ordinal (global_ordinals)

Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.

global_ordinals is the default option for keyword field, it uses global ordinals to allocates buckets dynamically so memory usage is linear to the number of values of the documents that are part of the aggregation scope.

map should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode is significantly faster. By default, map is only used when running an aggregation on scripts, since they don’t have ordinals.

GET /_search
{
    "aggs" : {
        "tags" : {
             "significant_terms" : {
                 "field" : "tags",
                 "execution_hint": "map" (1)
             }
         }
    }
}
  1. the possible values are map, global_ordinals

Please note that Elasticsearch will ignore this execution hint if it is not applicable.

Significant Text Aggregation

An aggregation that returns interesting or unusual occurrences of free-text terms in a set. It is like the significant terms aggregation but differs in that:

  • It is specifically designed for use on type text fields

  • It does not require field data or doc-values

  • It re-analyzes text content on-the-fly meaning it can also filter duplicate sections of noisy text that otherwise tend to skew statistics.

Warning
Re-analyzing large result sets will require a lot of time and memory. It is recommended that the significant_text aggregation is used as a child of either the sampler or diversified sampler aggregation to limit the analysis to a small selection of top-matching documents e.g. 200. This will typically improve speed, memory use and quality of results.
Example use cases:
  • Suggesting "H5N1" when users search for "bird flu" to help expand queries

  • Suggesting keywords relating to stock symbol $ATI for use in an automated news classifier

In these cases the words being selected are not simply the most popular terms in results. The most popular words tend to be very boring (and, of, the, we, I, they …​). The significant words are the ones that have undergone a significant change in popularity measured between a foreground and background set. If the term "H5N1" only exists in 5 documents in a 10 million document index and yet is found in 4 of the 100 documents that make up a user’s search results that is significant and probably very relevant to their search. 5/10,000,000 vs 4/100 is a big swing in frequency.

Basic use

In the typical use case, the foreground set of interest is a selection of the top-matching search results for a query and the _background_set used for statistical comparisons is the index or indices from which the results were gathered.

Example:

GET news/article/_search
{
    "query" : {
        "match" : {"content" : "Bird flu"}
    },
    "aggregations" : {
        "my_sample" : {
            "sampler" : {
                "shard_size" : 100
            },
            "aggregations": {
                "keywords" : {
                    "significant_text" : { "field" : "content" }
                }
            }
        }
    }
}

Response:

{
  "took": 9,
  "timed_out": false,
  "_shards": ...,
  "hits": ...,
    "aggregations" : {
        "my_sample": {
            "doc_count": 100,
            "keywords" : {
                "doc_count": 100,
                "buckets" : [
                    {
                        "key": "h5n1",
                        "doc_count": 4,
                        "score": 4.71235374214817,
                        "bg_count": 5
                    }
                    ...
                ]
            }
        }
    }
}

The results show that "h5n1" is one of several terms strongly associated with bird flu. It only occurs 5 times in our index as a whole (see the bg_count) and yet 4 of these were lucky enough to appear in our 100 document sample of "bird flu" results. That suggests a significant word and one which the user can potentially add to their search.

Dealing with noisy data using filter_duplicate_text

Free-text fields often contain a mix of original content and mechanical copies of text (cut-and-paste biographies, email reply chains, retweets, boilerplate headers/footers, page navigation menus, sidebar news links, copyright notices, standard disclaimers, addresses).

In real-world data these duplicate sections of text tend to feature heavily in significant_text results if they aren’t filtered out. Filtering near-duplicate text is a difficult task at index-time but we can cleanse the data on-the-fly at query time using the filter_duplicate_text setting.

First let’s look at an unfiltered real-world example using the Signal media dataset of a million news articles covering a wide variety of news. Here are the raw significant text results for a search for the articles mentioning "elasticsearch":

{
    ...
  "aggregations": {
    "sample": {
      "doc_count": 35,
      "keywords": {
        "doc_count": 35,
        "buckets": [
          {
            "key": "elasticsearch",
            "doc_count": 35,
            "score": 28570.428571428572,
            "bg_count": 35
          },
          ...
          {
            "key": "currensee",
            "doc_count": 8,
            "score": 6530.383673469388,
            "bg_count": 8
          },
          ...
          {
            "key": "pozmantier",
            "doc_count": 4,
            "score": 3265.191836734694,
            "bg_count": 4
          },
          ...

}

The uncleansed documents have thrown up some odd-looking terms that are, on the face of it, statistically correlated with appearances of our search term "elasticsearch" e.g. "pozmantier". We can drill down into examples of these documents to see why pozmantier is connected using this query:

GET news/article/_search
{
  "query": {
    "simple_query_string": {
      "query": "+elasticsearch  +pozmantier"
    }
  },
  "_source": [
    "title",
    "source"
  ],
  "highlight": {
    "fields": {
      "content": {}
    }
  }
}

The results show a series of very similar news articles about a judging panel for a number of tech projects:

{
  ...
  "hits": {
    "hits": [
      {
        ...
        "_source": {
          "source": "Presentation Master",
          "title": "T.E.N. Announces Nominees for the 2015 ISE® North America Awards"
        },
        "highlight": {
          "content": [
            "City of San Diego Mike <em>Pozmantier</em>, Program Manager, Cyber Security Division, Department of",
            " Janus, Janus <em>ElasticSearch</em> Security Visualization Engine "
          ]
        }
      },
      {
        ...
        "_source": {
          "source": "RCL Advisors",
          "title": "T.E.N. Announces Nominees for the 2015 ISE(R) North America Awards"
        },
        "highlight": {
          "content": [
            "Mike <em>Pozmantier</em>, Program Manager, Cyber Security Division, Department of Homeland Security S&T",
            "Janus, Janus <em>ElasticSearch</em> Security Visualization Engine"
          ]
        }
      },
      ...

Mike Pozmantier was one of many judges on a panel and elasticsearch was used in one of many projects being judged.

As is typical, this lengthy press release was cut-and-paste by a variety of news sites and consequently any rare names, numbers or typos they contain become statistically correlated with our matching query.

Fortunately similar documents tend to rank similarly so as part of examining the stream of top-matching documents the significant_text aggregation can apply a filter to remove sequences of any 6 or more tokens that have already been seen. Let’s try this same query now but with the filter_duplicate_text setting turned on:

GET news/article/_search
{
  "query": {
    "match": {
      "content": "elasticsearch"
    }
  },
  "aggs": {
    "sample": {
      "sampler": {
        "shard_size": 100
      },
      "aggs": {
        "keywords": {
          "significant_text": {
            "field": "content",
            "filter_duplicate_text": true
          }
        }
      }
    }
  }
}

The results from analysing our deduplicated text are obviously of higher quality to anyone familiar with the elastic stack:

{
  ...
  "aggregations": {
    "sample": {
      "doc_count": 35,
      "keywords": {
        "doc_count": 35,
        "buckets": [
          {
            "key": "elasticsearch",
            "doc_count": 22,
            "score": 11288.001166180758,
            "bg_count": 35
          },
          {
            "key": "logstash",
            "doc_count": 3,
            "score": 1836.648979591837,
            "bg_count": 4
          },
          {
            "key": "kibana",
            "doc_count": 3,
            "score": 1469.3020408163263,
            "bg_count": 5
          }
        ]
      }
    }
  }
}

Mr Pozmantier and other one-off associations with elasticsearch no longer appear in the aggregation results as a consequence of copy-and-paste operations or other forms of mechanical repetition.

If your duplicate or near-duplicate content is identifiable via a single-value indexed field (perhaps a hash of the article’s title text or an original_press_release_url field) then it would be more efficient to use a parent diversified sampler aggregation to eliminate these documents from the sample set based on that single key. The less duplicate content you can feed into the significant_text aggregation up front the better in terms of performance.

How are the significance scores calculated?

The numbers returned for scores are primarily intended for ranking different suggestions sensibly rather than something easily understood by end users. The scores are derived from the doc frequencies in foreground and background sets. In brief, a term is considered significant if there is a noticeable difference in the frequency in which a term appears in the subset and in the background. The way the terms are ranked can be configured, see "Parameters" section.

Use the "like this but not this" pattern

You can spot mis-categorized content by first searching a structured field e.g. category:adultMovie and use significant_text on the text "movie_description" field. Take the suggested words (I’ll leave them to your imagination) and then search for all movies NOT marked as category:adultMovie but containing these keywords. You now have a ranked list of badly-categorized movies that you should reclassify or at least remove from the "familyFriendly" category.

The significance score from each term can also provide a useful boost setting to sort matches. Using the minimum_should_match setting of the terms query with the keywords will help control the balance of precision/recall in the result set i.e a high setting would have a small number of relevant results packed full of keywords and a setting of "1" would produce a more exhaustive results set with all documents containing any keyword.

Limitations

No support for child aggregations

The significant_text aggregation intentionally does not support the addition of child aggregations because:

  • It would come with a high memory cost

  • It isn’t a generally useful feature and there is a workaround for those that need it

The volume of candidate terms is generally very high and these are pruned heavily before the final results are returned. Supporting child aggregations would generate additional churn and be inefficient. Clients can always take the heavily-trimmed set of results from a significant_text request and make a subsequent follow-up query using a terms aggregation with an include clause and child aggregations to perform further analysis of selected keywords in a more efficient fashion.

No support for nested objects

The significant_text aggregation currently also cannot be used with text fields in nested objects, because it works with the document JSON source. This makes this feature inefficient when matching nested docs from stored JSON given a matching Lucene docID.

Approximate counts

The counts of how many documents contain a term provided in results are based on summing the samples returned from each shard and as such may be:

  • low if certain shards did not provide figures for a given term in their top sample

  • high when considering the background frequency as it may count occurrences found in deleted documents

Like most design decisions, this is the basis of a trade-off in which we have chosen to provide fast performance at the cost of some (typically small) inaccuracies. However, the size and shard size settings covered in the next section provide tools to help control the accuracy levels.

Parameters

Significance heuristics

This aggregation supports the same scoring heuristics (JLH, mutual_information, gnd, chi_square etc) as the significant terms aggregation

Size & Shard Size

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. If the number of unique terms is greater than size, the returned list can be slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).

To ensure better accuracy a multiple of the final size is used as the number of terms to request from each shard (2 * (size * 1.5 + 10)). To take manual control of this setting the shard_size parameter can be used to control the volumes of candidate terms produced by each shard.

Low-frequency terms can turn out to be the most interesting ones once all results are combined so the significant_terms aggregation can produce higher-quality results when the shard_size parameter is set to values significantly higher than the size setting. This ensures that a bigger volume of promising candidate terms are given a consolidated review by the reducing node before the final selection. Obviously large candidate term lists will cause extra network traffic and RAM usage so this is quality/cost trade off that needs to be balanced. If shard_size is set to -1 (the default) then shard_size will be automatically estimated based on the number of shards and the size parameter.

Note
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, elasticsearch will override it and reset it to be equal to size.
Minimum document count

It is possible to only return terms that match more than a configured number of hits using the min_doc_count option. The Default value is 3.

Terms that score highly will be collected on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global term frequencies available. The decision if a term is added to a candidate list depends only on the score computed on the shard using local shard frequencies, not the global frequencies of the word. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent but high scoring terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.

shard_min_doc_count parameter

The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent words and you are not interested in these (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local frequencies. shard_min_doc_count is set to 1 per default and has no effect unless you explicitly set it.

Warning
Setting min_doc_count to 1 is generally not advised as it tends to return terms that are typos or other bizarre curiosities. Finding more than one instance of a term helps reinforce that, while still rare, the term was not the result of a one-off accident. The default value of 3 is used to provide a minimum weight-of-evidence. Setting shard_min_doc_count too high will cause significant candidate terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.
Custom background context

The default source of statistical information for background term frequencies is the entire index and this scope can be narrowed through the use of a background_filter to focus in on significant terms within a narrower context:

GET news/article/_search
{
    "query" : {
        "match" : {
            "content" : "madrid"
        }
    },
    "aggs" : {
        "tags" : {
            "significant_text" : {
                "field" : "content",
                "background_filter": {
                    "term" : { "content" : "spain"}
                }
            }
        }
    }
}

The above filter would help focus in on terms that were peculiar to the city of Madrid rather than revealing terms like "Spanish" that are unusual in the full index’s worldwide context but commonplace in the subset of documents containing the word "Spain".

Warning
Use of background filters will slow the query as each term’s postings must be filtered to determine a frequency
Dealing with source and index mappings

Ordinarily the indexed field name and the original JSON field being retrieved share the same name. However with more complex field mappings using features like copy_to the source JSON field(s) and the indexed field being aggregated can differ. In these cases it is possible to list the JSON _source fields from which text will be analyzed using the source_fields parameter:

GET news/article/_search
{
    "query" : {
        "match" : {
            "custom_all" : "elasticsearch"
        }
    },
    "aggs" : {
        "tags" : {
            "significant_text" : {
                "field" : "custom_all",
                "source_fields": ["content" , "title"]
            }
        }
    }
}
Filtering Values

It is possible (although rarely required) to filter the values for which buckets will be created. This can be done using the include and exclude parameters which are based on a regular expression string or arrays of exact terms. This functionality mirrors the features described in the terms aggregation documentation.

Terms Aggregation

A multi-bucket value source based aggregation where buckets are dynamically built - one per unique value.

Example:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : { "field" : "genre" } (1)
        }
    }
}
  1. terms aggregation should be a field of type keyword or any other data type suitable for bucket aggregations. In order to use it with text you will need to enable fielddata.

Response:

{
    ...
    "aggregations" : {
        "genres" : {
            "doc_count_error_upper_bound": 0, (1)
            "sum_other_doc_count": 0, (2)
            "buckets" : [ (3)
                {
                    "key" : "electronic",
                    "doc_count" : 6
                },
                {
                    "key" : "rock",
                    "doc_count" : 3
                },
                {
                    "key" : "jazz",
                    "doc_count" : 2
                }
            ]
        }
    }
}
  1. an upper bound of the error on the document counts for each term, see below

  2. when there are lots of unique terms, Elasticsearch only returns the top terms; this number is the sum of the document counts for all buckets that are not part of the response

  3. the list of the top buckets, the meaning of top being defined by the order

By default, the terms aggregation will return the buckets for the top ten terms ordered by the doc_count. One can change this default behaviour by setting the size parameter.

Size

The size parameter can be set to define how many term buckets should be returned out of the overall terms list. By default, the node coordinating the search process will request each shard to provide its own top size term buckets and once all shards respond, it will reduce the results to the final list that will then be returned to the client. This means that if the number of unique terms is greater than size, the returned list is slightly off and not accurate (it could be that the term counts are slightly off and it could even be that a term that should have been in the top size buckets was not returned).

Note
If you want to retrieve all terms or all combinations of terms in a nested terms aggregation you should use the Composite aggregation which allows to paginate over all possible terms rather than setting a size greater than the cardinality of the field in the terms aggregation. The terms aggregation is meant to return the top terms and does not allow pagination.

Document counts are approximate

As described above, the document counts (and the results of any sub aggregations) in the terms aggregation are not always accurate. This is because each shard provides its own view of what the ordered list of terms should be and these are combined to give a final view. Consider the following scenario:

A request is made to obtain the top 5 terms in the field product, ordered by descending document count from an index with 3 shards. In this case each shard is asked to give its top 5 terms.

GET /_search
{
    "aggs" : {
        "products" : {
            "terms" : {
                "field" : "product",
                "size" : 5
            }
        }
    }
}

The terms for each of the three shards are shown below with their respective document counts in brackets:

Shard A Shard B Shard C

1

Product A (25)

Product A (30)

Product A (45)

2

Product B (18)

Product B (25)

Product C (44)

3

Product C (6)

Product F (17)

Product Z (36)

4

Product D (3)

Product Z (16)

Product G (30)

5

Product E (2)

Product G (15)

Product E (29)

6

Product F (2)

Product H (14)

Product H (28)

7

Product G (2)

Product I (10)

Product Q (2)

8

Product H (2)

Product Q (6)

Product D (1)

9

Product I (1)

Product J (6)

10

Product J (1)

Product C (4)

The shards will return their top 5 terms so the results from the shards will be:

Shard A Shard B Shard C

1

Product A (25)

Product A (30)

Product A (45)

2

Product B (18)

Product B (25)

Product C (44)

3

Product C (6)

Product F (17)

Product Z (36)

4

Product D (3)

Product Z (16)

Product G (30)

5

Product E (2)

Product G (15)

Product E (29)

Taking the top 5 results from each of the shards (as requested) and combining them to make a final top 5 list produces the following:

1

Product A (100)

2

Product Z (52)

3

Product C (50)

4

Product G (45)

5

Product B (43)

Because Product A was returned from all shards we know that its document count value is accurate. Product C was only returned by shards A and C so its document count is shown as 50 but this is not an accurate count. Product C exists on shard B, but its count of 4 was not high enough to put Product C into the top 5 list for that shard. Product Z was also returned only by 2 shards but the third shard does not contain the term. There is no way of knowing, at the point of combining the results to produce the final list of terms, that there is an error in the document count for Product C and not for Product Z. Product H has a document count of 44 across all 3 shards but was not included in the final list of terms because it did not make it into the top five terms on any of the shards.

Shard Size

The higher the requested size is, the more accurate the results will be, but also, the more expensive it will be to compute the final results (both due to bigger priority queues that are managed on a shard level and due to bigger data transfers between the nodes and the client).

The shard_size parameter can be used to minimize the extra work that comes with bigger requested size. When defined, it will determine how many terms the coordinating node will request from each shard. Once all the shards responded, the coordinating node will then reduce them to a final result which will be based on the size parameter - this way, one can increase the accuracy of the returned terms and avoid the overhead of streaming a big list of buckets back to the client.

Note
shard_size cannot be smaller than size (as it doesn’t make much sense). When it is, Elasticsearch will override it and reset it to be equal to size.

The default shard_size is (size * 1.5 + 10).

Calculating Document Count Error

There are two error values which can be shown on the terms aggregation. The first gives a value for the aggregation as a whole which represents the maximum potential document count for a term which did not make it into the final list of terms. This is calculated as the sum of the document count from the last term returned from each shard. For the example given above the value would be 46 (2 + 15 + 29). This means that in the worst case scenario a term which was not returned could have the 4th highest document count.

{
    ...
    "aggregations" : {
        "products" : {
            "doc_count_error_upper_bound" : 46,
            "sum_other_doc_count" : 79,
            "buckets" : [
                {
                    "key" : "Product A",
                    "doc_count" : 100
                },
                {
                    "key" : "Product Z",
                    "doc_count" : 52
                }
                ...
            ]
        }
    }
}

Per bucket document count error

The second error value can be enabled by setting the show_term_doc_count_error parameter to true:

GET /_search
{
    "aggs" : {
        "products" : {
            "terms" : {
                "field" : "product",
                "size" : 5,
                "show_term_doc_count_error": true
            }
        }
    }
}

This shows an error value for each term returned by the aggregation which represents the 'worst case' error in the document count and can be useful when deciding on a value for the shard_size parameter. This is calculated by summing the document counts for the last term returned by all shards which did not return the term. In the example above the error in the document count for Product C would be 15 as Shard B was the only shard not to return the term and the document count of the last term it did return was 15. The actual document count of Product C was 54 so the document count was only actually off by 4 even though the worst case was that it would be off by 15. Product A, however has an error of 0 for its document count, since every shard returned it we can be confident that the count returned is accurate.

{
    ...
    "aggregations" : {
        "products" : {
            "doc_count_error_upper_bound" : 46,
            "sum_other_doc_count" : 79,
            "buckets" : [
                {
                    "key" : "Product A",
                    "doc_count" : 100,
                    "doc_count_error_upper_bound" : 0
                },
                {
                    "key" : "Product Z",
                    "doc_count" : 52,
                    "doc_count_error_upper_bound" : 2
                }
                ...
            ]
        }
    }
}

These errors can only be calculated in this way when the terms are ordered by descending document count. When the aggregation is ordered by the terms values themselves (either ascending or descending) there is no error in the document count since if a shard does not return a particular term which appears in the results from another shard, it must not have that term in its index. When the aggregation is either sorted by a sub aggregation or in order of ascending document count, the error in the document counts cannot be determined and is given a value of -1 to indicate this.

Order

The order of the buckets can be customized by setting the order parameter. By default, the buckets are ordered by their doc_count descending. It is possible to change this behaviour as documented below:

Warning
Sorting by ascending _count or by sub aggregation is discouraged as it increases the error on document counts. It is fine when a single shard is queried, or when the field that is being aggregated was used as a routing key at index time: in these cases results will be accurate since shards have disjoint values. However otherwise, errors are unbounded. One particular case that could still be useful is sorting by min or max aggregation: counts will not be accurate but at least the top buckets will be correctly picked.

Ordering the buckets by their doc _count in an ascending manner:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "field" : "genre",
                "order" : { "_count" : "asc" }
            }
        }
    }
}

Ordering the buckets alphabetically by their terms in an ascending manner:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "field" : "genre",
                "order" : { "_key" : "asc" }
            }
        }
    }
}

deprecated[6.0.0, Use _key instead of _term to order buckets by their term]

Ordering the buckets by single value metrics sub-aggregation (identified by the aggregation name):

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "field" : "genre",
                "order" : { "max_play_count" : "desc" }
            },
            "aggs" : {
                "max_play_count" : { "max" : { "field" : "play_count" } }
            }
        }
    }
}

Ordering the buckets by multi value metrics sub-aggregation (identified by the aggregation name):

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "field" : "genre",
                "order" : { "playback_stats.max" : "desc" }
            },
            "aggs" : {
                "playback_stats" : { "stats" : { "field" : "play_count" } }
            }
        }
    }
}
Note
Pipeline aggs cannot be used for sorting

Pipeline aggregations are run during the reduce phase after all other aggregations have already completed. For this reason, they cannot be used for ordering.

It is also possible to order the buckets based on a "deeper" aggregation in the hierarchy. This is supported as long as the aggregations path are of a single-bucket type, where the last aggregation in the path may either be a single-bucket one or a metrics one. If it’s a single-bucket type, the order will be defined by the number of docs in the bucket (i.e. doc_count), in case it’s a metrics one, the same rules as above apply (where the path must indicate the metric name to sort by in case of a multi-value metrics aggregation, and in case of a single-value metrics aggregation the sort will be applied on that value).

The path must be defined in the following form:

AGG_SEPARATOR       =  '>' ;
METRIC_SEPARATOR    =  '.' ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH                =  <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;
GET /_search
{
    "aggs" : {
        "countries" : {
            "terms" : {
                "field" : "artist.country",
                "order" : { "rock>playback_stats.avg" : "desc" }
            },
            "aggs" : {
                "rock" : {
                    "filter" : { "term" : { "genre" :  "rock" }},
                    "aggs" : {
                        "playback_stats" : { "stats" : { "field" : "play_count" }}
                    }
                }
            }
        }
    }
}

The above will sort the artist’s countries buckets based on the average play count among the rock songs.

Multiple criteria can be used to order the buckets by providing an array of order criteria such as the following:

GET /_search
{
    "aggs" : {
        "countries" : {
            "terms" : {
                "field" : "artist.country",
                "order" : [ { "rock>playback_stats.avg" : "desc" }, { "_count" : "desc" } ]
            },
            "aggs" : {
                "rock" : {
                    "filter" : { "term" : { "genre" : "rock" }},
                    "aggs" : {
                        "playback_stats" : { "stats" : { "field" : "play_count" }}
                    }
                }
            }
        }
    }
}

The above will sort the artist’s countries buckets based on the average play count among the rock songs and then by their doc_count in descending order.

Note
In the event that two buckets share the same values for all order criteria the bucket’s term value is used as a tie-breaker in ascending alphabetical order to prevent non-deterministic ordering of buckets.

Minimum document count

It is possible to only return terms that match more than a configured number of hits using the min_doc_count option:

GET /_search
{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "min_doc_count": 10
            }
        }
    }
}

The above aggregation would only return tags which have been found in 10 hits or more. Default value is 1.

Terms are collected and ordered on a shard level and merged with the terms collected from other shards in a second step. However, the shard does not have the information about the global document count available. The decision if a term is added to a candidate list depends only on the order computed on the shard using local shard frequencies. The min_doc_count criterion is only applied after merging local terms statistics of all shards. In a way the decision to add the term as a candidate is made without being very certain about if the term will actually reach the required min_doc_count. This might cause many (globally) high frequent terms to be missing in the final result if low frequent terms populated the candidate lists. To avoid this, the shard_size parameter can be increased to allow more candidate terms on the shards. However, this increases memory consumption and network traffic.

shard_min_doc_count parameter

The parameter shard_min_doc_count regulates the certainty a shard has if the term should actually be added to the candidate list or not with respect to the min_doc_count. Terms will only be considered if their local shard frequency within the set is higher than the shard_min_doc_count. If your dictionary contains many low frequent terms and you are not interested in those (for example misspellings), then you can set the shard_min_doc_count parameter to filter out candidate terms on a shard level that will with a reasonable certainty not reach the required min_doc_count even after merging the local counts. shard_min_doc_count is set to 0 per default and has no effect unless you explicitly set it.

Note
Setting min_doc_count=0 will also return buckets for terms that didn’t match any hit. However, some of the returned terms which have a document count of zero might only belong to deleted documents or documents from other types, so there is no warranty that a match_all query would find a positive document count for those terms.
Warning
When NOT sorting on doc_count descending, high values of min_doc_count may return a number of buckets which is less than size because not enough data was gathered from the shards. Missing buckets can be back by increasing shard_size. Setting shard_min_doc_count too high will cause terms to be filtered out on a shard level. This value should be set much lower than min_doc_count/#shards.

Script

Generating the terms using a script:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "script" : {
                    "source": "doc['genre'].value",
                    "lang": "painless"
                }
            }
        }
    }
}

This will interpret the script parameter as an inline script with the default script language and no script parameters. To use a stored script use the following syntax:

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "script" : {
                    "id": "my_script",
                    "params": {
                        "field": "genre"
                    }
                }
            }
        }
    }
}

Value Script

GET /_search
{
    "aggs" : {
        "genres" : {
            "terms" : {
                "field" : "genre",
                "script" : {
                    "source" : "'Genre: ' +_value",
                    "lang" : "painless"
                }
            }
        }
    }
}

Filtering Values

It is possible to filter the values for which buckets will be created. This can be done using the include and exclude parameters which are based on regular expression strings or arrays of exact values. Additionally, include clauses can filter using partition expressions.

Filtering Values with regular expressions
GET /_search
{
    "aggs" : {
        "tags" : {
            "terms" : {
                "field" : "tags",
                "include" : ".*sport.*",
                "exclude" : "water_.*"
            }
        }
    }
}

In the above example, buckets will be created for all the tags that has the word sport in them, except those starting with water_ (so the tag water_sports will not be aggregated). The include regular expression will determine what values are "allowed" to be aggregated, while the exclude determines the values that should not be aggregated. When both are defined, the exclude has precedence, meaning, the include is evaluated first and only then the exclude.

The syntax is the same as regexp queries.

Filtering Values with exact values

For matching based on exact values the include and exclude parameters can simply take an array of strings that represent the terms as they are found in the index:

GET /_search
{
    "aggs" : {
        "JapaneseCars" : {
             "terms" : {
                 "field" : "make",
                 "include" : ["mazda", "honda"]
             }
         },
        "ActiveCarManufacturers" : {
             "terms" : {
                 "field" : "make",
                 "exclude" : ["rover", "jensen"]
             }
         }
    }
}
Filtering Values with partitions

Sometimes there are too many unique terms to process in a single request/response pair so it can be useful to break the analysis up into multiple requests. This can be achieved by grouping the field’s values into a number of partitions at query-time and processing only one partition in each request. Consider this request which is looking for accounts that have not logged any access recently:

GET /_search
{
   "size": 0,
   "aggs": {
      "expired_sessions": {
         "terms": {
            "field": "account_id",
            "include": {
               "partition": 0,
               "num_partitions": 20
            },
            "size": 10000,
            "order": {
               "last_access": "asc"
            }
         },
         "aggs": {
            "last_access": {
               "max": {
                  "field": "access_date"
               }
            }
         }
      }
   }
}

This request is finding the last logged access date for a subset of customer accounts because we might want to expire some customer accounts who haven’t been seen for a long while. The num_partitions setting has requested that the unique account_ids are organized evenly into twenty partitions (0 to 19). and the partition setting in this request filters to only consider account_ids falling into partition 0. Subsequent requests should ask for partitions 1 then 2 etc to complete the expired-account analysis.

Note that the size setting for the number of results returned needs to be tuned with the num_partitions. For this particular account-expiration example the process for balancing values for size and num_partitions would be as follows:

  1. Use the cardinality aggregation to estimate the total number of unique account_id values

  2. Pick a value for num_partitions to break the number from 1) up into more manageable chunks

  3. Pick a size value for the number of responses we want from each partition

  4. Run a test request

If we have a circuit-breaker error we are trying to do too much in one request and must increase num_partitions. If the request was successful but the last account ID in the date-sorted test response was still an account we might want to expire then we may be missing accounts of interest and have set our numbers too low. We must either

  • increase the size parameter to return more results per partition (could be heavy on memory) or

  • increase the num_partitions to consider less accounts per request (could increase overall processing time as we need to make more requests)

Ultimately this is a balancing act between managing the Elasticsearch resources required to process a single request and the volume of requests that the client application must issue to complete a task.

Multi-field terms aggregation

The terms aggregation does not support collecting terms from multiple fields in the same document. The reason is that the terms agg doesn’t collect the string term values themselves, but rather uses global ordinals to produce a list of all of the unique values in the field. Global ordinals results in an important performance boost which would not be possible across multiple fields.

There are two approaches that you can use to perform a terms agg across multiple fields:

Script

Use a script to retrieve terms from multiple fields. This disables the global ordinals optimization and will be slower than collecting terms from a single field, but it gives you the flexibility to implement this option at search time.

copy_to field

If you know ahead of time that you want to collect the terms from two or more fields, then use copy_to in your mapping to create a new dedicated field at index time which contains the values from both fields. You can aggregate on this single field, which will benefit from the global ordinals optimization.

Collect mode

Deferring calculation of child aggregations

For fields with many unique terms and a small number of required results it can be more efficient to delay the calculation of child aggregations until the top parent-level aggs have been pruned. Ordinarily, all branches of the aggregation tree are expanded in one depth-first pass and only then any pruning occurs. In some scenarios this can be very wasteful and can hit memory constraints. An example problem scenario is querying a movie database for the 10 most popular actors and their 5 most common co-stars:

GET /_search
{
    "aggs" : {
        "actors" : {
             "terms" : {
                 "field" : "actors",
                 "size" : 10
             },
            "aggs" : {
                "costars" : {
                     "terms" : {
                         "field" : "actors",
                         "size" : 5
                     }
                 }
            }
         }
    }
}

Even though the number of actors may be comparatively small and we want only 50 result buckets there is a combinatorial explosion of buckets during calculation - a single actor can produce n² buckets where n is the number of actors. The sane option would be to first determine the 10 most popular actors and only then examine the top co-stars for these 10 actors. This alternative strategy is what we call the breadth_first collection mode as opposed to the depth_first mode.

Note
The breadth_first is the default mode for fields with a cardinality bigger than the requested size or when the cardinality is unknown (numeric fields or scripts for instance). It is possible to override the default heuristic and to provide a collect mode directly in the request:
GET /_search
{
    "aggs" : {
        "actors" : {
             "terms" : {
                 "field" : "actors",
                 "size" : 10,
                 "collect_mode" : "breadth_first" (1)
             },
            "aggs" : {
                "costars" : {
                     "terms" : {
                         "field" : "actors",
                         "size" : 5
                     }
                 }
            }
         }
    }
}
  1. the possible values are breadth_first and depth_first

When using breadth_first mode the set of documents that fall into the uppermost buckets are cached for subsequent replay so there is a memory overhead in doing this which is linear with the number of matching documents. Note that the order parameter can still be used to refer to data from a child aggregation when using the breadth_first setting - the parent aggregation understands that this child aggregation will need to be called first before any of the other child aggregations.

Warning
Nested aggregations such as top_hits which require access to score information under an aggregation that uses the breadth_first collection mode need to replay the query on the second pass but only for the documents belonging to the top buckets.

Execution hint

There are different mechanisms by which terms aggregations can be executed:

  • by using field values directly in order to aggregate data per-bucket (map)

  • by using global ordinals of the field and allocating one bucket per global ordinal (global_ordinals)

Elasticsearch tries to have sensible defaults so this is something that generally doesn’t need to be configured.

global_ordinals is the default option for keyword field, it uses global ordinals to allocates buckets dynamically so memory usage is linear to the number of values of the documents that are part of the aggregation scope.

map should only be considered when very few documents match a query. Otherwise the ordinals-based execution mode is significantly faster. By default, map is only used when running an aggregation on scripts, since they don’t have ordinals.

GET /_search
{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "execution_hint": "map" (1)
             }
         }
    }
}
  1. The possible values are map, global_ordinals

Please note that Elasticsearch will ignore this execution hint if it is not applicable and that there is no backward compatibility guarantee on these hints.

Missing value

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value.

GET /_search
{
    "aggs" : {
        "tags" : {
             "terms" : {
                 "field" : "tags",
                 "missing": "N/A" (1)
             }
         }
    }
}
  1. Documents without a value in the tags field will fall into the same bucket as documents that have the value N/A.

Mixing field types

Warning
When aggregating on multiple indices the type of the aggregated field may not be the same in all indices. Some types are compatible with each other (integer and long or float and double) but when the types are a mix of decimal and non-decimal number the terms aggregation will promote the non-decimal numbers to decimal numbers. This can result in a loss of precision in the bucket values.

Pipeline Aggregations

Pipeline aggregations work on the outputs produced from other aggregations rather than from document sets, adding information to the output tree. There are many different types of pipeline aggregation, each computing different information from other aggregations, but these types can be broken down into two families:

Parent

A family of pipeline aggregations that is provided with the output of its parent aggregation and is able to compute new buckets or new aggregations to add to existing buckets.

Sibling

Pipeline aggregations that are provided with the output of a sibling aggregation and are able to compute a new aggregation which will be at the same level as the sibling aggregation.

Pipeline aggregations can reference the aggregations they need to perform their computation by using the buckets_path parameter to indicate the paths to the required metrics. The syntax for defining these paths can be found in the buckets_path Syntax section below.

Pipeline aggregations cannot have sub-aggregations but depending on the type it can reference another pipeline in the buckets_path allowing pipeline aggregations to be chained. For example, you can chain together two derivatives to calculate the second derivative (i.e. a derivative of a derivative).

Note
Because pipeline aggregations only add to the output, when chaining pipeline aggregations the output of each pipeline aggregation will be included in the final output.

buckets_path Syntax

Most pipeline aggregations require another aggregation as their input. The input aggregation is defined via the buckets_path parameter, which follows a specific format:

AGG_SEPARATOR       =  '>' ;
METRIC_SEPARATOR    =  '.' ;
AGG_NAME            =  <the name of the aggregation> ;
METRIC              =  <the name of the metric (in case of multi-value metrics aggregation)> ;
PATH                =  <AGG_NAME> [ <AGG_SEPARATOR>, <AGG_NAME> ]* [ <METRIC_SEPARATOR>, <METRIC> ] ;

For example, the path "my_bucket>my_stats.avg" will path to the avg value in the "my_stats" metric, which is contained in the "my_bucket" bucket aggregation.

Paths are relative from the position of the pipeline aggregation; they are not absolute paths, and the path cannot go back "up" the aggregation tree. For example, this moving average is embedded inside a date_histogram and refers to a "sibling" metric "the_sum":

POST /_search
{
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"timestamp",
                "interval":"day"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "lemmings" } (1)
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } (2)
                }
            }
        }
    }
}
  1. The metric is called "the_sum"

  2. The buckets_path refers to the metric via a relative path "the_sum"

buckets_path is also used for Sibling pipeline aggregations, where the aggregation is "next" to a series of buckets instead of embedded "inside" them. For example, the max_bucket aggregation uses the buckets_path to specify a metric embedded inside a sibling aggregation:

POST /_search
{
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "max_monthly_sales": {
            "max_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the sales_per_month date histogram.

Special Paths

Instead of pathing to a metric, buckets_path can use a special "_count" path. This instructs the pipeline aggregation to use the document count as its input. For example, a moving average can be calculated on the document count of each bucket, instead of a specific metric:

POST /_search
{
    "aggs": {
        "my_date_histo": {
            "date_histogram": {
                "field":"timestamp",
                "interval":"day"
            },
            "aggs": {
                "the_movavg": {
                    "moving_avg": { "buckets_path": "_count" } (1)
                }
            }
        }
    }
}
  1. By using _count instead of a metric name, we can calculate the moving average of document counts in the histogram

The buckets_path can also use "_bucket_count" and path to a multi-bucket aggregation to use the number of buckets returned by that aggregation in the pipeline aggregation instead of a metric. for example a bucket_selector can be used here to filter out buckets which contain no buckets for an inner terms aggregation:

POST /sales/_search
{
  "size": 0,
  "aggs": {
    "histo": {
      "date_histogram": {
        "field": "date",
        "interval": "day"
      },
      "aggs": {
        "categories": {
          "terms": {
            "field": "category"
          }
        },
        "min_bucket_selector": {
          "bucket_selector": {
            "buckets_path": {
              "count": "categories._bucket_count" (1)
            },
            "script": {
              "source": "params.count != 0"
            }
          }
        }
      }
    }
  }
}
  1. By using _bucket_count instead of a metric name, we can filter out histo buckets where they contain no buckets for the categories aggregation

Dealing with dots in agg names

An alternate syntax is supported to cope with aggregations or metrics which have dots in the name, such as the 99.9th percentile. This metric may be referred to as:

"buckets_path": "my_percentile[99.9]"

Dealing with gaps in the data

Data in the real world is often noisy and sometimes contains gaps — places where data simply doesn’t exist. This can occur for a variety of reasons, the most common being:

  • Documents falling into a bucket do not contain a required field

  • There are no documents matching the query for one or more buckets

  • The metric being calculated is unable to generate a value, likely because another dependent bucket is missing a value. Some pipeline aggregations have specific requirements that must be met (e.g. a derivative cannot calculate a metric for the first value because there is no previous value, HoltWinters moving average need "warmup" data to begin calculating, etc)

Gap policies are a mechanism to inform the pipeline aggregation about the desired behavior when "gappy" or missing data is encountered. All pipeline aggregations accept the gap_policy parameter. There are currently two gap policies to choose from:

skip

This option treats missing data as if the bucket does not exist. It will skip the bucket and continue calculating using the next available value.

insert_zeros

This option will replace missing values with a zero (0) and pipeline aggregation computation will proceed as normal.

Avg Bucket Aggregation

A sibling pipeline aggregation which calculates the (mean) average value of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

An avg_bucket aggregation looks like this in isolation:

{
    "avg_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 4. avg_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the average for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the average of the total monthly sales:

POST /_search
{
  "size": 0,
  "aggs": {
    "sales_per_month": {
      "date_histogram": {
        "field": "date",
        "interval": "month"
      },
      "aggs": {
        "sales": {
          "sum": {
            "field": "price"
          }
        }
      }
    },
    "avg_monthly_sales": {
      "avg_bucket": {
        "buckets_path": "sales_per_month>sales" (1)
      }
    }
  }
}
  1. buckets_path instructs this avg_bucket aggregation that we want the (mean) average value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "avg_monthly_sales": {
          "value": 328.33333333333333
      }
   }
}

Derivative Aggregation

A parent pipeline aggregation which calculates the derivative of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

Syntax

A derivative aggregation looks like this in isolation:

"derivative": {
  "buckets_path": "the_sum"
}
Table 5. derivative Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the derivative for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

First Order Derivative

The following snippet calculates the derivative of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales" (1)
                    }
                }
            }
        }
    }
}
  1. buckets_path instructs this derivative aggregation to use the output of the sales aggregation for the derivative

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } (1)
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0 (2)
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2, (3)
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0
               }
            }
         ]
      }
   }
}
  1. No derivative for the first bucket since we need at least 2 data points to calculate the derivative

  2. Derivative value units are implicitly defined by the sales aggregation and the parent histogram so in this case the units would be $/month assuming the price field has units of $.

  3. The number of documents in the bucket are represented by the doc_count

Second Order Derivative

A second order derivative can be calculated by chaining the derivative pipeline aggregation onto the result of another derivative pipeline aggregation as in the following example which will calculate both the first and the second order derivative of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales"
                    }
                },
                "sales_2nd_deriv": {
                    "derivative": {
                        "buckets_path": "sales_deriv" (1)
                    }
                }
            }
        }
    }
}
  1. buckets_path for the second derivative points to the name of the first derivative

And the following may be the response:

{
   "took": 50,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } (1)
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0
               } (1)
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0
               },
               "sales_2nd_deriv": {
                  "value": 805.0
               }
            }
         ]
      }
   }
}
  1. No second derivative for the first two buckets since we need at least 2 data points from the first derivative to calculate the second derivative

Units

The derivative aggregation allows the units of the derivative values to be specified. This returns an extra field in the response normalized_value which reports the derivative value in the desired x-axis units. In the below example we calculate the derivative of the total sales per month but ask for the derivative of the sales as in the units of sales per day:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_deriv": {
                    "derivative": {
                        "buckets_path": "sales",
                        "unit": "day" (1)
                    }
                }
            }
        }
    }
}
  1. unit specifies what unit to use for the x-axis of the derivative calculation

And the following may be the response:

{
   "took": 50,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               } (1)
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "sales_deriv": {
                  "value": -490.0, (1)
                  "normalized_value": -15.806451612903226 (2)
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "sales_deriv": {
                  "value": 315.0,
                  "normalized_value": 11.25
               }
            }
         ]
      }
   }
}
  1. value is reported in the original units of 'per month'

  2. normalized_value is reported in the desired units of 'per day' === Max Bucket Aggregation

A sibling pipeline aggregation which identifies the bucket(s) with the maximum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A max_bucket aggregation looks like this in isolation:

{
    "max_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 6. max_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the maximum for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the maximum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "max_monthly_sales": {
            "max_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. buckets_path instructs this max_bucket aggregation that we want the maximum value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "max_monthly_sales": {
          "keys": ["2015/01/01 00:00:00"], (1)
          "value": 550.0
      }
   }
}
  1. keys is an array of strings since the maximum value may be present in multiple buckets === Min Bucket Aggregation

A sibling pipeline aggregation which identifies the bucket(s) with the minimum value of a specified metric in a sibling aggregation and outputs both the value and the key(s) of the bucket(s). The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A min_bucket aggregation looks like this in isolation:

{
    "min_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 7. min_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the minimum for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the minimum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "min_monthly_sales": {
            "min_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. buckets_path instructs this min_bucket aggregation that we want the minimum value of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "min_monthly_sales": {
          "keys": ["2015/02/01 00:00:00"], (1)
          "value": 60.0
      }
   }
}
  1. keys is an array of strings since the minimum value may be present in multiple buckets === Sum Bucket Aggregation

A sibling pipeline aggregation which calculates the sum across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A sum_bucket aggregation looks like this in isolation:

{
    "sum_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 8. sum_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the sum for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the sum of all the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "sum_monthly_sales": {
            "sum_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. buckets_path instructs this sum_bucket aggregation that we want the sum of the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "sum_monthly_sales": {
          "value": 985.0
      }
   }
}

Stats Bucket Aggregation

A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A stats_bucket aggregation looks like this in isolation:

{
    "stats_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 9. stats_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to calculate stats for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the stats for monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "stats_monthly_sales": {
            "stats_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. bucket_paths instructs this stats_bucket aggregation that we want the calculate stats for the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0
      }
   }
}

Extended Stats Bucket Aggregation

A sibling pipeline aggregation which calculates a variety of stats across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

This aggregation provides a few more statistics (sum of squares, standard deviation, etc) compared to the stats_bucket aggregation.

Syntax

A extended_stats_bucket aggregation looks like this in isolation:

{
    "extended_stats_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 10. extended_stats_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to calculate stats for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

sigma

The number of standard deviations above/below the mean to display

Optional

2

The following snippet calculates the extended stats for monthly sales bucket:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "stats_monthly_sales": {
            "extended_stats_bucket": {
                "buckets_path": "sales_per_month>sales" (1)
            }
        }
    }
}
  1. bucket_paths instructs this extended_stats_bucket aggregation that we want the calculate stats for the sales aggregation in the sales_per_month date histogram.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "stats_monthly_sales": {
         "count": 3,
         "min": 60.0,
         "max": 550.0,
         "avg": 328.3333333333333,
         "sum": 985.0,
         "sum_of_squares": 446725.0,
         "variance": 41105.55555555556,
         "std_deviation": 202.74505063146563,
         "std_deviation_bounds": {
           "upper": 733.8234345962646,
           "lower": -77.15676792959795
         }
      }
   }
}

Percentiles Bucket Aggregation

A sibling pipeline aggregation which calculates percentiles across all bucket of a specified metric in a sibling aggregation. The specified metric must be numeric and the sibling aggregation must be a multi-bucket aggregation.

Syntax

A percentiles_bucket aggregation looks like this in isolation:

{
    "percentiles_bucket": {
        "buckets_path": "the_sum"
    }
}
Table 11. percentiles_bucket Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the percentiles for (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

percents

The list of percentiles to calculate

Optional

[ 1, 5, 25, 50, 75, 95, 99 ]

The following snippet calculates the percentiles for the total monthly sales buckets:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                }
            }
        },
        "percentiles_monthly_sales": {
            "percentiles_bucket": {
                "buckets_path": "sales_per_month>sales", (1)
                "percents": [ 25.0, 50.0, 75.0 ] (2)
            }
        }
    }
}
  1. buckets_path instructs this percentiles_bucket aggregation that we want to calculate percentiles for the sales aggregation in the sales_per_month date histogram.

  2. percents specifies which percentiles we wish to calculate, in this case, the 25th, 50th and 75th percentiles.

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               }
            }
         ]
      },
      "percentiles_monthly_sales": {
        "values" : {
            "25.0": 375.0,
            "50.0": 375.0,
            "75.0": 550.0
         }
      }
   }
}

Percentiles_bucket implementation

The Percentile Bucket returns the nearest input data point that is not greater than the requested percentile; it does not interpolate between data points.

The percentiles are calculated exactly and is not an approximation (unlike the Percentiles Metric). This means the implementation maintains an in-memory, sorted list of your data to compute the percentiles, before discarding the data. You may run into memory pressure issues if you attempt to calculate percentiles over many millions of data-points in a single percentiles_bucket.

Moving Average Aggregation

deprecated:[6.4.0, "The Moving Average aggregation has been deprecated in favor of the more general Moving Function Aggregation. The new Moving Function aggregation provides all the same functionality as the Moving Average aggregation, but also provides more flexibility."]

Given an ordered series of data, the Moving Average aggregation will slide a window across the data and emit the average value of that window. For example, given the data [1, 2, 3, 4, 5, 6, 7, 8, 9, 10], we can calculate a simple moving average with windows size of 5 as follows:

  • (1 + 2 + 3 + 4 + 5) / 5 = 3

  • (2 + 3 + 4 + 5 + 6) / 5 = 4

  • (3 + 4 + 5 + 6 + 7) / 5 = 5

  • etc

Moving averages are a simple method to smooth sequential data. Moving averages are typically applied to time-based data, such as stock prices or server metrics. The smoothing can be used to eliminate high frequency fluctuations or random noise, which allows the lower frequency trends to be more easily visualized, such as seasonality.

Syntax

A moving_avg aggregation looks like this in isolation:

{
    "moving_avg": {
        "buckets_path": "the_sum",
        "model": "holt",
        "window": 5,
        "gap_policy": "insert_zeros",
        "settings": {
            "alpha": 0.8
        }
    }
}
Table 12. moving_avg Parameters

Parameter Name

Description

Required

Default Value

buckets_path

Path to the metric of interest (see buckets_path Syntax for more details

Required

model

The moving average weighting model that we wish to use

Optional

simple

gap_policy

Determines what should happen when a gap in the data is encountered.

Optional

insert_zeros

window

The size of window to "slide" across the histogram.

Optional

5

minimize

If the model should be algorithmically minimized. See Minimization for more details

Optional

false for most models

settings

Model-specific settings, contents which differ depending on the model specified.

Optional

moving_avg aggregations must be embedded inside of a histogram or date_histogram aggregation. They can be embedded like any other metric aggregation:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                (1)
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" } (2)
                },
                "the_movavg":{
                    "moving_avg":{ "buckets_path": "the_sum" } (3)
                }
            }
        }
    }
}
  1. A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals

  2. A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)

  3. Finally, we specify a moving_avg aggregation which uses "the_sum" metric as its input.

Moving averages are built by first specifying a histogram or date_histogram over a field. You can then optionally add normal metrics, such as a sum, inside of that histogram. Finally, the moving_avg is embedded inside the histogram. The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see buckets_path Syntax for a description of the syntax for buckets_path.

An example response from the above aggregation may look like:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "my_date_histo": {
         "buckets": [
             {
                 "key_as_string": "2015/01/01 00:00:00",
                 "key": 1420070400000,
                 "doc_count": 3,
                 "the_sum": {
                    "value": 550.0
                 }
             },
             {
                 "key_as_string": "2015/02/01 00:00:00",
                 "key": 1422748800000,
                 "doc_count": 2,
                 "the_sum": {
                    "value": 60.0
                 },
                 "the_movavg": {
                    "value": 550.0
                 }
             },
             {
                 "key_as_string": "2015/03/01 00:00:00",
                 "key": 1425168000000,
                 "doc_count": 2,
                 "the_sum": {
                    "value": 375.0
                 },
                 "the_movavg": {
                    "value": 305.0
                 }
             }
         ]
      }
   }
}

Models

The moving_avg aggregation includes four different moving average "models". The main difference is how the values in the window are weighted. As data-points become "older" in the window, they may be weighted differently. This will affect the final average for that window.

Models are specified using the model parameter. Some models may have optional configurations which are specified inside the settings parameter.

Simple

The simple model calculates the sum of all values in the window, then divides by the size of the window. It is effectively a simple arithmetic mean of the window. The simple model does not perform any time-dependent weighting, which means the values from a simple moving average tend to "lag" behind the real data.

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg":{
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "simple"
                    }
                }
            }
        }
    }
}

A simple model has no special settings to configure

The window size can change the behavior of the moving average. For example, a small window ("window": 10) will closely track the data and only smooth out small scale fluctuations:

movavg 10window
Figure 1. Moving average with window of size 10

In contrast, a simple moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount:

movavg 100window
Figure 2. Moving average with window of size 100

Linear

The linear model assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce the "lag" behind the data’s mean, since older points have less influence.

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "linear"
                    }
                }
            }
        }
    }
}

A linear model has no special settings to configure

Like the simple model, window size can change the behavior of the moving average. For example, a small window ("window": 10) will closely track the data and only smooth out small scale fluctuations:

linear 10window
Figure 3. Linear moving average with window of size 10

In contrast, a linear moving average with larger window ("window": 100) will smooth out all higher-frequency fluctuations, leaving only low-frequency, long term trends. It also tends to "lag" behind the actual data by a substantial amount, although typically less than the simple model:

linear 100window
Figure 4. Linear moving average with window of size 100

EWMA (Exponentially Weighted)

The ewma model (aka "single-exponential") is similar to the linear model, except older data-points become exponentially less important, rather than linearly less important. The speed at which the importance decays can be controlled with an alpha setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the moving average. This tends to make the moving average track the data more closely but with less smoothing.

The default value of alpha is 0.3, and the setting accepts any float from 0-1 inclusive.

The EWMA model can be Minimized

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "ewma",
                        "settings" : {
                            "alpha" : 0.5
                        }
                    }
                }
            }
        }
    }
}
single 0.2alpha
Figure 5. EWMA with window of size 10, alpha = 0.2
single 0.7alpha
Figure 6. EWMA with window of size 10, alpha = 0.7

Holt-Linear

The holt model (aka "double exponential") incorporates a second exponential term which tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The double exponential model calculates two values internally: a "level" and a "trend".

The level calculation is similar to ewma, and is an exponentially weighted view of the data. The difference is that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series. The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the smoothed data). The trend value is also exponentially weighted.

Values are produced by multiplying the level and trend components.

The default value of alpha is 0.3 and beta is 0.1. The settings accept any float from 0-1 inclusive.

The Holt-Linear model can be Minimized

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "holt",
                        "settings" : {
                            "alpha" : 0.5,
                            "beta" : 0.5
                        }
                    }
                }
            }
        }
    }
}

In practice, the alpha value behaves very similarly in holt as ewma: small values produce more smoothing and more lag, while larger values produce closer tracking and less lag. The value of beta is often difficult to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger values emphasize short-term trends. This will become more apparently when you are predicting values.

double 0.2beta
Figure 7. Holt-Linear moving average with window of size 100, alpha = 0.5, beta = 0.2
double 0.7beta
Figure 8. Holt-Linear moving average with window of size 100, alpha = 0.5, beta = 0.7

Holt-Winters

The holt_winters model (aka "triple exponential") incorporates a third exponential term which tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend" and "seasonality".

The level and trend calculation is identical to holt The seasonal calculation looks at the difference between the current point, and the point one period earlier.

Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity" of your data: e.g. if your data has cyclic trends every 7 days, you would set period: 7. Similarly if there was a monthly trend, you would set it to 30. There is currently no periodicity detection, although that is planned for future enhancements.

There are two varieties of Holt-Winters: additive and multiplicative.

"Cold Start"

Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This means that your window must always be at least twice the size of your period. An exception will be thrown if it isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period buckets; the current algorithm does not backcast.

triple untruncated
Figure 9. Holt-Winters showing a "cold" start where no values are emitted

Because the "cold start" obscures what the moving average looks like, the rest of the Holt-Winters images are truncated to not show the "cold start". Just be aware this will always be present at the beginning of your moving averages!

Additive Holt-Winters

Additive seasonality is the default; it can also be specified by setting "type": "add". This variety is preferred when the seasonal affect is additive to your data. E.g. you could simply subtract the seasonal effect to "de-seasonalize" your data into a flat trend.

The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive. The default value of period is 1.

The additive Holt-Winters model can be Minimized

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "holt_winters",
                        "settings" : {
                            "type" : "add",
                            "alpha" : 0.5,
                            "beta" : 0.5,
                            "gamma" : 0.5,
                            "period" : 7
                        }
                    }
                }
            }
        }
    }
}
triple
Figure 10. Holt-Winters moving average with window of size 120, alpha = 0.5, beta = 0.7, gamma = 0.3, period = 30
Multiplicative Holt-Winters

Multiplicative is specified by setting "type": "mult". This variety is preferred when the seasonal affect is multiplied against your data. E.g. if the seasonal affect is x5 the data, rather than simply adding to it.

The default values of alpha and gamma are 0.3 while beta is 0.1. The settings accept any float from 0-1 inclusive. The default value of period is 1.

The multiplicative Holt-Winters model can be Minimized

Warning

Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the mult Holt-Winters pads all values by a very small amount (1*10-10) so that all values are non-zero. This affects the result, but only minimally. If your data is non-zero, or you prefer to see NaN when zero’s are encountered, you can disable this behavior with pad: false

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "holt_winters",
                        "settings" : {
                            "type" : "mult",
                            "alpha" : 0.5,
                            "beta" : 0.5,
                            "gamma" : 0.5,
                            "period" : 7,
                            "pad" : true
                        }
                    }
                }
            }
        }
    }
}

Prediction

experimental[]

All the moving average model support a "prediction" mode, which will attempt to extrapolate into the future given the current smoothed, moving average. Depending on the model and parameter, these predictions may or may not be accurate.

Predictions are enabled by adding a predict parameter to any moving average aggregation, specifying the number of predictions you would like appended to the end of the series. These predictions will be spaced out at the same interval as your buckets:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "window" : 30,
                        "model" : "simple",
                        "predict" : 10
                    }
                }
            }
        }
    }
}

The simple, linear and ewma models all produce "flat" predictions: they essentially converge on the mean of the last value in the series, producing a flat:

simple prediction
Figure 11. Simple moving average with window of size 10, predict = 50

In contrast, the holt model can extrapolate based on local or global constant trends. If we set a high beta value, we can extrapolate based on local constant trends (in this case the predictions head down, because the data at the end of the series was heading in a downward direction):

double prediction local
Figure 12. Holt-Linear moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.8

In contrast, if we choose a small beta, the predictions are based on the global constant trend. In this series, the global trend is slightly positive, so the prediction makes a sharp u-turn and begins a positive slope:

double prediction global
Figure 13. Double Exponential moving average with window of size 100, predict = 20, alpha = 0.5, beta = 0.1

The holt_winters model has the potential to deliver the best predictions, since it also incorporates seasonal fluctuations into the model:

triple prediction
Figure 14. Holt-Winters moving average with window of size 120, predict = 25, alpha = 0.8, beta = 0.2, gamma = 0.7, period = 30

Minimization

Some of the models (EWMA, Holt-Linear, Holt-Winters) require one or more parameters to be configured. Parameter choice can be tricky and sometimes non-intuitive. Furthermore, small deviations in these parameters can sometimes have a drastic effect on the output moving average.

For that reason, the three "tunable" models can be algorithmically minimized. Minimization is a process where parameters are tweaked until the predictions generated by the model closely match the output data. Minimization is not fullproof and can be susceptible to overfitting, but it often gives better results than hand-tuning.

Minimization is disabled by default for ewma and holt_linear, while it is enabled by default for holt_winters. Minimization is most useful with Holt-Winters, since it helps improve the accuracy of the predictions. EWMA and Holt-Linear are not great predictors, and mostly used for smoothing data, so minimization is less useful on those models.

Minimization is enabled/disabled via the minimize parameter:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_avg":{
                        "buckets_path": "the_sum",
                        "model" : "holt_winters",
                        "window" : 30,
                        "minimize" : true,  (1)
                        "settings" : {
                            "period" : 7
                        }
                    }
                }
            }
        }
    }
}
  1. Minimization is enabled with the minimize parameter

When enabled, minimization will find the optimal values for alpha, beta and gamma. The user should still provide appropriate values for window, period and type.

Warning

Minimization works by running a stochastic process called simulated annealing. This process will usually generate a good solution, but is not guaranteed to find the global optimum. It also requires some amount of additional computational power, since the model needs to be re-run multiple times as the values are tweaked. The run-time of minimization is linear to the size of the window being processed: excessively large windows may cause latency.

Finally, minimization fits the model to the last n values, where n = window. This generally produces better forecasts into the future, since the parameters are tuned around the end of the series. It can, however, generate poorer fitting moving averages at the beginning of the series.

Moving Function Aggregation

Given an ordered series of data, the Moving Function aggregation will slide a window across the data and allow the user to specify a custom script that is executed on each window of data. For convenience, a number of common functions are predefined such as min/max, moving averages, etc.

This is conceptually very similar to the Moving Average pipeline aggregation, except it provides more functionality.

Syntax

A moving_fn aggregation looks like this in isolation:

{
    "moving_fn": {
        "buckets_path": "the_sum",
        "window": 10,
        "script": "MovingFunctions.min(values)"
    }
}
Table 13. moving_avg Parameters
Parameter Name Description Required Default Value

buckets_path

Path to the metric of interest (see buckets_path Syntax for more details

Required

window

The size of window to "slide" across the histogram.

Required

script

The script that should be executed on each window of data

Required

moving_fn aggregations must be embedded inside of a histogram or date_histogram aggregation. They can be embedded like any other metric aggregation:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{                (1)
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" } (2)
                },
                "the_movfn": {
                    "moving_fn": {
                        "buckets_path": "the_sum", (3)
                        "window": 10,
                        "script": "MovingFunctions.unweightedAvg(values)"
                    }
                }
            }
        }
    }
}
  1. A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals

  2. A sum metric is used to calculate the sum of a field. This could be any numeric metric (sum, min, max, etc)

  3. Finally, we specify a moving_fn aggregation which uses "the_sum" metric as its input.

Moving averages are built by first specifying a histogram or date_histogram over a field. You can then optionally add numeric metrics, such as a sum, inside of that histogram. Finally, the moving_fn is embedded inside the histogram. The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see buckets_path Syntax for a description of the syntax for buckets_path.

An example response from the above aggregation may look like:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "my_date_histo": {
         "buckets": [
             {
                 "key_as_string": "2015/01/01 00:00:00",
                 "key": 1420070400000,
                 "doc_count": 3,
                 "the_sum": {
                    "value": 550.0
                 },
                 "the_movfn": {
                    "value": null
                 }
             },
             {
                 "key_as_string": "2015/02/01 00:00:00",
                 "key": 1422748800000,
                 "doc_count": 2,
                 "the_sum": {
                    "value": 60.0
                 },
                 "the_movfn": {
                    "value": 550.0
                 }
             },
             {
                 "key_as_string": "2015/03/01 00:00:00",
                 "key": 1425168000000,
                 "doc_count": 2,
                 "the_sum": {
                    "value": 375.0
                 },
                 "the_movfn": {
                    "value": 305.0
                 }
             }
         ]
      }
   }
}

Custom user scripting

The Moving Function aggregation allows the user to specify any arbitrary script to define custom logic. The script is invoked each time a new window of data is collected. These values are provided to the script in the values variable. The script should then perform some kind of calculation and emit a single double as the result. Emitting null is not permitted, although NaN and +/- Inf are allowed.

For example, this script will simply return the first value from the window, or NaN if no values are available:

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "return values.length > 0 ? values[0] : Double.NaN"
                    }
                }
            }
        }
    }
}

Pre-built Functions

For convenience, a number of functions have been prebuilt and are available inside the moving_fn script context:

  • max()

  • min()

  • sum()

  • stdDev()

  • unweightedAvg()

  • linearWeightedAvg()

  • ewma()

  • holt()

  • holtWinters()

The functions are available from the MovingFunctions namespace. E.g. MovingFunctions.max()

max Function

This function accepts a collection of doubles and returns the maximum value in that window. null and NaN values are ignored; the maximum is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result.

Table 14. max(double[] values) Parameters
Parameter Name Description

values

The window of values to find the maximum

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_moving_max": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.max(values)"
                    }
                }
            }
        }
    }
}
min Function

This function accepts a collection of doubles and returns the minimum value in that window. null and NaN values are ignored; the minimum is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result.

Table 15. min(double[] values) Parameters
Parameter Name Description

values

The window of values to find the minimum

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_moving_min": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.min(values)"
                    }
                }
            }
        }
    }
}
sum Function

This function accepts a collection of doubles and returns the sum of the values in that window. null and NaN values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are null/NaN, 0.0 is returned as the result.

Table 16. sum(double[] values) Parameters
Parameter Name Description

values

The window of values to find the sum of

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_moving_sum": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.sum(values)"
                    }
                }
            }
        }
    }
}
stdDev Function

This function accepts a collection of doubles and average, then returns the standard deviation of the values in that window. null and NaN values are ignored; the sum is only calculated over the real values. If the window is empty, or all values are null/NaN, 0.0 is returned as the result.

Table 17. stdDev(double[] values) Parameters
Parameter Name Description

values

The window of values to find the standard deviation of

avg

The average of the window

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_moving_sum": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.stdDev(values, MovingFunctions.unweightedAvg(values))"
                    }
                }
            }
        }
    }
}

The avg parameter must be provided to the standard deviation function because different styles of averages can be computed on the window (simple, linearly weighted, etc). The various moving averages that are detailed below can be used to calculate the average for the standard deviation function.

unweightedAvg Function

The unweightedAvg function calculates the sum of all values in the window, then divides by the size of the window. It is effectively a simple arithmetic mean of the window. The simple moving average does not perform any time-dependent weighting, which means the values from a simple moving average tend to "lag" behind the real data.

null and NaN values are ignored; the average is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result. This means that the count used in the average calculation is count of non-null,non-NaN values.

Table 18. unweightedAvg(double[] values) Parameters
Parameter Name Description

values

The window of values to find the sum of

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.unweightedAvg(values)"
                    }
                }
            }
        }
    }
}

linearWeightedAvg Function

The linearWeightedAvg function assigns a linear weighting to points in the series, such that "older" datapoints (e.g. those at the beginning of the window) contribute a linearly less amount to the total average. The linear weighting helps reduce the "lag" behind the data’s mean, since older points have less influence.

If the window is empty, or all values are null/NaN, NaN is returned as the result.

Table 19. linearWeightedAvg(double[] values) Parameters
Parameter Name Description

values

The window of values to find the sum of

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.linearWeightedAvg(values)"
                    }
                }
            }
        }
    }
}

ewma Function

The ewma function (aka "single-exponential") is similar to the linearMovAvg function, except older data-points become exponentially less important, rather than linearly less important. The speed at which the importance decays can be controlled with an alpha setting. Small values make the weight decay slowly, which provides greater smoothing and takes into account a larger portion of the window. Larger values make the weight decay quickly, which reduces the impact of older values on the moving average. This tends to make the moving average track the data more closely but with less smoothing.

null and NaN values are ignored; the average is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result. This means that the count used in the average calculation is count of non-null,non-NaN values.

Table 20. ewma(double[] values, double alpha) Parameters
Parameter Name Description

values

The window of values to find the sum of

alpha

Exponential decay

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.ewma(values, 0.3)"
                    }
                }
            }
        }
    }
}

holt Function

The holt function (aka "double exponential") incorporates a second exponential term which tracks the data’s trend. Single exponential does not perform well when the data has an underlying linear trend. The double exponential model calculates two values internally: a "level" and a "trend".

The level calculation is similar to ewma, and is an exponentially weighted view of the data. The difference is that the previously smoothed value is used instead of the raw value, which allows it to stay close to the original series. The trend calculation looks at the difference between the current and last value (e.g. the slope, or trend, of the smoothed data). The trend value is also exponentially weighted.

Values are produced by multiplying the level and trend components.

null and NaN values are ignored; the average is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result. This means that the count used in the average calculation is count of non-null,non-NaN values.

Table 21. holt(double[] values, double alpha) Parameters
Parameter Name Description

values

The window of values to find the sum of

alpha

Level decay value

beta

Trend decay value

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "MovingFunctions.holt(values, 0.3, 0.1)"
                    }
                }
            }
        }
    }
}

In practice, the alpha value behaves very similarly in holtMovAvg as ewmaMovAvg: small values produce more smoothing and more lag, while larger values produce closer tracking and less lag. The value of beta is often difficult to see. Small values emphasize long-term trends (such as a constant linear trend in the whole series), while larger values emphasize short-term trends.

holtWinters Function

The holtWinters function (aka "triple exponential") incorporates a third exponential term which tracks the seasonal aspect of your data. This aggregation therefore smooths based on three components: "level", "trend" and "seasonality".

The level and trend calculation is identical to holt The seasonal calculation looks at the difference between the current point, and the point one period earlier.

Holt-Winters requires a little more handholding than the other moving averages. You need to specify the "periodicity" of your data: e.g. if your data has cyclic trends every 7 days, you would set period = 7. Similarly if there was a monthly trend, you would set it to 30. There is currently no periodicity detection, although that is planned for future enhancements.

null and NaN values are ignored; the average is only calculated over the real values. If the window is empty, or all values are null/NaN, NaN is returned as the result. This means that the count used in the average calculation is count of non-null,non-NaN values.

Table 22. holtWinters(double[] values, double alpha) Parameters
Parameter Name Description

values

The window of values to find the sum of

alpha

Level decay value

beta

Trend decay value

gamma

Seasonality decay value

period

The periodicity of the data

multiplicative

True if you wish to use multiplicative holt-winters, false to use additive

POST /_search
{
    "size": 0,
    "aggs": {
        "my_date_histo":{
            "date_histogram":{
                "field":"date",
                "interval":"1M"
            },
            "aggs":{
                "the_sum":{
                    "sum":{ "field": "price" }
                },
                "the_movavg": {
                    "moving_fn": {
                        "buckets_path": "the_sum",
                        "window": 10,
                        "script": "if (values.length > 5*2) {MovingFunctions.holtWinters(values, 0.3, 0.1, 0.1, 5, false)}"
                    }
                }
            }
        }
    }
}
Warning

Multiplicative Holt-Winters works by dividing each data point by the seasonal value. This is problematic if any of your data is zero, or if there are gaps in the data (since this results in a divid-by-zero). To combat this, the mult Holt-Winters pads all values by a very small amount (1*10-10) so that all values are non-zero. This affects the result, but only minimally. If your data is non-zero, or you prefer to see NaN when zero’s are encountered, you can disable this behavior with pad: false

"Cold Start"

Unfortunately, due to the nature of Holt-Winters, it requires two periods of data to "bootstrap" the algorithm. This means that your window must always be at least twice the size of your period. An exception will be thrown if it isn’t. It also means that Holt-Winters will not emit a value for the first 2 * period buckets; the current algorithm does not backcast.

You’ll notice in the above example we have an if () statement checking the size of values. This is checking to make sure we have two periods worth of data (5 * 2, where 5 is the period specified in the holtWintersMovAvg function) before calling the holt-winters function.

Cumulative Sum Aggregation

A parent pipeline aggregation which calculates the cumulative sum of a specified metric in a parent histogram (or date_histogram) aggregation. The specified metric must be numeric and the enclosing histogram must have min_doc_count set to 0 (default for histogram aggregations).

Syntax

A cumulative_sum aggregation looks like this in isolation:

{
    "cumulative_sum": {
        "buckets_path": "the_sum"
    }
}
Table 23. cumulative_sum Parameters
Parameter Name Description Required Default Value

buckets_path

The path to the buckets we wish to find the cumulative sum for (see buckets_path Syntax for more details)

Required

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the cumulative sum of the total monthly sales:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "cumulative_sales": {
                    "cumulative_sum": {
                        "buckets_path": "sales" (1)
                    }
                }
            }
        }
    }
}
  1. buckets_path instructs this cumulative sum aggregation to use the output of the sales aggregation for the cumulative sum

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "sales": {
                  "value": 550.0
               },
               "cumulative_sales": {
                  "value": 550.0
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "sales": {
                  "value": 60.0
               },
               "cumulative_sales": {
                  "value": 610.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "sales": {
                  "value": 375.0
               },
               "cumulative_sales": {
                  "value": 985.0
               }
            }
         ]
      }
   }
}

Bucket Script Aggregation

A parent pipeline aggregation which executes a script which can perform per bucket computations on specified metrics in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a numeric value.

Syntax

A bucket_script aggregation looks like this in isolation:

{
    "bucket_script": {
        "buckets_path": {
            "my_var1": "the_sum", (1)
            "my_var2": "the_value_count"
        },
        "script": "params.my_var1 / params.my_var2"
    }
}
  1. Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to the metrics to use for that variable.

Table 24. bucket_script Parameters
Parameter Name Description Required Default Value

script

The script to run for this aggregation. The script can be inline, file or indexed. (see [modules-scripting] for more details)

Required

buckets_path

A map of script variables and their associated path to the buckets we wish to use for the variable (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

format

format to apply to the output value of this aggregation

Optional

null

The following snippet calculates the ratio percentage of t-shirt sales compared to total sales each month:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "t-shirts": {
                  "filter": {
                    "term": {
                      "type": "t-shirt"
                    }
                  },
                  "aggs": {
                    "sales": {
                      "sum": {
                        "field": "price"
                      }
                    }
                  }
                },
                "t-shirt-percentage": {
                    "bucket_script": {
                        "buckets_path": {
                          "tShirtSales": "t-shirts>sales",
                          "totalSales": "total_sales"
                        },
                        "script": "params.tShirtSales / params.totalSales * 100"
                    }
                }
            }
        }
    }
}

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 200.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 36.36363636363637
               }
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "total_sales": {
                   "value": 60.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 10.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 16.666666666666664
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
               "t-shirts": {
                   "doc_count": 1,
                   "sales": {
                       "value": 175.0
                   }
               },
               "t-shirt-percentage": {
                   "value": 46.666666666666664
               }
            }
         ]
      }
   }
}

Bucket Selector Aggregation

A parent pipeline aggregation which executes a script which determines whether the current bucket will be retained in the parent multi-bucket aggregation. The specified metric must be numeric and the script must return a boolean value. If the script language is expression then a numeric return value is permitted. In this case 0.0 will be evaluated as false and all other values will evaluate to true.

Note
The bucket_selector aggregation, like all pipeline aggregations, executes after all other sibling aggregations. This means that using the bucket_selector aggregation to filter the returned buckets in the response does not save on execution time running the aggregations.

Syntax

A bucket_selector aggregation looks like this in isolation:

{
    "bucket_selector": {
        "buckets_path": {
            "my_var1": "the_sum", (1)
            "my_var2": "the_value_count"
        },
        "script": "params.my_var1 > params.my_var2"
    }
}
  1. Here, my_var1 is the name of the variable for this buckets path to use in the script, the_sum is the path to the metrics to use for that variable.

Table 25. bucket_selector Parameters
Parameter Name Description Required Default Value

script

The script to run for this aggregation. The script can be inline, file or indexed. (see [modules-scripting] for more details)

Required

buckets_path

A map of script variables and their associated path to the buckets we wish to use for the variable (see buckets_path Syntax for more details)

Required

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

The following snippet only retains buckets where the total sales for the month is more than 200:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_bucket_filter": {
                    "bucket_selector": {
                        "buckets_path": {
                          "totalSales": "total_sales"
                        },
                        "script": "params.totalSales > 200"
                    }
                }
            }
        }
    }
}

And the following may be the response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               }
            },(1)
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
            }
         ]
      }
   }
}
  1. Bucket for 2015/02/01 00:00:00 has been removed as its total sales was less than 200 === Bucket Sort Aggregation

A parent pipeline aggregation which sorts the buckets of its parent multi-bucket aggregation. Zero or more sort fields may be specified together with the corresponding sort order. Each bucket may be sorted based on its _key, _count or its sub-aggregations. In addition, parameters from and size may be set in order to truncate the result buckets.

Note
The bucket_sort aggregation, like all pipeline aggregations, is executed after all other non-pipeline aggregations. This means the sorting only applies to whatever buckets are already returned from the parent aggregation. For example, if the parent aggregation is terms and its size is set to 10, the bucket_sort will only sort over those 10 returned term buckets.

Syntax

A bucket_sort aggregation looks like this in isolation:

{
    "bucket_sort": {
        "sort": [
            {"sort_field_1": {"order": "asc"}},(1)
            {"sort_field_2": {"order": "desc"}},
            "sort_field_3"
        ],
        "from": 1,
        "size": 3
    }
}
  1. Here, sort_field_1 is the bucket path to the variable to be used as the primary sort and its order is ascending.

Table 26. bucket_sort Parameters
Parameter Name Description Required Default Value

sort

The list of fields to sort on. See sort for more details.

Optional

from

Buckets in positions prior to the set value will be truncated.

Optional

0

size

The number of buckets to return. Defaults to all buckets of the parent aggregation.

Optional

gap_policy

The policy to apply when gaps are found in the data (see Dealing with gaps in the data for more details)

Optional

skip

The following snippet returns the buckets corresponding to the 3 months with the highest total sales in descending order:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "total_sales": {
                    "sum": {
                        "field": "price"
                    }
                },
                "sales_bucket_sort": {
                    "bucket_sort": {
                        "sort": [
                          {"total_sales": {"order": "desc"}}(1)
                        ],
                        "size": 3(2)
                    }
                }
            }
        }
    }
}
  1. sort is set to use the values of total_sales in descending order

  2. size is set to 3 meaning only the top 3 months in total_sales will be returned

And the following may be the response:

{
   "took": 82,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/01/01 00:00:00",
               "key": 1420070400000,
               "doc_count": 3,
               "total_sales": {
                   "value": 550.0
               }
            },
            {
               "key_as_string": "2015/03/01 00:00:00",
               "key": 1425168000000,
               "doc_count": 2,
               "total_sales": {
                   "value": 375.0
               },
            },
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2,
               "total_sales": {
                   "value": 60.0
               },
            }
         ]
      }
   }
}

Truncating without sorting

It is also possible to use this aggregation in order to truncate the result buckets without doing any sorting. To do so, just use the from and/or size parameters without specifying sort.

The following example simply truncates the result so that only the second bucket is returned:

POST /sales/_search
{
    "size": 0,
    "aggs" : {
        "sales_per_month" : {
            "date_histogram" : {
                "field" : "date",
                "interval" : "month"
            },
            "aggs": {
                "bucket_truncate": {
                    "bucket_sort": {
                        "from": 1,
                        "size": 1
                    }
                }
            }
        }
    }
}

Response:

{
   "took": 11,
   "timed_out": false,
   "_shards": ...,
   "hits": ...,
   "aggregations": {
      "sales_per_month": {
         "buckets": [
            {
               "key_as_string": "2015/02/01 00:00:00",
               "key": 1422748800000,
               "doc_count": 2
            }
         ]
      }
   }
}

Serial Differencing Aggregation

Serial differencing is a technique where values in a time series are subtracted from itself at different time lags or periods. For example, the datapoint f(x) = f(xt) - f(xt-n), where n is the period being used.

A period of 1 is equivalent to a derivative with no time normalization: it is simply the change from one point to the next. Single periods are useful for removing constant, linear trends.

Single periods are also useful for transforming data into a stationary series. In this example, the Dow Jones is plotted over ~250 days. The raw data is not stationary, which would make it difficult to use with some techniques.

By calculating the first-difference, we de-trend the data (e.g. remove a constant, linear trend). We can see that the data becomes a stationary series (e.g. the first difference is randomly distributed around zero, and doesn’t seem to exhibit any pattern/behavior). The transformation reveals that the dataset is following a random-walk; the value is the previous value +/- a random amount. This insight allows selection of further tools for analysis.

dow
Figure 15. Dow Jones plotted and made stationary with first-differencing

Larger periods can be used to remove seasonal / cyclic behavior. In this example, a population of lemmings was synthetically generated with a sine wave + constant linear trend + random noise. The sine wave has a period of 30 days.

The first-difference removes the constant trend, leaving just a sine wave. The 30th-difference is then applied to the first-difference to remove the cyclic behavior, leaving a stationary series which is amenable to other analysis.

lemmings
Figure 16. Lemmings data plotted made stationary with 1st and 30th difference

Syntax

A serial_diff aggregation looks like this in isolation:

{
    "serial_diff": {
        "buckets_path": "the_sum",
        "lag": "7"
    }
}
Table 27. serial_diff Parameters
Parameter Name Description Required Default Value

buckets_path

Path to the metric of interest (see buckets_path Syntax for more details

Required

lag

The historical bucket to subtract from the current value. E.g. a lag of 7 will subtract the current value from the value 7 buckets ago. Must be a positive, non-zero integer

Optional

1

gap_policy

Determines what should happen when a gap in the data is encountered.

Optional

insert_zero

format

Format to apply to the output value of this aggregation

Optional

null

serial_diff aggregations must be embedded inside of a histogram or date_histogram aggregation:

POST /_search
{
   "size": 0,
   "aggs": {
      "my_date_histo": {                  (1)
         "date_histogram": {
            "field": "timestamp",
            "interval": "day"
         },
         "aggs": {
            "the_sum": {
               "sum": {
                  "field": "lemmings"     (2)
               }
            },
            "thirtieth_difference": {
               "serial_diff": {                (3)
                  "buckets_path": "the_sum",
                  "lag" : 30
               }
            }
         }
      }
   }
}
  1. A date_histogram named "my_date_histo" is constructed on the "timestamp" field, with one-day intervals

  2. A sum metric is used to calculate the sum of a field. This could be any metric (sum, min, max, etc)

  3. Finally, we specify a serial_diff aggregation which uses "the_sum" metric as its input.

Serial differences are built by first specifying a histogram or date_histogram over a field. You can then optionally add normal metrics, such as a sum, inside of that histogram. Finally, the serial_diff is embedded inside the histogram. The buckets_path parameter is then used to "point" at one of the sibling metrics inside of the histogram (see buckets_path Syntax for a description of the syntax for buckets_path.

Matrix Aggregations

experimental[]

The aggregations in this family operate on multiple fields and produce a matrix result based on the values extracted from the requested document fields. Unlike metric and bucket aggregations, this aggregation family does not yet support scripting.

Matrix Stats

The matrix_stats aggregation is a numeric aggregation that computes the following statistics over a set of document fields:

count

Number of per field samples included in the calculation.

mean

The average value for each field.

variance

Per field Measurement for how spread out the samples are from the mean.

skewness

Per field measurement quantifying the asymmetric distribution around the mean.

kurtosis

Per field measurement quantifying the shape of the distribution.

covariance

A matrix that quantitatively describes how changes in one field are associated with another.

correlation

The covariance matrix scaled to a range of -1 to 1, inclusive. Describes the relationship between field distributions.

The following example demonstrates the use of matrix stats to describe the relationship between income and poverty.

GET /_search
{
    "aggs": {
        "statistics": {
            "matrix_stats": {
                "fields": ["poverty", "income"]
            }
        }
    }
}

The aggregation type is matrix_stats and the fields setting defines the set of fields (as an array) for computing the statistics. The above request returns the following response:

{
    ...
    "aggregations": {
        "statistics": {
            "doc_count": 50,
            "fields": [{
                "name": "income",
                "count": 50,
                "mean": 51985.1,
                "variance": 7.383377037755103E7,
                "skewness": 0.5595114003506483,
                "kurtosis": 2.5692365287787124,
                "covariance": {
                    "income": 7.383377037755103E7,
                    "poverty": -21093.65836734694
                },
                "correlation": {
                    "income": 1.0,
                    "poverty": -0.8352655256272504
                }
            }, {
                "name": "poverty",
                "count": 50,
                "mean": 12.732000000000001,
                "variance": 8.637730612244896,
                "skewness": 0.4516049811903419,
                "kurtosis": 2.8615929677997767,
                "covariance": {
                    "income": -21093.65836734694,
                    "poverty": 8.637730612244896
                },
                "correlation": {
                    "income": -0.8352655256272504,
                    "poverty": 1.0
                }
            }]
        }
    }
}

The doc_count field indicates the number of documents involved in the computation of the statistics.

Multi Value Fields

The matrix_stats aggregation treats each document field as an independent sample. The mode parameter controls what array value the aggregation will use for array or multi-valued fields. This parameter can take one of the following:

avg

(default) Use the average of all values.

min

Pick the lowest value.

max

Pick the highest value.

sum

Use the sum of all values.

median

Use the median of all values.

Missing Values

The missing parameter defines how documents that are missing a value should be treated. By default they will be ignored but it is also possible to treat them as if they had a value. This is done by adding a set of fieldname : value mappings to specify default values per field.

GET /_search
{
    "aggs": {
        "matrixstats": {
            "matrix_stats": {
                "fields": ["poverty", "income"],
                "missing": {"income" : 50000} (1)
            }
        }
    }
}
  1. Documents without a value in the income field will have the default value 50000.

Script

This aggregation family does not yet support scripting.

Caching heavy aggregations

Frequently used aggregations (e.g. for display on the home page of a website) can be cached for faster responses. These cached results are the same results that would be returned by an uncached aggregation — you will never get stale results.

See [shard-request-cache] for more details.

Returning only aggregation results

There are many occasions when aggregations are required but search hits are not. For these cases the hits can be ignored by setting size=0. For example:

GET /twitter/_search
{
  "size": 0,
  "aggregations": {
    "my_agg": {
      "terms": {
        "field": "text"
      }
    }
  }
}

Setting size to 0 avoids executing the fetch phase of the search making the request more efficient.

Aggregation Metadata

You can associate a piece of metadata with individual aggregations at request time that will be returned in place at response time.

Consider this example where we want to associate the color blue with our terms aggregation.

GET /twitter/_search
{
  "size": 0,
  "aggs": {
    "titles": {
      "terms": {
        "field": "title"
      },
      "meta": {
        "color": "blue"
      }
    }
  }
}

Then that piece of metadata will be returned in place for our titles terms aggregation

{
    "aggregations": {
        "titles": {
            "meta": {
                "color" : "blue"
            },
            "doc_count_error_upper_bound" : 0,
            "sum_other_doc_count" : 0,
            "buckets": [
            ]
        }
    },
    ...
}

Returning the type of the aggregation

Sometimes you need to know the exact type of an aggregation in order to parse its results. The typed_keys parameter can be used to change the aggregation’s name in the response so that it will be prefixed by its internal type.

Considering the following date_histogram aggregation named tweets_over_time which has a sub 'top_hits` aggregation named top_users:

GET /twitter/_search?typed_keys
{
  "aggregations": {
    "tweets_over_time": {
      "date_histogram": {
        "field": "date",
        "interval": "year"
      },
      "aggregations": {
        "top_users": {
            "top_hits": {
                "size": 1
            }
        }
      }
    }
  }
}

In the response, the aggregations names will be changed to respectively date_histogram#tweets_over_time and top_hits#top_users, reflecting the internal types of each aggregation:

{
    "aggregations": {
        "date_histogram#tweets_over_time": { (1)
            "buckets" : [
                {
                    "key_as_string" : "2009-01-01T00:00:00.000Z",
                    "key" : 1230768000000,
                    "doc_count" : 5,
                    "top_hits#top_users" : {  (2)
                        "hits" : {
                            "total" : 5,
                            "max_score" : 1.0,
                            "hits" : [
                                {
                                  "_index": "twitter",
                                  "_type": "_doc",
                                  "_id": "0",
                                  "_score": 1.0,
                                  "_source": {
                                    "date": "2009-11-15T14:12:12",
                                    "message": "trying out Elasticsearch",
                                    "user": "kimchy",
                                    "likes": 0
                                  }
                                }
                            ]
                        }
                    }
                }
            ]
        }
    },
    ...
}
  1. The name tweets_over_time now contains the date_histogram prefix.

  2. The name top_users now contains the top_hits prefix.

Note
For some aggregations, it is possible that the returned type is not the same as the one provided with the request. This is the case for Terms, Significant Terms and Percentiles aggregations, where the returned type also contains information about the type of the targeted field: lterms (for a terms aggregation on a Long field), sigsterms (for a significant terms aggregation on a String field), tdigest_percentiles (for a percentile aggregation based on the TDigest algorithm).