"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/reference/ingest.asciidoc" (29 Dec 2021, 1692 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Pipeline Definition

A pipeline is a definition of a series of processors that are to be executed in the same order as they are declared. A pipeline consists of two main fields: a description and a list of processors:

{
  "description" : "...",
  "processors" : [ ... ]
}

The description is a special field to store a helpful description of what the pipeline does.

The processors parameter defines a list of processors to be executed in order.

Ingest APIs

The following ingest APIs are available for managing pipelines:

Put Pipeline API

The put pipeline API adds pipelines and updates existing pipelines in the cluster.

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
    }
  ]
}
Note
The put pipeline API also instructs all ingest nodes to reload their in-memory representation of pipelines, so that pipeline changes take effect immediately.

Get Pipeline API

The get pipeline API returns pipelines based on ID. This API always returns a local reference of the pipeline.

GET _ingest/pipeline/my-pipeline-id

Example response:

{
  "my-pipeline-id" : {
    "description" : "describe pipeline",
    "processors" : [
      {
        "set" : {
          "field" : "foo",
          "value" : "bar"
        }
      }
    ]
  }
}

For each returned pipeline, the source and the version are returned. The version is useful for knowing which version of the pipeline the node has. You can specify multiple IDs to return more than one pipeline. Wildcards are also supported.

Pipeline Versioning

Pipelines can optionally add a version number, which can be any integer value, in order to simplify pipeline management by external systems. The version field is completely optional and it is meant solely for external management of pipelines. To unset a version, simply replace the pipeline without specifying one.

PUT _ingest/pipeline/my-pipeline-id
{
  "description" : "describe pipeline",
  "version" : 123,
  "processors" : [
    {
      "set" : {
        "field": "foo",
        "value": "bar"
      }
    }
  ]
}

To check for the version, you can filter responses using filter_path to limit the response to just the version:

GET /_ingest/pipeline/my-pipeline-id?filter_path=*.version

This should give a small response that makes it both easy and inexpensive to parse:

{
  "my-pipeline-id" : {
    "version" : 123
  }
}

Delete Pipeline API

The delete pipeline API deletes pipelines by ID or wildcard match (my-, ).

DELETE _ingest/pipeline/my-pipeline-id
DELETE _ingest/pipeline/*

Simulate Pipeline API

The simulate pipeline API executes a specific pipeline against the set of documents provided in the body of the request.

You can either specify an existing pipeline to execute against the provided documents, or supply a pipeline definition in the body of the request.

Here is the structure of a simulate request with a pipeline definition provided in the body of the request:

POST _ingest/pipeline/_simulate
{
  "pipeline" : {
    // pipeline definition here
  },
  "docs" : [
    { "_source": {/** first document **/} },
    { "_source": {/** second document **/} },
    // ...
  ]
}

Here is the structure of a simulate request against an existing pipeline:

POST _ingest/pipeline/my-pipeline-id/_simulate
{
  "docs" : [
    { "_source": {/** first document **/} },
    { "_source": {/** second document **/} },
    // ...
  ]
}

Here is an example of a simulate request with a pipeline defined in the request and its response:

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "field" : "field2",
          "value" : "_value"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

Response:

{
   "docs": [
      {
         "doc": {
            "_id": "id",
            "_index": "index",
            "_type": "_doc",
            "_source": {
               "field2": "_value",
               "foo": "bar"
            },
            "_ingest": {
               "timestamp": "2017-05-04T22:30:03.187Z"
            }
         }
      },
      {
         "doc": {
            "_id": "id",
            "_index": "index",
            "_type": "_doc",
            "_source": {
               "field2": "_value",
               "foo": "rab"
            },
            "_ingest": {
               "timestamp": "2017-05-04T22:30:03.188Z"
            }
         }
      }
   ]
}

Viewing Verbose Results

You can use the simulate pipeline API to see how each processor affects the ingest document as it passes through the pipeline. To see the intermediate results of each processor in the simulate request, you can add the verbose parameter to the request.

Here is an example of a verbose request and its response:

POST _ingest/pipeline/_simulate?verbose
{
  "pipeline" :
  {
    "description": "_description",
    "processors": [
      {
        "set" : {
          "field" : "field2",
          "value" : "_value2"
        }
      },
      {
        "set" : {
          "field" : "field3",
          "value" : "_value3"
        }
      }
    ]
  },
  "docs": [
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "bar"
      }
    },
    {
      "_index": "index",
      "_type": "_doc",
      "_id": "id",
      "_source": {
        "foo": "rab"
      }
    }
  ]
}

Response:

{
   "docs": [
      {
         "processor_results": [
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field2": "_value2",
                     "foo": "bar"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.674Z"
                  }
               }
            },
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field3": "_value3",
                     "field2": "_value2",
                     "foo": "bar"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.675Z"
                  }
               }
            }
         ]
      },
      {
         "processor_results": [
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field2": "_value2",
                     "foo": "rab"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.676Z"
                  }
               }
            },
            {
               "doc": {
                  "_id": "id",
                  "_index": "index",
                  "_type": "_doc",
                  "_source": {
                     "field3": "_value3",
                     "field2": "_value2",
                     "foo": "rab"
                  },
                  "_ingest": {
                     "timestamp": "2017-05-04T22:46:09.677Z"
                  }
               }
            }
         ]
      }
   ]
}

Accessing Data in Pipelines

The processors in a pipeline have read and write access to documents that pass through the pipeline. The processors can access fields in the source of a document and the document’s metadata fields.

Accessing Fields in the Source

Accessing a field in the source is straightforward. You simply refer to fields by their name. For example:

{
  "set": {
    "field": "my_field",
    "value": 582.1
  }
}

On top of this, fields from the source are always accessible via the _source prefix:

{
  "set": {
    "field": "_source.my_field",
    "value": 582.1
  }
}

Accessing Metadata Fields

You can access metadata fields in the same way that you access fields in the source. This is possible because Elasticsearch doesn’t allow fields in the source that have the same name as metadata fields.

The following example sets the _id metadata field of a document to 1:

{
  "set": {
    "field": "_id",
    "value": "1"
  }
}

The following metadata fields are accessible by a processor: _index, _type, _id, _routing.

Accessing Ingest Metadata Fields

Beyond metadata fields and source fields, ingest also adds ingest metadata to the documents that it processes. These metadata properties are accessible under the _ingest key. Currently ingest adds the ingest timestamp under the _ingest.timestamp key of the ingest metadata. The ingest timestamp is the time when Elasticsearch received the index or bulk request to pre-process the document.

Any processor can add ingest-related metadata during document processing. Ingest metadata is transient and is lost after a document has been processed by the pipeline. Therefore, ingest metadata won’t be indexed.

The following example adds a field with the name received. The value is the ingest timestamp:

{
  "set": {
    "field": "received",
    "value": "{{_ingest.timestamp}}"
  }
}

Unlike Elasticsearch metadata fields, the ingest metadata field name _ingest can be used as a valid field name in the source of a document. Use _source._ingest to refer to the field in the source document. Otherwise, _ingest will be interpreted as an ingest metadata field.

Accessing Fields and Metafields in Templates

A number of processor settings also support templating. Settings that support templating can have zero or more template snippets. A template snippet begins with {{ and ends with }}. Accessing fields and metafields in templates is exactly the same as via regular processor field settings.

The following example adds a field named field_c. Its value is a concatenation of the values of field_a and field_b.

{
  "set": {
    "field": "field_c",
    "value": "{{field_a}} {{field_b}}"
  }
}

The following example uses the value of the geoip.country_iso_code field in the source to set the index that the document will be indexed into:

{
  "set": {
    "field": "_index",
    "value": "{{geoip.country_iso_code}}"
  }
}

Dynamic field names are also supported. This example sets the field named after the value of service to the value of the field code:

{
  "set": {
    "field": "{{service}}",
    "value": "{{code}}"
  }
}

Conditional Execution in Pipelines

Each processor allows for an optional if condition to determine if that processor should be executed or skipped. The value of the if is a Painless script that needs to evaluate to true or false.

For example the following processor will drop the document (i.e. not index it) if the input document has a field named network_name and it is equal to Guest.

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "drop": {
        "if": "ctx.network_name == 'Guest'"
      }
    }
  ]
}

Using that pipeline for an index request:

POST test/_doc/1?pipeline=drop_guests_network
{
  "network_name" : "Guest"
}

Results in nothing indexed since the conditional evaluated to true.

{
  "_index": "test",
  "_type": "_doc",
  "_id": "1",
  "_version": -3,
  "result": "noop",
  "_shards": {
    "total": 0,
    "successful": 0,
    "failed": 0
  }
}

Handling Nested Fields in Conditionals

Source documents often contain nested fields. Care should be taken to avoid NullPointerExceptions if the parent object does not exist in the document. For example ctx.a.b.c can throw an NullPointerExceptions if the source document does not have top level a object, or a second level b object.

To help protect against NullPointerExceptions, null safe operations should be used. Fortunately, Painless makes {painless}/painless-operators-reference.html#null-safe-operator[null safe] operations easy with the ?. operator.

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "drop": {
        "if": "ctx.network?.name == 'Guest'"
      }
    }
  ]
}

The following document will get dropped correctly:

POST test/_doc/1?pipeline=drop_guests_network
{
  "network": {
    "name": "Guest"
  }
}

Thanks to the ?. operator the following document will not throw an error. If the pipeline used a . the following document would throw a NullPointerException since the network object is not part of the source document.

POST test/_doc/2?pipeline=drop_guests_network
{
  "foo" : "bar"
}

The source document can also use dot delimited fields to represent nested fields.

For example instead the source document defining the fields nested:

{
  "network": {
    "name": "Guest"
  }
}

The source document may have the nested fields flattened as such:

{
  "network.name": "Guest"
}

If this is the case, use the Dot Expand Processor so that the nested fields may be used in a conditional.

PUT _ingest/pipeline/drop_guests_network
{
  "processors": [
    {
      "dot_expander": {
        "field": "network.name"
      }
    },
    {
      "drop": {
        "if": "ctx.network?.name == 'Guest'"
      }
    }
  ]
}

Now the following input document can be used with a conditional in the pipeline.

POST test/_doc/3?pipeline=drop_guests_network
{
  "network.name": "Guest"
}

The ?. operators works well for use in the if conditional because the {painless}/painless-operators-reference.html#null-safe-operator[null safe operator] returns null if the object is null and == is null safe (as well as many other {painless}/painless-operators.html[painless operators]).

However, calling a method such as .equalsIgnoreCase is not null safe and can result in a NullPointerException.

Some situations allow for the same functionality but done so in a null safe manner. For example: 'Guest'.equalsIgnoreCase(ctx.network?.name) is null safe because Guest is always non null, but ctx.network?.name.equalsIgnoreCase('Guest') is not null safe since ctx.network?.name can return null.

Some situations require an explicit null check. In the following example there is not null safe alternative, so an explicit null check is needed.

{
  "drop": {
    "if": "ctx.network?.name != null && ctx.network.name.contains('Guest')"
  }
}

Complex Conditionals

The if condition can be more then a simple equality check. The full power of the Painless Scripting Language is available and running in the {painless}/painless-ingest-processor-context.html[ingest processor context].

Important
The value of ctx is read-only in if conditions.

A more complex if condition that drops the document (i.e. not index it) unless it has a multi-valued tag field with at least one value that contains the characters prod (case insensitive).

PUT _ingest/pipeline/not_prod_dropper
{
  "processors": [
    {
      "drop": {
        "if": "Collection tags = ctx.tags;if(tags != null){for (String tag : tags) {if (tag.toLowerCase().contains('prod')) { return false;}}} return true;"
      }
    }
  ]
}

The conditional needs to be all on one line since JSON does not support new line characters. However, Kibana’s console supports a triple quote syntax to help with writing and debugging scripts like these.

PUT _ingest/pipeline/not_prod_dropper
{
  "processors": [
    {
      "drop": {
        "if": """
            Collection tags = ctx.tags;
            if(tags != null){
              for (String tag : tags) {
                  if (tag.toLowerCase().contains('prod')) {
                      return false;
                  }
              }
            }
            return true;
        """
      }
    }
  ]
}
POST test/_doc/1?pipeline=not_prod_dropper
{
  "tags": ["application:myapp", "env:Stage"]
}

The document is dropped since prod (case insensitive) is not found in the tags.

The following document is indexed (i.e. not dropped) since prod (case insensitive) is found in the tags.

POST test/_doc/2?pipeline=not_prod_dropper
{
  "tags": ["application:myapp", "env:Production"]
}

The Simulate Pipeline API with verbose can be used to help build out complex conditionals. If the conditional evaluates to false it will be omitted from the verbose results of the simulation since the document will not change.

Care should be taken to avoid overly complex or expensive conditional checks since the condition needs to be checked for each and every document.

Conditionals with the Pipeline Processor

The combination of the if conditional and the Pipeline Processor can result in a simple, yet powerful means to process heterogeneous input. For example, you can define a single pipeline that delegates to other pipelines based on some criteria.

PUT _ingest/pipeline/logs_pipeline
{
  "description": "A pipeline of pipelines for log files",
  "version": 1,
  "processors": [
    {
      "pipeline": {
        "if": "ctx.service?.name == 'apache_httpd'",
        "name": "httpd_pipeline"
      }
    },
    {
      "pipeline": {
        "if": "ctx.service?.name == 'syslog'",
        "name": "syslog_pipeline"
      }
    },
    {
      "fail": {
        "if": "ctx.service?.name != 'apache_httpd' && ctx.service?.name != 'syslog'",
        "message": "This pipeline requires service.name to be either `syslog` or `apache_httpd`"
      }
    }
  ]
}

The above example allows consumers to point to a single pipeline for all log based index requests. Based on the conditional, the correct pipeline will be called to process that type of data.

This pattern works well with a default pipeline defined in an index mapping template for all indexes that hold data that needs pre-index processing.

Conditionals with the Regular Expressions

The if conditional is implemented as a Painless script, which requires {painless}//painless-examples.html#modules-scripting-painless-regex[explicit support for regular expressions].

script.painless.regex.enabled: true must be set in elasticsearch.yml to use regular expressions in the if condition.

If regular expressions are enabled, operators such as =~ can be used against a /pattern/ for conditions.

For example:

PUT _ingest/pipeline/check_url
{
  "processors": [
    {
      "set": {
        "if": "ctx.href?.url =~ /^http[^s]/",
        "field": "href.insecure",
        "value": true
      }
    }
  ]
}
POST test/_doc/1?pipeline=check_url
{
  "href": {
    "url": "http://www.elastic.co/"
  }
}

Results in:

{
  "_index": "test",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "_seq_no": 60,
  "_primary_term": 1,
  "found": true,
  "_source": {
    "href": {
      "insecure": true,
      "url": "http://www.elastic.co/"
    }
  }
}

Regular expressions can be expensive and should be avoided if viable alternatives exist.

For example in this case startsWith can be used to get the same result without using a regular expression:

PUT _ingest/pipeline/check_url
{
  "processors": [
    {
      "set": {
        "if": "ctx.href?.url != null && ctx.href.url.startsWith('http://')",
        "field": "href.insecure",
        "value": true
      }
    }
  ]
}

Handling Failures in Pipelines

In its simplest use case, a pipeline defines a list of processors that are executed sequentially, and processing halts at the first exception. This behavior may not be desirable when failures are expected. For example, you may have logs that don’t match the specified grok expression. Instead of halting execution, you may want to index such documents into a separate index.

To enable this behavior, you can use the on_failure parameter. The on_failure parameter defines a list of processors to be executed immediately following the failed processor. You can specify this parameter at the pipeline level, as well as at the processor level. If a processor specifies an on_failure configuration, whether it is empty or not, any exceptions that are thrown by the processor are caught, and the pipeline continues executing the remaining processors. Because you can define further processors within the scope of an on_failure statement, you can nest failure handling.

The following example defines a pipeline that renames the foo field in the processed document to bar. If the document does not contain the foo field, the processor attaches an error message to the document for later analysis within Elasticsearch.

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "field \"foo\" does not exist, cannot rename to \"bar\""
            }
          }
        ]
      }
    }
  ]
}

The following example defines an on_failure block on a whole pipeline to change the index to which failed documents get sent.

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [ ... ],
  "on_failure" : [
    {
      "set" : {
        "field" : "_index",
        "value" : "failed-{{ _index }}"
      }
    }
  ]
}

Alternatively instead of defining behaviour in case of processor failure, it is also possible to ignore a failure and continue with the next processor by specifying the ignore_failure setting.

In case in the example below the field foo doesn’t exist the failure will be caught and the pipeline continues to execute, which in this case means that the pipeline does nothing.

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "bar",
        "ignore_failure" : true
      }
    }
  ]
}

The ignore_failure can be set on any processor and defaults to false.

Accessing Error Metadata From Processors Handling Exceptions

You may want to retrieve the actual error message that was thrown by a failed processor. To do so you can access metadata fields called on_failure_message, on_failure_processor_type, and on_failure_processor_tag. These fields are only accessible from within the context of an on_failure block.

Here is an updated version of the example that you saw earlier. But instead of setting the error message manually, the example leverages the on_failure_message metadata field to provide the error message.

{
  "description" : "my first pipeline with handled exceptions",
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "to" : "bar",
        "on_failure" : [
          {
            "set" : {
              "field" : "error",
              "value" : "{{ _ingest.on_failure_message }}"
            }
          }
        ]
      }
    }
  ]
}

Processors

All processors are defined in the following way within a pipeline definition:

{
  "PROCESSOR_NAME" : {
    ... processor configuration options ...
  }
}

Each processor defines its own configuration parameters, but all processors have the ability to declare tag, on_failure and if fields. These fields are optional.

A tag is simply a string identifier of the specific instantiation of a certain processor in a pipeline. The tag field does not affect the processor’s behavior, but is very useful for bookkeeping and tracing errors to specific processors.

The if field must contain a script that returns a boolean value. If the script evaluates to true then the processor will be executed for the given document otherwise it will be skipped. The if field takes an object with the script fields defined in script-options and accesses a read only version of the document via the same ctx variable used by scripts in the Script Processor.

{
  "set": {
    "if": "ctx.foo == 'someValue'",
    "field": "found",
    "value": true
  }
}

See Conditional Execution in Pipelines to learn more about the if field and conditional execution.

See Handling Failures in Pipelines to learn more about the on_failure field and error handling in pipelines.

The node info API can be used to figure out what processors are available in a cluster. The node info API will provide a per node list of what processors are available.

Custom processors must be installed on all nodes. The put pipeline API will fail if a processor specified in a pipeline doesn’t exist on all nodes. If you rely on custom processor plugins make sure to mark these plugins as mandatory by adding plugin.mandatory setting to the config/elasticsearch.yml file, for example:

plugin.mandatory: ingest-attachment

A node will not start if this plugin is not available.

The node stats API can be used to fetch ingest usage statistics, globally and on a per pipeline basis. Useful to find out which pipelines are used the most or spent the most time on preprocessing.

Ingest Processor Plugins

Additional ingest processors can be implemented and installed as Elasticsearch {plugins}/intro.html[plugins]. See {plugins}/ingest.html[Ingest plugins] for information about the available ingest plugins.

Append Processor

Appends one or more values to an existing array if the field already exists and it is an array. Converts a scalar to an array and appends one or more values to it if the field exists and it is a scalar. Creates an array containing the provided values if the field doesn’t exist. Accepts a single value or an array of values.

Table 1. Append Options
Name Required Default Description

field

yes

-

The field to be appended to. Supports template snippets.

value

yes

-

The value to be appended. Supports template snippets.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "append": {
    "field": "tags",
    "value": ["production", "{{app}}", "{{owner}}"]
  }
}

Bytes Processor

Converts a human readable byte value (e.g. 1kb) to its value in bytes (e.g. 1024).

Supported human readable units are "b", "kb", "mb", "gb", "tb", "pb" case insensitive. An error will occur if the field is not a supported format or resultant value exceeds 2^63.

Table 2. Bytes Options
Name Required Default Description

field

yes

-

The field to convert

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "bytes": {
    "field": "file.size"
  }
}

Convert Processor

Converts a field in the currently ingested document to a different type, such as converting a string to an integer. If the field value is an array, all members will be converted.

The supported types include: integer, long, float, double, string, boolean, and auto.

Specifying boolean will set the field to true if its string value is equal to true (ignore case), to false if its string value is equal to false (ignore case), or it will throw an exception otherwise.

Specifying auto will attempt to convert the string-valued field into the closest non-string type. For example, a field whose value is "true" will be converted to its respective boolean type: true. Do note that float takes precedence of double in auto. A value of "242.15" will "automatically" be converted to 242.15 of type float. If a provided field cannot be appropriately converted, the Convert Processor will still process successfully and leave the field value as-is. In such a case, target_field will still be updated with the unconverted field value.

Table 3. Convert Options
Name Required Default Description

field

yes

-

The field whose value is to be converted

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

type

yes

-

The type to convert the existing value to

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

PUT _ingest/pipeline/my-pipeline-id
{
  "description": "converts the content of the id field to an integer",
  "processors" : [
    {
      "convert" : {
        "field" : "id",
        "type": "integer"
      }
    }
  ]
}

Date Processor

Parses dates from fields, and then uses the date or timestamp as the timestamp for the document. By default, the date processor adds the parsed date as a new field called @timestamp. You can specify a different field by setting the target_field configuration parameter. Multiple date formats are supported as part of the same date processor definition. They will be used sequentially to attempt parsing the date field, in the same order they were defined as part of the processor definition.

Table 4. Date options
Name Required Default Description

field

yes

-

The field to get the date from.

target_field

no

@timestamp

The field that will hold the parsed date.

formats

yes

-

An array of the expected date formats. Can be a Joda pattern or one of the following formats: ISO8601, UNIX, UNIX_MS, or TAI64N.

timezone

no

UTC

The timezone to use when parsing the date. Supports template snippets.

locale

no

ENGLISH

The locale to use when parsing the date, relevant when parsing month names or week days. Supports template snippets.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Here is an example that adds the parsed date to the timestamp field based on the initial_date field:

{
  "description" : "...",
  "processors" : [
    {
      "date" : {
        "field" : "initial_date",
        "target_field" : "timestamp",
        "formats" : ["dd/MM/yyyy hh:mm:ss"],
        "timezone" : "Europe/Amsterdam"
      }
    }
  ]
}

The timezone and locale processor parameters are templated. This means that their values can be extracted from fields within documents. The example below shows how to extract the locale/timezone details from existing fields, my_timezone and my_locale, in the ingested document that contain the timezone and locale values.

{
  "description" : "...",
  "processors" : [
    {
      "date" : {
        "field" : "initial_date",
        "target_field" : "timestamp",
        "formats" : ["ISO8601"],
        "timezone" : "{{my_timezone}}",
        "locale" : "{{my_locale}}"
      }
    }
  ]
}

Date Index Name Processor

The purpose of this processor is to point documents to the right time based index based on a date or timestamp field in a document by using the date math index name support.

The processor sets the _index meta field with a date math index name expression based on the provided index name prefix, a date or timestamp field in the documents being processed and the provided date rounding.

First, this processor fetches the date or timestamp from a field in the document being processed. Optionally, date formatting can be configured on how the field’s value should be parsed into a date. Then this date, the provided index name prefix and the provided date rounding get formatted into a date math index name expression. Also here optionally date formatting can be specified on how the date should be formatted into a date math index name expression.

An example pipeline that points documents to a monthly index that starts with a myindex- prefix based on a date in the date1 field:

PUT _ingest/pipeline/monthlyindex
{
  "description": "monthly date-time index naming",
  "processors" : [
    {
      "date_index_name" : {
        "field" : "date1",
        "index_name_prefix" : "myindex-",
        "date_rounding" : "M"
      }
    }
  ]
}

Using that pipeline for an index request:

PUT /myindex/_doc/1?pipeline=monthlyindex
{
  "date1" : "2016-04-25T12:02:01.789Z"
}
{
  "_index" : "myindex-2016-04-01",
  "_type" : "_doc",
  "_id" : "1",
  "_version" : 1,
  "result" : "created",
  "_shards" : {
    "total" : 2,
    "successful" : 1,
    "failed" : 0
  },
  "_seq_no" : 55,
  "_primary_term" : 1
}

The above request will not index this document into the myindex index, but into the myindex-2016-04-01 index because it was rounded by month. This is because the date-index-name-processor overrides the _index property of the document.

To see the date-math value of the index supplied in the actual index request which resulted in the above document being indexed into myindex-2016-04-01 we can inspect the effects of the processor using a simulate request.

POST _ingest/pipeline/_simulate
{
  "pipeline" :
  {
    "description": "monthly date-time index naming",
    "processors" : [
      {
        "date_index_name" : {
          "field" : "date1",
          "index_name_prefix" : "myindex-",
          "date_rounding" : "M"
        }
      }
    ]
  },
  "docs": [
    {
      "_source": {
        "date1": "2016-04-25T12:02:01.789Z"
      }
    }
  ]
}

and the result:

{
  "docs" : [
    {
      "doc" : {
        "_id" : "_id",
        "_index" : "<myindex-{2016-04-25||/M{yyyy-MM-dd|UTC}}>",
        "_type" : "_type",
        "_source" : {
          "date1" : "2016-04-25T12:02:01.789Z"
        },
        "_ingest" : {
          "timestamp" : "2016-11-08T19:43:03.850+0000"
        }
      }
    }
  ]
}

The above example shows that _index was set to <myindex-{2016-04-25||/M{yyyy-MM-dd|UTC}}>. Elasticsearch understands this to mean 2016-04-01 as is explained in the date math index name documentation

Table 5. Date index name options
Name Required Default Description

field

yes

-

The field to get the date or timestamp from.

index_name_prefix

no

-

A prefix of the index name to be prepended before the printed date. Supports template snippets.

date_rounding

yes

-

How to round the date when formatting the date into the index name. Valid values are: y (year), M (month), w (week), d (day), h (hour), m (minute) and s (second). Supports template snippets.

date_formats

no

yyyy-MM-dd'T'HH:mm:ss.SSSZ

An array of the expected date formats for parsing dates / timestamps in the document being preprocessed. Can be a Joda pattern or one of the following formats: ISO8601, UNIX, UNIX_MS, or TAI64N.

timezone

no

UTC

The timezone to use when parsing the date and when date math index supports resolves expressions into concrete index names.

locale

no

ENGLISH

The locale to use when parsing the date from the document being preprocessed, relevant when parsing month names or week days.

index_name_format

no

yyyy-MM-dd

The format to be used when printing the parsed date into the index name. A valid Joda pattern is expected here. Supports template snippets.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Dissect Processor

Similar to the Grok Processor, dissect also extracts structured fields out of a single text field within a document. However unlike the Grok Processor, dissect does not use Regular Expressions. This allows dissect’s syntax to be simple and for some cases faster than the Grok Processor.

Dissect matches a single text field against a defined pattern.

For example the following pattern:

%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}

will match a log line of this format:

1.2.3.4 - - [30/Apr/1998:22:00:52 +0000] \"GET /english/venues/cities/images/montpellier/18.gif HTTP/1.0\" 200 3171

and result in a document with the following fields:

"doc": {
  "_index": "_index",
  "_type": "_type",
  "_id": "_id",
  "_source": {
    "request": "/english/venues/cities/images/montpellier/18.gif",
    "auth": "-",
    "ident": "-",
    "verb": "GET",
    "@timestamp": "30/Apr/1998:22:00:52 +0000",
    "size": "3171",
    "clientip": "1.2.3.4",
    "httpversion": "1.0",
    "status": "200"
  }
}

A dissect pattern is defined by the parts of the string that will be discarded. In the example above the first part to be discarded is a single space. Dissect finds this space, then assigns the value of clientip is everything up until that space. Later dissect matches the [ and then ] and then assigns @timestamp to everything in-between [ and ]. Paying special attention the parts of the string to discard will help build successful dissect patterns.

Successful matches require all keys in a pattern to have a value. If any of the %{keyname} defined in the pattern do not have a value, then an exception is thrown and may be handled by the on_failure directive. An empty key %{} or a named skip key can be used to match values, but exclude the value from the final document. All matched values are represented as string data types. The convert processor may be used to convert to expected data type.

Dissect also supports key modifiers that can change dissect’s default behavior. For example you can instruct dissect to ignore certain fields, append fields, skip over padding, etc. See below for more information.

Table 6. Dissect Options
Name Required Default Description

field

yes

-

The field to dissect

pattern

yes

-

The pattern to apply to the field

append_separator

no

"" (empty string)

The character(s) that separate the appended fields.

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "dissect": {
    "field": "message",
    "pattern" : "%{clientip} %{ident} %{auth} [%{@timestamp}] \"%{verb} %{request} HTTP/%{httpversion}\" %{status} %{size}"
   }
}

Dissect key modifiers

Key modifiers can change the default behavior for dissection. Key modifiers may be found on the left or right of the %{keyname} always inside the %{ and }. For example %{+keyname →} has the append and right padding modifiers.

Table 7. Dissect Key Modifiers
Modifier Name Position Example Description Details

Skip right padding

(far) right

%{keyname1→}

Skips any repeated characters to the right

link

+

Append

left

%{+keyname} %{+keyname}

Appends two or more fields together

link

+ with /n

Append with order

left and right

%{+keyname/2} %{+keyname/1}

Appends two or more fields together in the order specified

link

?

Named skip key

left

%{?ignoreme}

Skips the matched value in the output. Same behavior as %{}

link

* and &

Reference keys

left

%{*r1} %{&r1}

Sets the output key as value of * and output value of &

link

Right padding modifier ()

The algorithm that performs the dissection is very strict in that it requires all characters in the pattern to match the source string. For example, the pattern %{fookey} %{barkey} (1 space), will match the string "foo bar" (1 space), but will not match the string "foo  bar" (2 spaces) since the pattern has only 1 space and the source string has 2 spaces.

The right padding modifier helps with this case. Adding the right padding modifier to the pattern %{fookey→} %{barkey}, It will now will match "foo bar" (1 space) and "foo  bar" (2 spaces) and even "foo          bar" (10 spaces).

Use the right padding modifier to allow for repetition of the characters after a %{keyname→}.

The right padding modifier may be placed on any key with any other modifiers. It should always be the furthest right modifier. For example: %{+keyname/1→} and %{→}

Right padding modifier example

Pattern

%{ts→} %{level}

Input

1998-08-10T17:15:42,466          WARN

Result

  • ts = 1998-08-10T17:15:42,466

  • level = WARN

The right padding modifier may be used with an empty key to help skip unwanted data. For example, the same input string, but wrapped with brackets requires the use of an empty right padded key to achieve the same result.

Right padding modifier with empty key example

Pattern

[%{ts}]%{→}[%{level}]

Input

[1998-08-10T17:15:42,466]            [WARN]

Result

  • ts = 1998-08-10T17:15:42,466

  • level = WARN

Append modifier (+)

Dissect supports appending two or more results together for the output. Values are appended left to right. An append separator can be specified. In this example the append_separator is defined as a space.

Append modifier example

Pattern

%{+name} %{+name} %{+name} %{+name}

Input

john jacob jingleheimer schmidt

Result

  • name = john jacob jingleheimer schmidt

Append with order modifier (+ and /n)

Dissect supports appending two or more results together for the output. Values are appended based on the order defined (/n). An append separator can be specified. In this example the append_separator is defined as a comma.

Append with order modifier example

Pattern

%{+name/2} %{+name/4} %{+name/3} %{+name/1}

Input

john jacob jingleheimer schmidt

Result

  • name = schmidt,john,jingleheimer,jacob

Named skip key (?)

Dissect supports ignoring matches in the final result. This can be done with an empty key %{}, but for readability it may be desired to give that empty key a name.

Named skip key modifier example

Pattern

%{clientip} %{?ident} %{?auth} [%{@timestamp}]

Input

1.2.3.4 - - [30/Apr/1998:22:00:52 +0000]

Result

  • clientip = 1.2.3.4

  • @timestamp = 30/Apr/1998:22:00:52 +0000

Reference keys (* and &)

Dissect support using parsed values as the key/value pairings for the structured content. Imagine a system that partially logs in key/value pairs. Reference keys allow you to maintain that key/value relationship.

Reference key modifier example

Pattern

[%{ts}] [%{level}] %{*p1}:%{&p1} %{*p2}:%{&p2}

Input

[2018-08-10T17:15:42,466] [ERR] ip:1.2.3.4 error:REFUSED

Result

  • ts = 2018-08-10T17:15:42,466

  • level = ERR

  • ip = 1.2.3.4

  • error = REFUSED

Dot Expander Processor

Expands a field with dots into an object field. This processor allows fields with dots in the name to be accessible by other processors in the pipeline. Otherwise these fields can’t be accessed by any processor.

Table 8. Dot Expand Options
Name Required Default Description

field

yes

-

The field to expand into an object field

path

no

-

The field that contains the field to expand. Only required if the field to expand is part another object field, because the field option can only understand leaf fields.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "dot_expander": {
    "field": "foo.bar"
  }
}

For example the dot expand processor would turn this document:

{
  "foo.bar" : "value"
}

into:

{
  "foo" : {
    "bar" : "value"
  }
}

If there is already a bar field nested under foo then this processor merges the foo.bar field into it. If the field is a scalar value then it will turn that field into an array field.

For example, the following document:

{
  "foo.bar" : "value2",
  "foo" : {
    "bar" : "value1"
  }
}

is transformed by the dot_expander processor into:

{
  "foo" : {
    "bar" : ["value1", "value2"]
  }
}

If any field outside of the leaf field conflicts with a pre-existing field of the same name, then that field needs to be renamed first.

Consider the following document:

{
  "foo": "value1",
  "foo.bar": "value2"
}

Then the foo needs to be renamed first before the dot_expander processor is applied. So in order for the foo.bar field to properly be expanded into the bar field under the foo field the following pipeline should be used:

{
  "processors" : [
    {
      "rename" : {
        "field" : "foo",
        "target_field" : "foo.bar""
      }
    },
    {
      "dot_expander": {
        "field": "foo.bar"
      }
    }
  ]
}

The reason for this is that Ingest doesn’t know how to automatically cast a scalar field to an object field.

Drop Processor

Drops the document without raising any errors. This is useful to prevent the document from getting indexed based on some condition.

Table 9. Drop Options
Name Required Default Description

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "drop": {
    "if" : "ctx.network_name == 'Guest'"
  }
}

Fail Processor

Raises an exception. This is useful for when you expect a pipeline to fail and want to relay a specific message to the requester.

Table 10. Fail Options
Name Required Default Description

message

yes

-

The error message thrown by the processor. Supports template snippets.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "fail": {
    "if" : "ctx.tags.contains('production') != true",
    "message": "The production tag is not present, found tags: {{tags}}"
  }
}

Foreach Processor

Processes elements in an array of unknown length.

All processors can operate on elements inside an array, but if all elements of an array need to be processed in the same way, defining a processor for each element becomes cumbersome and tricky because it is likely that the number of elements in an array is unknown. For this reason the foreach processor exists. By specifying the field holding array elements and a processor that defines what should happen to each element, array fields can easily be preprocessed.

A processor inside the foreach processor works in the array element context and puts that in the ingest metadata under the _ingest._value key. If the array element is a json object it holds all immediate fields of that json object. and if the nested object is a value is _ingest._value just holds that value. Note that if a processor prior to the foreach processor used _ingest._value key then the specified value will not be available to the processor inside the foreach processor. The foreach processor does restore the original value, so that value is available to processors after the foreach processor.

Note that any other field from the document are accessible and modifiable like with all other processors. This processor just puts the current array element being read into _ingest._value ingest metadata attribute, so that it may be pre-processed.

If the foreach processor fails to process an element inside the array, and no on_failure processor has been specified, then it aborts the execution and leaves the array unmodified.

Table 11. Foreach Options
Name Required Default Description

field

yes

-

The array field

processor

yes

-

The processor to execute against each field

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Assume the following document:

{
  "values" : ["foo", "bar", "baz"]
}

When this foreach processor operates on this sample document:

{
  "foreach" : {
    "field" : "values",
    "processor" : {
      "uppercase" : {
        "field" : "_ingest._value"
      }
    }
  }
}

Then the document will look like this after preprocessing:

{
  "values" : ["FOO", "BAR", "BAZ"]
}

Let’s take a look at another example:

{
  "persons" : [
    {
      "id" : "1",
      "name" : "John Doe"
    },
    {
      "id" : "2",
      "name" : "Jane Doe"
    }
  ]
}

In this case, the id field needs to be removed, so the following foreach processor is used:

{
  "foreach" : {
    "field" : "persons",
    "processor" : {
      "remove" : {
        "field" : "_ingest._value.id"
      }
    }
  }
}

After preprocessing the result is:

{
  "persons" : [
    {
      "name" : "John Doe"
    },
    {
      "name" : "Jane Doe"
    }
  ]
}

The wrapped processor can have a on_failure definition. For example, the id field may not exist on all person objects. Instead of failing the index request, you can use an on_failure block to send the document to the 'failure_index' index for later inspection:

{
  "foreach" : {
    "field" : "persons",
    "processor" : {
      "remove" : {
        "field" : "_value.id",
        "on_failure" : [
          {
            "set" : {
              "field": "_index",
              "value": "failure_index"
            }
          }
        ]
      }
    }
  }
}

In this example, if the remove processor does fail, then the array elements that have been processed thus far will be updated.

Another advanced example can be found in the {plugins}/ingest-attachment-with-arrays.html[attachment processor documentation].

GeoIP Processor

The geoip processor adds information about the geographical location of IP addresses, based on data from the Maxmind databases. This processor adds this information by default under the geoip field. The geoip processor can resolve both IPv4 and IPv6 addresses.

The ingest-geoip module ships by default with the GeoLite2 City, GeoLite2 Country and GeoLite2 ASN geoip2 databases from Maxmind made available under the CCA-ShareAlike 4.0 license. For more details see, http://dev.maxmind.com/geoip/geoip2/geolite2/

The geoip processor can run with other GeoIP2 databases from Maxmind. The files must be copied into the ingest-geoip config directory, and the database_file option should be used to specify the filename of the custom database. Custom database files must be stored uncompressed. The ingest-geoip config directory is located at $ES_CONFIG/ingest-geoip.

Using the geoip Processor in a Pipeline

Table 12. geoip options
Name Required Default Description

field

yes

-

The field to get the ip address from for the geographical lookup.

target_field

no

geoip

The field that will hold the geographical information looked up from the Maxmind database.

database_file

no

GeoLite2-City.mmdb

The database filename in the geoip config directory. The ingest-geoip module ships with the GeoLite2-City.mmdb, GeoLite2-Country.mmdb and GeoLite2-ASN.mmdb files.

properties

no

[continent_name, country_iso_code, region_iso_code, region_name, city_name, location] *

Controls what properties are added to the target_field based on the geoip lookup.

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

*Depends on what is available in database_file:

  • If the GeoLite2 City database is used, then the following fields may be added under the target_field: ip, country_iso_code, country_name, continent_name, region_iso_code, region_name, city_name, timezone, latitude, longitude and location. The fields actually added depend on what has been found and which properties were configured in properties.

  • If the GeoLite2 Country database is used, then the following fields may be added under the target_field: ip, country_iso_code, country_name and continent_name. The fields actually added depend on what has been found and which properties were configured in properties.

  • If the GeoLite2 ASN database is used, then the following fields may be added under the target_field: ip, asn, and organization_name. The fields actually added depend on what has been found and which properties were configured in properties.

Here is an example that uses the default city database and adds the geographical information to the geoip field based on the ip field:

PUT _ingest/pipeline/geoip
{
  "description" : "Add geoip info",
  "processors" : [
    {
      "geoip" : {
        "field" : "ip"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=geoip
{
  "ip": "8.8.8.8"
}
GET my_index/_doc/my_id

Which returns:

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 55,
  "_primary_term": 1,
  "_source": {
    "ip": "8.8.8.8",
    "geoip": {
      "continent_name": "North America",
      "country_iso_code": "US",
      "location": { "lat": 37.751, "lon": -97.822 }
    }
  }
}

Here is an example that uses the default country database and adds the geographical information to the geo field based on the ip field`. Note that this database is included in the module. So this:

PUT _ingest/pipeline/geoip
{
  "description" : "Add geoip info",
  "processors" : [
    {
      "geoip" : {
        "field" : "ip",
        "target_field" : "geo",
        "database_file" : "GeoLite2-Country.mmdb"
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=geoip
{
  "ip": "8.8.8.8"
}
GET my_index/_doc/my_id

returns this:

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 65,
  "_primary_term": 1,
  "_source": {
    "ip": "8.8.8.8",
    "geo": {
      "continent_name": "North America",
      "country_iso_code": "US",
    }
  }
}

Not all IP addresses find geo information from the database, When this occurs, no target_field is inserted into the document.

Here is an example of what documents will be indexed as when information for "80.231.5.0" cannot be found:

PUT _ingest/pipeline/geoip
{
  "description" : "Add geoip info",
  "processors" : [
    {
      "geoip" : {
        "field" : "ip"
      }
    }
  ]
}

PUT my_index/_doc/my_id?pipeline=geoip
{
  "ip": "80.231.5.0"
}

GET my_index/_doc/my_id

Which returns:

{
  "_index" : "my_index",
  "_type" : "_doc",
  "_id" : "my_id",
  "_version" : 1,
  "_seq_no" : 71,
  "_primary_term": 1,
  "found" : true,
  "_source" : {
    "ip" : "80.231.5.0"
  }
}
Recognizing Location as a Geopoint

Although this processor enriches your document with a location field containing the estimated latitude and longitude of the IP address, this field will not be indexed as a {ref}/geo-point.html[geo_point] type in Elasticsearch without explicitly defining it as such in the mapping.

You can use the following mapping for the example index above:

PUT my_ip_locations
{
  "mappings": {
    "_doc": {
      "properties": {
        "geoip": {
          "properties": {
            "location": { "type": "geo_point" }
          }
        }
      }
    }
  }
}
Node Settings

The geoip processor supports the following setting:

ingest.geoip.cache_size

The maximum number of results that should be cached. Defaults to 1000.

Note that these settings are node settings and apply to all geoip processors, i.e. there is one cache for all defined geoip processors.

Grok Processor

Extracts structured fields out of a single text field within a document. You choose which field to extract matched fields from, as well as the grok pattern you expect will match. A grok pattern is like a regular expression that supports aliased expressions that can be reused.

This tool is perfect for syslog logs, apache and other webserver logs, mysql logs, and in general, any log format that is generally written for humans and not computer consumption. This processor comes packaged with many reusable patterns.

If you need help building patterns to match your logs, you will find the {kibana-ref}/xpack-grokdebugger.html[Grok Debugger] tool quite useful! The Grok Debugger is an {xpack} feature under the Basic License and is therefore free to use. The Grok Constructor at http://grokconstructor.appspot.com/ is also a useful tool.

Grok Basics

Grok sits on top of regular expressions, so any regular expressions are valid in grok as well. The regular expression library is Oniguruma, and you can see the full supported regexp syntax on the Oniguruma site.

Grok works by leveraging this regular expression language to allow naming existing patterns and combining them into more complex patterns that match your fields.

The syntax for reusing a grok pattern comes in three forms: %{SYNTAX:SEMANTIC}, %{SYNTAX}, %{SYNTAX:SEMANTIC:TYPE}.

The SYNTAX is the name of the pattern that will match your text. For example, 3.44 will be matched by the NUMBER pattern and 55.3.244.1 will be matched by the IP pattern. The syntax is how you match. NUMBER and IP are both patterns that are provided within the default patterns set.

The SEMANTIC is the identifier you give to the piece of text being matched. For example, 3.44 could be the duration of an event, so you could call it simply duration. Further, a string 55.3.244.1 might identify the client making a request.

The TYPE is the type you wish to cast your named field. int and float are currently the only types supported for coercion.

For example, you might want to match the following text:

3.44 55.3.244.1

You may know that the message in the example is a number followed by an IP address. You can match this text by using the following Grok expression.

%{NUMBER:duration} %{IP:client}

Using the Grok Processor in a Pipeline

Table 13. Grok Options
Name Required Default Description

field

yes

-

The field to use for grok expression parsing

patterns

yes

-

An ordered list of grok expression to match and extract named captures with. Returns on the first expression in the list that matches.

pattern_definitions

no

-

A map of pattern-name and pattern tuples defining custom patterns to be used by the current processor. Patterns matching existing names will override the pre-existing definition.

trace_match

no

false

when true, _ingest._grok_match_index will be inserted into your matched document’s metadata with the index into the pattern found in patterns that matched.

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Here is an example of using the provided patterns to extract out and name structured fields from a string field in a document.

{
  "message": "55.3.244.1 GET /index.html 15824 0.043"
}

The pattern for this could be:

%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}

Here is an example pipeline for processing the above document by using Grok:

{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{IP:client} %{WORD:method} %{URIPATHPARAM:request} %{NUMBER:bytes} %{NUMBER:duration}"]
      }
    }
  ]
}

This pipeline will insert these named captures as new fields within the document, like so:

{
  "message": "55.3.244.1 GET /index.html 15824 0.043",
  "client": "55.3.244.1",
  "method": "GET",
  "request": "/index.html",
  "bytes": 15824,
  "duration": "0.043"
}

Custom Patterns

The Grok processor comes pre-packaged with a base set of patterns. These patterns may not always have what you are looking for. Patterns have a very basic format. Each entry has a name and the pattern itself.

You can add your own patterns to a processor definition under the pattern_definitions option. Here is an example of a pipeline specifying custom pattern definitions:

{
  "description" : "...",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["my %{FAVORITE_DOG:dog} is colored %{RGB:color}"],
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "RGB" : "RED|GREEN|BLUE"
        }
      }
    }
  ]
}

Providing Multiple Match Patterns

Sometimes one pattern is not enough to capture the potential structure of a field. Let’s assume we want to match all messages that contain your favorite pet breeds of either cats or dogs. One way to accomplish this is to provide two distinct patterns that can be matched, instead of one really complicated expression capturing the same or behavior.

Here is an example of such a configuration executed against the simulate API:

POST _ingest/pipeline/_simulate
{
  "pipeline": {
  "description" : "parse multiple patterns",
  "processors": [
    {
      "grok": {
        "field": "message",
        "patterns": ["%{FAVORITE_DOG:pet}", "%{FAVORITE_CAT:pet}"],
        "pattern_definitions" : {
          "FAVORITE_DOG" : "beagle",
          "FAVORITE_CAT" : "burmese"
        }
      }
    }
  ]
},
"docs":[
  {
    "_source": {
      "message": "I love burmese cats!"
    }
  }
  ]
}

response:

{
  "docs": [
    {
      "doc": {
        "_type": "_type",
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "message": "I love burmese cats!",
          "pet": "burmese"
        },
        "_ingest": {
          "timestamp": "2016-11-08T19:43:03.850+0000"
        }
      }
    }
  ]
}

Both patterns will set the field pet with the appropriate match, but what if we want to trace which of our patterns matched and populated our fields? We can do this with the trace_match parameter. Here is the output of that same pipeline, but with "trace_match": true configured:

{
  "docs": [
    {
      "doc": {
        "_type": "_type",
        "_index": "_index",
        "_id": "_id",
        "_source": {
          "message": "I love burmese cats!",
          "pet": "burmese"
        },
        "_ingest": {
          "_grok_match_index": "1",
          "timestamp": "2016-11-08T19:43:03.850+0000"
        }
      }
    }
  ]
}

In the above response, you can see that the index of the pattern that matched was "1". This is to say that it was the second (index starts at zero) pattern in patterns to match.

This trace metadata enables debugging which of the patterns matched. This information is stored in the ingest metadata and will not be indexed.

Retrieving patterns from REST endpoint

The Grok Processor comes packaged with its own REST endpoint for retrieving which patterns the processor is packaged with.

GET _ingest/processor/grok

The above request will return a response body containing a key-value representation of the built-in patterns dictionary.

{
  "patterns" : {
    "BACULA_CAPACITY" : "%{INT}{1,3}(,%{INT}{3})*",
    "PATH" : "(?:%{UNIXPATH}|%{WINPATH})",
    ...
}

This can be useful to reference as the built-in patterns change across versions.

Grok watchdog

Grok expressions that take too long to execute are interrupted and the grok processor then fails with an exception. The grok processor has a watchdog thread that determines when evaluation of a grok expression takes too long and is controlled by the following settings:

Table 14. Grok watchdog settings
Name Default Description

ingest.grok.watchdog.interval

1s

How often to check whether there are grok evaluations that take longer than the maximum allowed execution time.

ingest.grok.watchdog.max_execution_time

1s

The maximum allowed execution of a grok expression evaluation.

Gsub Processor

Converts a string field by applying a regular expression and a replacement. If the field is not a string, the processor will throw an exception.

Table 15. Gsub Options
Name Required Default Description

field

yes

-

The field to apply the replacement to

pattern

yes

-

The pattern to be replaced

replacement

yes

-

The string to replace the matching patterns with

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "gsub": {
    "field": "field1",
    "pattern": "\.",
    "replacement": "-"
  }
}

Join Processor

Joins each element of an array into a single string using a separator character between each element. Throws an error when the field is not an array.

Table 16. Join Options
Name Required Default Description

field

yes

-

The field to be separated

separator

yes

-

The separator character

target_field

no

field

The field to assign the joined value to, by default field is updated in-place

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "join": {
    "field": "joined_array_field",
    "separator": "-"
  }
}

JSON Processor

Converts a JSON string into a structured JSON object.

Table 17. Json Options
Name Required Default Description

field

yes

-

The field to be parsed

target_field

no

field

The field to insert the converted structured object into

add_to_root

no

false

Flag that forces the serialized json to be injected into the top level of the document. target_field must not be set when this option is chosen.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

All JSON-supported types will be parsed (null, boolean, number, array, object, string).

Suppose you provide this configuration of the json processor:

{
  "json" : {
    "field" : "string_source",
    "target_field" : "json_target"
  }
}

If the following document is processed:

{
  "string_source": "{\"foo\": 2000}"
}

after the json processor operates on it, it will look like:

{
  "string_source": "{\"foo\": 2000}",
  "json_target": {
    "foo": 2000
  }
}

If the following configuration is provided, omitting the optional target_field setting:

{
  "json" : {
    "field" : "source_and_target"
  }
}

then after the json processor operates on this document:

{
  "source_and_target": "{\"foo\": 2000}"
}

it will look like:

{
  "source_and_target": {
    "foo": 2000
  }
}

This illustrates that, unless it is explicitly named in the processor configuration, the target_field is the same field provided in the required field configuration.

KV Processor

This processor helps automatically parse messages (or specific event fields) which are of the foo=bar variety.

For example, if you have a log message which contains ip=1.2.3.4 error=REFUSED, you can parse those automatically by configuring:

{
  "kv": {
    "field": "message",
    "field_split": " ",
    "value_split": "="
  }
}
Table 18. Kv Options
Name Required Default Description

field

yes

-

The field to be parsed

field_split

yes

-

Regex pattern to use for splitting key-value pairs

value_split

yes

-

Regex pattern to use for splitting the key from the value within a key-value pair

target_field

no

null

The field to insert the extracted keys into. Defaults to the root of the document

include_keys

no

null

List of keys to filter and insert into document. Defaults to including all keys

exclude_keys

no

null

List of keys to exclude from document

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

prefix

no

null

Prefix to be added to extracted keys

trim_key

no

null

String of characters to trim from extracted keys

trim_value

no

null

String of characters to trim from extracted values

strip_brackets

no

false

If true strip brackets (), <>, [] as well as quotes ' and " from extracted values

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Lowercase Processor

Converts a string to its lowercase equivalent.

Table 19. Lowercase Options
Name Required Default Description

field

yes

-

The field to make lowercase

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "lowercase": {
    "field": "foo"
  }
}

Pipeline Processor

Executes another pipeline.

Table 20. Pipeline Options
Name Required Default Description

name

yes

-

The name of the pipeline to execute

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "pipeline": {
    "name": "inner-pipeline"
  }
}

An example of using this processor for nesting pipelines would be:

Define an inner pipeline:

PUT _ingest/pipeline/pipelineA
{
  "description" : "inner pipeline",
  "processors" : [
    {
      "set" : {
        "field": "inner_pipeline_set",
        "value": "inner"
      }
    }
  ]
}

Define another pipeline that uses the previously defined inner pipeline:

PUT _ingest/pipeline/pipelineB
{
  "description" : "outer pipeline",
  "processors" : [
    {
      "pipeline" : {
        "name": "pipelineA"
      }
    },
    {
      "set" : {
        "field": "outer_pipeline_set",
        "value": "outer"
      }
    }
  ]
}

Now indexing a document while applying the outer pipeline will see the inner pipeline executed from the outer pipeline:

PUT /myindex/_doc/1?pipeline=pipelineB
{
  "field": "value"
}

Response from the index request:

{
  "_index": "myindex",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 66,
  "_primary_term": 1,
}

Indexed document:

{
  "field": "value",
  "inner_pipeline_set": "inner",
  "outer_pipeline_set": "outer"
}

Remove Processor

Removes existing fields. If one field doesn’t exist, an exception will be thrown.

Table 21. Remove Options
Name Required Default Description

field

yes

-

Fields to be removed. Supports template snippets.

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

Here is an example to remove a single field:

{
  "remove": {
    "field": "user_agent"
  }
}

To remove multiple fields, you can use the following query:

{
  "remove": {
    "field": ["user_agent", "url"]
  }
}

Rename Processor

Renames an existing field. If the field doesn’t exist or the new name is already used, an exception will be thrown.

Table 22. Rename Options
Name Required Default Description

field

yes

-

The field to be renamed. Supports template snippets.

target_field

yes

-

The new name of the field. Supports template snippets.

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "rename": {
    "field": "provider",
    "target_field": "cloud.provider"
  }
}

Script Processor

Allows inline and stored scripts to be executed within ingest pipelines.

See How to use scripts to learn more about writing scripts. The Script Processor leverages caching of compiled scripts for improved performance. Since the script specified within the processor is potentially re-compiled per document, it is important to understand how script caching works. To learn more about caching see Script Caching.

Table 23. Script Options
Name Required Default Description

lang

no

"painless"

The scripting language

id

no

-

The stored script id to refer to

source

no

-

An inline script to be executed

params

no

-

Script Parameters

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

One of id or source options must be provided in order to properly reference a script to execute.

You can access the current ingest document from within the script context by using the ctx variable.

The following example sets a new field called field_a_plus_b_times_c to be the sum of two existing numeric fields field_a and field_b multiplied by the parameter param_c:

{
  "script": {
    "lang": "painless",
    "source": "ctx.field_a_plus_b_times_c = (ctx.field_a + ctx.field_b) * params.param_c",
    "params": {
      "param_c": 10
    }
  }
}

It is possible to use the Script Processor to manipulate document metadata like _index and _type during ingestion. Here is an example of an Ingest Pipeline that renames the index and type to my_index no matter what was provided in the original index request:

PUT _ingest/pipeline/my_index
{
    "description": "use index:my_index and type:_doc",
    "processors": [
      {
        "script": {
          "source": """
            ctx._index = 'my_index';
            ctx._type = '_doc';
          """
        }
      }
    ]
}

Using the above pipeline, we can attempt to index a document into the any_index index.

PUT any_index/_doc/1?pipeline=my_index
{
  "message": "text"
}

The response from the above index request:

{
  "_index": "my_index",
  "_type": "_doc",
  "_id": "1",
  "_version": 1,
  "result": "created",
  "_shards": {
    "total": 2,
    "successful": 1,
    "failed": 0
  },
  "_seq_no": 89,
  "_primary_term": 1,
}

In the above response, you can see that our document was actually indexed into my_index instead of any_index. This type of manipulation is often convenient in pipelines that have various branches of transformation, and depending on the progress made, indexed into different indices.

Set Processor

Sets one field and associates it with the specified value. If the field already exists, its value will be replaced with the provided one.

Table 24. Set Options
Name Required Default Description

field

yes

-

The field to insert, upsert, or update. Supports template snippets.

value

yes

-

The value to be set for the field. Supports template snippets.

override

no

true

If processor will update fields with pre-existing non-null-valued field. When set to false, such fields will not be touched.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "set": {
    "field": "host.os.name",
    "value": "{{os}}"
  }
}

Set Security User Processor

Sets user-related details (such as username, roles, email, full_name and metadata ) from the current authenticated user to the current document by pre-processing the ingest.

Important
Requires an authenticated user for the index request.
Table 25. Set Security User Options
Name Required Default Description

field

yes

-

The field to store the user information into.

properties

no

[username, roles, email, full_name, metadata]

Controls what user related properties are added to the field.

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

The following example adds all user details for the current authenticated user to the user field for all documents that are processed by this pipeline:

{
  "processors" : [
    {
      "set_security_user": {
        "field": "user"
      }
    }
  ]
}

Split Processor

Splits a field into an array using a separator character. Only works on string fields.

Table 26. Split Options
Name Required Default Description

field

yes

-

The field to split

separator

yes

-

A regex which matches the separator, eg , or \s+

target_field

no

field

The field to assign the split value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "split": {
    "field": "my_field",
    "separator": "\\s+" (1)
  }
}
  1. Treat all consecutive whitespace characters as a single separator === Sort Processor Sorts the elements of an array ascending or descending. Homogeneous arrays of numbers will be sorted numerically, while arrays of strings or heterogeneous arrays of strings + numbers will be sorted lexicographically. Throws an error when the field is not an array.

Table 27. Sort Options
Name Required Default Description

field

yes

-

The field to be sorted

order

no

"asc"

The sort order to use. Accepts "asc" or "desc".

target_field

no

field

The field to assign the sorted value to, by default field is updated in-place

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "sort": {
    "field": "array_field_to_sort",
    "order": "desc"
  }
}

Trim Processor

Trims whitespace from field.

Note
This only works on leading and trailing whitespace.
Table 28. Trim Options
Name Required Default Description

field

yes

-

The string-valued field to trim whitespace from

target_field

no

field

The field to assign the trimmed value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "trim": {
    "field": "foo"
  }
}

Uppercase Processor

Converts a string to its uppercase equivalent.

Table 29. Uppercase Options
Name Required Default Description

field

yes

-

The field to make uppercase

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "uppercase": {
    "field": "foo"
  }
}

URL Decode Processor

URL-decodes a string

Table 30. URL Decode Options
Name Required Default Description

field

yes

-

The field to decode

target_field

no

field

The field to assign the converted value to, by default field is updated in-place

ignore_missing

no

false

If true and field does not exist or is null, the processor quietly exits without modifying the document

if

no

-

Conditionally execute this processor.

on_failure

no

-

Handle failures for this processor. See Handling Failures in Pipelines.

ignore_failure

no

false

Ignore failures for this processor. See Handling Failures in Pipelines.

tag

no

-

An identifier for this processor. Useful for debugging and metrics.

{
  "urldecode": {
    "field": "my_url_to_decode"
  }
}

User Agent processor

The user_agent processor extracts details from the user agent string a browser sends with its web requests. This processor adds this information by default under the user_agent field.

The ingest-user-agent module ships by default with the regexes.yaml made available by uap-java with an Apache 2.0 license. For more details see https://github.com/ua-parser/uap-core.

Using the user_agent Processor in a Pipeline

Table 31. User-agent options
Name Required Default Description

field

yes

-

The field containing the user agent string.

target_field

no

user_agent

The field that will be filled with the user agent details.

regex_file

no

-

The name of the file in the config/ingest-user-agent directory containing the regular expressions for parsing the user agent string. Both the directory and the file have to be created before starting Elasticsearch. If not specified, ingest-user-agent will use the regexes.yaml from uap-core it ships with (see below).

properties

no

[name, major, minor, patch, build, os, os_name, os_major, os_minor, device]

Controls what properties are added to target_field.

ignore_missing

no

false

If true and field does not exist, the processor quietly exits without modifying the document

ecs

no

false

Whether to return the output in Elastic Common Schema format. NOTE: ECS format will be the default in Elasticsearch 7.0 and non-ECS format is deprecated.

Here is an example that adds the user agent details to the user_agent field based on the agent field:

PUT _ingest/pipeline/user_agent
{
  "description" : "Add user agent information",
  "processors" : [
    {
      "user_agent" : {
        "field" : "agent",
        "ecs" : true
      }
    }
  ]
}
PUT my_index/_doc/my_id?pipeline=user_agent
{
  "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36"
}
GET my_index/_doc/my_id

Which returns

{
  "found": true,
  "_index": "my_index",
  "_type": "_doc",
  "_id": "my_id",
  "_version": 1,
  "_seq_no": 22,
  "_primary_term": 1,
  "_source": {
    "agent": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
    "user_agent": {
      "name": "Chrome",
      "original": "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_10_5) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/51.0.2704.103 Safari/537.36",
      "version": "51.0.2704",
      "os": {
        "name": "Mac OS X",
        "version": "10.10.5",
        "full": "Mac OS X 10.10.5"
      },
      "device" : {
        "name" : "Other"
      },
    }
  }
}
Using a custom regex file

To use a custom regex file for parsing the user agents, that file has to be put into the config/ingest-user-agent directory and has to have a .yml filename extension. The file has to be present at node startup, any changes to it or any new files added while the node is running will not have any effect.

In practice, it will make most sense for any custom regex file to be a variant of the default file, either a more recent version or a customised version.

The default file included in ingest-user-agent is the regexes.yaml from uap-core: https://github.com/ua-parser/uap-core/blob/master/regexes.yaml