"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/plugins/mapper.asciidoc" (29 Dec 2021, 903 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

Mapper Plugins

Mapper plugins allow new field datatypes to be added to Elasticsearch.

Core mapper plugins

The core mapper plugins are:

Mapper Size Plugin

The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original {ref}/mapping-source-field.html[_source] field.

[mapper-murmur3]

The mapper-murmur3 plugin allows hashes to be computed at index-time and stored in the index for later use with the cardinality aggregation.

Mapper Annotated Text Plugin

The annotated text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).

Mapper Size Plugin

The mapper-size plugin provides the _size meta field which, when enabled, indexes the size in bytes of the original {ref}/mapping-source-field.html[_source] field.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-size

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/mapper-size/mapper-size-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-size

The node must be stopped before removing the plugin.

Using the _size field

In order to enable the _size field, set the mapping as follows:

PUT my_index
{
  "mappings": {
    "_doc": {
      "_size": {
        "enabled": true
      }
    }
  }
}

The value of the _size field is accessible in queries, aggregations, scripts, and when sorting:

# Example documents
PUT my_index/_doc/1
{
  "text": "This is a document"
}

PUT my_index/_doc/2
{
  "text": "This is another document"
}

GET my_index/_search
{
  "query": {
    "range": {
      "_size": { (1)
        "gt": 10
      }
    }
  },
  "aggs": {
    "sizes": {
      "terms": {
        "field": "_size", (2)
        "size": 10
      }
    }
  },
  "sort": [
    {
      "_size": { (3)
        "order": "desc"
      }
    }
  ],
  "script_fields": {
    "size": {
      "script": "doc['_size']"  (4)
    }
  }
}
  1. Querying on the _size field

  2. Aggregating on the _size field

  3. Sorting on the _size field

  4. Accessing the _size field in scripts (inline scripts must be modules-security-scripting.html#enable-dynamic-scripting[enabled] for this example to work) === Mapper Murmur3 Plugin

The mapper-murmur3 plugin provides the ability to compute hash of field values at index-time and store them in the index. This can sometimes be helpful when running cardinality aggregations on high-cardinality and large string fields.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-murmur3

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/mapper-murmur3/mapper-murmur3-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-murmur3

The node must be stopped before removing the plugin.

Using the murmur3 field

The murmur3 is typically used within a multi-field, so that both the original value and its hash are stored in the index:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "keyword",
          "fields": {
            "hash": {
              "type": "murmur3"
            }
          }
        }
      }
    }
  }
}

Such a mapping would allow to refer to my_field.hash in order to get hashes of the values of the my_field field. This is only useful in order to run cardinality aggregations:

# Example documents
PUT my_index/_doc/1
{
  "my_field": "This is a document"
}

PUT my_index/_doc/2
{
  "my_field": "This is another document"
}

GET my_index/_search
{
  "aggs": {
    "my_field_cardinality": {
      "cardinality": {
        "field": "my_field.hash" (1)
      }
    }
  }
}
  1. Counting unique values on the my_field.hash field

Running a cardinality aggregation on the my_field field directly would yield the same result, however using my_field.hash instead might result in a speed-up if the field has a high-cardinality. On the other hand, it is discouraged to use the murmur3 field on numeric fields and string fields that are not almost unique as the use of a murmur3 field is unlikely to bring significant speed-ups, while increasing the amount of disk space required to store the index.

Mapper Annotated Text Plugin

experimental[]

The mapper-annotated-text plugin provides the ability to index text that is a combination of free-text and special markup that is typically used to identify items of interest such as people or organisations (see NER or Named Entity Recognition tools).

The elasticsearch markup allows one or more additional tokens to be injected, unchanged, into the token stream at the same position as the underlying text it annotates.

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install mapper-annotated-text

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/mapper-annotated-text/mapper-annotated-text-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove mapper-annotated-text

The node must be stopped before removing the plugin.

Using the annotated-text field

The annotated-text tokenizes text content as per the more common text field (see "limitations" below) but also injects any marked-up annotation tokens directly into the search index:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_field": {
          "type": "annotated_text"
        }
      }
    }
  }
}

Such a mapping would allow marked-up text eg wikipedia articles to be indexed as both text and structured tokens. The annotations use a markdown-like syntax using URL encoding of one or more values separated by the & symbol.

We can use the "_analyze" api to test how an example annotation would be stored as tokens in the search index:

GET my_index/_analyze
{
  "field": "my_field",
  "text":"Investors in [Apple](Apple+Inc.) rejoiced."
}

Response:

{
  "tokens": [
    {
      "token": "investors",
      "start_offset": 0,
      "end_offset": 9,
      "type": "<ALPHANUM>",
      "position": 0
    },
    {
      "token": "in",
      "start_offset": 10,
      "end_offset": 12,
      "type": "<ALPHANUM>",
      "position": 1
    },
    {
      "token": "Apple Inc.", (1)
      "start_offset": 13,
      "end_offset": 18,
      "type": "annotation",
      "position": 2
    },
    {
      "token": "apple",
      "start_offset": 13,
      "end_offset": 18,
      "type": "<ALPHANUM>",
      "position": 2
    },
    {
      "token": "rejoiced",
      "start_offset": 19,
      "end_offset": 27,
      "type": "<ALPHANUM>",
      "position": 3
    }
  ]
}
  1. Note the whole annotation token Apple Inc. is placed, unchanged as a single token in the token stream and at the same position (position 2) as the text token (apple) it annotates.

We can now perform searches for annotations using regular term queries that don’t tokenize the provided search values. Annotations are a more precise way of matching as can be seen in this example where a search for Beck will not match Jeff Beck :

# Example documents
PUT my_index/_doc/1
{
  "my_field": "[Beck](Beck) announced a new tour"(1)
}

PUT my_index/_doc/2
{
  "my_field": "[Jeff Beck](Jeff+Beck&Guitarist) plays a strat"(2)
}

# Example search
GET my_index/_search
{
  "query": {
    "term": {
        "my_field": "Beck" (3)
    }
  }
}
  1. As well as tokenising the plain text into single words e.g. beck, here we inject the single token value Beck at the same position as beck in the token stream.

  2. Note annotations can inject multiple tokens at the same position - here we inject both the very specific value Jeff Beck and the broader term Guitarist. This enables broader positional queries e.g. finding mentions of a Guitarist near to strat.

  3. A benefit of searching with these carefully defined annotation tokens is that a query for Beck will not match document 2 that contains the tokens jeff, beck and Jeff Beck

Warning
Any use of = signs in annotation values eg [Prince](person=Prince) will cause the document to be rejected with a parse failure. In future we hope to have a use for the equals signs so wil actively reject documents that contain this today.

Data modelling tips

Use structured and unstructured fields

Annotations are normally a way of weaving structured information into unstructured text for higher-precision search.

Entity resolution is a form of document enrichment undertaken by specialist software or people where references to entities in a document are disambiguated by attaching a canonical ID. The ID is used to resolve any number of aliases or distinguish between people with the same name. The hyperlinks connecting Wikipedia’s articles are a good example of resolved entity IDs woven into text.

These IDs can be embedded as annotations in an annotated_text field but it often makes sense to include them in dedicated structured fields to support discovery via aggregations:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_unstructured_text_field": {
          "type": "annotated_text"
        },
        "my_structured_people_field": {
          "type": "text",
          "fields": {
          	"keyword" :{
          	  "type": "keyword"
          	}
          }
        }
      }
    }
  }
}

Applications would then typically provide content and discover it as follows:

# Example documents
PUT my_index/_doc/1
{
  "my_unstructured_text_field": "[Shay](%40kimchy) created elasticsearch",
  "my_twitter_handles": ["@kimchy"] (1)
}

GET my_index/_search
{
  "query": {
    "query_string": {
        "query": "elasticsearch OR logstash OR kibana",(2)
        "default_field": "my_unstructured_text_field"
    }
  },
  "aggregations": {
  	"top_people" :{
  	    "significant_terms" : { (3)
	       "field" : "my_twitter_handles.keyword"
  	    }
  	}
  }
}
  1. Note the my_twitter_handles contains a list of the annotation values also used in the unstructured text. (Note the annotated_text syntax requires escaping). By repeating the annotation values in a structured field this application has ensured that the tokens discovered in the structured field can be used for search and highlighting in the unstructured field.

  2. In this example we search for documents that talk about components of the elastic stack

  3. We use the my_twitter_handles field here to discover people who are significantly associated with the elastic stack.

Avoiding over-matching annotations

By design, the regular text tokens and the annotation tokens co-exist in the same indexed field but in rare cases this can lead to some over-matching.

The value of an annotation often denotes a named entity (a person, place or company). The tokens for these named entities are inserted untokenized, and differ from typical text tokens because they are normally:

  • Mixed case e.g. Madonna

  • Multiple words e.g. Jeff Beck

  • Can have punctuation or numbers e.g. Apple Inc. or @kimchy

This means, for the most part, a search for a named entity in the annotated text field will not have any false positives e.g. when selecting Apple Inc. from an aggregation result you can drill down to highlight uses in the text without "over matching" on any text tokens like the word apple in this context:

the apple was very juicy

However, a problem arises if your named entity happens to be a single term and lower-case e.g. the company elastic. In this case, a search on the annotated text field for the token elastic may match a text document such as this:

he fired an elastic band

To avoid such false matches users should consider prefixing annotation values to ensure they don’t name clash with text tokens e.g.

[elastic](Company_elastic) released version 7.0 of the elastic stack today

Using the annotated highlighter

The annotated-text plugin includes a custom highlighter designed to mark up search hits in a way which is respectful of the original markup:

# Example documents
PUT my_index/_doc/1
{
  "my_field": "The cat sat on the [mat](sku3578)"
}

GET my_index/_search
{
  "query": {
    "query_string": {
        "query": "cats"
    }
  },
  "highlight": {
    "fields": {
      "my_field": {
        "type": "annotated", (1)
        "require_field_match": false
      }
    }
  }
}
  1. The annotated highlighter type is designed for use with annotated_text fields

The annotated highlighter is based on the unified highlighter and supports the same settings but does not use the pre_tags or post_tags parameters. Rather than using html-like markup such as <em>cat</em> the annotated highlighter uses the same markdown-like syntax used for annotations and injects a key=value annotation where _hit_term is the key and the matched search term is the value e.g.

The [cat](_hit_term=cat) sat on the [mat](sku3578)

The annotated highlighter tries to be respectful of any existing markup in the original text:

  • If the search term matches exactly the location of an existing annotation then the _hit_term key is merged into the url-like syntax used in the (…​) part of the existing annotation.

  • However, if the search term overlaps the span of an existing annotation it would break the markup formatting so the original annotation is removed in favour of a new annotation with just the search hit information in the results.

  • Any non-overlapping annotations in the original text are preserved in highlighter selections

Limitations

The annotated_text field type supports the same mapping settings as the text field type but with the following exceptions:

  • No support for fielddata or fielddata_frequency_filter

  • No support for index_prefixes or index_phrases indexing