"Fossies" - the Fresh Open Source Software Archive

Member "elasticsearch-6.8.23/docs/plugins/repository-hdfs.asciidoc" (29 Dec 2021, 7915 Bytes) of package /linux/www/elasticsearch-6.8.23-src.tar.gz:


As a special service "Fossies" has tried to format the requested source page into HTML format (assuming AsciiDoc format). Alternatively you can here view or download the uninterpreted source code file. A member file download can also be achieved by clicking within a package contents listing on the according byte size field.

A hint: This file contains one or more very long lines, so maybe it is better readable using the pure text view mode that shows the contents as wrapped lines within the browser window.


Hadoop HDFS Repository Plugin

The HDFS repository plugin adds support for using HDFS File System as a repository for {ref}/modules-snapshots.html[Snapshot/Restore].

Installation

This plugin can be installed using the plugin manager:

sudo bin/elasticsearch-plugin install repository-hdfs

The plugin must be installed on every node in the cluster, and each node must be restarted after installation.

This plugin can be downloaded for offline install from {plugin_url}/repository-hdfs/repository-hdfs-{version}.zip.

Removal

The plugin can be removed with the following command:

sudo bin/elasticsearch-plugin remove repository-hdfs

The node must be stopped before removing the plugin.

Getting started with HDFS

The HDFS snapshot/restore plugin is built against the latest Apache Hadoop 2.x (currently 2.7.1). If the distro you are using is not protocol compatible with Apache Hadoop, consider replacing the Hadoop libraries inside the plugin folder with your own (you might have to adjust the security permissions required).

Even if Hadoop is already installed on the Elasticsearch nodes, for security reasons, the required libraries need to be placed under the plugin folder. Note that in most cases, if the distro is compatible, one simply needs to configure the repository with the appropriate Hadoop configuration files (see below).

Windows Users

Using Apache Hadoop on Windows is problematic and thus it is not recommended. For those really wanting to use it, make sure you place the elusive winutils.exe under the plugin folder and point HADOOP_HOME variable to it; this should minimize the amount of permissions Hadoop requires (though one would still have to add some more).

Configuration Properties

Once installed, define the configuration for the hdfs repository through the {ref}/modules-snapshots.html[REST API]:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "elasticsearch/repositories/my_hdfs_repository",
    "conf.dfs.client.read.shortcircuit": "true"
  }
}

The following settings are supported:

uri

The uri address for hdfs. ex: "hdfs://<host>:<port>/". (Required)

path

The file path within the filesystem where data is stored/loaded. ex: "path/to/file". (Required)

load_defaults

Whether to load the default Hadoop configuration or not. (Enabled by default)

conf.<key>

Inlined configuration parameter to be added to Hadoop configuration. (Optional) Only client oriented properties from the hadoop core and hdfs configuration files will be recognized by the plugin.

compress

Whether to compress the metadata or not. (Disabled by default)

max_restore_bytes_per_sec

Throttles per node restore rate. Defaults to 40mb per second.

max_snapshot_bytes_per_sec

Throttles per node snapshot rate. Defaults to 40mb per second.

readonly

Makes repository read-only. Defaults to false.

chunk_size

Override the chunk size. (Disabled by default)

security.principal

Kerberos principal to use when connecting to a secured HDFS cluster. If you are using a service principal for your elasticsearch node, you may use the _HOST pattern in the principal name and the plugin will replace the pattern with the hostname of the node at runtime (see Creating the Secure Repository).

A Note on HDFS Availability

When you initialize a repository, its settings are persisted in the cluster state. When a node comes online, it will attempt to initialize all repositories for which it has settings. If your cluster has an HDFS repository configured, then all nodes in the cluster must be able to reach HDFS when starting. If not, then the node will fail to initialize the repository at start up and the repository will be unusable. If this happens, you will need to remove and re-add the repository or restart the offending node.

Hadoop Security

The HDFS Repository Plugin integrates seamlessly with Hadoop’s authentication model. The following authentication methods are supported by the plugin:

simple

Also means "no security" and is enabled by default. Uses information from underlying operating system account running Elasticsearch to inform Hadoop of the name of the current user. Hadoop makes no attempts to verify this information.

kerberos

Authenticates to Hadoop through the usage of a Kerberos principal and keytab. Interfacing with HDFS clusters secured with Kerberos requires a few additional steps to enable (See Principals and Keytabs and Creating the Secure Repository for more info)

Principals and Keytabs

Before attempting to connect to a secured HDFS cluster, provision the Kerberos principals and keytabs that the Elasticsearch nodes will use for authenticating to Kerberos. For maximum security and to avoid tripping up the Kerberos replay protection, you should create a service principal per node, following the pattern of elasticsearch/hostname@REALM.

Warning
In some cases, if the same principal is authenticating from multiple clients at once, services may reject authentication for those principals under the assumption that they could be replay attacks. If you are running the plugin in production with multiple nodes you should be using a unique service principal for each node.

On each Elasticsearch node, place the appropriate keytab file in the node’s configuration location under the repository-hdfs directory using the name krb5.keytab:

$> cd elasticsearch/config
$> ls
elasticsearch.yml  jvm.options        log4j2.properties  repository-hdfs/   scripts/
$> cd repository-hdfs
$> ls
krb5.keytab
Note
Make sure you have the correct keytabs! If you are using a service principal per node (like elasticsearch/hostname@REALM) then each node will need its own unique keytab file for the principal assigned to that host!
Creating the Secure Repository

Once your keytab files are in place and your cluster is started, creating a secured HDFS repository is simple. Just add the name of the principal that you will be authenticating as in the repository settings under the security.principal option:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "/user/elasticsearch/repositories/my_hdfs_repository",
    "security.principal": "elasticsearch@REALM"
  }
}

If you are using different service principals for each node, you can use the _HOST pattern in your principal name. Elasticsearch will automatically replace the pattern with the hostname of the node at runtime:

PUT _snapshot/my_hdfs_repository
{
  "type": "hdfs",
  "settings": {
    "uri": "hdfs://namenode:8020/",
    "path": "/user/elasticsearch/repositories/my_hdfs_repository",
    "security.principal": "elasticsearch/_HOST@REALM"
  }
}
Authorization

Once Elasticsearch is connected and authenticated to HDFS, HDFS will infer a username to use for authorizing file access for the client. By default, it picks this username from the primary part of the kerberos principal used to authenticate to the service. For example, in the case of a principal like elasticsearch@REALM or elasticsearch/hostname@REALM then the username that HDFS extracts for file access checks will be elasticsearch.

Note
The repository plugin makes no assumptions of what Elasticsearch’s principal name is. The main fragment of the Kerberos principal is not required to be elasticsearch. If you have a principal or service name that works better for you or your organization then feel free to use it instead!