Version information
This version is compatible with:
- Puppet Enterprise 2023.8.x, 2023.7.x, 2023.6.x, 2023.5.x, 2023.4.x, 2023.3.x, 2023.2.x, 2023.1.x, 2023.0.x, 2021.7.x, 2021.6.x, 2021.5.x, 2021.4.x, 2021.3.x, 2021.2.x, 2021.1.x, 2021.0.x, 2019.8.x, 2019.7.x, 2019.5.x, 2019.4.x, 2019.3.x, 2019.2.x, 2019.1.x, 2019.0.x, 2018.1.x, 2017.3.x, 2017.2.x, 2017.1.x, 2016.5.x, 2016.4.x
- Puppet >=3.4.0
- , , , ,
Start using this module
Add this module to your Puppetfile:
mod 'cesnet-hive', '0.15.0'
Learn more about managing modules with a PuppetfileDocumentation
Apache Hive Puppet Module
####Table of Contents
- Module Description - What the module does and why it is useful
- Setup - The basics of getting started with Hive
- Usage - Configuration options and additional functionality
- Reference - An under-the-hood peek at what the module is doing and how
- Limitations - OS compatibility, etc.
- Development - Guide for contributing to the module
##Module Description
This module installs and setups Apache Hive data warehouse software running on the top of Hadoop cluster. Hive services can be collocated or separated with other services in the cluster. Optionally security based on Kerberos can be enabled. Security should be enabled if Hadoop cluster security is enabled.
Puppet client configured with stringify_facts=false
is recommended, but not required (see also schema_file parameter).
Tested with:
- Debian 7/wheezy, 8/jessie: Cloudera distribution (tested on Hive 0.13.1, 2.1.1)
- RHEL 6 and clones: Cloudera distribution (tested with Hadoop 2.6.0)
##Setup
###What cesnet-hive module affects
- Packages: installs Hive packages (common packages, subsets for requested services, hcatalog, and/or hive client)
- Files modified:
- */etc/hive/* (or /etc/hive/conf/**)
- /usr/local/sbin/hivemanager (not needed, only when administrator manager script is requested by features)
- Alternatives:
- alternatives are used for /etc/hive/conf in Cloudera
- this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
- Services: only requested Hive services are setup and started
- metastore
- server2
- Helper Files:
- /var/lib/hadoop-hdfs/.puppet-hive-dir-created (created by cesnet-hadoop module)
- Secret Files (keytabs): permissions are modified for hive service keytab (/etc/security/keytab/hive.service.keytab)
- Facts:
hive_schemas
(stringify_facts=false
is needed when using this fact) - Databases: for supported databases (when not disabled): user created and database schema imported using puppetlabs modules
###Setup Requirements
There are several known or intended limitations in this module.
Be aware of:
-
Repositories - see cesnet-hadoop module Setup Requirements for details
-
No inter-node dependencies: running HDFS namenode is required for Hive metastore server startup
-
Secure mode: keytabs must be prepared in /etc/security/keytabs/ (see realm parameter)
-
Database setup: MariaDB/MySQL or PostgreSQL are supported. You need to install puppetlabs-mysql or puppetlabs-postgresql module, because they are not in dependencies.
-
Hadoop: it should be configured locally or you should use hdfs_hostname parameter (see Module Parameters)
###Beginning with Hive
Let's start with basic examples.
Example: The simplest setup without security nor zookeeper, with everything on single machine:
class{"hive":
hdfs_hostname => $::fqdn,
metastore_hostname => $::fqdn,
server2_hostname => $::fqdn,
}
node <HDFS_NAMENODE> {
# HDFS initialization must be done on the namenode
# (or /user/hive on HDFS must be created)
include hive::hdfs
}
node default {
# server
include ::hive::metastore
include ::hive::server2
# client
include ::hive::frontend
include ::hive::hcatalog
# worker nodes
include ::hive::worker
}
Modify $::fqdn and node(s) section as needed.
We recommend:
- using zookeeper and set hive parameter zookeeper_hostnames (cesnet-zookeeper module can be used for installation of zookeeper)
- if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['hive::metastore::service']
- if not collocated, it is needed to have HDFS namenode running first, or restart Hive metastore later
- using hadoop class plus some other component (or hadoop::common::config class) - see hdfs_hostname parameter
##Usage
It is highly recommended to use real database backends instead of Derby. Also security can be enabled.
Hive is used together with other components in roles in cesnet::site_hadoop puppet module.
Or you can see the examples here, how to use the hive puppet module directly:
Example 1: Setup with security:
Additional permissions in Hadoop cluster are needed: add hive proxy user.
class{"hadoop":
...
properties => {
'hadoop.proxyuser.hive.groups' => 'hive,impala,oozie,users',
'hadoop.proxyuser.hive.hosts' => '*',
},
...
}
class{"hive":
group => 'users',
metastore_hostname => $::fqdn,
realm => 'MY.REALM',
}
Use nodes sections from the initial Example, modify $::fqdn and nodes sections as needed.
Example 2: MySQL database, puppetlabs-mysql puppet module must be installed.
Add this to the initial example:
class{"hive":
...
db => 'mysql',
#db => 'mariadb',
db_password => 'hivepassword',
}
node default {
...
class { 'mysql::server':
root_password => 'strongpassword',
}
class { 'mysql::bindings':
java_enable => true,
#java_package_name => 'libmariadb-java',
}
}
Database is created in hive::metastore::db (hive::metastore) class.
Example 3: PostgreSQL database, puppetlabs-postgresql puppet module must be installed.
Add this to the initial example:
class{"hive":
...
db => 'postgresql',
db_password => 'hivepassword',
}
node default {
...
class { 'postgresql::server':
postgres_password => 'strongpassword',
}
include postgresql::lib::java
...
}
###Enable Security
Security in Hadoop (and Hive) is based on Kerberos. Keytab files needs to be prepared on the proper places before enabling the security.
Following parameters are used for security (see also hive class):
- realm (Kerberos realm, empty string disables the security)
Enables security and specifies Kerberos realm to use. Empty string disables the security.
To enable security, there are required:
- installed Kerberos client (Debian: krb5-user/heimdal-clients; RedHat: krb5-workstation)
- configured Kerberos client (/etc/krb5.conf, /etc/krb5.keytab)
- /etc/security/keytab/hive.service.keytab (on all server nodes)
- sentry_hostname Enable usage of Sentry authorization service. When not specified, Hive server2 impersonation is enabled and authorization works using HDFS permissions.
####Impersonation
Authorization by impersonation of the user. Used when sentry_hostname is not specified.
Hadoop needs to have enabled proxyuser for it:
# 'users' is the group in *group* parameter
hadoop.proxyuser.hive.groups => 'hive,users'
hadoop.proxyuser.hive.hosts => '*'
Users need to have access to warehouse directory. Group is set to users by default. Other addons (like impala) need to be in the users group too!
Another way could be to add users to hive group and use that group instead (more simple, but less secure).
####Sentry
Authorization by sentry. Used when sentry_hostname is not specified.
Hive itself runs under 'hive' user. Hadoop and Hive must have enabed security.
Warehouse directory must have 'hive' group ownership. It is set by the puppet module by default.
###Multihome Support
Multihome is supported by Hive out-of-the-box.
<a name="defaultfs" ###Changing defaultFS (converting non-HA cluster, ...)
Changing defaultFS can be needed when, for example:
- changing Hadoop cluster name
- using cluster name because of converting non-HA cluster to High Availability
But existing objects in Hive schema are using the old URL with previous defaultFS and needs to be converted.
Getting the old URL:
hive --service metatool -listFSRoot 2>/dev/null
Convert (you can try testing run first using --dryRun):
OLD_URL="hdfs://NAMENODE_HOSTNAME:8020"
NEW_URL="hdfs://CLUSTER_NAME"
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL} --dryRun
hive --service metatool -updateLocation ${NEW_URL} ${OLD_URL}
###Cluster with more HDFS Name nodes
If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'hive' system user on all of them to authorization work properly. You could install full Hive client (using hive::frontend::install), but just creating the user is enough (using hive::user).
Note, the hive::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the hive::user.
Example:
node <HDFS_NAMENODE> {
include hive::hdfs
}
node <HDFS_OTHER_NAMENODE> {
include hive::user
}
###Upgrade
The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also needed to update schema using schematool or upgrade scripts in /usr/lib/hive/scripts/metastore/upgrade/DATABASE/.
For example (using mysql, from Hive 0.13.0):
alternative='cluster'
d='hive'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf
# upgrade
...
# metadata schema upgrade
mysqldump --opt metastore > metastore-backup.sql
mysqldump --skip-add-drop-table --no-data metastore > my-schema-backup.mysql.sql
/usr/lib/hive/bin/schematool -dbType mysql -upgradeSchemaFrom 0.13.0 -userName root -passWord MYSQL_ROOT_PASSWORD
puppet agent --test
#or: puppet apply ...
##Reference
###Classes
hive
: The main configuration class for Apache Hivehive::hbase
: Client Support for HBasehive::hdfs
: HDFS initialiationshive::params
hive::service
- common:
hive::common::config
hive::common::daemon
hive::common::postinstall
hive::frontend
: Hive Clienthive::frontend::config
hive::frontend::install
hive::hcatalog
: Hive HCatalog Clienthive::hcatalog::config
hive::hcatalog::install
hive::metastore
: Hive Metastorehive::metastore::config
hive::metastore::install
hive::metastore::db
hive::metastore::service
hive::server2
: Hive Serverhive::server2::config
hive::server2::install
hive::server2::service
hive::user
: Create hive system user, if neededhive::worker
: Hive support at the worker node
###Facts
hive_schemas
: database schema file for each database backend
###hive
class
####confdir
Hive config directory. Default: '/etc/hive/conf' or '/etc/hive'.
####group
Hive group on HDFS. Default: 'users' (without sentry), 'hive' (with sentry).
For Hive impersonation (without sentry) is expected all users belong to the specified group.
It is not updated when changed, you should remove the /var/lib/hadoop-hdfs/.puppet-hive-dir-created file when changing or update group of /user/hive on HDFS.
####hdfs_hostname
HDFS hostname (or defaultFS value), if different from core-site.xml Hadoop file. Default: undef.
It is recommended to have the core-site.xml file instead. core-site.xml will be created when installing any Hadoop component or if you include hadoop::common::config class.
####keytab
Hive keytab file. Default: '/etc/security/keytab/hive.service.keytab'.
Only used with security (realm parameter).
####keytab_source
Puppet source for keytab file. Default: undef.
When specified, the Hive keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.
Only used with security (realm parameter).
####metastore_hostname
Hostname of the metastore server. Default: undef.
When specified, remote mode is activated (recommended).
####principal
Hive Kerberos principal. Default: '::default' (="hive/_HOST@${hive::realm}").
####sentry_hostname
Hostname of the (external) Sentry service. Default: undef.
Non-empty value will enable Hive settings needed to use Sentry authorization service.
When sentry is enabled, you will need also hive user added to allowed.system.users in Hadoop YARN containers.
####server2_hostname
Hostname of the Hive server. Default: undef.
Used only for hivemanager script.
####zookeeper_hostnames
Array of zookeeper hostnames quorum. Default: undef.
Used for lock management (recommended).
####zookeeper_port
Zookeeper port, if different from the default (2181). Default: undef.
####realm
Kerberos realm. Default: ''.
Empty string disables the security.
When security is enabled, you also need either Sentry service (sentry_hostname parameter) or proxyuser properties to Hadoop cluster for Hive impersonation. See Enable Security.
####properties
Additional properties. Default: undef.
####descriptions
Descriptions for the additional properties. Default: undef.
####alternatives
Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.
Use it only when supported (for example with Cloudera distribution).
####database_setup_enable
Enables database setup (if suported). Default: true.
####db
Database behind the metastore. Default: undef.
The default is embedded database (derby), but it is recommended to use proper database.
Values:
- derby (default): embedded database
- mysql: MySQL/MariaDB,
- postgresql: PostgreSQL
####db_host
Database hostname for mysql, postgresql, and oracle. Default: 'localhost'.
It can be overridden by javax.jdo.option.ConnectionURL property.
####db_name
Database name for mysql and postgresql. Default: 'metastore'.
For oracle 'xe' schema is used. Can be overridden by javax.jdo.option.ConnectionURL property.
####db_user
Database user for mysql, postgresql, and oracle. Default: 'hive'.
####db_password
Database password for mysql, postgresql, and oracle. Default: undef.
####features
Enable additional features. Default: {}.
Values:
- manager - script in /usr/local to start/stop Hive daemons relevant for given node
####schema_dir
Hive directory with database schemas. Default: undef (/usr/lib/hive/scripts/metastore/upgrade).
####schema_file
Hive database schema file. Default: undef (autodetect).
Autodetection requires puppet configured with stringify_facts=false
. But the value can be set directly instead (for example hive-schema-2.1.1.mysql.sql
).
##Limitations
Idea in this module is to do only one thing - setup Hive SW - and not limit generic usage of this module by doing other stuff. You can have your own repository with Hadoop SW, you can select which Kerberos implementation to use, or Java version.
On other hand this leads to some limitations as mentioned in Setup Requirements section and usage is more complicated - you may need site-specific puppet module together with this one, like cesnet-site_hadoop.
For database there are used puppetlabs-mysql and puppetlabs-postgresql modules, but they are not in dependencies. You can disable database setup altogether with database_setup_enable parameter.
##Development
- Repository: https://github.com/MetaCenterCloudPuppet/cesnet-hive
- Tests:
- basic: see .travis.yml
- vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests
Dependencies
- puppetlabs/stdlib (>=1.0.0 <5.0.0)
- cesnet/hadoop (>=0.9.4 <4.0.0)
- cesnet/hadoop_lib (>=0.4.0 <1.0.0)
The MIT License (MIT) Copyright (c) 2014-2020 CESNET Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.