Version information
This version is compatible with:
- Puppet Enterprise 2023.8.x, 2023.7.x, 2023.6.x, 2023.5.x, 2023.4.x, 2023.3.x, 2023.2.x, 2023.1.x, 2023.0.x, 2021.7.x, 2021.6.x, 2021.5.x, 2021.4.x, 2021.3.x, 2021.2.x, 2021.1.x, 2021.0.x, 2019.8.x, 2019.7.x, 2019.5.x, 2019.4.x, 2019.3.x, 2019.2.x, 2019.1.x, 2019.0.x, 2018.1.x, 2017.3.x, 2017.2.x, 2017.1.x, 2016.5.x, 2016.4.x
- Puppet >= 3.4.0
- , , , ,
Start using this module
Add this module to your Puppetfile:
mod 'cesnet-spark', '2.0.0'
Learn more about managing modules with a PuppetfileDocumentation
Apache Spark Puppet Module
Table of Contents
- Module Description - What the module does and why it is useful
- Setup - The basics of getting started with spark
- Usage - Configuration options and additional functionality
- Reference - An under-the-hood peek at what the module is doing and how
- Limitations - OS compatibility, etc.
- Development - Guide for contributing to the module
Module Description
This puppet module installs and setup Apache Spark cluster, optionally with security. YARN and Spark Master cluster modes are supported.
Setup
What spark affects
- Packages: installs Spark packages as needed (core, python, history server, ...)
- Files modified:
- /etc/spark/conf/spark-default.conf
- /etc/spark/conf/spark-env.sh (modified, when environment parameter set)
- /etc/default/spark
- /etc/profile.d/hadoop-spark.csh (frontend)
- /etc/profile.d/hadoop-spark.sh (frontend)
- Permissions modified:
- /etc/security/keytab/spark.service.keytab (historyserver)
- Alternatives:
- alternatives are used for /etc/spark/conf in Cloudera
- this module switches to the new alternative by default, so the Cloudera original configuration can be kept intact
- Services:
- master server (when spark::master or spark::master::service included)
- history server (when spark::historyserver or spark::historyserver::service included)
- worker node (when spark::worker or spark::worker::service included)
- Helper files:
- */var/lib/hadoop-hdfs/.puppet-spark-**
Setup Requirements
There are several known or intended limitations in this module.
Be aware of:
-
Hadoop repositories
-
neither Cloudera nor Hortonworks repositories are configured in this module (for Cloudera you can find list and key files here: http://archive.cloudera.com/cdh5/debian/wheezy/amd64/cdh/)
-
java is not installed by this module (openjdk-7-jre-headless is OK for Debian 7/wheezy)
-
No inter-node dependencies: working HDFS is required before deploying of Spark History Server, dependency of Spark HDFS initialization on HDFS namenode is handled properly (if the class spark::hdfs is included on the HDFS namenode, see examples)
Usage
There are two cluster modes, how to use Spark (these modes can be both enabled):
- YARN mode: Hadoop is used for computing and scheduling
- Spark mode: Spark Master Server and Worker Nodes are used for computing and scheduling
Optionally Spark History Server can be used (for both YARN or Spark modes), which would also require Hadoop HDFS.
The Spark mode doesn't support security, only YARN mode can be used with secured Hadoop cluster.
Puppet classes to include:
- everywhere: spark
- YARN mode (requires Hadoop cluster with YARN, see CESNET Hadoop puppet module):
- client: spark::frontend
- Spark mode:
- master: spark::master
- slaves: spark::worker
- optionally History Server (requires Hadoop cluster with HDFS, see CESNET Hadoop puppet module):
- spark::historyserver
- on HDFS namenode: spark::hdfs
Spark in YARN cluster mode
Example: Apache Spark over Hadoop cluster:
For simplicity one-machine Hadoop cluster is used (everything is on $::fqdn, replication factor 1).
class{'hadoop':
hdfs_hostname => $::fqdn,
yarn_hostname => $::fqdn,
slaves => [ $::fqdn ],
frontends => [ $::fqdn ],
realm => '',
properties => {
'dfs.replication' => 1,
},
}
class{'spark':
# defaultFS is taken from hadoop class
}
node default {
include stdlib
include hadoop::namenode
include hadoop::resourcemanager
include hadoop::historyserver
include hadoop::datanode
include hadoop::nodemanager
include hadoop::frontend
include spark::frontend
# should be collocated with hadoop::namenode
include spark::hdfs
}
Notes:
- if collocated with HDFS namenode, add dependency Class['hadoop::namenode::service'] -> Class['spark::historyserver::service']
- if not collocated, it is needed to have HDFS namenode running first (puppet should be launched later again, if Spark History Server won't start because of HDFS)
- for Spark clients (in YARN mode): user must logout and login again or launch ". /etc/profile.d/hadoop-spark.sh"
Now you can submit spark jobs in the cluster mode over Hadoop YARN:
spark-submit --class org.apache.spark.examples.SparkPi --deploy-mode cluster --master yarn /usr/lib/spark/lib/spark-examples-1.2.0-cdh5.3.1-hadoop2.5.0-cdh5.3.1.jar 10
Spark in Spark Master cluster mode
Example: Apache Spark in Spark cluster mode:
Two-nodes cluster is used here.
$master_hostname='spark-master.example.com'
class{'hadoop':
realm => '',
hdfs_hostname => $master_hostname,
slaves => ['spark1.example.com', 'spark2.example.com'],
}
class{'spark':
master_hostname => $master_hostname,
historyserver_hostname => $master_hostname,
yarn_enable => false,
}
node 'spark-master.example.com' {
include spark::master
include spark::historyserver
include hadoop::namenode
include spark::hdfs
}
node /spark(1|2).example.com/ {
include spark::worker
include hadoop::datanode
}
node 'client.example.com' {
include hadoop::frontend
include spark::frontend
}
Notes:
- there is also enabled Spark History Server (spark::historyserver), which requires HDFS (master: hadoop::namenode, slaves: hadoop::datanode)
- YARN is disabled completely, to enable YARN: include also hadoop::nodemanager on the slave nodes (collocation with spark::worker is not needed) and hadoop::resourcemanager on master (see previous example, or CESNET Hadoop puppet module)
Spark jar file optimization
The spark-assembly.jar file is copied into HDFS on each job submit. It is possible to optimize this by copying it beforehand. Keep in mind the jar file needs to be refreshed on HDFS with each Spark SW update.
...
class{'spark':
jar_enable => true,
}
...
Copy the jar file after installation and deployment (superuser credentials are needed if security in Hadoop is enabled):
hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar
Add Spark History Server
Spark History server stores details about Spark jobs. It is provided by the class spark::historyserver. The parameter historiserver_hostname needs to be also specified (replace $::fqdn by real hostname), and HDFS cluster is required:
...
class{'spark':
...
historyserver_hostname => $::fqdn,
}
node default {
...
include spark::historyserver
}
Multihome
Multihome is not supported.
You may also need to set SPARK_LOCAL_IP to bind RPC listen address to the default interface:
environment => {
'SPARK_LOCAL_IP' => '0.0.0.0',
#'SPARK_LOCAL_IP' => $::ipaddress_eth0,
}
Cluster with more HDFS Name nodes
If there are used more HDFS namenodes in the Hadoop cluster (high availability, namespaces, ...), it is needed to have 'spark' system user on all of them to autorization work properly. You could install full Spark client (using spark::frontend::install), but just creating the user is enough (using spark::user).
Note, the spark::hdfs class must be used too, but only on one of the HDFS namenodes. It includes the spark::user.
Example:
node <HDFS_NAMENODE> {
include spark::hdfs
}
node <HDFS_OTHER_NAMENODE> {
include spark::user
}
Upgrade
The best way is to refresh configurations from the new original (=remove the old) and relaunch puppet on top of it. There is also problem with start-up scripts on Debian, which needs to be worked around, where Spark history server is used.
For example:
alternative='cluster'
d='spark'
mv /etc/{d}$/conf.${alternative} /etc/${d}/conf.cdhXXX
update-alternatives --auto ${d}-conf
service spark-history-server stop || :
mv /etc/init.d/spark-history-server /etc/init.d/spark-history-server.prev
# upgrade
...
puppet agent --test
#or: puppet apply ...
# restore start-up script from spark-history-server.dpkg-new or spark-history-server.prev
...
service spark-history-server start
##Reference
###Classes
spark
: Main configuration class for CESNET Apache Spark puppet modulespark::common
:spark::common::config
spark::common::postinstall
spark::frontend
: Apache Spark Clientspark::frontend::config
spark::frontend::install
spark::hdfs
: HDFS initializationspark::historyserver
: Apache Spark History Serverspark::historyserver::config
spark::historyserver::install
spark::historyserver::service
spark::master
: Apache Spark Master Serverspark::master::config
spark::master::install
spark::master::service
spark::worker
: Apache Spark Worker Nodespark::worker::config
spark::worker::install
spark::worker::service
spark::params
spark::user
: Create spark system user
spark
class
####Parameters
#####alternatives
Switches the alternatives used for the configuration. Default: 'cluster' (Debian) or undef.
It can be used only when supported (for example with Cloudera distribution).
#####confdir
Spark config directory. Default: platform specific ('/etc/spark/conf' or '/etc/spark').
#####defaultFS
Filesystem URI. Default: '::default' (from $::hadoop::_defaultFS).
Examples:
- hdfs://hdfs.example.com:8020
- hdfs://mycluster
#####hive_configfile
Hive config file. Default: platform specific ('../../hive/conf/hive-site.xml' or '../etc/hive/hive-site.xml').
#####keytab
Spark Historyserver keytab file. Default: '/etc/security/keytab/spark.service.keytab'.
#####keytab_source
Puppet source for the Spark keytab file. Default: undef.
When specified, the Spark keytab file is created using this puppet source(s). Otherwise only persmissions are set on the keytab file.
#####master_hostname
Spark Master hostname. Default: undef.
#####master_port
Spark Master port. Default: '7077'.
#####master_ui_port
Spark Master Web UI port. Default: '18080'.
#####historyserver_hostname
Spark History server hostname. Default: undef.
#####historyserver_port
Spark History Server Web UI port. Default: '180088'.
Notes:
- the Spark default value is 18080, which conflicts with default for Master server
- no historyserver_ui_port parameter (Web UI port is the same as the RPC port)
#####worker_port
Spark Worker node port. Default: '7078'.
#####worker_ui_port
Spark Worker node Web UI port. Default: '18081'.
#####environment
Environments to set for Apache Spark. Default: undef.
The value is a hash. The '::undef' values will unset the particular variables.
Example: you may need to increase memory in case of big amount of jobs:
environment => {
'SPARK_DAEMON_MEMORY' => '4096m',
}
#####properties
Spark properties to set. Default: undef.
#####realm
Kerberos realm. Default: undef.
Non-empty string enables security.
#####hive_enable
Enable support for Hive metastore. Default: true.
This just create the symlink of the Hive configuration file in the Spark configuration directory on the frontend.
There is required to install also Hive JDBC (or Spark assembly with Hive JDBC) at all worker nodes.
#####jar_enable
Configure Apache Spark to search Spark jar file in $hdfs_hostname/user/spark/share/lib/spark-assembly.jar. Default: false.
The jar needs to be copied to HDFS manually after installation, and also manually updated after each Spark SW update:
hdfs dfs -put /usr/lib/spark/spark-assembly.jar /user/spark/share/lib/spark-assembly.jar
#####yarn_enable
Enable YARN mode. Default: true.
This requires configured Hadoop using CESNET Hadoop puppet module.
Limitations
Tested with Cloudera distribution.
See also Setup requirements.
Development
- Repository: https://github.com/MetaCenterCloudPuppet/cesnet-spark
- Tests:
- basic: see .travis.yml
- vagrant: https://github.com/MetaCenterCloudPuppet/hadoop-tests
- Email: František Dvořák <valtri@civ.zcu.cz>
#Changelog
See https://github.com/MetaCenterCloudPuppet/cesnet-spark/commits/master.
#Incompatible changes
1.0.0:
- environments parameter:
- name changed: environments -> environment
- type remains hash
2.0.0:
- removed hdfs_hostname parameter (replaced by spark::defaultFS and *hadoop::_defaultFS)
Dependencies
- puppetlabs/stdlib (>= 1.0.0 <7.0.0)
- cesnet/hadoop (>= 3.0.0 <4.0.0)
- cesnet/hadoop_lib (<1.0.0)
The MIT License (MIT) Copyright (c) 2014-2020 CESNET Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.