SPARK¶
Apache Spark is a large-scale data processing engine that performs in-memory computing. Spark offers bindings in Java, Scala, Python and R for building parallel applications.
Warning
No web UI available
Modules¶
module load java/1.7.0
module load hadoop/2.7.2
module load hadoop/spark2.0.2
Conf
You can find all the configurations files in the location /apps/applications/hadoop/spark/2.0.2/user-conf/
Simple Single Node Usage¶
#!/bin/bash -l
###############################
#SBATCH --job-name=spark
#SBATCH --output=spark.out
#SBATCH --error=spark.err
#SBATCH --nodes=1
#SBATCH --ntasks=1
#SBATCH --cpus-per-task=80
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --account=testproj
#SBATCH --partition=taskp
###############################
module load java/1.7.0
module load python/2.7.10
module load hadoop/2.7.2
module load hadoop/spark2.0.2
export JAVA_HOME="/apps/compilers/java/1.7.0/jdk1.7.0_80"
export WORK="${WORKDIR}/${SLURM_JOB_ID}"
export SPARK_CONF_DIR="${WORK}/spark/conf"
mkdir -p ${WORK}
mkdir -p ${WORK}/spark
mkdir -p ${SPARK_CONF_DIR}
cp ${SPARK_HOME}/user-conf/* ${SPARK_CONF_DIR}/.
source $SPARK_CONF_DIR/spark-env.sh
export PYSPARK_PYTHON=/apps/applications/python/2.7.10/bin/python
export PATH=/apps/applications/python/2.7.10/bin:$PATH
export LD_LIBRARY_PATH=/apps/applications/python/2.7.10/lib:$LD_LIBRARY_PATH
sed -i "s|.*export JAVA_HOME=.*|export JAVA_HOME=${JAVA_HOME}|g" $SPARK_CONF_DIR/spark-env.sh
MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1)-ib
MASTER_NODE=spark://$MASTER:$SPARK_MASTER_PORT
export SPARK_MASTER_HOST=$MASTER
export SPARK_LOCAL_IP=$MASTER
echo $SPARK_MASTER_HOST
start-master.sh -h $MASTER
srun start-slave.sh $MASTER_NODE
scontrol show hostname $SLURM_NODELIST > ${SPARK_CONF_DIR}/slaves
sed -i 's/$/-ib/' ${SPARK_CONF_DIR}/slaves
echo "Python:"
spark-submit \
--master $MASTER_NODE \
$SPARK_HOME/examples/src/main/python/pi.py \
80
stop-all.sh
Sbatch Usage¶
A Spark cluster can run on top of a Slurm cluster. In other words, the sbatch script must first exclusively allocate nodes to run the Spark master and worker daemons, then use the spark-submit script to submit jobs to the Spark cluster.
Example¶
Submit a spark job on ARIS. To calculate PI using python and java examples.
#!/bin/bash -l
###############################
#SBATCH --job-name=spark
#SBATCH --output=spark.out
#SBATCH --error=spark.err
#SBATCH --nodes=2
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=80
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --account=testproj
#SBATCH --partition=taskp
###############################
module load java/1.7.0
module load python/2.7.10
module load hadoop/2.7.2
module load hadoop/spark2.0.2
export SPARK_CONF_DIR=/apps/applications/hadoop/spark/2.0.2/user-conf
source $SPARK_CONF_DIR/spark-env.sh
export PYSPARK_PYTHON=/apps/applications/python/2.7.10/bin/python
export PATH=/apps/applications/python/2.7.10/bin:$PATH
export LD_LIBRARY_PATH=/apps/applications/python/2.7.10/lib:$LD_LIBRARY_PATH
MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1)-ib
MASTER_NODE=spark://$MASTER:$SPARK_MASTER_PORT
export SPARK_MASTER_HOST=$MASTER
export SPARK_LOCAL_IP=$MASTER
echo $SPARK_MASTER_HOST
start-master.sh -h $MASTER
srun start-slave.sh $MASTER_NODE
scontrol show hostname $SLURM_NODELIST > ${SPARK_CONF_DIR}/slaves
sed -i 's/$/-ib/' ${SPARK_CONF_DIR}/slaves
spark-submit \
--master $MASTER_NODE \
$SPARK_HOME/examples/src/main/python/pi.py \
160
stop-all.sh