SPARK

Apache Spark is a large-scale data processing engine that performs in-memory computing. Spark offers bindings in Java, Scala, Python and R for building parallel applications.

Warning

No web UI available

Modules

module load java/1.7.0
module load hadoop/2.7.2
module load hadoop/spark2.0.2

Conf

You can find all the configurations files in the location /apps/applications/hadoop/spark/2.0.2/user-conf/

Simple Single Node Usage

#!/bin/bash

###############################
#SBATCH --job-name=spark
#SBATCH --output=spark.out
#SBATCH --error=spark.err
###SBATCH --nodelist=fat39
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=80
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --account=testproj
#SBATCH --partition=taskp
###############################

module load java/1.7.0
module load python/2.7.10
module load hadoop/2.7.2
module load hadoop/spark2.0.2 

export JAVA_HOME="/apps/compilers/java/1.7.0/jdk1.7.0_80"
export WORK="${WORKDIR}/${SLURM_JOB_ID}"
export SPARK_CONF_DIR="${WORK}/spark/conf"

mkdir -p ${WORK}
mkdir -p ${WORK}/spark
mkdir -p ${SPARK_CONF_DIR}
cp ${SPARK_HOME}/user-conf/* ${SPARK_CONF_DIR}/.

source $SPARK_CONF_DIR/spark-env.sh
export PYSPARK_PYTHON=/apps/applications/python/2.7.10/bin/python

export PATH=/apps/applications/python/2.7.10/bin:$PATH
export LD_LIBRARY_PATH=/apps/applications/python/2.7.10/lib:$LD_LIBRARY_PATH

sed -i "s|.*export JAVA_HOME=.*|export JAVA_HOME=${JAVA_HOME}|g" $SPARK_CONF_DIR/spark-env.sh

MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1)-ib
MASTER_NODE=spark://$MASTER:$SPARK_MASTER_PORT
export SPARK_MASTER_HOST=$MASTER
export SPARK_LOCAL_IP=$MASTER

echo $SPARK_MASTER_HOST
start-master.sh -h $MASTER
srun start-slave.sh $MASTER_NODE

scontrol show hostname $SLURM_NODELIST > ${SPARK_CONF_DIR}/slaves
sed -i 's/$/-ib/' ${SPARK_CONF_DIR}/slaves

echo "Python:"
spark-submit \
        --master $MASTER_NODE \
        $SPARK_HOME/examples/src/main/python/pi.py \
        160

stop-all.sh

Sbatch Usage

A Spark cluster can run on top of a Slurm cluster. In other words, the sbatch script must first exclusively allocate nodes to run the Spark master and worker daemons, then use the spark-submit script to submit jobs to the Spark cluster.

Example

Submit a spark job on ARIS. To calculate PI using python and java examples.

#!/bin/bash

###############################
#SBATCH --job-name=spark
#SBATCH --output=spark.out
#SBATCH --error=spark.err
###SBATCH --nodelist=fat39
#SBATCH --ntasks=2
#SBATCH --cpus-per-task=80
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --account=testproj
#SBATCH --partition=taskp
###############################

module load java/1.7.0
module load python/2.7.10
module load hadoop/2.7.2
module load hadoop/spark2.0.2 

export 

export SPARK_CONF_DIR=/apps/applications/hadoop/spark/2.0.2/user-conf
source $SPARK_CONF_DIR/spark-env.sh
export PYSPARK_PYTHON=/apps/applications/python/2.7.10/bin/python

export PATH=/apps/applications/python/2.7.10/bin:$PATH
export LD_LIBRARY_PATH=/apps/applications/python/2.7.10/lib:$LD_LIBRARY_PATH

MASTER=$(scontrol show hostname $SLURM_NODELIST | head -n 1)-ib
MASTER_NODE=spark://$MASTER:$SPARK_MASTER_PORT
export SPARK_MASTER_HOST=$MASTER
export SPARK_LOCAL_IP=$MASTER

echo $SPARK_MASTER_HOST
start-master.sh -h $MASTER
srun start-slave.sh $MASTER_NODE

scontrol show hostname $SLURM_NODELIST > ${SPARK_CONF_DIR}/slaves
sed -i 's/$/-ib/' ${SPARK_CONF_DIR}/slaves

spark-submit \
        --master $MASTER_NODE \
        $SPARK_HOME/examples/src/main/python/pi.py \
        160

stop-all.sh