ARIS HADOOP CONFIGURATION

This document describes how to configure Hadoop so that you can perform Hadoop MapReduce and Hadoop Distributed File System (HDFS) operations on ARIS supercomputer.

The Apache Hadoop software library is a framework that allows for the distributed processing of large data sets across clusters of computers using simple programming models.

Note

To run Hadoop on ARIS supercomputer we start up a Hadoop cluster each time a Hadoop job is submitted.

Modules

module load java/1.7.0
module load hadoop/2.7.2

Configuration

You can find all the configurations files in the location /apps/applications/hadoop/2.7.2/user-etc/hadoop/

Sbatch Usage

A Hadoop cluster can run on top of a Slurm cluster. In other words, the sbatch script must first exclusively allocate nodes to run the Hadoop master and slaves, then you can submit jobs to the Hadoop cluster.

Example

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

#!/bin/bash 
#################################
# ARIS Hadoop script #
#################################

###############################
#SBATCH --job-name=hadoop
#SBATCH --output=hadoop.out
#SBATCH --error=hadoop.err
#SBATCH --ntasks=2
#SBATCH --nodes=2
#SBATCH --ntasks-per-node=1
#SBATCH --time=00:30:00
#SBATCH --account=testproj
#SBATCH --partition=taskp
#SBATCH --exclusive
###############################

module load java/1.7.0
module load hadoop/2.7.2

# Create Hadoop Configs
export JAVA_HOME="/apps/compilers/java/1.7.0/jdk1.7.0_80"
export WORK="${WORKDIR}/${SLURM_JOB_ID}"
export HADOOP_CONF_DIR="${WORK}/hadoop/conf"
export HADOOP_LOG_DIR="${WORK}/hadoop/logs"
export YARN_LOG_DIR="$HADOOP_LOG_DIR"
export HADOOP_MAPRED_LOG_DIR="$HADOOP_LOG_DIR"
export TMP="${WORK}/tmp"
export HADOOP_TMP_DIR="${TMP}"

export DFS_NAME_DIR="${WORK}/namenode_data"
export DFS_DATA_DIR="${WORK}/hdfs_data"
export DFS_REPLICATION="1"
export DFS_BLOCK_SIZE=""
export MAPRED_LOCAL_DIR="${WORK}/mapred_scratch"
export MAPRED_TASKTRACKER_MAP_TASKS_MAXIMUM=""
export MAPRED_TASKTRACKER_REDUCE_TASKS_MAXIMUM=""
export MAPRED_MAP_TASKS=""
export MAPRED_REDUCE_TASKS=""

mkdir -p ${WORK}
mkdir -p ${WORK}/hadoop
mkdir -p ${HADOOP_LOG_DIR}
mkdir -p ${HADOOP_CONF_DIR}
cp ${HADOOP_HOME}/user-etc/hadoop/* ${HADOOP_CONF_DIR}/.

MASTER=$(hostname)
echo ${MASTER} > ${HADOOP_CONF_DIR}/masters
scontrol show hostname $SLURM_NODELIST > ${HADOOP_CONF_DIR}/slaves

sed -i "s|MASTER|${MASTER}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|MASTER|${MASTER}|g" ${HADOOP_CONF_DIR}/core-site.xml
sed -i "s|HADOOP_TMP_DIR|${HADOOP_TMP_DIR}|g" ${HADOOP_CONF_DIR}/core-site.xml
sed -i "s|MAPRED_LOCAL_DIR|${MAPRED_LOCAL_DIR}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|MAPRED_TASKTRACKER_MAP_TASKS_MAXIMUM|${MAPRED_TASKTRACKER_MAP_TASKS_MAXIMUM}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|MAPRED_TASKTRACKER_REDUCE_TASKS_MAXIMUM|${MAPRED_TASKTRACKER_REDUCE_TASKS_MAXIMUM}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|MAPRED_MAP_TASKS|${MAPRED_MAP_TASKS}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|MAPRED_REDUCE_TASKS|${MAPRED_REDUCE_TASKS}|g" ${HADOOP_CONF_DIR}/mapred-site.xml
sed -i "s|DFS_REPLICATION|${DFS_REPLICATION}|g" ${HADOOP_CONF_DIR}/hdfs-site.xml
sed -i "s|DFS_NAME_DIR|${DFS_NAME_DIR}|g" ${HADOOP_CONF_DIR}/hdfs-site.xml
sed -i "s|DFS_DATA_DIR|${DFS_DATA_DIR}|g" ${HADOOP_CONF_DIR}/hdfs-site.xml

sed -i "s|.*YARN_LOG_DIR=.*|YARN_LOG_DIR=${YARN_LOG_DIR}|g" ${HADOOP_CONF_DIR}/yarn-env.sh
sed -i "s|.*export HADOOP_LOG_DIR.*|export HADOOP_LOG_DIR=${HADOOP_LOG_DIR}|g" ${HADOOP_CONF_DIR}/hadoop-env.sh
sed -i "s|.*export HADOOP_PID_DIR.*|export HADOOP_PID_DIR=${WORKDIR}/${SLURM_JOBID}|g" ${HADOOP_CONF_DIR}/hadoop-env.sh


# Format the filesystem:
hdfs namenode -format -nonInteractive -force

# Start NameNode daemon and DataNode daemon:
start-dfs.sh
start-yarn.sh

# Create the HDFS directories required to execute MapReduce jobs:
hdfs dfs -mkdir  /user
hdfs dfs -mkdir  /user/${USER}

# Copy the input files into the distributed filesystem:
hdfs dfs -put ${HADOOP_HOME}/etc/hadoop input

# Map reduce example

hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep input output 'dfs[a-z.]+'

# Copy the output files from the distributed filesystem to the local filesystem
OUTPUT="${WORK}/output"
hdfs dfs -get output "${OUTPUT}"
hdfs dfs -cat output/*

#Stop Hadoop Service
stop-dfs.sh
stop-yarn.sh

Output

Result will list all the matches in all the input files.

cat hadoop.out

6   dfs.audit.logger
4   dfs.class
3   dfs.server.namenode.
2   dfs.period
2   dfs.audit.log.maxfilesize
2   dfs.audit.log.maxbackupindex
1   dfsmetrics.log
1   dfsadmin
1   dfs.servers
1   dfs.file

Standalone Operation

This script describes how to set up and configure a single-node Hadoop installation so that you can quickly perform simple operations using Hadoop MapReduce and the Hadoop Distributed File System (HDFS).

Info

Hadoop is configured to run in a non-distributed mode, as a single Java process. This is useful for debugging.

Example

The following example copies the unpacked conf directory to use as input and then finds and displays every match of the given regular expression. Output is written to the given output directory.

#!/bin/bash 
#################################
# ARIS Hadoop Script            #
#################################

###############################
#SBATCH --job-name=hadoop
#SBATCH --output=hadoop.out
#SBATCH --error=hadoop.err
#SBATCH --ntasks=1
#SBATCH --nodes=1
#SBATCH --ntasks-per-node=1
#SBATCH --cpus-per-task=40
#SBATCH --time=00:10:00
#SBATCH --account=testproj
#SBATCH --partition=fat
#SBATCH --exclusive
###############################

module load java/1.7.0
module load hadoop/2.7.2

# Create Hadoop Configs
export WORK="${WORKDIR}/${SLURM_JOB_ID}"
#export HADOOP_LOG_DIR="${WORK}/hadoop/logs"
#export YARN_LOG_DIR="$HADOOP_LOG_DIR"
#export HADOOP_MAPRED_LOG_DIR="$HADOOP_LOG_DIR"

mkdir -p ${WORK}

INPUT="${WORK}/input"
mkdir -p ${INPUT}
cp ${HADOOP_HOME}/etc/hadoop/* ${INPUT} 

OUTPUT="${WORK}/output"

hadoop jar ${HADOOP_HOME}/share/hadoop/mapreduce/hadoop-mapreduce-examples-*.jar grep ${INPUT} ${OUTPUT} 'dfs[a-z.]+'

cat ${OUTPUT}/*