Atabak - Hadoop Spark via docker

This document would share an experience on setting up dockerized master-slave hadoop and spark on top of them. Then config your environment and listen to steaming data. It will also help to automate your environment setup for development and production stages.

This experiment case is contain; a) private a master container, and b) provide 2 slave hadoop containers and c) prepare spark on top of hadoop infrustructure.

Requirement

Ubuntu
Internet access

Quick outline

create a hadoop master container.
create slave containers.
create spark container pointed to hadoop.
create spark container pointed to hadoop.
prepare docker environment with docker-compose

prepare a docker for hadoop main node

Creating docker file for hadoop main node:

Create a directory by name hadoop-master and then create a Dockerfile inside that contain following info:

As you can see in Dockerfile, customized bootstrap configuration file is must be located beside Dockerfile. Therefore, inside same directory, create a bash file named bootstrap.sh and then enter the following script within that:

#!/bin/bash

: ${HADOOP_PREFIX:=/usr/local/hadoop}

$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

rm /tmp/*.pid

# installing libraries if any - (resource urls added comma separated to the ACP system variable)
cd $HADOOP_PREFIX/share/hadoop/common ; for cp in ${ACP//,/ }; do  echo == $cp; curl -LO $cp ; done; cd -

# altering the core-site configuration
#sed s/NAMENODE/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/core-site.xml.template > /usr/local/hadoop/etc/hadoop/core-site.xml
#sed s/NAMENODE/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/hdfs-site.xml.template > /usr/local/hadoop/etc/hadoop/hdfs-site.xml
#sed s/RESOURCEMANAGER/$HOSTNAME/ /usr/local/hadoop/etc/hadoop/yarn-site.xml.template.template > /usr/local/hadoop/etc/hadoop/yarn-site.xml.template

service ssh start
nohup $HADOOP_PREFIX/bin/hdfs namenode &
nohup $HADOOP_PREFIX/bin/yarn resourcemanager &
nohup $HADOOP_PREFIX/bin/yarn timelineserver &
nohup $HADOOP_PREFIX/bin/mapred historyserver &

if [[ $1 == "-d" ]]; then
    while true; do sleep 1000; done
fi

if [[ $1 == "-bash" ]]; then
    /bin/bash
fi

prepare a docker for hadoop slave nodes

Creating docker file for hadoop main node:

Create a directory by name hadoop-slave and then create a Dockerfile inside that contain following info:

As you can see in Dockerfile, slave containers also required customized bootstrap configuration file is must be located beside Dockerfile. Therefore, inside same directory, create a bash file named bootstrap.sh and then enter the following script within that:

#!/bin/bash

: ${HADOOP_PREFIX:=/usr/local/hadoop}

$HADOOP_PREFIX/etc/hadoop/hadoop-env.sh

rm /tmp/*.pid

service ssh start
nohup $HADOOP_PREFIX/bin/hdfs datanode 2>> /var/log/hadoop/datanode.err >> /var/log/hadoop/datanode.out &
nohup $HADOOP_PREFIX/bin/yarn nodemanager 2>> /var/log/hadoop/nodemanager.err >> /var/log/hadoop/nodemanager.out &

if [[ $1 == "-d" ]]; then
    while true; do sleep 1000; done
fi

if [[ $1 == "-bash" ]]; then
    /bin/bash
fi

prepare a docker for spark and config the nodes

Now it is turn to config and create a container for spark including it’s dependencies:

Create a directory by name spark and then create a Dockerfile inside that contain following info:

FROM atabak/hadoop-base:latest
MAINTAINER Atabak

# Update the APT cache 
RUN apt-get update 
RUN apt-get upgrade -y

# Install and setup project dependencies 
RUN apt-get install -y curl wget git

#prepare for Java download 
RUN apt-get install -y python-software-properties 
RUN apt-get install -y software-properties-common

#java
RUN apt-get install -y openjdk-8-jdk && \ 
    apt-get install -y ant && \ 
    apt-get clean && \ 
    rm -rf /var/lib/apt/lists/* && \ 
    rm -rf /var/cache/oracle-jdk8-installer;

RUN mkdir -p /usr/java/default
RUN sudo cp -r /usr/lib/jvm/java-8-openjdk-amd64/* /usr/java/default/

ENV JAVA_HOME /usr/java/default
ENV PATH $PATH:$JAVA_HOME/bin

#install spark
ENV SPARK_HOME /usr/local/spark
ENV SPARK_VERSION 2.1.1
ENV SPARK_BIN_VERSION $SPARK_VERSION-bin-hadoop2.7

RUN curl -s https://www.apache.org/dist/spark/spark-$SPARK_VERSION/spark-$SPARK_BIN_VERSION.tgz | tar xz -C /usr/local
RUN ln -s /usr/local/spark-$SPARK_BIN_VERSION $SPARK_HOME 
#&& \
#    rm /spark-$SPARK_BIN_VERSION.tgz

#RUN mkdir $SPARK_HOME/yarn-remote-client
#ADD yarn-remote-client $SPARK_HOME/yarn-remote-client

#RUN $BOOTSTRAP && $HADOOP_PREFIX/bin/hadoop dfsadmin -safemode leave && $HADOOP_PREFIX/bin/hdfs dfs -put $SPARK_HOME-2.1.1-bin-hadoop2.7/lib /spark

#ENV YARN_CONF_DIR $HADOOP_PREFIX/etc/hadoop
#ENV PATH $PATH:$SPARK_HOME/bin:$HADOOP_PREFIX/bin

# update boot script
ADD bootstrap.sh /etc/bootstrap.sh
#RUN chown root.root /etc/bootstrap.sh
#RUN chmod 700 /etc/bootstrap.sh

#install scala
ENV SCALA_VERSION 2.12.2 
ENV SCALA_HOME /usr/local/scala 
ENV PATH $PATH:$SPARK_HOME/bin:$SCALA_HOME/bin

RUN wget https://downloads.lightbend.com/scala/$SCALA_VERSION/scala-$SCALA_VERSION.tgz && \ 
    tar -zxf /scala-$SCALA_VERSION.tgz -C /usr/local/ && \ 
    ln -s /usr/local/scala-$SCALA_VERSION $SCALA_HOME && \ 
    rm /scala-$SCALA_VERSION.tgz

ADD start-master.sh /start-master.sh 
ADD start-worker /start-worker.sh 
ADD spark-shell.sh /spark-shell.sh 
ADD spark-defaults.conf /spark-defaults.conf 
ADD remove_alias.sh /remove_alias.sh

ENV SPARK_MASTER_OPTS="-Dspark.driver.port=7001 -Dspark.fileserver.port=7002 -Dspark.broadcast.port=7003 -Dspark.replClassServer.port=7004 -Dspark.blockManager.port=7005 -Dspark.executor.port=7006 -Dspark.ui.port=4040 -Dspark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory" 
ENV SPARK_WORKER_OPTS="-Dspark.driver.port=7001 -Dspark.fileserver.port=7002 -Dspark.broadcast.port=7003 -Dspark.replClassServer.port=7004 -Dspark.blockManager.port=7005 -Dspark.executor.port=7006 -Dspark.ui.port=4040 -Dspark.broadcast.factory=org.apache.spark.broadcast.HttpBroadcastFactory" 
ENV SPARK_MASTER_PORT 7077 
ENV SPARK_MASTER_WEBUI_PORT 8080 
ENV SPARK_WORKER_PORT 8888 
ENV SPARK_WORKER_WEBUI_PORT 8081

EXPOSE 8080 7077 8888 8081 4040 7001 7002 7003 7004 7005 7006

As you can see within docker file few dependencies is there which need to be satisfied including spark configuration and node manager: Therefore we go with creating one by one:

bootstrap

we need to customize the bootstrap, and for that reason we are going to update the current bootstrap file with the file. create a bash file bootstrap.sh contain following content and place beside the dockerfile:

Spark Master

create another bash file start-master.sh to start master node:

Workers

create another bash file start-worker to start workers:

spark shell

create another bash spark-shell.sh file to start spark :

spark configuration

one of the important files we need is to customize configuration by injecting a config file spark-defaults.conf:

clean up

and finaly clean up the host alias and remove the redundancies remove_alias.sh:

prepare a docker compose file to manage the package under same network

Creating docker compose file for whole package:

Within root directory, create a docker-compose.yml file. The file must be contain 4 sections, master service, 2 slaves and finaly spark container. Therefore, the file would be comprised of:

run docker-compose up --build -d

enjoy your environment.

ABOUT ATABAK

Atabak is a Software and Data Engineering Consultant

FOLLOW ATABAK