Apache Spark is an engine for Big Data processing. One can run Spark on distributed mode on the cluster. In the cluster, there is master and n number of workers. It schedules and divides resource in the host machine which forms the cluster. The prime work of the cluster manager is to divide resources across applications. It works as an external service for acquiring resources on the cluster. The following description is just a experience of using simple docker-compose for a clusterized spark environment.
Base on following discussion, we would create a base docker image called pysparkbase which we can use for our master node as well as workers. Then we create and build master node and follows by few workers as needed. at the end we can execute a simple python code to be run through entire cluster.
create docker file including pyspark modules
To create a docker file for pyspark base we can consider a light version of jdk and then add spark and pyspark to it. Therefore we can create a folder called pyspark_cluster, then create a Dockerfile
inside it based on following scripts:
FROM openjdk:8-jdk-alpine
ENV DAEMON_RUN=true
ENV HADOOP_VERSION 2.7.3
ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop
ENV SPARK_VERSION 2.4.4
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-without-hadoop
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"
ENV PYSPARK_PYTHON python3
ENV PYSPARK_DRIVER_PYTHON=python3
ENV PATH $PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin
# Conda envs
ENV CONDA_DIR=/opt/conda CONDA_VER=4.3.14
ENV PATH=$CONDA_DIR/bin:$PATH SHELL=/bin/bash LANG=C.UTF-8
# Conda
RUN set -ex \
&& apk add --no-cache bash curl \
&& apk add --virtual .fetch-deps --no-cache ca-certificates wget \
&& wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub \
&& wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.28-r0/glibc-2.28-r0.apk \
&& apk add --virtual .conda-deps glibc-2.28-r0.apk \
&& mkdir -p $CONDA_DIR \
&& echo export PATH=$CONDA_DIR/bin:'$PATH' > /etc/profile.d/conda.sh \
&& wget https://repo.continuum.io/miniconda/Miniconda3-${CONDA_VER}-Linux-x86_64.sh -O miniconda.sh \
&& bash miniconda.sh -f -b -p $CONDA_DIR \
&& rm miniconda.sh \
&& conda update conda \
# hadoop
&& curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
| gunzip | tar -x -C /usr/ \
&& rm -rf $HADOOP_HOME/share/doc \
&& chown -R root:root $HADOOP_HOME \
# spark
&& curl -sL --retry 3 "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
| gunzip | tar x -C /usr/ \
&& mv /usr/$SPARK_PACKAGE $SPARK_HOME \
&& chown -R root:root $SPARK_HOME \
&& apk add --update curl gcc g++ \
&& apk add --no-cache python3-dev libstdc++ \
&& rm -rf /var/cache/apk/* \
&& ln -s /usr/include/locale.h /usr/include/xlocale.h \
&& pip install bottle numpy cython pandas \
&& pip3 install numpy \
&& pip3 install pandas \
# cleanup
&& apk del .fetch-deps
# && apk del .conda-deps
COPY config ${SPARK_HOME}/conf
CMD ["/bin/bash"]
As you can see, the environment need config files as well, so let’s create a config folder and vim to file here:
log4j.properties
log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.apache.spark.repl.Main=WARN
log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR
log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR
you can modify it based on your requirements.
then next file is :
spark-defaults.conf
# Example:
# spark.master spark://master:7077
# spark.eventLog.enabled true
# spark.eventLog.dir hdfs://namenode:8021/directory
# spark.serializer org.apache.spark.serializer.KryoSerializer
# spark.driver.memory 5g
# spark.executor.extraJavaOptions -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"
# spark.driver.memory 40g
# spark.executer.memory 40g
# spark.driver.maxResultSize 20g
Now it is turn to build pysperkbase
image:
docker build -t pysparkbase:1.0.1 .
after built, we are ready to configure and build our master node, Therefore we will create a folder named pyspark-master
and will create a Dockerfile with below contains:
FROM pysparkbase:1.0.1
ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logs
COPY runmaster.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/runmaster.sh
RUN ln -s /usr/local/bin/runmaster.sh / # backwards compat
EXPOSE 8080 7077 6066
let’s build the container for master node:
docker build -t pysparkmaster:1.0.1 ./pyspark-master
Same goes to worker container, so we will create a folder called pyspark-worker
and create a Dockerfile
inside:
FROM pysparkbase:1.0.1
ENV SPARK_WORKER_WEBUI_PORT 8081
ENV SPARK_WORKER_LOG /spark/logs
ENV SPARK_MASTER "spark://pysparkmaster:7077"
COPY runworker.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/runworker.sh
RUN ln -s /usr/local/bin/runworker.sh / # backwards compat
EXPOSE 8081
Containers are ready, so let’s use docker-compose
to build and run the entire package once:
version: "3.7"
services:
pysparkmaster:
image: pysparkmaster:1.0.1
# build:
# context: .
container_name: pysparkmaster
hostname: pysparkmaster
tty: true
stdin_open: true
ports:
- "8080:8080"
- "7077:7077"
volumes:
- ./pyspark-apps:/code/pyspark-apps
- ./pyspark-data:/code/pyspark-data
environment:
- "SPARK_LOCAL_IP=pysparkmaster"
networks:
pysparknetwork:
ipv4_address: 172.26.0.2
pysparkworker1:
image: pysparkworker:1.0.1
# build:
# context: .
container_name: pysparkworker1
hostname: pysparkworker1
tty: true
stdin_open: true
ports:
- "8081:8081"
volumes:
- ./pyspark-apps:/code/pyspark-apps
- ./pyspark-data:/code/pyspark-data
environment:
- "SPARK_LOCAL_IP=pysparkworker1"
depends_on:
- pysparkmaster
networks:
pysparknetwork:
ipv4_address: 172.26.0.3
pysparkworker2:
image: pysparkworker:1.0.1
# build:
# context: .
container_name: pysparkworker2
hostname: pysparkworker2
tty: true
stdin_open: true
ports:
- "8082:8081"
volumes:
- ./pyspark-apps:/code/pyspark-apps
- ./pyspark-data:/code/pyspark-data
environment:
- "SPARK_LOCAL_IP=pysparkworker2"
depends_on:
- pysparkmaster
networks:
pysparknetwork:
ipv4_address: 172.26.0.4
pysparkworker3:
image: pysparkworker:1.0.1
# build:
# context: .
container_name: pysparkworker3
hostname: pysparkworker3
tty: true
stdin_open: true
ports:
- "8083:8081"
volumes:
- ./pyspark-apps:/code/pyspark-apps
- ./pyspark-data:/code/pyspark-data
environment:
- "SPARK_LOCAL_IP=pysparkworker3"
depends_on:
- pysparkmaster
networks:
pysparknetwork:
ipv4_address: 172.26.0.5
networks:
pysparknetwork:
driver: bridge
name: pysparknetwork
ipam:
driver: default
config:
- subnet: 172.26.0.0/16
By using docker-compoer up --build -d
the whole package would be up and running.
now all of the workers are running and it’s time to link workers to the name node by doing:
master
spark-class org.apache.spark.deploy.master.Master --ip `hostname` --port 7077 --webui-port 8080
each workers
spark-class org.apache.spark.deploy.worker.Worker --webui-port 8080 spark://pysparkmaster:7077
hint
you can also using some bash script or Makefile to build the entire thing easier and faster, for example before make docker-compose up you can easily have a build.sh
file to build all images once:
#!/bin/bash
set -e
docker build -t pysparkbase:1.0.1 .
docker build -t pysparkmaster:1.0.1 ./pyspark-master
docker build -t pysparkworker:1.0.1 ./pyspark-worker
submit a task
During build and up the package we shared some volumes like:
volumes:
- ./pyspark-apps:/code/pyspark-apps
- ./pyspark-data:/code/pyspark-data
Therefore, we can put our pyspark code inside pyspark-app
directory and the data would be under pyspark-data
.
It would help you sbmit the task easily and track the data easier as well.
to submit a task we can use below command to master node:
docker run --rm -v $(CURDIR):/code --network pysparknetwork pysparkbase:1.0.1 spark-submit --master spark://pysparkmaster:7077 /code/pyspark-apps/test.py
Hope you setup this env easily and enjoy with it. You may use this env in your end to handle your bigdata tasks, streaming data or Machine learning data analysis good luck