Apache Spark is an engine for Big Data processing. One can run Spark on distributed mode on the cluster. In the cluster, there is master and n number of workers. It schedules and divides resource in the host machine which forms the cluster. The prime work of the cluster manager is to divide resources across applications. It works as an external service for acquiring resources on the cluster. The following description is just a experience of using simple docker-compose for a clusterized spark environment.

Base on following discussion, we would create a base docker image called pysparkbase which we can use for our master node as well as workers. Then we create and build master node and follows by few workers as needed. at the end we can execute a simple python code to be run through entire cluster.

create docker file including pyspark modules

To create a docker file for pyspark base we can consider a light version of jdk and then add spark and pyspark to it. Therefore we can create a folder called pyspark_cluster, then create a Dockerfile inside it based on following scripts:

FROM openjdk:8-jdk-alpine

ENV DAEMON_RUN=true

ENV HADOOP_VERSION 2.7.3
ENV HADOOP_HOME /usr/hadoop-$HADOOP_VERSION
ENV HADOOP_CONF_DIR=$HADOOP_HOME/etc/hadoop

ENV SPARK_VERSION 2.4.4
ENV SPARK_PACKAGE spark-${SPARK_VERSION}-bin-without-hadoop
ENV SPARK_HOME /usr/spark-${SPARK_VERSION}
ENV SPARK_DIST_CLASSPATH="$HADOOP_HOME/etc/hadoop/*:$HADOOP_HOME/share/hadoop/common/lib/*:$HADOOP_HOME/share/hadoop/common/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/hdfs/lib/*:$HADOOP_HOME/share/hadoop/hdfs/*:$HADOOP_HOME/share/hadoop/yarn/lib/*:$HADOOP_HOME/share/hadoop/yarn/*:$HADOOP_HOME/share/hadoop/mapreduce/lib/*:$HADOOP_HOME/share/hadoop/mapreduce/*:$HADOOP_HOME/share/hadoop/tools/lib/*"

ENV PYSPARK_PYTHON python3
ENV PYSPARK_DRIVER_PYTHON=python3

ENV PATH $PATH:$JAVA_HOME/bin:$HADOOP_HOME/bin:$SPARK_HOME/bin

# Conda envs
ENV CONDA_DIR=/opt/conda CONDA_VER=4.3.14
ENV PATH=$CONDA_DIR/bin:$PATH SHELL=/bin/bash LANG=C.UTF-8

# Conda
RUN set -ex \
  && apk add --no-cache bash curl \
  && apk add --virtual .fetch-deps --no-cache ca-certificates wget \
  && wget -q -O /etc/apk/keys/sgerrand.rsa.pub https://alpine-pkgs.sgerrand.com/sgerrand.rsa.pub \
  && wget https://github.com/sgerrand/alpine-pkg-glibc/releases/download/2.28-r0/glibc-2.28-r0.apk \
  && apk add --virtual .conda-deps glibc-2.28-r0.apk \
  && mkdir -p $CONDA_DIR  \
  && echo export PATH=$CONDA_DIR/bin:'$PATH' > /etc/profile.d/conda.sh \
  && wget https://repo.continuum.io/miniconda/Miniconda3-${CONDA_VER}-Linux-x86_64.sh -O miniconda.sh \
  && bash miniconda.sh -f -b -p $CONDA_DIR \
  && rm miniconda.sh \
  && conda update conda \
  # hadoop
  && curl -sL --retry 3 "http://archive.apache.org/dist/hadoop/common/hadoop-$HADOOP_VERSION/hadoop-$HADOOP_VERSION.tar.gz" \
  | gunzip | tar -x -C /usr/ \
  && rm -rf $HADOOP_HOME/share/doc \
  && chown -R root:root $HADOOP_HOME \
  # spark
  && curl -sL --retry 3 "https://www.apache.org/dyn/mirrors/mirrors.cgi?action=download&filename=spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
  | gunzip | tar x -C /usr/ \
  && mv /usr/$SPARK_PACKAGE $SPARK_HOME \
  && chown -R root:root $SPARK_HOME \
  && apk add --update curl gcc g++ \
  && apk add --no-cache python3-dev libstdc++ \
  && rm -rf /var/cache/apk/* \
  && ln -s /usr/include/locale.h /usr/include/xlocale.h \
  && pip install bottle numpy cython pandas \
  && pip3 install numpy \
  && pip3 install pandas \
  # cleanup
  && apk del .fetch-deps
  # && apk del .conda-deps

COPY config ${SPARK_HOME}/conf

CMD ["/bin/bash"]

As you can see, the environment need config files as well, so let’s create a config folder and vim to file here: log4j.properties

log4j.rootCategory=ERROR, console
log4j.appender.console=org.apache.log4j.ConsoleAppender
log4j.appender.console.target=System.err
log4j.appender.console.layout=org.apache.log4j.PatternLayout
log4j.appender.console.layout.ConversionPattern=%d{yy/MM/dd HH:mm:ss} %p %c{1}: %m%n
log4j.logger.org.apache.spark.repl.Main=WARN

log4j.logger.org.spark_project.jetty=WARN
log4j.logger.org.spark_project.jetty.util.component.AbstractLifeCycle=ERROR
log4j.logger.org.apache.spark.repl.SparkIMain$exprTyper=INFO
log4j.logger.org.apache.spark.repl.SparkILoop$SparkILoopInterpreter=INFO
log4j.logger.org.apache.parquet=ERROR
log4j.logger.parquet=ERROR

log4j.logger.org.apache.hadoop.hive.metastore.RetryingHMSHandler=FATAL
log4j.logger.org.apache.hadoop.hive.ql.exec.FunctionRegistry=ERROR

you can modify it based on your requirements. then next file is : spark-defaults.conf

# Example:
# spark.master                     spark://master:7077
# spark.eventLog.enabled           true
# spark.eventLog.dir               hdfs://namenode:8021/directory
# spark.serializer                 org.apache.spark.serializer.KryoSerializer
# spark.driver.memory              5g
# spark.executor.extraJavaOptions  -XX:+PrintGCDetails -Dkey=value -Dnumbers="one two three"

# spark.driver.memory                40g
# spark.executer.memory              40g
# spark.driver.maxResultSize         20g

Now it is turn to build pysperkbase image:

docker build -t pysparkbase:1.0.1 .

after built, we are ready to configure and build our master node, Therefore we will create a folder named pyspark-master and will create a Dockerfile with below contains:

FROM pysparkbase:1.0.1

ENV SPARK_MASTER_PORT 7077
ENV SPARK_MASTER_WEBUI_PORT 8080
ENV SPARK_MASTER_LOG /spark/logs

COPY  runmaster.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/runmaster.sh
RUN ln -s /usr/local/bin/runmaster.sh / # backwards compat

EXPOSE 8080 7077 6066

let’s build the container for master node:

docker build -t pysparkmaster:1.0.1 ./pyspark-master

Same goes to worker container, so we will create a folder called pyspark-worker and create a Dockerfile inside:

FROM pysparkbase:1.0.1

ENV SPARK_WORKER_WEBUI_PORT 8081
ENV SPARK_WORKER_LOG /spark/logs
ENV SPARK_MASTER "spark://pysparkmaster:7077"


COPY  runworker.sh /usr/local/bin/
RUN chmod +x /usr/local/bin/runworker.sh
RUN ln -s /usr/local/bin/runworker.sh / # backwards compat

EXPOSE 8081

Containers are ready, so let’s use docker-compose to build and run the entire package once:

version: "3.7"
services:
  pysparkmaster:
    image: pysparkmaster:1.0.1
    #      build:
    #        context: .
    container_name: pysparkmaster
    hostname: pysparkmaster
    tty: true
    stdin_open: true
    ports:
      - "8080:8080"
      - "7077:7077"
    volumes:
      - ./pyspark-apps:/code/pyspark-apps
      - ./pyspark-data:/code/pyspark-data
    environment:
      - "SPARK_LOCAL_IP=pysparkmaster"
    networks:
      pysparknetwork:
         ipv4_address: 172.26.0.2
  pysparkworker1:
    image: pysparkworker:1.0.1
         #      build:
         #        context: .
    container_name: pysparkworker1
    hostname: pysparkworker1
    tty: true
    stdin_open: true
    ports:
      - "8081:8081"
    volumes:
      - ./pyspark-apps:/code/pyspark-apps
      - ./pyspark-data:/code/pyspark-data
    environment:
      - "SPARK_LOCAL_IP=pysparkworker1"
    depends_on:
      - pysparkmaster
    networks:
      pysparknetwork:
        ipv4_address: 172.26.0.3
  pysparkworker2:
    image: pysparkworker:1.0.1
         #      build:
         #        context: .
    container_name: pysparkworker2
    hostname: pysparkworker2
    tty: true
    stdin_open: true
    ports:
      - "8082:8081"
    volumes:
      - ./pyspark-apps:/code/pyspark-apps
      - ./pyspark-data:/code/pyspark-data
    environment:
      - "SPARK_LOCAL_IP=pysparkworker2"
    depends_on:
      - pysparkmaster
    networks:
      pysparknetwork:
        ipv4_address: 172.26.0.4
  pysparkworker3:
    image: pysparkworker:1.0.1
         #      build:
         #        context: .
    container_name: pysparkworker3
    hostname: pysparkworker3
    tty: true
    stdin_open: true
    ports:
      - "8083:8081"
    volumes:
      - ./pyspark-apps:/code/pyspark-apps
      - ./pyspark-data:/code/pyspark-data
    environment:
      - "SPARK_LOCAL_IP=pysparkworker3"
    depends_on:
      - pysparkmaster
    networks:
      pysparknetwork:
        ipv4_address: 172.26.0.5

networks:
  pysparknetwork:
    driver: bridge
    name: pysparknetwork
    ipam:
      driver: default
      config:
        - subnet: 172.26.0.0/16

By using docker-compoer up --build -d the whole package would be up and running.

now all of the workers are running and it’s time to link workers to the name node by doing:

master

spark-class org.apache.spark.deploy.master.Master --ip `hostname` --port 7077 --webui-port 8080

each workers

spark-class org.apache.spark.deploy.worker.Worker  --webui-port 8080 spark://pysparkmaster:7077 

hint

you can also using some bash script or Makefile to build the entire thing easier and faster, for example before make docker-compose up you can easily have a build.sh file to build all images once:

#!/bin/bash

set -e

docker build -t pysparkbase:1.0.1 .
docker build -t pysparkmaster:1.0.1 ./pyspark-master
docker build -t pysparkworker:1.0.1 ./pyspark-worker

submit a task

During build and up the package we shared some volumes like:

volumes:
    - ./pyspark-apps:/code/pyspark-apps
    - ./pyspark-data:/code/pyspark-data

Therefore, we can put our pyspark code inside pyspark-app directory and the data would be under pyspark-data .

It would help you sbmit the task easily and track the data easier as well.

to submit a task we can use below command to master node:

docker run --rm -v $(CURDIR):/code --network pysparknetwork pysparkbase:1.0.1 spark-submit --master spark://pysparkmaster:7077 /code/pyspark-apps/test.py

Hope you setup this env easily and enjoy with it. You may use this env in your end to handle your bigdata tasks, streaming data or Machine learning data analysis good luck


ABOUT ATABAK

Atabak is a Software and Data Engineering Consultant


FOLLOW ATABAK

© Copyright 2017-2023