This document would share an experience on setting up dockerized master-slave hadoop and spark on top of them. Then config your environment and listen to steaming data. It will also help to automate your environment setup for development and production stages.

This experiment case is contain; a) private a master container, and b) provide 2 slave hadoop containers and c) prepare spark on top of hadoop infrustructure.

Requirement

  • Ubuntu
  • Internet access

Quick outline

  • create a hadoop master container.
  • create slave containers.
  • create spark container pointed to hadoop.
  • create spark container pointed to hadoop.
  • prepare docker environment with docker-compose

prepare a docker for hadoop main node

Creating docker file for hadoop main node:

Create a directory by name hadoop-master and then create a Dockerfile inside that contain following info:

As you can see in Dockerfile, customized bootstrap configuration file is must be located beside Dockerfile. Therefore, inside same directory, create a bash file named bootstrap.sh and then enter the following script within that:

prepare a docker for hadoop slave nodes

Creating docker file for hadoop main node:

Create a directory by name hadoop-slave and then create a Dockerfile inside that contain following info:

As you can see in Dockerfile, slave containers also required customized bootstrap configuration file is must be located beside Dockerfile. Therefore, inside same directory, create a bash file named bootstrap.sh and then enter the following script within that:

prepare a docker for spark and config the nodes

Now it is turn to config and create a container for spark including it’s dependencies:

Create a directory by name spark and then create a Dockerfile inside that contain following info:

As you can see within docker file few dependencies is there which need to be satisfied including spark configuration and node manager: Therefore we go with creating one by one:

bootstrap

we need to customize the bootstrap, and for that reason we are going to update the current bootstrap file with the file. create a bash file bootstrap.sh contain following content and place beside the dockerfile:

Spark Master

create another bash file start-master.sh to start master node:

Workers

create another bash file start-worker to start workers:

spark shell

create another bash spark-shell.sh file to start spark :

spark configuration

one of the important files we need is to customize configuration by injecting a config file spark-defaults.conf:

clean up

and finaly clean up the host alias and remove the redundancies remove_alias.sh:

prepare a docker compose file to manage the package under same network

Creating docker compose file for whole package:

Within root directory, create a docker-compose.yml file. The file must be contain 4 sections, master service, 2 slaves and finaly spark container. Therefore, the file would be comprised of:

run docker-compose up --build -d

enjoy your environment.


ABOUT ATABAK

Atabak is a Software and Data Engineering Consultant


FOLLOW ATABAK

© Copyright 2017-2023