Creating Your Own Micro-Cluster Lab Using Docker, To Experiment With Spark & Dask on Yarn

docker dasj spark

Before starting, I have a quick disclaimer : This post is not a Docker tutorial but a post on how to use Docker to create a cluster on your own machine. If you've never heard of it before (which would be surprising) or you've never had the chance to play with it, I highly recommend you to watch this very great introductory video made by Mumshad Mannambeth. I've also made a very short and consistent cheat-sheet that you can refer to.

In addition to that, I will assume you already know what Dask, Spark, Yarn and Hadoop are all about. Again this isn't an introductory tutorial but more of a "cookbook", so to speak.

In case you're here for the code and want to have a turnkey cluster on your own machine don't hesitate to use my code on my git repo as you please.

When first learning Spark or Dask, we generally run them on our local machine, it is convenient and somehow simple but far from being a real life application scenario where we would be using distributed computations on many nodes. And sometime we want just to test one or two things on Spark or Dask, so we remember that we've deleted that standalone folder from last time or we've thrown away the python virtual environment where we've sweetly installed Dask and Pyspark and redoing configurations is a real pain in the a**.

Docker can be a good tool to make our life a bit easier regarding this, it can simulate a "real life" environment on your own machine, where we will harness its OS-level virtualization power to put Yarn, Spark and Dask on a 1-Master 2-Slaves cluster (more or less depending on your machine's power). The first reaction you may have, if you're not very familiar with Docker, is to ask why not using many virtual machines to do the job?! It is a valid question in my opinion and for that I have two answers: firstly Docker do virtualization too, but in a low level which means a high gain in terms of resources consumption, more room for ressources means more virtual machines (called containers), plus Docker's virtualization isn't the "objective" but just a mean to run application contained on virtual machines which brings me to my second point that Docker is a "Platform as a Service", or in a more cool-forced way a PaaS, which means the presence of a consistent API to run our containers and configure them without any hustle.

Cluster Architecture

The architecture I'm imagining is a cluster with three nodes. One is a master, it won't do any "work" but will be sort of a manager to other nodes, plus it will host the web interfaces (a Jupyter notebook and the Yarn resource manager UI) that we'll be using to run and check our works on top of the cluster. The two other nodes are slaves (computer science can be harsh sometimes), they will do the heavy lift and execute any compute-map-reducy-machine-learny stuff we want them to do.

As presented in the following image, Docker is running inside the host (your own computer in less fashionable words) and will run three containers corresponding to the three nodes I've discussed before. This architecture depends on your resources (number of CPUs and quantity of RAM your machine has), you can go up to 6 nodes if you wish or go down to 2 nodes if your resources are limited.

Files Tree to Use

This project will be organized as follow :

├── docker-compose.yml
├── Dockerfile
├── confs
│   ├── config
│   ├── core-site.xml
│   ├── hdfs-site.xml
│   ├── mapred-site.xml
│   ├── requirements.req
│   ├── slaves
│   ├── spark-defaults.conf
│   └── yarn-site.xml
├── datasets
│   ├── alice_in_wonderland.txt
│   └── iris.csv
├── notebooks
│   ├── Bash-Interface.ipynb
│   ├── Dask-Yarn.ipynb
│   ├── Python-Spark.ipynb
│   └── Scala-Spark.ipynb
└── script_files
    └── bootstrap.sh

docker-compose.yml : here where the magic happens, this file will contain all the set of directives that Docker will use to create the whole cluster, configure the network and set properties for each node.
Dockerfile : this is what Docker use to create the image (OS + needed applications) that we'll use for our containers.
confs : this folder will contain the configuration files for Hadoop, Yarn and Spark in addition to the requirements for python.
datasets : this folder contains the "starter" datasets to use for experimenting with our cluster. I've chosen very small datasets to play with, remember this cluster is not for "real" work but just to test stuff.
notebooks : those are the notebooks "starters", each one contains the basic commands to launch our cluster for Scala-Spark (the real Spark you say?!), Python-Spark (yeah, I know Pyspark...) and Dask. In addition to those three, there is a notebook running a Bash-Kernel so you can have an interface with the command shell.
script_files : I will put here all the bash scripts to be executed. Right now, I have only one named bootstrap.sh, it will be executed at each time the container is starting. I'll explain a bit further what it does exactly.

Resources & Preset configuration for Hadoop, Yarn & Spark

Configuration is a huge deal here, because of the resources you have. Remember, we're trying to simulate many nodes on one machine, so if you have for example 4 GB of RAM and a lonely CPU...I think better not to bother with all this and stick to whatever is running on your machine. But if you happen to have like 16 GB of RAM and 4 cores or more, yeah go ahead this will give you the ability to simulate a cluster, if you don't have any within your reach (even 8 GB can do the job but you have to limit yourself to one slave node...) .

That's been said, I will introduce here the default configuration I've used to run the three nodes, and you can of course change them so it can suit your needs and machine limits.

spark-defaults.conf (Spark)

bash

# We inform Spark that it has to use Yarn
spark.master                     yarn
# Client mode is important so we can interact with Spark using Jupyter
spark.submit.deployMode          client
# Spark driver and executor memory
spark.driver.memory              512m
spark.executor.memory            512m
# This has to be lower than yarn.nodemanager.resource.memory-mb (inside yarn-site.xml)
spark.yarn.am.memory             1G

slave (Yarn/Hadoop), here we define the names of our salves, as we will see it further those names are defined at the level of the docker-compose.yml file. The internal Docker DNS manages to match those names with their respective machines.

bash

node-slave1
node-slave2

core-site.xml (Yarn/Hadoop), we're defining here the Master Node's name ("the host") and the port for using HDFS communication. In this case my Master Node name is node-master and the HDFS port is 9000.

xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
        <property>
            <name>fs.default.name</name>
            <value>hdfs://node-master:9000</value>
        </property>
</configuration>

hdfs-site.xml (Hadoop), we are definging here some properties related to the HDFS system. We've set the number of replication (dfs.replication) to be equal to the number of nodes, in this case 2 means each file will be duplicated through the cluster. Other than that, we are simply setting the directory for the namenode and the datanode.

xml

<?xml version="1.0" encoding="UTF-8"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
     <property>
            <name>dfs.namenode.name.dir</name>
            <value>/opt/hadoop/data/nameNode</value>
    </property>
    <property>
            <name>dfs.datanode.data.dir</name>
            <value>/opt/hadoop/data/dataNode</value>
    </property>
    <property>
            <name>dfs.replication</name>
            <value>2</value>
    </property>
</configuration>

yarn-site.xml (Yarn), here we're setting Yarn's resources consumption and indicating who's the Master Node. Each slave will then use only one core (yarn.nodemanager.resource.cpu-vcores), and a maximum memory of 1536 MB (yarn.nodemanager.resource.memory-mb).

xml

<?xml version="1.0"?>
<configuration>
        <property>
                <name>yarn.acl.enable</name>
                <value>0</value>
        </property>
        <property>
                <name>yarn.resourcemanager.hostname</name>
                <value>node-master</value>
        </property>
        <property>
                <name>yarn.nodemanager.aux-services</name>
                <value>mapreduce_shuffle</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.cpu-vcores</name>
                <value>1</value>
        </property>
        <property>
                <name>yarn.nodemanager.resource.memory-mb</name>
                <value>1536</value>
        </property>
        <property>
                <name>yarn.scheduler.minimum-allocation-mb</name>
                <value>256</value>
        </property>
        <property>
                <name>yarn.scheduler.maximum-allocation-mb</name>
                <value>1536</value>
        </property>
        <property>
                <name>yarn.nodemanager.vmem-check-enabled</name>
                <value>false</value>
        </property>
</configuration>

mapred-site.xml (Yarn/Hadoop), this time it is the map and reduce processing we're configuring.

xml

<?xml version="1.0"?>
<?xml-stylesheet type="text/xsl" href="configuration.xsl"?>

<configuration>
  <property>
      <name>mapreduce.framework.name</name>
      <value>yarn</value>
  </property>
  <property>
      <name>yarn.app.mapreduce.am.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
      <name>mapreduce.map.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
      <name>mapreduce.reduce.env</name>
      <value>HADOOP_MAPRED_HOME=$HADOOP_HOME</value>
  </property>
  <property>
      <name>yarn.app.mapreduce.am.resource.mb</name>
      <value>1536</value>
    </property>
  <property>
      <name>yarn.app.mapreduce.am.command-opts</name>
      <value>400</value>
    </property>
    <property>
        <name>mapreduce.map.memory.mb</name>
        <value>256</value>
    </property>
    <property>
        <name>mapreduce.reduce.memory.mb</name>
        <value>256</value>
    </property>
  <property>
      <name>mapreduce.map.java.opts</name>
        <value>200</value>
    </property>
  <property>
        <name>mapreduce.reduce.java.opts</name>
        <value>400</value>
    </property>
</configuration>

If you don't know how to choose the right values that matches your resources, Cloudera has this article explaining some heuristics to be used.

Creating the Base Image

Now we need a special "distribution" (called an image in the Docker jargon) that we will run on our three containers, it must include the necessary applications and configurations so we can have a little sweet OS with Python, Spark, Hadoop & Yarn and all friends on it!

To do this, we will put the necessary instruction inside the file named Dockerfile. We start by specifying what is the base OS/Image to use, in our case it would be an Ubuntu Bionic Beaver also known as Ubuntu 18.04 LTS.

dockerfile

FROM ubuntu:bionic

Right after we will define all necessary environment variables for the (to be) running session. The comments explain the role of each line.

dockerfile

# showing to hadoop and spark where to find java!
ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64/jre

# after downloading hadoop (a bit further) we have to inform any concerned
# app where to find it
ENV HADOOP_HOME /opt/hadoop

# same for the hadoop configuration
ENV HADOOP_CONF_DIR /opt/hadoop/etc/hadoop

# and same for spark
ENV SPARK_HOME /opt/spark

# with this we can run all hadoop and spark scripts and commands directly from the shell
# without using the absolute path
ENV PATH="${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH}"

# just informing the hadoop version, this isn't really necessary
ENV HADOOP_VERSION 2.7.0

# if you happend to run pyspark from shell, it will launch it on a Jupyter Notebook
# this is just two fancy lines, really no need for it
ENV PYSPARK_DRIVER_PYTHON=jupyter
ENV PYSPARK_DRIVER_PYTHON_OPTS='notebook'

# showing pyspark which "python" command to use
ENV PYSPARK_PYTHON=python3

We will now update the OS; install Java, an SSH server and Python (including pip and other dependencies). You can also see that we've installed wget (to download from urls) and nano (to access and modify files), it may seem weird but the basic distribution used by Docker ubuntu:bionic is very very lean and contains only what is necessary to define the distribution itself.

dockerfile

RUN apt-get update && \
    apt-get install -y wget nano openjdk-8-jdk ssh openssh-server
RUN apt update && apt install -y python3 python3-pip python3-dev build-essential libssl-dev libffi-dev libpq-dev

Now we will copy the requirements file inside the image and then install them on Python.

dockerfile

COPY /confs/requirements.req /
RUN pip3 install -r requirements.req
RUN pip3 install dask[bag] --upgrade
RUN pip3 install --upgrade toree
RUN python3 -m bash_kernel.install

It is time to download Hadoop and Spark and extract each one of them into its specific folder.

dockerfile

RUN wget -P /tmp/ https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
RUN tar xvf /tmp/hadoop-2.7.0.tar.gz -C /tmp && \
    mv /tmp/hadoop-2.7.0 /opt/hadoop

RUN wget -P /tmp/ http://mirror.ibcp.fr/pub/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar xvf /tmp/spark-2.4.4-bin-hadoop2.7.tgz -C /tmp && \
    mv /tmp/spark-2.4.4-bin-hadoop2.7 ${SPARK_HOME}

We need to generate an SSH key to allow communication between our containers as expected by Hadoop and Spark. Just to clarify something, installing Hadoop is necessary to run Yarn and using the HDFS.

dockerfile

RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 600 ~/.ssh/authorized_keys
COPY /confs/config /root/.ssh
RUN chmod 600 /root/.ssh/config

Now we will copy the preset configurations and the ENTRYPOINT script ( a script to be executed each time we will run a container with this image).

dockerfile

COPY /confs/*.xml /opt/hadoop/etc/hadoop/
COPY /confs/slaves /opt/hadoop/etc/hadoop/
COPY /script_files/bootstrap.sh /
COPY /confs/spark-defaults.conf ${SPARK_HOME}/conf

Exposing ports so the container can listen to them, by default it's a TCP port.

dockerfile

EXPOSE 9000
EXPOSE 7077
EXPOSE 4040
EXPOSE 8020
EXPOSE 22

Finally we will copy the starters (notebooks and datasets) and specify the command to run as the ENTRYPOINT of our image, in our case it's simply running the bootstrap.sh script.

dockerfile

RUN mkdir lab
COPY notebooks/*.ipynb /root/lab/
COPY datasets /root/lab/datasets

ENTRYPOINT ["/bin/bash", "bootstrap.sh"]

The bootstrap.sh script is doing four things :

Formating the namenode to support HDFS.
Running the SSH client.
Check if we are at the Master Node or not, if it's the case it will start Hadoop, Yarn and run Jupypter (properly so no tokens or passwords are needed).
Launch an empty loop so our node will reamin up without terminating the container.

bash

#!/bin/bash

hdfs namenode -format
service ssh start
if [ "$HOSTNAME" = node-master ]; then
    start-dfs.sh
    start-yarn.sh
    cd /root/lab
    jupyter trust Bash-Interface.ipynb
    jupyter trust Dask-Yarn.ipynb
    jupyter trust Python-Spark.ipynb
    jupyter trust Scala-Spark.ipynb
    jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root --NotebookApp.token='' --NotebookApp.password='' &
fi

while :; do :; done & kill -STOP $! && wait $!

Using all this chunks, the final Dockerfile has to be like this :

dockerfile

FROM ubuntu:bionic

ENV JAVA_HOME /usr/lib/jvm/java-1.8.0-openjdk-amd64/jre
ENV HADOOP_HOME /opt/hadoop
ENV HADOOP_CONF_DIR /opt/hadoop/etc/hadoop
ENV SPARK_HOME /opt/spark
ENV PATH="${HADOOP_HOME}/bin:${HADOOP_HOME}/sbin:${SPARK_HOME}/bin:${SPARK_HOME}/sbin:${PATH}"
ENV HADOOP_VERSION 2.7.0
ENV PYSPARK_DRIVER_PYTHON=jupyter
ENV PYSPARK_DRIVER_PYTHON_OPTS='notebook'
ENV PYSPARK_PYTHON=python3

RUN apt-get update && \
    apt-get install -y wget nano openjdk-8-jdk ssh openssh-server
RUN apt update && apt install -y python3 python3-pip python3-dev build-essential libssl-dev libffi-dev libpq-dev

COPY /confs/requirements.req /
RUN pip3 install -r requirements.req
RUN pip3 install dask[bag] --upgrade
RUN pip3 install --upgrade toree
RUN python3 -m bash_kernel.install

RUN wget -P /tmp/ https://archive.apache.org/dist/hadoop/common/hadoop-2.7.0/hadoop-2.7.0.tar.gz
RUN tar xvf /tmp/hadoop-2.7.0.tar.gz -C /tmp && \
    mv /tmp/hadoop-2.7.0 /opt/hadoop

RUN wget -P /tmp/ http://mirror.ibcp.fr/pub/apache/spark/spark-2.4.4/spark-2.4.4-bin-hadoop2.7.tgz
RUN tar xvf /tmp/spark-2.4.4-bin-hadoop2.7.tgz -C /tmp && \
    mv /tmp/spark-2.4.4-bin-hadoop2.7 ${SPARK_HOME}

RUN ssh-keygen -t rsa -P '' -f ~/.ssh/id_rsa && \
    cat ~/.ssh/id_rsa.pub >> ~/.ssh/authorized_keys && \
    chmod 600 ~/.ssh/authorized_keys
COPY /confs/config /root/.ssh
RUN chmod 600 /root/.ssh/config

COPY /confs/*.xml /opt/hadoop/etc/hadoop/
COPY /confs/slaves /opt/hadoop/etc/hadoop/
COPY /script_files/bootstrap.sh /
COPY /confs/spark-defaults.conf ${SPARK_HOME}/conf

RUN jupyter toree install --spark_home=${SPARK_HOME}
RUN echo "export JAVA_HOME=${JAVA_HOME}" >> /etc/environment

EXPOSE 9000
EXPOSE 7077
EXPOSE 4040
EXPOSE 8020
EXPOSE 22

RUN mkdir lab
COPY notebooks/*.ipynb /root/lab/
COPY datasets /root/lab/datasets

ENTRYPOINT ["/bin/bash", "bootstrap.sh"]

Now we need to build our image, and to do that we simply run the following command at the same folder level as the Dockerfile :

bash

docker build . -t cluster-base

docker build

It is very important that the image name used here (cluster-base) matches the specified image name on the docker-compose.yml file.

Docker Compose or how to make the cluster comes to life

Instead of launching and configuring each node independently, Docker Compose facilitates doing this by specifying the characteristics of all our nodes, and by characteristics I mean the image to use, the hostname to set, the network to be part of and anything that you can more or less write down using Docker command lines.

Bellow is the YAML file I've used, I will add comments to it so you can undesrtand what it does.

yaml

version: "3.3"
# Here we will list all nodes/containers!
services:
# First container will be named node-master
  node-master:
# We will tell Docker to run the slaves first before running the Master Node
    depends_on:
      - node-slave1
      - node-slave2
# The image to be used is the image we've built before
    image: cluster-base:latest
# It explains itself, what it the container name
    container_name: node-master
# The hostname for this container
    hostname: node-master
# Connecting your terminal to the stdin and stdout stream of the container so you can see
# all the messages it outputs
    tty: true
# Ports redirection to Host
    ports:
      - "8088:8088"
      - "50070:50070"
      - "50030:50030"
      - "8080:8080"
      - "8042:8042"
      - "8888:8888"
# Linking this container to the following network and defining an alias to be used
    networks:
      cluster-network:
        aliases: 
          - node-master
# First Slave Node, same thing as previously seen
  node-slave1:
    image: cluster-base:latest
    container_name: node-slave1
    hostname: node-slave1
    tty: true
    ports:
      - "7042:8042"
    networks:
      cluster-network:
        aliases: 
          - node-slave1
# Second Slave Node
  node-slave2:
    image: cluster-base:latest
    container_name: node-slave2
    hostname: node-slave2
    tty: true
    ports:
      - "9042:8042"
    networks:
      cluster-network:
        aliases: 
          - node-slave2
# Defining a new network of Bridge type to be created
networks:
  cluster-network:
    driver: bridge
    ipam:
      driver: default

To run the cluster and launch it, all is needed now is to run the command bellow at the same folder level as the docker-compose.yml:

bash

docker-compose up

Now that your cluster is up and running you should have something like this, displayed on your shell :

docker compose

You can even access Yarn resource manager UI to see the slave nodes part of your cluster (localhost:8088/cluster/nodes).

Yarn UI

And now you can access Jupyter Notebook at localhost:8888. You can use the starter notebooks and start playing arround with Spark (Scala or Python) and Dask :

jupyter

For stopping your cluster use Ctrl + C and then run the command docker-compose down to completly remove the containers.

stop cluster

That's all folks! Don't hesitate to follow me on Twitter!