Photo by Warren Wong on Unsplash
October 22, 2020

Dockerize Pyspark

Navin Vembar

CTO, Camber


Last time, we discussed how to run PySpark on your machine. But, especially pre-v3.0, running PySpark requires some setup and modification of the executing environment to run. Let’s resolve that using Docker containers. This will make it easier to run PySpark programs locally - though, at least at the beginning, without the ability to apply distributed processing on one machine. It’s also the first step in being able to use a container to, well, contain your dependencies for deployment in an EMR cluster.

There are some Docker containers that support PySpark in Jupyter. But let’s build a container from scratch - in part because that’s safer to do rather than using arbitrary containers you find on the dangerous Internet, but also because customizing those containers will be useful as we work to use them in EMR later.

Let’s lay out the Dockerfile first.

FROM python:3.7 as builder

ARG SPARK_VERSION=2.4.7
ARG HADOOP_VERSION=2.7
ARG SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
ARG SPARK_HOME=/usr/spark-${SPARK_VERSION}
RUN curl -sL --retry 3 \
  "https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
  | tar xz -C /tmp/ \
 && mv /tmp/$SPARK_PACKAGE $SPARK_HOME \
 && chown -R root:root $SPARK_HOME

RUN pip install -U pip
RUN pip install poetry
RUN mkdir /home/spark
WORKDIR /home/spark

# Get the dependencies for the project locally
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true && poetry install --no-root

FROM openjdk:8

ARG SPARK_VERSION=2.4.7
ARG HADOOP_VERSION=2.7
ARG SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
ARG SPARK_HOME=/usr/spark-${SPARK_VERSION}

ENV PYTHON_DEPENDENCY=python3.7
RUN apt-get update && \
    apt-get install -y --no-install-recommends ${PYTHON_DEPENDENCY} && \
    apt-get clean

ENV SPARK_HOME=/usr/spark-${SPARK_VERSION}
ENV PYTHONPATH /home/spark/.venv

COPY --from=builder /usr/spark-${SPARK_VERSION}/ /usr/spark-${SPARK_VERSION}

RUN adduser --home /home/spark --disabled-password spark

USER spark
WORKDIR /home/spark

COPY --from=builder --chown=spark:spark /home/spark/.venv /home/spark/.venv

# This deals with the fact that `apt-get install` up there installs
# python to /usr/bin/python3.7, which is different than the location
# of python3 in python:3.7-buster
RUN rm /home/spark/.venv/bin/python
RUN ln -s /usr/bin/python3.7 /home/spark/.venv/bin/python

COPY adding.py test_add.py ./
ENV PATH=${PYTHONPATH}/bin:${SPARK_HOME}/bin:${PATH}
CMD ["pytest"]

The first thing you’ll notice is that we split this into two stages. Installing pip in the Java container is (1) unnecessary for execution and (2) brings with a lot of dependencies which make a very large container. We need Java, Python and Spark with Hadoop to run, so we’ll optimize towards that. But we need our Python dependencies in the resulting container, so let’s make sure to get those. Generally, we want to push any steps that will rarely change - especially if they take a lot of time - and push them as early in the build as possible. This avoid long waits for rebuild when code is changing quickly.

The first stage of our build gets all our dependencies and the second stage copies them over into a container that can execute Pyspark.

Following the instructions to download Spark with Hadoop, the first thing we do is pull down the tarball we need and unpack it. We do that first because that file is not going to change (unless we choose a different Spark version).

Then we update pip and install poetry into the container. The poetry config virtualenvs.in-project true command ensures that we know exactly where the Python virtual environment will be built - in the .venv directory where the project files are. The --no-root option means that poetry won’t try to install your own code into the virtualenv, so we don’t even copy our code over, just the pyproject.toml and poetry.lock files. Note we do that last in the build file, so any changes to dependencies will not cause a redownload of Spark or another call to apt-get.

We know that Spark 2.4.7 uses Java 8, so our execution state will get built off the OpenJDK 8 image. Note the redefinition of the ARG values, as they don’t cross build stages. Then we install Python 3.7 - note that this doesn’t come with a lot of other support dependencies, just the basics.

The COPY commands pull over the Spark libraries and dependencies from the first stage. However, there is one thing that’s broken. The install of Python 3.7 via apt-get and its installation in the python:3.7-buster image aren’t compatible. So, we do a bit of manual work to relink the right Python binary into the virtualenv. This will make the scripts in the virtualenv correctly use the /usr/bin/python3.7 binary installed through apt-get.

Adding $PYTHONPATH/bin and $SPARK_HOME/bin to the PATH environment variable, we ensure that spark is findable when our tests get run.

To execute all of this, run the following in the directory with adding.py, test_add.py, and our poetry files from the previous steps.

$ docker build -t dockerized-pyspark .
$ docker run dockerized-pyspark # Run pytest

You should see the same results as before!