Last time, we discussed how to run PySpark on your machine. But, especially pre-v3.0, running PySpark requires some setup and modification of the executing environment to run. Let’s resolve that using Docker containers. This will make it easier to run PySpark programs locally - though, at least at the beginning, without the ability to apply distributed processing on one machine. It’s also the first step in being able to use a container to, well, contain your dependencies for deployment in an EMR cluster.
There are some Docker containers that support PySpark in Jupyter. But let’s build a container from scratch - in part because that’s safer to do rather than using arbitrary containers you find on the dangerous Internet, but also because customizing those containers will be useful as we work to use them in EMR later.
Let’s lay out the Dockerfile
first.
FROM python:3.7 as builder
ARG SPARK_VERSION=2.4.7
ARG HADOOP_VERSION=2.7
ARG SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
ARG SPARK_HOME=/usr/spark-${SPARK_VERSION}
RUN curl -sL --retry 3 \
"https://archive.apache.org/dist/spark/spark-${SPARK_VERSION}/${SPARK_PACKAGE}.tgz" \
| tar xz -C /tmp/ \
&& mv /tmp/$SPARK_PACKAGE $SPARK_HOME \
&& chown -R root:root $SPARK_HOME
RUN pip install -U pip
RUN pip install poetry
RUN mkdir /home/spark
WORKDIR /home/spark
# Get the dependencies for the project locally
COPY pyproject.toml poetry.lock ./
RUN poetry config virtualenvs.in-project true && poetry install --no-root
FROM openjdk:8
ARG SPARK_VERSION=2.4.7
ARG HADOOP_VERSION=2.7
ARG SPARK_PACKAGE=spark-${SPARK_VERSION}-bin-hadoop${HADOOP_VERSION}
ARG SPARK_HOME=/usr/spark-${SPARK_VERSION}
ENV PYTHON_DEPENDENCY=python3.7
RUN apt-get update && \
apt-get install -y --no-install-recommends ${PYTHON_DEPENDENCY} && \
apt-get clean
ENV SPARK_HOME=/usr/spark-${SPARK_VERSION}
ENV PYTHONPATH /home/spark/.venv
COPY --from=builder /usr/spark-${SPARK_VERSION}/ /usr/spark-${SPARK_VERSION}
RUN adduser --home /home/spark --disabled-password spark
USER spark
WORKDIR /home/spark
COPY --from=builder --chown=spark:spark /home/spark/.venv /home/spark/.venv
# This deals with the fact that `apt-get install` up there installs
# python to /usr/bin/python3.7, which is different than the location
# of python3 in python:3.7-buster
RUN rm /home/spark/.venv/bin/python
RUN ln -s /usr/bin/python3.7 /home/spark/.venv/bin/python
COPY adding.py test_add.py ./
ENV PATH=${PYTHONPATH}/bin:${SPARK_HOME}/bin:${PATH}
CMD ["pytest"]
The first thing you’ll notice is that we split this into two stages. Installing pip
in the Java container is (1) unnecessary for execution and (2) brings with a lot of dependencies which make a very large container. We need Java, Python and Spark with Hadoop to run, so we’ll optimize towards that. But we need our Python dependencies in the resulting container, so let’s make sure to get those. Generally, we want to push any steps that will rarely change - especially if they take a lot of time - and push them as early in the build as possible. This avoid long waits for rebuild when code is changing quickly.
The first stage of our build gets all our dependencies and the second stage copies them over into a container that can execute Pyspark.
Following the instructions to download Spark with Hadoop, the first thing we do is pull down the tarball we need and unpack it. We do that first because that file is not going to change (unless we choose a different Spark version).
Then we update pip and install poetry into the container. The poetry config virtualenvs.in-project true
command ensures that we know exactly where the Python virtual environment will be built - in the .venv
directory where the project files are. The --no-root
option means that poetry won’t try to install your own code into the virtualenv, so we don’t even copy our code over, just the pyproject.toml
and poetry.lock
files. Note we do that last in the build file, so any changes to dependencies will not cause a redownload of Spark or another call to apt-get
.
We know that Spark 2.4.7 uses Java 8, so our execution state will get built off the OpenJDK 8 image. Note the redefinition of the ARG
values, as they don’t cross build stages. Then we install Python 3.7 - note that this doesn’t come with a lot of other support dependencies, just the basics.
The COPY
commands pull over the Spark libraries and dependencies from the first stage. However, there is one thing that’s broken. The install of Python 3.7 via apt-get
and its installation in the python:3.7-buster
image aren’t compatible. So, we do a bit of manual work to relink the right Python binary into the virtualenv. This will make the scripts in the virtualenv correctly use the /usr/bin/python3.7
binary installed through apt-get
.
Adding $PYTHONPATH/bin
and $SPARK_HOME/bin
to the PATH
environment variable, we ensure that spark
is findable when our tests get run.
To execute all of this, run the following in the directory with adding.py
, test_add.py
, and our poetry files from the previous steps.
$ docker build -t dockerized-pyspark .
$ docker run dockerized-pyspark # Run pytest
You should see the same results as before!