Dockerization of a pytest environment for tests that use Spark.
See the sample code in the example/
directory and the workflow in .github/workflows/ci.yml
for a bare-bones use case.
- Create a Dockerfile within the repo to test
FROM foundryai/spark-pytest:latest
WORKDIR /usr/src/app
ADD requirements.txt /usr/src/app
ADD requirements-dev.txt /usr/src/app
RUN pip install -r requirements.txt
RUN pip install -r requirements-dev.txt
COPY . /usr/src/app
CMD ["pytest"]
2.) Build the latest Docker image of the source code. (See docker-build
in example/Makefile
.)
3.) Run the tests. (See spark-pytest
in example/Makefile
.)
(Note: running make spark-pytest
in the example/
directory will complete steps 2 and 3 at once.)
- Install maven, run the
brew install maven
command. If you are not running on MacOS, see the manual install steps below.
Complete these steps to prepare for local Python development:
-
Install Apache Maven from the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-common/apache-maven-3.6.0-bin.tar.gz.
-
Install the Apache Spark distribution the following location: https://aws-glue-etl-artifacts.s3.amazonaws.com/glue-1.0/spark-2.4.3-bin-hadoop2.8.tgz
-
Export the SPARK_HOME environment variable, setting it to the root location extracted from the Spark archive. For example:
export SPARK_HOME=/home/$USER/spark-2.4.3-bin-spark-2.4.3-bin-hadoop2.8
Utility | Command | Description |
---|---|---|
Pytest | ./bin/pytest |
Write and run unit tests of your Python code. The pytest module must be installed and available in the PATH. For more information, see the pytest documentation. |
Developing and Testing ETL Scripts Locally Using the AWS Glue ETL Library