Coder Social home page Coder Social logo

aws / sagemaker-spark-container Goto Github PK

View Code? Open in Web Editor NEW
33.0 10.0 71.0 413 KB

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.

License: Apache License 2.0

Makefile 3.46% Python 89.41% Shell 3.73% Java 1.73% Scala 1.67%

sagemaker-spark-container's Introduction

SageMaker Spark Container

Spark Overview

Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

SageMaker Spark Container

The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.

For the list of available Spark images, see Available SageMaker Spark Images.

License

This project is licensed under the Apache-2.0 License.

Usage in the SageMaker Python SDK

The simplest way to get started with the SageMaker Spark Container is to use the pre-built images via the SageMaker Python SDK.

Amazon SageMaker Processing — sagemaker 2.5.3 documentation

Getting Started With Development

To get started building and testing the SageMaker Spark container, you will have to setup a local development environment.

See instructions in DEVELOPMENT.md

Contributing

To contribute to this project, please read through CONTRIBUTING.md

sagemaker-spark-container's People

Contributors

aapidinomm avatar amazon-auto avatar apacker avatar asumitamazon avatar can-sun avatar dependabot[bot] avatar guoqiao1992 avatar larroy avatar mahendruajay avatar spoorn avatar xiaoxshe avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

sagemaker-spark-container's Issues

Usage of pandas udf requires installations of additional heavy libraries which should reside on the docker image

When using the pandas udf additional libraries need to be install:
"pandas==0.24.2","requests==2.21.0","pyarrow==0.15.1","pytz==2021.1","six==1.15.0","python-dateutil==2.8.1","numpy==1.16.5"
Since the libraries are pretty heavy this requires almost always create a new image instead of using the pre built one from this repo (or installing them via pip in the container).
The pandas udf is the recommended udf to be used and thus including these libraries in the image is preferred.

Support for spark 3.2

spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users.

Run bootstrap script?

Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:

`Traceback (most recent call last):
    File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
    @pandas_udf("float", PandasUDFType.GROUPED_AGG)
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
    ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

`

I'm using the latest version of the Sagemaker Python SDK, and have tried using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.

I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!

PyArrow and PySpark pandas support

Reading parquet files with PySpark and pandas is common. The Pipfile does not include pandas and pyarrow for reading parquet files and executing pandas_udfs

missing aws-config folder

I am unable to build a Docker image using the provided Dockerfile.cpu files as the aws-config folder is missing. Line 107 in the Dockerfile fails.

Python 3.8 and 3.9 Releases

It would be wonderful if images for python 3.8 and 3.9 could be released as official aws images.

The use case is maintaining the same python version throughout a project or model build. Also, for using new language features in 3.8/3.9 and security updates in both of these python versions.

It seems like a matter of editing the docker file minimally and/or moving to ARGS for the python version to yum install. Happy to submit a PR if needed.

PySparkProcessor Error reading parquet from S3 (may be version compatibility issue)

Getting the following error submitting a processing job with PySparkProcessor running on VPC subnet with security group.

py4j.protocol.Py4JJavaError: An error occurred while calling o47.parquet.

Traceback (most recent call last):
File "/opt/ml/processing/input/code/processing.py", line 100, in
main()
File "/opt/ml/processing/input/code/processing.py", line 36, in main
df = spark.read.parquet("s3://path-to-parquet/")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 316, in parquet
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value

spark_processor = PySparkProcessor(
base_job_name = 'one-ckd-poc',
framework_version="2.4",
py_version="py37",
container_version="1.3",
role=role,
instance_count=10,
volume_size_in_gb = 100,
instance_type="ml.m5.2xlarge",
max_runtime_in_seconds=1200,
network_config = network

spark_processor.run(
submit_app="./processing.py",
spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, s3_sparklog_prefix),
logs=False

Update to Java 11

Currently the images run in Java 1.8:

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)

This prevents us from writing our Jobs in Java 11.

Add support for scipy

I'm trying to use PySparkProcessor along with some scientific packages such as scipy. Is there an easier way to install such dependencies by passing a requirements.txt or should this be installed as a part of this code base?

sagemaker-spark-processing:3.1-cpu: update PyArrow version to >= 1.x

Hello team. Seem that the image version sagemaker-spark-processing:3.1-cpu is not using an updated version of PyArrow, required for pandas_udf functions. Error returned during the execution of the Processing Job:

2023-02-21T20:06:56.503+01:00ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

Can you please update the Spark images? Thank you

Faild to build on macOS: [Errno 32] Broken pipe

The document file DEVELOPMENT.md does not mention that you must install jq package to build this repository successfully, and I haven't had this package by default on my mac. For that reason, I received the error below:

Fetching EC2 instance type info to ./spark/processing/3.0/py3/aws-config/ec2-instance-type-info.json ...
./scripts/fetch-ec2-instance-type-info.sh: line 32: jq: command not found
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
make: *** [build-static-config] Error 127

To fix this on macOS, you can install jq using brew:

brew install jq

Release 3.1.1

Hello!

I saw that support for spark 3.1.1 was recently added to the git repo but a docker/ECR release hasn't happened yet.
Is there a plan for this soon? If not, is the image in a state where a client can push the image to its own ECR and run with it?

Thanks!

Older version of the python37-sagemaker-pyspark package in newer docker image

The current version of the python37-sagemaker-pyspark package in the (906073651304.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-spark-processing:2.4-cpu) 2.4 docker image is 1.4.0. However, when I upgrade to the (906073651304.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-spark-processing:3.0-cpu) 3.0 docker image, the version is 1.3.0.

Why is the version of the python37-sagemaker-pyspark package older in the newer docker image?

Remove `/var/cache/yum/` and merge steps to minimize size of container

After taking a look through the sagemaker-spark-processing:3.1-cpu-py37-v1.1 image using the dive tool, I noticed that the cache files for installations were not getting cleaned up, leading to an unnecessary increase in image size.

In particular line 13 has no effect as the layer that installs the yum packages on line 6 is immutable at that point. This leads to around 30-40% of the image size being allocated to caches:

image

By adding the cleanup at the end of the layer definition, the cleanup actually works, significantly reducing the size of the image:

image

By merging a couple of layers and doing cleanup we are able to shrink the image size from 4.4GB to 2.5GB, that should lead to faster container spin-up. The changes are available in my fork, if the maintainers agree I can try opening a PR with these changes for the Dockerfiles that could benefit.

s3-us-east-1.amazonaws.com’s server IP address could not be found.

While building the docker image, it needs to configure a emr yum repo and install packages from this yum repo.

When I set REGION to be ue-east-1, docker build cannot resolve the yum repo.

Then I found this is caused by the following line.

gpgkey = https://s3-REGION.amazonaws.com/repo.REGION.emr.amazonaws.com/apps-repository/emr-6.3.0/184d7755-d3a2-4c5c-9e1f-c72d4f2b33f1/repoPublicKey.txt

Specifically, when REGION=us-east-1, this url cannot be resolved. This is confirmed when I tried to visit this url in my browser, I got "s3-us-east-1.amazonaws.com’s server IP address could not be found."

From the s3 endpoint document, it seems that the format of the url should be https://s3.REGION.amazonaws.com/ .... Is my understanding right?

Error in init

When I am running make build I am getting the following error

cp {Pipfile,Pipfile.lock,setup.py} ./spark/processing/3.0/py3
cp: cannot stat '{Pipfile,Pipfile.lock,setup.py}': No such file or directory
make: *** [Makefile:38: init] Error 1

Any idea what I might be doing wrong?

Memory issues on algo-1

I noticed that the algo-1 host was using a lot more memory than the other nodes. At some point, our algo-1 was running out of memory and eventually crashing.
Screenshot from 2022-10-10 14-13-01
I figured that algo-1 was the driver, so I tried to dedicate one instance for the driver by reducing the number of executors from 12 to 11, knowing that the job is running on 12 nodes.

# /opt/ml/processing/input/conf/configuration.json
 
[{
    "Classification": "spark-defaults",
    "Properties": { "spark.executor.instances": "11" } # 12 - 1 for the driver
}]

Sometimes, it does work as intended: the driver does some work at the beginning and then just check on the worker nodes. It's a bit of a waste since our worker doesn't do anything else but it works.
Screenshot from 2022-10-10 15-46-39

However, sometimes, Spark prefers to have one node on IDLE, and one node with both the driver and the executor, as before. It defeats the purpose of my configuration. (here we had 10 nodes instead of 12)
Screenshot from 2022-10-17 16-13-48

Do you have any idea how this issue could be solved?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.