aws / sagemaker-spark-container Goto Github PK

The SageMaker Spark Container is a Docker image used to run data processing workloads with the Spark framework on Amazon SageMaker.

License: Apache License 2.0

Makefile 3.46% Python 89.41% Shell 3.73% Java 1.73% Scala 1.67%

sagemaker-spark-container's Introduction

SageMaker Spark Container

Spark Overview

Apache Spark™ is a unified analytics engine for large-scale data processing. It provides high-level APIs in Scala, Java, Python, and R, and an optimized engine that supports general computation graphs for data analysis. It also supports a rich set of higher-level tools including Spark SQL for SQL and DataFrames, MLlib for machine learning, GraphX for graph processing, and Structured Streaming for stream processing.

SageMaker Spark Container

The SageMaker Spark Container is a Docker image used to run batch data processing workloads on Amazon SageMaker using the Apache Spark framework. The container images in this repository are used to build the pre-built container images that are used when running Spark jobs on Amazon SageMaker using the SageMaker Python SDK. The pre-built images are available in the Amazon Elastic Container Registry (Amazon ECR), and this repository serves as a reference for those wishing to build their own customized Spark containers for use in Amazon SageMaker.

For the list of available Spark images, see Available SageMaker Spark Images.

License

This project is licensed under the Apache-2.0 License.

Usage in the SageMaker Python SDK

The simplest way to get started with the SageMaker Spark Container is to use the pre-built images via the SageMaker Python SDK.

Amazon SageMaker Processing — sagemaker 2.5.3 documentation

Getting Started With Development

To get started building and testing the SageMaker Spark container, you will have to setup a local development environment.

See instructions in DEVELOPMENT.md

Contributing

To contribute to this project, please read through CONTRIBUTING.md

sagemaker-spark-container's People

Contributors

Stargazers

Watchers

Forkers

guoqiao1992 satish615 andremoeller amulmgr thvasilo vscerchia greatzt youneverknow10 himanshu-mishra shunsunsun rciocoiu ramonmarrero mahendruajay rschneider98 mattx jmahlik juandd18 statefarmins shivansh-narayan-oyo cloud-data-science hvu53 qpc-database hrishikesh91 cuenca-mx yiweihuang dbranscombe snehalmistrybfa igosuki aarthikasirajan clashofphish sullivph pavel-balas varunexpsde ttitamu ocipoc code360in spoorn ogzhnndrms phall1 devjadhav benhorvath jif078 test-mass-forker-org-1 shenxiaoxu vmay-chegg sowston oyangz pancodia altaf1994 qgrose henitsoi vdreamakitex can-sun aapidinomm rangareddy-p xiaoxshe abhishekgoyal1994 asumitamazon timsteuer albertocanoortiz sabrinalameiras seanpm2001 siliciuss giuseppeporcelli ajpeixoto kondape elisee9571 iknsa-talend pablosls

sagemaker-spark-container's Issues

Usage of pandas udf requires installations of additional heavy libraries which should reside on the docker image

When using the pandas udf additional libraries need to be install:
"pandas==0.24.2","requests==2.21.0","pyarrow==0.15.1","pytz==2021.1","six==1.15.0","python-dateutil==2.8.1","numpy==1.16.5"
Since the libraries are pretty heavy this requires almost always create a new image instead of using the pre built one from this repo (or installing them via pip in the container).
The pandas udf is the recommended udf to be used and thus including these libraries in the image is preferred.

Support for spark 3.2

spark >= 3.2 added pandas API (covers 90%), which make it easy to use for pandas users.

(Potentially) wrong header in Release tracker

Hello,

The latest release says 3.1-cpu-py37-v1.2 but then the listed image URIs contain py39. Did something go wrong there, @guoqiao1992 ?

https://github.com/aws/sagemaker-spark-container/releases/tag/3.1-cpu-py39-v1.2

Run bootstrap script?

Is there a way to use the Sagemaker PySparkProcessor class to execute a bootstrap script when a cluster launches? I'm currently trying to run a processing workload that uses pandas_udfs, and seeing ImportError when the cluster tries to use PyArrow:

`Traceback (most recent call last):
    File "/opt/ml/processing/input/code/spark_preprocess.py", line 35 in <module>
    @pandas_udf("float", PandasUDFType.GROUPED_AGG)
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/udf.py", line 47, in _create_udf
    File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 149 in require_minimum_pyarrow_version
    ImportError: PyArrow >= 0.8.0 must be installed; however, it was not found.

I'm using the latest version of the Sagemaker Python SDK, and have tried using the submit_py_files parameter in PySparkProcessor.run() to submit a .zip file of all my dependencies. However, it doesn't seem to be installing them.

I know with EMR you can submit a bootstrap action script to install dependencies - is there a similar option here? Thanks!

PyArrow and PySpark pandas support

Reading parquet files with PySpark and pandas is common. The Pipfile does not include pandas and pyarrow for reading parquet files and executing pandas_udfs

Use include language from the main branch

Can we please drop the master branch name and start using more inclusive main or mainline?

There is a lot of articles (examples, example, example) covering this topic.

Updating hadoop version to 3.2.2 or 3.3.0

Looks like there is a bug with hadoop version 3.2.1, they released the patch in https://issues.apache.org/jira/browse/HDFS-15191. Can you please upgrade the hadoop version to 3.22. or 3.3.0. One of the customer is facing issue because of this.

missing aws-config folder

I am unable to build a Docker image using the provided Dockerfile.cpu files as the aws-config folder is missing. Line 107 in the Dockerfile fails.

ECR scan reports more than 20 vulnerabilities

Hi
We deployed the solution but the base image contains vulnerabilities, someone could confirm us if the general update could broke the image, or how to mitigate this issue?

Python 3.8 and 3.9 Releases

It would be wonderful if images for python 3.8 and 3.9 could be released as official aws images.

The use case is maintaining the same python version throughout a project or model build. Also, for using new language features in 3.8/3.9 and security updates in both of these python versions.

It seems like a matter of editing the docker file minimally and/or moving to ARGS for the python version to yum install. Happy to submit a PR if needed.

PySparkProcessor Error reading parquet from S3 (may be version compatibility issue)

Getting the following error submitting a processing job with PySparkProcessor running on VPC subnet with security group.

py4j.protocol.Py4JJavaError: An error occurred while calling o47.parquet.

Traceback (most recent call last):
File "/opt/ml/processing/input/code/processing.py", line 100, in
main()
File "/opt/ml/processing/input/code/processing.py", line 36, in main
df = spark.read.parquet("s3://path-to-parquet/")
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/readwriter.py", line 316, in parquet
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/java_gateway.py", line 1257, in call
File "/usr/lib/spark/python/lib/pyspark.zip/pyspark/sql/utils.py", line 63, in deco
File "/usr/lib/spark/python/lib/py4j-0.10.7-src.zip/py4j/protocol.py", line 328, in get_return_value

spark_processor = PySparkProcessor(
base_job_name = 'one-ckd-poc',
framework_version="2.4",
py_version="py37",
container_version="1.3",
role=role,
instance_count=10,
volume_size_in_gb = 100,
instance_type="ml.m5.2xlarge",
max_runtime_in_seconds=1200,
network_config = network

spark_processor.run(
submit_app="./processing.py",
spark_event_logs_s3_uri="s3://{}/{}/spark_event_logs".format(bucket, s3_sparklog_prefix),
logs=False

Add SageMaker Feature Store Spark connector to containers

The SageMaker Feature Store Spark connector enables scalable data ingestion into SageMaker Feature Store. In order to use the connector in a SageMaker Processing Job, one must extend a pre-built image with the dependency as shown here.

Including the connector as part of the pre-built images would make it easier to get started with using Spark for ingest.

Update to Java 11

Currently the images run in Java 1.8:

openjdk version "1.8.0_342"
OpenJDK Runtime Environment (build 1.8.0_342-b07)
OpenJDK 64-Bit Server VM (build 25.342-b07, mixed mode)

This prevents us from writing our Jobs in Java 11.

Add support for scipy

I'm trying to use PySparkProcessor along with some scientific packages such as scipy. Is there an easier way to install such dependencies by passing a requirements.txt or should this be installed as a part of this code base?

'local/local_gpu' mode causing errors

sagemaker-spark-container/src/smspark/bootstrapper.py

Line 341 in 0e407d1

instance_type_info = instance_type_info[instance_type]

When using the container for PySpark/Spark Processing Job in 'local' mode, the container starts but exits with a docker-compose error. ['docker-compose', '-f', '/tmp/tmpivizopyn/docker-compose.yaml', 'up', '--build', '--abort-on-container-exit']

sagemaker-spark-processing:3.1-cpu: update PyArrow version to >= 1.x

Hello team. Seem that the image version sagemaker-spark-processing:3.1-cpu is not using an updated version of PyArrow, required for pandas_udf functions. Error returned during the execution of the Processing Job:

2023-02-21T20:06:56.503+01:00ImportError: PyArrow >= 1.0.0 must be installed; however, it was not found.

Can you please update the Spark images? Thank you

python37-sagemaker-pyspark deleted from Dockerfile

I know python37-sagemaker_pyspark was removed in Apr 2021 from this commit 04b5676#diff-3cf35f33d3648c3aeaaeb5ac9cafbf8eeee4f0f7cd04bfdd00c1c4f8a209512dL62

2.4 and 3.0 containers didn't get rebuilt till Dec 2021 so problems only started popping up when I can no longer import sagemaker_pyspark

Any reason why sagemaker_pyspark was removed from the Dockerfile? Also can we add it back for future versions?

If primary is down abnormally, worker should exit with error

Worker doesn't exit with error when the primary is down abnormally as the StatusMessage is not checked. Would it be possible to exit workers with error when primary is down abnormally?

See flow here:

https://github.com/aws/sagemaker-spark-container/blob/master/src/smspark/job.py#L185

Faild to build on macOS: [Errno 32] Broken pipe

The document file DEVELOPMENT.md does not mention that you must install jq package to build this repository successfully, and I haven't had this package by default on my mac. For that reason, I received the error below:

Fetching EC2 instance type info to ./spark/processing/3.0/py3/aws-config/ec2-instance-type-info.json ...
./scripts/fetch-ec2-instance-type-info.sh: line 32: jq: command not found
Exception ignored in: <_io.TextIOWrapper name='<stdout>' mode='w' encoding='utf-8'>
BrokenPipeError: [Errno 32] Broken pipe
make: *** [build-static-config] Error 127

To fix this on macOS, you can install jq using brew:

brew install jq

Release 3.1.1

Hello!

I saw that support for spark 3.1.1 was recently added to the git repo but a docker/ECR release hasn't happened yet.
Is there a plan for this soon? If not, is the image in a state where a client can push the image to its own ECR and run with it?

Thanks!

Older version of the python37-sagemaker-pyspark package in newer docker image

The current version of the python37-sagemaker-pyspark package in the (906073651304.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-spark-processing:2.4-cpu) 2.4 docker image is 1.4.0. However, when I upgrade to the (906073651304.dkr.ecr.eu-central-1.amazonaws.com/sagemaker-spark-processing:3.0-cpu) 3.0 docker image, the version is 1.3.0.

Why is the version of the python37-sagemaker-pyspark package older in the newer docker image?

Remove `/var/cache/yum/` and merge steps to minimize size of container

After taking a look through the sagemaker-spark-processing:3.1-cpu-py37-v1.1 image using the dive tool, I noticed that the cache files for installations were not getting cleaned up, leading to an unnecessary increase in image size.

In particular line 13 has no effect as the layer that installs the yum packages on line 6 is immutable at that point. This leads to around 30-40% of the image size being allocated to caches:

By adding the cleanup at the end of the layer definition, the cleanup actually works, significantly reducing the size of the image:

By merging a couple of layers and doing cleanup we are able to shrink the image size from 4.4GB to 2.5GB, that should lead to faster container spin-up. The changes are available in my fork, if the maintainers agree I can try opening a PR with these changes for the Dockerfiles that could benefit.

s3-us-east-1.amazonaws.com’s server IP address could not be found.

While building the docker image, it needs to configure a emr yum repo and install packages from this yum repo.

When I set REGION to be ue-east-1, docker build cannot resolve the yum repo.

Then I found this is caused by the following line.

sagemaker-spark-container/spark/processing/3.1/py3/yum/emr-apps.repo

Line 3 in 3ccb729

    
           gpgkey = https://s3-REGION.amazonaws.com/repo.REGION.emr.amazonaws.com/apps-repository/emr-6.3.0/184d7755-d3a2-4c5c-9e1f-c72d4f2b33f1/repoPublicKey.txt

Specifically, when REGION=us-east-1, this url cannot be resolved. This is confirmed when I tried to visit this url in my browser, I got "s3-us-east-1.amazonaws.com’s server IP address could not be found."

From the s3 endpoint document, it seems that the format of the url should be https://s3.REGION.amazonaws.com/ .... Is my understanding right?

Error in init

When I am running make build I am getting the following error

cp {Pipfile,Pipfile.lock,setup.py} ./spark/processing/3.0/py3
cp: cannot stat '{Pipfile,Pipfile.lock,setup.py}': No such file or directory
make: *** [Makefile:38: init] Error 1

Any idea what I might be doing wrong?

Memory issues on algo-1

I noticed that the algo-1 host was using a lot more memory than the other nodes. At some point, our algo-1 was running out of memory and eventually crashing.

I figured that algo-1 was the driver, so I tried to dedicate one instance for the driver by reducing the number of executors from 12 to 11, knowing that the job is running on 12 nodes.

# /opt/ml/processing/input/conf/configuration.json
 
[{
    "Classification": "spark-defaults",
    "Properties": { "spark.executor.instances": "11" } # 12 - 1 for the driver
}]

Sometimes, it does work as intended: the driver does some work at the beginning and then just check on the worker nodes. It's a bit of a waste since our worker doesn't do anything else but it works.

However, sometimes, Spark prefers to have one node on IDLE, and one node with both the driver and the executor, as before. It defeats the purpose of my configuration. (here we had 10 nodes instead of 12)

Do you have any idea how this issue could be solved?

HIGH Vulnerability detected in sagemaker-spark-processing:3.1-cpu-py39-v1.2

log4J vulnerability need to mitigate

aws / sagemaker-spark-container Goto Github PK

sagemaker-spark-container's Introduction

SageMaker Spark Container

Spark Overview

SageMaker Spark Container

License

Usage in the SageMaker Python SDK

Getting Started With Development

Contributing

sagemaker-spark-container's People

Contributors

Stargazers

Watchers

Forkers

sagemaker-spark-container's Issues

Recommend Projects

Recommend Topics

Recommend Org