Coder Social home page Coder Social logo

aws-samples / amazon-sagemaker-mlflow-fargate Goto Github PK

View Code? Open in Web Editor NEW
135.0 29.0 69.0 909 KB

Managing your machine learning lifecycle with MLflow and Amazon SageMaker

License: MIT No Attribution

Dockerfile 1.21% Jupyter Notebook 55.37% Python 41.37% Shell 2.05%

amazon-sagemaker-mlflow-fargate's Introduction

Manage your machine learning lifecycle with MLflow and Amazon SageMaker

Overview

In this repository we show how to deploy MLflow on AWS Fargate and how to use it during your ML project with Amazon SageMaker. You will use Amazon SageMaker to develop, train, tune and deploy a Scikit-Learn based ML model (Random Forest) and track experiment runs and models with MLflow.

This implementation shows how to do the following:

  • Host a serverless MLflow server on AWS Fargate with S3 as artifact store and RDS and backend stores
  • Track experiment runs running on SageMaker with MLflow
  • Register models trained in SageMaker in the MLflow model registry
  • Deploy an MLflow model into a SageMaker endpoint

MLflow tracking server

You can set a central MLflow tracking server during your ML project. By using this remote MLflow server, data scientists will be able to manage experiments and models in a collaborative manner. An MLflow tracking server also has two components for storage: a backend store and an artifact store. This implementation uses an Amazon S3 bucket as artifact store and an Amazon RDS instance for MySQL as backend store.

Prerequisites

We will use the AWS CDK to deploy the MLflow server.

To go through this example, make sure you have the following:

  • An AWS account where the service will be deployed
  • AWS CDK installed and configured. Make sure to have the credentials and permissions to deploy the stack into your account
  • Docker to build and push the MLflow container image to ECR
  • This Github repository cloned into your environment to follow the steps

Deploying the stack

You can view the CDK stack details in app.py. Execute the following commands to install CDK and make sure you have the right dependencies:

npm install -g [email protected]
python3 -m venv .venv
source .venv/bin/activate
pip3 install -r requirements.txt

Once this is installed, you can execute the following commands to deploy the inference service into your account:

ACCOUNT_ID=$(aws sts get-caller-identity --query Account | tr -d '"')
AWS_REGION=$(aws configure get region)
cdk bootstrap aws://${ACCOUNT_ID}/${AWS_REGION}
cdk deploy --parameters ProjectName=mlflow --require-approval never

The first 2 commands will get your account ID and current AWS region using the AWS CLI on your computer. cdk bootstrap and cdk deploy will build the container image locally, push it to ECR, and deploy the stack.

The stack will take a few minutes to launch the MLflow server on AWS Fargate, with an S3 bucket and a MySQL database on RDS. You can then use the load balancer URI present in the stack outputs to access the MLflow UI:

N.B: In this illustrative example stack, the load balancer is launched on a public subnet and is internet facing. For security purposes, you may want to provision an internal load balancer in your VPC private subnets where there is no direct connectivity from the outside world. Here is a blog post explaining how to achieve this: Access Private applications on AWS Fargate using Amazon API Gateway PrivateLink

Managing an ML lifecycle with Amazon SageMaker and MLflow

You now have a remote MLflow tracking server running accessible through a REST API via the load balancer uri. You can use the MLflow Tracking API to log parameters, metrics, and models when running your machine learning project with Amazon SageMaker. For this you will need install the MLflow library when running your code on Amazon SageMaker and set the remote tracking uri to be your load balancer address.

The following python API command allows you to point your code executing on SageMaker to your MLflow remote server:

import mlflow
mlflow.set_tracking_uri('<YOUR LOAD BALANCER URI>')

Connect to your notebook instance and set the remote tracking URI.

Running an example lab

This describes how to develop, train, tune and deploy a Random Forest model using Scikit-learn with the SageMaker Python SDK. We use the Boston Housing dataset, present in Scikit-Learn, and log our machine learning runs into MLflow. You can find the original lab in the SageMaker Examples repository for more details on using custom Scikit-learn scipts with Amazon SageMaker.

Follow the step-by-step guide by executing the notebooks in the following folders:

  • lab/1_track_experiments.ipynb
  • lab/2_track_experiments_hpo.ipynb
  • lab/3_deploy_model.ipynb

Current limitation on user access control

The open source version of MLflow does not currently provide user access control features in case you have multiple tenants on your MLflow server. This means any user having access to the MLflow server can modify experiments, model versions, and stages. This can be a challenge for enterprises in regulated industries that need to keep strong model governance for audit purposes.

Security

See CONTRIBUTING for more information.

License

This library is licensed under the MIT-0 License. See the LICENSE file.

amazon-sagemaker-mlflow-fargate's People

Contributors

amazon-auto avatar cmosh avatar dependabot[bot] avatar sofianhamiti avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

amazon-sagemaker-mlflow-fargate's Issues

mysql version not available causes cdk deploy to fail

in app.py on line 120: the mysql version is set to 8.0.26 which is not available. It causes this command: cdk deploy --parameters ProjectName=mlflow --require-approval never to fail.

This may be just my region, but thought to let you know anyway. Changing to 8.033 fixes it.

wrong mlflow version specified in Dockerfile

The mlflow version in the docker file is higher than the current version. Thus running the command:
cdk deploy --parameters ProjectName=mlflow --require-approval never fails. As of today, the mlflow version should be 2.5.0

container/Dockerfile

Experiment entry not found in MLFlow

Issue:

Experiment not getting tracked in MLFlow

  1. I created the stack by following the instructions provided in the README
  2. I spun-up a SageMaker Notebook instance
  3. Ran one of the example notebook

Output from SageMaker Notebook:

algo-1-8lk9t_1  | INFO:root:reading data
algo-1-8lk9t_1  | INFO:root:building training and testing datasets
algo-1-8lk9t_1  | INFO: 'boston-house' does not exist. Creating a new experiment
algo-1-8lk9t_1  | INFO:root:training model
algo-1-8lk9t_1  | INFO:root:evaluating model
algo-1-8lk9t_1  | INFO:root:AE-at-10th-percentile: 0.3148059956709947
algo-1-8lk9t_1  | INFO:root:AE-at-50th-percentile: 1.5540873015873053
algo-1-8lk9t_1  | INFO:root:AE-at-90th-percentile: 4.351567142857139
algo-1-8lk9t_1  | INFO:root:saving model in MLflow
algo-1-8lk9t_1  | 2020-12-05 01:46:05,404 sagemaker-training-toolkit INFO     Reporting training SUCCESS

Couldn't find the Experiment entry in the MLFlow
Screen Shot 2020-12-04 at 9 22 37 PM

Running 1_track_experiments notebook in a SM Studio notebook throws various docker related exceptions

It is unclear how to run the lab notebooks. Is there a specific kernel that it should be run on? We have tried various Sagemaker kernels and get different errors raised...

It would be great to have a bit more guidance in the readme to aid in running the lab's examples.

Here is one of them.

/opt/conda/lib/python3.7/subprocess.py in _execute_child(self, args, executable, preexec_fn, close_fds, pass_fds, cwd, env, startupinfo, creationflags, shell, p2cread, p2cwrite, c2pread, c2pwrite, errread, errwrite, restore_signals, start_new_session)
   1549                         if errno_num == errno.ENOENT:
   1550                             err_msg += ': ' + repr(err_filename)
-> 1551                     raise child_exception_type(errno_num, err_msg, err_filename)
   1552                 raise child_exception_type(err_msg)
   1553 

FileNotFoundError: [Errno 2] No such file or directory: 'docker': 'docker'

Here is another

/opt/conda/envs/sagemaker-soln/lib/python3.7/site-packages/sagemaker/local/image.py in __init__(self, instance_type, instance_count, image, sagemaker_session, container_entrypoint, container_arguments)
     90         if find_executable("docker-compose") is None:
     91             raise ImportError(
---> 92                 "'docker-compose' is not installed. "
     93                 "Local Mode features will not work without docker-compose. "
     94                 "For more information on how to install 'docker-compose', please, see "

ImportError: 'docker-compose' is not installed. Local Mode features will not work without docker-compose. For more information on how to install 'docker-compose', please, see https://docs.docker.com/compose/install/

URLs loading endless

I have successfully deployed the stack
endless loading

But when trying to open the URI from the load balancer present in the stack outputs, it loads endless and does not open the UI.
Any suggestions?

Lab 3: build and push container

!mlflow sagemaker build-and-push-container step seems to look for a Dockerfile. Please advise which Dockerfile to use for this lab?

!pwd
!mlflow sagemaker build-and-push-container
/root/amazon-sagemaker-mlflow-fargate/lab
2022/09/28 04:48:36 INFO mlflow.models.docker_utils: Building docker image with name mlflow-pyfunc
/tmp/tmprxjci4u7/
/tmp/tmprxjci4u7/Dockerfile
Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/conda/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/opt/conda/lib/python3.7/site-packages/docker/transport/unixconn.py", line 30, in connect
    sock.connect(self.unix_socket)
FileNotFoundError: [Errno 2] No such file or directory

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 499, in send
    timeout=timeout,
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 788, in urlopen
    method, url, error=e, _pool=self, _stacktrace=sys.exc_info()[2]
  File "/opt/conda/lib/python3.7/site-packages/urllib3/util/retry.py", line 550, in increment
    raise six.reraise(type(error), error, _stacktrace)
  File "/opt/conda/lib/python3.7/site-packages/urllib3/packages/six.py", line 769, in reraise
    raise value.with_traceback(tb)
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 710, in urlopen
    chunked=chunked,
  File "/opt/conda/lib/python3.7/site-packages/urllib3/connectionpool.py", line 398, in _make_request
    conn.request(method, url, **httplib_request_kw)
  File "/opt/conda/lib/python3.7/http/client.py", line 1277, in request
    self._send_request(method, url, body, headers, encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1323, in _send_request
    self.endheaders(body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1272, in endheaders
    self._send_output(message_body, encode_chunked=encode_chunked)
  File "/opt/conda/lib/python3.7/http/client.py", line 1032, in _send_output
    self.send(msg)
  File "/opt/conda/lib/python3.7/http/client.py", line 972, in send
    self.connect()
  File "/opt/conda/lib/python3.7/site-packages/docker/transport/unixconn.py", line 30, in connect
    sock.connect(self.unix_socket)
urllib3.exceptions.ProtocolError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 214, in _retrieve_server_version
    return self.version(api_version=False)["ApiVersion"]
  File "/opt/conda/lib/python3.7/site-packages/docker/api/daemon.py", line 181, in version
    return self._result(self._get(url), json=True)
  File "/opt/conda/lib/python3.7/site-packages/docker/utils/decorators.py", line 46, in inner
    return f(self, *args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 237, in _get
    return self.get(url, **self._set_request_timeout(kwargs))
  File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 600, in get
    return self.request("GET", url, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 587, in request
    resp = self.send(prep, **send_kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/sessions.py", line 701, in send
    r = adapter.send(request, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/requests/adapters.py", line 547, in send
    raise ConnectionError(err, request=request)
requests.exceptions.ConnectionError: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/opt/conda/bin/mlflow", line 8, in <module>
    sys.exit(cli())
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1130, in __call__
    return self.main(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1055, in main
    rv = self.invoke(ctx)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1657, in invoke
    return _process_result(sub_ctx.command.invoke(sub_ctx))
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 1404, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "/opt/conda/lib/python3.7/site-packages/click/core.py", line 760, in invoke
    return __callback(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/mlflow/sagemaker/cli.py", line 606, in build_and_push_container
    env_manager=env_manager,
  File "/opt/conda/lib/python3.7/site-packages/mlflow/models/docker_utils.py", line 191, in _build_image
    _build_image_from_context(context_dir=cwd, image_name=image_name)
  File "/opt/conda/lib/python3.7/site-packages/mlflow/models/docker_utils.py", line 197, in _build_image_from_context
    client = docker.from_env()
  File "/opt/conda/lib/python3.7/site-packages/docker/client.py", line 101, in from_env
    **kwargs_from_env(**kwargs)
  File "/opt/conda/lib/python3.7/site-packages/docker/client.py", line 45, in __init__
    self.api = APIClient(*args, **kwargs)
  File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 197, in __init__
    self._version = self._retrieve_server_version()
  File "/opt/conda/lib/python3.7/site-packages/docker/api/client.py", line 222, in _retrieve_server_version
    f'Error while fetching server API version: {e}'
docker.errors.DockerException: Error while fetching server API version: ('Connection aborted.', FileNotFoundError(2, 'No such file or directory'))

Docker push error

`[100%] fail: docker push 031342435657.dkr.ecr.us-east-1.amazonaws.com/cdk-hnb659fds-container-assets-031342435657-us-east-1:df0c2dd130fbc264a70d9331a9a4367b68d888ce0d1a851cc223b5f794dbc7f1 exited with error code 1: EOF

โŒ DeploymentStack failed: Error: Failed to publish one or more assets. See the error messages above for more information.
at Object.publishAssets (/Users/amittimalsina/.nvm/versions/node/v16.3.0/lib/node_modules/aws-cdk/lib/util/asset-publishing.ts:25:11)
at processTicksAndRejections (node:internal/process/task_queues:96:5)
at CloudFormationDeployments.publishStackAssets (/Users/amittimalsina/.nvm/versions/node/v16.3.0/lib/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:424:7)
at CloudFormationDeployments.deployStack (/Users/amittimalsina/.nvm/versions/node/v16.3.0/lib/node_modules/aws-cdk/lib/api/cloudformation-deployments.ts:317:5)
at CdkToolkit.deploy (/Users/amittimalsina/.nvm/versions/node/v16.3.0/lib/node_modules/aws-cdk/lib/cdk-toolkit.ts:201:24)
at initCommandLine (/Users/amittimalsina/.nvm/versions/node/v16.3.0/lib/node_modules/aws-cdk/bin/cdk.ts:281:9)
Failed to publish one or more assets. See the error messages above for more information.`

Can you please help me how i can solve this?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.