Coder Social home page Coder Social logo

ml_e2e's Introduction

Blueprint for ML Projects

This blueprint for ML projects entails various pieces that are usually required:

  • MLFlow experiment tracking
  • MLFlow model tracking and staging
  • CI pipeline

System Design

  • Model observability (training, lineage, deployment staging, ...) is ensured via tracking with MLFlow
  • The Github CI/CD pipeline runs tests on PR/merge to master, and on success pushes the image of the inference server to Github's registry, see the packages
  • Training is triggered manually (developers running code locally), and could be added to CI
    • if training computations could not fit/would take too long on a laptop, and had to be run on k8s (for instance if we had to scan a large grid of hyperparameters), then we'd add a cli.py that accepts such hyperparams and we could trigger the necessary runs in parallel
  • Inference server has to modes: 1) use locally stored weights, or 2) fetch model weights from a mlflow server
  • The model is served by a custom flask server instead of relying on mlflow serve, to ensure that
    1. appropriate security could easily be added to the project if this service was to face the public internet
    2. we can adhere to the required API specification (/stream and /batch)
  • Notable simplifications
    • No load balancer when serving the model
    • No k8s deployment. We push images to Github's registry in CI though, which facilitates deployment to k8s
    • No model staging (we promote directly to prod if performance checks pass); we would usually set a model to staging, test its performance by serving a small fraction of clients (1%) using a load balancer, which we'd gradually increase (5%, 10% etc). Once we are sure there are no problems (performance regressions, crashes, systems overloads etc..) we'd promote to prod.
    • Artifacts are stored on the local FS. We'd usually set up a remote MLFlow server with artifacts stored in s3

Training Pipeline

training_pipeline.png

Inference

inference.png

Running for local dev (docker)

Build the project images:

docker-compose build

Run the tests (includes detection of linting issues)

docker-compose run test

Run inference

Start the stack (flask server has autoreload)

docker-compose up web -d

You can verify that the stack is running by checking the status of the flask server on http://localhost:8000/health, while the mlflow server is on http://localhost:5000/.

You can also verify that the stack is working correctly by requesting a prediction

import requests

# test /stream
X_request = {'X': 0.2}
r = requests.post("http://localhost:8000/stream", json=X_request)
print(r.json())

# test /batch
X_request = {'X': [0.2, 0.5, -0.1]}
r = requests.post("http://localhost:8000/batch", json=X_request)
print(r.json())

Run training

While the stack is running (we need MLFlow up to log models and metrics), training can be run as follows

# run stack if not already running
docker-compose up web -d

docker-compose run train

A new model will be trained and if performance checks pass, the model will be promoted to production:

mlflow_ui.png

Model Registry:

mlflow_model_registry.png

Running for local dev (no docker)

It is recommended to work with the docker setup above. However, it is possible to install this project as an editable package for quick local iterations:

pip install -e .

Deployment

A production image is created as part of CI/CD. We have 2 deployment options.

Option 1: run using locally stored weights:

docker run -e MODEL_SERVED=local -p 8000:8000 --rm ghcr.io/robin-vjc/endeavour_e2e_ml:latest

Option 2: a running MLFlow server is available (adjust hostname in the command accordingly) and the most recent production model can be loaded from it:

docker run -e MODEL_SERVED=mlflow -e MLFLOW_SERVER_HOST=http://mlflow:5000 --network=ml_e2e_ml_e2e -p 8000:8000 --rm ghcr.io/robin-vjc/endeavour_e2e_ml:latest

ml_e2e's People

Contributors

robin-vjc avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.