Coder Social home page Coder Social logo

dominodatalab / domino-research Goto Github PK

View Code? Open in Web Editor NEW
76.0 6.0 8.0 1.56 MB

Projects developed by Domino's R&D team

License: Apache License 2.0

Dockerfile 1.35% Python 62.36% HCL 2.24% Shell 4.20% Jupyter Notebook 9.13% JavaScript 1.15% Less 0.08% SCSS 0.11% TypeScript 18.16% HTML 1.24%
python mlops data-science mlflow sagemaker

domino-research's Introduction

Domino Research

This repo contains projects under active development by the Domino R&D team. We build tools that help Data Scientists and ML engineers train and deploy ML models.

Active Projects

Here’s what we’re working on:

  • πŸŒ‰ Bridge - deploy directly from your registry, turning it into a declarative source of truth for your model hosting.

  • πŸ›‚ Checkpoint - adds 'Pull Requests' to your registry to create a better process for promoting models to production.

  • πŸŽ‡ Flare - monitor models and get alerts without capturing, storing or processing production inference data.

domino-research's People

Contributors

adp312 avatar ajbosco avatar ddl-kevin avatar joshbroomberg avatar katedk avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

domino-research's Issues

Create IAM for Bridge run

Use bridge init to create an IAM user that Bridge uses at runtime. This would simplify the user configuration.

Optimize endpoint create/update times

Endpoints start very slowly. Consider the following optimizations:

  • Larger default instances to run conda install faster
  • Use a requirments.txt file instead of a conda yaml in model packaging
  • Use a multi-model endpoint that doesn't require new instances/restarts to update
  • Use a custom image to achieve smaller image size

Teardown/destroy is incomplete

  • Block until all endpoints are deletable to avoid leaving endpoints
  • Empty all artifacts from bucket so that it can be deleted
  • Separate the model-level and infra-level teardown for use in tests

500s during model update in localhost target

This looks like the error

Traceback (most recent call last):
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 2070, in wsgi_app
    response = self.full_dispatch_request()
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1516, in full_dispatch_request
    return self.finalize_request(rv)
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1535, in finalize_request
    response = self.make_response(rv)
  File "/opt/conda/lib/python3.9/site-packages/flask/app.py", line 1727, in make_response
    raise TypeError(
TypeError: The view function did not return a valid response. The return type must be a string, dict, tuple, Response instance, or WSGI callable, but it was a int.

Deploy to Heroku button for Checkpoint

This could be tricky given our use of Python and React (requiring two buildpacks). They have a Dockerfile option, but the start up time would be 5+ minutes. Is there an easier way for us to do this?

Handle race conditions

NOOP race condition

  • PR opens to move version to archive
  • Version moves to archive by another means (manually or via another PR)
  • Original PR errors out when approving

Target Version race condition

  • You are targeting some change to production.
  • Someone else makes a change to production.
  • your diff is now out of date and you're an idiot

Assert the version of Python interpreter

I accidentally used an older version of Python (3.8), and this caused the following exception in bridge run:

AttributeError: 'str' object has no attribute 'removesuffix'

The function in question was, indeed, introduced in Python 3.9.

I will add an assertion in the beginning of the call stack to guarantee that the interpreter version is kosher (>=3.9).

Model update/synchronization bug

Replication:

  1. Add a couple model versions to registry
  2. Init bridge in a fresh account
  3. Bridge creates an endpoint/model
  4. Before bridge is done creating the endpoint, add and tag a new version destined for the endpoint
  5. Bridge skips the update.
  6. After some time (perhaps when endpoint becomes updateable) bridge attempts update and then fails due to existing model.

Query ALL versions in a stage in MLFlow

Our logic currently only fetches the most recent version in each stage.

Consider fetching all the versions in a stage so that we can enable A/B testing by deploying N versions to the endpoint for a stage.

Better current_stage handling

  • remove current_stage from create time and from DB

In the view one PR endpoint

  • Query current stage and add it
  • Query the current version in the target stage
  • Query all the other info we need for the view

Optional per-registry name prefix

If you have more than one MLFlow registry and attempt to run a bridge worker pointing at each, and deploying to the same AWS account/region then the models from each registry may clobber each other.

Solution: add a per-registry prefix to AWS resources so that multiple registries can cohabit the same account and region.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    πŸ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. πŸ“ŠπŸ“ˆπŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❀️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.