Coder Social home page Coder Social logo

caraml-dev / merlin Goto Github PK

View Code? Open in Web Editor NEW
162.0 13.0 42.0 17.44 MB

Kubernetes-friendly ML model management, deployment, and serving.

License: Apache License 2.0

Dockerfile 0.34% Makefile 0.27% Go 61.71% HTML 0.04% Shell 0.49% Python 25.83% JavaScript 11.22% SCSS 0.09% Mustache 0.02%
machine-learning mlops

merlin's Introduction

Overview

Merlin is a platform for deploying and serving machine learning models. The project was born of the belief that model deployment should be:

  • Easy and self-serve: Human should not become the bottleneck for deploying model into production.
  • Scalable: The model deployed should be able to handle Gojek scale and beyond.
  • Fast: The framework should be able to let user iterate quickly.
  • Cost efficient: It should provide all benefit above in a cost efficient manner.

Merlin attempts to do so by:

  • Abstracting infrastructure: Merlin uses familiar concept such as Project, Model, Version, and Endpoint as its core component and abstract away complexity of deploying and serving ML service from user.
  • Autoscaling: Merlin is built on top Knative and KFServing to provide a production ready serverless solution.

Getting Started

To install Merlin in your local machine, click Local Development.

Documentation

Go to the docs folder for the full documentation and guides.

Python SDK Documentation

Click here to getting started on using the Python SDK.

API Documentation

To explore the API documentation, run:

make swagger-ui

Client Libraries

We use Swagger Codegen to automatically generate Golang and Python clients for Merlin API. To genarate the client libraries, run:

make generate-client

Notice

Merlin is a community project and is still under active development. Your feedback and contributions are important to us. Please have a look at our contributing guide for details.

merlin's People

Contributors

aemzayn avatar ariefrahmansyah avatar ashwinath avatar bthari avatar davidheryanto avatar deadlycoconuts avatar eric-lidong avatar haveaqiupill avatar imjuanleonard avatar jials avatar karzuo avatar khorshuheng avatar krithika369 avatar leonlnj avatar mbruner avatar numb3r33 avatar pradithya avatar romanwozniak avatar shydefoo avatar terryyylim avatar tiopramayudi avatar tkpd-hafizhan avatar zenovore avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

merlin's Issues

Panic on janitor - image building jobs deletion

Given a list of image building jobs on the cluster with one failed job:

$ kubectl get job
NAME                                                 COMPLETIONS   DURATION   AGE
merlin-e2e-ariefrahmansyah-std-transformer-4         0/1           6h31m      6h31m
merlin-e2e-ariefrahmansyah-std-transformer-5         1/1           3m43s      5h38m

$ kubectl get job merlin-e2e-ariefrahmansyah-std-transformer-4 -o json
...
    "status": {
        "conditions": [
            {
                "lastProbeTime": "2021-08-21T19:17:51Z",
                "lastTransitionTime": "2021-08-21T19:17:51Z",
                "message": "Job has reached the specified backoff limit",
                "reason": "BackoffLimitExceeded",
                "status": "True",
                "type": "Failed"
            }
        ],
        "failed": 4,
        "startTime": "2021-08-21T19:08:40Z"
    }
...

The Image Builder Janitor Cron panic with following error:

2021/08/21 22:00:00 cron: panic running job: runtime error: invalid memory address or nil pointer dereference
goroutine 3629 [running]:
github.com/robfig/cron.(*Cron).runWithRecovery.func1(0xc000511180)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:161 +0x9e
panic(0x203aa20, 0x3aef850)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/gojek/merlin/imagebuilder.(*Janitor).getExpiredJobs(0xc00118cf90, 0xc000582f00, 0x23f0f3a, 0x2d, 0x0, 0x0)
	/src/api/imagebuilder/janitor.go:81 +0x184
github.com/gojek/merlin/imagebuilder.(*Janitor).CleanJobs(0xc00118cf90)
	/src/api/imagebuilder/janitor.go:56 +0x7c
github.com/robfig/cron.FuncJob.Run(0xc001094b70)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:92 +0x25
github.com/robfig/cron.(*Cron).runWithRecovery(0xc000511180, 0x2829060, 0xc001094b70)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:165 +0x59
created by github.com/robfig/cron.(*Cron).run
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:199 +0x747

Problem In Setting Up on Minikube

Specifications

I was trying to set up locally following this doc: https://github.com/gojek/merlin/blob/main/docs/getting-started/deploying-merlin/local_development.md but not able to configure properly.
With MERLIN_VERSION=v0.9.0 version pods are crashing in mlp name space but after changing version to v0.10.0 all pods are deploying.
I am able to access merlin on http://merlin.mlp.127.0.0.1.nip.io/merlin and mlp on http://mlp.mlp.127.0.0.1.nip.io/merlin but only the first page, when I tried to authenticate error out redirect url mis match. As mention in docs we need to put Authorised JavaScript origins and Authorised redirect URIs in http but its only support https for this type (https://merlin.mlp.127.0.0.1.nip.io/merlin) of urls.

  • Version:
  • Platform:
  • Subsystem:

Possible Solution

No validation for Custom Transformer Docker Image Name

Expected Behavior

Prevent user from submitting invalid custom Docker image name

Current Behavior

There's no validation for the Custom Transformer Docker Image name in the input-form, enabling the user to submit it and continue with the deployment. It leads to error creating inference service.

Steps to reproduce

  1. In the model deployment page, choose Custom Transformer
  2. In the Docker Image name, write an invalid docker name. For example, add a space between the name and tag: my-image-name: my-tag
  3. Click Deploy
  4. The UI will display success, but after waiting, the deployment will be failed with the error message: error creating inference service

Possible Solution

Add validation.

*We may need to check for another validation needed in other input components.

Who to contact for security issues

Hello ๐Ÿ‘‹

I run a security community that finds and fixes vulnerabilities in OSS. A researcher (@Catman426) has found a potential issue, which I would be eager to share with you.

Could you add a SECURITY.md file with an e-mail address for me to send further details to? GitHub recommends a security policy to ensure issues are responsibly disclosed, and it would help direct researchers in the future.

Looking forward to hearing from you ๐Ÿ‘

(cc @huntr-helper)

Model version detail page

As discussed here, it will be great to have a dedicated page to display the configuration of a model version.

The page should have the following information:

  1. Model version specific info (name, mlflow link, created and updated time)
  2. List of model version endpoint across environments
  • environment name, endpoint URL, status, transformer (if exist)

The page will consume model version API

Docker images still exist after model version deletion

Expected Behavior

Docker images of pyfunc model should be deleted after the model version deleted from the UI

Current Behavior

Docker images not deleted after the model version deleted from the UI

Steps to reproduce

Delete model version, and upload the pyfunc model again it will use existing images

Specifications

  • Version: 0.33.0
  • Platform: linux
  • Subsystem:

Possible Solution

enable docker images deletion after model version deleted

Rework how batch-predictor & pyfunc-server depend on merlin-sdk

Is your feature request related to a problem? Please describe.

Currently, we need to publish merlin-sdk to PyPi first before batch-predictor and pyfunc-server able to release their new Docker image version. This is because instead of depending on the local code of merlin-sdk, they specify merlin-sdk in requirements.txt, making them resolve to PyPi.

Describe the solution you'd like

batch-predictor & pyfunc-server should depends on the local code of merlin-sdk

Publish latest tag for Merlin Docker images.

Describe the solution you'd like

Currently, all Merlin Docker images are tagged with the Merlin release version and there's no latest tag.

Describe alternatives you've considered

To publish the latest tag for all Merlin Docker images on every release.

Additional context

#514 (comment)

Do not display Version Endpoint URL value if empty in Version UI page

Current Behavior

image

Steps to reproduce

  1. Deploy a model version
  2. Go to model version UI page
  3. When the endpoint still pending, the displayed endpoint URL is :predict instead of -

Specifications

  • Version:
  • Platform:
  • Subsystem:

Possible Solution

Do not display Version Endpoint URL value if empty in Version UI page

Add E2E test that run Python SDK test

The Python SDK contains the test files that interact with the running Merlin server and its model cluster. Having it running in GitHub Actions will enable the developers to be more confident in introducing their changes.

The current E2E itself only deploys the Merlin server to the KinD cluster but there's no testing for model management, deployment, and serving.

These components need to be installed in E2E KinD cluster:

  • Istio
  • Knative
  • KFServing
  • Vault
  • Spark Operator
  • Minio (For GCS replacement -- for MLFlow storage backend)

We also need to add a pytest-dependency library and update the decorator to make sure the test ordered correctly.

Refactor storage package to focus on how to update instead of what to update

Is your feature request related to a problem? Please describe.

Model Endpoint Storage's Save method in the storage package has a logic on what and how to update both model endpoint and model version endpoint. This could lead to confusion on how to implement service and storage layer in the future.

Related discussion: #483 (comment)

Describe the solution you'd like

  • A clear separation and guideline for service and storage layer's responsibilties
  • Refactor storage layer to be more responsible on how to update the data and let service layer decide what to save

Rework Fetching Model Version List

Is your feature request related to a problem? Please describe.

The current implementation of List Model Version API is using preloading GORM and without limitation. This would become an issue if we have a huge list of the model versions that can exceed Postgres' max_stack_depth configuration value. In most cases, users don't need to get all model versions and filtering or limitation is needed here.

It also affects the model version UI page since it's ended up fetching all of the model versions and displaying all of them at once.

Describe the solution you'd like

In API side, it will be good to use basic SQL instead of relying to GORM. It will also be huge improvement to allow user to specify his own filter query.

In UI side, we can add pagination in model version table.

Safer quick_install.sh

Is your feature request related to a problem? Please describe.

To avoid accidentally running quick_install.sh on existing Kubernetes cluster, e.g. production cluster with running infrastructure (istio, knative, kfserving) and Merlin installed.

Describe the solution you'd like

  1. Add confirmation question if the quick_install.sh script run on non-minikube or kind cluster (context aware script)
  2. Use commands that will not overwrite existing installed components. The script should stop running if component already exist.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.