caraml-dev / merlin Goto Github PK

Kubernetes-friendly ML model management, deployment, and serving.

License: Apache License 2.0

Dockerfile 0.34% Makefile 0.27% Go 61.71% HTML 0.04% Shell 0.49% Python 25.83% JavaScript 11.22% SCSS 0.09% Mustache 0.02%

machine-learning mlops

merlin's Introduction

Overview

Merlin is a platform for deploying and serving machine learning models. The project was born of the belief that model deployment should be:

Easy and self-serve: Human should not become the bottleneck for deploying model into production.
Scalable: The model deployed should be able to handle Gojek scale and beyond.
Fast: The framework should be able to let user iterate quickly.
Cost efficient: It should provide all benefit above in a cost efficient manner.

Merlin attempts to do so by:

Abstracting infrastructure: Merlin uses familiar concept such as Project, Model, Version, and Endpoint as its core component and abstract away complexity of deploying and serving ML service from user.
Autoscaling: Merlin is built on top Knative and KFServing to provide a production ready serverless solution.

Getting Started

To install Merlin in your local machine, click Local Development.

Documentation

Go to the docs folder for the full documentation and guides.

Python SDK Documentation

Click here to getting started on using the Python SDK.

API Documentation

To explore the API documentation, run:

make swagger-ui

Client Libraries

We use Swagger Codegen to automatically generate Golang and Python clients for Merlin API. To genarate the client libraries, run:

make generate-client

Notice

Merlin is a community project and is still under active development. Your feedback and contributions are important to us. Please have a look at our contributing guide for details.

merlin's People

Contributors

Stargazers

Watchers

Forkers

ariefrahmansyah souravbadami zhangchi1 swatiarora0208 anujavane devops-corner mlops-aiops ruanglaptop tiopramayudi khorshuheng ashwinath jials davidheryanto ilham-bintang omrisnyk babywyrm karzuo eric-lidong ignitewind saikatkumardey halasnet pradithya scantistdemo airlab-unsri devopstoday11 leonlnj 1337day-inj3ctor arcosx myfdtestacc isabella232 shydefoo krithika369 terryyylim imjenal zenovore deadlycoconuts pint1022 zakirkun aemzayn hellobrobro zain13337

merlin's Issues

Panic on janitor - image building jobs deletion

Given a list of image building jobs on the cluster with one failed job:

$ kubectl get job
NAME                                                 COMPLETIONS   DURATION   AGE
merlin-e2e-ariefrahmansyah-std-transformer-4         0/1           6h31m      6h31m
merlin-e2e-ariefrahmansyah-std-transformer-5         1/1           3m43s      5h38m

$ kubectl get job merlin-e2e-ariefrahmansyah-std-transformer-4 -o json
...
    "status": {
        "conditions": [
            {
                "lastProbeTime": "2021-08-21T19:17:51Z",
                "lastTransitionTime": "2021-08-21T19:17:51Z",
                "message": "Job has reached the specified backoff limit",
                "reason": "BackoffLimitExceeded",
                "status": "True",
                "type": "Failed"
            }
        ],
        "failed": 4,
        "startTime": "2021-08-21T19:08:40Z"
    }
...

The Image Builder Janitor Cron panic with following error:

2021/08/21 22:00:00 cron: panic running job: runtime error: invalid memory address or nil pointer dereference
goroutine 3629 [running]:
github.com/robfig/cron.(*Cron).runWithRecovery.func1(0xc000511180)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:161 +0x9e
panic(0x203aa20, 0x3aef850)
	/usr/local/go/src/runtime/panic.go:969 +0x166
github.com/gojek/merlin/imagebuilder.(*Janitor).getExpiredJobs(0xc00118cf90, 0xc000582f00, 0x23f0f3a, 0x2d, 0x0, 0x0)
	/src/api/imagebuilder/janitor.go:81 +0x184
github.com/gojek/merlin/imagebuilder.(*Janitor).CleanJobs(0xc00118cf90)
	/src/api/imagebuilder/janitor.go:56 +0x7c
github.com/robfig/cron.FuncJob.Run(0xc001094b70)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:92 +0x25
github.com/robfig/cron.(*Cron).runWithRecovery(0xc000511180, 0x2829060, 0xc001094b70)
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:165 +0x59
created by github.com/robfig/cron.(*Cron).run
	/go/pkg/mod/github.com/robfig/cron@v1.2.0/cron.go:199 +0x747

Problem In Setting Up on Minikube

Specifications

I was trying to set up locally following this doc: https://github.com/gojek/merlin/blob/main/docs/getting-started/deploying-merlin/local_development.md but not able to configure properly.
With MERLIN_VERSION=v0.9.0 version pods are crashing in mlp name space but after changing version to v0.10.0 all pods are deploying.
I am able to access merlin on http://merlin.mlp.127.0.0.1.nip.io/merlin and mlp on http://mlp.mlp.127.0.0.1.nip.io/merlin but only the first page, when I tried to authenticate error out redirect url mis match. As mention in docs we need to put Authorised JavaScript origins and Authorised redirect URIs in http but its only support https for this type (https://merlin.mlp.127.0.0.1.nip.io/merlin) of urls.

Version:
Platform:
Subsystem:

Possible Solution

No validation for Custom Transformer Docker Image Name

Expected Behavior

Prevent user from submitting invalid custom Docker image name

Current Behavior

There's no validation for the Custom Transformer Docker Image name in the input-form, enabling the user to submit it and continue with the deployment. It leads to error creating inference service.

Steps to reproduce

In the model deployment page, choose Custom Transformer
In the Docker Image name, write an invalid docker name. For example, add a space between the name and tag: my-image-name: my-tag
Click Deploy
The UI will display success, but after waiting, the deployment will be failed with the error message: error creating inference service

Possible Solution

Add validation.

*We may need to check for another validation needed in other input components.

Investigate validation is not working as expected in config struct

Expected Behavior

Any validate struct tag should be respected and validated. But only the parent struct got validated.

Current Behavior

The validate struct tag in a child struct of the config struct is ignored.

Who to contact for security issues

Hello 👋

I run a security community that finds and fixes vulnerabilities in OSS. A researcher (@Catman426) has found a potential issue, which I would be eager to share with you.

Could you add a SECURITY.md file with an e-mail address for me to send further details to? GitHub recommends a security policy to ensure issues are responsibly disclosed, and it would help direct researchers in the future.

Looking forward to hearing from you 👍

(cc @huntr-helper)

Model version detail page

As discussed here, it will be great to have a dedicated page to display the configuration of a model version.

The page should have the following information:

Model version specific info (name, mlflow link, created and updated time)
List of model version endpoint across environments

environment name, endpoint URL, status, transformer (if exist)

The page will consume model version API

Docker images still exist after model version deletion

Expected Behavior

Docker images of pyfunc model should be deleted after the model version deleted from the UI

Current Behavior

Docker images not deleted after the model version deleted from the UI

Steps to reproduce

Delete model version, and upload the pyfunc model again it will use existing images

Specifications

Version: 0.33.0
Platform: linux
Subsystem:

Possible Solution

enable docker images deletion after model version deleted

Rework how batch-predictor & pyfunc-server depend on merlin-sdk

Is your feature request related to a problem? Please describe.

Currently, we need to publish merlin-sdk to PyPi first before batch-predictor and pyfunc-server able to release their new Docker image version. This is because instead of depending on the local code of merlin-sdk, they specify merlin-sdk in requirements.txt, making them resolve to PyPi.

Describe the solution you'd like

batch-predictor & pyfunc-server should depends on the local code of merlin-sdk

Publish latest tag for Merlin Docker images.

Describe the solution you'd like

Currently, all Merlin Docker images are tagged with the Merlin release version and there's no latest tag.

Describe alternatives you've considered

To publish the latest tag for all Merlin Docker images on every release.

Additional context

#514 (comment)

Do not display Version Endpoint URL value if empty in Version UI page

Current Behavior

Steps to reproduce

Deploy a model version
Go to model version UI page
When the endpoint still pending, the displayed endpoint URL is :predict instead of -

Specifications

Version:
Platform:
Subsystem:

Possible Solution

Do not display Version Endpoint URL value if empty in Version UI page

Allow user to configure autoscaling

Right now, we only use the queue.sidecar.serving.knative.dev/resourcePercentage annotation to configure the autoscaling and its value is configured globally per environment.

As described here, we also cannot specify autoscaling policy for predictor (the model) and transformer specifically.

This issue tracks how to enables the user to specify autoscaling configuration for their model.

Add E2E test that run Python SDK test

The Python SDK contains the test files that interact with the running Merlin server and its model cluster. Having it running in GitHub Actions will enable the developers to be more confident in introducing their changes.

The current E2E itself only deploys the Merlin server to the KinD cluster but there's no testing for model management, deployment, and serving.

These components need to be installed in E2E KinD cluster:

Istio
Knative
KFServing
Vault
Spark Operator
Minio (For GCS replacement -- for MLFlow storage backend)

We also need to add a pytest-dependency library and update the decorator to make sure the test ordered correctly.

Refactor storage package to focus on how to update instead of what to update

Is your feature request related to a problem? Please describe.

Model Endpoint Storage's Save method in the storage package has a logic on what and how to update both model endpoint and model version endpoint. This could lead to confusion on how to implement service and storage layer in the future.

Related discussion: #483 (comment)

Describe the solution you'd like

A clear separation and guideline for service and storage layer's responsibilties
Refactor storage layer to be more responsible on how to update the data and let service layer decide what to save

Rework Fetching Model Version List

Is your feature request related to a problem? Please describe.

The current implementation of List Model Version API is using preloading GORM and without limitation. This would become an issue if we have a huge list of the model versions that can exceed Postgres' max_stack_depth configuration value. In most cases, users don't need to get all model versions and filtering or limitation is needed here.

It also affects the model version UI page since it's ended up fetching all of the model versions and displaying all of them at once.

Describe the solution you'd like

In API side, it will be good to use basic SQL instead of relying to GORM. It will also be huge improvement to allow user to specify his own filter query.

In UI side, we can add pagination in model version table.

Samples links are broken

Expected Behavior

Links to the correct updated examples.

Current Behavior

404 since the files are shifted to here

Steps to reproduce

The hyper links are broken.
https://github.com/gojek/merlin/tree/main/python/sdk#getting-started

Specifications

Version: -
Platform: -
Subsystem: -

Possible Solution

Are these the right links: https://github.com/gojek/merlin/tree/main/examples

Safer quick_install.sh

Is your feature request related to a problem? Please describe.

To avoid accidentally running quick_install.sh on existing Kubernetes cluster, e.g. production cluster with running infrastructure (istio, knative, kfserving) and Merlin installed.

Describe the solution you'd like

Add confirmation question if the quick_install.sh script run on non-minikube or kind cluster (context aware script)
Use commands that will not overwrite existing installed components. The script should stop running if component already exist.

caraml-dev / merlin Goto Github PK

merlin's Introduction

Overview

Getting Started

Documentation

Python SDK Documentation

API Documentation

Client Libraries

Notice

merlin's People

Contributors

Stargazers

Watchers

Forkers

merlin's Issues

Specifications

Possible Solution

Expected Behavior

Current Behavior

Steps to reproduce

Possible Solution

Expected Behavior

Current Behavior

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Describe the solution you'd like

Describe alternatives you've considered

Additional context

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Expected Behavior

Current Behavior

Steps to reproduce

Specifications

Possible Solution

Is your feature request related to a problem? Please describe.

Describe the solution you'd like

Recommend Projects

Recommend Topics

Recommend Org