cap-ntu / ml-model-ci Goto Github PK

MLModelCI is a complete MLOps platform for managing, converting, profiling, and deploying MLaaS (Machine Learning-as-a-Service), bridging the gap between current ML training and serving systems.

Home Page: https://mlmodelci.com

License: Apache License 2.0

Python 74.84% Shell 1.29% Dockerfile 2.12% JavaScript 0.08% HTML 0.07% TypeScript 8.19% CSS 0.42% SCSS 0.13% Jupyter Notebook 12.87%

continuous-integration convert-models deep-learning dispatcher inference mlops onnx profiler pytorch serving tensorflow-serving tensorrt tensorrt-inference-server

ml-model-ci's People

Contributors

Stargazers

Watchers

ml-model-ci's Issues

Current profiler fails in profiling on CPU

Current profiler fails in profiling on CPU @huangyz0918:
https://github.com/YuanmingLeee/ML-Model-CI/blob/0cb19ee10d05666f07fc4d3197e5bd0ada61f9b5/modelci/metrics/benchmark/metric.py#L132
This line fails, as there is no key named 'accelerators' on CPU. Please help to fix this after this merge.

Originally posted by @YuanmingLeee in #124

Full error logs:

Traceback (most recent call last):
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/controller/executor.py", line 83, in run
    dpr = profiler.diagnose(device=job.device)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/profiler.py", line 80, in diagnose
    result = self.inspector.run_model(server_name=self.server_name, device=device)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 132, in run_model
    val_stats = [x for x in stats[-int(SLEEP_TIME + all_data_latency):] if
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 133, in <listcomp>
    x['accelerators'][0]['duty_cycle'] is not 0]
KeyError: 0

Add contribution guide and modify `CONTRIBUTING.md`

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

The Setup Environment part of CONTRIBUTING.md is out of date, and we could add a detailed contribute guide for new contributors with few experience on git to have a starting point

Steps to Reproduce the Problem

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

README.md updates required

issue1: README.md lacks a run in docker command.

## 2. Install MongoDB service
docker --rm -d -p 27017:27017 --name modelci-mongo mongo

should be docker run --rm -d -p 27017:27017 --name modelci-mongo mongo

issue 2: lacks necessary steps in setting up environments

after setting up mongo according to the home page README.md, user needs to source an .env to load PORT into system environment. Something like,

set -o allexport; source modelci/env-mongodb.env; set +o allexport

or user will get an error while importing the package, this should be indicate below the document of setting mongo db.

issue 3: if user want to run scripts inside the modeci, he should setup the PYTHON_PATH to the root of this project. README.md should add a section contribution or development to indicate this.

Update model retrieval API Required

Sure, please file a PR to update the document, and remember to indicate how to get the info (if exists something as info)

Originally posted by @huangyz0918 in #48 (comment)

Add more optimization or conversion tools (e.g. TVM)

"train models in clouds, deploy models everywhere"

Incorporate more tools to help users convert and optimize models so they can be deployed to both cloud and edge devices

TVM
Neo-AI-DLR https://github.com/neo-ai/neo-ai-dlr
AWS Neuron https://github.com/aws/aws-neuron-sdk

convert improvement

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

We may try to adopt https://github.com/microsoft/hummingbird to our system

Steps to Reproduce the Problem

Expected Behavior

Supported more conversions

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

ONNX converter using ONNXML tools

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

ONNX converter, may refer to https://github.com/onnx/onnxmltools

Steps to Reproduce the Problem

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

Roadmap

Register Model in the Model Database
Converting Model to Different Frameworks
Profiling Model Automatically
Retrieve and Deploy Model to Specific Device

New name on TensorRT Serving

TensorRT Serving now has a new name, in the document along with the paper, we should update too.

NEW NAME: We have a new name: Triton Inference Server. Read about why we are making this change and our plans for version 2 of the inference server in Roadmap.

https://github.com/NVIDIA/triton-inference-server/blob/master/README.rst

Refactor Scripts

Replace all the shell scripts with python scripts.
- Docker controll -> Docker Python SDK
- MongoDB -> PyMongo
Remove all the DAO/DTO related code (we don't need to use DTO pattern with an already highly encapsulated DB library (mongo), it doesn't make sense).

duplicate key error while register the same model more than once.

This comes when I'm retrieving the model

Traceback (most recent call last):
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/document.py", line 412, in save
    object_id = self._save_create(doc, force_insert, write_concern)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/document.py", line 477, in _save_create
    object_id = wc_collection.insert_one(doc).inserted_id
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 698, in insert_one
    session=session),
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 612, in _insert
    bypass_doc_val, session)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 600, in _insert_one
    acknowledged, _insert_command, session)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 597, in _insert_command
    _check_write_command_response(result)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 221, in _check_write_command_response
    _raise_last_write_error(write_errors)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 202, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: modelci.model_p_o index: engine_1_name_1_framework_1_version_1 dup key: { engine: 2, name: "ResNet50", framework: 1, version: 1 }

should throw an exception while trying to register a saved model. Less priority @YuanmingLeee fix this when you are free.

Badges required

including but not limited to

PyPI download and version badge
Codacy code static checks
Travis CI build status
https://app.fossa.io/ FOSSA license scanning

updating the database table is required

Adding new attributes or modify existing ones.

number of batch of testing
batch size of testing
overall latency
overall throughput
50th-percentile latency
95th-percentile latency
99th-percentile latency
average GPU memory usage percentile
average GPU memory used
average GPU utilization
completed time
total GPU memory

Originally posted by @huangyz0918 in #59 (comment)

Documentation in Chinese

Need a Chinses version documentation
The documentation is out of date and needs to be updated.
add roadmap

Torch Serve Dockerfile Related

少个包 grpcio-tools

root@21d6bab63eef:/content# python pytorch_serve.py
/miniconda//bin/python: Error while finding module specification for 'grpc_tools.protoc' (ModuleNotFoundError: No module named 'grpc_tools')
  Found existing installation: grpcio 1.16.1

Mongodb initial pwd is not set

Software and Hardware Versions

ModelCI v1.x.x,
CUDA Version vx.x.x,
GPU device used...

Problem description

After starting the service, MongoDB failed to start due to null pwd argument

2020-11-05 15:09:07,969 - ml-modelci Docker Container Manager - ERROR - Exception during starting MongoDB: "pwd" had the wrong type. Expected string, found null, full error: {'ok': 0.0, 'errmsg': '"pwd" had the wrong type. Expected string, found null', 'code': 14, 'codeName': 'TypeMismatch'}
2020-11-05 15:09:09,650 - ml-modelci Docker Container Manager - INFO - Container name=cadvisor-81293 started.
2020-11-05 15:09:10,795 - ml-modelci Docker Container Manager - INFO - Container name=dcgm-exporter-76327 started.
2020-11-05 15:09:11,527 - ml-modelci Docker Container Manager - INFO - gpu-metrics-exporter-86973 stared
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "~/project/github/ML-Model-CI/modelci/cli/__init__.py", line 36, in start
    __start__(gpu)
  File "~/project/github/ML-Model-CI/modelci/cli/__init__.py", line 28, in __start__
    if not container_conn.start():
  File "~/project/github/ML-Model-CI/modelci/utils/docker_container_manager.py", line 115, in start
    return self.connect()
  File "~/project/github/ML-Model-CI/modelci/utils/docker_container_manager.py", line 128, in connect
    self.mongo_port = all_labels[MODELCI_DOCKER_PORT_LABELS['mongo']]
KeyError: 'modelci.mongo.port'

Steps to Reproduce the Problem

import modelci.cli 
modelci.cli.start()

Expected Behavior

program start without error

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

Setting a default password in 7th line of modelci/config.py make this error disappear

ML-Model-CI/modelci/config.py

Line 7 in ca81463

MONGO_PASSWORD = os.getenv('MONGO_PASSWORD')

A more accurate GPU utilization algorithm

make sure the batch number and batch size can make the testing process lasts over 1 minutes (the best way, but is hard to control the memory in our TorchScript and ONNX custom gRPC servers while increasing the testing data number).
using time.sleep() to get the GPU changes in utilization during the decline of requests.
get the valid data from 1 minutes (current method, but during experiments, if the batch inference last a very short time like 1 to 5 seconds, the cadvisor's result is lower than actual GPU utilization)

duty circle level is lower from CAdvisor

duty circle data from CAdvisor is the GPU utilization, but after conducting experiments, I found that the data is a bit smaller that real GPU utilization.

[CI] Unit Test Quickfix

Problem description

After #59 , I think some of the APIs are changed, we should fix the unit tests in the CI checks.

Steps to Reproduce the Problem

Run pytest.

Expected Behavior

The CI builds should be successful.

Other Information

Change the updating mongo methods in unit tests.

key error while registering model using template

the same template @YuanmingLeee put in the /example, and I can make sure the model exists in the yaml's path.

The code I use

   model_path = '../resnet50_explicit_path.yml'
   register_model_from_yaml(model_path)

the error I get

Traceback (most recent call last):
  File "main.py", line 46, in <module>
    register_model_from_yaml(model_path)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 150, in register_model_from_yaml
    framework = Framework[framework.upper()]
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/enum.py", line 352, in __getitem__
    return cls._member_map_[name]
KeyError: 'PYTORCH,'

Start profiling only when Client is ready

Software and Hardware Versions

Problem description

The container starts and the profiler fails to inter with the Docker container, as the model needs some time for loading.
Call profiler immediately after call sver will raise an error:

Traceback (most recent call last):
  File "/home/lym/.pycharm_helpers/pydev/pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/lym/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 166, in <module>
    args.func(args)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 136, in export_model
    ModelExporter.ResNet50(framework)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 56, in ResNet50
    convert=export_trt
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/manager.py", line 134, in register_model
    result = profiler.diagnose()
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/profiler.py", line 48, in diagnose
    self.inspector.run_model(self.server_name)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 87, in run_model
    self.start_infer_with_time(batch)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 148, in start_infer_with_time
    self.make_request(batch_input)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/client/tfs_client.py", line 29, in make_request
    channel = grpc.insecure_channel(FLAGS.server)
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/platform/flags.py", line 84, in __getattr__
    wrapped(_sys.argv)
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/site-packages/absl/flags/_flagvalues.py", line 633, in __call__
    name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'model'

Steps to Reproduce the Problem

client = CVTFSClient(repeat_data=test_img_bytes, batch_size=32, batch_num=100, asynchronous=False)
container = serve(save_path=model_dir, device='cuda')
profiler = Profiler(model_info=model, server_name=container.name, inspector=client)
result = profiler.diagnose()
container.stop()

Expected Behavior

Profile successfully

Other Information

Shall add a check in the profiler diagnose function, see if the model finishes loading

Rebuild PyTorch and ONNX Serve Docker Images

For #129 this fix, we should update the docker images in the Docker Hub

Fixed CUDA Version

We should also fix the CUDA version in those two docker images. Currently our benchmarking uses CUDA version 10.0.130 in TensorFlow/Serving and TensorRT Server, the ONNX and PyTorch serve Docker images should keep the same.

You can FROM a base 10.0-cudnn7-tensorrt7-devel-ubuntu16.04 (cuda 10.0.130, nccl 2.6.4, cudnn 7.6.5.32 tensorrt 7.0.0.11) image to build the image.

Proto File Generation

Problem description

For clients of ONNX Runtime and TorchScript, we need to implement the protocol and generate it using python code, the generating step should be added into the installation script. Or after the installation, user will get an import error from proto if they call the profiler of onnx or torchscript.

Solution

Add

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. modelci/hub/deployer/onnx/proto/service.proto

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. modelci/hub/deployer/pytorch/proto/service.proto

in the installation script.

[feature] Format output in the terminal

Problem description

The output (model, profiling results) is not formatted.

Expected Behavior

Use https://github.com/willmcgugan/rich to build a table and format the output in the terminal

can start from the list APIs

Failed to start with GPU enabled

Software and Hardware Versions

ModelCI v1.x.x,
CUDA Version v10.2
GPU device used: True

Problem description

Failed to started the program with gpu option

Traceback (most recent call last):
  File "~/project/github/ML-Model-CI/test.py", line 4, in <module>
    modelci.cli.start(gpu=True)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 781, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 698, in make_context
    ctx = Context(self, info_name=info_name, parent=parent, **extra)
TypeError: __init__() got an unexpected keyword argument 'gpu'

Steps to Reproduce the Problem

import modelci.cli
modelci.cli.start(gpu=True)

Expected Behavior

the program started without error

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

after commenting the 25th line of modelci/cli/__init__.py, the error disappeared

ML-Model-CI/modelci/cli/__init__.py

Lines 25 to 27 in ca81463

    
           @cli.command() 
        
           @click.option('--gpu', default=False, type=click.BOOL, is_flag=True) 
        
           def start(gpu=False):

Batch Prediction for ONNX and TorchScript Required in Server

@YuanmingLeee @huangyz0918

batching functions in our deployer. Torch server and ONNX server and Dockerfile updates required.
gRPC for ONNX client

start script run error

Software and Hardware Versions

ModelCI v1.x.x
CUDA Version vx.x.x
GPU device used...

Problem description

There is no install.pull_docker_images.sh and install.start_service.sh under scripts subfolder

Generating gRPC code...Pulling Docker images...bash: scripts/install.pull_docker_images.sh: 没有那个文件或目录
FAIL
Starting services...bash: scripts/install.start_service.sh: 没有那个文件或目录
FAIL

Steps to Reproduce the Problem

bash scripts/install.sh

Expected Behavior

run without error and start the service

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

add relevant bash script file or modify bash/install.sh

Paper Website Update Required

The paper website need to improve before the conference.

Links in the top (e.g., code, paper, slides) should be correct.
Model List to switch the profiling results, currently only supports ResNet50. For JSON format profiling data, can turn to @huangyz0918.

Web Application Demo Required

: P

Pip Installation

Software and Hardware Versions

N.A.

Problem description

Require of pip installation of modelci:

pip install modelci

Steps to Reproduce the Problem

N.A.

Expected Behavior

N.A.

Other Information

N.A.

Failed to install pip package

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

Failed to install pip package of ML-Model-CI
Here is the output

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/cap-ntu/ML-Model-CI.git@master
  Cloning https://github.com/cap-ntu/ML-Model-CI.git (to revision master) to /tmp/pip-req-build-kz7ui669
    ERROR: Command errored out with exit status 1:
     command: ~/miniconda3/envs/modelci/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-kz7ui669/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-kz7ui669/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-lj6ykhe5
         cwd: /tmp/pip-req-build-kz7ui669/
    Complete output (11 lines):
    When trying to extract ~/tmp/tensorrtserver/tritonis.client.tar.gz, an exception raised: file could not be opened successfully
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-kz7ui669/setup.py", line 98, in <module>
        triton_client_package = install_triton_client()
      File "/tmp/pip-req-build-kz7ui669/setup.py", line 81, in install_triton_client
        tar_file = tarfile.open(save_name, mode='r')
      File "~/miniconda3/envs/modelci/lib/python3.7/tarfile.py", line 1578, in open
        raise ReadError("file could not be opened successfully")
    tarfile.ReadError: file could not be opened successfully
    Re-download from https://github.com/triton-inference-server/server/releases/download/v1.8.0/v1.8.0_ubuntu2012.clients.tar.gz
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Steps to Reproduce the Problem

# need to install requests package first
pip install setuptools requests==2.23.0
# then install modelci
pip install git+https://github.com/cap-ntu/ML-Model-CI.git@master

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

It seems like this link https://github.com/triton-inference-server/server/releases/download/v1.8.0/v1.8.0_ubuntu2012.clients.tar.gz has no file to download, I think this maybe the main cause

[DB] more user-friendly database backend (will use persistence.service_.py)

data structure re-design
implementation

Issues with Dockerfile ONNX and Torch

conda not found issue

this issue appeared both in ONNX GPU and TorchScript GPU, CPU works fine.

ONNX

Setting up manpages-dev (4.04-2) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Removing intermediate container 2578e1707b90
 ---> 2ca135fa916c
Step 14/18 : RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && chmod +x /miniconda.sh  && /miniconda.sh -b -p /miniconda  && rm /miniconda.sh
 ---> Running in 7517cb331d4c
Removing intermediate container 7517cb331d4c
 ---> 6a934d6c7863
Step 15/18 : RUN conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch  && pip install onnxruntime-gpu==0.5.0  && conda clean -ya  && rm -rf ~/.cache/pip
 ---> Running in b005a9115f30
/bin/sh: 1: conda: not found
The command '/bin/sh -c conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch  && pip install onnxruntime-gpu==0.5.0  && conda clean -ya  && rm -rf ~/.cache/pip' returned a non-zero code: 127

Torch

Setting up curl (7.47.0-1ubuntu2.14) ...
Removing intermediate container 132b5b26abb2
 ---> 85f3482379c7
Step 14/17 : RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && chmod +x /miniconda.sh  && /miniconda.sh -b -p /miniconda  && rm /miniconda.sh
 ---> Running in 6048b6dbf121
Removing intermediate container 6048b6dbf121
 ---> cdf464dd44f5
Step 15/17 : RUN conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch-nightly -c conda-forge  && conda clean -ya  && rm -rf ~/.cache/pip
 ---> Running in e4ed8a340cfd
/bin/sh: 1: conda: not found
The command '/bin/sh -c conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch-nightly -c conda-forge  && conda clean -ya  && rm -rf ~/.cache/pip' returned a non-zero code: 127

no class file in config folder

check out here

and your readme.md.

Installation Methods

Please refer to the README.md.

Docker Image
CLI Installations (CLI or PyPI package)

Add CI tests for converter with larger converage

Problem description

This is a request for a new feature on CI testing.

We would like to have a larger coverage for converter CI tests.
You may refer here https://github.com/microsoft/hummingbird/tree/master/tests for traditional ML -> {PyTorch, ONNX} first.

Other Information

Roadmap (you may change this as you prefer)

@univerone

Migrate Metrics

After the PR #87 is merged, we should start migrating the metrics here, and working here.

MongoDB connect error: Authentication failed

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

There is always pymongo Authentication failure when I run the pytest, here is part of the error message

__________________________________________ test_delete_model ___________________________________________

    def test_delete_model():
>       model = ModelService.get_models('ResNet50')[0]

tests/test_model_service.py:180: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
modelci/persistence/service.py:45: in get_models
    model_pos = cls.__model_DAO.get_models(**kwargs)
modelci/persistence/model_dao.py:66: in get_models
    return ModelDO.objects(**kwargs).order_by('name', 'framework', 'engine', '-version')
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/queryset/manager.py:37: in __get__
    queryset = queryset_class(owner, owner._get_collection())
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/document.py:211: in _get_collection
    db = cls._get_db()
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/document.py:189: in _get_db
    return get_db(cls._meta.get("db_alias", DEFAULT_CONNECTION_NAME))
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/connection.py:369: in get_db
    conn_settings["username"], conn_settings["password"], **auth_kwargs
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/database.py:1495: in authenticate
    connect=True)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/mongo_client.py:781: in _cache_credentials
    sock_info.authenticate(credentials)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/pool.py:810: in authenticate
    auth.authenticate(credentials, self)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:673: in authenticate
    auth_func(credentials, sock_info)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:589: in _authenticate_default
    return _authenticate_scram(credentials, sock_info, 'SCRAM-SHA-256')
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:333: in _authenticate_scram
    res = sock_info.command(source, cmd)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/pool.py:694: in command
    exhaust_allowed=exhaust_allowed)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/network.py:162: in command
    parse_write_concern_error=parse_write_concern_error)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = {'code': 18, 'codeName': 'AuthenticationFailed', 'errmsg': 'Authentication failed.', 'ok': 0.0}
max_wire_version = 9, msg = '%s', allowable_errors = None, parse_write_concern_error = False

    def _check_command_response(response, max_wire_version, msg=None,
                                allowable_errors=None,
                                parse_write_concern_error=False):
        """Check the response to a command for errors.
        """
        if "ok" not in response:
            # Server didn't recognize our message as a command.
            raise OperationFailure(response.get("$err"),
                                   response.get("code"),
                                   response,
                                   max_wire_version)
    
        if parse_write_concern_error and 'writeConcernError' in response:
            _raise_write_concern_error(response['writeConcernError'])
    
        if not response["ok"]:
    
            details = response
            # Mongos returns the error details in a 'raw' object
            # for some errors.
            if "raw" in response:
                for shard in itervalues(response["raw"]):
                    # Grab the first non-empty raw error from a shard.
                    if shard.get("errmsg") and not shard.get("ok"):
                        details = shard
                        break
    
            errmsg = details["errmsg"]
            if (allowable_errors is None
                    or (errmsg not in allowable_errors
                        and details.get("code") not in allowable_errors)):
    
                code = details.get("code")
                # Server is "not master" or "recovering"
                if code in _NOT_MASTER_CODES:
                    raise NotMasterError(errmsg, response)
                elif ("not master" in errmsg
                      or "node is recovering" in errmsg):
                    raise NotMasterError(errmsg, response)
    
                # Server assertion failures
                if errmsg == "db assertion failure":
                    errmsg = ("db assertion failure, assertion: '%s'" %
                              details.get("assertion", ""))
                    raise OperationFailure(errmsg,
                                           details.get("assertionCode"),
                                           response,
                                           max_wire_version)
    
                # Other errors
                # findAndModify with upsert can raise duplicate key error
                if code in (11000, 11001, 12582):
                    raise DuplicateKeyError(errmsg, code, response,
                                            max_wire_version)
                elif code == 50:
                    raise ExecutionTimeout(errmsg, code, response,
                                           max_wire_version)
                elif code == 43:
                    raise CursorNotFound(errmsg, code, response,
                                         max_wire_version)
    
                msg = msg or "%s"
                raise OperationFailure(msg % errmsg, code, response,
>                                      max_wire_version)
E               pymongo.errors.OperationFailure: Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18, 'codeName': 'AuthenticationFailed'}

../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/helpers.py:168: OperationFailure
=========================================== warnings summary ===========================================
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import (Mapping, MutableMapping, KeysView,

test_converter.py::test_xgboost_to_onnx
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/onnx/helper.py:220: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    is_iterable = isinstance(value, collections.Iterable)

test_converter.py::test_xgboost_to_torch
test_converter.py::test_onnx_to_pytorch
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/torch/nn/modules/container.py:434: UserWarning: Setting attributes on ParameterList is not supported.
    warnings.warn("Setting attributes on ParameterList is not supported.")

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================= short test summary info ========================================
FAILED tests/test_model_service.py::test_register_model - pymongo.errors.OperationFailure: Authentica...
FAILED tests/test_model_service.py::test_get_model_by_name - pymongo.errors.OperationFailure: Authent...
FAILED tests/test_model_service.py::test_get_model_by_task - pymongo.errors.OperationFailure: Authent...
FAILED tests/test_model_service.py::test_get_model_by_id - pymongo.errors.OperationFailure: Authentic...
FAILED tests/test_model_service.py::test_update_model - pymongo.errors.OperationFailure: Authenticati...
FAILED tests/test_model_service.py::test_register_static_profiling_result - pymongo.errors.OperationF...
FAILED tests/test_model_service.py::test_register_dynamic_profiling_result - pymongo.errors.Operation...
FAILED tests/test_model_service.py::test_update_dynamic_profiling_result - pymongo.errors.OperationFa...
FAILED tests/test_model_service.py::test_delete_dynamic_profiling_result - pymongo.errors.OperationFa...
FAILED tests/test_model_service.py::test_delete_model - pymongo.errors.OperationFailure: Authenticati...
============================== 10 failed, 9 passed, 9 warnings in 20.78s ===============================

Steps to Reproduce the Problem

source scripts/setup_env.sh
python -m pytest tests/

Expected Behavior

All the tests should be passed

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

Actually, some MongoDB connection tests are passed, but I cannot locate the database connect part in ModelDAO class, so I'm still stuck on it

API server required

Require connecting service / model manager / frontend with API server

[Feat] Register and get models with running scripts directly from GitHub repo

CI Required

Unit Tests Check
Build Check
Slack Integration

model version inconsistent issue

Traceback (most recent call last):
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_objects.py", line 69, in __init__
    ver = int(ver_string)
ValueError: invalid literal for int() with base 10: '1.zip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 48, in <module>
    diagnoser.init_model_info(architecture_name='ResNet50', framework=Framework.PYTORCH, engine=Engine.TORCHSCRIPT)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/diagnoser.py", line 91, in init_model_info
    framework=framework, engine=engine)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 265, in retrieve_model_by_name
    return get_remote_model_weight(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 225, in get_remote_model_weight
    save_path = generate_path(model.name, model.framework, model.engine, model.weight.filename)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/utils.py", line 45, in generate_path
    version = ModelVersion(str(version))
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_objects.py", line 71, in __init__
    raise ValueError('invalid value for version string, expected a number, got {}'.format(ver_string))
ValueError: invalid value for version string, expected a number, got 1.zip

I have this error after loading model from template yaml successfully, and registering successfully, but when I call

    diagnoser.init_model_info(architecture_name='ResNet50', framework=Framework.PYTORCH, engine=Engine.TORCHSCRIPT)
    print(diagnoser.model_info)

This error occurs. Seems we have mishandled the 1.zip to 1 as a model version.

    def init_model_info(self, architecture_name, framework, engine):
        """
        init the model information before testing, should be called before calling diagnose.
        By model name and optionally filtered by model framework and(or) model engine
        """
        self.model_path, self.model_info = retrieve_model_by_name(architecture_name=architecture_name, 
                                                    framework=framework, engine=engine)

[Bug] diagnose by specific batch size array

Auto-script issues

if a machine without mongo install before, the first time he run sh scripts/start_service.sh will get a connection refused error, needs to run again to success.
don't know why, but executing bash scripts/setup_env.sh is useless., I'm using zsh, but this issue exists in bash as well.

Update Dynamic Profile Result Attributes

Problem description

please refer to here, we should update the database structure according to the data we can get from profiler.

here is an example:

batch size: 8
tested device: Nvidia P4
model: ResNet50
serving engine: TensorFlow Serving

all_batch_latency:  37.82002019882202 sec
all_batch_throughput:  169.2225431492324  req/sec
overall 50th-percentile latiency: 0.04665029048919678 s
overall 95th-percentile latiency: 0.0504256248474121 s
overall 99th-percentile latiency: 0.052218921184539795 s
total GPU memory: 7981694976.0 bytes
average GPU memory usage percentile: 0.9726
average GPU memory used: 7763132416.0 bytes
average GPU utilization: 66.6216%

some attributes are missing now, I can update the db in profiler after this improvement.

Generate cloud service k8s deployment script

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

Generate cloud service deployment yaml file for k8s cluster with template based on current dispachter.

Steps to Reproduce the Problem

Expected Behavior

Generate correct yaml file:

init container for retrieving model from cloud storage. e.g. S3 bucket
container with correct image based on model framework and device info
correct environment variables

Other Information

Jinja2 as the framework for parsing template.

onnxruntime how to to specify a GPU device?

I'm afraid this is an issue that we cannot specify a GPU device to test. Currently, we limited the GPU usage by setting flag os.environ["CUDA_VISIBLE_DEVICES"]="0" in the server, but I think that's not a good idea, since CPU will join serving and I didn't see any task while typing nvidia-smi, which means there are no binding tasks in the GPU.

reference
microsoft/onnxruntime#331

Add Jupyter Notebook Examples

Problem description

Get started examples using Jupyter notebook

specify the GPU device in the serving code

I saw a TODO about this in the serving code, and this feature is required in building our diagnosing feature.

Issues with init_data.py, ONNX and TensorFlow

issue1: using Tensorflow

python init_data.py export --model ResNet50 --framework TensorFlow

log

Traceback (most recent call last):
  File "init_data.py", line 149, in <module>
    args.func(args)
  File "init_data.py", line 124, in export_model
    ModelExporter.ResNet50(framework)
  File "init_data.py", line 44, in ResNet50
    version=ModelVersion(version)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 96, in register_model
    ModelService.post_model(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/service.py", line 82, in post_model
    model_po = model.to_model_po()
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_bo.py", line 94, in to_model_po
    self.weight.weight, filename=self.weight.filename, content_type=self.weight.content_type
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1758, in put
    self.grid_id = self.fs.put(file_obj, **kwargs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/gridfs/__init__.py", line 130, in put
    grid_file.write(data)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/gridfs/grid_file.py", line 386, in write
    to_write = read(self.chunk_size)
ValueError: read of closed file

issue2: using PyTorch

command

python init_data.py export --model ResNet50 --framework PyTorch

log

Traceback (most recent call last):
  File "init_data.py", line 149, in <module>
    args.func(args)
  File "init_data.py", line 124, in export_model
    ModelExporter.ResNet50(framework)
  File "init_data.py", line 57, in ResNet50
    version=ModelVersion(version)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 77, in register_model
    outputs=outputs
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 123, in _generate_model_family
    ONNXConverter.from_torch_module(model, onnx_dir, inputs, max_batch_size)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/converter.py", line 88, in from_torch_module
    **export_kwargs
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/__init__.py", line 148, in export
    strip_doc_string, dynamic_axes, keep_initializers_as_inputs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/utils.py", line 66, in export
    dynamic_axes=dynamic_axes, keep_initializers_as_inputs=keep_initializers_as_inputs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/utils.py", line 406, in _export
    _set_opset_version(opset_version)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/symbolic_helper.py", line 437, in _set_opset_version
    raise ValueError("Unsupported ONNX opset version: " + str(opset_version))
ValueError: Unsupported ONNX opset version: -1

pymongo.errors.OperationFailure: Authentication failed. ERROR

Traceback (most recent call last):
  File "main.py", line 46, in <module>
    register_model_from_yaml(model_path)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 167, in register_model_from_yaml
    no_generate=no_generate,
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 103, in register_model
    ModelService.post_model(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/service.py", line 82, in post_model
    model_po = model.to_model_po()
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_bo.py", line 94, in to_model_po
    self.weight.weight, filename=self.weight.filename, content_type=self.weight.content_type
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1758, in put
    self.grid_id = self.fs.put(file_obj, **kwargs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1729, in fs
    self._fs = gridfs.GridFS(get_db(self.db_alias), self.collection_name)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/connection.py", line 369, in get_db
    conn_settings["username"], conn_settings["password"], **auth_kwargs
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/database.py", line 1471, in authenticate
    connect=True)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 755, in _cache_credentials
    sock_info.authenticate(credentials)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/pool.py", line 730, in authenticate
    auth.authenticate(credentials, self)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 564, in authenticate
    auth_func(credentials, sock_info)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 539, in _authenticate_default
    return _authenticate_scram(credentials, sock_info, 'SCRAM-SHA-1')
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 263, in _authenticate_scram
    res = sock_info.command(source, cmd)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/pool.py", line 613, in command
    user_fields=user_fields)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/network.py", line 167, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 159, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Authentication failed.

make sure that you have started a MongoDB service and configured the MongoDB environment

Yes I have executed the two scripts in the README.md, the same issue using register_model

	@cli.command()
	@click.option('--gpu', default=False, type=click.BOOL, is_flag=True)
	def start(gpu=False):

cap-ntu / ml-model-ci Goto Github PK

ml-model-ci's People

Contributors

Stargazers

Watchers

Forkers

ml-model-ci's Issues

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Roadmap

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Fixed CUDA Version

Problem description

Solution

Problem description

Expected Behavior

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

ONNX

Torch

Problem description

Other Information

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Problem description

Software and Hardware Versions

Problem description

Steps to Reproduce the Problem

Expected Behavior

Other Information

Problem description

Recommend Projects

Recommend Topics