Coder Social home page Coder Social logo

cap-ntu / ml-model-ci Goto Github PK

View Code? Open in Web Editor NEW
188.0 188.0 33.0 5.5 MB

MLModelCI is a complete MLOps platform for managing, converting, profiling, and deploying MLaaS (Machine Learning-as-a-Service), bridging the gap between current ML training and serving systems.

Home Page: https://mlmodelci.com

License: Apache License 2.0

Python 74.84% Shell 1.29% Dockerfile 2.12% JavaScript 0.08% HTML 0.07% TypeScript 8.19% CSS 0.42% SCSS 0.13% Jupyter Notebook 12.87%
continuous-integration convert-models deep-learning dispatcher inference mlops onnx profiler pytorch serving tensorflow-serving tensorrt tensorrt-inference-server

ml-model-ci's People

Contributors

dependabot[bot] avatar dixing0908 avatar huaizhengzhang avatar huangyz0918 avatar lionjump0723 avatar univerone avatar yuanmingleee avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

ml-model-ci's Issues

Current profiler fails in profiling on CPU

Current profiler fails in profiling on CPU @huangyz0918:
https://github.com/YuanmingLeee/ML-Model-CI/blob/0cb19ee10d05666f07fc4d3197e5bd0ada61f9b5/modelci/metrics/benchmark/metric.py#L132
This line fails, as there is no key named 'accelerators' on CPU. Please help to fix this after this merge.

Originally posted by @YuanmingLeee in #124

Full error logs:

Traceback (most recent call last):
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/threading.py", line 926, in _bootstrap_inner
    self.run()
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/controller/executor.py", line 83, in run
    dpr = profiler.diagnose(device=job.device)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/profiler.py", line 80, in diagnose
    result = self.inspector.run_model(server_name=self.server_name, device=device)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 132, in run_model
    val_stats = [x for x in stats[-int(SLEEP_TIME + all_data_latency):] if
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 133, in <listcomp>
    x['accelerators'][0]['duty_cycle'] is not 0]
KeyError: 0

Add contribution guide and modify `CONTRIBUTING.md`

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

The Setup Environment part of CONTRIBUTING.md is out of date, and we could add a detailed contribute guide for new contributors with few experience on git to have a starting point

Steps to Reproduce the Problem

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

README.md updates required

  • issue1: README.md lacks a run in docker command.
## 2. Install MongoDB service
docker --rm -d -p 27017:27017 --name modelci-mongo mongo

should be docker run --rm -d -p 27017:27017 --name modelci-mongo mongo

  • issue 2: lacks necessary steps in setting up environments

after setting up mongo according to the home page README.md, user needs to source an .env to load PORT into system environment. Something like,

set -o allexport; source modelci/env-mongodb.env; set +o allexport

or user will get an error while importing the package, this should be indicate below the document of setting mongo db.

  • issue 3: if user want to run scripts inside the modeci, he should setup the PYTHON_PATH to the root of this project. README.md should add a section contribution or development to indicate this.

convert improvement

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

We may try to adopt https://github.com/microsoft/hummingbird to our system

Steps to Reproduce the Problem

Expected Behavior

Supported more conversions

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

ONNX converter using ONNXML tools

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

ONNX converter, may refer to https://github.com/onnx/onnxmltools

Steps to Reproduce the Problem

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

[Doc] Tutorial

Roadmap

  • Register Model in the Model Database
  • Converting Model to Different Frameworks
  • Profiling Model Automatically
  • Retrieve and Deploy Model to Specific Device

Refactor Scripts

  • Replace all the shell scripts with python scripts.
    • Docker controll -> Docker Python SDK
    • MongoDB -> PyMongo
  • Remove all the DAO/DTO related code (we don't need to use DTO pattern with an already highly encapsulated DB library (mongo), it doesn't make sense).

duplicate key error while register the same model more than once.

This comes when I'm retrieving the model

Traceback (most recent call last):
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/document.py", line 412, in save
    object_id = self._save_create(doc, force_insert, write_concern)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/document.py", line 477, in _save_create
    object_id = wc_collection.insert_one(doc).inserted_id
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 698, in insert_one
    session=session),
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 612, in _insert
    bypass_doc_val, session)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 600, in _insert_one
    acknowledged, _insert_command, session)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1491, in _retryable_write
    return self._retry_with_session(retryable, func, s, None)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 1384, in _retry_with_session
    return func(session, sock_info, retryable)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/collection.py", line 597, in _insert_command
    _check_write_command_response(result)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 221, in _check_write_command_response
    _raise_last_write_error(write_errors)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 202, in _raise_last_write_error
    raise DuplicateKeyError(error.get("errmsg"), 11000, error)
pymongo.errors.DuplicateKeyError: E11000 duplicate key error collection: modelci.model_p_o index: engine_1_name_1_framework_1_version_1 dup key: { engine: 2, name: "ResNet50", framework: 1, version: 1 }

should throw an exception while trying to register a saved model. Less priority @YuanmingLeee fix this when you are free.

updating the database table is required

Adding new attributes or modify existing ones.

  • number of batch of testing
  • batch size of testing
  • overall latency
  • overall throughput
  • 50th-percentile latency
  • 95th-percentile latency
  • 99th-percentile latency
  • average GPU memory usage percentile
  • average GPU memory used
  • average GPU utilization
  • completed time
  • total GPU memory

Originally posted by @huangyz0918 in #59 (comment)

Documentation in Chinese

  • Need a Chinses version documentation
  • The documentation is out of date and needs to be updated.
  • add roadmap

Torch Serve Dockerfile Related

  • 少个包 grpcio-tools
root@21d6bab63eef:/content# python pytorch_serve.py
/miniconda//bin/python: Error while finding module specification for 'grpc_tools.protoc' (ModuleNotFoundError: No module named 'grpc_tools')
  Found existing installation: grpcio 1.16.1

Mongodb initial pwd is not set

Software and Hardware Versions

ModelCI v1.x.x,
CUDA Version vx.x.x,
GPU device used...

Problem description

After starting the service, MongoDB failed to start due to null pwd argument

2020-11-05 15:09:07,969 - ml-modelci Docker Container Manager - ERROR - Exception during starting MongoDB: "pwd" had the wrong type. Expected string, found null, full error: {'ok': 0.0, 'errmsg': '"pwd" had the wrong type. Expected string, found null', 'code': 14, 'codeName': 'TypeMismatch'}
2020-11-05 15:09:09,650 - ml-modelci Docker Container Manager - INFO - Container name=cadvisor-81293 started.
2020-11-05 15:09:10,795 - ml-modelci Docker Container Manager - INFO - Container name=dcgm-exporter-76327 started.
2020-11-05 15:09:11,527 - ml-modelci Docker Container Manager - INFO - gpu-metrics-exporter-86973 stared
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 782, in main
    rv = self.invoke(ctx)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 1066, in invoke
    return ctx.invoke(self.callback, **ctx.params)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 610, in invoke
    return callback(*args, **kwargs)
  File "~/project/github/ML-Model-CI/modelci/cli/__init__.py", line 36, in start
    __start__(gpu)
  File "~/project/github/ML-Model-CI/modelci/cli/__init__.py", line 28, in __start__
    if not container_conn.start():
  File "~/project/github/ML-Model-CI/modelci/utils/docker_container_manager.py", line 115, in start
    return self.connect()
  File "~/project/github/ML-Model-CI/modelci/utils/docker_container_manager.py", line 128, in connect
    self.mongo_port = all_labels[MODELCI_DOCKER_PORT_LABELS['mongo']]
KeyError: 'modelci.mongo.port'

Steps to Reproduce the Problem

import modelci.cli 
modelci.cli.start()

Expected Behavior

program start without error

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

Setting a default password in 7th line of modelci/config.py make this error disappear

MONGO_PASSWORD = os.getenv('MONGO_PASSWORD')

A more accurate GPU utilization algorithm

  • make sure the batch number and batch size can make the testing process lasts over 1 minutes (the best way, but is hard to control the memory in our TorchScript and ONNX custom gRPC servers while increasing the testing data number).
  • using time.sleep() to get the GPU changes in utilization during the decline of requests.
  • get the valid data from 1 minutes (current method, but during experiments, if the batch inference last a very short time like 1 to 5 seconds, the cadvisor's result is lower than actual GPU utilization)

[CI] Unit Test Quickfix

Problem description

After #59 , I think some of the APIs are changed, we should fix the unit tests in the CI checks.

Steps to Reproduce the Problem

Run pytest.

Expected Behavior

The CI builds should be successful.

Other Information

Change the updating mongo methods in unit tests.

key error while registering model using template

the same template @YuanmingLeee put in the /example, and I can make sure the model exists in the yaml's path.

The code I use

   model_path = '../resnet50_explicit_path.yml'
   register_model_from_yaml(model_path)

the error I get

Traceback (most recent call last):
  File "main.py", line 46, in <module>
    register_model_from_yaml(model_path)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 150, in register_model_from_yaml
    framework = Framework[framework.upper()]
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/enum.py", line 352, in __getitem__
    return cls._member_map_[name]
KeyError: 'PYTORCH,'

Start profiling only when Client is ready

Software and Hardware Versions

Problem description

The container starts and the profiler fails to inter with the Docker container, as the model needs some time for loading.
Call profiler immediately after call sver will raise an error:

Traceback (most recent call last):
  File "/home/lym/.pycharm_helpers/pydev/pydevd.py", line 1438, in _exec
    pydev_imports.execfile(file, globals, locals)  # execute the script
  File "/home/lym/.pycharm_helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
    exec(compile(contents+"\n", file, 'exec'), glob, loc)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 166, in <module>
    args.func(args)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 136, in export_model
    ModelExporter.ResNet50(framework)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/init_data.py", line 56, in ResNet50
    convert=export_trt
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/manager.py", line 134, in register_model
    result = profiler.diagnose()
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/profiler.py", line 48, in diagnose
    self.inspector.run_model(self.server_name)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 87, in run_model
    self.start_infer_with_time(batch)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/metrics/benchmark/metric.py", line 148, in start_infer_with_time
    self.make_request(batch_input)
  File "/home/lym/Documents/PyCharmProjects/ML-Model-CI/modelci/hub/client/tfs_client.py", line 29, in make_request
    channel = grpc.insecure_channel(FLAGS.server)
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/platform/flags.py", line 84, in __getattr__
    wrapped(_sys.argv)
  File "/home/lym/anaconda3/envs/modelci/lib/python3.7/site-packages/absl/flags/_flagvalues.py", line 633, in __call__
    name, value, suggestions=suggestions)
absl.flags._exceptions.UnrecognizedFlagError: Unknown command line flag 'model'

Steps to Reproduce the Problem

client = CVTFSClient(repeat_data=test_img_bytes, batch_size=32, batch_num=100, asynchronous=False)
container = serve(save_path=model_dir, device='cuda')
profiler = Profiler(model_info=model, server_name=container.name, inspector=client)
result = profiler.diagnose()
container.stop()

Expected Behavior

Profile successfully

Other Information

Shall add a check in the profiler diagnose function, see if the model finishes loading

Rebuild PyTorch and ONNX Serve Docker Images

For #129 this fix, we should update the docker images in the Docker Hub

Fixed CUDA Version

We should also fix the CUDA version in those two docker images. Currently our benchmarking uses CUDA version 10.0.130 in TensorFlow/Serving and TensorRT Server, the ONNX and PyTorch serve Docker images should keep the same.

You can FROM a base 10.0-cudnn7-tensorrt7-devel-ubuntu16.04 (cuda 10.0.130, nccl 2.6.4, cudnn 7.6.5.32 tensorrt 7.0.0.11) image to build the image.

Proto File Generation

Problem description

For clients of ONNX Runtime and TorchScript, we need to implement the protocol and generate it using python code, the generating step should be added into the installation script. Or after the installation, user will get an import error from proto if they call the profiler of onnx or torchscript.

Solution

Add

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. modelci/hub/deployer/onnx/proto/service.proto

or

python -m grpc_tools.protoc -I. --python_out=. --grpc_python_out=. modelci/hub/deployer/pytorch/proto/service.proto

in the installation script.

Failed to start with GPU enabled

Software and Hardware Versions

ModelCI v1.x.x,
CUDA Version v10.2
GPU device used: True

Problem description

Failed to started the program with gpu option

Traceback (most recent call last):
  File "~/project/github/ML-Model-CI/test.py", line 4, in <module>
    modelci.cli.start(gpu=True)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 829, in __call__
    return self.main(*args, **kwargs)
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 781, in main
    with self.make_context(prog_name, args, **extra) as ctx:
  File "~/miniconda3/envs/modelci/lib/python3.7/site-packages/click/core.py", line 698, in make_context
    ctx = Context(self, info_name=info_name, parent=parent, **extra)
TypeError: __init__() got an unexpected keyword argument 'gpu'

Steps to Reproduce the Problem

import modelci.cli
modelci.cli.start(gpu=True)

Expected Behavior

the program started without error

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

after commenting the 25th line of modelci/cli/__init__.py, the error disappeared

@cli.command()
@click.option('--gpu', default=False, type=click.BOOL, is_flag=True)
def start(gpu=False):

start script run error

Software and Hardware Versions

  • ModelCI v1.x.x
  • CUDA Version vx.x.x
  • GPU device used...

Problem description

There is no install.pull_docker_images.sh and install.start_service.sh under scripts subfolder

Generating gRPC code...Pulling Docker images...bash: scripts/install.pull_docker_images.sh: 没有那个文件或目录
FAIL
Starting services...bash: scripts/install.start_service.sh: 没有那个文件或目录
FAIL

Steps to Reproduce the Problem

bash scripts/install.sh

Expected Behavior

run without error and start the service

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

add relevant bash script file or modify bash/install.sh

Paper Website Update Required

The paper website need to improve before the conference.

  • Links in the top (e.g., code, paper, slides) should be correct.
  • Model List to switch the profiling results, currently only supports ResNet50. For JSON format profiling data, can turn to @huangyz0918.

Pip Installation

Software and Hardware Versions

N.A.

Problem description

Require of pip installation of modelci:

pip install modelci

Steps to Reproduce the Problem

N.A.

Expected Behavior

N.A.

Other Information

N.A.

Failed to install pip package

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

Failed to install pip package of ML-Model-CI
Here is the output

Looking in indexes: https://pypi.tuna.tsinghua.edu.cn/simple, https://pypi.ngc.nvidia.com
Collecting git+https://github.com/cap-ntu/ML-Model-CI.git@master
  Cloning https://github.com/cap-ntu/ML-Model-CI.git (to revision master) to /tmp/pip-req-build-kz7ui669
    ERROR: Command errored out with exit status 1:
     command: ~/miniconda3/envs/modelci/bin/python3.7 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-req-build-kz7ui669/setup.py'"'"'; __file__='"'"'/tmp/pip-req-build-kz7ui669/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-pip-egg-info-lj6ykhe5
         cwd: /tmp/pip-req-build-kz7ui669/
    Complete output (11 lines):
    When trying to extract ~/tmp/tensorrtserver/tritonis.client.tar.gz, an exception raised: file could not be opened successfully
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-req-build-kz7ui669/setup.py", line 98, in <module>
        triton_client_package = install_triton_client()
      File "/tmp/pip-req-build-kz7ui669/setup.py", line 81, in install_triton_client
        tar_file = tarfile.open(save_name, mode='r')
      File "~/miniconda3/envs/modelci/lib/python3.7/tarfile.py", line 1578, in open
        raise ReadError("file could not be opened successfully")
    tarfile.ReadError: file could not be opened successfully
    Re-download from https://github.com/triton-inference-server/server/releases/download/v1.8.0/v1.8.0_ubuntu2012.clients.tar.gz
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Steps to Reproduce the Problem

# need to install requests package first
pip install setuptools requests==2.23.0
# then install modelci
pip install git+https://github.com/cap-ntu/ML-Model-CI.git@master

Expected Behavior

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

It seems like this link https://github.com/triton-inference-server/server/releases/download/v1.8.0/v1.8.0_ubuntu2012.clients.tar.gz has no file to download, I think this maybe the main cause

Issues with Dockerfile ONNX and Torch

  • conda not found issue

this issue appeared both in ONNX GPU and TorchScript GPU, CPU works fine.

ONNX

Setting up manpages-dev (4.04-2) ...
Processing triggers for libc-bin (2.23-0ubuntu11) ...
Removing intermediate container 2578e1707b90
 ---> 2ca135fa916c
Step 14/18 : RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && chmod +x /miniconda.sh  && /miniconda.sh -b -p /miniconda  && rm /miniconda.sh
 ---> Running in 7517cb331d4c
Removing intermediate container 7517cb331d4c
 ---> 6a934d6c7863
Step 15/18 : RUN conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch  && pip install onnxruntime-gpu==0.5.0  && conda clean -ya  && rm -rf ~/.cache/pip
 ---> Running in b005a9115f30
/bin/sh: 1: conda: not found
The command '/bin/sh -c conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch  && pip install onnxruntime-gpu==0.5.0  && conda clean -ya  && rm -rf ~/.cache/pip' returned a non-zero code: 127

Torch

Setting up curl (7.47.0-1ubuntu2.14) ...
Removing intermediate container 132b5b26abb2
 ---> 85f3482379c7
Step 14/17 : RUN curl -so /miniconda.sh https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh  && chmod +x /miniconda.sh  && /miniconda.sh -b -p /miniconda  && rm /miniconda.sh
 ---> Running in 6048b6dbf121
Removing intermediate container 6048b6dbf121
 ---> cdf464dd44f5
Step 15/17 : RUN conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch-nightly -c conda-forge  && conda clean -ya  && rm -rf ~/.cache/pip
 ---> Running in e4ed8a340cfd
/bin/sh: 1: conda: not found
The command '/bin/sh -c conda env update --name base -f /content/environment.yml  && conda install -y pytorch torchvision cudatoolkit=${CUDA} -c pytorch-nightly -c conda-forge  && conda clean -ya  && rm -rf ~/.cache/pip' returned a non-zero code: 127
  • no class file in config folder

check out here

and your readme.md.

Migrate Metrics

After the PR #87 is merged, we should start migrating the metrics here, and working here.

MongoDB connect error: Authentication failed

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

There is always pymongo Authentication failure when I run the pytest, here is part of the error message

__________________________________________ test_delete_model ___________________________________________

    def test_delete_model():
>       model = ModelService.get_models('ResNet50')[0]

tests/test_model_service.py:180: 
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 
modelci/persistence/service.py:45: in get_models
    model_pos = cls.__model_DAO.get_models(**kwargs)
modelci/persistence/model_dao.py:66: in get_models
    return ModelDO.objects(**kwargs).order_by('name', 'framework', 'engine', '-version')
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/queryset/manager.py:37: in __get__
    queryset = queryset_class(owner, owner._get_collection())
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/document.py:211: in _get_collection
    db = cls._get_db()
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/document.py:189: in _get_db
    return get_db(cls._meta.get("db_alias", DEFAULT_CONNECTION_NAME))
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/mongoengine/connection.py:369: in get_db
    conn_settings["username"], conn_settings["password"], **auth_kwargs
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/database.py:1495: in authenticate
    connect=True)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/mongo_client.py:781: in _cache_credentials
    sock_info.authenticate(credentials)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/pool.py:810: in authenticate
    auth.authenticate(credentials, self)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:673: in authenticate
    auth_func(credentials, sock_info)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:589: in _authenticate_default
    return _authenticate_scram(credentials, sock_info, 'SCRAM-SHA-256')
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/auth.py:333: in _authenticate_scram
    res = sock_info.command(source, cmd)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/pool.py:694: in command
    exhaust_allowed=exhaust_allowed)
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/network.py:162: in command
    parse_write_concern_error=parse_write_concern_error)
_ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ _ 

response = {'code': 18, 'codeName': 'AuthenticationFailed', 'errmsg': 'Authentication failed.', 'ok': 0.0}
max_wire_version = 9, msg = '%s', allowable_errors = None, parse_write_concern_error = False

    def _check_command_response(response, max_wire_version, msg=None,
                                allowable_errors=None,
                                parse_write_concern_error=False):
        """Check the response to a command for errors.
        """
        if "ok" not in response:
            # Server didn't recognize our message as a command.
            raise OperationFailure(response.get("$err"),
                                   response.get("code"),
                                   response,
                                   max_wire_version)
    
        if parse_write_concern_error and 'writeConcernError' in response:
            _raise_write_concern_error(response['writeConcernError'])
    
        if not response["ok"]:
    
            details = response
            # Mongos returns the error details in a 'raw' object
            # for some errors.
            if "raw" in response:
                for shard in itervalues(response["raw"]):
                    # Grab the first non-empty raw error from a shard.
                    if shard.get("errmsg") and not shard.get("ok"):
                        details = shard
                        break
    
            errmsg = details["errmsg"]
            if (allowable_errors is None
                    or (errmsg not in allowable_errors
                        and details.get("code") not in allowable_errors)):
    
                code = details.get("code")
                # Server is "not master" or "recovering"
                if code in _NOT_MASTER_CODES:
                    raise NotMasterError(errmsg, response)
                elif ("not master" in errmsg
                      or "node is recovering" in errmsg):
                    raise NotMasterError(errmsg, response)
    
                # Server assertion failures
                if errmsg == "db assertion failure":
                    errmsg = ("db assertion failure, assertion: '%s'" %
                              details.get("assertion", ""))
                    raise OperationFailure(errmsg,
                                           details.get("assertionCode"),
                                           response,
                                           max_wire_version)
    
                # Other errors
                # findAndModify with upsert can raise duplicate key error
                if code in (11000, 11001, 12582):
                    raise DuplicateKeyError(errmsg, code, response,
                                            max_wire_version)
                elif code == 50:
                    raise ExecutionTimeout(errmsg, code, response,
                                           max_wire_version)
                elif code == 43:
                    raise CursorNotFound(errmsg, code, response,
                                         max_wire_version)
    
                msg = msg or "%s"
                raise OperationFailure(msg % errmsg, code, response,
>                                      max_wire_version)
E               pymongo.errors.OperationFailure: Authentication failed., full error: {'ok': 0.0, 'errmsg': 'Authentication failed.', 'code': 18, 'codeName': 'AuthenticationFailed'}

../../../miniconda3/envs/modelci/lib/python3.7/site-packages/pymongo/helpers.py:168: OperationFailure
=========================================== warnings summary ===========================================
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/tensorflow_core/python/pywrap_tensorflow_internal.py:15: DeprecationWarning: the imp module is deprecated in favour of importlib; see the module's documentation for alternative uses
    import imp

../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
../../../miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/h5py/_hl/base.py:19: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    from collections import (Mapping, MutableMapping, KeysView,

test_converter.py::test_xgboost_to_onnx
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/onnx/helper.py:220: DeprecationWarning: Using or importing the ABCs from 'collections' instead of from 'collections.abc' is deprecated since Python 3.3,and in 3.9 it will stop working
    is_iterable = isinstance(value, collections.Iterable)

test_converter.py::test_xgboost_to_torch
test_converter.py::test_onnx_to_pytorch
  ~/miniconda3/envs/modelci/lib/python3.7/site-packages/torch/nn/modules/container.py:434: UserWarning: Setting attributes on ParameterList is not supported.
    warnings.warn("Setting attributes on ParameterList is not supported.")

-- Docs: https://docs.pytest.org/en/stable/warnings.html
======================================= short test summary info ========================================
FAILED tests/test_model_service.py::test_register_model - pymongo.errors.OperationFailure: Authentica...
FAILED tests/test_model_service.py::test_get_model_by_name - pymongo.errors.OperationFailure: Authent...
FAILED tests/test_model_service.py::test_get_model_by_task - pymongo.errors.OperationFailure: Authent...
FAILED tests/test_model_service.py::test_get_model_by_id - pymongo.errors.OperationFailure: Authentic...
FAILED tests/test_model_service.py::test_update_model - pymongo.errors.OperationFailure: Authenticati...
FAILED tests/test_model_service.py::test_register_static_profiling_result - pymongo.errors.OperationF...
FAILED tests/test_model_service.py::test_register_dynamic_profiling_result - pymongo.errors.Operation...
FAILED tests/test_model_service.py::test_update_dynamic_profiling_result - pymongo.errors.OperationFa...
FAILED tests/test_model_service.py::test_delete_dynamic_profiling_result - pymongo.errors.OperationFa...
FAILED tests/test_model_service.py::test_delete_model - pymongo.errors.OperationFailure: Authenticati...
============================== 10 failed, 9 passed, 9 warnings in 20.78s ===============================

Steps to Reproduce the Problem

source scripts/setup_env.sh
python -m pytest tests/

Expected Behavior

All the tests should be passed

Other Information

Things you tried, stack traces, related issues, suggestions on how to fix it...

Actually, some MongoDB connection tests are passed, but I cannot locate the database connect part in ModelDAO class, so I'm still stuck on it

CI Required

  • Unit Tests Check
  • Build Check
  • Slack Integration

model version inconsistent issue

Traceback (most recent call last):
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_objects.py", line 69, in __init__
    ver = int(ver_string)
ValueError: invalid literal for int() with base 10: '1.zip'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "main.py", line 48, in <module>
    diagnoser.init_model_info(architecture_name='ResNet50', framework=Framework.PYTORCH, engine=Engine.TORCHSCRIPT)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/diagnoser.py", line 91, in init_model_info
    framework=framework, engine=engine)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 265, in retrieve_model_by_name
    return get_remote_model_weight(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 225, in get_remote_model_weight
    save_path = generate_path(model.name, model.framework, model.engine, model.weight.filename)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/utils.py", line 45, in generate_path
    version = ModelVersion(str(version))
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_objects.py", line 71, in __init__
    raise ValueError('invalid value for version string, expected a number, got {}'.format(ver_string))
ValueError: invalid value for version string, expected a number, got 1.zip

I have this error after loading model from template yaml successfully, and registering successfully, but when I call

    diagnoser.init_model_info(architecture_name='ResNet50', framework=Framework.PYTORCH, engine=Engine.TORCHSCRIPT)
    print(diagnoser.model_info)

This error occurs. Seems we have mishandled the 1.zip to 1 as a model version.

    def init_model_info(self, architecture_name, framework, engine):
        """
        init the model information before testing, should be called before calling diagnose.
        By model name and optionally filtered by model framework and(or) model engine
        """
        self.model_path, self.model_info = retrieve_model_by_name(architecture_name=architecture_name, 
                                                    framework=framework, engine=engine)

Auto-script issues

  • if a machine without mongo install before, the first time he run sh scripts/start_service.sh will get a connection refused error, needs to run again to success.
  • don't know why, but executing bash scripts/setup_env.sh is useless., I'm using zsh, but this issue exists in bash as well.

Update Dynamic Profile Result Attributes

Problem description

please refer to here, we should update the database structure according to the data we can get from profiler.

here is an example:

batch size: 8
tested device: Nvidia P4
model: ResNet50
serving engine: TensorFlow Serving

all_batch_latency:  37.82002019882202 sec
all_batch_throughput:  169.2225431492324  req/sec
overall 50th-percentile latiency: 0.04665029048919678 s
overall 95th-percentile latiency: 0.0504256248474121 s
overall 99th-percentile latiency: 0.052218921184539795 s
total GPU memory: 7981694976.0 bytes
average GPU memory usage percentile: 0.9726
average GPU memory used: 7763132416.0 bytes
average GPU utilization: 66.6216%

some attributes are missing now, I can update the db in profiler after this improvement.

Generate cloud service k8s deployment script

Software and Hardware Versions

ModelCI v1.x.x, CUDA Version vx.x.x, GPU device used...

Problem description

Generate cloud service deployment yaml file for k8s cluster with template based on current dispachter.

Steps to Reproduce the Problem

Expected Behavior

Generate correct yaml file:

  • init container for retrieving model from cloud storage. e.g. S3 bucket
  • container with correct image based on model framework and device info
  • correct environment variables

Other Information

Jinja2 as the framework for parsing template.

onnxruntime how to to specify a GPU device?

I'm afraid this is an issue that we cannot specify a GPU device to test. Currently, we limited the GPU usage by setting flag os.environ["CUDA_VISIBLE_DEVICES"]="0" in the server, but I think that's not a good idea, since CPU will join serving and I didn't see any task while typing nvidia-smi, which means there are no binding tasks in the GPU.

reference
microsoft/onnxruntime#331

Issues with init_data.py, ONNX and TensorFlow

issue1: using Tensorflow

python init_data.py export --model ResNet50 --framework TensorFlow

log

Traceback (most recent call last):
  File "init_data.py", line 149, in <module>
    args.func(args)
  File "init_data.py", line 124, in export_model
    ModelExporter.ResNet50(framework)
  File "init_data.py", line 44, in ResNet50
    version=ModelVersion(version)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 96, in register_model
    ModelService.post_model(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/service.py", line 82, in post_model
    model_po = model.to_model_po()
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_bo.py", line 94, in to_model_po
    self.weight.weight, filename=self.weight.filename, content_type=self.weight.content_type
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1758, in put
    self.grid_id = self.fs.put(file_obj, **kwargs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/gridfs/__init__.py", line 130, in put
    grid_file.write(data)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/gridfs/grid_file.py", line 386, in write
    to_write = read(self.chunk_size)
ValueError: read of closed file

issue2: using PyTorch

command

python init_data.py export --model ResNet50 --framework PyTorch

log

Traceback (most recent call last):
  File "init_data.py", line 149, in <module>
    args.func(args)
  File "init_data.py", line 124, in export_model
    ModelExporter.ResNet50(framework)
  File "init_data.py", line 57, in ResNet50
    version=ModelVersion(version)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 77, in register_model
    outputs=outputs
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 123, in _generate_model_family
    ONNXConverter.from_torch_module(model, onnx_dir, inputs, max_batch_size)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/converter.py", line 88, in from_torch_module
    **export_kwargs
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/__init__.py", line 148, in export
    strip_doc_string, dynamic_axes, keep_initializers_as_inputs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/utils.py", line 66, in export
    dynamic_axes=dynamic_axes, keep_initializers_as_inputs=keep_initializers_as_inputs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/utils.py", line 406, in _export
    _set_opset_version(opset_version)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/torch/onnx/symbolic_helper.py", line 437, in _set_opset_version
    raise ValueError("Unsupported ONNX opset version: " + str(opset_version))
ValueError: Unsupported ONNX opset version: -1

pymongo.errors.OperationFailure: Authentication failed. ERROR

Traceback (most recent call last):
  File "main.py", line 46, in <module>
    register_model_from_yaml(model_path)
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 167, in register_model_from_yaml
    no_generate=no_generate,
  File "/home/hyz/workspace/ML-Model-CI/modelci/hub/manager.py", line 103, in register_model
    ModelService.post_model(model)
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/service.py", line 82, in post_model
    model_po = model.to_model_po()
  File "/home/hyz/workspace/ML-Model-CI/modelci/persistence/bo/model_bo.py", line 94, in to_model_po
    self.weight.weight, filename=self.weight.filename, content_type=self.weight.content_type
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1758, in put
    self.grid_id = self.fs.put(file_obj, **kwargs)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/fields.py", line 1729, in fs
    self._fs = gridfs.GridFS(get_db(self.db_alias), self.collection_name)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/mongoengine/connection.py", line 369, in get_db
    conn_settings["username"], conn_settings["password"], **auth_kwargs
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/database.py", line 1471, in authenticate
    connect=True)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/mongo_client.py", line 755, in _cache_credentials
    sock_info.authenticate(credentials)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/pool.py", line 730, in authenticate
    auth.authenticate(credentials, self)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 564, in authenticate
    auth_func(credentials, sock_info)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 539, in _authenticate_default
    return _authenticate_scram(credentials, sock_info, 'SCRAM-SHA-1')
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/auth.py", line 263, in _authenticate_scram
    res = sock_info.command(source, cmd)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/pool.py", line 613, in command
    user_fields=user_fields)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/network.py", line 167, in command
    parse_write_concern_error=parse_write_concern_error)
  File "/home/hyz/local/anaconda3/envs/hy/lib/python3.7/site-packages/pymongo/helpers.py", line 159, in _check_command_response
    raise OperationFailure(msg % errmsg, code, response)
pymongo.errors.OperationFailure: Authentication failed.

make sure that you have started a MongoDB service and configured the MongoDB environment

Yes I have executed the two scripts in the README.md, the same issue using register_model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.