cornerstone-ondemand / modelkit Goto Github PK

View Code? Open in Web Editor NEW

152.0 8.0 17.0 1.82 MB

Toolkit for developing and maintaining ML models

Home Page: http://cornerstone-ondemand.github.io/modelkit/

License: MIT License

Python 99.60% Makefile 0.40%

python machine-learning mlops production fastapi

modelkit's People

Contributors

Stargazers

Watchers

Forkers

antoinejeannot tgenin cyrillemat pquentin mathilde-leval shainaraza danielschulz deborahdavid tbascoul huangweiboy2 ldeflandre llin17 franckbrl micseb priyankaiiit14 victorbenichoux nicolascarrezcsod

modelkit's Issues

Uploaded asset names with unsupported characters are not checked against asset spec, and cannot be downloaded

After diving into the code, modelkit lists the following regex for asset names:

GENERIC_ASSET_NAME_RE = (
    r"(([A-Z]:\\)|/)?[a-zA-Z0-9]([a-zA-Z0-9\-\_\.\/\\]*[a-zA-Z0-9])?"
)

However:

uploaded assets and sub-files are not checked against this.
This does not include all valid URI and file name characters.

I uploaded a file that was named something similar to file__id==version, and because == is invalid according to the regex. The file could be uploaded, but when specified in a model config for the asset it freezes when trying to resolve the asset and no error is throw. The asset also cannot be downloaded via code.

merge assets 'new' and 'update' cli

Having both is confusing. I don't see why a single one wouldn't be enough

Warn or raise Exception when 2 models have the same name

If 2 models have the same configuration name, modelkit will silently use one of them.

It should at least warn the user but maybe we just want to raise an Exception ?

Enable assets manager to run on read-only filesystem when assets do not need to be downloaded

Current state

During asset creation development process, or if you want to embed assets in a container at build time,
Modelkit can be configured to use StorageProvider with LocalStorageDriver and a AssetsDir pointing to the same location of the LocalStorage.
This allow to only use local file and do not duplicate them on local file system.

However, modelkit still requires to have write permission on the filesystem because the management of lock files is not disabled in this condition.

Fix

Idea is to disable the lock files management in this configuration.
It is fine to do that since we are guaranteed that we will not download anything in this configuration

Deprecated comment in cli assets

We don't only support gcs, we need to add s3 and azure

Assets versioning with dates

As of today, assets can only be versioned with numbers (eg: 1.4)
https://cornerstone-ondemand.github.io/modelkit/assets/managing_assets/

Modelkit should support date versionning (eg: 2021-10-16) to support organisations where models are trained on a monthy/weekly/daily basis

Prefer httpx over aiohttp and requests

It would help reducing the number of dependencies

Pydantic 2

Modelkit has pinned pydantic < 2 a few months ago since the latest 2.x version breaks BaseSettings.

A previous issue written by @victorbenichoux (see #189) and corresponding PR was merged to pin pydantic < 2 as a quick fix.

Nonetheless, as stated by @victorbenichoux (again) and backed by several profiling initiatives throughout the years:

pydantic input and output validations represent a large overhead (around 30% of the inference time for models with lots of typed model dependencies)
disabling MODELKIT_ENABLE_VALIDATION does not work with complex data types (it attempts at recursively and manually recreating the object without validating it in the first part) but still comes with the reconstruction overhead

As such, migrating modelkit to Pydantic 2 seems like a no-brainer, especially since it is now 2.4.x with a whole summer of patches, fixes and improvements!

It will be a good opportunity to bump a major version as well to celebrate the huge step forward in terms of speed (finally!)

@victorbenichoux would you still be happy to propose a PR, or discuss it ? :-)

LocalStorage may delete local assets in _fetch

When configuring LocalStorage with MODELKIT_ASSETS_DIR = MODELKIT_STORAGE_BUCKET/MODELKIT_STORAGE_PREFIX

_fetch will delete local asset without being able to re-download it since configurations are overlapping

pydantic >= 2.0 breaks BaseSettings

Bumping pydantic to version >= 2.0 breaks modelkit: BaseSettings has been moved to a separate package:

pydantic.errors.PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.0.2/migration/#basesettings-has-moved-to-pydantic-settings for more details.

Installingpydantic-settings as suggested is not enough and it will keep breaking on separate errors.

In fact, pydantic >= 2.0 comes with a large set of breaking changes, and since we use relatively advanced features during validation of payloads (e.g. create_model, etc.) the migration will likely be quite hard.

OTOH, pydantic 2.0 is substantially faster than 1.X 😄 , and input validation accounts for most of the overhead at predict time so it is very much worth it for modelkit

Do you guys have any plans on organizing the migration? If not I am happy to propose a PR to do so.
In the meantime, we should fix pydantic<2.0 in setup.cfg

Add init.py for Profilers

Support for Apple silicon

modelkit does not seem to be installable on M1 macs with apple silicon.

It fails since the numpy version is pinned to 1.19.3 somehow, which cannot be installed on M1 macs.

assets new and update compatible with s3/gs remote files

Currently we only push from local to remote
cli could be able to push from remote to local or remote to remote (restricted to remote with same credentials)

MRO is not respected in the class `Asset`

The current code suffers from a method resolution order (MRO) issue. In the Asset class, the __init__ method does not call super().__init__(), resulting in an incorrect MRO. This omission prevents the proper initialization of the superclass, leading to potential errors or undesired behavior when inheriting from Asset. To ensure that the MRO is respected and the superclass is initialized correctly, it is necessary to include super().__init__() in the __init__ method of the Asset class.

An Asset can't depend on a Model (lazy_loading)

Here is an asset that has a model in its dependencies. When I get the asset, ModelLibrary doesn't run the model's _load method.

class MyModel(Model):
    CONFIGURATIONS = {"my_model": {}}
    def _load(self):
        self.data = "my model data"

    def _predict(self, item):
        return self.data

class MyAsset(Asset):
    CONFIGURATIONS = {"my_asset": {"model_dependencies": {"my_model"}}}
    def _load(self):
        model = self.model_dependencies["my_model"]
        self.data = model("")


dataset_library = ModelLibrary(required_models={"my_asset": {},}, models=[MyModel, MyAsset], settings={"lazy_loading": True})
dataset_library.get("my_asset").data

This returns the following error: AttributeError: 'MyModel' object has no attribute 'data' (created in the model's _load method).

Be able to push assets directly from remote storage to remote storage (for s3)

Currently asset new or asset update push data from local to MODELKIT_STORAGE_PROVIDER

https://cornerstone-ondemand.github.io/modelkit/assets/managing_assets/#create-a-new-asset

which mean asset must be on local storage to be pushed

It could be interesting to be able to directly push asset from remote to remote without writing on local disk

Note: the feature seems to be partially for gcs but it just download locally to repush :

modelkit/modelkit/assets/cli.py

Line 201 in 6e71fe7

if asset_path.startswith("gs://"):

(idk if it still works)

So maybe at least implement an automatic redownload + push for s3

Or better find a way to do it directly (idk if it is possible at least with 2 remotes with same credentials)

Handle breaking batch behaviour options

Currently breaking a prediction_batch example, breaks the call and raise the error

we may want another option like (for example) returning all the batches returns except the breaking ones (set to none or set to the Exception) and a mask or something like that

Easy way to override asset for local development?

Is there an easy way to override an asset in a config specifically for development / non-production purposes?

Ideally, a way to do this on the local machine by copying an existing config, and setting the asset path to a local directory. Even if before it pointed to a remote path.

Ignore modules in library search path

Sometimes it makes sense to group helper utilities together in in the library search path.

However, we don't always want these files to be loaded if for example the model itself is not used.

The problem is that modelkit traverses and loads all python modules in the search path. It would be useful to be able to skip over specific files.

This could maybe be done with a common naming convention, for example prefixing files that should be ignored with some standard prefix, eg. library._my_module would be ignored and skipped over (this follows private convention, with the underscore indicating a module that should not be exposed publicly), while library.my_module would be traversed and loaded.

FailedPreconditionError with pre-trained saved model

When loading a pretrained saved_model from Tensorflow Hub with TensorflowModel, some variables seems to be garbage-collected by tensorflow. Then prediction may leads to errors like

FailedPreconditionError:  Could not find variable _AnonymousVar523. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status=Not found: Resource localhost/_AnonymousVar523/N10tensorflow3VarE does not exist.

Keeping a reference of the saved model seems to solve the issue

cf OpenNMT/OpenNMT-tf#842

NamedTuple Error with python 3.9.5

I had this error running pytest in python 3.9.5

  File "/Users/clemat/.pyenv/versions/3.9.5/lib/python3.9/importlib/__init__.py", line 127, in import_module
    return _bootstrap._gcd_import(name[level:], package, level)
  File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
  File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
  File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
  File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
  File "/Users/clemat/.pyenv/versions/3.9.5/envs/modelkit/lib/python3.9/site-packages/_pytest/assertion/rewrite.py", line 170, in exec_module
    exec(co, module.__dict__)
  File "/Users/clemat/dev/modelkit/tests/assets/conftest.py", line 18, in <module>
    from modelkit.assets.manager import AssetsManager
  File "/Users/clemat/dev/modelkit/modelkit/__init__.py", line 3, in <module>
    from modelkit.core.library import ModelLibrary, load_model  # NOQA
  File "/Users/clemat/dev/modelkit/modelkit/core/__init__.py", line 1, in <module>
    from modelkit.core.library import ModelLibrary, load_model
  File "/Users/clemat/dev/modelkit/modelkit/core/library.py", line 35, in <module>
    from modelkit.core.model import Asset, AsyncModel, Model
  File "/Users/clemat/dev/modelkit/modelkit/core/model.py", line 31, in <module>
    from modelkit.utils.cache import Cache, CacheItem
  File "/Users/clemat/dev/modelkit/modelkit/utils/cache.py", line 15, in <module>
    class CacheItem(NamedTuple, Generic[ItemType]):
  File "/Users/clemat/.pyenv/versions/3.9.5/lib/python3.9/typing.py", line 1881, in _namedtuple_mro_entries
    raise TypeError("Multiple inheritance with NamedTuple is not supported")
TypeError: Multiple inheritance with NamedTuple is not supported
======================================================================================= short test summary info ========================================================================================
ERROR  - TypeError: Multiple inheritance with NamedTuple is not supported

click 8.1.0 breaks spaCy 3.2.3

Installing with
pip install modelkit spacy
results in click 8.1.0 and spacy 3.2.3 being installed. But click 8.1.0 contains a breaking change and is incompatible with spaCy 3.2.3 (explosion/spaCy#10564).

Until updated spaCy is realeased, you need to install with
pyenv exec pip install modelkit spacy "click==8.0.4"

Do not raise KeyError and reduce try context in ModelLibrary.get()

This try/except is quite large and except KeyError.

When a model raises a KeyError exception in _load() the exception is catch and re-raised with the wrong error message.

e.g.

import modelkit

class SomeModel(modelkit.Model):
    CONFIGURATIONS = {"model": {}}

    def _load(self):
        data = {}
        data[0]

    def _predict(self, item):
        return item


library = modelkit.ModelLibrary(
    models=SomeModel,
    settings={"lazy_loading": True},
)

library.get("model")

leads to a strange exception message

KeyError: 'Model model not loaded. (loaded models: model).'

instead of

KeyError: 0

Improve the model describe method with a better signatures

For models using pydantic, the describe method only display the name of the pydantic class:

Printing the content of these classes would make the model usage way easier

Documentation: Clicking on `Edit this page` leads to `Error 404`.

==> https://github.com/Cornerstone-OnDemand/modelkit/edit/master/docs/index.md

For example, Pytorch uses the following solution:

The button Edit on GitHub redirects to the main page of the GitHub account, which is publicly available.

[Feature request] Add profiler to modelkit model

I would like to propose a simple feature that can analyze resource consumption for model prediction. The usage is described as follows:

model = modelkit.load_model(...)
profiler = Profiler(model) # Profiler could be customized
result = model(item)
profiler.summary() # output a DataFrame or str

# if there is no profiler, everything works as usual.

The output profile.summary() contains also the profiling result of each sub-models (defined in self.model_dependencies recursively). This requires only a few lines changes in modelkit/core/model.py.

Broken spec display during logging

2022-03-31 09:32.50 [info     ] [modelkit.assets.manager] Fetching asset
	EnvironmentName=test force_download=None return_info=True spec=<modelkit.assets.settings.AssetSpec object at 0x15dd02ad0>

spec is a modelkit.assets.settings.AssetSpec object instance which is not well displayed during logging

The describe function is not in the documentation

The describe functions (for a library or a method) are not in the documentation.
We should add them

Set MODELKIT_ENABLE_SIMPLE_TRACEBACK for pytest

Running pytest with env var MODELKIT_ENABLE_SIMPLE_TRACEBACK set to false leads to the following fails:

FAILED tests/test_errors.py::test_prediction_error[model0] - AssertionError: assert 6 <= 3
FAILED tests/test_errors.py::test_prediction_error[model1] - AssertionError: assert 4 <= 3
FAILED tests/test_errors.py::test_prediction_error_composition - AssertionError: assert 11 <= 4
FAILED tests/test_errors.py::test_prediction_error_async[model0] - AssertionError: assert 6 <= 3
FAILED tests/test_errors.py::test_prediction_error_async[model1] - AssertionError: assert 4 <= 3

Update/Upgrade requirement.txt and requirement-dev.txt

Currently, the requirement.txt is not update-to-date with requirement-dev.txt. For example, the version of pydantic in requirement-dev.txt is (pydantic==1.9.0) while in requirement.txt we have (pydantic==1.8.2).

"Net duration" statistic of SimpleProfiler is wrong when using memory caching

The definition of "Duration per call (s)" and "Net duration per call (s)" could be tricky when dynamic model or memory caching are involved. For instance, we might have a huge duration difference for each call of the same model : [5.0 s, 0.01 s, 0.017 s, ...] (the first call is longer). In these cases, the "Duration per call (s)" is the average duration of all calls. Hence, the "Net duration per call (s)" should take this into account, and be calculated based on the actual function call chain.

test_simple_profiler is not idempotent

see PR failure due to this

=================================== FAILURES ===================================

_____________________________ test_simple_profiler _____________________________
Traceback (most recent call last):
File "/Users/runner/work/modelkit/modelkit/tests/test_simple_profiler.py", line 295, in test_simple_profiler
assert math.isclose(net_durations["model_d"], 0.1, rel_tol=rel_tol)
AssertionError: assert False

where False = (0.4682511319999776, 0.1, rel_tol=0.7)

where = math.isclose