cornerstone-ondemand / modelkit Goto Github PK
View Code? Open in Web Editor NEWToolkit for developing and maintaining ML models
Home Page: http://cornerstone-ondemand.github.io/modelkit/
License: MIT License
Toolkit for developing and maintaining ML models
Home Page: http://cornerstone-ondemand.github.io/modelkit/
License: MIT License
After diving into the code, modelkit lists the following regex for asset names:
GENERIC_ASSET_NAME_RE = (
r"(([A-Z]:\\)|/)?[a-zA-Z0-9]([a-zA-Z0-9\-\_\.\/\\]*[a-zA-Z0-9])?"
)
However:
I uploaded a file that was named something similar to file__id==version
, and because ==
is invalid according to the regex. The file could be uploaded, but when specified in a model config for the asset
it freezes when trying to resolve the asset and no error is throw. The asset also cannot be downloaded via code.
Having both is confusing. I don't see why a single one wouldn't be enough
If 2 models have the same configuration name, modelkit
will silently use one of them.
It should at least warn the user but maybe we just want to raise an Exception ?
During asset creation development process, or if you want to embed assets in a container at build time,
Modelkit can be configured to use StorageProvider with LocalStorageDriver and a AssetsDir pointing to the same location of the LocalStorage.
This allow to only use local file and do not duplicate them on local file system.
However, modelkit still requires to have write permission on the filesystem because the management of lock files is not disabled in this condition.
Idea is to disable the lock files management in this configuration.
It is fine to do that since we are guaranteed that we will not download anything in this configuration
As of today, assets can only be versioned with numbers (eg: 1.4)
https://cornerstone-ondemand.github.io/modelkit/assets/managing_assets/
Modelkit should support date versionning (eg: 2021-10-16) to support organisations where models are trained on a monthy/weekly/daily basis
It would help reducing the number of dependencies
Modelkit has pinned pydantic < 2 a few months ago since the latest 2.x version breaks BaseSettings.
A previous issue written by @victorbenichoux (see #189) and corresponding PR was merged to pin pydantic < 2 as a quick fix.
Nonetheless, as stated by @victorbenichoux (again) and backed by several profiling initiatives throughout the years:
MODELKIT_ENABLE_VALIDATION
does not work with complex data types (it attempts at recursively and manually recreating the object without validating it in the first part) but still comes with the reconstruction overheadAs such, migrating modelkit to Pydantic 2 seems like a no-brainer, especially since it is now 2.4.x with a whole summer of patches, fixes and improvements!
It will be a good opportunity to bump a major version as well to celebrate the huge step forward in terms of speed (finally!)
@victorbenichoux would you still be happy to propose a PR, or discuss it ? :-)
When configuring LocalStorage with MODELKIT_ASSETS_DIR = MODELKIT_STORAGE_BUCKET/MODELKIT_STORAGE_PREFIX
_fetch
will delete local asset without being able to re-download it since configurations are overlapping
Bumping pydantic
to version >= 2.0 breaks modelkit
: BaseSettings
has been moved to a separate package:
pydantic.errors.PydanticImportError: `BaseSettings` has been moved to the `pydantic-settings` package. See https://docs.pydantic.dev/2.0.2/migration/#basesettings-has-moved-to-pydantic-settings for more details.
Installingpydantic-settings
as suggested is not enough and it will keep breaking on separate errors.
In fact, pydantic >= 2.0
comes with a large set of breaking changes, and since we use relatively advanced features during validation of payloads (e.g. create_model
, etc.) the migration will likely be quite hard.
OTOH, pydantic 2.0 is substantially faster than 1.X ๐ , and input validation accounts for most of the overhead at predict time so it is very much worth it for modelkit
Do you guys have any plans on organizing the migration? If not I am happy to propose a PR to do so.
In the meantime, we should fix pydantic<2.0
in setup.cfg
modelkit
does not seem to be installable on M1 macs with apple silicon.
It fails since the numpy
version is pinned to 1.19.3 somehow, which cannot be installed on M1 macs.
Currently we only push from local to remote
cli could be able to push from remote to local or remote to remote (restricted to remote with same credentials)
The current code suffers from a method resolution order (MRO) issue. In the Asset
class, the __init__
method does not call super().__init__()
, resulting in an incorrect MRO. This omission prevents the proper initialization of the superclass, leading to potential errors or undesired behavior when inheriting from Asset
. To ensure that the MRO is respected and the superclass is initialized correctly, it is necessary to include super().__init__()
in the __init__
method of the Asset
class.
Here is an asset that has a model in its dependencies. When I get the asset, ModelLibrary doesn't run the model's _load
method.
class MyModel(Model):
CONFIGURATIONS = {"my_model": {}}
def _load(self):
self.data = "my model data"
def _predict(self, item):
return self.data
class MyAsset(Asset):
CONFIGURATIONS = {"my_asset": {"model_dependencies": {"my_model"}}}
def _load(self):
model = self.model_dependencies["my_model"]
self.data = model("")
dataset_library = ModelLibrary(required_models={"my_asset": {},}, models=[MyModel, MyAsset], settings={"lazy_loading": True})
dataset_library.get("my_asset").data
This returns the following error: AttributeError: 'MyModel' object has no attribute 'data'
(created in the model's _load
method).
Currently asset new
or asset update
push data from local to MODELKIT_STORAGE_PROVIDER
https://cornerstone-ondemand.github.io/modelkit/assets/managing_assets/#create-a-new-asset
which mean asset must be on local storage to be pushed
It could be interesting to be able to directly push asset from remote to remote without writing on local disk
Note: the feature seems to be partially for gcs but it just download locally to repush :
modelkit/modelkit/assets/cli.py
Line 201 in 6e71fe7
(idk if it still works)
So maybe at least implement an automatic redownload + push for s3
Or better find a way to do it directly (idk if it is possible at least with 2 remotes with same credentials)
Currently breaking a prediction_batch example, breaks the call and raise the error
we may want another option like (for example) returning all the batches returns except the breaking ones (set to none or set to the Exception) and a mask or something like that
Is there an easy way to override an asset in a config specifically for development / non-production purposes?
Ideally, a way to do this on the local machine by copying an existing config, and setting the asset path to a local directory. Even if before it pointed to a remote path.
Sometimes it makes sense to group helper utilities together in in the library search path.
However, we don't always want these files to be loaded if for example the model itself is not used.
This could maybe be done with a common naming convention, for example prefixing files that should be ignored with some standard prefix, eg. library._my_module
would be ignored and skipped over (this follows private convention, with the underscore indicating a module that should not be exposed publicly), while library.my_module
would be traversed and loaded.
When loading a pretrained saved_model from Tensorflow Hub with TensorflowModel
, some variables seems to be garbage-collected by tensorflow. Then prediction may leads to errors like
FailedPreconditionError: Could not find variable _AnonymousVar523. This could mean that the variable has been deleted. In TF1, it can also mean the variable is uninitialized. Debug info: container=localhost, status=Not found: Resource localhost/_AnonymousVar523/N10tensorflow3VarE does not exist.
Keeping a reference of the saved model seems to solve the issue
I had this error running pytest in python 3.9.5
File "/Users/clemat/.pyenv/versions/3.9.5/lib/python3.9/importlib/__init__.py", line 127, in import_module
return _bootstrap._gcd_import(name[level:], package, level)
File "<frozen importlib._bootstrap>", line 1030, in _gcd_import
File "<frozen importlib._bootstrap>", line 1007, in _find_and_load
File "<frozen importlib._bootstrap>", line 986, in _find_and_load_unlocked
File "<frozen importlib._bootstrap>", line 680, in _load_unlocked
File "/Users/clemat/.pyenv/versions/3.9.5/envs/modelkit/lib/python3.9/site-packages/_pytest/assertion/rewrite.py", line 170, in exec_module
exec(co, module.__dict__)
File "/Users/clemat/dev/modelkit/tests/assets/conftest.py", line 18, in <module>
from modelkit.assets.manager import AssetsManager
File "/Users/clemat/dev/modelkit/modelkit/__init__.py", line 3, in <module>
from modelkit.core.library import ModelLibrary, load_model # NOQA
File "/Users/clemat/dev/modelkit/modelkit/core/__init__.py", line 1, in <module>
from modelkit.core.library import ModelLibrary, load_model
File "/Users/clemat/dev/modelkit/modelkit/core/library.py", line 35, in <module>
from modelkit.core.model import Asset, AsyncModel, Model
File "/Users/clemat/dev/modelkit/modelkit/core/model.py", line 31, in <module>
from modelkit.utils.cache import Cache, CacheItem
File "/Users/clemat/dev/modelkit/modelkit/utils/cache.py", line 15, in <module>
class CacheItem(NamedTuple, Generic[ItemType]):
File "/Users/clemat/.pyenv/versions/3.9.5/lib/python3.9/typing.py", line 1881, in _namedtuple_mro_entries
raise TypeError("Multiple inheritance with NamedTuple is not supported")
TypeError: Multiple inheritance with NamedTuple is not supported
======================================================================================= short test summary info ========================================================================================
ERROR - TypeError: Multiple inheritance with NamedTuple is not supported
Installing with
pip install modelkit spacy
results in click 8.1.0 and spacy 3.2.3 being installed. But click 8.1.0 contains a breaking change and is incompatible with spaCy 3.2.3 (explosion/spaCy#10564).
Until updated spaCy is realeased, you need to install with
pyenv exec pip install modelkit spacy "click==8.0.4"
This try/except is quite large and except KeyError.
When a model raises a KeyError
exception in _load()
the exception is catch and re-raised with the wrong error message.
e.g.
import modelkit
class SomeModel(modelkit.Model):
CONFIGURATIONS = {"model": {}}
def _load(self):
data = {}
data[0]
def _predict(self, item):
return item
library = modelkit.ModelLibrary(
models=SomeModel,
settings={"lazy_loading": True},
)
library.get("model")
leads to a strange exception message
KeyError: 'Model
model not loaded. (loaded models: model).'
instead of
KeyError: 0
==> https://github.com/Cornerstone-OnDemand/modelkit/edit/master/docs/index.md
For example, Pytorch uses the following solution:
The button Edit on GitHub
redirects to the main page of the GitHub account, which is publicly available.
I would like to propose a simple feature that can analyze resource consumption for model prediction. The usage is described as follows:
model = modelkit.load_model(...)
profiler = Profiler(model) # Profiler could be customized
result = model(item)
profiler.summary() # output a DataFrame or str
# if there is no profiler, everything works as usual.
The output profile.summary()
contains also the profiling result of each sub-models (defined in self.model_dependencies
recursively). This requires only a few lines changes in modelkit/core/model.py
.
2022-03-31 09:32.50 [info ] [modelkit.assets.manager] Fetching asset
EnvironmentName=test force_download=None return_info=True spec=<modelkit.assets.settings.AssetSpec object at 0x15dd02ad0>
spec
is a modelkit.assets.settings.AssetSpec
object instance which is not well displayed during logging
The describe functions (for a library or a method) are not in the documentation.
We should add them
Running pytest with env var MODELKIT_ENABLE_SIMPLE_TRACEBACK
set to false leads to the following fails:
FAILED tests/test_errors.py::test_prediction_error[model0] - AssertionError: assert 6 <= 3
FAILED tests/test_errors.py::test_prediction_error[model1] - AssertionError: assert 4 <= 3
FAILED tests/test_errors.py::test_prediction_error_composition - AssertionError: assert 11 <= 4
FAILED tests/test_errors.py::test_prediction_error_async[model0] - AssertionError: assert 6 <= 3
FAILED tests/test_errors.py::test_prediction_error_async[model1] - AssertionError: assert 4 <= 3
Currently, the requirement.txt is not update-to-date with requirement-dev.txt. For example, the version of pydantic in requirement-dev.txt is (pydantic==1.9.0) while in requirement.txt we have (pydantic==1.8.2).
The definition of "Duration per call (s)" and "Net duration per call (s)" could be tricky when dynamic model
or memory caching
are involved. For instance, we might have a huge duration difference for each call of the same model : [5.0 s, 0.01 s, 0.017 s, ...]
(the first call is longer). In these cases, the "Duration per call (s)" is the average duration of all calls. Hence, the "Net duration per call (s)" should take this into account, and be calculated based on the actual function call chain.
test_simple_profiler is not idempotent
=================================== FAILURES ===================================
_____________________________ test_simple_profiler _____________________________
Traceback (most recent call last):
File "/Users/runner/work/modelkit/modelkit/tests/test_simple_profiler.py", line 295, in test_simple_profiler
assert math.isclose(net_durations["model_d"], 0.1, rel_tol=rel_tol)
AssertionError: assert False
- where False = (0.4682511319999776, 0.1, rel_tol=0.7)
- where = math.isclose
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.