Coder Social home page Coder Social logo

src-d / modelforge Goto Github PK

View Code? Open in Web Editor NEW
18.0 9.0 13.0 386 KB

Python library to share machine learning models easily and reliably.

License: Apache License 2.0

Python 99.18% HTML 0.49% Dockerfile 0.33%
model machine-learning git sharing registry

modelforge's Issues

ALWAYS_SIGNOFF env variable is not taken into account

Although you set the MODELFORGE_ALWAYS_SIGNOFF environment to True, the index is committed without a DCO:

➜  ~ modelforge publish -f .modelforge/bot_detection/bot_detection.asdf --meta .modelforge/bot_detection/template_meta.json
INFO:21f9:GitIndex:Cached index is not up to date, pulling warenlg/models
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
INFO:21f9:generic:Reading /home/waren/.modelforge/bot_detection/bot_detection.asdf (100.0 kB)...
INFO:21f9:gcs-backend:Connecting to the bucket...
INFO:21f9:gcs-backend:Uploading bot-detection from /home/waren/.modelforge/bot_detection/bot_detection.asdf...
[################################] 98304/100278 - 00:00:00
INFO:21f9:publish_model:Uploaded as https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbot-detection%2F599cf161-8e51-44ad-a576-3dd1518afb80.asdf
INFO:21f9:publish_model:Updating the models index...
INFO:21f9:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_model.md.jinja2
INFO:21f9:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_readme.md.jinja2
INFO:21f9:GitIndex:Added /home/waren/.modelforge/cache/source{d}/warenlg/models/bot-detection/599cf161-8e51-44ad-a576-3dd1518afb80.md
INFO:21f9:GitIndex:Updated /home/waren/.modelforge/cache/source{d}/warenlg/models/README.md
INFO:21f9:GitIndex:Writing the new index.json ...
INFO:21f9:GitIndex:Committing the index without DCO
INFO:21f9:GitIndex:Pushing the updated index ...
Push to ssh://[email protected]/warenlg/models successful.
INFO:21f9:publish_model:Successfully published

Indeed, the signoff value could never be None here https://github.com/src-d/modelforge/blob/master/modelforge/index.py#L51 as it is parsed from the arguments as action="store_true" https://github.com/src-d/modelforge/blob/master/modelforge/__main__.py#L37

Trailing dots in log message raise AssertionError-s

A function decorater has been added to check if log messages end with a dot 9111b35.
And indeed it works well since it is raising errors when such log message appear:

➜  ~ modelforge publish -f .modelforge/bot_detection/bot_detection.asdf --meta .modelforge/bot_detection/template_meta.json
INFO:0189:generic:Reading /home/waren/.modelforge/bot_detection/bot_detection.asdf (100.0 kB)...
INFO:0189:gcs-backend:Connecting to the bucket...
INFO:0189:gcs-backend:Uploading bot-detection from /home/waren/.modelforge/bot_detection/bot_detection.asdf...
[################################] 98304/100278 - 00:00:00
INFO:0189:publish_model:Uploaded as https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbot-detection%2F599cf161-8e51-44ad-a576-3dd1518afb80.asdf
INFO:0189:publish_model:Updating the models index...
INFO:0189:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_model.md.jinja2
INFO:0189:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_readme.md.jinja2
INFO:0189:GitIndex:Added /home/waren/.modelforge/cache/source{d}/warenlg/models/bot-detection/599cf161-8e51-44ad-a576-3dd1518afb80.md
INFO:0189:GitIndex:Updated /home/waren/.modelforge/cache/source{d}/warenlg/models/README.md
INFO:0189:GitIndex:Writing the new index.json ...
INFO:0189:GitIndex:Committing the index without DCO
INFO:0189:GitIndex:Pushing the updated index ...
Push to ssh://[email protected]/warenlg/models successful.
Traceback (most recent call last):
  File "/home/waren/.local/bin/modelforge", line 11, in <module>
    sys.exit(main())
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/__main__.py", line 122, in main
    return handler(args)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/backends.py", line 93, in wrapped_supply_backend
    return func(args, backend, log)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/registry.py", line 84, in publish_model
    log.info("Successfully published.")
  File "/usr/lib/python3.6/logging/__init__.py", line 1308, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
    self.emit(record)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/slogging.py", line 75, in decorated_with_check_trailing_dot
    (record.name, msg))
AssertionError: Log message is not allowed to have a trailing dot: publish_model: "Successfully published."

So, let's fix those log messages once for all.

Problems with pickling loaded model containing numpy.ndarray

Faced problems with multiprocessing inside loaded modelforge model. Happens both in lazy and not lazy load modes.

Here is the code sample, reproducing error:

from multiprocessing import Pool
import pickle
import traceback
import numpy
from modelforge import Model


class NumpyArray(Model):
    NAME = "numpy_array"
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.array = numpy.random.normal(size=(4, 4))
    
    def _generate_tree(self):
        tree = self.__dict__.copy()
        for key in vars(Model()):
            del tree[key]
        return tree
    
    def _load_tree(self, tree):
        self.__dict__.update(tree)

    def pickle_test(self, path: str):
        with open(path, "wb") as out:
            pickle.dump(self, out)
            
    def mult(self, coeff: float):
        return self.array * coeff
    
    def multithread_test(self):
        coeffs = numpy.random.normal(size=16)
        with Pool(4) as pool:
            results = pool.map(self.mult, coeffs)
        return sum(results)

Test non-lazy mode:

arr_obj = NumpyArray()
arr_obj.save("numpy_array.asdf")

new_arr_obj = NumpyArray()
new_arr_obj.load("numpy_array.asdf", lazy=False)
new_arr_obj.pickle_test()

Here is the output:

TypeErrorTraceback (most recent call last)
<ipython-input-148-3f72419a5166> in <module>()
----> 1 new_arr_obj.pickle_test("array.pkl")

<ipython-input-142-d11f51c103b9> in pickle_test(self, path)
     24     def pickle_test(self, path: str):
     25         with open(path, "wb") as out:
---> 26             pickle.dump(self, out)
     27 
     28     def mult(self, coeff: float):

TypeError: cannot serialize '_io.BufferedReader' object

Same with multithreading:

new_arr_obj.multithread_test()

Gets:

TypeErrorTraceback (most recent call last)
<ipython-input-149-e6fc3a006712> in <module>()
      4 new_arr_obj = NumpyArray()
      5 new_arr_obj.load("numpy_array.asdf", lazy=False)
----> 6 new_arr_obj.multithread_test()

<ipython-input-142-d11f51c103b9> in multithread_test(self)
     32         coeffs = numpy.random.normal(size=16)
     33         with Pool(4) as pool:
---> 34             results = pool.map(self.mult, coeffs)
     35         return sum(results)
     36 

/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/usr/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/usr/lib/python3.5/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

/usr/lib/python3.5/multiprocessing/reduction.py in dumps(cls, obj, protocol)
     48     def dumps(cls, obj, protocol=None):
     49         buf = io.BytesIO()
---> 50         cls(buf, protocol).dump(obj)
     51         return buf.getbuffer()
     52 

TypeError: cannot serialize '_io.BufferedReader' object

Same happens in lazy mode. Calling this functions in original class instance works fine.

This fixes the problem locally (can be done inside _load_tree()):

new_arr_obj.array = numpy.array(new_arr_obj.array)
new_arr_obj.multithread_test()
new_arr_obj.pickle_test("array.pkl")

It passes, but looks like numpy arrays non-lazy loading is meant to work right out-of-the-box.

Bad github credentials leaves cache in an unstable state

When the github auth fails (for example when given bad credentials), modelforge's cache is left in an unstable state. Further attempts to upload a model fail, even given the right credentials, with the following error:

Traceback (most recent call last):
 File "/home/tristan/.pyenv/versions/3.6.0/bin/modelforge", line 10, in <module>
   sys.exit(main())
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/__main__.py", line 122, in main
   return handler(args)
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/registry.py", line 97, in list_models
   password=args.password, cache=args.cache, log_level=args.log_level)
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 83, in __init__
   self.fetch()
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 110, in fetch
   if self._are_local_and_remote_heads_different():
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 250, in _are_local_and_remote_heads_different
   local_head = Repo(self.cached_repo).head()
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dulwich/repo.py", line 459, in head
   return self.refs[b'HEAD']
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dulwich/refs.py", line 284, in __getitem__
   raise KeyError(name)
KeyError: b'HEAD'

Manually deleting the cache solves the problem.
Version: 0.12.1

DocumentFrequencies model from sourced/ml broke when imported in the IPython environment

Using modelforge-0.5.1a0 in jupyter notebook 5.0.0

When importing DocumentFrequencies model from sourced/ml in jupyter notebook:

from sourced.ml.models.df import DocumentFrequencies
DocumentFrequencies()

had the error:

AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    691                 type_pprinters=self.type_printers,
    692                 deferred_pprinters=self.deferred_printers)
--> 693             printer.pretty(obj)
    694             printer.flush()
    695             return stream.getvalue()

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in pretty(self, obj)
    378                             if callable(meth):
    379                                 return meth(obj, self, cycle)
--> 380             return _default_pprint(obj, self, cycle)
    381         finally:
    382             self.end_group()

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    493     if _safe_getattr(klass, '__repr__', None) is not object.__repr__:
    494         # A user-provided repr. Find newlines and replace them with p.break_()
--> 495         _repr_pprint(obj, p, cycle)
    496         return
    497     p.begin_group(1, '<')

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    691     """A pprint that just redirects to the normal repr function."""
    692     # Find newlines and replace them with p.break_()
--> 693     output = repr(obj)
    694     for idx,output_line in enumerate(output.splitlines()):
    695         if idx:

/usr/local/lib/python3.5/dist-packages/modelforge/model.py in __repr__(self)
    150     created_at = metaprop("created_at")
    151     version = metaprop("version")
--> 152     parent = metaprop("parent")
    153     license = metaprop("license")
    154 

AttributeError: module '__main__' has no attribute '__file__'

Errors in the model's description

When running style-analyzer's tests with modelforge 0.11.0, I met some errors dur to the model's description, like:

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/lookout/style/typos/tests/test_ranking.py", line 69, in test_save_load
    print(ranker)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/modelforge/model.py", line 274, in __str__
    " ".join("%s==%s" % tuple(p) for p in self.environment["packages"])
TypeError: 'NoneType' object is not subscriptable

and

  File "lookout/style/format/tests/test_analyzer.py", line 151, in test_train_cutoff_labels
    self.assertIn("javascript", model1, str(model1))
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/model.py", line 270, in __str__
    meta["created_at"] = format_datetime(meta["created_at"])
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/meta.py", line 68, in format_datetime
    return dt.strftime("%Y-%m-%d %H:%M:%S%z")
AttributeError: 'NoneType' object has no attribute 'strftime'

Proposal todo list

Spitballing some ideas, @vmarkovtsev WDYT :

  • Structure
    • Model stuff could be in a directory
    • Bakend stuff could be in a directory
    • Logging could be in a directory
    • Command line should be in a directory, and split properly (eg tools.py, registry.py)
    • All the rest should be in utils
  • Command line
    • Add command to create configuration (modelforgecfg.py file or edit a given pre-existing bash/text file
    • Add command to edit only the registry (amend), by uploading a modified meta.json. It should be able to edit either a specific model, or a series of model
    • Add command to add external files to the registry, e.g. tests, travis.yaml file, etc
    • Include gitbooks, update templates
  • Code
    • Switch to Pathlib entirely
    • Update the code, we are not even using the new functionnalities of slogging
    • Simplify the code as much as possible. For instance, Model().load( ... ) is overly complex, and some features are useless imo, eg loading a model from the model
  • Documentation
    • Explain configuration, depending on the utilization. Right now it's quite opaque
    • Explain how metadata can be added in command line, there's no real explanaton apart from the default template and a link to src-d/models
    • We should talk about slogging since we use it, and it's part of the API
    • Add doc for new commands
    • Add an index of the API
    • List internal methods that should not be overrided in Model

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.