The modelforge's discuss from src-d

Modernize Modelforge CI

Build and push the PyPi package on tags (working example: https://github.com/src-d/minhashcuda/blob/master/.travis.yml#L15)
Use flake8 as in sourced-ml

Test Python 3.7 in Travis

Example: https://github.com/src-d/style-analyzer/blob/master/.travis.yml

We should not not drop 3.4 here.

Documentation

As noted in https://github.com/src-d/backlog/issues/1205#issuecomment-400283991 we need to properly document the way Modelforge works ATM. Also how to use it both internally and externally.

ALWAYS_SIGNOFF env variable is not taken into account

Although you set the MODELFORGE_ALWAYS_SIGNOFF environment to True, the index is committed without a DCO:

➜  ~ modelforge publish -f .modelforge/bot_detection/bot_detection.asdf --meta .modelforge/bot_detection/template_meta.json
INFO:21f9:GitIndex:Cached index is not up to date, pulling warenlg/models
Total 0 (delta 0), reused 0 (delta 0), pack-reused 0
INFO:21f9:generic:Reading /home/waren/.modelforge/bot_detection/bot_detection.asdf (100.0 kB)...
INFO:21f9:gcs-backend:Connecting to the bucket...
INFO:21f9:gcs-backend:Uploading bot-detection from /home/waren/.modelforge/bot_detection/bot_detection.asdf...
[################################] 98304/100278 - 00:00:00
INFO:21f9:publish_model:Uploaded as https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbot-detection%2F599cf161-8e51-44ad-a576-3dd1518afb80.asdf
INFO:21f9:publish_model:Updating the models index...
INFO:21f9:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_model.md.jinja2
INFO:21f9:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_readme.md.jinja2
INFO:21f9:GitIndex:Added /home/waren/.modelforge/cache/source{d}/warenlg/models/bot-detection/599cf161-8e51-44ad-a576-3dd1518afb80.md
INFO:21f9:GitIndex:Updated /home/waren/.modelforge/cache/source{d}/warenlg/models/README.md
INFO:21f9:GitIndex:Writing the new index.json ...
INFO:21f9:GitIndex:Committing the index without DCO
INFO:21f9:GitIndex:Pushing the updated index ...
Push to ssh://[email protected]/warenlg/models successful.
INFO:21f9:publish_model:Successfully published

Indeed, the signoff value could never be None here https://github.com/src-d/modelforge/blob/master/modelforge/index.py#L51 as it is parsed from the arguments as action="store_true" https://github.com/src-d/modelforge/blob/master/modelforge/__main__.py#L37

Trailing dots in log message raise AssertionError-s

A function decorater has been added to check if log messages end with a dot 9111b35.
And indeed it works well since it is raising errors when such log message appear:

➜  ~ modelforge publish -f .modelforge/bot_detection/bot_detection.asdf --meta .modelforge/bot_detection/template_meta.json
INFO:0189:generic:Reading /home/waren/.modelforge/bot_detection/bot_detection.asdf (100.0 kB)...
INFO:0189:gcs-backend:Connecting to the bucket...
INFO:0189:gcs-backend:Uploading bot-detection from /home/waren/.modelforge/bot_detection/bot_detection.asdf...
[################################] 98304/100278 - 00:00:00
INFO:0189:publish_model:Uploaded as https://storage.googleapis.com/models.cdn.sourced.tech/models%2Fbot-detection%2F599cf161-8e51-44ad-a576-3dd1518afb80.asdf
INFO:0189:publish_model:Updating the models index...
INFO:0189:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_model.md.jinja2
INFO:0189:GitIndex:Loaded /home/waren/.local/lib/python3.6/site-packages/modelforge/templates/template_readme.md.jinja2
INFO:0189:GitIndex:Added /home/waren/.modelforge/cache/source{d}/warenlg/models/bot-detection/599cf161-8e51-44ad-a576-3dd1518afb80.md
INFO:0189:GitIndex:Updated /home/waren/.modelforge/cache/source{d}/warenlg/models/README.md
INFO:0189:GitIndex:Writing the new index.json ...
INFO:0189:GitIndex:Committing the index without DCO
INFO:0189:GitIndex:Pushing the updated index ...
Push to ssh://[email protected]/warenlg/models successful.
Traceback (most recent call last):
  File "/home/waren/.local/bin/modelforge", line 11, in <module>
    sys.exit(main())
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/__main__.py", line 122, in main
    return handler(args)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/backends.py", line 93, in wrapped_supply_backend
    return func(args, backend, log)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/registry.py", line 84, in publish_model
    log.info("Successfully published.")
  File "/usr/lib/python3.6/logging/__init__.py", line 1308, in info
    self._log(INFO, msg, args, **kwargs)
  File "/usr/lib/python3.6/logging/__init__.py", line 1444, in _log
    self.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1454, in handle
    self.callHandlers(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 1516, in callHandlers
    hdlr.handle(record)
  File "/usr/lib/python3.6/logging/__init__.py", line 865, in handle
    self.emit(record)
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/slogging.py", line 75, in decorated_with_check_trailing_dot
    (record.name, msg))
AssertionError: Log message is not allowed to have a trailing dot: publish_model: "Successfully published."

So, let's fix those log messages once for all.

Problems with pickling loaded model containing numpy.ndarray

Faced problems with multiprocessing inside loaded modelforge model. Happens both in lazy and not lazy load modes.

Here is the code sample, reproducing error:

from multiprocessing import Pool
import pickle
import traceback
import numpy
from modelforge import Model


class NumpyArray(Model):
    NAME = "numpy_array"
    
    def __init__(self, **kwargs):
        super().__init__(**kwargs)
        self.array = numpy.random.normal(size=(4, 4))
    
    def _generate_tree(self):
        tree = self.__dict__.copy()
        for key in vars(Model()):
            del tree[key]
        return tree
    
    def _load_tree(self, tree):
        self.__dict__.update(tree)

    def pickle_test(self, path: str):
        with open(path, "wb") as out:
            pickle.dump(self, out)
            
    def mult(self, coeff: float):
        return self.array * coeff
    
    def multithread_test(self):
        coeffs = numpy.random.normal(size=16)
        with Pool(4) as pool:
            results = pool.map(self.mult, coeffs)
        return sum(results)

Test non-lazy mode:

arr_obj = NumpyArray()
arr_obj.save("numpy_array.asdf")

new_arr_obj = NumpyArray()
new_arr_obj.load("numpy_array.asdf", lazy=False)
new_arr_obj.pickle_test()

Here is the output:

TypeErrorTraceback (most recent call last)
<ipython-input-148-3f72419a5166> in <module>()
----> 1 new_arr_obj.pickle_test("array.pkl")

<ipython-input-142-d11f51c103b9> in pickle_test(self, path)
     24     def pickle_test(self, path: str):
     25         with open(path, "wb") as out:
---> 26             pickle.dump(self, out)
     27 
     28     def mult(self, coeff: float):

TypeError: cannot serialize '_io.BufferedReader' object

Same with multithreading:

new_arr_obj.multithread_test()

Gets:

TypeErrorTraceback (most recent call last)
<ipython-input-149-e6fc3a006712> in <module>()
      4 new_arr_obj = NumpyArray()
      5 new_arr_obj.load("numpy_array.asdf", lazy=False)
----> 6 new_arr_obj.multithread_test()

<ipython-input-142-d11f51c103b9> in multithread_test(self)
     32         coeffs = numpy.random.normal(size=16)
     33         with Pool(4) as pool:
---> 34             results = pool.map(self.mult, coeffs)
     35         return sum(results)
     36 

/usr/lib/python3.5/multiprocessing/pool.py in map(self, func, iterable, chunksize)
    258         in a list that is returned.
    259         '''
--> 260         return self._map_async(func, iterable, mapstar, chunksize).get()
    261 
    262     def starmap(self, func, iterable, chunksize=None):

/usr/lib/python3.5/multiprocessing/pool.py in get(self, timeout)
    606             return self._value
    607         else:
--> 608             raise self._value
    609 
    610     def _set(self, i, obj):

/usr/lib/python3.5/multiprocessing/pool.py in _handle_tasks(taskqueue, put, outqueue, pool, cache)
    383                         break
    384                     try:
--> 385                         put(task)
    386                     except Exception as e:
    387                         job, ind = task[:2]

/usr/lib/python3.5/multiprocessing/connection.py in send(self, obj)
    204         self._check_closed()
    205         self._check_writable()
--> 206         self._send_bytes(ForkingPickler.dumps(obj))
    207 
    208     def recv_bytes(self, maxlength=None):

/usr/lib/python3.5/multiprocessing/reduction.py in dumps(cls, obj, protocol)
     48     def dumps(cls, obj, protocol=None):
     49         buf = io.BytesIO()
---> 50         cls(buf, protocol).dump(obj)
     51         return buf.getbuffer()
     52 

TypeError: cannot serialize '_io.BufferedReader' object

Same happens in lazy mode. Calling this functions in original class instance works fine.

This fixes the problem locally (can be done inside _load_tree()):

new_arr_obj.array = numpy.array(new_arr_obj.array)
new_arr_obj.multithread_test()
new_arr_obj.pickle_test("array.pkl")

It passes, but looks like numpy arrays non-lazy loading is meant to work right out-of-the-box.

Bad github credentials leaves cache in an unstable state

When the github auth fails (for example when given bad credentials), modelforge's cache is left in an unstable state. Further attempts to upload a model fail, even given the right credentials, with the following error:

Traceback (most recent call last):
 File "/home/tristan/.pyenv/versions/3.6.0/bin/modelforge", line 10, in <module>
   sys.exit(main())
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/__main__.py", line 122, in main
   return handler(args)
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/registry.py", line 97, in list_models
   password=args.password, cache=args.cache, log_level=args.log_level)
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 83, in __init__
   self.fetch()
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 110, in fetch
   if self._are_local_and_remote_heads_different():
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/modelforge/index.py", line 250, in _are_local_and_remote_heads_different
   local_head = Repo(self.cached_repo).head()
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dulwich/repo.py", line 459, in head
   return self.refs[b'HEAD']
 File "/home/tristan/.pyenv/versions/3.6.0/lib/python3.6/site-packages/dulwich/refs.py", line 284, in __getitem__
   raise KeyError(name)
KeyError: b'HEAD'

Manually deleting the cache solves the problem.
Version: 0.12.1

DocumentFrequencies model from sourced/ml broke when imported in the IPython environment

Using modelforge-0.5.1a0 in jupyter notebook 5.0.0

When importing DocumentFrequencies model from sourced/ml in jupyter notebook:

from sourced.ml.models.df import DocumentFrequencies
DocumentFrequencies()

had the error:

AttributeError                            Traceback (most recent call last)
/usr/local/lib/python3.5/dist-packages/IPython/core/formatters.py in __call__(self, obj)
    691                 type_pprinters=self.type_printers,
    692                 deferred_pprinters=self.deferred_printers)
--> 693             printer.pretty(obj)
    694             printer.flush()
    695             return stream.getvalue()

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in pretty(self, obj)
    378                             if callable(meth):
    379                                 return meth(obj, self, cycle)
--> 380             return _default_pprint(obj, self, cycle)
    381         finally:
    382             self.end_group()

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in _default_pprint(obj, p, cycle)
    493     if _safe_getattr(klass, '__repr__', None) is not object.__repr__:
    494         # A user-provided repr. Find newlines and replace them with p.break_()
--> 495         _repr_pprint(obj, p, cycle)
    496         return
    497     p.begin_group(1, '<')

/usr/local/lib/python3.5/dist-packages/IPython/lib/pretty.py in _repr_pprint(obj, p, cycle)
    691     """A pprint that just redirects to the normal repr function."""
    692     # Find newlines and replace them with p.break_()
--> 693     output = repr(obj)
    694     for idx,output_line in enumerate(output.splitlines()):
    695         if idx:

/usr/local/lib/python3.5/dist-packages/modelforge/model.py in __repr__(self)
    150     created_at = metaprop("created_at")
    151     version = metaprop("version")
--> 152     parent = metaprop("parent")
    153     license = metaprop("license")
    154 

AttributeError: module '__main__' has no attribute '__file__'

Errors in the model's description

When running style-analyzer's tests with modelforge 0.11.0, I met some errors dur to the model's description, like:

Traceback (most recent call last):
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/lookout/style/typos/tests/test_ranking.py", line 69, in test_save_load
    print(ranker)
  File "/home/travis/virtualenv/python3.5.6/lib/python3.5/site-packages/modelforge/model.py", line 274, in __str__
    " ".join("%s==%s" % tuple(p) for p in self.environment["packages"])
TypeError: 'NoneType' object is not subscriptable

and

  File "lookout/style/format/tests/test_analyzer.py", line 151, in test_train_cutoff_labels
    self.assertIn("javascript", model1, str(model1))
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/model.py", line 270, in __str__
    meta["created_at"] = format_datetime(meta["created_at"])
  File "/home/waren/.local/lib/python3.6/site-packages/modelforge/meta.py", line 68, in format_datetime
    return dt.strftime("%Y-%m-%d %H:%M:%S%z")
AttributeError: 'NoneType' object has no attribute 'strftime'

Proposal todo list

Spitballing some ideas, @vmarkovtsev WDYT :

src-d / modelforge Goto Github PK

modelforge's Issues

Modernize Modelforge CI

Test Python 3.7 in Travis

Documentation

ALWAYS_SIGNOFF env variable is not taken into account

Trailing dots in log message raise AssertionError-s

Problems with pickling loaded model containing numpy.ndarray

Bad github credentials leaves cache in an unstable state

DocumentFrequencies model from sourced/ml broke when imported in the IPython environment

Errors in the model's description

Proposal todo list

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent