Coder Social home page Coder Social logo

feedly / transfer-nlp Goto Github PK

View Code? Open in Web Editor NEW
291.0 291.0 16.0 2.88 MB

NLP library designed for reproducible experimentation management

License: MIT License

Python 98.93% Shell 1.07%
framework language-model natural-language-understanding nlp playground pytorch transfer-learning

transfer-nlp's People

Contributors

0xflotus avatar brettkoonce avatar kireet avatar louistrezzini avatar mathieu4141 avatar petermartigny avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

transfer-nlp's Issues

Pytorch Lightning as a back-end

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Hi! check out Pytorch Lightning as an option for your backend! We're looking for awesome project implemented in Lightning.

https://github.com/williamFalcon/pytorch-lightning
Additional context
Add any other context or screenshots about the feature request here.

additional parameters don't cause config errors

something like this won't cause an error:

{ 
   "item": {
       "_name": "Foo",
       "param":"$bar"
    }
}

even if Foo.__init__ doesn't have an input parameter param. this can lead to hard to find typo config bugs.

refactoring/config

The config.py file which manages the creation of objects is starting to get quite complicated and hard to maintain / change.

It is a good time to do a refactoring of the logic

Uninstantiated registrables in experiment config files

Currently, all registrable objects defined in experiment files are instantiated. However, sometimes we would like to have objects that are not instantiated. For examples, we might have a list of functions, e.g.:

metrics:
  - $accuracy_score
  - $precision_score

None default causes problems

this config didn't work:

@register_plugin
class DemoDefaults:

    def __init__(self, strval: str, intval1: int = 5, intval2: int = None):
        self.strval = strval
        self.intval1 = intval1
        self.intval2 = intval2

@register_plugin
class DemoComplexDefaults:

    def __init__(self, strval: str, obj: DemoDefaults = None): # use different param and property names as additonal check
        self.simple = strval
        self.complex = obj

experiment = {
    'demo': {
        '_name': 'DemoComplexDefaults',
        'strval': 'foo'
    },
    'obj': {
        '_name': 'DemoDefaults',
        'strval': 'bar',
        'intval1': 20
    }
}

it's because intval2's default is None and this is causing it to be unconfigured in the params mode 1 step. then both objects are configured using all default values in mode=2.

upgrade feedly client

upgrade to the latest feedly client to be compatible with the rest of the feedly code

use toml instead of ini format for parameters

right now the code uses configparser/ini format for parameters. toml could be a better fit as it's still quite readable and allows for some structure. one use case i had recently was enabling/disabling featurizers. with the ini format, i had to create a dumb enabled flag in the featurizer and then set all the featurizer enabled flags for every experiment. it worked but was a hassle.

with toml, we could just have

featurizers=["$f1", "$f2"] 

as config. we'll need to provide some clear documentation though, as in this case the json experiment file will refer to the $featurizers parameter in the config file while the config file will refer to $f1 in the experiment file...

have the possibility to build object with a function instead of a class

When you want to experiment with someone else's code, you don't want to copy-paste their code.

If you want to use a class AwesomeClass from an awesome github repo, you can do:

from transfer_nlp.pluginf.config import register_plugin
from awesome_repo.module import AwesomeClass

register_plugin(AwesomeClass)

and then use it in your experiments.

However, when reusing complex objects, it might complicated to configure them.
An example is the pre-trained model from the pytorch-pretrained-bert repo, where you can build complex models with nice one-liners such as model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

It's possible to encapsulate these into other classes and have Transfer NLP build them, but it can feel awkward and adds unnecessary complexity / lines of code compared to the initial one-liner.
An alternative is to build these objects with a method, in the previous example we would only write:

@register_function
def bert_classifier(bert_version: str='bert-base-uncased', num_labels: int=4):
    return BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path=bert_version, num_labels=num_labels)

and we could use functions just as methods in the config loading.

support nested list config

sklearn pipelines are created using nested lists, but this doesn't seem to work with our code:

from sklearn.pipeline import Pipeline

from transfer_nlp.plugins.config import register_plugin, ExperimentConfig
from sklearn.feature_selection import SelectKBest

if __name__ == '__main__':
    register_plugin(Pipeline)
    register_plugin(SelectKBest)
    cfg = {
        'pipeline': {
            '_name': 'Pipeline',
            'steps': [
                ['first', '$first'],
                ['second', '$second'],
            ]
        },
        'first': {
            '_name': 'SelectKBest'
        },
        'second': {
            '_name': 'SelectKBest'
        }
    }
    cfg = ExperimentConfig(cfg)

results in TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '$first' (type <class 'str'>) doesn't

so it looks like $first and $second didn't get substituted

caching objects in experiment runner

some readonly objects can take awhile to load in experiments (embeddings, datasets, etc). The current ExperimentRunner always recreates the entire experiment. It would be nice if we could keep some objects in memory...

Proposal

add a cached property in run_all

    def run_all(experiment: Union[str, Path, Dict],
                experiment_cache: Union[str, Path, Dict],
                experiment_config: Union[str, Path],
                report_dir: Union[str, Path],
                trainer_config_name: str = 'trainer',
                reporter_config_name: str = 'reporter',
                **env_vars) -> None:

The cache is just another experiment json. it would be loaded only once at the very beginning only using the env_vars. any resulting objects would then be added to env_vars when running each each experiment. objects can optionally implement a Resettable class that has a reset method that would be called once before each experiment.

incorrect usage of this feature could lead to non-reproducibility issues, but through docs we could make it clear this should only be for read-only objects. i think it would be worth doing...

feature/return-cached-experiment

The Runner class runs sequentially each config file and outputs reporting for each of them.
We would like to be able to aggregate some metrics over several config experiments.

To do that we need to return the cached experiments after the sequential runner

YAML / TOML files for experiments

Some people don't like json files (for readability, unnecessary punctuation, lack of comments etc).
It would be nice to be able to load an experiment defined in a YAML file instead.

Similarly, TOML files can come in handy especially with simple use cases.

Registrable selector

When using the sequential Runner, we might be interested in running ablation studies, where we have several instances of possible registrables and we want to authorize only a few of them.

I nice to have would be to add a selector class to the library, something like:

from typing import Any, List, Tuple
def multiple_choices_selector(choices: List[Tuple[bool, Any]]) -> List[Any]:

    return [obj for enabled, obj in choices if enabled]

def unique_selector(choices: List[Tuple[bool, Any]]) -> Any:
    res = multiple_choices_selector(choices)

    if len(res) == 1:
        return res[0]

    raise ValueError(f'Unique selector found {len(res)} objects enables instead of 1')

allow list configurations

lots of times we need to configure lists. it would be nice if this was possible, e.g.

{ 
    "pipeline": {
        "_name": "MyPipeline",
        "preprocessors": [ "$pp1", "$pp2", ....]
    }
}

Check that all registrables are registered

Currently, objects are built one by one and when one fails it throws an error.

It would be great to have a quick pass before instantiating objects to check that all registrable names / aliases are actually registered, and throw an error at this moment.

Generalized "Runner"

An experiment is defined by a json file. If you want to run many variations of your model, it may be hard to manage if you have to create almost identical copies of the json file for each variant.

Instead we could imagine having a single general, parameterized json file and then multiple configurations in order to run experiments. To do so effectively, the json structure should remain the same across all variants.

This may mean that pre-built components should have an enabled/disabled flag. So that even if they are included in an experiment, they can be disabled easily via a parameter.

Additionally we can have a "runner" class that can take in a json and config file and run the experiment once for each config with some reports. For now the runner can run the experiments sequentially on a machine, but we could imagine people writing more advanced launchers to do things in parallel.

List of custom objects doesn't work on first level objects

Currently, it is possible to use list of custom objects, but when having such list as the first level object in the experiment file, this fail to instantiate.

e.g.:

my_objects:
  - _name: my_first_object
  - _name: my_second_object

fails

Tensorboard options

It would be nice to have options to disable / enable tensorboard handlers in the PyTorch trainer.
For example we might always be interested in visualizing the metrics, but not necessarily the embeddings, the gradients or other quantities

Downloader Plugin

From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader component that can download data from URLs and save them locally to disk.

  • If the files exist, the downloader can skip the download
  • the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
  • the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.

so an example json config could be:

{
  "_name": "Downloader",
  "local_dir": "$my_path",
  "checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download 
  "sentences.txt.gz": {
    "url": "$BASE_URL/sentences.txt.gz",
    "decompress": true
  },
  "word_embeddings.npy": {
    "url": "$BASE_URL/word_embeddings.npy"
  }
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.