feedly / transfer-nlp Goto Github PK

View Code? Open in Web Editor NEW

291.0 291.0 16.0 2.88 MB

NLP library designed for reproducible experimentation management

License: MIT License

Python 98.93% Shell 1.07%

framework language-model natural-language-understanding nlp playground pytorch transfer-learning

transfer-nlp's People

Contributors

Stargazers

Watchers

Forkers

jaedukseo legendtianjin codeaudit allensmile awesome-archive 0xflotus trendingtechnology moolighty jjwangnlp leekltw brettkoonce auscenery sainiudit danishack techthiyanes ahmad-abdellatif

transfer-nlp's Issues

Pytorch Lightning as a back-end

Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]

Describe the solution you'd like
A clear and concise description of what you want to happen.

Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.

Hi! check out Pytorch Lightning as an option for your backend! We're looking for awesome project implemented in Lightning.

https://github.com/williamFalcon/pytorch-lightning
Additional context
Add any other context or screenshots about the feature request here.

TPU + 16 bit

hey!

Not sure if you've seen:
https://github.com/williamFalcon/pytorch-lightning.

The fastest growing PyTorch front-end project.

We're also now venture funded so we have a fulltime team working on this and will be around for a very long time :)

https://medium.com/pytorch/pytorch-lightning-0-7-1-release-and-venture-funding-dd12b2e75fb3?postPublishedType=repub

additional parameters don't cause config errors

something like this won't cause an error:

{ 
   "item": {
       "_name": "Foo",
       "param":"$bar"
    }
}

even if Foo.__init__ doesn't have an input parameter param. this can lead to hard to find typo config bugs.

refactoring/config

The config.py file which manages the creation of objects is starting to get quite complicated and hard to maintain / change.

It is a good time to do a refactoring of the logic

Uninstantiated registrables in experiment config files

Currently, all registrable objects defined in experiment files are instantiated. However, sometimes we would like to have objects that are not instantiated. For examples, we might have a list of functions, e.g.:

metrics:
  - $accuracy_score
  - $precision_score

None default causes problems

this config didn't work:

@register_plugin
class DemoDefaults:

    def __init__(self, strval: str, intval1: int = 5, intval2: int = None):
        self.strval = strval
        self.intval1 = intval1
        self.intval2 = intval2

@register_plugin
class DemoComplexDefaults:

    def __init__(self, strval: str, obj: DemoDefaults = None): # use different param and property names as additonal check
        self.simple = strval
        self.complex = obj

experiment = {
    'demo': {
        '_name': 'DemoComplexDefaults',
        'strval': 'foo'
    },
    'obj': {
        '_name': 'DemoDefaults',
        'strval': 'bar',
        'intval1': 20
    }
}

it's because intval2's default is None and this is causing it to be unconfigured in the params mode 1 step. then both objects are configured using all default values in mode=2.

upgrade feedly client

upgrade to the latest feedly client to be compatible with the rest of the feedly code

use toml instead of ini format for parameters

right now the code uses configparser/ini format for parameters. toml could be a better fit as it's still quite readable and allows for some structure. one use case i had recently was enabling/disabling featurizers. with the ini format, i had to create a dumb enabled flag in the featurizer and then set all the featurizer enabled flags for every experiment. it worked but was a hassle.

with toml, we could just have

featurizers=["$f1", "$f2"]

as config. we'll need to provide some clear documentation though, as in this case the json experiment file will refer to the $featurizers parameter in the config file while the config file will refer to $f1 in the experiment file...

have the possibility to build object with a function instead of a class

When you want to experiment with someone else's code, you don't want to copy-paste their code.

If you want to use a class AwesomeClass from an awesome github repo, you can do:

from transfer_nlp.pluginf.config import register_plugin
from awesome_repo.module import AwesomeClass

register_plugin(AwesomeClass)

and then use it in your experiments.

However, when reusing complex objects, it might complicated to configure them.
An example is the pre-trained model from the pytorch-pretrained-bert repo, where you can build complex models with nice one-liners such as model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)

It's possible to encapsulate these into other classes and have Transfer NLP build them, but it can feel awkward and adds unnecessary complexity / lines of code compared to the initial one-liner.
An alternative is to build these objects with a method, in the previous example we would only write:

@register_function
def bert_classifier(bert_version: str='bert-base-uncased', num_labels: int=4):
    return BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path=bert_version, num_labels=num_labels)

and we could use functions just as methods in the config loading.

document new runner code

Can we incorporate the new runner code into one or more examples?

unsubstituted parameter doesn't cause an error

something like this won't cause a problem:

{ 
   "item": {
       "_name": "foo",
       "param":"$bar"
    }
}

even if we don't set a value for bar. this can lead to easily misconfigured objects.

support nested list config

sklearn pipelines are created using nested lists, but this doesn't seem to work with our code:

from sklearn.pipeline import Pipeline

from transfer_nlp.plugins.config import register_plugin, ExperimentConfig
from sklearn.feature_selection import SelectKBest

if __name__ == '__main__':
    register_plugin(Pipeline)
    register_plugin(SelectKBest)
    cfg = {
        'pipeline': {
            '_name': 'Pipeline',
            'steps': [
                ['first', '$first'],
                ['second', '$second'],
            ]
        },
        'first': {
            '_name': 'SelectKBest'
        },
        'second': {
            '_name': 'SelectKBest'
        }
    }
    cfg = ExperimentConfig(cfg)

results in TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '$first' (type <class 'str'>) doesn't

so it looks like $first and $second didn't get substituted

caching objects in experiment runner

some readonly objects can take awhile to load in experiments (embeddings, datasets, etc). The current ExperimentRunner always recreates the entire experiment. It would be nice if we could keep some objects in memory...

Proposal

add a cached property in run_all

    def run_all(experiment: Union[str, Path, Dict],
                experiment_cache: Union[str, Path, Dict],
                experiment_config: Union[str, Path],
                report_dir: Union[str, Path],
                trainer_config_name: str = 'trainer',
                reporter_config_name: str = 'reporter',
                **env_vars) -> None:

The cache is just another experiment json. it would be loaded only once at the very beginning only using the env_vars. any resulting objects would then be added to env_vars when running each each experiment. objects can optionally implement a Resettable class that has a reset method that would be called once before each experiment.

incorrect usage of this feature could lead to non-reproducibility issues, but through docs we could make it clear this should only be for read-only objects. i think it would be worth doing...

feature/return-cached-experiment

The Runner class runs sequentially each config file and outputs reporting for each of them.
We would like to be able to aggregate some metrics over several config experiments.

To do that we need to return the cached experiments after the sequential runner

YAML / TOML files for experiments

Some people don't like json files (for readability, unnecessary punctuation, lack of comments etc).
It would be nice to be able to load an experiment defined in a YAML file instead.

Similarly, TOML files can come in handy especially with simple use cases.

Registrable selector

When using the sequential Runner, we might be interested in running ablation studies, where we have several instances of possible registrables and we want to authorize only a few of them.

I nice to have would be to add a selector class to the library, something like:

from typing import Any, List, Tuple
def multiple_choices_selector(choices: List[Tuple[bool, Any]]) -> List[Any]:

    return [obj for enabled, obj in choices if enabled]

def unique_selector(choices: List[Tuple[bool, Any]]) -> Any:
    res = multiple_choices_selector(choices)

    if len(res) == 1:
        return res[0]

    raise ValueError(f'Unique selector found {len(res)} objects enables instead of 1')

allow list configurations

lots of times we need to configure lists. it would be nice if this was possible, e.g.

{ 
    "pipeline": {
        "_name": "MyPipeline",
        "preprocessors": [ "$pp1", "$pp2", ....]
    }
}

Check that all registrables are registered

Currently, objects are built one by one and when one fails it throws an error.

It would be great to have a quick pass before instantiating objects to check that all registrable names / aliases are actually registered, and throw an error at this moment.

Generalized "Runner"

An experiment is defined by a json file. If you want to run many variations of your model, it may be hard to manage if you have to create almost identical copies of the json file for each variant.

Instead we could imagine having a single general, parameterized json file and then multiple configurations in order to run experiments. To do so effectively, the json structure should remain the same across all variants.

This may mean that pre-built components should have an enabled/disabled flag. So that even if they are included in an experiment, they can be disabled easily via a parameter.

Additionally we can have a "runner" class that can take in a json and config file and run the experiment once for each config with some reports. For now the runner can run the experiments sequentially on a machine, but we could imagine people writing more advanced launchers to do things in parallel.

[ExperimentRunner] Default value of experiment_cache cause run_all to fail

Describe the bug
The ExperimentRunner.run_all fails if experiment_cache is None.

The issue comes from line 109, where the default value for the experiment cache (None) is not handled correctly: https://github.com/feedly/transfer-nlp/blob/master/transfer_nlp/runner/experiment_runner.py#L109

List of custom objects doesn't work on first level objects

Currently, it is possible to use list of custom objects, but when having such list as the first level object in the experiment file, this fail to instantiate.

e.g.:

my_objects:
  - _name: my_first_object
  - _name: my_second_object

fails

Tensorboard options

It would be nice to have options to disable / enable tensorboard handlers in the PyTorch trainer.
For example we might always be interested in visualizing the metrics, but not necessarily the embeddings, the gradients or other quantities

Downloader Plugin

From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader component that can download data from URLs and save them locally to disk.

If the files exist, the downloader can skip the download
the downloader should calculate checksums for downloaded files. it should produce a checksums.cfg file to simplify reusing these in configuration later
the downloader should allow checksums to be configured in the experiment file. when set, the downloader would verify the downloaded file is the same as the one specified in the experiment.

so an example json config could be:

{
  "_name": "Downloader",
  "local_dir": "$my_path",
  "checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download 
  "sentences.txt.gz": {
    "url": "$BASE_URL/sentences.txt.gz",
    "decompress": true
  },
  "word_embeddings.npy": {
    "url": "$BASE_URL/word_embeddings.npy"
  }
}