feedly / transfer-nlp Goto Github PK
View Code? Open in Web Editor NEWNLP library designed for reproducible experimentation management
License: MIT License
NLP library designed for reproducible experimentation management
License: MIT License
Is your feature request related to a problem? Please describe.
A clear and concise description of what the problem is. Ex. I'm always frustrated when [...]
Describe the solution you'd like
A clear and concise description of what you want to happen.
Describe alternatives you've considered
A clear and concise description of any alternative solutions or features you've considered.
Hi! check out Pytorch Lightning as an option for your backend! We're looking for awesome project implemented in Lightning.
https://github.com/williamFalcon/pytorch-lightning
Additional context
Add any other context or screenshots about the feature request here.
hey!
Not sure if you've seen:
https://github.com/williamFalcon/pytorch-lightning.
The fastest growing PyTorch front-end project.
We're also now venture funded so we have a fulltime team working on this and will be around for a very long time :)
something like this won't cause an error:
{
"item": {
"_name": "Foo",
"param":"$bar"
}
}
even if Foo.__init__
doesn't have an input parameter param
. this can lead to hard to find typo config bugs.
The config.py
file which manages the creation of objects is starting to get quite complicated and hard to maintain / change.
It is a good time to do a refactoring of the logic
Currently, all registrable objects defined in experiment files are instantiated. However, sometimes we would like to have objects that are not instantiated. For examples, we might have a list of functions, e.g.:
metrics:
- $accuracy_score
- $precision_score
this config didn't work:
@register_plugin
class DemoDefaults:
def __init__(self, strval: str, intval1: int = 5, intval2: int = None):
self.strval = strval
self.intval1 = intval1
self.intval2 = intval2
@register_plugin
class DemoComplexDefaults:
def __init__(self, strval: str, obj: DemoDefaults = None): # use different param and property names as additonal check
self.simple = strval
self.complex = obj
experiment = {
'demo': {
'_name': 'DemoComplexDefaults',
'strval': 'foo'
},
'obj': {
'_name': 'DemoDefaults',
'strval': 'bar',
'intval1': 20
}
}
it's because intval2's default is None
and this is causing it to be unconfigured in the params mode 1 step. then both objects are configured using all default values in mode=2.
upgrade to the latest feedly client to be compatible with the rest of the feedly code
right now the code uses configparser
/ini
format for parameters. toml
could be a better fit as it's still quite readable and allows for some structure. one use case i had recently was enabling/disabling featurizers. with the ini format, i had to create a dumb enabled
flag in the featurizer and then set all the featurizer enabled flags for every experiment. it worked but was a hassle.
with toml, we could just have
featurizers=["$f1", "$f2"]
as config. we'll need to provide some clear documentation though, as in this case the json experiment file will refer to the $featurizers
parameter in the config file while the config file will refer to $f1
in the experiment file...
When you want to experiment with someone else's code, you don't want to copy-paste their code.
If you want to use a class AwesomeClass
from an awesome
github repo, you can do:
from transfer_nlp.pluginf.config import register_plugin
from awesome_repo.module import AwesomeClass
register_plugin(AwesomeClass)
and then use it in your experiments.
However, when reusing complex objects, it might complicated to configure them.
An example is the pre-trained model from the pytorch-pretrained-bert repo, where you can build complex models with nice one-liners such as model = BertForSequenceClassification.from_pretrained('bert-base-uncased', num_labels=4)
It's possible to encapsulate these into other classes and have Transfer NLP build them, but it can feel awkward and adds unnecessary complexity / lines of code compared to the initial one-liner.
An alternative is to build these objects with a method, in the previous example we would only write:
@register_function
def bert_classifier(bert_version: str='bert-base-uncased', num_labels: int=4):
return BertForSequenceClassification.from_pretrained(pretrained_model_name_or_path=bert_version, num_labels=num_labels)
and we could use functions just as methods in the config loading.
Can we incorporate the new runner code into one or more examples?
something like this won't cause a problem:
{
"item": {
"_name": "foo",
"param":"$bar"
}
}
even if we don't set a value for bar
. this can lead to easily misconfigured objects.
sklearn pipelines are created using nested lists, but this doesn't seem to work with our code:
from sklearn.pipeline import Pipeline
from transfer_nlp.plugins.config import register_plugin, ExperimentConfig
from sklearn.feature_selection import SelectKBest
if __name__ == '__main__':
register_plugin(Pipeline)
register_plugin(SelectKBest)
cfg = {
'pipeline': {
'_name': 'Pipeline',
'steps': [
['first', '$first'],
['second', '$second'],
]
},
'first': {
'_name': 'SelectKBest'
},
'second': {
'_name': 'SelectKBest'
}
}
cfg = ExperimentConfig(cfg)
results in TypeError: All intermediate steps should be transformers and implement fit and transform or be the string 'passthrough' '$first' (type <class 'str'>) doesn't
so it looks like $first
and $second
didn't get substituted
some readonly objects can take awhile to load in experiments (embeddings, datasets, etc). The current ExperimentRunner
always recreates the entire experiment. It would be nice if we could keep some objects in memory...
add a cached property in run_all
def run_all(experiment: Union[str, Path, Dict],
experiment_cache: Union[str, Path, Dict],
experiment_config: Union[str, Path],
report_dir: Union[str, Path],
trainer_config_name: str = 'trainer',
reporter_config_name: str = 'reporter',
**env_vars) -> None:
The cache is just another experiment json. it would be loaded only once at the very beginning only using the env_vars
. any resulting objects would then be added to env_vars when running each each experiment. objects can optionally implement a Resettable
class that has a reset
method that would be called once before each experiment.
incorrect usage of this feature could lead to non-reproducibility issues, but through docs we could make it clear this should only be for read-only objects. i think it would be worth doing...
The Runner class runs sequentially each config file and outputs reporting for each of them.
We would like to be able to aggregate some metrics over several config experiments.
To do that we need to return the cached experiments after the sequential runner
Some people don't like json files (for readability, unnecessary punctuation, lack of comments etc).
It would be nice to be able to load an experiment defined in a YAML file instead.
Similarly, TOML files can come in handy especially with simple use cases.
When using the sequential Runner, we might be interested in running ablation studies, where we have several instances of possible registrables and we want to authorize only a few of them.
I nice to have would be to add a selector class to the library, something like:
from typing import Any, List, Tuple
def multiple_choices_selector(choices: List[Tuple[bool, Any]]) -> List[Any]:
return [obj for enabled, obj in choices if enabled]
def unique_selector(choices: List[Tuple[bool, Any]]) -> Any:
res = multiple_choices_selector(choices)
if len(res) == 1:
return res[0]
raise ValueError(f'Unique selector found {len(res)} objects enables instead of 1')
lots of times we need to configure lists. it would be nice if this was possible, e.g.
{
"pipeline": {
"_name": "MyPipeline",
"preprocessors": [ "$pp1", "$pp2", ....]
}
}
Currently, objects are built one by one and when one fails it throws an error.
It would be great to have a quick pass before instantiating objects to check that all registrable names / aliases are actually registered, and throw an error at this moment.
An experiment is defined by a json file. If you want to run many variations of your model, it may be hard to manage if you have to create almost identical copies of the json file for each variant.
Instead we could imagine having a single general, parameterized json file and then multiple configurations in order to run experiments. To do so effectively, the json structure should remain the same across all variants.
This may mean that pre-built components should have an enabled/disabled flag. So that even if they are included in an experiment, they can be disabled easily via a parameter.
Additionally we can have a "runner" class that can take in a json and config file and run the experiment once for each config with some reports. For now the runner can run the experiments sequentially on a machine, but we could imagine people writing more advanced launchers to do things in parallel.
Describe the bug
The ExperimentRunner.run_all
fails if experiment_cache
is None.
The issue comes from line 109, where the default value for the experiment cache (None) is not handled correctly: https://github.com/feedly/transfer-nlp/blob/master/transfer_nlp/runner/experiment_runner.py#L109
Currently, it is possible to use list of custom objects, but when having such list as the first level object in the experiment file, this fail to instantiate.
e.g.:
my_objects:
- _name: my_first_object
- _name: my_second_object
fails
It would be nice to have options to disable / enable tensorboard handlers in the PyTorch trainer.
For example we might always be interested in visualizing the metrics, but not necessarily the embeddings, the gradients or other quantities
From the talk today, one good point was the point that reproducibility problems often stem from data inconsistencies. To that end, I think we should have a DataDownloader
component that can download data from URLs and save them locally to disk.
so an example json config could be:
{
"_name": "Downloader",
"local_dir": "$my_path",
"checksums": "$WORK_DIR/checksums_2019_05_23.cfg", <-- produced by a previous download
"sentences.txt.gz": {
"url": "$BASE_URL/sentences.txt.gz",
"decompress": true
},
"word_embeddings.npy": {
"url": "$BASE_URL/word_embeddings.npy"
}
}
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.