allenai / allennlp Goto Github PK
View Code? Open in Web Editor NEWAn open-source NLP research library, built on PyTorch.
Home Page: http://www.allennlp.org
License: Apache License 2.0
An open-source NLP research library, built on PyTorch.
Home Page: http://www.allennlp.org
License: Apache License 2.0
We currently give all OOV tokens the same embedding at both training and test time. It'd be nice to be able to have some different options here:
There are probably some other options I'm forgetting right now. These would be pretty tricky to implement in our current data pipeline, though.
After ACL, we can modify the handout they used. We may have a promotion for adding models to AllenNLP.
Just ran a BiDAF training run, and I got log messages saying "best validation performance so far" at every epoch, even when that was not true.
We have blocks like this in several places:
allennlp/allennlp/data/dataset_readers/snli.py
Lines 78 to 85 in 166809c
These should all be put in one spot, probably something like TokenIndexer.dict_from_params
(not thrilled with that name, but something similar).
The underscores were from a time when the serialization_prefix
wasn't necessarily a directory. They should be removed. This is for these files: _model_params.json
, _stdout.log
, _stderr.log
, and _python_logging.log
.
These iterators by definition change the order of your data.
Removes the need to squash the labels before using them:
https://github.com/allenai/allennlp/blob/master/allennlp/models/simple_tagger.py#L88
the culprit seems to be
because mask
is on the GPU but count
is on the CPU.
I can think of a few ways to fix this, but I'm not sure which is least hacky, nor if this problem exists elsewhere
Please pin all dependency versions in https://github.com/allenai/allennlp/blob/master/requirements.txt.
I would rather not have a Docker image doing one thing, and the same [on paper] Docker image do something else just because they use different dependencies.
[good]
pandas==0.19.2
[may break on upstream updates]
awscli>=1.11.91
[i have no idea what code is running]
scikit-learn
Currently, if you want to load a model, you first need to load the vocab, then construct the model from_params
, then load the state dict, etc. We should just have a method that does this, given the base serialization directory.
I installed AllenNLP from source, and when I followed the steps on the Getting Started page to run the command python -m allennlp.run
, it occured the following errors:
Traceback (most recent call last):
File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
"__main__", mod_spec)
File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/run.py", line 10, in <module>
from allennlp.commands import main # pylint: disable=wrong-import-position
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/__init__.py", line 3, in <module>
from allennlp.commands.serve import add_subparser as add_serve_subparser
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/serve.py", line 27, in <module>
from allennlp.service import server_sanic
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/service/server_sanic.py", line 19, in <module>
from allennlp.models.archival import load_archive
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/__init__.py", line 6, in <module>
from allennlp.models.archival import archive_model, load_archive
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/archival.py", line 10, in <module>
from allennlp.models.model import Model, _DEFAULT_WEIGHTS
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/model.py", line 12, in <module>
from allennlp.data import Vocabulary
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/__init__.py", line 1, in <module>
from allennlp.data.dataset import Dataset
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/dataset.py", line 13, in <module>
from allennlp.data.instance import Instance
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/instance.py", line 3, in <module>
from allennlp.data.fields.field import DataArray, Field
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/__init__.py", line 12, in <module>
from allennlp.data.fields.text_field import TextField
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/text_field.py", line 11, in <module>
from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/__init__.py", line 6, in <module>
from allennlp.data.token_indexers.token_characters_indexer import TokenCharactersIndexer
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/token_characters_indexer.py", line 10, in <module>
from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/__init__.py", line 7, in <module>
from allennlp.data.tokenizers.word_tokenizer import WordTokenizer
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 13, in <module>
class WordTokenizer(Tokenizer):
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 39, in WordTokenizer
word_splitter: WordSplitter = SpacyWordSplitter(),
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_splitter.py", line 144, in __init__
import spacy
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/__init__.py", line 5, in <module>
from .deprecated import resolve_model_name
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/deprecated.py", line 8, in <module>
from .cli import download
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/__init__.py", line 5, in <module>
from .train import train, train_config
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/train.py", line 8, in <module>
from ..scorer import Scorer
File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/scorer.py", line 4, in <module>
from .gold import tags_to_entities
ImportError: dlopen: cannot load any more object with static TLS
I googled the error message dlopen: cannot load any more object with static TLS
and found that this issuse seems to be related to the import of spacy package, and changing the importing order might help. But when I looked into the source code I found nothing that can get fixed. Up to now I haven't successfully run any demo on my machine. Anybody also encountered the same problem?
The Predictors
currently have to pull out the tokenizer and the token indexers from the DatasetReader
and recreate what the DatasetReader
does internally for every instance. This means that if we want to change (or add parameters to) what the DatasetReader
does, we have to change the Predictors
to match. Instead, we should just add a text_to_instance
method on the DatasetReader
itself, so that the Predictor
can just keep around the DatasetReader
, and pass off all processing to it.
And other places where we load models. Actually, because this is going through to load_archive
, that method itself has to allow overrides, and any entry point that reaches load_archive
needs a way to pass in overrides.
Like our Init
wrapper. This will allow us to remove the type checks that were necessary in Trainer
by handling the API differences between pytorch's LRSchedulers
with separate wrappers (or just a single wrapper that does some inspection of the wrapped object...).
In three places now we've implemented appending and/or prepending tokens to what goes into a TextField
(a null token for the SNLI reader, a stop token for SQuAD, and sentence boundary tokens for the language model reader). This should just be a basic functionality of the tokenizer.
We only use it when passing something into the constructor of Params
- we should just put it inside the constructor and not make the caller have to call this method every time.
We should work out whether these are actually the correct type annotation for many of our functions. For many functions, we are actually only ever passing torch.autograd.Variables
, including some which actually require this, e.g:
# raises
torch.nn.functional.softmax(torch.rand([3,4]))
# fine
torch.nn.functional.softmax(torch.autograd.Variable(torch.rand([3,4])))
I think we can still keep the tensor types by doing something like:
torch.autograd.Variable[torch.FloatTensor]
etc.
This will use one or more Embedding
layers. We should also consider moving to a dictionary for the TokenIndexers
, also, to make things easier.
It's getting too big; we should have one file per metric.
This will let us get rid of the nasty offset
return value, because it will just be a field on the Token
, and it will let us include POS tags, for POS tag embeddings.
It's probably easiest to just return spacy's token representation directly, rather than trying to roll our own, and have other word splitters mimic spacy's API. Or we could just have them crash; not sure we really need the other word splitters at this point - we could just simplify things a lot by putting spacy directly into WordTokenizer
. Anybody have any thoughts on that?
This will let us use variable type annotations, and remove all of the unused-imports
in the code.
I don't think there are any big issues with just changing the python version in our images and build settings, so I'm labeling this as easy, but it's possible there is some library we're using that's not compatible and it will end up being hard.
Instead just have a default_model()
method, or similar, that loads the configuration from experiment_configs/
.
The success criteria is having a state-of-the-art model within 1% of the model presented in https://arxiv.org/abs/1611.01603.
Codebase: https://github.com/allenai/bi-att-flow
Web demo: http://35.165.153.16:1995/
It'd be nice to have some way of managing config file changes, so that, e.g., if we add a new required parameter, or change the name of a flag, config files don't break mysteriously for an end user. Even better if we can make things backwards compatible when they change. Not sure at all how to make this happen in a reasonable way, though.
It's grown to be a ~150 line method that's pretty hard to reason about. It's a bit tricky to decompose, though, because of all of the dependencies between different parts of the method, but we should be able to pull out a bunch of it into separate methods.
Instead of copying the input parameter file when archiving a model, we should parse the log file to get the actual parameters that were used (including defaults). This will make model archiving more robust to changes in default parameters (as recently happened with the tokenizer). Seems like quite a bit of work to be sure the parameters are parsed out correctly, though, and it's not super high priority.
And fix it, if possible.
If you just import something from the library, like from allennlp.data import Vocabulary
, there's a several second delay. Not sure what the cause is, but it seems like some __init__.py
somewhere is doing more than it should, or something is getting run on import when it shouldn't be.
This would remove the need for all of the special casing and reflection that I did to pass the metadata through correctly. Basically we take all of the metadata-related code from this PR and replace it with a MetadataField
. There will still be a little bit of special casing, unless we also move the array creation code into a class method on the Field
objects (probably on Field
itself, overriden by MetadataField
). In particular, I mean this code:
allennlp/allennlp/data/dataset.py
Lines 150 to 162 in fb73633
This issue was brought up to me by Nikket.
Hi Michael,
I followed the steps in the readme, and got stuck in step 4.
Download and install Conda.
Create a Conda environment with Python 3.
conda create -n allennlp python=3.5
source activate allennlp
INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh
Visit http://pytorch.org/ and install the relevant pytorch package.
Set the PYTHONHASHSEED for repeatable experiments.
export PYTHONHASHSEED=2157
I get stuck on step 4, whose remedy seems to be https://stackoverflow.com/questions/1449396/how-to-install-setuptools, but I just wanted to be sure that I am not doing anything wrong.
(allennlp) nikett:allennlp nikett$ INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh
Collecting git+git://github.com/mkorpela/overrides.git@40f8bd1fae7a3364a1 (from -r requirements.txt (line 23))
Cloning git://github.com/mkorpela/overrides.git (to 40f8bd1fae7a3364a1) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-rc43o34j-build
Could not find a tag or branch '40f8bd1fae7a3364a1', assuming commit.
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'nltk.downloader' (ImportError: No module named 'nltk')
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'spacy.en.download' (ImportError: No module named 'spacy')
Collecting git+git://github.com/PyCQA/pylint.git@2561f539d60a3563d6507e7a22e226fb10b58210 (from -r requirements_test.txt (line 6))
Cloning git://github.com/PyCQA/pylint.git (to 2561f539d60a3563d6507e7a22e226fb10b58210) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-gd8xgqm9-build
Could not find a tag or branch '2561f539d60a3563d6507e7a22e226fb10b58210', assuming commit.
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.
Vocabulary
Data API - Fields
, Instances
, Dataset
.
Iterators
and Training a model.
Tokens -> Tokenizers
-> TokenIndexers
Abstraction.
Writing a DatasetReader
example.
Writing a Model
, differences between torch.nn.Module
.
Why have Params
? How to build things which run from JSON.
TokenEmbedders
-> TextFields
-> representation Abstraction.
Seq2SeqEncoders
and Seq2VecEncoders
and how to use them.
How to make your model Servable
and deploy a server via Docker.
I installed python 3.6 into an Anaconda environment installed all the requirements.txt and requirements_test.txt packages, pytorch, etc.
When I run pytest -v, the tests are failing.
It is accessing the python 2.7 packages (not the python 3.6 ones):
(py36) David-Laxers-MacBook-Pro:allennlp davidlaxer$ pytest -v
============================= test session starts ==============================
platform darwin -- Python 2.7.13, pytest-3.1.1, py-1.4.33, pluggy-0.4.0 -- /Users/davidlaxer/anaconda/bin/python
cachedir: .cache
rootdir: /Users/davidlaxer/allennlp, inifile: pytest.ini
collected 0 items / 68 errors
==================================== ERRORS ====================================
___________________ ERROR collecting tests/notebooks_test.py ___________________
../anaconda/lib/python2.7/site-packages/_pytest/python.py:408: in _importtestmodule
mod = self.fspath.pyimport(ensuresyspath=importmode)
../anaconda/lib/python2.7/site-packages/py/_path/local.py:662: in pyimport
__import__(modname)
E File "/Users/davidlaxer/allennlp/tests/notebooks_test.py", line 17
E def execute_notebook(notebook_path: str):
E ^
E SyntaxError: invalid syntax
[...]
ImportError while importing test module '/Users/davidlaxer/allennlp/tests/training/metrics/span_based_f1_measure_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/training/metrics/span_based_f1_measure_test.py:2: in <module>
import torch
E ImportError: No module named torch
!!!!!!!!!!!!!!!!!!! Interrupted: 68 errors during collection !!!!!!!!!!!!!!!!!!!
=========================== 68 error in 5.49 seconds ===========================
We need this so we can configure the vocabulary properly from an experiment config. We need a method that takes a Dataset
and a Params
and instantiates the object. Not sure what the most sane way is to do this - should we just only allow certain ways of constructing the vocab?
SQuAD has plenty of paragraphs that have wiki notes, formatting like "This was a protest.[note 4]". Spacy for some reason does not tokenize these strings correctly, giving "protest.[note" as a single token. We should be able to improve performance on SQuAD at least a little bit by fixing these issues, as it affects a fair number of our training examples, and some of the dev set.
A test that currently fails, but should pass (goes in word_splitter_test.py
):
def test_tokenize_handles_wiki_notes(self):
passage = "McWhorter writes of Lee, \"for a white person from the South to write a " +\
"book like this in the late 1950s is really unusual\u2014by its very existence " +\
"an act of protest.\"[note 4] Author James McBride calls Lee brilliant but " +\
"stops short of calling her brave: \"I think by calling Harper Lee brave you " +\
"kind of absolve yourself of your own racism.\""
tokens, offsets = self.word_splitter.split_words(passage)
assert "protest" in tokens
Because of this issue. The optimizer might have state that's initialized from the model parameters, and needs to be on the right device.
This means we should either:
cuda_device
be a top-level key in the experiment config, so we can move the model over in commands.train()
before constructing the optimizer.model.cuda()
into Trainer.from_params()
.I could go either way. Calling model.cuda()
inside of from_params()
in the second option is a little bit more logic than we like to have in those methods, but not much. The optimizer conceptually seems like it's part of the trainer, so having the optimizer params inside of the trainer params makes sense.
Reported by @schmmd. I'm not really sure what could be causing this, because I didn't think there was any randomness in model.forward()
after model.eval()
has been called. But here are steps to reproduce:
$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 2157
$ allennlp/run serve
> “spaceship”
1. Navigate to http://localhost:8000.
2. Click the MC Model tab.
3. Submit the last example (The Millennium Falcon…)
$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 4563123
$ allennlp/run serve
1. Navigate to http://localhost:8000.
2. Click the MC Model tab.
3. Submit the last example (The Millennium Falcon…)
> “variety of Star Wars expanded …”
ReferenceError: event is not defined[Learn More] demo.allennlp.org:686:13
onClick http://demo.allennlp.org/:686:13
onClick self-hosted:987:17
[55]</ReactErrorUtils.invokeGuardedCallback http://demo.allennlp.org/lib/react-dom.js:9036:7
executeDispatch http://demo.allennlp.org/lib/react-dom.js:2996:5
executeDispatchesInOrder http://demo.allennlp.org/lib/react-dom.js:3019:5
executeDispatchesAndRelease http://demo.allennlp.org/lib/react-dom.js:2427:5
executeDispatchesAndReleaseTopLevel http://demo.allennlp.org/lib/react-dom.js:2438:10
forEach self-hosted:267:13
forEachAccumulated http://demo.allennlp.org/lib/react-dom.js:15456:5
processEventQueue http://demo.allennlp.org/lib/react-dom.js:2638:7
runEventQueueInBatch http://demo.allennlp.org/lib/react-dom.js:9060:3
handleTopLevel http://demo.allennlp.org/lib/react-dom.js:9070:5
handleTopLevelImpl http://demo.allennlp.org/lib/react-dom.js:9147:5
perform http://demo.allennlp.org/lib/react-dom.js:14760:13
batchedUpdates http://demo.allennlp.org/lib/react-dom.js:8825:14
batchedUpdates http://demo.allennlp.org/lib/react-dom.js:12895:10
dispatchEvent http://demo.allennlp.org/lib/react-dom.js:9222:7
dispatchEvent self-hosted:987:17
The success criteria is having a state-of-the-art model within 1% of the model presented in https://www.semanticscholar.org/paper/A-Decomposable-Attention-Model-for-Natural-Languag-Parikh-T%C3%A4ckstr%C3%B6m/07a9478e87a8304fc3267fa16e83e9f3bbd98b27.
This would be more portable to other SQuAD models, though we still need BiDAF to have a way to construct this metric, so it doesn't give us that much. It does seem conceptually cleaner to have that live in the metric, though.
For things like Instance
, seems not very idiomatic to have def fields(self): return self._fields
. Might be better to just make fields
public.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.