allenai / allennlp Goto Github PK

View Code? Open in Web Editor NEW

11.7K 279.0 2.2K 72.96 MB

An open-source NLP research library, built on PyTorch.

Home Page: http://www.allennlp.org

License: Apache License 2.0

Python 98.43% Shell 0.06% Perl 0.01% C 1.10% Makefile 0.15% Scilab 0.11% Dockerfile 0.03% Jsonnet 0.12%

pytorch nlp natural-language-processing deep-learning data-science python

allennlp's People

Contributors

Stargazers

Watchers

Forkers

deneutoy matt-gardner schmmd nelson-liu panyang taekyoon oyvindtafjord pdasigi aaronsarnat sbhaktha codeaudit freedom2020 hitxujian fatmas1982 alfords changfengfeng sbarman-mi9 alaakh42 danidb ajoeajoe wangzhiwei-ai dream-seeker aascode rahulremanan benzei techscientist songfang snakehacker akshattrivedi hhy5277 maxiio lyrl ml-lab limberc stevenlol bhavanadalvi valdersoul ml-ai-nlp-ir zhoujiang2013 ibrahimsharaf mguo001 winnerineast erkanderon hbcbh1999 ikwattro lzbgt deepmusic vseledkin luheng leezqcst chrisc36 anktplwl91 albertwy andyhyh xumx krstnaparker kentonl ccclyu mengtianxiyu patricknmahoney gaceladri nafitzgerald kellywzhang rajasagashe yoshikids demfier athenagoras wsdm-paper-reading resurgo-genetics rsokhonn adamjm farizikhwantri 1013553207 tusharkhot eunsol scottyih markwzx lucylw clonemvp pmulcaire libertatis egorlakomkin fyyw nadileaf tongshuangwu chenwgen aa1607 berryhn ekolve athiwatp rosssong fengzhang2011 doddaiah ajay01994 praveenmunagapati meinwerk codeviking sgbmm zxsted chenglongchen

allennlp's Issues

Support a mechanism for running AllenNLP jobs on the cloud.

Allow for different handling of OOV words

We currently give all OOV tokens the same embedding at both training and test time. It'd be nice to be able to have some different options here:

At test time, see if the OOV token is in glove, and use that embedding instead
At both training and test, use random vectors for each unique OOV token, as suggested here

There are probably some other options I'm forgetting right now. These would be pretty tricky to implement in our current data pipeline, though.

AllenNLP handout for EMNLP.

After ACL, we can modify the handout they used. We may have a promotion for adding models to AllenNLP.

Any user can run our three featured models in bulk via the CLI.

"Best" logic in saving validation weights does not appear to be working

Just ran a BiDAF training run, and I got log messages saying "best validation performance so far" at every epoch, even when that was not true.

Simplify / centralize `TokenIndexer.from_params()`

We have blocks like this in several places:

allennlp/allennlp/data/dataset_readers/snli.py

Lines 78 to 85 in 166809c

    
           token_indexers = {} 
        
           token_indexer_params = params.pop('token_indexers', Params({})) 
        
           for name, indexer_params in token_indexer_params.items(): 
        
               token_indexers[name] = TokenIndexer.from_params(indexer_params) 
        
           # The default parameters are contained within the class, 
        
           # so if no parameters are given we must pass None. 
        
           if token_indexers == {}: 
        
               token_indexers = None

These should all be put in one spot, probably something like TokenIndexer.dict_from_params (not thrilled with that name, but something similar).

Move test case out of testing/ and refactor constants in it to be fixtures

Remove preprended underscores from files in the serialization directory.

The underscores were from a time when the serialization_prefix wasn't necessarily a directory. They should be removed. This is for these files: _model_params.json, _stdout.log, _stderr.log, and _python_logging.log.

Add a warning when passing `shuffle=False` to bucket and adaptive iterators

These iterators by definition change the order of your data.

Make sure all models raise ConfigurationErrors on invalid encoder / embedding sizes

Make TagField and LabelField return integer labels.

Removes the need to squash the labels before using them:
https://github.com/allenai/allennlp/blob/master/allennlp/models/simple_tagger.py#L88

`CategoricalAccuracy` metric crashes on GPU

the culprit seems to be

https://github.com/allenai/allennlp/blob/master/allennlp/training/metrics/categorical_accuracy.py#L64

because mask is on the GPU but count is on the CPU.

I can think of a few ways to fix this, but I'm not sure which is least hacky, nor if this problem exists elsewhere

Consider pinning dependencies versions

Please pin all dependency versions in https://github.com/allenai/allennlp/blob/master/requirements.txt.

I would rather not have a Docker image doing one thing, and the same [on paper] Docker image do something else just because they use different dependencies.

[good]
pandas==0.19.2

[may break on upstream updates]
awscli>=1.11.91

[i have no idea what code is running]
scikit-learn

Add a `Model.load_from_file` method (or similar)

Currently, if you want to load a model, you first need to load the vocab, then construct the model from_params, then load the state dict, etc. We should just have a method that does this, given the base serialization directory.

ImportError: dlopen: cannot load any more object with static TLS

I installed AllenNLP from source, and when I followed the steps on the Getting Started page to run the command python -m allennlp.run, it occured the following errors:

Traceback (most recent call last):
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/run.py", line 10, in <module>
    from allennlp.commands import main  # pylint: disable=wrong-import-position
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/__init__.py", line 3, in <module>
    from allennlp.commands.serve import add_subparser as add_serve_subparser
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/serve.py", line 27, in <module>
    from allennlp.service import server_sanic
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/service/server_sanic.py", line 19, in <module>
    from allennlp.models.archival import load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/__init__.py", line 6, in <module>
    from allennlp.models.archival import archive_model, load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/archival.py", line 10, in <module>
    from allennlp.models.model import Model, _DEFAULT_WEIGHTS
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/model.py", line 12, in <module>
    from allennlp.data import Vocabulary
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/__init__.py", line 1, in <module>
    from allennlp.data.dataset import Dataset
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/dataset.py", line 13, in <module>
    from allennlp.data.instance import Instance
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/instance.py", line 3, in <module>
    from allennlp.data.fields.field import DataArray, Field
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/__init__.py", line 12, in <module>
    from allennlp.data.fields.text_field import TextField
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/text_field.py", line 11, in <module>
    from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/__init__.py", line 6, in <module>
    from allennlp.data.token_indexers.token_characters_indexer import TokenCharactersIndexer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/token_characters_indexer.py", line 10, in <module>
    from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/__init__.py", line 7, in <module>
    from allennlp.data.tokenizers.word_tokenizer import WordTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 13, in <module>
    class WordTokenizer(Tokenizer):
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 39, in WordTokenizer
    word_splitter: WordSplitter = SpacyWordSplitter(),
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_splitter.py", line 144, in __init__
    import spacy
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/__init__.py", line 5, in <module>
    from .deprecated import resolve_model_name
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/deprecated.py", line 8, in <module>
    from .cli import download
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/__init__.py", line 5, in <module>
    from .train import train, train_config
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/train.py", line 8, in <module>
    from ..scorer import Scorer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/scorer.py", line 4, in <module>
    from .gold import tags_to_entities
ImportError: dlopen: cannot load any more object with static TLS

I googled the error message dlopen: cannot load any more object with static TLS and found that this issuse seems to be related to the import of spacy package, and changing the importing order might help. But when I looked into the source code I found nothing that can get fixed. Up to now I haven't successfully run any demo on my machine. Anybody also encountered the same problem?

Add a `text_to_instance` method on `DatasetReader`

The Predictors currently have to pull out the tokenizer and the token indexers from the DatasetReader and recreate what the DatasetReader does internally for every instance. This means that if we want to change (or add parameters to) what the DatasetReader does, we have to change the Predictors to match. Instead, we should just add a text_to_instance method on the DatasetReader itself, so that the Predictor can just keep around the DatasetReader, and pass off all processing to it.

Allow config file overrides in `evaluate`

And other places where we load models. Actually, because this is going through to load_archive, that method itself has to allow overrides, and any entry point that reaches load_archive needs a way to pass in overrides.

A web demo that is publicly available, for which we understand the capacity and fails gracefully under load.

Make explicit wrappers for LearningRateSchedulers

Like our Init wrapper. This will allow us to remove the type checks that were necessary in Trainer by handling the API differences between pytorch's LRSchedulers with separate wrappers (or just a single wrapper that does some inspection of the wrapped object...).

Add ability to append or prepend tokens to the `Tokenizers`

In three places now we've implemented appending and/or prepending tokens to what goes into a TextField (a null token for the SNLI reader, a stop token for SQuAD, and sentence boundary tokens for the language model reader). This should just be a basic functionality of the tokenizer.

Make it easy for our external users to run GPU jobs in the cloud.

Move `replace_none` into `Params`

We only use it when passing something into the constructor of Params - we should just put it inside the constructor and not make the caller have to call this method every time.

torch.Tensor type annotations

We should work out whether these are actually the correct type annotation for many of our functions. For many functions, we are actually only ever passing torch.autograd.Variables, including some which actually require this, e.g:

# raises
torch.nn.functional.softmax(torch.rand([3,4]))

# fine
torch.nn.functional.softmax(torch.autograd.Variable(torch.rand([3,4])))

I think we can still keep the tensor types by doing something like:
torch.autograd.Variable[torch.FloatTensor] etc.

Implement a TokenEmbedder that takes a TextField as input

This will use one or more Embedding layers. We should also consider moving to a dictionary for the TokenIndexers, also, to make things easier.

Split `metrics.py` into a directory

It's getting too big; we should have one file per metric.

Have Tokenizers return a Token object

This will let us get rid of the nasty offset return value, because it will just be a field on the Token, and it will let us include POS tags, for POS tag embeddings.

It's probably easiest to just return spacy's token representation directly, rather than trying to roll our own, and have other word splitters mimic spacy's API. Or we could just have them crash; not sure we really need the other word splitters at this point - we could just simplify things a lot by putting spacy directly into WordTokenizer. Anybody have any thoughts on that?

Upgrade to python 3.6

This will let us use variable type annotations, and remove all of the unused-imports in the code.

I don't think there are any big issues with just changing the python version in our images and build settings, so I'm labeling this as easy, but it's possible there is some library we're using that's not compatible and it will end up being hard.

Make validation metric able to deal with increasing or decreasing functions

Remove the crazy way we did default initialization in `from_params` in models

Instead just have a default_model() method, or similar, that loads the configuration from experiment_configs/.

SOTA BiDAF model.

The success criteria is having a state-of-the-art model within 1% of the model presented in https://arxiv.org/abs/1611.01603.

Codebase: https://github.com/allenai/bi-att-flow
Web demo: http://35.165.153.16:1995/

Have some parameter versioning, or something

It'd be nice to have some way of managing config file changes, so that, e.g., if we add a new required parameter, or change the name of a flag, config files don't break mysteriously for an end user. Even better if we can make things backwards compatible when they change. Not sure at all how to make this happen in a reasonable way, though.

Decompose `Trainer.train()` into smaller methods

It's grown to be a ~150 line method that's pretty hard to reason about. It's a bit tricky to decompose, though, because of all of the dependencies between different parts of the method, but we should be able to pull out a bunch of it into separate methods.

Parse the log file for _actual_ parameters used to save in the model archive

Instead of copying the input parameter file when archiving a model, we should parse the log file to get the actual parameters that were used (including defaults). This will make model archiving more robust to changes in default parameters (as recently happened with the tokenizer). Seems like quite a bit of work to be sure the parameters are parsed out correctly, though, and it's not super high priority.

Figure out cause of slow imports

And fix it, if possible.

If you just import something from the library, like from allennlp.data import Vocabulary, there's a several second delay. Not sure what the cause is, but it seems like some __init__.py somewhere is doing more than it should, or something is getting run on import when it shouldn't be.

Change `Instance.metadata` into a `MetadataField`

This would remove the need for all of the special casing and reflection that I did to pass the metadata through correctly. Basically we take all of the metadata-related code from this PR and replace it with a MetadataField. There will still be a little bit of special casing, unless we also move the array creation code into a class method on the Field objects (probably on Field itself, overriden by MetadataField). In particular, I mean this code:

allennlp/allennlp/data/dataset.py

Lines 150 to 162 in fb73633

    
           if field_name == 'metadata': 
        
               continue 
        
           if isinstance(field_array_list[0], dict): 
        
               # This is creating a dict of {token_indexer_key: batch_array} for each 
        
               # token indexer used to index this field. This is mostly utilised by TextFields. 
        
               token_indexer_key_to_batch_dict = defaultdict(list)  # type: Dict[str, List[numpy.ndarray]] 
        
               for namespace_dict in field_array_list: 
        
                   for indexer_name, array in namespace_dict.items(): 
        
                       token_indexer_key_to_batch_dict[indexer_name].append(array) 
        
               field_arrays[field_name] = {indexer_name: numpy.asarray(array_list) for  # type: ignore 
        
                                           indexer_name, array_list in token_indexer_key_to_batch_dict.items()} 
        
           else: 
        
               field_arrays[field_name] = numpy.asarray(field_array_list)

Error when installing requirements in a Conda environment.

This issue was brought up to me by Nikket.

Hi Michael,

I followed the steps in the readme, and got stuck in step 4.

Download and install Conda.
Create a Conda environment with Python 3.

conda create -n allennlp python=3.5

Now activate the Conda environment.

source activate allennlp

Install the required dependencies.

INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

Visit http://pytorch.org/ and install the relevant pytorch package.
Set the PYTHONHASHSEED for repeatable experiments.

export PYTHONHASHSEED=2157

I get stuck on step 4, whose remedy seems to be https://stackoverflow.com/questions/1449396/how-to-install-setuptools, but I just wanted to be sure that I am not doing anything wrong.

(allennlp) nikett:allennlp nikett$ INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

 Collecting git+git://github.com/mkorpela/overrides.git@40f8bd1fae7a3364a1 (from -r requirements.txt (line 23))
  Cloning git://github.com/mkorpela/overrides.git (to 40f8bd1fae7a3364a1) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-rc43o34j-build
  Could not find a tag or branch '40f8bd1fae7a3364a1', assuming commit.
Could not import setuptools which is required to install from a source distribution.


Please install setuptools.
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'nltk.downloader' (ImportError: No module named 'nltk')
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'spacy.en.download' (ImportError: No module named 'spacy')


Collecting git+git://github.com/PyCQA/pylint.git@2561f539d60a3563d6507e7a22e226fb10b58210 (from -r requirements_test.txt (line 6))
  Cloning git://github.com/PyCQA/pylint.git (to 2561f539d60a3563d6507e7a22e226fb10b58210) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-gd8xgqm9-build
  Could not find a tag or branch '2561f539d60a3563d6507e7a22e226fb10b58210', assuming commit.
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.

Notebook Checklist

A web demo that interactively runs our three featured models.

Any user can train our three featured models via the CLI.

pytest -v Issues

I installed python 3.6 into an Anaconda environment installed all the requirements.txt and requirements_test.txt packages, pytorch, etc.

When I run pytest -v, the tests are failing.
It is accessing the python 2.7 packages (not the python 3.6 ones):

(py36) David-Laxers-MacBook-Pro:allennlp davidlaxer$ pytest -v
============================= test session starts ==============================
platform darwin -- Python 2.7.13, pytest-3.1.1, py-1.4.33, pluggy-0.4.0 -- /Users/davidlaxer/anaconda/bin/python
cachedir: .cache
rootdir: /Users/davidlaxer/allennlp, inifile: pytest.ini
collected 0 items / 68 errors 

==================================== ERRORS ====================================
___________________ ERROR collecting tests/notebooks_test.py ___________________
../anaconda/lib/python2.7/site-packages/_pytest/python.py:408: in _importtestmodule
    mod = self.fspath.pyimport(ensuresyspath=importmode)
../anaconda/lib/python2.7/site-packages/py/_path/local.py:662: in pyimport
    __import__(modname)
E     File "/Users/davidlaxer/allennlp/tests/notebooks_test.py", line 17
E       def execute_notebook(notebook_path: str):
E                                         ^
E   SyntaxError: invalid syntax
[...]
ImportError while importing test module '/Users/davidlaxer/allennlp/tests/training/metrics/span_based_f1_measure_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/training/metrics/span_based_f1_measure_test.py:2: in <module>
    import torch
E   ImportError: No module named torch
!!!!!!!!!!!!!!!!!!! Interrupted: 68 errors during collection !!!!!!!!!!!!!!!!!!!
=========================== 68 error in 5.49 seconds ===========================

Figure out the right way to instantiate Vocabulary `from_params`

We need this so we can configure the vocabulary properly from an experiment config. We need a method that takes a Dataset and a Params and instantiates the object. Not sure what the most sane way is to do this - should we just only allow certain ways of constructing the vocab?

Figure out how to get spacy to tokenize wiki text correctly

SQuAD has plenty of paragraphs that have wiki notes, formatting like "This was a protest.[note 4]". Spacy for some reason does not tokenize these strings correctly, giving "protest.[note" as a single token. We should be able to improve performance on SQuAD at least a little bit by fixing these issues, as it affects a fair number of our training examples, and some of the dev set.

A test that currently fails, but should pass (goes in word_splitter_test.py):

 def test_tokenize_handles_wiki_notes(self):
     passage = "McWhorter writes of Lee, \"for a white person from the South to write a " +\
             "book like this in the late 1950s is really unusual\u2014by its very existence " +\
             "an act of protest.\"[note 4] Author James McBride calls Lee brilliant but " +\
             "stops short of calling her brave: \"I think by calling Harper Lee brave you " +\
             "kind of absolve yourself of your own racism.\""
     tokens, offsets = self.word_splitter.split_words(passage)
     assert "protest" in tokens

Move the call to `model.cuda()` to before optimizer creation

Because of this issue. The optimizer might have state that's initialized from the model parameters, and needs to be on the right device.

This means we should either:

Have cuda_device be a top-level key in the experiment config, so we can move the model over in commands.train() before constructing the optimizer.
Move the optimizer creation and the call to model.cuda() into Trainer.from_params().

I could go either way. Calling model.cuda() inside of from_params() in the second option is a little bit more logic than we like to have in those methods, but not much. The optimizer conceptually seems like it's part of the trainer, so having the optimizer params inside of the trainer params makes sense.

Figure out non-determinism due to PYTHONHASHSEED

Reported by @schmmd. I'm not really sure what could be causing this, because I didn't think there was any randomness in model.forward() after model.eval() has been called. But here are steps to reproduce:

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 2157
$ allennlp/run serve

> “spaceship”

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 4563123
$ allennlp/run serve

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

> “variety of Star Wars expanded …”

Web demo does not work on Firefox

ReferenceError: event is not defined[Learn More] demo.allennlp.org:686:13
    onClick http://demo.allennlp.org/:686:13
    onClick self-hosted:987:17
    [55]</ReactErrorUtils.invokeGuardedCallback http://demo.allennlp.org/lib/react-dom.js:9036:7
    executeDispatch http://demo.allennlp.org/lib/react-dom.js:2996:5
    executeDispatchesInOrder http://demo.allennlp.org/lib/react-dom.js:3019:5
    executeDispatchesAndRelease http://demo.allennlp.org/lib/react-dom.js:2427:5
    executeDispatchesAndReleaseTopLevel http://demo.allennlp.org/lib/react-dom.js:2438:10
    forEach self-hosted:267:13
    forEachAccumulated http://demo.allennlp.org/lib/react-dom.js:15456:5
    processEventQueue http://demo.allennlp.org/lib/react-dom.js:2638:7
    runEventQueueInBatch http://demo.allennlp.org/lib/react-dom.js:9060:3
    handleTopLevel http://demo.allennlp.org/lib/react-dom.js:9070:5
    handleTopLevelImpl http://demo.allennlp.org/lib/react-dom.js:9147:5
    perform http://demo.allennlp.org/lib/react-dom.js:14760:13
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:8825:14
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:12895:10
    dispatchEvent http://demo.allennlp.org/lib/react-dom.js:9222:7
    dispatchEvent self-hosted:987:17

	token_indexers = {}
	token_indexer_params = params.pop('token_indexers', Params({}))
	for name, indexer_params in token_indexer_params.items():
	token_indexers[name] = TokenIndexer.from_params(indexer_params)
	# The default parameters are contained within the class,
	# so if no parameters are given we must pass None.
	if token_indexers == {}:
	token_indexers = None

	if field_name == 'metadata':
	continue
	if isinstance(field_array_list[0], dict):
	# This is creating a dict of {token_indexer_key: batch_array} for each
	# token indexer used to index this field. This is mostly utilised by TextFields.
	token_indexer_key_to_batch_dict = defaultdict(list) # type: Dict[str, List[numpy.ndarray]]
	for namespace_dict in field_array_list:
	for indexer_name, array in namespace_dict.items():
	token_indexer_key_to_batch_dict[indexer_name].append(array)
	field_arrays[field_name] = {indexer_name: numpy.asarray(array_list) for # type: ignore
	indexer_name, array_list in token_indexer_key_to_batch_dict.items()}
	else:
	field_arrays[field_name] = numpy.asarray(field_array_list)