Coder Social home page Coder Social logo

allenai / allennlp Goto Github PK

View Code? Open in Web Editor NEW
11.7K 279.0 2.2K 72.96 MB

An open-source NLP research library, built on PyTorch.

Home Page: http://www.allennlp.org

License: Apache License 2.0

Python 98.43% Shell 0.06% Perl 0.01% C 1.10% Makefile 0.15% Scilab 0.11% Dockerfile 0.03% Jsonnet 0.12%
pytorch nlp natural-language-processing deep-learning data-science python

allennlp's People

Contributors

akshitab avatar arjunsubramonian avatar bratao avatar brendan-ai2 avatar bryant1410 avatar danieldeutsch avatar deneutoy avatar dependabot-preview[bot] avatar dependabot[bot] avatar dirkgr avatar eladsegal avatar epwalsh avatar eric-wallace avatar harshtrivedi avatar joelgrus avatar johngiorgi avatar kl2806 avatar maksymdel avatar matt-gardner avatar matt-peters avatar nafitzgerald avatar nelson-liu avatar nicola-decao avatar oyvindtafjord avatar pdasigi avatar sai-prasanna avatar scarecrow1123 avatar schmmd avatar wrran avatar zhaofengwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

allennlp's Issues

Allow for different handling of OOV words

We currently give all OOV tokens the same embedding at both training and test time. It'd be nice to be able to have some different options here:

  • At test time, see if the OOV token is in glove, and use that embedding instead
  • At both training and test, use random vectors for each unique OOV token, as suggested here

There are probably some other options I'm forgetting right now. These would be pretty tricky to implement in our current data pipeline, though.

Simplify / centralize `TokenIndexer.from_params()`

We have blocks like this in several places:

token_indexers = {}
token_indexer_params = params.pop('token_indexers', Params({}))
for name, indexer_params in token_indexer_params.items():
token_indexers[name] = TokenIndexer.from_params(indexer_params)
# The default parameters are contained within the class,
# so if no parameters are given we must pass None.
if token_indexers == {}:
token_indexers = None

These should all be put in one spot, probably something like TokenIndexer.dict_from_params (not thrilled with that name, but something similar).

Add a `Model.load_from_file` method (or similar)

Currently, if you want to load a model, you first need to load the vocab, then construct the model from_params, then load the state dict, etc. We should just have a method that does this, given the base serialization directory.

ImportError: dlopen: cannot load any more object with static TLS

I installed AllenNLP from source, and when I followed the steps on the Getting Started page to run the command python -m allennlp.run, it occured the following errors:

Traceback (most recent call last):
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 184, in _run_module_as_main
    "__main__", mod_spec)
  File "/data/bo718.wang/anaconda3/lib/python3.5/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/run.py", line 10, in <module>
    from allennlp.commands import main  # pylint: disable=wrong-import-position
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/__init__.py", line 3, in <module>
    from allennlp.commands.serve import add_subparser as add_serve_subparser
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/commands/serve.py", line 27, in <module>
    from allennlp.service import server_sanic
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/service/server_sanic.py", line 19, in <module>
    from allennlp.models.archival import load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/__init__.py", line 6, in <module>
    from allennlp.models.archival import archive_model, load_archive
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/archival.py", line 10, in <module>
    from allennlp.models.model import Model, _DEFAULT_WEIGHTS
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/models/model.py", line 12, in <module>
    from allennlp.data import Vocabulary
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/__init__.py", line 1, in <module>
    from allennlp.data.dataset import Dataset
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/dataset.py", line 13, in <module>
    from allennlp.data.instance import Instance
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/instance.py", line 3, in <module>
    from allennlp.data.fields.field import DataArray, Field
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/__init__.py", line 12, in <module>
    from allennlp.data.fields.text_field import TextField
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/fields/text_field.py", line 11, in <module>
    from allennlp.data.token_indexers.token_indexer import TokenIndexer, TokenType
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/__init__.py", line 6, in <module>
    from allennlp.data.token_indexers.token_characters_indexer import TokenCharactersIndexer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/token_indexers/token_characters_indexer.py", line 10, in <module>
    from allennlp.data.tokenizers.character_tokenizer import CharacterTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/__init__.py", line 7, in <module>
    from allennlp.data.tokenizers.word_tokenizer import WordTokenizer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 13, in <module>
    class WordTokenizer(Tokenizer):
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_tokenizer.py", line 39, in WordTokenizer
    word_splitter: WordSplitter = SpacyWordSplitter(),
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/allennlp/data/tokenizers/word_splitter.py", line 144, in __init__
    import spacy
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/__init__.py", line 5, in <module>
    from .deprecated import resolve_model_name
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/deprecated.py", line 8, in <module>
    from .cli import download
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/__init__.py", line 5, in <module>
    from .train import train, train_config
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/cli/train.py", line 8, in <module>
    from ..scorer import Scorer
  File "/data/bo718.wang/anaconda3/lib/python3.5/site-packages/spacy/scorer.py", line 4, in <module>
    from .gold import tags_to_entities
ImportError: dlopen: cannot load any more object with static TLS

I googled the error message dlopen: cannot load any more object with static TLS and found that this issuse seems to be related to the import of spacy package, and changing the importing order might help. But when I looked into the source code I found nothing that can get fixed. Up to now I haven't successfully run any demo on my machine. Anybody also encountered the same problem?

Add a `text_to_instance` method on `DatasetReader`

The Predictors currently have to pull out the tokenizer and the token indexers from the DatasetReader and recreate what the DatasetReader does internally for every instance. This means that if we want to change (or add parameters to) what the DatasetReader does, we have to change the Predictors to match. Instead, we should just add a text_to_instance method on the DatasetReader itself, so that the Predictor can just keep around the DatasetReader, and pass off all processing to it.

Allow config file overrides in `evaluate`

And other places where we load models. Actually, because this is going through to load_archive, that method itself has to allow overrides, and any entry point that reaches load_archive needs a way to pass in overrides.

Make explicit wrappers for LearningRateSchedulers

Like our Init wrapper. This will allow us to remove the type checks that were necessary in Trainer by handling the API differences between pytorch's LRSchedulers with separate wrappers (or just a single wrapper that does some inspection of the wrapped object...).

Add ability to append or prepend tokens to the `Tokenizers`

In three places now we've implemented appending and/or prepending tokens to what goes into a TextField (a null token for the SNLI reader, a stop token for SQuAD, and sentence boundary tokens for the language model reader). This should just be a basic functionality of the tokenizer.

Move `replace_none` into `Params`

We only use it when passing something into the constructor of Params - we should just put it inside the constructor and not make the caller have to call this method every time.

torch.Tensor type annotations

We should work out whether these are actually the correct type annotation for many of our functions. For many functions, we are actually only ever passing torch.autograd.Variables, including some which actually require this, e.g:

# raises
torch.nn.functional.softmax(torch.rand([3,4]))

# fine
torch.nn.functional.softmax(torch.autograd.Variable(torch.rand([3,4])))

I think we can still keep the tensor types by doing something like:
torch.autograd.Variable[torch.FloatTensor] etc.

Have Tokenizers return a Token object

This will let us get rid of the nasty offset return value, because it will just be a field on the Token, and it will let us include POS tags, for POS tag embeddings.

It's probably easiest to just return spacy's token representation directly, rather than trying to roll our own, and have other word splitters mimic spacy's API. Or we could just have them crash; not sure we really need the other word splitters at this point - we could just simplify things a lot by putting spacy directly into WordTokenizer. Anybody have any thoughts on that?

Upgrade to python 3.6

This will let us use variable type annotations, and remove all of the unused-imports in the code.

I don't think there are any big issues with just changing the python version in our images and build settings, so I'm labeling this as easy, but it's possible there is some library we're using that's not compatible and it will end up being hard.

Have some parameter versioning, or something

It'd be nice to have some way of managing config file changes, so that, e.g., if we add a new required parameter, or change the name of a flag, config files don't break mysteriously for an end user. Even better if we can make things backwards compatible when they change. Not sure at all how to make this happen in a reasonable way, though.

Decompose `Trainer.train()` into smaller methods

It's grown to be a ~150 line method that's pretty hard to reason about. It's a bit tricky to decompose, though, because of all of the dependencies between different parts of the method, but we should be able to pull out a bunch of it into separate methods.

Parse the log file for _actual_ parameters used to save in the model archive

Instead of copying the input parameter file when archiving a model, we should parse the log file to get the actual parameters that were used (including defaults). This will make model archiving more robust to changes in default parameters (as recently happened with the tokenizer). Seems like quite a bit of work to be sure the parameters are parsed out correctly, though, and it's not super high priority.

Figure out cause of slow imports

And fix it, if possible.

If you just import something from the library, like from allennlp.data import Vocabulary, there's a several second delay. Not sure what the cause is, but it seems like some __init__.py somewhere is doing more than it should, or something is getting run on import when it shouldn't be.

Change `Instance.metadata` into a `MetadataField`

This would remove the need for all of the special casing and reflection that I did to pass the metadata through correctly. Basically we take all of the metadata-related code from this PR and replace it with a MetadataField. There will still be a little bit of special casing, unless we also move the array creation code into a class method on the Field objects (probably on Field itself, overriden by MetadataField). In particular, I mean this code:

if field_name == 'metadata':
continue
if isinstance(field_array_list[0], dict):
# This is creating a dict of {token_indexer_key: batch_array} for each
# token indexer used to index this field. This is mostly utilised by TextFields.
token_indexer_key_to_batch_dict = defaultdict(list) # type: Dict[str, List[numpy.ndarray]]
for namespace_dict in field_array_list:
for indexer_name, array in namespace_dict.items():
token_indexer_key_to_batch_dict[indexer_name].append(array)
field_arrays[field_name] = {indexer_name: numpy.asarray(array_list) for # type: ignore
indexer_name, array_list in token_indexer_key_to_batch_dict.items()}
else:
field_arrays[field_name] = numpy.asarray(field_array_list)

Error when installing requirements in a Conda environment.

This issue was brought up to me by Nikket.

Hi Michael,

I followed the steps in the readme, and got stuck in step 4.

  1. Download and install Conda.

  2. Create a Conda environment with Python 3.

conda create -n allennlp python=3.5

  1. Now activate the Conda environment.

source activate allennlp

  1. Install the required dependencies.

INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

  1. Visit http://pytorch.org/ and install the relevant pytorch package.

  2. Set the PYTHONHASHSEED for repeatable experiments.

export PYTHONHASHSEED=2157

I get stuck on step 4, whose remedy seems to be https://stackoverflow.com/questions/1449396/how-to-install-setuptools, but I just wanted to be sure that I am not doing anything wrong.

(allennlp) nikett:allennlp nikett$ INSTALL_TEST_REQUIREMENTS="true" ./scripts/install_requirements.sh

 Collecting git+git://github.com/mkorpela/overrides.git@40f8bd1fae7a3364a1 (from -r requirements.txt (line 23))
  Cloning git://github.com/mkorpela/overrides.git (to 40f8bd1fae7a3364a1) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-rc43o34j-build
  Could not find a tag or branch '40f8bd1fae7a3364a1', assuming commit.
Could not import setuptools which is required to install from a source distribution.


Please install setuptools.
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'nltk.downloader' (ImportError: No module named 'nltk')
/Users/nikett/anaconda/envs/allennlp/bin/python: Error while finding module specification for 'spacy.en.download' (ImportError: No module named 'spacy')


Collecting git+git://github.com/PyCQA/pylint.git@2561f539d60a3563d6507e7a22e226fb10b58210 (from -r requirements_test.txt (line 6))
  Cloning git://github.com/PyCQA/pylint.git (to 2561f539d60a3563d6507e7a22e226fb10b58210) to /private/var/folders/hj/kby3swx56l9bf_1v93z87b840000gp/T/pip-gd8xgqm9-build
  Could not find a tag or branch '2561f539d60a3563d6507e7a22e226fb10b58210', assuming commit.
Could not import setuptools which is required to install from a source distribution.
Please install setuptools.

Notebook Checklist

  • Vocabulary

  • Data API - Fields, Instances, Dataset.

  • Iterators and Training a model.

  • Tokens -> Tokenizers -> TokenIndexers Abstraction.

  • Writing a DatasetReader example.

  • Writing a Model, differences between torch.nn.Module.

  • Why have Params? How to build things which run from JSON.

  • TokenEmbedders -> TextFields -> representation Abstraction.

  • Seq2SeqEncoders and Seq2VecEncoders and how to use them.

  • How to make your model Servable and deploy a server via Docker.

pytest -v Issues

I installed python 3.6 into an Anaconda environment installed all the requirements.txt and requirements_test.txt packages, pytorch, etc.

When I run pytest -v, the tests are failing.
It is accessing the python 2.7 packages (not the python 3.6 ones):

(py36) David-Laxers-MacBook-Pro:allennlp davidlaxer$ pytest -v
============================= test session starts ==============================
platform darwin -- Python 2.7.13, pytest-3.1.1, py-1.4.33, pluggy-0.4.0 -- /Users/davidlaxer/anaconda/bin/python
cachedir: .cache
rootdir: /Users/davidlaxer/allennlp, inifile: pytest.ini
collected 0 items / 68 errors 

==================================== ERRORS ====================================
___________________ ERROR collecting tests/notebooks_test.py ___________________
../anaconda/lib/python2.7/site-packages/_pytest/python.py:408: in _importtestmodule
    mod = self.fspath.pyimport(ensuresyspath=importmode)
../anaconda/lib/python2.7/site-packages/py/_path/local.py:662: in pyimport
    __import__(modname)
E     File "/Users/davidlaxer/allennlp/tests/notebooks_test.py", line 17
E       def execute_notebook(notebook_path: str):
E                                         ^
E   SyntaxError: invalid syntax
[...]
ImportError while importing test module '/Users/davidlaxer/allennlp/tests/training/metrics/span_based_f1_measure_test.py'.
Hint: make sure your test modules/packages have valid Python names.
Traceback:
tests/training/metrics/span_based_f1_measure_test.py:2: in <module>
    import torch
E   ImportError: No module named torch
!!!!!!!!!!!!!!!!!!! Interrupted: 68 errors during collection !!!!!!!!!!!!!!!!!!!
=========================== 68 error in 5.49 seconds ===========================

Figure out the right way to instantiate Vocabulary `from_params`

We need this so we can configure the vocabulary properly from an experiment config. We need a method that takes a Dataset and a Params and instantiates the object. Not sure what the most sane way is to do this - should we just only allow certain ways of constructing the vocab?

Figure out how to get spacy to tokenize wiki text correctly

SQuAD has plenty of paragraphs that have wiki notes, formatting like "This was a protest.[note 4]". Spacy for some reason does not tokenize these strings correctly, giving "protest.[note" as a single token. We should be able to improve performance on SQuAD at least a little bit by fixing these issues, as it affects a fair number of our training examples, and some of the dev set.

A test that currently fails, but should pass (goes in word_splitter_test.py):

 def test_tokenize_handles_wiki_notes(self):
     passage = "McWhorter writes of Lee, \"for a white person from the South to write a " +\
             "book like this in the late 1950s is really unusual\u2014by its very existence " +\
             "an act of protest.\"[note 4] Author James McBride calls Lee brilliant but " +\
             "stops short of calling her brave: \"I think by calling Harper Lee brave you " +\
             "kind of absolve yourself of your own racism.\""
     tokens, offsets = self.word_splitter.split_words(passage)
     assert "protest" in tokens

Move the call to `model.cuda()` to before optimizer creation

Because of this issue. The optimizer might have state that's initialized from the model parameters, and needs to be on the right device.

This means we should either:

  1. Have cuda_device be a top-level key in the experiment config, so we can move the model over in commands.train() before constructing the optimizer.
  2. Move the optimizer creation and the call to model.cuda() into Trainer.from_params().

I could go either way. Calling model.cuda() inside of from_params() in the second option is a little bit more logic than we like to have in those methods, but not much. The optimizer conceptually seems like it's part of the trainer, so having the optimizer params inside of the trainer params makes sense.

Figure out non-determinism due to PYTHONHASHSEED

Reported by @schmmd. I'm not really sure what could be causing this, because I didn't think there was any randomness in model.forward() after model.eval() has been called. But here are steps to reproduce:

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 2157
$ allennlp/run serve

> “spaceship”

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

$ git checkout schmmd/weird-bug
$ set -x PYTHONHASHSEED 4563123
$ allennlp/run serve

1.  Navigate to http://localhost:8000.
2.  Click the MC Model tab.
3.  Submit the last example (The Millennium Falcon…)

> “variety of Star Wars expanded …”

Web demo does not work on Firefox

ReferenceError: event is not defined[Learn More] demo.allennlp.org:686:13
    onClick http://demo.allennlp.org/:686:13
    onClick self-hosted:987:17
    [55]</ReactErrorUtils.invokeGuardedCallback http://demo.allennlp.org/lib/react-dom.js:9036:7
    executeDispatch http://demo.allennlp.org/lib/react-dom.js:2996:5
    executeDispatchesInOrder http://demo.allennlp.org/lib/react-dom.js:3019:5
    executeDispatchesAndRelease http://demo.allennlp.org/lib/react-dom.js:2427:5
    executeDispatchesAndReleaseTopLevel http://demo.allennlp.org/lib/react-dom.js:2438:10
    forEach self-hosted:267:13
    forEachAccumulated http://demo.allennlp.org/lib/react-dom.js:15456:5
    processEventQueue http://demo.allennlp.org/lib/react-dom.js:2638:7
    runEventQueueInBatch http://demo.allennlp.org/lib/react-dom.js:9060:3
    handleTopLevel http://demo.allennlp.org/lib/react-dom.js:9070:5
    handleTopLevelImpl http://demo.allennlp.org/lib/react-dom.js:9147:5
    perform http://demo.allennlp.org/lib/react-dom.js:14760:13
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:8825:14
    batchedUpdates http://demo.allennlp.org/lib/react-dom.js:12895:10
    dispatchEvent http://demo.allennlp.org/lib/react-dom.js:9222:7
    dispatchEvent self-hosted:987:17

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.