Coder Social home page Coder Social logo

argilla-io / biome-text Goto Github PK

View Code? Open in Web Editor NEW
68.0 68.0 7.0 80.62 MB

Custom Natural Language Processing with big and small models ๐ŸŒฒ๐ŸŒฑ

Home Page: https://recognai.github.io/biome-text/

License: Other

Makefile 0.22% Python 99.78%
allennlp data-science natural-language-processing nlp pytorch

biome-text's People

Contributors

dvsrepo avatar frascuchon avatar ignacioct avatar javispp avatar leire-a avatar leiyre avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar

biome-text's Issues

Process output metrics calculation fails when training on GPU

Testing the TextClassifier on GPU, the following error is shown:

/usr/local/lib/python3.6/dist-packages/biome/text/modules/heads/classification/defs.py in process_output(self, output)
     90 
     91         if not isinstance(probs_batch, numpy.ndarray):
---> 92             probs_batch = probs_batch.data.numpy()
     93 
     94         output_map_probs = []

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

There are two alternatives to solve this:

  1. use .cpu to copy the tensor to cpu as suggested by the Error
  2. use torch methods instead of numpy's for calculating the metrics (e.g., https://pytorch.org/docs/stable/torch.html?highlight=argmax#torch.argmax). We need to check if this is feasible.

Logging control

The allennlp logs are back again. The idea is show verbosing logging only in train process and with verbose flag enabled

Allow batch prediction

The default batch prediction method defined in allennlp.Predictor will not handle errors in text-to-instance transformation, failing into a global batch error if one single input cannot be predicted.

We need tackle those cases for properly apply batch prediction. We have 2 alternatives:

  1. Implements a custom predict_batch_json handling empty instances
  2. Make model pipeline (reader + architecture) robust to void instances (We need to check if it's possible in terms of allennlp api creating void instances

Default mapping in _DataSourceReader does not get updated when using set_head method

Testing pretraining+fine_tuning (in examples/4.language model/fine_tune classifier) where we use datasource definitions with no mapping but a dataset with text and label.

Training after changing the head from LM to TextClassification fails as default_mapping in _DataSourceReader keeps the values from the previous head (LM): text: text.

Reader for iterables values that aren't lists

Passing numpy arrays or other iterables structures to text_to_instance method, the result fall into an error:

  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
    self.apply_series_generator()
  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
    results[i] = self.f(v)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/commands/explore/explore.py", line 154, in <lambda>
    lambda x: pipeline.predict_json(x.to_dict()), axis=1, meta=(None, object)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 268, in predict_json
    instance = self._json_to_instance(inputs)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 179, in _json_to_instance
    return self.reader.text_to_instance(**json_dict)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/sequence_classifier_dataset_reader.py", line 35, in text_to_instance
    tokens_field = self.build_textfield(tokens)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 135, in build_textfield
    for value in data
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 136, in <listcomp>
    if value
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 539')
INFO:allennlp.models.archival:removing temporary unarchived model dir at /tmp/tmpjyg60096

Review of vocab and extend_vocab related params

I have been testing vocab related params and have a few comments:

Params

  • vocab_config: Creating a Pipeline with a vocabulary with Pipeline.from_config and pipeline.from_file using VocabularyConfiguration. Useful for filtering unusual words (min_count) and other things.

  • vocab: in pipeline.train to launch a training experiment with an instantiated pipeline. Useful for reusing an existing vocabulary avoiding its creation during training. From my tests, this param seems to be ignored. For example, if I create a pipeline without vocab_config, thus I have an empty vocab within the pipeline. Then if I want to pass a pre-generated vocab like this:

# vocab creation with:
...
vocab_config=VocabularyConfiguration(sources=[train, validation], min_count={'words': 10}),
..
pl.save_vocab('vocabulary_min_count_10')

pipe = pl.train(
    output='experiment_pretraining_preproc',
    trainer=TrainerConfiguration(optimizer='adam',num_epochs=5, patience=3),
    training='data/tweets-spanish/train.yml',
    validation='data/tweets-spanish/validation.yml',
    vocab='vocabulary_min_count_10/', 
    extend_vocab=False
)

The vocab param will be ignored and my train run will use an empty vocab (only with @unknown and other predefined tokens).

  • extend_vocab: This param has default=False everywhere except in the entry method (train). I am not sure if there's a reason to have it as default=True. But I would vote for having it False in pipeline.train as well. With my tests, this can cause strange behaviours if not taken care, like increasing the sizes of an existing model embeddings and failing only after training during the pipeline.__class__( pretrained_path=os.path.join(config.output, "model.tar.gz"), config=pipeline.config, ) step.

macOS Mojave - installing jasonnnet fails

On macOS Mojave, the jsonnet install fails:

  1 warning generated.
  clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
  ld: library not found for -lstdc++
  clang: error: linker command failed with exit code 1 (use -v to see invocation)
  error: Setup script exited with error: command 'g++' failed with exit status 1
  make: *** [dev] Error 1 

A working solution can be found here: google/jsonnet#573 (comment)

It's basically

$ xcode-select --install
$ open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg

Cleanup branches

@frascuchon Can we make a cleanup regarding the branches? We have a lot of them and it keeps the git workspace cleaner ...

InputFeaturizer initialization

Create InputFeaturizer objects passing directly words and chars specs, instead of configuration dicts.

InputFeaturizer(
  words=WordsFeaturesSpecs(embedding_dim=50,lowercase_tokens=True),
  chars=CharsFeaturesSpec(embedding_dim=10, encoder={ "type": ....}...)
)

Also, align spec naming and make them public classes

_CharacterFeaturesSpec -> CharsFeaturesSpec
_WordFeaturesSpecs -> WordsFeaturesSpecs

Common datasource configuration format

In order to standardize the datasource configuration format, i suggest a new yaml schema that i explain in next example:

# Define de datasource source (implicitly give the ds format)
source: my/path/to/file.csv

# Extra attributes related to how to read the source
attributes:
   encoding: utf-8
   delimiter: ,

For file-based datasource, we could define a source as list of paths:

source: 
- a/path/to/file.csv
- another/folder/containing/*.csv

This definition extends even for non file-based sources. An example should be:

source: elasticsearch

attributes:
  es_host: https://elasticsearch.service:9200
  es_index: my-cool-index
  es_type: _docs
  query: { "query" : {"match_all" : {} }}
...

Check setuptools min version

Some insttallations raises setuptools error related to find_namespaced_packages method:

Obtaining file:///Users/lea/Documents/biome-text
    ERROR: Command errored out with exit status 1:
     command: /Users/lea/.biome/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/Users/lea/Documents/biome-text/setup.py'"'"'; __file__='"'"'/Users/leirea/Documents/biome-text/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/fg/mftly61x1dvbs5dp3p6v5_vh0000gp/T/pip-pip-egg-info-sd6_7k75
         cwd: /Users/leirea/Documents/biome-text/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/lea/Documents/biome-text/setup.py", line 6, in <module>
        from setuptools import setup, find_namespace_packages
    ImportError: cannot import name 'find_namespace_packages'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
make: *** [dev] Error 1

installation should find installed setuptools version and check if compatible

Cannot set NaN's as empty string for column path

Lately i get a lot of these Warnings when creating a DataSource

WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path
WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path

Want to have a look at this!

Unify `metadata_file` and `values_mapping`

The functionality of the metadata_file should go into values_mapping (see the ExamplePreparator).
I also want to change a bit the format to get rid of the target definition (see #3).

implement a learning rate finder

With our new config.json we loose the ability to use the allennlp find-lr command:

(biome) macbook-pro-de-recognai:1 david$ allennlp find-lr -s find-lr --include-package=biome.text config.json
2020-05-18 16:42:20,680 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,551 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,561 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,956 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,959 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,961 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,963 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 258, in pop
    value = self.params.pop(key)
KeyError: 'dataset_reader'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/bin/allennlp", line 8, in <module>
    sys.exit(run())

 File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 127, in find_learning_rate_from_args
    force=args.force)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 174, in find_learning_rate_model
    all_datasets = datasets_from_params(params)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/training/util.py", line 158, in datasets_from_params
    dataset_reader_params = params.pop('dataset_reader')
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 260, in pop
    raise ConfigurationError("key \"{}\" is required at location \"{}\"".format(key, self.history))
allennlp.common.checks.ConfigurationError: 'key "dataset_reader" is required at location ""'

We should implement either an own lr finder or a wrapper for the allennlp command.

Unfix pyyaml version

I had a quick look at awscli, and it should also work with the latest pyyaml version.
I think we can remove the fixture in the setup.py.

The downside could be that pip will complain about incompatibilities.

Trainer configurations

@frascuchon @dvsrepo I propose to expose these two options in the TrainerConfiguration class:

num_serialized_models_to_keep: 1
should_log_learning_rate: true

or maybe hardcode them as defaults, i always include them in my training runs.

How AllenNLP deals with empty Fields.

This issue just describes how AllenNLP deals with empty Fields, that is Fields (TextField, LabelField) created with empty strings. How we currently deal with empty strings is described in #59 .

An empty string in the LabelField results in an empty string class when creating the vocab for the labels.

An empty TextField passed on directly to the model results in an error:

    Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 78, in <module>
    main()
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 71, in main
    args.func(args)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 128, in learn_from_args
    workers=args.workers,
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 195, in learn
    force=True,
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/sequence_classifier.py", line 41, in forward
    encoded_text = self.forward_tokens(tokens)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/base_model_classifier.py", line 164, in forward_tokens
    tokens, mask=mask, num_wrapping_dims=self._num_wrapping_dims
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 35, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in forward
    reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in <listcomp>
    reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 67, in _reshape_tensor
    raise RuntimeError(f"No dimension to distribute: {input_size}")
RuntimeError: No dimension to distribute: torch.Size([1, 0])

A completely empty ListField([]) results in the following error:

  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/data/fields/list_field.py", line 30, in __init__
    str(field_class_set)
AssertionError: ('ListFields must contain a single field type, found set()', 'occurred at index 0')

Empty TextFields get padded in a ListField. For example, if there is a ListField with 3 TextFields (the first not empty, the last two empty) the mask looks like

[[[1, ...], [0], [0]]]

Also, if you specify sorting by "num_fields" in the trainer, ListFields with different lengths get padded. For example, if you have 2 ListFields with 3 and 2 TextFields, the mask for the second TextField looks like

[[[1, ...], [1, ...], [0]]]

number of `trainable_parameters` not the same

The number shown in the training log (verbose=True) of the number of trainable parameters is slightly different of the number in Pipeline.trainable_parameters.

In my example the difference is: 267170 <-> 264226

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.