argilla-io / biome-text Goto Github PK

View Code? Open in Web Editor NEW

68.0 68.0 7.0 80.62 MB

Custom Natural Language Processing with big and small models 🌲🌱

Home Page: https://recognai.github.io/biome-text/

License: Other

Makefile 0.22% Python 99.78%

allennlp data-science natural-language-processing nlp pytorch

biome-text's People

Contributors

Stargazers

Watchers

Forkers

arunadevikaruppasamy dcfidalgo radovankavicky gapdata ignacioct javispp trendingtechnology

biome-text's Issues

DataFrame MultiIndex for tokens containing multiple columns of the DataFrame

Just a reminder issue for me.

Maybe we could use the multiindex feature of the DataFrames to "save" the content of a token, instead of duplicating the information in a dict like we do now. Have to check in more detail!

Process output metrics calculation fails when training on GPU

Testing the TextClassifier on GPU, the following error is shown:

/usr/local/lib/python3.6/dist-packages/biome/text/modules/heads/classification/defs.py in process_output(self, output)
     90 
     91         if not isinstance(probs_batch, numpy.ndarray):
---> 92             probs_batch = probs_batch.data.numpy()
     93 
     94         output_map_probs = []

TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.

There are two alternatives to solve this:

use .cpu to copy the tensor to cpu as suggested by the Error
use torch methods instead of numpy's for calculating the metrics (e.g., https://pytorch.org/docs/stable/torch.html?highlight=argmax#torch.argmax). We need to check if this is feasible.

Logging control

The allennlp logs are back again. The idea is show verbosing logging only in train process and with verbose flag enabled

Refresh web ui fals into a internar server error

Problems related to biome ui command and manual page refresh (F5)

The default batch prediction method defined in allennlp.Predictor will not handle errors in text-to-instance transformation, failing into a global batch error if one single input cannot be predicted.

We need tackle those cases for properly apply batch prediction. We have 2 alternatives:

Implements a custom predict_batch_json handling empty instances
Make model pipeline (reader + architecture) robust to void instances (We need to check if it's possible in terms of allennlp api creating void instances

Copy over issues from GitLab!

issues in GitLab:
https://gitlab.com/recognai-team/biome/biome-allennlp/issues

Disable restore by default and allow load vocab data when enabled

Restore by default could cause some weird problems when training. We will disable it as default behaviour and force user enable when needed.

Also, vocabulary will be restored from the training folder if present.

Default mapping in _DataSourceReader does not get updated when using set_head method

Testing pretraining+fine_tuning (in examples/4.language model/fine_tune classifier) where we use datasource definitions with no mapping but a dataset with text and label.

Training after changing the head from LM to TextClassification fails as default_mapping in _DataSourceReader keeps the values from the previous head (LM): text: text.

Use fbeta metrics for Classification heads

Refactor metrics calculation to include macro, micro and per-label metrics with fbeta metrics

Reader for iterables values that aren't lists

Passing numpy arrays or other iterables structures to text_to_instance method, the result fall into an error:

  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
    return self.apply_standard()
  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
    self.apply_series_generator()
  File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
    results[i] = self.f(v)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/commands/explore/explore.py", line 154, in <lambda>
    lambda x: pipeline.predict_json(x.to_dict()), axis=1, meta=(None, object)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 268, in predict_json
    instance = self._json_to_instance(inputs)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 179, in _json_to_instance
    return self.reader.text_to_instance(**json_dict)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/sequence_classifier_dataset_reader.py", line 35, in text_to_instance
    tokens_field = self.build_textfield(tokens)
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 135, in build_textfield
    for value in data
  File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 136, in <listcomp>
    if value
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 539')
INFO:allennlp.models.archival:removing temporary unarchived model dir at /tmp/tmpjyg60096

notebook usage - stdout in same cell as train command

When running a training in a cell:

trained_pl = pl.train(..., verbose=True)

all subsequent cell outputs go into the cell output of the above command ...

Allow trace requested predictions for retraining purposes

Enable trace mode in pipeline for keep all input/predictions results for a given serving model.

Review of vocab and extend_vocab related params

I have been testing vocab related params and have a few comments:

Params

vocab_config: Creating a Pipeline with a vocabulary with Pipeline.from_config and pipeline.from_file using VocabularyConfiguration. Useful for filtering unusual words (min_count) and other things.
vocab: in pipeline.train to launch a training experiment with an instantiated pipeline. Useful for reusing an existing vocabulary avoiding its creation during training. From my tests, this param seems to be ignored. For example, if I create a pipeline without vocab_config, thus I have an empty vocab within the pipeline. Then if I want to pass a pre-generated vocab like this:

# vocab creation with:
...
vocab_config=VocabularyConfiguration(sources=[train, validation], min_count={'words': 10}),
..
pl.save_vocab('vocabulary_min_count_10')

pipe = pl.train(
    output='experiment_pretraining_preproc',
    trainer=TrainerConfiguration(optimizer='adam',num_epochs=5, patience=3),
    training='data/tweets-spanish/train.yml',
    validation='data/tweets-spanish/validation.yml',
    vocab='vocabulary_min_count_10/', 
    extend_vocab=False
)

The vocab param will be ignored and my train run will use an empty vocab (only with @unknown and other predefined tokens).

extend_vocab: This param has default=False everywhere except in the entry method (train). I am not sure if there's a reason to have it as default=True. But I would vote for having it False in pipeline.train as well. With my tests, this can cause strange behaviours if not taken care, like increasing the sizes of an existing model embeddings and failing only after training during the pipeline.__class__( pretrained_path=os.path.join(config.output, "model.tar.gz"), config=pipeline.config, ) step.

macOS Mojave - installing jasonnnet fails

On macOS Mojave, the jsonnet install fails:

  1 warning generated.
  clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
  ld: library not found for -lstdc++
  clang: error: linker command failed with exit code 1 (use -v to see invocation)
  error: Setup script exited with error: command 'g++' failed with exit status 1
  make: *** [dev] Error 1

A working solution can be found here: google/jsonnet#573 (comment)

It's basically

$ xcode-select --install
$ open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg

Check pipeline init arguments

Must specify a minimal check for passing exclusive arguments pretrained_path and config in pipeline init method

Update installation steps in main readme.md

pypi and test.pypi integration

Cleanup branches

@frascuchon Can we make a cleanup regarding the branches? We have a lot of them and it keeps the git workspace cleaner ...

InputFeaturizer initialization

Create InputFeaturizer objects passing directly words and chars specs, instead of configuration dicts.

InputFeaturizer(
  words=WordsFeaturesSpecs(embedding_dim=50,lowercase_tokens=True),
  chars=CharsFeaturesSpec(embedding_dim=10, encoder={ "type": ....}...)
)

Also, align spec naming and make them public classes

_CharacterFeaturesSpec -> CharsFeaturesSpec
_WordFeaturesSpecs -> WordsFeaturesSpecs

Cache pipeline predictions

Include functionality for cacheable pipeline inference (depending on prediction input)

Log INFO:biome.text._model:Serialization directory already exists even with fresh experiments

We need to review this log message:

INFO:biome.text._model:Serialization directory (fine_tuning_boe_avg_0.001) already exists and is not empty.

It is shown every time when running pipeline.train even with fresh output directories.

Include tokenization results as part of explain result for all models

datasource forward transformation for model prediction

Implement label smoothing

If we want to implement label smoothing (a regularization method used for example in the original transformer paper), this link is helpful:
https://discuss.pytorch.org/t/cross-entropy-with-one-hot-targets/13580/2

I think it is a good method if we are not sure that the gold labels are 100% correct!

Common datasource configuration format

In order to standardize the datasource configuration format, i suggest a new yaml schema that i explain in next example:

# Define de datasource source (implicitly give the ds format)
source: my/path/to/file.csv

# Extra attributes related to how to read the source
attributes:
   encoding: utf-8
   delimiter: ,

For file-based datasource, we could define a source as list of paths:

source: 
- a/path/to/file.csv
- another/folder/containing/*.csv

This definition extends even for non file-based sources. An example should be:

source: elasticsearch

attributes:
  es_host: https://elasticsearch.service:9200
  es_index: my-cool-index
  es_type: _docs
  query: { "query" : {"match_all" : {} }}
...

Passing test dataset to pipeline train has no effect in metrics

Missing init method for prediction logging and cache configuration

Pipeline methods related with prediction logging or cache initialization are missing.

Check setuptools min version

Some insttallations raises setuptools error related to find_namespaced_packages method:

Obtaining file:///Users/lea/Documents/biome-text
    ERROR: Command errored out with exit status 1:
     command: /Users/lea/.biome/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/Users/lea/Documents/biome-text/setup.py'"'"'; __file__='"'"'/Users/leirea/Documents/biome-text/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/fg/mftly61x1dvbs5dp3p6v5_vh0000gp/T/pip-pip-egg-info-sd6_7k75
         cwd: /Users/leirea/Documents/biome-text/
    Complete output (5 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/Users/lea/Documents/biome-text/setup.py", line 6, in <module>
        from setuptools import setup, find_namespace_packages
    ImportError: cannot import name 'find_namespace_packages'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
make: *** [dev] Error 1

installation should find installed setuptools version and check if compatible

Remove verbose and enable allennlp logging

Minimal logging info is not quite enough for training tracking. Removing verbose param and enabling allennlp logging, will help to visualize training progress.

missing dependency

xlrd is necessary for the excel reader!

Create documentation for installation process

Loss weights parameter in classification heads

Let's bring our old implementation of loss weights into the classification heads.

Text Transform component

Define a common clean pipeline that converts html content into extracted textual data.

Fix the usage of a metadata_file

Does not work at the moment, i think the ExamplePreparator needs to be updated.

Support to gradual unfreezing for fine tuning

Brings the training configuration parameter for easily define model groups layers and freeze/unfreeze grouped layers sequentially.

Set up a test bench!

Remove old docs folder

Can we safely remove the old docs folder @dcfidalgo ?

Cannot set NaN's as empty string for column path

Lately i get a lot of these Warnings when creating a DataSource

WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path
WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path

Want to have a look at this!

Unify `metadata_file` and `values_mapping`

The functionality of the metadata_file should go into values_mapping (see the ExamplePreparator).
I also want to change a bit the format to get rid of the target definition (see #3).

Define internal howtos documentation

Place some HOWTO guides for core development team

Segment sentences for record data

Allow apply tokenizer segment_sentences for record-like input data.

implement a learning rate finder

With our new config.json we loose the ability to use the allennlp find-lr command:

(biome) macbook-pro-de-recognai:1 david$ allennlp find-lr -s find-lr --include-package=biome.text config.json
2020-05-18 16:42:20,680 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,551 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,561 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,956 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,959 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,961 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,963 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 258, in pop
    value = self.params.pop(key)
KeyError: 'dataset_reader'

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/bin/allennlp", line 8, in <module>
    sys.exit(run())

 File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
    main(prog="allennlp")
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
    args.func(args)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 127, in find_learning_rate_from_args
    force=args.force)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 174, in find_learning_rate_model
    all_datasets = datasets_from_params(params)
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/training/util.py", line 158, in datasets_from_params
    dataset_reader_params = params.pop('dataset_reader')
  File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 260, in pop
    raise ConfigurationError("key \"{}\" is required at location \"{}\"".format(key, self.history))
allennlp.common.checks.ConfigurationError: 'key "dataset_reader" is required at location ""'

We should implement either an own lr finder or a wrapper for the allennlp command.

Unfix pyyaml version

I had a quick look at awscli, and it should also work with the latest pyyaml version.
I think we can remove the fixture in the setup.py.

The downside could be that pip will complain about incompatibilities.

Trainer configurations

@frascuchon @dvsrepo I propose to expose these two options in the TrainerConfiguration class:

num_serialized_models_to_keep: 1
should_log_learning_rate: true

or maybe hardcode them as defaults, i always include them in my training runs.

Allow disable tokenization for data featurization and hide backbone features

In some cases, dataset and models require input text is already tokenized.

The main point in library for data featuring is backbone.featurize method. We must hide the backbone features dictionary and force api users use featurize without tokenization

How AllenNLP deals with empty Fields.

This issue just describes how AllenNLP deals with empty Fields, that is Fields (TextField, LabelField) created with empty strings. How we currently deal with empty strings is described in #59 .

An empty string in the LabelField results in an empty string class when creating the vocab for the labels.

An empty TextField passed on directly to the model results in an error:

    Traceback (most recent call last):
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 193, in _run_module_as_main
    "__main__", mod_spec)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 78, in <module>
    main()
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 71, in main
    args.func(args)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 128, in learn_from_args
    workers=args.workers,
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 195, in learn
    force=True,
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/commands/train.py", line 252, in train_model
    metrics = trainer.train()
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 478, in train
    train_metrics = self._train_epoch(epoch)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
    loss = self.batch_loss(batch_group, for_training=True)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
    output_dict = self.model(**batch)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/sequence_classifier.py", line 41, in forward
    encoded_text = self.forward_tokens(tokens)
  File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/base_model_classifier.py", line 164, in forward_tokens
    tokens, mask=mask, num_wrapping_dims=self._num_wrapping_dims
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
    token_vectors = embedder(*tensors, **forward_params_values)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 35, in forward
    return self._dropout(self._encoder(self._embedding(token_characters), mask))
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
    result = self.forward(*input, **kwargs)
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in forward
    reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in <listcomp>
    reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 67, in _reshape_tensor
    raise RuntimeError(f"No dimension to distribute: {input_size}")
RuntimeError: No dimension to distribute: torch.Size([1, 0])

A completely empty ListField([]) results in the following error:

  File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/data/fields/list_field.py", line 30, in __init__
    str(field_class_set)
AssertionError: ('ListFields must contain a single field type, found set()', 'occurred at index 0')

Empty TextFields get padded in a ListField. For example, if there is a ListField with 3 TextFields (the first not empty, the last two empty) the mask looks like

[[[1, ...], [0], [0]]]

Also, if you specify sorting by "num_fields" in the trainer, ListFields with different lengths get padded. For example, if you have 2 ListFields with 3 and 2 TextFields, the mask for the second TextField looks like

[[[1, ...], [1, ...], [0]]]