argilla-io / biome-text Goto Github PK
View Code? Open in Web Editor NEWCustom Natural Language Processing with big and small models ๐ฒ๐ฑ
Home Page: https://recognai.github.io/biome-text/
License: Other
Custom Natural Language Processing with big and small models ๐ฒ๐ฑ
Home Page: https://recognai.github.io/biome-text/
License: Other
Just a reminder issue for me.
Maybe we could use the multiindex feature of the DataFrame
s to "save" the content of a token, instead of duplicating the information in a dict like we do now. Have to check in more detail!
Testing the TextClassifier
on GPU, the following error is shown:
/usr/local/lib/python3.6/dist-packages/biome/text/modules/heads/classification/defs.py in process_output(self, output)
90
91 if not isinstance(probs_batch, numpy.ndarray):
---> 92 probs_batch = probs_batch.data.numpy()
93
94 output_map_probs = []
TypeError: can't convert cuda:0 device type tensor to numpy. Use Tensor.cpu() to copy the tensor to host memory first.
There are two alternatives to solve this:
The allennlp logs are back again. The idea is show verbosing logging only in train process and with verbose flag enabled
Problems related to biome ui
command and manual page refresh (F5)
The default batch prediction method defined in allennlp.Predictor
will not handle errors in text-to-instance transformation, failing into a global batch error if one single input cannot be predicted.
We need tackle those cases for properly apply batch prediction. We have 2 alternatives:
predict_batch_json
handling empty instancesallennlp
api creating void instances
issues in GitLab:
https://gitlab.com/recognai-team/biome/biome-allennlp/issues
Restore by default could cause some weird problems when training. We will disable it as default behaviour and force user enable when needed.
Also, vocabulary will be restored from the training folder if present.
Testing pretraining+fine_tuning (in examples/4.language model/fine_tune classifier
) where we use datasource definitions with no mapping but a dataset with text and label.
Training after changing the head from LM to TextClassification fails as default_mapping
in _DataSourceReader
keeps the values from the previous head (LM): text: text
.
Refactor metrics calculation to include macro, micro and per-label metrics with fbeta metrics
Passing numpy arrays or other iterables structures to text_to_instance method, the result fall into an error:
File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 186, in get_result
return self.apply_standard()
File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 292, in apply_standard
self.apply_series_generator()
File "/opt/biome/lib/python3.7/site-packages/pandas/core/apply.py", line 321, in apply_series_generator
results[i] = self.f(v)
File "/opt/biome/lib/python3.7/site-packages/biome/text/commands/explore/explore.py", line 154, in <lambda>
lambda x: pipeline.predict_json(x.to_dict()), axis=1, meta=(None, object)
File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 268, in predict_json
instance = self._json_to_instance(inputs)
File "/opt/biome/lib/python3.7/site-packages/biome/text/pipelines/pipeline.py", line 179, in _json_to_instance
return self.reader.text_to_instance(**json_dict)
File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/sequence_classifier_dataset_reader.py", line 35, in text_to_instance
tokens_field = self.build_textfield(tokens)
File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 135, in build_textfield
for value in data
File "/opt/biome/lib/python3.7/site-packages/biome/text/dataset_readers/mixins.py", line 136, in <listcomp>
if value
ValueError: ('The truth value of an array with more than one element is ambiguous. Use a.any() or a.all()', 'occurred at index 539')
INFO:allennlp.models.archival:removing temporary unarchived model dir at /tmp/tmpjyg60096
Enable trace mode in pipeline for keep all input/predictions results for a given serving model.
I have been testing vocab related params and have a few comments:
vocab_config
: Creating a Pipeline with a vocabulary with Pipeline.from_config
and pipeline.from_file
using VocabularyConfiguration
. Useful for filtering unusual words (min_count) and other things.
vocab
: in pipeline.train
to launch a training experiment with an instantiated pipeline. Useful for reusing an existing vocabulary avoiding its creation during training. From my tests, this param seems to be ignored. For example, if I create a pipeline without vocab_config, thus I have an empty vocab within the pipeline. Then if I want to pass a pre-generated vocab like this:
# vocab creation with:
...
vocab_config=VocabularyConfiguration(sources=[train, validation], min_count={'words': 10}),
..
pl.save_vocab('vocabulary_min_count_10')
pipe = pl.train(
output='experiment_pretraining_preproc',
trainer=TrainerConfiguration(optimizer='adam',num_epochs=5, patience=3),
training='data/tweets-spanish/train.yml',
validation='data/tweets-spanish/validation.yml',
vocab='vocabulary_min_count_10/',
extend_vocab=False
)
The vocab param will be ignored and my train run will use an empty vocab (only with @unknown and other predefined tokens).
extend_vocab
: This param has default=False
everywhere except in the entry method (train
). I am not sure if there's a reason to have it as default=True. But I would vote for having it False
in pipeline.train as well. With my tests, this can cause strange behaviours if not taken care, like increasing the sizes of an existing model embeddings and failing only after training during the pipeline.__class__( pretrained_path=os.path.join(config.output, "model.tar.gz"), config=pipeline.config, )
step.On macOS Mojave, the jsonnet install fails:
1 warning generated.
clang: warning: libstdc++ is deprecated; move to libc++ with a minimum deployment target of OS X 10.9 [-Wdeprecated]
ld: library not found for -lstdc++
clang: error: linker command failed with exit code 1 (use -v to see invocation)
error: Setup script exited with error: command 'g++' failed with exit status 1
make: *** [dev] Error 1
A working solution can be found here: google/jsonnet#573 (comment)
It's basically
$ xcode-select --install
$ open /Library/Developer/CommandLineTools/Packages/macOS_SDK_headers_for_macOS_10.14.pkg
Must specify a minimal check for passing exclusive arguments pretrained_path
and config
in pipeline init method
@frascuchon Can we make a cleanup regarding the branches? We have a lot of them and it keeps the git workspace cleaner ...
Create InputFeaturizer
objects passing directly words and chars specs, instead of configuration dicts.
InputFeaturizer(
words=WordsFeaturesSpecs(embedding_dim=50,lowercase_tokens=True),
chars=CharsFeaturesSpec(embedding_dim=10, encoder={ "type": ....}...)
)
Also, align spec naming and make them public classes
_CharacterFeaturesSpec
-> CharsFeaturesSpec
_WordFeaturesSpecs
-> WordsFeaturesSpecs
Include functionality for cacheable pipeline inference (depending on prediction input)
We need to review this log message:
INFO:biome.text._model:Serialization directory (fine_tuning_boe_avg_0.001) already exists and is not empty.
It is shown every time when running pipeline.train
even with fresh output directories.
If we want to implement label smoothing (a regularization method used for example in the original transformer paper), this link is helpful:
https://discuss.pytorch.org/t/cross-entropy-with-one-hot-targets/13580/2
I think it is a good method if we are not sure that the gold labels are 100% correct!
In order to standardize the datasource configuration format, i suggest a new yaml schema that i explain in next example:
# Define de datasource source (implicitly give the ds format)
source: my/path/to/file.csv
# Extra attributes related to how to read the source
attributes:
encoding: utf-8
delimiter: ,
For file-based datasource, we could define a source as list of paths:
source:
- a/path/to/file.csv
- another/folder/containing/*.csv
This definition extends even for non file-based sources. An example should be:
source: elasticsearch
attributes:
es_host: https://elasticsearch.service:9200
es_index: my-cool-index
es_type: _docs
query: { "query" : {"match_all" : {} }}
...
Pipeline methods related with prediction logging or cache initialization are missing.
Some insttallations raises setuptools error related to find_namespaced_packages
method:
Obtaining file:///Users/lea/Documents/biome-text
ERROR: Command errored out with exit status 1:
command: /Users/lea/.biome/bin/python3 -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/Users/lea/Documents/biome-text/setup.py'"'"'; __file__='"'"'/Users/leirea/Documents/biome-text/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /private/var/folders/fg/mftly61x1dvbs5dp3p6v5_vh0000gp/T/pip-pip-egg-info-sd6_7k75
cwd: /Users/leirea/Documents/biome-text/
Complete output (5 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/Users/lea/Documents/biome-text/setup.py", line 6, in <module>
from setuptools import setup, find_namespace_packages
ImportError: cannot import name 'find_namespace_packages'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
make: *** [dev] Error 1
installation should find installed setuptools version and check if compatible
Minimal logging info is not quite enough for training tracking. Removing verbose
param and enabling allennlp
logging, will help to visualize training progress.
xlrd
is necessary for the excel reader!
Let's bring our old implementation of loss weights into the classification heads.
Define a common clean pipeline that converts html content into extracted textual data.
Does not work at the moment, i think the ExamplePreparator
needs to be updated.
Brings the training configuration parameter for easily define model groups layers and freeze/unfreeze grouped layers sequentially.
Can we safely remove the old docs
folder @dcfidalgo ?
Lately i get a lot of these Warnings when creating a DataSource
WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path
WARNING:biome.text.data.datasource:Cannot set NaN's as empty string for column path
Want to have a look at this!
The functionality of the metadata_file
should go into values_mapping
(see the ExamplePreparator
).
I also want to change a bit the format to get rid of the target definition (see #3).
Place some HOWTO
guides for core development team
Allow apply tokenizer segment_sentences
for record-like input data.
With our new config.json
we loose the ability to use the allennlp find-lr
command:
(biome) macbook-pro-de-recognai:1 david$ allennlp find-lr -s find-lr --include-package=biome.text config.json
2020-05-18 16:42:20,680 - INFO - pytorch_pretrained_bert.modeling - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,551 - INFO - pytorch_transformers.modeling_bert - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,561 - INFO - pytorch_transformers.modeling_xlnet - Better speed can be achieved with apex installed from https://www.github.com/nvidia/apex .
2020-05-18 16:42:21,956 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,959 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,961 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
2020-05-18 16:42:21,963 - INFO - allennlp.common.registrable - instantiating registered subclass relu of <class 'allennlp.nn.activations.Activation'>
Traceback (most recent call last):
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 258, in pop
value = self.params.pop(key)
KeyError: 'dataset_reader'
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "/Users/david/anaconda3/envs/biome/bin/allennlp", line 8, in <module>
sys.exit(run())
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/run.py", line 18, in run
main(prog="allennlp")
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/__init__.py", line 102, in main
args.func(args)
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 127, in find_learning_rate_from_args
force=args.force)
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/commands/find_learning_rate.py", line 174, in find_learning_rate_model
all_datasets = datasets_from_params(params)
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/training/util.py", line 158, in datasets_from_params
dataset_reader_params = params.pop('dataset_reader')
File "/Users/david/anaconda3/envs/biome/lib/python3.7/site-packages/allennlp/common/params.py", line 260, in pop
raise ConfigurationError("key \"{}\" is required at location \"{}\"".format(key, self.history))
allennlp.common.checks.ConfigurationError: 'key "dataset_reader" is required at location ""'
We should implement either an own lr finder or a wrapper for the allennlp command.
I had a quick look at awscli, and it should also work with the latest pyyaml version.
I think we can remove the fixture in the setup.py
.
The downside could be that pip will complain about incompatibilities.
@frascuchon @dvsrepo I propose to expose these two options in the TrainerConfiguration
class:
num_serialized_models_to_keep: 1
should_log_learning_rate: true
or maybe hardcode them as defaults, i always include them in my training runs.
In some cases, dataset and models require input text is already tokenized.
The main point in library for data featuring is backbone.featurize
method. We must hide the backbone features dictionary and force api users use featurize
without tokenization
This issue just describes how AllenNLP deals with empty Fields, that is Fields (TextField
, LabelField
) created with empty strings. How we currently deal with empty strings is described in #59 .
An empty string in the LabelField
results in an empty string class when creating the vocab for the labels.
An empty TextField
passed on directly to the model results in an error:
Traceback (most recent call last):
File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 193, in _run_module_as_main
"__main__", mod_spec)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/runpy.py", line 85, in _run_code
exec(code, run_globals)
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 78, in <module>
main()
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/__main__.py", line 71, in main
args.func(args)
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 128, in learn_from_args
workers=args.workers,
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/commands/learn/learn.py", line 195, in learn
force=True,
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/commands/train.py", line 252, in train_model
metrics = trainer.train()
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 478, in train
train_metrics = self._train_epoch(epoch)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 320, in _train_epoch
loss = self.batch_loss(batch_group, for_training=True)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/training/trainer.py", line 261, in batch_loss
output_dict = self.model(**batch)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/sequence_classifier.py", line 41, in forward
encoded_text = self.forward_tokens(tokens)
File "/Users/david/recognai/biome/biome-text/src/biome/allennlp/models/base_model_classifier.py", line 164, in forward_tokens
tokens, mask=mask, num_wrapping_dims=self._num_wrapping_dims
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/text_field_embedders/basic_text_field_embedder.py", line 131, in forward
token_vectors = embedder(*tensors, **forward_params_values)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/token_embedders/token_characters_encoder.py", line 35, in forward
return self._dropout(self._encoder(self._embedding(token_characters), mask))
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/torch/nn/modules/module.py", line 493, in __call__
result = self.forward(*input, **kwargs)
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in forward
reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 34, in <listcomp>
reshaped_inputs = [self._reshape_tensor(input_tensor) for input_tensor in inputs]
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/modules/time_distributed.py", line 67, in _reshape_tensor
raise RuntimeError(f"No dimension to distribute: {input_size}")
RuntimeError: No dimension to distribute: torch.Size([1, 0])
A completely empty ListField([])
results in the following error:
File "/Users/david/anaconda3/envs/biome/lib/python3.6/site-packages/allennlp/data/fields/list_field.py", line 30, in __init__
str(field_class_set)
AssertionError: ('ListFields must contain a single field type, found set()', 'occurred at index 0')
Empty TextField
s get padded in a ListField
. For example, if there is a ListField
with 3 TextField
s (the first not empty, the last two empty) the mask looks like
[[[1, ...], [0], [0]]]
Also, if you specify sorting by "num_fields" in the trainer, ListField
s with different lengths get padded. For example, if you have 2 ListField
s with 3 and 2 TextField
s, the mask for the second TextField
looks like
[[[1, ...], [1, ...], [0]]]
The number shown in the training log (verbose=True
) of the number of trainable parameters is slightly different of the number in Pipeline.trainable_parameters
.
In my example the difference is: 267170 <-> 264226
The file readers (from_csv
, from_excel
, from_...
) should return a Dask DataFrame instead of a Dask Bag.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.