xashru / punctuation-restoration Goto Github PK

Punctuation Restoration using Transformer Models for High-and Low-Resource Languages

License: MIT License

Python 99.18% Shell 0.82%

bangla bert deep-learning punctuation-marks punctuation-restoration pytorch transformer-models

punctuation-restoration's Issues

CUDA memory allocation error!

RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.17 GiB total capacity; 10.41 GiB already allocated; 5.81 MiB free; 10.54 GiB reserved in total by PyTorch)

How to set limit for the memory allocation by CUDA?

I have tried this too: export PYTORCH_CUDA_ALLOC_CONF=max_split_size_mb:128

But still it is failing!

StackTrace:

raceback (most recent call last):
  File "src/train.py", line 264, in <module>
    train()
  File "src/train.py", line 211, in train
    y_predict = deep_punctuation(x, att)
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/punctuation-restoration/src/model.py", line 28, in forward
    x = self.bert_layer(x, attention_mask=attn_masks)[0]
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/banglaDariComma/lib/python3.7/site-packages/transformers/modeling_bert.py", line 734, in forward
    encoder_attention_mask=encoder_extended_attention_mask,
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/banglaDariComma/lib/python3.7/site-packages/transformers/modeling_bert.py", line 408, in forward
    hidden_states, attention_mask, head_mask[i], encoder_hidden_states, encoder_attention_mask
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/banglaDariComma/lib/python3.7/site-packages/transformers/modeling_bert.py", line 369, in forward
    self_attention_outputs = self.attention(hidden_states, attention_mask, head_mask)
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/banglaDariComma/lib/python3.7/site-packages/transformers/modeling_bert.py", line 315, in forward
    hidden_states, attention_mask, head_mask, encoder_hidden_states, encoder_attention_mask
  File "/root/banglaDariComma/lib/python3.7/site-packages/torch/nn/modules/module.py", line 1102, in _call_impl
    return forward_call(*input, **kwargs)
  File "/root/banglaDariComma/lib/python3.7/site-packages/transformers/modeling_bert.py", line 236, in forward
    attention_scores = attention_scores / math.sqrt(self.attention_head_size)

What should be the prediction text format?

How can I predict using your model? I have a text file and want to punctuate it. It is having lower case words without any punctuation.

i have created my own training data i got this error

2021-04-23 21:22:11.699111: I tensorflow/stream_executor/platform/default/dso_loader.cc:49] Successfully opened dynamic library libcudart.so.11.0
Traceback (most recent call last):
File "src/train.py", line 41, in
token_style=token_style, is_train=True, augment_rate=ar, augment_type=aug_type)
File "/content/drive/My Drive/punctuation-restoration/src/dataset.py", line 77, in init
self.data = parse_data(files, tokenizer, sequence_len, token_style)
File "/content/drive/My Drive/punctuation-restoration/src/dataset.py", line 45, in parse_data
y.append(punctuation_dict[punc])
KeyError: '0'

HELP PLEASE.

ERROR: Failed building wheel for tokenizers

Hello. Ty very much for installation and examples. However I am getting this error. Any idea how to solve?

I am using Windows 10 and pip

C:\punctuation-restoration>pip install -r requirements.txt
Collecting transformers==v2.11.0
Using cached transformers-2.11.0-py3-none-any.whl (674 kB)
Collecting pytorch-crf
Using cached pytorch_crf-0.7.2-py3-none-any.whl (9.5 kB)
Requirement already satisfied: sentencepiece in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (0.1.97)
Requirement already satisfied: requests in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (2.21.0)
Collecting sacremoses
Using cached sacremoses-0.0.53.tar.gz (880 kB)
Preparing metadata (setup.py) ... done
Requirement already satisfied: packaging in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (21.3)
Requirement already satisfied: filelock in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (3.8.0)
Requirement already satisfied: numpy in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (1.23.3)
Requirement already satisfied: tqdm>=4.27 in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (4.64.1)
Requirement already satisfied: regex!=2019.12.17 in c:\python399\lib\site-packages (from transformers==v2.11.0->-r requirements.txt (line 1)) (2022.9.13)
Collecting tokenizers==0.7.0
Using cached tokenizers-0.7.0.tar.gz (81 kB)
Installing build dependencies ... done
Getting requirements to build wheel ... done
Preparing metadata (pyproject.toml) ... done
Requirement already satisfied: colorama in c:\python399\lib\site-packages (from tqdm>=4.27->transformers==v2.11.0->-r requirements.txt (line 1)) (0.4.5)
Requirement already satisfied: pyparsing!=3.0.5,>=2.0.2 in c:\python399\lib\site-packages (from packaging->transformers==v2.11.0->-r requirements.txt (line 1)) (3.0.9)
Requirement already satisfied: chardet<3.1.0,>=3.0.2 in c:\python399\lib\site-packages (from requests->transformers==v2.11.0->-r requirements.txt (line 1)) (3.0.4)
Requirement already satisfied: urllib3<1.25,>=1.21.1 in c:\python399\lib\site-packages (from requests->transformers==v2.11.0->-r requirements.txt (line 1)) (1.24.3)
Requirement already satisfied: idna<2.9,>=2.5 in c:\python399\lib\site-packages (from requests->transformers==v2.11.0->-r requirements.txt (line 1)) (2.8)
Requirement already satisfied: certifi>=2017.4.17 in c:\python399\lib\site-packages (from requests->transformers==v2.11.0->-r requirements.txt (line 1)) (2022.9.14)
Requirement already satisfied: six in c:\python399\lib\site-packages (from sacremoses->transformers==v2.11.0->-r requirements.txt (line 1)) (1.12.0)
Collecting click
Using cached click-8.1.3-py3-none-any.whl (96 kB)
Collecting joblib
Using cached joblib-1.2.0-py3-none-any.whl (297 kB)
Building wheels for collected packages: tokenizers
Building wheel for tokenizers (pyproject.toml) ... error
error: subprocess-exited-with-error

× Building wheel for tokenizers (pyproject.toml) did not run successfully.
│ exit code: 1
╰─> [46 lines of output]
running bdist_wheel
running build
running build_py
creating build
creating build\lib.win-amd64-cpython-39
creating build\lib.win-amd64-cpython-39\tokenizers
copying tokenizers_init_.py -> build\lib.win-amd64-cpython-39\tokenizers
creating build\lib.win-amd64-cpython-39\tokenizers\models
copying tokenizers\models_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\models
creating build\lib.win-amd64-cpython-39\tokenizers\decoders
copying tokenizers\decoders_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\decoders
creating build\lib.win-amd64-cpython-39\tokenizers\normalizers
copying tokenizers\normalizers_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\normalizers
creating build\lib.win-amd64-cpython-39\tokenizers\pre_tokenizers
copying tokenizers\pre_tokenizers_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\pre_tokenizers
creating build\lib.win-amd64-cpython-39\tokenizers\processors
copying tokenizers\processors_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\processors
creating build\lib.win-amd64-cpython-39\tokenizers\trainers
copying tokenizers\trainers_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\trainers
creating build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations\base_tokenizer.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations\bert_wordpiece.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations\byte_level_bpe.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations\char_level_bpe.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations\sentencepiece_bpe.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers\implementations_init_.py -> build\lib.win-amd64-cpython-39\tokenizers\implementations
copying tokenizers_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers
copying tokenizers\models_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\models
copying tokenizers\decoders_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\decoders
copying tokenizers\normalizers_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\normalizers
copying tokenizers\pre_tokenizers_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\pre_tokenizers
copying tokenizers\processors_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\processors
copying tokenizers\trainers_init_.pyi -> build\lib.win-amd64-cpython-39\tokenizers\trainers
running build_ext
running build_rust
error: can't find Rust compiler

  If you are using an outdated pip version, it is possible a prebuilt wheel is available for this package but pip is not able to install from it. Installing from the wheel would avoid the need for a Rust compiler.

  To update pip, run:

      pip install --upgrade pip

  and then retry package installation.

  If you did intend to build this package from source, try installing a Rust compiler from your system package manager and ensure it is on the PATH during installation. Alternatively, rustup (available at https://rustup.rs) is the recommended way to download and update the Rust compiler toolchain.
  [end of output]

note: This error originates from a subprocess, and is likely not a problem with pip.
ERROR: Failed building wheel for tokenizers
Failed to build tokenizers
ERROR: Could not build wheels for tokenizers, which is required to install pyproject.toml-based projects

What is 5th Index value in Precision, Recall and F1 lists

Hi,
I got the below values for one of the tests.

Precision: [0.98068136 0.71008241 0.87272477 0.83011424 0.81440205]
Recall: [0.98558426 0.67705532 0.84840781 0.84434041 0.79418293]
F1 score: [0.9831267 0.69317568 0.86039451 0.83716689 0.80416542]
Accuracy:0.9535017031737457
Confusion Matrix
[[297677 2467 1236 651]
[ 3773 12839 1806 545]
[ 1366 2295 25364 871]
[ 725 480 657 10100]]

On manually calculating, first four indexes of Precision. Recall and F1 score correspond to 4 punctuations tried. But what does 5th one refer to? Its definitely not Average of the first 4 to represent "Overall" value.

Unable to predict using inference.py file

I tried using the inference.py file to predict a sample text with 100 as sequence length, I get the following run time error. Kindly assist me on how this could be solved,

python inference.py --pretrained-model=roberta-large --weight-path=roberta-large-en.pt --language=en --in-file=data/test_en.txt --out-file=data/test_en_out.txt

Traceback (most recent call last):
File "src/inference.py", line 106, in
inference()
File "src/inference.py", line 42, in inference
deep_punctuation.load_state_dict(torch.load(model_save_path))
File "C:\Users\SMVP\Anaconda3\envs\my_torch\lib\site-packages\torch\nn\modules\module.py", line 1052, in load_state_dict
self.class.name, "\n\t".join(error_msgs)))
RuntimeError: Error(s) in loading state_dict for DeepPunctuation:
Missing key(s) in state_dict: "bert_layer.embeddings.position_ids".

FileNotFoundError

While running the test script:
python src/test.py --pretrained-model=roberta-large --lstm-dim=-1 --use-crf=False --data-path=data/test --weight-path=weights/roberta-large-en.pt --sequence-length=256 --save-path=out

I'm getting the following error:

Traceback (most recent call last):
File "src/test.py", line 33, in
test_files = os.listdir(args.data_path)
FileNotFoundError: [Errno 2] No such file or directory: 'data/test

RuntimeError: Error(s) in loading state_dict for DeepPunctuation: Missing key(s) in state_dict: "bert_layer.embeddings.position_ids".

I have tried with my current installation and here the error

C:\punctuation-restoration\src>python inference.py --pretrained-model=roberta-large --weight-path=roberta-large-en.pt --language=en --in-file=data/test_en.txt --out-file=data/test_en_out.txt
C:\Python399\lib\site-packages\torchaudio\backend\utils.py:62: UserWarning: No audio backend is available.
warnings.warn("No audio backend is available.")
loading file vocab.json from cache at C:\Users\King/.cache\huggingface\hub\models--roberta-large\snapshots\5069d8a2a32a7df4c69ef9b56348be04152a2341\vocab.json
loading file merges.txt from cache at C:\Users\King/.cache\huggingface\hub\models--roberta-large\snapshots\5069d8a2a32a7df4c69ef9b56348be04152a2341\merges.txt
loading file added_tokens.json from cache at None
loading file special_tokens_map.json from cache at None
loading file tokenizer_config.json from cache at None
loading configuration file config.json from cache at C:\Users\King/.cache\huggingface\hub\models--roberta-large\snapshots\5069d8a2a32a7df4c69ef9b56348be04152a2341\config.json
Model config RobertaConfig {
"_name_or_path": "roberta-large",
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.22.1",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}

loading configuration file config.json from cache at C:\Users\King/.cache\huggingface\hub\models--roberta-large\snapshots\5069d8a2a32a7df4c69ef9b56348be04152a2341\config.json
Model config RobertaConfig {
"architectures": [
"RobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"classifier_dropout": null,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"hidden_size": 1024,
"initializer_range": 0.02,
"intermediate_size": 4096,
"layer_norm_eps": 1e-05,
"max_position_embeddings": 514,
"model_type": "roberta",
"num_attention_heads": 16,
"num_hidden_layers": 24,
"pad_token_id": 1,
"position_embedding_type": "absolute",
"transformers_version": "4.22.1",
"type_vocab_size": 1,
"use_cache": true,
"vocab_size": 50265
}

loading weights file pytorch_model.bin from cache at C:\Users\King/.cache\huggingface\hub\models--roberta-large\snapshots\5069d8a2a32a7df4c69ef9b56348be04152a2341\pytorch_model.bin
Some weights of the model checkpoint at roberta-large were not used when initializing RobertaModel: ['lm_head.dense.weight', 'lm_head.layer_norm.weight', 'lm_head.bias', 'lm_head.dense.bias', 'lm_head.layer_norm.bias', 'lm_head.decoder.weight']

This IS expected if you are initializing RobertaModel from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
This IS NOT expected if you are initializing RobertaModel from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).
All the weights of RobertaModel were initialized from the model checkpoint at roberta-large.
If your task is similar to the task the model of the checkpoint was trained on, you can already use RobertaModel for predictions without further training.
Traceback (most recent call last):
File "C:\punctuation-restoration\src\inference.py", line 105, in
inference()
File "C:\punctuation-restoration\src\inference.py", line 41, in inference
deep_punctuation.load_state_dict(torch.load(model_save_path))
File "C:\Python399\lib\site-packages\torch\nn\modules\module.py", line 1604, in load_state_dict
raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
RuntimeError: Error(s) in loading state_dict for DeepPunctuation:
Missing key(s) in state_dict: "bert_layer.embeddings.position_ids".

C:\punctuation-restoration\src>

Dose this model support training with chinese?

Hello, I noticed that the pretrained model has multilingual cased. I was wondering if I can use these pretrained model to train on chinese dataset? Thanks

How to implement `torch.nn.DistributedDataParallel` in training?

Wrong shape for input_ids (shape torch.Size([1, 256])) or attention_mask (shape torch.Size([256]))

I try to use pretrained RoBERTa-large model for English, but get the following error:

Traceback (most recent call last):
  File "/usr/local/lib/python3.6/site-packages/uvicorn/protocols/http/h11_impl.py", line 396, in run_asgi
    result = await app(self.scope, self.receive, self.send)
  File "/usr/local/lib/python3.6/site-packages/uvicorn/middleware/proxy_headers.py", line 45, in __call__
    return await self.app(scope, receive, send)
  File "/usr/local/lib/python3.6/site-packages/fastapi/applications.py", line 201, in __call__
    await super().__call__(scope, receive, send)  # pragma: no cover
  File "/usr/local/lib/python3.6/site-packages/starlette/applications.py", line 111, in __call__
    await self.middleware_stack(scope, receive, send)
  File "/usr/local/lib/python3.6/site-packages/starlette/middleware/errors.py", line 181, in __call__
    raise exc from None
  File "/usr/local/lib/python3.6/site-packages/starlette/middleware/errors.py", line 159, in __call__
    await self.app(scope, receive, _send)
  File "/usr/local/lib/python3.6/site-packages/starlette/exceptions.py", line 82, in __call__
    raise exc from None
  File "/usr/local/lib/python3.6/site-packages/starlette/exceptions.py", line 71, in __call__
    await self.app(scope, receive, sender)
  File "/usr/local/lib/python3.6/site-packages/starlette/routing.py", line 566, in __call__
    await route.handle(scope, receive, send)
  File "/usr/local/lib/python3.6/site-packages/starlette/routing.py", line 227, in handle
    await self.app(scope, receive, send)
  File "/usr/local/lib/python3.6/site-packages/starlette/routing.py", line 41, in app
    response = await func(request)
  File "/usr/local/lib/python3.6/site-packages/fastapi/routing.py", line 202, in app
    dependant=dependant, values=values, is_coroutine=is_coroutine
  File "/usr/local/lib/python3.6/site-packages/fastapi/routing.py", line 148, in run_endpoint_function
    return await dependant.call(**values)
  File "./src/main.py", line 58, in predict
    p = inference(deep_punctuation, use_crf, sequence_length, tokenizer, token_idx, model_save_path, device, text)
  File "./src/punctuation/src/inference.py", line 49, in inference
    y_predict = deep_punctuation(x, attn_mask)
  File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "./src/punctuation/src/model.py", line 28, in forward
    x = self.bert_layer(x, attention_mask=attn_masks)[0]
  File "/usr/local/lib/python3.6/site-packages/torch/nn/modules/module.py", line 889, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python3.6/site-packages/transformers/modeling_bert.py", line 706, in forward
    extended_attention_mask: torch.Tensor = self.get_extended_attention_mask(attention_mask, input_shape, device)
  File "/usr/local/lib/python3.6/site-packages/transformers/modeling_utils.py", line 204, in get_extended_attention_mask
    input_shape, attention_mask.shape
ValueError: Wrong shape for input_ids (shape torch.Size([1, 256])) or attention_mask (shape torch.Size([256]))

Python 3.6, torch.device("cpu") and requirements.txt :

python-multipart==0.0.5
-f https://download.pytorch.org/whl/torch_stable.html
torch==1.8.1+cpu
torchaudio==0.8.1
transformers==v2.11.0
pytorch-crf==0.7.2

What if I want to load a pre-trained model from local disk?

Hello,

I like to run this model at the local GPU server with no internet.
After downloading one of the Huggingface pytorch models outside and moving it into the inside local disk, I do not know how to change path information. Please let me have the clue.

Thanks,

xashru / punctuation-restoration Goto Github PK

punctuation-restoration's Issues

CUDA memory allocation error!

What should be the prediction text format?

i have created my own training data i got this error

ERROR: Failed building wheel for tokenizers

What is 5th Index value in Precision, Recall and F1 lists

Unable to predict using inference.py file

FileNotFoundError

RuntimeError: Error(s) in loading state_dict for DeepPunctuation: Missing key(s) in state_dict: "bert_layer.embeddings.position_ids".

Dose this model support training with chinese?

How to implement `torch.nn.DistributedDataParallel` in training?

Wrong shape for input_ids (shape torch.Size([1, 256])) or attention_mask (shape torch.Size([256]))

What if I want to load a pre-trained model from local disk?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent