The lightning-text-classification from ricardorei

CUDA out of memory

I'm having CUDA out of memory even with --batch_size 1 . But it raises the error only after the first epoch. Any idea/advice on to solve such an issue?

Error during training: Can't pickle local object

Hi,

I really like how you combined PyTorch Lightning with the Transformers library. I tried to upgrade all dependencies to the latest version. There are some smaller issues, like renamed keyword arguments in PyTorch Lightning and so on. Eventually, I face the following error message.

Error:
AttributeError: Can't pickle local object 'LayerSummary._register_hook..hook'

Environment:
python = "^3.7"
torch = "^1.5.1"
torchvision = "^0.6.1"
transformers = "^2.11.0"
pytorch-lightning = "^0.8.1"
pytorch-nlp = "^0.5.0"
test-tube = "^0.7.5"
pandas = "^1.0.5"
sklearn = "^0.0"

Do you have any idea what the error might be? Is it a Lightning problem or a transformers problem? Any way to fix it?

Best
Dominique

Multi GPU half precision training enhancement request

Currently an error is thrown when using multi-GPU's and 16 bit precision.

To setup this, the following hparams are added to trainer:

    parser.add_argument("--gpus", type=str, default='0,1,2', help="Which gpus")
    parser.add_argument('--precision', default=16, type=int)

    trainer=Trainer(
    ...
        precision=hparams.precision,

Error trace:

  File "/path_to/lightning-text-classification/classifier.py", line 183, in forward
    word_embeddings = self.bert(tokens, mask)[0]
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/transformers/modeling_bert.py", line 824, in forward
    embedding_output = self.embeddings(
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/transformers/modeling_bert.py", line 207, in forward
    inputs_embeds = self.word_embeddings(input_ids)
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/torch/nn/modules/module.py", line 722, in _call_impl
    result = self.forward(*input, **kwargs)
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/torch/nn/modules/sparse.py", line 124, in forward
    return F.embedding(
  File "/home/me/.virtualenvs/sbert-env/lib/python3.8/site-packages/torch/nn/functional.py", line 1814, in embedding
    return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
RuntimeError: arguments are located on different GPUs at /pytorch/aten/src/THC/generic/THCTensorIndex.cu:403

Checking this, at the call: word_embeddings = self.bert(tokens, mask)[0], both tokens and mask are on the same GPU, it appears an issue with the embeddings that I haven't been able to isolate exactly where yet.

Running with multi GPU and 32 bit precision works fine, as does 1 GPU and 16 bit precision. Error occurs with both dp and dpp distributed modes.

ricardorei / lightning-text-classification Goto Github PK

lightning-text-classification's People

Contributors

Stargazers

Watchers

Forkers

lightning-text-classification's Issues

CUDA out of memory

Error during training: Can't pickle local object

Multi GPU half precision training enhancement request

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent