Not clear if I can ask a question here. When the first one was tried, it took quit

Very slow data loading about pytorch-sentiment-analysis HOT 5 CLOSED

bentrevett commented on May 9, 2024 4

Very slow data loading

from pytorch-sentiment-analysis.

Comments (5)

bentrevett commented on May 9, 2024 7

Yep, that'd be a lot faster.

You can run the following code once:

from torchtext import data
from torchtext import datasets
import json

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_examples = [vars(t) for t in train_data]
test_examples = [vars(t) for t in test_data]

with open('.data/train.json', 'w+') as f:
    for example in train_examples:
        json.dump(example, f)
        f.write('\n')
        
with open('.data/test.json', 'w+') as f:
    for example in test_examples:
        json.dump(example, f)
        f.write('\n')

This will use spaCy to tokenize all of the examples and then save each in the json lines format. Inside the tutorials we can then replace loading the IMDB dataset with:

TEXT = data.Field()
LABEL = data.LabelField()

fields = {'text': ('text', TEXT), 'label': ('label', LABEL)}

train_data, test_data = data.TabularDataset.splits(
    path = '.data',
    train = 'train.json',
    test = 'test.json',
    format = 'json',
    fields = fields
)

As the dataset has already been tokenized this should be very quick (~2 seconds on my machine, compared to the original which took ~5 minutes).

from pytorch-sentiment-analysis.

bentrevett commented on May 9, 2024 1

Questions are always welcome!

I believe the issue is due to spaCy being very, very slow. You can change the tokenization for the sentences from using the spacy tokenizer to a very basic custom tokenizer which will be a lot faster, but I believe you will get worse results. Your custom tokenizers have to be functions that take in a string and return a list of strings (the tokenized example).

For example, we can change:

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

to:

def tokenize(s):
    return s.split(' ')

TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField(dtype=torch.float)

This tokenizer simply splits the string on spaces. It won't be as good as the spaCy one, but will be a lot faster!

from pytorch-sentiment-analysis.

yongduek commented on May 9, 2024

I see.
So it would be better for me to do offline tokenizing and save the data. Then load the data with the simple split(' ') tokenizer. Because most of the experiments are on developing a neural network, preprocessing the data may be a way of separating the whole process.

I really appreciate your kind answer.

from pytorch-sentiment-analysis.

yongduek commented on May 9, 2024

wow, this is great. I was simply trying to run spacy separately to save the result, but had no idea of using JSON format thru TabularDataset.
Thanks a lot.

from pytorch-sentiment-analysis.

wildannajah commented on May 9, 2024

in
7 start_time = time.time()
8
----> 9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
10 # valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11

in train(model, iterator, optimizer, criterion)
12 predictions = model(batch.text).squeeze(1)
13
---> 14 loss = criterion(predictions, batch.label)
15
16 acc = binary_accuracy(predictions, batch.label)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
599 self.weight,
600 pos_weight=self.pos_weight,
--> 601 reduction=self.reduction)
602
603

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2122
2123 if not (target.size() == input.size()):
-> 2124 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2125
2126 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

ValueError: Target size (torch.Size([64])) must be the same as input size (torch.Size([970]))

why do i get this error?

from pytorch-sentiment-analysis.

Very slow data loading about pytorch-sentiment-analysis HOT 5 CLOSED

Comments (5)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent