Coder Social home page Coder Social logo

Comments (5)

bentrevett avatar bentrevett commented on May 9, 2024 7

Yep, that'd be a lot faster.

You can run the following code once:

from torchtext import data
from torchtext import datasets
import json

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)

train_examples = [vars(t) for t in train_data]
test_examples = [vars(t) for t in test_data]

with open('.data/train.json', 'w+') as f:
    for example in train_examples:
        json.dump(example, f)
        f.write('\n')
        
with open('.data/test.json', 'w+') as f:
    for example in test_examples:
        json.dump(example, f)
        f.write('\n')

This will use spaCy to tokenize all of the examples and then save each in the json lines format. Inside the tutorials we can then replace loading the IMDB dataset with:

TEXT = data.Field()
LABEL = data.LabelField()

fields = {'text': ('text', TEXT), 'label': ('label', LABEL)}

train_data, test_data = data.TabularDataset.splits(
    path = '.data',
    train = 'train.json',
    test = 'test.json',
    format = 'json',
    fields = fields
)

As the dataset has already been tokenized this should be very quick (~2 seconds on my machine, compared to the original which took ~5 minutes).

from pytorch-sentiment-analysis.

bentrevett avatar bentrevett commented on May 9, 2024 1

Questions are always welcome!

I believe the issue is due to spaCy being very, very slow. You can change the tokenization for the sentences from using the spacy tokenizer to a very basic custom tokenizer which will be a lot faster, but I believe you will get worse results. Your custom tokenizers have to be functions that take in a string and return a list of strings (the tokenized example).

For example, we can change:

TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)

to:

def tokenize(s):
    return s.split(' ')

TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField(dtype=torch.float)

This tokenizer simply splits the string on spaces. It won't be as good as the spaCy one, but will be a lot faster!

from pytorch-sentiment-analysis.

yongduek avatar yongduek commented on May 9, 2024

I see.
So it would be better for me to do offline tokenizing and save the data. Then load the data with the simple split(' ') tokenizer. Because most of the experiments are on developing a neural network, preprocessing the data may be a way of separating the whole process.

I really appreciate your kind answer.

from pytorch-sentiment-analysis.

yongduek avatar yongduek commented on May 9, 2024

wow, this is great. I was simply trying to run spacy separately to save the result, but had no idea of using JSON format thru TabularDataset.
Thanks a lot.

from pytorch-sentiment-analysis.

wildannajah avatar wildannajah commented on May 9, 2024

in
7 start_time = time.time()
8
----> 9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
10 # valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11

in train(model, iterator, optimizer, criterion)
12 predictions = model(batch.text).squeeze(1)
13
---> 14 loss = criterion(predictions, batch.label)
15
16 acc = binary_accuracy(predictions, batch.label)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
599 self.weight,
600 pos_weight=self.pos_weight,
--> 601 reduction=self.reduction)
602
603

C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2122
2123 if not (target.size() == input.size()):
-> 2124 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2125
2126 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)

ValueError: Target size (torch.Size([64])) must be the same as input size (torch.Size([970]))

why do i get this error?

from pytorch-sentiment-analysis.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.