Comments (5)
Yep, that'd be a lot faster.
You can run the following code once:
from torchtext import data
from torchtext import datasets
import json
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField()
train_data, test_data = datasets.IMDB.splits(TEXT, LABEL)
train_examples = [vars(t) for t in train_data]
test_examples = [vars(t) for t in test_data]
with open('.data/train.json', 'w+') as f:
for example in train_examples:
json.dump(example, f)
f.write('\n')
with open('.data/test.json', 'w+') as f:
for example in test_examples:
json.dump(example, f)
f.write('\n')
This will use spaCy to tokenize all of the examples and then save each in the json lines format. Inside the tutorials we can then replace loading the IMDB dataset with:
TEXT = data.Field()
LABEL = data.LabelField()
fields = {'text': ('text', TEXT), 'label': ('label', LABEL)}
train_data, test_data = data.TabularDataset.splits(
path = '.data',
train = 'train.json',
test = 'test.json',
format = 'json',
fields = fields
)
As the dataset has already been tokenized this should be very quick (~2 seconds on my machine, compared to the original which took ~5 minutes).
from pytorch-sentiment-analysis.
Questions are always welcome!
I believe the issue is due to spaCy being very, very slow. You can change the tokenization for the sentences from using the spacy tokenizer to a very basic custom tokenizer which will be a lot faster, but I believe you will get worse results. Your custom tokenizers have to be functions that take in a string and return a list of strings (the tokenized example).
For example, we can change:
TEXT = data.Field(tokenize='spacy')
LABEL = data.LabelField(dtype=torch.float)
to:
def tokenize(s):
return s.split(' ')
TEXT = data.Field(tokenize=tokenize)
LABEL = data.LabelField(dtype=torch.float)
This tokenizer simply splits the string on spaces. It won't be as good as the spaCy one, but will be a lot faster!
from pytorch-sentiment-analysis.
I see.
So it would be better for me to do offline tokenizing and save the data. Then load the data with the simple split(' ')
tokenizer. Because most of the experiments are on developing a neural network, preprocessing the data may be a way of separating the whole process.
I really appreciate your kind answer.
from pytorch-sentiment-analysis.
wow, this is great. I was simply trying to run spacy
separately to save the result, but had no idea of using JSON
format thru TabularDataset
.
Thanks a lot.
from pytorch-sentiment-analysis.
in
7 start_time = time.time()
8
----> 9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
10 # valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11
in train(model, iterator, optimizer, criterion)
12 predictions = model(batch.text).squeeze(1)
13
---> 14 loss = criterion(predictions, batch.label)
15
16 acc = binary_accuracy(predictions, batch.label)
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\module.py in call(self, *input, **kwargs)
530 result = self._slow_forward(*input, **kwargs)
531 else:
--> 532 result = self.forward(*input, **kwargs)
533 for hook in self._forward_hooks.values():
534 hook_result = hook(self, input, result)
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\modules\loss.py in forward(self, input, target)
599 self.weight,
600 pos_weight=self.pos_weight,
--> 601 reduction=self.reduction)
602
603
C:\ProgramData\Anaconda3\lib\site-packages\torch\nn\functional.py in binary_cross_entropy_with_logits(input, target, weight, size_average, reduce, reduction, pos_weight)
2122
2123 if not (target.size() == input.size()):
-> 2124 raise ValueError("Target size ({}) must be the same as input size ({})".format(target.size(), input.size()))
2125
2126 return torch.binary_cross_entropy_with_logits(input, target, weight, pos_weight, reduction_enum)
ValueError: Target size (torch.Size([64])) must be the same as input size (torch.Size([970]))
why do i get this error?
from pytorch-sentiment-analysis.
Related Issues (20)
- The train_data built from my own dataset after following the Appendix A looks wrong HOT 1
- migrating to the new API HOT 4
- .squeeze(1) HOT 5
- for word embedding in RNN model HOT 2
- Representation of similar words HOT 1
- Using a target size (torch.Size([64, 1])) that is different to the input size (torch.Size([304800, 1])) is deprecated. Please ensure they have the same size HOT 7
- train_test_split in LSTM HOT 2
- pad sequence in my dataset
- 6 - Transformers for Sentiment Analysis HOT 1
- Question in fasttext HOT 1
- ModuleNotFoundError
- How can I predict on one example?
- User Interface
- Got error" 'lengths' argument should be a 1D CPU int64 tensor, but got 1D cuda:0 Long tensor " HOT 10
- where is the trained model parameters ? HOT 1
- TypeError HOT 2
- Multi-class Sentiment Analysis: How to use custom dataset?
- how did you build torchtext from source HOT 4
- how does pytorch pad sentences ? HOT 3
- Pad Sequence error HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytorch-sentiment-analysis.