bentrevett / pytorch-sentiment-analysis Goto Github PK

Tutorials on getting started with PyTorch and TorchText for sentiment analysis.

License: MIT License

Jupyter Notebook 100.00%

pytorch sentiment-analysis tutorial rnn lstm fasttext torchtext sentiment-classification cnn cnn-text-classification

pytorch-sentiment-analysis's Introduction

PyTorch Sentiment Analysis

This repo contains tutorials covering understanding and implementing sequence classification models using PyTorch, with Python 3.9. Specifically, we'll train models to predict sentiment from movie reviews.

If you find any mistakes or disagree with any of the explanations, please do not hesitate to submit an issue. I welcome any feedback, positive or negative!

Getting Started

Install the required dependencies with: pip install -r requirements.txt --upgrade.

Tutorials

1 - Neural Bag of Words

This tutorial covers the workflow of a sequence classification project with PyTorch. We'll cover the basics of sequence classification using a simple, but effective, neural bag-of-words model, and how to use the datasets/torchtext libaries to simplify data loading/preprocessing.
2 - Recurrent Neural Networks

Now we have the basic sequence classification workflow covered, this tutorial will focus on improving our results by switching to a recurrent neural network (RNN) model. We'll cover the theory behind RNNs, and look at an implementation of the long short-term memory (LSTM) RNN, one of the most common variants of RNN.
3 - Convolutional Neural Networks

Next, we'll cover convolutional neural networks (CNNs) for sentiment analysis. This model will be an implementation of Convolutional Neural Networks for Sentence Classification.
4 - Transformers

Finally, we'll show how to use the transformers library to load a pre-trained transformer model, specifically the BERT model from this paper, and use it for sequence classification.

Legacy Tutorials

Previous versions of these tutorials used features from the torchtext library which are no longer available. These are stored in the legacy directory.

References

Here are some things I looked at while making these tutorials. Some of it may be out of date.

pytorch-sentiment-analysis's People

Contributors

Stargazers

Watchers

Forkers

ghiblifield oroi-kmscom briando2005 ramonyeung cloud9-xx little1tow skyjiao louislbc nawshad garethmd xiongshufeng aaronkl yasinkutuk ex00 shlpu bg453cornell yihengxiao jamesfeng1994 mqrshiyan hongpeng1992 nishatdhillon nishalpattan siddbanpsu billpun avipartho center1 twoweaks ratmcu hvdthong gobbedy eyongkevin django-kz angrydata niluanwudidadi zdepablo gaurav104 znob b1ackshadow raman-r-4978 asif31iqbal marc88 vectorchanger0 gritycda riviera2015 pathfinder5 rohan-kumar1998 dracero liuyang9536 mlearnx palakgupta1020 readall hellozhoulei ilovenlp imaduddin76 udipta jobrienski kylewilbert mldlx syemc maulvi67 chsafouane renkexinmay toastynews rogggger bjfunlp frankdfm keyic sushantakpani karanrathour zxcndy leokb24 binhna source-nerd mah533 shenfuli xelibrion houwenjiejeff aneeshaasc vivek89 mhdella asad saikrishnaklu jaynoel sahanduiuc rsilveira79 harirajeev anastasiaclark skols shubhampachori12110095 mkhalusova erikkallenbach mejihero isanjayyadav fkhjoy prashant118 rosssong meus-projetos-pessoais neetha8 dendisuhubdy d-barradas

pytorch-sentiment-analysis's Issues

3-fasttext, no bigrams in function predict_sentiment

Update predict_sentiment function at the end to include bigrams.

In addition, since most bigrams do not have pretrained word embeddings they are initialized as all zeros, which is a bad practice. Try to initialize randomly (as in xavier initializer).

A - Using TorchText with Your Own Datasets

Hi, I tried to use my own dataset. And was getting an error. Suppose my code is following:

from torchtext import data
from torchtext import datasets
import torch

NAME = data.Field()
SAYING = data.Field()
PLACE = data.Field()

fields = {'name': ('n', NAME), 'location': ('p', PLACE), 'quote': ('s', SAYING)}

train_data, test_data = data.TabularDataset.splits(
                            path = 'sample_data',
                            train = 'sample_data_train.json',
                            test = 'sample_data_test.json',
                            format = 'json',
                            fields = fields
)

BATCH_SIZE = 64
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')


NAME.build_vocab(train_data)
SAYING.build_vocab(train_data)
PLACE.build_vocab(train_data)

train_iterator, test_iterator = data.BucketIterator.splits(
    (train_data, test_data), 
    batch_size=BATCH_SIZE,
    device=device)


for batch in train_iterator:
	print(batch)
for batch in test_iterator:
	print(batch)

Where both sample_data_train.json and sample_data_test.json refer to json file:

{"name": "John", "location": "United Kingdom", "age": 42, "quote": ["i", "love", "the", "united kingdom"]}
{"name": "Mary", "location": "United States", "age": 36, "quote": ["i", "want", "more", "telescopes"]}

So this is the basic code which was supposed to run. But I get the following error:

[torchtext.data.batch.Batch of size 2]
	[.p]:[torch.LongTensor of size 2x2]
	[.n]:[torch.LongTensor of size 1x2]
	[.s]:[torch.LongTensor of size 4x2]
Traceback (most recent call last):
  File "test5.py", line 35, in <module>
    for batch in test_iterator:
  File "/home/akhan/.local/lib/python3.5/site-packages/torchtext/data/iterator.py", line 142, in __iter__
    self.init_epoch()
  File "/home/akhan/.local/lib/python3.5/site-packages/torchtext/data/iterator.py", line 118, in init_epoch
    self.create_batches()
  File "/home/akhan/.local/lib/python3.5/site-packages/torchtext/data/iterator.py", line 242, in create_batches
    self.batches = batch(self.data(), self.batch_size,
  File "/home/akhan/.local/lib/python3.5/site-packages/torchtext/data/iterator.py", line 103, in data
    xs = sorted(self.dataset, key=self.sort_key)
TypeError: unorderable types: Example() < Example()

So code can iterate through train data, but cannot through test data. Can you spot the reason. Thanks!!!

forgot model.eval() in predict_sentiment function

TypeError: init() got an unexpected keyword argument 'dtype'

TypeError Traceback (most recent call last)
in
10
11 TEXT = data.Field(tokenize='spacy', preprocessing=generate_bigrams)
---> 12 LABEL = data.LabelField(dtype=torch.float)

d:\progrom\python\python\python3\lib\site-packages\torchtext-0.2.3-py3.6.egg\torchtext\data\field.py in init(self, **kwargs)
691 kwargs['unk_token'] = None
692
--> 693 super(LabelField, self).init(**kwargs)

TypeError: init() got an unexpected keyword argument 'dtype'

custom word embedding learning

u can also add a feature of saving final embeddings after the training. And then loading custom embeddings in the following part of the code:

TEXT.build_vocab(train_data, max_size=25000, vectors="glove.twitter.27B.100d", unk_init=torch.Tensor.normal_)

This might be useful in case of existence of a distant dataset.

Full link to the paper:
http://www.aclweb.org/anthology/S17-2094

Why do not need to zero the hidden tensors?

Question regarding BucketIterator

hi, I'm apporaching sentiment analysis with torchtext and I've recently been studying the concept of Iterator. From what I understand it is used to automatically convert strings in vectors, batching them (that is, getting the set of vectors that shall be used for training) and then move them to the computing device.

I saw that BucketIterator tries to get a batch in which all the sentences have similar length, to reduce the amount of padding. My question is: if a sentence is shorter than the fixed length it is padded, but what if a sentence is longer? Is it truncated? If yes, how exactly?

Thanks in advance.

Using pretrained ELMo with TorchText

Hi, thanks for the tutorial. I have one question, though.
I want to use ELMo instead of GloVe or other for embedding, and suppose I already have ELMo representation for every sentence in the shape (seq_len, elmo_dimension).
I want to either:

concat this representation with the embedding from embedding layer or,
use this representation before passing it to RNN/CNN.
Do you have any idea on how to use this with Torchtext? I am not sure how to add the elmo sentence representation to the batch and pass it to my model together with the input (which has been converted to indices) ?

Any advice/pointer would be greatly appreciated.

Thanks.

No need for the embedding layer to be included in backprop

Since you are loading already trained vectors into the model we can set require_grad parameter as false for the embedding layer , right ? There is no need to propagate the gradients for them .
This could create an issue especially if you encounter a word you haven't seen in the training data in your vaildation .

can not download glove.6B.zip

Hi, I have a problem when run this code
TEXT.build_vocab(train, max_size=25000, vectors="glove.6B.100d")
It's outputs as follows:
.vector_cache/glove.6B.zip: 0.00B [00:00, ?B/s]
Traceback (most recent call last):
File "/home/hang/PycharmProjects/pytorch/test.py", line 20, in
TEXT.build_vocab(train, max_size=20000, vectors="glove.6B.100d")
File "/usr/local/lib/python2.7/dist-packages/torchtext/data/field.py", line 257, in build_vocab
self.vocab = self.vocab_cls(counter, specials=specials, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torchtext/vocab.py", line 80, in init
self.load_vectors(vectors, unk_init=unk_init, cache=vectors_cache)
File "/usr/local/lib/python2.7/dist-packages/torchtext/vocab.py", line 139, in load_vectors
vectors[idx] = pretrained_aliasesvector
File "/usr/local/lib/python2.7/dist-packages/torchtext/vocab.py", line 347, in init
super(GloVe, self).init(name, url=url, **kwargs)
File "/usr/local/lib/python2.7/dist-packages/torchtext/vocab.py", line 236, in init
self.cache(name, cache, url=url)
File "/usr/local/lib/python2.7/dist-packages/torchtext/vocab.py", line 268, in cache
with zipfile.ZipFile(dest, "r") as zf:
File "/usr/lib/python2.7/zipfile.py", line 770, in init
self._RealGetContents()
File "/usr/lib/python2.7/zipfile.py", line 811, in _RealGetContents
raise BadZipfile, "File is not a zip file"
zipfile.BadZipfile: File is not a zip file

My system is ubuntu and python2.7,
Can I download "glove.6B.zip" manually ?

ValueError: Target size (torch.Size([2, 1])) must be the same as input size (torch.Size([1, 1]))

Hi,

Below is my error screenshot. Followed everything as you described in updated sentiment analysis. Just using my own dataset which has label categorized into 5 types.

Below is my code for fields definition and iterator.

TEXT = data.Field(tokenize = 'spacy', include_lengths = True)
LABEL = data.Field(dtype=torch.float)

fields = [('text', TEXT), ('label', LABEL)]

train_data, valid_data, test_data = data.TabularDataset.splits(
path = r'H:************',
train = 'train.csv',
validation = 'valid.csv',
test = 'test.csv',
format = 'csv',
fields = fields)

TEXT.build_vocab(train_data,
vectors = "glove.6B.100d",
unk_init = torch.Tensor.normal_)
LABEL.build_vocab(train_data)

import torch

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

BATCH_SIZE = 1

train_iterator, valid_iterator, test_iterator = data.BucketIterator.splits(
(train_data, valid_data, test_data),
sort_key = lambda x: x.text, #sort by s attribute (quote)
sort_within_batch = True,
batch_size=BATCH_SIZE,
device=device)`

I tried to google it out but no luck. I am sorry I am new to the field, any help will be much appreciated. Could you please have a look let me know what could be creating this error?

2nd Tutorial: RuntimeError: CUDA error: out of memory

Hi,

While running this second tutorial example i get the CUDA out of memory error.

Request some help with this.

Thanks,
GR

'tuple' object has no attribute 'permute' - 5Multi-Class Classification

I get the following : AttributeError: 'tuple' object has no attribute 'permute'
when trying to run the multi-class tutorial on my own data.

Does anyone know what I am doing wrong here?

Why Hidden state instead output state?

What is the intuition behind using hidden state from lstm/rnn instead of output state?

5th Tutorial: categorical_accuracy error

Hi,

I get error to change the data type of denominator to torch.cuda.LongTensor([y.shape[0]]) from torch.LongTensor([y.shape[0]]) but I start to get the accuracies to get calculated as zeroes.

Request to help in this case.

A solid reference material overall!!

Thanks,
GR

"cannot re-initialize cuda in forked subprocess. " + msg)

I am getting this error
"cannot re-initialize cuda in forked subprocess. " + msg)
When trying to iterate over the trainiteraor. In train function
I tried to change the device to cpu doesn't abd nothing happened

Do you know where the error is stemming from?
I tried googling but wasn't able to solve it

Tips on improving multi-class classification on custom dataset

I have a custom text dataset with little more than 13k data points with 5 classes. I tried to use your multi-class classification notebook for classifying my dataset, but it doesn't seem to work.

Here's what I have tried -

Played around with the learning rates, optimizers and momentum.
Tried increasing and decreasing the number of conv layers.
Tried changing GloVe word vectors of dimensions 50, 100 and 200.

How nothing seems to work and in fact, the validation loss steadily increases. I suspect some improper embedding of my dataset. Do you have any suggestions to improve? Any help is appreciated.

Difference between Fast-Text model and Fast-Text embeddings

In the third jupyter notebook it si shown how to implement the bi-gram based Fast-Text system to perform sentiment analysis. By looking up at the different embeddings that TorchText offers I have found that you can also use pre-trained Fast-Text vectors. How is this different from the implemented system? Would it be a nice idea to use the Fast-Text architecture with Fast-Text embeddings? Or maybe they could be used with other architectures, like RNN or LSTM?

Freezing glove word embeddings for few epochs

I saw that in some papers, people freeze embedding weights, before further layers converge to something meaningful. After that they unfreeze them and train the whole network together. What do you think about adding this feature, say as an appendix?

Why Conv2D, not Conv1D ?

Just wondering, why did you use Conv2d, not Conv1d ?
Thanks.

Question about Tutorial 4

Hi Ben,

Thanks a lot for this awesome tutorials!

I have a questions about tutorial 4 about CNN:

On the 3rd picture max pooling is applied for 3 values (0.3, 0.6, 0.9), and you wrote that we pick 0.9 as maximum, but on the picture 0.3 is selected. Maybe it is a typo?
Also in class implementation of CNN, after applying convolutional filters, there is a comment about conv_n shape:
#conv_n = [batch size, n_filters, sent len - filter_sizes[n]]
And in class implementation of CNN1d, there is a comment about the shape of the same output:
#conv_n = [batch size, n_filters, sent len - filter_sizes[n] - 1]
Do we have the 3rd dimension as "sent len - filter_sizes[n] + 1" in both cases?

Thank you.

invalid argument 5: kernel size should be greater than zero, but got kH:

Hi, I'm new to pytorch.
I followed your code and tried to make Korean version sentiment analysis.
But, I got this error.

RuntimeError Traceback (most recent call last)
in
8
9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion,)
---> 10 valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11
12 end_time = time.time()

in evaluate(model, iterator, criterion)
10 for batch in iterator:
11
---> 12 predictions = model(batch.text).squeeze(1)
13
14 loss = criterion(predictions, batch.label)

/anaconda3/lib/python3.7/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

in forward(self, text)
21 #embedded = [batch size, sent len, emb dim]
22
---> 23 pooled = F.avg_pool2d(embedded, (embedded.shape[1], 1)).squeeze(1)
24 #pooled = [batch size, embedding_dim]
25

RuntimeError: invalid argument 5: kernel size should be greater than zero, but got kH: 0 kW: 1 at /Users/soumith/mc3build/conda-bld/pytorch_1549312653646/work/aten/src/THNN/generic/SpatialAveragePooling.c:14

I changed the code a little because I'm Korean, so I modified it into Korean version.
Like this:
from soynlp.tokenizer import MaxScoreTokenizer
from soynlp.normalizer import *
from konlpy.tag import Okt
import re

def tokenizer(text): # create a tokenizer function
okt = Okt()
review_text = re.sub("[^가-힣ㄱ-ㅎㅏ-ㅣ\\s]", "", text)
x = okt.morphs(review_text , stem= True)
return x

def generate_bigrams(x):
n_grams = set(zip(*[x[i:] for i in range(2)]))
for n_gram in n_grams:
x.append(' '.join(n_gram))
return x

TEXT = data.Field(tokenize = tokenizer, preprocessing = generate_bigrams, stop_words = stop_words)
LABEL = data.LabelField(dtype = torch.float)

I think the problem is in validation...... because when I added #(I mean Annotation processing) in front of code that is related to validation.
And then I run it, it worked well.

non consistent dropout after embeddings

In [2 - Updated Sentiment Analysis] you have dropout after embedding layer.
In [4 - Convolutional Sentiment Analysis] you dont have dropout after embedding layer.
Not sure whether you need it, but this is inconsistent, you either do or dont need in both cases. I guess this is a mistake. So many mistakes dude, hah?

Squeeze(0)

In tutorial 2 in the RNN.forward function you return self.fc(hidden.squeeze(0)).

However when I test this with my own data where I have a dataset of length x where x%batch_size ==1 it fails due to the removing of the dimension.

I printed out the hidden shape just before the return statement and noticed that the squeeze(0) is not needed as the hidden shape is of [batch_size,hidden_dim*num_directions].

My question is as follows: is it correct that the squeeze(0) is essentially not needed because it will always remove the batch_size dimension if it is 1?

Also awesome tutorial!

use pre-trained word embeddings

Hi @bentrevett,
Q1. Can also use the pre-trained word embedding weights for non-sentiment analysis problem? such as machine translation or any problem related to text?
Q2. I want to work on searching with deep learning, in your opinion, if I have pre-trained word embedding when user pass a word, then the system gives us similarity words. Can get this similarity word from the pre-trained word embedding matrix?

Not able to run the code on CPU.

I tried to run the code on CPU, but it was throwing some errors. Is there a easy way to fix this? For now I just disable these three lines.

torch.manual_seed(SEED) torch.cuda.manual_seed(SEED) torch.backends.cudnn.deterministic = True

Which I suppose is using cuda random seed. But After doing that I was always getting accuracy as zero. Any help in this matter?

typo

A small typo in Tutorial 1: Simple Sentiment Analysis

The RNN returns 2 tensors, output of size [sentence length, batch size, hidden dim] and hidden of size [1, batch size, embedding dim].

I think embedding dim shall be hidden dim.

assertion fails for output and hidden tensors

the assertion works when the pack_padded_seqnce is not used, however on using pack_padded_sequence the assertion fails.

RuntimeError: 'lengths' array has to be sorted in decreasing order

Hi!
I have tired to multi-class sentiment analysis.
You used CNN.
But I'd like to use other methods, so I coded it by Bi-LSTM.
But I got this error

RuntimeError Traceback (most recent call last)
in
7 start_time = time.time()
8
----> 9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
10 valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11

in train(model, iterator, optimizer, criterion)
12 text, text_lengths = batch.text
13
---> 14 predictions = model(text, text_lengths).squeeze(1)
15
16 loss = criterion(predictions, batch.label)

in forward(self, text, text_lengths)
28
29 #pack sequence
---> 30 packed_embedded = nn.utils.rnn.pack_padded_sequence(embedded, text_lengths)
31
32 packed_output, (hidden, cell) = self.rnn(packed_embedded)

/anaconda3/lib/python3.7/site-packages/torch/nn/utils/rnn.py in pack_padded_sequence(input, lengths, batch_first)
146 category=torch.jit.TracerWarning, stacklevel=2)
147 lengths = torch.as_tensor(lengths, dtype=torch.int64)
--> 148 return PackedSequence(torch._C._VariableFunctions._pack_padded_sequence(input, lengths, batch_first))
149
150

RuntimeError: 'lengths' array has to be sorted in decreasing order

Test model after saving with torch.save by loading glove weights

I have a saved the model with torch.save, and while testing on the new set of data, I have loaded the model with the checkpoint created earlier. But how do I load the word embedding into the model? Similar to model.embedding.weight.data.copy_(TEXT.vocab.vectors) is there a step required such as this one?

AttributeError: 'tuple' object has no attribute 'fields'

THE FOLLOWING IS MY CODE

tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
def token_split(data):
    return[text  for text in tokenizer.tokenize(data)]
Text = torchtext.data.Field(tokenize=token_split , pad_token= '[PAD]')
train_data = torchtext.data.TabularDataset.splits(path='./' ,train ='enron.csv' ,format ='csv',fields = [('text',Text)])
train_iterator = torchtext.data.Iterator(train_data,batch_size = 10,sort_key =lambda x: len(x.text), shuffle= True)
for example in train_iterator:
     print(example)

WHEN I RUN IT I GET THE FOLLOWING ERROR I AM NOT ABLE TO FINCD THE REASON CAN ANYONE HELP ME

AttributeError: 'tuple' object has no attribute 'fields'

AND WHEN MY I USE

train_iterator = torchtext.data.Iterator(train_data,batch_size = 10, shuffle= True)

I GET THIS ERROR ALSO

AttributeError: 'tuple' object has no attribute 'sort_key'

THANKS IN ADVANCE

Move loss function to GPU

Hi,
Could you please explain the intuition why we need to move the loss function to GPU?
criterion = criterion.to(device)

I almost never saw it in others.

Thanks for the great tutorial. I found your explanation quite intuitive.

Can we save the vocabulary generated by torchtext for later prediction?

Suppose I already saved the model and now want to use it only for inference, say to predict the sentiment of a sentence. We will need the vocabulary generated by torchtext to get the index of the words in the sentence we want to predict.
indexed = [TEXT.vocab.stoi[t] for t in tokenized]

So is there a way to save torchtext vocabulary and re-load it (without the need to generate it again from the training data) ?
To be clear, my question would be how to properly save the model (including everything it needs for inference), and then use it for 'independent' prediction?

Thanks!

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Hi,

I'm getting the error when I'm running the code to tokenize the text.
I've installed spacy and the English language model. I think the problem is that the spacy model is now called 'en_core_web_sm' instead of 'em'.

Multi-Class Classification. Alternative for torch.stack. Finding it expensive and time consuming while training.

CNN(
(embedding): Embedding(82672, 50)
(convs): ModuleList(
(0): Conv2d(1, 50, kernel_size=(2, 50), stride=(1, 1), padding=(1, 0))
(1): Conv2d(1, 50, kernel_size=(3, 50), stride=(1, 1), padding=(2, 0))
(2): Conv2d(1, 50, kernel_size=(4, 50), stride=(1, 1), padding=(3, 0))
(3): Conv2d(1, 50, kernel_size=(5, 50), stride=(1, 1), padding=(4, 0))
)
(fc): Linear(in_features=200, out_features=13, bias=True)
(dropout): Dropout(p=0.5)
)

I have about 13 target fields. During training, I'm passing the batch data with model(batch.data)wheredata is my text field. But while calculating the loss, I'm trying to stack 13 one hot encoded vectors with torch.stack([batch.target1, batch.target2.........,batch.target13], dim=1). Is there an alternative to pass all my target field of the particular batch to have a faster computation?
My loss is BCEwithLogitsLoss. I am passing the out of fully connected layer to this function and doing a back prop.

ValueError "too many values to unpack (expected 2)" due to "text, text_lengths = batch.text"

I tried to run LSTM in "Updated Sentiment Analysis" with my own data, but the ValueError "too many values to unpack (expected 2)" occurs in "text, text_lengths = batch.text"......How can I fix it?
Isn't batch.text itself a set of batches containing tensors? Why "text_lengths" is necessary? text_lengths=?

Thanks.

TypeError: init() got an unexpected keyword argument 'dtype'

TypeError Traceback (most recent call last)
in ()
9
10 TEXT = data.Field(tokenize='spacy')
---> 11 LABEL = data.LabelField(dtype=torch.float)

~\AppData\Local\Programs\Python\Python36\lib\site-packages\torchtext\data\field.py in init(self, **kwargs)
691 None by default.
692 """
--> 693
694 def init(self, **kwargs):
695 # whichever value is set for sequential and unk_token will be overwritten

TypeError: init() got an unexpected keyword argument 'dtype'

I have read the previeus issue and I updated my torchtext but the problem still exists.
Any suggestions?

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

I've loaded the file into a colab notebook. When I try to run the first cell of the notebook I get the following:
`OSError Traceback (most recent call last)
in ()
12 torch.cuda.manual_seed(SEED)
13
---> 14 TEXT = data.Field(tokenize='spacy')
15 LABEL = data.LabelField(tensor_type=torch.FloatTensor)
16

/usr/local/lib/python3.6/dist-packages/torchtext/data/field.py in init(self, sequential, use_vocab, init_token, eos_token, fix_length, tensor_type, preprocessing, postprocessing, lower, tokenize, include_lengths, batch_first, pad_token, unk_token, pad_first, truncate_first)
148 self.postprocessing = postprocessing
149 self.lower = lower
--> 150 self.tokenize = get_tokenizer(tokenize)
151 self.include_lengths = include_lengths
152 self.batch_first = batch_first

/usr/local/lib/python3.6/dist-packages/torchtext/data/utils.py in get_tokenizer(tokenizer)
10 try:
11 import spacy
---> 12 spacy_en = spacy.load('en')
13 return lambda s: [tok.text for tok in spacy_en.tokenizer(s)]
14 except ImportError:

/usr/local/lib/python3.6/dist-packages/spacy/init.py in load(name, **overrides)
13 if depr_path not in (True, False, None):
14 deprecation_warning(Warnings.W001.format(path=depr_path))
---> 15 return util.load_model(name, **overrides)
16
17

/usr/local/lib/python3.6/dist-packages/spacy/util.py in load_model(name, **overrides)
117 elif hasattr(name, 'exists'): # Path or Path-like to model data
118 return load_model_from_path(name, **overrides)
--> 119 raise IOError(Errors.E050.format(name=name))
120
121

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.`

Testing accuracy on CNN Sentiment analysis

The validation accuracy is higher than the training accuracy?

Using it for Multilabel classification?

Thanks for the awesome tutorial.

4 - Convolutional Sentiment Analysis.ipynb
Do you think it is possible to use this part of tutorial in multilabel classification case? (eg: kaggle toxic comment)
It seems by changing output dimension to 6 is not working. Or do I have to use nn.softmax with BCEloss()?

Dimension out of range in the last batch

I used my own data and in the last batch of my data (where there were only 1 example), model(batch.text).squeeze(1) causes a dimension out of range error.
predictions = model(batch.text).squeeze(1)
I solved it by not squeeze it in case batch only contains one example.
But I'm wondering if it happened to anyone else?

Can you tell me a method for prediction on batch? Predicting per sentence takes a lot of time.

AttributeError: 'Batch' object has no attribute 'text'

Hi, while training the model I am getting the error:

AttributeError: 'Batch' object has no attribute 'text'

I am working with Upgraded Sentiment Analysis notebook, followed everything as mentioned. However, using my own data sets having one field as cleaned text and the other as label.
Below is my error while executing the code

Could you please help?
Thank you

How to use pack_padded_sequence with the existing sentiment analysis code?

can someone point out the code to add pack_padded_sequence and it's unpacking?

2th Tutorial: ValueError: not enough values to unpack (expected 2, got 0)

I copy your demo,and use it in my computer ,but have the problem :
"train_data, valid_data = train_data.split(random_state = random.seed(SEED))
ValueError: not enough values to unpack (expected 2, got 0)"

<pad> and <unk> not in vocab

Thanks for the great code and explanation.
In 1 - Simple Sentiment Analysis.ipynb, you claim that the size of the vocabulary is 25002 because and are added to the vocabulary. But in fact this is not the case as these tokens, as well as other tokens such as eos and sos are not being added to the vocabulary in build_vocab.

embedding(): argument 'indices' (position 2) must be Tensor, not tuple

from Tutorials : 1 - Simple Sentiment Analysis
`---------------------------------------------------------------------------
TypeError Traceback (most recent call last)
in ()
7 start_time = time.time()
8
----> 9 train_loss, train_acc = train(model, train_iterator, optimizer, criterion)
10 valid_loss, valid_acc = evaluate(model, valid_iterator, criterion)
11

in train(model, iterator, optimizer, criterion)
10 optimizer.zero_grad()
11
---> 12 predictions = model(batch.text).squeeze(1)
13
14 loss = criterion(predictions, batch.label)

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/module.py in call(self, *input, **kwargs)
487 result = self._slow_forward(*input, **kwargs)
488 else:
--> 489 result = self.forward(*input, **kwargs)
490 for hook in self._forward_hooks.values():
491 hook_result = hook(self, input, result)

in forward(self, text)
14 #text = [sent len, batch size]
15
---> 16 embedded = self.embedding(text)
17
18 #embedded = [sent len, batch size, emb dim]

/opt/conda/lib/python3.6/site-packages/torch/nn/modules/sparse.py in forward(self, input)
116 return F.embedding(
117 input, self.weight, self.padding_idx, self.max_norm,
--> 118 self.norm_type, self.scale_grad_by_freq, self.sparse)
119
120 def extra_repr(self):

/opt/conda/lib/python3.6/site-packages/torch/nn/functional.py in embedding(input, weight, padding_idx, max_norm, norm_type, scale_grad_by_freq, sparse)
1452 # remove once script supports set_grad_enabled
1453 no_grad_embedding_renorm(weight, input, max_norm, norm_type)
-> 1454 return torch.embedding(weight, input, padding_idx, scale_grad_by_freq, sparse)
1455
1456

TypeError: embedding(): argument 'indices' (position 2) must be Tensor, not tuple`

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence.

You may notice that this tensor should have another dimension due to the one-hot vectors, however PyTorch conveniently stores a one-hot vector as it's index value, i.e. the tensor representing a sentence is just a tensor of the indexes for each token in that sentence.

Hi, does this mean that when we train the model, we are feeding in the one-hot vectors, and not the value of the index?

I am feeding my array of characters into a 2d conv layer of filter size 20x256. Torchtext gives me (100,128) - 100 is sequence length, 128 is batch size. Was just wondering how to feed change the char vectors into one-hot

Very slow data loading

Not clear if I can ask a question here.
When the first one was tried, it took quite long for data loading (not for downloading thru internet).

Is this simply because of the file size?
Is there a way of doing it faster?

Thanks for sharing the code. I have been trying to learn pytorch and this site is helping me a lot.

RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM in "4 - Convolutional Sentiment Analysis"

Hello,
Very nice tutorial, many thanks for sharing. I succesfully trained RNN with my custom dataset, but getting RuntimeError: cuDNN error: CUDNN_STATUS_BAD_PARAM when trying to train CNN from "4 - Convolutional Sentiment Analysis". May someone know where could be a problem?