Coder Social home page Coder Social logo

cnn_sentence's People

Contributors

yoonkim avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

cnn_sentence's Issues

Getting class probablitly vectors from the intermediate layers

Hi, I am trying to use CNN_sentence to classify tweets into one of the K predefined topics. Although I am able to get the final ouput class for each tweet, I am more interested in the probability vector in the layer right before the ouput layer based on which its decided which class the tweet should belong to. eg. If I have 3 classes and the output class for a given tweet is [2], the I assume the previous layer would be dealing with a class-wise probability which could be something like [0.3 , 0.78, 0.1] for class 1, 2, 3 respectively(just an example).

[test_loss,y_pred] = test_model_all(test_set_x,test_set_y)
the variable "y_pred" gives me the final output but not the class probabilities but not class probabilities.
Can you suggest a way to get these probs ?

how about the size of feature map?

hello kim. I have an issue about the size of feature map when reading your paper.
In your paper, you had used two filters with windows size in 2 and 3. And the representation matrix size of sentence is 9 x K. As a general convolution operate, the size of feature map equals to 9-2+1=8 (with the small filter) while the size in your paper is 7. So can you give some details?

Error when I have more than two classes

Hi,

I was trying to change the code to use on classification task where there are more than two classes of outputs (positive and negative). I changed preprocess.py and it worked fine. But when I tried to run the conv_net_sentence.py , it reported following error:

Traceback (most recent call last):
  File "conv_net_sentence.py", line 327, in <module>
    dropout_rate=[0.5],img_w=wordim)
  File "conv_net_sentence.py", line 170, in train_conv_net
    cost_epoch = train_model(minibatch_index)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 606, in __call__
    storage_map=self.fn.storage_map)
  File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 595, in __call__
    outputs = self.fn()
ValueError: y_i >= dx dimensions[1]
Apply node that caused the error: CrossentropySoftmax1HotWithBiasDx(Alloc.0, SoftmaxWithBias.0, Elemwise{Cast{int32}}.0)

According to here, it seems that it is because the code defines a model with binary outputs classes while my data had more than two classes. I am new to Theano and I am wondering where should I change the code to adopt it to more than two outputs?

Thanks

How much memory do I need to process bin file (i.e. GoogleNews-vectors-negative300.bin)

Hi everyone,

I try to run the "process_data.py" file with the same word2vec binary file (i.e. GoogleNews-vectors-negative300.bin)  but it didn't work. The process got killed after 30 mint approx.

Before I was thinking, it may be a memory problem, but I tried on the server (256GB RAM and 16GB GPU) too but unfortunately found same results (i.e. program got killed after running approx. 30 mint).

what could be possible reasons?

Your response will be highly appreciable.

MLPDropoutLayer seems has no hidden layer ?

Hi , recently I am reading the code and founed it seems that : At MLPDroupoutLayer , there is no hidden layer added .
First , hidden_unites defined here :

conv_net_sentence.py - 311line : hidden_units=[100,2]

and changed here :

conv_net_sentence.py - 93 line : hidden_units[0] = feature_maps*len(filter_hs)

so it steel just has 2 elements .

And at conv_net_classes.py :
96 line : self.weight_matrix_sizes = zip(layer_sizes, layer_sizes[1:])
103 : next_dropout_layer_input = _dropout_from_layer(rng, input, p=dropout_rates[0])
105 : for n_in, n_out in self.weight_matrix_sizes[:-1]:

and here self.weight_matrix_sizes[:-1] is a empty list . So there seems no hidden layer defined ?

other datasets

When the code is used with a different dataset than the provided one it crashed if the longest sentence in the other dataset is longer than the longest sentence in the provided dataset. This can be fixed easily, the max length it set in the code manually.

Word Embddings in Non-static Mode

Hi @yoonkim,

Thanks a lot for the code and useful comments. I have one questions about word embedding after training process in non-static mode. Is there a way to export the back-propagated word wectors in you code? Or even more general question, is it possible to just back-propagate word vectors with labelled dataset to see how their dimensionality changes (specially in sentiment classification and polarity of words)?

Regards,
Nader

Dealing with unknown words

Does anyone know how words present in dataset, but absent in pre-trained word2vec embedding were treated? In the paper is not clear.

How to get the sentence label?

How to get the sentence label? i cont get label from theano.var.tensorvaribal ,what should i do to solve this problem

How to obtain altered word vectors from non-static model?

Hi everyone,

I want to do an analysis similar to one Kim Yoon did in his paper in table 3. He compares word vectors close to each other before and after fine-tuning by the CNN. My question is, how can I safe the "new" word vectors which were adapted by the model? I presume the object is somewhere in the "define model architecture" part of the code, but I am not sure which objects contains the final vectors.

Any help is highly appreciated!

test_model in file conv_net_sentence.py

Why we need to define two same theano function twice in file conv_net_sentence.py - test_model and train_model? Do we need to replace train_set_x with test_set_x?

a pickle file problem

Hi, @yoonkim
I am a beginner of natural language processing and machine learning. Since 'GoogleNews-vectors-negative300.bin' file size is quite large, all of my attemps for making a pickle file ('mr.p') failed. Could you give me some pieces of advice for making 'mr.p' with 16GB~32GB RAM if you don't mind?

And.. I wonder if 'mr.p' also need a chunk process to solve the memory problem. (I little know about pickle file..)

Thank you

What does '<PAD>' stands for?

It seems like all texts are padded with PAD to a fixed length. What's the meaning of this string? Why not pad with spaces?

success with CUDA 7.5?

Hello. I was able to run this code on a machine that uses CUDA 6.5 and Theano 0.7

However, on a machine that uses CUDA 7.5 and Theano 0.7 I get the following error. Has anyone had success using CUDA 7.5? If so, are there other possible causes of the error? Again, I was able to get the code to work on a machine running CUDA 6.5.

Exception: ('The following error happened while compiling the node', GpuDnnPoolDesc{ws=(60, 1), stride=(60, 1), mode='max', pad=(0, 0)}(), '\n', 'nvcc return status', 2, 'for cmd', 'nvcc -shared -O3 -arch=sm_52 -m64 -Xcompiler -fno-math-errno,-Wno-unused-label,-Wno-unused-variable,-Wno-write-strings,-DCUDA_NDARRAY_CUH=11b90075e2397c684f9dc0f7276eab8f,-D NPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION,-fPIC -Xlinker -rpath,/home/ahandler/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.3.1611-Core-x86_64-2.7.12-64/cuda_ndarray -I/home/ahandler/.virtualenvs/cnn/lib/python2.7/site-packages/theano/sandbox/cuda -I/home/ahandler/.virtualenvs/cnn/lib/python2.7/site-packages/numpy/core/include -I/cm/shared/apps/python/2.7.12/include/python2.7 -o /home/ahandler/.theano/compiledir_Linux-3.10-el7.x86_64-x86_64-with-centos-7.3.1611-Core-x86_64-2.7.12-64/tmpq2e2WV/a63d06730a401da926137f78c7ff838d.so mod.cu -L/cm/shared/apps/python/2.7.12/lib -lpython2.7 -lcudnn -lcudart', "[GpuDnnPoolDesc{ws=(60, 1), stride=(60, 1), mode='max', pad=(0, 0)}()]")

about the size of the filter

I am still confused about the size of the filter, how convolution it works ..
What do filter_h = 5 with filter_hs = [3,4,5], whetherfilter_h is the maximum length for eachfilter_hs ??
to get theimage shape, the longest maximum sentence is 56 so 56 + 2 * (5-1) = 64 .., what does it mean number 2 ?? where number 2 is obtained?

about the val perf

Hi,Kim.Thank you for sharing your code for us. I am using the same code and corpus with you and Runing the models in cpu , the command is THEANO_FLAGS=mode=FAST_RUN,device=cpu,floatX=float32 python conv_net_sentence.py -nonstatic -word2vec ,but the val perf always stay with 75%, cannot be reached the 80% or even more. so I want to know what factors can make this result ?

model's result:
loading data... data loaded!
model architecture: CNN-non-static
using: word2vec vectors
[('image shape', 64, 300), ('filter shape', [(100, 1, 3, 300), (100, 1, 4, 300), (100, 1, 5, 300)]), ('hidden_units', [100, 2]), ('dropout', [0.5]), ('batch_size', 50), ('non_static', True), ('learn_decay', 0.95), ('conv_non_linear', 'relu'), ('non_static', True), ('sqr_norm_lim', 9), ('shuffle_batch', True)]
... training
epoch: 1, training time: 336.80 secs, train perf: 54.66 %, val perf: 54.95 %
epoch: 2, training time: 336.82 secs, train perf: 65.70 %, val perf: 65.79 %
epoch: 3, training time: 336.72 secs, train perf: 66.45 %, val perf: 63.26 %
epoch: 4, training time: 336.60 secs, train perf: 72.57 %, val perf: 68.21 %
epoch: 5, training time: 336.53 secs, train perf: 74.38 %, val perf: 68.74 %

NotImplementedError: The image and the kernel must have the same type.inputs

ubgpu@ubgpu:/github/CNN_sentence$ sudo python conv_net_sentence.py -nonstatic -word2vec
Using gpu device 0: GeForce GTX 970
loading data... data loaded!
model architecture: CNN-non-static
using: word2vec vectors
[('image shape', 64, 300), ('filter shape', [(100, 1, 3, 300), (100, 1, 4, 300), (100, 1, 5, 300)]), ('hidden_units', [100, 2]), ('dropout', [0.5]), ('batch_size', 50), ('non_static', True), ('learn_decay', 0.95), ('conv_non_linear', 'relu'), ('non_static', True), ('sqr_norm_lim', 9), ('shuffle_batch', True)]
Traceback (most recent call last):
File "conv_net_sentence.py", line 317, in
dropout_rate=[0.5])
File "conv_net_sentence.py", line 88, in train_conv_net
filter_shape=filter_shape, poolsize=pool_size, non_linear=conv_non_linear)
File "conv_net_classes.py", line 390, in init
conv_out = conv.conv2d(input=input, filters=self.W,filter_shape=self.filter_shape, image_shape=self.image_shape)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/nnet/conv.py", line 151, in conv2d
return op(input, filters)
File "/usr/local/lib/python2.7/dist-packages/theano/gof/op.py", line 507, in call
node = self.make_node(_inputs, *_kwargs)
File "/usr/local/lib/python2.7/dist-packages/theano/tensor/nnet/conv.py", line 628, in make_node
"inputs(%s), kerns(%s)" % (_inputs.dtype, _kerns.dtype))
NotImplementedError: The image and the kernel must have the same type.inputs(float64), kerns(float32)
ubgpu@ubgpu:
/github/CNN_sentence$

question regarding datasets

Hello,
This is not an issue but rather a question -
Where could I get all the datasets you reported to in the paper ?
Do you think that training on ALL datasets together would improve the results ?
What about training for various languages - do you think a model containing text for mixed languages would behave better or worse than models handling each language separately ?

And another question regarding phrases - the google's word2vec pretrained vectors include also phrases - were they taken into account as well ?

confused on the dropout_cost_p and cost_p ??

I am not familiar with Theano. But it seems in the train_model function, it outputs cost_p, but in the sgd_updates_adadelta function, it optimizes over dropout_cost_p ? ? I am confused on this. Could you please explain this to me if you have time ?

Thanks in advance.

sentence length and padding

Why do you pad all sentences to the same length, currently fixed to 56?
It should not be necessary, since in the paper you say that the "pooling scheme naturally deals with variable sentence lengths".
Shouldn't padding depend on filter size?
Right now it is fixed at 5 in the call to
make_idx_data_cv(revs, word_idx_map, i, max_l=56, k=300, filter_h=5)
BTW: k is not used.

Dealing with overfitting

Does anyone know how overfitting was dealt? I read something about early stopping at 3.1 section, but I get overfitting at second epoch of training using the hyperparameters specified in the article. Is that correct?

Confused with vocab in process_data.py, need heeeeeeelp

I'm a new bee in Sentiment Analysis and recently I'm trying to use CNN to apply to Sentiment Analysis. Yoon's paper helps me a lot and I really appreicate that.

I want to understand every piece of code in this repo, but I get some trouble when I read process_data.py. Variable vocab is a type of dictionary and it should store the frequency of each word occurred in MR datas, which is {word, word_frequency}, but in the function build_data_cv, Yoon used set to store words in each line, which means the duplicate words will be removed, in this case how can we calculate the occurred times of each word ?

    vocab = defaultdict(float)   # dict to store words with its frequences
    with open(pos_file, "rb") as f:
        for line in f:       
            rev = []
            rev.append(line.strip())
            if clean_string:
                orig_rev = clean_str(" ".join(rev))
            else:
                orig_rev = " ".join(rev).lower()
            words = set(orig_rev.split()) # use set to store words, which means duplicate words will be removed in current line

IS THERE ANYBODY CAN HELP ME? THANKS A LOT!!!

The following error happened while compiling the node', GpuAlloc

I met this error as fellows:
Using gpu device 0: GeForce GTX 960
loading data... data loaded!
model architecture: CNN-static
using: word2vec vectors
[('image shape', 64, 300), ('filter shape', [(100, 1, 3, 300), (100, 1, 4, 300), (100, 1, 5, 300)]), ('hidden_units', [100, 2]), ('dropout', [0.5]), ('batch_size', 50), ('non_static', False), ('learn_decay', 0.95), ('conv_non_linear', 'relu'), ('non_static', False), ('sqr_norm_lim', 9), ('shuffle_batch', True)]
Traceback (most recent call last):
File "conv_net_sentence.py", line 322, in
dropout_rate=[0.5])
File "conv_net_sentence.py", line 133, in train_conv_net
allow_input_downcast=True)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/compile/function.py", line 266, in function
profile=profile)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/compile/pfunc.py", line 511, in pfunc
on_unused_input=on_unused_input)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 1466, in orig_function
defaults)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/compile/function_module.py", line 1324, in create
input_storage=input_storage_lists)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/link.py", line 519, in make_thunk
output_storage=output_storage)[:3]
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/vm.py", line 897, in make_all
no_recycling))
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/init.py", line 259, in make_thunk
compute_map, no_recycling)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/op.py", line 739, in make_thunk
output_storage=node_output_storage)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1073, in make_thunk
keep_lock=keep_lock)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1015, in compile
keep_lock=keep_lock)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1442, in cthunk_factory
key=key, lnk=self, keep_lock=keep_lock)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cmodule.py", line 1076, in module_from_key
module = lnk.compile_cmodule(location)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cc.py", line 1354, in compile_cmodule
preargs=preargs)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/sandbox/cuda/nvcc_compiler.py", line 434, in compile_str
return dlimport(lib_filename)
File "/home/gallup/anaconda2/lib/python2.7/site-packages/theano/gof/cmodule.py", line 293, in dlimport
rval = import(module_name, {}, {}, [module_name])
ImportError: ('The following error happened while compiling the node', GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, Shape_i{0}.0, Shape_i{0}.0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0, Elemwise{Composite{(((i0 - i1) // i2) + i2)}}[(0, 1)].0), '\n', '/home/gallup/.theano/compiledir_Linux-3.13--generic-x86_64-with-debian-jessie-sid-x86_64-2.7.11-64/tmp8GhxaW/87d76708312aab82a90a5274df9a9cc6.so: undefined symbol: _Z17CudaNdarray_SIZEtPK11CudaNdarray', '[GpuAlloc{memset_0=True}(CudaNdarrayConstant{0.0}, <TensorType(int64, scalar)>, <TensorType(int64, scalar)>, <TensorType(int64, scalar)>, <TensorType(int64, scalar)>)]')

words embedding normalization

I didn't see any normalization for the word vectors (maybe i missed it). should we do it for the word embedding or we should use the same vector representation as (google's) word2vec ?

AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer'

Hello,
I'm sorry to write this, when I import this project into my workspace,there are two red line under two place,LeNetConvPoolLayer and MLPDropout,line 88 and 95 in the conv_net_classes.py file;then I add theano. prefix ,the line dispeared;but when I run the conv_net_classes.py file,it is wrong --AttributeError: 'module' object has no attribute 'LeNetConvPoolLayer',how can I dispose it?

License?

Hi -
Do you have a license which you are licensing your code under?

Save trained model

Hi, can you add the method of saving trained model for future prediction ?

multilabel classificaion

Hi What changes have to be done in this code to allow multi label classification? is there any resource that I can refer to extend your code to allow multi label classification?

Results change when running multiple time

Hi Kim,

I get one problem when using your code.
I fixed the random seed, then when excluding embedding form the parameter set, the results remain the same (this is what I expected). But when including embeddings to fine-tune it, the results changed slightly with different runs?? Can you explain the reason why??

Thank you

set sentence max length automatically

If you are using a dataset with sentences longer than 65 words, you have to set the the max_l variable manually. You can fix this little issue by replacing the second last line in process_data.py with:

cPickle.dump([revs, W, W2, word_idx_map, vocab, max_l], open("mr.p", "wb"))

and in conv_net_sentence.py after loading the pickled file:

revs, W, W2, word_idx_map, vocab, max_l = x[0], x[1], x[2], x[3], x[4], x[5]

Now you only have to replace the make_idx_data_cv function call by:
make_idx_data_cv(revs, word_idx_map, i, max_l=max_l, k=300, filter_h=5)

It drove me crazy finding that the max sentence length limitation was the problem for the error

ValueError: setting an array element with a sequence.

When will Torch version come up?

Hi

In the readme file you say that there will be soon a Torch version of it. Is this still valid? If yes, when will it come out approximately?

Regrads, Felix

Why initialize W[0] with all 0s?

Hi, I'm just having some trouble understanding the process_data.py file, especially saving a special W[0] word and initialize idx_map starting at 1. What's the purpose of doing that?

how do you feel if I create a new project based on your code>?

Hi, yoonkim,

Sincerely, thanks a lot for your sharing of this code.
I am a beginner of CNN, and your code gives me much help during the practicing.

I rewrite some structures of this code and implement saving and loading processes of parameters.

There are too much difference between your sharing code and my new code, so I wanna to create a project CNN_sentence_cm, which cm means Chinese comments and model parameters saving process added.

Sincerely, if this would make a trouble for you, I will never do this.

question about clearn_str in process_data.py

Hi @yoonkim Recently, I am reading dennybritz/cnn-text-classification-tf implementation based on your original code I found there some regrex pattern in the function clean_str(string)such as

    string = re.sub(r"\'d", " \'d", string)
    string = re.sub(r"\'ll", " \'ll", string)
    string = re.sub(r",", " , ", string)
    string = re.sub(r"!", " ! ", string)
    string = re.sub(r"\(", " \( ", string)

when program find this kind of patterns, it just replaces found pattern with the same thing from my viewpoint.
So my question is that whats the purpose of those re.sub code. Im confused. Could u give me some clues? Thx a lot ; )

The model vocabulary come from train data and test data?

Hi,Kim.Thank you for sharing your code.But I have a question about your model. In your implementation (shown below), the embedding weight contain all words from train set and test set. But I think it should contain words only in train set because in a real scene you can't know test data in which maybe there are has some OOV (some words/vocabulary out of train set). if in static mode(CNN-static), this is no problem. But in a non-static mode (CNN-non-static) how can you solve this OOV problem(how to update OOV words ( not present in model vocabulary)' embedding parameters). In brief, for words which present in word2vec model but not in origin model vocabulary, how can you solve it. Be sorry for my English is poor and expression may be not clear. Thank you.

def get_W(word_vecs, k=300):
    """
    Get word matrix. W[i] is the vector for word indexed by i
    """
    vocab_size = len(word_vecs)
    word_idx_map = dict()
    W = np.zeros(shape=(vocab_size+1, k), dtype='float32')            
    W[0] = np.zeros(k, dtype='float32')
    i = 1
    for word in word_vecs:
        W[i] = word_vecs[word]
        word_idx_map[word] = i
        i += 1
    return W, word_idx_map

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.