thunlp / nsc Goto Github PK

Neural Sentiment Classification

License: MIT License

Python 100.00%

nlp

nsc's Introduction

Neural Sentiment Classification

Neural Sentiment Classification aims to classify the sentiment in a document with neural models, which has been the state-of-the-art methods for sentiment classification. In this project, we provide our implementations of NSC, NSC+LA and NSC+UPA [Chen et al., 2016] in which user and product information is considered via attentions over different semantic levels.

Evaluation Results

Evaluation results on document-level sentiment classification. Acc.(Accuracy) and RMSE are the evaluation metrics.

In the above table, baseline models including Majority, Trigram, TextFeature, UPF, AvgWordvec, SSWE, RNTN + RNN, Paragraph Vector, JMARS and UPNN are reported in [Tang et al., 2015].

Data

We provide IMDB, Yelp13 and Yelp14 datasets we used for sentiment classification in [Download]. The dataset should be decompressed and put in the folder NSC/, NSC+LA/ or NSC+UPA/.

We prepocess the original data to make it satisfy the input format of our codes. The original datasets are released by the paper [Tang et al., 2015]. [Download]

Pre-trained word vectors are learned on each dataset (IMDB, Yelp13, Yelp14) separately.

The dataset in each domain contains seven files, using the following format:

train.txt: training file, format (userid productid class document), split by '\t'.
dev.txt: dev file, same format as train.txt.
test.txt: test file, same format as train.txt.
wordlist.txt: corresponding words with same sequence in pre-trained word vectors, one per line.
usrlist.txt: user ids in each dataset, per one line.
prdlist.txt: product ids in each dataset, per one line.
embinit.save: the pre-trained word embedding file, which is saved as pickle and can be loaded from pickle to numpy arrays.

The trained model can be found at this link.

Codes

The source codes of various models are put in the folders NSC/src, NSC+LA/src, NSC+UPA/src.

Train

For training, you need to type the following command in the folder src/ of each model:

THEANO_FLAGS="floatX=float32,device=gpu" python train.py $dataset $class

where dataset is the corresponding dataset folder, class is the number of corresponding domain.

For example, we use the following command when classfing the IMDB document:

THEANO_FLAGS="floatX=float32,device=gpu" python train.py IMDB 10

The training model file will be saved in the folder model/bestmodel/ of each model.

Test

For testing, you need to type the following command in the folder src/ of each model:

THEANO_FLAGS="floatX=float32,device=gpu" python test.py $dataset $class

where dataset is the corresponding dataset folder, class is the number of corresponding domain.

For example, we use the following command when classfing the IMDB document:

THEANO_FLAGS="floatX=float32,device=gpu" python test.py IMDB 10

The testing result which reports the Accuracy and RMSE will be shown in screen.

Cite

If you use the code, please cite the following paper:

[Chen et al., 2016] Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin and Zhiyuan Liu. Neural Sentiment Classification with User and Product Attention. In proceedings of EMNLP.[pdf]

Reference

[Chen et al., 2016] Huimin Chen, Maosong Sun, Cunchao Tu, Yankai Lin and Zhiyuan Liu. Neural Sentiment Classification with User and Product Attention. In proceedings of EMNLP.[pdf]

[Tang et al., 2015] Duyu Tang, Bing Qin, Ting Liu. Learning Semantic Representations of Users and Products for Document Level Sentiment Classification. In Proceedings of EMNLP.

nsc's People

Contributors

Stargazers

Watchers

Forkers

binbinbian stevenlol xsongx clear-datacenter giserh yinminggang techscientist jangol xuh5156 drjzhou hemingteng vishwajeetkumar93 hestendelin chenjun0210 vyraun pathriclee fangzheng354 asiagood yijunran jinhlov soon2soon lrxzhy rubeeny hqueduxiamen 124399839 andrewlesson hungryquiter huajiechen adelija ridepeer jhnlp caoge4 kelvict dvector89 meshiguge xunan0812 luomuqinghan huyongjun leezqcst scpei chunyuany wushicanasl pingoogle ryfan-rs gridl derekmma gds123 shubhampachori12110095 691505789 lamperougeyxy qitong wyg2015 nkmeng clumsyye xxcharles hkazuakey pikaliov youlei5898 sxdkxgwan liaupt edisono endlesstalking lukliz shuangshuangwen weavermonkey mqrshiyan caifenggh nguyenvo09 hellodannyliu zhuxf0407 afcarl laisun shaoyn0817 sszzsupersupersupersuper yuanyuansiyuan frankfqchen genesisxyl ymr12 amirunpri2018 datazwer winkywang teslaimpertior cshaowang yuanmingchen vitodh yoyo-yun xiaofangxiansheng2 legendtianjin cxncu001 tommy-xu shandongwuyanzu zeekinger wanxuex hhx1999

nsc's Issues

数据集的下载链接访问受限

您好，请问数据集的下载链接访问受限，有别的下载地址吗？

About the result.

I have run the code as settings accordingly, but didn't meet the results provided in the Paper.
Could you describe if I made any mistake.

What is 'self.output = outs[1]' in "LSTMLayer.py" means?

Sorry for my stupid misunderstood.

when i run THEANO_FLAGS="floatX=float32,device=gpu" python train.py IMDB 10 occurred ValueError,

wangj@liutl:~/Work/NSC-master/NSC/src$ THEANO_FLAGS="floatX=float32,device=gpu" python train.py IMDB 10
Using gpu device 0: GeForce GTX TITAN Black (CNMeM is disabled, cuDNN Version is too old. Update to v5, was 4004.)
data loaded.
Traceback (most recent call last):
File "train.py", line 15, in
model = LSTMModel(voc.size,trainset, devset, dataname, classes, None)
File "/home/wangj/Work/NSC-master/NSC/src/LSTMModel.py", line 61, in init
updates=updates,
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function.py", line 320, in function
output_keys=output_keys)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/pfunc.py", line 479, in pfunc
output_keys=output_keys)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 1777, in orig_function
defaults)
File "/usr/local/lib/python2.7/dist-packages/theano/compile/function_module.py", line 1641, in create
input_storage=input_storage_lists, storage_map=storage_map)
File "/usr/local/lib/python2.7/dist-packages/theano/gof/link.py", line 690, in make_thunk
storage_map=storage_map)[:3]
File "/usr/local/lib/python2.7/dist-packages/theano/gof/vm.py", line 1003, in make_all
no_recycling))
File "/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_op.py", line 913, in make_thunk
from . import scan_perform_ext
File "/usr/local/lib/python2.7/dist-packages/theano/scan_module/scan_perform_ext.py", line 141, in
from scan_perform.scan_perform import *
File "init.pxd", line 155, in init theano.scan_module.scan_perform (/home/wangj/.theano/compiledir_Linux-3.13--generic-x86_64-with-Ubuntu-14.04-trusty-x86_64-2.7.6-64/scan_perform/mod.cpp:9984)
ValueError: ('The following error happened while compiling the node', forall_inplace,gpu,scan_fn}(Elemwise{minimum,no_inplace}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Wf1, Wf2, Wc1, Wc2, Wi1, Wi2, Wo1, Wo2, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0), '\n', 'numpy.dtype has the wrong size, try recompiling')
@huimchen
thx

代码容易跑崩

您好！我下载了代码，直接在内存12G的titan的GPU跑train.py，但是迭代到700次时，代码就内存不足崩溃了。
有试过调节batch_size等，但是同样存在这个问题，只是可能能跑久一些。
因为我不太熟悉theano，请问为什么内存这么不稳定，为什么需要内存越来越多？我的GPU配置差于你的吗，才会出现这种跑不下去的情况？

另外，求模型结果，我用tensorflow复现，参数完全一致，在IMDB上只能达到49%，而且此时训练集上过拟合严重。

can this be used on Windows computer?

训练train的疑问

您好，
在训练过程中，为什么要随机选窗口来训练呢？
在LSTMModel.py的方法def train(iters)，lst应该就是随机出iters个窗口的编号，然后去训练这些窗口内的数据，疑问的是，为什么不是每个窗口进行训练，随机选不会影响训练效果吗？
我看到测试test是有对每个窗口数据进行测试。求解？？？？

词向量训练

您好，请问代码提供的词向量据论文中说是通过skip-gram训练得到的，请问具体的训练参数是多少，使用了哪个软件包呢？
另外，有尝试过用SSWE做吗？

Pre-train the 200-dimensional word embeddings on each dataset

In you paper, for each dataset, you pre-train the 200-dimensional word embeddings on each dataset, I want to ask that what datasets to train the word vector, just the model training dataset like IMDB or with other dataset like wiki ?
Thank you !

why the embinit.save's length is one bigger than the length of the wordlist?

hi, i am glad the codes are open-source.
And i found that the problem as the title said? Is the reason that there is one meaning "UNK" like the id "-1" in the wordlist? If does, which one means the "UNK", the first or the last embeddings in the embinit.save?

out of memory

 Hi,
     我用你们的提供的NSC尝试了一下训练，结果out of memory 。我这边gpu的内存是4G。 你们那个硬件配置是多少呀？训练模型大概需要多久？
     我这边可以通过参数调节来降低内存的损耗吗？   求指导，谢谢。

可以支持中文吗？

如何训练中文样本？

关于LSTMModel.py的问题

params = []
for layer in layers:
params += layer.params
L2_rate = numpy.float32(1e-5)
for param in params[5:]:
cost += T.sum(L2_rate * (param * param), acc_dtype='float32')
gparams = [T.grad(cost, param) for param in params]

打印出的params如下：[E, U, P, Wi1, Wi2, bi, Wo1, Wo2, bo, Wf1, Wf2, bf, Wc1, Wc2, bc, W, v, Wu, Wp, b, Wi1, Wi2, bi, Wo1, Wo2, bo, Wf1, Wf2, bf, Wc1, Wc2, bc, W, v, Wu, Wp, b, W, b, W, b]
想问下，这个为什么从第五个开始，第五个应该是bi。
我的理解这个cost应该是正则化项，为何要从第五个开始，
另外，我想问一下，这里面的E,U,P的值是不是也在随着训练更新。谢谢了。

您好，想问下这个用户和产品注意力的词典或者矩阵是怎么建立的呢？不太明白。谢谢您的指导！

训练一直会有错误。请求指导？

17 cost: 0.267990171909
18 cost: 0.298284947872
19 cost: 0.266352206469
Traceback (most recent call last):
File 'E:/ymg/NSC-LA/main/train.py', line 27, in
model.train(30)
File 'E:\ymg\NSC-LA\main\LSTMModel.py', line 87, in train
out = self.train_model(self.trainset.docs[i], self.trainset.label[i], self.trainset.wordmask[i], self.trainset.sentencemask[i], self.trainset.maxsentencenum[i])
File 'D:\Anaconda2\lib\site-packages heano\compileunction_module.py', line 898, in call
storage_map=getattr(self.fn, 'storage_map', None))
File 'D:\Anaconda2\lib\site-packages heano\gof\link.py', line 325, in raise_with_op
reraise(exc_type, exc_value, exc_trace)
File 'D:\Anaconda2\lib\site-packages heano\compileunction_module.py', line 884, in call
self.fn() if output_subset is None else
File 'D:\Anaconda2\lib\site-packages heano\scan_module\scan_op.py', line 989, in rval
r = p(n, [x[0] for x in i], o)
File 'D:\Anaconda2\lib\site-packages heano\scan_module\scan_op.py', line 978, in p
self, node)
File 'theano/scan_module/scan_perform.pyx', line 522, in theano.scan_module.scan_perform.perform (C:\Users\Administrator\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-Intel64_Family_6_Model_94_Stepping_3_GenuineIntel-2.7.12-64\scan_perform\mod.cpp:6173)
RuntimeError: CudaNdarray_ZEROS: allocation failed.
Apply node that caused the error: forall_inplace,gpu,grad_of_scan_fn}(Elemwise{minimum,no_inplace}.0, GpuDimShuffle{0,2,1}.0, GpuDimShuffle{0,2,1}.0, GpuElemwise{tanh,no_inplace}.0, GpuDimShuffle{0,1,x}.0, GpuElemwise{Composite{(i0 - sqr(i1))},no_inplace}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuSubtensor{int64:int64:int64}.0, GpuAlloc{memset_0=True}.0, GpuSubtensor{::int64}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, GpuAlloc{memset_0=True}.0, Elemwise{minimum,no_inplace}.0, Elemwise{minimum,no_inplace}.0, Elemwise{minimum,no_inplace}.0, Elemwise{minimum,no_inplace}.0, Elemwise{minimum,no_inplace}.0, Wf1, Wf2, Wc1, Wc2, Wi1, Wi2, Wo1, Wo2, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{1,0}.0, GpuDimShuffle{1,0}.0)
Toposort index: 879
Inputs types: [TensorType(int64, scalar), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, (False, False, True)), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, 3D), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), TensorType(int64, scalar), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, row), CudaNdarrayType(float32, matrix), CudaNdarrayType(float32, matrix)]
Inputs shapes: [(), (106, 200, 6016), (106, 200, 6016), (106, 6016, 200), (106, 6016, 1), (106, 6016, 200), (106, 6016, 200), (106, 6016, 200), (106, 6016, 200), (107, 6016, 200), (107, 6016, 200), (2, 200), (2, 200), (2, 200), (2, 200), (), (), (), (), (), (200, 200), (200, 200), (200, 200), (200, 200), (200, 200), (200, 200), (200, 200), (200, 200), (1, 200), (200, 200), (200, 200), (1, 200), (200, 200), (200, 200), (1, 200), (200, 200), (200, 200), (1, 200), (200, 200), (200, 200)]
Inputs strides: [(), (-1203200, 1, 200), (-1203200, 1, 200), (1203200, 200, 1), (-6016, 1, 0), (1203200, 200, 1), (-1203200, 200, 1), (-1203200, 200, 1), (-1203200, 200, 1), (1203200, 200, 1), (-1203200, 200, 1), (200, 1), (200, 1), (200, 1), (200, 1), (), (), (), (), (), (200, 1), (200, 1), (200, 1), (200, 1), (200, 1), (200, 1), (200, 1), (200, 1), (0, 1), (1, 200), (1, 200), (0, 1), (1, 200), (1, 200), (0, 1), (1, 200), (1, 200), (0, 1), (1, 200), (1, 200)]
Inputs values: [array(106L, dtype=int64), 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', array(106L, dtype=int64), array(106L, dtype=int64), array(106L, dtype=int64), array(106L, dtype=int64), array(106L, dtype=int64), 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown', 'not shown']
Outputs clients: [[], [], [GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.2, ScalarFromTensor.0)], [GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.3, ScalarFromTensor.0)], [GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.4, ScalarFromTensor.0)], [GpuSubtensor{int64}(forall_inplace,gpu,grad_of_scan_fn}.5, ScalarFromTensor.0)], [GpuSubtensor{::int64}(forall_inplace,gpu,grad_of_scan_fn}.6, Constant{-1})], [GpuReshape{2}(forall_inplace,gpu,grad_of_scan_fn}.7, MakeVector{dtype='int64'}.0)], [GpuReshape{2}(forall_inplace,gpu,grad_of_scan_fn}.8, MakeVector{dtype='int64'}.0)], [GpuReshape{2}(forall_inplace,gpu,grad_of_scan_fn}.9, MakeVector{dtype='int64'}.0)], [GpuReshape{2}(forall_inplace,gpu,grad_of_scan_fn}.10, MakeVector{dtype='int64'}.0)]]

HINT: Re-running with most Theano optimization disabled could give you a back-trace of when this node was created. This can be done with by setting the Theano flag 'optimizer=fast_compile'. If that does not work, Theano optimizations can be disabled with 'optimizer=None'.
HINT: Use the Theano flag 'exception_verbosity=high' for a debugprint and storage map footprint of this apply node.

训练一直会出现这样的错误，迭代了有0~7次不等，就会这样，使用的8GB Gpu，也上网找了使用FAST_RUN模式训练，同样的出错。。。。求哪位大神来共勉？？？？

Will it work without GPU?

hey, recently I was trying to run this code on my computer. Sadly, I don't have any GPU. So I was wondering if this is still gonna work. If not, what changes should I do?

i need help

i use theano CPU ,it appears the following problems

File "d:\Users\compiler1\Anaconda2\lib\site-packages\theano\gof\graph.py", line 381, in init
self.tag = utils.scratchpad()

NameError: global name 'utils' is not defined

how can i fix ut?

NSC+UPA问题

NSC+UPA实验跑了多次，在数据集IMDB上都无法达到53.3%，最高可以到52%，请问代码是否有改动么？您跑了多少次呢？求解答，谢谢

使用GPU报内存错误，请问你的theano版本？

你好，我使用THEANO_FLAGS="floatX=float32,device=gpu" python train.py IMDB 10运行一会就会报内存错误，具体错误如下：Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0xb09aa0000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
Error when tring to find the memory information on the GPU: an illegal memory access was encountered
Error freeing device pointer 0xb59a00000 (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
device_free: cudaFree() returned an error, but there is already an Python error set. This happen during the clean up when there is a first error and the CUDA driver is in a so bad state that it don't work anymore. We keep the previous error set to help debugging it.CudaNdarray_uninit: error freeing self->devdata. (self=0x7f113c04aef0, self->devata=0xb59a00000)
Error when trying to find the memory information on the GPU: an illegal memory access was encountered
Error allocating 84299200 bytes of device memory (an illegal memory access was encountered). Driver report 0 bytes free and 0 bytes total
Traceback (most recent call last):
File "train.py", line 16, in
model.train(100)
File "/home/wangj/Work/NSC-master/NSC/src/LSTMModel.py", line 74, in train
out = self.train_model(self.trainset.docs[i], self.trainset.label[i], self.trainset.length[i],self.trainset.sentencenum[i],self.trainset.wordmask[i],self.trainset.sentencemask[i],self.trainset.maxsentencenum[i])
File "/home/liutl/anaconda/lib/python2.7/site-packages/theano/compile/function_module.py", line 588, in call
self.fn.thunks[self.fn.position_of_error])
File "/home/liutl/anaconda/lib/python2.7/site-packages/theano/compile/function_module.py", line 579, in call
outputs = self.fn()
MemoryError: Error allocating 84299200 bytes of device memory (an illegal memory access was encountered).
Apply node that caused the error: GpuElemwise{Composite{[mul(i0, sqr(i1))]},no_inplace}(CudaNdarrayConstant{[[ -1.76432998e-28]]}, GpuAdvancedIncSubtensor1_dev20{inplace,inc}.0)
Inputs shapes: [(1, 1), (105374, 200)]
Inputs strides: [(0, 0), (200, 1)]
Inputs types: [CudaNdarrayType(float32, (True, True)), CudaNdarrayType(float32, matrix)]
Use the Theano flag 'exception_verbosity=high' for a debugprint of this apply node.
我查了一下有可能和我的theano版本有关系，我的服务器里安装的是0.6的版本，请问你进行实验的时候用的是什么版本？谢谢！ @huimchen

在下载data.zip的IMDB数据中，为什么词向量很多都是0

@huimchen 您好，我现在了您提供了链接的数据：data.zip，在运行代码中IMDB的实验时，发现在IMDB文件夹下的embinit.save中，在编号比较大（比如超过15000）后，好像词向量都是零向量，请问为什么要将后面的单词的词向量都设置为0向量呢？

NSC+UPA的模型中，为什么句子层和词汇层得到的user, product表示不同？

在NSC+UPA的模型中，GetuEmbLayer.py里面有下面几行代码来区分是在句子层attention用user还是词汇层attention用user：
if self.name == 'uemb_sentence':
ualloc = T.alloc(u,maxsentencesum,T.shape(u)[0])
uflatten = ualloc.T.flatten()
else:
uflatten = u
self.output = Uemb[uflatten]
我想知道为什么需要分开处理？谢谢！

我的GPU是64 位的，训练指令配置32位，THEANO_FLAGS="floatX=float32,device=gpu" python train.py IMDB 10

请问是不是造成valueerror？
ValueError: ('The following error happened while compiling the node', forall_inplace,gpu,scan_fn}(Elemwise{minimum,no_inplace}.0, GpuDimShuffle{0,1,x}.0, GpuSubtensor{int64:int64:int8}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, GpuIncSubtensor{InplaceSet;:int64:}.0, Wf1, Wf2, Wc1, Wc2, Wi1, Wi2, Wo1, Wo2, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0, GpuDimShuffle{x,0}.0), '\n', 'numpy.dtype has the wrong size, try recompiling')
有好多问题不懂，请指教，谢谢！

thunlp / nsc Goto Github PK

nsc's Introduction

Neural Sentiment Classification

Evaluation Results

Data

Codes

Train

Test

Cite

Reference

nsc's People

Contributors

Stargazers

Watchers

Forkers

nsc's Issues

Recommend Projects

Recommend Topics

Recommend Org