chilynn / sequence-labeling Goto Github PK

View Code? Open in Web Editor NEW

308.0 308.0 168.0 8.52 MB

Python 100.00%

sequence-labeling's People

Contributors

Stargazers

Watchers

Forkers

liangpj vangogh0318 jankim kekedan coder3344 luyee aporia3517 liuyijiang1994 imzwz pingoogle fanfanfeng zxsted valdersoul antoine-tran xyz8 machineicegit danveno tanganyao glebalshanskii qiuyuew zilongzhong keaideii andysdc zjh-nudger zhaodonghui3939 pokbe xinpengzhou gailysun theanhle eddie1224 hanxiaobupt chivychao martianmartian fydlzr allensmile nininininini jdc08161063 chagge benjamesbabala wanjinchang pjpan kinglai bensnw zhoujiang2013 skepsun buaalearn xshhhm loyaltyji rubeeny wj1031924 leezqcst x-hacker heshenghuan zihay xu2333 chenghuige kellerxu easonfzw maozhiqiang yxryxryxr3 shenyong123 zhuyin521 hassyma mathshelly2014 nwy2010 cyxuanwater wushicanasl leakey1905 sericwong jidlin liuyuemaicha sherlockhoatszx baberryyoung jerrymomo10 ssdutyuyang199401 ivanvera nackel lyfree132 caifuli dreadlord1984 mqrshiyan neofung yzx1992 colinsongf gokunwu zncepup yumy-yumy zzgo12 qiujkx dongxf369 bigmaye leeon2vec yx100 hzylmf yuanjungod qizailiu suzhidong huoliangyu babyzpj yanglijun960703

sequence-labeling's Issues

能帮忙提供下您实现的crf部分以及viterbi部分的一个理论上的教程吗？

能帮忙提供下您实现的crf部分以及viterbi部分的一个理论上的教程吗？我看code有点困难，感谢啦

关于停用词被去掉的问题。

您好，运行成功模型后，发现做分词的时候，停用词，标点符号会被去掉，想问一下，这块是怎么解的啊？详细在代码哪块处理的呢？

1）第一个问题是BILSTM_CRF.py 中，因为我使用的是tensorflow 最新的版本，报的指向from tensorflow.models.rnn import rnn,rnn_cell 这一行，错误是this module is deprecated use tf.nn.crnn_* instead, 我根据rnn.py中的提示改成了 from tensorflow.python.ops import rnn,rnn_cell 错误没有了，请问这样改对吗？tensorflow 小白一枚，所以有点儿不是很确定
2）第二个问题是利用pandas的read_csv()函数读取数据时候报的错误，parser_f() got an unexpected keyword argument 'skip_blank_lines' 报错的具体位置都在helper.py 文件中的df_train=pd.read_csv() 等函数中，我查了，我对应的pandas 版本中是有对应的参数的，而且在代码中如果删除掉这个参数skip_blank_lines=False 就可以运行。请问您知道这个问题大致该如何解决吗？或者可能是有什么原因引起的？
谢谢

关于数据预处理的问题

您好，由于您的.in文件没用体现出您的训练数据输入格式，我想请问一下对于一个含有多个分句的长句子您是将其作为一个完整句子输入还是根据标点符号切分成为多个小句子单独输入呢？

请教下，实现CRF的这部分代码的过程中参考了哪些文献或博客

有很多行代码实在是看不懂……

请教您tf版本更新后的代码的变化

您好，我是tf beginer, 在我升级了tf到0.11后，tf库有不少的api有一些变化，主要是在BILSTM_CRF.py:
from tensorflow.models.rnn import rnn, rnn_cell的包的改变，
raise ImportError("This module is deprecated. Use tf.nn.rnn_* instead.")
然而for me，因为对tf模型的不全面了解，要去手动改一些代码是比较困难的……
您能否抽空指示一下tf版本升级后的代码改进呢:)
谢谢~

关于数据能不能多举几个例子？

@chilynn 多谢！

About datasets

Firstly, I would like to express my thank to you for you good implementation. I have a small trouble about data in your training and testing files. Could you mind explaining more detail about data in your training and testing files. I don't understand why all of them are the same:
N B
B M
A E
D O

Z O
Z O
Z O
Z O
Z O

In my opinion: you just gave the format and didn't focus on value, right?
I would be most grateful to you if you could create and share a small data example (in English) about input, output files. It will be easier for me to understand.
Thank you in advance!

请问下，HMM的predict这块的path和W指的什么？

请问下@chilynn
https://github.com/chilynn/sequence-labeling/blob/master/code/hmm/hmm.py#L79
predict是书里的什么算法？
多谢

Hierarchical BiLSTM + CRF

Hi，请教一下，您的这个实现里面是只有char level的embedding，如何用tensorflow实现像原论文中char level + word level的Hierarchical BiLSTM + CRF呢?

请教，作者用这个程序跑过一些NLP任务吗？效果如何？

@chilynn，您好，试用了您的程序，用2万的数据跑了分词，F值在93%左右（单纯用CRF也有95%），刚刚接触深度学习应该是有些trick。陈利人老师最近发了一个博客，用BILSTM＋CRF能达到97.5%的准确率。
文章标题《97.5%准确率的深度学习中文分词（字嵌入+Bi-LSTM+CRF）》
请问作者有跑过一些NLP任务吗？效果与state of the art比效果如何。

关于label数多于四个的问题

我现在想要识别六个label，试了一下貌似不能直接把[self.batch_size, 6, 6] 后面的 6， 6 换成8， 8，请问下修改代码的话从哪里着手呢

confused about the task

Hi I'm reading your code but I'm confused with the task it deal with.

The test output format example is

NBAD<@>NBA

ZZZZZ<@>

Does this means the task is to delete the O labeled character?So it is used for NER only?.

And if I want to use it to segment sequences, i.e., I want to tag the characters as B M E S ,then use B(begin) E(end) and S(single) to segment the sentence, where should I modify this code?Just BILSTM_CRF.test?

Thanks a lot!

关于y_train_weight_batch

for iteration in range(num_iterations):
                # train
                X_train_batch, y_train_batch = helper.nextBatch(X_train, y_train, start_index=iteration * self.batch_size, batch_size=self.batch_size)
                y_train_weight_batch = 1 + np.array((y_train_batch == label2id['B']) | (y_train_batch == label2id['E']), float)
                transition_batch = helper.getTransition(y_train_batch)

你好，BILSTM_CRF.py文件中第173行的这个y_train_weight_batch的含义是什么？

计算point score问题

self.point_score = tf.gather(tf.reshape(self.tags_scores, [-1]), tf.range(0, self.batch_size * self.num_steps) * self.num_classes + tf.reshape(self.targets,[self.batch_size * self.num_steps]))
self.point_score *= self.mask

这里targets的label是从1开始计数的，而tags_scores对应的classes是从0开始计数的..不会有问题么？还是我漏看了什么..

why my loss decrease under 0

loss is supposed to be close to and bigger than 0, self.loss = - (self.target_path_score - self.total_path_score),target_path_score should be smaller than total_path_score

word-embedding file

if I don't import a word-embedding file, does the word vector generate randomly?

tf implemented CRF compared with API crf

Hello, recently I am studying CRF and I tried to using tensorflow to implement a linear-chain CRF.

I have read about you blog about this repo. And I also checked tensorflow's API, it has a Module for constructing a linear-chain CRF.

By reading your source code and tensorflow's source code, I found that both are using tag\_score for CRF decoding, I was wondering whether the two methods have a performance gap?

word embedding

你好请问下，可以构造如下的word级别的训练数据吗：
你好 o
** u

问一下这个BI-LSTM-CRF模型的构造

您好，请问一下是将BI-LSTM隐藏层的输出直接输入到CRF线性层里面的吗，还是我理解的有错，两者之间是怎么联系的？

在初始化的时候选择了is_crf=False，结果结果就是没有初始化loss，出错了

在初始化的时候选择了is_crf=False，结果结果就是没有初始化loss，出错了
请问这个代码可能选择关闭crf的功能吗

Why my loss is every small and under 0?

code is error?
or other is error?
if you know how to do it, please tell you..
Thanks

    @staticmethod
    def argmax(vec):
        _, idx = torch.max(vec, 1)
        return idx.item()

    def log_sum_exp(self, vec):
        max_score = vec[0, self.argmax(vec)]
        max_score_broadcast = max_score.view(1, -1).expand(1, vec.size()[1])
        return max_score + torch.log(torch.sum(torch.exp(vec - max_score_broadcast)))

    def _forward_alg(self, feats, mask):
        batch_size, step_num, tag_size = feats.size()
        lengths = mask.sum(1).tolist()
        alpha = torch.FloatTensor([0]).to(self.device)
        for b in range(batch_size):
            init_alphas = torch.full((1, self.tag_size), -10000.).to(self.device)
            init_alphas[0][self.tag_to_ix[START_TAG]] = 0.

            forward_var = init_alphas

            for k in range(lengths[b]):
                alphas_t = []  # The forward tensors at this timestep
                for next_tag in range(self.tag_size):
                    emit_score = feats[b][k][next_tag].view(1, -1).expand(1, self.tag_size)
                    trans_score = self.transitions[next_tag].view(1, -1)
                    next_tag_var = forward_var + trans_score + emit_score
                    alphas_t.append(self.log_sum_exp(next_tag_var).view(1))
                forward_var = torch.cat(alphas_t).view(1, -1)
            terminal_var = forward_var + self.transitions[self.tag_to_ix[STOP_TAG]]
            alpha += self.log_sum_exp(terminal_var)
        return alpha

    def _score_sentence(self, features, arguments, mask):
        # Gives the score of a provided tag sequence
        batch_size, step_num, tag_size = features.size()
        lengths = mask.sum(1).tolist()
        score = torch.zeros(1).to(self.device)
        for b in range(batch_size):
            tags = torch.Tensor(arguments[b]).long().to(self.device)
            tags = torch.cat([torch.Tensor([self.tag_to_ix[START_TAG]]).long().to(self.device), tags])
            feats = features[b][:lengths[b], :]
            for i, feat in enumerate(feats):
                score = score + self.transitions[tags[i + 1], tags[i]] + feat[tags[i + 1]]
            score = score + self.transitions[self.tag_to_ix[STOP_TAG], tags[-1]]
        return score

    def neg_log_likelihood(self, inps):
        sent, mask, tags = inps
        feats = self._get_lstm_features(sent, mask)
        forward_score = self._forward_alg(feats, mask)
        gold_score = self._score_sentence(feats, tags, mask)
        return forward_score - gold_score

train---epoch: 8, learn rate: 0.001000, global step: 1525
loss: -13789.84375000
macro arg---P: 0.592268, R: 0.567828, F: 0.579791
---------------------------------------
train---epoch: 8, learn rate: 0.001000, global step: 1526
loss: -370.70312500
macro arg---P: 0.573383, R: 0.599768, F: 0.586279
---------------------------------------
train---epoch: 8, learn rate: 0.001000, global step: 1527
loss: -9303.57812500
macro arg---P: 0.578675, R: 0.598240, F: 0.588295
---------------------------------------
train---epoch: 8, learn rate: 0.001000, global step: 1528
loss: 718.17187500
macro arg---P: 0.585216, R: 0.577113, F: 0.581136
---------------------------------------
train---epoch: 8, learn rate: 0.001000, global step: 1529
loss: -2091.90625000
macro arg---P: 0.601942, R: 0.591456, F: 0.596653
---------------------------------------
train---epoch: 8, learn rate: 0.001000, global step: 1530
loss: -12369.98437500
macro arg---P: 0.602512, R: 0.591223, F: 0.596814
---------------------------------------

ImportError: No module named models.rnn

Hi：
python BILSTM_CRF.py
Traceback (most recent call last):
File "BILSTM_CRF.py", line 5, in
from tensorflow.models.rnn import rnn, rnn_cell
ImportError: No module named models.rnn

代码中带的那个NBA例子，训练后没找到存储的模型

您好，我就用train.in数据来训练的模型，没看到存储的模型结果，用test.py测试时报错了，我猜是没找到模型文件
另外问一下训练时间大约多少，我在笔记本上只用CPU跑100个epoch要半个小时，而且最后准确率是0.00，这个正常吗？

模型初始化的时候出错，请教一下原因

参数全部默认

训练数据：
放推荐动词_S
一 O_S
首 O_S
我歌曲名_B
们歌曲名_M
的歌曲名_M
世歌曲名_M
代歌曲名_E
。 O_S

适 O_B
合 O_E
老其他实体_B
人其他实体_E
的 O_S
爱歌曲名_B
情歌曲名_M
的歌曲名_M
结歌曲名_M
束歌曲名_E
。 O_S

具体错误：
File "train.py", line 47, in <module> model = BILSTM_CRF(num_chars=num_chars, num_classes=num_classes, num_steps=num_steps, num_epochs=num_epochs, embedding_matrix=embedding_matrix, is_training=True) File "/home/lirang/workspace/sequence-labeling/code/bilstm_crf/BILSTM_CRF.py", line 103, in __init__ self.total_path_score, self.max_scores, self.max_scores_pre = self.forward(self.observations, self.transitions, self.length) File "/home/lirang/workspace/sequence-labeling/code/bilstm_crf/BILSTM_CRF.py", line 121, in forward transitions = tf.reshape(tf.concat(0, [transitions] * self.batch_size), [self.batch_size, 6, 6]) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/gen_array_ops.py", line 1092, in reshape name=name) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/op_def_library.py", line 655, in apply_op op_def=op_def) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 2156, in create_op set_shapes_for_outputs(ret) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/framework/ops.py", line 1612, in set_shapes_for_outputs shapes = shape_func(op) File "/usr/local/lib/python2.7/dist-packages/tensorflow/python/ops/array_ops.py", line 1115, in _ReshapeShape % (num_elements.value, new_shape, np.prod(new_shape))) ValueError: Cannot reshape a tensor with 512 elements to shape [128, 6, 6] (4608 elements)

word embedding支持中文吗？

我构造了一份中文的词向量文件，每行是分词后得到的词语、单字或者短语，以及对应的向量。在使用过程中，出现了段错误 (core dumped)。想请问一下这个用法是否支持，还是我构造的词向量文件有问题。

关于模型中CRF层的几个小疑问

你好：
对于实现过程中部分代码有一点疑问，希望您可以为我解答：
在实现CRF的batch 操作的时候，是否把一个batch中的句子拼成了一个句子看待？

        因为我看到在计算point_score时有这样的操作：
        self.point_score = tf.gather(tf.reshape(self.tags_scores, [-1]), tf.range(0, self.batch_size * self.num_steps) * self.num_classes + tf.reshape(self.targets,[self.batch_size * self.num_steps]))
         请问我理解的对吗？

         还有我改写了部分代码，现在能在多类别标签数据集上跑通，但是loss会出现负数， 请问是和crf层中的常量设置有关系吗？
         但是按照 公式推导，p（y|X）是个小数，套一个log为负，再取反作为loss应该是正数才对。

transitions reshape的问题

你好我运行代码的时候会出现如下错误：

ValueError: Cannot reshape a tensor with 512 elements to shape [128,6,6] (4608 elements) for 'model/Reshape_10' (op: 'Reshape') with input shapes: [256,2], [3].

transitions = tf.reshape(tf.concat(0, [transitions] * self.batch_size), [self.batch_size, 6, 6])

我的数据输入格式是按照文档中的格式来的。