threelittlemonkeys / lstm-crf-pytorch Goto Github PK

View Code? Open in Web Editor NEW

457.0 13.0 103.0 12.6 MB

LSTM-CRF in PyTorch

Python 100.00%

lstm-crf crf sequence-labeling pytorch

lstm-crf-pytorch's Introduction

LSTM-CRF in PyTorch

A minimal PyTorch (1.7.1) implementation of bidirectional LSTM-CRF for sequence labelling.

Supported features:

Mini-batch training with CUDA
Lookup, CNNs, RNNs and/or self-attention in the embedding layer
Hierarchical recurrent encoding (HRE)
A PyTorch implementation of conditional random field (CRF)
Vectorized computation of CRF loss
Vectorized Viterbi decoding

Usage

Training data should be formatted as below:

token/tag token/tag token/tag ...
token/tag token/tag token/tag ...
...

For more detail, see README.md in each subdirectory.

To prepare data:

python3 prepare.py training_data

To train:

python3 train.py model char_to_idx word_to_idx tag_to_idx training_data.csv (validation_data) num_epoch

To predict:

python3 predict.py model.epochN word_to_idx tag_to_idx test_data

To evaluate:

python3 evaluate.py model.epochN word_to_idx tag_to_idx test_data

References

Zhiheng Huang, Wei Xu, Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991.

Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta, Sachindra Joshi. 2018. Dialogue Act Sequence Labeling Using Hierarchical Encoder with CRF. In AAAI.

Xuezhe Ma, Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv:1603.01354.

Shotaro Misawa, Motoki Taniguchi, Yasuhide Miura, Tomoko Ohkuma. 2017. Character-based Bidirectional LSTM-CRF with Words and Characters for Japanese Named Entity Recognition. In Proceedings of the 1st Workshop on Subword and Character Level Models in NLP.

Yan Shao, Christian Hardmeier, Jörg Tiedemann, Joakim Nivre. 2017. Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF. arXiv:1704.01314.

Slav Petrov, Dipanjan Das, Ryan McDonald. 2011. A Universal Part-of-Speech Tagset. arXiv:1104.2086.

Nils Reimers, Iryna Gurevych. 2017. Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks. arXiv:1707.06799.

Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. Neural Models for Sequence Chunking. In AAAI.

Zenan Zhai, Dat Quoc Nguyen, Karin Verspoor. 2018. Comparing CNN and LSTM Character-level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition. arXiv:1808.08450.

lstm-crf-pytorch's People

Contributors

Stargazers

Watchers

Forkers

leeeeoliu cbasavaraj anupama16 dsp6414 niuqun swordsmanxyz guixiangyu1 chandana-pvsl nlpjoe nlpscott greengrass2015 kuroro-tian wurentidai aliendeep richardsun-voyager xumaoxuan bpshaver saudbinhabib elifell mttk minhhdvn hibiscuses radagastrotn sprinterzzj wibruce joydajunspacecraft zhouhouqian xuyou314 91-nlper tj1116 zhusleep jsupeng cheesama uditpython fw339wj fishredleaf bailianfa supersteph liujiarik lcyuanjiang youarerare ben-bay zhilu1 iamjohnrain mrkzz yangseunghwan deepquantitative rz-zhang crystal22 zlxwl stlover25 hsinjhao sharonbubz geektemo lai-alien-s 77281900000 ai404 wenbowong a-little-story 7777777-liuyaqi lhospitalisbetter jwijffels laomagic kinghuin cin-hy divenkang xinxu1018 ayans21 qzqzzw yuan776 catherine1999 yiqingqiu23187 luchao1991 ouyangsizhuo dachengai advancer-debug toby517 tommy-xu lbxcfx ashoknp-git snow9666 jhxcugbcs woshinigulasi xingkongcwb gouqi666 fdkup zengpeizhi huhengkai nowalab taghizad3h mathhyphen x39826 alexanderpagani zhulongpeng0129 ericguo5513 aqhali drizzle98 wws0815 qiao0313 techthiyanes

lstm-crf-pytorch's Issues

example training data?

Thank you for the instructions on training data formatting.

Could you include small examples of this for your 3 use cases? I want to design my own tagset.

Question in computing the CRF score in case of partial annotation

What would be changes in the CRF score particularly in we have partial annotation, where we need to calculate the nominator for all possible sequence as partial annotation: https://stackoverflow.com/questions/54575275/partial-crf-in-pytorch

how to use this model in my own dataset?

I want to use this model by my own dataset, who can teach me? thanks.

embedding

how to use own pretrained embedding text?

Question about rnn-crf and lstm-crf

Hi,

I saw that the name of the project is "lstm-crf-pytorch", but while running the code, I found that you use rnn-crf structure. I am wondering that why?

Thank you in advance!

When use lstm_crf, loss is error .. sometimes positive, sometimes negative. sometimes it's a very big number, why

    def _get_lstm_feature(self, sent, mask):
        sent_embed = self.word_lookup(sent)
        sent_embed = fun.dropout(sent_embed, self.dropout)

        lstm_ip, id_unsort = self.get_pack_inputs(sent_embed, mask.sum(1))
        lstm_out, _ = self.lstm(lstm_ip)
        lstm_out = self.get_pad_outputs(lstm_out, sent_embed.size(1), id_unsort)

        lstm_out = fun.relu(lstm_out)
        hidden = self.hidden_linear(lstm_out)
        hidden = fun.relu(fun.dropout(hidden, self.dropout))
        feats = self.tag_linear(hidden)

        return feats

    @staticmethod
    def log_sum_exp(x, axis):
        x_max, _ = torch.max(x, axis, keepdim=True)
        x_max_expand = x_max.expand(x.size())

        return x_max + torch.log(torch.sum(torch.exp(x - x_max_expand), axis, keepdim=True))

    def forward(self, emits, masks):
        batch_size, step_num, tag_size = emits.size()
        lengths = masks.sum(1).tolist()

        _mask = torch.zeros(masks.size()).to(self.device)
        for b in range(batch_size):
            _mask[b][lengths[b]-1] = 1
        _mask = _mask.byte()

        _mask = _mask.view(batch_size, step_num, 1).expand_as(emits)
        batch_trans = torch.cat([self.transitions for _ in range(batch_size)], 0).contiguous()
        batch_trans = batch_trans.view(batch_size, tag_size, tag_size)

        forward_var = torch.cat([self.alpha_0 for _ in range(batch_size)], 0).contiguous()
        forward_var = forward_var.view(batch_size, tag_size, 1)
        forward_var = forward_var + emits[:, 0, :].view(batch_size, tag_size, 1)

        alpha = [forward_var]
        max_scores = [torch.squeeze(forward_var)]
        max_scores_pre = []
        for t in range(1, step_num):
            forward_var = forward_var.view(batch_size, tag_size, 1).expand(batch_size, tag_size, tag_size)
            current = emits[:, t, :].view(batch_size, 1, tag_size).contiguous()
            # alpha_t[i, j] = pre_score[j](col) + score(j->i) + score[i](row)
            alpha_t = forward_var + current.expand(batch_size, tag_size, tag_size) + batch_trans

            # cur_max_score score[i, :], cur_max_idx is j
            cur_max_score, cur_max_idx = torch.max(alpha_t, 1, keepdim=True)
            max_scores.append(torch.squeeze(cur_max_score, 1))
            max_scores_pre.append(torch.squeeze(cur_max_idx, 1))

            log_alpha_t = self.log_sum_exp(alpha_t, 1).view(batch_size, tag_size, 1)
            forward_var = log_alpha_t
            alpha.append(log_alpha_t)

        alphas = torch.cat(alpha, 0).view(batch_size, step_num, tag_size)
        last_alphas = torch.masked_select(alphas, _mask).view(batch_size, tag_size, 1)
        # forward var max value is add.....
        alpha_z = torch.sum(self.log_sum_exp(last_alphas, 1))

        return alpha_z, max_scores, max_scores_pre

    def score_path(self, emits, tags, mask):
        sent_len = mask.sum(1).tolist()
        batch_size, step_num = tags.size()

        scores = torch.FloatTensor([0]).to(self.device)

        for b in range(batch_size):
            cur_tag = tags[b][0].item()
            scores += self.alpha_0[cur_tag][0] + emits[b][0][cur_tag]
            for step in range(1, step_num):
                pre_tag = cur_tag
                cur_tag = tags[b][step].item()
                if step < sent_len[b]:
                    scores += (self.transitions[pre_tag][cur_tag] + emits[b][step][cur_tag])
                else:
                    break

        return scores

    @staticmethod
    def viterbi(max_scores, max_score_pre, mask):
        sent_lenth = mask.sum(1).tolist()

        best_paths = []
        batch_size = mask.size(0)
        for b in range(batch_size):
            cur_path = []
            _, last_max_node = torch.max(max_scores[sent_lenth[b]-1][b], 0, keepdim=True)
            last_max_node = last_max_node.item()
            cur_path.append(last_max_node)
            for t in range(sent_lenth[b]-2, -1, -1):
                last_max_node = max_score_pre[t][b][last_max_node]
                last_max_node = last_max_node.item()
                cur_path.append(last_max_node)

            cur_path = cur_path[::-1]
            best_paths.append(cur_path)

        return best_paths

    def get_arg(self, inps):
        sent, mask, _ = inps
        feats = self._get_lstm_feature(sent, mask)
        _, max_scores, max_scores_pre = self.forward(feats, mask)
        best_paths = self.viterbi(max_scores, max_scores_pre, mask)
        return best_paths

    def get_loss(self, inps):
        sent, mask, args = inps
        feats = self._get_lstm_feature(sent, mask)
        forward_score, _, _ = self.forward(feats, mask)
        gold_score = self.score_path(feats, args, mask)

        return forward_score - gold_score
train---epoch: 5, learn rate: 0.001000, global step: 1009
loss: -382.26562500
macro arg---P: 0.560069, R: 0.111111, F: 0.079778
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1010
loss: -1400.46484375
macro arg---P: 0.559974, R: 0.111111, F: 0.079770
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1011
loss: 1773.91503906
macro arg---P: 0.559594, R: 0.111111, F: 0.079735
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1012
loss: 807.89257812
macro arg---P: 0.559650, R: 0.111111, F: 0.079740
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1013
loss: 1946.49560547
macro arg---P: 0.559642, R: 0.111111, F: 0.079739
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1014
loss: -3450.06152344
macro arg---P: 0.559863, R: 0.111111, F: 0.079760
---------------------------------------

Where is the "training data" ?

I can't run this code without the"training data", where is it? In the code somewhere? I can't find it. HELP

Some question about the CRF layer

Hello,
Does your code means calculate the score sentence by sentence while training?

Are there any publicly available training-data datasets?

hello,
I'm a novice to image processing, but now I want to learn how to use CRF and want to run your code. But I don't know much about the data set about NLP. Do you know if there are some available data sets that can be used directly?
Thank you very ##much!

train_data

Can you share a train data sample?

Different results per run

I have added precision and recall calculations for different named entity types and every time I run the predict.py I get slightly different results. I am sure that there is no randomness in the data or in calculation of the metrics. Is it OK that results differ slightly on different runs (at most they differ by 1%)? What could be a reason for that?

sentece is not padded with SOS, but labels are

Hello,
Thank you for your nice code. I am confused regarding few things-

Input sentences have EOS marker but no SOS marker. But labels are marked with both SOS and EOS. Why are not we marking the SOS in sentences as well?
Say we have a test/dev sentence with actual length 5. Now sentence will have EOS marker and the length will be 6. The actual label will have both SOS and EOS marker. Hence length will b e 7. But the generated label sequence has length 6. Now I am confused. Does the generated label sequence only have EOS or only the SOS? Depending on that during evaluation of result I need to ignore either 0th tag or the last tag.

Could you kindly give some insight.

for x, y in enumerate(data):

lstm-crf-pytorch/train.py

Line 49 in 8993f8e

for x, y in enumerate(data):

hi, the code should be "for i, (x, y) in enumerate(data):"

data format (slash in tokens?)

What if I am using my own custom tag system, and want to train on the token/tag pair 3/4/NUM? The word is 3/4 and the custom token is NUM. How to include forward slash?

This seems to break prepare.py, with this traceback:

Traceback (most recent call last):
  File "../../prepare.py", line 62, in <module>
    data, cti, wti, tti = load_data()
  File "../../prepare.py", line 29, in load_data
    x, y = load_line(line, cti, wti, tti)
  File "../../prepare.py", line 43, in load_line
    w, tag = (w, None) if HRE else re.split("/(?=[^/]+$)", w)
ValueError: not enough values to unpack (expected 2, got 1)

EDIT: The error may be due to something else about my training data, I am not sure what. Does this format look correct to you?

3/4/QTY cup/UNIT unsalted/NAME butter/NAME at/NOTE room/NOTE temperature/NOTE
half/QTY teaspoon/UNIT vanilla/NAME extract/NAME
...

the size of batch_size

i want to run sentence_classification
my gpu is 1050Ti 4G
i only set batch_size is 1 can run , but it is very slow , i do not knoe why the effiency is so slow

the code was running on the GPU.

I also found that it was running very slowly and the code was running on the GPU. like issue 9.
i do not know the reson
could data size is big, but only 2400

train error

After successfully using prepare.py, train.py gives this error.

Traceback (most recent call last):
  File "train.py", line 67, in <module>
    train()
  File "train.py", line 35, in train
    batch, cti, wti, itt = load_data()
  File "train.py", line 19, in load_data
    xc, xw = zip(*[(list(map(int, xc.split("+"))), int(xw)) for xc, xw in x])
  File "train.py", line 19, in <listcomp>
    xc, xw = zip(*[(list(map(int, xc.split("+"))), int(xw)) for xc, xw in x])
ValueError: invalid literal for int() with base 10: ''

Training_data Format Problem

Can you give me an example about the format of training data?Thanks a lot~

The code runs very slowly on GPU

When I used the code to perform the named entity recognition task, I found that it was running very slowly and the code was running on the GPU.
Is there anything else that needs attention?
Thanks for your help.

transition from last word to <END> tag not added in the the score function

Hi, maybe, I am wrong but it's confusing that in the following function, SOS_IDX was added to the begin of the tag sequence but the EOS_IDX tag wasn't added to the end of the tag sequence. As such,
tag sequence will look like <START> A B C D. When calculating the trans score, it stops at self.trans[3+1, 3] (i.e. transition score C->D) which doesn't calculate the transition score from D to <END>.

    def score(self, y, y0, mask): # calculate the score of a given sequence
        score = Tensor(BATCH_SIZE).fill_(0.)
        y0 = torch.cat([LongTensor(BATCH_SIZE, 1).fill_(SOS_IDX), y0], 1)
        for t in range(y.size(1)): # iterate through the sequence
            mask_t = mask[:, t]
            emit = torch.cat([y[b, t, y0[b, t + 1]].unsqueeze(0) for b in range(BATCH_SIZE)])
            trans = torch.cat([self.trans[seq[t + 1], seq[t]].unsqueeze(0) for seq in y0]) * mask_t
            score = score + emit + trans
        return score

the means of self.rnn

in embedding.py

    if HRE:
        self.sent_embed = self.rnn(EMBED_SIZE, EMBED_SIZE, True)

why vocab_size is EMBED_SIZE? isn't it the size of vocab like 10000(more and more)

threelittlemonkeys / lstm-crf-pytorch Goto Github PK

lstm-crf-pytorch's Introduction

LSTM-CRF in PyTorch

Usage

References

lstm-crf-pytorch's People

Contributors

Stargazers

Watchers

Forkers

lstm-crf-pytorch's Issues

Recommend Projects

Recommend Topics

Recommend Org