Coder Social home page Coder Social logo

lstm-crf-pytorch's Introduction

LSTM-CRF in PyTorch

A minimal PyTorch (1.7.1) implementation of bidirectional LSTM-CRF for sequence labelling.

Supported features:

  • Mini-batch training with CUDA
  • Lookup, CNNs, RNNs and/or self-attention in the embedding layer
  • Hierarchical recurrent encoding (HRE)
  • A PyTorch implementation of conditional random field (CRF)
  • Vectorized computation of CRF loss
  • Vectorized Viterbi decoding

Usage

Training data should be formatted as below:

token/tag token/tag token/tag ...
token/tag token/tag token/tag ...
...

For more detail, see README.md in each subdirectory.

To prepare data:

python3 prepare.py training_data

To train:

python3 train.py model char_to_idx word_to_idx tag_to_idx training_data.csv (validation_data) num_epoch

To predict:

python3 predict.py model.epochN word_to_idx tag_to_idx test_data

To evaluate:

python3 evaluate.py model.epochN word_to_idx tag_to_idx test_data

References

Zhiheng Huang, Wei Xu, Kai Yu. 2015. Bidirectional LSTM-CRF Models for Sequence Tagging. arXiv:1508.01991.

Harshit Kumar, Arvind Agarwal, Riddhiman Dasgupta, Sachindra Joshi. 2018. Dialogue Act Sequence Labeling Using Hierarchical Encoder with CRF. In AAAI.

Xuezhe Ma, Eduard Hovy. 2016. End-to-end Sequence Labeling via Bi-directional LSTM-CNNs-CRF. arXiv:1603.01354.

Shotaro Misawa, Motoki Taniguchi, Yasuhide Miura, Tomoko Ohkuma. 2017. Character-based Bidirectional LSTM-CRF with Words and Characters for Japanese Named Entity Recognition. In Proceedings of the 1st Workshop on Subword and Character Level Models in NLP.

Yan Shao, Christian Hardmeier, Jörg Tiedemann, Joakim Nivre. 2017. Character-based Joint Segmentation and POS Tagging for Chinese using Bidirectional RNN-CRF. arXiv:1704.01314.

Slav Petrov, Dipanjan Das, Ryan McDonald. 2011. A Universal Part-of-Speech Tagset. arXiv:1104.2086.

Nils Reimers, Iryna Gurevych. 2017. Optimal Hyperparameters for Deep LSTM-Networks for Sequence Labeling Tasks. arXiv:1707.06799.

Feifei Zhai, Saloni Potdar, Bing Xiang, Bowen Zhou. 2017. Neural Models for Sequence Chunking. In AAAI.

Zenan Zhai, Dat Quoc Nguyen, Karin Verspoor. 2018. Comparing CNN and LSTM Character-level Embeddings in BiLSTM-CRF Models for Chemical and Disease Named Entity Recognition. arXiv:1808.08450.

lstm-crf-pytorch's People

Contributors

threelittlemonkeys avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

lstm-crf-pytorch's Issues

example training data?

Thank you for the instructions on training data formatting.

Could you include small examples of this for your 3 use cases? I want to design my own tagset.

embedding

how to use own pretrained embedding text?

Question about rnn-crf and lstm-crf

Hi,

I saw that the name of the project is "lstm-crf-pytorch", but while running the code, I found that you use rnn-crf structure. I am wondering that why?

Thank you in advance!

When use lstm_crf, loss is error .. sometimes positive, sometimes negative. sometimes it's a very big number, why

    def _get_lstm_feature(self, sent, mask):
        sent_embed = self.word_lookup(sent)
        sent_embed = fun.dropout(sent_embed, self.dropout)

        lstm_ip, id_unsort = self.get_pack_inputs(sent_embed, mask.sum(1))
        lstm_out, _ = self.lstm(lstm_ip)
        lstm_out = self.get_pad_outputs(lstm_out, sent_embed.size(1), id_unsort)

        lstm_out = fun.relu(lstm_out)
        hidden = self.hidden_linear(lstm_out)
        hidden = fun.relu(fun.dropout(hidden, self.dropout))
        feats = self.tag_linear(hidden)

        return feats

    @staticmethod
    def log_sum_exp(x, axis):
        x_max, _ = torch.max(x, axis, keepdim=True)
        x_max_expand = x_max.expand(x.size())

        return x_max + torch.log(torch.sum(torch.exp(x - x_max_expand), axis, keepdim=True))

    def forward(self, emits, masks):
        batch_size, step_num, tag_size = emits.size()
        lengths = masks.sum(1).tolist()

        _mask = torch.zeros(masks.size()).to(self.device)
        for b in range(batch_size):
            _mask[b][lengths[b]-1] = 1
        _mask = _mask.byte()

        _mask = _mask.view(batch_size, step_num, 1).expand_as(emits)
        batch_trans = torch.cat([self.transitions for _ in range(batch_size)], 0).contiguous()
        batch_trans = batch_trans.view(batch_size, tag_size, tag_size)

        forward_var = torch.cat([self.alpha_0 for _ in range(batch_size)], 0).contiguous()
        forward_var = forward_var.view(batch_size, tag_size, 1)
        forward_var = forward_var + emits[:, 0, :].view(batch_size, tag_size, 1)

        alpha = [forward_var]
        max_scores = [torch.squeeze(forward_var)]
        max_scores_pre = []
        for t in range(1, step_num):
            forward_var = forward_var.view(batch_size, tag_size, 1).expand(batch_size, tag_size, tag_size)
            current = emits[:, t, :].view(batch_size, 1, tag_size).contiguous()
            # alpha_t[i, j] = pre_score[j](col) + score(j->i) + score[i](row)
            alpha_t = forward_var + current.expand(batch_size, tag_size, tag_size) + batch_trans

            # cur_max_score score[i, :], cur_max_idx is j
            cur_max_score, cur_max_idx = torch.max(alpha_t, 1, keepdim=True)
            max_scores.append(torch.squeeze(cur_max_score, 1))
            max_scores_pre.append(torch.squeeze(cur_max_idx, 1))

            log_alpha_t = self.log_sum_exp(alpha_t, 1).view(batch_size, tag_size, 1)
            forward_var = log_alpha_t
            alpha.append(log_alpha_t)

        alphas = torch.cat(alpha, 0).view(batch_size, step_num, tag_size)
        last_alphas = torch.masked_select(alphas, _mask).view(batch_size, tag_size, 1)
        # forward var max value is add.....
        alpha_z = torch.sum(self.log_sum_exp(last_alphas, 1))

        return alpha_z, max_scores, max_scores_pre

    def score_path(self, emits, tags, mask):
        sent_len = mask.sum(1).tolist()
        batch_size, step_num = tags.size()

        scores = torch.FloatTensor([0]).to(self.device)

        for b in range(batch_size):
            cur_tag = tags[b][0].item()
            scores += self.alpha_0[cur_tag][0] + emits[b][0][cur_tag]
            for step in range(1, step_num):
                pre_tag = cur_tag
                cur_tag = tags[b][step].item()
                if step < sent_len[b]:
                    scores += (self.transitions[pre_tag][cur_tag] + emits[b][step][cur_tag])
                else:
                    break

        return scores

    @staticmethod
    def viterbi(max_scores, max_score_pre, mask):
        sent_lenth = mask.sum(1).tolist()

        best_paths = []
        batch_size = mask.size(0)
        for b in range(batch_size):
            cur_path = []
            _, last_max_node = torch.max(max_scores[sent_lenth[b]-1][b], 0, keepdim=True)
            last_max_node = last_max_node.item()
            cur_path.append(last_max_node)
            for t in range(sent_lenth[b]-2, -1, -1):
                last_max_node = max_score_pre[t][b][last_max_node]
                last_max_node = last_max_node.item()
                cur_path.append(last_max_node)

            cur_path = cur_path[::-1]
            best_paths.append(cur_path)

        return best_paths

    def get_arg(self, inps):
        sent, mask, _ = inps
        feats = self._get_lstm_feature(sent, mask)
        _, max_scores, max_scores_pre = self.forward(feats, mask)
        best_paths = self.viterbi(max_scores, max_scores_pre, mask)
        return best_paths

    def get_loss(self, inps):
        sent, mask, args = inps
        feats = self._get_lstm_feature(sent, mask)
        forward_score, _, _ = self.forward(feats, mask)
        gold_score = self.score_path(feats, args, mask)

        return forward_score - gold_score
train---epoch: 5, learn rate: 0.001000, global step: 1009
loss: -382.26562500
macro arg---P: 0.560069, R: 0.111111, F: 0.079778
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1010
loss: -1400.46484375
macro arg---P: 0.559974, R: 0.111111, F: 0.079770
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1011
loss: 1773.91503906
macro arg---P: 0.559594, R: 0.111111, F: 0.079735
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1012
loss: 807.89257812
macro arg---P: 0.559650, R: 0.111111, F: 0.079740
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1013
loss: 1946.49560547
macro arg---P: 0.559642, R: 0.111111, F: 0.079739
---------------------------------------
train---epoch: 5, learn rate: 0.001000, global step: 1014
loss: -3450.06152344
macro arg---P: 0.559863, R: 0.111111, F: 0.079760
---------------------------------------

Are there any publicly available training-data datasets?

hello,
I'm a novice to image processing, but now I want to learn how to use CRF and want to run your code. But I don't know much about the data set about NLP. Do you know if there are some available data sets that can be used directly?
Thank you very ##much!

Different results per run

I have added precision and recall calculations for different named entity types and every time I run the predict.py I get slightly different results. I am sure that there is no randomness in the data or in calculation of the metrics. Is it OK that results differ slightly on different runs (at most they differ by 1%)? What could be a reason for that?

sentece is not padded with SOS, but labels are

Hello,
Thank you for your nice code. I am confused regarding few things-

  1. Input sentences have EOS marker but no SOS marker. But labels are marked with both SOS and EOS. Why are not we marking the SOS in sentences as well?
  2. Say we have a test/dev sentence with actual length 5. Now sentence will have EOS marker and the length will be 6. The actual label will have both SOS and EOS marker. Hence length will b e 7. But the generated label sequence has length 6. Now I am confused. Does the generated label sequence only have EOS or only the SOS? Depending on that during evaluation of result I need to ignore either 0th tag or the last tag.

Could you kindly give some insight.

data format (slash in tokens?)

What if I am using my own custom tag system, and want to train on the token/tag pair 3/4/NUM? The word is 3/4 and the custom token is NUM. How to include forward slash?

This seems to break prepare.py, with this traceback:

Traceback (most recent call last):
  File "../../prepare.py", line 62, in <module>
    data, cti, wti, tti = load_data()
  File "../../prepare.py", line 29, in load_data
    x, y = load_line(line, cti, wti, tti)
  File "../../prepare.py", line 43, in load_line
    w, tag = (w, None) if HRE else re.split("/(?=[^/]+$)", w)
ValueError: not enough values to unpack (expected 2, got 1)

EDIT: The error may be due to something else about my training data, I am not sure what. Does this format look correct to you?

3/4/QTY cup/UNIT unsalted/NAME butter/NAME at/NOTE room/NOTE temperature/NOTE
half/QTY teaspoon/UNIT vanilla/NAME extract/NAME
...

the size of batch_size

i want to run sentence_classification
my gpu is 1050Ti 4G
i only set batch_size is 1 can run , but it is very slow , i do not knoe why the effiency is so slow

the code was running on the GPU.

I also found that it was running very slowly and the code was running on the GPU. like issue 9.
i do not know the reson
could data size is big, but only 2400

train error

After successfully using prepare.py, train.py gives this error.

Traceback (most recent call last):
  File "train.py", line 67, in <module>
    train()
  File "train.py", line 35, in train
    batch, cti, wti, itt = load_data()
  File "train.py", line 19, in load_data
    xc, xw = zip(*[(list(map(int, xc.split("+"))), int(xw)) for xc, xw in x])
  File "train.py", line 19, in <listcomp>
    xc, xw = zip(*[(list(map(int, xc.split("+"))), int(xw)) for xc, xw in x])
ValueError: invalid literal for int() with base 10: ''

The code runs very slowly on GPU

When I used the code to perform the named entity recognition task, I found that it was running very slowly and the code was running on the GPU.
Is there anything else that needs attention?
Thanks for your help.

transition from last word to <END> tag not added in the the score function

Hi, maybe, I am wrong but it's confusing that in the following function, SOS_IDX was added to the begin of the tag sequence but the EOS_IDX tag wasn't added to the end of the tag sequence. As such,
tag sequence will look like <START> A B C D. When calculating the trans score, it stops at self.trans[3+1, 3] (i.e. transition score C->D) which doesn't calculate the transition score from D to <END>.

    def score(self, y, y0, mask): # calculate the score of a given sequence
        score = Tensor(BATCH_SIZE).fill_(0.)
        y0 = torch.cat([LongTensor(BATCH_SIZE, 1).fill_(SOS_IDX), y0], 1)
        for t in range(y.size(1)): # iterate through the sequence
            mask_t = mask[:, t]
            emit = torch.cat([y[b, t, y0[b, t + 1]].unsqueeze(0) for b in range(BATCH_SIZE)])
            trans = torch.cat([self.trans[seq[t + 1], seq[t]].unsqueeze(0) for seq in y0]) * mask_t
            score = score + emit + trans
        return score

the means of self.rnn

in embedding.py

    if HRE:
        self.sent_embed = self.rnn(EMBED_SIZE, EMBED_SIZE, True)

why vocab_size is EMBED_SIZE? isn't it the size of vocab like 10000(more and more)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.