jidasheng / bi-lstm-crf Goto Github PK

View Code? Open in Web Editor NEW

237.0 9.0 47.0 10 KB

A PyTorch implementation of the BI-LSTM-CRF model.

License: MIT License

Python 100.00%

crf bilstm bilstm-crf crf-model bi-lstm-crf lstm-crf pytorch nlp ner sequence-labeling

bi-lstm-crf's People

Contributors

Stargazers

Watchers

Forkers

sheeptako changaolin mongjin genozhang624 geektemo preenet jwijffels kunal8kabi zidane28 wei-ann-github divenkang guilhermenoronha tathi trinh-hoang-hiep bit-nlper hardikarora17 jungyitsai sunghwanson guitarmind tranhoangkhuongvn snow9666 ivnle zikf220 drickv5 mingfanzhao tonjk taghizad3h andy-ta aflyingwolf sumethy marcolopinto-university murray-z yunbujian sar2652 sfhong2019 brucewuzhang inokoj huadong2014 xiaoheng-zhang99 roo4l mohit-15 embneural barrycug alierkan aqhali alekhyavittalam elizzer xiaoli

bi-lstm-crf's Issues

How to calculate Accuracy, Recall and F-Score using your project.

Hello, it's me again.

I trained my model and I want to calculate its performance using the F-Score metric. How could I calculate it using your project? Thanks in advance. :)

question about application of prediction

After completion of my model training, i wanna test the result and got problem of that.

my code:
from bi_lstm_crf.app import WordsTagger

model_dir= './model'
model = WordsTagger(model_dir=model_dir)


sentence='國家外匯管理局公佈，截至2019年11月末，中國外匯儲備規模為30,956億美元'

tags, sequences = model([sentence])  # CHAR-based model
print(tags)

print(sequences)

The result :

Traceback (most recent call last):
  File "/Users/marcusau/PycharmProjects/jidasheng/test.py", line 10, in <module>
    tags, sequences = model([sentence])  # CHAR-based model
ValueError: not enough values to unpack (expected 2, got 1)

Note: i use char-based model training

你好，请问CRF model里面的Mask是做什么用的？要怎么制作它呢？

if for Part of the speech, feasible?

for example.

jieba pos tags: 'nr'='人名',=, 'nt'='機構名' ,'w'=符號

Using different tags and unseen words

Interesting. If I change the tags, can I use it for finding technical attributes?
Also, can it predict unseen words and sentences? Or does it only predict the words in its vocabulary?
I see you have said that "chars/words that not in the vocabulary will be replaced by UNKNOWN". But does that only apply for training or for predicting too?

one feedback about your module

Very good and user friendly.

I train my model on a non-GPU PC and the training time of 660000 sentences corpus is less than 30 mins.

Very amazing.

My only concern is about accuracy because 20 epochs is not enough for my corpus ,maybe due to huge size.

The loss , val_loss are all about 50. sth.

I am now increasing the epochs from 20 to 100 and see the result.

Thanks for your nice work

if this a bug in your corpus build?

example:

in my dataset.txt , the first row is :
作者根據和比利案件有關的醫院人員、律師、警員提供的第一手資料，和比利偷偷寫下的內心筆記，揭露了醫院內對待精神病人的黑幕，和比利既要面對人格融合，及醫院內精神及肉體的不人道對待的矛盾。 ["B","E","B","E","S","B","E","B","E","B","E","S","B","E","B","E","S","B","E","S","B","E","B","E","S","B","M","M","M","E","S","S","B","E","B","E","B","E","S","B","E","B","E","S","B","E","S","B","E","S","B","E","B","M","M","E","S","B","E","S","S","B","E","B","E","B","E","B","E","B","E","S","S","B","E","S","B","E","S","B","E","S","B","M","E","B","E","S","B","E","S"]

the first part (sentence) is string
the second part (BMES label ) is list

config ./data/vocab.json loaded
config ./data/tags.json loaded
tag dict file => ./model/tags.json
tag dict file => ./model/vocab.json
parsing ./data/dataset.txt: 37271it [00:01, 29734.05it/s]
Traceback (most recent call last):
  File "/Users/marcusau/jidasheng/lib/python3.6/site-              packages/bi_lstm_crf/app/preprocessing/preprocess.py", line 123, in __build_corpus
      sentence = json.loads(sentence)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/__init__.py", line 354, in loads
      return _default_decoder.decode(s)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 339, in decode
      obj, end = self.raw_decode(s, idx=_w(s, 0).end())
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/json/decoder.py", line 357, in raw_decode
      raise JSONDecodeError("Expecting value", s, err.value) from None
  json.decoder.JSONDecodeError: Expecting value: line 1 column 2 (char 1)

  During handling of the above exception, another exception occurred:

  Traceback (most recent call last):
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 193, in _run_module_as_main
      "__main__", mod_spec)
    File "/Library/Frameworks/Python.framework/Versions/3.6/lib/python3.6/runpy.py", line 85, in _run_code
      exec(code, run_globals)
    File "/Users/marcusau/jidasheng/lib/python3.6/site-packages/bi_lstm_crf/__main__.py", line 3, in <module>
      main()
    File "/Users/marcusau/jidasheng/lib/python3.6/site-packages/bi_lstm_crf/app/train.py", line 119, in main
      train(args)
    File "/Users/marcusau/jidasheng/lib/python3.6/site-packages/bi_lstm_crf/app/train.py", line 46, in train
      args.corpus_dir, args.val_split, args.test_split, max_seq_len=args.max_seq_len)
    File "/Users/marcusau/jidasheng/lib/python3.6/site-packages/bi_lstm_crf/app/preprocessing/preprocess.py", line 69, in load_dataset
      xs, ys = self.__build_corpus(corpus_dir, max_seq_len)
    File "/Users/marcusau/jidasheng/lib/python3.6/site-packages/bi_lstm_crf/app/preprocessing/preprocess.py", line 131, in __build_corpus
      raise ValueError("exception raised when parsing line {}\n {}".format(idx + 1, e))
  ValueError: exception raised when parsing line 37514
   Expecting value: line 1 column 2 (char 1)

Error: Corpus_dir/vocab.json not found

Hi @jidasheng

I have been trying to run your implementation of the BiLSTM-CRF a couple of times but I keep getting an error.
On my terminal I executed
>$ python -m bi_lstm_crf corpus_dir --model_dir "model xxx"

and got the following error (See attached image below for full details)
ValueError: "corpus_dir/vocab.json" file does not exist

I've also tried >$ python3 -m bi_lstm_crf corpus_dir --model_dir "model xxx" but the same error persists.
I have not modified anything in the code and when I check inside "sample_corpus" the file "vocab.json" is there.

I am new to BiLSTM-CRF code but I have read the paper [2] and a few others on NER using BiLSTM-CRF. Is there something wrong that I'm doing? Please advise. Thanks!

Launching the WordTagger with device='cpu' error

When launching the WordTagger with device = 'cpu' the class throws an error:

WordsTagger( basepath, device='cpu')

File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\models.py", line 613, in __init__
    self.tagger = WordsTagger(
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\bi_lstm_crf\app\predict.py", line 15, in __init__
    self.model = build_model(self.args, self.preprocessor, load=True, verbose=False)
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\bi_lstm_crf\app\utils.py", line 24, in build_model
    state_dict = torch.load(model_path)
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 789, in load
    return _load(opened_zipfile, map_location, pickle_module, **pickle_load_args)
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 1131, in _load
    result = unpickler.load()
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 1101, in persistent_load
    load_tensor(dtype, nbytes, key, _maybe_decode_ascii(location))
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 1083, in load_tensor
    wrap_storage=restore_location(storage, location),
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 215, in default_restore_location
    result = fn(storage, location)
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 182, in _cuda_deserialize
    device = validate_cuda_device(location)
  File "C:\Users\MarcoOdore\agilelab\MultiLegalSBD-master\venv\lib\site-packages\torch\serialization.py", line 166, in validate_cuda_device
    raise RuntimeError('Attempting to deserialize object on a CUDA '
RuntimeError: Attempting to deserialize object on a CUDA device but torch.cuda.is_available() is False. If you are running on a CPU-only machine, please use torch.load with map_location=torch.device('cpu') to map your storages to the CPU.

The reason is because under app.utils.py, the method build_model don't take in account the device passed as input

class WordsTagger:
    def __init__(self, model_dir, device=None):
        args_ = load_json_file(arguments_filepath(model_dir))
        args = argparse.Namespace(**args_)
        args.model_dir = model_dir
        self.args = args

        self.preprocessor = Preprocessor(config_dir=model_dir, verbose=False)
        self.model = build_model(self.args, self.preprocessor, load=True, verbose=False) # here
        self.device = running_device(device)
        self.model.to(self.device)

        self.model.eval()

def build_model(args, processor, load=True, verbose=False):
    model = BiRnnCrf(len(processor.vocab), len(processor.tags),
                     embedding_dim=args.embedding_dim, hidden_dim=args.hidden_dim, num_rnn_layers=args.num_rnn_layers)

    # weights
    model_path = model_filepath(args.model_dir)
    if exists(model_path) and load:
        state_dict = torch.load(model_path)  # here
        model.load_state_dict(state_dict)
        if verbose:
            print("load model weights from {}".format(model_path))
    return model

I think that the problem could be solved by passing the device also to the build_model method, changing the torch.load method, adding the desired device

def build_model(args, processor, load=True, verbose=False, device='gpu'):
    model = BiRnnCrf(len(processor.vocab), len(processor.tags),
                     embedding_dim=args.embedding_dim, hidden_dim=args.hidden_dim, num_rnn_layers=args.num_rnn_layers)

    # weights
    model_path = model_filepath(args.model_dir)
    if exists(model_path) and load:
        if device == 'cpu':
          state_dict = torch.load(model_path, map_location=torch.device('cpu'))
        else:
          state_dict = torch.load(model_path)
        model.load_state_dict(state_dict)
        if verbose:
            print("load model weights from {}".format(model_path))
    return model

I would like to train my model on my office desktop which has no gpu.

Thanks a lot.

list index out of range when trying to predict

Hello,

My code is getting the error of 'list index out of range' when I try to predict a sentence from the model. A analyzed your code but couldn't figure out what could it be.

Here is my code:

    model = WordsTagger(model_dir='name_model')
    tags, sequences = model([["meu", "amigo", "e", "senhor", "."]])
    print(tags)  
    print(sequences)

Here is the log:

`Traceback (most recent call last):

  File "C:\Users\guilh\OneDrive\Textos e Documentação\workspace\BI-LSTM-CRF Cartas\bilstmprocessor.py", line 30, in <module>
    predict_text()

  File "C:\Users\guilh\OneDrive\Textos e Documentação\workspace\BI-LSTM-CRF Cartas\bilstmprocessor.py", line 26, in predict_text
    tags, sequences = model([["meu", "amigo", "e", "senhor", "."]])

  File "C:\Users\guilh\anaconda3\envs\bilstm_cartas\lib\site-packages\bi_lstm_crf\app\predict.py", line 40, in __call__
    return tags, self.tokens_from_tags(sentences, tags, begin_tags=begin_tags)

  File "C:\Users\guilh\anaconda3\envs\bilstm_cartas\lib\site-packages\bi_lstm_crf\app\predict.py", line 64, in tokens_from_tags
    tokens_list = [_tokens(sentence, ts) for sentence, ts in zip(sentences, tags_list)]

  File "C:\Users\guilh\anaconda3\envs\bilstm_cartas\lib\site-packages\bi_lstm_crf\app\predict.py", line 64, in <listcomp>
    tokens_list = [_tokens(sentence, ts) for sentence, ts in zip(sentences, tags_list)]

  File "C:\Users\guilh\anaconda3\envs\bilstm_cartas\lib\site-packages\bi_lstm_crf\app\predict.py", line 56, in _tokens
    begins = [b for idx, b in enumerate(begins) if idx == 0 or ts[idx] != "O" or ts[idx - 1] != "O"]

  File "C:\Users\guilh\anaconda3\envs\bilstm_cartas\lib\site-packages\bi_lstm_crf\app\predict.py", line 56, in <listcomp>
    begins = [b for idx, b in enumerate(begins) if idx == 0 or ts[idx] != "O" or ts[idx - 1] != "O"]

IndexError: list index out of range`

I'd appreciate any help. Thanks in advance. Best regards.