glample / tagger Goto Github PK
View Code? Open in Web Editor NEWNamed Entity Recognition Tool
License: Apache License 2.0
Named Entity Recognition Tool
License: Apache License 2.0
Hi,
Good work!! Can you also provide these train.txt, dev.txt and test.txt for training your model used in your paper? Thanks a lot!!
Why LSTM implementation in nn.py have forget gate code commented out? It seems that even in tagging code there is no forget gate usage. Shouldn't it be used in LSTM?
Hi,
When I use tagger.py
I sometimes get the following error:
Floating point exception (core dumped)
The error appears if there's one of the following lines in the document (for instance):
_ _ _
| _ |
_ | _
_ | |
| | |
On the other hand, the following lines "work":
A A A
L L L
X L L
L | L
I have no line number, but the error happens after line 59. Indeed a dirty fix (eliminate lines with only one-character words) fixed the problem. But I have no idea where the problem is coming from.
Hi!
Running on my data cause this error (After first epoch):
Traceback (most recent call last):
File "./train.py", line 220, in <module>
dev_data, id_to_tag, dico_tags)
File "/home/tagger/utils.py", line 282, in evaluate
return float(eval_lines[1].strip().split()[-1])
IndexError: list index out of range
the form of data is:
<sent0_unicode_word><space><iob_tag>
<sent0_unicode_word><space><iob_tag>
<sent0_unicode_word><space><iob_tag>
<sent1_unicode_word><space><iob_tag>
<sent1_unicode_word><space><iob_tag>
<sent1_unicode_word><space><iob_tag>
The iob tags are in the set {B-PER, I-PER}, and the data is validated by this script:
for line in conll:
if line != "\n":
spl = line.strip().split()
if spl[-1] not in ["B-PER", "I-PER", "O"]:
return False
Would you help me to find out where and why my work raised this exception?
Hi, great work to begin with!
I'm wondering there might a bug in your https://github.com/glample/tagger/blob/master/model.py#L244, where it will break when I turn off the char_lstm (setting char_dim=0 I think?). The reason is the input now has only one element and pass the check in L244 as a list and cause error in L252. I think the right way is either removing the check or replacing with:
inputs = T.concatenate(inputs, axis=1) if len(inputs)!=1 else inputs[0]
Hi, how much time will be saved by running this program on GPU rather than CPU?
Hi,
I was using conll2002 dataset in spanish and when it was computing the F1 score it failed like this:
Traceback (most recent call last):
File "train.py", line 220, in <module>
dev_data, id_to_tag, dico_tags)
File "D:\deepner\tagger\utils.py", line 282, in evaluate
return float(eval_lines[1].strip().split()[-1])
IndexError: list index out of range
The dataset is this one: http://www.cnts.ua.ac.be/conll2002/ner/data/
And I run the following command: python train.py --train dataset\esp.train --dev dataset\esp.testa --test dataset\esp.testb
Thank you for sharing this implementation ๐
Hi.
I'm a novice in NER and I'm trying to use the code to deal with Chinese NER task. However, after running train.py all files I got were just "parameters.pkl" and "mappings.pkl". So when I tried to run tagger.py, it said that it needed "word_layer.mat".
So how can I generate files " *.mat ".?
Thanks a lot.
@glample - Sir, if I am not wrong, the dataset you provided,namely, eng.train, eng.testa and eng.testb is free but copywrited by Reuters.
I would like to suggest you to add this warning to that particular commit so that people take proper precautions before using it.
Just wanted to clarify: does it support minibatch training for the LSTM+CRF model ?
Thanks.
Hi,
Currently run of Tagger on my 24 core CPU uses 8-12 core instead of the full 24.
gensim gives an option of utilizing the full cores of CPU.
Is there an option possible for tagger as well? This would really speed it up.
Hi @glample,
I'm trying to overfit the LSTM without the CRF layer on a small dataset but the dev/test scores was about ~2%. I used the --crf 0 option, do I need to take any further steps?
Hi,
I am trying to train a new model on my dataset but during the training, an error occurred.
This error is "MemoryError: alloc failed". I used 64GB RAM to train my model.
When running train.py using owner dataset, exception happend.
I don't know how to find exception location, please give me some suggest
output look like these:
Model location: ./models/tag_scheme=iob,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=sgd-lr_.005
Found 107161 unique words (1038059 in total)
Found 5360 unique characters
Found 27 unique named entity tags
348944 / 349324 / 350017 sentences in train / dev / test.
Saving the mappings to disk...
Compiling...
Starting epoch 0...
50, cost average: 9.245738
100, cost average: 9.109227
150, cost average: 8.605217
200, cost average: 7.075110
250, cost average: 7.567110
300, cost average: 6.226210
350, cost average: 6.570437
400, cost average: 6.829048
450, cost average: 6.786608
500, cost average: 6.878631
550, cost average: 6.116019
600, cost average: 5.965092
650, cost average: 6.305119
700, cost average: 6.422719
750, cost average: 5.528729
800, cost average: 5.219546
850, cost average: 6.256534
900, cost average: 5.553740
950, cost average: 6.382971
Floating point exception (core dumped)
It seems there is a little difference in CRF likelihood loss from in TensorFlow . The unary potentials are in log_sum_exp, while in TF, it is outside.
Good work!
AttributeError when using learning methods except SGD. Does the error happen when try to update the word embedding?
Please let me know if there is TF version?
Hi,
Nice work on the implementation. I had a question. I am trying to train my lstm-crf model with external word2vec embeddings + char bi-lstm features + word-lstm features + few gazetteer features and there is no change in the accuracy on test data set by using addtional gazetteer features. So wanted to know if the code supports using additional gazetteer features currently or just w2v embeddings + char lstm + word lstm as features?
Cornell2003 is the format to use?
Hi,
LSTM-CRF internally uses the vector representations created using the Word2Vec algorithm. Right?
How to feed LSTM-CRF with my home-baked Word2Vec vectors?
Thanks
Kewl
Hi,
I wonder if you could add the german model and data? That would be very helpful for me.
thx
or the URL of those resource
Fantastic work! Is there an open source license for tagger or any plan to add one?
Dear Glample,
When i have tried to train the model using the data you provided, i got the error infomation like this:
IOError: [Errno 2] No such file or directory: './evaluation/temp/eval.1851915.scores'
By the way, i've checked the output. All the result was O tag, which makes me confuse.
Here is my command line:
./train.py --train dataset/eng.train --dev dataset/eng.testa --test dataset/eng.testb
With Regards,
A du.
Dear Glample,
When I have tried to train the model it renders the following error: FileNotFoundError: [WinError 3] The system cannot find the path specified: './models\tag_scheme=iobes,lower=False,zeros=False,char_dim=25,char_lstm_dim=25,char_bidirect=True,word_dim=100,word_lstm_dim=100,word_bidirect=True,pre_emb=,all_emb=False,cap_dim=0,crf=True,dropout=0.5,lr_method=sgd-lr_.005'
Thus, this is to kindly request you how to fix this error and train the model with custom dataset. Thank you very much for your consideration and support.
With Regards,
Kidane W.
Is there some way to trade prediction quality for tagging speed? Right now it gives very good results but works extremely slow. What could be done to make the architecture more speed-oriented?
hi Glample,
You said it is on 4 CoNLL datasets (English, Spanish, German and Dutch). Can I use this tool to train on other Language dataset like Chinese, Vietnamese ...
Thank you!
I am trying to reproduce the reported result for the LSTM-CRF model on the English dataset included in the repository, but haven't yet succeeded in getting the reported F1 score (90.94).
I used all the default parameter values, which seem to match what is in the paper, and I used eng.train for training, eng.testa for validation, and eng.testb for testing. After the program stopped (100 epochs on the training set), I got the best dev score to be 89.74, and the corresponding test score is 83.55. Am I missing something?
Also, in train.py
, why is "New best score on test" reported? Shouldn't one only report the test score of the model that does the best on validation?
I will keep checking if I have anything misconfigured, but at the meantime, any help or insight would be appreciated! Thanks again for the awesome paper + code!
Dear Mr. Lample.
I tried to put my data into your code but something happened, so I put the conll2002 data to try. But during the training, an error occurred. The terminal has an error line: KeyError: u'S-PRT'.
I tried to print all the str_words and I cannot find this key. I think that this key comes from the model. How can I fix it? Thank you!
Hi, could you please let me know is it possible to get the predict probability of the predicted label sequences for an sentence such as the CRF model?
Hi @glample,
Do you use any pre-trained embedding for languages other than English? If so, where I can download these embeddings?
Thanks,
Dung Thai
Does the tagger use PoS tags available in the Conll dataset as an input feature for NER ?
What are the input features, apart from the word embeddings, used by the tagger?
Hi! @glample ,why LSTM's forget gate are not used? Will the forget gate hurt the performance? Any explanation on that?
Hi,
What is FB1 score term used in the output stats ?
Does this code support training with mini-batch?
Hi
Could you please help clarify my doubt
I understand that the function below loads the pre_trained embeddings
The comment says augment the dictionary with words that have pretrained embedding
def augment_with_pretrained(dictionary, ext_emb_path, words):
"""
Augment the dictionary with words that have a pretrained embedding.
If `words` is None, we add every word that has a pretrained embedding
to the dictionary, otherwise, we only add the words that are given by
`words` (typically the words in the development and test sets.)
"""
print 'Loading pretrained embeddings from %s...' % ext_emb_path
assert os.path.isfile(ext_emb_path)
My doubt is , I have train, dev, test in conll 2003 format, its very clear, How should the pretrained embedding file be saved?
I am planning to use word2vec , glove models which take each word in sentence as input and give an vector representation of the each of the word in sentences.
How am I suppose to input these vectors to models ? Could you please direct me to the code section which reads this vector representation?
What should be the file format of pretrained embedding file?
How will the word_id pick the vector representation while training which part of the code will handle this ?
Should the pretrained embedding file be like
word_id Vector representation of word ?
Many thanks for clarifying the doubt in advance
with regards
Raghav
Hi,
I think there is a small bug in the preprocessing of tagger.py:
words = line.rstrip().split()
if line:
# Lowercase sentence
if parameters['lower']:
line = line.lower()
# Replace all digits with zeros
if parameters['zeros']:
line = zero_digits(line)
You do the lower() and zero_digits() preprocessing after the input line is split into words. To the neural network, the variable words is passed, i.e., the two preprocessings steps lower and zeros will not be applied if you run the tagger.py file.
Sir thank you for providing us such a great NER tagger. I was trying to run the tagger in Anaconda Prompt and cmd but i got the following error on Windows 7 32 bit
00878 {
00879 // save references to outputs prior to updating storage contai
ners
00880 assert (self->n_output_vars >= self->n_updates);
00881 Py_DECREF(rval);
00882 rval = PyList_New(self->n_output_vars);
00883 for (int i = 0; i < (self->n_output_vars); ++i)
00884 {
00885 Py_ssize_t src = self->output_vars[i];
00886 PyObject * item = PyList_GetItem(self->var_value_cells[src
], 0);
00887 if ((output_subset == NULL || output_subset[i]) &&
00888 self->var_computed[src] != 1)
00889 {
00890 err = 1;
00891 PyErr_Format(PyExc_AssertionError,
00892 "The compute map of output %d should cont
ain "
00893 "1 at the end of execution, not %d.",
00894 i, self->var_computed[src]);
00895 break;
00896 }
00897 Py_INCREF(item);
00898 PyList_SetItem(rval, i, item);
00899 }
00900 }
00901
00902 if (!err)
00903 {
00904 // Update the inputs that have an update rule
00905 for (int i = 0; i < self->n_updates; ++i)
00906 {
00907 PyObject* tmp = PyList_GetItem(rval, self->n_output_vars -
self->n_updates + i);
00908 Py_INCREF(tmp);
00909 Py_ssize_t dst = self->update_storage[i];
00910 PyList_SetItem(self->var_value_cells[dst], 0, tmp);
00911 }
00912 }
00913 }
00914
00915 /*
00916 Clear everything that is left and not an output. This is needed
00917 for lazy evaluation since the current GC algo is too conservative
00918 with lazy graphs.
00919 /
00920 if (self->allow_gc && !err)
00921 {
00922 for (Py_ssize_t i = 0; i < self->n_vars; ++i)
00923 {
00924 int do_cleanup = 1;
00925 if (!self->var_has_owner[i] || !self->var_computed[i])
00926 continue;
00927 for (int j = 0; j < self->n_output_vars; ++j)
00928 {
00929 if (i == self->output_vars[j])
00930 {
00931 do_cleanup = 0;
00932 break;
00933 }
00934 }
00935 if (!do_cleanup)
00936 continue;
00937 Py_INCREF(Py_None);
00938 PyList_SetItem(self->var_value_cells[i], 0, Py_None);
00939 }
00940 }
00941 if (output_subset != NULL)
00942 free(output_subset);
00943
00944 Py_DECREF(one);
00945 Py_DECREF(zero);
00946 if (err)
00947 {
00948 Py_DECREF(rval);
00949 return NULL;
00950 }
00951 return rval;
00952 }
00953
00954 #if 0
00955 static PyMethodDef CLazyLinker_methods[] = {
00956 {
00957 //"name", (PyCFunction)CLazyLinker_accept, METH_VARARGS, "Return t
he name, combining the first and last name"
00958 },
00959 {NULL} / Sentinel /
00960 };
00961 #endif
00962
00963
00964 static PyObject *
00965 CLazyLinker_get_allow_gc(CLazyLinker self, void closure)
00966 {
00967 return PyBool_FromLong(self->allow_gc);
00968 }
00969
00970 static int
00971 CLazyLinker_set_allow_gc(CLazyLinker self, PyObject value, void closu
re)
00972 {
00973 if(!PyBool_Check(value))
00974 return -1;
00975
00976 if (value == Py_True)
00977 self->allow_gc = true;
00978 else
00979 self->allow_gc = false;
00980 return 0;
00981 }
00982
00983 static PyGetSetDef CLazyLinker_getset[] = {
00984 {(char)"allow_gc",
00985 (getter)CLazyLinker_get_allow_gc,
00986 (setter)CLazyLinker_set_allow_gc,
00987 (char)"do this function support allow_gc",
00988 NULL},
00989 {NULL, NULL, NULL, NULL} / Sentinel /
00990 };
00991 static PyMemberDef CLazyLinker_members[] = {
00992 {(char)"nodes", T_OBJECT_EX, offsetof(CLazyLinker, nodes), 0,
00993 (char)"list of nodes"},
00994 {(char)"thunks", T_OBJECT_EX, offsetof(CLazyLinker, thunks), 0,
00995 (char)"list of thunks in program"},
00996 {(char*)"call_counts", T_OBJECT_EX, offsetof(CLazyLinker, call_count
s), 0,
00997 (char*)"number of calls of each thunk"},
00998 {(char*)"call_times", T_OBJECT_EX, offsetof(CLazyLinker, call_times)
, 0,
00999 (char*)"total runtime in each thunk"},
01000 {(char*)"position_of_error", T_INT, offsetof(CLazyLinker, position_o
f_error), 0,
01001 (char*)"position of failed thunk"},
01002 {(char*)"time_thunks", T_INT, offsetof(CLazyLinker, do_timing), 0,
01003 (char*)"bool: nonzero means call will time thunks"},
01004 {(char*)"need_update_inputs", T_INT, offsetof(CLazyLinker, need_upda
te_inputs), 0,
01005 (char*)"bool: nonzero means Function.call must implement update
mechanism"},
01006 {NULL} /* Sentinel */
01007 };
01008
01009 static PyTypeObject lazylinker_ext_CLazyLinkerType = {
01010 #if defined(NPY_PY3K)
01011 PyVarObject_HEAD_INIT(NULL, 0)
01012 #else
01013 PyObject_HEAD_INIT(NULL)
01014 0, /ob_size/
01015 #endif
01016 "lazylinker_ext.CLazyLinker", /tp_name/
01017 sizeof(CLazyLinker), /tp_basicsize/
01018 0, /tp_itemsize/
01019 CLazyLinker_dealloc, /tp_dealloc/
01020 0, /tp_print/
01021 0, /tp_getattr/
01022 0, /tp_setattr/
01023 0, /tp_compare/
01024 0, /tp_repr/
01025 0, /tp_as_number/
01026 0, /tp_as_sequence/
01027 0, /tp_as_mapping/
01028 0, /*tp_hash /
01029 CLazyLinker_call, /tp_call/
01030 0, /tp_str/
01031 0, /tp_getattro/
01032 0, /tp_setattro/
01033 0, /tp_as_buffer/
01034 Py_TPFLAGS_DEFAULT|Py_TPFLAGS_BASETYPE, /tp_flags/
01035 "CLazyLinker object", / tp_doc /
01036 0, / tp_traverse /
01037 0, / tp_clear /
01038 0, / tp_richcompare /
01039 0, / tp_weaklistoffset /
01040 0, / tp_iter /
01041 0, / tp_iternext /
01042 0,//CLazyLinker_methods, / tp_methods /
01043 CLazyLinker_members, / tp_members /
01044 CLazyLinker_getset, / tp_getset /
01045 0, / tp_base /
01046 0, / tp_dict /
01047 0, / tp_descr_get /
01048 0, / tp_descr_set /
01049 0, / tp_dictoffset /
01050 (initproc)CLazyLinker_init,/ tp_init /
01051 0, / tp_alloc /
01052 CLazyLinker_new, / tp_new */
01053 };
01054
01055 static PyObject * get_version(PyObject *dummy, PyObject *args)
01056 {
01057 PyObject *result = PyFloat_FromDouble(0.211);
01058 return result;
01059 }
01060
01061 static PyMethodDef lazylinker_ext_methods[] = {
01062 {"get_version", get_version, METH_VARARGS, "Get extension version."},
Problem occurred during compilation with the command line below:
"C:\MinGW\bin\g++.exe" -shared -g -DNPY_NO_DEPRECATED_API=NPY_1_7_API_VERSION -m
32 -I"C:\ProgramData\Anaconda2\lib\site-packages\numpy\core\include" -I"C:\Progr
amData\Anaconda2\include" -I"C:\ProgramData\Anaconda2\lib\site-packages\theano\g
of" -L"C:\ProgramData\Anaconda2\libs" -L"C:\ProgramData\Anaconda2" -o C:\Users\R
abia Noureen\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-x86_Family_6
Model_42_Stepping_7_GenuineIntel-2.7.13-32\lazylinker_ext\lazylinker_ext.pyd C:
\Users\Rabia Noureen\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-x86
Family_6_Model_42_Stepping_7_GenuineIntel-2.7.13-32\lazylinker_ext\mod.cpp -lpyt
hon27
g++.exe: error: Noureen\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-x
86_Family_6_Model_42_Stepping_7_GenuineIntel-2.7.13-32\lazylinker_ext\lazylinker
_ext.pyd: No such file or directory
g++.exe: error: C:\Users\Rabia: No such file or directory
g++.exe: error: Noureen\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-x
86_Family_6_Model_42_Stepping_7_GenuineIntel-2.7.13-32\lazylinker_ext\mod.cpp: N
o such file or directory
Traceback (most recent call last):
File "./tagger.py", line 8, in
from loader import prepare_sentence
File "C:\ProgramData\Anaconda2\tagger-master\tagger-master\loader.py", line 4,
in
from utils import create_dico, create_mapping, zero_digits
File "C:\ProgramData\Anaconda2\tagger-master\tagger-master\utils.py", line 5,
in
import theano
File "C:\ProgramData\Anaconda2\lib\site-packages\theano_init_.py", line 66,
in
from theano.compile import (
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\compile_init_.py",
line 10, in
from theano.compile.function_module import *
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\compile\function_modul
e.py", line 21, in
import theano.compile.mode
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\compile\mode.py", line
10, in
import theano.gof.vm
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\gof\vm.py", line 662,
in
from . import lazylinker_c
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\gof\lazylinker_c.py",
line 127, in
preargs=args)
File "C:\ProgramData\Anaconda2\lib\site-packages\theano\gof\cmodule.py", line
2316, in compile_str
(status, compile_stderr.replace('\n', '. ')))
Exception: Compilation failed (return status=1): g++.exe: error: Noureen\AppData
\Local\Theano\compiledir_Windows-7-6.1.7601-SP1-x86_Family_6_Model_42_Stepping_7
_GenuineIntel-2.7.13-32\lazylinker_ext\lazylinker_ext.pyd: No such file or direc
. g++.exe: error: Noureen\AppData\Local\Theano\compiledir_Windows-7-6.1.7601-SP1
-x86_Family_6_Model_42_Stepping_7_GenuineIntel-2.7.13-32\lazylinker_ext\mod.cpp:
. o such file or directory
Kindly respond asap as it am stuck in this issue.
Hi.
I have some problems during training my own model on Persian dataset. It gave me error at the beginning of training phase. My dataset is in UTF-8 format. Does Glample support utf-8? If yes, what else can be the problem? My dataset is in CONLL2003 format.
The Error: "file loader.py", line 43, in update_tag_scheme
'Please check sentence %i:\n%s' % (i, s_str))
Exception: <exception str() failed>
"
Thanks
Where can I get Chinese train dataset ?
Line 168 of train.py reads
# Create a dictionary and a mapping for words / POS tags / tags
But in fact, POS tags seem to have been never used.
Just want to make sure: in the README it says input file for tagger.py should contain 1 sentence each line and sentences have to be tokenized. So, I wonder if 'tokenized' here means the sentences are split into tokens and tokens are separated by space?
hi, @glample
Recently, I have modified your code in model.py
and train.py
to adapt to my sequence label mission, punctuation prediction, without modifying your kernel code in nn.py
.
The same setup of LSTM and LSTM-CRF is word_dim=100, word_lstm_dim=100, dropout=0.5, lr_method=adadelta, char_dim = 0, without using char embedding.
Then I got unexpected results in which LSTM-CRF is worse than LSTM.
I'm wondering whether it is the influence of dropout layer or not? Cause I don't use char embedding, do I need to use dropout layer?
Or maybe LSTM-CRF is not suitable for my specified mission?
Do you have any suggestions?
Dear @glample ,
Can you explain about that. Many thanks.
I am getting an empty file when I am running tagger.py . I have at first generated my own model and the model generated is fine. But when I launch tagger.py based on my model, I am getting no labels, all words are labeled with O. Even when I tried your own model "english" with simple tokenized sentences, I am getting no results, all words with "O" labels. Am I missing smtg?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.