hit-scir / elmoformanylangs Goto Github PK

View Code? Open in Web Editor NEW

This project forked from bozheng-hit/elmo

1.5K 1.5K 244.0 105 KB

Pre-trained ELMo Representations for Many Languages

License: MIT License

Python 100.00%

elmo multilingual nlp

elmoformanylangs's People

Contributors

Stargazers

Watchers

Forkers

projectcodez wangyuxuan93 hit-joseph goingcoder nonva ssghost ericxsun archeryi little1tow guanlongtianzi 2793145003 machenfeng currylym chenmoshushi lancelot39 jyori112 bl960809 kennethkhmoon frankier zhouleidcc guhay sunnymarkliu showlion zenrran atmahou xyionwu voidism antdlx fubincom fancyerii yzy5630 cstghitpku buptpriswang jackysnake enod ryfan-rs shdut long-neck-deer jiasir803 yndu13 padeoe mandyzore minhpqn nipi64310 hkiyomaru azuredsky hephaex fangmath joanzhou gonewithgt zhang2010hao lastquarter22 ml-lab hanxiaoyang lihongzheng-nlp dream1202 wushicanasl ktmud shangcaiwangtao swetmelon wangyi888 minicool007 ikuangye rtygbwwwerr jakejing kyang888 qolina cyzhangathit jingubang1219 haif-liu konvica joneswan xgj2github rucjuanli moree0 frankchu0229 yuanjie-ai taniabladier wqw123 vectorchanger0 qianchu sunyancn pablo-var sxu-fyj saiyyc redyandri sbmaruf yc-wind strubell geektemo deppdepp langzhining shaoyn0817 yanghaocsg munaachyuta krzynio seopbo jowagner weiliangxiao khronosplus

elmoformanylangs's Issues

How to get the embedding for each word in the sentance?

Hi,

I am struggling to get the embedding for individual words. I used this command:

python -m elmoformanylangs test --input_format conll --input input.conllu --model ar.model --output_prefix ./output/ --output_format hdf5 --output_layer -1

And it dumbs hdf5 encoded onto the disk, as said. However, as far as I understand, the file encoded a dict where the key is tab speerated sentence, and the value is its representation.

But when I print the key:


f = h5py.File(filename, 'r')

for key in list(f.keys()):
    print(key)

I can see that f.keys() contain only a one string key of all sentences in the input file. 1) Why? And how to get individual sentence representation? 2) How to get individual word representation?

This is example of my input with 2 sentences :

1	ik	ik	PRON	VNW|pers|pron|nomin|vol|1|ev	Case=Nom|Person=1|PronType=Prs	2	nsubj	2:nsubj	_
2	zie	zien	VERB	WW|pv|tgw|ev	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	hem	hem	PRON	VNW|pers|pron|obl|vol|3|ev|masc	Case=Acc|Person=3|PronType=Prs	2	obj	2:obj|4:nsubj:xsubj	_
4	fietsen	fietsen	VERB	WW|inf|vrij|zonder	VerbForm=Inf	2	xcomp	2:xcomp	_
1	Jan	Jan	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	nsubj	2:nsubj	_
2	komt	komen	VERB	WW|pv|tgw|met-t	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	vandaag	vandaag	ADV	BW	_	2	advmod	2:advmod	_
4	en	en	CCONJ	VG|neven	_	5	cc	5.1:cc	_
5	Piet	Piet	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	conj	5.1:nsubj	_

Help Provide The Original Training Commands

@Oneplus Thanks for your great work! I wonder if you could share the original commands for training, especially for the training of simplified-Chinese ELMo model?

Training Your Own ELMo

Please run
python -m elmoformanylangs.biLM train -h

关于fine-tune的咨询

您好，
我想确认一点的是，您开发的这个模型是暂时没有fine-tune功能的？（亦或者我没有太认真看代码？）
如果没有，那我就在您基础上尝试看能不能写一个出来吧~谢谢大佬！

NameError: name 'train_w' is not defined in src/biLM.py line 553

train_w -> train[0] ?

Thank you so much for sharing this repo. Just two questions :
1- Config.json has many relative paths, do we have to change them all?
2- The English lang zip file has the name '144', I am changing it to 'English' and copy it inside the ELMO folder.

Now, I am having this encoding error, any ideas?
NB. I am using Python 3.6.8

from elmoformanylangs import Embedder

e = Embedder('/home/malrawi/Desktop/My Programs/ELMO/English/')
Traceback (most recent call last):

  File "<ipython-input-15-3e3369c99ad2>", line 1, in <module>
    e = Embedder('/home/malrawi/Desktop/My Programs/ELMO/English/')

  File "/home/malrawi/Documents/ELMO/elmoformanylangs/elmo.py", line 107, in __init__

  File "/home/malrawi/Documents/ELMO/elmoformanylangs/elmo.py", line 115, in get_model

  File "/home/malrawi/anaconda3/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

  File "/home/malrawi/anaconda3/lib/python3.6/json/__init__.py", line 344, in loads
    s, 0)

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig)

找不到输出的和地方

关于训练loss问题

我在训练自己的中文预料，想问下提供的训练好的中文模型训练过程的loss大致有多少，谢谢！

Unable to download simplified-Chinese model

The link in readme is useless. And I can't find simplified-Chinese model in NLPL Vectors Repository.
Thx.

typo in readme.md

the python module name in Training Your Own ELMo section

python -m elmoformanylang.biLM train -h

should be

python -m elmoformanylangs.biLM train -h

相同输入下结果不同

e = Embedder(path)
sents = [['Sam', 'goes', 'to', 'the', 'bank', 'every', 'day', '.']]

res1 = e.sents2elmo(sents)
res2 = e.sents2elmo(sents)

Document what tokenisation was used for the offered models

Closed issue #45 indicates that udpipe was used and __main__.py suggests that you use the expanded form for conll multiword tokens, e.g. 2 tokens "de le" instead of "du" in French. The readme should mention both.

I can not download the Japanese model

An error occurred while downloading the Japanese model of this url.
http://pbmpb9h15.bkt.gdipper.com/ja.model.tar.xz

Can you check it ?

Thanks.

Getting all layers

Hi,

I was wondering whether there is a simple way to modify your code to return the 3 layers of the biLSTM, as in Peters et al. (2018), so as to train the task specific weights for the weighted average.

Thank you

您好，如何调节模型输出向量的维度呢？

ELMo weights.hdf5 and options.hdf5 files?

Thanks for this work!
Could you please make available the weights and options file (in .hdf5 format), like how the allennlp pre-trained model works?

messing up eow_id and bow_id

ELMoForManyLangs/elmoformanylangs/biLM.py

Line 111 in f586849

    
           bow_id, eow_id, oov_id, pad_id = char2id.get('<eow>', None), char2id.get('<bow>', None), char2id.get(oov, None), char2id.get(pad, None)

I think it should be:
eow_id, bow_id, oov_id, pad_id

How to use elmo embedding?

Hi, i try to dump elmo embedding to txt and hdf5 format file. But i find the file is so big!
In my valid data, about 7W sentences , txt file nearly 20G and hdf5 nearly 15G, how to use this file?

Thanks

. and / are replaced

Hi,

Thanks for maintenance.

Just curious about the reason that . and / are replaced with $period$ and $backslash$.
(https://github.com/HIT-SCIR/ELMoForManyLangs/blob/master/elmoformanylangs/gen_elmo.py)
Do they present any bad effects?

Sincerely.

输入一个句子,里面的词会出现两个向量,请问为什么呢?

输入一个句子,里面的词会出现两个向量,请问为什么呢?
比如:'今天很开心',出来的向量中很这个词的向量有2个,请解惑.....

如何在分类任务中使用ELMo

您好，我现在已经下载了中文简体的ELMo模型，并且执行了该github页面的一些基本操作。
但是我不太了解如何用pytorch把ELMo预训练模型用在后续分类任务中，我看了一些issue说可以参考word2vec的应用方法，但是word2vec是静态词向量，而且我还看到部分issue说目前HIT的ELMo无法和allennlp结合使用。

因此您方便提供一些Pytorch使用例子吗？感激不尽。

Is there a pretrained English model with word_dim=512?

Hi,

I downloaded the pretrained English model and found that word_dim is set to 100. Since the config file has a default word_dim 512, I am wondering whether there is a version with word_dim=512 available. Thanks!

How to modify the output of dim for embedding

when I use the sents2elmo How to modify the output of dim for embedding? thanks!

Installation on Google Cloud Computing Engine

Hi!
I have problems with installation on google cloud
I do everything according to the instruction in a console, and have a success, importing this module from python from console
BUT
when I try to import it in Jupyter - I get an import error

Is it necessary to add <BOS> <EOS> tokens ?

Hi,
While using ELMoForManyLangs programmatically, do I need to add <bos> <eos> tokens manually to the sentences ? Specifically, could you specify which one of the below usages is the correct one ?

from elmoformanylangs import Embedder

# Option-1 
e = Embedder('/path/to/your/model/')
sents = [['this','is','the','first','sentence'],['another','sentence']]
e.sents2elmo(sents)

#Option-2
e = Embedder('/path/to/your/model/')
sents = [['<bos>','this','is','the','first','sentence','<eos>'],['<bos>','another','sentence','<eos>']]
e.sents2elmo(sents)

Input file demonstration

Hi,

Thanks for maintenance.

Could you provide an input file example?
Since in the README, the example doesn't explain itself well.
It is still hard for me to fit my input into the format.

Thanks.

ELMoForManyLangs as Keras Layer

Hi!

Does anybody know how to use Embedder.sents2elmo() as a Layer for Keras?

I mean like using the one from tensorflow hub, see https://www.depends-on-the-definition.com/named-entity-recognition-with-residual-lstm-and-elmo/.

Thanks in advance!

英语的模型有点奇怪

我用好几种语言的预训练模型来跑自己的模型，效果都非常好，获得了非常大的性能提升，但是英语的模型不行，获得了非常大的性能下降。这有什么说法吗？

How to use the weights for chinese text classification?

Can you give an example?

Thanks~

cuda out of memory

在使用简体中文模型时，单句稍微长一点，或者batch大一点，就会报这个错误。
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.c
即使把CUDA_VISIBLE_DEVICES开到多卡也会报错，感觉上不是模型太大的缘故

找不到输出的hdf5文件

你好,我的输入命令如下python -m elmoformanylangs test --input_format conll --input input.conllu --model zhs.model --output_prefix ./out/ --output_format hdf5 --output_layer -1
为什么找不到生成的hdf5文件,谢谢

Lemma column needed in conllu input?

The example in the readme shows conllu code with ID, form and lemma column populated. However, all read_conll_*corpus functions in __main__.py do not read the lemma column. Do you read the lemma column anywhere else or do you plan to use it in the near future? Do I need a lemmatiser?

Of course, udpipe users can get the lemmatisation from udpipe and one will want to use udpipe with the offered models as udpipe's tokenisation differs from other popular tokenisers. The lemma column would, however, be a difficulty if I use my own tokeniser and train my own models (as described in the readme).

May you provide some example about conllu format ?

训练数据的数量不对

您好，我训练用的命令是
python -m elmoformanylangs.biLM train --train_path /home/data/peter/intent_classification/c2_intent_ltp_sent.txt --config_path /home/data/peter/ELMoForManyLangs/pretrained_model/configs/cnn_50_100_512_4096_sample.json --model /home/data/peter/ELMoForManyLangs/elmo_train_model

然后我的训练数据是64629，但是我看到实际ELMo训练的时候只拿了9千多条。

2019-01-28 12:48:07,723 INFO: training instance: 9376, training tokens: 459418.
2019-01-28 12:48:07,785 INFO: Truncated word count: 0.
2019-01-28 12:48:07,785 INFO: Original vocabulary size: 14878.
2019-01-28 12:48:07,919 INFO: Word embedding size: 14880
2019-01-28 12:48:08,145 INFO: Char embedding size: 3292
2019-01-28 12:48:24,806 INFO: 293 batches, avg len: 50.0
2019-01-28 12:48:24,807 INFO: Evaluate every 293 batches.
2019-01-28 12:48:24,807 INFO: vocab size: 14880

German model doesn't have the cnn_50_100_512_4096_sample.json

Hello,

I downloaded the German model and I tried to configure the config.json, but the zip file for the German model doesn't have the cnn_50_100_512_4096_sample.json.

Could you please upload the missing file ?

Zero Division Error

Hello,

I am running sent2elmo function with a custom dataset (reddit data). I am running into the problem of ZeroDivisionError. The error seems to happen in line 202 in elmo.py. What is happening or what is in my input that is causing this ?

Thanks in advance

Some questions about training params and times

Hi, i am trying to train my own elmo by using your train script, but it cost much time.
I notice that "The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU."
Can you tell me the size of one language corpus? The whole training progress means 100 epochs?

Also, can you share you params? Such as learning rate, learning rate decay, batch_size, optimizer.

Thanks

训练自己的ELMo时报错

你好，我的开发环境为Python3.6。我使用的是最新的版本，训练数据是一个txt文件，一共有13万行，每一行是一条语句，每条语句通过空格分词。配置文件使用的是cnn_50_100_512_4096_sample.json。训练过程中，在classify_layer.py的113行报错。错误信息如下：
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in
main()
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 696, in
train()
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 613, in train
train, valid, test, best_train, best_valid, test_result)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 340, in train_model
loss_forward, loss_backward = model.forward(w, c, masks)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 241, in forward
self.classify_layer.update_negative_samples(word_inp, chars_inp, mask_package[0])
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/modules/classify_layer.py", line 113, in update_negative_samples
word = word_inp[i][j].tolist()
AttributeError: 'int' object has no attribute 'tolist'

谢谢能抽时间回答我的问题！

Unable to download pretrained models

First of all thanks for making this repo! However, when I try to download a pre-trained model, the download quits halfway since the connection to the server is lost. Could you maybe host them somewhere else? Thanks!

Cannot load embedder

I installed ELMoForManyLangs as instructed with python setup.py install and downloaded a custom language (Swedish).

Extracted the files from the swedish language model to a folder called elmo_sv.

from elmoformanylangs import Embedder
e = Embedder('elmo_sv/')

Error:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/yijialiu/work/projects/conll2018/models/word_elmo/cnn_50_100_512_4096_sample.json'

Something seems to be hardcoded to the user yijialiu - thats not me!

We also provided simplified-Chinese ELMo. It was trained on xinhua proportion of Chinese gigawords-v5, which is different from the Wikipedia for traditional Chinese ELMo.

Different output vectors for same sentences

Hi, I am using ELMo for Japanese. Here is my code:

from elmoformanylangs import Embedder
e = Embedder('/Users/tanh/Desktop/alt/JapaneseElmo')

if __name__ == '__main__':
    sents = [
        ['今'],
        ['今'],
        ['潮水', '退']
    ]
    print(e.sents2elmo(sents))
    print(e.sents2elmo(sents))

And here is the console output:
`2018-11-14 10:33:26,441 INFO: 1 batches, avg len: 3.3
[array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347,
0.18502058, -0.27423012]], dtype=float32), array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347,
0.18502058, -0.27423012]], dtype=float32), array([[-0.11759937, -0.04552874, 0.22546595, ..., 0.21812831,
-0.33964303, -0.33022305],
[-0.26380852, -0.27671477, -0.33576807, ..., 0.15142155,
-0.04612424, -0.74970037]], dtype=float32)]

2018-11-14 10:33:26,734 INFO: 1 batches, avg len: 3.3
[array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066,
0.18976462, -0.2817447 ]], dtype=float32), array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066,
0.18976462, -0.2817447 ]], dtype=float32), array([[-0.12085894, -0.05347676, 0.18303208, ..., 0.22256255,
-0.37257898, -0.39672664],
[-0.21205096, -0.31738985, -0.34304047, ..., 0.24654591,
-0.07900852, -0.710617 ]], dtype=float32)]
`
So as you can see, the output is different when I run sents2elmo twice, is this normal or a bug? If it's normal so how can I prevent it from happening again?

A bug on elmo.py

HI, i try to train a elmo model by runing python src/train.py -h ,but i got a SyntaxError: invalid syntax
in elmo.py, line 134

Wrong!!

Tried sampled softmax and full softmax but find no significant speed difference

Do you have any suggestion on how to set the n_samples number?

原版bilm-tf的训练问题

您好，想对比一下您的工具包和原始allen-tf版elmo训练包的性能，但在使用它的工具包时，进行单元测试，报错:说另一程序正在使用，因此进程无法访问，不知您是否也遇到过，谢谢

多GPU训练的问题

hi，我尝试了使用多GPU训练，在model和optimizer加入了
model = nn.DataParallel(model, device_ids=gpu_ids)
optimizer = nn.DataParallel(optimizer, device_ids=gpu_ids)

但是遇到了
terminate called after throwing an instance of 'std::runtime_error'
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCStorage.c:184
错误
单GPU运行没有问题，请问下大佬对这个bug有没有头绪哈

Problem for using pre-trained representations and possible solutions

Hi,

I followed the instructions as given and had the hdf5 encoded dict dumped onto the disk. Everything worked well until I load the model using AllenNLP ELMo, as in the code presented in here: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

from allennlp.modules.elmo import Elmo, batch_to_ids options_file = "path/to/my/option/json" weight_file = "path/to/the/pretrained/hdf5/in/last/step" elmo = Elmo(options_file, weight_file, 2, dropout=0) sentences = [['First', 'sentence', '.'], ['Another', '.']] character_ids = batch_to_ids(sentences) embeddings = elmo(character_ids)

My options file is:

{"lstm": {"use_skip_connections": true, "projection_dim": 512, "cell_clip": 3, "proj_clip": 3, "dim": 4096, "n_layers": 2}, "char_cnn": {"activation": "relu", "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "n_highway": 2, "embedding": {"dim": 16}, "n_characters": 262, "max_characters_per_token": 50}}

which is the same as in the original paper (Not very sure, please let me know if I made a mistake), because

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN

However, when I exert this script, the following error shows up:

Traceback (most recent call last): File "output_elmo_vec.py", line 15, in <module> elmo = Elmo(options_file, weight_file, 2, dropout=0) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 99, in __init__ vocab_to_cache=vocab_to_cache) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 499, in __init__ self._token_embedder = _ElmoCharacterEncoder(options_file, weight_file, requires_grad=requires_grad) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 285, in __init__ self._load_weights() File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 373, in _load_weights self._load_char_embedding() File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 380, in _load_char_embedding char_embed_weights = fin['char_embed'][...] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/h5py/_hl/group.py", line 177, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object 'char_embed' doesn't exist)"

I searched on Google and found this particular error of 'char_embed doesn't exist', was actually caused by customized BiLM training:
allenai/allennlp#1521 (comment)

According to this post, it seems AllenNLP will not officially offer support for customized pre-trained models without 'char_cnn' params. May I suggest keeping this parameter during training, so that the pretrained model has this part?

Thank you very much for your reply. ;)

Installable pip package

Hello,

Thank you for these great embeddings, really nice to have the possibility to use them, especially the french one in my case.

I wonder if you are planning to create an installable pip package or if you want someone, possibly me, trying to?

This could be very useful in order to use those embeddings in other libraries such as Flair.

Thank you in advance for your answer.

Amaury

tokenization and fine tuning for Japanese

How do I use Japanese ELMo on my own corpus? In the paper it is indicated that SCIR tokenizer is used. Is there a script provided for this? Also would like to fine-tune the model towards my corpus. What is the best way to do this?

Tokenization details

In the readme file it was written "Do remember tokenization!". What type of tokenization is needed. Do we need to give case sensitive or case insensitive input to the model, is there any normalization involved?

hit-scir / elmoformanylangs Goto Github PK

elmoformanylangs's People

Contributors

Stargazers

Watchers

Forkers

elmoformanylangs's Issues

Training Your Own ELMo

Recommend Projects

Recommend Topics

Recommend Org