Coder Social home page Coder Social logo

elmoformanylangs's People

Contributors

alongwy avatar angledluffa avatar bozheng-hit avatar frankier avatar oneplus avatar strubell avatar voidism avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

elmoformanylangs's Issues

How to get the embedding for each word in the sentance?

Hi,

I am struggling to get the embedding for individual words. I used this command:

python -m elmoformanylangs test --input_format conll --input input.conllu --model ar.model --output_prefix ./output/ --output_format hdf5 --output_layer -1

And it dumbs hdf5 encoded onto the disk, as said. However, as far as I understand, the file encoded a dict where the key is tab speerated sentence, and the value is its representation.

But when I print the key:


f = h5py.File(filename, 'r')

for key in list(f.keys()):
    print(key)

I can see that f.keys() contain only a one string key of all sentences in the input file. 1) Why? And how to get individual sentence representation? 2) How to get individual word representation?

This is example of my input with 2 sentences :

1	ik	ik	PRON	VNW|pers|pron|nomin|vol|1|ev	Case=Nom|Person=1|PronType=Prs	2	nsubj	2:nsubj	_
2	zie	zien	VERB	WW|pv|tgw|ev	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	hem	hem	PRON	VNW|pers|pron|obl|vol|3|ev|masc	Case=Acc|Person=3|PronType=Prs	2	obj	2:obj|4:nsubj:xsubj	_
4	fietsen	fietsen	VERB	WW|inf|vrij|zonder	VerbForm=Inf	2	xcomp	2:xcomp	_
1	Jan	Jan	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	nsubj	2:nsubj	_
2	komt	komen	VERB	WW|pv|tgw|met-t	Number=Sing|Tense=Pres|VerbForm=Fin	0	root	0:root	_
3	vandaag	vandaag	ADV	BW	_	2	advmod	2:advmod	_
4	en	en	CCONJ	VG|neven	_	5	cc	5.1:cc	_
5	Piet	Piet	PROPN	N|eigen|ev|basis|zijd|stan	Gender=Com|Number=Sing	2	conj	5.1:nsubj	_ 

Help Provide The Original Training Commands

@Oneplus Thanks for your great work! I wonder if you could share the original commands for training, especially for the training of simplified-Chinese ELMo model?

Training Your Own ELMo

Please run

python -m elmoformanylangs.biLM train -h

关于fine-tune的咨询

您好,
我想确认一点的是,您开发的这个模型是暂时没有fine-tune功能的?(亦或者我没有太认真看代码?)
如果没有,那我就在您基础上尝试看能不能写一个出来吧~谢谢大佬!

Relative paths in config.json

Thank you so much for sharing this repo. Just two questions :
1- Config.json has many relative paths, do we have to change them all?
2- The English lang zip file has the name '144', I am changing it to 'English' and copy it inside the ELMO folder.

Now, I am having this encoding error, any ideas?
NB. I am using Python 3.6.8

from elmoformanylangs import Embedder

e = Embedder('/home/malrawi/Desktop/My Programs/ELMO/English/')
Traceback (most recent call last):

  File "<ipython-input-15-3e3369c99ad2>", line 1, in <module>
    e = Embedder('/home/malrawi/Desktop/My Programs/ELMO/English/')

  File "/home/malrawi/Documents/ELMO/elmoformanylangs/elmo.py", line 107, in __init__

  File "/home/malrawi/Documents/ELMO/elmoformanylangs/elmo.py", line 115, in get_model

  File "/home/malrawi/anaconda3/lib/python3.6/json/__init__.py", line 299, in load
    parse_constant=parse_constant, object_pairs_hook=object_pairs_hook, **kw)

  File "/home/malrawi/anaconda3/lib/python3.6/json/__init__.py", line 344, in loads
    s, 0)

JSONDecodeError: Unexpected UTF-8 BOM (decode using utf-8-sig)

关于训练loss问题

我在训练自己的中文预料,想问下提供的训练好的中文模型训练过程的loss大致有多少,谢谢!

typo in readme.md

the python module name in Training Your Own ELMo section

python -m elmoformanylang.biLM train -h

should be

python -m elmoformanylangs.biLM train -h

相同输入下结果不同

e = Embedder(path)
sents = [['Sam', 'goes', 'to', 'the', 'bank', 'every', 'day', '.']]

res1 = e.sents2elmo(sents)
res2 = e.sents2elmo(sents)

Getting all layers

Hi,

I was wondering whether there is a simple way to modify your code to return the 3 layers of the biLSTM, as in Peters et al. (2018), so as to train the task specific weights for the weighted average.

Thank you

How to use elmo embedding?

Hi, i try to dump elmo embedding to txt and hdf5 format file. But i find the file is so big!
In my valid data, about 7W sentences , txt file nearly 20G and hdf5 nearly 15G, how to use this file?

Thanks

如何在分类任务中使用ELMo

您好,我现在已经下载了中文简体的ELMo模型,并且执行了该github页面的一些基本操作。
但是我不太了解如何用pytorchELMo预训练模型用在后续分类任务中,我看了一些issue说可以参考word2vec的应用方法,但是word2vec是静态词向量,而且我还看到部分issue说目前HIT的ELMo无法和allennlp结合使用。

因此您方便提供一些Pytorch使用例子吗?感激不尽。

Is there a pretrained English model with word_dim=512?

Hi,

I downloaded the pretrained English model and found that word_dim is set to 100. Since the config file has a default word_dim 512, I am wondering whether there is a version with word_dim=512 available. Thanks!

Installation on Google Cloud Computing Engine

Hi!
I have problems with installation on google cloud
I do everything according to the instruction in a console, and have a success, importing this module from python from console
BUT
when I try to import it in Jupyter - I get an import error

Is it necessary to add <BOS> <EOS> tokens ?

Hi,
While using ELMoForManyLangs programmatically, do I need to add <bos> <eos> tokens manually to the sentences ? Specifically, could you specify which one of the below usages is the correct one ?

from elmoformanylangs import Embedder

# Option-1 
e = Embedder('/path/to/your/model/')
sents = [['this','is','the','first','sentence'],['another','sentence']]
e.sents2elmo(sents)

#Option-2
e = Embedder('/path/to/your/model/')
sents = [['<bos>','this','is','the','first','sentence','<eos>'],['<bos>','another','sentence','<eos>']]
e.sents2elmo(sents)

Input file demonstration

Hi,

Thanks for maintenance.

Could you provide an input file example?
Since in the README, the example doesn't explain itself well.
It is still hard for me to fit my input into the format.

Thanks.

英语的模型有点奇怪

我用好几种语言的预训练模型来跑自己的模型,效果都非常好,获得了非常大的性能提升,但是英语的模型不行,获得了非常大的性能下降。这有什么说法吗?

cuda out of memory

在使用简体中文模型时,单句稍微长一点,或者batch大一点,就会报这个错误。
RuntimeError: cuda runtime error (2) : out of memory at /opt/conda/conda-bld/pytorch_1524584710464/work/aten/src/THC/generic/THCStorage.c
即使把CUDA_VISIBLE_DEVICES开到多卡也会报错,感觉上不是模型太大的缘故

找不到输出的hdf5文件

你好,我的输入命令如下python -m elmoformanylangs test --input_format conll --input input.conllu --model zhs.model --output_prefix ./out/ --output_format hdf5 --output_layer -1
为什么找不到生成的hdf5文件,谢谢

Lemma column needed in conllu input?

The example in the readme shows conllu code with ID, form and lemma column populated. However, all read_conll_*corpus functions in __main__.py do not read the lemma column. Do you read the lemma column anywhere else or do you plan to use it in the near future? Do I need a lemmatiser?

Of course, udpipe users can get the lemmatisation from udpipe and one will want to use udpipe with the offered models as udpipe's tokenisation differs from other popular tokenisers. The lemma column would, however, be a difficulty if I use my own tokeniser and train my own models (as described in the readme).

训练数据的数量不对

您好,我训练用的命令是
python -m elmoformanylangs.biLM train --train_path /home/data/peter/intent_classification/c2_intent_ltp_sent.txt --config_path /home/data/peter/ELMoForManyLangs/pretrained_model/configs/cnn_50_100_512_4096_sample.json --model /home/data/peter/ELMoForManyLangs/elmo_train_model

然后我的训练数据是64629,但是我看到实际ELMo训练的时候只拿了9千多条。

2019-01-28 12:48:07,723 INFO: training instance: 9376, training tokens: 459418.
2019-01-28 12:48:07,785 INFO: Truncated word count: 0.
2019-01-28 12:48:07,785 INFO: Original vocabulary size: 14878.
2019-01-28 12:48:07,919 INFO: Word embedding size: 14880
2019-01-28 12:48:08,145 INFO: Char embedding size: 3292
2019-01-28 12:48:24,806 INFO: 293 batches, avg len: 50.0
2019-01-28 12:48:24,807 INFO: Evaluate every 293 batches.
2019-01-28 12:48:24,807 INFO: vocab size: 14880

Zero Division Error

Hello,

I am running sent2elmo function with a custom dataset (reddit data). I am running into the problem of ZeroDivisionError. The error seems to happen in line 202 in elmo.py. What is happening or what is in my input that is causing this ?

Thanks in advance

Some questions about training params and times

Hi, i am trying to train my own elmo by using your train script, but it cost much time.
I notice that "The training of ELMo on one language takes roughly 3 days on an NVIDIA P100 GPU."
Can you tell me the size of one language corpus? The whole training progress means 100 epochs?

Also, can you share you params? Such as learning rate, learning rate decay, batch_size, optimizer.

Thanks

训练自己的ELMo时报错

你好,我的开发环境为Python3.6。我使用的是最新的版本,训练数据是一个txt文件,一共有13万行,每一行是一条语句,每条语句通过空格分词。配置文件使用的是cnn_50_100_512_4096_sample.json。训练过程中,在classify_layer.py的113行报错。错误信息如下:
Traceback (most recent call last):
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1664, in
main()
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1658, in main
globals = debugger.run(setup['file'], None, None, is_module)
File "/Applications/PyCharm.app/Contents/helpers/pydev/pydevd.py", line 1068, in run
pydev_imports.execfile(file, globals, locals) # execute the script
File "/Applications/PyCharm.app/Contents/helpers/pydev/_pydev_imps/_pydev_execfile.py", line 18, in execfile
exec(compile(contents+"\n", file, 'exec'), glob, loc)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 696, in
train()
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 613, in train
train, valid, test, best_train, best_valid, test_result)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 340, in train_model
loss_forward, loss_backward = model.forward(w, c, masks)
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/biLM.py", line 241, in forward
self.classify_layer.update_negative_samples(word_inp, chars_inp, mask_package[0])
File "/Users/chenmeng/PycharmProjects/ELMoForManyLangs-master/elmoformanylangs/modules/classify_layer.py", line 113, in update_negative_samples
word = word_inp[i][j].tolist()
AttributeError: 'int' object has no attribute 'tolist'

谢谢能抽时间回答我的问题!

Unable to download pretrained models

First of all thanks for making this repo! However, when I try to download a pre-trained model, the download quits halfway since the connection to the server is lost. Could you maybe host them somewhere else? Thanks!

Cannot load embedder

I installed ELMoForManyLangs as instructed with python setup.py install and downloaded a custom language (Swedish).

Extracted the files from the swedish language model to a folder called elmo_sv.

from elmoformanylangs import Embedder
e = Embedder('elmo_sv/')

Error:

FileNotFoundError: [Errno 2] No such file or directory: '/Users/yijialiu/work/projects/conll2018/models/word_elmo/cnn_50_100_512_4096_sample.json'

Something seems to be hardcoded to the user yijialiu - thats not me!

Different output vectors for same sentences

Hi, I am using ELMo for Japanese. Here is my code:

from elmoformanylangs import Embedder
e = Embedder('/Users/tanh/Desktop/alt/JapaneseElmo')

if __name__ == '__main__':
    sents = [
        ['今'],
        ['今'],
        ['潮水', '退']
    ]
    print(e.sents2elmo(sents))
    print(e.sents2elmo(sents))

And here is the console output:
`2018-11-14 10:33:26,441 INFO: 1 batches, avg len: 3.3
[array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347,
0.18502058, -0.27423012]], dtype=float32), array([[-0.23187001, -0.09699917, 0.46900252, ..., -0.33114347,
0.18502058, -0.27423012]], dtype=float32), array([[-0.11759937, -0.04552874, 0.22546595, ..., 0.21812831,
-0.33964303, -0.33022305],
[-0.26380852, -0.27671477, -0.33576807, ..., 0.15142155,
-0.04612424, -0.74970037]], dtype=float32)]

2018-11-14 10:33:26,734 INFO: 1 batches, avg len: 3.3
[array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066,
0.18976462, -0.2817447 ]], dtype=float32), array([[-0.25601366, -0.10413959, 0.45184097, ..., -0.34171066,
0.18976462, -0.2817447 ]], dtype=float32), array([[-0.12085894, -0.05347676, 0.18303208, ..., 0.22256255,
-0.37257898, -0.39672664],
[-0.21205096, -0.31738985, -0.34304047, ..., 0.24654591,
-0.07900852, -0.710617 ]], dtype=float32)]
`
So as you can see, the output is different when I run sents2elmo twice, is this normal or a bug? If it's normal so how can I prevent it from happening again?

A bug on elmo.py

HI, i try to train a elmo model by runing python src/train.py -h ,but i got a SyntaxError: invalid syntax
in elmo.py, line 134

原版bilm-tf的训练问题

您好,想对比一下您的工具包和原始allen-tf版elmo训练包的性能,但在使用它的工具包时,进行单元测试,报错:说另一程序正在使用,因此进程无法访问,不知您是否也遇到过,谢谢

多GPU训练的问题

hi,我尝试了使用多GPU训练,在model和optimizer加入了
model = nn.DataParallel(model, device_ids=gpu_ids)
optimizer = nn.DataParallel(optimizer, device_ids=gpu_ids)

但是遇到了
terminate called after throwing an instance of 'std::runtime_error'
what(): cuda runtime error (59) : device-side assert triggered at /pytorch/aten/src/THC/generic/THCStorage.c:184
错误
单GPU运行没有问题,请问下大佬对这个bug有没有头绪哈

Problem for using pre-trained representations and possible solutions

Hi,

I followed the instructions as given and had the hdf5 encoded dict dumped onto the disk. Everything worked well until I load the model using AllenNLP ELMo, as in the code presented in here: https://github.com/allenai/allennlp/blob/master/tutorials/how_to/elmo.md

from allennlp.modules.elmo import Elmo, batch_to_ids options_file = "path/to/my/option/json" weight_file = "path/to/the/pretrained/hdf5/in/last/step" elmo = Elmo(options_file, weight_file, 2, dropout=0) sentences = [['First', 'sentence', '.'], ['Another', '.']] character_ids = batch_to_ids(sentences) embeddings = elmo(character_ids)

My options file is:

{"lstm": {"use_skip_connections": true, "projection_dim": 512, "cell_clip": 3, "proj_clip": 3, "dim": 4096, "n_layers": 2}, "char_cnn": {"activation": "relu", "filters": [[1, 32], [2, 32], [3, 64], [4, 128], [5, 256], [6, 512], [7, 1024]], "n_highway": 2, "embedding": {"dim": 16}, "n_characters": 262, "max_characters_per_token": 50}}

which is the same as in the original paper (Not very sure, please let me know if I made a mistake), because

We use the same hyperparameter settings as Peters et al. (2018) for the biLM and the character CNN

However, when I exert this script, the following error shows up:

Traceback (most recent call last): File "output_elmo_vec.py", line 15, in <module> elmo = Elmo(options_file, weight_file, 2, dropout=0) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 99, in __init__ vocab_to_cache=vocab_to_cache) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 499, in __init__ self._token_embedder = _ElmoCharacterEncoder(options_file, weight_file, requires_grad=requires_grad) File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 285, in __init__ self._load_weights() File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 373, in _load_weights self._load_char_embedding() File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/allennlp/modules/elmo.py", line 380, in _load_char_embedding char_embed_weights = fin['char_embed'][...] File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "/user01/data/anaconda2/envs/allennlpenv/lib/python3.6/site-packages/h5py/_hl/group.py", line 177, in __getitem__ oid = h5o.open(self.id, self._e(name), lapl=self._lapl) File "h5py/_objects.pyx", line 54, in h5py._objects.with_phil.wrapper File "h5py/_objects.pyx", line 55, in h5py._objects.with_phil.wrapper File "h5py/h5o.pyx", line 190, in h5py.h5o.open KeyError: "Unable to open object (object 'char_embed' doesn't exist)"

I searched on Google and found this particular error of 'char_embed doesn't exist', was actually caused by customized BiLM training:
allenai/allennlp#1521 (comment)

According to this post, it seems AllenNLP will not officially offer support for customized pre-trained models without 'char_cnn' params. May I suggest keeping this parameter during training, so that the pretrained model has this part?

Thank you very much for your reply. ;)

Installable pip package

Hello,

Thank you for these great embeddings, really nice to have the possibility to use them, especially the french one in my case.

I wonder if you are planning to create an installable pip package or if you want someone, possibly me, trying to?

This could be very useful in order to use those embeddings in other libraries such as Flair.

Thank you in advance for your answer.

Amaury

tokenization and fine tuning for Japanese

How do I use Japanese ELMo on my own corpus? In the paper it is indicated that SCIR tokenizer is used. Is there a script provided for this? Also would like to fine-tune the model towards my corpus. What is the best way to do this?

Tokenization details

In the readme file it was written "Do remember tokenization!". What type of tokenization is needed. Do we need to give case sensitive or case insensitive input to the model, is there any normalization involved?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.