awslabs / mlm-scoring Goto Github PK
View Code? Open in Web Editor NEWPython library & examples for Masked Language Model Scoring (ACL 2020)
Home Page: https://www.aclweb.org/anthology/2020.acl-main.240/
License: Apache License 2.0
Python library & examples for Masked Language Model Scoring (ACL 2020)
Home Page: https://www.aclweb.org/anthology/2020.acl-main.240/
License: Apache License 2.0
I cloned the repo locally, then ran pip install -e .
and pip install mxnet-mkl
but I get the error:
ERROR: Could not find a version that satisfies the requirement mxnet-mkl (from versions: none)
ERROR: No matching distribution found for mxnet-mkl
How can I fix it?
Hi there,
I'm using community model 'bert-base-chinese' from HuggingFace to finetune masked LMs and I get the following error:
ValueError:
Model 'BertForMaskedLMOptimized' is not supported by the scorer 'RegressionFinetuner'.
What can I do to solve this issue?
Thanks!
Hi
I am trying to use distil roberta model with the code. I can see there is class for DistilBert which has been been loaded from transformers library here. Is there a way i could also use it for DistilRoberta as it is not in the models src of transformers library but only has a model card.
I'd quite like to use this library to score the output from my RoBERTa model, but it's implemented with huggingface transformers version 4.x and this library requires 3.3.1 (and that also ended up installing tokenizers-0.8.1rc2 for some reason).
It would be nice if it could be upgraded to the latest verison.
I don't want to download the vocab file because I want to do offline. so I want to give a parameter to "get_pretrained".
I read this code, I think it can't do it.
Would you fix it?
Hi,
I need to score a rather large number of sentences for a downstream task. I'm experimenting with models supported by huggingface with no fine tuning, e.g.:
mlms_model, vocab, tokenizer = get_pretrained(ctxs, 'albert-base-v2')
scorer = MLMScorerPT(mlms_model, vocab, tokenizer, ctxs)
sentences = ... # 1847 sentences
corpus = Corpus.from_text(sentences)
scores = self.scorer.score(corpus, 1.0, 500) # adjusted batch size to avoid gpu out of memory errors
Depending on the model and scorer I get wildly different runtimes. On my computer, encoding 1847 sentences:
I expected, perhaps naively, that ALBERT and DistilBERT would be much faster due to reduced dimensionality and number of layers.
I am trying to use this package's command-line interface in a similar fashion to the README's example:
mlm score \
--mode hyp \
--model bert-base-en-uncased \
--gpus 0 \
examples/asr-librispeech-espnet/data/dev-other.am.json \
> examples/demo/dev-other-3.lm.json
However, I see that it uses only around 601 MB of GPU memory, which is much less that what the GPU is able to support (12 GB). Is there any way to increase the batch size when using mlm score
? It seems that the --split-size
argument would do something like this, is that right?
Thank you for the amazing work. I am trying to use xlm models for scoring, but I got a bug like below for using xlm-roberta-base/large.
(base) bill@ink-molly:~/MickeyProbes$ python probe_generation/sent_scoring.py
/home/bill/anaconda3/lib/python3.7/site-packages/mxnet/optimizer/optimizer.py:167: UserWarning: WARNING: New optimizer gluonnlp.optimizer.lamb.LAMB is overriding existing optimizer mxnet.optimizer.optimizer.LAMB
Optimizer.opt_registry[name].__name__))
WARNING:root:Model 'xlm-roberta-large' not recognized as an MXNet model; treating as PyTorch model
Downloading: 100%|█████████████████████████████████████████████████████████████████████████████████████████████████████████| 513/513 [00:00<00:00, 257kB/s]
Can't set hidden_size with value 1024 for XLMConfig {
"architectures": [
"XLMRobertaForMaskedLM"
],
"attention_probs_dropout_prob": 0.1,
"bos_token_id": 0,
"eos_token_id": 2,
"hidden_act": "gelu",
"hidden_dropout_prob": 0.1,
"model_type": "xlm",
"pad_token_id": 1
}
Traceback (most recent call last):
File "probe_generation/sent_scoring.py", line 9, in <module>
model, vocab, tokenizer = get_pretrained(ctxs, 'xlm-roberta-large')
File "/home/bill/MickeyProbes/mlm-scoring/src/mlm/models/__init__.py", line 126, in get_pretrained
model, loading_info = transformers.XLMWithLMHeadModel.from_pretrained(model_fullname, output_loading_info=True)
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/modeling_utils.py", line 854, in from_pretrained
**kwargs,
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 316, in from_pretrained
return cls.from_dict(config_dict, **kwargs)
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 403, in from_dict
config = cls(**config_dict)
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_xlm.py", line 195, in __init__
super().__init__(pad_token_id=pad_token_id, bos_token_id=bos_token_id, **kwargs)
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 215, in __init__
raise err
File "/home/bill/anaconda3/lib/python3.7/site-packages/transformers/configuration_utils.py", line 212, in __init__
setattr(self, key, value)
AttributeError: can't set attribute
I am not sure if it is a version issue. Would you please provide an example for running xlm in the code? Thanks!
Hi there,
I'm using the PyTorch implementation with bert-base-uncased
and I get the following error when the sentence contains only one token:
Traceback (most recent call last):
File "bert.py", line 28, in <module>
print(scorer.score_sentences(["Hello"]))
File ".../mlm-scoring/src/mlm/scorers.py", line 167, in score_sentences
return self.score(corpus, **kwargs)[0]
File ".../mlm-scoring/src/mlm/scorers.py", line 757, in score
out = out[list(range(split_size)), token_masked_ids]
IndexError: too many indices for tensor of dimension 1
It works fine with MXNet MLMs, but I need to use a community model from HuggingFace.
Thanks!
Dear authors,
I have tried to change the pre-trained model to 'xlm-roberta-large', but I got this OSerror message:
Can't load tokenizer for 'xlm-roberta-large'. If you were trying to load it from 'https://huggingface.co/models', make sure you don't have a local directory with the same name. Otherwise, make sure 'xlm-roberta-large' is the correct path to a directory containing all relevant files for a XLMTokenizer tokenizer.
Could you guide me how to solve this problem?
I tried scoring sentences with the models mentioned here . Every model works fine except for gpt2-117m-en-cased
and gpt2-345m-en-cased
. The following error pops up
Traceback (most recent call last):
File "sample.py", line 16, in <module>
print(scorer.score_sentences(["Hello world!"]))
File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 148, in score_sentences
return self.score(corpus, **kwargs)[0]
File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 396, in score
dataset = self.corpus_to_dataset(corpus)
File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 364, in corpus_to_dataset
ids_masked = self._ids_to_masked(ids_original)
File "/home/pandramish.vinay/mlm-scoring/src/mlm/scorers.py", line 329, in _ids_to_masked
mask_token_id = self._vocab.token_to_idx[self._vocab.mask_token]
AttributeError: 'Vocab' object has no attribute 'mask_token'
Any fixes ?
When trying to follow the steps stated in Maskless fine tuning section, (i even tried to use the exact model stated in the steps..)
i always recieve:
60 @staticmethod
61 def _check_support(model) -> bool:
---> 62 raise NotImplementedError
is the regression fine tuner implemented for Bert models?
Hi,
It seems that support for PyTorch models is currently limited to bert and xlm. Would it be possible to add support for lighter models, e.g. DistilBERT or ALBERT?
Do you think that using these models would hurt the performance of the scorers significantly?
Thanks!
Hello,
This is probably a silly question, but I'm having a hard time adapting the mlm-scoring to use other public PyTorch RoBERTa models that are not on the list of supported models. Do you have any tutorials/materials on how to use self-trained/other public RoBERTa models with mlm-scoring? Any help would be much appreciated and I apologize in advance in case this information is in the repository and I missed it.
Kind regards,
Danielly
hi there,
as i understand your library, it works with models which are available from huggingface or gluon.
question: for a model that is not available in the model zoos of those two frameworks, e.g. a model i trained myself, how can i get this to work with a config.json, a pytorch_model.bin and a vocab.txt file?
best,
phillip
hi, I am little confused on rescoring for asr and nmt.
Is it post-pretrained on domain corpus for rescoring(apply MLM on domain data) or just used open-source pretrained model(roberta or bert on wikibook corpus)?
Hi,
I read readme and don't see how can I train my model roberta or bert from scratch (with hugging face) then save to checkpoint, then I can use mlm-scoring to integration with it.
Tks.
Hi there,
I'm facing an issue with your PyTorch implementation and some input sentences. E.g.
s = 'RT @HISPANlCPROBS : When u walk straight into the kitchen to eat & ur mom hits u with the " ya saludaste " #ThanksgivingWithHispanics https://…'
print(scorer.score_sentences([s]))
gives the following error:
RuntimeError: CUDA out of memory. Tried to allocate 20.00 MiB (GPU 0; 11.91 GiB total capacity; 451.65 MiB already allocated; 12.12 MiB free; 40.35 MiB cached)
I'm working on a server with three GPUs and tried setting ctxs = [mx.gpu(0)]
, ctxs = [mx.gpu(1)]
, ctxs = [mx.gpu(2)]
and ctxs = [mx.cpu()]
but I always get the same error about GPU 0. I'm wondering if this is hardcoded somewhere in your code? Changing the ctxs
variable seems to have no effect.
Thanks.
I wanted to use this library for computing the scores for MuRIL which is based on BERT MLM.Its not there in huggingface as yet.How to bridge the gap?
Hey
Is there some way to just use a pre trained model and give a sentence as input and get it's perplexity as the output?
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.