simonepri / lm-scorer Goto Github PK
View Code? Open in Web Editor NEW📃Language Model based sentences scoring library
License: MIT License
📃Language Model based sentences scoring library
License: MIT License
I noticed that the probability of the final <|endoftext|>
token is included in scoring. For the purposes of scoring sentences, it seems to me that it would be more correct (for most use cases) to omit that one, because it doesn't really matter whether or not more text follows a given sentence. Doesn't the probability of an <|endoftext|>
token following a sentence depend on the (somewhat arbitrary) details of how text was broken up for training?
Hi! Thanks for making this amazing package.
When I do:
import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer
# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 100
scorer = LMScorer.from_pretrained("distilgpt2", device=device, batch_size=batch_size)
And then:
scorer.sentence_score(["Sasgdkjlasdjglakjsdg", "Sentence 2"], log=True)
(Or any sentences of different length while trying to batch, I get this error):
---------------------------------------------------------------------------
ValueError Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
714 if not is_tensor(value):
--> 715 tensor = as_tensor(value)
716
ValueError: expected sequence of length 237 at dim 1 (got 232)
During handling of the above exception, another exception occurred:
ValueError Traceback (most recent call last)
8 frames
<ipython-input-15-08e953660864> in <module>
71 j+=1
72
---> 73 passages_relevance_scores = scorer.sentence_score(passages_relevance_text, reduce='hmean')
74 passages_trans_scores = scorer.sentence_score(passages_trans_text, reduce='hmean')
75 for i, passage in enumerate(passage_texts):
/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/base.py in sentence_score(self, text, log, reduce)
31 return scores
32
---> 33 outputs = self._tokens_log_prob(sentences)
34 for output in outputs:
35 log_probs = output[0]
/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/batch.py in _tokens_log_prob(self, text)
26 for i in range(0, len(text), self.batch_size):
27 batch = text[i : i + self.batch_size]
---> 28 outputs.extend(self._tokens_log_prob_for_batch(batch))
29 return outputs
30
/usr/local/lib/python3.7/dist-packages/lm_scorer/models/gpt2.py in _tokens_log_prob_for_batch(self, text)
45 text = list(map(self._add_special_tokens, text))
46 encoding: BatchEncoding = self.tokenizer.batch_encode_plus(
---> 47 text, return_tensors="pt",
48 )
49 with torch.no_grad():
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
2780 return_length=return_length,
2781 verbose=verbose,
-> 2782 **kwargs,
2783 )
2784
/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
164 )
165
--> 166 return super()._batch_encode_plus(*args, **kwargs)
167
168 def _encode_plus(self, *args, **kwargs) -> BatchEncoding:
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
475 for input_ids in sanitized_tokens["input_ids"]:
476 self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
--> 477 return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
478
479 def _encode_plus(
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
208 self._n_sequences = n_sequences
209
--> 210 self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
211
212 @property
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
730 )
731 raise ValueError(
--> 732 "Unable to create tensor, you should probably activate truncation and/or padding with"
733 " 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your"
734 f" features (`{key}` in this case) have excessive nesting (inputs type `list` where type `int` is"
ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).
Would love any help whatsoever!
Hi
I had a couple of queries.
I was wondering if you could direct me to the part of the code and recommend changes I could make so that i can also calculate this score on my own fine-tuned gpt2 model (which has its own path where it is saved)
I was also thinking that gpt2 uses BPE encoding. So when you return probability score it always returns the probability for the complete word (not the sub units). As far as i understand BPE it divides the token into sub pieces and gives the corresponding ids to those sub pieces. So do you know how is that working internally, that is able to assign probability to complete word ?
Thanks
It gives an error while loading gpt2
File "/home/debanjan/anaconda3/envs/py_3/lib/python3.7/site-packages/lm_scorer/models/gpt2.py", line 84, in _supported_model_names
return GPT2LMHeadModel.pretrained_model_archive_map.keys()
AttributeError: type object 'GPT2LMHeadModel' has no attribute 'pretrained_model_archive_map'
can you kindly check?
Running the cli command
!lm-scorer -lp --cuda 0 sentences.txt,
I got the following error:
2020-04-11 07:48:41.170455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: init() got an unexpected keyword argument 'device'
and without --cuda, !lm-scorer -lp sentences.txt,
I got the following error
2020-04-11 07:52:48.219135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: Device index must be non-negative, got -1
Please help.
Is there a reason 3.8 isn't supported in PyPi? (I'm running 3.8 and pip doesn't have a compatible version)
lm-scorer does not support pytthon versions >3.8 . Since 3.10 and even 3.11 is out, with torch and transformers supporting 3.9, it may make sense to bump the versions up.
I would like to adapt this library to work with user-contributed multilingual models from the transformers
library.
I tried to add another model class in a fork to handle AutoModelWithLMHead
models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)
I am running into two (possibly related) issues with this approach.
First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
, with what seems to be an off-by-one indexing error.
/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
66 # logits.shape = [len(text[sent_index]) + 1, vocab_size]
67 sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68 sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
69 # ids_scores.shape = [seq_len + 1]
70 sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001
If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French
([-13.103885650634766,
-7.141622066497803,
-2.2347683906555176,
-6.366621017456055,
-1.1687631607055664,
-3.626580238342285,
-10.760506629943848],
[2532, 5985, 327, 375, 295, 7536, 50257],
['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])
vs. English
([-2.4790897369384766,
-9.218439102172852,
-2.2219443321228027,
-5.678627967834473,
-0.41474056243896484,
-4.27750301361084,
-2.19716739654541,
-5.7754011154174805],
[464, 3797, 318, 319, 262, 9753, 13, 50256],
['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])
The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.
Any help appreciated figuring out how AutoModelWithLMHead
might differ from GPT2LMHeadModel
!
I am using this code:
import torch
from transformers import *
from lm_scorer.models.auto import AutoLMScorer as LMScorer
However, I am unable to get lm-scorer to work due to this error:
AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file
Based on feedback located here:huggingface/transformers#4513
I was told:
This is likely to be a problem with the LMScorer rather than with this transformers library. Looking t the source code, it does not pass they keyword arguments down to model init. I suggest that you make an issue over at the library that you used.
Any suggestions would be helpful! Thank you!
Please note: It seems to work in the terminal. And I am using vagrant, cgi, and apache2.
I have only just tried loading GPT2, havent tried to score a sentence yet. Here is the code:
import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)
However, when I run it, I get this error:
Traceback (most recent call last):
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 239, in get_config_dict
local_files_only=local_files_only,
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\file_utils.py", line 267, in cached_path
raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file gpt2\config.json not found
During handling of the above exception, another exception occurred:
Traceback (most recent call last):
File "scorer.py", line 6, in <module>
scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\auto.py", line 24, in from_pretrained
return model_class(model_name, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\abc\base.py", line 11, in __init__
self._build(model_name, kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\gpt2.py", line 19, in _build
model_name, use_fast=True, add_special_tokens=False
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\tokenization_auto.py", line 195, in from_pretrained
config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_auto.py", line 196, in from_pretrained
config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 252, in get_config_dict
raise EnvironmentError(msg)
OSError: Can't load config for 'gpt2'. Make sure that:
- 'gpt2' is a correct model identifier listed on 'https://huggingface.co/models'
- or 'gpt2' is the correct path to a directory containing a config.json file
Thank you for your work! This seems to be exactly what I need, if only I could get it to work!
Is it possible to get top most probable missing words from a sentence using GPT-2?
For example, we have a sentence "The doctor ran to the emergency room to see [MASK] patient." and we want to get the most probable words which could be at [MASK].
Thanks.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.