simonepri / lm-scorer Goto Github PK

📃Language Model based sentences scoring library

License: MIT License

Python 100.00%

sentence probability lm language-model ml

lm-scorer's Issues

couldn't load pretrained gpt2

It gives an error while loading gpt2
File "/home/debanjan/anaconda3/envs/py_3/lib/python3.7/site-packages/lm_scorer/models/gpt2.py", line 84, in _supported_model_names
return GPT2LMHeadModel.pretrained_model_archive_map.keys()
AttributeError: type object 'GPT2LMHeadModel' has no attribute 'pretrained_model_archive_map'

can you kindly check?

Couple of queries: 1) Fine tuned GPT2 2) BPE Encoding

Hi
I had a couple of queries.

I was wondering if you could direct me to the part of the code and recommend changes I could make so that i can also calculate this score on my own fine-tuned gpt2 model (which has its own path where it is saved)
I was also thinking that gpt2 uses BPE encoding. So when you return probability score it always returns the probability for the complete word (not the sub units). As far as i understand BPE it divides the token into sub pieces and gives the corresponding ids to those sub pieces. So do you know how is that working internally, that is able to assign probability to complete word ?

Thanks

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

I am using this code:

import torch
from transformers import *
from lm_scorer.models.auto import AutoLMScorer as LMScorer

However, I am unable to get lm-scorer to work due to this error:

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

Based on feedback located here:huggingface/transformers#4513

I was told:

This is likely to be a problem with the LMScorer rather than with this transformers library. Looking t the source code, it does not pass they keyword arguments down to model init. I suggest that you make an issue over at the library that you used.

Any suggestions would be helpful! Thank you!

Please note: It seems to work in the terminal. And I am using vagrant, cgi, and apache2.

Python 3.8 support?

Is there a reason 3.8 isn't supported in PyPi? (I'm running 3.8 and pip doesn't have a compatible version)

Unable to run lm-scorer

Running the cli command

!lm-scorer -lp --cuda 0 sentences.txt,

I got the following error:

2020-04-11 07:48:41.170455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: init() got an unexpected keyword argument 'device'

and without --cuda, !lm-scorer -lp sentences.txt,

I got the following error

2020-04-11 07:52:48.219135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: Device index must be non-negative, got -1

Please help.

Can't load config for 'gpt2'

I have only just tried loading GPT2, havent tried to score a sentence yet. Here is the code:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)

However, when I run it, I get this error:

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 239, in get_config_dict
    local_files_only=local_files_only,
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\file_utils.py", line 267, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file gpt2\config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scorer.py", line 6, in <module>
    scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\auto.py", line 24, in from_pretrained
    return model_class(model_name, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\abc\base.py", line 11, in __init__
    self._build(model_name, kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\gpt2.py", line 19, in _build
    model_name, use_fast=True, add_special_tokens=False
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\tokenization_auto.py", line 195, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_auto.py", line 196, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 252, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'gpt2'. Make sure that:

- 'gpt2' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'gpt2' is the correct path to a directory containing a config.json file

Thank you for your work! This seems to be exactly what I need, if only I could get it to work!

Support for AutoModelWithLMHead

I would like to adapt this library to work with user-contributed multilingual models from the transformers library.

I tried to add another model class in a fork to handle AutoModelWithLMHead models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)

I am running into two (possibly related) issues with this approach.

First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf"), with what seems to be an off-by-one indexing error.

/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
     66             # logits.shape = [len(text[sent_index]) + 1, vocab_size]
     67             sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68             sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
     69             # ids_scores.shape = [seq_len + 1]
     70             sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001

If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French

([-13.103885650634766,
  -7.141622066497803,
  -2.2347683906555176,
  -6.366621017456055,
  -1.1687631607055664,
  -3.626580238342285,
  -10.760506629943848],
 [2532, 5985, 327, 375, 295, 7536, 50257],
 ['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])

vs. English

([-2.4790897369384766,
  -9.218439102172852,
  -2.2219443321228027,
  -5.678627967834473,
  -0.41474056243896484,
  -4.27750301361084,
  -2.19716739654541,
  -5.7754011154174805],
 [464, 3797, 318, 319, 262, 9753, 13, 50256],
 ['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])

The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.

Any help appreciated figuring out how AutoModelWithLMHead might differ from GPT2LMHeadModel !

Support for Python 3.8

lm-scorer does not support pytthon versions >3.8 . Since 3.10 and even 3.11 is out, with torch and transformers supporting 3.9, it may make sense to bump the versions up.

Possible missing words

Is it possible to get top most probable missing words from a sentence using GPT-2?
For example, we have a sentence "The doctor ran to the emergency room to see [MASK] patient." and we want to get the most probable words which could be at [MASK].

Thanks.

Not really correct to include <|endoftext|> token in scoring?

I noticed that the probability of the final <|endoftext|> token is included in scoring. For the purposes of scoring sentences, it seems to me that it would be more correct (for most use cases) to omit that one, because it doesn't really matter whether or not more text follows a given sentence. Doesn't the probability of an <|endoftext|> token following a sentence depend on the (somewhat arbitrary) details of how text was broken up for training?

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Hi! Thanks for making this amazing package.

When I do:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 100
scorer = LMScorer.from_pretrained("distilgpt2", device=device, batch_size=batch_size)

And then:

scorer.sentence_score(["Sasgdkjlasdjglakjsdg", "Sentence 2"], log=True)

(Or any sentences of different length while trying to batch, I get this error):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    714                 if not is_tensor(value):
--> 715                     tensor = as_tensor(value)
    716 

ValueError: expected sequence of length 237 at dim 1 (got 232)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-15-08e953660864> in <module>
     71       j+=1
     72 
---> 73   passages_relevance_scores = scorer.sentence_score(passages_relevance_text, reduce='hmean')
     74   passages_trans_scores = scorer.sentence_score(passages_trans_text, reduce='hmean')
     75   for i, passage in enumerate(passage_texts):

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/base.py in sentence_score(self, text, log, reduce)
     31             return scores
     32 
---> 33         outputs = self._tokens_log_prob(sentences)
     34         for output in outputs:
     35             log_probs = output[0]

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/batch.py in _tokens_log_prob(self, text)
     26         for i in range(0, len(text), self.batch_size):
     27             batch = text[i : i + self.batch_size]
---> 28             outputs.extend(self._tokens_log_prob_for_batch(batch))
     29         return outputs
     30 

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/gpt2.py in _tokens_log_prob_for_batch(self, text)
     45         text = list(map(self._add_special_tokens, text))
     46         encoding: BatchEncoding = self.tokenizer.batch_encode_plus(
---> 47             text, return_tensors="pt",
     48         )
     49         with torch.no_grad():

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2780             return_length=return_length,
   2781             verbose=verbose,
-> 2782             **kwargs,
   2783         )
   2784 

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
    164         )
    165 
--> 166         return super()._batch_encode_plus(*args, **kwargs)
    167 
    168     def _encode_plus(self, *args, **kwargs) -> BatchEncoding:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    475         for input_ids in sanitized_tokens["input_ids"]:
    476             self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
--> 477         return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
    478 
    479     def _encode_plus(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208         self._n_sequences = n_sequences
    209 
--> 210         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    211 
    212     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    730                     )
    731                 raise ValueError(
--> 732                     "Unable to create tensor, you should probably activate truncation and/or padding with"
    733                     " 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your"
    734                     f" features (`{key}` in this case) have excessive nesting (inputs type `list` where type `int` is"

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Would love any help whatsoever!

simonepri / lm-scorer Goto Github PK

lm-scorer's Issues

couldn't load pretrained gpt2

Couple of queries: 1) Fine tuned GPT2 2) BPE Encoding

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

Python 3.8 support?

Unable to run lm-scorer

Can't load config for 'gpt2'

Support for AutoModelWithLMHead

Support for Python 3.8

Possible missing words

Not really correct to include <|endoftext|> token in scoring?

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent