Coder Social home page Coder Social logo

lm-scorer's Issues

Not really correct to include <|endoftext|> token in scoring?

I noticed that the probability of the final <|endoftext|> token is included in scoring. For the purposes of scoring sentences, it seems to me that it would be more correct (for most use cases) to omit that one, because it doesn't really matter whether or not more text follows a given sentence. Doesn't the probability of an <|endoftext|> token following a sentence depend on the (somewhat arbitrary) details of how text was broken up for training?

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Hi! Thanks for making this amazing package.

When I do:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 100
scorer = LMScorer.from_pretrained("distilgpt2", device=device, batch_size=batch_size)

And then:

scorer.sentence_score(["Sasgdkjlasdjglakjsdg", "Sentence 2"], log=True)

(Or any sentences of different length while trying to batch, I get this error):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    714                 if not is_tensor(value):
--> 715                     tensor = as_tensor(value)
    716 

ValueError: expected sequence of length 237 at dim 1 (got 232)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-15-08e953660864> in <module>
     71       j+=1
     72 
---> 73   passages_relevance_scores = scorer.sentence_score(passages_relevance_text, reduce='hmean')
     74   passages_trans_scores = scorer.sentence_score(passages_trans_text, reduce='hmean')
     75   for i, passage in enumerate(passage_texts):

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/base.py in sentence_score(self, text, log, reduce)
     31             return scores
     32 
---> 33         outputs = self._tokens_log_prob(sentences)
     34         for output in outputs:
     35             log_probs = output[0]

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/batch.py in _tokens_log_prob(self, text)
     26         for i in range(0, len(text), self.batch_size):
     27             batch = text[i : i + self.batch_size]
---> 28             outputs.extend(self._tokens_log_prob_for_batch(batch))
     29         return outputs
     30 

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/gpt2.py in _tokens_log_prob_for_batch(self, text)
     45         text = list(map(self._add_special_tokens, text))
     46         encoding: BatchEncoding = self.tokenizer.batch_encode_plus(
---> 47             text, return_tensors="pt",
     48         )
     49         with torch.no_grad():

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2780             return_length=return_length,
   2781             verbose=verbose,
-> 2782             **kwargs,
   2783         )
   2784 

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
    164         )
    165 
--> 166         return super()._batch_encode_plus(*args, **kwargs)
    167 
    168     def _encode_plus(self, *args, **kwargs) -> BatchEncoding:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    475         for input_ids in sanitized_tokens["input_ids"]:
    476             self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
--> 477         return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
    478 
    479     def _encode_plus(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208         self._n_sequences = n_sequences
    209 
--> 210         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    211 
    212     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    730                     )
    731                 raise ValueError(
--> 732                     "Unable to create tensor, you should probably activate truncation and/or padding with"
    733                     " 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your"
    734                     f" features (`{key}` in this case) have excessive nesting (inputs type `list` where type `int` is"

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Would love any help whatsoever!

Couple of queries: 1) Fine tuned GPT2 2) BPE Encoding

Hi
I had a couple of queries.

  1. I was wondering if you could direct me to the part of the code and recommend changes I could make so that i can also calculate this score on my own fine-tuned gpt2 model (which has its own path where it is saved)

  2. I was also thinking that gpt2 uses BPE encoding. So when you return probability score it always returns the probability for the complete word (not the sub units). As far as i understand BPE it divides the token into sub pieces and gives the corresponding ids to those sub pieces. So do you know how is that working internally, that is able to assign probability to complete word ?

Thanks

couldn't load pretrained gpt2

It gives an error while loading gpt2
File "/home/debanjan/anaconda3/envs/py_3/lib/python3.7/site-packages/lm_scorer/models/gpt2.py", line 84, in _supported_model_names
return GPT2LMHeadModel.pretrained_model_archive_map.keys()
AttributeError: type object 'GPT2LMHeadModel' has no attribute 'pretrained_model_archive_map'

can you kindly check?

Unable to run lm-scorer

Running the cli command

!lm-scorer -lp --cuda 0 sentences.txt,

I got the following error:

2020-04-11 07:48:41.170455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: init() got an unexpected keyword argument 'device'

and without --cuda, !lm-scorer -lp sentences.txt,

I got the following error

2020-04-11 07:52:48.219135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: Device index must be non-negative, got -1

Please help.

Python 3.8 support?

Is there a reason 3.8 isn't supported in PyPi? (I'm running 3.8 and pip doesn't have a compatible version)

Support for Python 3.8

lm-scorer does not support pytthon versions >3.8 . Since 3.10 and even 3.11 is out, with torch and transformers supporting 3.9, it may make sense to bump the versions up.

Support for AutoModelWithLMHead

I would like to adapt this library to work with user-contributed multilingual models from the transformers library.

I tried to add another model class in a fork to handle AutoModelWithLMHead models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)

I am running into two (possibly related) issues with this approach.

First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf"), with what seems to be an off-by-one indexing error.

/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
     66             # logits.shape = [len(text[sent_index]) + 1, vocab_size]
     67             sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68             sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
     69             # ids_scores.shape = [seq_len + 1]
     70             sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001

If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French

([-13.103885650634766,
  -7.141622066497803,
  -2.2347683906555176,
  -6.366621017456055,
  -1.1687631607055664,
  -3.626580238342285,
  -10.760506629943848],
 [2532, 5985, 327, 375, 295, 7536, 50257],
 ['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])

vs. English

([-2.4790897369384766,
  -9.218439102172852,
  -2.2219443321228027,
  -5.678627967834473,
  -0.41474056243896484,
  -4.27750301361084,
  -2.19716739654541,
  -5.7754011154174805],
 [464, 3797, 318, 319, 262, 9753, 13, 50256],
 ['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])

The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.

Any help appreciated figuring out how AutoModelWithLMHead might differ from GPT2LMHeadModel !

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

I am using this code:

import torch
from transformers import *
from lm_scorer.models.auto import AutoLMScorer as LMScorer

However, I am unable to get lm-scorer to work due to this error:

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

Based on feedback located here:huggingface/transformers#4513

I was told:

This is likely to be a problem with the LMScorer rather than with this transformers library. Looking t the source code, it does not pass they keyword arguments down to model init. I suggest that you make an issue over at the library that you used.

Any suggestions would be helpful! Thank you!

Please note: It seems to work in the terminal. And I am using vagrant, cgi, and apache2.

Can't load config for 'gpt2'

I have only just tried loading GPT2, havent tried to score a sentence yet. Here is the code:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)

However, when I run it, I get this error:

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 239, in get_config_dict
    local_files_only=local_files_only,
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\file_utils.py", line 267, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file gpt2\config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scorer.py", line 6, in <module>
    scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\auto.py", line 24, in from_pretrained
    return model_class(model_name, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\abc\base.py", line 11, in __init__
    self._build(model_name, kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\gpt2.py", line 19, in _build
    model_name, use_fast=True, add_special_tokens=False
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\tokenization_auto.py", line 195, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_auto.py", line 196, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 252, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'gpt2'. Make sure that:

- 'gpt2' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'gpt2' is the correct path to a directory containing a config.json file

Thank you for your work! This seems to be exactly what I need, if only I could get it to work!

Possible missing words

Is it possible to get top most probable missing words from a sentence using GPT-2?
For example, we have a sentence "The doctor ran to the emergency room to see [MASK] patient." and we want to get the most probable words which could be at [MASK].

Thanks.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.