simonepri / lm-scorer Goto Github PK

View Code? Open in Web Editor NEW

297.0 5.0 37.0 4.69 MB

📃Language Model based sentences scoring library

License: MIT License

Python 100.00%

sentence probability lm language-model ml

lm-scorer's Introduction

lm-scorer

📃 Language Model based sentences scoring library

Synopsis

This package provides a simple programming interface to score sentences using different ML language models.

A simple CLI is also available for quick prototyping.
You can run it locally or on directly on Colab using this notebook.

Do you believe that this is useful? Has it saved you time? Or maybe you simply like it?
If so, support this work with a Star ⭐️.

Install

pip install lm-scorer

Usage

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Available models
list(LMScorer.supported_model_names())
# => ["gpt2", "gpt2-medium", "gpt2-large", "gpt2-xl", distilgpt2"]

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained("gpt2", device=device, batch_size=batch_size)

# Return token probabilities (provide log=True to return log probabilities)
scorer.tokens_score("I like this package.")
# => (scores, ids, tokens)
# scores = [0.018321, 0.0066431, 0.080633, 0.00060745, 0.27772, 0.0036381]
# ids    = [40,       588,       428,      5301,       13,      50256]
# tokens = ["I",      "Ġlike",   "Ġthis",  "Ġpackage", ".",     "<|endoftext|>"]

# Compute sentence score as the product of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="prod")
# => 6.0231e-12

# Compute sentence score as the mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="mean")
# => 0.064593

# Compute sentence score as the geometric mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="gmean")
# => 0.013489

# Compute sentence score as the harmonic mean of tokens' probabilities
scorer.sentence_score("I like this package.", reduce="hmean")
# => 0.0028008

# Get the log of the sentence score.
scorer.sentence_score("I like this package.", log=True)
# => -25.835

# Score multiple sentences.
scorer.sentence_score(["Sentence 1", "Sentence 2"])
# => [1.1508e-11, 5.6645e-12]

# NB: Computations are done in log space so they should be numerically stable.

CLI

The pip package includes a CLI that you can use to score sentences.

usage: lm-scorer [-h] [--model-name MODEL_NAME] [--tokens] [--log-prob]
                 [--reduce REDUCE] [--batch-size BATCH_SIZE]
                 [--significant-figures SIGNIFICANT_FIGURES] [--cuda CUDA]
                 [--debug]
                 sentences-file-path

Get sentences probability using a language model.

positional arguments:
  sentences-file-path   A file containing sentences to score, one per line. If
                        - is given as filename it reads from stdin instead.

optional arguments:
  -h, --help            show this help message and exit
  --model-name MODEL_NAME, -m MODEL_NAME
                        The pretrained language model to use. Can be one of:
                        gpt2, gpt2-medium, gpt2-large, gpt2-xl, distilgpt2.
  --tokens, -t          If provided it provides the probability of each token
                        of each sentence.
  --log-prob, -lp       If provided log probabilities are returned instead.
  --reduce REDUCE, -r REDUCE
                        Reduce strategy applied on token probabilities to get
                        the sentence score. Available strategies are: prod,
                        mean, gmean, hmean.
  --batch-size BATCH_SIZE, -b BATCH_SIZE
                        Number of sentences to process in parallel.
  --significant-figures SIGNIFICANT_FIGURES, -sf SIGNIFICANT_FIGURES
                        Number of significant figures to use when printing
                        numbers.
  --cuda CUDA           If provided it runs the model on the given cuda
                        device.
  --debug               If provided it provides additional logging in case of
                        errors.

Development

You can install this library locally for development using the commands below. If you don't have it already, you need to install poetry first.

# Clone the repo
git clone https://github.com/simonepri/lm-scorer
# CD into the created folder
cd lm-scorer
# Create a virtualenv and install the required dependencies using poetry
poetry install

You can then run commands inside the virtualenv by using poetry run COMMAND.
Alternatively, you can open a shell inside the virtualenv using poetry shell.

If you wish to contribute to this project, run the following commands locally before opening a PR and check that no error is reported (warnings are fine).

# Run the code formatter
poetry run task format
# Run the linter
poetry run task lint
# Run the static type checker
poetry run task types
# Run the tests
poetry run task test

Authors

Simone Primarosa - simonepri

See also the list of contributors who participated in this project.

License

This project is licensed under the MIT License - see the license file for details.

lm-scorer's People

Contributors

Stargazers

Watchers

lm-scorer's Issues

Support for Python 3.8

lm-scorer does not support pytthon versions >3.8 . Since 3.10 and even 3.11 is out, with torch and transformers supporting 3.9, it may make sense to bump the versions up.

Can't load config for 'gpt2'

I have only just tried loading GPT2, havent tried to score a sentence yet. Here is the code:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 1
scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)

However, when I run it, I get this error:

Traceback (most recent call last):
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 239, in get_config_dict
    local_files_only=local_files_only,
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\file_utils.py", line 267, in cached_path
    raise EnvironmentError("file {} not found".format(url_or_filename))
OSError: file gpt2\config.json not found

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "scorer.py", line 6, in <module>
    scorer = LMScorer.from_pretrained('gpt2', device=device, batch_size=batch_size)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\auto.py", line 24, in from_pretrained
    return model_class(model_name, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\abc\base.py", line 11, in __init__
    self._build(model_name, kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\lm_scorer\models\gpt2.py", line 19, in _build
    model_name, use_fast=True, add_special_tokens=False
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\tokenization_auto.py", line 195, in from_pretrained
    config = AutoConfig.from_pretrained(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_auto.py", line 196, in from_pretrained
    config_dict, _ = PretrainedConfig.get_config_dict(pretrained_model_name_or_path, **kwargs)
  File "C:\Users\User\AppData\Local\Programs\Python\Python37\lib\site-packages\transformers\configuration_utils.py", line 252, in get_config_dict
    raise EnvironmentError(msg)
OSError: Can't load config for 'gpt2'. Make sure that:

- 'gpt2' is a correct model identifier listed on 'https://huggingface.co/models'

- or 'gpt2' is the correct path to a directory containing a config.json file

Thank you for your work! This seems to be exactly what I need, if only I could get it to work!

Not really correct to include <|endoftext|> token in scoring?

I noticed that the probability of the final <|endoftext|> token is included in scoring. For the purposes of scoring sentences, it seems to me that it would be more correct (for most use cases) to omit that one, because it doesn't really matter whether or not more text follows a given sentence. Doesn't the probability of an <|endoftext|> token following a sentence depend on the (somewhat arbitrary) details of how text was broken up for training?

Python 3.8 support?

Is there a reason 3.8 isn't supported in PyPi? (I'm running 3.8 and pip doesn't have a compatible version)

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Hi! Thanks for making this amazing package.

When I do:

import torch
from lm_scorer.models.auto import AutoLMScorer as LMScorer

# Load model to cpu or cuda
device = "cuda:0" if torch.cuda.is_available() else "cpu"
batch_size = 100
scorer = LMScorer.from_pretrained("distilgpt2", device=device, batch_size=batch_size)

And then:

scorer.sentence_score(["Sasgdkjlasdjglakjsdg", "Sentence 2"], log=True)

(Or any sentences of different length while trying to batch, I get this error):

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    714                 if not is_tensor(value):
--> 715                     tensor = as_tensor(value)
    716 

ValueError: expected sequence of length 237 at dim 1 (got 232)

During handling of the above exception, another exception occurred:

ValueError                                Traceback (most recent call last)
8 frames
<ipython-input-15-08e953660864> in <module>
     71       j+=1
     72 
---> 73   passages_relevance_scores = scorer.sentence_score(passages_relevance_text, reduce='hmean')
     74   passages_trans_scores = scorer.sentence_score(passages_trans_text, reduce='hmean')
     75   for i, passage in enumerate(passage_texts):

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/base.py in sentence_score(self, text, log, reduce)
     31             return scores
     32 
---> 33         outputs = self._tokens_log_prob(sentences)
     34         for output in outputs:
     35             log_probs = output[0]

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/abc/batch.py in _tokens_log_prob(self, text)
     26         for i in range(0, len(text), self.batch_size):
     27             batch = text[i : i + self.batch_size]
---> 28             outputs.extend(self._tokens_log_prob_for_batch(batch))
     29         return outputs
     30 

/usr/local/lib/python3.7/dist-packages/lm_scorer/models/gpt2.py in _tokens_log_prob_for_batch(self, text)
     45         text = list(map(self._add_special_tokens, text))
     46         encoding: BatchEncoding = self.tokenizer.batch_encode_plus(
---> 47             text, return_tensors="pt",
     48         )
     49         with torch.no_grad():

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding, truncation, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose, **kwargs)
   2780             return_length=return_length,
   2781             verbose=verbose,
-> 2782             **kwargs,
   2783         )
   2784 

/usr/local/lib/python3.7/dist-packages/transformers/models/gpt2/tokenization_gpt2_fast.py in _batch_encode_plus(self, *args, **kwargs)
    164         )
    165 
--> 166         return super()._batch_encode_plus(*args, **kwargs)
    167 
    168     def _encode_plus(self, *args, **kwargs) -> BatchEncoding:

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_fast.py in _batch_encode_plus(self, batch_text_or_text_pairs, add_special_tokens, padding_strategy, truncation_strategy, max_length, stride, is_split_into_words, pad_to_multiple_of, return_tensors, return_token_type_ids, return_attention_mask, return_overflowing_tokens, return_special_tokens_mask, return_offsets_mapping, return_length, verbose)
    475         for input_ids in sanitized_tokens["input_ids"]:
    476             self._eventual_warn_about_too_long_sequence(input_ids, max_length, verbose)
--> 477         return BatchEncoding(sanitized_tokens, sanitized_encodings, tensor_type=return_tensors)
    478 
    479     def _encode_plus(

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in __init__(self, data, encoding, tensor_type, prepend_batch_axis, n_sequences)
    208         self._n_sequences = n_sequences
    209 
--> 210         self.convert_to_tensors(tensor_type=tensor_type, prepend_batch_axis=prepend_batch_axis)
    211 
    212     @property

/usr/local/lib/python3.7/dist-packages/transformers/tokenization_utils_base.py in convert_to_tensors(self, tensor_type, prepend_batch_axis)
    730                     )
    731                 raise ValueError(
--> 732                     "Unable to create tensor, you should probably activate truncation and/or padding with"
    733                     " 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your"
    734                     f" features (`{key}` in this case) have excessive nesting (inputs type `list` where type `int` is"

ValueError: Unable to create tensor, you should probably activate truncation and/or padding with 'padding=True' 'truncation=True' to have batched tensors with the same length. Perhaps your features (`input_ids` in this case) have excessive nesting (inputs type `list` where type `int` is expected).

Would love any help whatsoever!

Unable to run lm-scorer

Running the cli command

!lm-scorer -lp --cuda 0 sentences.txt,

I got the following error:

2020-04-11 07:48:41.170455: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: init() got an unexpected keyword argument 'device'

and without --cuda, !lm-scorer -lp sentences.txt,

I got the following error

2020-04-11 07:52:48.219135: I tensorflow/stream_executor/platform/default/dso_loader.cc:44] Successfully opened dynamic library libcudart.so.10.1
Error: Device index must be non-negative, got -1

Please help.

Possible missing words

Is it possible to get top most probable missing words from a sentence using GPT-2?
For example, we have a sentence "The doctor ran to the emergency room to see [MASK] patient." and we want to get the most probable words which could be at [MASK].

Thanks.

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

I am using this code:

import torch
from transformers import *
from lm_scorer.models.auto import AutoLMScorer as LMScorer

However, I am unable to get lm-scorer to work due to this error:

AH01215: OSError: Couldn't reach server at 'https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json' to download pretrained model configuration file

Based on feedback located here:huggingface/transformers#4513

I was told:

This is likely to be a problem with the LMScorer rather than with this transformers library. Looking t the source code, it does not pass they keyword arguments down to model init. I suggest that you make an issue over at the library that you used.

Any suggestions would be helpful! Thank you!

Please note: It seems to work in the terminal. And I am using vagrant, cgi, and apache2.

Couple of queries: 1) Fine tuned GPT2 2) BPE Encoding

Hi
I had a couple of queries.

I was wondering if you could direct me to the part of the code and recommend changes I could make so that i can also calculate this score on my own fine-tuned gpt2 model (which has its own path where it is saved)
I was also thinking that gpt2 uses BPE encoding. So when you return probability score it always returns the probability for the complete word (not the sub units). As far as i understand BPE it divides the token into sub pieces and gives the corresponding ids to those sub pieces. So do you know how is that working internally, that is able to assign probability to complete word ?

Thanks

Support for AutoModelWithLMHead

I would like to adapt this library to work with user-contributed multilingual models from the transformers library.

I tried to add another model class in a fork to handle AutoModelWithLMHead models here: https://github.com/smeylan/lm-scorer/blob/master/lm_scorer/models/automodel.py, just substituting the transformer model class (GPT2LMHeadModel -> AutoModelWithLMHead)

I am running into two (possibly related) issues with this approach.

First, it errors out on this line: sent_logits[:, self.tokenizer.pad_token_id] = float("-inf"), with what seems to be an off-by-one indexing error.

/content/drive/MyDrive/Repos/lm-scorer/lm_scorer/models/automodel.py in _tokens_log_prob_for_batch(self, text)
     66             # logits.shape = [len(text[sent_index]) + 1, vocab_size]
     67             sent_logits = logits[sent_index, sent_nopad_mask][:-1, :]
---> 68             sent_logits[:, self.tokenizer.pad_token_id] = float("-inf")
     69             # ids_scores.shape = [seq_len + 1]
     70             sent_ids_scores = sent_logits.gather(1, sent_ids.unsqueeze(1)).squeeze(1)
IndexError: index 52001 is out of bounds for dimension 1 with size 52001

If I comment out this line and let it continue, I get back probabilities, but they seem to be odd. Probabilities of the first token and the endoftext token are both very low compared to the English model on a matched sentence. For example, compare French

([-13.103885650634766,
  -7.141622066497803,
  -2.2347683906555176,
  -6.366621017456055,
  -1.1687631607055664,
  -3.626580238342285,
  -10.760506629943848],
 [2532, 5985, 327, 375, 295, 7536, 50257],
 ['Le', 'Ġchat', 'Ġest', 'Ġsur', 'Ġle', 'Ġtoit', '<|endoftext|>'])

vs. English

([-2.4790897369384766,
  -9.218439102172852,
  -2.2219443321228027,
  -5.678627967834473,
  -0.41474056243896484,
  -4.27750301361084,
  -2.19716739654541,
  -5.7754011154174805],
 [464, 3797, 318, 319, 262, 9753, 13, 50256],
 ['The', 'Ġcat', 'Ġis', 'Ġon', 'Ġthe', 'Ġroof', '.', '<|endoftext|>'])

The same also holds for German (i.e. it follows the pattern fo French), so I don't think it's a model-specific problem.

Any help appreciated figuring out how AutoModelWithLMHead might differ from GPT2LMHeadModel !

couldn't load pretrained gpt2

It gives an error while loading gpt2
File "/home/debanjan/anaconda3/envs/py_3/lib/python3.7/site-packages/lm_scorer/models/gpt2.py", line 84, in _supported_model_names
return GPT2LMHeadModel.pretrained_model_archive_map.keys()
AttributeError: type object 'GPT2LMHeadModel' has no attribute 'pretrained_model_archive_map'

can you kindly check?