minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models

This repo is a wrapper around the transformers library from Hugging Face 🤗

Installation

Install from Pypi using:

pip install minicons

Supported Functionality

Extract word representations from Contextualized Word Embeddings
Score sequences using language model scoring techniques, including masked language models following Salazar et al. (2020).

Examples

Extract word representations from contextualized word embeddings:

from minicons import cwe

model = cwe.CWE('bert-base-uncased')

context_words = [("I went to the bank to withdraw money.", "bank"), 
                 ("i was at the bank of the river ganga!", "bank")]

print(model.extract_representation(context_words, layer = 12))

''' 
tensor([[ 0.5399, -0.2461, -0.0968,  ..., -0.4670, -0.5312, -0.0549],
        [-0.8258, -0.4308,  0.2744,  ..., -0.5987, -0.6984,  0.2087]],
       grad_fn=<MeanBackward1>)
'''

# if model is seq2seq:
model = cwe.EncDecCWE('t5-small')

print(model.extract_representation(context_words))

'''(last layer, by default)
tensor([[-0.0895,  0.0758,  0.0753,  ...,  0.0130, -0.1093, -0.2354],
        [-0.0695,  0.1142,  0.0803,  ...,  0.0807, -0.1139, -0.2888]])
'''

Compute sentence acceptability measures (surprisals) using Word Prediction Models:

from minicons import scorer

mlm_model = scorer.MaskedLMScorer('bert-base-uncased', 'cpu')
ilm_model = scorer.IncrementalLMScorer('distilgpt2', 'cpu')
s2s_model = scorer.Seq2SeqScorer('t5-base', 'cpu')

stimuli = ["The keys to the cabinet are on the table.",
           "The keys to the cabinet is on the table."]

# use sequence_score with different reduction options: 
# Sequence Surprisal - lambda x: -x.sum(0).item()
# Sequence Log-probability - lambda x: x.sum(0).item()
# Sequence Surprisal, normalized by number of tokens - lambda x: -x.mean(0).item()
# Sequence Log-probability, normalized by number of tokens - lambda x: x.mean(0).item()
# and so on...

print(ilm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item()))

'''
[39.879737854003906, 42.75846481323242]
'''

# MLM scoring, inspired by Salazar et al., 2020
print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item()))
'''
[13.962685585021973, 23.415111541748047]
'''

# Seq2seq scoring
## Blank source sequence, target sequence specified in `stimuli`
print(s2s_model.sequence_score(stimuli, source_format = 'blank'))
## Source sequence is the same as the target sequence in `stimuli`
print(s2s_model.sequence_score(stimuli, source_format = 'copy'))
'''
[-7.910910129547119, -7.835635185241699]
[-10.555519104003906, -9.532546997070312]
'''

A better version of MLM Scoring by Kauf and Ivanova

This version leverages a locally-autoregressive scoring strategy to avoid the overestimation of probabilities of tokens in multi-token words (e.g., "ostrich" -> "ostr" + "#ich"). In particular, tokens probabilities are estimated using the bidirectional context, excluding any future tokens that belong to the same word as the current target token.

For more details, refer to Kauf and Ivanova, 2023

from minicons import scorer
mlm_model = scorer.MaskedLMScorer('bert-base-uncased', 'cpu')

stimuli = ['The traveler lost the souvenir.']

# un-normalized sequence score
print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item(), PLL_metric='within_word_l2r'))
'''
[32.77983617782593]
'''

# original metric, for comparison:
print(mlm_model.sequence_score(stimuli, reduction = lambda x: -x.sum(0).item(), PLL_metric='original'))
'''
[18.014726161956787]
'''

print(mlm_model.token_score(stimuli, PLL_metric='within_word_l2r'))
'''
[[('the', -0.07324600219726562), ('traveler', -9.668401718139648), ('lost', -6.955361366271973),
('the', -1.1923179626464844), ('so', -7.776356220245361), ('##uven', -6.989711761474609),
('##ir', -0.037807464599609375), ('.', -0.08663368225097656)]]
'''

# original values, for comparison (notice the 'souvenir' tokens):

print(mlm_model.token_score(stimuli, PLL_metric='original'))
'''
[[('the', -0.07324600219726562), ('traveler', -9.668402671813965), ('lost', -6.955359935760498), ('the', -1.192317008972168), ('so', -3.0517578125e-05), ('##uven', -0.0009250640869140625), ('##ir', -0.03780937194824219), ('.', -0.08663558959960938)]]
'''

Tutorials

Recent Updates

November 6, 2021: MLM scoring has been fixed! You can now use model.token_score() and model.sequence_score() with MaskedLMScorers as well!
June 4, 2022: Added support for Seq2seq models. Thanks to Aaron Mueller 🥳
June 13, 2023: Added support for within_word_l2r, a better way to do MLM scoring, thanks to Carina Kauf (https://github.com/carina-kauf) 🥳

Citation

If you use minicons, please cite the following paper:

@article{misra2022minicons,
    title={minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models},
    author={Kanishka Misra},
    journal={arXiv preprint arXiv:2203.13112},
    year={2022}
}

If you use Kauf and Ivanova's PLL scoring technique, please additionally also cite the following paper:

@inproceedings{kauf2023better,
  title={A Better Way to Do Masked Language Model Scoring},
  author={Kauf, Carina and Ivanova, Anna},
  booktitle={Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers)},
  year={2023}
}

wwt17 / minicons Goto Github PK

minicons's Introduction

minicons: Enabling Flexible Behavioral and Representational Analyses of Transformer Language Models

Installation

Supported Functionality

Examples

A better version of MLM Scoring by Kauf and Ivanova

Tutorials

Recent Updates

Citation

minicons's People

Contributors

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent