Coder Social home page Coder Social logo

marcotcr / checklist Goto Github PK

View Code? Open in Web Editor NEW
2.0K 29.0 204.0 128 MB

Beyond Accuracy: Behavioral Testing of NLP models with CheckList

License: MIT License

Python 3.90% Jupyter Notebook 93.40% HTML 0.01% JavaScript 0.16% TypeScript 2.40% CSS 0.12%

checklist's Introduction

CheckList

This repository contains code for testing NLP Models as described in the following paper:

Beyond Accuracy: Behavioral Testing of NLP models with CheckList
Marco Tulio Ribeiro, Tongshuang Wu, Carlos Guestrin, Sameer Singh Association for Computational Linguistics (ACL), 2020

Bibtex for citations:

 @inproceedings{checklist:acl20,  
 author = {Marco Tulio Ribeiro and Tongshuang Wu and Carlos Guestrin and Sameer Singh},  
 title = {Beyond Accuracy: Behavioral Testing of NLP models with CheckList},  
 booktitle = {Association for Computational Linguistics (ACL)},  
 year = {2020}
 }

Table of Contents

Installation

From pypi:

pip install checklist
jupyter nbextension install --py --sys-prefix checklist.viewer
jupyter nbextension enable --py --sys-prefix checklist.viewer

Note: --sys-prefix to install into python’s sys.prefix, which is useful for instance in virtual environments, such as with conda or virtualenv. If you are not in such environments, please switch to --user to install into the user’s home jupyter directories.

From source:

git clone [email protected]:marcotcr/checklist.git
cd checklist
pip install -e .

Either way, you need to install pytorch or tensorflow if you want to use masked language model suggestions:

pip install torch

For most tutorials, you also need to download a spacy model:

python -m spacy download en_core_web_sm

Tutorials

Please note that the visualizations are implemented as ipywidgets, and don't work on colab or JupyterLab (use jupyter notebook). Everything else should work on these though.

  1. Generating data
  2. Perturbing data
  3. Test types, expectation functions, running tests
  4. The CheckList process

Paper tests

Notebooks: how we created the tests in the paper

  1. Sentiment analysis
  2. QQP
  3. SQuAD

Replicating paper tests, or running them with new models

For all of these, you need to unpack the release data (in the main repo folder after cloning):

tar xvzf release_data.tar.gz

Sentiment Analysis

Loading the suite:

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/sentiment/sentiment_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions (replace bert on pred_path with amazon, google, microsoft, or roberta for others):

pred_path = 'release_data/sentiment/predictions/bert'
suite.run_from_file(pred_path, overwrite=True)
suite.summary() # or suite.visual_summary_table()

To test your own model, get predictions for the texts in release_data/sentiment/tests_n500 and save them in a file where each line has 4 numbers: the prediction (0 for negative, 1 for neutral, 2 for positive) and the prediction probabilities for (negative, neutral, positive).
Then, update pred_path with this file and run the lines above.

QQP

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/qqp/qqp_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions (replace bert on pred_path with roberta if you want):

pred_path = 'release_data/qqp/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='binary_conf')
suite.visual_summary_table()

To test your own model, get predictions for pairs in release_data/qqp/tests_n500 (format: tsv) and output them in a file where each line has a single number: the probability that the pair is a duplicate.

SQuAD

import checklist
from checklist.test_suite import TestSuite
suite_path = 'release_data/squad/squad_suite.pkl'
suite = TestSuite.from_file(suite_path)

Running tests with precomputed bert predictions:

pred_path = 'release_data/squad/predictions/bert'
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')
suite.visual_summary_table()

To test your own model, get predictions for pairs in release_data/squad/squad.jsonl (format: jsonl) or release_data/squad/squad.json (format: json, like SQuAD dev) and output them in a file where each line has a single string: the prediction span.

Testing huggingface transformer pipelines

See this notebook.

Code snippets

Templates

See 1. Generating data for more details.

import checklist
from checklist.editor import Editor
import numpy as np
editor = Editor()
ret = editor.template('{first_name} is {a:profession} from {country}.',
                       profession=['lawyer', 'doctor', 'accountant'])
np.random.choice(ret.data, 3)

['Mary is a doctor from Afghanistan.',
'Jordan is an accountant from Indonesia.',
'Kayla is a lawyer from Sierra Leone.']

RoBERTa suggestions

See 1. Generating data for more details.
In template:

ret = editor.template('This is {a:adj} {mask}.',  
                      adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]

['This is a good idea.',
'This is a good sign.',
'This is a good thing.']

Multiple masks:

ret = editor.template('This is {a:adj} {mask} {mask}.',
                      adj=['good', 'bad', 'great', 'terrible'])
ret.data[:3]

['This is a good history lesson.',
'This is a good chess move.',
'This is a good news story.']

Getting suggestions rather than filling out templates:

editor.suggest('This is {a:adj} {mask}.',
               adj=['good', 'bad', 'great', 'terrible'])[:5]

['idea', 'sign', 'thing', 'example', 'start']

Getting suggestions for replacements (only a single text allowed, no templates):

editor.suggest_replace('This is a good movie.', 'good')[:5]

['great', 'horror', 'bad', 'terrible', 'cult']

Getting suggestions through jupyter visualization:

editor.visual_suggest('This is {a:mask} movie.')

visual suggest

Multilingual suggestions

Just initialize the editor with the language argument (should work with language names and iso 639-1 codes):

import checklist
from checklist.editor import Editor
import numpy as np
# in Portuguese
editor = Editor(language='portuguese')
ret = editor.template('O João é um {mask}.',)
ret.data[:3]

['O João é um português.',
'O João é um poeta.',
'O João é um brasileiro.']

# in Chinese
editor = Editor(language='chinese')
ret = editor.template('西游记的故事很{mask}。',)
ret.data[:3]

['西游记的故事很精彩。',
'西游记的故事很真实。',
'西游记的故事很经典。']

We're using FlauBERT for french, German BERT for german, and XLM-RoBERTa for everything else (click the link for a list of supported languages). We can't vouch for the quality of the suggestions in other languages, but it seems to work reasonably well for the languages we speak (although not as well as English).

Lexicons (somewhat multilingual)

editor.lexicons is a dictionary, which can be used in templates. For example:

import checklist
from checklist.editor import Editor
import numpy as np
# Default: English
editor = Editor()
ret = editor.template('{male1} went to see {male2} in {city}.', remove_duplicates=True)
list(np.random.choice(ret.data, 3))

['Dan went to see Hugh in Riverside.',
'Stephen went to see Eric in Omaha.',
'Patrick went to see Nick in Kansas City.']

Person names and location (country, city) names are multilingual, depending on the editor language. We got the data from wikidata, so there is a bias towards names on wikipedia.

editor = Editor(language='german')
ret = editor.template('{male1} went to see {male2} in {city}.', remove_duplicates=True)
list(np.random.choice(ret.data, 3))

['Rolf went to see Klaus in Leipzig.',
'Richard went to see Jörg in Marl.',
'Gerd went to see Fritz in Schwerin.']

List of available lexicons:

editor.lexicons.keys()

dict_keys(['male', 'female', 'first_name', 'first_pronoun', 'last_name', 'country', 'nationality', 'city', 'religion', 'religion_adj', 'sexual_adj', 'country_city', 'male_from', 'female_from', 'last_from'])

Some of these cannot be used directly in templates because they are themselves dictionaries. For example, male_from, female_from, last_from and country_city are dictionaries from country to male names, female names, last names and most populous cities.
You can call editor.lexicons.male_from.keys() for a list of country names. Example usage:

import numpy as np
countries = ['France', 'Germany', 'Brazil']
for country in countries:
    ts = editor.template('{male} {last} is from {city}',
                male=editor.lexicons.male_from[country],
                last=editor.lexicons.last_from[country],
                city=editor.lexicons.country_city[country],
               )
    print('Country: %s' % country)
    print('\n'.join(np.random.choice(ts.data, 3)))
    print()

Country: France
Jean-Jacques Brun is from Avignon
Bruno Deschamps is from Vitry-sur-Seine
Ernest Picard is from Chambéry

Country: Germany
Rainer Braun is from Schwerin
Markus Brandt is from Gera
Reinhard Busch is from Erlangen

Country: Brazil
Gilberto Martins is from Anápolis
Alfredo Guimarães is from Indaiatuba
Jorge Barreto is from Fortaleza

Perturbing data for INVs and DIRs

See 2.Perturbing data for more details.
Custom perturbation function:

import re
import checklist
from checklist.perturb import Perturb
def replace_john_with_others(x, *args, **kwargs):
    # Returns empty (if John is not present) or list of strings with John replaced by Luke and Mark
    if not re.search(r'\bJohn\b', x):
        return None
    return [re.sub(r'\bJohn\b', n, x) for n in ['Luke', 'Mark']]

dataset = ['John is a man', 'Mary is a woman', 'John is an apostle']
ret = Perturb.perturb(dataset, replace_john_with_others)
ret.data

[['John is a man', 'Luke is a man', 'Mark is a man'],
['John is an apostle', 'Luke is an apostle', 'Mark is an apostle']]

General purpose perturbations (see tutorial for more):

import spacy
nlp = spacy.load('en_core_web_sm')
pdataset = list(nlp.pipe(dataset))
ret = Perturb.perturb(pdataset, Perturb.change_names, n=2)
ret.data

[['John is a man', 'Ian is a man', 'Robert is a man'],
['Mary is a woman', 'Katherine is a woman', 'Alexandra is a woman'],
['John is an apostle', 'Paul is an apostle', 'Gabriel is an apostle']]

ret = Perturb.perturb(pdataset, Perturb.add_negation)
ret.data

[['John is a man', 'John is not a man'],
['Mary is a woman', 'Mary is not a woman'],
['John is an apostle', 'John is not an apostle']]

Creating and running tests

See 3. Test types, expectation functions, running tests for more details.

MFT:

import checklist
from checklist.editor import Editor
from checklist.perturb import Perturb
from checklist.test_types import MFT, INV, DIR
editor = Editor()

t = editor.template('This is {a:adj} {mask}.',  
                      adj=['good', 'great', 'excellent', 'awesome'])
test1 = MFT(t.data, labels=1, name='Simple positives',
           capability='Vocabulary', description='')

INV:

dataset = ['This was a very nice movie directed by John Smith.',
           'Mary Keen was brilliant.',
          'I hated everything about this.',
          'This movie was very bad.',
          'I really liked this movie.',
          'just bad.',
          'amazing.',
          ]
t = Perturb.perturb(dataset, Perturb.add_typos)
test2 = INV(**t)

DIR:

from checklist.expect import Expect
def add_negative(x):
    phrases = ['Anyway, I thought it was bad.', 'Having said this, I hated it', 'The director should be fired.']
    return ['%s %s' % (x, p) for p in phrases]

t = Perturb.perturb(dataset, add_negative)
monotonic_decreasing = Expect.monotonic(label=1, increasing=False, tolerance=0.1)
test3 = DIR(**t, expect=monotonic_decreasing)

Running tests directly:

from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(model.predict_proba)
test.run(wrapped_pp)

Running from a file:

# One line per example
test.to_raw_file('/tmp/raw_file.txt')
# each line has prediction probabilities (softmax)
test.run_from_file('/tmp/softmax_preds.txt', file_format='softmax', overwrite=True)

Summary of results:

test.summary(n=1)

Test cases: 400
Fails (rate): 200 (50.0%)

Example fails:
0.2 This is a good idea

Visual summary:

test.visual_summary()

visual summary

Saving and loading individual tests:

# save
test.save(path)
# load
test = MFT.from_file(path)

Custom expectation functions

See 3. Test types, expectation functions, running tests for more details.

If you are writing a custom expectation functions, it must return a float or bool for each example such that:

  • > 0 (or True) means passed,
  • <= 0 or False means fail, and (optionally) the magnitude of the failure, indicated by distance from 0, e.g. -10 is worse than -1
  • None means the test does not apply, and this should not be counted

Expectation on a single example:

def high_confidence(x, pred, conf, label=None, meta=None):
    return conf.max() > 0.95
expect_fn = Expect.single(high_confidence)

Expectation on pairs of (orig, new) examples (for INV and DIR):

def changed_pred(orig_pred, pred, orig_conf, conf, labels=None, meta=None):
    return pred != orig_pred
expect_fn = Expect.pairwise(changed_pred)

There's also Expect.testcase and Expect.test, amongst many others.
Check out expect.py for more details.

Test Suites

See 4. The CheckList process for more details.

Adding tests:

from checklist.test_suite import TestSuite
# assuming test exists:
suite.add(test)

Running a suite is the same as running an individual test, either directly or through a file:

from checklist.pred_wrapper import PredictorWrapper
# wrapped_pp returns a tuple with (predictions, softmax confidences)
wrapped_pp = PredictorWrapper.wrap_softmax(model.predict_proba)
suite.run(wrapped_pp)
# or suite.run_from_file, see examples above

To visualize results, you can call suite.summary() (same as test.summary), or suite.visual_summary_table(). This is what the latter looks like for BERT on sentiment analysis:

suite.visual_summary_table()

visual summary table

Finally, it's easy to save, load, and share a suite:

# save
suite.save(path)
# load
suite = TestSuite.from_file(path)

API reference

On readthedocs

Code of Conduct

Microsoft Open Source Code of Conduct

checklist's People

Contributors

abinayam02 avatar avenge-prc777 avatar bharatr21 avatar bilalsal avatar dependabot[bot] avatar ecederstrand avatar helpmefindaname avatar jlema avatar marcotcr avatar muhtasham avatar prrao87 avatar ramji-c avatar tongshuangwu avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

checklist's Issues

ssahar

@marcotcr
Hi, I have tried to install checklist on windows10 but I received an error that I can not handle it. first of all, it gives me the error for this line in the setup file. (check_call([f"{sys.executable} -m pip install jupyter"], shell=True). I removed this line now I got the following error:
error: symbolic link privilege not held
I was wondering if your package is working on Windows. If it is true what I missed here.

Thanks

Screenshot (12)

Dose the Perturb Module support Multilingual ?

Thank you for sharing such a great tool, it‘s really useful and amazing!

When i used the Perturb.change_names to get a INV test case, it could not perturb the names in my chinese dataset.
Does this module currently support multiple languages? can you tell me how to use it?

RuntimeError: selected index k out of range

When trying the "Multilingual suggestions" example an error occurs:

/usr/local/lib/python3.6/dist-packages/checklist/text_generation.py in unmask(self, text_with_mask, beam_size, candidates)
180 else:
181 if forbid:
--> 182 v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
183 top_preds = self.with_space[top_preds]
184 else:

RuntimeError: selected index k out of range

pip install fails

pip install keeps failing with the following msg:

ERROR: Command errored out with exit status 1:
command: 'd:\python\vir37v1\scripts\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'D:\Users\CAIZ1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"'; file='"'"'D:\Users\CAIZ1\AppData\Local\Temp\pip-install-by_ojkpx\checklist\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' egg_info --egg-base 'D:\Users\CAIZ~1\AppData\Local\Temp\pip-pip-egg-info-hi8iwmvg'

Choice behind RoBERTa and BERT

Just a curiosity.
Is there a specific implementation choice on why the the fill-in for the templates ({mask}) produced by CheckList are created from suggestions of RoBERTa and BERT? Given that these same two research models are analysed and their shortcomings are highlighted in the paper (why in some way rely on """imperfect""" models?)

Thanks a lot

Possible to install this without mysql?

When trying to install checklist, I run into the error: OSError: mysql_config not found. This happens during pip install mysqlclient, which is a dependency of the pattern package.

The simple solution of course is to install mysql. Unfortunately, this is not possible in many university shared computing environments due to security risks. As far as I can tell, mysql is not used anywhere in this tool. Is there any workaround to install it in an environment without mysql?

Roberta-specific tokenization?

self.space_prefix = self.tokenizer.tokenize(' John')[0].split('John')[0]

This line seems to assume Roberta-style tokenization, where there is a special character (G-dot) marking a token that occurs at the beginning of a word, but it fails for BERT-style tokenization, which uses a special sequence (##) to mark tokens not at the beginning of a word. It would also fail if the tokenizer is uncased (John -> john). I can't really see a way to fix it without knowing something about different model names, though.

NER tests

Hi,
First thanks for your great work and this very useful library.

I am looking to test NER models (Transformer and LSTM based), also I would like to know if you had any example/code of how you could test such models?
I haven't found any, even in the notebook 5. Testing transformer pipelines.

I guess the key is to be able to make an expectation function at token-level ? Maybe you already explored something ?

Many thanks!

Extra `__init__.py`

Hi,

In the root directory of this repo, there exists __init__.py file. I'm wondering if this is intentional or not, because it causes Python to think that the repo itself is a Python package when it's not. So for example, if we have a directory such as the following,

.
├── checklist (repo)
└── foo.py

and if I'm importing checklist.editor in foo.py, Python looks for editor in the repo, which it cannot find and trigger ModuleNotFoundError

RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long

I have to say that it's really a good job!I met some problem when I run the demo (Multilingual suggestions by RoBERTa ),I try to set language as English and Chinese,however they all fail to run,The reason is that “RuntimeError: Expected tensor for argument #1 'indices' to have scalar type Long; but got torch.IntTensor instead (while checking arguments for embedding)” ,I have no idea how to solve it ,it disturb me a lot time,can you give me some suggestions? I will appreciate it,by a way,my Environment are as follows:

  1. torch==1.2.0
  2. torchvision==0.4.0
  3. tensorflow-gpu==1.13.2
    4.Python==3.6.1

In jupyter viwer, INV and DIR test would change two sentences pred in reverse order.

- In INV and DIR test, every example would be two sentences, one original and one changed. I use Suite.summary() it would be okay. But in jupyter that I used Suite.Summarizer() to show the result, I found that every INV and DIR test, the predict label and conf of two sentences are in reversed order.
  • For one example, (orginal: I'm the guy. model predict is 1), I use add_typos to this sentence and get(changed:I'm eth guy. model predict is 0).
    

    If I use Suite.summary(), I can get the corret result just like:
    Example fails: 1 (0.8) I'm the guy. 0 (0.9) I'm eth guy.
    But in Suite.Summarizer(), I may get the result show in jupyter just like:
    I'm the guy.→ I'm eth guy. Pred: 0(0.9)→1(0.8)
    I can't find the bug where it happends, so please help me to debug , thanks!

SQuAD test doesn't work on Windows

The following line from the tutorial causes an error:
suite.run_from_file(pred_path, overwrite=True, file_format='pred_only')

UnicodeDecodeError: 'charmap' codec can't decode byte 0x81 in position 6884: character maps to <undefined>

I fixed the issue by changing abstract_test.py line 15 to:
f = open(path, 'r', encoding='utf-8')

Labelled data

Hi,
First of all, thank you for sharing your work.
How can I get labelled data(text, ground truth) to calculate uncertainty in these tasks? I couldn't see any function about it. In the tests_n500 file, there are only sample texts without any label.

Thanks for advance
Best

Very useful product, I want to get some advices.

Thanks for release this useful product, I recently modified it adapted to Chinese. If I want to use it to check/test some other models which not use softmax for last layer like NER model or not one example one probs. What change should I do to adapte to these models?

NER example

Hi!
In the paper you describe some invariant tests for NER models, however, in the library I can't find any code to this. Could you share this as well? Thanks :)

Win10 type errors with tensors in unmask() text_generation.py

Encountered a couple of type errors with Win10 Python/Torch on tensors in the unmask() function from text_generation.py while running the examples from the introduction page. These required fixing through explicit type casting with .to(torch.int64).

Diff to fix is as follows:

PS C:\src\checklist> git diff .\checklist\text_generation.py
diff --git a/checklist/text_generation.py b/checklist/text_generation.py
index b0ad20e..9a6a5ed 100644
--- a/checklist/text_generation.py
+++ b/checklist/text_generation.py
@@ -163,7 +163,7 @@ class TextGenerator(object):
             # print('ae')
             # print('\n'.join([tokenizer.decode(x) for x in to_pred]))
             # print()
-            to_pred = torch.tensor(to_pred, device=self.device)
+            to_pred = torch.tensor(to_pred, device=self.device).to(torch.int64) # fix for int32 / int64 type mismatch on win10
             with torch.no_grad():
                 outputs = model(to_pred)[0]
             for i, current in enumerate(current_beam):
@@ -179,7 +179,7 @@ class TextGenerator(object):
                     new = [(current[0] + [int(x[0])], float(x[1]) + current[1]) for x in zip(cands_to_use, scores)]
                 else:
                     if forbid:
-                        v, top_preds = torch.topk(outputs[i, masked[size], self.with_space], beam_size + 10)
+                        v, top_preds = torch.topk(outputs[i, masked[size], self.with_space.to(torch.int64)], beam_size + 10) # fix for int32 / int64 type mismatch on win10
                         top_preds = self.with_space[top_preds]
                     else:
                         v, top_preds = torch.topk(outputs[i, masked[size]], beam_size + 10)

Change the data in tests_n500

Hi!
First of all, thanks for your work, it is very inspiring.

I would like to conduct tests on another test set, semantically different from airline related tweets (in particular, I would like to use data from this competition https://amievalita2018.wordpress.com/, which collects misogynistic tweets, in order to explore the fairness of the models).

To do this, I just have to replace the file tests_n500 and insert in the folder "predictions/" the file containing the predictions in the usual format (i.e. the label 0/1/2 and the three probabilities), right?

Excuse the beginner's question :) Thanks a lot!

Checklist viewer installation fails in macintosh

jupyter nbextension install --py --user checklist.viewer

Results in:

Installing test/lib/python3.7/site-packages/checklist/viewer/static -> viewer
Making directory: user/Library/Jupyter/nbextensions/viewer/
Traceback (most recent call last):
  File "test/bin/jupyter-nbextension", line 8, in <module>
    sys.exit(main())
  File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 270, in launch_instance
    return super(JupyterApp, cls).launch_instance(argv=argv, **kwargs)
  File "test/lib/python3.7/site-packages/traitlets/config/application.py", line 664, in launch_instance
    app.start()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 983, in start
    super(NBExtensionApp, self).start()
  File "test/lib/python3.7/site-packages/jupyter_core/application.py", line 259, in start
    self.subapp.start()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 711, in start
    self.install_extensions()
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 690, in install_extensions
    **kwargs
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 220, in install_nbextension_python
    destination=dest, logger=logger
  File "test/lib/python3.7/site-packages/notebook/nbextensions.py", line 187, in install_nbextension
    os.makedirs(dest_dir)
  File "test/bin/../lib/python3.7/os.py", line 221, in makedirs
    mkdir(name, mode)
FileNotFoundError: [Errno 2] No such file or directory: 'user/Library/Jupyter/nbextensions/viewer/'

In a Mac OSX virtualenv for python 3

Loading a finetuned model

Hi,

What's the easiest way to load a finetuned BERT model in the CheckList process? Seems like models are wrapped in a ModelWrapper class, and that wrapper is not available.

Thanks.

Labels cannot be list of int

The following code gives me an error:

from checklist.editor import Editor
editor = Editor()

exs = ['hello', 'good bye']
labels = [0, 1]
ret = editor.template('{ex}', ex=exs, labels=labels, save=True, meta=True)

Based on the documentation I expect labels to accept a list of ints.

Feature request Static HTML output

Is it possible to output checklist results in some structured format, eg html, xml.

We'd like to to integrate Checklist as a CI step and want to be able to persist the model performance.

Irony & Sarcasm as new capability

Hi there, :) Just wondering if you have any suggestions, for the context of Sentiment Analysis, how to include-create a new capability concerning irony and sarcasm.
The fact is that given the three types of tests, the most suitable to start would be MFT, but what about the labels? If "irony datasets" contain tweets only marked as ironic or not, they could be negative or positive with respect to the Sentiment, so what I thought is to use the Expect function is_not_1, expecting the label NOT to be neutral... But I am not convinced by this solution

Squad model source

Hi,

What is the source of the squad model with this API? (taken from the SQuaD tutorial)

model = bert_squad_model.BertSquad()
invert = lambda a: model.predict_pairs([(x[1], x[0]) for x in a])
new_pp = PredictorWrapper.wrap_predict(invert)

How to perturb data by mixing two independent functions ?

Hi,
As in your tutorials, Perturb.perturb usually inputs 1 function (example: Perturb.change_names). But in my scenario, I want 2 functions as parameters of Perturb.perturb, then it will return sentences which are mixture of func1 and func2 .
Ex:
func1: change_names
func2: change_city
I hope it returned a sentence which name is changed by func1 or/and city is changed by func2 (something like: Perturb.perturb(data, Perturb.change_names, Perturb.change_city ) ). Otherwise, I still can use Perturb.perturb(data, Perturb.change_names) and Perturb.perturb(data, Perturb.change_city) independently.
How can I do that ?

CUDA OOM for beam search in perturbation

Upon using perturb functions and replace words with synonyms, this line causes CUDA OOM error:

ret = self.unmask(masked, beam_size=100000000, candidates=options)

The beam size is unbounded. Is it possible to make it configurable by users when calling antonyms/synonyms API so that the memory cost is more controllable?
BTW thank you for such great work!

Missing documentation for Editor and Perturb classes (readthedocs)

The editor and perturb modules are currently missing documentation - is this still a work in progress? It would be very nice to be able to browse through the various methods and types for these modules, just like the expect module allows us to do currently, so I just wanted to point this out.

Thanks for this awesome library!

Could you give a more detailed environment?

There is much troubleshooting when following your guide in README.md, none of the demo codes can be replicated on my computer. I don't know if there are any hidden bugs or it's only for my issue. Then could you give us a more detailed environment, including the libraries needed and their version?

Editor.suggest() ignores nsamples parameter

Code to reproduce

import checklist
from checklist.editor import Editor

editor = Editor()
ret = editor.suggest('I am a {mask} {mask}.', nsamples=5)
print(ret)

Expected behavior

It should generate 5 samples.

Actual behavior

It generates 11000 samples.

Note that editor.template() correctly handles nsamples, only editor.suggest() is broken.

Perturb.strip_punctuation possible endless loop for some cases

The while loop in Perturb.strip_punctuation function becomes an endless loop for inputs like :.

while doc[-1].pos_ == 'PUNCT':

To reproduce it:

import spacy
from checklist.perturb import Perturb
model_path = ''  # Spacy model path
nlp = spacy.load(model_path)
sent = nlp(':')
Perturb.strip_punctuations(sent)  # Endless loop!

I checked the code and found doc[-1].pos_ after stripping the last token : was always PUNCT... it seems like a spacy bug.

To avoid this, I suggest checking the length of doc in the while condition and return when the length of doc becomes 0.

while len(doc) and doc[-1].pos_ == 'PUNCT':

error: BadZipFile: File is not a zip file

Thank you for your useful work!
if run:
from checklist.perturb import Perturb
output:
error: BadZipFile: File is not a zip file
Can you tell me how to solve this problem?

Custom aggregate function results in attribute error.

Writing custom expectation aggregate functions for test cases results in attribute error.

Following examples demonstrate this :


import checklist
import torch
import numpy as np

from transformers import AutoTokenizer, AutoModel
from sklearn.metrics.pairwise import cosine_similarity
from checklist.pred_wrapper import PredictorWrapper
from checklist.expect import Expect
from checklist.test_types import INV
from checklist.perturb import Perturb

dataset = [
    'I am checking the checklist',
    'There is a bug in the code',
]


class Model(object):
    THRESHOLD = 0.9

    def __init__(self):
        self.tokenizer = AutoTokenizer.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")
        self.model = AutoModel.from_pretrained("sentence-transformers/bert-base-nli-mean-tokens")

    def _mean_pooling(self, model_output, attention_mask):
        token_embeddings = model_output[0] #First element of model_output contains all token embeddings
        input_mask_expanded = attention_mask.unsqueeze(-1).expand(token_embeddings.size()).float()
        sum_embeddings = torch.sum(token_embeddings * input_mask_expanded, 1)
        sum_mask = torch.clamp(input_mask_expanded.sum(1), min=1e-9)
        return sum_embeddings / sum_mask

    def get_encoding(self, sentences):
        encoded_input = self.tokenizer(sentences, padding=True, truncation=True, max_length=128, return_tensors='pt')
        with torch.no_grad():
            model_output = self.model(**encoded_input)
            return self._mean_pooling(model_output, encoded_input['attention_mask'])

    def get_similarities(self, sentence1, other_sentences):    
        e1 = self.get_encoding(str(sentence1))
        e2 = self.get_encoding([str(x) for x in other_sentences])
        return np.squeeze(cosine_similarity(e1, e2))


def similarity_score(inputs):
    all_preds = list()
    for sentence1, other_sentences in inputs:
        scores = model.get_similarities(sentence1, other_sentences)
        all_preds.append(scores)
    return np.array(all_preds)


def all_similar(x, pred, conf, label=None, meta=None):
    """if any of the results is is below threshold testcase doesn't pass"""
    ret = np.sum(pred < Model.THRESHOLD) == 0
    print(f'pred = {pred}, ret = {ret}')
    return ret

def add_typos(sentence, n=5):
    typos = []
    for i in range(n):
        typos.append(Perturb.perturb([sentence], Perturb.add_typos, keep_original=False))
    return sentence, typos


wrapped_pp = PredictorWrapper.wrap_predict(similarity_score)
expect_all_similar = Expect.single(all_similar)

model = Model()

t = Perturb.perturb(dataset, add_typos, nsamples=200, keep_original=False)
test = INV(**t, name='add typos', capability='typo',
           description='', expect=expect_all_similar, agg_fn=expect_all_similar)

test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
test.summary()

This results in the following exception :

---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-38-0b1b8f7c7467> in <module>
      1 test = INV(**t, name='add typos', capability='typo',
      2            description='', expect=expect_all_similar, agg_fn=expect_all_similar)
----> 3 test.run(predict_and_confidence_fn=wrapped_pp, overwrite=True, verbose=True)
      4 test.summary()

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run(self, predict_and_confidence_fn, overwrite, verbose, n, seed)
    351             print('Predicting %d examples' % len(examples))
    352         preds, confs = predict_and_confidence_fn(examples)
--> 353         self.run_from_preds_confs(preds, confs, overwrite=overwrite)
    354 
    355     def fail_idxs(self):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in run_from_preds_confs(self, preds, confs, overwrite)
    291         self._check_create_results(overwrite)
    292         self.update_results_from_preds(preds, confs)
--> 293         self.update_expect()
    294 
    295     def run_from_file(self, path, file_format=None, format_fn=None, ignore_header=False, overwrite=False):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/abstract_test.py in update_expect(self)
    128         self._check_results()
    129         self.results.expect_results = self.expect(self)
--> 130         self.results.passed = Expect.aggregate(self.results.expect_results, self.agg_fn)
    131 
    132     def example_list_and_indices(self, n=None, seed=None):

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate(data, agg_fn)
    145         # data is a list of lists or list of np.arrays
    146         # import pdb; pdb.set_trace()
--> 147         return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
    148 
    149     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in <listcomp>(.0)
    145         # data is a list of lists or list of np.arrays
    146         # import pdb; pdb.set_trace()
--> 147         return np.array([Expect.aggregate_testcase(x, agg_fn) for x in data])
    148 
    149     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in aggregate_testcase(expect_results, agg_fn)
    160             return None
    161         else:
--> 162             return agg_fn(np.array(r))
    163 
    164     @staticmethod

~/.virtualenvs/test-demo-TklxO9OB/lib/python3.8/site-packages/checklist/expect.py in expect(self)
     75         """
     76         def expect(self):
---> 77             zipped = iter_with_optional(self.data, self.results.preds, self.results.confs, self.labels, self.meta, self.run_idxs)
     78             return [fn(x, pred, confs, labels, meta) for x, pred, confs, labels, meta in zipped]
     79         return expect

AttributeError: 'numpy.ndarray' object has no attribute 'results'

question about prediction rendering

Should the old prediction be on the left side of the arrow?

code

          predTag = <Tag style={{verticalAlign: "middle"}}>
                Pred: <span className="example-token rewrite-remove">{newobj.pred}{confStr}</span>
                {replaceArrow}
                <span className="example-token rewrite-add">{oldobj.pred}{confStrOld}</span>
            </Tag>

pipeline step in 4. The CheckList process tutorial fails if there is no GPU

The code snippet pipe = pipeline("sentiment-analysis", model=model, tokenizer=tokenizer, framework="pt", device=0) in cell#2 of checklist/notebooks/tutorials/4. The CheckList process.ipynb fails on MacOS and other systems that do not have a GPU with the following error assertionerror: torch not compiled with cuda enabled. GPU devices are 0 indexed and changing the parameter device=0 to device=-1 resolves the problem. As there are no explicit requirements to have a GPU, it is perhaps better to change this parameter value so that notebook runs on all systems, including those that don't have a GPU.

JSON serialization error when loading examples in visual summary

When trying to use the visual summary functionality on a TestSuite, I ran into an issue with loading examples: I get the error message ValueError: Can't clean for JSON: array([1.]). I get this both when using suite.visual_summary_table() or suite.visual_summary_by_test().

However, when I try suite.summary() it works fine and I get something like this:

NER test
Test cases:      100
Fails (rate):    4 (4.0%)

Example fails:
0.0 0.0 1.0 Ian Young cooked the burgers in some broth.
----
0.0 0.0 1.0 George Rogers cooked the meats in some broth.
----
0.0 0.0 1.0 Paul Brown cooked the chicken al dente.
----

where the three numbers before every sample are the probability scores (in the case of my model, these are always 1.0 or 0.0).

Is this expected behavior (am I doing something wrong?) or is it a bug?

See traceback from the visualization widget below -- note that the error is raised not when initially loading the widget but only once example fails are being loaded.

ValueError                                Traceback (most recent call last)
~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in handle_events(self, _, content, buffers)
     46         elif content.get('event', '') == 'switch_test':
     47             testname = content.get("testname", "")
---> 48             self.on_select_test(testname)
     49 
     50     def on_select_test(self, testname: str) -> None:

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/suite_summarizer.py in on_select_test(self, testname)
     54             summary, testcases = self.select_test_fn(testname)
     55         self.reset_summary(summary)
---> 56         self.reset_testcases(testcases)

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in reset_testcases(self, testcases)
     46         self.filtered_testcases = testcases if testcases else []
     47         self.tokenize_testcases()
---> 48         self.search(filter_tags=[], is_fail_case=True)
     49 
     50     def handle_events(self, _, content, buffers):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in search(self, filter_tags, is_fail_case)
    118         self.compute_stats_result(candidate_testcases_not_fail)
    119         self.to_slice_idx = 0
--> 120         self.fetch_example()
    121 
    122     def fetch_example(self):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/checklist/viewer/test_summarizer.py in fetch_example(self)
    126             new_examples = self.candidate_testcases[self.to_slice_idx : self.to_slice_idx+self.max_return]
    127             self.to_slice_idx += len(new_examples)
--> 128             self.testcases = [e for e in new_examples]

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in __set__(self, obj, value)
    583             raise TraitError('The "%s" trait is read-only.' % self.name)
    584         else:
--> 585             self.set(obj, value)
    586 
    587     def _validate(self, obj, value):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in set(self, obj, value)
    572             # we explicitly compare silent to True just in case the equality
    573             # comparison above returns something other than True/False
--> 574             obj._notify_trait(self.name, old_value, new_value)
    575 
    576     def __set__(self, obj, value):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/traitlets/traitlets.py in _notify_trait(self, name, old_value, new_value)
   1137             new=new_value,
   1138             owner=self,
-> 1139             type='change',
   1140         ))
   1141 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in notify_change(self, change)
    603             if name in self.keys and self._should_send_property(name, getattr(self, name)):
    604                 # Send new state to front-end
--> 605                 self.send_state(key=name)
    606         super(Widget, self).notify_change(change)
    607 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in send_state(self, key)
    487             state, buffer_paths, buffers = _remove_buffers(state)
    488             msg = {'method': 'update', 'state': state, 'buffer_paths': buffer_paths}
--> 489             self._send(msg, buffers=buffers)
    490 
    491 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipywidgets/widgets/widget.py in _send(self, msg, buffers)
    735         """Sends a message to the model in the front-end."""
    736         if self.comm is not None and self.comm.kernel is not None:
--> 737             self.comm.send(data=msg, buffers=buffers)
    738 
    739     def _repr_keys(self):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in send(self, data, metadata, buffers)
    121         """Send a message to the frontend-side version of this comm"""
    122         self._publish_msg('comm_msg',
--> 123             data=data, metadata=metadata, buffers=buffers,
    124         )
    125 

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/comm/comm.py in _publish_msg(self, msg_type, data, metadata, buffers, **keys)
     63         data = {} if data is None else data
     64         metadata = {} if metadata is None else metadata
---> 65         content = json_clean(dict(data=data, comm_id=self.comm_id, **keys))
     66         self.kernel.session.send(self.kernel.iopub_socket, msg_type,
     67             content,

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in <listcomp>(.0)
    175 
    176     if isinstance(obj, list):
--> 177         return [json_clean(x) for x in obj]
    178 
    179     if isinstance(obj, dict):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    189         out = {}
    190         for k,v in iteritems(obj):
--> 191             out[unicode_type(k)] = json_clean(v)
    192         return out
    193     if isinstance(obj, datetime):

~/anaconda3/envs/frameid-checks/lib/python3.7/site-packages/ipykernel/jsonutil.py in json_clean(obj)
    195 
    196     # we don't understand it, it's probably an unserializable object
--> 197     raise ValueError("Can't clean for JSON: %r" % obj)

ValueError: Can't clean for JSON: array([1.])

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.