Coder Social home page Coder Social logo

microsoft / promptbench Goto Github PK

View Code? Open in Web Editor NEW
2.1K 21.0 167.0 5.55 MB

A unified evaluation framework for large language models

Home Page: http://aka.ms/promptbench

License: MIT License

Python 100.00%
adversarial-attacks chatgpt evaluation large-language-models robustness prompt prompt-engineering benchmark

promptbench's Introduction

Contributors Forks Stargazers Issues


Logo

PromptBench: A Unified Library for Evaluating and Understanding Large Language Models.
Paper · Documentation · Leaderboard · More papers

Table of Contents
  1. News and Updates
  2. Introduction
  3. Installation
  4. Usage
  5. Datasets and Models
  6. Benchmark Results
  7. Acknowledgments

News and Updates

  • [13/03/2024] Add support for multi-modal models and datasets.
  • [05/01/2024] Add support for BigBench Hard, DROP, ARC datasets.
  • [16/12/2023] Add support for Gemini, Mistral, Mixtral, Baichuan, Yi models.
  • [15/12/2023] Add detailed instructions for users to add new modules (models, datasets, etc.) examples/add_new_modules.md.
  • [05/12/2023] Published promptbench 0.0.1.

Introduction

PromptBench is a Pytorch-based Python package for Evaluation of Large Language Models (LLMs). It provides user-friendly APIs for researchers to conduct evaluation on LLMs. Check the technical report: https://arxiv.org/abs/2312.07910.

Code Structure

What does promptbench currently provide?

  1. Quick model performance assessment: We offer a user-friendly interface that allows for quick model building, dataset loading, and evaluation of model performance.
  2. Prompt Engineering: We implemented several prompt engineering methods. For example: Few-shot Chain-of-Thought [1], Emotion Prompt [2], Expert Prompting [3] and so on.
  3. Evaluating adversarial prompts: promptbench integrated prompt attacks [4], enabling researchers to simulate black-box adversarial prompt attacks on models and evaluate their robustness (see details here).
  4. Dynamic evaluation to mitigate potential test data contamination: we integrated the dynamic evaluation framework DyVal [5], which generates evaluation samples on-the-fly with controlled complexity.

Installation

Install via pip

We provide a Python package promptbench for users who want to start evaluation quickly. Simply run:

pip install promptbench

Note that the pip installation could be behind the recent updates. So, if you want to use the latest features or develop based on our code, you should install via GitHub.

Install via GitHub

First, clone the repo:

git clone [email protected]:microsoft/promptbench.git

Then,

cd promptbench

To install the required packages, you can create a conda environment:

conda create --name promptbench python=3.9

then use pip to install required packages:

pip install -r requirements.txt

Note that this only installed basic python packages. For Prompt Attacks, you will also need to install TextAttack.

Usage

promptbench is easy to use and extend. Going through the examples below will help you get familiar with promptbench for quick use, evaluate existing datasets and LLMs, or create your own datasets and models.

Please see Installation to install promptbench first.

If promptbench is installed via pip, you can simply do:

import promptbench as pb

If you installed promptbench from git and want to use it in other projects:

import sys

# Add the directory of promptbench to the Python path
sys.path.append('/home/xxx/promptbench')

# Now you can import promptbench by name
import promptbench as pb

We provide tutorials for:

  1. evaluate models on existing benchmarks: please refer to the examples/basic.ipynb for constructing your evaluation pipeline. For a multi-modal evaluation pipeline, please refer to examples/multimodal.ipynb
  2. test the effects of different prompting techniques:
  3. examine the robustness for prompt attacks, please refer to examples/prompt_attack.ipynb to construct the attacks.
  4. use DyVal for evaluation: please refer to examples/dyval.ipynb to construct DyVal datasets.

Implemented Components

PromptBench currently supports different datasets, models, prompt engineering methods, adversarial attacks, and more. You are welcome to add more.

Datasets

  • Language datasets:
    • GLUE: SST-2, CoLA, QQP, MRPC, MNLI, QNLI, RTE, WNLI
    • MMLU
    • BIG-Bench Hard (Bool logic, valid parentheses, date...)
    • Math
    • GSM8K
    • SQuAD V2
    • IWSLT 2017
    • UN Multi
    • CSQA (CommonSense QA)
    • Numersense
    • QASC
    • Last Letter Concatenate
  • Multi-modal datasets:
    • VQAv2
    • NoCaps
    • MMMU
    • MathVista
    • AI2D
    • ChartQA
    • ScienceQA

Models

Language models:

  • Open-source models:
    • google/flan-t5-large
    • databricks/dolly-v1-6b
    • Llama2 series
    • vicuna-13b, vicuna-13b-v1.3
    • Cerebras/Cerebras-GPT-13B
    • EleutherAI/gpt-neox-20b
    • Google/flan-ul2
    • phi-1.5 and phi-2
  • Proprietary models
    • PaLM 2
    • GPT-3.5
    • GPT-4
    • Gemini Pro

Multi-modal models:

  • Open-source models:
    • BLIP2
    • LLaVA
    • Qwen-VL, Qwen-VL-Chat
    • InternLM-XComposer2-VL
  • Proprietary models
    • GPT-4v
    • Gemini Pro Vision
    • Qwen-VL-Max, Qwen-VL-Plus

Prompt Engineering

  • Chain-of-thought (COT) [1]
  • EmotionPrompt [2]
  • Expert prompting [3]
  • Zero-shot chain-of-thought
  • Generated knowledge [6]
  • Least to most [7]

Adversarial Attacks

  • Character-level attack
    • DeepWordBug
    • TextBugger
  • Word-level attack
    • TextFooler
    • BertAttack
  • Sentence-level attack
    • CheckList
    • StressTest
  • Semantic-level attack
    • Human-crafted attack

Protocols and Analysis

  • Standard evaluation
  • Dynamic evaluation
  • Semantic evaluation
  • Benchmark results
  • Visualization analysis
  • Transferability analysis
  • Word frequency analysis

Benchmark Results

Please refer to our benchmark website for benchmark results on Prompt Attacks, Prompt Engineering and Dynamic Evaluation DyVal.

Acknowledgements

  • TextAttack
  • README Template
  • We thank the volunteers: Hanyuan Zhang, Lingrui Li, Yating Zhou for conducting the semantic preserving experiment in Prompt Attack benchmark.

Reference

[1] Jason Wei, et al. "Chain-of-Thought Prompting Elicits Reasoning in Large Language Models." arXiv preprint arXiv:2201.11903 (2022).

[2] Cheng Li, et al. "Emotionprompt: Leveraging psychology for large language models enhancement via emotional stimulus." arXiv preprint arXiv:2307.11760 (2023).

[3] BenFeng Xu, et al. "ExpertPrompting: Instructing Large Language Models to be Distinguished Experts" arXiv preprint arXiv:2305.14688 (2023).

[4] Zhu, Kaijie, et al. "PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts." arXiv preprint arXiv:2306.04528 (2023).

[5] Zhu, Kaijie, et al. "DyVal: Graph-informed Dynamic Evaluation of Large Language Models." arXiv preprint arXiv:2309.17167 (2023).

[6] Liu J, Liu A, Lu X, et al. Generated knowledge prompting for commonsense reasoning[J]. arXiv preprint arXiv:2110.08387, 2021.

[7] Zhou D, Schärli N, Hou L, et al. Least-to-most prompting enables complex reasoning in large language models[J]. arXiv preprint arXiv:2205.10625, 2022.

Citing promptbench and other research papers

Please cite us if you find this project helpful for your project/paper:

@article{zhu2023promptbench2,
  title={PromptBench: A Unified Library for Evaluation of Large Language Models},
  author={Zhu, Kaijie and Zhao, Qinlin and Chen, Hao and Wang, Jindong and Xie, Xing},
  journal={arXiv preprint arXiv:2312.07910},
  year={2023}
}

@article{zhu2023promptbench,
  title={PromptBench: Towards Evaluating the Robustness of Large Language Models on Adversarial Prompts},
  author={Zhu, Kaijie and Wang, Jindong and Zhou, Jiaheng and Wang, Zichen and Chen, Hao and Wang, Yidong and Yang, Linyi and Ye, Wei and Gong, Neil Zhenqiang and Zhang, Yue and others},
  journal={arXiv preprint arXiv:2306.04528},
  year={2023}
}

@article{zhu2023dyval,
  title={DyVal: Graph-informed Dynamic Evaluation of Large Language Models},
  author={Zhu, Kaijie and Chen, Jiaao and Wang, Jindong and Gong, Neil Zhenqiang and Yang, Diyi and Xie, Xing},
  journal={arXiv preprint arXiv:2309.17167},
  year={2023}
}

@article{chang2023survey,
  title={A survey on evaluation of large language models},
  author={Chang, Yupeng and Wang, Xu and Wang, Jindong and Wu, Yuan and Zhu, Kaijie and Chen, Hao and Yang, Linyi and Yi, Xiaoyuan and Wang, Cunxiang and Wang, Yidong and others},
  journal={arXiv preprint arXiv:2307.03109},
  year={2023}
}

Contributing

This project welcomes contributions and suggestions. Most contributions require you to agree to a Contributor License Agreement (CLA) declaring that you have the right to, and actually do, grant us the rights to use your contribution. For details, visit https://cla.opensource.microsoft.com.

When you submit a pull request, a CLA bot will automatically determine whether you need to provide a CLA and decorate the PR appropriately (e.g., status check, comment). Simply follow the instructions provided by the bot. You will only need to do this once across all repos using our CLA.

This project has adopted the Microsoft Open Source Code of Conduct. For more information see the Code of Conduct FAQ or contact [email protected] with any additional questions or comments.

If you have a suggestion that would make promptbench better, please fork the repo and create a pull request. You can also simply open an issue with the tag "enhancement". Don't forget to give the project a star! Thanks again!

  1. Fork the project
  2. Create your branch (git checkout -b your_name/your_branch)
  3. Commit your changes (git commit -m 'Add some features')
  4. Push to the branch (git push origin your_name/your_branch)
  5. Open a Pull Request

Trademarks

This project may contain trademarks or logos for projects, products, or services. Authorized use of Microsoft trademarks or logos is subject to and must follow Microsoft's Trademark & Brand Guidelines. Use of Microsoft trademarks or logos in modified versions of this project must not cause confusion or imply Microsoft sponsorship. Any use of third-party trademarks or logos are subject to those third-party's policies.

promptbench's People

Contributors

dependabot[bot] avatar django-jiang avatar eltociear avatar hhhhhhao avatar icecream-and-tea avatar immortalise avatar jindongwang avatar madhavmathur avatar microsoftopensource avatar mingxuanxia avatar nabbisen avatar nadatelwazane avatar zhimin-z avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

promptbench's Issues

typo

1710947458077 Check https://llm-eval.github.io/pages/leaderboard/dyval.html

About Semantic Attacks Against Vicuna

Hi!

Thanks for your nice work!

One small question is that when I visited your demo site, and I choose "Vicuna" + "MNLI" + “Semantic” + “zero-shot task”, the site return nothing. Therefore, I am confused and could you please replenish these adversarial prompts? Many thanks!

Best wishes

pb-issue

Llama2 adversarial prompts

The prompts for Llama 2 have not been provided in prompts/adv_prompts, so running load_adv_prompt doesn't work when using Llama 2. Could these be added, please? Thanks!

examples basic.ipyn

from tqdm import tqdm
for prompt in prompts:
preds = []
labels = []
for data in tqdm(dataset):
# process input
input_text = pb.InputProcess.basic_format(prompt, data)
label = data['label']
print(type(input_text))
raw_pred = model(input_text)

    # process output
    pred = pb.OutputProcess.cls(raw_pred, proj_func)
    preds.append(pred)
    labels.append(label)

# evaluate
score = pb.Eval.compute_cls_accuracy(preds, labels)
print(f"{score:.3f}, {prompt}")

but have some quenstion about
TypeError Traceback (most recent call last)
Cell In[22], line 10
8 label = data['label']
9 print(type(input_text))
---> 10 raw_pred = model(input_text)
12 # process output
13 pred = pb.OutputProcess.cls(raw_pred, proj_func)
end in

Some datasets are not available

I'm not sure if i'm missing something, but I tried loading the MMLU, SQuADv2 - but it always gives me an error (the json file is empty)

It does seem to fail to fetch from the URL provided here, since the data folder does not exist

        # check if the dataset exists, if not, download it
        self.filepath = os.path.join(self.data_dir, f"{dataset_name}.json")
        self.filepath2 = os.path.join(self.data_dir, f"{dataset_name}.jsonl")
        if not os.path.exists(self.filepath):
            if os.path.exists(self.filepath2):
                self.filepath = self.filepath2
            else:
                url = f'https://wjdcloud.blob.core.windows.net/dataset/promptbench/dataset/{dataset_name}.json'
                print(f"Downloading {dataset_name} dataset...")
                response = requests.get(url)
                with open(self.filepath, 'wb') as f:
                    f.write(response.content)

Parsing the output of a model

The output of the model largely depends on the prompt. When performing text classification tasks such as sst2, which requires the model to give negative or positive responses, the model may provide various forms of answers, and the form of answers from different models may vary. Is there a good way to parse these results?

MMLU subset

There are 800+ samples in the MMLU subset (data/mmlu.json). How does it work?

How does it aligns with 564 samples as introduced in paper?

Enhancement Proposal for Dynamic Evaluation Framework Integration

Dear PromptBench Contributors,

I trust this message finds you in good health and high spirits. I am writing to propose an enhancement to the PromptBench library that I believe could significantly augment its utility for researchers and practitioners alike.

Having perused the documentation and utilised the library extensively, I have found the dynamic evaluation framework, DyVal, to be an invaluable asset for assessing the performance of Large Language Models (LLMs) in a more robust and realistic manner. However, I have observed that the integration of DyVal could be further refined to provide a more seamless user experience and to extend its capabilities.

To wit, I propose the following enhancements:

  1. Streamlined Integration: Simplifying the process of setting up and running DyVal within the PromptBench environment. This could involve automating the setup process and providing a more intuitive API that abstracts away some of the complexities involved in configuring the dynamic evaluation parameters.

  2. Extended Dataset Support: Expanding the range of datasets compatible with DyVal. While the current selection is commendable, incorporating additional datasets, particularly from niche domains, could broaden the scope of dynamic evaluation and provide deeper insights into model performance across diverse contexts.

  3. Custom Evaluation Metrics: Facilitating the integration of custom evaluation metrics within the DyVal framework. This would allow researchers to define and utilise metrics that are specifically tailored to their unique research questions or application requirements.

  4. Real-time Performance Monitoring: Introducing functionality for real-time monitoring of model performance during dynamic evaluation. This could include visual dashboards or logging mechanisms that provide immediate feedback on the model's performance trends over time.

  5. Guidance on Best Practices: Providing comprehensive documentation and examples that elucidate best practices for leveraging DyVal in various research scenarios. This could greatly assist users in maximising the potential of dynamic evaluation for their specific use cases.

I am keen to engage in further dialogue regarding these suggestions and explore how we might collaborate to bring these enhancements to fruition. I am convinced that by enriching the DyVal integration, PromptBench can offer even greater value to the community by enabling more nuanced and insightful evaluation of LLMs.

Thank you for considering my proposal. I eagerly anticipate your response and am hopeful for the opportunity to contribute to the evolution of PromptBench.

Best regards,
yihong1120

The ‘temperature’ parameter in pb.LLMModel needs to be of type float

1709090414095

"""
A class providing an interface for various language models.

This class supports creating and interfacing with different language models, handling prompt engineering, and performing model inference.

Parameters:
-----------
model : str
    The name of the model to be used.
max_new_tokens : int, optional
    The maximum number of new tokens to be generated (default is 20).
temperature : float, optional
    The temperature for text generation (default is 0).
model_dir : str or None, optional
    The directory containing the model files (default is None).
system_prompt : str or None, optional
    The system prompt to be used (default is None).
openai_key : str or None, optional
    The OpenAI API key, if required (default is None).
sleep_time : int, optional
    The sleep time between inference calls (default is 3).

How to use visualize.py to llama-2-7b?

Hi, it is a good job!

I want to know how to use visualize.py to llama-2-7b? I see that the code visualizes by calculating gradients rather than using the attention matrix. So in llama-2-7b, if I want to observe which words in the input sentence the model weights more heavily, what should I do? Do I need real labels? Can it be achieved by modifying the visualize.py code? Could you please give me some advice? Thank you very much!

Chatgpt

I want to replicate the results in the paper using the chatgpt model. But predict_by_openai_api method raises an error, i.e. NotImplementedError.

Pip's dependency conflicts :datasets==2.15.0

I want to use this project to evaluate my models.Howerer,when I use the command:

pip install promptbench

It raise a error as the title.The promptbench require the datasets==2.15.0,but the other package require datasets==2.18.0.
So would you add the support for dataset==2.18.0 or other versions?Thanks.

About black-box attacks

hey, I want to know the details in the gpt attack.

Is your attack like this?:
For a sample which original is true. After you add an attack prompt and its output is wrong, then it means attack is successful?

Because in score-base attack, the output probabilities will decrease, like this: 1.0->0.8->0.5->0.4
And in your method, it is like that?: 1.0->0
Thanks

Interpretation of attack results

Thanks for this amazing repo on evaluating differet prompts, and I am keen to apply it to more LLMs 🤗

I tried to evaluate the prompt attack following your example, and obtained the following output:
(0.0, "Aѕ a sentiment classifier, determine whether the following text is 'positive' or 'negative'. Please classify: \nQuestion: {content}\nAnswer: ", 0.0, 0.0)

I would like to know how to intepretate these numbers, and more importantly how can I obtain the evaluation results like in promptbench/prompt_attack/README.md? Since I haven't found related information in your documents. Am i missing something?

Thank you very much!

执行bertattack攻击的时候出现SSL错误,是不是下载bert模型出错啊!基本LLM模型已经是在本地了


SSLCertVerificationError Traceback (most recent call last)
File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:1346, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
1345 try:
-> 1346 h.request(req.get_method(), req.selector, req.data, headers,
1347 encode_chunked=req.has_header('Transfer-encoding'))
1348 except OSError as err: # timeout error

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:1285, in HTTPConnection.request(self, method, url, body, headers, encode_chunked)
1284 """Send a complete request to the server."""
-> 1285 self._send_request(method, url, body, headers, encode_chunked)

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:1331, in HTTPConnection._send_request(self, method, url, body, headers, encode_chunked)
1330 body = _encode(body, 'body')
-> 1331 self.endheaders(body, encode_chunked=encode_chunked)

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:1280, in HTTPConnection.endheaders(self, message_body, encode_chunked)
1279 raise CannotSendHeader()
-> 1280 self._send_output(message_body, encode_chunked=encode_chunked)

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:1040, in HTTPConnection._send_output(self, message_body, encode_chunked)
1039 del self._buffer[:]
-> 1040 self.send(msg)
1042 if message_body is not None:
1043
1044 # create a consistent interface to message_body

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:980, in HTTPConnection.send(self, data)
979 if self.auto_open:
--> 980 self.connect()
981 else:

File ~/miniconda3/envs/promptbench/lib/python3.9/http/client.py:1454, in HTTPSConnection.connect(self)
1452 server_hostname = self.host
-> 1454 self.sock = self._context.wrap_socket(self.sock,
1455 server_hostname=server_hostname)

File ~/miniconda3/envs/promptbench/lib/python3.9/ssl.py:501, in SSLContext.wrap_socket(self, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, session)
495 def wrap_socket(self, sock, server_side=False,
496 do_handshake_on_connect=True,
497 suppress_ragged_eofs=True,
498 server_hostname=None, session=None):
499 # SSLSocket class handles server_hostname encoding before it calls
500 # ctx._wrap_socket()
--> 501 return self.sslsocket_class._create(
502 sock=sock,
503 server_side=server_side,
504 do_handshake_on_connect=do_handshake_on_connect,
505 suppress_ragged_eofs=suppress_ragged_eofs,
506 server_hostname=server_hostname,
507 context=self,
508 session=session
509 )

File ~/miniconda3/envs/promptbench/lib/python3.9/ssl.py:1074, in SSLSocket._create(cls, sock, server_side, do_handshake_on_connect, suppress_ragged_eofs, server_hostname, context, session)
1073 raise ValueError("do_handshake_on_connect should not be specified for non-blocking sockets")
-> 1074 self.do_handshake()
1075 except (OSError, ValueError):

File ~/miniconda3/envs/promptbench/lib/python3.9/ssl.py:1343, in SSLSocket.do_handshake(self, block)
1342 self.settimeout(None)
-> 1343 self._sslobj.do_handshake()
1344 finally:

SSLCertVerificationError: [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1133)

During handling of the above exception, another exception occurred:

URLError Traceback (most recent call last)
Cell In[17], line 5
3 attack = Attack(model_t5, "bertattack", train_dataset, prompt, eval_func, unmodifiable_words, verbose=True)
4 # print attack result
----> 5 print(attack.attack())

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:234, in Attack.attack(self)
231 return results
233 else:
--> 234 return self.prompt_attack.attack(self.prompt)

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:660, in AdvPromptAttack.attack(self, example)
658 return SkippedAttackResult(goal_function_result)
659 else:
--> 660 result = self._attack(goal_function_result)
661 return result

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:602, in AdvPromptAttack._attack(self, initial_result)
591 def _attack(self, initial_result):
592 """Calls the SearchMethod to perturb the AttackedText stored in
593 initial_result.
594
(...)
600 or MaximizedAttackResult.
601 """
--> 602 final_result = self.search_method(initial_result)
603 self.clear_cache()
604 if final_result.goal_status == GoalFunctionResultStatus.SUCCEEDED:

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/search_methods/search_method.py:35, in SearchMethod.call(self, initial_result)
30 if not hasattr(self, "filter_transformations"):
31 raise AttributeError(
32 "Search Method must have access to filter_transformations method"
33 )
---> 35 result = self.perform_search(initial_result)
36 # ensure that the number of queries for this GoalFunctionResult is up-to-date
37 result.num_queries = self.goal_function.num_queries

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/search_methods/greedy_word_swap_wir.py:141, in GreedyWordSwapWIR.perform_search(self, initial_result)
139 results = None
140 while i < len(index_order) and not search_over:
--> 141 transformed_text_candidates = self.get_transformations(
142 cur_result.attacked_text,
143 original_text=initial_result.attacked_text,
144 indices_to_modify=[index_order[i]],
145 )
146 i += 1
147 if len(transformed_text_candidates) == 0:

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:519, in AdvPromptAttack.get_transformations(self, current_text, original_text, **kwargs)
514 else:
515 transformed_texts = self._get_transformations_uncached(
516 current_text, original_text, **kwargs
517 )
--> 519 return self.filter_transformations(
520 transformed_texts, current_text, original_text
521 )

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:584, in AdvPromptAttack.filter_transformations(self, transformed_texts, current_text, original_text)
582 if self.constraints_cache[(current_text, transformed_text)]:
583 filtered_texts.append(transformed_text)
--> 584 filtered_texts += self._filter_transformations_uncached(
585 uncached_texts, current_text, original_text=original_text
586 )
587 # Sort transformations to ensure order is preserved between runs
588 filtered_texts.sort(key=lambda t: t.text)

File ~/autodl-tmp/promptbench/promptbench/prompt_attack/attack.py:544, in AdvPromptAttack._filter_transformations_uncached(self, transformed_texts, current_text, original_text)
539 if not original_text:
540 raise ValueError(
541 f"Missing original_text argument when constraint {type(C)} is set to compare against original_text"
542 )
--> 544 filtered_texts = C.call_many(filtered_texts, original_text)
545 else:
546 filtered_texts = C.call_many(filtered_texts, current_text)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/constraints/constraint.py:50, in Constraint.call_many(self, transformed_texts, reference_text)
46 except KeyError:
47 raise KeyError(
48 "transformed_text must have last_transformation attack_attr to apply constraint"
49 )
---> 50 filtered_texts = self._check_constraint_many(
51 compatible_transformed_texts, reference_text
52 )
53 return list(filtered_texts) + incompatible_transformed_texts

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/constraints/semantics/sentence_encoders/sentence_encoder.py:179, in SentenceEncoder._check_constraint_many(self, transformed_texts, reference_text)
175 def _check_constraint_many(self, transformed_texts, reference_text):
176 """Filters the list transformed_texts so that the similarity
177 between the reference_text and the transformed text is greater than
178 the self.threshold."""
--> 179 scores = self._score_list(reference_text, transformed_texts)
181 for i, transformed_text in enumerate(transformed_texts):
182 # Optionally ignore similarity score for sentences shorter than the
183 # window size.
184 if (
185 self.skip_text_shorter_than_window
186 and len(transformed_text.words) < self.window_size
187 ):

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/constraints/semantics/sentence_encoders/sentence_encoder.py:152, in SentenceEncoder._score_list(self, starting_text, transformed_texts)
142 starting_text_windows.append(
143 starting_text.text_window_around_index(
144 modified_index, self.window_size
145 )
146 )
147 transformed_text_windows.append(
148 transformed_text.text_window_around_index(
149 modified_index, self.window_size
150 )
151 )
--> 152 embeddings = self.encode(starting_text_windows + transformed_text_windows)
153 if not isinstance(embeddings, torch.Tensor):
154 embeddings = torch.tensor(embeddings)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/textattack/constraints/semantics/sentence_encoders/universal_sentence_encoder/universal_sentence_encoder.py:30, in UniversalSentenceEncoder.encode(self, sentences)
28 def encode(self, sentences):
29 if not self.model:
---> 30 self.model = hub.load(self._tfhub_url)
31 encoding = self.model(sentences)
33 if isinstance(encoding, dict):

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/module_v2.py:100, in load(handle, tags, options)
98 if not isinstance(handle, str):
99 raise ValueError("Expected a string, got %s" % handle)
--> 100 module_path = resolve(handle)
101 is_hub_module_v1 = tf.io.gfile.exists(_get_module_proto_path(module_path))
102 if tags is None and is_hub_module_v1:

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/module_v2.py:55, in resolve(handle)
31 def resolve(handle):
32 """Resolves a module handle into a path.
33
34 This function works both for plain TF2 SavedModels and the legacy TF1 Hub
(...)
53 A string representing the Module path.
54 """
---> 55 return registry.resolver(handle)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/registry.py:49, in MultiImplRegister.call(self, *args, **kwargs)
47 for impl in reversed(self._impls):
48 if impl.is_supported(*args, **kwargs):
---> 49 return impl(*args, **kwargs)
50 else:
51 fails.append(type(impl).name)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/compressed_module_resolver.py:81, in HttpCompressedFileResolver.call(self, handle)
77 response = self._call_urlopen(request)
78 return resolver.DownloadManager(handle).download_and_uncompress(
79 response, tmp_dir)
---> 81 return resolver.atomic_download(handle, download, module_dir,
82 self._lock_file_timeout_sec())

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/resolver.py:421, in atomic_download(handle, download_fn, module_dir, lock_file_timeout_sec)
419 logging.info("Downloading TF-Hub Module '%s'.", handle)
420 tf.compat.v1.gfile.MakeDirs(tmp_dir)
--> 421 download_fn(handle, tmp_dir)
422 # Write module descriptor to capture information about which module was
423 # downloaded by whom and when. The file stored at the same level as a
424 # directory in order to keep the content of the 'model_dir' exactly as it
(...)
429 # module caching protocol and no code in the TF-Hub library reads its
430 # content.
431 _write_module_descriptor_file(handle, module_dir)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/compressed_module_resolver.py:77, in HttpCompressedFileResolver.call..download(handle, tmp_dir)
71 return resolver.DownloadManager(handle).download_and_uncompress(
72 response, tmp_dir
73 )
75 request = urllib.request.Request(
76 self._append_compressed_format_query(handle))
---> 77 response = self._call_urlopen(request)
78 return resolver.DownloadManager(handle).download_and_uncompress(
79 response, tmp_dir)

File ~/miniconda3/envs/promptbench/lib/python3.9/site-packages/tensorflow_hub/resolver.py:528, in HttpResolverBase._call_urlopen(self, request)
526 return urllib.request.urlopen(request)
527 else:
--> 528 return urllib.request.urlopen(request, context=self._context)

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:214, in urlopen(url, data, timeout, cafile, capath, cadefault, context)
212 else:
213 opener = _opener
--> 214 return opener.open(url, data, timeout)

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:517, in OpenerDirector.open(self, fullurl, data, timeout)
514 req = meth(req)
516 sys.audit('urllib.Request', req.full_url, req.data, req.headers, req.get_method())
--> 517 response = self._open(req, data)
519 # post-process response
520 meth_name = protocol+"_response"

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:534, in OpenerDirector._open(self, req, data)
531 return result
533 protocol = req.type
--> 534 result = self._call_chain(self.handle_open, protocol, protocol +
535 '_open', req)
536 if result:
537 return result

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:494, in OpenerDirector._call_chain(self, chain, kind, meth_name, *args)
492 for handler in handlers:
493 func = getattr(handler, meth_name)
--> 494 result = func(*args)
495 if result is not None:
496 return result

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:1389, in HTTPSHandler.https_open(self, req)
1388 def https_open(self, req):
-> 1389 return self.do_open(http.client.HTTPSConnection, req,
1390 context=self._context, check_hostname=self._check_hostname)

File ~/miniconda3/envs/promptbench/lib/python3.9/urllib/request.py:1349, in AbstractHTTPHandler.do_open(self, http_class, req, **http_conn_args)
1346 h.request(req.get_method(), req.selector, req.data, headers,
1347 encode_chunked=req.has_header('Transfer-encoding'))
1348 except OSError as err: # timeout error
-> 1349 raise URLError(err)
1350 r = h.getresponse()
1351 except:

URLError: <urlopen error [SSL: CERTIFICATE_VERIFY_FAILED] certificate verify failed: self-signed certificate in certificate chain (_ssl.c:1133)>

ChatGPT version

May we please confirm with you the exact ChatGPT version you used in the paper? Thanks.

About Runtime

Hi!Thanks for your awesome work!When I run the commend python main.py --model llama-13b --dataset mnli --attack bertattack --shot 0 --generate_len 20 with one NVIDIA 3090 24G, it takes already 20 hours and I don't know when it would finish. The logs is like:

2023-07-26 08:31:41,068 - INFO - gt: 2
2023-07-26 08:31:41,068 - INFO - Pred: -1
2023-07-26 08:31:41,068 - INFO - sentence: Assess the connection between the following sentences and count it as 'entailment', 'neutral', or 'contradiction':Premise: He hadn't seen even pictures of such things since the few silent movies run in some of the little art theaters. Hypothesis: He had recently seen pictures depicting those things. Answer: 

but the results file only contain a file named mnli which is also empty. Would you please give me a general time consuming estimation with your machines? Thanks a lot!

About Checklist attack method

Hi,

I was wondering why your implementation of Checklist attack method is different from the one in TextAttack package. Thank you for your time!

Numerous datasets exhibit a label of -1

Hello,

Attempting to load multiple datasets, I observed that several of them have all their elements labeled as -1.

For example:

import promptbench as pb
from tqdm import tqdm
dataset = pb.DatasetLoader.load_dataset("sst2")

print([e for e in tqdm(dataset) if e["label"] != -1])

Results: []
I'm using the promptbench version 0.0.2.

Thanks!

Access to per-sample evaluation results

Hi,
Thanks for the great work! For my current project, I am looking to use the sample-wise evaluation results of VLMs for the experiments you have conducted.

If you can provide me with the sample-wise evaluation logs on the multimodal datasets mentioned(VQAv2, NoCaps, MMMU, MathVista, AI2D, ChartQA, ScienceQA) for the models evaluated(BLIP2, LLaVA Qwen-VL, Qwen-VL-Chat, InternLM-XComposer2-VL, GPT-4v, Gemini Pro Vision, Qwen-VL-Max, Qwen-VL-Plus), I would greatly appreciate it. If I missed a dataset or model, please feel free to incorporate them.

Running time

Hello, I was running with this command. It is not finished after a couple of hours. Do you have an estimation of how long does it take?

python main.py --model google/flan-t5-large --dataset mnli --attack textfooler --shot 0 --generate_len 20

Compatibility Request: Upgrade openai Dependency to Support >=1.10.0

Environment

  • promptbench version: 0.0.2
  • langchain-openai version: 0.1.1
  • openai version requested: 1.14.3
  • Python version: [Your Python version]
  • Operating System: [Your OS]

Description

I am currently working on a project that relies on promptbench along with langchain-openai. However, I've encountered a dependency conflict with the openai package. promptbench requires openai version 1.3.7, while langchain-openai is compatible with versions before 2.0.0 and at least 1.10.0. For my use case, I need features that are available in openai version 1.10.0 or later.

Issue

The current version constraint of promptbench for the openai package prevents me from using the newer features of the openai library that are necessary for my project. This version conflict is causing installation and compatibility issues in my development environment.

Request

Would it be possible to update the promptbench dependency on the openai library to be compatible with version 1.10.0 or later? This update would greatly benefit users who need the newer features from the openai library.

can you share a prompt to test llama with mmlu data ?

I tried this prompt but didn't work well :

The following are multiple choice questions (with answers) about  abstract algebra.
========example====== 
In order to make the title of this discourse generally intelligible, I have translated the term "Protoplasm," which is the scientific name of the substance of which I am about to speak, by the words "the physical basis of life." I suppose that, to many, the idea that there is such a thing as a physical basis, or matter, of life may be novel—so widely spread is the conception of life as something which works through matter. … Thus the matter of life, so far as we know it (and we have no right to speculate on any other), breaks up, in consequence of that continual death which is the condition of its manifesting vitality, into carbonic acid, water, and nitrogenous compounds, which certainly possess no properties but those of ordinary matter.
Thomas Henry Huxley, "The Physical Basis of Life," 1868
From the passage, one may infer that Huxley argued that "life" was
----here are the choices,you need to select one answer,return the order number----:
['a force that works through matter', 'essentially a philosophical notion', 'merely a property of a certain kind of matter', 'a supernatural phenomenon']
Answer:2
========example====== 
Read the the following quotation to answer questions.
The various modes of worship which prevailed in the Roman world were all considered by the people as equally true; by the philosopher as equally false; and by the magistrate as equally useful.
Edward Gibbon, The Decline and Fall of the Roman Empire, 1776–1788
Gibbon's interpretation of the state of religious worship in ancient Rome could be summarized as
----here are the choices,you need to select one answer,return the order number----:
["In ancient Rome, religious worship was decentralized and tended to vary with one's social position.", 'In ancient Rome, religious worship was the source of much social tension and turmoil.', 'In ancient Rome, religious worship was homogeneous and highly centralized.', 'In ancient Rome, religious worship was revolutionized by the introduction of Christianity.']
Answer:0
========example====== 
The following quote is from Voltaire in response to the 1755 Lisbon earthquake.
My dear sir, nature is very cruel. One would find it hard to imagine how the laws of movement cause such frightful disasters in the best of possible worlds. A hundred thousand ants, our fellows, crushed all at once in our ant-hill, and half of them perishing, no doubt in unspeakable agony, beneath the wreckage from which they cannot be drawn. Families ruined all over Europe, the fortune of a hundred businessmen, your compatriots, swallowed up in the ruins of Lisbon. What a wretched gamble is the game of human life! What will the preachers say, especially if the palace of the Inquisition is still standing? I flatter myself that at least the reverend father inquisitors have been crushed like others. That ought to teach men not to persecute each other, for while a few holy scoundrels burn a few fanatics, the earth swallows up one and all.
—Voltaire, in a letter, 1755
The ideas expressed by Voltaire, above, best illustrate which of the following characteristics of Enlightenment intellectuals?
----here are the choices,you need to select one answer,return the order number----:
['Many were accomplished scientists, who added important pieces to human understanding of the universe.', 'They utilized new methods of communicating their ideas, such as salons and inexpensive printed pamphlets.', 'Most rejected religion altogether and adopted atheism as the only credo of a rational man.', 'Many believed that the new scientific discoveries justified a more tolerant and objective approach to social and cultural issues.']
Answer:3
==========answer this question below===========
{question}
----here are the choices,you need to select one answer,return the order number----:
{choices}
Answer:

then , flowed the demo code @jindongwang @madhavMathur :

dataset = pb.DatasetLoader.load_dataset("mmlu")
...

for prompt in prompts:
    preds = []
    labels = []
    for data in tqdm(dataset[5:]):
        # process input
        input_text = pb.InputProcess.basic_format(prompt, data)
        label = data['answer']
        raw_pred = model(input_text)
        print(raw_pred)
        break

LLaMa 2 inference

I tried to benchmark LLaMa 2 chat model from HuggingFace and got a ValueError:
ValueError: temperature(=0) has to be a strictly positive float, otherwise your next token scores will be invalid.
It is caused by this line

Potentially a bug in `LabelConstraint`

Hello,
Firstly, thanks for your contribution and open-sourcing you code!

I got the following error trying to reproduce an attack for the SST2 dataset:

  File "<PATH>/promptbench/prompt_attack/label_constraint.py", line 12, in <listcomp>
    self.labels = [label.lower() for label in labels]
AttributeError: 'int' object has no attribute 'lower'

Which makes sense since in the config the labels consist integers:

    'sst2': ['positive', 'negative', 'positive\'', 'negative\'', '0', '1', '0\'', '1\'',0, 1], 

But in the LabelConstraint constructor we have:

    def __init__(self, labels=[]):
        self.labels = [label.lower() for label in labels]

The changes were made in this PR.

To make it run we can simply delete the integers labels 0, 1 - tho I'm afraid the ground truth labels might be integers.
So another solution could be to apply.lower() only on string labels (which to me sounds like the correct approach).

What do you guys think?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.