Coder Social home page Coder Social logo

upd's Introduction

Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models

🤗 Dataset | 🏆 Leaderboard | 📖 arXiv | GitHub

⭐️ We are opening PRs for adding VLMs. Please do not hesitate to send them. We'll update 🏆 Leaderboard with your favorite VLMs! ⭐️

1The University of Tokyo  2S-Lab, Nanyang Technological University  3Duke University  4University of Wisconsin-Madison  
5LY Corporation  6Tokyo University of Science 

News

  • 2024.05: We have created our 🏆 Leaderboard with the support of the HuggingFace group.
  • 2024.05: MM-UPD has been integrated into the LMMs-eval codebase.
  • 2024.05: We have refactored the structure of the codebase to evaluate both UPD and Standard problems at once.
  • 2024.03: The short version (4p.) of this paper has been accepted by ICLR 2024 R2-FM Workshop.

Introduction

This paper introduces a novel and significant challenge for Vision Language Models (VLMs), termed Unsolvable Problem Detection (UPD). UPD examines the VLM's ability to withhold answers when faced with unsolvable problems in the context of Visual Question Answering (VQA) tasks. UPD encompasses three distinct settings: Absent Answer Detection (AAD), Incompatible Answer Set Detection (IASD), and Incompatible Visual Question Detection (IVQD). To deeply investigate the UPD problem, extensive experiments indicate that most VLMs, including GPT-4V and LLaVA-Next-34B, struggle with our benchmarks to varying extents, highlighting significant room for the improvements. To address UPD, we explore both training-free and training-based solutions, offering new insights into their effectiveness and limitations. We hope our insights, together with future efforts within the proposed UPD settings, will enhance the broader understanding and development of more practical and reliable VLMs.

Requirements

Installation

We mainly follow the LLaVA's environment for the Installation.
For the implementations, we utlize Nvidia A100 GPUs with 80G. We utilize single GPU for the VLMs' inference and two GPUs for instruction tuning.

conda create -n upd_en python=3.10 -y
conda activate upd_en
pip install --upgrade pip  # enable PEP 660 support
pip install -e .
pip install -e ".[train]"
pip install flash-attn --no-build-isolation --no-cache-dir

Data

MM-UPD Bench (Unnecessary)

UPD_SETTGING_OVERVIEW

We provide all benchmarks via huggingface 🤗. All benchmarks are automatically downloaded by running scripts. Please refer to our huggingface Dataset for more details.

If we encounter DatasetGenerationError, please delete the cache data of previous version.

Instruction tuning data (Optional)

We instruction tuning data (.json) via this url.
Please download, unzip and put it to ~/data.
As for the images for the instruction tuning data, we used the images for the official LLaVA's instruction tuning. Please download the images from constituting datasets:

Also, we provide the checkpoint for the instruction tuning via this url. Please download, unzip and put it to ~/checkpoints.

The overall file structure is as follows:

UPD
|-- checkpoints
    |-- llava-v1.6-vicuna-13b-task-lora
    |-- llava-v1.6-34b-task-lora
|-- data
    |-- inst_tuning
        |-- upd_tuning_data_20240303.json
        ├── coco
        │   └── train2017
        ├── gqa
        │   └── images
        ├── ocr_vqa
        │   └── images
        ├── textvqa
        │   └── train_images
        └── vg
            ├── VG_100K
            └── VG_100K_2

Note that the setting of these files is not necessary if you just want to inference VLMs.

API KEY

We need to create API keys for OpenAI and Gemini. With API keys, we can run the following commands:

export OPENAI_API_KEY='your-api-key-here'
export GEMINI_API_KEY='your-api-key-here'

About the setup of API keys, please refer to the OpenAI official page as for more detail.

Quick Start

1. Inference of VLMs

We put each script in ~/scripts/inference/<VLM name>/<UPD>. For example, to implement LLaVA-1.5 13B for the base setting, you can implement the following commands for each AAD and Standared senario:

base

bash scripts/inference/llava1.5/aad/base.sh

By implementing the above code, the result is automatically created under output/aad/answers_upload/llava1.5/base/mmaad_base/llava1.5-13b_<time_stamp>.xlsx

2. Evaluation

We put each evaluation script in ~/scripts/evaluation/<UPD>.

For example, to evaluate the performance of LLaVA-1.5 13B for the base setting, you can implement the following commands:

bash scripts/evaluation/aad/eval_base.sh <RESULT_PATH>
  • <RESULT_PATH> is output/aad/answers_upload/llava1.5/base/mmaad_base/llava1.5-13b_<time_stamp>.xlsx in this example.

By implementing the above code, the result is automatically created in each RESULT_PATH folder.

3. Instruction Tuning

We put each script in ~/scripts/inst_tuning. For example, to implement LLaVA-1.6 34B, you can implement the following commands:

bash scripts/inst_tuning/llava1.6_34b_lora_tuning.sh

As of March 2024, LLaVA1.6 has not yet released official LoRA tuning code. Therefore, please be aware that the our instruction tuning code may differ from the official LLaVA implementation.

Model Results

We provide a Google Sheet for the detailed results on each senario (Fig. 3, 4 ,5 ,6 in the paper). You can access the sheet here and can easily draw radar charts! SPREAD_SHEET

How to Add New VLMs

You can add your favorite VLMs in a very easy way!

  1. Create vlms/<your_vlm>/<your_vlm>_vqa_updbench.py with reference to other files.

  2. Create script files in scripts/inference/<your_vlm> with reference to other files.

After the performance check, let's send a PR!

Leaderboard Submission

We are opening your submission to our 🏆 Leaderboard. If you finish your eval with the eval scrip in scripts/evaluation and get a result file (_dual_detail_submission.json),
let's submit your result file into the leaderboard.

Acknowledgement

We adopt these codes to create this repository.

We thank the HF group for their kind assistance in the creation and promotion of our Leaderboard.

Contact

If you have questions, please open an issue mentioning @AtsuMiyai or send an email to miyai[at]hal.t.u-tokyo.ac.jp

Ads

If you are interested in this work, please refer to our other projects.

Citaiton

If you find our work interesting or use our code/models, please consider citing:

@article{miyai2024upd,
  title={Unsolvable Problem Detection: Evaluating Trustworthiness of Vision Language Models},
  author={Miyai, Atsuyuki and Yang, Jingkang and Zhang, Jingyang and Ming, Yifei and Yu, Qing and Irie, Go and Li, Yixuan and Li, Hai and Liu, Ziwei and Aizawa, Kiyoharu},
  journal={arXiv preprint arXiv:2403.20331},
  year={2024}
}

upd's People

Contributors

atsumiyai avatar andimarafioti avatar

Stargazers

Steffen Röcker avatar JIMMY ZHAO avatar Jinhao Duan avatar Yoshiki Masuyama avatar Mathias Gatti avatar  avatar Yabin Zhang avatar Bowen Dong avatar Yueqian Lin avatar HXH avatar JIAXUAN avatar JerExJs avatar Zekun Li avatar 王泽润 avatar Tong Bingkui avatar Leo avatar  avatar Harpreet Sahota avatar Yohan  Hmaiti avatar Rui Shao avatar Mykola Maslych avatar  avatar Andrés Ávila avatar  avatar  avatar  avatar Guilherme Euzébio avatar Rob avatar Hogan Kangas avatar  avatar kelsy gagnebin avatar  avatar Trey Saddler avatar Tim avatar  avatar Lum avatar Haian Huang(深度眸) avatar Yuhao Dong avatar 唐国梁Tommy avatar Pumpkin avatar  avatar  avatar tensorboy avatar  avatar Ziwei Liu avatar Jingkang Yang avatar Jingyang Zhang avatar  avatar Yifei Ming avatar

Watchers

Jingkang Yang avatar Jingyang Zhang avatar Yifei Ming avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.