yiyangzhou / lure Goto Github PK

View Code? Open in Web Editor NEW

116.0 116.0 5.0 2.35 MB

[ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models

Python 99.88% Shell 0.12%

lure's People

Contributors

Stargazers

Watchers

Forkers

eric-doug standardgalactic 5l1v3r1 nahidalam winboyer

lure's Issues

a question about the dataset

Hi Yiyang,

Thank you for your excellent work. I have a query regarding the datasets used in your study. In your paper, you mention using the LLaVA-150k dataset for training, and the COCO2014 training dataset for testing. However, the GitHub description suggests that the images for the test dataset can be directly downloaded from the COCO2014 training dataset link. Could you please clarify this discrepancy?

MINIGPT4 revisor checkpoint release

Hi, guys,

Thank you for providing the clean codes for your amazing method of mitigating hallucination in VLM. I am wondering if it is possible that you could also provide the weights of the revisor.

Thank you very much!

Incorrect Output in Model Inference Process

Hi! I have read your paper and it is really an amazing work!
But I encountered some issues during the Model Inference stage. I followed the steps below for model inference.

Prepare the code and the environment

git clone https://github.com/YiyangZhou/LURE.git
cd LURE
conda env create -f environment.yml
conda activate LURE

Download Checkpoint Aligned with Vicuna 13B (stage 1）
set the path to the pretrained checkpoint in the evaluation config file in eval_configs/minigpt4_eval.yaml
Prepare model captions by running the following command:

python output_LURE.py --mode == 'inference' --cfg-path /eval_configs/minigpt4_eval.yaml --gpu-id 0 --input_image /path/to/image_file --output_file /output/output.jsonl

It's important to note that for each image, I modified the prompt to a specific question. Here are some of the output results. Some strange phenomena occurred here: only part of the sentence was output for the caption.

{"id": "64f51a50c1aff6f7.png", "question": "What color is the umbrella?", "caption": "The umbrella is purple, pink, and yellow.", "objs": ["umbrella", "yellow"], "plist": [0.393798828125], "p_all": {"The": [0.5888671875], "umbrella": [0.70556640625], "is": [0.404296875], "purple": [0.214111328125], ",": [], "pink": [0.1790771484375], "and": [0.568359375], "yellow": [0.393798828125], ".": []}, "model": "MiniGPT-4_13b"}
{"id": "01a4b22028ed8133.png", "question": "Describe the image in detail.", "caption": "The image is of a television with an old man in glasses and a", "objs": ["image", "television", "man", "glasses"], "plist": [0.374267578125, 0.384765625], "p_all": {"The": [0.61328125], "image": [0.6396484375], "is": [0.426025390625], "of": [0.27197265625], "a": [0.84814453125, 0.84814453125], "television": [0.374267578125], "with": [0.176513671875], "an": [0.1585693359375], "old": [0.19677734375], "man": [0.87646484375], "in": [0.412109375], "glasses": [0.384765625], "and": [0.58349609375]}, "model": "MiniGPT-4_13b"}
{"id": "856c000622d66bd9.png", "question": "Describe the image in detail.", "caption": "The image shows a row of shops on a", "objs": ["image", "row", "shops"], "plist": [0.10723876953125], "p_all": {"The": [0.4541015625], "image": [0.7734375], "shows": [0.56005859375], "a": [0.86865234375, 0.86865234375], "row": [0.10723876953125], "of": [1.0], "shops": [0.7548828125], "on": [0.2744140625]}, "model": "MiniGPT-4_13b"}
{"id": "50ba2afa85a784cb.png", "question": "In the image, which truck (left or right) has its door open?", "caption": "The red truck on the left has its door open.\n\n </Img>", "objs": ["truck", "left", "door", "/Img", ">"], "plist": [0.355712890625], "p_all": {"The": [0.406494140625], "red": [0.3154296875], "truck": [0.355712890625], "on": [0.57861328125], "the": [1.0], "left": [0.80078125], "has": [1.0], "its": [1.0], "door": [1.0], "open": [1.0], ".": [], "<": [], "/Img": [], ">": []}, "model": "MiniGPT-4_13b"}
{"id": "071af503f95aee59.png", "question": "How many oysters can you see in the photo?", "caption": "I can see 5 oysters on the plate", "objs": ["oysters", "plate"], "plist": [], "p_all": {"I": [0.368896484375], "can": [0.52734375], "see": [1.0], "5": [0.1383056640625], "oysters": [1.0], "on": [0.4609375], "the": [1.0], "plate": [1.0]}, "model": "MiniGPT-4_13b"}

get the masked caption of prepared captions

python generate_IDK.py   --input_file /output/output.jsonl  --output_file /output/idk_caption_file.jsonl

The following are the corresponding outputs in JSON format for the 5 entries:

{"id": "64f51a50c1aff6f7.png", "question": "What color is the umbrella?", "caption": "The umbrella is purple, pink, [IDK].", "objs": ["umbrella", "yellow"], "plist": [0.393798828125], "p_all": {"The": [0.5888671875], "umbrella": [0.70556640625], "is": [0.404296875], "purple": [0.214111328125], ",": [], "pink": [0.1790771484375], "and": [0.568359375], "yellow": [0.393798828125], ".": []}, "model": "MiniGPT-4_13b"}
{"id": "01a4b22028ed8133.png", "question": "Describe the image in detail.", "caption": "The image is of [IDK] an old man [IDK] a", "objs": ["image", "television", "man", "glasses"], "plist": [0.374267578125, 0.384765625], "p_all": {"The": [0.61328125], "image": [0.6396484375], "is": [0.426025390625], "of": [0.27197265625], "a": [0.84814453125, 0.84814453125], "television": [0.374267578125], "with": [0.176513671875], "an": [0.1585693359375], "old": [0.19677734375], "man": [0.87646484375], "in": [0.412109375], "glasses": [0.384765625], "and": [0.58349609375]}, "model": "MiniGPT-4_13b"}
{"id": "856c000622d66bd9.png", "question": "Describe the image in detail.", "caption": "The image shows [IDK] shops on a", "objs": ["image", "row", "shops"], "plist": [0.10723876953125], "p_all": {"The": [0.4541015625], "image": [0.7734375], "shows": [0.56005859375], "a": [0.86865234375, 0.86865234375], "row": [0.10723876953125], "of": [1.0], "shops": [0.7548828125], "on": [0.2744140625]}, "model": "MiniGPT-4_13b"}
{"id": "50ba2afa85a784cb.png", "question": "In the image, which truck (left or right) has its door open?", "caption": "The [IDK] the left has its door open. </Img>", "objs": ["truck", "left", "door", "/Img", ">"], "plist": [0.355712890625], "p_all": {"The": [0.406494140625], "red": [0.3154296875], "truck": [0.355712890625], "on": [0.57861328125], "the": [1.0], "left": [0.80078125], "has": [1.0], "its": [1.0], "door": [1.0], "open": [1.0], ".": [], "<": [], "/Img": [], ">": []}, "model": "MiniGPT-4_13b"}
{"id": "071af503f95aee59.png", "question": "How many oysters can you see in the photo?", "caption": "I can see 5 oysters on [IDK]", "objs": ["oysters", "plate"], "plist": [], "p_all": {"I": [0.368896484375], "can": [0.52734375], "see": [1.0], "5": [0.1383056640625], "oysters": [1.0], "on": [0.4609375], "the": [1.0], "plate": [1.0]}, "model": "MiniGPT-4_13b"}

obtain the rewriting response

python output_LURE.py --mode == 'rewrite' --cfg-path /eval_configs/minigpt4_eval.yaml --gpu-id 0 --input_caption /output/idk_caption_file  --input_image /path/to/image_file --output_file /output/answer.jsonl

The final model output results are as follows:
there are more errors in the captions, for example, only outputting a \u200b or prompts like can't see the image, etc.

{"id": "64f51a50c1aff6f7.png", "question": "According to the picture, remove the information that does not exist in the following description: The umbrella is purple, pink, [IDK].", "caption": "Here is the corrected image description:", "model": "LURE"}
{"id": "01a4b22028ed8133.png", "question": "According to the picture, remove the information that does not exist in the following description: The image is of [IDK] an old man [IDK] a", "caption": "\u200b", "model": "LURE"}
{"id": "856c000622d66bd9.png", "question": "According to the picture, remove the information that does not exist in the following description: The image shows [IDK] shops on a", "caption": "autumn day, which is well kept with the colours of autumn leaves", "model": "LURE"}
{"id": "50ba2afa85a784cb.png", "question": "According to the picture, remove the information that does not exist in the following description: The [IDK] the left has its door open. </Img>", "caption": "I'm sorry, I don't see the image you provided. Can you please try again or provide more context?", "model": "LURE"}
{"id": "071af503f95aee59.png", "question": "According to the picture, remove the information that does not exist in the following description: I can see 5 oysters on [IDK]", "caption": "What type of food is on the plate?", "model": "LURE"}

So, which step in my model inference process went wrong, leading to the incorrect output?

object extraction

Hi! I read the paper and found the work to be quite interesting!

I saw that you need the inference data to be in the following format:

{"id": "image_path", "answer": "caption of LLVM", "p_all": {"word1": [probs, ...], "word2": [probs,...], ...}, "objs": ["obj1", "obj2", ...]}

Can you share the object detection code you used for LVLMs other than MiniGPT4?

code for evaluating CHAIR on the dataset

hi zhou and the team, thanks for bringing about LURE!

in the paper, LURE used CHAIR-s and CHAIR-i as the main metric to evaluate hallucinations. i'm interested in calculating CHAIR on my own outputs, but the repos seem not hosting the code.

could you please sharing what code do you use for evaluating CHAIR? thanks a lot!

Implement of get output probabilities.

To output probabilities, we modify the generation/utils.py file in the Transformers library to generate probabilities for each token.

Hi authors, I want to know how you change the transformers/utils to get the out probabilities. Can you release the implements?

Thanks!

Question about the revisor

I have read your paper and it is truly wonderful work! However, I have a question about the revisor. Is the revisor a fine-tuned MiniGPT-4 using the constructed training data in your paper? If so, it seems you have not used any visual information as input when the revisor works. Why not just use a normal LLM?

Hallucination dataset release

Hi Yiyang,

Amazing work! As mentioned in your paper, to train the hallucination revisor, we randomly selected 5000 image-text pairs from LLaVA-150k to construct the hallucination dataset to fine-tune an LVLM and use it as a revisor. I am wondering if it is possible for you to share the above-constructed hallucination dataset.

Problems to reproduce the LURE output

Hi, great work!
Thanks for releasing the checkpoints. But I found problems in reproducing the LURE corrections of Minigpt-4 hallucinated generation using generate_IDK.py and output_LURE.py. The LURE outputs are quite bad:(
Could you please provide the sample of caption_file.jsonl, caption_file.jsonl, output.jsonl for reference?
btw, check your codes again lol (typos).

Question about the version of minigpt-4

Hi, Yiyang,
Thanks for your great work. The ckpt you provided in the repository is based on MiniGPT-4 7B (The ckpt we trained based on MiniGPT-4 7B as a baseline is available at Hugingface) but in the paper you said you used MiniGPT-4 (Vicuna 13B), which version of MiniGPT-4 should I download?

Best regards,
Wenbin

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.