yiyangzhou / lure Goto Github PK
View Code? Open in Web Editor NEW[ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
[ICLR 2024] Analyzing and Mitigating Object Hallucination in Large Vision-Language Models
Hi Yiyang,
Thank you for your excellent work. I have a query regarding the datasets used in your study. In your paper, you mention using the LLaVA-150k dataset for training, and the COCO2014 training dataset for testing. However, the GitHub description suggests that the images for the test dataset can be directly downloaded from the COCO2014 training dataset link. Could you please clarify this discrepancy?
Hi, guys,
Thank you for providing the clean codes for your amazing method of mitigating hallucination in VLM. I am wondering if it is possible that you could also provide the weights of the revisor.
Thank you very much!
Hi! I have read your paper and it is really an amazing work!
But I encountered some issues during the Model Inference stage. I followed the steps below for model inference.
git clone https://github.com/YiyangZhou/LURE.git
cd LURE
conda env create -f environment.yml
conda activate LURE
python output_LURE.py --mode == 'inference' --cfg-path /eval_configs/minigpt4_eval.yaml --gpu-id 0 --input_image /path/to/image_file --output_file /output/output.jsonl
It's important to note that for each image, I modified the prompt to a specific question. Here are some of the output results. Some strange phenomena occurred here: only part of the sentence was output for the caption.
{"id": "64f51a50c1aff6f7.png", "question": "What color is the umbrella?", "caption": "The umbrella is purple, pink, and yellow.", "objs": ["umbrella", "yellow"], "plist": [0.393798828125], "p_all": {"The": [0.5888671875], "umbrella": [0.70556640625], "is": [0.404296875], "purple": [0.214111328125], ",": [], "pink": [0.1790771484375], "and": [0.568359375], "yellow": [0.393798828125], ".": []}, "model": "MiniGPT-4_13b"}
{"id": "01a4b22028ed8133.png", "question": "Describe the image in detail.", "caption": "The image is of a television with an old man in glasses and a", "objs": ["image", "television", "man", "glasses"], "plist": [0.374267578125, 0.384765625], "p_all": {"The": [0.61328125], "image": [0.6396484375], "is": [0.426025390625], "of": [0.27197265625], "a": [0.84814453125, 0.84814453125], "television": [0.374267578125], "with": [0.176513671875], "an": [0.1585693359375], "old": [0.19677734375], "man": [0.87646484375], "in": [0.412109375], "glasses": [0.384765625], "and": [0.58349609375]}, "model": "MiniGPT-4_13b"}
{"id": "856c000622d66bd9.png", "question": "Describe the image in detail.", "caption": "The image shows a row of shops on a", "objs": ["image", "row", "shops"], "plist": [0.10723876953125], "p_all": {"The": [0.4541015625], "image": [0.7734375], "shows": [0.56005859375], "a": [0.86865234375, 0.86865234375], "row": [0.10723876953125], "of": [1.0], "shops": [0.7548828125], "on": [0.2744140625]}, "model": "MiniGPT-4_13b"}
{"id": "50ba2afa85a784cb.png", "question": "In the image, which truck (left or right) has its door open?", "caption": "The red truck on the left has its door open.\n\n </Img>", "objs": ["truck", "left", "door", "/Img", ">"], "plist": [0.355712890625], "p_all": {"The": [0.406494140625], "red": [0.3154296875], "truck": [0.355712890625], "on": [0.57861328125], "the": [1.0], "left": [0.80078125], "has": [1.0], "its": [1.0], "door": [1.0], "open": [1.0], ".": [], "<": [], "/Img": [], ">": []}, "model": "MiniGPT-4_13b"}
{"id": "071af503f95aee59.png", "question": "How many oysters can you see in the photo?", "caption": "I can see 5 oysters on the plate", "objs": ["oysters", "plate"], "plist": [], "p_all": {"I": [0.368896484375], "can": [0.52734375], "see": [1.0], "5": [0.1383056640625], "oysters": [1.0], "on": [0.4609375], "the": [1.0], "plate": [1.0]}, "model": "MiniGPT-4_13b"}
python generate_IDK.py --input_file /output/output.jsonl --output_file /output/idk_caption_file.jsonl
The following are the corresponding outputs in JSON format for the 5 entries:
{"id": "64f51a50c1aff6f7.png", "question": "What color is the umbrella?", "caption": "The umbrella is purple, pink, [IDK].", "objs": ["umbrella", "yellow"], "plist": [0.393798828125], "p_all": {"The": [0.5888671875], "umbrella": [0.70556640625], "is": [0.404296875], "purple": [0.214111328125], ",": [], "pink": [0.1790771484375], "and": [0.568359375], "yellow": [0.393798828125], ".": []}, "model": "MiniGPT-4_13b"}
{"id": "01a4b22028ed8133.png", "question": "Describe the image in detail.", "caption": "The image is of [IDK] an old man [IDK] a", "objs": ["image", "television", "man", "glasses"], "plist": [0.374267578125, 0.384765625], "p_all": {"The": [0.61328125], "image": [0.6396484375], "is": [0.426025390625], "of": [0.27197265625], "a": [0.84814453125, 0.84814453125], "television": [0.374267578125], "with": [0.176513671875], "an": [0.1585693359375], "old": [0.19677734375], "man": [0.87646484375], "in": [0.412109375], "glasses": [0.384765625], "and": [0.58349609375]}, "model": "MiniGPT-4_13b"}
{"id": "856c000622d66bd9.png", "question": "Describe the image in detail.", "caption": "The image shows [IDK] shops on a", "objs": ["image", "row", "shops"], "plist": [0.10723876953125], "p_all": {"The": [0.4541015625], "image": [0.7734375], "shows": [0.56005859375], "a": [0.86865234375, 0.86865234375], "row": [0.10723876953125], "of": [1.0], "shops": [0.7548828125], "on": [0.2744140625]}, "model": "MiniGPT-4_13b"}
{"id": "50ba2afa85a784cb.png", "question": "In the image, which truck (left or right) has its door open?", "caption": "The [IDK] the left has its door open. </Img>", "objs": ["truck", "left", "door", "/Img", ">"], "plist": [0.355712890625], "p_all": {"The": [0.406494140625], "red": [0.3154296875], "truck": [0.355712890625], "on": [0.57861328125], "the": [1.0], "left": [0.80078125], "has": [1.0], "its": [1.0], "door": [1.0], "open": [1.0], ".": [], "<": [], "/Img": [], ">": []}, "model": "MiniGPT-4_13b"}
{"id": "071af503f95aee59.png", "question": "How many oysters can you see in the photo?", "caption": "I can see 5 oysters on [IDK]", "objs": ["oysters", "plate"], "plist": [], "p_all": {"I": [0.368896484375], "can": [0.52734375], "see": [1.0], "5": [0.1383056640625], "oysters": [1.0], "on": [0.4609375], "the": [1.0], "plate": [1.0]}, "model": "MiniGPT-4_13b"}
python output_LURE.py --mode == 'rewrite' --cfg-path /eval_configs/minigpt4_eval.yaml --gpu-id 0 --input_caption /output/idk_caption_file --input_image /path/to/image_file --output_file /output/answer.jsonl
The final model output results are as follows:
there are more errors in the captions, for example, only outputting a \u200b
or prompts like can't see the image
, etc.
{"id": "64f51a50c1aff6f7.png", "question": "According to the picture, remove the information that does not exist in the following description: The umbrella is purple, pink, [IDK].", "caption": "Here is the corrected image description:", "model": "LURE"}
{"id": "01a4b22028ed8133.png", "question": "According to the picture, remove the information that does not exist in the following description: The image is of [IDK] an old man [IDK] a", "caption": "\u200b", "model": "LURE"}
{"id": "856c000622d66bd9.png", "question": "According to the picture, remove the information that does not exist in the following description: The image shows [IDK] shops on a", "caption": "autumn day, which is well kept with the colours of autumn leaves", "model": "LURE"}
{"id": "50ba2afa85a784cb.png", "question": "According to the picture, remove the information that does not exist in the following description: The [IDK] the left has its door open. </Img>", "caption": "I'm sorry, I don't see the image you provided. Can you please try again or provide more context?", "model": "LURE"}
{"id": "071af503f95aee59.png", "question": "According to the picture, remove the information that does not exist in the following description: I can see 5 oysters on [IDK]", "caption": "What type of food is on the plate?", "model": "LURE"}
So, which step in my model inference process went wrong, leading to the incorrect output?
Hi! I read the paper and found the work to be quite interesting!
I saw that you need the inference data to be in the following format:
{"id": "image_path", "answer": "caption of LLVM", "p_all": {"word1": [probs, ...], "word2": [probs,...], ...}, "objs": ["obj1", "obj2", ...]}
Can you share the object detection code you used for LVLMs other than MiniGPT4?
hi zhou and the team, thanks for bringing about LURE!
in the paper, LURE used CHAIR-s and CHAIR-i as the main metric to evaluate hallucinations. i'm interested in calculating CHAIR on my own outputs, but the repos seem not hosting the code.
could you please sharing what code do you use for evaluating CHAIR? thanks a lot!
To output probabilities, we modify the generation/utils.py file in the Transformers library to generate probabilities for each token.
Hi authors, I want to know how you change the transformers/utils to get the out probabilities. Can you release the implements?
Thanks!
I have read your paper and it is truly wonderful work! However, I have a question about the revisor. Is the revisor a fine-tuned MiniGPT-4 using the constructed training data in your paper? If so, it seems you have not used any visual information as input when the revisor works. Why not just use a normal LLM?
Hi Yiyang,
Amazing work! As mentioned in your paper, to train the hallucination revisor, we randomly selected 5000 image-text pairs from LLaVA-150k to construct the hallucination dataset to fine-tune an LVLM and use it as a revisor. I am wondering if it is possible for you to share the above-constructed hallucination dataset.
Hi, great work!
Thanks for releasing the checkpoints. But I found problems in reproducing the LURE corrections of Minigpt-4 hallucinated generation using generate_IDK.py and output_LURE.py. The LURE outputs are quite bad:(
Could you please provide the sample of caption_file.jsonl, caption_file.jsonl, output.jsonl for reference?
btw, check your codes again lol (typos).
Hi, Yiyang,
Thanks for your great work. The ckpt you provided in the repository is based on MiniGPT-4 7B (The ckpt we trained based on MiniGPT-4 7B as a baseline is available at Hugingface) but in the paper you said you used MiniGPT-4 (Vicuna 13B), which version of MiniGPT-4 should I download?
Best regards,
Wenbin
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.