facebookresearch / open-eqa Goto Github PK

View Code? Open in Web Editor NEW

197.0 11.0 16.0 10.12 MB

OpenEQA Embodied Question Answering in the Era of Foundation Models

License: MIT License

Python 1.45% Jupyter Notebook 98.44% HTML 0.10%

open-eqa's Introduction

OpenEQA: Embodied Question Answering in the Era of Foundation Models

[paper] [project] [dataset] [bibtex]

open-eqa-teaser.mp4

Abstract

We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA – the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.

Dataset

The OpenEQA dataset consists of 1600+ question answer pairs $(Q,A^*)$ and corresponding episode histories $H$.

The question-answer pairs are available in data/open-eqa-v0.json and the episode histories can be downloaded by following the instructions here.

Preview: A simple tool to view samples in the dataset is provided here.

Baselines and Automatic Evaluation

Installation

The code requires a python>=3.9 environment. We recommend using conda:

conda create -n openeqa python=3.9
conda activate openeqa
pip install -r requirements.txt
pip install -e .

Running baselines

Several baselines are implemented in openeqa/baselines. In general, baselines are run as follows:

# set an environment variable to your personal API key for the baseline
python openeqa/baselines/<baseline>.py --dry-run  # remove --dry-run to process the full benchmark

See openeqa/baselines/README.md for more details.

Running evaluations

Automatic evaluation is implemented with GPT-4 using the prompts found here and here.

# set the OPENAI_API_KEY environment variable to your personal API key
python evaluate-predictions.py <path/to/results/file.json> --dry-run  # remove --dry-run to evaluate on the full benchmark

License

OpenEQA is released under the MIT License.

Contributors

Arjun Majumdar*, Anurag Ajay*, Xiaohan Zhang*, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran

Citing OpenEQA

@inproceedings{majumdar2023openeqa,
  author={Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran},
  title={{OpenEQA: Embodied Question Answering in the Era of Foundation Models}},
  booktitle={{CVPR}},
  year={2024},
}

open-eqa's People

Contributors

Stargazers

Watchers

Forkers

evelynmitchell keke-220 apollohuang1 adeelahmad hanyangclarence sedrickkeh schowdhury671 twkang43 chaehonglee zanedurante jinclef bridgetleonard2 xforcevesa samuelcahyawijaya mrok24 abhyudit309

open-eqa's Issues

Request for baseline code for GPT-4+LLaVA

Hi, thanks for the awesome work!

I saw in Table 2 of the paper there is an important baseline namely "Socratic LLMs w/ Frame Captions (GPT-4 w/ LLaVA-1.5)". I am wondering whether you could also consider providing the implementation for this baseline, especially the LLaVA captions and GPT-4 prompts.

Any help would be highly appreciated!

Request for evaluation and baseline code for GPT-4o

Dear Author,

Considering the video analysis capabilities of the new GPT-4o, it would be extremely beneficial to have evaluation and baseline code available for this model.

Thank you.

Wrong path specified during frames extraction

Hey, I noticed the following issue when I tried to run python data/hm3d/extract-frames.py:

In the folder data/frames/hm3d-v0/021-hm3d-LT9Jq6dN3Ea, a file which should specify the scene file (00000.pkl) has attribute scene_id: datasets01/hm3d/090121/val/00862-LT93q63E/LT9Jq60N3Ea.basis.glb instead of correct one data/scene_datasets/hm3d/val/00862-LT93q63E/LT9Jq60N3Ea.basis.glb. This may be in other files too, I haven't checked that.

Potentially incorrect camera pose in HM3D

Hello,

I am trying to project the dataset's HM3D part into 3D point cloud for visualization using the extracted RGB frames, depth, pose and intrinsic. The code I used successfully generate normal visualization result on the ScanNet part, while the result looks kind of weird on the HM3D part. I wonder if there is any possible for potential error in the HM3D's camera pose?

the correct render result on scannet

the weird render result on HM3D (I am aware the axis direction in HM3D is different but the 3D point cloud is still weird even neglecting the axis part)

and the code is used to generate 3d point cloud.

import os
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import random
from tqdm import tqdm

frame_sample_ratio = 30
pixel_sample_ratio = 0.05

def load_matrix_from_txt(path, shape=(4, 4)):
    with open(path) as f:
        txt = f.readlines()
    txt = ''.join(txt).replace('\n', ' ')
    matrix = [float(v) for v in txt.split()]
    return np.array(matrix).reshape(shape)

def load_image(path):
    image = Image.open(path)
    imm = np.array(image)
    return imm

def convert_from_uvd(u, v, d, intr, pose):
    if d == 0:
        return None, None, None
    
    fx = intr[0, 0]
    fy = intr[1, 1]
    cx = intr[0, 2]
    cy = intr[1, 2]
    depth_scale = 6553.5

    z = d / depth_scale
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy
    
    world = (pose @ np.array([x, y, z, 1]))
    return world[:3] / world[3]

def plot_3d(xdata, ydata, zdata, color=None, b_min=0, b_max=8, view=(45, 45)):
    fig, ax = plt.subplots(subplot_kw={"projection": "3d"}, dpi=200)
    ax.view_init(view[0], view[1])

    ax.set_xlim(b_min, b_max)
    ax.set_ylim(b_min, b_max)
    ax.set_zlim(b_min, b_max)

    ax.scatter3D(xdata, ydata, zdata, c=color, cmap='rgb', s=0.1)

root = '/data/frames/hm3d-v0/084-hm3d-zt1RVoi7PcG'
intrinsic_depth = np.loadtxt(root+'/intrinsic_depth.txt')

x_data, y_data, z_data, c_data = [], [], [], []

length = len([x for x in os.listdir(root) if x.endswith('depth.png')])

from collections import defaultdict
dic = defaultdict(dict)
dic_c = defaultdict(dict)

for idx in tqdm(range(10)):
    rgb_image_path = root+'/{:05d}-rgb.png'.format(idx)
    depth_image_path = root+'/{:05d}-depth.png'.format(idx)

    p = load_matrix_from_txt(root+'/{:05d}.txt'.format(idx))
    c = load_image(rgb_image_path)
    d = load_image(depth_image_path)


    for i in range(d.shape[0]):
        for j in range(d.shape[1]):
            if random.random() < pixel_sample_ratio:
                x, y, z = convert_from_uvd(j, i, d[i, j], intrinsic_depth, p)
                if x is None:
                    continue
                    
                x_data.append(x)
                y_data.append(y)
                z_data.append(z)
                
                ci = int(i * c.shape[0] / d.shape[0])
                cj = int(j * c.shape[1] / d.shape[1])
                c_data.append(c[ci, cj] / 255.0)

plot_3d(x_data, y_data, z_data, color=c_data)

plt.show()

Code of the Scene-graph based method

Thanks for your great work! I wonder whether the code about scene-graph based method will be release?

Improve discoverability of your work on HF

Hi,

Niels here from the open-source team at Hugging Face.

It would be great to made the OpenEQA dataset available on HF. Here's a guide: https://huggingface.co/docs/datasets/loading

Let me know if you need any help regarding this!

Cheers,

Niels
ML Engineer @ HF 🤗

Prompt for LLM + Scene Graph

Congratulations on the awesome work! Could you also share your prompt for LLM, when using scene graph as reference information? Thanks!

Baselines for open source models

Hello!
Thank you for open-sourcing the code base. I was wondering if it is possible to release the code for open source models such as llama and llava. I unfortunately do not have access to gemini/ claude/ gpt APIs. So it would be great if I could run the open sourced baselines.
Kalyani

Active EQA

Thank you so much for open sourcing the project. I was very interested in the Active EQA task in your paper, but I don't seem to see the Active EQA baseline in this repository. Is it convenient to provide Active EQA related codes?

facebookresearch / open-eqa Goto Github PK

open-eqa's Introduction

OpenEQA: Embodied Question Answering in the Era of Foundation Models

Abstract

Dataset

Baselines and Automatic Evaluation

Installation

Running baselines

Running evaluations

License

Contributors

Citing OpenEQA

open-eqa's People

Contributors

Stargazers

Watchers

Forkers

open-eqa's Issues

Recommend Projects

Recommend Topics

Recommend Org