Coder Social home page Coder Social logo

open-eqa's Introduction

OpenEQA: Embodied Question Answering in the Era of Foundation Models

[paper] [project] [dataset] [bibtex]

open-eqa-teaser.mp4

Abstract

We present a modern formulation of Embodied Question Answering (EQA) as the task of understanding an environment well enough to answer questions about it in natural language. An agent can achieve such an understanding by either drawing upon episodic memory, exemplified by agents on smart glasses, or by actively exploring the environment, as in the case of mobile robots. We accompany our formulation with OpenEQA – the first open-vocabulary benchmark dataset for EQA supporting both episodic memory and active exploration use cases. OpenEQA contains over 1600 high-quality human generated questions drawn from over 180 real-world environments. In addition to the dataset, we also provide an automatic LLM-powered evaluation protocol that has excellent correlation with human judgement. Using this dataset and evaluation protocol, we evaluate several state-of-the-art foundation models including GPT-4V, and find that they significantly lag behind human-level performance. Consequently, OpenEQA stands out as a straightforward, measurable, and practically relevant benchmark that poses a considerable challenge to current generation of foundation models. We hope this inspires and stimulates future research at the intersection of Embodied AI, conversational agents, and world models.

Dataset

The OpenEQA dataset consists of 1600+ question answer pairs $(Q,A^*)$ and corresponding episode histories $H$.

The question-answer pairs are available in data/open-eqa-v0.json and the episode histories can be downloaded by following the instructions here.

Preview: A simple tool to view samples in the dataset is provided here.

Baselines and Automatic Evaluation

Installation

The code requires a python>=3.9 environment. We recommend using conda:

conda create -n openeqa python=3.9
conda activate openeqa
pip install -r requirements.txt
pip install -e .

Running baselines

Several baselines are implemented in openeqa/baselines. In general, baselines are run as follows:

# set an environment variable to your personal API key for the baseline
python openeqa/baselines/<baseline>.py --dry-run  # remove --dry-run to process the full benchmark

See openeqa/baselines/README.md for more details.

Running evaluations

Automatic evaluation is implemented with GPT-4 using the prompts found here and here.

# set the OPENAI_API_KEY environment variable to your personal API key
python evaluate-predictions.py <path/to/results/file.json> --dry-run  # remove --dry-run to evaluate on the full benchmark

License

OpenEQA is released under the MIT License.

Contributors

Arjun Majumdar*, Anurag Ajay*, Xiaohan Zhang*, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran

Citing OpenEQA

@inproceedings{majumdar2023openeqa,
  author={Arjun Majumdar, Anurag Ajay, Xiaohan Zhang, Pranav Putta, Sriram Yenamandra, Mikael Henaff, Sneha Silwal, Paul Mcvay, Oleksandr Maksymets, Sergio Arnaud, Karmesh Yadav, Qiyang Li, Ben Newman, Mohit Sharma, Vincent Berges, Shiqi Zhang, Pulkit Agrawal, Yonatan Bisk, Dhruv Batra, Mrinal Kalakrishnan, Franziska Meier, Chris Paxton, Sasha Sax, Aravind Rajeswaran},
  title={{OpenEQA: Embodied Question Answering in the Era of Foundation Models}},
  booktitle={{CVPR}},
  year={2024},
}

open-eqa's People

Contributors

aravindr93 avatar arjunmajum avatar zanedurante avatar

Stargazers

 avatar  avatar Wang Peng avatar wenbo zhang avatar  avatar Juhong Min avatar Mr.Tony avatar Vladislav Sorokin avatar kstranger avatar Rix Mape avatar Weijie Wang avatar Takyoung Kim avatar Samuel Cahyawijaya avatar  avatar Andy Cheng avatar Jiange Yang avatar Shengyu Hao avatar Chan Hee (Luke) Song avatar WangZihang avatar Yifei Yang avatar  avatar Youngtaek Oh avatar zwq2018 avatar James Brown avatar Diwank Singh Tomer avatar TianqiTang avatar Lawrence Lin avatar Baby Commando (JP) avatar Qiao Gu avatar  avatar Shuo Feng avatar Unchun Yang avatar sun avatar SAKAMOTO Kouya avatar XiaofengLin avatar ChopinChen avatar  avatar  avatar Talha Chafekar avatar  avatar Taebaek Hwang avatar Winky Ge avatar Francesco Taioli avatar bing avatar louix avatar Junming Yang avatar Xuecheng avatar Hsiang Wei Huang avatar yu3jun avatar Haomeng Zhang avatar Yunze Man avatar  avatar Elliot Yun avatar  avatar Lu Ming avatar Mihai Bujanca avatar Jeff Carpenter avatar Lële avatar Xiang Li avatar XiaoqianShen avatar Qingkai Fang avatar Bright avatar  avatar Jian Xie avatar  avatar Jamie Jiazhan Feng avatar LI XIN avatar SeungwonLee avatar Ziyang Wang avatar Jihan Yang avatar Dasol Hong avatar JIMMY ZHAO avatar  avatar metaaaa avatar Zhihao Fan avatar  avatar Laplace avatar Jiajun Deng avatar tansuozhe02 avatar Hao Zhang avatar Ramsey avatar KG Duan avatar BENM avatar Aber Hu avatar Nuri Kim avatar Jiahe Chen avatar  avatar Nico avatar Huangying Zhan avatar V Babbar avatar Changyeon Kim avatar Victor Oloyede Aregbede avatar Adeel Ahmad avatar 🤡🐀 avatar Weiyang Li avatar  avatar JieLiu avatar Bui Van Hop avatar Junseok Lee avatar yangmin avatar

Watchers

Franziska Meier avatar Oleksandr avatar Sasha Sax avatar  avatar  avatar  avatar Tatsuya Matsushima avatar  avatar Arun Sathiya avatar Andrew J Brush avatar  avatar

open-eqa's Issues

Request for baseline code for GPT-4+LLaVA

Hi, thanks for the awesome work!

I saw in Table 2 of the paper there is an important baseline namely "Socratic LLMs w/ Frame Captions (GPT-4 w/ LLaVA-1.5)". I am wondering whether you could also consider providing the implementation for this baseline, especially the LLaVA captions and GPT-4 prompts.

Any help would be highly appreciated!

Wrong path specified during frames extraction

Hey, I noticed the following issue when I tried to run python data/hm3d/extract-frames.py:
Снимок экрана 2024-06-06 в 14 29 57

In the folder data/frames/hm3d-v0/021-hm3d-LT9Jq6dN3Ea, a file which should specify the scene file (00000.pkl) has attribute scene_id: datasets01/hm3d/090121/val/00862-LT93q63E/LT9Jq60N3Ea.basis.glb instead of correct one data/scene_datasets/hm3d/val/00862-LT93q63E/LT9Jq60N3Ea.basis.glb. This may be in other files too, I haven't checked that.

Potentially incorrect camera pose in HM3D

Hello,

I am trying to project the dataset's HM3D part into 3D point cloud for visualization using the extracted RGB frames, depth, pose and intrinsic. The code I used successfully generate normal visualization result on the ScanNet part, while the result looks kind of weird on the HM3D part. I wonder if there is any possible for potential error in the HM3D's camera pose?

the correct render result on scannet
scannet

the weird render result on HM3D (I am aware the axis direction in HM3D is different but the 3D point cloud is still weird even neglecting the axis part)
hm3d

and the code is used to generate 3d point cloud.

import os
import numpy as np
from PIL import Image
import matplotlib.pyplot as plt
import random
from tqdm import tqdm

frame_sample_ratio = 30
pixel_sample_ratio = 0.05

def load_matrix_from_txt(path, shape=(4, 4)):
    with open(path) as f:
        txt = f.readlines()
    txt = ''.join(txt).replace('\n', ' ')
    matrix = [float(v) for v in txt.split()]
    return np.array(matrix).reshape(shape)

def load_image(path):
    image = Image.open(path)
    imm = np.array(image)
    return imm

def convert_from_uvd(u, v, d, intr, pose):
    if d == 0:
        return None, None, None
    
    fx = intr[0, 0]
    fy = intr[1, 1]
    cx = intr[0, 2]
    cy = intr[1, 2]
    depth_scale = 6553.5

    z = d / depth_scale
    x = (u - cx) * z / fx
    y = (v - cy) * z / fy
    
    world = (pose @ np.array([x, y, z, 1]))
    return world[:3] / world[3]

def plot_3d(xdata, ydata, zdata, color=None, b_min=0, b_max=8, view=(45, 45)):
    fig, ax = plt.subplots(subplot_kw={"projection": "3d"}, dpi=200)
    ax.view_init(view[0], view[1])

    ax.set_xlim(b_min, b_max)
    ax.set_ylim(b_min, b_max)
    ax.set_zlim(b_min, b_max)

    ax.scatter3D(xdata, ydata, zdata, c=color, cmap='rgb', s=0.1)

root = '/data/frames/hm3d-v0/084-hm3d-zt1RVoi7PcG'
intrinsic_depth = np.loadtxt(root+'/intrinsic_depth.txt')

x_data, y_data, z_data, c_data = [], [], [], []

length = len([x for x in os.listdir(root) if x.endswith('depth.png')])

from collections import defaultdict
dic = defaultdict(dict)
dic_c = defaultdict(dict)

for idx in tqdm(range(10)):
    rgb_image_path = root+'/{:05d}-rgb.png'.format(idx)
    depth_image_path = root+'/{:05d}-depth.png'.format(idx)

    p = load_matrix_from_txt(root+'/{:05d}.txt'.format(idx))
    c = load_image(rgb_image_path)
    d = load_image(depth_image_path)


    for i in range(d.shape[0]):
        for j in range(d.shape[1]):
            if random.random() < pixel_sample_ratio:
                x, y, z = convert_from_uvd(j, i, d[i, j], intrinsic_depth, p)
                if x is None:
                    continue
                    
                x_data.append(x)
                y_data.append(y)
                z_data.append(z)
                
                ci = int(i * c.shape[0] / d.shape[0])
                cj = int(j * c.shape[1] / d.shape[1])
                c_data.append(c[ci, cj] / 255.0)

plot_3d(x_data, y_data, z_data, color=c_data)

plt.show()

Prompt for LLM + Scene Graph

Congratulations on the awesome work! Could you also share your prompt for LLM, when using scene graph as reference information? Thanks!

Baselines for open source models

Hello!
Thank you for open-sourcing the code base. I was wondering if it is possible to release the code for open source models such as llama and llava. I unfortunately do not have access to gemini/ claude/ gpt APIs. So it would be great if I could run the open sourced baselines.
Kalyani

Active EQA

Thank you so much for open sourcing the project. I was very interested in the Active EQA task in your paper, but I don't seem to see the Active EQA baseline in this repository. Is it convenient to provide Active EQA related codes?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.