Coder Social home page Coder Social logo

evelynmitchell / theoremqa Goto Github PK

View Code? Open in Web Editor NEW

This project forked from wenhuchen/theoremqa

0.0 2.0 0.0 32.27 MB

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset

License: MIT License

Python 69.89% Jupyter Notebook 30.11%

theoremqa's Introduction

TheoremQA

The dataset and code for paper: TheoremQA: A Theorem-driven Question Answering dataset (https://arxiv.org/abs/2305.12524).

Introduction

We propose the first question-answering dataset driven by STEM theorems. We annotated 800 QA pairs covering 350+ theorems spanning across Math, EE&CS, Physics and Finance. The dataset is collected by human experts with very high quality. We provide the dataset as a new benchmark to test the limit of large language models to apply theorems to solve challenging university-level questions. We provide a pipeline in the following to prompt LLMs and evaluate their outputs with WolframAlpha.

The dataset covers a wide range of topics listed below:

Examples

Huggingface

Our dataset is on Huggingface now: https://huggingface.co/datasets/wenhu/TheoremQA

from datasets import load_dataset
dataset = load_dataset("wenhu/TheoremQA")

Files

  • theoremqa_test.json: this file contains all the annotated question-answer pairs.
  • theoremqa_visual_subset_test.json: this file contains the subset of visual questions if you want to specifically test that.
  • all_theorems.json: this file contains the textual description of all the theorems being covered.
  • error_analysis/*: this folder contains the error analysis results on the 180-question subset.
  • solutions/*: this folder contains solutions for roughly 180 questions, which correspond to the problems used in error_analysis/
  • outputs/*.json.corrected: this folder contains all the model outputs.

Visualize the GPT-4 output at https://github.com/wenhuchen/TheoremQA/blob/main/visualize.ipynb.

Running Instruction

Dependency

  • openai == 0.27.6
  • wolframalpha == 5.0.0
  • pytorch == py3.8_cuda11.8_cudnn8.7.0_0
  • sympy == 1.11.1
  • transformers == 4.29.1
  • accelerate == 0.19.0
  • anthropic == 0.2.9

Chain-of-Thoughts Prompting

python run_gpt4.py

This will write output to outputs/GPT4_s0...

Program-of-Thoughts Prompting

python run_gpt4_pot.py

This will write output to outputs/GPT4_PoT_s0...

Evaluate model output

You need to register wolfram|alpha account to use their free API, checkout https://products.wolframalpha.com/api to register. Once you are done, you should receive an API_KEY.

export OPENAI_KEY=[YOUR_KEY]
export WOLFRAM_KEY=[YOUR_KEY]
python predict_accuracy.py outputs/[YOUR_FILE]

This will write an evaluation output as outputs/[YOUR_FILE].corrected

Analyze the model output

python analyze_results.py outputs/[YOUR_FILE].corrected

Leaderboard

Model Method Accuracy
GPT-4 PoT 52.4
GPT-4 CoT 43.8
ChatGPT PoT 35.6
PaLM-2 (unicorn) CoT 31.8
ChatGPT CoT 30.2
GPT-3.5 (text-davinci-003) PoT 27.8
Claude-v1 PoT 25.9
Claude-v1 CoT 24.9
Claude-v2 CoT 24.6
Claude-instant CoT 23.6
Codex (code-davinci-002) PoT 23.9
GPT-3.5 (text-davinci-003) CoT 22.8
PaLM-2 (bison) CoT 21.0
GPT-3 (text-davinci-002) PoT 20.6
GPT-3 (text-davinci-002) CoT 16.6
Alpaca CoT 13.5
Vicuna CoT 12.9
MOSS CoT 12.2
StarChat PoT 12.2
InstructCodeT5+ PoT 11.6
OpenAssistant CoT 10.7

Cite our Work

@article{chen2023theoremqa,
  title={TheoremQA: A Theorem-driven Question Answering dataset},
  author={Chen, Wenhu and Ming Yin, Max Ku, Elaine Wan, Xueguang Ma, Jianyu Xu, Tony Xia, Xinyi Wang, Pan Lu},
  journal={arXiv preprint arXiv:2305.12524},
  year={2023}
}

theoremqa's People

Contributors

wenhuchen avatar ishaan-jaff avatar ghabs avatar

Watchers

James Cloos avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.