Coder Social home page Coder Social logo

workinthedark / fairytaleqa_qag_system Goto Github PK

View Code? Open in Web Editor NEW
26.0 1.0 15.0 7.34 MB

The official repository for paper "It is AI’s Turn to Ask Humans a Question: Question-Answer Pair Generation for Children’s Story Books" accepted to ACL 2022

Jupyter Notebook 14.13% Python 84.96% Shell 0.45% Makefile 0.02% Dockerfile 0.08% Jsonnet 0.01% CSS 0.09% JavaScript 0.26%

fairytaleqa_qag_system's Introduction

FairytaleQA_QAG_System

For paper It is AI’s Turn to Ask Humans a Question: Question-Answer Pair Generation for Children’s Story Books [accepted to ACL 2022]

We have a separate repository for the FairytaleQA Dataset Fantastic Questions and Where to Find Them: FairytaleQA -- An Authentic Dataset for Narrative Comprehension [accepted to ACL 2022]:

https://github.com/uci-soe/FairytaleQAData (this repo might not be available, use the following one instead)

https://github.com/WorkInTheDark/FairytaleQA_Dataset

We also have a separate repository for the StoryBuddy Storytelling System StoryBuddy: A Human-AI Collaborative Chatbot for Parent-Child Interactive Storytelling with Flexible Parental Involvement [accepted to CHI 2022] built upon this QAG framework here:

https://github.com/roryzhengzhang/storytelling-QA-system

What this repo is for

We developed an automated QA-pair generation (QAG) system for an education scenario: given a story book as input, our system can automatically generate QA-pairs that are capable of testing a variety of dimensions of a student's comprehension skills. We are using a new expert-annotated FairytaleQA dataset, which focuses on narrative comprehension for elementary to middle school students and contains 10,580 QA-pairs labeled by education experts from 278 classic fairytales.

For the fine-tune process and the end-2-end generation pipeline, We've been using the same version of transformers since we started the project to avoid version conflicts and it is included in this repo. You may find the latest version here: https://github.com/huggingface/transformers

QA-pair Generation System Diagram

QA-pair Generation System Diagram

There are three sub-modules in our QAG pipeline:

  1. An answer generation(AG) module that leverages Spacy English model to extract named entities and noun chunks and Propbank’s Semantic Role Labeler to extract action events’ descriptions as candidate answers
  2. A BART-based question generation(QG) module fine-tuned on FairytaleQA dataset
  3. A ranking module to rank and select top-N QA-pairs. We fine-tune a DistilBERT model on a classification task between QA-pairs generated with our QAG system and ground-truth from FairytaleQA Dataset

What's here

We provide separate Jupyter Notebooks for the following task:

  • (You can load the dataset from Huggingface Hub, SEE BELOW)

    0_Pre_processing_the_original_data.ipynb --> Pre-processing the original story dataset into desired fine-tuning format. You may acquire the original story dataset from https://github.com/uci-soe/FairytaleQAData. Remember to put question and story files into one folder before using this notebook, so that the script can directly find the story file and question file for the same story.

  • 1_Train_BART_model.ipynb --> fine-tune a BART QG model

  • 2_Generate_QA_pairs_with_our_QAG_system.ipynb --> end-to-end QAG

  • 3_RANK_QA_on_test_val.ipynb --> Ranking module after generating QA-pairs with the previous Notebook

[2023 UPDATE] We have uploaded the dataset to Huggingface Hub, so you can load the dataset much more easily for NLP tasks The dataset is uploaded to Huggingface Hub: https://huggingface.co/datasets/WorkInTheDark/FairytaleQA

from datasets import load_dataset
dataset = load_dataset("WorkInTheDark/FairytaleQA")

'''
DatasetDict({
    train: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 8548
    })
    validation: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 1025
    })
    test: Dataset({
        features: ['story_name', 'story_section', 'question', 'answer1', 'answer2', 'local-or-sum', 'attribute', 'ex-or-im', 'ex-or-im2'],
        num_rows: 1007
    })
})
'''

To load train/test/valid split:

from datasets import load_dataset
dataset = load_dataset("WorkInTheDark/FairytaleQA", split='train')

[ORIGINAL CONTENT] To make things easy, we have pre-processed the original storys from FairytaleQA Dataset for QAG and stored them under ./QAG_Generation_E2E/data/input_for_QAG. In each pre-processed story file, each line is a section of the story. (A section is determined by human coders which contains multiple paragraphs)

Thus, you may directly run 2_Generate_QA_pairs_with_our_QAG_system.ipynb without the need to pre-process original story books by yourself if you just wish to view the generation results on FairytaleQA Dataset. (But you still need to get the model checkpoint below). Also, you may directly use the pre-processed story data to test your own QAG systems.

Here are the model checkpoints that being used in the end-to-end QAG Notebook and the Ranking Module Notebook:

Tips

  • We would suggest using Google Colab so that you can copy the model to your drive and mount it to the Colab instance directly, since it'll be quite slow to download such large BART model from Google Drive.
  • To run 2_Generate_QA_pairs_with_our_QAG_system.ipynb, you need to have a system with more than 16G RAM, and preferrably with GPU support.
  • If you are using Google Colab, remember to restart the runtime after installing the dependencies (Colab will have an automatic prompt as well).

Citation

Our Dataset Paper is accepted to ACL 2022, you may cite:

@inproceedings{yao2022storybookqag,
    author = {Yao, Bingsheng and Wang, Dakuo and Wu, Tongshuang and Zhang, Zheng and Li, Toby Jia-Jun and Yu, Mo and Xu, Ying},
    title = {It is AI's Turn to Ask Humans a Question: Question-Answer Pair Generation for Children's Story books},
    publisher = {Association for Computational Linguistics},
    year = {2022}
}

fairytaleqa_qag_system's People

Contributors

workinthedark avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

fairytaleqa_qag_system's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.