Coder Social home page Coder Social logo

hellisotherpeople / debatesum Goto Github PK

View Code? Open in Web Editor NEW
50.0 4.0 5.0 508 KB

Corresponding code repo for the paper at COLING 2020 - ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset"

Home Page: https://huggingface.co/datasets/Hellisotherpeople/DebateSum

Python 100.00%
evidence debaters debate

debatesum's Introduction

DebateSum

Corresponding code repo for the upcoming paper at ARGMIN 2020: "DebateSum: A large-scale argument mining and summarization dataset"

Arxiv pre-print available here: https://arxiv.org/abs/2011.07251

Check out the presentation date and time here: https://argmining2020.i3s.unice.fr/node/9

Full paper as presented by the ACL is here: https://www.aclweb.org/anthology/2020.argmining-1.1/

Video of presentation at COLING 2020: https://underline.io/lecture/6461-debatesum-a-large-scale-argument-mining-and-summarization-dataset

The dataset hosted on Huggingface Datasets Hub (including preview of the dataset): https://huggingface.co/datasets/Hellisotherpeople/DebateSum

The dataset is distributed as csv files.

A search engine over DebateSum (as well as some additional evidence not included in DebateSum) is available as debate.cards. It's very good quality and allows for the evidence to be viewed in the format that debaters use.

Data

DebateSum consists of 187328 debate documents, arguements (also can be thought of as abstractive summaries, or queries), word-level extractive summaries, citations, and associated metadata organized by topic-year. This data is ready for analysis by NLP systems.

Download

All data is accesable in a parsed format organized by topic year here

Addtionally, the trained word-vectors for debate2vec are also found in that folder.

Regenerating it yourself

This is useful as the debaters who produce the evidence release their work every year. Soon enough I will update to include the 2020-2021 topic.

Step 1: Download all open evidence files from Open Evidence and unzip them into a directory. The links are as follows:

  • 2019 - Resolved: The United States federal government should substantially reduce Direct Commercial Sales and/or Foreign Military Sales of arms from the United States.
  • 2018 - Resolved: The United States federal government should substantially reduce its restrictions on legal immigration to the United States.
  • 2017 - Resolved: The United States federal government should substantially increase its funding and/or regulation of elementary and/or secondary education in the United States.
  • 2016 - Resolved: The United States federal government should substantially increase its economic and/or diplomatic engagement with the People’s Republic of China.
  • 2015 - Resolved: The United States federal government should substantially curtail its domestic surveil-lance.
  • 2014 - Resolved: The United States federal government should substantially increase its non-military exploration and/or development of the Earth’s oceans.
  • 2013 - Resolved: The United States federal government should substantially increase its economic en-gagement toward Cuba, Mexico or Venezuela.

Step 2: Convert all evidence from docx files to html5 files using pandoc with this command:

for f in *.docx; do pandoc "$f" -s -o "${f%.docx}.html5"; done

Step 3: install the dependencies for make_debate_dataset.py.

pip install -r requirements.txt

Step 4: Modify the folder and file locations as needed for your system, and run make_debate_dataset.py

python3 make_debate_dataset.py

Credits

Huge thanks to Arvind Balaji for making debate.cards and being second author on this paper!

debatesum's People

Contributors

hellisotherpeople avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

debatesum's Issues

Training the transformer

Hi @Hellisotherpeople ,

as you wrote in your paper, you trained several transformer models, including BERT-large and Longformer-base. You also mentioned the usage of simple-transformer library. Could you share a short code snippet on how you trained the model for extractive summarization, please?

Thanks in advance!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.