Coder Social home page Coder Social logo

peerread's Introduction

PeerRead

Data and code for "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications" by Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy and Roy Schwartz, NAACL 2018

The PeerRead dataset

PearRead is a dataset of scientific peer reviews available to help researchers study this important artifact. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.

We structured the dataset into sections each corresponding to a venue or an arxiv category, e.g., ./data/acl_2017 and ./data/arxiv.cs.cl_2007-2017. Each section is further split into the train/dev/test splits (same splits used in the paper). Due to licensing constraints, we provide instructions for downloading the data for some sections instead of including it in this repository, e.g., ./data/nips_2013-2017/README.md.

Models

In order to experiment with (and hopefully improve) our models for aspect prediction and for predicting whether a paper will be accepted, see ./code/README.md.

Setup Configuration

Run ./setup.sh at the root of this repository to install dependencies and download some of the larger data files not included in this repo.

Citation

@inproceedings{kang18naacl,
  title = {A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications},
  author = {Dongyeop Kang and Waleed Ammar and Bhavana Dalvi and Madeleine van Zuylen and Sebastian Kohlmeier and Eduard Hovy and Roy Schwartz},
  booktitle = {Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  address = {New Orleans, USA},
  month = {June},
  url = {https://arxiv.org/abs/1804.09635},
  year = {2018}
}

Acknowledgement

  • We use some of the code in CanaanShen for web crawling.
  • We use some of the code in jiegzhan for our aspect prediction experiments.
  • This work would not have been possible without the efforts of Rich Gerber and Paolo Gai (developers of the softconf.com conference management system), Stefan Riezler, Yoav Goldberg (chairs of CoNLL 2016), Min-Yen Kan, Regina Barzilay (chairs of ACL 2017) for allowing authors and reviewers to opt-in for this dataset during the official review process.
  • We thank the openreview.net, arxiv.org and semanticscholar.org teams for their commitment to promoting transparency and openness in scientific communication.

peerread's People

Contributors

dykang avatar schmmd avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

peerread's Issues

Download NIPS dataset

line 40 in NIPS_crawl.py
str(data).split("\n")
It returns me a list with only one element(the whole string), it seems the splitting code does not work.

Couldn't get the NIPS data

I couldn't get the NIPS data because of the invalidation url in the program.
How could I get these data?

Increase in test result in the master branch

Hey,

When I ran the script to calculate the acceptance/rejection percentage, I am getting 68.42% on 15 examples. But in the paper, the accuracy is reported to be 65.3% on the ICLR data. Can you please let me know if you guys added some new features or am I doing something wrong?

Thanks!

Huggingface Dataset reviews missing data?

I downloaded the reviews dataset subset from the huggingface, and all the reviews data are missing values!

rev_dataset = load_dataset("peer_read", 'reviews', split="train")
The review's 'column' is all missing values/lists. e.g. {'date': [], 'title': [], 'other_keys': [], 'o...`

Is this an error in the uploaded dataset? Thanks!

Reviewer information

Hi, Thanks for nice work (first of this kind). I see, there are anonymous reviewer comments in your data-set but how can I know the identity of reviewer? My research topic requires training model based on reviewer previous works predicting that comments given by reviewer are actually given by that reviewer or someone else (his student or his colleague perhaps). Please suggest how this dataset can help in this scenario (any other advises / links are also welcomed)?

Discrepancy between "annotation_full.tsv" and the actual reviews

While the article identifiers in "annotation_full.tsv" are in the range of 0-426, the identifiers in the reviews (/data/iclr_2017/train/reviews/) are somewhere between 300-700. When running "assign_annot_iclr_2017.py", it ends up matching annotations to incorrect reviews.

Extracting "intermediate", parsed data as a flat Table?

Hi! Great dataset!
I am interested in experimenting on it for my own work, as well as comparing ML approaches on it. I want to get the data in the form of a table (amenable to pandas and the like), while keeping the "Raw" data (i.e raw text, labels, marking rows as being from source X, keeping dates, reviewer # as an column (ID) variable, etc'.
I know the pipeline munges these features, but the output is too processed for my purposes - where should I look at in the code, in order to get the intermediate outputs?

e.g. a csv of all reviews and texts, with the raw variables (each in it's own column), across all the datasets? (and train/test splits)?

Thanks!

Improve documentation

The readme file at the root of this repository is a bit rough. It should be an enjoyable read and friction free. For example:

  • Replace the detailed directory structure with a high level description of what is provided in the repo.
  • Add pointers to other readme files in the repository with more detailed instructions (e.g., for downloading NIPS data, or running experiments).
  • In addition to providing a link to the paper, add a brief description of what it is about.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.