allenai / peerread Goto Github PK

Data and code for Kang et al., NAACL 2018's paper titled "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications"

Python 95.68% Shell 4.32%

peerread's Introduction

PeerRead

Data and code for "A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications" by Dongyeop Kang, Waleed Ammar, Bhavana Dalvi, Madeleine van Zuylen, Sebastian Kohlmeier, Eduard Hovy and Roy Schwartz, NAACL 2018

The PeerRead dataset

PearRead is a dataset of scientific peer reviews available to help researchers study this important artifact. The dataset consists of over 14K paper drafts and the corresponding accept/reject decisions in top-tier venues including ACL, NIPS and ICLR, as well as over 10K textual peer reviews written by experts for a subset of the papers.

We structured the dataset into sections each corresponding to a venue or an arxiv category, e.g., ./data/acl_2017 and ./data/arxiv.cs.cl_2007-2017. Each section is further split into the train/dev/test splits (same splits used in the paper). Due to licensing constraints, we provide instructions for downloading the data for some sections instead of including it in this repository, e.g., ./data/nips_2013-2017/README.md.

Models

In order to experiment with (and hopefully improve) our models for aspect prediction and for predicting whether a paper will be accepted, see ./code/README.md.

Setup Configuration

Run ./setup.sh at the root of this repository to install dependencies and download some of the larger data files not included in this repo.

Citation

@inproceedings{kang18naacl,
  title = {A Dataset of Peer Reviews (PeerRead): Collection, Insights and NLP Applications},
  author = {Dongyeop Kang and Waleed Ammar and Bhavana Dalvi and Madeleine van Zuylen and Sebastian Kohlmeier and Eduard Hovy and Roy Schwartz},
  booktitle = {Meeting of the North American Chapter of the Association for Computational Linguistics (NAACL)},
  address = {New Orleans, USA},
  month = {June},
  url = {https://arxiv.org/abs/1804.09635},
  year = {2018}
}

Acknowledgement

We use some of the code in CanaanShen for web crawling.
We use some of the code in jiegzhan for our aspect prediction experiments.
This work would not have been possible without the efforts of Rich Gerber and Paolo Gai (developers of the softconf.com conference management system), Stefan Riezler, Yoav Goldberg (chairs of CoNLL 2016), Min-Yen Kan, Regina Barzilay (chairs of ACL 2017) for allowing authors and reviewers to opt-in for this dataset during the official review process.
We thank the openreview.net, arxiv.org and semanticscholar.org teams for their commitment to promoting transparency and openness in scientific communication.

peerread's People

Contributors

Stargazers

Watchers

Forkers

gazzola gearsuccess cclauss stevenlol shlpu gaosong0613 little1tow xgeric ogorodriguez luckypicklezz ldruth28 fendaq leiloong code-cultivater zzzkk2009 vivpra89 jxudata cdmaok notlaughinggirl jellying ml-lab eos21 palashpandey lilithein weldonwangwang strategist922 wlglixin jeongchanwoo collielimabean vveitch tinamil ahoyosid itachi0071998 ap9894 rhohith tristan-chow lipanpanpanpan dykang annekehdyt bassamtiano bigheiniu gwenniger rschmaelzle chengleiqin fmigone jepyh lili-yu lihaode2007 merdanme cstghitpku gundum0079 mmh-max kmmanu123456789 ashfaq92 gabriellin rohitmujumdar jkrosen loriqing cb1711 vinobryan zhuhaichao518 canyuchen chenbindtp iokuznetsov saarku jiafangdi-guang gary3321 sean-la vinaykudari abhigandhi29 fj-morales stevaras2 dipanshunagpal kapilmulchandani its-shubh9140 liuzechao-ccnu djaouadnm normonisping frankblood xiaopanlyu valeman prianca36 lkylucky domenicrosati gihanmora swryu76 ddofer prayat-pu mehalsakthi atahanoezer koko5677 zekaigalaxy akkarimi caitao1234 rioncarter aimeekco xiaofeng-github

peerread's Issues

where I can get the 'glove.840B.300d.w2v.bin' file

Couldn't get the NIPS data

I couldn't get the NIPS data because of the invalidation url in the program.
How could I get these data?

Missing NIPS train/dev/test splits

We need to provide standardized train/dev/test splits for all sections. However, the instructions in ./data/nips_2013-2017/README.md put all the papers in the same directory instead of splitting them as in the other sections.

Extracting "intermediate", parsed data as a flat Table?

Hi! Great dataset!
I am interested in experimenting on it for my own work, as well as comparing ML approaches on it. I want to get the data in the form of a table (amenable to pandas and the like), while keeping the "Raw" data (i.e raw text, labels, marking rows as being from source X, keeping dates, reviewer # as an column (ID) variable, etc'.
I know the pipeline munges these features, but the output is too processed for my purposes - where should I look at in the code, in order to get the intermediate outputs?

e.g. a csv of all reviews and texts, with the raw variables (each in it's own column), across all the datasets? (and train/test splits)?

Thanks!

Discrepancy between "annotation_full.tsv" and the actual reviews

While the article identifiers in "annotation_full.tsv" are in the range of 0-426, the identifiers in the reviews (/data/iclr_2017/train/reviews/) are somewhere between 300-700. When running "assign_annot_iclr_2017.py", it ends up matching annotations to incorrect reviews.

add more dataset (e.g. ICLR'18)

Huggingface Dataset reviews missing data?

I downloaded the reviews dataset subset from the huggingface, and all the reviews data are missing values!

rev_dataset = load_dataset("peer_read", 'reviews', split="train")
The review's 'column' is all missing values/lists. e.g. {'date': [], 'title': [], 'other_keys': [], 'o...`

Is this an error in the uploaded dataset? Thanks!

Increase in test result in the master branch

Hey,

When I ran the script to calculate the acceptance/rejection percentage, I am getting 68.42% on 15 examples. But in the paper, the accuracy is reported to be 65.3% on the ICLR data. Can you please let me know if you guys added some new features or am I doing something wrong?

Thanks!

Update paper link

Update paper link on the root README.md file once it's on arxiv.

Download NIPS dataset

line 40 in NIPS_crawl.py
str(data).split("\n")
It returns me a list with only one element(the whole string), it seems the splitting code does not work.

Improve documentation

The readme file at the root of this repository is a bit rough. It should be an enjoyable read and friction free. For example:

Replace the detailed directory structure with a high level description of what is provided in the repo.
Add pointers to other readme files in the repository with more detailed instructions (e.g., for downloading NIPS data, or running experiments).
In addition to providing a link to the paper, add a brief description of what it is about.

mention Python 2 dependency in READM.md

science-parse code missing

Our README files point to the science-parse code inside lib, but we don't have a lib directory

Reviewer information

Hi, Thanks for nice work (first of this kind). I see, there are anonymous reviewer comments in your data-set but how can I know the identity of reviewer? My research topic requires training model based on reviewer previous works predicting that comments given by reviewer are actually given by that reviewer or someone else (his student or his colleague perhaps). Please suggest how this dataset can help in this scenario (any other advises / links are also welcomed)?

mixed indentation

4 and 2 indentations are mixed in the following codes:

https://github.com/allenai/PeerRead/blob/master/code/accept_classify/classify.py#L20
https://github.com/allenai/PeerRead/blob/master/code/aspect_predict/data_helper.py

Thanks @rooa for pointing this out.