Coder Social home page Coder Social logo

harishgovardhandamodar / arxiv-public-datasets Goto Github PK

View Code? Open in Web Editor NEW

This project forked from mattbierbaum/arxiv-public-datasets

0.0 0.0 0.0 230 KB

A set of scripts to grab public datasets from resources related to arXiv

Home Page: https://arxiv.org/abs/1905.00075

License: MIT License

Python 89.80% Jupyter Notebook 10.20%

arxiv-public-datasets's Introduction

arXiv public datasets

This project is part of a submission to an ICLR 2019 workshop, RLGM Representation Learning on Graphs and Manifolds. The manuscript can be found on arXiv:1905.00075. Our primary purpose is to develop a set of tools to standardize and facilitate use of the arXiv as a dataset. Due to licensing and distribution issues, our work is primarily a set of scripts which builds the dataset from various public data sources. There are additional cleaning, organization, and aggregation functions that it performs as well.

This project is under development as we try to best fit the needs of the community. We have adopted semver and as such will denote major releases with the first numeral in the tagged version.

Currently, the project grabs or generates:

  • Article metadata -- title, authors string, category, doi, abstract, submitter
  • PDFs -- all PDFs available through arXiv bulk download
  • Plain text -- PDFs converted to UTF-8 encoded plain text
  • Citation graph -- intra-arXiv citation graph between arXiv IDs only (generated from plain text)
  • Author string parsing -- convert metadata author strings into standardized list of name, affiliations

We are able to host certain generated portions of this dataset as released snapshots. The iterations can be found under the releases tab: Releases. However, the rest of it must be generated locally.

Kaggle hosting arXiv data

Kaggle has agreed to cover the cost of hosting the arXiv PDFs, and has created some challenges and tasks associated with the arXiv data. You can use our tools from their data source: download it from Google cloud using their instructions, storing the PDFs in your local $ARXIV_DATA/tarpdfs directory. Skip the downloading stage of downloading the PDFs below.

Contributing

Thank you for your interest in contributing! We welcome many forms of input -- GitHub issues, email, pull requests to name a few. Let us know how if there are ways we can better present and preserve the dataset.

Citing

If you wish to cite the dataset in your own work, a suggested bibtex entry is:

@misc{clement2019arxiv,
    title={On the Use of ArXiv as a Dataset},
    author={Colin B. Clement and Matthew Bierbaum and Kevin P. O'Keeffe and Alexander A. Alemi},
    year={2019},
    eprint={1905.00075},
    archivePrefix={arXiv},
    primaryClass={cs.IR}
}

Setup on Linux

Install the required system packages (or use an alternative Python distribution of your choice). For Debian / Ubuntu / similar:

sudo apt install python3 python3-pip python3-virtualenv poppler-utils

Download the code and prepare the python environment:

git clone https://github.com/mattbierbaum/arxiv-public-datasets
cd arxiv-public-datasets

virtualenv venv
. venv/bin/activate

pip3 install -e .
pip3 install -r requirements.txt

Decide where the data should live and modify the config.json file. This directory needs to have adequate space to hold ~ 1TB of pdfs and ~ 70GB of text if you so choose to retrieve them:

cp config.json.example config.json
[set ARXIV_DATA in config.json to your own directory]

The scripts in bin will then create any of the three subdirectories:

$ARXIV_DATA/tarpdfs   # raw pdf files from Amazon AWS bucket
$ARXIV_DATA/fulltext  # .txt from raw .pdf
$ARXIV_DATA/output    # co-citation network, parsed author strings, etc

Article metadata

Run OAI metadata harvester

python bin/metadata.py [OPTIONAL filepath.json.gz]

This will download the entire ArXiv metadata set, saving it as a series of gzip-compressed JSON entries. The default save location is $ARXIV_DATA/arxiv-metadata-oai-<date>.json.gz. This process will take at least 6 hours, as the OAI server only sends 1000 entries every 15 seconds. A resumption token is saved, so the process can be restarted by running again.

PDFs (AWS download only)

Prepare credentials

In addition to the setup above, you need to prepare your AWS credentials for use with boto3, the Python AWS library. A long explanation is available here while the quick method is to:

apt install awscli
aws configure

Bulk download of ArXiv PDFs

This download costs about 100 USD (and is 1.1TB) at the time of writing, as the ArXiv bulk download only allows requester-pays AWS S3 downloads. Ensure that you have at least 1TB of free space in the directory specified in config.json:

python bin/pdfdownload.py [OPTIONAL manifest_file.json.gz]

Plain text

The tool works for both AWS download (.tar files) and Kaggle download (plain PDFs) by adding --PLAIN_PDFS optional argument.

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] --PLAIN_PDFS [OPTIONAL, use if plain PDFs are dowloaded]

Bulk PDF conversion for AWS download

To use our tool for text conversion of all the PDFs from the ArXiv bulk download described above, execute the following. NOTE: if you have not already downloaded the PDFs, this tool will do so. If you have downloaded them, be sure to not change the $ARXIV_DATA so that it will not re-download the tars.

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] 

At the time of writing, converting 1.39 million articles requires over 400 core-hours using two Intel Xeon E5-2600 CPUs.

Bulk PDF conversion for plain PDFs, e.g. downdloaded from Kaggle (Google Cloud)

Download from AWS are .tar files. If plains PDFs are downloaded, e.g. from Kaggle, use the optional argument --PLAIN_PDFS:

python bin/fulltext.py -N [OPTIONAL number_of_processes, default cpu_count] --PLAIN_PDFS

Cocitation network

To generate the cocitation network, you first must have the full text. Then, with the directories still set up, run:

python bin/cocitations.py [OPTIONAL number_of_processes, default cpu_count]

The cocitation network will by default be saved in $ARXIV_DATA/output/internal-citations.json.gz.

Author string split

The OAI metadata from the ArXiv features author strings as submitted by article authors. In order to use them in a principled way, theses strings must be parsed and split. To generate and save these author splittings, run:

python bin/authorsplit.py [OPTIONAL number_of_processes, default cpu_count]

The split author strings will by default be saved in $ARXIV_DATA/output/authors-parsed.json.gz.

Dataset Metadata

The following table is necessary for this dataset to be indexed by search engines such as Google Dataset Search.

property value
name arXiv Public Datasets
url
sameAs https://github.com/mattbierbaum/arxiv-public-datasets
description The arXiv pre-print service is the de facto venue for publishing in many scientific disciplines. This repository provides tools for using all the publicly available information provided by the arXiv to download all of the publications and their metadata, extract fulltext from PDFs, and build a co-citation graph. For each publication the tools provide access to: * **Article metadata** -- title, authors string, category, doi, abstract, submitter * **PDFs** -- all PDFs available through arXiv bulk download * **Plain text** -- PDFs converted to UTF-8 encoded plain text * **Citation graph** -- intra-arXiv citation graph between arXiv IDs only (generated from plain text) * **Author string parsing** -- convert metadata author strings into standardized list of name, affiliations
citation https://arxiv.org/abs/1905.00075
license https://raw.githubusercontent.com/mattbierbaum/arxiv-public-datasets/master/LICENSE

arxiv-public-datasets's People

Contributors

colinclement avatar quynhneo avatar mattbierbaum avatar khev avatar bluetyson avatar guumaster avatar varshakishore avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.