Coder Social home page Coder Social logo

koenraijer / arxivcollector Goto Github PK

View Code? Open in Web Editor NEW
1.0 1.0 0.0 25 KB

arXiv Collector allows you to export your arXiv searches as neatly formatted BibTex files for easy importation in most common scientific reference managers (like Zotero or EndNote).

License: GNU General Public License v3.0

Python 100.00%

arxivcollector's Introduction

arXivCollector

arXivCollector allows you to export your arXiv searches as neatly formatted BibTex files for easy importation in most common scientific reference managers (like Zotero or EndNote). It does not require much prior programming knowledge. A particularly useful feature is the inclusion of DOIs and direct links to article PDFs in the resulting file. The references can also be saved as a csv file.

Installation

  1. Have Python installed (download it from here).
  2. Clone the repository by running the following command in a terminal:
git clone https://github.com/koenraijer/arxivcollector.git
  1. Navigate to the cloned repository:
cd path/to/arxivcollector

Getting started

arXivCollector can be used in two ways:

  • By importing the arXivCollector() class;
  • By executing the arxivcollectory.py script from the command line.

Step 1: obtain an arXiv search results URL

To obtain an arXiv search results URL for your search query, go to https://arxiv.org/ or to the advanced search page and construct your search query. Press the big blue button that says "Search", wait until you arrive on the page that displays the search results. Now copy the entire URL as is, and you're done โœ….

Step 2: use arXivCollector in one of two ways

In Python

Run the following Python code (e.g., in a script or from a Jupyter notebook).

from arxiv import arXivCollector

# Initiate a new instance of the arXivCollector class
collector = arXivCollector()
# Set the title of the exported file (optional)
collector.set_title("Parrots")
# Pass the search URL to the run method
collector.run('https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first')

After running this with your own search URL and title, a new file should appear in the parent directory of arXivCollector.

From the commandline

The first argument after arxivcollectory.py is the search URL, the second argument is your title.

python arxivcollector.py "https://arxiv.org/search/advanced?advanced=&terms-0-operator=AND&terms-0-term=stochastic+parrot&terms-0-field=title&classification-physics_archives=all&classification-include_cross_list=include&date-filter_by=all_dates&date-year=&date-from_date=&date-to_date=&date-date_type=submitted_date&abstracts=show&size=50&order=-announced_date_first" "Parrots"

Special thanks

Fatima et al. served as the main inspiration for this code. To see their paper, go to: https://doi.org/10.1016/j.infsof.2023.107251.

Full reference:

Fatima, R., Yasin, A., Liu, L., Wang, J., & Afzal, W. (2023). Retrieving arXiv, SocArXiv, and SSRN metadata for initial review screening. Information and Software Technology, 161, 107251. https://doi.org/10.1016/j.infsof.2023.107251

API

Class: arXivCollector

This class is used to collect metadata from the arXiv website and save it in either BibTeX or CSV format.

__init__(self, user_agent, num_abstracts, arxiv_doi_prefix, default_item_type, verbose, mode) -> None

Initializes an instance of the ArXiv class.

Parameters:
  • user_agent (str): The User-Agent header to use when sending requests. Default is a common User-Agent string for a Chrome browser.
  • num_abstracts (int): The number of abstracts you want displayed per page (on the arXiv website). Default is 50.
  • arxiv_doi_prefix (str): The prefix for the DOI of arXiv papers. Default is "https://doi.org/10.48550".
  • default_item_type (str): The default item type for the BibTeX entries. Default is "ARTICLE".
  • verbose (bool): Whether to print verbose output. Default is False.
  • mode (str): The mode to use when saving the collected data. Can be either "bibtex" or "csv". Default is "bibtex".

set_title(self, title: str)

Sets the title of the output file.

Parameters:
  • title (str): The title to set.

run(self, url)

Starts the collection process for the specified URL.

Parameters:
  • url (str): The URL to start the collection process for.

arxivcollector's People

Contributors

koenraijer avatar

Stargazers

 avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.