Coder Social home page Coder Social logo

arxiv.py's Introduction

arxiv.py Python 3.6 PyPI GitHub Workflow Status (branch)

Python wrapper for the arXiv API.

Quick links

About arXiv

arXiv is a project by the Cornell University Library that provides open access to 1,000,000+ articles in Physics, Mathematics, Computer Science, Quantitative Biology, Quantitative Finance, and Statistics.

Usage

Installation

$ pip install arxiv

In your Python script, include the line

import arxiv

Search

A Search specifies a search of arXiv's database.

arxiv.Search(
  query: str = "",
  id_list: List[str] = [],
  max_results: float = float('inf'),
  sort_by: SortCriterion = SortCriterion.Relevanvce,
  sort_order: SortOrder = SortOrder.Descending
)
  • query: an arXiv query string. Advanced query formats are documented in the arXiv API User Manual.
  • id_list: list of arXiv record IDs (typically of the format "0710.5765v1"). See the arXiv API User's Manual for documentation of the interaction between query and id_list.
  • max_results: The maximum number of results to be returned in an execution of this search. To fetch every result available, set max_results=float('inf') (default); to fetch up to 10 results, set max_results=10. The API's limit is 300,000 results.
  • sort_by: The sort criterion for results: relevance, lastUpdatedDate, or submittedDate.
  • sort_order: The sort order for results: 'descending' or 'ascending'.

To fetch arXiv records matching a Search, use search.results() or (Client).results(search) to get a generator yielding Results.

Example: fetching results

Print the titles fo the 10 most recent articles related to the keyword "quantum:"

import arxiv

search = arxiv.Search(
  query = "quantum",
  max_results = 10,
  sort_by = arxiv.SortCriterion.SubmittedDate
)

for result in search.results():
  print(result.title)

Fetch and print the title of the paper with ID "1605.08386v1:"

import arxiv

search = arxiv.Search(id_list=["1605.08386v1"])
paper = next(search.results())
print(paper.title)

Result

The Result objects yielded by (Search).results() include metadata about each paper and some helper functions for downloading their content.

The meaning of the underlying raw data is documented in the arXiv API User Manual: Details of Atom Results Returned.

  • result.entry_id: A url http://arxiv.org/abs/{id}.
  • result.updated: When the result was last updated.
  • result.published: When the result was originally published.
  • result.title: The title of the result.
  • result.authors: The result's authors, as arxiv.Authors.
  • result.summary: The result abstract.
  • result.comment: The authors' comment if present.
  • result.journal_ref: A journal reference if present.
  • result.doi: A URL for the resolved DOI to an external resource if present.
  • result.primary_category: The result's primary arXiv category. See arXiv: Category Taxonomy.
  • result.categories: All of the result's categories. See arXiv: Category Taxonomy.
  • result.links: Up to three URLs associated with this result, as arxiv.Links.
  • result.pdf_url: A URL for the result's PDF if present. Note: this URL also appears among result.links.

They also expose helper methods for downloading papers: (Result).download_pdf() and (Result).download_source().

Example: downloading papers

To download a PDF of the paper with ID "1605.08386v1," run a Search and then use (Result).download_pdf():

import arxiv

paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
# Download the PDF to the PWD with a default filename.
paper.download_pdf()
# Download the PDF to the PWD with a custom filename.
paper.download_pdf(filename="downloaded-paper.pdf")
# Download the PDF to a specified directory with a custom filename.
paper.download_pdf(dirpath="./mydir", filename="downloaded-paper.pdf")

The same interface is available for downloading .tar.gz files of the paper source:

import arxiv

paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
# Download the archive to the PWD with a default filename.
paper.download_source()
# Download the archive to the PWD with a custom filename.
paper.download_source(filename="downloaded-paper.tar.gz")
# Download the archive to a specified directory with a custom filename.
paper.download_source(dirpath="./mydir", filename="downloaded-paper.tar.gz")

Client

A Client specifies a strategy for fetching results from arXiv's API; it obscures pagination and retry logic.

For most use cases the default client should suffice. You can construct it explicitly with arxiv.Client(), or use it via the (Search).results() method.

arxiv.Client(
  page_size: int = 100,
  delay_seconds: int = 3,
  num_retries: int = 3
)
  • page_size: the number of papers to fetch from arXiv per page of results. Smaller pages can be retrieved faster, but may require more round-trips. The API's limit is 2000 results.
  • delay_seconds: the number of seconds to wait between requests for pages. arXiv's Terms of Use ask that you "make no more than one request every three seconds."
  • num_retries: The number of times the client will retry a request that fails, either with a non-200 HTTP status code or with an unexpected number of results given the search parameters.

Example: fetching results with a custom client

(Search).results() uses the default client settings. If you want to use a client you've defined instead of the defaults, use (Client).results(...):

import arxiv

big_slow_client = arxiv.Client(
  page_size = 1000,
  delay_seconds = 10,
  num_retries = 5
)

# Prints 1000 titles before needing to make another request.
for result in big_slow_client.results(arxiv.Search(query="quantum")):
  print(result.title)

Example: logging

To inspect this package's network behavior and API logic, configure an INFO-level logger.

>>> import logging, arxiv
>>> logging.basicConfig(level=logging.INFO)
>>> paper = next(arxiv.Search(id_list=["1605.08386v1"]).results())
INFO:arxiv.arxiv:Requesting 100 results at offset 0
INFO:arxiv.arxiv:Requesting page of results
INFO:arxiv.arxiv:Got first page; 1 of inf results available

arxiv.py's People

Contributors

lukasschwab avatar msoelch avatar natfarleydev avatar windisch avatar mdamien avatar arkel23 avatar jacquerie avatar japoneris avatar mhils avatar miguel-asm avatar santosh-gupta avatar ziadmodak avatar

Watchers

James Cloos avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.