Coder Social home page Coder Social logo

arxiv-miner's Introduction

ArXiv Miner

This is a tool to mine ArXiv, calculate summary statistics, and classify papers by topic.

  • src includes the original code by @dormaayan to run the analysis on a small data sample.
  • test is the modified code intended to be run in an HPC environment, specifically on the Sherlock cluster at Stanford. This first go was a test analysis, and imperfect as it wasn't run on all the data.
  • analysis is a (hopefully) more complete run of the analysis, where we also take an inventory of files.

arxiv-miner's People

Contributors

dormaayan avatar vsoch avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

Forkers

stjordanis

arxiv-miner's Issues

Where did you derive number of pages?

Hey @dormaayan ! You might be sleeping, but I have another question for you when you are back! I see that we have a number of pages file, and we read from there with csvreader:

npagesFile = "npages.csv"

but this was derived outside of the notebook (and I don't see where). Did you get this from the metadata via the arxiv API? E.g.: here is the arxiv_comment:

{'affiliation': 'None',
 'arxiv_comment': '4 pages, 1 figure, presented at the 37th International Symposium on\n  Multiparticle Dynamics in Berkeley',
 'arxiv_primary_category': {'scheme': 'http://arxiv.org/schemas/atom',
  'term': 'nucl-th'},

Or did you derive it via some length metric of the actual LaTex (or something else)? I'm going to do the entire thing (on a per paper basis) in one schwoop so I won't have any summary csv file (and it wouldn't be feasible, given the number of papers!)

Getting full arXiv

Considering that now we have Stanford servers for our usage, we might want to ask Paul (who provided us what we have now) to get the full arXiv Including the actual figures, which are missing now
This might be in our interest in the near future explore also the figures themselves

Tex representation of a figure

Is the following tex the only way that a figure can be represented?

thing = "\\begin{tikzpicture}"

I have used the same (with figure) in some papers, and I saw this commented out:

#thing = "\\begin{figure}"

@dormaayan are you only interested in the first tag? There are definitely figure tags in there. If it's the case that the figures (graphics) aren't included this would not represent the true number of figures in the paper, no?

Filtering down to subset of papers

heyo @dormaayan ! I noticed that we have a function to return a boolean if it's a math paper, based on finding the tag:

#Given a paper name, return whether it is a math paper or not
def isMathPaper(paperName):
    return subjects[paperName].count("math.") > 0

Is there any reason to limit our set at this point? The categories are pretty fuzzy overall, maybe just look at all of them?

Download PDF from arxiv

  • to derive number of pages
  • can we count figures from it?
  • what other things can we parse (possibly better than the latex?)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.