Light

penrose / arxiv-miner Goto Github PK

View Code? Open in Web Editor NEW

11.0 5.0 1.0 49.34 MB

analysis of arxiv equations and figures to guide Penrose development

TeX 37.36% Jupyter Notebook 62.41% Python 0.24%

arxiv topics penrose

arxiv-miner's Introduction

ArXiv Miner

This is a tool to mine ArXiv, calculate summary statistics, and classify papers by topic.

src includes the original code by @dormaayan to run the analysis on a small data sample.
test is the modified code intended to be run in an HPC environment, specifically on the Sherlock cluster at Stanford. This first go was a test analysis, and imperfect as it wasn't run on all the data.
analysis is a (hopefully) more complete run of the analysis, where we also take an inventory of files.

arxiv-miner's People

Contributors

Stargazers

Watchers

Forkers

stjordanis

arxiv-miner's Issues

Where did you derive number of pages?

Hey @dormaayan ! You might be sleeping, but I have another question for you when you are back! I see that we have a number of pages file, and we read from there with csvreader:

npagesFile = "npages.csv"

but this was derived outside of the notebook (and I don't see where). Did you get this from the metadata via the arxiv API? E.g.: here is the arxiv_comment:

{'affiliation': 'None',
 'arxiv_comment': '4 pages, 1 figure, presented at the 37th International Symposium on\n  Multiparticle Dynamics in Berkeley',
 'arxiv_primary_category': {'scheme': 'http://arxiv.org/schemas/atom',
  'term': 'nucl-th'},

Or did you derive it via some length metric of the actual LaTex (or something else)? I'm going to do the entire thing (on a per paper basis) in one schwoop so I won't have any summary csv file (and it wouldn't be feasible, given the number of papers!)

Getting full arXiv

Considering that now we have Stanford servers for our usage, we might want to ask Paul (who provided us what we have now) to get the full arXiv Including the actual figures, which are missing now
This might be in our interest in the near future explore also the figures themselves

Tex representation of a figure

Is the following tex the only way that a figure can be represented?

thing = "\\begin{tikzpicture}"

I have used the same (with figure) in some papers, and I saw this commented out:

#thing = "\\begin{figure}"

@dormaayan are you only interested in the first tag? There are definitely figure tags in there. If it's the case that the figures (graphics) aren't included this would not represent the true number of figures in the paper, no?

Filtering down to subset of papers

heyo @dormaayan ! I noticed that we have a function to return a boolean if it's a math paper, based on finding the tag:

#Given a paper name, return whether it is a math paper or not
def isMathPaper(paperName):
    return subjects[paperName].count("math.") > 0

Is there any reason to limit our set at this point? The categories are pretty fuzzy overall, maybe just look at all of them?

Download PDF from arxiv

to derive number of pages
can we count figures from it?
what other things can we parse (possibly better than the latex?)

Recommend Projects

React

A declarative, efficient, and flexible JavaScript library for building user interfaces.
Vue.js

🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
Typescript

TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
TensorFlow

An Open Source Machine Learning Framework for Everyone
Django

The Web framework for perfectionists with deadlines.
Laravel

A PHP framework for web artisans
D3

Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

javascript

JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
web

Some thing interesting about web. New door for the world.
server

A server is a program made to process requests and deliver data to clients.
Machine learning

Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Visualization

Some thing interesting about visualization, use data art
Game

Some thing interesting about game, make everyone happy.

Recommend Org

Facebook

We are working to build community through open source technology. NB: members must have two-factor auth.
Microsoft

Open source projects and samples from Microsoft.
Google

Google ❤️ Open Source for everyone.
Alibaba

Alibaba Open Source for everyone
D3

Data-Driven Documents codes.
Tencent

China tencent open source team.