Coder Social home page Coder Social logo

paper-analyser's Introduction

Paper analyser

The goal of this project is to enable quantitative analysis of academic papers.

To achieve the goal, the project contains logic for

  1. Parsing academic white papers into structured representation
  2. Doing analysis on the structured representations

Paper dependency graph

The project as it currently stands focuses on the task of taking a list of arbitrary papers in the form of PDFs, and then creating a dependency graph of citations amongst these papers. This graph then shows how each of the PDFs reference each other. Paper analyser achieves this by going through the steps:

  1. Parse the papers to extract relevant data
    1. Read the PDF files to a format usable in Python
    2. Extract title of a given paper
    3. Extract the raw data of the "References" section
  2. Parse the raw "References" section into individual refereces:
    1. Extract the title and authors of the citation
    2. Normalise the data of the extracted citations
  3. Do dependency analysis based on the above citation extractions

Usage

To see example usage of a simple exmplae look at the simple_example.md

Paper analyser takes as input PDF files of academic papers and outputs data about these papers. For convenience we maintain a list of links to software analysis papers focused on software security in our sister repository software-security-paper-list

To see an example of doing analysis on many papers look at the explanation here large_example.md

Example visualisation

We have also created visualisations for the output of the paper analyser, which makes it very nice to rapidly understand the relationship between the academic papers in the data set.

See a link here for an example of the visualisations https://adalogics.com/software-security-research-citations-visualiser

These visualistions will be open sourced in the near future.

Citation graph example:

Alt txt

Wordcloud of 85 fuzzing papers

Example of a wordcloud generated by the papers in the "Fuzzing" section of software-security-paper-list. This wordcloud discounts the use of the 100 most common english words https://www.espressoenglish.net/the-100-most-common-words-in-english/ Alt txt

Wordcount of 85 fuzzing papers

Doing a barplot of the words in the papers in the "Fuzzing" section of software-security-paper-list. This plot discounts the use of the 100 most commond english words https://www.espressoenglish.net/the-100-most-common-words-in-english/ Alt txt

Installation

git clone https://github.com/AdaLogics/paper-analyser
cd paper-analyser
./install.sh

Contribute

We welcome contributions.

Paper analyser is maintained by:

We are particularly interested in features for:

  1. Improved parsing of the PDF files to get better structured ouput out
  2. More data analysis into the project

Feature suggestions

If you would like to contribute but dont have a feature in mind, please see the list below for suggestions:

  • Extraction of authors from papers
  • Extraction of the actual text from the papers. This could be used for a lot of cool data analysis

paper-analyser's People

Contributors

davidkorczynski avatar gchers avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar

paper-analyser's Issues

Package(s)

At some point it'd be good to refactor the code base and convert it into a Python package (or multiple packages). I'd suggest we do this at a later stage, when we have a better idea of the overall structure.

Avoid duplicate work effort

Grobid: A machine learning software for extracting information from scholarly documents

https://github.com/kermitt2/grobid

We should avoid doing double work with other projects and instead focus on collaborating and solving new problems. Grobid seems to be a super relevant project to paper-analyser. This issue is to keep track of clarifying overlap between the projects and may have an effect on how we progress with paper-analyser.

@gchers

Unclear how to reproduce Alexandria

The steps to reproduce alexandria is not entirely clear. Could this be added? We are using some grobid_client although it's not clear which module this is (https://github.com/kermitt2/grobid_client_python ?).

Could we perhaps add a script/doc to reproduce?

I think it should be something along the lines of:

git clone https://github.com/AdaLogics/paper-analyser
cd paper-analyser/alexandria
python3 -m venv venv
. venv/bin/activate

docker pull lfoppiano/grobid:0.6.2
docker run -t --rm --init lfoppiano/grobid:0.6.2
pip install pymongo bs4 pygrobid
git clone https://github.com/kermitt2/grobid_client_python
python ./runner.py ../example-papers

Grobid doc for reference:
https://grobid.readthedocs.io/en/latest/Grobid-docker/

I believe the above steps should be close to enough, but they didn't do it entirely for me.

@gchers ?

Set up CI

Tasks:

  • verify that every PR adheres to coding standards (e.g., use black).
  • run tests to verify that every PR doesn't break stuff.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.