Coder Social home page Coder Social logo

edco95 / scientific-paper-summarisation Goto Github PK

View Code? Open in Web Editor NEW
255.0 16.0 64.0 2.33 MB

Machine learning models to automatically summarise scientific papers

Python 100.00%
automatic-summarization machine-learning natural-language-processing python scientific-papers

scientific-paper-summarisation's Introduction

Automatic Summarisation of Scientific Papers

Have you ever had to do a literature review as part of a research project and thought "I wish there was a quicker way of doing this"? This code aims to create that quicker way by developing a supervised-learning based extractive summarisation system for the summarisation of scientific papers.

For more information on the project, please see:

Ed Collins, Isabelle Augenstein, Sebastian Riedel. A Supervised Approach to Extractive Summarisation of Scientific Papers. To appear in Proceedings of CoNLL, July 2017.

Ed Collins. A supervised approach to extractive summarisation of scientific papers. UCL MEng thesis, May 2017.

Code Description

The various code files and folders are described here. Note that the data used is not uploaded here but nonetheless the repository is still over 1GB in size.

  • Analysis - A folder containing code used to analyse the generated summaries and create various pretty graphs. It is not essential to the functioning of the summarisers and will not work without the data.
  • Data - Where all data should be stored. In the folder Utility_Data are things such as stopword lists, permitted titles and a count of how many different papers each word occurs in (used for TF-IDF; calculated automatically by DataTools/DataPreprocessing/cspubsumext_creator.py.
  • DataTools - Contains files for manipulating and preprocessing the data. There are two particularly important files in this folder. useful_functions.py contains many important functions used to run the system. DataPreprocessing/cspubsumext_creator.py will take the parsed papers which are produced by the code in DataDownloader and preprocess them into the form used to train the models in the research automatically.
  • Evaluation - Contains code to evaluate summaries and calculate the ROUGE-L metric, with thanks to hapribot.
  • Models - Contains the code which constructs and trains each of the supervised learning modules that form the core of the summarisation system. All written in TensorFlow.
  • Summarisers - Contains the code which takes the trained models and uses them to actually create summaries of papers.
  • Visualisations - Contains code which visualises summaries by colouring them and saving them as HTML files. This is not essential to run the system.
  • Word2Vec - Contains the code necessary to train the Word2Vec model used for word embeddings. The actual trained Word2Vec model is not uploaded because it is too large.
  • DataDownloader - Contains code to download and parse the original XML paper files into the format currently used by this system - where each section title is delineated by "@&#" so the paper can easily be read and split into constituent sections by reading the whole paper as a string and splitting the string on this symbol which is very unlikely to ever occur in the text. The important file is acquire_data.py.

Running the Code

Before attempting to run this code you should setup a suitable virtualenv using Python 2.7. Install all of the requirements listed in requirements.txt with pip install -r requirements.txt.

To download the dataset and preprocess it into the form used to train the models in the paper, first run DataDownloader/acquire_data.py. This will download all of the papers and parse them into the form used - with sections separated by a special symbol - "@&#" - so that the papers can be read as strings then split into sections and titles by splitting on this symbol.

To turn these downloaded papers into training data, run DataTools/DataPreprocessing/cspubsumext_creator.py. This will take a while to run depending on your machine and number of cores (~2 hours on late 2016 MacBook Pro with dual core i7) but will handle creating all of the necessary files to train models. These are stored by default in Data/Training_Data/, with there being an individual JSON file for each paper and a single JSON file called all_data.json which is a list of all of the individual items of training data. This code now uses the ultra-fast uJSON library which reads the data much faster than the previous version which used pickle.

All of the models and summarisers should then be usable.

Be sure to check that all of the paths are correctly set! These are in DataDownloader/acquire_data.py for downloading papers, and in DataTools/useful_functions.py otherwise.

NOTE: The code in DataTools/DataPreprocessing/AbstractNetPreprocessor.py is still unpleasently inefficient and is still currently used in the summarisers themselves. The next code update will fix this and streamline the process of running the trained summarisers.

Other Notes

If you have read or are reading the MEng thesis or CoNLL paper corresponding to this code, then SAFNet = SummariserNet, SFNet = SummariserNetV2, SNet = LSTM, SAF+F Ens = EnsembleSummariser, S+F Ens = EnsembleV2Summariser.

scientific-paper-summarisation's People

Contributors

edco95 avatar isabelleaugenstein avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

scientific-paper-summarisation's Issues

Issue downloading Highlights in the XML

Hi, I tried to download the dataset with the steps written on your github but none of the papers has a "Highlight" section. Which basically makes the backbone of this paper. I was wondering if you can guide me where I am going wrong. I have already set up the project correctly, I have the API key and also I am running the code while connected to the university network but still no luck.

Python 3?

In the description it says to create a virtual environment using python 2.7.
However the code seems to have been written as python 3 code.
The urllib.request can't be used with python 27 instead it is just urllib.
Also stuff like using print(something, end="\r") instead of print something,

In short: why use python 2.7?

acquire_data.py does not work

Hi, I tried to construct dataset by DataDownloader/acquire_data.py . However, this script does not work.
Error message is below.

urllib.error.HTTPError: HTTP Error 429: Too Many Requests

Can you fix this or are you planning to publish dataset?

Papers_With_Section_Titles 404

Ok, so I'm trying to get the code to run - Looks like a promising library. After a few hours of fixing encoding errors and the like, and also after a correspondence with the Elsevier support team (who had some issues with their API service), I have finally managed to download the data; However, after having followed the instructions of first running the acquire_data.py I get the following error when running cspubsumext_creator.py:
"path not found: .../Data/Papers/Full/Papers_With_Section_Titles/'"

And surely enough the only thing in the directory is:
Parsed_Papers/ Utility_Data/ XML_Papers/

Any idéa where things have gone wrong?
Any help would be much appreciated!

Issue with pip install -r requirements.txt

Hello,

I have tried to execute: 'pip install -r requirements.txt'
The error is

Could not find a version that satisfies the requirement bonjour-py==0.3 (from -r requirements.txt (line 5)) (from versions: )
No matching distribution found for bonjour-py==0.3 (from -r requirements.txt (line 5))

How to proceed to fix it?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.