Coder Social home page Coder Social logo

pke's Introduction

pke - python keyphrase extraction

This is a fork of the original pke library, which can be found here. The build and unit testing steps of the original library have been fixed.

pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset.

Build Status

Table of Contents

Installation

Clone the repo and create a conda environment with a Python install >= 3.6 and, after cding to the project directory containing the requirements.txt file, activate the environment, and install all dependencies via pip

conda create -n keyphrase_extraction python==3.7
conda activate keyphrase_extraction
pip install -r requirements.txt

Then run

python -m nltk.downloader stopwords
python -m nltk.downloader universal_tagset

Install the pytest module separately via

pip install -U pytest

Usually, one downloads spacy models via python -m spacy download etc. But we'll be manually installing the models. First download en_core_web_sm-2.3.1.tar.gz from here. Untar the folder wherever you'd like to store your spacy models:

tar -xvf en_core_web_sm-2.3.1.tar.gz 

Next, wtih the conda environment created above activated, run the following command:

 python -m spacy link directory_with_your_spacy_model_folder/en_core_web_sm-2.3.1/en_core_web_sm en
 --force

Next, make sure everything works by running the unit tests via

pytest tests

Minimal example

pke provides a standardized API for extracting keyphrases from a document. Start by typing the 5 lines below. For using another model, simply replace pke.unsupervised.TopicRank with another model (list of implemented models).

import pke

# initialize keyphrase extraction model, here TopicRank
extractor = pke.unsupervised.TopicRank()

# load the content of the document, here document is expected to be in raw
# format (i.e. a simple text file) and preprocessing is carried out using spacy
extractor.load_document(input='/path/to/input.txt', language='en')

# keyphrase candidate selection, in the case of TopicRank: sequences of nouns
# and adjectives (i.e. `(Noun|Adj)*`)
extractor.candidate_selection()

# candidate weighting, in the case of TopicRank: using a random walk algorithm
extractor.candidate_weighting()

# N-best selection, keyphrases contains the 10 highest scored candidates as
# (keyphrase, score) tuples
keyphrases = extractor.get_n_best(n=10)

A detailed example is provided in the examples/ directory.

Getting started

Tutorials and code documentation are available at https://boudinfl.github.io/pke/.

Implemented models

pke currently implements the following keyphrase extraction models:

Citing pke

If you use pke, please cite the following paper:

@InProceedings{boudin:2016:COLINGDEMO,
  author    = {Boudin, Florian},
  title     = {pke: an open source python-based keyphrase extraction toolkit},
  booktitle = {Proceedings of COLING 2016, the 26th International Conference on Computational Linguistics: System Demonstrations},
  month     = {December},
  year      = {2016},
  address   = {Osaka, Japan},
  pages     = {69--73},
  url       = {http://aclweb.org/anthology/C16-2015}
}

pke's People

Contributors

boudinfl avatar brunoberisso avatar limberc avatar paul-mannino avatar poulain-tim avatar sp1thas avatar suhasmohan avatar theorm avatar timrepke avatar ygorg avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.