Coder Social home page Coder Social logo

aayushpatel007 / topicrankpy Goto Github PK

View Code? Open in Web Editor NEW
16.0 2.0 3.0 74 KB

A Python package to get useful information from documents using TopicRank Algorithm.

Python 100.00%
textrank nlp data-preprocessing topicrank graph-algorithms hierarchical-clustering pagerank-python network-x spacy keyphrase-extraction

topicrankpy's Introduction

Important topics/phrases extraction using TopicRank algorithm.

Overview

TopicRank is an unsupervised method that aims to extract keyphrases from the most important topics of a document. Topics are defined as clusters of similar keyphrase candidates. This new method is an improvement of the TextRank method applied to keyphrase extraction (Mihalcea and Tarau,2004). In the TextRank method, a document is represented by a graph where words are vertices and edges represent co-occurrence relations. A graph-based ranking model derived from PageRank (Brin and Page, 1998) is then used to assign a significance score to each word. TopicRank represents a document as a complete graph where vertices are not words but topics. It defines a topic as a cluster of similar single and multi-word expressions.

1. Topic Identification and Clustering:

This project follows Wan and Xiao (2008) and extract the longest sequences of nouns and adjectives from the document as keyphrase candidates. Other methods use syntactically filtered n-grams that are most likely to contain a larger number of candidates matching with reference keyphrases, but the n-gram restricted length is a problem. Indeed, n-grams do not always capture as much information as the longest noun phrases. Also, they are less likely to be grammatically correct.

To automatically group similar noun phrases as a single entity, this project uses Hirearchical Agglomerative Clustering algorithm. For this clustering algorithm, vectorized text has been passed to "Jaccard" corfficient for finding similarity between phrases.

2. Graph-Based Ranking:

TextRank(Graph-based ranking model) is used to assign significance score to each topic.To understand how textrank algorithm works please refer : https://web.eecs.umich.edu/~mihalcea/papers/mihalcea.emnlp04.pdf

Getting Started

Using this library, you will be able to extract meaningful information from documents like:

  • Top N phrases
  • Url's
  • Email Id's
  • Phone numbers
  • Important names

Installation

pip3 install topicrankpy

from topicrankpy import extractinformation as t

t.extract_all('path_of_document',no_of_phrases)    

Output: For testing purpose, I have used my Resume.

{
  'Top_Phrases_With_Ranking': [
    ('data engineering',
    0.03882171811465683),
    ('machine learning',
    0.0231421447805223),
    ('technologies',
    0.01656229201773112),
    ('algorithms',
    0.015179556679089493),
    ('python',
    0.014202240623362651),
    ('android application',
    0.013784183422746128),
    ('deep learning',
    0.012663419387693997),
    ('cloud services',
    0.012062811163957745),
    ('kafka',
    0.011780856748625147),
    ('elasticsearch',
    0.011594082728116736)
  ],
  'Phone_Numbers': [
    '4168328255'
  ],
  'Email_address': [
    '[email protected]'
  ],
  'Important Names': [
    'Aayush Patel',
    'AWS Certified Solutions Architect',
    'Award Machine Learning Artificial Intelligence',
    'Advance Data Science',
    'Google Play Store',
    'Chahal Academy',
    'Apache Spark Hadoop',
    'Kafka',
    'Kafka Streams',
    'Apache Cassandra',
    'Flume',
    'Amazon Kinesis',
    'Amazon EMR',
    'Elastic Map Reduce',
    'Machine Learning Deep',
    'Data Preprocessing',
    'Keras',
    'Open CV',
    'Python',
    'Amazon Web Services',
    'Google Cloud Platform',
    'System',
    'Linux Windows',
    'Gujarat',
    'Python',
    'Cloud',
    'Teksun Lab Pvt',
    'Ltd',
    'Kinesis',
    'Collect',
    'Applied',
    'Python',
    'Data',
    'Machine Learning Intern',
    'Experts Hub',
    'Keras',
    'Sardar Vallabhbhai Patel Institute Technology',
    'Android',
    'Kinesis',
    'Cognito',
    'Desktop Application',
    'Python',
    'Apache Kafka',
    'Apache Cassandra Elasticsearch',
    'Twitter API',
    'Elastic Load Transform',
    'Kafka Connector Sink',
    'Cassandra',
    'Inspector',
    'Ontario Fire Code',
    'Build Log Analytics Solutions',
    'Google Play Store',
    'Trent University'
  ],
  'URLS': [
    'https://www.linkedin.com/in/aayushpatel678/',
    'https://github.com/Aayushpatel007',
    'https://www.youtube.com/watch?v=tvBZz7L5EBI'
  ]
}

topicrankpy's People

Contributors

aayushpatel007 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

topicrankpy's Issues

ModuleNotFoundError: No module named 'pke'

Received the following error on using it


ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-19-01fbd239b3d2> in <module>()
----> 1 from topicrankpy import extractinformation as t

/usr/local/lib/python3.6/dist-packages/topicrankpy/__init__.py in <module>()
      1 from __future__ import absolute_import
      2 
----> 3 from pke.data_structures import Candidate, Document, Sentence
      4 from pke.readers import MinimalCoreNLPReader, RawTextReader
      5 from pke.base import LoadFile

ModuleNotFoundError: No module named 'pke'

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.