Coder Social home page Coder Social logo

jerrygaolondon / jgtextrank Goto Github PK

View Code? Open in Web Editor NEW
14.0 3.0 4.0 8.96 MB

jgtextrank: Yet another Python implementation of TextRank

License: MIT License

Batchfile 0.06% Python 56.44% Jupyter Notebook 43.47% Shell 0.03%
textrank parsing natural-naturallanguage-processing nlp keywords-extraction term-extraction text-summarisation text-analytics text-mining feature-extraction

jgtextrank's Introduction

license Build Status Donate

jgTextRank : Yet another Python implementation of TextRank

This is a parallelisable and highly customisable implementation of the TextRank algorithm [Mihalcea et al., 2004]. You can define your own co-occurrence context, syntactic categories(choose either "closed" filters or "open" filters), stop words, feed your own pre-segmented/pre-tagged data, and many more. You can also load co-occurrence graph directly from your text for visual analytics, debug and fine-tuning your custom settings. This implementation can also be applied to large corpus for terminology extraction. It can be applied to short text for supervised learning in order to provide more interesting features than conventional TF-IDF Vectorizer.

TextRank algorithm look into the structure of word co-occurrence networks, where nodes are word types and edges are word cooccurrence.

Important words can be thought of as being endorsed by other words, and this leads to an interesting phenomenon. Words that are most important, viz. keywords, emerge as the most central words in the resulting network, with high degree and PageRank. The final important step is post-filtering. Extracted phrases are disambiguated and normalized for morpho-syntactic variations and lexical synonymy (Csomai and Mihalcea 2007). Adjacent words are also sometimes collapsed into phrases, for a more readable output.

Alt text

Mihalcea, R., & Tarau, P. (2004, July). TextRank: Bringing order into texts. Association for Computational Linguistics.

Usage

Simple examples

Extract weighted keywords with an undirected graph:

>>> from jgtextrank import keywords_extraction
>>> example_abstract = "Compatibility of systems of linear constraints over the set of natural numbers. " \
                       "Criteria of compatibility of a system of linear Diophantine equations, strict inequations, " \
                       "and nonstrict inequations are considered. Upper bounds for components of a minimal set of " \
                       "solutions and algorithms of construction of minimal generating sets of solutions for all " \
                       "types of systems are given. These criteria and the corresponding algorithms for " \
                       "constructing a minimal supporting set of solutions can be used in solving all the " \
                       "considered types systems and systems of mixed types."
>>> keywords_extraction(example_abstract, top_p = 1, directed=False)[0][:15]
[('linear diophantine equations', 0.18059), ('minimal supporting set', 0.16649), ('minimal set', 0.13201), ('types systems', 0.1194), ('linear constraints', 0.10997), ('strict inequations', 0.08832), ('systems', 0.08351), ('corresponding algorithms', 0.0767), ('nonstrict inequations', 0.07276), ('mixed types', 0.07178), ('set', 0.06674), ('minimal', 0.06527), ('natural numbers', 0.06466), ('algorithms', 0.05479), ('solutions', 0.05085)]

Change syntactic filters to restrict vertices to only noun phrases for addition to the graph:

>>> custom_categories = {'NNS', 'NNP', 'NN'}
>>> keywords_extraction(example_abstract, top_p = 1, top_t=None,
                        directed=False, syntactic_categories=custom_categories)[0][:15]
[('types systems', 0.17147), ('diophantine equations', 0.15503), ('supporting set', 0.14256), ('solutions', 0.13119), ('systems', 0.12452), ('algorithms', 0.09188), ('set', 0.09188), ('compatibility', 0.0892), ('construction', 0.05068), ('criteria', 0.04939), ('sets', 0.04878), ('types', 0.04696), ('system', 0.01163), ('constraints', 0.01163), ('components', 0.01163)]

You can provide an additional stop word list to filter unwanted candidate terms:

>>> stop_list={'set', 'mixed', 'corresponding', 'supporting'}
>>> keywords_extraction(example_abstract, top_p = 1, top_t=None,
                        directed=False,
                        syntactic_categories=custom_categories, stop_words=stop_list)[0][:15]
[('types systems', 0.20312), ('diophantine equations', 0.18348), ('systems', 0.1476), ('algorithms', 0.11909), ('solutions', 0.11909), ('compatibility', 0.10522), ('sets', 0.06439), ('construction', 0.06439), ('criteria', 0.05863), ('types', 0.05552), ('system', 0.01377), ('constraints', 0.01377), ('components', 0.01377), ('numbers', 0.01377), ('upper', 0.01377)]

You can also use lemmatization (disabled by default) to increase the weight for terms appearing with various inflectional variations:

>>> keywords_extraction(example_abstract, top_p = 1, top_t=None,
                        directed=False,
                        syntactic_categories=custom_categories,
                        stop_words=stop_list, lemma=True)[0][:15]
[('type system', 0.2271), ('diophantine equation', 0.20513), ('system', 0.16497), ('algorithm', 0.14999), ('compatibility', 0.11774), ('construction', 0.07885), ('solution', 0.07885), ('criterion', 0.06542),('type', 0.06213), ('component', 0.01538), ('constraint', 0.01538), ('upper', 0.01538), ('inequations', 0.01538), ('number', 0.01538)]

The co-occurrence window size is 2 by default. You can try with a different number for your data:

>>> keywords_extraction(example_abstract,  window=5,
                        top_p = 1, top_t=None, directed=False,
                        stop_words=stop_list, lemma=True)[0][:15]
[('linear diophantine equation', 0.19172), ('linear constraint', 0.13484), ('type system', 0.1347), ('strict inequations', 0.12532), ('system', 0.10514), ('nonstrict inequations', 0.09483), ('solution', 0.06903), ('natural number', 0.06711), ('minimal', 0.06346), ('algorithm', 0.05762), ('compatibility', 0.05089), ('construction', 0.04541), ('component', 0.04418), ('criterion', 0.04086), ('type', 0.02956)]

Try with a centrality measures:

>>> keywords_extraction(example_abstract, solver="current_flow_betweenness",
                        window=5, top_p = 1, top_t=None,
                        directed=False, stop_words=stop_list,
                        lemma=True)[0][:15]
[('type system', 0.77869), ('system', 0.77869), ('solution', 0.32797), ('linear diophantine equation', 0.30657), ('linear constraint', 0.30657), ('minimal', 0.26052), ('algorithm', 0.21463), ('criterion', 0.19821), ('strict inequations', 0.19651), ('nonstrict inequations', 0.19651), ('compatibility', 0.1927), ('natural number', 0.11111), ('component', 0.11111), ('type', 0.10718), ('construction', 0.10039)]

Tuning your graph model as a black box can be problematic. You can try to visualize your co-occurrence network with your sample dataset in order to manually validate your custom parameters:

>>> from jgtextrank import preprocessing, build_cooccurrence_graph
>>> import networkx as nx
>>> import matplotlib.pyplot as plt
>>> preprocessed_context = preprocessing(example_abstract, stop_words=stop_list, lemma=True)
>>> cooccurrence_graph, context_tokens = build_cooccurrence_graph(preprocessed_context, window=2)
>>> pos = nx.spring_layout(cooccurrence_graph,k=0.20,iterations=20)
>>> nx.draw_networkx(cooccurrence_graph, pos=pos, arrows=True, with_requets labels=True)
>>> plt.savefig("my_sample_cooccurrence_graph.png")
>>> plt.show()

More examples (e.g., with custom co-occurrence context, how to extract from a corpus of text files, feed your own pre-segmented/pre-tagged data), please see jgTextRank wiki

Documentation

For jgtextrank documentation, see:

Installation

To install from PyPi:

pip install jgtextrank

To install from github

pip install git+git://github.com/jerrygaoLondon/jgtextrank.git

or

pip install git+https://github.com/jerrygaoLondon/jgtextrank.git

To install from source

python setup.py install

Dependencies

Status

  • Beta release (update)

    • Python implementation of TextRank algorithm for keywords extraction

    • Support directed/undirected and unweighted graph

    • >12 MWTs weighting methods

    • 3 pagerank implementations and >15 additional graph ranking algorithms

    • Parallelisation of vertices co-occurrence computation (allow to set number of available worker instances)

    • Support various custom settings and parameters (e.g., use of lemmatization, co-occurrence window size, options for two co-occurrence context strategies, use of custom syntactic filters, use of custom stop words)

    • Keywords extraction from pre-processed (pre-segmented or pre PoS tagged) corpus/context

    • Keywords extraction from a given corpus directory of raw text files

    • Export ranked result into 'csv' or 'json' file

    • Support visual analytics of vertices network

Contributions

This project welcomes contributions, feature requests and suggestions. Please feel free to create issues or send me your pull requests.

Important: By submitting a patch, you agree to allow the project owners to license your work under the MIT license.

To Cite

Here's a Bibtex entry if you need to cite jgTextRank in your research paper:

@Misc{jgTextRank,
author =   {Gao, Jie},
title =    {jgTextRank: Yet another Python implementation of TextRank},
howpublished = {\url{https://github.com/jerrygaoLondon/jgtextrank/}},
year = {2017}
}

Who do I talk to?

history

  • 0.1.2 Beta version - Aug 201Dependencies8
    • bug fixes
    • 15 additional graph ranking algorithms
  • 0.1.1 Alpha version - 1st Jan 2018
  • 0.1.3 Beta version - March, 2019
    • minor fixes and documentation improvement

jgtextrank's People

Contributors

jerrygaolondon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

jgtextrank's Issues

An attempt has been made tostart a new process before the current process has finished its bootstrapping phase

RuntimeError: 
        An attempt has been made to start a new process before the
        current process has finished its bootstrapping phase.

        This probably means that you are not using fork to start your
        child processes and you have forgotten to use the proper idiom
        in the main module:

            if __name__ == '__main__':
                freeze_support()
                ...

        The "freeze_support()" line can be omitted if the program
        is not going to be frozen to produce an executable.
Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\python3\lib\multiprocessing\spawn.py", line 105, in spawn_main
    exitcode = _main(fd)
  File "C:\python3\lib\multiprocessing\spawn.py", line 114, in _main
    prepare(preparation_data)
  File "C:\python3\lib\multiprocessing\spawn.py", line 225, in prepare
    _fixup_main_from_path(data['init_main_from_path'])
  File "C:\python3\lib\multiprocessing\spawn.py", line 277, in _fixup_main_from_path
    run_name="__mp_main__")
  File "C:\python3\lib\runpy.py", line 263, in run_path
    pkg_name=pkg_name, script_name=fname)
  File "C:\python3\lib\runpy.py", line 96, in _run_module_code
    mod_name, mod_spec, pkg_name, script_name)
  File "C:\python3\lib\runpy.py", line 85, in _run_code
    exec(code, run_globals)
  File "C:\Users\ashis\Documents\infoassistants\TA backend\tests_.py", line 7, in <module>
    keywords = keywords_extraction("\n".join(text), top_p=1, directed=False)[0][:15]
  File "C:\python3\lib\site-packages\jgtextrank-0.1.3-py3.7.egg\jgtextrank\core.py", line 1103, in keywords_extraction
    weight_comb=weight_comb, mu=mu)
  File "C:\python3\lib\site-packages\jgtextrank-0.1.3-py3.7.egg\jgtextrank\core.py", line 796, in _keywords_extraction_from_preprocessed_context
    window=window)
  File "C:\python3\lib\site-packages\jgtextrank-0.1.3-py3.7.egg\jgtextrank\core.py", line 510, in build_cooccurrence_graph
    conn_with_original_ctx=conn_with_original_ctx, window_size=window)
  File "C:\python3\lib\site-packages\jgtextrank-0.1.3-py3.7.egg\jgtextrank\core.py", line 432, in _build_vertices_representations
    with Pool(processes=MAX_PROCESSES) as pool:
  File "C:\python3\lib\multiprocessing\context.py", line 119, in Pool
    context=self.get_context())
  File "C:\python3\lib\multiprocessing\pool.py", line 176, in __init__
    self._repopulate_pool()
  File "C:\python3\lib\multiprocessing\pool.py", line 241, in _repopulate_pool
    w.start()
  File "C:\python3\lib\multiprocessing\process.py", line 112, in start
    self._popen = self._Popen(self)
  File "C:\python3\lib\multiprocessing\context.py", line 322, in _Popen
    return Popen(process_obj)
  File "C:\python3\lib\multiprocessing\popen_spawn_win32.py", line 33, in __init__
    prep_data = spawn.get_preparation_data(process_obj._name)
  File "C:\python3\lib\multiprocessing\spawn.py", line 143, in get_preparation_data
    _check_not_importing_main()
  File "C:\python3\lib\multiprocessing\spawn.py", line 136, in _check_not_importing_main
    is not going to be frozen to produce an executable.''')
RuntimeError: 

Python version 3.7.2

Couldn't find where to change the damping factor.

The original paper on TextRank by Mihalcea, R., & Tarau, P. mentions the damping factor and that it should be set to 0.85.

I couldn't really find any place to fix the damping factor in your implementation.

Pip install jgtextrank doesn't work

Collecting jgtextrank
  Downloading https://files.pythonhosted.org/packages/b3/bd/91d6b9590e8e471138d40b23c60da7db8acf576422977d43646566c0b4a9/jgtextrank-0.1.3.tar.gz
    Complete output from command python setup.py egg_info:
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\ashis\AppData\Local\Temp\pip-install-hsk486_a\jgtextrank\setup.py", line 29, in <module>
        long_description=readme(),
      File "C:\Users\ashis\AppData\Local\Temp\pip-install-hsk486_a\jgtextrank\setup.py", line 7, in readme
        with open('readme.txt') as f:
    FileNotFoundError: [Errno 2] No such file or directory: 'readme.txt'

    ----------------------------------------
Command "python setup.py egg_info" failed with error code 1 in C:\Users\ashis\AppData\Local\Temp\pip-install-hsk486_a\jgtextrank\

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.