Coder Social home page Coder Social logo

derwenai / pytextrank Goto Github PK

View Code? Open in Web Editor NEW
2.1K 63.0 335.0 1.64 MB

Python implementation of TextRank algorithms ("textgraphs") for phrase extraction

Home Page: https://derwen.ai/docs/ptr/

License: MIT License

Python 56.23% Jupyter Notebook 43.51% Shell 0.26%
textrank summarization natural-language-processing nlp machine-learning graph-algorithms spacy spacy-extension natural-language textgraphs

pytextrank's Introduction

PyTextRank

DOI Licence Repo size GitHub commit activity Checked with mypy security: bandit CI downloads sponsor

PyTextRank is a Python implementation of TextRank as a spaCy pipeline extension, for graph-based natural language work -- and related knowledge graph practices. This includes the family of textgraph algorithms:

Popular use cases for this library include:

  • phrase extraction: get the top-ranked phrases from a text document
  • low-cost extractive summarization of a text document
  • help infer concepts from unstructured text into more structured representation

See our full documentation at: https://derwen.ai/docs/ptr/

Getting Started

See the "Getting Started" section of the online documentation.

To install from PyPi:

python3 -m pip install pytextrank
python3 -m spacy download en_core_web_sm

If you work directly from this Git repo, be sure to install the dependencies as well:

python3 -m pip install -r requirements.txt

Alternatively, to install dependencies using conda:

conda env create -f environment.yml
conda activate pytextrank

Then to use the library with a simple use case:

import spacy
import pytextrank

# example text
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# load a spaCy model, depending on language, scale, etc.
nlp = spacy.load("en_core_web_sm")

# add PyTextRank to the spaCy pipeline
nlp.add_pipe("textrank")
doc = nlp(text)

# examine the top-ranked phrases in the document
for phrase in doc._.phrases:
    print(phrase.text)
    print(phrase.rank, phrase.count)
    print(phrase.chunks)

See the tutorial notebooks in the examples subdirectory for sample code and patterns to use in integrating PyTextTank with related libraries in Python: https://derwen.ai/docs/ptr/tutorial/

Contributing Code

We welcome people getting involved as contributors to this open source project!

For detailed instructions please see: CONTRIBUTING.md

Build Instructions Note: unless you are contributing code and updates, in most use cases won't need to build this package locally.

Instead, simply install from PyPi or use Conda.

To set up the build environment locally, see the "Build Instructions" section of the online documentation.

Semantic Versioning

Generally speaking the major release number of PyTextRank will track with the major release number of the associated spaCy version.

See: CHANGELOG.md

thanks noam!

License and Copyright

Source code for PyTextRank plus its logo, documentation, and examples have an MIT license which is succinct and simplifies use in commercial applications.

All materials herein are Copyright ยฉ 2016-2024 Derwen, Inc.

Attribution

Please use the following BibTeX entry for citing PyTextRank if you use it in your research or software:

@software{PyTextRank,
  author = {Paco Nathan},
  title = {{PyTextRank, a Python implementation of TextRank for phrase extraction and summarization of text documents}},
  year = 2016,
  publisher = {Derwen},
  doi = {10.5281/zenodo.4637885},
  url = {https://github.com/DerwenAI/pytextrank}
}

Citations are helpful for the continued development and maintenance of this library. For example, see our citations listed on Google Scholar.

Kudos

Many thanks to our open source sponsors; and to our contributors: @ceteri, @louisguitton, @Ankush-Chander, @tomaarsen, @CaptXiong, @Lord-V15, @anna-droid-beep, @dvsrepo, @clabornd, @dayalstrub-cma, @kavorite, @0dB, @htmartin, @williamsmj, @mattkohl, @vanita5, @HarshGrandeur, @mnowotka, @kjam, @SaiThejeshwar, @laxatives, @dimmu, @JasonZhangzy1757, @jake-aft, @junchen1992, @shyamcody, @chikubee; also to @mihalcea who leads outstanding NLP research work, encouragement from the wonderful folks at Explosion who develop spaCy, plus general support from Derwen, Inc.

Star History

Star History Chart

pytextrank's People

Contributors

0db avatar ankush-chander avatar anna-droid-beep avatar ceteri avatar clabornd avatar dayalstrub avatar dimmu avatar harshgrandeur avatar jake-aft avatar junchen1992 avatar kavorite avatar kjam avatar louisguitton avatar mnowotka avatar saithejeshwar avatar snyk-bot avatar tomaarsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pytextrank's Issues

Add support for unicode characters

Currently while decoding a file, 'ascii' codec can't decode byte error shows up. It would be good to have unicode character support. Thanks :)

Running text rank using nlp.pipe

So I add PyTR to the nlp pipeline as a component as shown in the example snippet but when I use like the following, the doc._.phrases contains the phrases for the last document always.

texts = [list of strings]
docs = list(nlp.pipe(texts))
doc_terms = []
for doc in docs:
        textrank_result = [p.text for p in doc._.phrases]
        doc_terms.append(textrank_result)

In the for loop, doc_terms is always the terms for the last doc object even though the doc objects are different. Any idea what I might be doing wrong here? Thanks.

Error on spacy 2.0

I'm using
pip install -U pytextrank
to install and tried to execute the stage1.py but then error appear on the execution process

doc = spacy_nlp(graf_text, parse=True)
TypeError: __call__() got an unexpected keyword argument 'parse'

I tried to remove arg parse=True but then another error raise.

is there any compatibly issue on pytextrank on scapy version 2.0?

thank you

AttributeError: module 'pytextrank' has no attribute 'parse_doc'

I installed all the libraries, i installed using git and the git url provided, as i can't or on;t know how to install the requirements.txt. However installed them individually.
I then tried to run your example:-

import pytextrank
import time
import sys

path_stage0 = "in.json"
path_stage1 = "o1.json"
path_stage2 = "o2.json"
path_stage3 = "o3.json"

start=time.time()
with open(path_stage1, 'w') as f:
for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
f.write("%s\n" % pytextrank.pretty_print(graf._asdict()))

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
f.write("%s\n" % pytextrank.pretty_print(rl._asdict()))

kernel = pytextrank.rank_kernel(path_stage2)
with open(path_stage3, 'w') as f:
for s in pytextrank.top_sentences(kernel, path_stage1):
f.write(pytextrank.pretty_print(s._asdict()))
f.write("\n")

phrases = ", ".join(set([p for p in pytextrank.limit_keyphrases(path_stage2, phrase_limit=3)]))
sent_iter = sorted(pytextrank.limit_sentences(path_stage3, word_limit=250), key=lambda x: x[1])
s = []

for sent_text, idx in sent_iter:
s.append(pytextrank.make_sentence(sent_text))
graf_text = " ".join(s)
#print("excerpts: %s\n\nkeywords: %s" % (graf_text, phrases,))
print('Time Taken: ',time.time()-start)

And i keep getting this error:-
AttributeError: module 'pytextrank' has no attribute 'parse_doc'

Ignore tokens and enrich the lemma graph

Hi everyone!

It is mentioned in the project's description that enriching the lemma graph would improve TextRank's performance. I saw that showing examples of this was in the todo list of the project but I was wondering if it worked by simply adding entities to the doc before summarising? Or is it more complicated? I am particularly interested in adding hyponymy.

And what about ignoring tokens? Some tokens are ignored depending on their POS tag in your implementation. Is it possible to ignore tokens specific to our application by tagging them? With what?

Thanks in advance for your answers!!

And thank you for this project, it is great!

Error in example.ipynb

I tried running the example associated with PYTEXT rank:-
https://github.com/DerwenAI/pytextrank/blob/master/example.ipynb

And I got the following error:-

File "C:\Users\lee.williams\AppData\Local\Continuum\anaconda3\lib\site-packages\pytextrank\pytextrank.py", line 315, in build_graph
for pair in get_tiles(map(WordNode._make, meta["graf"])):

TypeError: string indices must be integers

Any help would be appreciated. As the example output looks fine on the URL above

Regards

Lee

How to use this?

Hi Team,
It would be of great help if you could guide in this regard.

  1. Is there a way to use the functionality of this package with a simple text document.
  2. If not, I have documents and paragraphs in text(.txt) format. So what would be the input JSON format for the functions as given in the examples(Stages).

Keyphrases get messed up when enumerate_chunks() re-parses them

When running pytextrank on real-world data, I often end up with keyphrases like 'm and 've.

This seems to be because information is lost when enumerate_chunks() runs the output of spaCy through spaCy again.

Here's an example:

(.env) vagrant@web:/vagrant/pytextrank$ cat dat/err.json
{"id":"231", "text":"im sure its great"}
(.env) vagrant@web:/vagrant/pytextrank$ ./run.sh dat/err.json 
+ '[' 1 -ne 1 ']'
+ ./stage1.py dat/err.json

real	0m5.270s
user	0m3.776s
sys	0m0.992s
+ ./stage2.py o1.json

real	0m5.460s
user	0m3.984s
sys	0m0.948s
+ ./stage3.py o1.json o2.json

real	0m2.408s
user	0m1.272s
sys	0m0.512s
+ ./stage4.py o2.json o3.json

real	0m3.402s
user	0m1.680s
sys	0m0.760s
+ cat o4.md
**excerpts:** i m sure its great

**keywords:** m
(.env) vagrant@web:/vagrant/pytextrank$

Adding multilingual support with spaCy models

Idea

Using spaCy as core NLP library for pytextrank opens up the possibility of supporting new
languages other than English.

Initial analysis

Currently, as of spaCy 1.8.x there are four official languages supported: en, de, fr and more recently es.

I have performed an initial analysis and testing with two new languages: (1) German and (2) Spanish.
Of course, as with the English models, the user would need to run python -m spacy download de or python -m spacy download es.

According to my local tests executing the example notebook for German and Spanish, the following would be needed in pytextrank to support a new language:

  1. Make lang configurable and have https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L187 loading the language identified by its ISO code.

  2. [CAVEAT] If the language is available in spaCy but does not include any of the required features: (1) POS, (2) NER, and (3) noun chunking method, anything else?. pytextrank should show a warning/error message. E.g., here: https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L423 for noun_chunking or here https://github.com/ceteri/pytextrank/blob/master/pytextrank/pytextrank.py#L480 for NER

Current status

Out the other 3 official languages, one of them would be supported out of the box (German) and other would be supported in the next release (Spanish only lacks noun_chunking in 1.8.2, which is currently implemented on the master branch and will be in principle be included in the next release, see https://github.com/explosion/spaCy/pull/1096/commits/5b385e7d78fd955d97b59024645d2592bdbc0949)
French would need to implement NER and noun_chunking.

I would be happy to contribute code and examples if needed :-)

Dani

remove stopwords

hi,

It seems the function load_stopwords in pytextrank.py is never called.
Would you consider passing an argument when initializing to enable and disable removing the stopwords? And another argument to choose among pos and tag?

Besides, is it possible to make the pharse similar to a span, to allow access token.pos_ for each token in a pharse?

Errors importing from pytextrank

Hi! I'm working on a project connected with NLP and was happy to find out that there is such a tool as PyTextRank. However, I've encountered an issue at the very beginning trying to just import package to run the example code given here.
The error that I get is the following:

----> from pytextrank import json_iter, parse_doc, pretty_print
ImportError: cannot import name 'json_iter'
----> from pytextrank import parse_doc
ImportError: cannot import name 'parse_doc'

I've tried running it in iPython console and a Jupyter Notebook, both the same result. I've installed PyTextRank with pip, the python version that I have is 3.5.4., spacy 2.1.8., networkx 2.4, graphvis 0.13.2

at stage2.py

while running i got this error

Traceback (most recent call last): File "/newvolume/var/www/krackin_ai/krackin_ai/app/Scripts/stage2.py", line 17, in <module> render_ranks(graph, ranks) File "/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py", line 336, in render_ranks write_dot(graph, ranks, path=dot_file) File "/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py", line 327, in write_dot with open(path, 'w') as f: PermissionError: [Errno 13] Permission denied: 'graph.dot'

AttributeError: module 'pytextrank' has no attribute 'parse_doc'

I installed all the libraries, i installed using git and the git url provided, as i can't or on;t know how to install the requirements.txt. However installed them individually.
I then tried to run your example:-

import pytextrank
import time
import sys

path_stage0 = "in.json"
path_stage1 = "o1.json"
path_stage2 = "o2.json"
path_stage3 = "o3.json"

start=time.time()
with open(path_stage1, 'w') as f:
for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
f.write("%s\n" % pytextrank.pretty_print(graf._asdict()))

graph, ranks = pytextrank.text_rank(path_stage1)
pytextrank.render_ranks(graph, ranks)

with open(path_stage2, 'w') as f:
for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
f.write("%s\n" % pytextrank.pretty_print(rl._asdict()))

kernel = pytextrank.rank_kernel(path_stage2)
with open(path_stage3, 'w') as f:
for s in pytextrank.top_sentences(kernel, path_stage1):
f.write(pytextrank.pretty_print(s._asdict()))
f.write("\n")

phrases = ", ".join(set([p for p in pytextrank.limit_keyphrases(path_stage2, phrase_limit=3)]))
sent_iter = sorted(pytextrank.limit_sentences(path_stage3, word_limit=250), key=lambda x: x[1])
s = []

for sent_text, idx in sent_iter:
s.append(pytextrank.make_sentence(sent_text))
graf_text = " ".join(s)
#print("excerpts: %s\n\nkeywords: %s" % (graf_text, phrases,))
print('Time Taken: ',time.time()-start)

And i keep getting this error:-
AttributeError: module 'pytextrank' has no attribute 'parse_doc'

Singular vs plural across documents

Question: I am new at this and working from the example script to extract key phrases from about 800 separate documents. The tool is great! I notice in the results that some documents will produce a singular word in a key phrase (like "donor") while other documents in the set produce the plural of the same word ("donors"). In the ideal case I would like to collapse those into just one version, either singular or plural. Can you suggest an approach to do that? Is there any feature that would convert the found keywords to their singular, for example?

Expected input JSON format is not documented / clear

I'm currently working my way through the examples, and I found the documentation around the input formats to be quite unclear. Specifically, the schema isn't documented (in my understanding), and is only hinted at in the docs.

From looking at the example input, and reviewing the code, I've deduced that the input should consist of one JSON document per line, however I don't know how to structure my content to best suit this format. For example should I break a larger document into paragraphs, then include one per line in the JSON input? Or should the complete text to be summarised be included in a single 'text' element within the JSON.

Thanks,
J

pos_family = tok_tag.lower()[0] IndexError: string index out of range

Hi
I am getting the following error while running pytextrank on around 300+ documents:
Traceback (most recent call last): File "........../TextRank/driver_pytxtrank.py", line 10, in <module> for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)): File ".........../anaconda3/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 263, in parse_doc grafs, new_base_idx = parse_graf(meta["id"], graf_text, base_idx) File ".................../anaconda3/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 225, in parse_graf pos_family = tok_tag.lower()[0] IndexError: string index out of range

Also I encountered a error which said that it can't handle text having more than 1000000 characters.
This was the error:
Traceback (most recent call last): File "........../TextRank/driver_pytxtrank.py", line 10, in <module> for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)): File "........../anaconda3/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 261, in parse_doc grafs, new_base_idx = parse_graf(meta["id"], graf_text, base_idx) File ".........../anaconda3/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 193, in parse_graf doc = spacy_nlp(graf_text) File ".........../anaconda3/lib/python3.6/site-packages/spacy/language.py", line 345, in __call__ max_length=self.max_length)) ValueError: [E088] Text of length 1926491 exceeds maximum of 1000000. The v2.x parser and NER models require roughly 1GB of temporary memory per 100,000 characters in the input. This means long texts may cause memory allocation errors. If you're not using the parser or NER, it's probably safe to increase the nlp.max_length limit. The limit is in number of characters, so you can check whether your inputs are too long by checking len(text).

To fix that I made the following change to pytextrank.py:

if len(graf_text) >= 999999:
graf_text = graf_text[:999999]
grafs, new_base_idx = parse_graf(meta["id"], graf_text, base_idx)

which was followed by the before mentioned errors

Text rank summarization-- sentence score feature request[duplicate]

In the text rank extension, we have a summarization method; but that option will be more effective if you expose the sentence score as a touple with the sentences that come out as summarization output. It can be asked by an input parameter whether the user just wants a summary or also the sentence scores. Let me know if there is a way to get the sentence scores in current settings too.
If it is not possible, then I would like to try and add a PR for the same if allowed.
Thanks.

Example file throws KeyError: 1255

Have not been able to get either the long form (from wiki) or short form (from github readme) files to work successfully.

The file @ https://github.com/DerwenAI/pytextrank/blob/master/example.py throws a KeyError: 1255 when run. Output for this is below.

I have been able to get the example from the github page working but only for very small strings. Anything larger than a few words throws a KeyError with varying number depending on the length of the string.

Can't figure out the issue even using all input (txt files) from the example on the wiki page and changing the spacy version to various releases from 2.0.0 to present.


KeyError Traceback (most recent call last)
in ()
31 text = f.read()
32
---> 33 doc = nlp(text)
34
35 print("pipeline", nlp.pipe_names)

/home/pete/.local/lib/python3.5/site-packages/spacy/language.py in call(self, text, disable, component_cfg)
433 if not hasattr(proc, "call"):
434 raise ValueError(Errors.E003.format(component=type(proc), name=name))
--> 435 doc = proc(doc, **component_cfg.get(name, {}))
436 if doc is None:
437 raise ValueError(Errors.E005.format(name=name))

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in PipelineComponent(self, doc)
530 """
531 self.doc = doc
--> 532 Doc.set_extension("phrases", force=True, default=self.calc_textrank())
533 Doc.set_extension("textrank", force=True, default=self)
534

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in calc_textrank(self)
389
390 for chunk in self.doc.noun_chunks:
--> 391 self.collect_phrases(chunk)
392
393 for ent in self.doc.ents:

/usr/local/lib/python3.5/dist-packages/pytextrank/pytextrank.py in collect_phrases(self, chunk)
345 if key in self.seen_lemma:
346 node_id = list(self.seen_lemma.keys()).index(key)
--> 347 rank = self.ranks[node_id]
348 phrase.sq_sum_rank += rank
349 compound_key.add(key)

KeyError: 1255

IndexError: list index out of range

Hi,

I'm getting the following error when trying to run pytextrank with my own data. Is there a way to fix this?

app_1 | Traceback (most recent call last):
app_1 | File "index.py", line 26, in
app_1 | for rl in pytextrank.normalize_key_phrases(path_stage1, ranks):
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 581, in normalize_key_phrases
app_1 | for rl in collect_entities(sent, ranks, stopwords, spacy_nlp):
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 485, in collect_entities
app_1 | w_ranks, w_ids = find_entity(sent, ranks, ent.text.split(" "), 0)
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity
app_1 | return find_entity(sent, ranks, ent, i + 1)
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity
app_1 | return find_entity(sent, ranks, ent, i + 1)
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 454, in find_entity
app_1 | return find_entity(sent, ranks, ent, i + 1)
app_1 | [Previous line repeated 137 more times]
app_1 | File "/usr/local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 451, in find_entity
app_1 | w = sent[i + j]
app_1 | IndexError: list index out of range

How to work with large texts for phrase ranking?

I want to rank phrases from many documents at once but spacy's 1 million character limit for nlp object is preventing me from to do so. Also if I set nlp.max_length to be higher then it stays stuck. What would be the efficient way to do this?

nlp.evaluate conflicts ?

Hi, thanks a lot for this awesome work ๐Ÿš€ .

I have the following pipeline

>>> nlp.pipeline
[('tagger', <spacy.pipeline.pipes.Tagger at 0x12f31ba10>),
 ('parser', <spacy.pipeline.pipes.DependencyParser at 0x12d2d8bb0>),
 ('ner', <spacy.pipeline.pipes.EntityRecognizer at 0x12ed9a8a0>),
 ('textcat', <spacy.pipeline.pipes.TextCategorizer at 0x12f1d9a90>),
 ('textrank',
  <bound method TextRank.PipelineComponent of <pytextrank.pytextrank.TextRank object at 0x2dd192290>>)]

and when I run

text = "hello world"
annots = {"cats": {"POSITIVE": 1.0, "NEGATIVE": 0.0}}
doc = Doc(nlp.vocab, words=text.split(" "))
scorer = nlp.evaluate([(doc, annots)])

it just hangs.
When I remove the textrank from the pipeline, then it runs fine.

Any idea what I'm doing wrong ?

using "noun_chunks" from custom extension

I wanted to use pytextrank together with spacy_udpipe to get keywords from texts in other languages (see https://stackoverflow.com/questions/59824405/spacy-udpipe-with-pytextrank-to-extract-keywords-from-non-english-text) but I realized, that udpipe-spacy somehow "overrides" the original spacy's pipeline so the noun_chunks are not generated (btw: the noun_chunks are created in lang/en/syntax_iterators.py but it doesn't exist for all languages so even if it is called, it doesn't work e.g. for Slovak language)

Pytextrank keywords are taken from the spacy doc.noun_chunks, but if the noun_chunks are not generated, pytextrank doesn't work.

Sample code:

import spacy_udpipe, spacy, pytextrank
spacy_udpipe.download("en") # download English model
text = "Compatibility of systems of linear constraints over the set of natural numbers. Criteria of compatibility of a system of linear Diophantine equations, strict inequations, and nonstrict inequations are considered. Upper bounds for components of a minimal set of solutions and algorithms of construction of minimal generating sets of solutions for all types of systems are given. These criteria and the corresponding algorithms for constructing a minimal supporting set of solutions can be used in solving all the considered types systems and systems of mixed types."

# using spacy_udpipe
nlp_udpipe = spacy_udpipe.load("en")
tr = pytextrank.TextRank(logger=None)
nlp_udpipe.add_pipe(tr.PipelineComponent, name="textrank", last=True)
doc_udpipe = nlp_udpipe(text)

print("keywords from udpipe processing:")
for phrase in doc_udpipe._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

# loading original spacy model
nlp_spacy = spacy.load("en_core_web_sm")
tr2 = pytextrank.TextRank(logger=None)
nlp_spacy.add_pipe(tr2.PipelineComponent, name="textrank", last=True)
doc_spacy = nlp_spacy(text)

print("keywords from spacy processing:")
for phrase in doc_spacy._.phrases:
    print("{:.4f} {:5d}  {}".format(phrase.rank, phrase.count, phrase.text))
    print(phrase.chunks)

Would it be possible that pytextrank processes the "noun_chunks" (candidates for keywords) from a custom extension (function which uses a Matcher and the result is available e.g. as a doc._.custom_noun_chunks - see explosion/spaCy#3856 )?

Slow while execution

It is pretty slow when executed for a medium sized content.
Are there any tips to improve the performance? Kindly let know.
It takes around 30 seconds for a 512 KB content in both python 2 and 3.
Using latest spacy as well.

how to use pytextrank for entity linking

README states pytextrank can be used for three tasks 1. phrase extraction 2. summarization 3. entity linking
I see that examples and usage are available for 1 and 2 but not 3.
can someone share a reference on how it can be used, how its lemma graph can be enriched with domain knowledge, etc.

pytextrank.text_rank -> 'DiGraph' object has no attribute 'edge'

When trying out the examples, the line graph, ranks = pytextrank.text_rank(path_stage1) throws the following error:

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/user/.local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 349, in text_rank
    graph = build_graph(json_iter(path))
  File "/home/user/.local/lib/python3.6/site-packages/pytextrank/pytextrank.py", line 308, in build_graph    graph.edge[pair[0]][pair[1]]["weight"] += 1.0
AttributeError: 'DiGraph' object has no attribute 'edge'

https://github.com/ceteri/pytextrank/blob/181ea41375d29922eb96768cf6550e57a77a0c95/pytextrank/pytextrank.py#L308

Red Hat Enterprise Linux Server release 7.3 (Maipo)
core-4.1-amd64:core-4.1-noarch
python 3.6.4
pytextrank==1.1.0
spacy==2.0.11
graphviz==0.8.3
networkx==2.1
decorator==4.3.0

path_stage1 file contents:

["777", "7b982e54fa330a6854a0ed5397d49223fdc70645", [[1, "Compatibility", "compatibility", "NN", 1, 0], [0, "of", "of", "IN", 0, 1], [2, "systems", "system", "NNS", 1, 2], [0, "of", "of", "IN", 0, 3], [3, "linear", "linear", "JJ", 1, 4], [4, "constraints", "constraint", "NNS", 1, 5], [0, "over", "over", "IN", 0, 6], [0, "the", "the", "DT", 0, 7], [5, "set", "set", "NN", 1, 8], [0, "of", "of", "IN", 0, 9], [6, "natural", "natural", "JJ", 1, 10], [7, "numbers", "number", "NNS", 1, 11], [0, ".", ".", ".", 0, 12]]]
834
["777", "c1cd88e3fec11b0772c6be12b7504f332c0b3d8f", [[8, "Criteria", "criterion", "NNS", 1, 13], [0, "of", "of",
"IN", 0, 14], [1, "compatibility", "compatibility", "NN", 1, 15], [0, "of", "of", "IN", 0, 16], [0, "a", "a", "DT", 0, 17], [2, "system", "system", "NN", 1, 18], [0, "of", "of", "IN", 0, 19], [3, "linear", "linear", "JJ", 1, 20], [9, "Diophantine", "diophantine", "NNP", 1, 21], [10, "equations", "equation", "NNS", 1, 22], [0, ",", ",", ".", 0, 23], [11, "strict", "strict", "JJ", 1, 24], [12, "inequations", "inequation", "NNS", 1, 25], [0, ",", ",", ".", 0, 26], [0, "and", "and", "CC", 0, 27], [13, "nonstrict", "nonstrict", "NN", 1, 28], [12, "inequations", "inequation", "NNS", 1, 29], [14, "are", "be", "VBP", 1, 30], [15, "considered", "consider", "VBN", 1, 31], [0, ".", ".", ".", 0, 32]]]
1093
["777", "cb9235d7c8b21321b88462fca3a0480e29aa8ec7", [[16, "Upper", "upper", "NNP", 1, 33], [17, "bounds", "bound", "VBZ", 1, 34], [0, "for", "for", "IN", 0, 35], [18, "components", "component", "NNS", 1, 36], [0, "of", "of", "IN", 0, 37], [0, "a", "a", "DT", 0, 38], [19, "minimal", "minimal", "JJ", 1, 39], [5, "set", "set", "NN", 1, 40], [0, "of", "of", "IN", 0, 41], [20, "solutions", "solution", "NNS", 1, 42], [0, "and", "and", "CC", 0, 43], [21,
"algorithms", "algorithm", "NNS", 1, 44], [0, "of", "of", "IN", 0, 45], [22, "construction", "construction", "NN", 1, 46], [0, "of", "of", "IN", 0, 47], [19, "minimal", "minimal", "JJ", 1, 48], [23, "generating", "generating", "NN", 1, 49], [5, "sets", "set", "NNS", 1, 50], [0, "of", "of", "IN", 0, 51], [20, "solutions", "solution", "NNS", 1, 52], [0, "for", "for", "IN", 0, 53], [0, "all", "all", "DT", 0, 54], [24, "types", "type", "NNS", 1, 55], [0, "of", "of", "IN", 0, 56], [2, "systems", "system", "NNS", 1, 57], [14, "are", "be", "VBP", 1, 58], [25, "given", "give", "VBN", 1, 59], [0, ".", ".", ".", 0, 60]]]
1179
["777", "ae690a522dbc83b7c8447b5bf863ffa95d681ce0", [[0, "These", "these", "DT", 0, 61], [8, "criteria", "criterion", "NNS", 1, 62], [0, "and", "and", "CC", 0, 63], [0, "the", "the", "DT", 0, 64], [26, "corresponding", "correspond", "VBG", 1, 65], [21, "algorithms", "algorithm", "NNS", 1, 66], [0, "for", "for", "IN", 0, 67], [27, "constructing", "construct", "VBG", 1, 68], [0, "a", "a", "DT", 0, 69], [19, "minimal", "minimal", "JJ", 1, 70], [28, "supporting", "support", "VBG", 1, 71], [5, "set", "set", "NN", 1, 72], [0, "of", "of", "IN", 0, 73], [20, "solutions", "solution", "NNS", 1, 74], [0, "can", "can", "MD", 0, 75], [14, "be", "be", "VB", 1, 76], [29, "used", "use", "VBN", 1, 77], [0, "in", "in", "IN", 0, 78], [30, "solving", "solve", "VBG", 1, 79], [0, "all", "all", "PDT", 0, 80], [0, "the", "the", "DT", 0, 81], [15, "considered", "consider", "VBN", 1, 82], [24, "types", "type", "NNS", 1, 83], [2, "systems", "system", "NNS", 1, 84], [0, "and", "and", "CC", 0, 85], [2, "systems", "system", "NNS",
1, 86], [0, "of", "of", "IN", 0, 87], [31, "mixed", "mixed", "JJ", 1, 88], [24, "types", "type", "NNS", 1, 89], [0, ".", ".", ".", 0, 90]]]

Can't use PyTextRank in parallel?

I am trying to summarize many documents and am having trouble improving performance because I can't figure out how to use the PyTextRank extension for document summarization in parallel among multiple processes. The only method for parallel processing listed in the docs shows the use of nlp.pipe() (Language.pipe). However, you can't call PyTextRank on a document inside nlp.pipe(), it only allows itself to be used via Language.add_pipe(). I've looked at the code for pytextrank.pytextrank to verify this.

How do you use a Pipeline in parallel without using nlp.pipe()? How do you make nlp.add_pipe() parallel?

I get an error in the example "Stage 1"

Trying to run "stage 1" of the example (Python 3.6.4), I get error "call() got an unexpected keyword argument 'parse'".

Even the following gives me the error:
path_stage0 = 'dat/mih.json' for graf in parse_doc(json_iter(path_stage0)): print(graf)
Error details:
Traceback (most recent call last): File "<stdin>", line 1, in <module> File ".../pytextrank-master/pytextrank/pytextrank.py", line 261, in parse_doc grafs, new_base_idx = parse_graf(meta["id"], graf_text, base_idx) File ".../pytextrank-master/pytextrank/pytextrank.py", line 193, in parse_graf doc = spacy_nlp(graf_text, parse=True) TypeError: __call__() got an unexpected keyword argument 'parse'

why does the algorithm filter out the verbs?

I do not know why does the algorithm filter out the verbs?

def limit_keyphrases (json_iter, phrase_limit=10):
.....
for rl in lex:
if rl.pos[0] != "v":
if (used > phrase_limit) or (rl.rank < rank_thresh):
return

summarization as a pipeline task

Hi, so thanks for writing this awesome package. But is there any direct way to use summarization as there are pipelines for other tasks? Only thing I have found about summarization is the explain_summ.ipynb. please mention it if there is one.

min_phrases are not minimum spans for each phrase

Hi,

The explain_algo.ipynb mentioned:

Since noun chunks can be expressed in different ways (e.g., they may have articles or prepositions), we need to find a minimum span for each phrase based on combinations of lemmas...

But he code below does not do what it states.

import operator

min_phrases = {}

for compound_key, rank_tuples in phrases.items():
    l = list(rank_tuples)
    l.sort(key=operator.itemgetter(1), reverse=True)
    
    phrase, rank = l[0]
    count = counts[compound_key]
    
    min_phrases[phrase] = (rank, count)

The min_phrases did not seem change the phrase at all, let alone find a minimum span for each phrase.

sentence scores for summarization

In the text rank extension, we have a summarization method; but that option will be more effective if you expose the sentence score as a touple with the sentences that come out as summarization output. It can be asked by an input parameter whether the user just wants a summary or also the sentence scores. Let me know if there is a way to get the sentence scores in current settings too.
If it is not possible, then I would like to try and add a PR for the same if allowed.
Thanks.

How to use this?

Hi there, I've been looking at your code and example for a long time and I still have no idea how to use this.

I have documents in string format, what JSON format should they have if I want to use the stages as in the examples?

I find there's a crucial piece of information missing in the documentation, which is how to use the functionality of this package with a simple document in string format (or list of strings, representing sentences). As I don't know beforehand what JSON format I have to convert my text to in order to use the stage pipeline.

Cheers

AttributeError: 'DiGraph' object has no attribute 'edge'

Fixed by changing the code on pytextrank (307) from:
try:
graph.edge[pair[0]][pair[1]]["weight"] += 1.0
except KeyError:
graph.add_edge(pair[0], pair[1], weight=1.0)
to:
if "edge" in dir(graph):
graph.edge[pair[0]][pair[1]]["weight"] += 1.0
else:
graph.add_edge(pair[0], pair[1], weight=1.0)

KeyError: 'graf'

Hi guys,

I'd like to use pytextrank for keywords extraction from single sentences (questions).

After the installation and downloading a language model I ran pytextrank in IPython with success.

Next, I decided to ran Stage 2 from your example.ipynb with normalize_key_phrases feature.

After that I received KeyError: 'graf' exception.

Details:

In [4]: graph, ranks = pytextrank.text_rank('mih.json')
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-4-ce8762751470> in <module>()
----> 1 graph, ranks = pytextrank.text_rank('mih.json')

/Users/katharsis/workspace/keywords_extractor/src/pytextrank.pyc in text_rank(path)
    349     run the TextRank algorithm
    350     """
--> 351     graph = build_graph(json_iter(path))
    352     ranks = nx.pagerank(graph)
    353

/Users/katharsis/workspace/keywords_extractor/src/pytextrank.pyc in build_graph(json_iter)
    299         print meta.keys()
    300
--> 301         for pair in get_tiles(map(WordNode._make, meta["graf"])):
    302             if DEBUG:
    303                 print(pair)

KeyError: 'graf'

I saw that mih.json not contains "graf" key.
It includes "id" and "text" only.
Is it related somehow?

PS. I'm using Python 2.7 on Mac OSx.

Keyword extraction

Hi there, I'm working on a project extracting keywords from a german text. Is there a tutorial on how to extract keywords using pytextrank?

Best regards,

Serious bug in get_tiles?

I believe there is a serious bug in the line 284 in pytextrank.py.

Indeed once the unneeded words are filtered out (keeps = list(filter(lambda w: w.word_id > 0, graf))) the word indices have holes.

The corrected line should be

if (j - i) < size:
      yield (w0.root, w1.root)

spacy call error: __call__() got an unexpected keyword argument 'parse'

~/.local/lib/python3.6/site-packages/pytextrank/pytextrank.py in parse_graf(doc_id, graf_text, base_idx, spacy_nlp)
    191     markup = []
    192     new_base_idx = base_idx
--> 193     doc = spacy_nlp(graf_text, parse=True)
    194 
    195     for span in doc.sents:

TypeError: __call__() got an unexpected keyword argument 'parse'

spacy.__version__ == 2.0.11

seems that spacy_nlp(graf) does not have the option parse (it now only has the option disable).

doc._.textrank.summary bug when used on multiple docs

Hi,

I want to produce summaries of a list of spacy doc objects.

for doc in docs:
summary = [str(sent) for sent in doc._.textrank.summary(limit_phrases=15, limit_sentences=1)]
summary = " ".join(summary).replace("\n", " ")
print(summary)

I would expect the code above to print a summary of each separate doc object. Instead, it only prints the same summary several times (the summary from the last doc in the iteration.

I can solve this bug with the following change:

for doc in docs:
intermediate_doc = nlp(doc.text)
summary = [str(sent) for sent in intermediate_doc._.textrank.summary(limit_phrases=15, limit_sentences=1)]
summary = " ".join(summary).replace("\n", " ")
print(summary)

When I run this changed code, it prints a separate summary for each doc object (the expected behaviour).

I think that this bug comes from how the summary method is attached to the doc objects
def PipelineComponent (self, doc): ... Doc.set_extension("textrank", force=True, default=self) ...
(see pytextrank.py https://github.com/DerwenAI/pytextrank/blob/master/pytextrank/pytextrank.py)

If I understand correctly, the entire "self" class might be written on the doc extension "textrank" at each iteration. This might mean that it gets overwritten everytime.

I might be wrong though, would be interested in what you think!

Dependency on Textblob-0.11.1?

Tried to install and run example as detailed on README, and encountered this error:

Traceback (most recent call last): File "stage1.py", line 5, in <module> import textrank File "/Users/mattkohl/Development/pytextrank/textrank.py", line 13, in <module> import textblob_aptagger as tag ImportError: No module named 'textblob_aptagger'

Fixed by updating textblob as detailed here

pip install -U git+https://github.com/sloria/textblob-aptagger.git@dev

Error while running example.py

While running example, the following error occurred, any help will be appreciated.

doc = nlp(text)
Traceback (most recent call last):
File "", line 1, in
File "/home/vikas/PycharmProjects/NLP-NMT/venv/lib/python3.5/site-packages/spacy/language.py", line 435, in call
doc = proc(doc, **component_cfg.get(name, {}))
File "/home/vikas/github_projects/pytextrank/pytextrank/pytextrank.py", line 532, in PipelineComponent
Doc.set_extension("phrases", force=True, default=self.calc_textrank())
File "/home/vikas/github_projects/pytextrank/pytextrank/pytextrank.py", line 391, in calc_textrank
self.collect_phrases(chunk)
File "/home/vikas/github_projects/pytextrank/pytextrank/pytextrank.py", line 347, in collect_phrases
rank = self.ranks[node_id]
KeyError: 29

Temporarily disable NER by default

We're seeing problems with spaCy NER use in PyTextRank, an intermittent bug that appears to be infinite recursion.

Will temporarily disable NER use, by default -- leaving an option for people who want to override the settings in the source code.

This needs much more debugging, and probably an overhaul to use spaCy spans in lieu of the named tuples for tracking keyphrases.

ZeroDivisionError in summary method

In the 487th line of pytextrank.py file in the package, the unit vector is normalized by summing all its components and dividing each of the elements by it. For some reason, in my content a unit vector occurs as [0,0...0] i.e. a zero-vector. In such a case, a zero division error occurs. I think this is a rare situation, but should be handled in the code; i.e. the normalized value of the unit vector should be provided using a try-except catch. Correct me if I am wrong.

Can't find model 'en'.

Windows 10
Python 3.6
running Jupyter notebook in virtualenv
installed pytextrank with pip successfully
ran -m spacy download en
...
Linking successful
c:\ml\env\lib\site-packages\en_core_web_sm -->
c:\ml\env\lib\site-packages\spacy\data\en
You can now load the model via spacy.load('en')

Trying to follow example:

with open(path_stage1, 'w') as f:
for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
...

I get:
OSError Traceback (most recent call last)
in ()
3
4 with open(path_stage1, 'w') as f:
----> 5 for graf in pytextrank.parse_doc(pytextrank.json_iter(path_stage0)):
6 f.write("%s\n" % pytextrank.pretty_print(graf._asdict()))
7 # to view output in this notebook

c:\ml\env\lib\site-packages\pytextrank\pytextrank.py in parse_doc(json_iter)
259 print("graf_text:", graf_text)
260
--> 261 grafs, new_base_idx = parse_graf(meta["id"], graf_text, base_idx)
262 base_idx = new_base_idx
263

c:\ml\env\lib\site-packages\pytextrank\pytextrank.py in parse_graf(doc_id, graf_text, base_idx, spacy_nlp)
185 if not spacy_nlp:
186 if not SPACY_NLP:
--> 187 SPACY_NLP = spacy.load("en")
188
189 spacy_nlp = SPACY_NLP

c:\ml\env\lib\site-packages\spacy_init_.py in load(name, **overrides)
19 if depr_path not in (True, False, None):
20 deprecation_warning(Warnings.W001.format(path=depr_path))
---> 21 return util.load_model(name, **overrides)
22
23

c:\ml\env\lib\site-packages\spacy\util.py in load_model(name, **overrides)
117 elif hasattr(name, 'exists'): # Path or Path-like to model data
118 return load_model_from_path(name, **overrides)
--> 119 raise IOError(Errors.E050.format(name=name))
120
121

OSError: [E050] Can't find model 'en'. It doesn't seem to be a shortcut link, a Python package or a valid path to a data directory.

Is parser a relavant pipeline component beyond noun chunking?

Hello,

I'm using pytextrank with texts in Portuguese. Thanks to issue #54 I'm able to use POS information to produce some basic noun chunking, instead of syntactic information from the parser.

My question is, in this case where I'm producing chunks from POS, am I loosing something if I disable the parser and create a new pipeline component just for chunking? Are there other relevant information given by the parser used?

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.