Hi, I had a problem with stopwords with te

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hi <a class="user-mention notranslate" data-hovercard-type="user" data-ho

Cannot use stopwords in PyTextRank about pytextrank HOT 3 CLOSED

phongtnit commented on June 18, 2024 1

Cannot use stopwords in PyTextRank

from pytextrank.

Comments (3)

Ankush-Chander commented on June 18, 2024 2

Hi @phongtnit
Stopword settings are based on lemma, POS tag instead of exact word form or exact phrases.
So using lemma forms in stopword config should get you the desired outcome:

nlp.add_pipe("textrank", config={"stopwords": {
                                                "help": ["NOUN"],
                                                "error": ["NOUN"],
                                                "message": ["NOUN"],
                                                "difference": ["NOUN"],
                                                "need": ["NOUN"]
                                               }})

from pytextrank.

Ankush-Chander commented on June 18, 2024 2

@phongtnit
Sometimes model performs POS tagging and lemmatization in unexpected ways.
For example:
helpful may not lemmatized into help
need may be tagged as Verb instead of Noun.

Logging lemma and pos tags will be helpful

doc = nlp(source_text)
print([(token.lemma_, token.pos_) for token in doc])

Accordingly you can either use latest spacy model or modify config.

from pytextrank.

phongtnit commented on June 18, 2024

Hi @phongtnit Stopword settings are based on lemma, POS tag instead of exact word form or exact phrases. So using lemma forms in stopword config should get you the desired outcome:

nlp.add_pipe("textrank", config={"stopwords": {
                                                "help": ["NOUN"],
                                                "error": ["NOUN"],
                                                "message": ["NOUN"],
                                                "difference": ["NOUN"],
                                                "need": ["NOUN"]
                                               }})

@Ankush-Chander Thanks for your information,

I changed the line with stopwords as:

nlp.add_pipe("textrank", config={"stopwords": {
                                                "help": ["NOUN"],
                                                "error": ["NOUN"],
                                                "message": ["NOUN"],
                                                "difference": ["NOUN"],
                                                "need": ["NOUN"]
                                               }})

The result also included the word "Need" and phrase "helpful error message", in fact, the output is:

ic| phrase: Phrase(text='mod_proxy_fcgi', chunks=[mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi], count=7, rank=0.18637394284616607)
ic| phrase: Phrase(text='Apache httpd', chunks=[Apache httpd], count=1, rank=0.1717270122274852)
ic| phrase: Phrase(text='Unix domain sockets', chunks=[Unix domain sockets], count=1, rank=0.15063099262272117)
ic| phrase: Phrase(text='mod_fcgi', chunks=[mod_fcgi], count=1, rank=0.1427428470026225)
ic| phrase: Phrase(text='Need', chunks=[Need], count=1, rank=0.13539946891655025)
ic| phrase: Phrase(text='mod_perl', chunks=[mod_perl], count=1, rank=0.12859556242529588)
ic| phrase: Phrase(text='helpful error message', chunks=[helpful error message], count=1, rank=0.1268035421291434)
ic| phrase: Phrase(text='mod_fcgid', chunks=[mod_fcgid, mod_fcgid], count=2, rank=0.1097222239822189)
ic| phrase: Phrase(text='Apache', chunks=[Apache], count=1, rank=0.10926117011819174)
ic| phrase: Phrase(text='mod_fastcgi', chunks=[mod_fastcgi, mod_fastcgi], count=2, rank=0.07731336160506616)
['mod_proxy_fcgi', 'Apache httpd', 'Unix domain sockets', 'mod_fcgi', 'Need', 'mod_perl', 'helpful error message', 'mod_fcgid', 'Apache', 'mod_fastcgi']

from pytextrank.

Cannot use stopwords in PyTextRank about pytextrank HOT 3 CLOSED

Comments (3)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent