Comments (3)
Hi @phongtnit
Stopword settings are based on lemma, POS tag
instead of exact word form or exact phrases.
So using lemma forms in stopword config should get you the desired outcome:
nlp.add_pipe("textrank", config={"stopwords": {
"help": ["NOUN"],
"error": ["NOUN"],
"message": ["NOUN"],
"difference": ["NOUN"],
"need": ["NOUN"]
}})
from pytextrank.
@phongtnit
Sometimes model performs POS tagging and lemmatization in unexpected ways.
For example:
helpful
may not lemmatized into help
need
may be tagged as Verb
instead of Noun
.
Logging lemma and pos tags will be helpful
doc = nlp(source_text)
print([(token.lemma_, token.pos_) for token in doc])
Accordingly you can either use latest spacy model or modify config.
from pytextrank.
Hi @phongtnit Stopword settings are based on
lemma, POS tag
instead of exact word form or exact phrases. So using lemma forms in stopword config should get you the desired outcome:nlp.add_pipe("textrank", config={"stopwords": { "help": ["NOUN"], "error": ["NOUN"], "message": ["NOUN"], "difference": ["NOUN"], "need": ["NOUN"] }})
@Ankush-Chander Thanks for your information,
I changed the line with stopwords as:
nlp.add_pipe("textrank", config={"stopwords": {
"help": ["NOUN"],
"error": ["NOUN"],
"message": ["NOUN"],
"difference": ["NOUN"],
"need": ["NOUN"]
}})
The result also included the word "Need" and phrase "helpful error message", in fact, the output is:
ic| phrase: Phrase(text='mod_proxy_fcgi', chunks=[mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi, mod_proxy_fcgi], count=7, rank=0.18637394284616607)
ic| phrase: Phrase(text='Apache httpd', chunks=[Apache httpd], count=1, rank=0.1717270122274852)
ic| phrase: Phrase(text='Unix domain sockets', chunks=[Unix domain sockets], count=1, rank=0.15063099262272117)
ic| phrase: Phrase(text='mod_fcgi', chunks=[mod_fcgi], count=1, rank=0.1427428470026225)
ic| phrase: Phrase(text='Need', chunks=[Need], count=1, rank=0.13539946891655025)
ic| phrase: Phrase(text='mod_perl', chunks=[mod_perl], count=1, rank=0.12859556242529588)
ic| phrase: Phrase(text='helpful error message', chunks=[helpful error message], count=1, rank=0.1268035421291434)
ic| phrase: Phrase(text='mod_fcgid', chunks=[mod_fcgid, mod_fcgid], count=2, rank=0.1097222239822189)
ic| phrase: Phrase(text='Apache', chunks=[Apache], count=1, rank=0.10926117011819174)
ic| phrase: Phrase(text='mod_fastcgi', chunks=[mod_fastcgi, mod_fastcgi], count=2, rank=0.07731336160506616)
['mod_proxy_fcgi', 'Apache httpd', 'Unix domain sockets', 'mod_fcgi', 'Need', 'mod_perl', 'helpful error message', 'mod_fcgid', 'Apache', 'mod_fastcgi']
from pytextrank.
Related Issues (20)
- Biased Textrank implementation uses phrases instead of sentences HOT 4
- Information about the matrix similarity HOT 2
- NotImplementedError: [E894] The 'noun_chunks' syntax iterator is not implemented for language 'ru'. HOT 8
- Silence of the Lambs HOT 1
- ZeroDivisionError: division by zero in _calc_discounted_normalised_rank HOT 2
- Demo: Term Weighting for Document Similarity Testing HOT 1
- Is `biasedtextrank` implemented? HOT 4
- Is it possible to integrate Pytextrank with Flair NLP engine HOT 1
- "ValueError: [E002] Can't find factory for 'textrank' for language English (en)." - incompatibility with SpaCy 3.3.1? HOT 1
- suggestion: allow "wildcard" POS for stopwords
- Doesn't work for Dutch language HOT 1
- DiGraph instead of Graph HOT 2
- Is handling of singular / plural forms ('sentence' and 'sentences') correct / consistent? HOT 3
- Bugfix for scrubber sample code which fails when scrubbing "two" HOT 2
- Update Sample Usage document: stop words must be lowercase HOT 3
- GitHub CI Actions for `pre-commit` are failing HOT 3
- different output HOT 6
- Dependency Management Pip-Tools Example HOT 1
- why the keyword phrase include a PRON, like "it" HOT 2
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from pytextrank.