Coder Social home page Coder Social logo

liaad / yake Goto Github PK

View Code? Open in Web Editor NEW
1.6K 30.0 228.0 861 KB

Single-document unsupervised keyword extraction

Home Page: https://liaad.github.io/yake

License: Other

Python 92.73% Dockerfile 1.21% Shell 6.06%
keyword-extraction unsupervised-approach corpus-independent domain-and-language-independent single-document

yake's Introduction

Yet Another Keyword Extractor (Yake)

Unsupervised Approach for Automatic Keyword Extraction using Text Features.

YAKE! is a light-weight unsupervised automatic keyword extraction method which rests on text statistical features extracted from single documents to select the most important keywords of a text. Our system does not need to be trained on a particular set of documents, neither it depends on dictionaries, external-corpus, size of the text, language or domain. To demonstrate the merits and the significance of our proposal, we compare it against ten state-of-the-art unsupervised approaches (TF.IDF, KP-Miner, RAKE, TextRank, SingleRank, ExpandRank, TopicRank, TopicalPageRank, PositionRank and MultipartiteRank), and one supervised method (KEA). Experimental results carried out on top of twenty datasets (see Benchmark section below) show that our methods significantly outperform state-of-the-art methods under a number of collections of different sizes, languages or domains. In addition to the python package here described, we also make available a demo, an API and a mobile app.

Main Features

  • Unsupervised approach
  • Corpus-Independent
  • Domain and Language Independent
  • Single-Document

Benchmark

For Benchmark results check out our paper published on Information Science Journal (see the references section).

Rationale

Extracting keywords from texts has become a challenge for individuals and organizations as the information grows in complexity and size. The need to automate this task so that texts can be processed in a timely and adequate manner has led to the emergence of automatic keyword extraction tools. Despite the advances, there is a clear lack of multilingual online tools to automatically extract keywords from single documents. Yake! is a novel feature-based system for multi-lingual keyword extraction, which supports texts of different sizes, domain or languages. Unlike other approaches, Yake! does not rely on dictionaries nor thesauri, neither is trained against any corpora. Instead, it follows an unsupervised approach which builds upon features extracted from the text, making it thus applicable to documents written in different languages without the need for further knowledge. This can be beneficial for a large number of tasks and a plethora of situations where the access to training corpora is either limited or restricted.

Where can I find YAKE!?

YAKE! is available online [http://yake.inesctec.pt], on Google Play, as an open source Python package [https://github.com/LIAAD/yake] and as an API.

Installing YAKE!

There are three installation alternatives.

  • To run YAKE! in the command line (say, to integrate in a script), but you do not need an HTTP server on top, you can use our simple YAKE! Docker image. This container will allow you to run text extraction as a command, and then exit.
  • To run YAKE! as an HTTP server featuring a RESTful API (say to integrate in a web application or host your own YAKE!), you can use our RESTful API server image. This container/server will run in the background.
  • To install YAKE! straight "on the metal" or you want to integrate it in your Python app, you can install it and its dependencies.

Option 1. YAKE as a CLI utility inside a Docker container

First, install Docker. Ubuntu users, please see our script below for a complete installation script.

Then, run:

docker run liaad/yake:latest -ti "Caffeine is a central nervous system (CNS) stimulant of the methylxanthine class.[10] It is the world's most widely consumed psychoactive drug. Unlike many other psychoactive substances, it is legal and unregulated in nearly all parts of the world. There are several known mechanisms of action to explain the effects of caffeine. The most prominent is that it reversibly blocks the action of adenosine on its receptor and consequently prevents the onset of drowsiness induced by adenosine. Caffeine also stimulates certain portions of the autonomic nervous system."

Example text from Wikipedia

Option 2. REST API Server in a Docker container

This install will provide you a mirror of the original REST API of YAKE! available here.

docker run -p 5000:5000 -d liaad/yake-server:latest

After it starts up, the container will run in the background, at http://127.0.0.1:5000. To access the YAKE! API documentation, go to http://127.0.0.1:5000/apidocs/.

You can test the RESTful API using curl:

curl -X POST "http://localhost:5000/yake/" -H "accept: application/json" -H "Content-Type: application/json" \
-d @- <<'EOF'
{
  "language": "en",
  "max_ngram_size": 3,
  "number_of_keywords": 10,
  "text": "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning competitions. Details about the transaction remain somewhat vague , but given that Google is hosting its Cloud Next conference in San Francisco this week, the official announcement could come as early as tomorrow. Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, was founded by Goldbloom and Ben Hamner in 2010. The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, it has managed to stay well ahead of them by focusing on its specific niche. The service is basically the de facto home for running data science and machine learning competitions. With Kaggle, Google is buying one of the largest and most active communities for data scientists ..."
}
EOF

Example text from Wikipedia

Option 3. Standalone Installation (for development or integration)

Requirements

Python3

Installation

To install Yake using pip:

pip install git+https://github.com/LIAAD/yake

To upgrade using pip:

pip install git+https://github.com/LIAAD/yake –-upgrade

Usage (Command line)

How to use it on your favorite command line

Usage: yake [OPTIONS]

Options:
	-ti, --text_input TEXT          Input text, SURROUNDED by single quotes(')
	-i, --input_file TEXT           Input file
	-l, --language TEXT             Language
	-n, --ngram-size INTEGER        Max size of the ngram.
	-df, --dedup-func [leve|jaro|seqm]
									Deduplication function.
	-dl, --dedup-lim FLOAT          Deduplication limiar.
	-ws, --window-size INTEGER      Window size.
	-t, --top INTEGER               Number of keyphrases to extract
	-v, --verbose			Gets detailed information (such as the score)
	--help                          Show this message and exit.

Usage (Python)

How to use it on Python

import yake

text = "Sources tell us that Google is acquiring Kaggle, a platform that hosts data science and machine learning "\
"competitions. Details about the transaction remain somewhat vague, but given that Google is hosting its Cloud "\
"Next conference in San Francisco this week, the official announcement could come as early as tomorrow. "\
"Reached by phone, Kaggle co-founder CEO Anthony Goldbloom declined to deny that the acquisition is happening. "\
"Google itself declined 'to comment on rumors'. Kaggle, which has about half a million data scientists on its platform, "\
"was founded by Goldbloom  and Ben Hamner in 2010. "\
"The service got an early start and even though it has a few competitors like DrivenData, TopCoder and HackerRank, "\
"it has managed to stay well ahead of them by focusing on its specific niche. "\
"The service is basically the de facto home for running data science and machine learning competitions. "\
"With Kaggle, Google is buying one of the largest and most active communities for data scientists - and with that, "\
"it will get increased mindshare in this community, too (though it already has plenty of that thanks to Tensorflow "\
"and other projects). Kaggle has a bit of a history with Google, too, but that's pretty recent. Earlier this month, "\
"Google and Kaggle teamed up to host a $100,000 machine learning competition around classifying YouTube videos. "\
"That competition had some deep integrations with the Google Cloud Platform, too. Our understanding is that Google "\
"will keep the service running - likely under its current name. While the acquisition is probably more about "\
"Kaggle's community than technology, Kaggle did build some interesting tools for hosting its competition "\
"and 'kernels', too. On Kaggle, kernels are basically the source code for analyzing data sets and developers can "\
"share this code on the platform (the company previously called them 'scripts'). "\
"Like similar competition-centric sites, Kaggle also runs a job board, too. It's unclear what Google will do with "\
"that part of the service. According to Crunchbase, Kaggle raised $12.5 million (though PitchBook says it's $12.75) "\
"since its   launch in 2010. Investors in Kaggle include Index Ventures, SV Angel, Max Levchin, Naval Ravikant, "\
"Google chief economist Hal Varian, Khosla Ventures and Yuri Milner "

assuming default parameters

kw_extractor = yake.KeywordExtractor()
keywords = kw_extractor.extract_keywords(text)

for kw in keywords:
	print(kw)

specifying parameters

language = "en"
max_ngram_size = 3
deduplication_threshold = 0.9
deduplication_algo = 'seqm'
windowSize = 1
numOfKeywords = 20

custom_kw_extractor = yake.KeywordExtractor(lan=language, n=max_ngram_size, dedupLim=deduplication_threshold, dedupFunc=deduplication_algo, windowsSize=windowSize, top=numOfKeywords, features=None)
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

Output

The lower the score, the more relevant the keyword is.

('google', 0.026580863364597897)
('kaggle', 0.0289005976239829)
('ceo anthony goldbloom', 0.029946071606210194)
('san francisco', 0.048810837074825336)
('anthony goldbloom declined', 0.06176910090701819)
('google cloud platform', 0.06261974476422487)
('co-founder ceo anthony', 0.07357749587020043)
('acquiring kaggle', 0.08723571551039863)
('ceo anthony', 0.08915156857226395)
('anthony goldbloom', 0.09123482372372106)
('machine learning', 0.09147989238151344)
('kaggle co-founder ceo', 0.093805063905847)
('data', 0.097574333771058)
('google cloud', 0.10260128641464673)
('machine learning competitions', 0.10773000650607861)
('francisco this week', 0.11519915079240485)
('platform', 0.1183512305596321)
('conference in san', 0.12392066376108138)
('service', 0.12546743261462942)
('goldbloom', 0.14611408778815776)

Highlighting Feature

Highlighting feature will tag every keyword in the text with the default tag <kw>.

from yake.highlight import TextHighlighter

th = TextHighlighter(max_ngram_size = 3)
th.highlight(text, keywords)

Output

By default, keywords will be highlighted using the tag 'kw'.

Sources tell us that <kw>google</kw> is <kw>acquiring kaggle</kw>, a platform that <kw>hosts data science</kw> and <kw>machine learning</kw> competitions. Details about the transaction remain somewhat vague , but given that <kw>google</kw> is hosting its Cloud Next conference in <kw>san francisco</kw> this week, the official announcement could come as early as tomorrow.  Reached by phone, Kaggle co-founder <kw>ceo anthony goldbloom</kw> declined to deny that the acquisition is happening. <kw>google</kw> itself declined 'to comment on rumors'.
.....
.....

Custom Highlighting Feature

Besides tagging a text with the default tag, users can also specify their own custom highlight. In the following text, the tag <span class='my_class' > makes use of an hyphotetical function my_class whose purpose would be to highlight in white colour or the relevant keywords.

Output

from yake.highlight import TextHighlighter
th = TextHighlighter(max_ngram_size = 3, highlight_pre = "<span class='my_class' >", highlight_post= "</span>")
th.highlight(text, keywords)
self.highlight_postSources tell us that <span class='my_class' >google</span> is <span class='my_class' >acquiring kaggle</span>, a platform that <span class='my_class' >hosts data science</span> and <span class='my_class' >machine learning</span> self.highlight_postcompetitions. Details about the transaction remain somewhat vague , but given that <span class='my_class' >google</span> is hosting self.highlight_postits Cloud Next conference in <span class='my_class' >san francisco</span> this week, the official announcement could come as early self.highlight_postas tomorrow.  Reached by phone, Kaggle co-founder <span class='my_class' >ceo anthony goldbloom</span> declined to deny that the self.highlight_postacquisition is happening. <span class='my_class' >google</span> itself declined 'to comment on rumors'.
.....
.....

Languages others than English

While English (en) is the default language, users can use YAKE! to extract keywords from whatever language they want to by specifying the the corresponding language universal code. The below example shows how to extract keywords from a portuguese text.

text = '''
"Conta-me Histórias." Xutos inspiram projeto premiado. A plataforma "Conta-me Histórias" foi distinguida com o Prémio Arquivo.pt, atribuído a trabalhos inovadores de investigação ou aplicação de recursos preservados da Web, através dos serviços de pesquisa e acesso disponibilizados publicamente pelo Arquivo.pt . Nesta plataforma em desenvolvimento, o utilizador pode pesquisar sobre qualquer tema e ainda executar alguns exemplos predefinidos. Como forma de garantir a pluralidade e diversidade de fontes de informação, esta são utilizadas 24 fontes de notícias eletrónicas, incluindo a TSF. Uma versão experimental (beta) do "Conta-me Histórias" está disponível aqui.
A plataforma foi desenvolvida por Ricardo Campos investigador do LIAAD do INESC TEC e docente do Instituto Politécnico de Tomar, Arian Pasquali e Vitor Mangaravite, também investigadores do LIAAD do INESC TEC, Alípio Jorge, coordenador do LIAAD do INESC TEC e docente na Faculdade de Ciências da Universidade do Porto, e Adam Jatwot docente da Universidade de Kyoto.
'''

custom_kw_extractor = yake.KeywordExtractor(lan="pt")
keywords = custom_kw_extractor.extract_keywords(text)

for kw in keywords:
    print(kw)

Output

('conta-me histórias', 0.006225012963810038)
('liaad do inesc', 0.01899063587015275)
('inesc tec', 0.01995432290332246)
('conta-me', 0.04513273690417472)
('histórias', 0.04513273690417472)
('prémio arquivo.pt', 0.05749361520927859)
('liaad', 0.07738867367929901)
('inesc', 0.07738867367929901)
('tec', 0.08109398065524037)
('xutos inspiram projeto', 0.08720742489353424)
('inspiram projeto premiado', 0.08720742489353424)
('adam jatwot docente', 0.09407053486771558)
('arquivo.pt', 0.10261392141666957)
('alípio jorge', 0.12190479662535166)
('ciências da universidade', 0.12368384021490342)
('ricardo campos investigador', 0.12789997272332762)
('politécnico de tomar', 0.13323587141127738)
('arian pasquali', 0.13323587141127738)
('vitor mangaravite', 0.13323587141127738)
('preservados da web', 0.13596322680882506)

Related projects

YAKE! Mobile APP

YAKE! is now available on Google Play

pke - python keyphrase extraction

https://github.com/boudinfl/pke - pke is an open source python-based keyphrase extraction toolkit. It provides an end-to-end keyphrase extraction pipeline in which each component can be easily modified or extended to develop new models. pke also allows for easy benchmarking of state-of-the-art keyphrase extraction models, and ships with supervised models trained on the SemEval-2010 dataset (http://aclweb.org/anthology/S10-1004).

Credits to https://github.com/boudinfl

SparkNLP - State of the Art Natural Language Processing framework

https://github.com/JohnSnowLabs/spark-nlp - SparkNLP from John Snow Labs is an open source framework with full Python, Scala, and Java Support. Check their documentation, demo and google colab. A video on how to use spark nlp with yake can also be found here: https://events.johnsnowlabs.com/john-snow-labs-nlu-become-a-data-science-superhero-with-one-line-of-python-code

General Index by Archive.org

https://archive.org/details/GeneralIndex - A catalogue of 19 billions of YAKE keywords extracted from 107 million papers. An article about the General Index project can also be found in Nature.

textacy - NLP, before and after spaCy

https://github.com/chartbeat-labs/textacy - textacy is a Python library for performing a variety of natural language processing (NLP) tasks, built on the high-performance spaCy library. among other features it supports keyword extration using YAKE.

Credits to https://github.com/chartbeat-labs

Annif - Tool for automated subject indexing and classification

https://github.com/NatLibFi/Annif/ - Annif is a multi-algorithm automated subject indexing tool for libraries, archives and museums. This repository is used for developing a production version of the system, based on ideas from the initial prototype. Official website http://annif.org/.

Portulan Clarin - Services and data for researchers, innovators, students and language professionals

https://portulanclarin.net/workbench/liaad-yake/ - Portulan Clarin is a Research Infrastructure for the Science and Technology of Language, belonging to the Portuguese National Roadmap of Research Infrastructures of Strategic Relevance, and part of the international research infrastructure CLARIN ERIC. It includes a demo of YAKE! among many other language technologies. Official website https://portulanclarin.net/.

How to install Docker

Here is the "just copy and paste" installations script for Docker in Ubuntu. Enjoy.

# Install dependencies
sudo apt-get update
sudo apt-get install \
    apt-transport-https \
    ca-certificates \
    curl \
    software-properties-common

# Add Docker repo
curl -fsSL https://download.docker.com/linux/ubuntu/gpg | sudo apt-key add -
sudo apt-key fingerprint 0EBFCD88
sudo add-apt-repository \
   "deb [arch=amd64] https://download.docker.com/linux/ubuntu \
   $(lsb_release -cs) \
   stable"
sudo apt-get update

# Install Docker
sudo apt-get install -y docker-ce

# Start Docker Daemon
sudo service docker start

# Add yourself to the Docker user group, otherwise docker will complain that
# it does not know if the Docker Daemon is running
sudo usermod -aG docker ${USER}

# Install docker-compose
sudo curl -L "https://github.com/docker/compose/releases/download/1.23.1/docker-compose-$(uname -s)-$(uname -m)" -o /usr/local/bin/docker-compose
sudo chmod +x /usr/local/bin/docker-compose
source ~/.bashrc
docker-compose --version
echo "Done!"

Credits to https://github.com/silvae86 for the Docker scripts.

References

Please cite the following works when using YAKE

In-depth journal paper at Information Sciences Journal

Campos, R., Mangaravite, V., Pasquali, A., Jatowt, A., Jorge, A., Nunes, C. and Jatowt, A. (2020). YAKE! Keyword Extraction from Single Documents using Multiple Local Features. In Information Sciences Journal. Elsevier, Vol 509, pp 257-289. pdf

ECIR'18 Best Short Paper

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). A Text Feature Based Automatic Keyword Extraction Method for Single Documents. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 684 - 691. pdf

Campos R., Mangaravite V., Pasquali A., Jorge A.M., Nunes C., and Jatowt A. (2018). YAKE! Collection-independent Automatic Keyword Extractor. In: Pasi G., Piwowarski B., Azzopardi L., Hanbury A. (eds). Advances in Information Retrieval. ECIR 2018 (Grenoble, France. March 26 – 29). Lecture Notes in Computer Science, vol 10772, pp. 806 - 810. pdf

Awards

ECIR'18 Best Short Paper

yake's People

Contributors

andefined avatar andriyor avatar arianpasquali avatar erip avatar jakecowton avatar jmendes1995 avatar martijnbroekman avatar moty66 avatar patrickjae avatar readie-nf avatar rncampos avatar silvae86 avatar taaahaaa avatar vitordouzi avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

yake's Issues

ModuleNotFoundError: No module named 'regex._regex_core'; 'regex' is not a package

Problems installing on my macbook pro: running python 3.6

---------------------------------------------------------------------------
ModuleNotFoundError                       Traceback (most recent call last)
<ipython-input-3-7cd11dc808be> in <module>
----> 1 import yake

~/anaconda3/lib/python3.6/site-packages/yake/__init__.py in <module>
      7 __version__ = '0.4.0'
      8 
----> 9 from yake.yake import KeywordExtractor

~/anaconda3/lib/python3.6/site-packages/yake/yake.py in <module>
      8 from .Levenshtein import Levenshtein
      9 
---> 10 from .datarepresentation import DataCore
     11 
     12 class KeywordExtractor(object):

~/anaconda3/lib/python3.6/site-packages/yake/datarepresentation.py in <module>
----> 1 from segtok.segmenter import split_multi
      2 from segtok.tokenizer import web_tokenizer, split_contractions
      3 
      4 import networkx as nx
      5 import numpy as np

~/anaconda3/lib/python3.6/site-packages/segtok/segmenter.py in <module>
     28 from __future__ import absolute_import, unicode_literals
     29 import codecs
---> 30 from regex import compile, DOTALL, UNICODE, VERBOSE
     31 
     32 

~/anaconda3/lib/python3.6/site-packages/regex.py in <module>
    398 # Internals.
    399 
--> 400 import regex._regex_core as _regex_core
    401 import regex._regex as _regex
    402 from threading import RLock as _RLock

ModuleNotFoundError: No module named 'regex._regex_core'; 'regex' is not a package

Novo arquivo de stopword

Estou usando o projeto mas sinto falta de um arquivo que junte todas as stopwords em um pois existe artigos que mistura idiomas.

Not identifying keywords with hyphens or numbers

The extract keyword method is not identifying terms with hyphens or numbers. For example the terms COVID-19 and SARS-CoV-2 are ignored when processing.

The abstract for the paper "Mechanisms of SARS-CoV-2 Transmission and Pathogenesis" (https://pubmed.ncbi.nlm.nih.gov/33132005/) provides a good test to try.

With the default processing this yields
severe acute respiratory 0.0113
acute respiratory syndrome 0.0113
respiratory syndrome coronavirus 0.0356
human population 0.0374
emergence of severe 0.0486
severe acute 0.0486
acute respiratory 0.0486
respiratory syndrome 0.0486
highly pathogenic coronavirus 0.0647
highly pathogenic 0.0819
Pathogenesis 0.1135
syndrome coronavirus 0.1433
pathogenic coronavirus 0.1433
tissue tropism 0.1460
Mechanisms 0.1677
marks 0.1677
population 0.1677
highly 0.1760
Transmission 0.1773
coronavirus 0.1942

problem installing yake on linux

hello, i've tried to install yake using pip over linux and i get the following error:

pip install yake
Collecting yake
Requirement already satisfied: networkx in ./anaconda3/lib/python3.7/site-packages (from yake) (2.3)
Requirement already satisfied: Click>=6.0 in ./anaconda3/lib/python3.7/site-packages (from yake) (7.0)
Collecting jellyfish (from yake)
Requirement already satisfied: scipy in ./anaconda3/lib/python3.7/site-packages (from yake) (1.3.1)
Collecting unidecode>=0.4.19 (from yake)
Using cached https://files.pythonhosted.org/packages/d0/42/d9edfed04228bacea2d824904cae367ee9efd05e6cce7ceaaedd0b0ad964/Unidecode-1.1.1-py2.py3-none-any.whl
Requirement already satisfied: nltk>=3.1 in ./anaconda3/lib/python3.7/site-packages (from yake) (3.4.4)
Requirement already satisfied: numpy in ./anaconda3/lib/python3.7/site-packages (from yake) (1.16.4)
Requirement already satisfied: fire in ./anaconda3/lib/python3.7/site-packages (from yake) (0.2.1)
Requirement already satisfied: pandas in ./anaconda3/lib/python3.7/site-packages (from yake) (0.25.0)
Collecting segtok (from yake)
Requirement already satisfied: decorator>=4.3.0 in ./anaconda3/lib/python3.7/site-packages (from networkx->yake) (4.4.0)
Requirement already satisfied: six in ./anaconda3/lib/python3.7/site-packages (from nltk>=3.1->yake) (1.12.0)
Requirement already satisfied: termcolor in ./anaconda3/lib/python3.7/site-packages (from fire->yake) (1.1.0)
Requirement already satisfied: pytz>=2017.2 in ./anaconda3/lib/python3.7/site-packages (from pandas->yake) (2019.1)
Requirement already satisfied: python-dateutil>=2.6.1 in ./anaconda3/lib/python3.7/site-packages (from pandas->yake) (2.8.0)
Collecting regex (from segtok->yake)
Using cached https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd9792681240657a4c0a599c10a81/regex-2019.08.19.tar.gz
Building wheels for collected packages: regex
Building wheel for regex (setup.py) ... error
ERROR: Complete output from command /home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-phwl6li7/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-kz_s22lg --python-tag cp37:
ERROR: BASE_DIR is /tmp/pip-install-phwl6li7/regex
/home/ec2-user/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
normalized_version,
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/init.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/regex.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/_regex_core.py -> build/lib.linux-x86_64-3.7/regex
creating build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/init.py -> build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/test_regex.py -> build/lib.linux-x86_64-3.7/regex/test
running build_ext
building 'regex._regex' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/regex_3
gcc -pthread -B /home/ec2-user/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/ec2-user/anaconda3/include/python3.7m -c regex_3/_regex.c -o build/temp.linux-x86_64-3.7/regex_3/_regex.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1


ERROR: Failed building wheel for regex
Running setup.py clean for regex

Failed to build regex
Installing collected packages: jellyfish, unidecode, regex, segtok, yake
Running setup.py install for regex ... error
ERROR: Complete output from command /home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-phwl6li7/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-x34h1jjm/install-record.txt --single-version-externally-managed --compile:
ERROR: BASE_DIR is /tmp/pip-install-phwl6li7/regex
/home/ec2-user/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
normalized_version,
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/init.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/regex.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/_regex_core.py -> build/lib.linux-x86_64-3.7/regex
creating build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/init.py -> build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/test_regex.py -> build/lib.linux-x86_64-3.7/regex/test
running build_ext
building 'regex._regex' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/regex_3
gcc -pthread -B /home/ec2-user/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/ec2-user/anaconda3/include/python3.7m -c regex_3/_regex.c -o build/temp.linux-x86_64-3.7/regex_3/_regex.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------
ERROR: Command "/home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-phwl6li7/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-x34h1jjm/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-phwl6li7/regex/
(base) [ec2-user@ip-172-31-21-203 ~]$ pip install git+https://github.com/LIAAD/yake
Collecting git+https://github.com/LIAAD/yake
Cloning https://github.com/LIAAD/yake to /tmp/pip-req-build-dejjnuzx
Running command git clone -q https://github.com/LIAAD/yake /tmp/pip-req-build-dejjnuzx
Collecting tabulate (from yake==0.4.2)
Downloading https://files.pythonhosted.org/packages/66/d4/977fdd5186b7cdbb7c43a7aac7c5e4e0337a84cb802e154616f3cfc84563/tabulate-0.8.5.tar.gz (45kB)
|████████████████████████████████| 51kB 24.4MB/s
Requirement already satisfied: click>=6.0 in ./anaconda3/lib/python3.7/site-packages (from yake==0.4.2) (7.0)
Requirement already satisfied: numpy in ./anaconda3/lib/python3.7/site-packages (from yake==0.4.2) (1.16.4)
Collecting segtok (from yake==0.4.2)
Requirement already satisfied: networkx in ./anaconda3/lib/python3.7/site-packages (from yake==0.4.2) (2.3)
Requirement already satisfied: jellyfish in ./anaconda3/lib/python3.7/site-packages (from yake==0.4.2) (0.7.2)
Collecting regex (from segtok->yake==0.4.2)
Using cached https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd9792681240657a4c0a599c10a81/regex-2019.08.19.tar.gz
Requirement already satisfied: decorator>=4.3.0 in ./anaconda3/lib/python3.7/site-packages (from networkx->yake==0.4.2) (4.4.0)
Building wheels for collected packages: yake, tabulate, regex
Building wheel for yake (setup.py) ... done
Stored in directory: /tmp/pip-ephem-wheel-cache-gqhzo3j2/wheels/be/35/27/e4ebd54b78c1806ed8b0271ce247fcd91e2bedde35889fbc9b
Building wheel for tabulate (setup.py) ... done
Stored in directory: /home/ec2-user/.cache/pip/wheels/e1/41/5e/e201f95d90fc84f93aa629b6638adacda680fe63aac47174ab
Building wheel for regex (setup.py) ... error
ERROR: Complete output from command /home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-pqiwlp28/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' bdist_wheel -d /tmp/pip-wheel-3pkjcr_k --python-tag cp37:
ERROR: BASE_DIR is /tmp/pip-install-pqiwlp28/regex
/home/ec2-user/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
normalized_version,
running bdist_wheel
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/init.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/regex.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/_regex_core.py -> build/lib.linux-x86_64-3.7/regex
creating build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/init.py -> build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/test_regex.py -> build/lib.linux-x86_64-3.7/regex/test
running build_ext
building 'regex._regex' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/regex_3
gcc -pthread -B /home/ec2-user/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/ec2-user/anaconda3/include/python3.7m -c regex_3/_regex.c -o build/temp.linux-x86_64-3.7/regex_3/_regex.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1


ERROR: Failed building wheel for regex
Running setup.py clean for regex
Successfully built yake tabulate
Failed to build regex
Installing collected packages: tabulate, regex, segtok, yake
Running setup.py install for regex ... error
ERROR: Complete output from command /home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-pqiwlp28/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-aot88mu6/install-record.txt --single-version-externally-managed --compile:
ERROR: BASE_DIR is /tmp/pip-install-pqiwlp28/regex
/home/ec2-user/anaconda3/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
normalized_version,
running install
running build
running build_py
creating build
creating build/lib.linux-x86_64-3.7
creating build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/init.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/regex.py -> build/lib.linux-x86_64-3.7/regex
copying regex_3/regex/_regex_core.py -> build/lib.linux-x86_64-3.7/regex
creating build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/init.py -> build/lib.linux-x86_64-3.7/regex/test
copying regex_3/regex/test/test_regex.py -> build/lib.linux-x86_64-3.7/regex/test
running build_ext
building 'regex._regex' extension
creating build/temp.linux-x86_64-3.7
creating build/temp.linux-x86_64-3.7/regex_3
gcc -pthread -B /home/ec2-user/anaconda3/compiler_compat -Wl,--sysroot=/ -Wsign-compare -DNDEBUG -g -fwrapv -O3 -Wall -Wstrict-prototypes -fPIC -I/home/ec2-user/anaconda3/include/python3.7m -c regex_3/_regex.c -o build/temp.linux-x86_64-3.7/regex_3/_regex.o
unable to execute 'gcc': No such file or directory
error: command 'gcc' failed with exit status 1
----------------------------------------

ERROR: Command "/home/ec2-user/anaconda3/bin/python -u -c 'import setuptools, tokenize;file='"'"'/tmp/pip-install-pqiwlp28/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(file);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, file, '"'"'exec'"'"'))' install --record /tmp/pip-record-aot88mu6/install-record.txt --single-version-externally-managed --compile" failed with error code 1 in /tmp/pip-install-pqiwlp28/regex/

can you help in some way?

Taking too long to train for

For numOfKeywords >1K(~5K/10K) speed of training drops drastically. Is there any way to make learning more efficient/faster?
I am passing the whole corpus of text as single entity(tried as a string and as from a file.txt)

Why YAKE misses COVID-19 keyword in output?

Hi,
Why would YAKE not return the COVID-19 in any of the keywords in the following example:

occupational stress and mental health among anesthetists during the COVID-19 pandemic.

with default parameters, the output looks like this:

pandemic 0.04491197687864554
occupational stress 0.04940384002065631
stress and mental 0.09700399286574239
mental health 0.09700399286574239
health among anesthetists 0.09700399286574239
occupational 0.15831692877998726
stress 0.29736558256021506
mental 0.29736558256021506
health 0.29736558256021506
anesthetists 0.29736558256021506

Read from the standard input

Hi all,

It would be interesting to read the text from the standard input instead of only from a file; perhaps a -t argument?

Abraço,
JRS

YAKE at PyPI

Dear authors, thank you for the great method and its implementation as a Python package.

We want to use Yake as a part of a package which will also be published at PyPI. We have an issue with Yake not being published at PyPI, since it cannot be used as the dependency of another package (when the package is installed with pip it cannot have the dependency from GitHub). I know you decided to deprecate the PyPI version of this package but I would ask if you can rethink this decision. Publishing the package at PyPI takes just two additional commands to be executed on the release.

I do not know what exactly led to this decision but if there is any technical reason I would be happy to help.

IndexError: list index out of range

Hello! The pke-version of the yake extractor fails weightening the candidates for this text-file:
data.txt

Error:
.env/lib/python3.6/site-packages/pke/unsupervised/statistical/yake.py", line 368, in candidate_weighting term_right = tokens[j+1] IndexError: list index out of range
Please view yake.py.

Code:
extractor = pke.unsupervised.YAKE()
extractor.load_document(input='path/data.txt', language='en')
extractor.candidate_selection()
extractor.candidate_weighting()
keyphrases = extractor.get_n_best(n=10)

Unfortunately, we failed fixing this issue ourselves. The out-of-range error is raised for the token "allowed" / "allow" (line 30 of the text-file).
This is a critical issue, since we depend on your great package in two of our products.

Thank you!

AttributeError: module 'yake' has no attribute 'KeywordExtractor'

Hi, I'm trying to use yake on python. I installed it using pip. I copy/paste the code from readme into my script file and got the following error:

Traceback (most recent call last):
  File "./yake.py", line 57, in <module>
    main()
  File "./yake.py", line 16, in main
    tmp = assuming_default_parameters(data)
  File "./yake.py", line 26, in assuming_default_parameters
    kw_extractor = yake.KeywordExtractor()
AttributeError: module 'yake' has no attribute 'KeywordExtractor'

Could someone help me?

Default Keyword Extraction Includes Phone Numbers

After reading the paper, I thought that numbers would be discarded when extracting keywords, however when running the keyword extractor on the 20newsgroup dataset, specifically the document

20news_home/20news-bydate-train/misc.forsale/75935.txt

The following keywords are extracted (I removed the scores and joined the keyword phrases with underscores)

excellent_condition tom accord excellent honda offer_call offer model honda_accord miles 795-5636 653-0638 highway accord_for_sale sale loaded white offer_call_tom highway_miles condition

There are other instances of numbers with hyphens extracted as well.

Inconsistent return types for extract_keywords depening on dedupLim

When the dedupLim is below 1, the function extract_keywords(text) returns tuples in the format: (, )

When the dedupLim is 1, the same function returns tuples in the format: (, ).

This is very confusing. Especially problematic if you're testing multiple values of dedupLim in a loop.

Allow custom stopwords

Thanks for an awesome package! I'd love to be able to provide a custom stopwords list, but it doesn't seem like this is currently possible since the stopwords list is loaded from a file.

One option would be to add an optional stopwords list with a default None to the KeywordExtractor constructor and load the file in the default case. If this seems reasonable, I'd be happy to send a PR along.

Truncating long documents

Hi,
I found out that when using YAKE for long documents, it can be advantageous to truncate them in advance.

We have a test set of theses and dissertations (766 documents of on average 196k characters, 22k words), and when those documents are used as a gold standard for evaluation of YAKE (or its integration in our application), a F1@5 score of 0.29 is reached. However, if the documents are first truncated to a fixed length of 15000 characters, a better score 0.33 is reached.

Being such a simple way to possibly improve results, maybe a parameter/option for truncating input text could be added directly to YAKE? Or, better yet, could the term position feature be tuned to be better suited for long texts? To somehow make it to give even more importance to the beginning part?

is multi language possible?

I had installed the software after three dependency packages installed successfully
but it is now used in English key words extraction.
Would Chinese work as well ?

Error with upgrade command

I get an error when I run the YAKE upgrade command.

Command:
pip install git+https://github.com/LIAAD/yake –upgrade

Error:
ERROR: Invalid requirement: '\u2013upgrade'

Any ideas, why?

Unsupported YAKE languages

Looks like the following languages are unsupported by YAKE:

TAGALOG
VIETNAMESE
BENGALI
BOKMAL
YORUBA
CZECH
SOTHO
URDU
PUNJABI
SWAHILI
ALBANIAN
BELARUSIAN
MACEDONIAN
AZERBAIJANI
AFRIKAANS
XHOSA
ICELANDIC
TAMIL
KAZAKH
MONGOLIAN
CATALAN
GEORGIAN
LATIN
MAORI
MALAY
NYNORSK
GUJARATI
TSWANA
BOSNIAN
ZULU
TELUGU
ESPERANTO
SERBIAN
SOMALI
TSONGA
GANDA
BASQUE
HEBREW
WELSH
THAI
IRISH
SHONA
KOREAN
MARATHI

It there any particular reason why they are unsupported?

Problem with installing yake: clang

...$ pip install git+https://github.com/LIAAD/yake
Processing /Users/khrystynas/Library/Caches/pip/wheels/90/7d/56/63a3a3b064e9214e29616880dc4170a300ca68643728ce24ba/yake-0.3.7-py2.py3-none-any.whl
Processing /Users/khrystynas/Library/Caches/pip/wheels/15/ee/a8/6112173f1386d33eebedb3f73429cfa41a4c3084556bcee254/segtok-1.5.7-cp37-none-any.whl
Requirement already satisfied: nltk>=3.1 in /usr/local/lib/python3.7/site-packages (from yake) (3.4.5)
Requirement already satisfied: fire in /usr/local/lib/python3.7/site-packages (from yake) (0.2.1)
Requirement already satisfied: Click>=6.0 in /usr/local/lib/python3.7/site-packages (from yake) (7.0)
Requirement already satisfied: pandas in /usr/local/lib/python3.7/site-packages (from yake) (0.25.2)
Requirement already satisfied: networkx in /usr/local/lib/python3.7/site-packages (from yake) (2.4)
Requirement already satisfied: scipy in /usr/local/lib/python3.7/site-packages (from yake) (1.3.1)
Requirement already satisfied: numpy in /usr/local/lib/python3.7/site-packages (from yake) (1.17.2)
Requirement already satisfied: unidecode>=0.4.19 in /usr/local/lib/python3.7/site-packages (from yake) (1.1.1)
Requirement already satisfied: jellyfish in /usr/local/lib/python3.7/site-packages (from yake) (0.7.2)
Collecting regex
  Using cached https://files.pythonhosted.org/packages/6f/a6/99eeb5904ab763db87af4bd71d9b1dfdd9792681240657a4c0a599c10a81/regex-2019.08.19.tar.gz
Requirement already satisfied: six in /usr/local/lib/python3.7/site-packages (from nltk>=3.1->yake) (1.12.0)
Requirement already satisfied: termcolor in /usr/local/lib/python3.7/site-packages (from fire->yake) (1.1.0)
Requirement already satisfied: python-dateutil>=2.6.1 in /usr/local/lib/python3.7/site-packages (from pandas->yake) (2.8.0)
Requirement already satisfied: pytz>=2017.2 in /usr/local/lib/python3.7/site-packages (from pandas->yake) (2019.3)
Requirement already satisfied: decorator>=4.3.0 in /usr/local/lib/python3.7/site-packages (from networkx->yake) (4.4.0)
Building wheels for collected packages: regex
  Building wheel for regex (setup.py) ... error
  ERROR: Command errored out with exit status 1:
   command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"'; __file__='"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' bdist_wheel -d /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-wheel-vzqsgm04 --python-tag cp37
       cwd: /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/
  Complete output (22 lines):
  BASE_DIR is /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex
  /usr/local/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
    normalized_version,
  running bdist_wheel
  running build
  running build_py
  creating build
  creating build/lib.macosx-10.14-x86_64-3.7
  creating build/lib.macosx-10.14-x86_64-3.7/regex
  copying regex_3/regex/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/regex
  copying regex_3/regex/regex.py -> build/lib.macosx-10.14-x86_64-3.7/regex
  copying regex_3/regex/_regex_core.py -> build/lib.macosx-10.14-x86_64-3.7/regex
  creating build/lib.macosx-10.14-x86_64-3.7/regex/test
  copying regex_3/regex/test/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/regex/test
  copying regex_3/regex/test/test_regex.py -> build/lib.macosx-10.14-x86_64-3.7/regex/test
  running build_ext
  building 'regex._regex' extension
  creating build/temp.macosx-10.14-x86_64-3.7
  creating build/temp.macosx-10.14-x86_64-3.7/regex_3
  clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c regex_3/_regex.c -o build/temp.macosx-10.14-x86_64-3.7/regex_3/_regex.o
  xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
  error: command 'clang' failed with exit status 1
  ----------------------------------------
  ERROR: Failed building wheel for regex
  Running setup.py clean for regex
Failed to build regex
Installing collected packages: regex, segtok, yake
    Running setup.py install for regex ... error
    ERROR: Command errored out with exit status 1:
     command: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"'; __file__='"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-record-j4mtel58/install-record.txt --single-version-externally-managed --compile
         cwd: /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/
    Complete output (22 lines):
    BASE_DIR is /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex
    /usr/local/lib/python3.7/site-packages/setuptools/dist.py:472: UserWarning: Normalizing '2019.08.19' to '2019.8.19'
      normalized_version,
    running install
    running build
    running build_py
    creating build
    creating build/lib.macosx-10.14-x86_64-3.7
    creating build/lib.macosx-10.14-x86_64-3.7/regex
    copying regex_3/regex/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/regex
    copying regex_3/regex/regex.py -> build/lib.macosx-10.14-x86_64-3.7/regex
    copying regex_3/regex/_regex_core.py -> build/lib.macosx-10.14-x86_64-3.7/regex
    creating build/lib.macosx-10.14-x86_64-3.7/regex/test
    copying regex_3/regex/test/__init__.py -> build/lib.macosx-10.14-x86_64-3.7/regex/test
    copying regex_3/regex/test/test_regex.py -> build/lib.macosx-10.14-x86_64-3.7/regex/test
    running build_ext
    building 'regex._regex' extension
    creating build/temp.macosx-10.14-x86_64-3.7
    creating build/temp.macosx-10.14-x86_64-3.7/regex_3
    clang -Wno-unused-result -Wsign-compare -Wunreachable-code -fno-common -dynamic -DNDEBUG -g -fwrapv -O3 -Wall -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/usr/include -I/Applications/Xcode.app/Contents/Developer/Platforms/MacOSX.platform/Developer/SDKs/MacOSX10.14.sdk/System/Library/Frameworks/Tk.framework/Versions/8.5/Headers -I/usr/local/include -I/usr/local/opt/openssl/include -I/usr/local/opt/sqlite/include -I/usr/local/Cellar/python/3.7.3/Frameworks/Python.framework/Versions/3.7/include/python3.7m -c regex_3/_regex.c -o build/temp.macosx-10.14-x86_64-3.7/regex_3/_regex.o
    xcrun: error: invalid active developer path (/Library/Developer/CommandLineTools), missing xcrun at: /Library/Developer/CommandLineTools/usr/bin/xcrun
    error: command 'clang' failed with exit status 1
    ----------------------------------------
ERROR: Command errored out with exit status 1: /usr/local/opt/python/bin/python3.7 -u -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"'; __file__='"'"'/private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-install-v2c23sz4/regex/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' install --record /private/var/folders/bl/dzs3gyqj0nv6s4gbwrdk7by00000gn/T/pip-record-j4mtel58/install-record.txt --single-version-externally-managed --compile Check the logs for full command output.

Any help will be much appreciated!

Discarding keywords/-phrases with certain POS tag

Dear authors,
Firstly, I want to thank you for the great work you're doing!

I wonder what would be the best practice do detect and discard keyphrases that are (not) of a specific POS tag when using YAKE.

More specifically, I need to discard names of people and numbers for a project. I could do that after YAKE extracted them from my corpus, but I assume it would be more efficient to not even build/include the key phrases when they're of a specific POS tag.

Thanks in advance for any hints/ideas!

Deduplication threshold changes the order of the response tuples

I've noticed the following behavior of the .extract_keywords function:

When using a deduplication threshold (dedupLim) lower than 1, the response tuples are of the form (word, score). e.g.:

('non-profit', 0.18087033619667015)
('social', 0.21178928326651927)
('media', 0.21178928326651927)
('handle', 0.28189161752425324)

However, when equal or greater than 1, becomes:

(0.18087033619667015, 'non-profit')
(0.21178928326651927, 'social')
(0.21178928326651927, 'media')
(0.28189161752425324, 'handle')

Below the sample code which produces the above outputs:

import yake

text = 'I handle social media for a non-profit. Should I start going to social media networking events? Are there any good ones in the bay area?'

kw_extractor = yake.KeywordExtractor(lan="en", n=1, dedupLim=1, top=4, features=None)
keywords = kw_extractor.extract_keywords(text)
for kw in keywords:
    print(kw)

The issue seems to stem from the difference between these two lines: yake.py#L71 and yake.py#L85

Happy to submit a PR to fix it if is of any help

How to speed up the application to 100k documents?

Hi,
It works well with 1 document however if i want to apply this kw_extractor to a 100k rows of documents with pandas apply it takes more than 2 days to complete. Is there any way of speeding up this process?

CODE:
import yake
st = set(stopwords.words('japanese'))

def keywords_yake(sample_post):
    # take keywords for each post & turn them into a text string "sentence"
    simple_kwextractor = yake.KeywordExtractor(n=3, 
                                            lan='ja',
                                            dedupLim=.99, 
                                            dedupFunc='seqm', 
                                            windowsSize=1, 
                                            top=1000, 
                                            features=None,
                                            stopwords=st)
    
    post_keywords = simple_kwextractor.extract_keywords(sample_post)

        sentence_output = ""
        for word, number in post_keywords:
            sentence_output += word + " "        
    return " ".join(sentence_output)


df['keywords']= df['docs'].apply(lambda x: keywords_yake(x))
```

Available languages list

Where can I find a list of languages available in this library ?

yake.KeywordExtractor(lan=.....)

in parameter lan, I can pass "en" for English.
What other languages are available ? any documentation for that ?

MIT License

Would it be possible to use a different license file apart from GPL v3, like MIT / BSD . We would like to use this library for our project, but GPL v3 license are not allowed for use due to legal reasons.

The dependency on click

Can the dependency on click be changed to use lowercase: https://github.com/LIAAD/yake/blob/v0.4.1/setup.py#L13

There was only a single release of Click with uppercase (7.0) and that was a mistake. See: pallets/click@0d95fb0

I just asked for a hotfix release of click that would fix that: pallets/click#1284

In the meantime, if you could change the dependency on the line 13 in setup.py as:

'click>=6.0'

that would allow users of this package to pick up other versions and hotfixes of click during build. With the current setting of Click>=6.0, the build always picks up just Click-7.0, which may conflict with another package resolving this to click-x.y.z in some build environments.

Pip installation; is git preferred over PyPI?

Summary

I recommend changing the README to either:

  1. Suggest pip install yake instead of pip install git+https://github.com/LIAAD/yake
  2. Explain that pip installing the git repo is preferred to PyPI

Details

The README currently recommends a pip installation targeting the git repository:

Installation

To install Yake using pip:

pip install git+https://github.com/LIAAD/yake

But PyPI also has packages, and there are other issues regarding PyPI.

I generally prefer targeting specific versions (e.g. pip install yake==0.4.8) for reproducible builds. Is there a reason to prefer the git taget instead of PyPI? If so, I recommend adding a line to the README so that it is clear.

Thanks!

how yake do it?

If it is convenient to show how does yake do it (extract keywords form single document) ?

YAKE score(relevance )

Hi!

Could you please explain how to properly understand what YAKE score(relevance ) means ?

For example:

focused information workers 0.0075
premium management tool 0.0147
Royal TSX 0.0255
server admins 0.0286
tool for server 0.0373
focused information 0.0373
information workers 0.0373
Royal 0.0588
system engineers 0.0639
premium management 0.0710
management tool 0.0710
Royal TSD 0.0815
TSX 0.0827
TSX is compatible 0.0986
powerful connections management 0.1123
management 0.1165
admins 0.1467
credential management 0.1504
features powerful connections 0.1562
Royal TSi 0.1583

What are the score boundaries? Is it 0-1 or what? Less or more is a better relevance?

From the example above:

focused information workers 0.0075
Royal TSi 0.1583

What praze has a better relevance to the document according to YAKE?

Thanks!

Deduplication Threshold and Function as parameters to docker server?

I see that we cannot change the deduplication parameters when using the API:

https://github.com/LIAAD/yake/blob/master/docker/Dockerfiles/yake-server/yake-rest-api.py#L117

So it just uses the defaults:

def __init__(self, lan="en", n=3, dedupLim=0.9, dedupFunc='seqm', windowsSize=1, top=20, features=None, stopwords=None):

Could we be able to provide those when invoking the API server, please?
Thank you very much!

Have acronym keywords returned in upper case

Hi,

Would it be possible to have acronym keywords found in input text returned in upper case by the keyword parser? Currently, all keywords are in lower case in the results, I am sure for some good reason. As we feed parsed keywords back to a search engine which makes distinction about abbreviations from proper words when ranking matches, so I would currently go to post-process keywords and make acronyms from input text back to upper case as a workaround but maybe it would be useful enough to have it handled by the tool natively.

Thank you for the great and promising library you have implemented.

Deduplication function

Hi

Can you please explain how what are the [leve|jaro|seqm] and how do they work ?

Is there some documentation link I could use to know how to choose among them?

Thanks

import: no module named yake

Hello,

I just installed yake but then when I want to used it, I am getting an error message following import yake command:

File "", line 1, in
File "/home/sylwia/.local/lib/python2.7/site-packages/yake/init.py", line 9, in
from yake.yake import KeywordExtractor
ImportError: No module named yake

I would appreciate your help with that.
Cheers
Sylwia

Reproducing the results for Inspec and Semeval 2017 data

Thank you for making this work open source.

Are the evaluation scripts available online? I have been trying to reproduce the numbers from the paper but am unable to match the performance for the given optimal hyperparameter setting.

How to use lemmatization together with YAKE?

Dear authors, thank you for the great keyword extracting methods.

When extracting keywords with YAKE it usually happens that in-between keywords there are two words with the same lemma (e.g. tree, trees). Since it would be difficult (and exposed to errors) to lemmatize text and the contact text together before using YAKE, I was thinking to lemmatize words after keyword extraction. I am wondering what is the best way to aggregate scores after lemmatization. Should I use sum, mean or max function for aggregating scores of the words with the same lemma or do you suggest any alternative method? Do you have any experience with my issue?

Wrong code in calculating T_rel of a single_word

self.PL = self.WDL / maxTF
self.PR = self.WDR / maxTF
I think the right code is:
self.PL = self.WDL / self.G.in_degree(self.id, weight='TF')
self.PR = self.WDR / self.G.out_degree(self.id, weight='TF')

Infinite loop in Text Highlighter

Hello,

I'm having an issue processing this sentence:

Rugby league-Australian rugby league results .

The the highlighter never ends marking the sentence, for what I can see the issue is in the method format_n_gram_text, where the value y never increments.

This is how I call the processing:

self.__extractor = yake.KeywordExtractor(lan="en")
self.__highlighter = TextHighlighter(max_ngram_size=3)

UnboundLocalError: local variable 'block_of_word_obj' referenced before assignment

What does this error mean?

Traceback (most recent call last):
  File "/root/src/streamer/consumer.py", line 188, in callback
    keywords = set([_[1].upper() for _ in self._kw_extractor.extract_keywords(message["text"])])
  File "/opt/conda/lib/python3.7/site-packages/yake/yake.py", line 52, in extract_keywords
    dc = DataCore(text=text, stopword_set=self.stopword_set, windowsSize=self.windowsSize, n=self.n)
  File "/opt/conda/lib/python3.7/site-packages/yake/datarepresentation.py", line 30, in __init__
    self._build(text, windowsSize, n)
  File "/opt/conda/lib/python3.7/site-packages/yake/datarepresentation.py", line 93, in _build
    if len(block_of_word_obj) > 0:
UnboundLocalError: local variable 'block_of_word_obj' referenced before assignment

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.