Coder Social home page Coder Social logo

kevinlu1248 / pyate Goto Github PK

View Code? Open in Web Editor NEW
302.0 15.0 37.0 31.84 MB

PYthon Automated Term Extraction

Home Page: https://kevinlu1248.github.io/pyate/

License: MIT License

Python 5.49% HTML 65.05% CSS 5.15% JavaScript 24.10% Makefile 0.11% Batchfile 0.10%
nlp term-extraction ai symbolic-ai

pyate's Introduction

PyATE: Python Automated Term Extraction

Build Status Documentation Status PyPI pyversions PyPI version fury.io

Downloads Downloads Downloads

Code style: black Built with spaCy License: MIT DOI

Python implementation of term extraction algorithms such as C-Value, Basic, Combo Basic, Weirdness and Term Extractor using spaCy POS tagging.

NEW: Documentation can be found at https://kevinlu1248.github.io/pyate/. The documentation so far is still missing two algorithms and details about the TermExtraction class but I will have it done soon.

NEW: Try the algorithms out at https://pyate-demo.herokuapp.com/, a web app for demonstrating PyATE!

NEW: spaCy V3 is supported! For spaCy V2, use pyate==0.4.3 and view the spaCy V2 README.md file

If you have a suggestion for another ATE algorithm you would like implemented in this package feel free to file it as an issue with the paper the algorithm is based on.

For ATE packages implemented in Scala and Java, see ATR4S and JATE, respectively.

🎉 Installation

Using pip:

pip install pyate 
spacy download en_core_web_sm

🚀 Quickstart

To get started, simply call one of the implemented algorithms. According to Astrakhantsev 2016, combo_basic is the most precise of the five algorithms, though basic and cvalues is not too far behind (see Precision). The same study shows that PU-ATR and KeyConceptRel have higher precision than combo_basic but are not implemented and PU-ATR take significantly more time since it uses machine learning.

from pyate import combo_basic

# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""

print(combo_basic(string).sort_values(ascending=False))
""" (Output)
dysfunctional tumor                1.443147
tumor suppressors                  1.443147
genetic changes                    1.386294
cancer cells                       1.386294
dysfunctional tumor suppressors    1.298612
logical framework                  0.693147
sufficient growth                  0.693147
death signals                      0.693147
many aspects                       0.693147
inflammatory response              0.693147
tumor promotion                    0.693147
ancillary processes                0.693147
tumor environment                  0.693147
reflexive relationship             0.693147
particular focus                   0.693147
physiologic processes              0.693147
tissue homeostasis                 0.693147
cancer development                 0.693147
dtype: float64
"""

If you would like to add this to a spacy pipeline, simply use add Spacy's add_pipe method.

import spacy
from pyate.term_extraction_pipeline import TermExtractionPipeline

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe("combo_basic")
doc = nlp(string)
print(doc._.combo_basic.sort_values(ascending=False).head(5))
""" (Output)
dysfunctional tumor                1.443147
tumor suppressors                  1.443147
genetic changes                    1.386294
cancer cells                       1.386294
dysfunctional tumor suppressors    1.298612
dtype: float64
"""

Also, TermExtractionPipeline.__init__ is defined as follows

__init__(
  self,
  func: Callable[..., pd.Series] = combo_basic,
  *args,
  **kwargs
)

where func is essentially your term extracting algorithm that takes in a corpus (either a string or iterator of strings) and outputs a Pandas Series of term-value pairs of terms and their respective termhoods. func is by default combo_basic. args and kwargs are for you to overide default values for the function, which you can find by running help (might document later on).

Summary of functions

Each of cvalues, basic, combo_basic, weirdness and term_extractor take in a string or an iterator of strings and outputs a Pandas Series of term-value pairs, where higher values indicate higher chance of being a domain specific term. Furthermore, weirdness and term_extractor take a general_corpus key word argument which must be an iterator of strings which defaults to the General Corpus described below.

All functions only take the string of which you would like to extract terms from as the mandatory input (the technical_corpus), as well as other tweakable settings, including general_corpus (contrasting corpus for weirdness and term_extractor), general_corpus_size, verbose (whether to print a progress bar), weights, smoothing, have_single_word (whether to have a single word count as a phrase) and threshold. If you have not read the papers and are unfamiliar with the algorithms, I recommend just using the default settings. Again, use help to find the details regarding each algorithm since they are all different.

General Corpus

Under path/to/site-packages/pyate/default_general_domain.en.zip, there is a general CSV file of a general corpus, specifically, 3000 random sentences from Wikipedia. The source of it can be found at https://www.kaggle.com/mikeortman/wikipedia-sentences. Access it using it using the following after installing pyate.

import pandas as pd
from distutils.sysconfig import get_python_lib
df = pd.read_csv(get_python_lib() + "/pyate/default_general_domain.en.zip")["SECTION_TEXT"]
print(df.head())
""" (Output)
0    '''Anarchism''' is a political philosophy that...
1    The term ''anarchism'' is a compound word comp...
2    ===Origins===\nWoodcut from a Diggers document...
3    Portrait of philosopher Pierre-Joseph Proudhon...
4    consistent with anarchist values is a controve...
Name: SECTION_TEXT, dtype: object
"""

Other Languages

For switching languages, simply run Term_Extraction.set_language({language}, {model_name}), where model_name defaults to language. For example, Term_Extraction.set_language("it", "it_core_news_sm"}) for Italian. By default, the language is English. So far, the list of supported languages is:

  • English (en)
  • Dutch (nl)
  • French (fr)
  • German (de)
  • Italian (it)
  • Portuguese (pt)
  • Russian (ru)
  • Spanish (es)

To add more languages, file an issue with a corpus of at least 3000 paragraphs of a general domain in the desired language (preferably wikipedia) named default_general_domain.{lang}.zip replacing lang with the ISO-639-1 code of the language, or the ISO-639-2 if the language does not have a ISO-639-1 code (can be found at https://www.loc.gov/standards/iso639-2/php/code_list.php). The file format should be of the following form to be parsable by Pandas.

,SECTION_TEXT
0,"{paragraph_0}"
1,"{paragraph_1}"
...

Alternatively, place the file in src/pyate and file a pull request.

Models

Warning: The model only works with spaCy v2.

Though this model was originally intended for symbolic AI algorithms (non-machine learning), I realized a spaCy model on term extraction can reach significantly higher performance, and thus decided to include the model here.

For a comparison with the symbolic AI algorithms, see Precision. Note that only the F-Score, accuracy and precision was taken here yet for the model, but for the algorithms the AvP was taken so directly comparing the metrics would not really make sense.

URL F-Score (%) Precision (%) Recall (%)
https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/en_acl_terms_sm-2.0.4.tar.gz 94.71 95.41 94.03

The model was trained and evaluated on the ACL dataset, which is a computer science oriented dataset where the terms are manually picked. This has not yet been tested on other fields yet, however.

This model does not come with PyATE. To install, run

pip install https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/en_acl_terms_sm-2.0.4.tar.gz

To extract terms,

import spacy

nlp = spacy.load("en_acl_terms_sm")
doc = nlp("Hello world, I am a term extraction algorithm.")
print(doc.ents)
"""
(term extraction, algorithm)
"""

🎯 Precision

Here is the average precision of some of the implemented algorithms using the Average Precision (AvP) metric on seven distinct databases, as tested in Astrakhantsev 2016. Evaluation

🌠 Motivation

This project was planned to be a tool to be connected to a Google Chrome Extension that highlights and defines key terms that the reader probably does not know of. Furthermore, term extraction is an area where there is not a lot of focused research on in comparison to other areas of NLP and especially recently is not viewed to be very practical due to the more general tool of NER tagging. However, modern NER tagging usually incorporates some combination of memorized words and deep learning which are spatially and computationally heavy. Furthermore, to generalize an algorithm to recognize terms to the ever growing areas of medical and AI research, a list of memorized words will not do.

Of the five implemented algorithms, none are expensive, in fact, the bottleneck of the space allocation and computation expense is from the spaCy model and spaCy POS tagging. This is because they mostly rely simply on POS patterns, word frequencies, and the existence of embedded term candidates. For example, the term candidate "breast cancer" implies that "malignant breast cancer" is probably not a term and simply a form of "breast cancer" that is "malignant" (implemented in C-Value).

📌 Todo

  • Add other languages and data encapsulation for set language
  • Add automated tests and CI/CD
  • Add a brief CLI
  • Make NER version of this using the datasets from the sources
  • Add PU-ATR algorithm since its precision is a lot higher, though more computationally expensive
  • Page Rank algorithm
  • Add sources
  • Add voting algorithm and capabilities
  • Optimize perhaps using Cython, however, the bottleneck is POS tagging by Spacy and word counting with Pandas and Numpy, which are already at C-level so this will not help much
  • Clearer documentation
  • Allow GPU acceleration with Cupy

📑 Sources

I cannot seem to find the original Basic and Combo Basic papers but I found papers that referenced them. "ATR4S: Toolkit with State-of-the-art Automatic Terms Recognition Methods in Scala" more or less summarizes everything and incorporates several algorithms not in this package.

📕 Academia

Citing

If you publish work that uses PyATE, please let me know at [email protected] and cite as:

Lu, Kevin. (2021, June 28). kevinlu1248/pyate: Python Automated Term Extraction (Version v0.5.3). Zenodo. http://doi.org/10.5281/zenodo.5039289

or equivalently with Bibtext:

@software{pyate,
	title        = {kevinlu1248/pyate: Python Automated Term Extraction},
	author       = {Lu, Kevin},
	year         = 2021,
	month        = {Jun},
	publisher    = {Zenodo},
	doi          = {10.5281/zenodo.5039289}
}

Influences on Academia

This package was used in the paper (Unsupervised Technical Domain Terms Extraction using Term Extractor (Dowlagar and Mamidi, 2021).

☕ Buy Me a Coffee

If my work helped you, please consider buying me a coffee at https://www.buymeacoffee.com/kevinlu1248.

pyate's People

Contributors

binherunning avatar bluetyson avatar bohyunjung avatar deepsource-autofix[bot] avatar deepsourcebot avatar dependabot[bot] avatar imgbotapp avatar kevinlu1248 avatar pandawhocodes avatar restyled-commits avatar stelmath avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

pyate's Issues

Crash on short input; len(technical_counts) == 0 after filtering

In combo_basic.py:

    if len(technical_counts) == 0:
        return pd.Series()

    order = sorted(
        list(technical_counts.keys()), key=TermExtraction.word_length, reverse=True
    )

    if not have_single_word:
        order = list(filter(lambda s: TermExtraction.word_length(s) > 1, order))

    technical_counts = technical_counts[order]

    df = pd.DataFrame(
        {
            "xlogx_score": technical_counts.reset_index()
            .apply(
                lambda s: math.log(TermExtraction.word_length(s["index"])) * s[0],
                axis=1,
            )
            .values,
            "times_subset": 0,
            "times_superset": 0,
        },
        index=technical_counts.index,
    )

The call to pd.DataFrame() can fail if technical_counts is empty after technical_counts = technical_counts[order]. This can be avoided with a second check for an empty Series, e.g.:

    technical_counts = technical_counts[order]

    if len(technical_counts) == 0:
        return pd.Series()

Minimal working example:

import spacy
from pyate.term_extraction_pipeline import TermExtractionPipeline
nlp = spacy.load("en")
nlp.add_pipe(TermExtractionPipeline())
text = "This sentence is short."
nlp(text)

may be error in combo_basic.py

hi, I have readed the source code in combo_basic.py, there is one palce that confused me. May U help?
in ART4S paper, combo_basic method formula is:
ComboBasic(t)=|t|logf(t)+αet+βe′t
but the code in line 65 in combo_basic.py file is:
lambda s: math.log(TermExtraction.word_length(s["index"])) * s[0]
I think it should be:
lambda s: TermExtraction.word_length(s["index"]) * math.log(s[0])
since |t|=TermExtraction.word_length(s["index"]) and f(t)=s[0]
am I right, hope your help?

UnicodeDecodeError: 'charmap' Error in pip install Windows

While trying to install pyate on windows using pip install pyate gives the below error:

Collecting pyate
Using cached pyate-0.3.2.tar.gz (9.2 kB)
    ERROR: Command errored out with exit status 1:
     command: 'D:\anaconda3\envs\KPIAlgo\python.exe' -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'C:\\Users\\prince\\AppData\\Local\\Temp\\pip-install-d1tq5ym0\\pyate\\setup.py'"'"'; __file__='"'"'C:\\Users\\prince\\AppData\\Local\\Temp\\pip-install-d1tq5ym0\\pyate\\setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base 'C:\Users\prince\AppData\Local\Temp\pip-pip-egg-info-4plhxjxh'
         cwd: C:\Users\prince\AppData\Local\Temp\pip-install-d1tq5ym0\pyate\
    Complete output (7 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "C:\Users\prince\AppData\Local\Temp\pip-install-d1tq5ym0\pyate\setup.py", line 4, in <module>
        long_description = f.read()
      File "D:\anaconda3\envs\KPIAlgo\lib\encodings\cp1252.py", line 23, in decode
        return codecs.charmap_decode(input,self.errors,decoding_table)[0]
    UnicodeDecodeError: 'charmap' codec can't decode byte 0x9d in position 213: character maps to <undefined>
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

Conda Info:

conda version : 4.8.2
conda-build version : 3.18.11
python version : 3.7.6.final.0
platform : Win10-x64

However, the same installation worked fine on my Ubuntu system.
Conda Info:

conda version : 4.7.10
conda-build version : 3.15.1
python version : 3.7.3.final.0
platform : linux-x64

README out of date?

I installed spaCy 3.0.0 and per the README, ran

pip install pyate https://github.com/explosion/spacy-models/releases/download/en_core_web_sm-2.2.5/en_core_web_sm-2.2.5.tar.gz

Running your combo_basic() example resulted in an exception due to an apparent language model incompatibility. Downloading en_core_web_sm fixed the problem.

Maybe I missed something?

Environment: MacOs 12.2.1, Apple M1 chip, Python 3.8.11

prep != det

I believe you will see an improvement in the patterns if you change the definition of prep to {"POS": "ADP", "IS_PUNCT": False}.

pyATE for other languages?

It would be very nice if pyATE could be used on texts in other languages.
To this aim I have added a static method to the TermExtraction class:
(but I still have to find a good file of random sentences in Italian as a general domain)

    @staticmethod
    def set_language(language: str):
        TermExtraction.nlp     = spacy.load(language)
        TermExtraction.matcher = Matcher(TermExtraction.nlp.vocab)
        TermExtraction.DEFAULT_GENERAL_DOMAIN = pd.read_csv(pkg_resources.resource_stream(__name__, f'default_general_domain.{language}.csv'))

FileNotFoundError lexemes.bin

First in the Readme, I think the model link is not updated which is throwing 404 error, the readme link still has version 2.0.3.
I downloaded and installed with

pip install https://github.com/kevinlu1248/pyate/releases/download/v0.4.2/en_acl_terms_sm-2.0.4.tar.gz
nlp = spacy.load('en_acl_terms_sm')
nlp = en_acl_terms_sm.load()

both are throwing
FileNotFoundError: [Errno 2] No such file or directory: '/usr/local/lib/python3.6/dist-packages/en_acl_terms_sm/en_acl_terms_sm-2.0.4/vocab/lexemes.bin'

term_extractor reference

Hi there!

Where did you get the specs for the term_extractor component? Is it only from TermExtractor: a Web Application to Learn the Shared Terminology of Emergent Web Communities? If there's anything else, can you please indicate the reference?
It's not related in the benchmark from Astrakhantsev 2016, is it? Which one would be the closest approach to it?

Thanks, and great work!

Problem with set_language()

Hi,

when I try to run the following test code from master/tests/test_algs.py:

def test_lang_change():
    TermExtraction.set_language("it", "it_core_news_sm")  # italian
    for func in ALGORITHMS:
        func(CORPUS)

I get the following error:

Traceback (most recent call last):
  File "pyate_test.py", line 28, in <module>
    test_lang_change()
  File "pyate_test.py", line 23, in test_lang_change
    TermExtraction.set_language("it", "it_core_news_sm")  # italian
TypeError: set_language() takes 1 positional argument but 2 were given`

Thank you in advance for looking into this.

No default_general_domain.en.zip

After installation, there is no default_general_domain.en.zip file in the pyate folder. Without it, it does not run.
Kaggle provides a txt file with sentences, not a csv.
Please add a link and update the instructions so that people can download default_general_domain.en.zip.

About c values calculation

df = pd.DataFrame(
{
"frequency": technical_counts.values,
"times_nested": technical_counts.values,
"number_of_nested": 1,
"has_been_evaluated": False,
},
index=technical_counts.index,
)
# print(df)
output = []
indices = set(df.index)
iterator = tqdm(df.iterrows()) if verbose else df.iterrows()
for candidate, row in iterator:
f, t, n, h = row
length = TermExtraction.word_length(candidate)
if length == TermExtraction.config["MAX_WORD_LENGTH"]:
c_val = math.log2(length + smoothing) * f
else:
c_val = math.log2(length + smoothing) * f
if h:
c_val -= t / n
if c_val >= threshold:
output.append((candidate, c_val))
for substring in helper_get_subsequences(candidate):
if substring in indices:
df.loc[substring, "times_nested"] += 1
df.loc[substring, "number_of_nested"] += f
df.loc[substring, "has_been_evaluated"] = True

Hi, I want to thank you for your hard work in the implementation of so many ATE methods. It quite helps me a lot.
I have studied the original cvalues paper and I got confused since I found there are some differences in initialization and calculation of cvalues.
In initialization, why should not both "times_nested" and "number_of_nested" be set to zero?
In updating, why should "times_nested" be added to the candidate frequency and "number_of_nested" be added by 1?
In cvalue calculation and your implementation, the formula is \log_2(|a|) * f(a) - \frac{t(a)}{c(a)} (I used original notation in the paper) and it is different from \log_2(|a|) *(f(a) - \frac{t(a)}{c(a)}) in the original paper.

All the above things quite confused me. Maybe I misunderstood the paper or the code. Please feel free to correct me.

TypeError: __init__() missing 1 required positional argument: 'nlp'

Error occrred when I tested pyate(pyate v0.5.5, spacy 3.4.4, en_core 3.4.1, python 3.7.6).
The source codes are from demo:
from pyate import combo_basic

string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""
import spacy
from pyate.term_extraction_pipeline import TermExtractionPipeline

nlp = spacy.load("en_core_web_sm")
nlp.add_pipe(TermExtractionPipeline())
print(doc._.combo_basic.sort_values(ascending=False).head(5))

The error info:

nlp.add_pipe(TermExtractionPipeline())
TypeError: init() missing 1 required positional argument: 'nlp'

How can I debug the problem? Many thanks!

Error associated with multiprocessing and counting terms

When TermExtraction.count_terms_from_documents() is run, I get the error Error: Can't pickle local object 'Tok2Vec.predict.<locals>.<lambda>', an error only occurring when multiprocessing is turned on. This leads to weirdness and term_extraction both yielding strange results. A temporary fix has been done by disabling multiprocessing.

FileNotFoundError - default_general_domain.csv

I get this error after doing a pip install pyate.

FileNotFoundError: [Errno 2] File /usr/lib/python3/dist-packages/pyate/default_general_domain.csv does not exist: '/usr/lib/python3/dist-packages/pyate/default_general_domain.csv'

Bug report: TypeError: load() got an unexpected keyword argument 'parser'

Hello,

I am trying to run the following example program from the github main page:


# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""

print(combo_basic(string).sort_values(ascending=False))
""" (Output)
dysfunctional tumor                1.443147
tumor suppressors                  1.443147
genetic changes                    1.386294
cancer cells                       1.386294
dysfunctional tumor suppressors    1.298612
logical framework                  0.693147
sufficient growth                  0.693147
death signals                      0.693147
many aspects                       0.693147
inflammatory response              0.693147
tumor promotion                    0.693147
ancillary processes                0.693147
tumor environment                  0.693147
reflexive relationship             0.693147
particular focus                   0.693147
physiologic processes              0.693147
tissue homeostasis                 0.693147
cancer development                 0.693147
dtype: float64
"""

However, this error appears:


from pyate import combo_basic...
---------------------------------------------------------------------------
TypeError                                 Traceback (most recent call last)
~/Developer/.../LegalParse/nlp.py in <module>
     12 connection between the inflammatory response and cancer."""
     13 
---> 14 print(combo_basic(string).sort_values(ascending=False))
     15 """ (Output)
     16 dysfunctional tumor                1.443147

~/Developer/...env/lib/python3.8/site-packages/pyate/combo_basic.py in combo_basic(technical_corpus, smoothing, verbose, have_single_word, technical_counts, weights)
     35     if technical_counts is None:
     36         technical_counts = (
---> 37             TermExtraction(technical_corpus)
     38             .count_terms_from_documents(verbose=verbose)
     39             .reindex()

~/Developer/.../env/lib/python3.8/site-packages/pyate/term_extraction.py in __init__(self, corpus, vocab, patterns, do_parallelize, language, nlp, default_domain, default_domain_size, max_word_length, dtype)
    126             ]
    127         if self.nlp is None:
--> 128             self.nlp = TermExtraction.get_nlp(self.language)
    129         if self.default_domain is None:
    130             self.default_domain = TermExtraction.get_general_domain(self.language)

~/Developer/.../env/lib/python3.8/site-packages/pyate/term_extraction.py in get_nlp(language)
     63             language = TermExtraction.language
     64         if language not in TermExtraction.nlps:
---> 65             TermExtraction.nlps[language] = spacy.load(
     66                 TermExtraction.config["spacy_model"], parser=False, entity=False
     67             )

ATE not behaving as in documentation

After installing pyATE as for documentation, if I run the examples in the documentation I get different and worst results.

Running:

from pyate import combo_basic

# source: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC1994795/
string = """Central to the development of cancer are genetic changes that endow these “cancer cells” with many of the
hallmarks of cancer, such as self-sufficient growth and resistance to anti-growth and pro-death signals. However, while the
genetic changes that occur within cancer cells themselves, such as activated oncogenes or dysfunctional tumor suppressors,
are responsible for many aspects of cancer development, they are not sufficient. Tumor promotion and progression are
dependent on ancillary processes provided by cells of the tumor environment but that are not necessarily cancerous
themselves. Inflammation has long been associated with the development of cancer. This review will discuss the reflexive
relationship between cancer and inflammation with particular focus on how considering the role of inflammation in physiologic
processes such as the maintenance of tissue homeostasis and repair may provide a logical framework for understanding the U
connection between the inflammatory response and cancer."""

print(combo_basic(string).sort_values(ascending=False))

I obtain:

aspects of cancer                               3.348612
many aspects of cancer                          2.336294
aspects of cancer development                   2.336294
development of cancer                           2.197225
cancer development                              2.193147
many aspects                                    2.193147
cells of the tumor                              2.136294
many aspects of cancer development              2.109438
maintenance of tissue                           1.848612
cells of the tumor environment                  1.809438
connection between the inflammatory response    1.709438
maintenance of tissue homeostasis               1.586294
inflammation with particular focus              1.486294
tissue homeostasis                              1.443147
tumor environment                               1.443147
particular focus                                1.443147
dysfunctional tumor                             1.443147
tumor suppressors                               1.443147
inflammatory response                           1.443147
cancer cells                                    1.386294
genetic changes                                 1.386294
dysfunctional tumor suppressors                 1.298612
role of inflammation                            1.098612
relationship between cancer                     1.098612
hallmarks of cancer                             1.098612
death signals                                   0.693147
sufficient growth                               0.693147
tumor promotion                                 0.693147
ancillary processes                             0.693147
logical framework                               0.693147
dtype: float64

Does the package need a special config to be set?

Adding more languages

Hi and thx a lot for this great package
I am thinking in adding support for more languages (for example: fr, es, de, it, ar, pt, etc...)
I have been looking at the Opus Wikipedia corpus available here
You can easily download a zip containing a huge list of sentences extracted from Wikipedia like
The sentences are located in an xml file:

<s id="1">L'algèbre générale, ou algèbre abstraite, est la branche des mathématiques qui porte principalement sur l'étude des structures algébriques et de leurs relations.</s>
<s id="2">Elle maintient son activité dans les deux Irlandes (État libre d'Irlande, indépendant, et Irlande du Nord, britannique), mais concentre son action sur les intérêts britanniques, surtout en Irlande du Nord.</s>
<s id="3">Il a formé toute une génération de linguistes français, parmi lesquels Émile Benveniste, Marcel Cohen, Georges Dumézil, André Martinet, Aurélien Sauvageot, Lucien Tesnière, Joseph Vendryes, ainsi que le japonisant Charles Haguenauer.</s>
<s id="4">En conséquence, Meillet présente Parry à Matija Murko, savant originaire de Slovénie qui avait longuement écrit sur la tradition héroïque épique dans les Balkans, surtout en Bosnie-Herzégovine.</s>

But I was wondering if the sentences are not too short to be considered as 'paragraphs'
Apparently the paragraph used in the english corpus are much longer.
Do you think it is worth using this corpus, maybe I could group a set of sentences (10 ?) together to consolidate fake paragraphs?
What would be your advice here?

Best regards

Olivier Terrier

Could not read config file from en_acl_terms_sm-2.0.4\config.cfg

Hello, Kevin.
When I download en_acl_terms_sm-2.0.4..tar.gz by IDM downloader and use pip install this packages in cmd successfully.
However when i input

import spacy

nlp = spacy.load("en_acl_terms_sm")
Traceback (most recent call last):
File "", line 1, in
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy_init_.py", line 51, in load
return util.load_model(
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 420, in load_model
return load_model_from_package(name, **kwargs) # type: ignore[arg-type]
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 453, in load_model_from_package
return cls.load(vocab=vocab, disable=disable, exclude=exclude, config=config) # type: ignore[attr-defined]
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\en_acl_terms_sm_init_.py", line 12, in load
return load_model_from_init_py(file, **overrides)
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 615, in load_model_from_init_py
return load_model_from_path(
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 487, in load_model_from_path
config = load_config(config_path, overrides=overrides)
File "C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\spacy\util.py", line 646, in load_config
raise IOError(Errors.E053.format(path=config_path, name="config file"))
OSError: [E053] Could not read config file from C:\Users\WcisapoorguY\AppData\Local\Programs\Python\Python310\lib\site-packages\en_acl_terms_sm\en_acl_terms_sm-2.0.4\config.cfg

then i check the path, but there is no "config.cfg" file in root directory
And i try to reinstall the en_acl_terms_sm-2.0.4..tar.gz, still get the same result.

Here is my pip packages list:
C:\WINDOWS\system32>pip list
Package Version


blis 0.7.7
catalogue 2.0.7
certifi 2021.10.8
charset-normalizer 2.0.12
click 8.0.4
colorama 0.4.4
cymem 2.0.6
en-acl-terms-sm 2.0.4
en-core-web-sm 3.2.0
idna 3.3
Jinja2 3.1.1
langcodes 3.3.0
MarkupSafe 2.1.1
murmurhash 1.0.6
numpy 1.22.3
packaging 21.3
pandas 1.4.2
pathy 0.6.1
pip 22.0.4
preshed 3.0.6
pyahocorasick 1.4.4
pyate 0.5.3
pydantic 1.8.2
pyparsing 3.0.7
python-dateutil 2.8.2
pytz 2022.1
requests 2.27.1
setuptools 62.0.0
six 1.16.0
smart-open 5.2.1
spacy 3.2.4
spacy-legacy 3.0.9
spacy-loggers 1.0.2
srsly 2.4.2
thinc 8.0.15
tqdm 4.64.0
typer 0.4.1
typing_extensions 4.1.1
urllib3 1.26.9
wasabi 0.9.1
wheel 0.37.1

(sorry for my poor english, and i am a beginner of using python )

en_acl_terms_sm-2.0.4.tar.gz

Performance/speed issue using combo_basic

Hello,

I am using combo_basic to extract the top 5 keywords from a a collection of texts, per text. My corpus is a list of strings. Each string is around 8k characters long. I have around 63k of those texts. As I am looping through that list, I am using tqdm for a progress bar. I notice that the processing time for each document starts to increase as time passes by. Similarly, tqdm shows that the iterations/second start dropping, to the point where it takes more than 1 second per document. Here is a code snippet that I am using - I have commented out the rest of the code inside the loop and only kept the top_keywords = ... line and the issue still persists:

# loop over the documents
for document in tqdm.tqdm(text_list):  # a list of strings
    top_keywords = combo_basic(document).sort_values(ascending=False).head(5).index.tolist()

Is there some sort of caching behind the scenes that makes this slow down as the documents get processed? How can I have a "clean" call of combo_basic every time I use it? With the current setup, it will take me 2-3 days to process the 63k documents, as it starts to really slow down after the first 2k documents or so. Thanks!

Precision is not as good as atr4s ?

I ran this on the ACL rd tec 2.0 corpus and got a precision of around 50 % which is not as good as atr4s which has a 70 % precision on the same corpus. I used combo basic.

spaCy >= 2.3.2

ERROR: pyate 0.3.9 has requirement spacy==2.2.4, but you'll have spacy 2.3.2 which is incompatible.

Is there a pyATE release that works with newer versions of spaCy?

Configuring n-grams

Hi,
Can you please guide me how I can test this with different n-gram configurations? I'm using the TermExtractionPipeline with spaCy.

PyATE 0.5.3: general_corpus_size parameter for weirdness is ignored

In weirdness.py the parameter general_corpus_size is ignored.
This means you have to write

pyate.term_extraction.TermExtraction.config["DEFAULT_GENERAL_DOMAIN_SIZE"] = 5000
pyate.weirdness(text)

instead of just writing

pyate.weirdness(text, general_corpus_size=5000)

Appears to be the same issue in term_extractor.py.

Bug in combo_basic: helper_get_subsequences() needs to allow length one subsequences when have_single_word = True

In combo_basic.py, we have

def helper_get_subsequences(s: str) -> List[str]:
    """Helper function to get all subsequences of a string."""
    sequence = s.split()
    if len(sequence) <= 2:
        return []

But this means that length 1 candidate terms subsequences are not included as subsets of length 2 terms, and length 2 terms are not included as supersets of length one terms.
This is not correct when have_single_word == True.

Thanks!

could not read config.cfg

need help
error : Could not read config.cfg from /home/dzmfg/.local/lib/python3.9/site-packages/en_core_web_sm/en_core_web_sm-2.2.5/config.cfg

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.