kennethenevoldsen / asent Goto Github PK

View Code? Open in Web Editor NEW

112.0 112.0 15.0 55.49 MB

Asent is a python library for performing efficient and transparent sentiment analysis using spaCy.

Home Page: https://kennethenevoldsen.github.io/asent/

License: MIT License

Python 99.78% Makefile 0.22%

interpretability natural-language-processing nlp python3 sentiment-analysis spacy spacy-extensions

asent's Introduction

Kenneth Enevoldsen

Researcher, scholar, teacher

Profiles

Projects

The following are projects I am actively maintaining or contributing to. More might have been added since then.

Logo	Name	Description
	MTEB	The Massive Text Embedding Benchmark for evaluating document embeddings e.g. for RAG systems.
	Scandinavian Embedding Benchmark	A Scandinavian Benchmark for evaluating document embeddings
	DaCy	The State of the Art Danish NLP pipeline for SpaCy
	tomsup	Theory of Mind Simulation using Python. A package that allows for easy agent-based modeling of recursive Theory of Mind agents
	Augmenty	An structured augmentation library for augmenting both the texts and the annotations
	TextDescriptives	A Python library for calculating a large variety of metrics from text
	timeseriesflattener	for converting irregularly spaced time series, such as electronic health records, into statically shaped data frames.
	Asent	An educational library for performing transparent sentiment analysis
	ScandEval	An evaluation benchmark for the Scandinavian and Germanic language models evaluating natural language understanding and generation.
	swift-python-cookiecutter	The cookie-cutter template I actively use for my packages
	UD_Danish-DDT	The Danish Universal Dependencies Treebank, a high quality linguistic resource

Contributions:

A selection of contributions to open-source libraries, besides the ones to which I am actively contributing.

Library	Contribution
Transformers	Multiple bugfixes for training masked language models using flax
SpaCy core libraries:
spacy-transformers	Allow passing arguments to the transformer backend and forward
confection	Fixed issue where config where could not be filled
spacy-curated-transformers	Added support for ELECTRA tokenizers
curated-transformers	Added ELECTRA

asent's People

Contributors

Stargazers

Watchers

Forkers

hlasse martinbernstorff emilstenstrom asehmi tomaarsen ankush-chander techthiyanes tarekrahman3 aascode aditya1001001 alouca markusbansky chrissiecodes wangcj05 marcoschaarbr

asent's Issues

Add automatic formatting to black

Add a workflow which automatically formats to black.

create a functionality for value estimation on a training set.

One way to do this is using is either scikit-learn optimize or backpropagating through PyTorch tensor operations (or potentially using thinc).

Issue with asent runnung spacy pipeline in multiprocessing

I get an issue trying to run the spacy pipeline with asent component using multiprocessing.

How to reproduce the behaviour

documents = pd.Series(['I am trying to run this....', 'Also this one.'])

model = spacy.load('en_core_web_sm')
model.add_pipe('sentencizer', first=True)
model.add_pipe('asent_en_v1')

for doc in model.pipe(documents, batch_size=16, n_process=2):
    pass

As a result I get

Traceback (most recent call last):
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 1694, in _multiprocessing_pipe
    self.default_error_handler(
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/util.py", line 1724, in raise_error
    raise e
ValueError: [E871] Error encountered in nlp.pipe with multiprocessing:

Traceback (most recent call last):
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 2303, in _apply_pipes
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 2303, in <listcomp>
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "spacy/tokens/doc.pyx", line 1348, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1411, in spacy.tokens.doc.Doc.to_dict
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/util.py", line 1352, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1408, in spacy.tokens.doc.Doc.to_dict.lambda20
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'DocPolarityOutput' object

python-BaseException

Process finished with exit code 1

Your Environment

asent Version Used: 0.7.6
Operating System:Ubuntu 20.04.6 LTS
Python Version Used: 3.10.0
spaCy Version Used: 3.6.1
Environment Information: en-core-web-sm==3.6.0 pandas==1.5.3

Will be very glad with any help with this issue.

Add sentence sentiment visualizer

Add a sentence visualizer.

i have a problem with Asent and spacy on a kaggle's notebook (https://www.kaggle.com/code/smoestergaard/some-s22-reddit-toolbox/notebook)

Hi buddies
i am a researcher and i use this toolbox on kaggle for some linguistics analysis :
https://www.kaggle.com/code/smoestergaard/some-s22-reddit-toolbox/notebook
but when i run the second box (the box which entitled "#Load packages" ) i receive an error (below):

is there any body here who can help me to solve the problem.

Publicity: Make sure people know about Asent

Compound not taking into account all span polarities

I am trying to understand the scores for the following text:

The IRA will provide about $370bn of subsidies for clean energy, marking America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies.

The document polarity scores are the following:
neg=0.29 neu=0.71 pos=0.0 compound=-0.9477

The positive score is 0.0, despite clean energy and most ambitious having positive polarities.

Changing the text just a bit (replacing , marking with . It marked), changed the compound score, despite not changing the underlying span polarities:
neg=0.064 neu=0.751 pos=0.185 compound=0.1559

Code to reproduce:

!pip install -U spacy
!pip install asent
!python -m spacy download en_core_web_lg

import spacy
import asent

nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('asent_en_v1')

text = "The IRA will provide about $370bn of subsidies for clean energy, marking America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies."
print(nlp(text)._.polarity)
asent.visualize(nlp(text)

text = "The IRA will provide about $370bn of subsidies for clean energy. It marked America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies."
print(nlp(text)._.polarity)
asent.visualize(nlp(text)

spacy==3.7.2
asent==0.8.0
Python=3.10

Create a to_dict extension for polarity outputs

Create a to_dict extension for polarity outputs to allow for easy conversion into python formats.

Add 'positive', 'negative', 'neutral' label to sentence and document polarity outputs. For documents add 'mixed' as well. Similar to:

Add extraction function for positive and negative descriptors

Add extraction function for positive and negative descriptors of e.g. entities, nouns, or noun chunks. These should simply be dependency patterns such as [POS/NEG > SUBJ].

Where the use case is aspect-based sentiment extractions.

Question - Set minimum threshold

Great library! Apologies for the questions in one.

This is what we get and see in the docs:

Is there a way to exclude anything that is below Math.abs(1.5) for example? In that example, we would ignore both labels. The main idea is to identify n-grams (e.g., 3+ words) with a single label as very positive or very negative feedbacks.

Besides, from a style perspective, we saw this one as well. Assuming this is an old theme, is there an option to set it?

The idea would be to merge it with displacy entities rendering and the style is much more similar.

Danish model scores are all zeroes

How to reproduce the behaviour

Here is the code

import asent
import spacy

nlp = spacy.load("da_core_news_lg")
nlp.add_pipe("asent_da_v1")

doc = nlp("den blev vidst taget men så startede entemusikken igen?")
for sentence in doc.sents:
    print(sentence._.polarity)

output

neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=den blev vidst taget men så startede entemusikken igen?

Your Environment

asent Version Used: 0.6.0
Operating System: Ubuntu 18.04
Python Version Used: 3.8.9
spaCy Version Used: 3.4.4
Environment Information:

NotImplementedError: [E111] Pickling a token is not supported

what is this error?

Option to save output from `asent.visualize`

It would be neat to have the option to save the output of a visualization to a svg/png to use for presentations/papers. Currently, the methods return None, and screenshotting the output tends to produce quite low-resolution images.

v. 1.0.0 to do list

Create a test suite to allow for comparisons with other sentiment models.

Create a series of tests to apply the model to for each of the languages.

When the test suite is done test the following models:

Only lexicon (try with AFINNs also - not lemmatized)
Tree-based negations
POS-based negations

intended usage:

import asent
import spacy
from asent.benchmarks import benchmark


nlp = ...
nlp.add_pipe("asen...

grid = {"valence": [...],
            is_negated": [...]}

perf = benchmark(lang="da", grid)

obtain tests coverage >95%

Obtain tests coverage >95%, including at least one test for each language.

Improve the speed of the implementation

While the current implementation is okay it is likely that the implementation could become faster, either by using NumPy for some of the computation or using setter (as opposed to getters) for the extensions.

Edit: Probably better to use the spacy matches and replace the dictionaries with match patterns. These would also be relevant for #4

Add 'prediction-detailed' visualiser to asent using the span visualiser

This should allow overlapping spans to be labeled even though they are nested e.g. the word, word+intensifier, and word+intensifier+negation can denote three overlapping spans.

Streamlit demo application

It would be nice to conveniently have a streamlit application to test out the models on some data.

Addional language resources to add

Duplicated text in `prediction` visualization when there are overlapping spans

Hello!

Bug details

The Token polarity attribute will perform a look-back of (by default) 3 tokens, and the span of the resulting TokenPolarityOutput may thus be larger than the token itself. This causes potential span overlaps, such as in the example below. This is problematic with the current visualize implementation for the prediction style, as it duplicates the overlapping text.

You may have already been aware of this, given #52, but I figured I would make this report regardless.

How to reproduce the behaviour

import asent
import spacy

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("I am not pretty quite unhappy")
# doc = nlp("I am not great, unhappy is how I would describe myself.")

asent.visualize(doc, style="prediction")

Bugged results

(See also #58 for a secondary bug related to the second image, i.e. the unhappy section being regarded as positive)

My Environment

asent version: 0.4.3
spaCy version: 3.4.1
Platform: Windows-10-10.0.19043-SP0
Python version: 3.10.1
Pipelines: en_core_web_lg (3.4.0)

Tom Aarsen

`nlp` gets incorrectly overwritten in Getting Started

Hello!

The documentation page Getting started has the following snippet:

import asent
import spacy

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp = nlp.add_pipe("asent_en_v1")

As you likely know, add_pipe returns the pipeline component, and thus nlp is overwritten to asent.component.Asent, causing the remaining code snippets to fail.

Which page or section is this issue related to?

https://kennethenevoldsen.github.io/asent/introduction.html, each of the 4 language tabs

Tom Aarsen

another problem with spacy on : https://www.kaggle.com/code/smoestergaard/some-s22-reddit-toolbox/notebook

when i run the second box (the box which entitled "#Load packages" ) on https://www.kaggle.com/code/smoestergaard/some-s22-reddit-toolbox/notebook, i receive a new error :

what should i do ?

Polarity of span uses sentiment of tokens prior to the span, and commas/end-of-sentence markers are ignored

Hello!

Bug details

This bug report encompasses two related bugs. I'll group them together in this report, as they might both be solvable with the same fix.

The polarity of spans will take into consideration the tokens prior to the span, causing large issues in the produced sentiment.
Commas and end-of-sentence markers are disregarded.

How to reproduce the behaviour for bug 1

Sample code

import asent
import spacy
from pprint import pprint

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("I am not very happy.")

print(doc[3:])
print(doc[3:]._.polarity)
pprint(doc[3:]._.polarity.polarities)

Bugged results

very happy.
neg=0.616 neu=0.384 pos=0.0 compound=-0.4964 span=very happy.
[TokenPolarityOutput(polarity=0.0, token=very, span=very),
 TokenPolarityOutput(polarity=-2.215, token=happy, span=not very happy),
 TokenPolarityOutput(polarity=0.0, token=., span=.)]

Despite stating that the span is very happy., you can see that the sentiment is very negative, as it considers the tokens prior to the start of the span as well. Note the TokenPolarityOutput(polarity=-2.215, token=happy, span=not very happy). I discovered this bug while working on #52, as the polarity of the not very happy, very happy and happy spans were all reported to be the same.
You could argue that this is not that big of a deal, as most people are interested in per-sentence sentiment. However, this can also cause issues between sentences, as can be seen below:

Sample code

import asent
import spacy
from pprint import pprint

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("Would you do that? I would not. Very stupid is what that is.")

for sent in doc.sents:
    print(f"{sent.text:<30} - {sent._.polarity}")

pprint(list(doc.sents)[-1]._.polarity.polarities)

Bugged results

Would you do that?             - neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=Would you do that?
I would not.                   - neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=I would not.
Very stupid is what that is.   - neg=0.0 neu=0.667 pos=0.333 compound=0.4575 span=Very stupid is what that is.
[TokenPolarityOutput(polarity=0.0, token=Very, span=Very),
 TokenPolarityOutput(polarity=1.993, token=stupid, span=not. Very stupid),
 TokenPolarityOutput(polarity=0.0, token=is, span=is),
 TokenPolarityOutput(polarity=0.0, token=what, span=what),
 TokenPolarityOutput(polarity=0.0, token=that, span=that),
 TokenPolarityOutput(polarity=0.0, token=is, span=is),
 TokenPolarityOutput(polarity=0.0, token=., span=.)]

Note the TokenPolarityOutput(polarity=1.993, token=stupid, span=not. Very stupid).

How to reproduce the behaviour for bug 2

See the second sample from this issue, as well as the commented-out example from #57. In these examples, tokens from prior sentence (segments) are used to modify the polarity of tokens in the new sentence (segment). For example, in the second example from #57, the total polarity of the entire document is DocPolarityOutput(neg=0.198, neu=0.662, pos=0.14, compound=-0.2411). This is higher than expected, because the unhappy part becomes positive. Rewriting the input to be "I am not great, I would describe myself as unhappy." will cause the polarity to become DocPolarityOutput(neg=0.379, neu=0.621, pos=0.0, compound=-0.7264), with the following visualization:

This is the behaviour I would expect, even before rewriting the input.

My Environment

asent version: 0.4.3
spaCy version: 3.4.1
Platform: Windows-10-10.0.19043-SP0
Python version: 3.10.1
Pipelines: en_core_web_lg (3.4.0)

Tom Aarsen

Add conditional dicts

While the current dictionaries are simple, one could imagine much more extensive rules, with simple one being POS tags restrictions to while more complex ones might include word sense disambiguation.

The idea here is to create a dictionary object which takes in a spacy Token (which allows one to extract all of the above) and returns the desired values.

This, however, required a more general restructuring of the codebase.

Edit: seems like this can be done with spaCy's matches

Add option to include both lemmatized and non-lemmatized words.

For the lemmatize flag, create an option for both. Where the model first looks up the non-lemmatized word and as a fall-back uses the lemmatized words.

Negation, intensifiers visualizers

Using the dependency visualizer from spacy create a visualizer which reproduces the negation, intensifies relations for easier transparency. This will need a rework of the Polarity output classes.

Basically this workflow:

from spacy import displacy
ex = {
    "words": [
        {"text": "I", "tag": "0"},
        {"text": "'m", "tag": "0"},
        {"text": "not", "tag": "0"},
        {"text": "very", "tag": "0"},
        {"text": "happy", "tag": "2.7"}
    ],
    "arcs": [
        {"start": 2, "end": 4, "label": "negated by", "dir": "left"},
        {"start": 3, "end": 4, "label": "intensified by", "dir": "left"}
    ]
}
html = displacy.render(ex, style="dep", manual=True)

resulting in:

kennethenevoldsen / asent Goto Github PK

asent's Introduction

Kenneth Enevoldsen

Projects

Contributions:

asent's People

Contributors

Stargazers

Watchers

Forkers

asent's Issues

How to reproduce the behaviour

Your Environment

How to reproduce the behaviour

Your Environment

Bug details

How to reproduce the behaviour

Bugged results

My Environment

Which page or section is this issue related to?

Bug details

How to reproduce the behaviour for bug 1

Sample code

Bugged results

Sample code

Bugged results

How to reproduce the behaviour for bug 2

My Environment

Recommend Projects

Recommend Topics

Recommend Org