Coder Social home page Coder Social logo

kennethenevoldsen / asent Goto Github PK

View Code? Open in Web Editor NEW
112.0 112.0 15.0 55.49 MB

Asent is a python library for performing efficient and transparent sentiment analysis using spaCy.

Home Page: https://kennethenevoldsen.github.io/asent/

License: MIT License

Python 99.78% Makefile 0.22%
interpretability natural-language-processing nlp python3 sentiment-analysis spacy spacy-extensions

asent's Introduction

Kenneth Enevoldsen

Researcher, scholar, teacher

 kennethcenevoldsen

Profiles

Projects

The following are projects I am actively maintaining or contributing to. More might have been added since then.

Logo Name Description
MTEB The Massive Text Embedding Benchmark for evaluating document embeddings e.g. for RAG systems.
Scandinavian Embedding Benchmark A Scandinavian Benchmark for evaluating document embeddings
DaCy The State of the Art Danish NLP pipeline for SpaCy
tomsup Theory of Mind Simulation using Python. A package that allows for easy agent-based modeling of recursive Theory of Mind agents
Augmenty An structured augmentation library for augmenting both the texts and the annotations
TextDescriptives A Python library for calculating a large variety of metrics from text
timeseriesflattener for converting irregularly spaced time series, such as electronic health records, into statically shaped data frames.
Asent An educational library for performing transparent sentiment analysis
ScandEval An evaluation benchmark for the Scandinavian and Germanic language models evaluating natural language understanding and generation.
swift-python-cookiecutter The cookie-cutter template I actively use for my packages
UD_Danish-DDT The Danish Universal Dependencies Treebank, a high quality linguistic resource

Contributions:

A selection of contributions to open-source libraries, besides the ones to which I am actively contributing.

Library Contribution
Transformers Multiple bugfixes for training masked language models using flax
SpaCy core libraries:
spacy-transformers Allow passing arguments to the transformer backend and forward
confection Fixed issue where config where could not be filled
spacy-curated-transformers Added support for ELECTRA tokenizers
curated-transformers  Added ELECTRA

asent's People

Contributors

alouca avatar dependabot[bot] avatar emilstenstrom avatar kennethenevoldsen avatar martinbernstorff avatar pre-commit-ci[bot] avatar tomaarsen avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

asent's Issues

Issue with asent runnung spacy pipeline in multiprocessing

I get an issue trying to run the spacy pipeline with asent component using multiprocessing.

How to reproduce the behaviour

documents = pd.Series(['I am trying to run this....', 'Also this one.'])

model = spacy.load('en_core_web_sm')
model.add_pipe('sentencizer', first=True)
model.add_pipe('asent_en_v1')

for doc in model.pipe(documents, batch_size=16, n_process=2):
    pass

As a result I get

Traceback (most recent call last):
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 1694, in _multiprocessing_pipe
    self.default_error_handler(
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/util.py", line 1724, in raise_error
    raise e
ValueError: [E871] Error encountered in nlp.pipe with multiprocessing:

Traceback (most recent call last):
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 2303, in _apply_pipes
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/language.py", line 2303, in <listcomp>
    byte_docs = [(doc.to_bytes(), doc._context, None) for doc in docs]
  File "spacy/tokens/doc.pyx", line 1348, in spacy.tokens.doc.Doc.to_bytes
  File "spacy/tokens/doc.pyx", line 1411, in spacy.tokens.doc.Doc.to_dict
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/spacy/util.py", line 1352, in to_dict
    serialized[key] = getter()
  File "spacy/tokens/doc.pyx", line 1408, in spacy.tokens.doc.Doc.to_dict.lambda20
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/srsly/_msgpack_api.py", line 14, in msgpack_dumps
    return msgpack.dumps(data, use_bin_type=True)
  File "/home/username/.pyenv/versions/my_test_repo/lib/python3.10/site-packages/srsly/msgpack/__init__.py", line 55, in packb
    return Packer(**kwargs).pack(o)
  File "srsly/msgpack/_packer.pyx", line 285, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 291, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 288, in srsly.msgpack._packer.Packer.pack
  File "srsly/msgpack/_packer.pyx", line 264, in srsly.msgpack._packer.Packer._pack
  File "srsly/msgpack/_packer.pyx", line 282, in srsly.msgpack._packer.Packer._pack
TypeError: can not serialize 'DocPolarityOutput' object

python-BaseException

Process finished with exit code 1

Your Environment

  • asent Version Used: 0.7.6
  • Operating System:Ubuntu 20.04.6 LTS
  • Python Version Used: 3.10.0
  • spaCy Version Used: 3.6.1
  • Environment Information: en-core-web-sm==3.6.0 pandas==1.5.3

Will be very glad with any help with this issue.

Compound not taking into account all span polarities

I am trying to understand the scores for the following text:

The IRA will provide about $370bn of subsidies for clean energy, marking America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies.

The document polarity scores are the following:
neg=0.29 neu=0.71 pos=0.0 compound=-0.9477

The positive score is 0.0, despite clean energy and most ambitious having positive polarities.

Changing the text just a bit (replacing , marking with . It marked), changed the compound score, despite not changing the underlying span polarities:
neg=0.064 neu=0.751 pos=0.185 compound=0.1559

Code to reproduce:

!pip install -U spacy
!pip install asent
!python -m spacy download en_core_web_lg

import spacy
import asent

nlp = spacy.load('en_core_web_lg')
nlp.add_pipe('asent_en_v1')

text = "The IRA will provide about $370bn of subsidies for clean energy, marking America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies."
print(nlp(text)._.polarity)
asent.visualize(nlp(text)

text = "The IRA will provide about $370bn of subsidies for clean energy. It marked America’s most ambitious effort to tackle climate change, but has triggered furious criticism in Brussels and allegations that the US is discriminating against EU companies."
print(nlp(text)._.polarity)
asent.visualize(nlp(text)

spacy==3.7.2
asent==0.8.0
Python=3.10

Question - Set minimum threshold

Great library! Apologies for the questions in one.

This is what we get and see in the docs:
image

Is there a way to exclude anything that is below Math.abs(1.5) for example? In that example, we would ignore both labels. The main idea is to identify n-grams (e.g., 3+ words) with a single label as very positive or very negative feedbacks.

Besides, from a style perspective, we saw this one as well. Assuming this is an old theme, is there an option to set it?
image

The idea would be to merge it with displacy entities rendering and the style is much more similar.

Danish model scores are all zeroes

How to reproduce the behaviour

Here is the code

import asent
import spacy

nlp = spacy.load("da_core_news_lg")
nlp.add_pipe("asent_da_v1")

doc = nlp("den blev vidst taget men så startede entemusikken igen?")
for sentence in doc.sents:
    print(sentence._.polarity)

output

neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=den blev vidst taget men så startede entemusikken igen?

Your Environment

  • asent Version Used: 0.6.0
  • Operating System: Ubuntu 18.04
  • Python Version Used: 3.8.9
  • spaCy Version Used: 3.4.4
  • Environment Information:

Option to save output from `asent.visualize`

It would be neat to have the option to save the output of a visualization to a svg/png to use for presentations/papers. Currently, the methods return None, and screenshotting the output tends to produce quite low-resolution images.

v. 1.0.0 to do list

  • Documentation workflow
  • Custom issues
  • testing workflow
    • coverage comment
  • pypi workflow
  • Add add fix tests
  • Logo
  • Polish readme
  • Tutorials
    • Get started
    • Customizing the pipeline
  • Final check on documentation
    • Add list of components to component API section
    • Add example of visualizer to visualizers in the API section
  • Code review by Lasse and Martin
  • #13

Create a test suite to allow for comparisons with other sentiment models.

Create a series of tests to apply the model to for each of the languages.

When the test suite is done test the following models:

  • Only lexicon (try with AFINNs also - not lemmatized)
  • Tree-based negations
  • POS-based negations

intended usage:

import asent
import spacy
from asent.benchmarks import benchmark


nlp = ...
nlp.add_pipe("asen...

grid = {"valence": [...],
            is_negated": [...]}

perf = benchmark(lang="da", grid)

Improve the speed of the implementation

While the current implementation is okay it is likely that the implementation could become faster, either by using NumPy for some of the computation or using setter (as opposed to getters) for the extensions.

Edit: Probably better to use the spacy matches and replace the dictionaries with match patterns. These would also be relevant for #4

Duplicated text in `prediction` visualization when there are overlapping spans

Hello!

Bug details

The Token polarity attribute will perform a look-back of (by default) 3 tokens, and the span of the resulting TokenPolarityOutput may thus be larger than the token itself. This causes potential span overlaps, such as in the example below. This is problematic with the current visualize implementation for the prediction style, as it duplicates the overlapping text.

You may have already been aware of this, given #52, but I figured I would make this report regardless.

How to reproduce the behaviour

import asent
import spacy

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("I am not pretty quite unhappy")
# doc = nlp("I am not great, unhappy is how I would describe myself.")

asent.visualize(doc, style="prediction")

Bugged results

image
image

(See also #58 for a secondary bug related to the second image, i.e. the unhappy section being regarded as positive)

My Environment

  • asent version: 0.4.3
  • spaCy version: 3.4.1
  • Platform: Windows-10-10.0.19043-SP0
  • Python version: 3.10.1
  • Pipelines: en_core_web_lg (3.4.0)

  • Tom Aarsen

`nlp` gets incorrectly overwritten in Getting Started

Hello!

The documentation page Getting started has the following snippet:

import asent
import spacy

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp = nlp.add_pipe("asent_en_v1")

As you likely know, add_pipe returns the pipeline component, and thus nlp is overwritten to asent.component.Asent, causing the remaining code snippets to fail.

Which page or section is this issue related to?

https://kennethenevoldsen.github.io/asent/introduction.html, each of the 4 language tabs

  • Tom Aarsen

Polarity of span uses sentiment of tokens prior to the span, and commas/end-of-sentence markers are ignored

Hello!

Bug details

This bug report encompasses two related bugs. I'll group them together in this report, as they might both be solvable with the same fix.

  1. The polarity of spans will take into consideration the tokens prior to the span, causing large issues in the produced sentiment.
  2. Commas and end-of-sentence markers are disregarded.

How to reproduce the behaviour for bug 1

Sample code

import asent
import spacy
from pprint import pprint

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("I am not very happy.")

print(doc[3:])
print(doc[3:]._.polarity)
pprint(doc[3:]._.polarity.polarities)

Bugged results

very happy.
neg=0.616 neu=0.384 pos=0.0 compound=-0.4964 span=very happy.
[TokenPolarityOutput(polarity=0.0, token=very, span=very),
 TokenPolarityOutput(polarity=-2.215, token=happy, span=not very happy),
 TokenPolarityOutput(polarity=0.0, token=., span=.)]

Despite stating that the span is very happy., you can see that the sentiment is very negative, as it considers the tokens prior to the start of the span as well. Note the TokenPolarityOutput(polarity=-2.215, token=happy, span=not very happy). I discovered this bug while working on #52, as the polarity of the not very happy, very happy and happy spans were all reported to be the same.
You could argue that this is not that big of a deal, as most people are interested in per-sentence sentiment. However, this can also cause issues between sentences, as can be seen below:

Sample code

import asent
import spacy
from pprint import pprint

# load spacy pipeline
nlp = spacy.load("en_core_web_lg")

# add the rule-based sentiment model
nlp.add_pipe("asent_en_v1")

doc = nlp("Would you do that? I would not. Very stupid is what that is.")

for sent in doc.sents:
    print(f"{sent.text:<30} - {sent._.polarity}")

pprint(list(doc.sents)[-1]._.polarity.polarities)

Bugged results

Would you do that?             - neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=Would you do that?
I would not.                   - neg=0.0 neu=0.0 pos=0.0 compound=0.0 span=I would not.
Very stupid is what that is.   - neg=0.0 neu=0.667 pos=0.333 compound=0.4575 span=Very stupid is what that is.
[TokenPolarityOutput(polarity=0.0, token=Very, span=Very),
 TokenPolarityOutput(polarity=1.993, token=stupid, span=not. Very stupid),
 TokenPolarityOutput(polarity=0.0, token=is, span=is),
 TokenPolarityOutput(polarity=0.0, token=what, span=what),
 TokenPolarityOutput(polarity=0.0, token=that, span=that),
 TokenPolarityOutput(polarity=0.0, token=is, span=is),
 TokenPolarityOutput(polarity=0.0, token=., span=.)]

Note the TokenPolarityOutput(polarity=1.993, token=stupid, span=not. Very stupid).

How to reproduce the behaviour for bug 2

See the second sample from this issue, as well as the commented-out example from #57. In these examples, tokens from prior sentence (segments) are used to modify the polarity of tokens in the new sentence (segment). For example, in the second example from #57, the total polarity of the entire document is DocPolarityOutput(neg=0.198, neu=0.662, pos=0.14, compound=-0.2411). This is higher than expected, because the unhappy part becomes positive. Rewriting the input to be "I am not great, I would describe myself as unhappy." will cause the polarity to become DocPolarityOutput(neg=0.379, neu=0.621, pos=0.0, compound=-0.7264), with the following visualization:
image
This is the behaviour I would expect, even before rewriting the input.

My Environment

  • asent version: 0.4.3
  • spaCy version: 3.4.1
  • Platform: Windows-10-10.0.19043-SP0
  • Python version: 3.10.1
  • Pipelines: en_core_web_lg (3.4.0)

  • Tom Aarsen

Add conditional dicts

While the current dictionaries are simple, one could imagine much more extensive rules, with simple one being POS tags restrictions to while more complex ones might include word sense disambiguation.

The idea here is to create a dictionary object which takes in a spacy Token (which allows one to extract all of the above) and returns the desired values.

This, however, required a more general restructuring of the codebase.

Edit: seems like this can be done with spaCy's matches

Negation, intensifiers visualizers

Using the dependency visualizer from spacy create a visualizer which reproduces the negation, intensifies relations for easier transparency. This will need a rework of the Polarity output classes.

Basically this workflow:

from spacy import displacy
ex = {
    "words": [
        {"text": "I", "tag": "0"},
        {"text": "'m", "tag": "0"},
        {"text": "not", "tag": "0"},
        {"text": "very", "tag": "0"},
        {"text": "happy", "tag": "2.7"}
    ],
    "arcs": [
        {"start": 2, "end": 4, "label": "negated by", "dir": "left"},
        {"start": 3, "end": 4, "label": "intensified by", "dir": "left"}
    ]
}
html = displacy.render(ex, style="dep", manual=True)

resulting in:
Screenshot 2021-12-07 at 20 53 13

Add workflow sketch

Create a sketch which outlines the workflow of the model. To allow people to better understand the how the model works.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.