trinker / stansent Goto Github PK

R 100.00%

sentiment-analysis sentiment stanford-nlp

stansent's Introduction

stansent

stansent wraps Stanford's coreNLP sentiment tagger in a way that makes the process easier to get set up. The output is designed to look and behave like the objects from the sentimentr package. Plotting and the sentimentr::highlight functionality will work similar to the sentiment/sentiment_by objects from sentimentr. This requires less learning to work between the two packages.

In addition to sentimentr and stansent, Matthew Jocker's has created the syuzhet package that utilizes dictionary lookups for the Bing, NRC, and Afinn methods. Similarly, Subhasree Bose has contributed RSentiment which utilizes dictionary lookup that atempts to address negation and sarcasm. Click here for a comparison between stansent, sentimentr, syuzhet, and RSentiment. Note the accuracy and run times of the packages.

Installation

To download the development version of stansent:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh("trinker/coreNLPsetup", "trinker/stansent")

After installing use the following to ensure Java and coreNLP are installed correctly:

check_setup()

to make sure your Java version is of the right version and coreNLP is set up in the right location.

Installation
Functions
Contact
Demonstration

Functions

There are two main functions in sentimentr with a few helper functions. The main functions, task category, & descriptions are summarized in the table below:

Function	Function	Description
`sentiment_stanford`	sentiment	Sentiment at the sentence level
`sentiment_stanford_by`	sentiment	Aggregated sentiment by group(s)
`uncombine`	reshaping	Extract sentence level sentiment from `sentiment_by`
`get_sentences`	reshaping	Regex based string to sentence parser (or get sentences from `sentiment`/`sentiment_by`)
`highlight`	Highlight positive/negative sentences as an HTML document
`check_setup`	initial set-up	Make sure Java and coreNLP are set up correctly

Contact

You are welcome to:

submit suggestions and bug-reports at: https://github.com/trinker/stansent/issues
send a pull request on: https://github.com/trinker/stansent/
compose a friendly e-mail to: [email protected]

Demonstration

Load the Packages/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(c("trinker/stansent", "trinker/sentimentr"))
pacman::p_load(dplyr)

mytext <- c(
    'do you like it?  But I hate really bad dogs',
    'I am the best friend.',
    'Do you really like it?  I\'m not a fan'
)

data(presidential_debates_2012, cannon_reviews)
set.seed(100)
dat <- presidential_debates_2012[sample(1:nrow(presidential_debates_2012), 100), ]

`sentiment_stanford`

out1 <- sentiment_stanford(mytext) 
out1[["text"]] <- unlist(get_sentences(out1))
out1

##    element_id sentence_id word_count sentiment                       text
## 1:          1           1          4       0.0            do you like it?
## 2:          1           2          6      -0.5 But I hate really bad dogs
## 3:          2           1          5       0.5      I am the best friend.
## 4:          3           1          5       0.0     Do you really like it?
## 5:          3           2          4      -0.5              I'm not a fan

`sentiment_stanford_by`: Aggregation

To aggregate by element (column cell or vector element) use sentiment_stanford_by with by = NULL.

out2 <- sentiment_stanford_by(mytext) 
out2[["text"]] <- mytext
out2

##    element_id word_count        sd ave_sentiment
## 1:          1         10 0.3535534         -0.25
## 2:          2          5        NA          0.50
## 3:          3          9 0.3535534         -0.25
##                                           text
## 1: do you like it?  But I hate really bad dogs
## 2:                       I am the best friend.
## 3:       Do you really like it?  I'm not a fan

To aggregate by grouping variables use sentiment_by using the by argument.

(out3 <- with(dat, sentiment_stanford_by(dialogue, list(person, time))))

##        person   time word_count        sd ave_sentiment
##  1:     OBAMA time 2        207 0.4042260     0.1493099
##  2:     OBAMA time 1         34 0.7071068     0.0000000
##  3:    LEHRER time 1          2        NA     0.0000000
##  4:  QUESTION time 2          7 0.7071068     0.0000000
##  5: SCHIEFFER time 3         47 0.5000000     0.0000000
##  6:     OBAMA time 3        129 0.4166667    -0.1393260
##  7:   CROWLEY time 2         72 0.4166667    -0.1393260
##  8:    ROMNEY time 3        321 0.3746794    -0.1508172
##  9:    ROMNEY time 2        323 0.3875534    -0.2293311
## 10:    ROMNEY time 1         95 0.2236068    -0.4138598

Recycling

Note that the Stanford coreNLP functionality takes considerable time to compute (~14.5 seconds to compute out above). The output from sentiment_stanford/sentiment_stanford_by can be recycled inside of sentiment_stanford_by, reusing the raw scoring to save the new call to Java.

with(dat, sentiment_stanford_by(out3, list(role, time)))

##         role   time word_count        sd ave_sentiment
## 1: candidate time 1        129 0.3933979   -0.29271628
## 2: candidate time 2        530 0.4154046   -0.06751165
## 3: candidate time 3        450 0.3796283   -0.15455530
## 4: moderator time 1          2        NA    0.00000000
## 5: moderator time 2         72 0.4166667   -0.13932602
## 6: moderator time 3         47 0.5000000    0.00000000
## 7:     other time 2          7 0.7071068    0.00000000

Plotting

Plotting at Aggregated Sentiment

The possible sentiment values in the output are {-1, -0.5, 0, 0.5, 1}. The raw number of occurrences as each sentiment level are plotted as a bubble version of Cleveland's dot plot. The red cross represents the mean sentiment score (grouping variables are ordered by this by default).

plot(out3)

Plotting at the Sentence Level

The plot method for the class sentiment uses syuzhet's get_transformed_values combined with ggplot2 to make a reasonable, smoothed plot for the duration of the text based on percentage, allowing for comparison between plots of different texts. This plot gives the overall shape of the text's sentiment. The user can see syuzhet::get_transformed_values for more details.

plot(uncombine(out3))

Text Highlighting

The user may wish to see the output from sentiment_stanford_by line by line with positive/negative sentences highlighted. The sentimentr::highlight function wraps a sentiment_by output to produces a highlighted HTML file (positive = green; negative = pink). Here we look at three random reviews from Hu and Liu's (2004) Cannon G3 Camera Amazon product reviews.

set.seed(2)
highlight(with(subset(cannon_reviews, number %in% sample(unique(number), 3)), sentiment_stanford_by(review, number)))

stansent's People

Contributors

Stargazers

Watchers

Forkers

iamkbpark strategist922 mi7plus

stansent's Issues

Incorporate Zero weighted averaging

See: trinker/sentimentr#26

stansent::check_setup() confirms Java and coreNLP installed but sentiment_stanford() returns error

library(stansent)
  check_setup()
#> 
#> checking if Java is installed...
#> 
#> checking if Java is installed...

#> java version "1.8.0_171"
#> Java(TM) SE Runtime Environment (build 1.8.0_171-b11)
#> Java HotSpot(TM) Client VM (build 25.171-b11, mixed mode, sharing)

#> checking minimal Java version...

#> Java appears to be installed and at least of the minimal version.

#> checking if coreNLP is installed...

#> Stanford coreNLP appears to be installed.

#> ...Let the NLP tagging begin!

> mytext <- c(
   'do you like it?  But I hate really bad dogs',
   'I am the best friend.',
   'Do you really like it?  I\'m not a fan'
 )
> 
> out1 <- sentiment_stanford(mytext)
#> Warning message:
#> running command 'java -cp "P:/stanford-corenlp-full-2017-06-09/*" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline 
#> -stdin' had status 1

I tried installing stanford-corenlp-full-2017-06-09 in the R library stansent folder, but to no avail.

Look forward to your advise.

Thanks!

Improve sentiment_stanford

Also asked a question here so that the phrases won't get chopped into sentences (each is treated as a sentence). http://stackoverflow.com/q/34483978/1000343

This code doesn't change wd:

sentiment_stanford <- function (text.var, stanford.tagger = file.path(strsplit(getwd(), 
    "(/|\\\\)+")[[1]][1], "stanford-corenlp-full-2015-12-09")) {

    if (!file.exists(stanford.tagger)) {
        check_stanford_installed()
    }

    message("\nAnalyzing text for sentiment...\n")
    flush.console()

    cmd <- sprintf(
        "java -cp \"%s/*\" -mx5g edu.stanford.nlp.sentiment.SentimentPipeline -stdin", 
        stanford.tagger
    )

    results <- system(cmd, input = text.var, intern = TRUE, ignore.stderr = TRUE)

    as.numeric(.mgsub(c(".*Very negative", ".*Negative", ".*Neutral", 
        ".*Positive", ".*Very positive"), seq(-1, 1, by = 0.5), 
        results, fixed = FALSE))
}


sentiment_stanford(c("I hate canfdies", "But chester likes it"))

how to install OFFLINE?

Hello trinker,

Thanks for this beautiful package.

I am trying to install it, but the firewall at work does not allow me to connect directly to github. I can download anything I want manually, but to install anything in R I can only use stuff I downloaded on my computer (no remote connection that is).

I tried to download the package zip, but when I install locally I get an error:

install.packages("P:/R/trinker-stansent-e4975e7.zip", repos = NULL, type = "win.binary")
Installing package into ‘C:/Users/john/Documents/R/win-library/3.4’
(as ‘lib’ is unspecified)
> library(`trinker-stansent-e4975e7`)
Error in library(`trinker-stansent-e4975e7`) : 
  there is no package called ‘trinker-stansent-e4975e7’

More generally, could you please tell me how to install properly this package?

Thanks!!!

[stansent in Parallel?]

Dear Mr. Rinker:

First of all, I would like to express my sincere appreciation for 'stansent' as well as 'sentimentr'. Your packages makes me way easier to do sentiment analysis. Again, I do appreciate that.

Now, I just wonder if there is any way to use stansent with Stanford CoreNLP in parallel.
So far, I have successfully run the analysis with my laptop as follows:

tagger_path = 'INSTALLED LOCATION
sentiment_stanford_by (REVIEW, stanford.tagger = tagger_path)

I am currently testing the sample size of 10,000 reviews to get sentiment score but soon plan to increase the size of 500,000 reviews in AWS.

Thank you for your time.

sentiment_stanford on corpus?

Dear Mr. Trinker,

is it also possible to use the sentiment_stanford and sentiment_stanford_by functions on a corpus? I have a corpus of about 25000 comments which I would like to perform sentiment analysis on. Right now it is running, but it is taking very long...

Error with Text Highlighting in Stanford

The error which comes with the highlight function is

Error in [.data.table(y, , list(sentiment = attributes(x)["averaging.function"], :
attempt to apply non-function

The highlight function was working fine with both stansent and sentimentr previously but now only works with sentimentr. The sentimentr package used is V 2.8.1

error in sentiment stanford_by()

I have an error when using sentiment stanford_by () on rows of tweets. The thing is that when I tried the same code on 1000 rows of tweets, it worked flawlessly but later tried on 9K tweets, I got the error below. I did not clean the text before running the command and the object I work is data frame format. What am I doing wrong here?

error

Error in `[.data.table`(sent_dat[, list(sentences = unlist(sentences)), : 
Supplied 19238 items to be assigned to 19239 items of column 'sentiment'.
 If you wish to 'recycle' the RHS please use rep() to make this intent clear to readers of your code.
checking java

checking if Java is installed.

java version "15.0.2" 2021-01-19
Java(TM) SE Runtime Environment (build 15.0.2+7-27)
Java HotSpot(TM) 64-Bit Server VM (build 15.0.2+7-27, mixed mode, sharing)

checking minimal Java version...

Java appears to be installed and at least of the minimal version.

checking if coreNLP is installed...

Stanford coreNLP appears to be installed.

...Let the NLP tagging begin!
session info

R version 4.1.1 (2021-08-10)
Platform: x86_64-w64-mingw32/x64 (64-bit)
Running under: Windows 10 x64 (build 19043)

Matrix products: default

locale:
[1] LC_COLLATE=English_United Kingdom.1252 
[2] LC_CTYPE=English_United Kingdom.1252   
[3] LC_MONETARY=English_United Kingdom.1252
[4] LC_NUMERIC=C                           
[5] LC_TIME=English_United Kingdom.1252    

attached base packages:
[1] stats     graphics  grDevices utils     datasets  methods   base     

other attached packages:
[1] dplyr_1.0.7      sentimentr_2.7.1 stansent_0.2.0   pacman_0.5.1    

loaded via a namespace (and not attached):
 [1] pillar_1.6.3       compiler_4.1.1     tools_4.1.1        digest_0.6.28     
 [5] evaluate_0.14      lifecycle_1.0.1    tibble_3.1.5       pkgconfig_2.0.3   
 [9] rlang_0.4.11       rstudioapi_0.13    DBI_1.1.1          cli_3.0.1         
[13] yaml_2.2.1         xfun_0.26          fastmap_1.1.0      coreNLPsetup_0.0.1
[17] knitr_1.36         generics_0.1.0     vctrs_0.3.8        syuzhet_1.0.6     
[21] tidyselect_1.1.1   glue_1.4.2         data.table_1.14.2  R6_2.5.1          
[25] qdapRegex_0.7.2    fansi_0.5.0        rmarkdown_2.11     lexicon_1.2.1     
[29] textclean_0.9.3    purrr_0.3.4        magrittr_2.0.1     ellipsis_0.3.2    
[33] htmltools_0.5.2    assertthat_0.2.1   textshape_1.7.3    utf8_1.2.2        
[37] stringi_1.7.5      crayon_1.4.1

Not able to install

Hello,

I have tried to install the library in multiple versions of R (3.3.0, 3.4.0, 3.5.2 and 3.6.0) both using "auto install" and using the zip package but unfortunately it is not working.

Can you please suggest a fix?

Bellow the errors:
-AUTO:

pacman::p_load_gh("trinker/coreNLPsetup", "trinker/stansent")

checking for file 'C:\Users\flaviudan\AppData\Local\Temp\RtmpkF1OSL\remotes288037191d4f\trinker-coreNLPsetup-0fc06d4/DESCRIPTION' ... OK
preparing 'coreNLPsetup':
checking DESCRIPTION meta-information ... OK
checking for LF line-endings in source and make files and shell scripts
checking for empty or unneeded directories
Removed empty directory 'coreNLPsetup/tools/coreNLPsetup_logo'
Removed empty directory 'coreNLPsetup/tools'
building 'coreNLPsetup_0.0.1.tar.gz'
Error in strptime(xx, f, tz = tz) :
(converted from warning) unable to identify current timezone 'C':
please set environment variable 'TZ'
In R CMD INSTALL
Failed with error: ‘‘stansent’ is not a valid installed package’
Warning message:
In pacman::p_load_gh("trinker/coreNLPsetup", "trinker/stansent") :
Failed to install/load:
trinker/coreNLPsetup, trinker/stansent

-MANUAL:
nothing happening

Permission denied

Dear Mr. Rinker,

when I used the check_stanford_installed(), I received:

"Stanford coreNLP does not appear to be installed in root.
Would you like me to try to install it there?"

I accepted and it downloaded the file. However there seems to be an installation problem:

In file.copy(stan, file.path(root, "/"), , TRUE) :
Error with creating directory
/var/folders/k4/f38ndgfd6s7d40z7t8mmzm3m0000gn/T//Rtmp7QiEOq/stanford-corenlp-full-2017-06-09: Permission denied

I am on a MacBook Pro 2015.

Do you know how to solve this issue?

error in calling sentiment_stanford_by

I installed the stansent package successfully, but when I tried to run the function sentiment_stanford_by, I am getting the below error

"Error in [.data.table(sent_dat[, list(sentences = unlist(sentences)), :
RHS of assignment to new column 'sentiment' is zero length but not empty list(). For new columns the RHS must either be empty list() to create an empty list column, or, have length > 0; e.g. NA_integer_, 0L, etc."

Infact I just tried using the same example 'mytext' as given in the github page. Am I doing something wrong ?

Ensure sentimentr and stansent are breaking sentences the same

stansent appears to be using textshape::split_sentences whereas sentimentr appears to be using an internal get_sentences. Search for sents <- to see the difference.