collab-uniba / senti4sd Goto Github PK

An emotion-polarity classifier specifically trained on developers' communication channels

Home Page: http://collab.di.uniba.it/research

License: MIT License

R 81.76% Shell 18.24%

sentiment-analysis polarity sentiment-polarity sentiment-classifier sentiment-analyser sentiment sentiment-classification emotion

senti4sd's Introduction

Senti4SD

Senti4SD is an emotion polarity classifier specifically trained to support sentiment analysis in developers' communication channels. Senti4SD is trained and evaluated on a gold standard of over 4K posts extracted from Stack Overflow. It is part of the Collab Emotion Mining Toolkit, (EMTk).

Fair Use Policy

Please, cite the following paper if you intend to use our tool for your own research:

Calefato, F., Lanubile, F., Maiorano, F., Novielli N. (2018) "Sentiment Polarity Detection for Software Development," Empirical Software Engineering, 23(3), pp:1352-1382, doi: https://doi.org/10.1007/s10664-017-9546-9. (BibTeX)

NOTE: You will need to install Git LFS extension to check out this project. Once installed and initialized, simply run:

$ git lfs clone https://github.com/collab-uniba/Senti4SD.git

How do I get set up?

To set up the tool, simply run the following script from the command line:

$ sh requirements.sh

To run the script you need:

Java 8
R

The script will also install, if not already present, three R packages:

Running

To classify your data using Senti4SD, execute the following instruction from the command line:

$ cd ClassificationTask
$ sh classificationTask.sh inputCorpus.csv outputPredictions.csv

where inputCorpus.csv is a file containing the data you want to classify, considering a document for each line, and outputPredictions.csv is where the predictions will be saved. This last parameter is optional, if not present the output of the classification will be saved in a file called predictions.csv.

To see how the tool works, you can execute the following example:

$ cd ClassificationTask
$ sh classificationTask.sh Sample.csv

This will produce as output a csv file called predictions.csv.

Who do I talk to?

senti4sd's People

Contributors

Stargazers

Watchers

Forkers

naveenjafer sunnyhwang augurlabs tonnykar ilyasazeem anjandash maelick billelguerfa khayrulislam cforonda openjamoses umgraetsch ditty152 visajshah wahid-shuvo munsifsokiyna siamias202

senti4sd's Issues

How can i running Senti4SD on windows ?

hi, i'm trying to running Senti4SD on windows 8.1 and it has errors?
would you explain me step by step, how can i run it on windows?
thank you

Cloning the repository fails because of a GitHub bandwith issue.

If I try to clone the repo using Git LFS I get the following error:

Error downloading object: Senti4SD/ClassificationTask/Senti4SD-fast.jar batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Is it possible to upload the models somewhere else? Secondly, the Senti4SD-fast.jar is also tracked using Git LFS. So it is also impossible to clone the jar itself. This probably relates to #15 as well.

Script missing for the training task

The bash script to run the classification task is available, but a script to run the training task is missing.

Improve tool handling of very large input files

We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

[1] https://rpubs.com/msundar/large_data_analysis
[2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
[3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

Requirement.sh showing me there is no R installed but R is working.

input file and output file row count doesn't match

testinput.xlsx
testoutput.xlsx

the csv formats of these files were my input. where the input files have 1826 rows, the output file has 1829 rows-- and I have no way to say which is which. I just followd the procedure explained in the documentation. Can you please tell me what is the problem?

Furthermore, I don't think it's a good design that the output file doesn't generate labels along with the associated comments/other infos. The t0,t1 won't help me in anything. I am not sure what they mean.

Can you guys address this problem a bit quickly. I was trying to use this impressive tool in my research and I need to run it on over 1 million of texts. I am short in time. If this problem persists, I cannot proceed.
@fedemaiorano

Creating a small example

As suggested I created an input.csv file and I try to run it to receive the prediction.csv results but I receive this

$ sh requirements.sh
Java is installed
which: no C:Program in (/c/Users/Mary/bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/c/Users/Mary/bin:/c/Perl64/site/bin:/c/Perl64/bin:/c/ProgramData/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files/Java/jre1.8.0_151/bin:/c/Program Files/Java/jre1.8.0_151:/c/WINDOWS/System32/OpenSSH:/c/Program Files (x86)/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/WiFi/bin:/c/Program Files/Common Files/Intel/WirelessCommon:/c/Program Files/Git LFS:/cmd:/mingw64/bin:/usr/bin:/c/Program Files/R/R-3.5.2:/c/Program Files/R/R-3.5.2/bin:/c/Program Files/R/R-3.5.2/bin/Rscript.exe:/c/Users/Mary/AppData/Local/Microsoft/WindowsApps:/c/Program Files/Docker Toolbox:/usr/bin/vendor_perl:/usr/bin/core_perl)
which: no FilesRR-3.5.2binR in (/c/Users/Mary/bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/c/Users/Mary/bin:/c/Perl64/site/bin:/c/Perl64/bin:/c/ProgramData/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files/Java/jre1.8.0_151/bin:/c/Program Files/Java/jre1.8.0_151:/c/WINDOWS/System32/OpenSSH:/c/Program Files (x86)/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/WiFi/bin:/c/Program Files/Common Files/Intel/WirelessCommon:/c/Program Files/Git LFS:/cmd:/mingw64/bin:/usr/bin:/c/Program Files/R/R-3.5.2:/c/Program Files/R/R-3.5.2/bin:/c/Program Files/R/R-3.5.2/bin/Rscript.exe:/c/Users/Mary/AppData/Local/Microsoft/WindowsApps:/c/Program Files/Docker Toolbox:/usr/bin/vendor_perl:/usr/bin/core_perl)
R is installed
Warning in install.packages(c("caret"), dependencies = c("Imports", "Depends"), :
'lib = "C:/Program Files/R/R-3.5.2/library"' is not writable
Error in install.packages(c("caret"), dependencies = c("Imports", "Depends"), :
unable to install packages
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'caret'
Execution halted

Any idea what I missed?

Documentation for Senti4SD-fast.jar

I'm trying to use Senti4SD on a large dataset (~100M lines of text) and would like to instrument most of it from R to improve performance. In particular, I'm trying to avoid the creation of the large CSV file containing the features.

For that, I want to run Senti4SD on chunks of the data. However, this considerably slows down the whole process because each time the script is called, Senti4SD-fast.jar needs to reload dsm.bin. To overcome that problem, I want to use rJava to load the JVM from R itself, load the dsm.bin and run the feature extraction on chunks without storing the result in a file.

Is there any documentation available that would allow me to easily call with rJava the feature extraction without creating files?

Exception handling in R scripts for input arguments

Print an error message when a file is not found

Futures timed out after [24 hours]

I have a large file with almost 200k lines. When I run the Senti4SD it takes more than 24 hours and then it displays the error message "Futures timed out after [24 hours]".
Could you please help me how to solve this problem.

Different sentiment results

Hello! @bateman
I ran Senti4SD on the statements given in the appendix section A, table B, of your paper titled: "Sentiment Polarity Detection for Software Development". I found that the tool assigns sentiment 'Neutral' which are labeled 'Negative' in Table B.
What can be the reason? Why it label the statement incorrectly on which it has been trained.

Title of the EMSE paper seems to be missing

Please add "Sentiment Polarity Detection for Software Development" 😃

Error: Invalid or corrupt jarfile 4SD-fast.jar

the following command is:
% sh /Users/apple/Senti4SD/ClassificationTask/requirements.sh
Java is installed
R is installed
/Users/apple/Senti4SD/ClassificationTask/requirements.sh: line 7: Rscript: command not found

% sh /Users/apple/Senti4SD/ClassificationTask/classificationTask.sh /Users/apple/Senti4SD/ClassificationTask/Sample.csv
Error: Invalid or corrupt jarfile /Users/apple/Senti4SD/ClassificationTask/Senti4SD-fast.jar
/Users/apple/Senti4SD/ClassificationTask/classificationTask.sh: line 33: Rscript: command not found
rm: /Users/apple/Senti4SD/ClassificationTask/extractedFeatures.csv: No such file or directory