Coder Social home page Coder Social logo

collab-uniba / senti4sd Goto Github PK

View Code? Open in Web Editor NEW
48.0 12.0 17.0 5.47 MB

An emotion-polarity classifier specifically trained on developers' communication channels

Home Page: http://collab.di.uniba.it/research

License: MIT License

R 81.76% Shell 18.24%
sentiment-analysis polarity sentiment-polarity sentiment-classifier sentiment-analyser sentiment sentiment-classification emotion

senti4sd's Introduction

Senti4SD

Senti4SD is an emotion polarity classifier specifically trained to support sentiment analysis in developers' communication channels. Senti4SD is trained and evaluated on a gold standard of over 4K posts extracted from Stack Overflow. It is part of the Collab Emotion Mining Toolkit, (EMTk).

Fair Use Policy

Please, cite the following paper if you intend to use our tool for your own research:

Calefato, F., Lanubile, F., Maiorano, F., Novielli N. (2018) "Sentiment Polarity Detection for Software Development," Empirical Software Engineering, 23(3), pp:1352-1382, doi: https://doi.org/10.1007/s10664-017-9546-9. (BibTeX)

NOTE: You will need to install Git LFS extension to check out this project. Once installed and initialized, simply run:

$ git lfs clone https://github.com/collab-uniba/Senti4SD.git

How do I get set up?

To set up the tool, simply run the following script from the command line:

$ sh requirements.sh

To run the script you need:

  • Java 8
  • R

The script will also install, if not already present, three R packages:

Running

To classify your data using Senti4SD, execute the following instruction from the command line:

$ cd ClassificationTask
$ sh classificationTask.sh inputCorpus.csv outputPredictions.csv

where inputCorpus.csv is a file containing the data you want to classify, considering a document for each line, and outputPredictions.csv is where the predictions will be saved. This last parameter is optional, if not present the output of the classification will be saved in a file called predictions.csv.

To see how the tool works, you can execute the following example:

$ cd ClassificationTask
$ sh classificationTask.sh Sample.csv

This will produce as output a csv file called predictions.csv.

Who do I talk to?

senti4sd's People

Contributors

bateman avatar ciccio86 avatar fedemaiorano avatar lanubile avatar nnovielli avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

senti4sd's Issues

Cloning the repository fails because of a GitHub bandwith issue.

If I try to clone the repo using Git LFS I get the following error:

Error downloading object: Senti4SD/ClassificationTask/Senti4SD-fast.jar batch response: This repository is over its data quota. Account responsible for LFS bandwidth should purchase more data packs to restore access.

Is it possible to upload the models somewhere else? Secondly, the Senti4SD-fast.jar is also tracked using Git LFS. So it is also impossible to clone the jar itself. This probably relates to #15 as well.

Improve tool handling of very large input files

We need to re-code our script to circumvent the fact that R by default tries to load an entire file into the memory.
The easiest alternative is to use ff library, which works with dataframe containing heterogeneous data; if data are homogeneous (e.g., a number matrix), then also bigmemory library would do, but this doesn't appear to be our case.
The most general solutions instead are using Hadoop and map-reduce to parallelize your complex task in smaller, faster subtasks [2], or alternatively, leverage a database for storing and then querying data [3].

[1] https://rpubs.com/msundar/large_data_analysis
[2] http://www.bytemining.com/2010/08/taking-r-to-the-limit-part-ii-large-datasets-in-r/
[3] https://www.datasciencecentral.com/profiles/blogs/postgresql-monetdb-and-too-big-for-memory-data-in-r-part-ii

input file and output file row count doesn't match

testinput.xlsx
testoutput.xlsx

the csv formats of these files were my input. where the input files have 1826 rows, the output file has 1829 rows-- and I have no way to say which is which. I just followd the procedure explained in the documentation. Can you please tell me what is the problem?

Furthermore, I don't think it's a good design that the output file doesn't generate labels along with the associated comments/other infos. The t0,t1 won't help me in anything. I am not sure what they mean.

Can you guys address this problem a bit quickly. I was trying to use this impressive tool in my research and I need to run it on over 1 million of texts. I am short in time. If this problem persists, I cannot proceed.
@fedemaiorano

Creating a small example

As suggested I created an input.csv file and I try to run it to receive the prediction.csv results but I receive this


$ sh requirements.sh
Java is installed
which: no C:Program in (/c/Users/Mary/bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/c/Users/Mary/bin:/c/Perl64/site/bin:/c/Perl64/bin:/c/ProgramData/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files/Java/jre1.8.0_151/bin:/c/Program Files/Java/jre1.8.0_151:/c/WINDOWS/System32/OpenSSH:/c/Program Files (x86)/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/WiFi/bin:/c/Program Files/Common Files/Intel/WirelessCommon:/c/Program Files/Git LFS:/cmd:/mingw64/bin:/usr/bin:/c/Program Files/R/R-3.5.2:/c/Program Files/R/R-3.5.2/bin:/c/Program Files/R/R-3.5.2/bin/Rscript.exe:/c/Users/Mary/AppData/Local/Microsoft/WindowsApps:/c/Program Files/Docker Toolbox:/usr/bin/vendor_perl:/usr/bin/core_perl)
which: no FilesRR-3.5.2binR in (/c/Users/Mary/bin:/mingw64/bin:/usr/local/bin:/usr/bin:/bin:/mingw64/bin:/usr/bin:/c/Users/Mary/bin:/c/Perl64/site/bin:/c/Perl64/bin:/c/ProgramData/Oracle/Java/javapath:/c/WINDOWS/system32:/c/WINDOWS:/c/WINDOWS/System32/Wbem:/c/WINDOWS/System32/WindowsPowerShell/v1.0:/c/Program Files/Java/jre1.8.0_151/bin:/c/Program Files/Java/jre1.8.0_151:/c/WINDOWS/System32/OpenSSH:/c/Program Files (x86)/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/Intel(R) Management Engine Components/DAL:/c/Program Files/Intel/WiFi/bin:/c/Program Files/Common Files/Intel/WirelessCommon:/c/Program Files/Git LFS:/cmd:/mingw64/bin:/usr/bin:/c/Program Files/R/R-3.5.2:/c/Program Files/R/R-3.5.2/bin:/c/Program Files/R/R-3.5.2/bin/Rscript.exe:/c/Users/Mary/AppData/Local/Microsoft/WindowsApps:/c/Program Files/Docker Toolbox:/usr/bin/vendor_perl:/usr/bin/core_perl)
R is installed
Warning in install.packages(c("caret"), dependencies = c("Imports", "Depends"), :
'lib = "C:/Program Files/R/R-3.5.2/library"' is not writable
Error in install.packages(c("caret"), dependencies = c("Imports", "Depends"), :
unable to install packages
In addition: Warning message:
In library(package, lib.loc = lib.loc, character.only = TRUE, logical.return = TRUE, :
there is no package called 'caret'
Execution halted


Any idea what I missed?

Documentation for Senti4SD-fast.jar

I'm trying to use Senti4SD on a large dataset (~100M lines of text) and would like to instrument most of it from R to improve performance. In particular, I'm trying to avoid the creation of the large CSV file containing the features.

For that, I want to run Senti4SD on chunks of the data. However, this considerably slows down the whole process because each time the script is called, Senti4SD-fast.jar needs to reload dsm.bin. To overcome that problem, I want to use rJava to load the JVM from R itself, load the dsm.bin and run the feature extraction on chunks without storing the result in a file.

Is there any documentation available that would allow me to easily call with rJava the feature extraction without creating files?

Futures timed out after [24 hours]

I have a large file with almost 200k lines. When I run the Senti4SD it takes more than 24 hours and then it displays the error message "Futures timed out after [24 hours]".
Could you please help me how to solve this problem.

Different sentiment results

Hello! @bateman
I ran Senti4SD on the statements given in the appendix section A, table B, of your paper titled: "Sentiment Polarity Detection for Software Development". I found that the tool assigns sentiment 'Neutral' which are labeled 'Negative' in Table B.
What can be the reason? Why it label the statement incorrectly on which it has been trained.

Error: Invalid or corrupt jarfile 4SD-fast.jar

the following command is:
% sh /Users/apple/Senti4SD/ClassificationTask/requirements.sh
Java is installed
R is installed
/Users/apple/Senti4SD/ClassificationTask/requirements.sh: line 7: Rscript: command not found

% sh /Users/apple/Senti4SD/ClassificationTask/classificationTask.sh /Users/apple/Senti4SD/ClassificationTask/Sample.csv
Error: Invalid or corrupt jarfile /Users/apple/Senti4SD/ClassificationTask/Senti4SD-fast.jar
/Users/apple/Senti4SD/ClassificationTask/classificationTask.sh: line 33: Rscript: command not found
rm: /Users/apple/Senti4SD/ClassificationTask/extractedFeatures.csv: No such file or directory

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.