Coder Social home page Coder Social logo

abdulfatir / twitter-sentiment-analysis Goto Github PK

View Code? Open in Web Editor NEW
1.5K 1.5K 588.0 2.38 MB

Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc.

License: MIT License

Python 100.00%
cnn deeplearning keras lstm machine-learning python sentiment-analysis sentiment-classification

twitter-sentiment-analysis's Introduction

Sentiment Analysis on Tweets

Status badge

Update(21 Sept. 2018): I don't actively maintain this repository. This work was done for a course project and the dataset cannot be released because I don't own the copyright. However, everything in this repository can be easily modified to work with other datasets. I recommend reading the sloppily written project report for this project which can be found in docs/.

Dataset Information

We use and compare various different methods for sentiment analysis on tweets (a binary classification problem). The training dataset is expected to be a csv file of type tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the tweet, sentiment is either 1 (positive) or 0 (negative), and tweet is the tweet enclosed in "". Similarly, the test dataset is a csv file of type tweet_id,tweet. Please note that csv headers are not expected and should be removed from the training and test datasets.

Requirements

There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.

  • numpy
  • scikit-learn
  • scipy
  • nltk

The library requirements specific to some methods are:

  • keras with TensorFlow backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
  • xgboost for XGBoost.

Note: It is recommended to use Anaconda distribution of Python.

Usage

Preprocessing

  1. Run preprocess.py <raw-csv-path> on both train and test data. This will generate a preprocessed version of the dataset.
  2. Run stats.py <preprocessed-csv-path> where <preprocessed-csv-path> is the path of csv generated from preprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.

After the above steps, you should have four files in total: <preprocessed-train-csv>, <preprocessed-test-csv>, <freqdist>, and <freqdist-bi> which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.

For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.

Baseline

  1. Run baseline.py. With TRAIN = True it will show the accuracy results on training dataset.

Naive Bayes

  1. Run naivebayes.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Maximum Entropy

  1. Run logistic.py to run logistic regression model OR run maxent-nltk.py <> to run MaxEnt model of NLTK. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Decision Tree

  1. Run decisiontree.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Random Forest

  1. Run randomforest.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

XGBoost

  1. Run xgboost.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

SVM

  1. Run svm.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Multi-Layer Perceptron

  1. Run neuralnet.py. Will validate using 10% data and save the best model to best_mlp_model.h5.

Reccurent Neural Networks

  1. Run lstm.py. Will validate using 10% data and save models for each epock in ./models/. (Please make sure this directory exists before running lstm.py).

Convolutional Neural Networks

  1. Run cnn.py. This will run the 4-Conv-NN (4 conv layers neural network) model as described in the report. To run other versions of CNN, just comment or remove the lines where Conv layers are added. Will validate using 10% data and save models for each epoch in ./models/. (Please make sure this directory exists before running cnn.py).

Majority Vote Ensemble

  1. To extract penultimate layer features for the training dataset, run extract-cnn-feats.py <saved-model>. This will generate 3 files, train-feats.npy, train-labels.txt and test-feats.npy.
  2. Run cnn-feats-svm.py which uses files from the previous step to perform SVM classification on features extracted from CNN model.
  3. Place all prediction CSV files for which you want to take majority vote in ./results/ and run majority-voting.py. This will generate majority-voting.csv.

Information about other files

  • dataset/positive-words.txt: List of positive words.
  • dataset/negative-words.txt: List of negative words.
  • dataset/glove-seeds.txt: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.
  • Plots.ipynb: IPython notebook used to generate plots present in report.

twitter-sentiment-analysis's People

Contributors

abdulfatir avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

twitter-sentiment-analysis's Issues

Chinese text

Can this cnn.py be used for Sentiment Analysis of Chinese Text after processing the text by Word embedding?
THANKS A LOT.

getting issue after running stats.py file

File "stats.py", line 63, in
t_id, if_pos, tweet = line.strip().split(',')
ValueError: not enough values to unpack (expected 3, got 2)

I am getting this error, if_pos value isnt calculated with the preprocess.py file, in test.csv, there are only 2 columns, how should I get it to 3 as the error wants.?

Need GPU??

Can this be done in a normal laptops with basic 4gb ram 64x? Without GPU

problem in majority voting.py

majority voting.py is giving an array how to get the accuracy percentage for the same can you please help me thanks in advance

majority voting ss

preprocesser.py

I am unable to follow this code.Run preprocesser.py but no file test and train dataset is generated.
error

(base) c:\Twitter\twitter-sentiment-analysis-all algos (1)\twitter-sentiment-analysis-master>python preprocess.py C:\Twitter\twitter-sentiment-analysis-all algos (1)\twitter-sentiment-analysis-master\processed
Usage: python preprocess.py

Custom Dataset

I have twitter data set to classify in 3 classes. I have tweet text, ID, class. What are the changes needed in preprocess.py file thanks in advance

problem in first step

i am just a beginner so sorry if i am asking a very basic question..
i have my training and test data sets saved in csv format....in your first step it says run preprocess.py on training and test dataset...how to do that? does that mean i have to specify the path of csv files somewhere within the code(preprocess.py)? because i am not able to find that,,please help..thank you:))

Arabic Dialect DS

How can get Arabic Data Set for Facebook posts especially Syrian Dialect?!

Permission to read file denied in preprocessing step.

I am getting PermissionError while running the command preprocess.py <raw-csv-path>.
Can someone help me with it?

Traceback (most recent call last):
  File "preprocess.py", line 107, in <module>
    preprocess_csv(csv_file_name, processed_file_name, test_file=False)
  File "preprocess.py", line 74, in preprocess_csv
    with open(csv_file_name, 'r') as csv:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\kashi\\OneDrive\\Desktop\\twitter-sentiment-analysis-master\\dataset'

Dataset

Hey man, can u plz upload the dataset

data set

please share me the URL for data set

ValueError: invalid literal for int() with base 10: 'label'

Preprocessing still doesn't happen since there is an error in this line:
positive = int(line[:line.find(',')])

Here is the warning I get: ValueError: invalid literal for int() with base 10: 'label'

I looked up how to fix this and it looks like the string cannot be converted to int, so I tried int(float(...)) but didn't work

Any ideas?

Some Questions

Can you please give the link of the dataset used?
Also, don't yot think that apart from positive and negative sentiment label, neutral label should also be applied to sentiment classification problem?

Looking to measure sentiments of every tweet

I set up your repository and retrieved the results using python baseline.py TRAIN = True

I had to enter my own dataset to train the bot where i had to add sentiments manually as 0 or 1 ... but when i added a file to test which included tweet_id and tweet, and run the above command, then the result i got was Correct = 100.00%

Well I am looking for a code where the sentiments will be predicted as positive or negative. how do I go about it.?

REGARDING COLUMNS IN DATA SET

Respected sir, can you please tell me the dataset columns., only the column names so i can download same kind of data set for further work. Thankyou

seeds file and arabic sentiment analysis

Hello, i'm working on Arabic sentiment analysis, and i'm wondering if these codes will work for me ? and i'm looking from where you got seeds file ? and how can i make one ?
and from where you get this files :

FREQ_DIST_FILE = '../train-processed-freqdist.pkl'
BI_FREQ_DIST_FILE = '../train-processed-freqdist-bi.pkl'
TRAIN_PROCESSED_FILE = '../train-processed.csv'
TEST_PROCESSED_FILE = '../test-processed.csv'
GLOVE_FILE = './dataset/glove-seeds.txt'

error in xgboost file

hello sir,
sir xgboost file is showing error

File "C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgboost.py", line 4, in
from xgboost import XGBClassifier

File "C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgboost.py", line 4, in
from xgboost import XGBClassifier

ImportError: cannot import name 'XGBClassifier' from partially initialized module 'xgboost' (most likely due to a circular import) (C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgboost.py)

and when i change the file name its showing error

File "C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgbost.py", line 4, in
from xgboost import XGBClassifier

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost_init_.py", line 11, in
from .core import DMatrix, Booster

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost\core.py", line 115, in
_LIB = _load_lib()

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost\core.py", line 109, in _load_lib
lib = ctypes.cdll.LoadLibrary(lib_path[0])

File "C:\Users\win10\anaconda3\lib\ctypes_init_.py", line 451, in LoadLibrary
return self._dlltype(name)

File "C:\Users\win10\anaconda3\lib\ctypes_init.py", line 373, in _init
self._handle = _dlopen(self._name, mode)

OSError: [WinError 193] %1 is not a valid Win32 application

After stats.py

For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.

basically what we have to do actually??

csv data set

how to insert dataset in preprocessing,py file

hdf5 unable to created

(unable to open file: name= './models/lstm-01-0.519-0.744-0.490-0.771.hdf5, error=2, message='no such file or directory', flags=13, o_flags = 302)

How to create hdf5 file and how to use it?

Upload CNN Model?

Hello,

Can I ask you if you can upload your trained model of your convolutional neural network?

Thanks you!

Dataset

Where can I get the test and trained dataset you used(in csv format)?

Tidy up preprocess.py with pandas

In preprocess_csv function in preprocess.py (link), pandas can be used to parse the csv more efficiently and with way less code. The machine I was using while developing the project did not have pandas installed.

Multi classification issue

Hello
If we have multi class classification (Positive, Negative, Neutral) could you please explain how your code will change regarding deep learning models

Separate dataset

Do we need separate dataset for training and testing? Or we can divide this? If so how? I am using sentiment 140

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.