abdulfatir / twitter-sentiment-analysis Goto Github PK

Sentiment analysis on tweets using Naive Bayes, SVM, CNN, LSTM, etc.

License: MIT License

Python 100.00%

cnn deeplearning keras lstm machine-learning python sentiment-analysis sentiment-classification

twitter-sentiment-analysis's Introduction

Sentiment Analysis on Tweets

Update(21 Sept. 2018): I don't actively maintain this repository. This work was done for a course project and the dataset cannot be released because I don't own the copyright. However, everything in this repository can be easily modified to work with other datasets. I recommend reading the sloppily written project report for this project which can be found in docs/.

Dataset Information

We use and compare various different methods for sentiment analysis on tweets (a binary classification problem). The training dataset is expected to be a csv file of type tweet_id,sentiment,tweet where the tweet_id is a unique integer identifying the tweet, sentiment is either 1 (positive) or 0 (negative), and tweet is the tweet enclosed in "". Similarly, the test dataset is a csv file of type tweet_id,tweet. Please note that csv headers are not expected and should be removed from the training and test datasets.

Requirements

There are some general library requirements for the project and some which are specific to individual methods. The general requirements are as follows.

numpy
scikit-learn
scipy
nltk

The library requirements specific to some methods are:

keras with TensorFlow backend for Logistic Regression, MLP, RNN (LSTM), and CNN.
xgboost for XGBoost.

Note: It is recommended to use Anaconda distribution of Python.

Usage

Preprocessing

Run preprocess.py <raw-csv-path> on both train and test data. This will generate a preprocessed version of the dataset.
Run stats.py <preprocessed-csv-path> where <preprocessed-csv-path> is the path of csv generated from preprocess.py. This gives general statistical information about the dataset and will two pickle files which are the frequency distribution of unigrams and bigrams in the training dataset.

After the above steps, you should have four files in total: <preprocessed-train-csv>, <preprocessed-test-csv>, <freqdist>, and <freqdist-bi> which are preprocessed train dataset, preprocessed test dataset, frequency distribution of unigrams and frequency distribution of bigrams respectively.

For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.

Baseline

Run baseline.py. With TRAIN = True it will show the accuracy results on training dataset.

Naive Bayes

Run naivebayes.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Maximum Entropy

Run logistic.py to run logistic regression model OR run maxent-nltk.py <> to run MaxEnt model of NLTK. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Decision Tree

Run decisiontree.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Random Forest

Run randomforest.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

XGBoost

Run xgboost.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

SVM

Run svm.py. With TRAIN = True it will show the accuracy results on 10% validation dataset.

Multi-Layer Perceptron

Run neuralnet.py. Will validate using 10% data and save the best model to best_mlp_model.h5.

Reccurent Neural Networks

Run lstm.py. Will validate using 10% data and save models for each epock in ./models/. (Please make sure this directory exists before running lstm.py).

Convolutional Neural Networks

Run cnn.py. This will run the 4-Conv-NN (4 conv layers neural network) model as described in the report. To run other versions of CNN, just comment or remove the lines where Conv layers are added. Will validate using 10% data and save models for each epoch in ./models/. (Please make sure this directory exists before running cnn.py).

Majority Vote Ensemble

To extract penultimate layer features for the training dataset, run extract-cnn-feats.py <saved-model>. This will generate 3 files, train-feats.npy, train-labels.txt and test-feats.npy.
Run cnn-feats-svm.py which uses files from the previous step to perform SVM classification on features extracted from CNN model.
Place all prediction CSV files for which you want to take majority vote in ./results/ and run majority-voting.py. This will generate majority-voting.csv.

Information about other files

dataset/positive-words.txt: List of positive words.
dataset/negative-words.txt: List of negative words.
dataset/glove-seeds.txt: GloVe words vectors from StanfordNLP which match our dataset for seeding word embeddings.
Plots.ipynb: IPython notebook used to generate plots present in report.

twitter-sentiment-analysis's People

Contributors

Stargazers

Watchers

Forkers

adolfoeliazat cylovelife btbujiangjun jdc08161063 rpj911 mike-q paojianghu nilportugues allensmile little1tow luomuqinghan gybta wenjinsun baifengbai dukeyuan angzz jasonwuyun cyzlovedream lingya spark-lin huan2016 kkpop cjopengler samsmith95 iamsubhokarmakar edaworld ryfan-rs liuning123 leihao612 zxm1306192988 mukeshkb4u xiaomozi ahnassef zhujiahui moszh wubizhi nowucme berryhn wxywxyyxw skyninefive sinboyxx 060d zbn123 asheesh1202 batterysnoopy momodding searchmodel yaduvendra alisholihindev carlosf zhongkailv shichaoji lydialiang jbazsika pddsa babylls rtvl turmudi peterxiaoguo guirnyk mikewlange i69086 361793842 hwjml lazycrazyowl gitvivekgupta lizkt zuliarefendi rasarab shalomz vikaskodag2 qiqimaochiyu feng-1985 kripashetty6 xmavrck lordvoldersloth gypzie didarulcseiubat17 nawshad aarzoodawra jayendrabhardwaj sharan-amutharasu shubhampachori12110095 mojolab bitcreative-studios jinggz y0ucef durgaprasd generalzh djatikusuma mhd9023 luo-chang dcheason sheikho1983 ajaafer leolinoj c351 rama-jpmc adityaasinha28 sar2901

twitter-sentiment-analysis's Issues

Chinese text

Can this cnn.py be used for Sentiment Analysis of Chinese Text after processing the text by Word embedding?
THANKS A LOT.

getting issue after running stats.py file

File "stats.py", line 63, in
t_id, if_pos, tweet = line.strip().split(',')
ValueError: not enough values to unpack (expected 3, got 2)

I am getting this error, if_pos value isnt calculated with the preprocess.py file, in test.csv, there are only 2 columns, how should I get it to 3 as the error wants.?

missing glove-seeds.txt file

Hello Abdul Fatir,
Thanks for sharing the excellent work. I do not see the glove-seeds.txt file in the dataset/ directory. Any suggestion on where I can get it from?
I do see one on github at http://nlp.stanford.edu/data/wordvecs/glove.twitter.27B.zip, it seems too big, would it work?
Thanks in advance

Alim

Need GPU??

Can this be done in a normal laptops with basic 4gb ram 64x? Without GPU

No module found error. Can you please tell the steps to be followed.

problem in majority voting.py

majority voting.py is giving an array how to get the accuracy percentage for the same can you please help me thanks in advance

Need to add Neutral result into the sentiment

that's good to give result between positive (1) and negative (0), but im looking for how to add neutral into the code. Anyone can help?

preprocesser.py

I am unable to follow this code.Run preprocesser.py but no file test and train dataset is generated.

(base) c:\Twitter\twitter-sentiment-analysis-all algos (1)\twitter-sentiment-analysis-master>python preprocess.py C:\Twitter\twitter-sentiment-analysis-all algos (1)\twitter-sentiment-analysis-master\processed
Usage: python preprocess.py

No needed files of FREQ_DIST_FILE

How can we find the FREQ_DIST_FILE for training the models?

Refer to https://github.com/abdulfatir/twitter-sentiment-analysis/blob/master/extract-cnn-feats.py#L9 .

FREQ_DIST_FILE = '../train-processed-freqdist.pkl'

Can you upload these necessary files so that everyone can train the models easily?

Custom Dataset

I have twitter data set to classify in 3 classes. I have tweet text, ID, class. What are the changes needed in preprocess.py file thanks in advance

problem in first step

i am just a beginner so sorry if i am asking a very basic question..
i have my training and test data sets saved in csv format....in your first step it says run preprocess.py on training and test dataset...how to do that? does that mean i have to specify the path of csv files somewhere within the code(preprocess.py)? because i am not able to find that,,please help..thank you:))

Arabic Dialect DS

How can get Arabic Data Set for Facebook posts especially Syrian Dialect?!

Permission to read file denied in preprocessing step.

I am getting PermissionError while running the command preprocess.py <raw-csv-path>.
Can someone help me with it?

Traceback (most recent call last):
  File "preprocess.py", line 107, in <module>
    preprocess_csv(csv_file_name, processed_file_name, test_file=False)
  File "preprocess.py", line 74, in preprocess_csv
    with open(csv_file_name, 'r') as csv:
PermissionError: [Errno 13] Permission denied: 'C:\\Users\\kashi\\OneDrive\\Desktop\\twitter-sentiment-analysis-master\\dataset'

Dataset

Hey man, can u plz upload the dataset

data set

please share me the URL for data set

Glove-seed.txt file

where can we get Glove-seed.txt file.

ValueError: invalid literal for int() with base 10: 'label'

Preprocessing still doesn't happen since there is an error in this line:
positive = int(line[:line.find(',')])

Here is the warning I get: ValueError: invalid literal for int() with base 10: 'label'

I looked up how to fix this and it looks like the string cannot be converted to int, so I tried int(float(...)) but didn't work

Any ideas?

Some Questions

Can you please give the link of the dataset used?
Also, don't yot think that apart from positive and negative sentiment label, neutral label should also be applied to sentiment classification problem?

Looking to measure sentiments of every tweet

I set up your repository and retrieved the results using python baseline.py TRAIN = True

I had to enter my own dataset to train the bot where i had to add sentiments manually as 0 or 1 ... but when i added a file to test which included tweet_id and tweet, and run the above command, then the result i got was Correct = 100.00%

Well I am looking for a code where the sentiments will be predicted as positive or negative. how do I go about it.?

REGARDING COLUMNS IN DATA SET

Respected sir, can you please tell me the dataset columns., only the column names so i can download same kind of data set for further work. Thankyou

seeds file and arabic sentiment analysis

Hello, i'm working on Arabic sentiment analysis, and i'm wondering if these codes will work for me ? and i'm looking from where you got seeds file ? and how can i make one ?
and from where you get this files :

FREQ_DIST_FILE = '../train-processed-freqdist.pkl'
BI_FREQ_DIST_FILE = '../train-processed-freqdist-bi.pkl'
TRAIN_PROCESSED_FILE = '../train-processed.csv'
TEST_PROCESSED_FILE = '../test-processed.csv'
GLOVE_FILE = './dataset/glove-seeds.txt'

error in xgboost file

hello sir,
sir xgboost file is showing error

File "C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgboost.py", line 4, in
from xgboost import XGBClassifier

ImportError: cannot import name 'XGBClassifier' from partially initialized module 'xgboost' (most likely due to a circular import) (C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgboost.py)

and when i change the file name its showing error

File "C:\Users\win10\Desktop\twitter-sentiment-analysis-master\code\xgbost.py", line 4, in
from xgboost import XGBClassifier

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost_init_.py", line 11, in
from .core import DMatrix, Booster

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost\core.py", line 115, in
_LIB = _load_lib()

File "C:\Users\win10\anaconda3\lib\site-packages\xgboost\core.py", line 109, in _load_lib
lib = ctypes.cdll.LoadLibrary(lib_path[0])

File "C:\Users\win10\anaconda3\lib\ctypes_init_.py", line 451, in LoadLibrary
return self._dlltype(name)

File "C:\Users\win10\anaconda3\lib\ctypes_init.py", line 373, in _init
self._handle = _dlopen(self._name, mode)

OSError: [WinError 193] %1 is not a valid Win32 application

After stats.py

For all the methods that follow, change the values of TRAIN_PROCESSED_FILE, TEST_PROCESSED_FILE, FREQ_DIST_FILE, and BI_FREQ_DIST_FILE to your own paths in the respective files. Wherever applicable, values of USE_BIGRAMS and FEAT_TYPE can be changed to obtain results using different types of features as described in report.

basically what we have to do actually??

csv data set

how to insert dataset in preprocessing,py file

hdf5 unable to created

(unable to open file: name= './models/lstm-01-0.519-0.744-0.490-0.771.hdf5, error=2, message='no such file or directory', flags=13, o_flags = 302)

How to create hdf5 file and how to use it?

Upload CNN Model?

Hello,

Can I ask you if you can upload your trained model of your convolutional neural network?

Thanks you!