Coder Social home page Coder Social logo

cs-433-project-2-rainbow-triangle's Introduction

Twitter Sentiment Analysis - EPFL course challenge

Authors (rainbow-triangle ๐ŸŒˆ)

  • Giorgio Mannarini
  • Maria Pandele
  • Francesco Posa

Introduction

This project performs supervised classification of tweets. It predicts if a tweet message used to contain a positive :) or negative :( smiley, by considering only the remaining text.We implement various methods to represent tweets (TF-IDF, Glove embeddings) and different machine learning algorithms to classify them, from more classical ones to recurrent neural networks and deep learning.

In short, we compared: K-Nearest Neighbors, Naive Bayes, Logistic Regression, Support Vector Machines (linear), Random Forest, Multi-layer Perceptron, Gated Recurrent Unit, Bert. Moreover, we also make an ensemble based on voting between all of them.

For more details, read the report.pdf.

Results at a glance

Our best model was based on Bert (large-uncased) and had a 0.902 accuracy and 0.901 F1 score on AIcrowd.

Dependencies

To properly run our code you will have to install some dependencies. Our suggestion is to use a Python environment (we used Anaconda). GRU and Bert are built on TensorFlow, with Keras as a wrapper, while the baseline has been done in scikit-learn. In alphabetical order, you should have:

  • joblib 0.17 pip install joblib
  • nltk 3.5 pip install nltk
  • numpy 1.18.5 pip install numpy
  • pandas 1.1.2 pip install pandas
  • tensorflow 2.3.1 pip install --upgrade tensorflow
  • transformers 3.4.0 pip install transformers
  • scikit-learn 0.23.2 pip install -U scikit-learn
  • setuptools 50.3 pip install setuptools
  • symspellpy 6.7 pip install symspellpy
  • vaderSentiment 3.3.2 pip install vaderSentiment

Project structure

This is scheleton we used when developing this project. We recommend this structure since all the files' locations are based on it.

classes/: contains all our implementation

logs/: contains outputed logs during training

preprocessed_data/: we are saving/loading the preprocessed data here/from here

submissions/: contains AIcrowd submissions

utility/: contains helpful resources for preprocessing the tweets

weights/: contains saved weights

Extract_emoticons.ipynb: extracts emoticons from full dataset of tweets which are later manually processed and translated to Glove specific tags

constants.py: defines constants used throughout preprocessing, training and inference

run.py: main script, more details on how to use it in the next section

How to run

There are several ways to run it. You can either re-run everything from data preprocessing to training and inference. Or you can just load our already trained models and make predictions. Note: all the requirements in terms of hardware are in the README in the weights folder. If you just want to reproduce our best submission then skip to Best submission on AIcrowd section.

Step 1. Download the raw data

Skip this section if you only want to make predictions.

Download the raw data from https://www.aicrowd.com/challenges/epfl-ml-text-classification and put it in a new top level folder called data. So you should have something like this:

โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ train_pos.txt
โ”‚   โ”œโ”€โ”€ train_neg.txt
โ”‚   โ”œโ”€โ”€ train_pos_full.txt
โ”‚   โ”œโ”€โ”€ train_neg_full.txt
โ”‚   โ””โ”€โ”€ test_data.txt

Step 2. Download the GloVe File.

For our Recurrent Neural Network based on GRU, we use a Pre-Trained Embedding Layer, where each 100-dimensional GloVe vector has been obtained by the Stanford University on twitter data. Please download the file and put it in the data folder. If you didn't skip the previous step, you should have a structure like this:

โ”œโ”€โ”€ data
โ”‚   โ”œโ”€โ”€ train_pos.txt
โ”‚   โ”œโ”€โ”€ train_neg.txt
โ”‚   โ”œโ”€โ”€ train_pos_full.txt
โ”‚   โ”œโ”€โ”€ train_neg_full.txt
โ”‚   |โ”€โ”€ test_data.txt
|   โ””โ”€โ”€ glove.twitter.27B.100d.txt

Otherwise, you should have only the glove.twitter.27B.100d.txt in the data folder. This file is necessary even if you do not want to train the model again. Download:

Total required space: 974 MB

Step 3. Download the already preprocessed tweets

Skip this section if you did Step 1 and want to do your own preprocessing.

If you want to download the preprocessed tweets then download them from this Drive link and save them into the top level preprocessed_data/ folder.
Total required space: 365 MB
So you should have something like this:

โ”œโ”€โ”€ preprocessed_data
โ”‚   โ”œโ”€โ”€ baseline
โ”‚   โ”‚   โ”œโ”€โ”€ test_preprocessed.csv   
โ”‚   โ”‚   โ””โ”€โ”€ train_preprocessed.csv
โ”‚   โ”œโ”€โ”€ bert
โ”‚   โ”‚   โ”œโ”€โ”€ test_preprocessed.csv   
โ”‚   โ”‚   โ””โ”€โ”€ train_preprocessed.csv
โ”‚   โ”œโ”€โ”€ gru
โ”‚   โ”‚   โ”œโ”€โ”€ test_preprocessed.csv   
โ”‚   โ”‚   โ””โ”€โ”€ train_preprocessed.csv
โ”‚   โ””โ”€โ”€ README.md

Step 4. Download the models

Skip this section if you want to re-train the models.

If you want to download the pretrained models (HIGHLY RECOMMENDED for the deep learning models) then download them from this Drive link and save them into the top level weights/ folder.
Total required space: 6.21 GB
So you should have something like this:

โ”œโ”€โ”€ weights
โ”‚   โ”œโ”€โ”€ baseline
โ”‚   โ”‚   โ”œโ”€โ”€ model-KNN.joblib   
โ”‚   โ”‚   โ”œโ”€โ”€ model-Logistic-Regression.joblib   
โ”‚   โ”‚   ...
โ”‚   โ”‚   โ””โ”€โ”€ model-SVM.joblib
โ”‚   โ”œโ”€โ”€ bert
โ”‚   โ”‚   โ””โ”€โ”€ model
โ”‚   โ”‚       โ”œโ”€โ”€ config.json
โ”‚   โ”‚       โ””โ”€โ”€ tf_model.h5
โ”‚   โ”œโ”€โ”€ gru
โ”‚   โ””โ”€โ”€ README.md

Step 5. The actual run

run.py is the main script which performs the data preprocessing, training (with hyperparameter tuning) and inference.

A detailed help can be found by running:

python3 run.py -h

There are 3 types of options to keep in mind -lp (load preprocessed_data), -lt (load trained models), bert/gru/mlp...and so on. For example, if you did Step 1 and want to re-train a Naive Bayer Classifier, then run:

python3 run.py nbc

If you downloaded any intermediary data (preprocessed data or model) then run:

python3 run.py nbc -lp -lt

If you downloaded preprocessed tweets but want to retrain the Naive Bayes classifier then run:

python3 run.py nbc -lp

In all cases, the script will make a submission file and save it in the submissions/.

Best submission on AIcrowd

Our best submission on AIcrowd was a model based on Bert. Since this is a highly computationally expensive model, we recommend to download the preprocessed tweets and trained model.

python3 run.py bert -lp -lt

This will take between 30 minutes and one hour on a normal laptop.

cs-433-project-2-rainbow-triangle's People

Contributors

mapaaa avatar giorgiomannarini avatar francesco98 avatar

Watchers

James Cloos avatar Matteo Pagliardini avatar gortizji avatar  avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.