Coder Social home page Coder Social logo

cs-433-project-2-mlakes's Introduction

Smiley prediction on Twitter data :)

In this paper, we apply machine learning methods to Twitter data to predict if a message has a positive or a negative smiley.

We present four different types of models: a set of simple machine learning baseline models; two long-short term memory (LSTM) models using word2vec and GloVe embeddings respectively; transformer models; and a few-shot learning model using TARS.

Our proposed model is the one that uses CT-BERT language model which achieves 0.906 accuracy and 0.905 f1-score in the test set and it was placed at the third position of the respective AIcrowd competition (submission ID: 107963).

Our pre-trained model can be found here.

Colab

For a step-by-step guide to run all the experiments the project, please take a look at this notebook:

Open In Colab

We strongly advice running the project with the above Colab notebook which offers free GPUs.

Step-by-step guide for local deployment

Getting started

Install

Clone and enter the repository

git clone https://<YOUR USER>:<YOUR PASSWORD>@github.com/CS-433/cs-433-project-2-mlakes MLProject2
cd MLProject2

We recommend installing the dependencies inside a python virtual environment so you don't have any conflicts with other packages installed on the machine. You can use virutalenv, pyenv or condaenv to do that.

pyenv virtualenv mlproject2
pyenv activate mlproject2

Project dependencies are located in the requirements.txt file.
To install them you should run:

pip install -r requirements.txt

To install spacy dependencies please run the following:

python -m spacy download en_core_web_sm

Data

The raw data can be downloaded form the webpage of the AIcrowd challenge:
https://www.aicrowd.com/challenges/epfl-ml-text-classification/dataset_files.
The data should be located in the data/ directory in csv format.

To do this, move the zip file to the data directory and run

unzip data/twitter-datasets.zip -d data/

mv data/twitter-datasets/train_neg.txt data/train_neg.txt 
mv data/twitter-datasets/train_pos.txt data/train_pos.txt 
mv data/twitter-datasets/train_neg_full.txt data/train_neg_full.txt 
mv data/twitter-datasets/train_pos_full.txt data/train_pos_full.txt 
mv data/twitter-datasets/test_data.txt data/test_data.txt

Modeling

Embeddings

The BiLSTM can be trained with glove and word2vec embeddings. In order to run these models, you need to create the vocabulary (word2vec) or download a pre-trained one (gloVe).

Word2vec

Constructs a a vocabulary list of words appearing at least 5 times.

src/preprocessing_glove/build_vocab.sh
src/preprocessing_glove/cut_vocab.sh
python preprocessing_glove/pickle_vocab.py

GloVe

You must download the pretrained embeddings from here or using wget:

wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
mv glove.twitter.27B.zip data/embeddings/glove.twitter.27B.zip
unzip data/embeddings/glove.twitter.27B.zip -d data/embeddings

TARS zero shot

wget https://nlp.informatik.hu-berlin.de/resources/models/tars-base/tars-base.pt
mv tars-base.pt saved_models/tars-base.pt

Training

To train the model, you can run

cd src
python run.py --pipeline training 

To run a particular model, the name of the model can be passed as a parameter

cd src
python run.py --pipeline training \
              --model glove 

The following models can be trained:

  • tfidf : TermFrequency-Inverse Document Frequency
  • word2vec : BiLSTM using word2vec embeddings
  • glove : BiLSTM using glove embeddings
  • bert : Bidirectional Encoder Representations from Transformers (CT-BERT)
  • zero : Few shot learning model

To learn more, read the report :D

Testing

To create the predictions, you can run

python src/run.py --testing

Complete pipeline

If no parameters are passed, bert model is trained and then the predictions on the test data are made.

python src/run.py 

Running with Docker

The project can be easily run in any virtual machine without the need to install any dependencies using our docker container.

  1. Make sure you have docker and git installed and running.

  2. Declare global variables REPO is availabe in Dockerhub: paolamedo/bert_notebook:latest

REPO_URL=paolamedo/bert_notebook:latest
BUILD_DIR=/home/paola/Documents/EPFL/MLProject2 <location of the cloned repo>
  1. Run docker
docker run --rm -it -e GRANT_SUDO=yes \
--user root \
-p 8888:8888 \
-e JUPYTER_TOKEN="easy" \
-v $BUILD_DIR:/home/jovyan/work $REPO_URL
  1. You will now be able to open jupyter notebook and run notebooks/MLProject2_GAP.ipynb:
http://localhost:8888/?token=easy

or run from the terminal

python src/run.py 

Testing the code

To test the code of the data transformations please run:

cd src
python test_preprocessing.py 
python test_data_cleaning.py 
python test_embeddings.py 

Project Architecture

Report

Our paper regarding the methodology and the experiments of the proposed model is located under the report/ directory in pdf format.

Folder structure

The source code of this project is structured in the following manner.

project
├── README.md
├── requirements.txt
├── Dockerfile-notebook
├─docs/                        # report and project description
│
├─data/                        # the data directory
│   ├── embeddngs/             # dirctory where embeddings will be stored
│   └── twitter-datasets.zip   # This is where the data should be loaded
├── models/                    # directory where models are saved
├── predictions/               # directory where the predictions are saved
├── notebooks
│   └── MLProject2_GAP.ipynb
├── src
│   ├── models/                # directory with models' code   
│   ├── preprocessing_glove/   # directory with files to preprocess corpus for glove
│   ├── data_cleaning.py
│   ├── data_loading.py
│   ├── embeddings.py
│   ├── evaluate.py
│   ├── model_selection.py
│   ├── preprocessing.py
│   └── run.py
└── test                       # unit tests
   ├── test_data_cleaning.py
   ├── test_embeddings.py
   └── test_preprocessing.py

Authors

  • Angeliki Romanou @agromanou
  • George Fotiadis @geofot96
  • Paola Mejia @paola-md

To see the development of the project and the interesting discussions we had in each pull request, you can visit our development repository: https://github.com/geofot96/MLProject2/

cs-433-project-2-mlakes's People

Contributors

paola-md avatar agromanou avatar geofot96 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.