In this paper, we apply machine learning methods to Twitter data to predict if a message has a positive or a negative smiley.
We present four different types of models: a set of simple machine learning baseline models; two long-short term memory (LSTM) models using word2vec and GloVe embeddings respectively; transformer models; and a few-shot learning model using TARS.
Our proposed model is the one that uses CT-BERT language model which achieves 0.906 accuracy and 0.905 f1-score in the test set and it was placed at the third position of the respective AIcrowd competition (submission ID: 107963).
Our pre-trained model can be found here.
For a step-by-step guide to run all the experiments the project, please take a look at this notebook:
We strongly advice running the project with the above Colab notebook which offers free GPUs.
Clone and enter the repository
git clone https://<YOUR USER>:<YOUR PASSWORD>@github.com/CS-433/cs-433-project-2-mlakes MLProject2
cd MLProject2
We recommend installing the dependencies inside a python virtual environment so you don't have any conflicts with other packages installed on the machine. You can use virutalenv, pyenv or condaenv to do that.
pyenv virtualenv mlproject2
pyenv activate mlproject2
Project dependencies are located in the requirements.txt
file.
To install them you should run:
pip install -r requirements.txt
To install spacy dependencies please run the following:
python -m spacy download en_core_web_sm
The raw data can be downloaded form the webpage of the AIcrowd challenge:
https://www.aicrowd.com/challenges/epfl-ml-text-classification/dataset_files.
The data should be located in the data/
directory in csv format.
To do this, move the zip file to the data directory and run
unzip data/twitter-datasets.zip -d data/
mv data/twitter-datasets/train_neg.txt data/train_neg.txt
mv data/twitter-datasets/train_pos.txt data/train_pos.txt
mv data/twitter-datasets/train_neg_full.txt data/train_neg_full.txt
mv data/twitter-datasets/train_pos_full.txt data/train_pos_full.txt
mv data/twitter-datasets/test_data.txt data/test_data.txt
The BiLSTM can be trained with glove and word2vec embeddings. In order to run these models, you need to create the vocabulary (word2vec) or download a pre-trained one (gloVe).
Constructs a a vocabulary list of words appearing at least 5 times.
src/preprocessing_glove/build_vocab.sh
src/preprocessing_glove/cut_vocab.sh
python preprocessing_glove/pickle_vocab.py
You must download the pretrained embeddings from here or using wget:
wget http://nlp.stanford.edu/data/glove.twitter.27B.zip
mv glove.twitter.27B.zip data/embeddings/glove.twitter.27B.zip
unzip data/embeddings/glove.twitter.27B.zip -d data/embeddings
wget https://nlp.informatik.hu-berlin.de/resources/models/tars-base/tars-base.pt
mv tars-base.pt saved_models/tars-base.pt
To train the model, you can run
cd src
python run.py --pipeline training
To run a particular model, the name of the model can be passed as a parameter
cd src
python run.py --pipeline training \
--model glove
The following models can be trained:
- tfidf : TermFrequency-Inverse Document Frequency
- word2vec : BiLSTM using word2vec embeddings
- glove : BiLSTM using glove embeddings
- bert : Bidirectional Encoder Representations from Transformers (CT-BERT)
- zero : Few shot learning model
To learn more, read the report :D
To create the predictions, you can run
python src/run.py --testing
If no parameters are passed, bert model is trained and then the predictions on the test data are made.
python src/run.py
The project can be easily run in any virtual machine without the need to install any dependencies using our docker container.
-
Make sure you have docker and git installed and running.
-
Declare global variables REPO is availabe in Dockerhub: paolamedo/bert_notebook:latest
REPO_URL=paolamedo/bert_notebook:latest
BUILD_DIR=/home/paola/Documents/EPFL/MLProject2 <location of the cloned repo>
- Run docker
docker run --rm -it -e GRANT_SUDO=yes \
--user root \
-p 8888:8888 \
-e JUPYTER_TOKEN="easy" \
-v $BUILD_DIR:/home/jovyan/work $REPO_URL
- You will now be able to open jupyter notebook and run notebooks/MLProject2_GAP.ipynb:
http://localhost:8888/?token=easy
or run from the terminal
python src/run.py
To test the code of the data transformations please run:
cd src
python test_preprocessing.py
python test_data_cleaning.py
python test_embeddings.py
Our paper regarding the methodology and the experiments of the proposed model
is located under the report/
directory in pdf format.
The source code of this project is structured in the following manner.
project
├── README.md
├── requirements.txt
├── Dockerfile-notebook
├─docs/ # report and project description
│
├─data/ # the data directory
│ ├── embeddngs/ # dirctory where embeddings will be stored
│ └── twitter-datasets.zip # This is where the data should be loaded
├── models/ # directory where models are saved
├── predictions/ # directory where the predictions are saved
├── notebooks
│ └── MLProject2_GAP.ipynb
├── src
│ ├── models/ # directory with models' code
│ ├── preprocessing_glove/ # directory with files to preprocess corpus for glove
│ ├── data_cleaning.py
│ ├── data_loading.py
│ ├── embeddings.py
│ ├── evaluate.py
│ ├── model_selection.py
│ ├── preprocessing.py
│ └── run.py
└── test # unit tests
├── test_data_cleaning.py
├── test_embeddings.py
└── test_preprocessing.py
- Angeliki Romanou @agromanou
- George Fotiadis @geofot96
- Paola Mejia @paola-md
To see the development of the project and the interesting discussions we had in each pull request, you can visit our development repository: https://github.com/geofot96/MLProject2/