42-ai / sentimentalbb Goto Github PK

View Code? Open in Web Editor NEW

3.0 3.0 0.0 26.45 MB

License: MIT License

Shell 0.44% Python 2.46% Dockerfile 0.03% Makefile 0.14% Jupyter Notebook 96.92%

sentimentalbb's People

Stargazers

Watchers

sentimentalbb's Issues

feat (github-pages): exploring embedded W&B embedded link

📖 Describe what you want

Trying to incorporate an interactive Weight and Biases graph inside a markdown to make github-pages more user friendly

✔️ Definition of done

Just an exploration so no commit

docs (github pages): update the index page

📖 Describe what you want

Temporary set the SBB README as the website index page

✔️ Definition of done

Website index page = SBB README

docs (Setup.md): additional details

Suggested fix

Specify the path by changing echo "source venv/bin/activate" > .envrc to echo "source venv/bin/activate" > PATH TO VENV/ .envrc
Add source venv/bin/activate to activate venv

feat (data): 2 dowload several tweets which mention @user

📖 Describe what you want

Create python code under src/data which is called from python -m src data --download user

Download them under data/raw/user/. We keep ALL the data received by twitter API.

✔️ Definition of done

When tweeets are dowloaded to data/raw/user

docs (contrib): add clemedon as contributor

📖 Describe what you want

Add my github and 42 logins to README.md

✔️ Definition of done

Edit README.md

feat (data): script to request tweets from twitter API

Objective: Build the database and having data on as much days as possible.

📖 Describe what you want

Update the script about dataset to request specific tweets from the API of twitter based on its date or ID.

The script MUST save ALL the tweets received into csv files in the data/raw/twitter directory, with the date and ID of first and last tweet specified (in the name ?). Possible format:

data/raw/twitter/[candidat_name]_[startdate]_[enddate].csv
data/raw/twitter/[candidat_name]_[first_id_tweet]_[last_id_tweet].csv
data/raw/twitter/candidat_name/[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv
data/raw/twitter/week_#x/[candidat_name]_[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv

A particular point must be considered: the script should collect small chunks of results in order to save all the results little by little to avoid issue related to cache memory, disk memory or whatever: Create a tmp directory where the little portions are stored and after that the script

This script should be designed to be launched periodically, every week (or every day?) and collect specified amount of tweets about each candidate. These amounts of tweets per day and per candidate are yet to be determined.

✔️ Definition of done

a functioning script is written.
a format for the filename is chosen,
the script create a tmp directory where it saves small chunks of the total results.
the script concatenate all the chunks into a final csv file.

This script should be FIRST and ONLY tested with small amounts of tweet requested to the API in order to save the amount of tweet we can request: for instance 1k tweets for 2 or 3 candidates. The person testing the script should be careful to check the above points.

This script will be used for larger amounts after the pull request is validated.

docs (contrib): add agilmet as contributor

📖 Add a contributor

Add my name and 42 login to the README.md

✔️ Definition of done

Name and login added to README.md

test (CICD auto attribution project and kaban col): test

This is a test

feat (data): fix error parsing YAML file

📖 Describe what you want

Fix the error that happens when executing the script for downloading tweets about macron:
cannot read file ~/.twitter_keys.yaml Error parsing YAML file; searching for valid environment variables

✔️ Definition of done

No error happens when executing python -m src data --download macron

docs (github page): make the github page show-able

📖 Describe what you want

Update the website template

✔️ Definition of done

Change "Areg Sarkissian" to "Sentimental Big Bro" (top)
Remove "© 2019 Areg Sarkissian" (bottom)
Set a margin
Separate the Readme and the Vizualization into different pages.

docs (contrib): add guillaume-salle as contributor

Describe what you want

Add my name and 42 login to the readme.md

Describe what are the benefits

Update the list of contributors

Describe a test that your feature should pass before merging to master

readme.md should have my name and login

docs (data): Need for documentation for src/data/make_dataset/macron.py

📖 Documentation needed

Write a very small documentation of the function src/data/make_dataset/macron.py

✔️ Definition of done

Documentation is written.

model (deeplearning): Camembert baseline

📖 Describe what you want

Make a deeplearning baseline with Hugging Face and Pytorch.

Relevant tutorials to follow:

Most relevant (using camembert): https://ledatascientist.com/analyse-de-sentiments-avec-camembert/
https://www.kaggle.com/chayan8/sentiment-analysis-using-bert-pytorch
https://skimai.com/fine-tuning-bert-for-sentiment-analysis/

Relevant documentation:

HuggingFace: https://huggingface.co/course/chapter1/1
(Not yet necessary, but still cool to read) PyTorch: https://pytorch.org/tutorials/beginner/basics/intro.html

✔️ Definition of done

Being able to train the model then save it's weights poetry run python -n src models --model camembert --output-weights models/xxxxx --train-split xxx.csv
Save the weights in remote s3 dvc add models/xxxxx git add && git commit dvc push -r s3-remote
Being able to load model weights and predict on a given dataset
Create a unit test which train and predict from a csv of 100 examples which is push to github (in src/tests/test_dataset.csv)
If necessary update the .github/workflows/cicd.yaml and the .42AI/pre-commit.git to pass the new unit test

docs (setup): remove envrc requirement and direnv

📖 Describe what you want

Update the Setup.md file to remove the use of direnv and .envrc file.
Also remove the error when executing .42AI/init.sh after cloning the repo.

✔️ Definition of done

Setup.md is updated and .42AI/init.sh raises no error.

doc (Milstone: Make Readme Great Again): Table of content

📖 Describe what you want

Readme should have a table of content see here

✔️ Definition of done

Readme has a table of content

fix (setup.md): formating visual aspect

📖 Describe what you want

The describa encapsulate everything

✔️ Definition of done

A clean setup.md when viewed on github

feat (github pages): Set up GitHub Pages

📖 Describe what you want

Create a directory dedicated to GitHub Pages + index template

✔️ Definition of done

Available online index template page

doc (Millestone: Make Readme Great Again): Description section

📖 Describe what you want

Readme should have a short description of the project

✔️ Definition of done

Readme has a short description of the project

doc (Milstone: Make Readme Great Again): Description

📖 Describe what you want

Translation of the description section

✔️ Definition of done

There is a translation of the description section

feat (visualization): create first figures from data

📖 Describe what you want

While waiting for our collected twitter dataset to contain the datetime of each tweet we would like to plot our predicted sentiment for each candidate.

There are multiple otpions:

Pie Chart: easy to setup and straight to the point tutorial

RegPlot: More complex but swag tutorial

Wayyyyy more possibilites: for later

✔️ Definition of done

For each candidate, any figure from the top list saved under reports/figures/candidate/type_of_chart_and_name_question_answered.png

ie: if it's a Pie Chart for Macron reports/figures/Macron/pie_chart_sentiment_prediction.png

dvc (data): save tweets to build database & push to DVC

📖 Describe what you want

This issue must be completed only after issue #75 is completly done.

Run the feature built for issue #75 to download locally tweets from twitter API.
Make sure you have enough place available and your machine works correctly before doing so.
Then push these tweets with DVC to the AWS S3 storage service.

Determine the amount of tweet to request per day and per candidate after consultation with the others members of the project.
You should download tweets for one candidate with the start_date: 7 days ago and end_date: 6 days ago.

The .csv file must have at least this start_date and the last tweet ID in its name, like advised in issue #75, in order to be able to complete the data for this day and candidate later if possible.

This dataset should be reproductible : you must save somewhere the query sent to the Twitter API, either in the .csv file or in another file dedicated to it. The idea is to be able to publish this dataset on thr Huggingface Hub, and it requires reproductibility.

✔️ Definition of done

Tweets for ONE candidate and ONE day in the right amount regarding our request capacity are saved in the data/raw/twitter directory (or data/raw/twitter/week_#x/) and pushed to the DVC remote storage.

The script should suggest the user to perform the commands to add and push the data obtained to the dvc remote.

Every data obtained from the twitter API should be pushed to the dvc remote.

The query used to obtain the tweet must be saved.

Request data !!!!!

feat (data): 4 dvc push to remote s3

📖 Describe what you want

When adding dvc parts: add the folders under data/raw and data/processed separately

Able to dvc push the latest version of data to the remote s3 AND being able to pull it

✔️ Definition of done

Able to use DVC data in github action:

ie:

adding dvc pull to .42AI/int.sh
Having a unit_test which needs to have data.

docs (contrib): add athirion as contributor

📖 Add a contributor

Add my github and 42 logins to the README.md

✔️ Definition of done

Edit the README.md

docs (readme): Make Readme great again

📖 Describe what you want

Readme should have a short description of the project
Readme should have a table of content see here
Readme should have the following parts
- What the project is
- How normal people should interact with it
  - Where is the website: SentimentalBB
  - Where they can see the last availble results (reports/figures/)
  - How they can setup their environment (poetry install) such as to predict on tweets (poetry run ...)
- How people can contribute (Setup.md and CONTRIBUTING.md)
  - Associated details to explain things if necessary (what's currently on the readme)
- Contributors

✔️ Definition of done

All the listed element appears on the Readme.md when visiting https://github.com/42-AI/SentimentalBB

docs (dvc): how to setup and use

📖 Describe what you want

In Setup.md add steps to connect to remote s3, and how to add/push/pull data

In .42AI/init.sh

Verify that dvc remote is setup
Verify that aws is installed
Verify that aws is configured
Add a dvc pull

✔️ Definition of done

When steps in Setup.md allow new collaborators to participate, and when .42AI/init.sh gets the latest version of data available

docs (contributor): adding madvid as contributor

📖 Add of contributor

Adding of my login github and 42 in the README.md

✔️ Definition of done

Edition of the README.md.

docs (issues templates): add default project and status

📖 Describe what you want

In .github/ISSUE_TEMPLATE add in the top .yaml part project: SentimentalBB status: Backlog

✔️ Definition of done

When a new issue is created, it does not have the status "No status" and it is linked by default to SentimentalBB

setup (poetry): poetry location

📖 Describe what you want

This issue is due to 42 workspace consideration. B y default poetry is install in ~/HOME/.poetry.
Due to the limitation of the space, it could be interesting to be able to install Poetry in the sgoinfre in our personal repository /sgoinfre/. We could also create a repository 42AI_sbb into the sgoinfre to share the setup of the project.

✔️ Definition of done

be able to run poetry run python -m src ... withpoetry installled into the sgoinfre
Update the documentation to describe how to do this
(bonus) script to install poetry into the sgoinfre

feat (model): 1 fit and predict on the training dataset

📖 Describe what you want

While waiting for the allocine training dataset from the Milestone 3 let's train a model with our temporary dataset.

The temporary dataset is the aclImdb Standford datasaet.

It can be download with sh scripts/download_dataset.sh
The training dataset is then constructed by running python -m src data --download aclImdb

The 1st model:

You should fit() and predict() a sklearn classifier (for example Naive Bayes) on this train dataset.

✔️ Definition of done

When you have your predicted results saved under data/processed/aclImdb/results/classifier_name.csv

Your results should at least contain: y_pred and y_true (as we are using the training dataset, we have access to the correct label associated with each X)

There should should be a unit test which launches the train and predict pipeline, and then calculate the accuracy from this saved file.

data (inference): save tweet datetime info

📖 Describe what you want

For each tweet saved, add the datetime information

✔️ Definition of done

When the dataset created under data/raw/twitter/macron/twitter_Macron_None_None.csv will contain datetime information

feature (data): get the count of tweet written per day per candidate

📖 Describe what you want

Add a feature that requests the number of tweets published on twitter for a given candidate and a given day.
You may also request in addition the number of tweets for each hour of that day.

You should maybe use the granularity parameter of the function that makes querys in searchtweet-v2.

These data will help us decide how much tweet per candidate we should request, and have an overview of the political activity on twitter that we can display on a graph.

This data should be saved the somewhere data folder, maybe add a new data/metrics directory. Then push it to the dvc remote.

In ordre for this data to be reproductible, you should also save the request you made to obtain these answers from twitter API. You can either write the query in the same file or in another file.

✔️ Definition of done

The feature is added.
Obtain locally the answer from twitter for one day 7 days ago on one candidate.
Save also the query you sent to the twitter API.
Push this data in the correct folder to the remote dvc.

feature (data): add a feature to get the output of a model on given raw data

📖 Describe what you want

Add a feature to apply a model defined in src/models to a .csv file with raw data stored in the data/raw/ directory.
Save the output in a .csv file in data/processed/, from the command line.
The new file should contain the same name as the raw data file.

✔️ Definition of done

Feature is done.
Then apply this feature to the data already collected.

feat (model): predict with model on csv

📖 Describe what you want

I want to predict results with a given model on a given csv

✔️ Definition of done

saved results under path/results.csv

data (allocine): use allocine dataset

📖 Describe what you want

Train Model Baseline with HuggingFace allocine dataset https://huggingface.co/datasets/allocine

✔️ Definition of done

Able to pass naive-bayes unit test

doc (Milestone: Make Readme Great Again): Reports directory + template

📖 Describe what you want

Project should have a directory reports/figures/ and a template to describe the result of the experiments.

✔️ Definition of done

Project has a directory reports/figures/
a template is available in reports/

refactor (python dependencies): replacing pip and requirements.txt by poetry

📖 Describe what you want

Change to git hook so it doesnt take 5 minutes (currently donloads dataset everytime)
reinstall python dependencies with poetry
update Setup.md such as there is the correct steps to setup the environment

✔️ Definition of done

git hook take less than 5 seconds to run, except for the 1st time ;)
we can execute pyhton -m src data --download twitter --mention Macron
same as 2.

docs (Setup.md): alternatives commands for 42 ubuntu

Expected Behavior

Create a virtualenv

Current Behavior

python3.8 -m venv venv
The virtual environment was not created successfully because ensurepip is not
available.  On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

Suggested fix

python3.8 -m virtualenv venv

feat (website): github pages

📖 Describe what you want

Having a github pages which diplay one image from the repository (for example: .42AI/assets/Step_1.png)

✔️ Definition of done

Able to send an url to another person which shows a webpage with an image in it

fix (setup): template changes

Changes to do:

init.sh .envrc use good python
setup uvicorn
models in .gitignore + create in init.sh

doc (Milestone: Make Readme great again): Section about contribution

📖 Describe what you want

project should have a Contributing.md or similar where one can find how to contribute.

✔️ Definition of done

project has a Contributing.md or similar where one can find how to contribute.

cicd (issue template): testing update issue template

📖 Describe what you want

Update the dev issue template to add the project SentimentalBB by default, into the BackLog status by default also.

✔️ Definition of done

When creating a issue with the dev template, the project and status are set by default.

doc (data collection): Write documentation about data pipeline

📖 Describe what you want

creation and redaction of a documentation about the data collection pipeline

✔️ Definition of done

There is a documentation about the data collection pipeline

docs (spec): Create Specification Document

📖 Describe what you want

Create a specification document detailing everything that the program contained in src is supposed to do.

✔️ Definition of done

doc is created in the root folder
every subpart of the program is defined: the inputs they expect, what they do, the output they output.
every type of dataset in data is defined

feat (data): 3 Download the tweets mentioning a user in a given range

📖 Describe what you want

Create python code under src/data which is called from python -m src data --download --user username --date-from 01/01/2020 --date-to 01/02/2020

Download them under data/raw/user/. We keep ALL the data received by twitter API.

✔️ Definition of done

When tweeets are dowloaded to data/raw/user

feat(data): 1 download a tweet mentioning Emmanuel Macron with a script

📖 Describe what you want

Register a file in src/data, that will download a tweet which mention Emmanuel macron and saves it in data/raw/twitter/user/

✔️ Definition of done

When a tweet will be registered in above folder by running a script in src/data

doc (Milestone: Make Readme Great Again): website /github page

📖 Describe what you want

Project should have a website: SentimentalBB

✔️ Definition of done

Project has a website: SentimentalBB

testing (issue template): testing the project and

Prerequisites

Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.

I am running the latest version
I checked the documentation and found no answer
I checked to make sure that this issue has not already been filed
I'm reporting the issue to the correct repository

Expected Behavior

Please describe the behavior you are expecting

Current Behavior

What is the current behavior?

Failure Information (for bugs)

Please help provide information about the failure if this is a bug.
If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

step 1
step 2
you get it...

Context

Please provide any relevant information about your setup.
This is important in case the issue is not reproducible except for under certain conditions.

Data used:
Firmware Version:
Hardware:
Operating System:

Failure Logs

Please include any relevant log snippets or files here.

fix (github actions): install poetry

📖 Describe what you want

install poetry

✔️ Definition of done

worflow completes on master

model (deeplearning): twitter-xlm-roberta model

📖 Describe what you want

Add model twitter-xlm-roberta-base-sentiment from carfifnlp shared on Huggingface.

✔️ Definition of done

Being able to load model and predict on a given dataset
Create a unit test which predicts from a csv of 100 examples which is push to github (in src/tests/test_dataset.csv).
If necessary update the .github/workflows/cicd.yaml end the .42AI/precommit.git to pass the new unit test.