Coder Social home page Coder Social logo

sentimentalbb's People

Stargazers

 avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

sentimentalbb's Issues

docs (Setup.md): additional details

Suggested fix

  • Specify the path by changing echo "source venv/bin/activate" > .envrc to echo "source venv/bin/activate" > PATH TO VENV/ .envrc
  • Add source venv/bin/activate to activate venv

feat (data): 2 dowload several tweets which mention @user

๐Ÿ“– Describe what you want

Create python code under src/data which is called from python -m src data --download user

Download them under data/raw/user/. We keep ALL the data received by twitter API.

โœ”๏ธ Definition of done

When tweeets are dowloaded to data/raw/user

feat (data): script to request tweets from twitter API

Objective: Build the database and having data on as much days as possible.

๐Ÿ“– Describe what you want

Update the script about dataset to request specific tweets from the API of twitter based on its date or ID.

The script MUST save ALL the tweets received into csv files in the data/raw/twitter directory, with the date and ID of first and last tweet specified (in the name ?). Possible format:

  • data/raw/twitter/[candidat_name]_[startdate]_[enddate].csv
  • data/raw/twitter/[candidat_name]_[first_id_tweet]_[last_id_tweet].csv
  • data/raw/twitter/candidat_name/[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv
  • data/raw/twitter/week_#x/[candidat_name]_[startdate]_[enddate]_[first_id_tweet]_[last_id_tweet].csv

A particular point must be considered: the script should collect small chunks of results in order to save all the results little by little to avoid issue related to cache memory, disk memory or whatever: Create a tmp directory where the little portions are stored and after that the script

This script should be designed to be launched periodically, every week (or every day?) and collect specified amount of tweets about each candidate. These amounts of tweets per day and per candidate are yet to be determined.

โœ”๏ธ Definition of done

  • a functioning script is written.
  • a format for the filename is chosen,
  • the script create a tmp directory where it saves small chunks of the total results.
  • the script concatenate all the chunks into a final csv file.

This script should be FIRST and ONLY tested with small amounts of tweet requested to the API in order to save the amount of tweet we can request: for instance 1k tweets for 2 or 3 candidates. The person testing the script should be careful to check the above points.

This script will be used for larger amounts after the pull request is validated.

feat (data): fix error parsing YAML file

๐Ÿ“– Describe what you want

Fix the error that happens when executing the script for downloading tweets about macron:
cannot read file ~/.twitter_keys.yaml Error parsing YAML file; searching for valid environment variables

โœ”๏ธ Definition of done

No error happens when executing python -m src data --download macron

docs (github page): make the github page show-able

๐Ÿ“– Describe what you want

Update the website template

โœ”๏ธ Definition of done

  • Change "Areg Sarkissian" to "Sentimental Big Bro" (top)
  • Remove "ยฉ 2019 Areg Sarkissian" (bottom)
  • Set a margin
  • Separate the Readme and the Vizualization into different pages.

docs (contrib): add guillaume-salle as contributor

Describe what you want

Add my name and 42 login to the readme.md

Describe what are the benefits

Update the list of contributors

Describe a test that your feature should pass before merging to master

readme.md should have my name and login

model (deeplearning): Camembert baseline

๐Ÿ“– Describe what you want

Make a deeplearning baseline with Hugging Face and Pytorch.

Relevant tutorials to follow:

Relevant documentation:

โœ”๏ธ Definition of done

  1. Being able to train the model then save it's weights poetry run python -n src models --model camembert --output-weights models/xxxxx --train-split xxx.csv
  2. Save the weights in remote s3 dvc add models/xxxxx git add && git commit dvc push -r s3-remote
  3. Being able to load model weights and predict on a given dataset
  4. Create a unit test which train and predict from a csv of 100 examples which is push to github (in src/tests/test_dataset.csv)
  5. If necessary update the .github/workflows/cicd.yaml and the .42AI/pre-commit.git to pass the new unit test

docs (setup): remove envrc requirement and direnv

๐Ÿ“– Describe what you want

Update the Setup.md file to remove the use of direnv and .envrc file.
Also remove the error when executing .42AI/init.sh after cloning the repo.

โœ”๏ธ Definition of done

Setup.md is updated and .42AI/init.sh raises no error.

feat (github pages): Set up GitHub Pages

๐Ÿ“– Describe what you want

Create a directory dedicated to GitHub Pages + index template

โœ”๏ธ Definition of done

Available online index template page

feat (visualization): create first figures from data

๐Ÿ“– Describe what you want

While waiting for our collected twitter dataset to contain the datetime of each tweet we would like to plot our predicted sentiment for each candidate.

There are multiple otpions:

  • Pie Chart: easy to setup and straight to the point tutorial

  • RegPlot: More complex but swag tutorial

โœ”๏ธ Definition of done

For each candidate, any figure from the top list saved under reports/figures/candidate/type_of_chart_and_name_question_answered.png

ie: if it's a Pie Chart for Macron reports/figures/Macron/pie_chart_sentiment_prediction.png

dvc (data): save tweets to build database & push to DVC

๐Ÿ“– Describe what you want

This issue must be completed only after issue #75 is completly done.

Run the feature built for issue #75 to download locally tweets from twitter API.
Make sure you have enough place available and your machine works correctly before doing so.
Then push these tweets with DVC to the AWS S3 storage service.

Determine the amount of tweet to request per day and per candidate after consultation with the others members of the project.
You should download tweets for one candidate with the start_date: 7 days ago and end_date: 6 days ago.

The .csv file must have at least this start_date and the last tweet ID in its name, like advised in issue #75, in order to be able to complete the data for this day and candidate later if possible.

This dataset should be reproductible : you must save somewhere the query sent to the Twitter API, either in the .csv file or in another file dedicated to it. The idea is to be able to publish this dataset on thr Huggingface Hub, and it requires reproductibility.

โœ”๏ธ Definition of done

Tweets for ONE candidate and ONE day in the right amount regarding our request capacity are saved in the data/raw/twitter directory (or data/raw/twitter/week_#x/) and pushed to the DVC remote storage.

The script should suggest the user to perform the commands to add and push the data obtained to the dvc remote.

Every data obtained from the twitter API should be pushed to the dvc remote.

The query used to obtain the tweet must be saved.

feat (data): 4 dvc push to remote s3

๐Ÿ“– Describe what you want

When adding dvc parts: add the folders under data/raw and data/processed separately

Able to dvc push the latest version of data to the remote s3 AND being able to pull it

โœ”๏ธ Definition of done

Able to use DVC data in github action:

ie:

  • adding dvc pull to .42AI/int.sh
  • Having a unit_test which needs to have data.

docs (readme): Make Readme great again

๐Ÿ“– Describe what you want

  • Readme should have a short description of the project

  • Readme should have a table of content see here

  • Readme should have the following parts

    • What the project is
    • How normal people should interact with it
      • Where is the website: SentimentalBB
      • Where they can see the last availble results (reports/figures/)
      • How they can setup their environment (poetry install) such as to predict on tweets (poetry run ...)
    • How people can contribute (Setup.md and CONTRIBUTING.md)
      • Associated details to explain things if necessary (what's currently on the readme)
    • Contributors

โœ”๏ธ Definition of done

All the listed element appears on the Readme.md when visiting https://github.com/42-AI/SentimentalBB

docs (dvc): how to setup and use

๐Ÿ“– Describe what you want

In Setup.md add steps to connect to remote s3, and how to add/push/pull data

In .42AI/init.sh

  • Verify that dvc remote is setup
  • Verify that aws is installed
  • Verify that aws is configured
  • Add a dvc pull

โœ”๏ธ Definition of done

When steps in Setup.md allow new collaborators to participate, and when .42AI/init.sh gets the latest version of data available

docs (issues templates): add default project and status

๐Ÿ“– Describe what you want

In .github/ISSUE_TEMPLATE add in the top .yaml part project: SentimentalBB status: Backlog

โœ”๏ธ Definition of done

When a new issue is created, it does not have the status "No status" and it is linked by default to SentimentalBB

setup (poetry): poetry location

๐Ÿ“– Describe what you want

This issue is due to 42 workspace consideration. B y default poetry is install in ~/HOME/.poetry.
Due to the limitation of the space, it could be interesting to be able to install Poetry in the sgoinfre in our personal repository /sgoinfre/. We could also create a repository 42AI_sbb into the sgoinfre to share the setup of the project.

โœ”๏ธ Definition of done

  • be able to run poetry run python -m src ... withpoetry installled into the sgoinfre
  • Update the documentation to describe how to do this
  • (bonus) script to install poetry into the sgoinfre

feat (model): 1 fit and predict on the training dataset

๐Ÿ“– Describe what you want

While waiting for the allocine training dataset from the Milestone 3 let's train a model with our temporary dataset.

The temporary dataset is the aclImdb Standford datasaet.

  • It can be download with sh scripts/download_dataset.sh
  • The training dataset is then constructed by running python -m src data --download aclImdb

The 1st model:

  • You should fit() and predict() a sklearn classifier (for example Naive Bayes) on this train dataset.

โœ”๏ธ Definition of done

When you have your predicted results saved under data/processed/aclImdb/results/classifier_name.csv

Your results should at least contain: y_pred and y_true (as we are using the training dataset, we have access to the correct label associated with each X)

There should should be a unit test which launches the train and predict pipeline, and then calculate the accuracy from this saved file.

data (inference): save tweet datetime info

๐Ÿ“– Describe what you want

For each tweet saved, add the datetime information

โœ”๏ธ Definition of done

When the dataset created under data/raw/twitter/macron/twitter_Macron_None_None.csv will contain datetime information

feature (data): get the count of tweet written per day per candidate

๐Ÿ“– Describe what you want

Add a feature that requests the number of tweets published on twitter for a given candidate and a given day.
You may also request in addition the number of tweets for each hour of that day.

You should maybe use the granularity parameter of the function that makes querys in searchtweet-v2.

These data will help us decide how much tweet per candidate we should request, and have an overview of the political activity on twitter that we can display on a graph.

This data should be saved the somewhere data folder, maybe add a new data/metrics directory. Then push it to the dvc remote.

In ordre for this data to be reproductible, you should also save the request you made to obtain these answers from twitter API. You can either write the query in the same file or in another file.

โœ”๏ธ Definition of done

  1. The feature is added.
  2. Obtain locally the answer from twitter for one day 7 days ago on one candidate.
  3. Save also the query you sent to the twitter API.
  4. Push this data in the correct folder to the remote dvc.

feature (data): add a feature to get the output of a model on given raw data

๐Ÿ“– Describe what you want

Add a feature to apply a model defined in src/models to a .csv file with raw data stored in the data/raw/ directory.
Save the output in a .csv file in data/processed/, from the command line.
The new file should contain the same name as the raw data file.

โœ”๏ธ Definition of done

Feature is done.
Then apply this feature to the data already collected.

feat (model): predict with model on csv

๐Ÿ“– Describe what you want

I want to predict results with a given model on a given csv

โœ”๏ธ Definition of done

saved results under path/results.csv

refactor (python dependencies): replacing pip and requirements.txt by poetry

๐Ÿ“– Describe what you want

  1. Change to git hook so it doesnt take 5 minutes (currently donloads dataset everytime)
  2. reinstall python dependencies with poetry
  3. update Setup.md such as there is the correct steps to setup the environment

โœ”๏ธ Definition of done

  1. git hook take less than 5 seconds to run, except for the 1st time ;)
  2. we can execute pyhton -m src data --download twitter --mention Macron
  3. same as 2.

docs (Setup.md): alternatives commands for 42 ubuntu

Expected Behavior

Create a virtualenv

Current Behavior

python3.8 -m venv venv
The virtual environment was not created successfully because ensurepip is not
available.  On Debian/Ubuntu systems, you need to install the python3-venv
package using the following command.

Suggested fix

python3.8 -m virtualenv venv

feat (website): github pages

๐Ÿ“– Describe what you want

Having a github pages which diplay one image from the repository (for example: .42AI/assets/Step_1.png)

โœ”๏ธ Definition of done

Able to send an url to another person which shows a webpage with an image in it

cicd (issue template): testing update issue template

๐Ÿ“– Describe what you want

Update the dev issue template to add the project SentimentalBB by default, into the BackLog status by default also.

โœ”๏ธ Definition of done

When creating a issue with the dev template, the project and status are set by default.

docs (spec): Create Specification Document

๐Ÿ“– Describe what you want

Create a specification document detailing everything that the program contained in src is supposed to do.

โœ”๏ธ Definition of done

  • doc is created in the root folder
  • every subpart of the program is defined: the inputs they expect, what they do, the output they output.
  • every type of dataset in data is defined

feat (data): 3 Download the tweets mentioning a user in a given range

๐Ÿ“– Describe what you want

Create python code under src/data which is called from python -m src data --download --user username --date-from 01/01/2020 --date-to 01/02/2020

Download them under data/raw/user/. We keep ALL the data received by twitter API.

โœ”๏ธ Definition of done

When tweeets are dowloaded to data/raw/user

testing (issue template): testing the project and

Prerequisites

Please answer the following questions for yourself before submitting an issue. YOU MAY DELETE THE PREREQUISITES SECTION.

  • I am running the latest version
  • I checked the documentation and found no answer
  • I checked to make sure that this issue has not already been filed
  • I'm reporting the issue to the correct repository

Expected Behavior

Please describe the behavior you are expecting

Current Behavior

What is the current behavior?

Failure Information (for bugs)

Please help provide information about the failure if this is a bug.
If it is not a bug, please remove the rest of this template.

Steps to Reproduce

Please provide detailed steps for reproducing the issue.

  1. step 1
  2. step 2
  3. you get it...

Context

Please provide any relevant information about your setup.
This is important in case the issue is not reproducible except for under certain conditions.

  • Data used:
  • Firmware Version:
  • Hardware:
  • Operating System:

Failure Logs

Please include any relevant log snippets or files here.

model (deeplearning): twitter-xlm-roberta model

๐Ÿ“– Describe what you want

Add model twitter-xlm-roberta-base-sentiment from carfifnlp shared on Huggingface.

โœ”๏ธ Definition of done

  1. Being able to load model and predict on a given dataset
  2. Create a unit test which predicts from a csv of 100 examples which is push to github (in src/tests/test_dataset.csv).
  3. If necessary update the .github/workflows/cicd.yaml end the .42AI/precommit.git to pass the new unit test.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.