Coder Social home page Coder Social logo

code-charity / sponsorblock-ml Goto Github PK

View Code? Open in Web Editor NEW

This project forked from xenova/sponsorblock-ml

0.0 1.0 0.0 564 KB

Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders.

Home Page: https://xenova.github.io/sponsorblock-ml/

License: GNU General Public License v3.0

Python 100.00%

sponsorblock-ml's Introduction

title emoji colorFrom colorTo sdk app_file pinned
Sponsorblock ML
๐Ÿค–
yellow
indigo
streamlit
app.py
true

SponsorBlock-ML

Automatically detect in-video YouTube sponsorships, self/unpaid promotions, and interaction reminders. The model was trained using the SponsorBlock database licensed used under CC BY-NC-SA 4.0.

Check out the online demo application at https://xenova.github.io/sponsorblock-ml/, or follow the instructions below to run it locally.


Installation

  1. Download the repository:

    git clone https://github.com/xenova/sponsorblock-ml.git
    cd sponsorblock-ml
  2. Install the necessary dependencies:

    pip install -r requirements.txt
  3. Run the application:

    streamlit run app.py

Predicting

  • Predict for a single video using the --video_id argument. For example:

     python src/predict.py --video_id zo_uoFI1WXM
  • Predict for multiple videos using the --video_ids argument. For example:

     python src/predict.py --video_ids IgF3OX8nT0w ao2Jfm35XeE
  • Predict for a whole channel using the --channel_id argument. For example:

     python src/predict.py --channel_id UCHnyfMqiRRG1u-2MsSQLbXA

Note that on the first run, the program will download the necessary models (which may take some time).


Evaluating

Measuring Accuracy

This is primarly used to measure the accuracy (and other metrics) of the model (defaults to Xenova/sponsorblock-small).

python src/evaluate.py

In addition to the calculated metrics, missing and incorrect segments are output, allowing for improvements to be made to the database:

  • Missing segments: Segments which were predicted by the model, but are not in the database.
  • Incorrect segments: Segments which are in the database, but the model did not predict (meaning that the model thinks those segments are incorrect).

Moderation

This can also be used to moderate parts of the database. To moderate the whole database, first run:

python src/preprocess.py --do_process_database --processed_database whole_database.json --min_votes -1 --min_views 0 --min_date 01/01/2000 --max_date 01/01/9999 --keep_duplicate_segments

followed by

python src/evaluate.py --processed_file data/whole_database.json

The --video_ids and --channel_id arguments can also be used here. Remember to keep your database and processed database file up-to-date before running evaluations.


Training

Preprocessing

  1. Download the SponsorBlock database

    python src/preprocess.py --update_database
  2. Preprocess the database and generate training, testing and validation data

    python src/preprocess.py --do_transcribe --do_create --do_generate --do_split --model_name_or_path Xenova/sponsorblock-small
    1. --do_transcribe - Downloads and parses the transcripts from YouTube.
    2. --do_create - Process the database (removing unwanted and duplicate segments) and create the labelled dataset.
    3. --do_generate - Using the downloaded transcripts and labelled segment data, extract positive (sponsors, unpaid/self-promos and interaction reminders) and negative (normal video content) text segments and create large lists of input and target texts.
    4. --do_split - Using the generated positive and negative segments, split them into training, validation and testing sets (according to the specified ratios).

    Each of the above steps can be run independently (as separate commands, e.g. python src/preprocess.py --do_transcribe), but should be performed in order.

    For more advanced preprocessing options, run python src/preprocess.py --help

Transformer

The transformer is used to extract relevent segments from the transcript and apply a preliminary classification to the extracted text. To start finetuning from the current checkpoint, run:

python src/train.py --model_name_or_path Xenova/sponsorblock-small

If you wish to finetune an original transformer model, use one of the supported models (t5-small, t5-base, t5-large, t5-3b, t5-11b, google/t5-v1_1-small, google/t5-v1_1-base, google/t5-v1_1-large, google/t5-v1_1-xl, google/t5-v1_1-xxl) as the --model_name_or_path. For more information, check out the relevant documentation (t5 or t5v1.1).

Classifier

The classifier is used to add probabilities to the category predictions. Train the classifier using:

python src/train.py --train_classifier --skip_train_transformer

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.