Coder Social home page Coder Social logo

cta-zero9-zaic2022-lyric-alignment's Introduction

Zalo AI Challenge 2022 - Lyric Alignment

Introduction | Dataset | Evaluation Metric | Solutions < MFA | Wav2vec > | Leaderboard

Team members:

Introduction

Problem statement

Many of us love to sing along with songs in the way of our favorite singers in albums (karaoke style). To make it, we may need to remove the vocals of the singer(s) from the songs, then provide the lyrics aligned timely with the accompaniment sounds. There are various tools to remove vocals, but it is hard to align the lyrics with the song.

In this challenge, participants will build a model to align lyrics with a music audio.

  • Input: a music segment (including vocal) and its lyrics.

  • Output: start-time and end-time of each word in the lyrics.

Dataset

[Training data]: 1057 music segments from ~ 480 songs.

Each segment is provided with an audio formatted as WAV file and a ground-truth JSON file which includes lyrics and aligned time frame of each single word as the above example.

[Testing data]: Public test: 264 music segments from ~ 120 songs without aligned lyric files.

[Private test]: 464 music segments from ~ 200 songs without aligned lyric files.

Evaluation Metric

Accuracy of prediction will be evaluated using Intersection over Union (IoU).

IoU of prediction and the ground truth of an audio segment (๐‘ ๐‘–) is computed by the following formula:

sample

where ๐‘š is the number of tokens of ๐‘ ๐‘–.

IoU

Then the Final IoU of across all ๐‘› audio segments is the average of their corresponding IoUs.

Solutions

Montreal Forced Aligner

The Montreal Forced Aligner is a command line utility for performing forced alignment of speech datasets using Kaldi (http://kaldi-asr.org/).

For details: MFA folder

Baseline on Public Test

  • Use pre-trained model.
  • Use upper case for split sequence.
  • Add ending comma in each sequence.

Result:

Wav2Vec

The model first processes the raw waveform of the speech audio with a multilayer convolutional neural network to get latent audio representations of 25ms each. These representations are then fed into a quantizer as well as a transformer. The quantizer chooses a speech unit for the latent audio representation from an inventory of learned units. About half the audio representations are masked before being fed into the transformer. The transformer adds information from the entire audio sequence. Finally, the output of the transformer is used to solve a contrastive task. This task requires the model to identify the correct quantized speech units for the masked positions.

Source: https://ai.facebook.com/blog/wav2vec-20-learning-the-structure-of-speech-from-raw-audio/

For details: Vietnamese-wav2vec2 folder

Our pipeline

#1. Separate vocal
  • We use Demucs of Facebook-research team for removing noise of raw audio.
  • Use pre-trained model mdx_extra.

#2. Pre-process

  • Normalize raw audio to 16k sample rate and convert to mono chanel.
  • Strip special character and process with lower case.
  • Convert number in lyric to string.

#3. Model MFA | wav2vec

  • We custom result inference of 2 models into same format: list of dictionary with 3 keys (word, start time, and end time)

#4. Post-process

  • Merge time-step between 2 words
  • Map output into ground truth label (json format)

Running pipeline with bash script

sh ./predict.sh

Improvement

  • Separate vocal to remove noise.
  • Merge time-step between 2 words.
  • Fine-tuning hyper-parameters for Vietnamese dataset.
  • Adapt new dataset from zing mp3.

Leaderboard

Finally, our ranking is 9 out of 129 teams in this challenge.

Link: https://challenge.zalo.ai/portal/lyric-alignment/leaderboard

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.