Coder Social home page Coder Social logo

audacitorch's Introduction

audacitorch

This package contains utilities for prepping PyTorch audio models for use in Audacity. More specifically, it provides abstract classes for you to wrap your waveform-to-waveform and waveform-to-labels models (see the Deep Learning for Audacity website to learn more about deep learning models for audacity).

Table of Contents


img

Our work has not yet been merged to the main build of Audacity, though it will be soon. You can keep track of its progress by viewing our pull request. In the meantime, you can download an alpha version of Audacity + Deep Learning here.

Installing

You can install audacitorch using pip:

pip install audacitorch

Supported Torch versions

audacitorch requires for your model to be able to run in Torch 1.9.0, as that's what the Audacity torchscript interpreter uses.

Deep Learning Effect and Analyzer

Audacity is equipped with a wrapper framework for deep learning models written in PyTorch. Audacity contains two deep learning tools: Deep Learning Effect and Deep Learning Analyzer.
Deep Learning Effect performs waveform to waveform processing, and is useful for audio-in-audio-out tasks (such as source separation, voice conversion, style transfer, amplifier emulation, etc.), while Deep Learning Analyzer performs waveform to labels processing, and is useful for annotation tasks (such as sound event detection, musical instrument recognition, automatic speech recognition, etc.). audacitorch contains two abstract classes for serializing two types of models: waveform-to-waveform and waveform-to-labels. The classes are WaveformToWaveformBase, and WaveformToLabelsBase, respectively.

As shown in the effect diagram, Waveform-to-waveform models receive a single multichannel audio track as input, and may write to a variable number of new audio tracks as output.

Example models for waveform-to-waveform effects include source separation, neural upsampling, guitar amplifier emulation, generative models, etc. Output tensors for waveform-to-waveform models must be multichannel waveform tensors with shape (num_output_channels, num_samples). For every audio waveform in the output tensor, a new audio track is created in the Audacity project.

As shown in the effect diagram, Waveform-to-labels models receive a single multichannel audio track as input, and may write to an output label track as output. The waveform-to-labels effect can be used for many audio analysis applications, such as voice activity detection, sound event detection, musical instrument recognition, automatic speech recognition, etc. The output for waveform-to-labels models must be a tuple of two tensors. The first tensor corresponds to the class indexes for each label present in the waveform, shape (num_timesteps,). The second tensor must contain timestamps with start and stop times for each label, shape (num_timesteps, 2).

What If My Model Uses a Spectrogram as Input/Output?

If your model uses a spectrogram as input/output, you'll need to wrap your forward pass with some torchscript-compatible preprocessing/postprocessing. We recommend using torchaudio, writing your own preprocessing transforms in their own nn.Module, or writing your PyTorch-only preprocessing and placing it in WaveformToWaveform.do_forward_pass or WaveformToLabels.do_forward_pass. See the compatibility section for more info.

Once you have chosen the appropriate class type for your model from the provided audacitorch Deep Learning Effect and the Deep Learning Analyzer classes, you will need to create a model file for use in Audacity. This model file allows a trained model to be executed in Audacity.

There are several methods available to create a file for an executable deep learning model. The purpose of serializing the model into a file is to enable our C++ code to execute your model. To serialize a model, our framework utilizes files generated from TorchScript. An important note is that TorchScript does not facilitate model training. When investigating TorchScript, you may also come across the term LibTorch, which is a PyTorch C++ API. LibTorch contains the core components of PyTorch, allowing TorchScript files to be executed in C++. However, you do not need to interact directly with LibTorch to serialize your model.

TorchScript enables the serialization of PyTorch code and is included with the PyTorch module - no additional modules are required. Currently, the deep learning tools for Audacity do not support models running on the GPU. More information on TorchScript can be found in the PyTorch documentation.

TorchScript features a JIT module, where JIT stands for Just-In-Time Compiler. The TorchScript JIT analyzes PyTorch code and translates it into TorchScript. There are two methods for converting PyTorch code into TorchScript:

  • Tracing: torch.jit.trace constructs the computational graph of your model by tracing the path of sample inputs through your model.

  • Scripting: This method parses the PyTorch code to compile a graph for TorchScript. Scripting is a more robust method for generating a TorchScript version of your model, as tracing can overlook certain logic embedded in the Python code of your model.

These two approaches can be combined, with more information available in the TorchScript documentation. We recommend using TorchScript scripting whenever possible for more robust model serialization.

Serializing a model can be a challenging task with many unique edge cases. To help you navigate this process, we have provided several examples.

Model Metadata

Certain details about the model, such as its sample rate, tool type (e.g. waveform-to-waveform or waveform-to-labels), list of labels, etc. must be provided by the model contributor in a separate metadata.json file. In order to help users choose the correct model for their required task, model contributors are asked to provide a short and long description of the model, the target domain of the model (e.g. speech, music, environmental, etc.), as well as a list of tags or keywords as part of the metadata. Note that you do not need to manually create a metadata file, we provide utility function to automatically create and test metadata files from a Python dictionary. For an example of creating a metadata file from a Python dictionary, see here.

Metadata Spec

required fields:

  • sample_rate (int)
    • range (0, 396000)
    • Model sample rate. Input tracks will be resampled to this value.
  • domains (List[str])
    • List of data domains for the model. The list should contain any of the following strings (any others will be ignored): ["music", "speech", "environmental", "other"]
  • short_description(str)
    • max 60 chars
    • short description of the model. should contain a brief message with the model's purpose, e.g. "Use me for separating vocals from the background!".
  • long_description (str)
    • max 280 chars
    • long description of the model. Shown in the detailed view of the model UI.
  • tags (List[str])
    • list of tags (to be shown in the detailed view)
    • each tag should be 15 characters max
    • max 5 tags per model.
  • labels (List[str)
    • output labels for the model. Depending on the effect type, this field means different things
    • waveform-to-waveform
      • name of each output source (e.g. drums, bass, vocal). To create the track name for each output source, each one of the labels will be appended to the mixture track's name.
    • waveform-to-labels:
      • This should be classlist for model. The class indexes output by the model during a forward pass will be used to index into this classlist.
  • effect_type (str)
    • Target effect for this model. Must be one of ["waveform-to-waveform", "waveform-to-labels"].
  • multichannel (bool)
    • If multichannel is set to true, stereo tracks are passed to the model as multichannel audio tensors, with shape (2, n). Note that this means that the input could either be a mono track with shape (1, n) or stereo track with shape (2, n).
    • If multichannel is set to false, stereo tracks are downmixed, meaning that the input audio tensor will always be shape (1, n).

By default, users have to click on the Add From HuggingFace button on the Audacity Model Manager and enter the desired repo's ID to install a community contributed model. If you, instead, would like your community contributed model to show up in Audacity's Model Manager by default, please open a request here.

Here's a minimal example for a model that simply boosts volume by multiplying the incoming audio by a factor of 2.

We can sum up the whole process into 4 steps:

  1. Developing your model
  2. Wrapping your model using audacitorch
  3. Creating a metadata document
  4. Exporting to HuggingFace

First, we create our model. There are no internal constraints on what the internal model architecture should be, as long as you can use torch.jit.script or torch.jit.trace to serialize it, and it is able to meet the input-output constraints specified in waveform-to-waveform and waveform-to-labels models.

import torch
import torch.nn as nn

class MyVolumeModel(nn.Module):

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        # do the neural net magic!
        x = x * 2

        return x

PyTorch makes it really easy to deploy your Python models in C++ by using torchscript, an intermediate representation format for torch models that can be called in C++. Many of Python's built-in functions are supported by torchscript. However, not all Python operations are supported by the torchscript environment, meaning that you are only allowed to use a subset of Python operations in your model code. See the torch.jit docs to learn more about writing torchscript-compatible code.

If your model computes spectrograms (or requires any kind of preprocessing/postprocessing), make sure those operations are compatible with torchscript, like torchaudio's operation set.

Useful links:

Now, we create a wrapper class for our model. Because our model returns an audio waveform as output, we'll use WaveformToWaveformBase as our parent class. For both WaveformToWaveformBase and WaveformToLabelsBase, we need to implement the do_forward_pass method with our processing code. See the docstrings for more details.

from audacitorch import WaveformToWaveformBase

class MyVolumeModelWrapper(WaveformToWaveformBase):
    
    def do_forward_pass(self, x: torch.Tensor) -> torch.Tensor:
        
        # do any preprocessing here! 
        # expect x to be a waveform tensor with shape (n_channels, n_samples)

        output = self.model(x)

        # do any postprocessing here!
        # the return value should be a multichannel waveform tensor with shape (n_channels, n_samples)
    
        return output

Audacity models need a metadata file. See the metadata spec to learn about the required fields.

metadata = {
    'sample_rate': 48000, 
    'domain_tags': ['music', 'speech', 'environmental'],
    'short_description': 'Use me to boost volume by 3dB :).',
    'long_description':  'This description can be a max of 280 characters aaaaaaaaaaaaaaaaaaaa.',
    'tags': ['volume boost'],
    'labels': ['boosted'],
    'effect_type': 'waveform-to-waveform',
    'multichannel': False,
}

All set! We can now proceed to serialize the model to torchscript and save the model, along with its metadata.

from pathlib import Path
from audacitorch.utils import save_model, validate_metadata, \
                              get_example_inputs, test_run

# create a root dir for our model
root = Path('booster-net')
root.mkdir(exist_ok=True, parents=True)

# get our model
model = MyVolumeModel()

# wrap it
wrapper = MyVolumeModelWrapper(model)

# serialize it using torch.jit.script, torch.jit.trace,
# or a combination of both. 

# option 1: torch.jit.script 
# using torch.jit.script is preferred for most cases, 
# but may require changing a lot of source code
serialized_model = torch.jit.script(wrapper)

# option 2: torch.jit.trace
# using torch.jit.trace is typically easier, but you
# need to be extra careful that your serialized model behaves 
# properly after tracing
example_inputs = get_example_inputs()
serialized_model = torch.jit.trace(wrapper, example_inputs[0], 
                                    check_inputs=example_inputs)

# take your model for a test run!
test_run(serialized_model)

# check that we created our metadata correctly
success, msg = validate_metadata(metadata)
assert success

# save!
save_model(serialized_model, metadata, root)

At this point, your directory structure should look like this:

/booster-net/
/booster-net/model.pt
/booster-net/metadata.json

This will become the HuggingFace repository for your Audacity model.

Inside your HuggingFace repository, make sure to create a README.md file. After doing this, your directory structure should now look like:

/booster-net/
/booster-net/model.pt
/booster-net/metadata.json
/booster-net/README.md

In your HuggingFace repository's README.md file, you'll need to add an audacity tag in the YAML metadata. This ensures that your model will appear under the "Explore" tab in Audacity's Deep Learning Tools. To add the audacity tag, insert the following lines at the top of your README.md file:

---
tags:
- audacity
---

Great job! Now it's time to push your changes to HuggingFace. For more information on adding a model to the HuggingFace model hub, check out their documentation.

After serializing, you may need to debug your model inside Audacity, to make sure that it handles inputs correctly, doesn't crash while processing, and produces the correct output. While debugging, make sure your model isn't available through other users through the Explore HuggingFace button by temporarily removing the audacity tag from your README file. If your model fails internally while processing audio, you may see something like this:

To debug, you can access the error logs through the Help menu, in Help->Diagnostics->Show Log.... Any torchscript errors that may occur during the forward pass will be redirected here.

  • Demucs Denoiser: In this example, we guide you through implementing the do_forward_pass method of the WaveformToWaveformBase class, serializing the Demucs denoiser using the TorchScript scripting method, creating the model metadata (covered in the section below), and uploading to Huggingface. We illustrate how, in some instances, you may need to modify the original model code to properly serialize the model, this is done in the notebooks/denoiser/denoiser.py file. The model's source code is included for your reference.

  • FCNF0 ++ Pitch Estimation: In this case, we guide you through implementing the get_timestamps & do_forward_pass methods of the WaveformToLabelsBase class, serializing the FCNF0 ++ Pitch Estimator using the TorchScript scripting method, and creating the model metadata. The model's source code is also provided for your reference located in the file notebooks/pitch/pitch.py.

  • Asteroid Source Separation Model: For this example, we download a pretrained model from the Asteroid Python module, create metadata for the model, inherit from the WaveformToWaveformBase class, show you how to trace the model with dummy inputs, and demonstrate how to script the model.

  • S2T-MEDIUM-LIBRISPEECH-ASR by Changhan Wang and Yun Tang and Xutai Ma and Anne Wu and Dmytro Okhonko and Juan Pino: In this last example, we guide you through the process of wrapping a language model in the WaveformToLabelsBase class, creating model metadata, and tracing this model since scripting is not feasible in this case.


audacitorch's People

Contributors

hugofloresgarcia avatar aldo-aguilar avatar cwitkowitz avatar

Stargazers

Pixel_Fox avatar yezhangyinge avatar  avatar Batuhan Akkuş avatar Chizuru Yamauchi(Nina Yamauchi) avatar SpectraSynq avatar  avatar Pifometricien avatar Flow avatar  avatar Amirpasha Mobini Tehrani avatar zhuoy avatar zombiekanapa avatar antirkristi avatar Yann avatar  avatar Yasuo Kabe avatar francesco papaleo avatar Maximilian Roll avatar Beckett avatar Glenn 'devalias' Grant avatar HAESUNG JEON avatar Junichi Shimizu avatar  avatar  avatar Sofian Mejjoute avatar  avatar Sandalots avatar Alex McKinney avatar Omar Sanseviero avatar  avatar  avatar Akiyuki Okayasu avatar Michael Murray avatar Juan avatar Steven Murr avatar ✝o✝ avatar Recluse Audio avatar Moisés Horta Valenzuela avatar sy3c4ll avatar Tatsuya Shiozawa avatar Oli Larkin avatar  avatar Bailey Bjornstad avatar Nickolay V. Shmyrev avatar gongyibei avatar Bo-Yu Chen avatar Wangyou Zhang avatar Лэюань  avatar Sang-Hoon Lee avatar Kai Li (李凯) avatar  avatar Yi Luo avatar  avatar Robin Scheibler avatar Yuan-Man avatar Alexandre Défossez avatar David Südholt avatar MelonJack avatar  avatar  avatar  avatar Jacob A Rose avatar Paulo Chiliguano avatar Cumhur Erkut avatar Rafał Rolczyński avatar Matan Gover avatar Damien Ronssin avatar Jeff Carpenter avatar Chen Lequn avatar  avatar Pawel Cyrta avatar Han avatar Rish avatar Chris Emezue avatar sleim avatar Seung-won Park avatar Victor Shepardson avatar M. Edward (Ed) Borasky avatar bofeng huang avatar Ruoho Ruotsi avatar Matigekunstintelligentie avatar  avatar Dennis Scheiba avatar Qichao Lan avatar aaronchen avatar Ben Hayes avatar Sai Mitheran avatar Gerard I. Gállego avatar  avatar David Su avatar songtaoshi avatar Javier Naranjo-Alcazar avatar Rheza Budiono avatar Molly Jones avatar Tako avatar Sangeon Yong avatar Thomas Wollmann avatar Soonbeom Choi avatar Tony Yanick avatar

Watchers

Fabian-Robert Stöter avatar Nickolay V. Shmyrev avatar Jarredou avatar Amit Bendor avatar Molly Jones avatar Tony Yanick avatar Justin John avatar ✝o✝ avatar Flow avatar

audacitorch's Issues

tweaks to labeler_example.ipynb

Hey Aldo,

Good job getting the wavtogether. I just have a couple of small suggested edits. The first one is to add a section at the end that points them to how to upload a model to HuggingFace. The second boils down to adding a couple of hyperlinks to give information. I'll provide examples where it would be cool to add a link below.

In "Audacity WaveformToLabels Example"
In this notebook we will load in [a speech to text model](hyperlink to model)

In "Wraping the model"
torchaudacity provides a WaveformToLabels class . We will use this as a base class for our pretrained models wrapper. The WaveformToLabels class provides us with tests to ensure that our model is receiving properly sized input, and outputting the expected tensor shapes for Audacity's [Deep Learning Analyzer](hyperlink or explanation).

In "Saving Our Model & Metadata'
We will now save the wrapped model locally by tracing it with torchscript, and generating a ScriptModule or ScriptFunction using torch.jit.script. We can then use torchaudacity's utility function save_model to save the model and meta data easily.

Audacitorch install prevents import of torchaudio

I've trained a model that acts on MFCCs I get using torchaudio. When I installed audacitorch to wrap my model (in Colab), it prevents me from importing torchaudio with the message OSError: libtorch_cuda_cpp.so: cannot open shared object file: No such file or directory. It looks like installing audacitorch uninstalls and re-installs torch - might that have something to do with it?

Increase flexibility for output of waveform-to-labels models

Waveform-to-label models have little flexibility when it comes to the output predictions. As an example, I am trying to export an automatic music transcription model which produces multiple labels for each time step. This is not possible with the current version of audacitorch, and there does not seem to be a straightforward workaround.

Output structure for frame-level predictions

The output structure of waveform-to-label models seems to be more targeted towards intervallic predictions, such as music tagging or instrument recognition. It would be nice to have a more straightforward way to output frame-level predictions, such as for frame-wise music transcription.

README notes

I (hugo) have some edits to do the readme that I have on an email chain. Placing an issue here to make sure the issue gets completed.

Channel Behavior & Metadata

To avoid crashes and undefined behaviors, I think it's worth revising the API / metadata to cover expected input and output channel counts for waveform-to-waveform models. To illustrate the current behaviors, I wrote a simple "pass-through" effect that returns whatever audio is provided (via torch.nn.Identity). Here are the results:

		INPUT CHANNELS: 2 / MULTICHANNEL FLAG: False / OUTPUT LABELS: 1

			Sums the stereo track to a single mono output track (via sum, not average)

		INPUT CHANNELS: 1 / MULTICHANNEL FLAG: False / OUTPUT LABELS: 1

			Works as expected (creates one output track)

		INPUT CHANNELS: 2 / MULTICHANNEL FLAG: False / OUTPUT LABELS: 2

			Sums the stereo track to a single mono output track (via sum, not average) and creates an 
			additional empty track.

		INPUT CHANNELS: 1 / MULTICHANNEL FLAG: False / OUTPUT LABELS: 2
			
			Creates one output track with the input data and one empty track

		INPUT CHANNELS: 2 / MULTICHANNEL FLAG: True / OUTPUT LABELS: 1

			Crashes upon applying effect

		INPUT CHANNELS: 1 / MULTICHANNEL FLAG: True / OUTPUT LABELS: 1

			Works as expected (creates one output track)

		INPUT CHANNELS: 2 / MULTICHANNEL FLAG: True / OUTPUT LABELS: 2

			Works as expected (creates two output tracks); however, in some situations a user may wish
			to map a stereo track to a single stereo track without splitting

		INPUT CHANNELS: 1 / MULTICHANNEL FLAG: True / OUTPUT LABELS: 2

			Creates one output track with the input data and one empty track

To confirm the summing behavior, I doubled a mono track to stereo, set the effect gain to 0.5, inverted the summed output track, and made sure it canceled the stereo inputs.

I see a few issues with the current behaviors:

  • the interaction between the multichannel flag and labels can cause crashes, even though the former is only ostensibly responsible for downmixing stereo inputs to mono
  • labels implicitly determines the number of output tracks created, which is probably something that should be determined explicitly
  • in many instances a user may wish to map stereo inputs to stereo outputs rather than multiple mono tracks; however, the current setup does not distinguish between output tracks and channels, meaning that users have to upmix mono outputs to stereo manually
  • finally, the metadata contains a mix of attributes that determine actual model behavior (sample_rate, multichannel, etc.) and descriptors. Thus, developers have to keep track of the metadata and model behavior simultaneously and make sure the two match correctly

Some possible fixes:

  • In the Audacitorch API, remove the multichannel field and add metadata fields that specify (1) the input / output channel counts and (2) whether to automatically upmix the outputs to a single track, assuming this is feasible within Audacity
  • Make behavior-determining attributes (sample_rate, etc.) constructor arguments to WaveformToWaveformBase and store them there.
  • Generate metadata directly from a WaveformToWaveformBase model a with a utility function or class method that takes the remaining descriptors as arguments, ensuring that metadata and model are matched. This could also be folded into a single serialization / scripting utility function if we really want to streamline.
  • If we can catch Torchscript exceptions within Audacity and pass informative error messages, perform checks on input channel dimensions at runtime based on attributes stored within the scripted model or in the metadata
  • During validation, trim or auto-generate the contents of labels to match the effect outputs, so that labels does not determine any behavior beyond track naming

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.