thomeou / salsa Goto Github PK

This is the public repository for eigenvector-based SALSA features for polyphonic sound event localization and detection.

License: MIT License

Python 99.24% Makefile 0.76%

sound-event-detection sound-event-localization microphone-array first-order-ambisonics feature-extraction

salsa's Introduction

SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection

Visualization of SALSA features of a 16-second audio clip in multi-source scenarios for first-order ambisonics microphone (FOA) (left) and 4-channel microphone array (MIC) (right).

Official implementation for SALSA: Spatial Cue-Augmented Log-Spectrogram Features for polyphonic sound event localization and detection.

Update: SALSA-Lite feature has been added to the repo.

Thi Ngoc Tho Nguyen, Karn N. Watcharasupat, Ngoc Khanh Nguyen, Douglas L. Jones, Woon-Seng Gan. SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection, IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2022. [arXiv] [IEEE Xplore]

Thi Ngoc Tho Nguyen, Douglas L. Jones, Karn N. Watcharasupat, Huy Phan, Woon-Seng Gan. SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays, in Proceedings of the 2022 IEEE International Conference on Acoustics, Speech and Signal Processing, 2022. [arXiv] [IEEE Xplore]

Introduction to sound event localization and detection

Sound event localization and detection (SELD) is an emerging research field that unifies the tasks of sound event detection (SED) and direction-of-arrival estimation (DOAE) by jointly recognizing the sound classes, and estimating the directions of arrival (DOA), the onsets, and the offsets of detected sound events. While sound event detection mainly relies on time-frequency patterns to distinguish different sound classes, direction-of-arrival estimation uses amplitude and/or phase differences between microphones to estimate source directions. As a result, it is often difficult to jointly optimize these two subtasks.

What is SALSA?

We propose a novel feature called Spatial Cue-Augmented Log-Spectrogram (SALSA) with exact time-frequency mapping between the signal power and the source directional cues, which is crucial for resolving overlapping sound sources. The SALSA feature consists of multichannel log-linear spectrograms stacked along with the normalized principal eigenvector of the spatial covariance matrix at each corresponding time-frequency bin. Depending on the types of microphone array, the principal eigenvector can be normalized differently to extract amplitude and/or phase differences between the microphones. As a result, SALSA features are applicable for different microphone array formats such as first-order ambisonics (FOA) and multichannel microphone array (MIC).

Experimental results on the TAU-NIGENS Spatial Sound Events (TNSSE) 2021 dataset with directional interferences showed that SALSA features outperformed other state-of-the-art features. Specifically, the use of SALSA features in the FOA format increased the F1 score and localization recall by 6% each, compared to the multichannel log-mel spectrograms with intensity vectors. For the MIC format, using SALSA features increased F1 score and localization recall by 16% and 7%, respectively, compared to using multichannel log-mel spectrograms with generalized cross-correlation spectra.

Our ensemble model trained on SALSA features ranked second in the team category of the SELD task in the 2021 DCASE SELD Challenge.

Network architecture

We use a convolutional recurrent neural network (CRNN) in this code. The network consists of a CNN that is based on ResNet22 for audio tagging, a two-layer BiGRU, and fully connected (FC) layers. The network can be adapted for different input features by setting the number of input channels in the first convolutional layer to that of the input features.

Visualization of SELD output

Visualization of ground truth and predicted azimuth for test clip fold6_room2_mix041 of the TAU-NIGENS Spatial Sound Events 2021 dataset. Legend lists the ground truth events in chronological order. Sound classes are color-coded.

Comparison with state-of-the-art SELD systems

Simple CRNN models trained on SALSA features have shown to achieve similar to or even better SELD performance than many complex state-of-the-art systems on the 2020 and 2021 TNSSE datasets. We listed the performances of our models trained with the proposed SALSA features and other state-of-the-art SELD system in the following tables. For more results, please refer to the paper listed above.

Prepare dataset and environment

This code is tested on Ubuntu 18.04 with Python 3.7, CUDA 11.0 and Pytorch 1.7

Install the following dependencies by pip install -r requirements.txt. Or manually install these modules:
- numpy
- scipy
- pandas
- scikit-learn
- h5py
- librosa
- tqdm
- pytorch 1.7
- pytorch-lightning
- tensorboardx
- pyyaml
- munch
Download TAU-NIGENS Spatial Sound Events 2021 dataset here. This code also works with TAU-NIGENS Spatial Sound Events 2020 dataset here.
Extract everything into the same folder.
Data file structure should look like this:

./
├── feature_extraction.py
├── ...
└── data/
    ├──foa_dev
    │   ├── fold1_room1_mix001.wav
    │   ├── fold1_room1_mix002.wav  
    │   └── ...
    ├──foa_eval
    ├──metadata_dev
    ├──metadata_eval (might not be available yet)
    ├──mic_dev
    └──mic_eval

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder foa_dev or mic_dev.

Feature extraction

*Note: under contrib, you can find functionality to run SALSA and SALSA-Lite on the fly (CPU), and for an arbitrary number of microphones. This may be useful if you are interested in more flexible setups (e.g. involving data augmentation and real-time processing).

Our code support the following features:

Name	Format	Component	Number of channels
melspeciv	FOA	multichannel log-mel spectrograms + intensity vector	7
linspeciv	FOA	multichannel log-linear spectrograms + intensity vector	7
melspecgcc	MIC	multichannel log-mel spectrograms + GCC-PHAT	10
linspecgcc	MIC	multichannel log-linear spectrograms + GCC-PHAT	10
SALSA	FOA	multichannel log-linear spectrograms + eigenvector-based intensity vector (EIV)	7
SALSA	MIC	multichannel log-linear spectrograms + eigenvector-based phase vector (EPV)	7
SALSA-IPD	MIC	multichannel log-linear spectrograms + interchannel phase difference (IPD)	7
SALSA-Lite	MIC	multichannel log-linear spectrograms + normalized interchannel phase difference (NIPD)	7

Note: the number of channels are calculated based on four-channel inputs.

To extract SALSA feature, edit directories for data and feature accordingly in tnsse_2021_salsa_feature_config.yml in dataset\configs\ folder. Then run make salsa

To extract SALSA-Lite feature, edit directories for data and feature accordingly in tnsse_2021_salsa_lite_feature_config.yml in dataset\configs\ folder. Then run make salsa-lite

To extract linspeciv, melspeciv, linspecgcc, melspecgcc feature, edit directories for data and feature accordingly in tnsse_2021_feature_config.yml in dataset\configs\ folder. Then run make feature

Training and inference

To train SELD model with SALSA feature, edit the feature_root_dir and gt_meta_root_dir in the experiment config experiments\configs\seld.yml. Then run make train.

To train SELD model with SALSA-Lite feature, edit the feature_root_dir and gt_meta_root_dir in the experiment config experiments\configs\seld_salsa_lite.yml. Then run make train.

To do inference, run make inference. To evaluate output, edit the Makefile accordingly and run make evaluate.

DCASE2021 Sound Event Localization and Detection Challenge

We participated in DCASE2021 Sound Event Localization and Detection Challenge. Our model ensemble ranked 2nd in the team ranking category. The models in the ensemble were trained on a variant of SALSA for FOA format. This variant has an additional channel for direct-to-reverberant ratio (DRR). For more information, please check out our technical report. Ablation study on the TAU-NIGENS Spatial Sound Events 2021 dataset shows that adding DRR channel does not improve the SELD performance.

We applied three data augmentation techniques, namely channel swapping (CS), frequency shifting (FS), and random cutout (RC) while training models for the DCASE challenge. However, later ablation study shows that for FOA format of the TAU-NIGENS Spatial Sound Events 2021 dataset, combination of only CS and FS is better than combination of CS, FS and RC.

Citation

Please consider citing our papers if you find this code useful for your research. Thank you!!!

SALSA

@article{nguyen2021salsa,
  title={SALSA: Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection},
  author={Nguyen, Thi Ngoc Tho and Watcharasupat, Karn N and Nguyen, Ngoc Khanh and Jones, Douglas L and Gan, Woon-Seng},
  journal={arXiv preprint arXiv:2110.00275},
  year={2021}
}

SALSA-Lite

@inproceedings{nguyen2022salsa_lite,
  title={SALSA-Lite: A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays},
  author={Nguyen, Thi Ngoc Tho and Jones, Douglas L and Watcharasupat, Karn N and Phan, Huy and Gan, Woon-Seng},
  booktitle={2022 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)},
  year={2022}
}

DCASE 2021 Technical Report

@techreport{nguyen2021dcase,
  title={DCASE 2021 Task 3: Spectrotemporally-aligned Features for Polyphonic Sound Event Localization and Detection},
  author={Nguyen, Thi Ngoc Tho and Watcharasupat, Karn and Nguyen, Ngoc Khanh and Jones, Douglas L and Gan, Woon Seng},
  institution={IEEE AASP Challenge on Detection and Classification of Acoustic Scenes and Events 2021},
  year={2021}
}

salsa's People

Contributors

Stargazers

Watchers

Forkers

achyun haojunyong shrejais umarikkar dofuuz ishine peter-72 acefgin ewsgan san-ti rxliang fedepaj baekms daodeji remicchi b1ard itsjunwei

salsa's Issues

Minor issue on the condition to add leftover chunk data @ database.get_segment_idxes()

Hi!

I've recently started working on SELD, and your work greatly helped me with it! Thanks a lot :)

I think that @ databse.py line 115, the condition should be

(n_crop_frames - self.chunk_len // downsample_ratio) % (self.chunk_hop_len // downsample_ratio) != 0

while currently it is

(n_crop_frames - self.chunk_len // downsample_ratio) % self.chunk_hop_len // downsample_ratio != 0

Considering that the purpose of this code is to add leftover chunk in case the feature length is not exact multiple of chunk_hop_len, I think you missed out parenthesis. If I'm wrong, please let me know!

Thanks for sharing your great work again!

Best Regards,
Fred

Problems on model visualization

Hello, thank you very much for sharing such excellent work with us I have done similar work recently, but I have encountered difficulties in model visualization. I want to know how you drew such an excellent visualization diagram. Can you share this work?

Very low Scores

@thomeou @karnwatcharasupat

When training the SALSA model, I get very low and bad scores that are nowhere near the scores presented in the paper. Here is the configs that I used:

NB: I cannot use more FFTs because they are unfittable in my RAM when concatenating the file name arrays.

For extracting the salsa feature:

data:
  format: 'foa'
  fs: 24000
  n_fft: 256
  win_len: 256
  hop_len: 150
  fmin_doa: 50
  fmax_doa: 9000

For training:

# MAP config
name: 'map'
feature_root_dir: 'dataset/features/salsa/foa/24000fs_256nfft_150nhop_5cond_9000fmaxdoa'
feature_type: 'salsa'
gt_meta_root_dir: 'dataset/data'
split_meta_dir: 'dataset/meta/dcase2021/original'
seed: 2021
mode: 'crossval'
data:
  fs: 24000
  n_fft: 256
  hop_len: 300
  n_mels: 200
  audio_format: 'foa' 
  label_rate: 10
  train_chunk_len_s: 8
  train_chunk_hop_len_s: 0.5
  test_chunk_len_s: 60.0
  test_chunk_hop_len_s: 60.1
  n_classes: 12
  train_fraction: 1.0
  val_fraction: 1.0
  output_format: 'reg_xyz'
model:
  encoder:
    name: 'PannResNet22'
    n_input_channels: 7
  decoder:
    name: 'SeldDecoder'
    decoder_type: 'bigru'
    decoder_size: 256
    freq_pool: 'avg'
training:
  train_batch_size: 32
  val_batch_size: 32
  optimizer: 'adam'
  lr_scheduler:
    milestones:
      - 0.0
      - 0.1
      - 0.7
      - 1.0
    lrs:
      - 3.e-4
      - 3.e-4
      - 3.e-4
      - 1.e-4
    moms:
      - 0.9
      - 0.9
      - 0.9
      - 0.9
  loss_weight:
    - 0.3
    - 0.7
  max_epochs: 50
  val_interval: 1
sed_threshold: 0.3
doa_threshold: 20
eval_version: '2021'

And these are the results I obtained with these settings:

2021 SELD metrics: SELD error: 0.794 - SED ER: 1.027 - SED F1: 0.045 - DOA LE: 68.594 - DOA LR: 0.186

and

DATALOADER:0 TEST RESULTS
{'valER': 1.0272366522366523,
 'valF1': 0.0449490538573508,
 'valLE': 68.59412559532619,
 'valLR': 0.18617164379876244,
 'valSeld': 0.7942986075275322}

On the other hand, I trained the linspeciv feature with the same configs, and the scores where very good:

2021 SELD metrics: SELD error: 0.308 - SED ER: 0.472 - SED F1: 0.656 - DOA LE: 12.650 - DOA LR: 0.652

DATALOADER:0 TEST RESULTS
{'valER': 0.4717712842712843,
 'valF1': 0.6562263960924518,
 'valLE': 12.649613618929243,
 'valLR': 0.6522505064740597,
 'valSeld': 0.30839250323026157}

So, what is exactly the problem? What am I doing wrong?

@andres-fr what scores did you obtain when you trained the model?

A tiny correction in data augmentation

Hi. Thank you all for this amazing work and sharing it with us. I was going through your code to understand how you're doing data augmentation, specifically channel swapping, and I noticed something different in the code. I think the two "elif" conditions at line 582 and line 591 in the "apply()" method in "class GccRandomSwapChannelMic" definition in "utilities/transforms.py" should be both "if" conditions instead, since I think you want to apply three channel swapping methods independently with %50 probability. Thoughts?

Trained models

Hi Thank you for this sharing your work.

Will it be possible to share the trained models?

LSTM and BI-LSTM hidden size errors

In decoders.py, you will find that both lstm and bi-lstm hidden size is assigned the value self.gru_size instead of self.lstm_size:

elif self.decoder_type == 'lstm':
            self.lstm_input_size = n_output_channels
            self.lstm_size = decoder_size
            self.fc_size = self.lstm_size

            self.lstm = nn.LSTM(input_size=self.lstm_input_size, hidden_size=self.gru_size,
                                num_layers=2, batch_first=True, bidirectional=False, dropout=0.3)
            init_gru(self.lstm)
        elif self.decoder_type == 'bilstm':
            self.lstm_input_size = n_output_channels
            self.lstm_size = decoder_size
            self.fc_size = self.lstm_size * 2

            self.lstm = nn.LSTM(input_size=self.lstm_input_size, hidden_size=self.gru_size,
                               num_layers=2, batch_first=True, bidirectional=True, dropout=0.3)

which will cause AttributeError: 'SeldDecoder' object has no attribute 'gru_size'. Both hidden sizes needs to be hidden_size=self.lstm_size

Using Windows 10, Chrome Browser v101.0.4951.67

Does it support the SELD of overlap sound of the same classe?

Hello:
What a incredible project，but I have a question . Does it support the SELD of overlap sound of the same classe?

Need help in running the code

I need help to run the code i already extracted the salsa features but i have problems in train and evaluate even after appling the instructions in the read me file. Can some one help me please

Visualising the SELD output？

Hi，how to visualise the SELD output？I don't know，can you tell me，thanks for your help！！

About how to calculate the salsa-lite

Hi, i have a problem about how to calculate the salsa-lite.
What is the meaning of "arg" in your paper.
Here is the link to the formula picture.
https://github.com/kakarotto007/typora/blob/main/Snipaste_2023-04-25_15-13-16.png
@karnwatcharasupat
thank you.

about tta

Hi，Is there no code for TTA？I can't find it

SALSA does not work on my recorded data

@andres-fr
#20

Dear Andre,
I have recorded a 4-channel sound file with this array: https://wiki.seeedstudio.com/ReSpeaker-USB-Mic-Array/
but the SALSA-code does not localize sounds and the results are completely wrong. Can you help me? Can I localize just sounds recorded with a special array?

Best,

On-the-fly parallelized+refactored SALSA and SALSA-Lite

Hi there,

Me again. I reimplemented the SALSA and SALSA-Lite features to work on-the-fly with any number of microphones. I wanted to share it with you and the community with the hope it will help advance the field. The implementation can be found in this gist (I'm happy to make a PR if appropriate):

https://gist.github.com/andres-fr/d923e1df7de4dd6e0af34b28a2a7ef04

It should be just plug&play (docstrings contain usage examples), and depends only on numpy and librosa. I'd love some feedback! I did some sandbox tests and results were identical (up to eigenvector +/- sign), but no rigorous utests. Data looks fine though, here is an example from the DCASE 2021 SEDL dataset with the default parameters (wav file fold1_room1_mix001.wav):

LogMel excerpt

SALSA excerpt

SALSA-Lite excerpt

Note that a main visual difference is due to the fact that SALSA has the noise floor + coherence filters, which remove a lot of clutter (for readers: the SALSA-Lite paper kind of shows that these are not really needed).

The parallelized SALSA version yielded a decent speedup that could make it particularly useful for data augmentation techniques that are applied on the waveform. Also (hopefully) the refactoring helps coming up with further exciting variations of SALSA. For the record I like salsa brava.

Cheers!
Andres

Optimizing the RAM consumption when preparing data for training

The load_chunk_data method is aggressively cosuming huge amounts of RAM when concatenating np arrays.

I am currently trying to implement something that will reduce the RAM consumption

@karnwatcharasupat @thomeou I am happy to request a PR when I am done, if that's is acceptable by you.

PS: I noticed that the previous method never worked, and I apologize for not properly testing it; I am trying something new now.

@karnwatcharasupat The splitting idea didn't work, even after I fixed it to actually concat the chunks because in the end, I am still going to concatenate np arrays that will eventually reach the shape of (7, 1920000, 200), which is unhandleable anyway. I had an idea to not concatenate them at all, but to export them to the db_data in get_split method, like this for example:

db_data = {
    'features': features,
    'features_2': features_2,
    'features_3': features_3,
    'features_4': features_4,
    'sed_targets': sed_targets,
    'doa_targets': doa_targets,
    'feature_chunk_idxes': feature_chunk_idxes,
    'gt_chunk_idxes': gt_chunk_idxes,
    'filename_list': filename_list,
    'test_batch_size': test_batch_size,
    'feature_chunk_len': self.chunk_len,
    'gt_chunk_len': self.chunk_len // self.label_upsample_ratio
}

where features, features_2, features_3, and features_4 are just features, but splitted into 4 chunks. And then adjust the use of features in the whole project to include the other features sequentially. I have already developed such a method to export 4 arrays, but I am still exploring the code to better understand it before changing how it works. Currently, I can see that the get_split method is called when training in the datamodule.py file, specifically in

train_db = self.feature_db.get_split(split=self.train_split, split_meta_dir=self.split_meta_dir, stage='fit')

and in

val_db = self.feature_db.get_split(split=self.val_split, split_meta_dir=self.split_meta_dir, stage='inference')

The call from train_db variable is currently my problem.
If you have an idea how to add the chunks part to the code, please let me know.

Problem on phase_vector

Hello, i have a question about how to calculate phase_vector.
Why "phase_vector = np.angle(X[:, :, 1:] * np.conj(X[:, :, 0, None]))" ?
Thank you very much!

problems on data recording for test

I am working on sound source localization. I have read your papers entitled: "Spatial Cue-Augmented Log-Spectrogram Features for Polyphonic Sound Event Localization and Detection" and "A Fast and Effective Feature for Polyphonic Sound Event Localization and Detection with Microphone Arrays" , and I have found that you tested the approach on 60-sec sound files. I have tested your implemented approach ("https://github.com/thomeou/SALSA") on TAU-NIGENS Spatial Sound Events 2021 dataset , Everything is OK without any error, but the output of the network is correct only for the audios in the datasets, and for the data recorded by myself (I have recorded the data with the help of a 4-channel microphone array named ReSpeaker USB Mic Array) the output is completely wrong. I am just wondering what is wrong with my data. It is a 4-channal data, fp16, and with the same PCM coding.
Which device did you use for recording?
My data has some kind of echo, therein. Is it possible that a small amount of echo degrades the performance of your localization algorithm significantly?

Thanks for your helping.

Minor issues and questions running code (salsa+salsa_lite)

Hi! many congratulations for this outstanding line of work, and thanks a lot for sharing it.

I am running this repo on Ubuntu20.04+CUDA, and gathered a few notes on the process, with the hope that they are helpful to others.
I also encountered a few minor issues, and I also have some open questions that I couldn't answer reading the paper&docs, I was wondering if someone could take a look at them.

As for the changes I propose, I'll be happy to submit a PR if appropriate.
Cheers!
Andres

Installation:

Although mentioned in the README, there is no pip-compatible requirements.txt file, and the requirements.yml imposes more constraints than needed. The following minimal list worked for me:

# requirements.txt
scipy==1.5.2
pandas==1.1.3
scikit-learn==0.23.2
h5py==2.10.0
librosa==0.8.0
tqdm==4.54.1
torch==1.7.0+cu110
torchvision==0.8.1+cu110
pytorch-lightning==1.1.6
tensorboardx==2.1
pyyaml==5.3.1
munch==2.5.0
fire==0.3.1
ipython==7.19.0

Then, the environment can be initialized as follows (inside of <REPO_ROOT>):

conda create -n salsa python=3.7
conda activate salsa
pip install -r requirements.txt -f https://download.pytorch.org/whl/torch_stable.html

Precomputing SALSA features:

The readme specifies that the dataset should be found inside of <REPO_ROOT>/dataset/data. For that reason, we can get rid of the absolute paths in config files and replace them with the following relative paths. In tnsse2021_salsa_feature_config.yml:

data_dir: 'dataset/data'
feature_dir: 'dataset/features'

Then, running make salsa from <REPO_ROOT> (with the env activated) works perfectly, and yields results inside <REPO_ROOT>/dataset/features. In my case, both data and features were softlinks to an external memory drive, and it still worked fine.

Computing the SALSA features for the 600 MIC wav files (1 minute each, 4 channels, 24kHz, 6.9GB total) on an [email protected] CPU took ca. 35 minutes and 21.5GB with the default settings:

  n_fft: 512
  win_len: 512
  hop_len: 300  # 300 for 12.5ms for n_fft = 512; 150 for n_fft = 256
  fmin_doa: 50
  fmax_doa: 4000  # 'foa': 9000; 'mic': 4000

Precomputing SALSA-Lite features:

Analogous remarks as with SALSA. Computation took 2 minutes and 20.5GB.
Here, the question is how does the SALSA-Lite dedicated repo interact with this one. Will both be maintained, or is this the "main" one and the other was for publication purposes?

Training:

Question: Any pretrained models available? I couldn't find any upon a brief search

Regarding config, here we can also replace absolute with relative paths:

feature_root_dir: 'dataset/features/salsa/mic/24000fs_512nfft_300nhop_5cond_4000fmaxdoa'  # for SALSA
feature_root_dir: 'dataset/features/salsa_lite/mic/24000fs_512nfft_300nhop_2000fmaxdoa'  # for SALSA-Lite
gt_meta_root_dir: 'dataset/data'

Training setup had a couple of minor issues:

In the README we can currently see the following instruction:

For TAU-NIGENS Spatial Sound Events 2021 dataset, please move wav files from subfolders dev_train, dev_val, dev_test to outer folder.

This should be updated, because the training script also expects the metadata .csv files to be on the outer folder, so those have to be moved as well. Otherwise we get a file not found error.

Side note for further readers: The "train/val split" information gets lost when mixing all files, but the repo actually has this information in the form of CSV files, stored at dataset/meta/dcase2021/original. So mixing is fine; still, it is probably not a bad idea to make a backup of the original dev metadata before mixing everything together (it is not very large).

As for make train, it is currently hardcoded to salsa, the instructions to train on salsa-lite didn't work for me. I changed the "Training and inference" section in Makefile to the following, so that we can train on both via either make train-salsa or make train-salsa-lite:

# Training and inference
CONFIG_DIR=./experiments/configs
OUTPUT=./outputs   # Directory to save output
EXP_SUFFIX=_test   # the experiment name = CONFIG_NAME + EXP_SUFFIX
RESUME=False
GPU_NUM=0  # Set to -1 if there is no GPU

.phony: train-salsa
train-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa
inference-salsa:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

.phony: train-salsa-lite
train-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/train.py --exp_config="${CONFIG_DIR}/seld_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX) --resume=$(RESUME)

.phony: inference-salsa-lite
inference-salsa-lite:
	PYTHONPATH=$(shell pwd) CUDA_VISIBLE_DEVICES="${GPU_NUM}" python experiments/inference.py --exp_config="${CONFIG_DIR}/sedl_salsa_lite.yml" --exp_group_dir=$(OUTPUT) --exp_suffix=$(EXP_SUFFIX)

After a few epochs, the models seemed to converge well, so I believe all the above modifications were successful. Let me know if I am missing something!

Understanding database.py

Hello.

We recently started working on SELD task and found your work to be very helpful. Thank you for publishing this.

We would like to make it work with 2022 dataset and having some problems with understanding the preprocessing logic in database.py. Specifically:

What is each function load_chunk_data, get_segment_idxes, and load_classwise_gt supposed to do?
What is the difference between a chunk and a segment?

Any help is appreciated.