Coder Social home page Coder Social logo

fschmid56 / efficientat Goto Github PK

View Code? Open in Web Editor NEW
194.0 5.0 36.0 2.64 MB

This repository aims at providing efficient CNNs for Audio Tagging. We provide AudioSet pre-trained models ready for downstream training and extraction of audio embeddings.

License: MIT License

Python 100.00%
audio-tagging convolutional-neural-networks knowledge-distillation mobilenetv3 pytorch

efficientat's People

Contributors

eltociear avatar fschmid56 avatar hendriks73 avatar joemgu7 avatar turian avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar

efficientat's Issues

dcase20 dataset loader

Hi there,

thanks for sharing the code.
I was in particular interested in the dcase20 dataset loader.
However, something in the MixupDataset classed confused me.

See this link: https://github.com/fschmid56/EfficientAT/blob/main/datasets/dcase20.py#L101

SimpleSelectionDataset is returning x, label, device, city, self.available_indices[index].
It is then interpreted as x1, f1, y1, d1, c1.
It turns out that during training, x1 and f1 are used.
However, I think the goal is to return a one-hot encoded and weighted version of the label (y1 * l + y2 * (1. - l)).

Maybe not relevant for this training but in case someone stumbles on this or wants to reuse it might find it useful!

Is the Fname to index file correct?

Hey Florian,
I checked out your new fname_to_index file and trained some models, but performance is extremely bad.

Then just proceeded to print some scores from your provided fname_to_idx.pkl file and check out the ground truth.

I used this simple code to map your indexes to the fnames:

import torch
import numpy as np
import pandas as pd



clmaps = pd.read_csv('./class_labels_indices.csv').set_index('index')['display_name'].to_dict()
data = np.load('./passt_enemble_logits_mAP_495.npy', allow_pickle=True)
fnames_to_idx = np.load('./fname_to_index.pkl', allow_pickle=True)

idx_to_fnames = {v: k for k, v in fnames_to_idx.items()}

for idx, fname in idx_to_fnames.items():
    values, idxs = torch.as_tensor(data[idx], dtype=torch.float32).sigmoid().topk(5)
    print(f" ==== {fname} ==== ")
    names = [clmaps[i] for i in idxs.numpy()]
    for score, clname in zip(values, names):

        print(f"{clname:<10} {score:<.3f}")

Some of the outputs are:

 ==== 09c885WMtMw ==== 
Animal     0.914
Dog        0.887
Domestic animals, pets 0.847
Bark       0.643
Bow-wow    0.293

The ground truth for that file however, is "Music", you can check out the source at:

https://youtu.be/09c885WMtMw?t=80

Another sample is:

 ==== 09bFB0X-8QY ==== 
Speech     0.917
Female speech, woman speaking 0.598
Narration, monologue 0.406
Child speech, kid speaking 0.050
Inside, small room 0.028

Which can be viewed here: https://youtu.be/09bFB0X-8QY?t=16

I'm reasonably confident that your fname_to_index is somewhat wrong, could you maybe check it out if its the case?

EDIT:

In the above code snippet, the following will throw an error, which means there are some duplicate indexes:

assert len(fnames_to_idx) == len(idx_to_fnames)

Kind Regards,
Heinrich

error when evualte with "python ex_audioset.py --cuda --model_name="mn10_as""

Warning: FMAX is None setting to 15000
Dataset from /data/xiaoshengchang/audioset/mp3/eval_segments_mp3.hdf with length 18887.
Running AudioSet evaluation for model 'mn10_as' on device 'cuda'
69%|███████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▎ | 109/158 [00:29<00:07, 6.86it/s]Failed to read frame size: Could not seek to 1026.
82%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████▉ | 129/158 [00:35<00:07, 3.66it/s]
Traceback (most recent call last):
File "/data/xiaoshengchang/EfficientAT-main/ex_audioset.py", line 348, in
evaluate(args)
File "/data/xiaoshengchang/EfficientAT-main/ex_audioset.py", line 266, in evaluate
for batch in tqdm(dl):
File "/opt/conda/lib/python3.9/site-packages/tqdm/std.py", line 1195, in iter
for obj in iterable:
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 628, in next
data = self._next_data()
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1313, in _next_data
return self._process_data(data)
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/dataloader.py", line 1359, in _process_data
data.reraise()
File "/opt/conda/lib/python3.9/site-packages/torch/_utils.py", line 542, in reraise
raise RuntimeError(msg) from None
RuntimeError: Caught ValueError in DataLoader worker process 9.
Original Traceback (most recent call last):
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/conda/lib/python3.9/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/data/xiaoshengchang/EfficientAT-main/datasets/audioset.py", line 154, in getitem
waveform = decode_mp3(self.dataset_file['mp3'][index])
File "/data/xiaoshengchang/EfficientAT-main/datasets/audioset.py", line 37, in decode_mp3
container = av.open(io.BytesIO(mp3_arr.tobytes()))
File "av/container/core.pyx", line 401, in av.container.core.open
File "av/container/core.pyx", line 272, in av.container.core.Container.cinit
File "av/container/core.pyx", line 292, in av.container.core.Container.err_check
File "av/error.pyx", line 336, in av.error.err_check
av.error.ValueError: [Errno 22] Invalid argument: ''; last error log: [mp3] Failed to read frame size: Could not seek to 1026.

Model input shape

Thanks for the great work. I would like to ask the input(i.e.,x) shape of the mobilenet model, and is it (batch_size, 1, time_steps, mel_bins) or (batch_size, 1, mel_bins, time_steps)?

x = _mel_forward(x, mel)
y_hat, _ = model(x)

More Compatible Teacher Pred

Hi,
i try to run the training script, and some data was loss during downloading. So the teacher pred index is not right and can not be used.
I am wondering if your have time to update a Dict version of teacher pred with audio_name as key, because it would be more complex for new guy to get the ensemble logit

# here the teacher_preds is indexed by dataset index
if args.kd_lambda > 0:
      y_soft_teacher = teacher_preds[i]

Thanks

onnx

how to convert this model to onnx type?

Baseline Model

Hi,

Would you mind uploading the pre-trained baseline model without KD? I need it to do the model comparison experiments on our self-collected dataset. Thanks.

clipwise_output

Hello, how to get the frame-level output results similar to clipwise_output in panns

Thanks

How to accurately identify the sound event offset?

Hi,fschmid,I have a question: How to accurately identify the sound event offset? As is known, the sound events in our life are endless, and some events are not the sound event categories in the Audioset. How to accurately identify these?

Add feature maps

In lucidrains/audiolm-pytorch#177 I describe an approach for using EfficientAT as a discriminator for GAN training.

The only code change needed is this:


    def _forward_impl(self, x: Tensor, return_fmaps: bool = False) -> Union[Tuple[Tensor, Tensor], Tuple[Tensor, List[Tensor]]]:
        fmaps = []
        
        for i, layer in enumerate(self.features):
            x = layer(x)
            if return_fmaps:
                fmaps.append(x)
        
        features = F.adaptive_avg_pool2d(x, (1, 1)).squeeze()
        x = self.classifier(x).squeeze()
        
        if features.dim() == 1 and x.dim() == 1:
            # squeezed batch dimension
            features = features.unsqueeze(0)
            x = x.unsqueeze(0)
        
        if return_fmaps:
            return x, fmaps
        else:
            return x, features

    def forward(self, x: Tensor) -> Union[Tuple[Tensor, Tensor], Tuple[Tensor, List[Tensor]]]:
        return self._forward_impl(x)

to optionally return feature maps for learning feature matching in the generator.

Pretrained model with 1-second frame width?

I'm curious if you could release a pretrained model with a much shorter receptive length?

This would be useful for fine-grained tasks, like music transcription and event transcription (with a smaller hop size)

How to use mn10_as as a pre-trained model, and fine-tune on a new dataset

Hi,

I'm trying to use mn10_as as a pre-trained model, and want to fine-tune it to fit my collected dataset (3 classes, 50 10-second clips each class, sampling rate: 16KHz).

I prepraed my dataset and made a format like DCASE20, but only filename and scene_label are used. Then I modified ex_dcase20.py to ex_my_dataset.py. Follows point out where I modified, but errors occured, and hope you could give me some hits. Many thanks.

In ex_my_dataset.py,

def train(args):
    # Train Models for Acoustic Scene Classification

    # logging is done using wandb
    wandb.init(
        project="my_dataset",
        notes="Fine-tune Models for Acoustic Scene Classification.",
        tags=[ "Acoustic Scene Classification", "Fine-Tuning"],
        config=args,
        name=args.experiment_name
    )

    device = torch.device('cuda') if args.cuda and torch.cuda.is_available() else torch.device('cpu')

    # model to preprocess waveform into mel spectrograms
    mel = AugmentMelSTFT(n_mels=args.n_mels,
                         sr=args.resample_rate,
                         win_length=args.window_size,
                         hopsize=args.hop_size,
                         n_fft=args.n_fft,
                         freqm=args.freqm,
                         timem=args.timem,
                         fmin=args.fmin,
                         fmax=args.fmax,
                         fmin_aug_range=args.fmin_aug_range,
                         fmax_aug_range=args.fmax_aug_range
                         )
    mel.to(device)

    # load prediction model
    pretrained_name = args.pretrained_name
    if pretrained_name:
        model = get_mobilenet(width_mult=NAME_TO_WIDTH(pretrained_name), pretrained_name=pretrained_name,
                              head_type=args.head_type, se_dims=args.se_dims, num_classes=3)
    else:
        model = get_mobilenet(width_mult=args.model_width, head_type=args.head_type, se_dims=args.se_dims,
                              num_classes=3)

parser.add_argument('--pretrained_name', type=str, default='mn10_as')

In datasets/my_dataset.py

sr=16000
resample_rate=sr

dataset_config = {
    "dataset_name": "my_dataset",
    "meta_csv": os.path.join(dataset_dir, "meta.csv"),
    "train_files_csv": os.path.join(dataset_dir, "evaluation_setup", "fold1_train.csv"),
    "test_files_csv": os.path.join(dataset_dir, "evaluation_setup", "fold1_evaluate.csv")

#
class BasicDCASE22Dataset(TorchDataset):

    def __init__(self, meta_csv, sr=sr, cache_path=None):
        """
        @param meta_csv: meta csv file for the dataset
        @param sr: specify sampling rate
        @param sr: specify cache path to store resampled waveforms
        return: waveform, name of the file, label, device and cities
        """
        df = pd.read_csv(meta_csv, sep="\t")
        le = preprocessing.LabelEncoder()
        self.labels = torch.from_numpy(le.fit_transform(df[['scene_label']].values.reshape(-1)))
        **#self.devices = le.fit_transform(df[['source_label']].values.reshape(-1))
        #self.cities = le.fit_transform(df['identifier'].apply(lambda loc: loc.split("-")[0]).values.reshape(-1))**
        self.files = df[['filename']].values.reshape(-1)
        self.sr = sr
        if cache_path is not None:
            self.cache_path = os.path.join(cache_path, dataset_config["dataset_name"] + f"_r{self.sr}", "files_cache")
            os.makedirs(self.cache_path, exist_ok=True)
        else:
            self.cache_path = None

    def __getitem__(self, index):
        if self.cache_path:
            cpath = os.path.join(self.cache_path, str(index) + ".pt")
            try:
                sig = torch.load(cpath)
            except FileNotFoundError:
                sig, _ = librosa.load(os.path.join(dataset_dir, self.files[index]), sr=self.sr, mono=True)
                sig = torch.from_numpy(sig[np.newaxis])
                torch.save(sig, cpath)
        else:
            sig, _ = librosa.load(os.path.join(dataset_dir, self.files[index]), sr=self.sr, mono=True)
            sig = torch.from_numpy(sig[np.newaxis])
        **#return sig, self.labels[index], self.devices[index], self.cities[index]**
        return sig, self.labels[index]

    def __len__(self):
        return len(self.files)

class SimpleSelectionDataset(TorchDataset):
    """A dataset that selects a subsample from a dataset based on a set of sample ids.
        Supporting integer indexing in range from 0 to len(self) exclusive.
    """

    def __init__(self, dataset, available_indices):
        """
        @param dataset: dataset to load data from
        @param available_indices: available indices of samples for 'training', 'testing'
        return: x, label, device, city, index
        """
        self.available_indices = available_indices
        self.dataset = dataset

    def __getitem__(self, index):
        #x, label, device, city = self.dataset[self.available_indices[index]]
        x, label = self.dataset[self.available_indices[index]]
        #return x, label, device, city, self.available_indices[index]
        return x, label, self.available_indices[index]

    def __len__(self):
        return len(self.available_indices)

Error messages:

  File C:\Anaconda3\envs\EfficientAT\lib\site-packages\spyder_kernels\py3compat.py:356 in compat_exec
    exec(code, globals, locals)

  File d:\users\2023-efficientat-main\ex_my_dataset.py:241
    train(args)

  File d:\users\2023-efficientat-main\ex_my_dataset.py:100 in train
    for batch in pbar:

  File C:\Anaconda3\envs\EfficientAT\lib\site-packages\tqdm\std.py:1195 in __iter__
    for obj in iterable:

  File C:\Anaconda3\envs\EfficientAT\lib\site-packages\torch\utils\data\dataloader.py:435 in __iter__
    return self._get_iterator()

  File C:\Anaconda3\envs\EfficientAT\lib\site-packages\torch\utils\data\dataloader.py:381 in _get_iterator
    return _MultiProcessingDataLoaderIter(self)

  File C:\Anaconda3\envs\EfficientAT\lib\site-packages\torch\utils\data\dataloader.py:1034 in __init__
    w.start()

  File C:\Anaconda3\envs\EfficientAT\lib\multiprocessing\process.py:121 in start
    self._popen = self._Popen(self)

  File C:\Anaconda3\envs\EfficientAT\lib\multiprocessing\context.py:224 in _Popen
    return _default_context.get_context().Process._Popen(process_obj)

  File C:\Anaconda3\envs\EfficientAT\lib\multiprocessing\context.py:336 in _Popen
    return Popen(process_obj)

  File C:\Anaconda3\envs\EfficientAT\lib\multiprocessing\popen_spawn_win32.py:93 in __init__
    reduction.dump(process_obj, to_child)

  File C:\Anaconda3\envs\EfficientAT\lib\multiprocessing\reduction.py:60 in dump
    ForkingPickler(file, protocol).dump(obj)

AttributeError: Can't pickle local object 'get_roll_func.<locals>.roll_func'

AudioSet temporally-strong labels

Firstly I would like to state that this repo is great, so many models, all in pytorch and getting them to work on my machine was very easy.

Have you tried fine-tuning the models on the temporally-strong labeled subset of the AudioSet dataset?

Hack to get 1 dimensional output

I'm not sure why this was needed, but I had to add this hack to get num_classes=1 to work:

<             #num_classes = state_dict['classifier.1.bias'].size(0)
<             num_classes = state_dict['classifier.2.bias'].size(0)
---
>             num_classes = state_dict['classifier.1.bias'].size(0)
313,315d299
<             if "classifier.2.weight" in state_dict:
<                 del state_dict['classifier.2.weight']
<                 del state_dict['classifier.2.bias']

I won't push a fix because I don't understand the impact of this on other users. Perhaps it should only be used when num_classes is 1?

error in wandb report DyMN on OpenMic

The documentation for fine-tuning DyMN on OpenMic includes a reported example run, accessible via the following link: https://api.wandb.ai/links/florians/qo32vrgl.

In this report, the mean Average Precision (mAP) is stated as 91.6, whereas the paper reports a mAP of 84.4. Interestingly, the mAP value I obtained from fine-tuning on my personal computer aligns closely with the paper's reported value.

Upon closer inspection, I noticed that all mAP values correspond exactly to the values of the ROC statistics. This raises the question: could incorrect statistics be logged to W&B?

train problem

when i run ex_audioset.py, there is a problem:

in line 111 : x,f,y,i=batch

problem: not enough values to unpack(expected4, got3)

I think the value i dose not have a value and I want to know how to use the teacher preds.

Thanks!

Tag audio at a higher resolution

Thank you for your great work and sharing it!

Do you have any recommendation to use your models to label audio at a higher resolution, say 1 sec or lower? Or even mel frame level?

I've tried applying your models on short windows but below 5 seconds, the results deteriorate a lot (for 1sec it seems to fail completely). I guess it's because the training AudioSet samples are ~10 seconds long.

I've also tried to modify the model to obtain frame-level predictions but it seems that they all use the "mlp" head and getting rid of the adaptative pooling would require a full retrain?

Thank you in advance!

Obtain audio embeddings

Hello!
Congratulations on the model, very impressive results. This is more of a question than it is an issue.
I was wondering if there is a configured way to extract the scene embeddings w/o performing the classification step on this repo.
Much appreciated,
Caio

Cannot load dymn20_as

model = get_dymn(pretrained_name="dymn20_as")

gives

In [8]: get_dymn(pretrained_name="dymn20_as")
/opt/miniconda3/lib/python3.9/site-packages/torchvision/ops/misc.py:120: UserWarning: Don't use ConvNormActivation directly, please use Conv2dNormActivation and Conv3dNormActivation instead.
  warnings.warn(
---------------------------------------------------------------------------
RuntimeError                              Traceback (most recent call last)
Cell In[8], line 1
----> 1 get_dymn(pretrained_name="dymn20_as")

File ~/dev/goodvibes/vocaltechnique/EfficientAT/models/dymn/model.py:321, in get_model(num_classes, pretrained_name, width_mult, strides, context_ratio, max_context_size, min_context_size, dyrelu_k, no_dyrelu, dyconv_k, no_dyconv, T_max, T0_slope, T1_slope, T_min, pretrain_final_temp, no_ca, use_dy_blocks)
    317     T_max = pretrain_final_temp
    319 temp_schedule = (T_max, T_min, T0_slope, T1_slope)
--> 321 m = dymn(num_classes=num_classes,
    322          pretrained_name=pretrained_name,
    323          block=block,
    324          width_mult=width_mult,
    325          strides=strides,
    326          context_ratio=context_ratio,
    327          max_context_size=max_context_size,
    328          min_context_size=min_context_size,
    329          dyrelu_k=dyrelu_k,
    330          dyconv_k=dyconv_k,
    331          no_dyrelu=no_dyrelu,
    332          no_dyconv=no_dyconv,
    333          no_ca=no_ca,
    334          temp_schedule=temp_schedule,
    335          use_dy_blocks=use_dy_blocks
    336          )
    337 print(m)
    338 return m

File ~/dev/goodvibes/vocaltechnique/EfficientAT/models/dymn/model.py:263, in dymn(pretrained_name, **kwargs)
    261 def dymn(pretrained_name: str = None, **kwargs: Any):
    262     inverted_residual_setting, last_channel = _dymn_conf(**kwargs)
--> 263     return _dymn(inverted_residual_setting, last_channel, pretrained_name, **kwargs)

File ~/dev/goodvibes/vocaltechnique/EfficientAT/models/dymn/model.py:257, in _dymn(inverted_residual_setting, last_channel, pretrained_name, **kwargs)
    255         model.load_state_dict(state_dict, strict=False)
    256     else:
--> 257         model.load_state_dict(state_dict)
    258 return model

File /opt/miniconda3/lib/python3.9/site-packages/torch/nn/modules/module.py:2152, in Module.load_state_dict(self, state_dict, strict, assign)
   2147         error_msgs.insert(
   2148             0, 'Missing key(s) in state_dict: {}. '.format(
   2149                 ', '.join(f'"{k}"' for k in missing_keys)))
   2151 if len(error_msgs) > 0:
-> 2152     raise RuntimeError('Error(s) in loading state_dict for {}:\n\t{}'.format(
   2153                        self.__class__.__name__, "\n\t".join(error_msgs)))
   2154 return _IncompatibleKeys(missing_keys, unexpected_keys)

RuntimeError: Error(s) in loading state_dict for DyMN:
	size mismatch for layers.0.depth_conv.weight: copying a param with shape torch.Size([1, 1, 4, 288]) from checkpoint, the shape in current model is torch.Size([1, 1, 4, 144]).
	size mismatch for layers.0.depth_conv.residuals.0.weight: copying a param with shape torch.Size([4, 64]) from checkpoint, the shape in current model is torch.Size([4, 32]).
	size mismatch for layers.0.depth_norm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.depth_norm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.depth_norm.running_mean: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.depth_norm.running_var: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.depth_act.coef_net.0.weight: copying a param with shape torch.Size([128, 64]) from checkpoint, the shape in current model is torch.Size([64, 32]).
	size mismatch for layers.0.depth_act.coef_net.0.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
	size mismatch for layers.0.proj_conv.weight: copying a param with shape torch.Size([1, 1, 4, 1024]) from checkpoint, the shape in current model is torch.Size([1, 1, 4, 256]).
	size mismatch for layers.0.proj_conv.residuals.0.weight: copying a param with shape torch.Size([4, 64]) from checkpoint, the shape in current model is torch.Size([4, 32]).
	size mismatch for layers.0.proj_norm.weight: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.proj_norm.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.proj_norm.running_mean: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.proj_norm.running_var: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.context_gen.joint_conv.weight: copying a param with shape torch.Size([64, 32, 1, 1]) from checkpoint, the shape in current model is torch.Size([32, 16, 1, 1]).
	size mismatch for layers.0.context_gen.joint_norm.weight: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
	size mismatch for layers.0.context_gen.joint_norm.bias: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
	size mismatch for layers.0.context_gen.joint_norm.running_mean: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
	size mismatch for layers.0.context_gen.joint_norm.running_var: copying a param with shape torch.Size([64]) from checkpoint, the shape in current model is torch.Size([32]).
	size mismatch for layers.0.context_gen.conv_f.weight: copying a param with shape torch.Size([32, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([16, 32, 1, 1]).
	size mismatch for layers.0.context_gen.conv_f.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.0.context_gen.conv_t.weight: copying a param with shape torch.Size([32, 64, 1, 1]) from checkpoint, the shape in current model is torch.Size([16, 32, 1, 1]).
	size mismatch for layers.0.context_gen.conv_t.bias: copying a param with shape torch.Size([32]) from checkpoint, the shape in current model is torch.Size([16]).
	size mismatch for layers.1.exp_conv.weight: copying a param with shape torch.Size([1, 1, 4, 4096]) from checkpoint, the shape in current model is torch.Size([1, 1, 4, 1024]).
	size mismatch for layers.1.exp_conv.residuals.0.weight: copying a param with shape torch.Size([4, 64]) from checkpoint, the shape in current model is torch.Size([4, 32]).
	size mismatch for layers.1.exp_norm.weight: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
	size mismatch for layers.1.exp_norm.bias: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).
	size mismatch for layers.1.exp_norm.running_mean: copying a param with shape torch.Size([128]) from checkpoint, the shape in current model is torch.Size([64]).

etc

Config detail request

could u provide the all config detail for reproducing the paper results? For example, the model with pretrained and without pretrained version. By the way, must i use the audio with resample rate 32k?

Why mobile output is so different with python?

Hi,
I converted dymn10_as to pytorch mobile, But mobile output is so different with python.
I checked both torch version and the model file. What may be the problem? Thanks :)

Expected Behavior:
Same or similar outputs

Actual Behavior:
So different output

Reproduce:
1.convert dymn10_as to pytorch mobile(android)
2.compare python and mobile(android) output

if __name__ == '__main__':
    model_name = 'dymn10_as'
    model_input = torch.rand(1, 1, 128, 210)
    ptmobile_name = 'eat_' + model_name + '_ptmobile.ptl'

    if model_name.startswith('dymn'):
        model = get_dymn(width_mult=NAME_TO_WIDTH(model_name), pretrained_name=model_name, strides=[2, 2, 2, 2])
    else:
        model = get_mn(width_mult=NAME_TO_WIDTH(model_name), pretrained_name=model_name, strides=[2, 2, 2, 2])
    model.to(torch.device('cpu'))
    model.eval()
    model = torch.jit.trace(model, model_input)
    print(model.code)

    # https://github.com/pytorch/pytorch/issues/96639
    # model = mobile_optimizer.optimize_for_mobile(model,
    #                                                  {
    #                                                      MobileOptimizerType.CONV_BN_FUSION,
    #                                                      # I'm only disabling CONV_BN_FUSION
    #                                                      # MobileOptimizerType.FUSE_ADD_RELU,
    #                                                      # MobileOptimizerType.HOIST_CONV_PACKED_PARAMS,
    #                                                      # MobileOptimizerType.INSERT_FOLD_PREPACK_OPS,
    #                                                      # MobileOptimizerType.REMOVE_DROPOUT
    #                                                  })
    model._save_for_lite_interpreter(ptmobile_name)
    print('android model save success!')

image

The size of tensor a (17248) must match the size of tensor b (64000) at non-singleton dimension 1

i run this python ex_dcase20.py --cuda --pretrained --model_name=dymn04_as --cache_path=cache for custom dataset like urban dataset csv,but sitll fail,
File "/root/miniconda3/envs/EfficientAT/lib/python3.10/site-packages/torch/utils/data/_utils/worker.py", line 302, in _worker_loop
data = fetcher.fetch(index)
File "/root/miniconda3/envs/EfficientAT/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/root/miniconda3/envs/EfficientAT/lib/python3.10/site-packages/torch/utils/data/_utils/fetch.py", line 58, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/opt/EfficientAT/datasets/dcase20.py", line 129, in getitem
x = (x1 * l + x2 * (1. - l))
RuntimeError: The size of tensor a (17248) must match the size of tensor b (64000) at non-singleton dimension 1
can you help me

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.