sheldontsui / pseudobinaural_cvpr2021 Goto Github PK

View Code? Open in Web Editor NEW

58.0 58.0 12.0 2.37 MB

Codebase for the paper "Visually Informed Binaural Audio Generation without Binaural Audios" (CVPR 2021)

License: Creative Commons Attribution 4.0 International

Python 95.21% Shell 4.79%

pseudobinaural

pseudobinaural_cvpr2021's People

Contributors

Stargazers

Watchers

Forkers

liuguoyou peterzs zhaoforever peterzhousz funnymdzz xseq haoheliu pzhang266 ishine

pseudobinaural_cvpr2021's Issues

ValueError('need at least one array to stack')

When I use APNet_train_crop.sh or APNet_train.sh to run the code, there is an error:

Traceback (most recent call last):
File "train.py", line 210, in
for i, data in enumerate(data_loader):
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Augment_dataset.py", line 66, in getitem
data_ret, data_ret_sep = self._get_pseudo_item(index)
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Pseudo_dataset.py", line 263, in _get_pseudo_item
stereo = self.construct_stereo_ambi(pst_sources)
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Pseudo_dataset.py", line 120, in construct_stereo_ambi
signals = np.stack([src.signal for src in pst_sources], axis=1) # signals shape: [Len, n_signals]
File "<array_function internals>", line 5, in stack
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/numpy/core/shape_base.py", line 423, in stack
raise ValueError('need at least one array to stack')

Before the running ,I remade the hdf5 file and train_FAIR_data.txt,  the sequences of them are all  according to the sequence of the content of ./new_splits/split1，for example，the first line of train_FAIR_data.txt is “/home/yzx/lunwen/datasets/FAIR-PLAY/binaural_audios/000383.wav,/home/yzx/lunwen/datasets/FAIR-PLAY/frames/000383”. The former means the path of each binaural_audio，the latter is the path of the frames corresponding to the binaural_audio. For each video, I take the method of

Extracting 10 frames per second as your code. Because I found that in the function “_get_pseudo_item” in Pseudo_dataset.py，They are splited to audio_file and img_folder respectively. But I don't know why there is always the problem. I am eager for your help.

'new_patches' folder

Hello, I find that you mentioned 'new_patches' folder, but I don't find that in this repo. Shall I do it by myself? Can you provide it for me? Thank you!

Evaluation Metric

Hi, I have some confusion about the evaluation metric used for the paper.

is STFT and ENV reported in the paper is obtained by the function STFT_L2_distance and Envelope_distance? If so both STFT and ENV has different metric, wherein STFT squared L2 distance is used and for ENV direct L2 distance is used? is it done following the prior works?
Can you please point to the function that computes Mag as reported in evaluation tables? There are a bunch of MSE calculation functions and I am confused which one was particularly used for the calculation.
For the phase error calculation, in the paper it is mentioned that L2 loss is used for calculation but from the code, it seems L1 loss is used? Could you please clarify what did you use? Also for phase error calculation, there is some clipping performed with the values. It would be really helpful if you could explain the rationale behind it.

Thanks.

Will there be big difference using different HRIR?

Hi, I have a question regarding the HRIR used in the experiment. Is the HRIR data from the CIPIC HRTF Database? It seems like you are using only the subject 3 data. I am wondering whether there will be a big difference if you are using HRIR for other subjects. Thank you.

How to calculate virtual array in the code

Thank you for your great research. I have read your paper and I am reading the code. I have a question on how to calculate the “virtual speakers” when making pseudo binaural sound.
In your paper, you calculate the “virtual array” in equation (7).

I think the same calculation is performed by costruct_stereo_ambi function in data/Pseudo_dataset.py in your code.
I understand that variables in this function correspond to equation (7) as follows.

Here, in data/Pseudo_dataset.py line 123, we have
array_speakers_sound = np.dot(ambisonic, self.sph_mat.T)
but I cannot understand how Is taken into account.
I would like to know how you reflect the calculation of equation (7) in your code.
Regards.

script to run for results in Table-1

Hi, Thanks for sharing the code. Could you please clarify which script to run for the results mentioned in Table-1? In the scripts folder, there are two types of scripts one with simple audionet and the other APNet backbone. As per the paper, for the separation task, APNet backbone is only used whereas for the stereo task APNet is not used. But I could not find a setting that satisfies this condition. Please clarify.

about the result

Hi. I test the model that you hava listed in the Readme. But the result is like this

./eval_demo/sepstereo_Augment/Augment_AudioNet_sepstereo_crop_1_best
STFT L2 Distance: 1.2610043297237892
Average Envelope Distance: 0.16047628236511588
MSE Distance: 0.008410530806834097
STFT MSE Distance: 2.5220085946633852
Mag diff Distance: 18.28601552585867
Average Envelope diff Distance: 1.5206958858724584
Snr: 5.251474191632674
L1 Distance: 0.4755515200011233

This result is not as good as that in the paer.

And I trained a model by myself with the ./scripts/sepstereo_Augment/audionet_train_crop.sh, but the result is worse, which is shown below

./eval_demo/sepstereo_Augment/Augment_AudioNet_sepstereo_crop_1_best
STFT L2 Distance: 1.3221603853498551
Average Envelope Distance: 0.16612554087451462
MSE Distance: 0.008804865650220072
STFT MSE Distance: 2.6443207939678337
Mag diff Distance: 17.008346246525566
Average Envelope diff Distance: 1.4829492619968354
Snr: 5.008793861205568
L1 Distance: 0.4820237934270645

Could you help me how to improve the result? I am sincerely eager for your help.

sheldontsui / pseudobinaural_cvpr2021 Goto Github PK

pseudobinaural_cvpr2021's People

Contributors

Stargazers

Watchers

Forkers

pseudobinaural_cvpr2021's Issues

ValueError('need at least one array to stack')

'new_patches' folder

Evaluation Metric

Will there be big difference using different HRIR?

How to calculate virtual array in the code

script to run for results in Table-1

about the result

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent