Coder Social home page Coder Social logo

sheldontsui / pseudobinaural_cvpr2021 Goto Github PK

View Code? Open in Web Editor NEW
58.0 58.0 12.0 2.37 MB

Codebase for the paper "Visually Informed Binaural Audio Generation without Binaural Audios" (CVPR 2021)

License: Creative Commons Attribution 4.0 International

Python 95.21% Shell 4.79%
pseudobinaural

pseudobinaural_cvpr2021's People

Contributors

sheldontsui avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar

pseudobinaural_cvpr2021's Issues

ValueError('need at least one array to stack')

When I use APNet_train_crop.sh or APNet_train.sh to run the code, there is an error:

Traceback (most recent call last):
File "train.py", line 210, in
for i, data in enumerate(data_loader):
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 517, in next
data = self._next_data()
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1199, in _next_data
return self._process_data(data)
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/dataloader.py", line 1225, in _process_data
data.reraise()
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/_utils.py", line 429, in reraise
raise self.exc_type(msg)
ValueError: Caught ValueError in DataLoader worker process 0.
Original Traceback (most recent call last):
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/worker.py", line 202, in _worker_loop
data = fetcher.fetch(index)
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in fetch
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/torch/utils/data/_utils/fetch.py", line 44, in
data = [self.dataset[idx] for idx in possibly_batched_index]
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Augment_dataset.py", line 66, in getitem
data_ret, data_ret_sep = self._get_pseudo_item(index)
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Pseudo_dataset.py", line 263, in _get_pseudo_item
stereo = self.construct_stereo_ambi(pst_sources)
File "/home/yzx/lunwen/PseudoBinaural_CVPR2021-master/data/Pseudo_dataset.py", line 120, in construct_stereo_ambi
signals = np.stack([src.signal for src in pst_sources], axis=1) # signals shape: [Len, n_signals]
File "<array_function internals>", line 5, in stack
File "/home/yzx/anaconda3/envs/pytorch/lib/python3.8/site-packages/numpy/core/shape_base.py", line 423, in stack
raise ValueError('need at least one array to stack')

Before the running ,I remade the hdf5 file and train_FAIR_data.txt,  the sequences of them are all  according to the sequence of the content of ./new_splits/split1,for example,the first line of train_FAIR_data.txt is “/home/yzx/lunwen/datasets/FAIR-PLAY/binaural_audios/000383.wav,/home/yzx/lunwen/datasets/FAIR-PLAY/frames/000383”. The former means the path of each binaural_audio,the latter is the path of the frames corresponding to the binaural_audio. For each video, I take the method of  

Extracting 10 frames per second as your code. Because I found that in the function “_get_pseudo_item” in Pseudo_dataset.py,They are splited to audio_file and img_folder respectively. But I don't know why there is always the problem. I am eager for your help.

'new_patches' folder

Hello, I find that you mentioned 'new_patches' folder, but I don't find that in this repo. Shall I do it by myself? Can you provide it for me? Thank you!

Evaluation Metric

Hi, I have some confusion about the evaluation metric used for the paper.

  1. is STFT and ENV reported in the paper is obtained by the function STFT_L2_distance and Envelope_distance? If so both STFT and ENV has different metric, wherein STFT squared L2 distance is used and for ENV direct L2 distance is used? is it done following the prior works?
  2. Can you please point to the function that computes Mag as reported in evaluation tables? There are a bunch of MSE calculation functions and I am confused which one was particularly used for the calculation.
  3. For the phase error calculation, in the paper it is mentioned that L2 loss is used for calculation but from the code, it seems L1 loss is used? Could you please clarify what did you use? Also for phase error calculation, there is some clipping performed with the values. It would be really helpful if you could explain the rationale behind it.

Thanks.

Will there be big difference using different HRIR?

Hi, I have a question regarding the HRIR used in the experiment. Is the HRIR data from the CIPIC HRTF Database? It seems like you are using only the subject 3 data. I am wondering whether there will be a big difference if you are using HRIR for other subjects. Thank you.

How to calculate virtual array in the code

Thank you for your great research. I have read your paper and I am reading the code. I have a question on how to calculate the “virtual speakers” when making pseudo binaural sound.
In your paper, you calculate the “virtual array” in equation (7).
スクリーンショット 2021-05-28 4 04 26
I think the same calculation is performed by costruct_stereo_ambi function in data/Pseudo_dataset.py in your code.
I understand that variables in this function correspond to equation (7) as follows.

Here, in data/Pseudo_dataset.py line 123, we have
array_speakers_sound = np.dot(ambisonic, self.sph_mat.T)
but I cannot understand how スクリーンショット 2021-05-28 4 07 18 Is taken into account.
I would like to know how you reflect the calculation of equation (7) in your code.
Regards.

script to run for results in Table-1

Hi, Thanks for sharing the code. Could you please clarify which script to run for the results mentioned in Table-1? In the scripts folder, there are two types of scripts one with simple audionet and the other APNet backbone. As per the paper, for the separation task, APNet backbone is only used whereas for the stereo task APNet is not used. But I could not find a setting that satisfies this condition. Please clarify.

about the result

Hi. I test the model that you hava listed in the Readme. But the result is like this

  • ./eval_demo/sepstereo_Augment/Augment_AudioNet_sepstereo_crop_1_best
    STFT L2 Distance: 1.2610043297237892
    Average Envelope Distance: 0.16047628236511588
    MSE Distance: 0.008410530806834097
    STFT MSE Distance: 2.5220085946633852
    Mag diff Distance: 18.28601552585867
    Average Envelope diff Distance: 1.5206958858724584
    Snr: 5.251474191632674
    L1 Distance: 0.4755515200011233

This result is not as good as that in the paer.

And I trained a model by myself with the ./scripts/sepstereo_Augment/audionet_train_crop.sh, but the result is worse, which is shown below

  • ./eval_demo/sepstereo_Augment/Augment_AudioNet_sepstereo_crop_1_best
    STFT L2 Distance: 1.3221603853498551
    Average Envelope Distance: 0.16612554087451462
    MSE Distance: 0.008804865650220072
    STFT MSE Distance: 2.6443207939678337
    Mag diff Distance: 17.008346246525566
    Average Envelope diff Distance: 1.4829492619968354
    Snr: 5.008793861205568
    L1 Distance: 0.4820237934270645

Could you help me how to improve the result? I am sincerely eager for your help.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.