andrewowens / multisensory Goto Github PK

View Code? Open in Web Editor NEW

217.0 217.0 60.0 1.19 MB

Code for the paper: Audio-Visual Scene Analysis with Self-Supervised Multisensory Features

Home Page: http://andrewowens.com/multisensory/

License: Apache License 2.0

Shell 0.09% Python 99.91%

multisensory's People

Stargazers

Watchers

Forkers

ml-lab pandinosaurus locussam jinguang-dong xsingit quantumgame peterzhousz chouqin3 runngezhang wl3b10s hasandc aashiqmuhamed xinshengwang vcuculo jungleengine lucklyric jasonaidm beranitservice moplast xiaoyiming peterzs huaizhengzhang shahidpavis luqiang-guo ericye16 ammarkamoona liuzheng081 zyong812 marc-moreaux collector-m ruizewang henryxrl entn-at lukereichold hongminwu zhang405744522 mc261670164 junhyeoplee coalboss visionu zihanzhouzzh ly-zhu benedictquartey vuthede ang0-0gna tailororrr tamzeed-unc nomore-ly suliwa noammy yangx1123 z-mu-z xuanhanyu chester-w-xie darlingwu junjie2008v happyday630 adhamkhalifa babyblue26 zheweicao

multisensory's Issues

How do I run source separation on a different video?

I get this when I run it on my video:

Writing to: ../results/
Traceback (most recent call last):
  File "sep_video.py", line 442, in <module>
    ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg))
  File "~/multisensory/src/aolib/util.py", line 3156, in make_video
    write_ims = (type(im_fnames[0]) != type(''))
IndexError: list index out of range

Do I have to run something else before sep_video.py?

Question about the original audio waveform input

Hi owen,
Thanks for your contributions!
In your paper,you said you applied a series of strided 1D convolutions to the input waveform.
So the input waveform you refered here (before fusion) is the original audio signal waveform without STFT,right?
Why and how you process the 1D signal ? Could you kindly explain this point for me?

Improvement on using pretrained model

Thanks for the great paper. I am trying to use the pre-trained model but my results are not great. Can you please suggest on the prerequisite(like video quality, audio quality, sampling rate). I am working on recorded videos with only two speakers in it.

duration_mult flag

Could you provide any explanation on using --duration_mult for audio-visual source separation? While --duration_mult 4 works well, --duration_mult 10 seems to have a worse result.
The program will report an error if I use --duration_mult 12.
If I only use --duration 20, the separated audio is almost the same as the source.

My goal is to do audio-visual separation for a 28 sec video.

Thanks!

Question about the test in Table 2 GRID transfer

Hello Andrew,
I have one small question about how to run your model on GRID dataset. Because the audios in GRID dataset are shorter than 2s, and I find that the model in "/results/nets/sep/full/" can't run on video shorter than 2.135 s. So how did you conduct GRID transfer experiments here?

RuntimeError: Command failed! ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4"

Hello, thanks for the script.
When I do the following command to visualize the locations of sound sources
python sep_video.py ../data/translator.mp4 --model full --cam --out ../results/
I got a error:

Start time: 0.0
GPU = 0
Spectrogram samples: 128
2.145 2.135
100.0% complete, total time: 0:00:00. 0:00:00 per iteration. (01:57 PM Fri)
Struct(alg=sourcesep, augment_audio=False, augment_ims=True, augment_rms=False, base_lr=0.0001, batch_size=6, bn_last=True, bn_scale=True, both_videos_in_batch=True, cam=False, check_iters=1000, crop_im_dim=224, dilate=False, do_shift=False, dset_seed=None, fix_frame=False, fps=29.97, frame_length_ms=64, frame_sample_delta=74, frame_step_ms=16, freq_len=1024, full_im_dim=256, full_model=False, full_samples_len=105000, gamma=0.1, gan_weight=0.0, grad_clip=10.0, im_split=False, im_type=jpeg, init_path=../results/nets/shift/net.tf-650000, init_type=shift, input_rms=0.141421356237, l1_weight=1.0, log_spec=True, loss_types=['fg-bg'], model_path=../results/nets/sep/full/net.tf-160000, mono=False, multi_shift=False, net_style=full, normalize_rms=True, num_dbs=None, num_samples=44144, opt_method=adam, pad_stft=False, phase_type=pred, phase_weight=0.01, pit_weight=0.0, predict_bg=True, print_iters=10, profile_iters=None, resdir=/multisensory-master/results/nets/sep/full, samp_sr=21000.0, sample_len=None, sampled_frames=63, samples_per_frame=700.700700701, show_iters=None, show_videos=False, slow_check_iters=10000, spec_len=128, spec_max=80.0, spec_min=-100.0, step_size=120000, subsample_frames=None, summary_iters=10, test_batch=10, test_list=../data/celeb-tf-v6-full/test/tf, total_frames=149, train_iters=160000, train_list=../data/celeb-tf-v6-full/train/tf, use_3d=True, use_sound=True, use_wav_gan=False, val_list=../data/celeb-tf-v6-full/val/tf, variable_frame_count=False, vid_dur=2.135, weight_decay=1e-05)
ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -r 29.97 -vf scale=256:256 "/tmp/tmpVEitNC/small_%04d.png"
ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -r 29.97 -vf "scale=-2:'min(600,ih)'" "/tmp/tmpVEitNC/full_%04d.png"
ffmpeg -loglevel error -ss 0.0 -i "../data/translator.mp4" -safe 0 -t 2.185 -ar 21000.0 -ac 2 "/tmp/tmpVEitNC/sound.wav"
Running on: /gpu:0
2018-06-15 13:57:11.657961: I tensorflow/core/platform/cpu_feature_guard.cc:137] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2018-06-15 13:57:12.523259: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1030] Found device 0 with properties:
name: Tesla K40m major: 3 minor: 5 memoryClockRate(GHz): 0.745
pciBusID: 0000:02:00.0
totalMemory: 11.92GiB freeMemory: 11.84GiB
2018-06-15 13:57:12.523316: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0, compute capability: 3.5)
Raw spec length: [1, 128, 1025]
Truncated spec length: [1, 128, 1025]
bn scale: True
arg_scope train = False
sf/conv1_1 -> [1, 11036, 1, 64]
sf/conv2_1_short -> [1, 690, 1, 128]
sf/conv2_1_1 -> [1, 690, 1, 128]
sf/conv2_1_2 -> [1, 690, 1, 128]
sf/conv3_1_1 -> [1, 173, 1, 128]
sf/conv3_1_2 -> [1, 173, 1, 128]
sf/conv4_1_short -> [1, 44, 1, 256]
sf/conv4_1_1 -> [1, 44, 1, 256]
sf/conv4_1_2 -> [1, 44, 1, 256]
im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3]
pool -> [1, 32, 56, 56, 64]
im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
pool -> [1, 16, 28, 28, 64]
im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64]
im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64]
frac: 2.6875
sf/conv5_1 -> [1, 16, 1, 128]
sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256]
im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192]
im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512]
im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
time_stride = 1
im/conv5_1_short -> [1, 8, 7, 7, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_1 -> [1, 8, 7, 7, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_2 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512]
im/conv5_2_1 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512]
im/conv5_2_2 -> [1, 8, 7, 7, 512] before: [1, 8, 7, 7, 512]
joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512]
joint/logits -> [1, 8, 7, 7, 1] before: [1, 8, 7, 7, 512]
gen/conv1 [1, 128, 1024, 2] -> [1, 128, 512, 64]
gen/conv2 [1, 128, 512, 64] -> [1, 128, 256, 128]
gen/conv3 [1, 128, 256, 128] -> [1, 64, 128, 256]
Video net before merge: [1, 16, 1, 64] After: [1, 64, 1, 64]
gen/conv4 [1, 64, 128, 320] -> [1, 32, 64, 512]
Video net before merge: [1, 16, 1, 128] After: [1, 32, 1, 128]
gen/conv5 [1, 32, 64, 640] -> [1, 16, 32, 512]
Video net before merge: [1, 8, 1, 512] After: [1, 16, 1, 512]
gen/conv6 [1, 16, 32, 1024] -> [1, 8, 16, 512]
gen/conv7 [1, 8, 16, 512] -> [1, 4, 8, 512]
gen/conv8 [1, 4, 8, 512] -> [1, 2, 4, 512]
gen/conv9 [1, 2, 4, 512] -> [1, 1, 2, 512]
gen/deconv1 [1, 1, 2, 512] -> [1, 2, 4, 512]
gen/deconv2 [1, 2, 4, 1024] -> [1, 4, 8, 512]
gen/deconv3 [1, 4, 8, 1024] -> [1, 8, 16, 512]
gen/deconv4 [1, 8, 16, 1024] -> [1, 16, 32, 512]
gen/deconv5 [1, 16, 32, 1536] -> [1, 32, 64, 512]
gen/deconv6 [1, 32, 64, 1152] -> [1, 64, 128, 256]
gen/deconv7 [1, 64, 128, 576] -> [1, 128, 256, 128]
gen/deconv8 [1, 128, 256, 256] -> [1, 128, 512, 64]
gen/fg [1, 128, 512, 128] -> [1, 128, 1024, 2]
gen/bg [1, 128, 512, 128] -> [1, 128, 1024, 2]
Restoring from: ../results/nets/sep/full/net.tf-160000
predict
samples shape: (1, 44144, 2)
samples pred shape: (1, 44144, 2)
(128, 1025)
Running on: 0
2018-06-15 13:57:18.753499: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1120] Creating TensorFlow device (/device:GPU:0) -> (device: 0, name: Tesla K40m, pci bus id: 0000:02:00.0, compute capability: 3.5)
bn scale: False
arg_scope train = True
sf/conv1_1 -> [1, 11036, 1, 64]
sf/conv2_1_short -> [1, 690, 1, 128]
sf/conv2_1_1 -> [1, 690, 1, 128]
sf/conv2_1_2 -> [1, 690, 1, 128]
sf/conv3_1_1 -> [1, 173, 1, 128]
sf/conv3_1_2 -> [1, 173, 1, 128]
sf/conv4_1_short -> [1, 44, 1, 256]
sf/conv4_1_1 -> [1, 44, 1, 256]
sf/conv4_1_2 -> [1, 44, 1, 256]
im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3]
pool -> [1, 32, 56, 56, 64]
im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
pool -> [1, 16, 28, 28, 64]
im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64]
im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64]
frac: 2.6875
sf/conv5_1 -> [1, 16, 1, 128]
sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256]
im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192]
im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512]
im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
time_stride = 1
im/conv5_1_short -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
im/conv5_2_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
im/conv5_2_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512]
joint/logits -> [1, 8, 14, 14, 1] before: [1, 8, 14, 14, 512]
bn scale: False
arg_scope train = True
sf/conv1_1 -> [1, 11036, 1, 64]
sf/conv2_1_short -> [1, 690, 1, 128]
sf/conv2_1_1 -> [1, 690, 1, 128]
sf/conv2_1_2 -> [1, 690, 1, 128]
sf/conv3_1_1 -> [1, 173, 1, 128]
sf/conv3_1_2 -> [1, 173, 1, 128]
sf/conv4_1_short -> [1, 44, 1, 256]
sf/conv4_1_1 -> [1, 44, 1, 256]
sf/conv4_1_2 -> [1, 44, 1, 256]
im/conv1 -> [1, 32, 112, 112, 64] before: [1, 63, 224, 224, 3]
pool -> [1, 32, 56, 56, 64]
im/conv2_1_1 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
im/conv2_1_2 -> [1, 32, 56, 56, 64] before: [1, 32, 56, 56, 64]
pool -> [1, 16, 28, 28, 64]
im/conv2_2_1 -> [1, 16, 28, 28, 64] before: [1, 32, 56, 56, 64]
im/conv2_2_2 -> [1, 16, 28, 28, 64] before: [1, 16, 28, 28, 64]
frac: 2.6875
sf/conv5_1 -> [1, 16, 1, 128]
sf_net shape before merge: [1, 44, 1, 256], and after merge: [1, 16, 1, 256]
im/merge1 -> [1, 16, 28, 28, 512] before: [1, 16, 28, 28, 192]
im/merge2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 512]
im/conv3_1_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_1_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_1 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv3_2_2 -> [1, 16, 28, 28, 128] before: [1, 16, 28, 28, 128]
im/conv4_1_short -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_1 -> [1, 8, 14, 14, 256] before: [1, 16, 28, 28, 128]
im/conv4_1_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_1 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
im/conv4_2_2 -> [1, 8, 14, 14, 256] before: [1, 8, 14, 14, 256]
time_stride = 1
im/conv5_1_short -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 256]
im/conv5_1_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
im/conv5_2_1 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
im/conv5_2_2 -> [1, 8, 14, 14, 512] before: [1, 8, 14, 14, 512]
joint/logits -> [1, 1, 1, 1, 1] before: [1, 1, 1, 1, 512]
joint/logits -> [1, 8, 14, 14, 1] before: [1, 8, 14, 14, 512]
Writing to: ../results/
ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4"
Guessed Channel Layout for Input Stream #0.0 : mono
[concat @ 0x382d700] DTS -230584300921369 < 0 out of order
[h264_v4l2m2m @ 0x385f500] Could not find a valid device
[h264_v4l2m2m @ 0x385f500] can't configure encoder
Error initializing output stream 0:0 -- Error while opening encoder for output stream #0:0 - maybe incorrect parameters such as bit_rate, rate, width or height
Traceback (most recent call last):
File "sep_video.py", line 442, in
ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg))
File "/multisensory-master/src/aolib/util.py", line 3169, in make_video
% (sound_flags_in, fps, input_file, sound_flags_out, flags, out_fname))
File "/multisensory-master/src/aolib/util.py", line 915, in sys_check
fail('Command failed! %s' % cmd)
File "/multisensory-master/src/aolib/util.py", line 12, in fail
def fail(s = ''): raise RuntimeError(s)
RuntimeError: Command failed! ffmpeg -i "/tmp/ao_M0QAze.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_cnpblR.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "../results/fg_cam_translator.mp4"

I want to know what went wrong and what should i do...
Any suggestion will be appreciated! Thanks.

About the input file for shift model training

In the shift_params.py it seems like you are using 'audioset-vid-v21/small_train.txt', which may contain many *.tf files, right?

Could you please provide an example .txt file, or an example TFRecord file?
It would be much helpful if you provide the script how you have created the .tf file.
Thanks a lot!

Supported on Linux

Just wondering if this project will run on Linux. If not, is a Mac required and which versions of OS X are supported.

Thanks!

file input for blind audio source separation

Thanks for the amazing implementation and pre-trained models.

I want to use this for blind audio source separation for audio files only. but it gives me the following error.
ffmpeg -loglevel error -ss 0.0 -i "../../pyAudioAnalysis/90131M02_.wav" -safe 0 -t 8.338 -r 29.97 -vf scale=256:256 "/var/folders/kh/2zcggyvx7gs6l10p8pq7jkgw_0bxjx/T/tmpKTCMbU/small_%04d.png"
Output file #0 does not contain any stream
Traceback (most recent call last):
File "sep_video.py", line 398, in
ret = run(arg.vid_file, t, arg.clip_dur, pr, gpus[0], mask = arg.mask, arg = arg, net = net)
File "sep_video.py", line 254, in run
'ffmpeg -loglevel error -ss %(start_time)s -i "%(vid_file)s" -safe 0 '
File "/Volumes/workspace/multisensory/src/aolib/util.py", line 915, in sys_check
fail('Command failed! %s' % cmd)
File "/Volumes/workspace/multisensory/src/aolib/util.py", line 12, in fail
def fail(s = ''): raise RuntimeError(s)
RuntimeError: Command failed! ffmpeg -loglevel error -ss 0.0 -i "../../pyAudioAnalysis/90131M02_.wav" -safe 0 -t 8.338 -r 29.97 -vf scale=256:256 "/var/folders/kh/2zcggyvx7gs6l10p8pq7jkgw_0bxjx/T/tmpKTCMbU/small_%04d.png"

What is the correct format to run it for audio files?

Cheers!

What are feats['im_0'] and feats['im_1'] of example for shift model?

Hello,
In read_example() of shift_dset.py, I saw

feats['im_0'] = tf.FixedLenFeature([], dtype=tf.string) feats['im_1'] = tf.FixedLenFeature([], dtype=tf.string)
What are im_0 and im_1?

Thank you.

Questions about the models

Very great work! The idea is very interesting and thank you for providing the codes.

After running the script download_models.sh, I found out that there are several pretrained models in the folder, which are cam, sep, and shift. I am a little bit confuse about which model for which purpose. For example, which is the model for,

Self-supervised audio-visual features: a pretrained 3D CNN that can be used for downstream tasks (e.g. action recognition, source separation).

Thank you.

make_video_helper() missing 3 required positional arguments: 'x', 'in_dir', and 'tmp_ext'

When I run the code, and I got something wrong, how to solve(Ubuntu, tensorflow-gpu1.8)
multiprocessing.pool.RemoteTraceback:
"""
Traceback (most recent call last):
File "/usr/lib/python3.5/multiprocessing/pool.py", line 119, in worker
result = (True, func(*args, **kwds))
File "/usr/lib/python3.5/multiprocessing/pool.py", line 44, in mapstar
return list(map(*args))
TypeError: make_video_helper() missing 3 required positional arguments: 'x', 'in_dir', and 'tmp_ext'
"""
The above exception was the direct cause of the following exception:

Traceback (most recent call last):
File "/media/george/multisensory/src/sep_video.py", line 450, in
ig.show(table)
File "/media/george/multisensory/src/aolib/img.py", line 13, in show
return imtable.show(*args, **kwargs)
File "/media/george/multisensory/src/aolib/imtable.py", line 72, in show_table
html_rows = html_from_rows(table, output_dir)
File "/media/george/multisensory/src/aolib/imtable.py", line 413, in html_from_rows
html_rows.append("" + "".join(html_from_cell(x, output_dir) for x in row))
File "/media/george/multisensory/src/aolib/imtable.py", line 413, in
html_rows.append("" + "".join(html_from_cell(x, output_dir) for x in row))
File "/media/george/studyProjects/multisensory/src/aolib/imtable.py", line 308, in html_from_cell
return x.make_html(output_dir)
File "/media/george/studyProjects/multisensory/src/aolib/imtable.py", line 587, in make_html
make_video(fname, self.ims, self.fps, sound = self.sound)
File "/media/george/studyProjects/multisensory/src/aolib/imtable.py", line 498, in make_video
[(i, x, in_dir, tmp_ext) for i, x in enumerate(ims)])
File "/media/george/multisensory/src/aolib/util.py", line 2726, in parmap
ret = pool.map_async(f, xs).get(10000000)
File "/usr/lib/python3.5/multiprocessing/pool.py", line 608, in get
raise self._value
TypeError: make_video_helper() missing 3 required positional arguments: 'x', 'in_dir', and 'tmp_ext'

Issue on Large Videos

Thanks for great paper. When working with large videos the maximum duration of separation is 4min. Can this applied to whole video for separation at once

Question about fine-tune for full sep model

Really nice job!!! I have noticed that in the Self-supervised shift model, there is no gamma variable in slim.batch_norm for each conv layer(because there is no 'bn_scale' in shift_params.py). But as to the full speech-separation model, there is gamma in slim.batch_norm operation for each conv layer (‘bn_scale = True’ in sep_params.py). So, How could this full model be fine-tuned based on the shift model without gamma, since the 'gamma' differences exist in these two models respectively. If the weights in shift model and the corresponding weights in full model are the same, does the fine-tune make any sense?

> > In the source separation model it seems like you are using *.tf files as input (rec_files_from_path in sep_dset.py).Can you please provide the format to create those TFRecord files

After I read the comments above, I noticed that the author said need to rewrite the I/O code. If I rewrite the I/O code, Should I read video and audio data separately， and then fed to two branch networks ?

I RuntimeError: Command failed! ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4"

python sep_video.py data/translator.mp4 --model unet_pit --duration_mult 4 --out results/
Start time: 0.0
GPU = 0
Spectrogram samples: 512
(8.298, 8.288)
100.0% complete, total time: 0:00:00. 0:00:00 per iteration. (11:29 AM Tue)
Struct(alg=sourcesep, augment_audio=False, augment_ims=True, augment_rms=False, base_lr=0.0001, batch_size=24, bn_last=True, bn_scale=True, both_videos_in_batch=False, cam=False, check_iters=1000, crop_im_dim=224, dilate=False, do_shift=False, dset_seed=None, fix_frame=False, fps=29.97, frame_length_ms=64, frame_sample_delta=74.5, frame_step_ms=16, freq_len=1024, full_im_dim=256, full_model=False, full_samples_len=105000, gamma=0.1, gan_weight=0.0, grad_clip=10.0, im_split=False, im_type=jpeg, init_path=None, init_type=shift, input_rms=0.14142135623730953, l1_weight=1.0, log_spec=True, loss_types=['pit'], model_path=results/nets/sep/unet-pit/net.tf-160000, mono=False, multi_shift=False, net_style=no-im, normalize_rms=True, num_dbs=None, num_samples=173774, opt_method=adam, pad_stft=False, phase_type=pred, phase_weight=0.01, pit_weight=1.0, predict_bg=True, print_iters=10, profile_iters=None, resdir=/home/study/PycharmProjects/results/nets/sep/unet-pit, samp_sr=21000.0, sample_len=None, sampled_frames=248, samples_per_frame=700.7007007007007, show_iters=None, show_videos=False, slow_check_iters=10000, spec_len=512, spec_max=80.0, spec_min=-100.0, step_size=120000, subsample_frames=None, summary_iters=10, test_batch=10, test_list=../data/celeb-tf-v6-full/test/tf, total_frames=149, train_iters=160000, train_list=../data/celeb-tf-v6-full/train/tf, use_3d=True, use_sound=True, use_wav_gan=False, val_list=../data/celeb-tf-v6-full/val/tf, variable_frame_count=False, vid_dur=8.288, weight_decay=1e-05)
ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -r 29.97 -vf scale=256:256 "/tmp/tmpw4889ppn/small_%04d.png"
ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -r 29.97 -vf "scale=-2:'min(600,ih)'" "/tmp/tmpw4889ppn/full_%04d.png"
ffmpeg -loglevel error -ss 0.0 -i "data/translator.mp4" -safe 0 -t 8.338000000000001 -ar 21000.0 -ac 2 "/tmp/tmpw4889ppn/sound.wav"
Running on:
2019-05-14 11:29:30.212532: I tensorflow/core/platform/cpu_feature_guard.cc:140] Your CPU supports instructions that this TensorFlow binary was not compiled to use: SSE4.1 SSE4.2 AVX AVX2 FMA
2019-05-14 11:29:30.329825: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:898] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2019-05-14 11:29:30.330229: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1356] Found device 0 with properties:
name: GeForce RTX 2070 major: 7 minor: 5 memoryClockRate(GHz): 1.62
pciBusID: 0000:01:00.0
totalMemory: 7.77GiB freeMemory: 7.19GiB
2019-05-14 11:29:30.330244: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1435] Adding visible gpu devices: 0
2019-05-14 11:29:30.547596: I tensorflow/core/common_runtime/gpu/gpu_device.cc:923] Device interconnect StreamExecutor with strength 1 edge matrix:
2019-05-14 11:29:30.547627: I tensorflow/core/common_runtime/gpu/gpu_device.cc:929] 0
2019-05-14 11:29:30.547632: I tensorflow/core/common_runtime/gpu/gpu_device.cc:942] 0: N
2019-05-14 11:29:30.547797: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1053] Created TensorFlow device (/job:localhost/replica:0/task:0/device:GPU:0 with 6920 MB memory) -> physical GPU (device: 0, name: GeForce RTX 2070, pci bus id: 0000:01:00.0, compute capability: 7.5)
Raw spec length: [1, 514, 1025]
Truncated spec length: [1, 512, 1025]
('gen/conv1', [1, 512, 1024, 2], '->', [1, 512, 512, 64])
('gen/conv2', [1, 512, 512, 64], '->', [1, 512, 256, 128])
('gen/conv3', [1, 512, 256, 128], '->', [1, 256, 128, 256])
('gen/conv4', [1, 256, 128, 256], '->', [1, 128, 64, 512])
('gen/conv5', [1, 128, 64, 512], '->', [1, 64, 32, 512])
('gen/conv6', [1, 64, 32, 512], '->', [1, 32, 16, 512])
('gen/conv7', [1, 32, 16, 512], '->', [1, 16, 8, 512])
('gen/conv8', [1, 16, 8, 512], '->', [1, 8, 4, 512])
('gen/conv9', [1, 8, 4, 512], '->', [1, 4, 2, 512])
('gen/deconv1', [1, 4, 2, 512], '->', [1, 8, 4, 512])
('gen/deconv2', [1, 8, 4, 1024], '->', [1, 16, 8, 512])
('gen/deconv3', [1, 16, 8, 1024], '->', [1, 32, 16, 512])
('gen/deconv4', [1, 32, 16, 1024], '->', [1, 64, 32, 512])
('gen/deconv5', [1, 64, 32, 1024], '->', [1, 128, 64, 512])
('gen/deconv6', [1, 128, 64, 1024], '->', [1, 256, 128, 256])
('gen/deconv7', [1, 256, 128, 512], '->', [1, 512, 256, 128])
('gen/deconv8', [1, 512, 256, 256], '->', [1, 512, 512, 64])
('gen/fg', [1, 512, 512, 128], '->', [1, 512, 1024, 2])
('gen/bg', [1, 512, 512, 128], '->', [1, 512, 1024, 2])
Restoring from: results/nets/sep/unet-pit/net.tf-160000
predict
samples shape: (1, 173774, 2)
samples pred shape: (1, 173774, 2)
(512, 1025)
Writing to: results/
ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4"
[wav @ 0x558b3f868b40] Estimating duration from bitrate, this may be inaccurate
[wav @ 0x558b3f868b40] Could not find codec parameters for stream 0 (Audio: none, 1065353216 Hz, 16256 channels, 9481256 kb/s): unknown codec
Consider increasing the value for the 'analyzeduration' and 'probesize' options
Unknown encoder 'h264'
Traceback (most recent call last):
File "sep_video.py", line 455, in
ut.make_video(full_ims, pr.fps, pj(arg.out, 'fg%s.mp4' % name), snd(full_samples_fg))
File "/home/study/PycharmProjects/untitled/util.py", line 3176, in make_video
% (sound_flags_in, fps, input_file, sound_flags_out, flags, out_fname))
File "/home/study/PycharmProjects/untitled/util.py", line 917, in sys_check
fail('Command failed! %s' % cmd)
File "/home/study/PycharmProjects/untitled/util.py", line 14, in fail
def fail(s = ''): raise RuntimeError(s)
RuntimeError: Command failed! ffmpeg -i "/tmp/ao_wmjz0ezg.wav" -r 29.970000 -loglevel warning -safe 0 -f concat -i "/tmp/ao_i2pwi0b8.txt" -pix_fmt yuv420p -vcodec h264 -strict -2 -y -acodec aac "results/fg_translator.mp4"

I meet the problem like this... I run the code as you say... but what happend to this code? I run the code on python3,thank you for your prompt reply!!!

Questions about the entrance of the training function

Great job! I tried to train this model by myself, however, I encountered some problems. I am anxious to know the solutions to these problems.
a. Could you point out a detailed method of calling the training function?
b. How to input the 'Kinetics-Sounds' dataset into the model for training?
c. I noticed that you mentioned 'rewriting the read_data(pr, gpus) function'. I wonder what the variable 'pr' stands for.
Looking forward to your reply soon!
Thanks!
@andrewowens

Questions about sourcesep.py

Hello Andrew, thanks for your great work again. I have some questions about the training of the source separation model.
(1) I'm not sure what the meaning of these losses, could you please explain it for me?
Iteration 13140, lr = 1e-04, total:gen: 0.169 gen:reg: 0.002 diff-fg: 0.080 phase-fg: 0.004 diff-bg: 0.080 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 1.241

(2) I notice there is a code: "import sep_eval". But I could not find 'sep_eval.py'. Could you share this part of the code? I guess this might be used for evaluation and testing.

Thanks a lot 😁~

Question about the test in Table 3

Did you train a special model on short (200 ms) videos ?Because I find that the model in "/results/nets/sep/full/" can't run on video shorter than 2.135 ms.

What GPU used?

Hello,
What GPU used in your experiments, 1080Ti, P40 or P100 or something else?
Thank you ~

difference between "large" and "full" sep models

Hi Andrew,

Thanks for publicly releasing your code and models.

Could you please tell me the difference between "large" and "full" models for separation?
Have you released a model corresponding to "Large-scale training" (Sec. 6.3 in the paper)? Does the large model refer to this?

Thanks,
Sanjeel

whre is the sep_module (calss or funtion）in sourcesep.py

Really nice job!!!， I found that “sep_module”（class or function ) was used in sourcesep.py file.
However, I could not find the definition. the “sep_module” was used as followed:

"
spec_mix, phase_mix = sep_module(pr).stft(samples_trunc[:, :, 0], pr)
spec_mix = crop_spec(spec_mix)
phase_mix = crop_spec(phase_mix)

    self.specgram_op, phase = map(crop_spec, sep_module(pr).stft(samples_trunc[:, :, 0], pr))
    self.auto_op = sep_module(pr).istft(self.specgram_op, phase, pr)

    self.net = sep_module(pr).make_net(
      self.ims_ph, samples_trunc, spec_mix, phase_mix, 
      pr, reuse = False, train = False) "

In which way the video frames combine

question about sourcesep training result on new dataset

I tried to train the sourcesep.py on  a  new data-set.  the data-set contain 12000 videos and trained about 2000 iteration.  the training results are as followed:

Iteration 0, lr = 1e-04, total:gen: 1.038 gen:reg: 0.155 diff-fg: 0.556 phase-fg: 0.006 diff-bg: 0.316 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 105.432
Iteration 1, lr = 1e-04, total:gen: 1.037 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 104.403
Iteration 2, lr = 1e-04, total:gen: 1.036 gen:reg: 0.155 diff-fg: 0.555 phase-fg: 0.006 diff-bg: 0.315 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 103.402
Iteration 3, lr = 1e-04, total:gen: 1.035 gen:reg: 0.155 diff-fg: 0.554 phase-fg: 0.006 diff-bg: 0.314 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 102.648
Iteration 4, lr = 1e-04, total:gen: 1.033 gen:reg: 0.155 diff-fg: 0.553 phase-fg: 0.006 diff-bg: 0.313 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 101.729
Iteration 5, lr = 1e-04, total:gen: 1.030 gen:reg: 0.155 diff-fg: 0.551 phase-fg: 0.006 diff-bg: 0.312 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.953
Iteration 6, lr = 1e-04, total:gen: 1.028 gen:reg: 0.155 diff-fg: 0.550 phase-fg: 0.006 diff-bg: 0.311 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 100.038
Iteration 7, lr = 1e-04, total:gen: 1.024 gen:reg: 0.155 diff-fg: 0.547 phase-fg: 0.006 diff-bg: 0.310 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 99.307
Iteration 8, lr = 1e-04, total:gen: 1.021 gen:reg: 0.155 diff-fg: 0.545 phase-fg: 0.006 diff-bg: 0.309 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 98.419
Iteration 9, lr = 1e-04, total:gen: 1.017 gen:reg: 0.155 diff-fg: 0.542 phase-fg: 0.006 diff-bg: 0.308 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 97.764
Iteration 10, lr = 1e-04, total:gen: 1.013 gen:reg: 0.155 diff-fg: 0.539 phase-fg: 0.006 diff-bg: 0.307 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 96.905
Iteration 20, lr = 1e-04, total:gen: 0.967 gen:reg: 0.155 diff-fg: 0.507 phase-fg: 0.006 diff-bg: 0.294 phase-bg: 0.006 total:discrim: 0.000 discrim:reg: 0.000, time: 89.464
Iteration 30, lr = 1e-04, total:gen: 0.922 gen:reg: 0.154 diff-fg: 0.475 phase-fg: 0.006 diff-bg: 0.281 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000, time: 82.757
Iteration 40, lr = 1e-04, total:gen: 0.877 gen:reg: 0.153 diff-fg: 0.444 phase-fg: 0.006 diff-bg: 0.268 phase-bg: 0.005 total:discrim: 0.000 discrim:reg: 0.000,
.....
Iteration 1800, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.358
Iteration 1810, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.319
Iteration 1820, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.505
Iteration 1830, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.447
Iteration 1840, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.346
Iteration 1850, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.312
Iteration 1860, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.403
Iteration 1870, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.097 phase-fg: 0.004 diff-bg: 0.097 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.404
Iteration 1880, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.202
Iteration 1890, lr = 1e-04, total:gen: 0.205 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.469
Iteration 1900, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.318
Iteration 1910, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.159
Iteration 1920, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.241
Iteration 1930, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 21.028
Iteration 1940, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.877
Iteration 1950, lr = 1e-04, total:gen: 0.204 gen:reg: 0.005 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.739
Iteration 1960, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.555
Iteration 1970, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.00
Iteration 1980, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.283
Iteration 1990, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 20.177
Checkpoint: /home/zhang/xiao/multisensory-master/data/traing/sep_2s_test/net.tf-2000
Iteration 2000, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.973
Iteration 2010, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.958
Iteration 2020, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.821
Iteration 2030, lr = 1e-04, total:gen: 0.203 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.806
Iteration 2040, lr = 1e-04, total:gen: 0.204 gen:reg: 0.004 diff-fg: 0.096 phase-fg: 0.004 diff-bg: 0.096 phase-bg: 0.004 total:discrim: 0.000 discrim:reg: 0.000, time: 19.881
As shown in the results, the training loss decreases . However, when the trained results are used to separate the video through the sep_video.py. we can only get the noise. could you give me some advises?

How to train the "shift" and "cam" model for sound source location?

First of all, thank you for your earlier reply! Now I've got two more questions about your great work.
I've noticed that there are three models: "shift", "cam" and "sep". To my acknowledge, the "sep" model is for source seperation, and the "cam" model is for localization. And there are pretrained-model files for these models, such as:
model_file = '../results/nets/shift/net.tf-650000'
model_file = '../results/nets/cam/net.tf-675000'
Now I wonder how to train the "shift" model and "cam" model for sound source location. Could you give the detailed method to call the training function in shift_net.py? Which dataset should I use?
Looking forward for your reply :)

Download sample-data.zip NOT FOUND

Hi,

I am having trouble downloading the sample-data.zip since it's seems that the link is broken, I get a Not found error when running the sh file. Any chance you could provide the correct link?

Thank you!

Issue on datasets

Hello, thanks for your great work. Here are some questions when I read your paper and implement the work. In the paper, you said "you trained your model on a dataset of approximately 750,000 videos sampled from AudioSet." As we know, AudioSet consists of an expanding ontology of 632 audio event classes and a collection of 2,084,320 human-labeled 10-second sound clips drawn from YouTube videos.

So the AudioSet you used for training model, is human-labeled 10-second clips drawn from YouTube videos?
Can we download the full videos according to YouTubeID provided in audioset and spilt them to video clips for training? Actually, we have trained with this dataset for serval models. But some of them always predict labels as "1" (aligned), and others always predict labels as "0" (not aligned).

Getting /bin/sh: 1: ffmpeg-length: not found

when I run python sep_video.py ../data/translator.mp4 --model full --duration_mult 4 --out ../results/ . Is it an alias for something?

model architecture

can anyone briefly explain how the audio and video features are fused together

?
please use above image as reference which is from org paper

Question about training

It's really an amazing job. It seems you didn't share the codes for training, such as getting cams, action recognition, and auido-vision separation. I don't know how to train the models, could you add the codes for training.

TypeError: convolution() got multiple values for argument 'weights_regularizer'

I got error like this, what happened, please help me to fix.
Traceback (most recent call last):
File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 398, in
ret = run(arg.vid_file, t, arg.clip_dur, pr, gpus[0], mask = arg.mask, arg = arg, net = net)
File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 294, in run
net.init()
File "D:/Workspace/PythonProjects/studyProjects/multisensory/src/sep_video.py", line 42, in init
pr, reuse = False, train = False)
File "D:\Workspace\PythonProjects\studyProjects\multisensory\src\sourcesep.py", line 953, in make_net
vid_net_full = shift_net.make_net(ims, sfs, pr, None, reuse, train)
File "D:\Workspace\PythonProjects\studyProjects\multisensory\src\shift_net.py", line 419, in make_net
sf_net = conv2d(sf_net,num_outputs= 64, kernel_size= [65, 1], scope = 'sf/conv1_1', stride = [4, 1], padding='SAME', reuse = reuse) # by lg 8.20
File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\layers\python\layers\layers.py", line 1154, in convolution2d
conv_dims=2)
File "C:\Anaconda3\envs\tensorflow-gpu\lib\site-packages\tensorflow\contrib\framework\python\ops\arg_scope.py", line 183, in func_with_args
return func(*args, **current_args)
TypeError: convolution() got multiple values for argument 'weights_regularizer'

Why acc doesn't change when shift_model training?

Hello,
When I train shift_lowfps model, the loss decreases slowly, but acc doesn't change (0.500).
Could you give me some advice?

Is it because the training time is too short?
Should acc increase when loss decrease?
By the way, what does the total loss mean?


[grad norm:][0.0125109516]
Iteration 5500, lr = 1e-03, total:loss: 1.246 reg: 0.041 loss:label: 0.705 acc:label: 0.500, time: 2.978
Iteration 5510, lr = 1e-03, total:loss: 1.244 reg: 0.040 loss:label: 0.704 acc:label: 0.500, time: 2.974
Iteration 5520, lr = 1e-03, total:loss: 1.241 reg: 0.039 loss:label: 0.703 acc:label: 0.500, time: 2.953
Iteration 5530, lr = 1e-03, total:loss: 1.239 reg: 0.037 loss:label: 0.702 acc:label: 0.500, time: 2.960
Iteration 5540, lr = 1e-03, total:loss: 1.238 reg: 0.036 loss:label: 0.701 acc:label: 0.500, time: 2.971
Iteration 5550, lr = 1e-03, total:loss: 1.236 reg: 0.035 loss:label: 0.700 acc:label: 0.500, time: 2.965
Iteration 5560, lr = 1e-03, total:loss: 1.234 reg: 0.034 loss:label: 0.700 acc:label: 0.500, time: 2.961
Iteration 5570, lr = 1e-03, total:loss: 1.232 reg: 0.033 loss:label: 0.699 acc:label: 0.500, time: 2.957
Iteration 5580, lr = 1e-03, total:loss: 1.231 reg: 0.032 loss:label: 0.699 acc:label: 0.500, time: 2.952
Iteration 5590, lr = 1e-03, total:loss: 1.229 reg: 0.031 loss:label: 0.698 acc:label: 0.500, time: 2.967
[grad norm:][0.00501754601]
Iteration 5600, lr = 1e-03, total:loss: 1.228 reg: 0.030 loss:label: 0.698 acc:label: 0.500, time: 2.968
Iteration 5610, lr = 1e-03, total:loss: 1.227 reg: 0.030 loss:label: 0.697 acc:label: 0.500, time: 2.960
Iteration 5620, lr = 1e-03, total:loss: 1.225 reg: 0.029 loss:label: 0.697 acc:label: 0.500, time: 2.951
Iteration 5630, lr = 1e-03, total:loss: 1.224 reg: 0.028 loss:label: 0.696 acc:label: 0.500, time: 2.977
Iteration 5640, lr = 1e-03, total:loss: 1.223 reg: 0.027 loss:label: 0.696 acc:label: 0.500, time: 2.973
Iteration 5650, lr = 1e-03, total:loss: 1.222 reg: 0.026 loss:label: 0.696 acc:label: 0.500, time: 2.981

About the input format

In the source separation model it seems like you are using *.tf files as input (rec_files_from_path in sep_dset.py).Can you please provide the format to create those TFRecord files

Test set used in paper

I have a quick question about the test set used in your paper for the alignment task. Train is reported as 750000 videos. What is the size of the test set for this task?

question about using 'sep_example.tf'

when using 'sep_example.tf' to test the train procedure of shift_net.

tf file is from "http://people.eecs.berkeley.edu/~owens/multisensory/sep_example.tf"

i got messege like:

File "train_shift.py", line 3, in
shift_net.train(shift_params.shift_v1(num_gpus=3), [0,1,2], restore = False)
File "multisensory-master/src/shift_net.py", line 315, in train
model.make_model()
File "/multisensory-master/src/shift_net.py", line 155, in make_model
self.inputs = read_data(pr, self.gpus)
File "/multisensory-master/src/shift_net.py", line 17, in read_data
lambda : shift_dset.make_db_reader(
File "/multisensory-master/src/tfutil.py", line 484, in on_cpu
return f()
File "/multisensory-master/src/shift_net.py", line 19, in
num_db_files = pr.num_dbs))
File "/multisensory-master/src/shift_dset.py", line 283, in make_db_reader
ims, flows, samples, sfs, labels, ytids = tf.train.batch(example_list[0], batch_size)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 927, in batch
name=name)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/training/input.py", line 722, in _batch
dequeued = queue.dequeue_many(batch_size, name=name)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/data_flow_ops.py", line 464, in dequeue_many
self._queue_ref, n=n, component_types=self._dtypes, name=name)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/ops/gen_data_flow_ops.py", line 2418, in _queue_dequeue_many_v2
component_types=component_types, timeout_ms=timeout_ms, name=name)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/op_def_library.py", line 787, in _apply_op_helper
op_def=op_def)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 2956, in create_op
op_def=op_def)
File "/data/wen/anaconda2/lib/python2.7/site-packages/tensorflow/python/framework/ops.py", line 1470, in init
self._traceback = self._graph._extract_stack() # pylint: disable=protected-access

OutOfRangeError (see above for traceback): FIFOQueue '_1_batch/fifo_queue' is closed and has insufficient elements (requested 15, current size 0)

is there some setting in shift_params.py to change or set?

thanks.

Issue with sound source localization

When I am using the visualization of the sound source the heat map is always in the same part of the video (on the right). The size and shape is change a little (its usually a rectangle) but it is always on the right side. I tried with different videos and only the translator video you provided works correctly (the heat map is on his face). The videos I'm using are 2560x1440 and have binaural sound tracks. Thanks

Question about the shift_net.py' training

It's really nice work. However, I met some problems when I read the shift_net.py. As described follows:
ims = self.inputs[i]['ims']
samples_ex = self.inputs[i]['samples']
assert pr.both_examples
assert not pr.small_augment
labels = tf.random_uniform(
[shape(ims, 0)], 0, 2, dtype = tf.int64, name = 'labels_sample')
samples0 = tf.where(tf.equal(labels, 1), samples_ex[:, 1], samples_ex[:, 0])
samples1 = tf.where(tf.equal(labels, 0), samples_ex[:, 1], samples_ex[:, 0])
labels1 = 1 - labels

      net0 = make_net(ims, samples0, pr, reuse = reuse, train = self.is_training)
      net1 = make_net(None, samples1, pr, im_net = net0.im_net, reuse = True, train = self.is_training)
      labels = tf.concat([labels, labels1], 0).

My understanding is that the samples_ex is the stereo audio with the size of batch_size X N X 2(N is the length of the audio signal). However， why is the labels is variable ? Should it be constant （means 0 denotes not synchronized and 1 denotes synchronized) ? I'am looking for your reply.

How to calculate SDR?

Hi Andrew,
What tool did you use to calculate SDR?
Thank you :)

Some questions about training and testing shift model

Hello, I have some questions when training and testing, which make me bothered.

The parameter you used for training shift model is the default parameter of the code you provied？
You said: "In my experiments, the model took something like 2K iterations to reach chance performance (loss:label = 0.693), and 11K iterations to do better than chance (loss:label = 0.692). So, for a long time it looked like the model was stuck at chance."
So my question is, after the model can do better than chance, and the "loss:label" will decrease faster?
when do testing, I mean the testing you calculate the accuracy on the test set as mentioned in your paper. In your paper, you mentioned We found that the model obtained 59.9% accuracy on held-out videos for its alignment task (chance = 50%). My question is, the parameter "do_shift" should set to False or True? When I set it to True, the accuracy is 0.50633484. Set to False, I got an accuracy of 0.43133482. Both are quite different from the 0.599 reported in your paper. By the way, I use the same code reading dataset, I use the pre-trained model you provided. The dataset is generated from AudioSet.

Here is the code for testing, I only add a function in "class NetClf" in the "shift_net.py".

    def test_accuracy(self, reset=True):
        gpus = mu.set_gpus(self.gpu)
        print('Loading Model')
        if self.sess is None:
            print 'Running on:', gpus

            with tf.device(gpus[0]):
                if reset:
                    tf.reset_default_graph()
                    tf.Graph().as_default()

                pr = self.pr
                pr_test = pr.copy()
                self.sess = tf.Session()
                pr_test.augment_ims = False
                print 'pr_test ='
                print pr_test

                print('loading dataset...')
                with tf.device('/cpu:0'):

                    rec_files = shift_dset.rec_files_from_path(pr_test.test_list)
                    total_examples = len(rec_files)*8841
                    total_batch = int(total_examples/pr.test_batch)
                    print('the number of total examples:',total_examples)
                    print('the number of total batch:',total_batch)

                    self.test_ims, self.test_samples = mu.on_cpu(
                        lambda: shift_dset.make_db_reader(
                            pr_test.test_list, pr_test, pr.test_batch, ['im', 'samples'], one_pass=True))
                    print 'sample shape:', shape(self.test_samples) # [10, 87587, 2]

                    if pr_test.do_shift:
                        print('do shifting...')
                        self.test_labels = tf.random_uniform([shape(self.test_ims, 0)], 0, 2, dtype=tf.int64)
                        self.test_samples = tf.where(tf.equal(self.test_labels, 1), self.test_samples[:, 1],
                                                     self.test_samples[:, 0])

                    else:
                        self.test_labels = tf.ones(shape(self.test_ims, 0), dtype=tf.int64)
                        # self.test_samples = tf.where(tf.equal(self.test_labels, 1), self.test_samples[:, 1], self.test_samples[:, 0])

                print('make net')
                self.test_net = make_net(self.test_ims, self.test_samples, pr_test, reuse=False, train=False)


                self.coord = tf.train.Coordinator()
                self.init_op = tf.group(tf.global_variables_initializer(), tf.local_variables_initializer())
                self.sess.run(self.init_op)

                tf.train.Saver().restore(self.sess, self.model_path)    
                tf.get_default_graph().finalize()

                print('Start testing...')
                tf.train.start_queue_runners(self.sess, coord=self.coord)
                self.total_acc = []
                i = 0
                try:
                    while not self.coord.should_stop():
                        start = ut.now_sec()

                        predict_logits = self.sess.run(self.test_net.logits)  
                        predict_logits = np.squeeze(predict_logits) 

                        predict_labels = np.array(predict_logits > 0).astype(np.int64)  

                        labels = self.sess.run(self.test_labels)

                        correct_list = (predict_labels == labels)
                        acc = np.mean(np.array(correct_list).astype(np.float32))
                        self.total_acc.append(acc)
                        i += 1

                        print 'Iter: %d/%d, Accuracy: %s, time: %.3f' % (i, total_batch, acc,ut.now_sec() - start)
                except tf.errors.OutOfRangeError:  
                    print('Test Done!')

        return np.mean(np.array(self.total_acc))

And do testing like this:

import shift_net, shift_params, numpy as np
import time

pr = shift_params.shift_v1()

model_file = '../results/nets/shift/net.tf-650000'

gpu = '3'

start_time = time.time()

clf = shift_net.NetClf(pr, model_file, gpu=gpu)
accuray = clf.test_accuracy()

end_time = time.time()
print('pr.test_list:',pr.test_list)
print('model_file:',model_file)
print('pr.do_shift:',pr.do_shift)
print('accuray:',accuray)
print('cost time: {} s'.format(end_time-start_time))

error compiling

What is the format of the tensor in the code?

Hello, Is it (batch_size, channel, depth, height, width) or (b, d, h, w, c) or something else? I'm new to tensorflow and it confuses me a lot. Thanks.

Could you provide the dataset?

Hello, thanks for your great work!
I want to reproduce your work, but I don't see where the dataset provided. Could you please share your dataset?
Thanks again.

Questions about VoxCeleb2 dataset

Hello, thanks for your great work!
I have been working on this model for a while but I haven't got results as good as reported in your paper. After checking videos in VoxCeleb2 dataset, I found some of them contained audible background noise and were of low quality, while clean reference speech segments are necessary to obtain SDR index.
I'm wondering whether you selected videos of high quality in training and test phase, and how?

Download pretrain models

Hello Andrew,

When I am running the bash 'download_models.sh', it fails and returns a Not Found error. Is the link to the pretrained models changed? If so, could you give me the correct link? Thanks a lot!

Questions about the files in ".txt" format used to train the "shift" model

I noticed videos from AudioSet were used to train the "shift" model. In "shift_params.py", there are params such as:
train_list = '/data/ssd1/owens/audioset-vid-v21/small_train.txt',
test_list = '/data/ssd1/owens/audioset-vid-v21/small_train.txt',
train_list = '/data/scratch/owens/audioset-vid-v21/train_tfs.txt',
test_list = '/data/scratch/owens/audioset-vid-v21/test_tfs.txt',
However, I didn't find any available files in ".txt" format at the website : https://research.google.com/audioset/
Could you provide those ".txt" files to train the "shift" model? Thank you for your help! Looking for your reply:)

andrewowens / multisensory Goto Github PK

multisensory's People

Stargazers

Watchers

Forkers

multisensory's Issues

Recommend Projects

Recommend Topics

Recommend Org