nvidia / flownet2-pytorch Goto Github PK

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks

License: Other

Python 67.34% Shell 1.53% C++ 5.04% Cuda 25.30% Dockerfile 0.79%

flownet2-pytorch's Introduction

flownet2-pytorch

Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks.

Multiple GPU training is supported, and the code provides examples for training or inference on MPI-Sintel clean and final datasets. The same commands can be used for training or inference with other datasets. See below for more detail.

Inference using fp16 (half-precision) is also supported.

For more help, type

python main.py --help

Network architectures

Below are the different flownet neural network architectures that are provided.
A batchnorm version for each network is also available.

FlowNet2S
FlowNet2C
FlowNet2CS
FlowNet2CSS
FlowNet2SD
FlowNet2

Custom layers

FlowNet2 or FlowNet2C* achitectures rely on custom layers Resample2d or Correlation.
A pytorch implementation of these layers with cuda kernels are available at ./networks.
Note : Currently, half precision kernels are not available for these layers.

Data Loaders

Dataloaders for FlyingChairs, FlyingThings, ChairsSDHom and ImagesFromFolder are available in datasets.py.

Loss Functions

L1 and L2 losses with multi-scale support are available in losses.py.

Installation

# get flownet2-pytorch source
git clone https://github.com/NVIDIA/flownet2-pytorch.git
cd flownet2-pytorch

# install custom layers
bash install.sh

Python requirements

Currently, the code supports python 3

numpy
PyTorch ( == 0.4.1, for <= 0.4.0 see branch python36-PyTorch0.4)
scipy
scikit-image
tensorboardX
colorama, tqdm, setproctitle

Converted Caffe Pre-trained Models

We've included caffe pre-trained models. Should you use these pre-trained weights, please adhere to the license agreements.

FlowNet2[620MB]
FlowNet2-C[149MB]
FlowNet2-CS[297MB]
FlowNet2-CSS[445MB]
FlowNet2-CSS-ft-sd[445MB]
FlowNet2-S[148MB]
FlowNet2-SD[173MB]

Inference

# Example on MPISintel Clean   
python main.py --inference --model FlowNet2 --save_flow --inference_dataset MpiSintelClean \
--inference_dataset_root /path/to/mpi-sintel/clean/dataset \
--resume /path/to/checkpoints

Training and validation

# Example on MPISintel Final and Clean, with L1Loss on FlowNet2 model
python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4 \
--training_dataset MpiSintelFinal --training_dataset_root /path/to/mpi-sintel/final/dataset  \
--validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset

# Example on MPISintel Final and Clean, with MultiScale loss on FlowNet2C model 
python main.py --batch_size 8 --model FlowNet2C --optimizer=Adam --optimizer_lr=1e-4 --loss=MultiScale --loss_norm=L1 \
--loss_numScales=5 --loss_startScale=4 --optimizer_lr=1e-4 --crop_size 384 512 \
--training_dataset FlyingChairs --training_dataset_root /path/to/flying-chairs/dataset  \
--validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset

Results on MPI-Sintel

Reference

If you find this implementation useful in your work, please acknowledge it appropriately and cite the paper:

@InProceedings{IMKDB17,
  author       = "E. Ilg and N. Mayer and T. Saikia and M. Keuper and A. Dosovitskiy and T. Brox",
  title        = "FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks",
  booktitle    = "IEEE Conference on Computer Vision and Pattern Recognition (CVPR)",
  month        = "Jul",
  year         = "2017",
  url          = "http://lmb.informatik.uni-freiburg.de//Publications/2017/IMKDB17"
}

@misc{flownet2-pytorch,
  author = {Fitsum Reda and Robert Pottorff and Jon Barker and Bryan Catanzaro},
  title = {flownet2-pytorch: Pytorch implementation of FlowNet 2.0: Evolution of Optical Flow Estimation with Deep Networks},
  year = {2017},
  publisher = {GitHub},
  journal = {GitHub repository},
  howpublished = {\url{https://github.com/NVIDIA/flownet2-pytorch}}
}

Related Optical Flow Work from Nvidia

Code (in Caffe and Pytorch): PWC-Net
Paper : PWC-Net: CNNs for Optical Flow Using Pyramid, Warping, and Cost Volume.

Acknowledgments

Parts of this code were derived, as noted in the code, from ClementPinard/FlowNetPytorch.

flownet2-pytorch's People

Contributors

Stargazers

Watchers

Forkers

shubhampachori12110095 helloyuan bityangke limberc wangg12 hbcbh1999 gjtjx bygreencn joycezw lvzhaoyang dakeli alasin zmlshiwo 2prime lyk125 soledad89 thinklib styleflow vikashch liviust joaopcanario giserh coderzbx tiger6998 swang0926 vlam3d wang-kx hyfine tastylu fitsumreda flag-c wanjinchang dimplesl andrei-pokrovsky elevanth noahgolmant paulhendricks alibabade hellock skic githubfragments helq2612 negative09 nitinjsanket leixuchn locussam wzhang1 hongyunl keeprunningcelin yuwenxiong ml-lab hzhang57 lyv023 htlife zbwglory loganrooks swapsha96 flee1994 fantasylsc ysfalo hli1221 fengqian1989 xiaoweihappy123 xupengcoding nameless-chatoyant nikky4d syxd lvyanxuan zengzhi2015 jwgu jiyongma rronan lxmwust leeyangg mtlong black0017 lxtgh inmgjim fahall faraway1024 flt19940317 cch2016 hyuantan leoyouli hxl1990 nanzhixionggit yuyangyg jacke121 verigle jessisyj peiliangli wandering007 murrayc7 pinglmlcv tlatlbtle chenhongming ericzw bennnun xuezhisd jiaweihe1996

flownet2-pytorch's Issues

L1 loss and EPE become extremely large

Hi, @fitsumreda, thanks for your excellent work.
I train Flownet2 under your default settings with 4 Gpus on Sintel dataset, but the L1 loss and EPE become quite larger after several epochs. what is wrong with the training process?
The training command is
“python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4 --training_dataset MpiSintelFinal --training_dataset_root data/sintel/training --validation_dataset MpiSintelClean --validation_dataset_root data/sintel/training“

Confused of your inference mode

Sorry to impose, I was confused when I was trying to extract flow images from my custom dataset(or just a few frames), However, after I figured out the details of your code, only limited datasets are supported? It's not very friendly to use.
Would you plan to write a test.py as a friendly-use api example？

Pretrained model

Hello!

Thank for your great implementation!

Can you publish pretrained models please if it's possible?

About the MPI-Sintel datasets

In your command:

# Example on MPISintel Final and Clean, with MultiScale loss on FlowNet2C model 
python main.py --batch_size 8 --model FlowNet2C --optimizer=Adam --optimizer_lr=1e-4 --loss=MultiScale --loss_norm=L1 \
--loss_numScales=5 --loss_startScale=4 --optimizer_lr=1e-4 --crop_size 384 512 \
--training_dataset FlyingChairs --training_dataset_root /path/to/flying-chairs/dataset  \
--validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset

We need to pass the path of the MPI-Sintel datasets. So I use the absolute path to the clean folder under the train folder. But the image files and flow files are arranged separately.

Here the original folder tree of MPI-Sintel datasets, so I am wondering do I need to rearrange the files as the images and flo files are arranged seperately while your code seems can not deal with the original tree structure. So how should I change this?

Any natural methods to implement 1D Correlation layer with your 2D correlation

Hi. This is really a wonderful work. I'd know if there is any easy way to implement 1D Correlation layer with your 2D Correlation. Thank you!

I keep getting error saying the list index out of range

When I try to run flownet2 with this input

python main.py --batch_size 8 --model FlowNet2C --optimizer=Adam --optimizer_lr=1e-4 --loss=MultiScale --loss_norm=L1
--loss_numScales=5 --loss_startScale=4 --optimizer_lr=1e-4 --crop_size 384 512
--training_dataset FlyingChairs --training_dataset_root /home/projects/flownet2-pytorch/FlyingChairs/FlyingChairs_release/data
--validation_dataset MpiSintelClean --validation_dataset_root /home/projects/flownet2-pytorch/MPI-Sintel-complete/training/clean

But I keep getting error saying

File "main.py", line 140, in
validation_dataset = args.validation_dataset_class(args, True, **tools.kwargs_from_args(args, 'validation_dataset'))
File "/home/projects/flownet2-pytorch/datasets.py", line 108, in init
super(MpiSintelClean, self).init(args, is_cropped = is_cropped, root = root, dstype = 'clean', replicates = replicates)
File "/home/projects/flownet2-pytorch/datasets.py", line 66, in init
self.frame_size = frame_utils.read_gen(self.image_list[0][0]).shape
IndexError: list index out of range

How should I resolve this issue?

How to understand the "training for multiple flow" and the "div_flow"

issue 1

As in the FlowNetC, FlowNetS, we will return a list of flow during training:

        if self.training:
            return flow2,flow3,flow4,flow5,flow6

So do we need to compute EPE loss for all the five flow predictions? If needed, could you help me point out where we do such operations for the five predicted flow maps and where have we scale the groud truth flow to match the different scales corresponding to the five predicted flow maps?

As I find that in the main.py, we only have a class like below:

        class ModelAndLoss(nn.Module):
            def __init__(self, args):
                super(ModelAndLoss, self).__init__()
                kwargs = tools.kwargs_from_args(args, 'model')
                self.model = args.model_class(args, **kwargs)
                kwargs = tools.kwargs_from_args(args, 'loss')
                self.loss = args.loss_class(args, **kwargs)
                
            def forward(self, data, target, inference=False ):
                output = self.model(data)

                loss_values = self.loss(output, target)

                if not inference :
                    return loss_values
                else :
                    return loss_values, output

        model_and_loss = ModelAndLoss(args)

issue 2

I find that there exist a factor "div_flow" set as 20, could you share me how to understand this value and why choose 20?

Thanks for your kind help!

forward() takes exactly 2 arguments (3 given)

The error information is like this:

TypeError: forward() takes exactly 2 arguments (3 given)

Besides I am wondering about one line(line #142) code in the main.py

flownets2_flow2 = self.flownets_2(concat2, '2')[0]

What does '2' represent?

Torch multiprocessing in docker

I encountered the following Error:

Error : RuntimeError: unable to write to file </torch_18693_1954506624> at /pytorch/torch/lib/TH/THAllocator.c:271

Adding the flag --ipc=host in the last line in the launch-docker.sh right after sudo nvidia-docker run fixed the issue.

See pytorch README.md:

Please note that PyTorch uses shared memory to share data between processes, so if torch multiprocessing is used (e.g. for multithreaded data loaders) the default shared memory segment size that container runs with is not enough, and you should increase shared memory size either with --ipc=host or --shm-size command line options to nvidia-docker run.

Performance?

Hi, thanks for sharing the code.

I wonder if you can also share the performance numbers of the pre-trained models on the benchmarks (e.g. KITTI, Sintel, etc.). Seems that you did not use data augmentation, so I am actually quite curious about the performance.

multi-gpu training

Hi,

 Thanks a lot for providing the open-source code!
 It seems that the multi-gpu (I set the argument num_gpus to be 4 , so 4 GPUs are used) is not faster than the single GPU. 
Could you provide some suggestions?
Highly appreciate your time and help!

Best,
Lili

The training EPE error explodes on FlyingChairs dataset.

Here I train the flownet2 directly on the flying chairs. However, the EPE error explode after 7 epochs.

We can find that the EPE error is always increasing during the training procedure.

gpumemusage [error for multi-gpu]

Thanks for sharing your code.
I notice in README you said "Multiple GPU training is supported".
But i think this function only support for a single gpu computer. I add some code for my personal usage, but I think there are better solutions.

def gpumemusage():
    gpu_mem = subprocess.check_output("nvidia-smi | grep MiB | cut -f 3 -d '|'", shell=True).\
        replace(' ', '').replace('\n', '').replace('i', '').replace('MB', 'MB/').replace('//', '/')
    gpu_mem = gpu_mem[:-1]
    gpu_info= [float(a[:-2]) for a in gpu_mem.split('/')]
    curr = sum(gpu_info[0::2])
    tot = sum(gpu_info[1::2])
    util = "%1.2f"%(100*curr/tot)+'%'
    cmem = str(int(math.ceil(curr/1024.)))+'GB'
    gmem = str(int(math.ceil(tot/1024.)))+'GB'
    gpu_mem = util + '--' + join(cmem, gmem)
    return gpu_mem

Extremely slow for training Flownet2C

I'm trying to train FlowNet2 separately. It's very slow when I train the flownet2C, around 22s/batch with batch_size=32 on 2 P40 GPUs, while flownet2S takes 1.5s/batch. I don't know if there any problems, I want to know how fast you train FlowNet2C or FlowNet2

Running a FlowNet-pytorch on a single image pair

Could you pls let us know how to run this a FlowNet on a single image pair.
Many thanks.

trying to run in inference mode

Trying to run:
main.py --inference --model FlowNet2 --save_flow --inference_dataset MpiSintelClean --inference_dataset_root /../frame_00 --resume ./checkpoints/flownet2

Results in error:
File "main.py", line 396, in
stats = inference(args=args, epoch=epoch - 1, data_loader=inference_loader, model=model_and_loss, offset=offset)
NameError: name 'inference_loader' is not defined

where is inference_loader definded?

ImportError: /flownet2-pytorch/networks/resample2d_package/_ext/resample2d/_resample2d.so: undefined symbol: _Py_Dealloc

root@c38c6ed6d9fa:/flownet2-pytorch# python main.py --help
Traceback (most recent call last):
File "main.py", line 16, in
import models, losses, datasets
File "/flownet2-pytorch/models.py", line 8, in
from networks.resample2d_package.modules.resample2d import Resample2d
File "/flownet2-pytorch/networks/resample2d_package/modules/resample2d.py", line 3, in
from ..functions.resample2d import Resample2dFunction
File "/flownet2-pytorch/networks/resample2d_package/functions/resample2d.py", line 3, in
from .._ext import resample2d
File "/flownet2-pytorch/networks/resample2d_package/_ext/resample2d/init.py", line 3, in
from ._resample2d import lib as _lib, ffi as _ffi
ImportError: /flownet2-pytorch/networks/resample2d_package/_ext/resample2d/_resample2d.so: undefined symbol: _Py_Dealloc

Can I remove the "grad_output.is_contiguous() == True"?

I meet a bug informed that that grad_output.is_contiguous()=False with in the correlation.py. So can I comment this line? What the influence of this line?


def backward(self, grad_output):

   input1, input2 = self.saved_tensors
  
   assert(grad_output.is_contiguous() == True)

Inconsistent tensor sizes

There is a bug when the input image size is not 2 to the n.

concat4 = torch.cat((out_conv4,out_deconv4,flow5_up),1)

RuntimeError: inconsistent tensor sizes at /opt/conda/conda
bld/pytorch_1503966894950/work/torch/lib/THC/generic/THCTensorMath.cu:141

Python3 version

I met some problem when I tried to run the code under python3.

If the code only supports python2 right now, how can I upgrade it?

$ python3 main.py
Traceback (most recent call last):
  File "main.py", line 13, in <module>
    import FlowNetC
  File "/notebooks/data/vinet/FlowNetC.py", line 8, in <module>
    from correlation_package.modules.correlation import Correlation
  File "/notebooks/data/vinet/correlation_package/modules/correlation.py", line 3, in <module>
    from ..functions.correlation import CorrelationFunction
  File "/notebooks/data/vinet/correlation_package/functions/correlation.py", line 3, in <module>
    from .._ext import correlation
  File "/notebooks/data/vinet/correlation_package/_ext/correlation/__init__.py", line 3, in <module>
    from ._correlation import lib as _lib, ffi as _ffi
ImportError: /notebooks/data/vinet/correlation_package/_ext/correlation/_correlation.so: undefined symbol: PyInt_FromLong

Resample2D and ChannelNorm packages

Hello,

When reading your packages, It seemed to me that two of your packages were already implemented in the standard pytoch library
specifically, is channelnorm different than computing torch.norm(input, p=2, dim=1, keepdim=True) ? (maybe it's faster ?)

And is resample2Ddifferent from grid_sample ? I think this one is a recent feature though (from 0.2.0). It obviously doesn't take flow as input, but it could be used with a mixture of affine_grid (just put a zero affine_matrix) and easy normalization.
It might be a little slower though.

Thanks !

flying chairs dataset

where can I get the flying-chair dataset

Size of tensors does not match Inference Step.

The command I used for inference step.

python main.py --crop_size 384 512 --inference --model FlowNet2C --save_flow --inference_dataset MpiSintelClean --inference_dataset_root /home/tushar/data/tushar/MPI-Sintel-post/training --resume ./work/FlowNet2C_train-checkpoint.pth.tar

And terminal Screenshot

Should convert the input image from RGB to BGR?

The official Caffe version implementation converts it.
img0 = misc.imread(args.img0) if len(img0.shape) < 3: input_data.append(img0[np.newaxis, np.newaxis, :, :]) else: input_data.append(img0[np.newaxis, :, :, :].transpose(0, 3, 1, 2)[:, [2, 1, 0], :, :])
But this pytorch implementation doesn't.

Is the resample2d function implemented correctly?

Dear author.
I happened to see the resample2d implementation in "flownet2-pytorch/networks/resample2d_package/src/Resample2d_kernel.cu". In line#64, the output is assigned to a sum of values inside a region. Is it more reasonable to be a weighted sum?

correlation function runs very slow

Hi, In training the FlowNetC, the correlation layer runs so slow, It's about 5.5s a batch (batch size = 8) on TiTanX, I don't know whether there are some problems, and could you give me some advice for fast training? Thanks!

main.py: error: argument insta--schedule_lr_frequency: invalid int value: 'MpiSintelClean'

When I try to run the code insturcted as

python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4
--training_dataset MpiSintelFinal --training_dataset_root /path/to/mpi-sintel/final/dataset
--validation_dataset MpiSintelClean --validation_dataset_root /path/to/mpi-sintel/clean/dataset

with path changed to my dataset path,

I get an error saying

main.py: error: argument insta--schedule_lr_frequency: invalid int value: 'MpiSintelClean'

which disables me to proceed.
Can anyone help me to solve this issue?

performance about flownet2

Hi, I converted the pretrained caffemodel to pytorch, and tested with it. The visual results of caffe is better than pytorch. Do you have the similar problems? Thank you.

Inference with FlowNetC model

Hi, I have trained the FlowNet2C model and want to test it but I meet the following errors.

Here is the command I am excuting.

python main.py --inference --model FlowNet2C --save_flow --inference_dataset MpiSintelClean \
--inference_dataset_root /path/to/mpi-sintel/clean/dataset \
--resume /path/to/checkpoints

Could you help me?

Models

I download models from ''Converted Caffe Pre-trained Models'', but I can not open the model?
When I extracted the file, "An error occurred while loading the archive".
Did anyone come with this issue?

Could you provide the visualization code for the*.flo files~

Hi, thanks a lot for your help. I used to have a copy of visualization code with not exactly the same color distribution as your demo.

So I am wondering whether it is possible for you to share us the visualization code for your demo picture like in your youtube demo.

run bash install.sh error

Compiling correlation kernels by nvcc...
In file included from /home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/utils/ffi/../../lib/include/THC/THC.h:4:0,
from _correlation.c:493:
/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/utils/ffi/../../lib/include/THC/THCGeneral.h:9:18: fatal error: cuda.h: No such file or directory
#include "cuda.h"
^
compilation terminated.
Traceback (most recent call last):
File "build.py", line 31, in
ffi.build()
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 167, in build
_build_extension(ffi, cffi_wrapper_name, target_dir, verbose)
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 103, in _build_extension
ffi.compile(tmpdir=tmpdir, verbose=verbose, target=libname)
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/cffi/api.py", line 697, in compile
compiler_verbose=verbose, debug=debug, **kwds)
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/cffi/recompiler.py", line 1520, in recompile
compiler_verbose, debug)
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/cffi/ffiplatform.py", line 22, in compile
outputfilename = _build(tmpdir, ext, compiler_verbose, debug)
File "/home/zhuyisheng/anaconda3/envs/py27/lib/python2.7/site-packages/cffi/ffiplatform.py", line 58, in _build
raise VerificationError('%s: %s' % (e.class.name, e))
cffi.error.VerificationError: CompileError: command 'gcc' failed with exit status 1

error in Correlation_forward_cuda_kernel: invalid device function

When I tried to run python main.py for inference, I got the following error

Overall Progress:   0%|                                                       | 0/1 [00:00<?, ?it/s]error in Correlation_forward_cuda_kernel: invalid device functionncing :   0%|                                                         | 0/1.0 [00:00<?, ?it/s]
Traceback (most recent call last):
  File "main.py", line 397, in <module>
    stats = inference(args=args, epoch=epoch - 1, data_loader=inference_loader, model=model_and_loss, offset=offset)
  File "main.py", line 361, in inference
    losses, output = model(data[0], target[0], inference=True)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/parallel/data_parallel.py", line 66, in forward
    return self.module(*inputs[0], **kwargs[0])
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "main.py", line 165, in forward
    output = self.model(data)
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ekwan/flownet2-pytorch/models.py", line 118, in forward
    flownetc_flow2 = self.flownetc(x)[0]
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ekwan/flownet2-pytorch/networks/FlowNetC.py", line 86, in forward
    out_corr = self.corr(out_conv3a, out_conv3b) # False
  File "/usr/local/lib/python2.7/dist-packages/torch/nn/modules/module.py", line 325, in __call__
    result = self.forward(*input, **kwargs)
  File "/home/ekwan/flownet2-pytorch/networks/correlation_package/modules/correlation.py", line 17, in forward
    result = CorrelationFunction(self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)(input1, input2)
  File "/home/ekwan/flownet2-pytorch/networks/correlation_package/functions/correlation.py", line 30, in forward
    self.pad_size, self.kernel_size, self.max_displacement,self.stride1, self.stride2, self.corr_multiply)
  File "/usr/local/lib/python2.7/dist-packages/torch/utils/ffi/__init__.py", line 180, in safe_call
    result = torch._C._safe_call(*args, **kwargs)
torch.FatalError: aborting at /home/ekwan/flownet2-pytorch/networks/correlation_package/src/correlation_cuda.c:88

Turns out in the 3 make.sh, the architecture is hardcoded to -arch=sm_52 and in my case it needs to be -arch=sm_37. It's probably a good idea to update the documentation to save people time from troubleshooting.

Why use only the first loss value for weight updating?

According to this line only the first(on the largest scale) loss is selected for weight updating. Is this done on purpose for some special reason? Or this is how the origin paper implemented? If so what's the point of calculating loss on other scales? Thanks!

UPDATE: sorry I misunderstood the meaning of that line of comment. The losses on multiple scales are actually have been added. This issue is resolved.

training and validation problem

hello , in the process of Training and Validation,I come across an problem,as the epoch goes,the L1 and EPE become larger and larger,that's Mathematical Divergent which confuse me .
I don't know where the problem is ,when i run the Inference process ,it works well,so ,can you help me?

1:
my training code is 😀
python main.py --batch_size 8 --model FlowNet2 --loss=L1Loss --optimizer=Adam --optimizer_lr=1e-4
--training_dataset MpiSintelFinal --training_dataset_root MPI-Sintel-complete/training/
--validation_dataset MpiSintelClean --validation_dataset_root MPI-Sintel-complete/training/

2:
is whether the bash wrong?the path is not right?

3:
the training and validation epoch:(some)
Training Epoch 16 L1: 20.014, EPE: 32.409, lr: 1.0e-04, load: 1.1e-04: 1%| | 1/131.0 [00:02Training Epoch 16 L1: 20.014, EPE: 32.409, lr: 1.0e-04, load: 1.1e-04: 2%|1 | 2/131.0 [00:02Training Epoch 16 L1: 13.450, EPE: 20.976, lr: 1.0e-04, load: 2.0e-04: 2%|1 | 2/131.0 [00:03Training Epoch 16 L1: 13.450, EPE: 20.976, lr: 1.0e-04, load: 2.0e-04: 2%|2 | 3/131.0 [00:03Training Epoch 16 L1: 13.146, EPE: 20.628, lr: 1.0e-04, load: 1.1e-04: 2%|2 | 3/131.0 [00:03Training Epoch 16 L1: 13.146, EPE: 20.628, lr: 1.0e-04, load: 1.1e-04: 3%|3 | 4/131.0 [00:03Training Epoch 16 L1: 12.203, EPE: 18.957, lr: 1.0e-04, load: 2.8e-04: 3%|3 | 4/131.0 [00:03Training Epoch 16 L1: 12.203, EPE: 18.957, lr: 1.0e-04, load: 2.8e-04: 4%|3 | 5/131.0 [00:03Training Epoch 16 L1: 19.179, EPE: 31.449, lr: 1.0e-04, load: 9.4e-05: 4%|3 | 5/131.0 [00:04Training Epoch 16 L1: 19.179, EPE: 31.449, lr: 1.0e-04, load: 9.4e-05: 5%|4 | 6/131.0 [00:04Training Epoch 16 L1: 13.530, EPE: 20.727, lr: 1.0e-04, load: 1.1e-04: 5%|4 | 6/131.0 [00:04Training Epoch 16 L1: 13.530, EPE: 20.727, lr: 1.0e-04, load: 1.1e-04: 5%|5 | 7/131.0 [00:04Training Epoch 16 L1: 15.606, EPE: 24.507, lr: 1.0e-04, load: 1.0e-04: 5%|5 | 7/131.0 [00:05Training Epoch 16 L1: 15.606, EPE: 24.507, lr: 1.0e-04, load: 1.0e-04: 6%|6 | 8/131.0 [00:05Training Epoch 16 L1: 11.642, EPE: 18.232, lr: 1.0e-04, load: 1.4e-04: 6%|6 | 8/131.0 [00:05Training Epoch 16 L1: 11.642, EPE: 18.232, lr: 1.0e-04, load: 1.4e-04: 7%|6 | 9/131.0 [00:05Training Epoch 16 L1: 12.322, EPE: 19.302, lr: 1.0e-04, load: 9.4e-05: 7%|6 | 9/131.0 [00:06Training Epoch 16 L1: 12.322, EPE: 19.302, lr: 1.0e-04, load: 9.4e-05: 8%|6 | 10/131.0 [00:06Training Epoch 16 L1: 30.777, EPE: 46.454, lr: 1.0e-04, load: 1.9e-04: 8%|6 | 10/131.0 [00:06Training Epoch 16 L1: 30.777, EPE: 46.454, lr: 1.0e-04, load: 1.9e-04: 8%|7 | 11/131.0 [00:06Training Epoch 16 L1: 23.191, EPE: 35.177, lr: 1.0e-04, load: 9.2e-05: 8%|7 | 11/131.0 [00:07Training Epoch 16 L1: 23.191, EPE: 35.177, lr: 1.0e-04, load: 9.2e-05: 9%|8 | 12/131.0 [00:07Training Epoch 16 L1: 20.174, EPE: 31.803, lr: 1.0e-04, load: 8.3e-05: 9%|8 | 12/131.0 [00:07Training Epoch 16 L1: 20.174, EPE: 31.803, lr: 1.0e-04, load: 8.3e-05: 10%|8 | 13/131.0 [00:07Training Epoch 16 L1: 11.984, EPE: 18.587, lr: 1.0e-04, load: 1.6e-04: 10%|8 | 13/131.0 [00:08Training Epoch 16 L1: 11.984, EPE: 18.587, lr: 1.0e-04, load: 1.6e-04: 11%|9 | 14/131.0 [00:08Training Epoch 16 L1: 21.247, EPE: 31.631, lr: 1.0e-04, load: 8.2e-05: 11%|9 | 14/131.0 [00:08Training Epoch 16 L1: 21.247, EPE: 31.631, lr: 1.0e-04, load: 8.2e-05: 11%|# | 15/131.0 [00:08Training Epoch 16 L1: 22.299, EPE: 35.021, lr: 1.0e-04, load: 1.7e-04: 11%|# | 15/131.0 [00:09Training Epoch 16 L1: 22.299, EPE: 35.021, lr: 1.0e-04, load: 1.7e-04: 12%|# | 16/131.0 [00:09Training Epoch 16 L1: 15.094, EPE: 23.169, lr: 1.0e-04, load: 8.5e-05: 12%|# | 16/131.0 [00:09Training Epoch 16 L1: 15.094, EPE: 23.169, lr: 1.0e-04, load: 8.5e-05: 13%|#1 | 17/131.0 [00:09Training Epoch 16 L1: 12.046, EPE: 18.704, lr: 1.0e-04, load: 1.2e-04: 13%|#1 | 17/131.0 [00:10Training Epoch 16 L1: 12.046, EPE: 18.704, lr: 1.0e-04, load: 1.2e-04: 14%|#2 | 18/131.0 [00:10

Training Epoch 23 L1: 1120820.625, EPE: 1789642.750, lr: 1.0e-04, load: 1.1e-04: 79%|7| 104/131.0 [00Training Epoch 23 L1: 1124944.250, EPE: 1796120.625, lr: 1.0e-04, load: 1.1e-04: 79%|7| 104/131.0 [00Training Epoch 23 L1: 1124944.250, EPE: 1796120.625, lr: 1.0e-04, load: 1.1e-04: 80%|8| 105/131.0 [00Training Epoch 23 L1: 1131694.000, EPE: 1806845.500, lr: 1.0e-04, load: 8.0e-05: 80%|8| 105/131.0 [00Training Epoch 23 L1: 1131694.000, EPE: 1806845.500, lr: 1.0e-04, load: 8.0e-05: 81%|8| 106/131.0 [00Training Epoch 23 L1: 1148796.625, EPE: 1834048.500, lr: 1.0e-04, load: 9.7e-05: 81%|8| 106/131.0 [00Training Epoch 23 L1: 1148796.625, EPE: 1834048.500, lr: 1.0e-04, load: 9.7e-05: 82%|8| 107/131.0 [00Training Epoch 23 L1: 1152218.125, EPE: 1839424.000, lr: 1.0e-04, load: 1.1e-04: 82%|8| 107/131.0 [00Training Epoch 23 L1: 1152218.125, EPE: 1839424.000, lr: 1.0e-04, load: 1.1e-04: 82%|8| 108/131.0 [00Training Epoch 23 L1: 1156170.000, EPE: 1845637.000, lr: 1.0e-04, load: 1.1e-04: 82%|8| 108/131.0 [00Training Epoch 23 L1: 1156170.000, EPE: 1845637.000, lr: 1.0e-04, load: 1.1e-04: 83%|8| 109/131.0 [00Training Epoch 23 L1: 1161018.750, EPE: 1853342.250, lr: 1.0e-04, load: 1.0e-04: 83%|8| 109/131.0 [00Training Epoch 23 L1: 1161018.750, EPE: 1853342.250, lr: 1.0e-04, load: 1.0e-04: 84%|8| 110/131.0 [00Training Epoch 23 L1: 1163659.375, EPE: 1857491.000, lr: 1.0e-04, load: 2.0e-04: 84%|8| 110/131.0 [00Training Epoch 23 L1: 1163659.375, EPE: 1857491.000, lr: 1.0e-04, load: 2.0e-04: 85%|8| 111/131.0 [00Training Epoch 23 L1: 1167923.625, EPE: 1864260.875, lr: 1.0e-04, load: 1.0e-04: 85%|8| 111/131.0 [00Training Epoch 23 L1: 1167923.625, EPE: 1864260.875, lr: 1.0e-04, load: 1.0e-04: 85%|8| 112/131.0 [00Training Epoch 23 L1: 1176208.875, EPE: 1877437.750, lr: 1.0e-04, load: 1.1e-04: 85%|8| 112/131.0 [00Training Epoch 23 L1: 1176208.875, EPE: 1877437.750, lr: 1.0e-04, load: 1.1e-04: 86%|8| 113/131.0 [00Training Epoch 23 L1: 1172562.375, EPE: 1871562.125, lr: 1.0e-04, load: 1.1e-04: 86%|8| 113/131.0 [00Training Epoch 23 L1: 1172562.375, EPE: 1871562.125, lr: 1.0e-04, load: 1.1e-04: 87%|8| 114/131.0 [00Training Epoch 23 L1: 1174640.250, EPE: 1874833.625, lr: 1.0e-04, load: 2.1e-04: 87%|8| 114/131.0 [00Overall Progress: 0%| | 23/10000 [30:52<223:09:55, 80.52s/it]00Training Epoch 23 L1: 1174350.000, EPE: 1874322.875, lr: 1.0e-04, load: 1.1e-04: 88%|8| 115/131.0 [00Training Epoch 23 L1: 1174350.000, EPE: 1874322.875, lr: 1.0e-04, load: 1.1e-04: 89%|8| 116/131.0 [00Training Epoch 23 L1: 1174977.250, EPE: 1875286.625, lr: 1.0e-04, load: 1.1e-04: 89%|8| 116/131.0 [00Training Epoch 23 L1: 1174977.250, EPE: 1875286.625, lr: 1.0e-04, load: 1.1e-04: 89%|8| 117/131.0 [00Training Epoch 23 L1: 1182209.000, EPE: 1886798.125, lr: 1.0e-04, load: 1.1e-04: 89%|8| 117/131.0 [00Training Epoch 23 L1: 1182209.000, EPE: 1886798.125, lr: 1.0e-04, load: 1.1e-04: 90%|9| 118/131.0 [00Training Epoch 23 L1: 1178459.125, EPE: 1880784.750, lr: 1.0e-04, load: 9.2e-05: 90%|9| 118/131.0 [00Training Epoch 23 L1: 1178459.125, EPE: 1880784.750, lr: 1.0e-04, load: 9.2e-05: 91%|9| 119/131.0 [00Training Epoch 23 L1: 1182520.750, EPE: 1887210.875, lr: 1.0e-04, load: 8.5e-05: 91%|9| 119/131.0 [01Training Epoch 23 L1: 1182520.750, EPE: 1887210.875, lr: 1.0e-04, load: 8.5e-05: 92%|9| 120/131.0 [01Training Epoch 23 L1: 1172557.500, EPE: 1871309.875, lr: 1.0e-04, load: 1.1e-04: 92%|9| 120/131.0 [01Training Epoch 23 L1: 1172557.500, EPE: 1871309.875, lr: 1.0e-04, load: 1.1e-04: 92%|9| 121/131.0 [01Training Epoch 23 L1: 1174187.750, EPE: 1873864.625, lr: 1.0e-04, load: 1.3e-04: 92%|9| 121/131.0 [01Training Epoch 23 L1: 1174187.750, EPE: 1873864.625, lr: 1.0e-04, load: 1.3e-04: 93%|9| 122/131.0 [01Training Epoch 23 L1: 1172614.375, EPE: 1871375.375, lr: 1.0e-04, load: 1.1e-04: 93%|9| 122/131.0 [01Training Epoch 23 L1: 1172614.375, EPE: 1871375.375, lr: 1.0e-04, load: 1.1e-04: 94%|9| 123/131.0

Training Epoch 30 L1: 7526267.000, EPE: 11999885.000, lr: 1.0e-04, load: 9.7e-05: 79%|7| 104/131.0 [0Training Epoch 30 L1: 7526267.000, EPE: 11999885.000, lr: 1.0e-04, load: 9.7e-05: 80%|8| 105/131.0 [0Training Epoch 30 L1: 7697239.500, EPE: 12272357.000, lr: 1.0e-04, load: 1.0e-04: 80%|8| 105/131.0 [0Training Epoch 30 L1: 7697239.500, EPE: 12272357.000, lr: 1.0e-04, load: 1.0e-04: 81%|8| 106/131.0 [0Training Epoch 30 L1: 7870356.500, EPE: 12548359.000, lr: 1.0e-04, load: 2.0e-04: 81%|8| 106/131.0 [0Training Epoch 30 L1: 7870356.500, EPE: 12548359.000, lr: 1.0e-04, load: 2.0e-04: 82%|8| 107/131.0 [0Training Epoch 30 L1: 8068172.500, EPE: 12863640.000, lr: 1.0e-04, load: 1.2e-04: 82%|8| 107/131.0 [0Training Epoch 30 L1: 8068172.500, EPE: 12863640.000, lr: 1.0e-04, load: 1.2e-04: 82%|8| 108/131.0 [0Training Epoch 30 L1: 8262953.000, EPE: 13174119.000, lr: 1.0e-04, load: 1.1e-04: 82%|8| 108/131.0 [0Training Epoch 30 L1: 8262953.000, EPE: 13174119.000, lr: 1.0e-04, load: 1.1e-04: 83%|8| 109/131.0 [0Training Epoch 30 L1: 8458786.000, EPE: 13486154.000, lr: 1.0e-04, load: 1.1e-04: 83%|8| 109/131.0 [0Training Epoch 30 L1: 8458786.000, EPE: 13486154.000, lr: 1.0e-04, load: 1.1e-04: 84%|8| 110/131.0 [0Overall Progress: 0%|1 | 30/10000 [39:36<219:20:56, 79.20s/it][0Training Epoch 30 L1: 8676531.000, EPE: 13833231.000, lr: 1.0e-04, load: 1.1e-04: 85%|8| 111/131.0 [0Training Epoch 30 L1: 8906124.000, EPE: 14199155.000, lr: 1.0e-04, load: 7.9e-05: 85%|8| 111/131.0 [0Training Epoch 30 L1: 8906124.000, EPE: 14199155.000, lr: 1.0e-04, load: 7.9e-05: 85%|8| 112/131.0 [0Training Epoch 30 L1: 9146764.000, EPE: 14582753.000, lr: 1.0e-04, load: 8.2e-05: 85%|8| 112/131.0 [0Training Epoch 30 L1: 9146764.000, EPE: 14582753.000, lr: 1.0e-04, load: 8.2e-05: 86%|8| 113/131.0 [0Training Epoch 30 L1: 9381907.000, EPE: 14957424.000, lr: 1.0e-04, load: 1.9e-04: 86%|8| 113/131.0 [0Training Epoch 30 L1: 9381907.000, EPE: 14957424.000, lr: 1.0e-04, load: 1.9e-04: 87%|8| 114/131.0 [0Training Epoch 30 L1: 9641827.000, EPE: 15371584.000, lr: 1.0e-04, load: 9.3e-05: 87%|8| 114/131.0 [0Training Epoch 30 L1: 9641827.000, EPE: 15371584.000, lr: 1.0e-04, load: 9.3e-05: 88%|8| 115/131.0 [0Training Epoch 30 L1: 9932562.000, EPE: 15835133.000, lr: 1.0e-04, load: 1.0e-04: 88%|8| 115/131.0 [0Training Epoch 30 L1: 9932562.000, EPE: 15835133.000, lr: 1.0e-04, load: 1.0e-04: 89%|8| 116/131.0 [0Training Epoch 30 L1: 10214386.000, EPE: 16284384.000, lr: 1.0e-04, load: 1.4e-04: 89%|8| 116/131.0 [Training Epoch 30 L1: 10214386.000, EPE: 16284384.000, lr: 1.0e-04, load: 1.4e-04: 89%|8| 117/131.0 [Training Epoch 30 L1: 10510787.000, EPE: 16756787.000, lr: 1.0e-04, load: 1.1e-04: 89%|8| 117/131.0 [Training Epoch 30 L1: 10510787.000, EPE: 16756787.000, lr: 1.0e-04, load: 1.1e-04: 90%|9| 118/131.0 [Training Epoch 30 L1: 10803096.000, EPE: 17222668.000, lr: 1.0e-04, load: 2.3e-04: 90%|9| 118/131.0 [Training Epoch 30 L1: 10803096.000, EPE: 17222668.000, lr: 1.0e-04, load: 2.3e-04: 91%|9| 119/131.0 [Training Epoch 30 L1: 11161372.000, EPE: 17793628.000, lr: 1.0e-04, load: 2.1e-04: 91%|9| 119/131.0 [Training Epoch 30 L1: 11161372.000, EPE: 17793628.000, lr: 1.0e-04, load: 2.1e-04: 92%|9| 120/131.0 [Training Epoch 30 L1: 11545046.000, EPE: 18404904.000, lr: 1.0e-04, load: 1.1e-04: 92%|9| 120/131.0 [Training Epoch 30 L1: 11545046.000, EPE: 18404904.000, lr: 1.0e-04, load: 1.1e-04: 92%|9| 121/131.0 [Training Epoch 30 L1: 11917996.000, EPE: 18999540.000, lr: 1.0e-04, load: 1.1e-04: 92%|9| 121/131.0 [Training Epoch 30 L1: 11917996.000, EPE: 18999540.000, lr: 1.0e-04, load: 1.1e-04: 93%|9| 122/131.0 [Training Epoch 30 L1: 12331476.000, EPE: 19658392.000, lr: 1.0e-04, load: 1.3e-04: 93%|9| 122/131.0 [Training Epoch 30 L1: 12331476.000, EPE: 19658392.000, lr: 1.0e-04, load: 1.3e-04: 94%|9| 123/131.0 [Training Epoch 30 L1: 12736597.000, EPE: 20304082.000, lr: 1.0e-04, load: 2.0e-05: 94%|9| 123/131.0 [Training Epoch 30 L1: 12736597.000, EPE: 20304082.000, lr: 1.0e-04, load: 2.0e-05: 95%|9| 124/131.0 [Training Epoch 30 L1: 13234406.000, EPE: 21097104.000, lr: 1.0e-04, load: 2.8e-05: 95%|9| 124/131.0 [Training Epoch 30 L1: 13234406.000, EPE: 21097104.000, lr: 1.0e-04, load: 2.8e-05: 95%|9| 125/131.0 [Training Epoch 30 L1: 13745968.000, EPE: 21912486.000, lr: 1.0e-04, load: 1.5e-05: 95%|9| 125/131.0 [Training Epoch 30 L1: 13745968.000, EPE: 21912486.000, lr: 1.0e-04, load: 1.5e-05: 96%|9| 126/131.0 [Training Epoch 30 L1: 14259961.000, EPE: 22731514.000, lr: 1.0e-04, load: 4.8e-05: 96%|9| 126/131.0 [Training Epoch 30 L1: 14259961.000, EPE: 22731514.000, lr: 1.0e-04, load: 4.8e-05: 97%|9| 127/131.0 [Training Epoch 30 L1: 14828019.000, EPE: 23636724.000, lr: 1.0e-04, load: 4.5e-05: 97%|9| 127/131.0 [Training Epoch 30 L1: 14828019.000, EPE: 23636724.000, lr: 1.0e-04, load: 4.5e-05: 98%|9| 128/131.0 [Training Epoch 30 L1: 15450925.000, EPE: 24629376.000, lr: 1.0e-04, load: 1.7e-05: 98%|9| 128/131.0 [Training Epoch 30 L1: 15450925.000, EPE: 24629376.000, lr: 1.0e-04, load: 1.7e-05: 98%|9| 129/131.0 [Training Epoch 30 L1: 16118957.000, EPE: 25693774.000, lr: 1.0e-04, load: 2.6e-05: 98%|9| 129/131.0 [Training Epoch 30 L1: 16118957.000, EPE: 25693774.000, lr: 1.0e-04, load: 2.6e-05: 99%|9| 130/131.0 [Training Epoch 30 L1: 16797780.000, EPE: 26775512.000, lr: 1.0e-04, load: 7.2e-06: 99%|9| 130/131.0 [Training Epoch 30 L1: 16797780.000, EPE: 26775512.000, lr: 1.0e-04, load: 7.2e-06: 100%|#| 131/131.0 [01:04<00:00, 2.03it/s]
Validating Epoch 30 L1: 17581342.000, EPE: 28024272.000, load: 1.545: 1%| | 1/131.0 [00:01<03:36, 1Validating Epoch 30 L1: 17564684.000, EPE: 27997840.000, load: 1.1e-04: 1%| | 1/131.0 [00:01<03:51, Validating Epoch 30 L1: 17564684.000, EPE: 27997840.000, load: 1.1e-04: 2%| | 2/131.0 [00:01<01:54, Validating Epoch 30 L1: 17613800.000, EPE: 28076012.000, load: 1.2e-04: 2%| | 2/131.0 [00:01<02:02, Validating Epoch 30 L1: 17613800.000, EPE: 28076012.000, load: 1.2e-04: 2%| | 3/131.0 [00:01<01:20, Validating Epoch 30 L1: 17574314.000, EPE: 28012996.000, load: 7.4e-05: 2%| | 3/131.0 [00:02<01:25, Validating Epoch 30 L1: 17574314.000, EPE: 28012996.000, load: 7.4e-05: 3%| | 4/131.0 [00:02<01:03, Validating Epoch 30 L1: 17609200.000, EPE: 28068660.000, load: 0.903: 3%| | 4/131.0 [00:03<01:36, 1Validating Epoch 30 L1: 17609200.000, EPE: 28068660.000, load: 0.903: 4%| | 5/131.0 [00:03<01:16, 1Validating Epoch 30 L1: 17573464.000, EPE: 28011888.000, load: 0.035: 4%| | 5/131.0 [00:03<01:20, 1Validating Epoch 30 L1: 17573464.000, EPE: 28011888.000, load: 0.035: 5%| | 6/131.0 [00:03<01:06, 1Validating Epoch 30 L1: 17493966.000, EPE: 27885052.000, load: 0.009: 5%| | 6/131.0 [00:03<01:08, 1Validating Epoch 30 L1: 17493966.000, EPE: 27885052.000, load: 0.009: 5%| | 7/131.0 [00:03<00:58, 2Validating Epoch 30 L1: 17509758.000, EPE: 27910458.000, load: 0.006: 5%| | 7/131.0 [00:03<01:00, 2Validating Epoch 30 L1: 17509758.000, EPE: 27910458.000, load: 0.006: 6%| | 8/131.0 [00:03<00:52, 2Validating Epoch 30 L1: 17450372.000, EPE: 27815528.000, load: 1.008: 6%| | 8/131.0 [00:04<01:09, 1Validating Epoch 30 L1: 17450372.000, EPE: 27815528.000, load: 1.008: 7%| | 9/131.0 [00:04<01:01, 1Validating Epoch 30 L1: 17444074.000, EPE: 27805670.000, load: 0.176: 7%| | 9/131.0 [00:04<01:05, 1Validating Epoch 30 L1: 17444074.000, EPE: 27805670.000, load: 0.176: 8%| | 10/131.0 [00:04<00:58, Validating Epoch 30 L1: 17574704.000, EPE: 28013616.000, load: 0.006: 8%| | 10/131.0 [00:04<01:00, Validating Epoch 30 L1: 17574704.000, EPE: 28013616.000, load: 0.006: 8%| | 11/131.0 [00:04<00:54, Validating Epoch 30 L1: 17583654.000, EPE: 28027948.000, load: 7.6e-05: 8%| | 11/131.0 [00:05<00:55,Validating Epoch 30 L1: 17583654.000, EPE: 28027948.000, load: 7.6e-05: 9%| | 12/131.0 [00:05<00:50,Validating Epoch 30 L1: 17634218.000, EPE: 28108360.000, load: 0.838: 9%| | 12/131.0 [00:06<00:59, Validating Epoch 30 L1: 17634218.000, EPE: 28108360.000, load: 0.838: 10%| | 13/131.0 [00:06<00:54, Validating Epoch 30 L1: 17527270.000, EPE: 27938416.000, load: 0.006: 10%| | 13/131.0 [00:06<00:55, Validating Epoch 30 L1: 17527270.000, EPE: 27938416.000, load: 0.006: 11%|1| 14/131.0 [00:06<00:51, Validating Epoch 30 L1: 17424492.000, EPE: 27774274.000, load: 0.067: 11%|1| 14/131.0 [00:06<00:53, Validating Epoch 30 L1: 17424492.000, EPE: 27774274.000, load: 0.067: 11%|1| 15/131.0 [00:06<00:49, Validating Epoch 30 L1: 17455616.000, EPE: 27823988.000, load: 0.082: 11%|1| 15/131.0 [00:06<00:50, Validating Epoch 30 L1: 17455616.000, EPE: 27823988.000, load: 0.082: 12%|1| 16/131.0 [00:06<00:47, Validating Epoch 30 L1: 17453390.000, EPE: 27820352.000, load: 0.952: 12%|1| 16/131.0 [00:07<00:54, Validating Epoch 30 L1: 17453390.000, EPE: 27820352.000, load: 0.952: 13%|1| 17/131.0 [00:07<00:51, Validating Epoch 30 L1: 17424604.000, EPE: 27774368.000, load: 1.0e-04: 13%|1| 17/131.0 [00:07<00:51,Validating Epoch 30 L1: 17424604.000, EPE: 27774368.000, load: 1.0e-04: 14%|1| 18/131.0 [00:07<00:48,Validating Epoch 30 L1: 17429204.000, EPE: 27781808.000, load: 0.066: 14%|1| 18/131.0 [00:07<00:49, Validating Epoch 30 L1: 17429204.000, EPE: 27781808.000, load: 0.066: 15%|1| 19/131.0 [00:07<00:46, Validating Epoch 30 L1: 17461886.000, EPE: 27834034.000, load: 0.006: 15%|1| 19/131.0 [00:08<00:47, Validating Epoch 30 L1: 17461886.000, EPE: 27834034.000, load: 0.006: 15%|1| 20/131.0 [00:08<00:44,

Runtime Error: Value cannot be converted to type double without overflow: inf

Hi, When I run the example on MPISintel Final and Clean, with L1Loss on FlowNet2 model(given in READ.ME), I have a runtime error ‘Runtime Error: Value cannot be converted to type double without overflow: inf’ How can I solve this problem? could you give me some suggestions?
Thanks!

RuntimeError: Given groups=1, weight[64, 3, 7, 7], so expected input[1, 2, 436, 1024] to have 3 channels, but got 2 channels instead

Hello,

I was trying to run Inference code of the FlowNet2.0 using the FlowNet2C pretrained model you provided on the MPI-Sintel dataset using following command (from PyCharm): /data/users/milan/flownet2-pytorch/main.py --inference --model FlowNet2C --save_flow --inference_dataset MpiSintelClean --inference_dataset_root /data/users/milan/MPI-Sintel/training --resume /data/users/milan/flownet2-pytorch/FlowNet2-C_checkpoint.pth.tar

But eventually I am getting the error given in the screenshots attached. Any advice on this?

Many thanks in advance!

About the Super Huge Training Cost

My device is :

Ubuntu16.04
SSD 850EVO
GPU: 2 x TITANX Pascal

The forward&backward cost for each iteration is 1.5s
It seems that your method will continue for 10,000 epoch (each epoch takes 2,859 iterations), so the estimated training cost will be 10,000 x 2,859 x 1.5s = 4.3 x 10^7 s = 12,000 hours = 497 days.

Roughly speaking, it still takes one hour for every epoch even with SSD and 2xTITAN X Pascal GPU.

Which is almost impossible for me to retrain the FlowNet2 from scratch !

It seems the orginal flownet will train for 1.7 x 10^6 iteraions with the S_long method while in your settings it will train for 2.9 x 10^7 iterations.

Could you help me check whether I miss something? Besides, I am wondering whether you could provide me advice on how to accelerate my training.

Below is my training setttings:

About "Convert Official Caffe Pre-trained Models to PyTorch"

I have followed your instructions to convert the models but failed!
So I am wondering whether you could share the well-converted models, besides I am wondering that whether you have implemented the evaluation code to get the EPE performance over different dataset?

Regards

Why only consider data[0] and target[0] in the forward computation in FlowNet?

It seems that you only consider the data[0] and target[0] in the loop of the forward computation:

line #260 to line #265 in main.py

 optimizer.zero_grad() if not is_validate else None
 losses = model(data[0], target[0])
 losses = [torch.mean(loss_value) for loss_value in losses] 
 loss_val = losses[0] # Collect first loss for weight update
 total_loss += loss_val.data[0]
 loss_values = [v.data[0] for v in losses]

Shouldn't we consider all the data and their flow?

Augmentation issues?

I find that in the original version of the FlowNet/ FlowNet-2，the data augmentation techniques include :

translation
rotation
scaling
gaussian noise
contrast
multiplicative color changes

But it seems that there only exists the center crop augmentation op in the current implementations.

FlowNet2SD single output is wrong

https://github.com/NVIDIA/flownet2-pytorch/blob/master/models.py/#L340 should be division instead of product.

return self.upsample1(flow2/self.div_flow)

This can be verified by https://github.com/NVIDIA/flownet2-pytorch/blob/master/models.py/#L156

torch.FatalError: aborting at correlation_cuda.c

Could anyone help me with this problem?

Fail to adapt to Pytorch Version 0.3

I am trying to adapt the code to new Pytorch version and higher python version. I modify the code based on Pytorch source code, especially I modify the customized layers as follows:
functions:
`

class ChannelNormFunction(Function):

    @staticmethod
    def forward(ctx, input1, norm_deg=2):
        # self.save_for_backward(input1)
        ctx.norm_deg = norm_deg
        assert(input1.is_contiguous() == True)
        with torch.cuda.device_of(input1):
            b, _, h, w = input1.size()
            output = input1.new().resize_(b, 1, h, w).zero_()
            ChannelNorm_cuda_forward(input1, output, ctx.norm_deg)
        ctx.save_for_backward(input1, output)
        return output

    @staticmethod
    def backward(ctx, gradOutput):
        input1, output = ctx.saved_tensors
        with torch.cuda.device_of(input1):
            b, c, h, w = input1.size()
            gradInput1 = input1.new().resize_(b,c,h,w).zero_()
            ChannelNorm_cuda_backward(input1, output, gradOutput, gradInput1, ctx.norm_deg)

        return gradInput1, None

modules
`

class ChannelNorm(Module):
    def __init__(self, norm_deg=2):
        super(ChannelNorm, self).__init__()
        self.norm_deg = norm_deg

    def forward(self, input1):
        return ChannelNormFunction.apply(input1, self.norm_deg)

With other grammatical modifications, I am able to run the code in inference mode. But I fail to train because backpropagation is not working correctly. I met this error:

File "main.py", line 434, in
train_loss, iterations = train(args=args, epoch=epoch, start_iteration=global_iteration, data_loader=train_loader, model=model_and_loss, optimizer=optimizer, logger=train_logger, offset=offset)
File "main.py", line 301, in train
loss_val.backward()
File "/home/cdeng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/autograd/variable.py", line 167, in backward
torch.autograd.backward(self, gradient, retain_graph, create_graph, retain_variables)
File "/home/cdeng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/autograd/init.py", line 99, in backward
variables, grad_variables, retain_graph)
File "/home/cdeng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/autograd/function.py", line 91, in apply
return self._forward_cls.backward(self, *args)
File "/home/cdeng/Thesis/flownet2-pytorch/networks/channelnorm_package/functions/channelnorm.py", line 32, in backward
ChannelNorm_cuda_backward(input1, output, gradOutput, gradInput1, ctx.norm_deg)
File "/home/cdeng/anaconda3/envs/py27/lib/python2.7/site-packages/torch/utils/ffi/init.py", line 180, in safe_call
result = torch._C._safe_call(*args, **kwargs)
TypeError: initializer for ctype 'struct THCudaTensor *' must be a cdata pointer, not Variable

Could someone have a look? And help me locate and fix the bug?
Great thanks!

confirm！correlation layer backward is extremely slow

I tried three experiments：
A：model without any correlation function
B：model A + only correlation layer forward
C：model A + both correlation layer forward and backward

ON UBUNTU14+TITAN XP with CUDA8.0，time costed per iteration（including forward and backward）：
A：0.6
B：0.63
C：4.18

so，it seems that the backward of the correlation layer is extremely slow，and i don't get it why，since the forward seems pretty good，did anyone know how to accelerate the backward of the correlation layer ？

Can't download the MPI-Sintel flow datasets.

I can't download the MPI-Sintel flow datasets from the official website. Could everyone share the dataset by google drive?

Is there a plan to port the data augmentation layer from caffe to pytorch？

Hello. Thanks for sharing the code. I am training flownet1.0 on flyingchairs using pytorch. However, without data augmentation, overfitting happened. Is there a plan to port the data augmentation layer from caffe to pytorch？As there are so many parameter for differernt kinds of noise for data augmentation, I still can't find a good way to do that.

Is there any plan to develop 1D correlation on x-axis?

Hi. I found this work is excellent and solves most of our problem.
I just wonder if you have any plan to develop 1D correlation on x-axis, as well as resample? It will be very useful for depth estimation.
Thank you very much.

Tips for using for the task of Action detection

I want to use flowNet 2.0 for training a two stream model for action detection (spatio-temporal localization), I want to achieve real time performance, I have 2 questions:

Which network do you recommend using, FlowNetS or FlowNetSD, I assume SD should be better since it was designed to help get better result on realistic datasets as UCF101 which I will be using for my task.
If I want to fine tune the network for my task, what learning rate do you think would be reasonable (I was thinking of around 1e-7) ?

is cropping implemented correct?

Hi, I found the random cropping is implemented so that each call of the function will crop different regions(datasets.py->StaticRandomCrop). It is unreasonable to me. For each example, we should use random cropping but the left/right/flow image should be cropped in the same way.