Hi lltcggie, Could you please add an option in the interface so that

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Specify the device in multiple GPUs system about waifu2x-caffe HOT 10 CLOSED

lltcggie commented on August 28, 2024

Specify the device in multiple GPUs system

from waifu2x-caffe.

Comments (10)

lltcggie commented on August 28, 2024

使用するGPUを指定できるようにしました。

from waifu2x-caffe.

chungexcy commented on August 28, 2024

@lltcggie Talking about video processing, is that possible to merge several input images into the four-dimension-array data blob structure during the inference? Will this improve the efficiency, compared to processing single image every time?

from waifu2x-caffe.

lltcggie commented on August 28, 2024

複数枚の画像を同時に処理したほうが効率がいいじゃないか、ということですね？
予想ですが、crop_sizeを大きくするのとそこまで違いはないと思います。
とくに、一度に複数枚処理する場合はVRAMの容量を考えるとcrop_sizeを小さくする必要があります。
つまり複数枚処理して高速化することと、crop_sizeを大きくして1枚あたりの処理を高速化することはトレードオフの関係にあるので、わざわざ実装する価値はないと思います。

from waifu2x-caffe.

chungexcy commented on August 28, 2024

Yes, multi images doesn't make too much sense. And to combine into their framework, it might be a totally different story.

For crop-size you have mentioned, it is true that larger size will give you faster speed. I did a de-noising experiment on three 3072*3072 generated images. VRAM with other application + caffe model is around 300MB. Below table is the time consuming and total VRAM on my GTX960.

------------ --------- cuDNN ----------- ------------ CUDA --------
crop-size -- --- VRAM ------ TIME ------ ------ VRAM ------ TIME --
128 -------- -- 403MB ------ 14.394s --- ----- 658MB ------ 23.173s
256 -------- -- 499MB ------ 12.693s --- ---- 1537MB ------ 20.579s
384 -------- -- 654MB ------ 12.391s --- ---- 2976MB ------ 20.110s
512 -------- -- 870MB ------ 12.297s
1024 ------- - 2340MB ------ 12.235s

Although, we can get improvement by increasing the crop size, this improvement is slowing down. What's really important is real image has different width and height, especially for video resolution. In most case, the crop-size cannot divide both width and height. So a very large crop can waste too much on some data blobs (rightest and most bottom patches).

For example, 384 is the best for denoising on 1920-by-1080, (require to process 1920-by-1152 ), and 432 is best for denoising on 3840-by-2160, (require to process 3888-by-2160 , better than 384 here, which is 3840-by-2304 ). I believe assigning crop size for width and height separately might be the best solution for this problem.

Another thinking is still about the batch size.

From my another caffe CNN project on FCN training (forward+backward on 500-by-300 images) on a dual Titan X server, I observed s 20% speed-up with batch-size=10 than batch-size=2 (which is batch-size=5 than batch-size=1 on each card). This improvement might be more obvious in a dual GPU system and in training process, but single GPU may still benefit even in forward only. This might be one reason for AlexNet and SRCNN to use batch-size=128 in training.

As you already crop the image into smaller patches, and each patches only need 300MB, I believe it's worth to try batch-size>1 by combining 2 or 4 patches into one four-dims data blob, do the forward, and recover them back in the four-dims output. In this way, the VRAM is still under controlled. And we don't need multiple images.

I believe efficiency of waifu2x-caffe can be further optimized, since I have seen other CNN costing more power on my GPU ;)

from waifu2x-caffe.

chungexcy commented on August 28, 2024

Maybe caffe and cudnn themselves do a better job on batch-size dimension than width and height dimension. Anyway, It's possible to try optimizing it in this way :)

from waifu2x-caffe.

nagadomi commented on August 28, 2024

cudnnConvolutionForward has been optimized for batch size = 1 (from cuDNN v4 Release Notes).

But in my experiment, -batch_size option improves the processing time by 0~13% on cuDNN v4 and Torch7.

from waifu2x-caffe.

lltcggie commented on August 28, 2024

なるほど、four-dims data blobとはbatchのことを示していたんですね。
チャンネル軸があるのを忘れてたので、別の何かのことかと勘違いしていました。
コマンドライン版ではbatch sizeを指定できるようにしてあるので、もしよろしければお試しください。
ちなみに、過去にGTX 660(Kepler世代)で実験した時にbatch sizeを増やしても処理速度が向上することはなかったので、GUI版ではbatch sizeの指定欄を入れる必要ないと判断しました。
しかしGPUの世代、あるいはcuDNNのバージョン次第では違う結果になるかもしれません。
もしbatch sizeを増やして速くなるのであれば、GUIの方でもbatch sizeを指定できるようにしたいと思います。

from waifu2x-caffe.

chungexcy commented on August 28, 2024

I tried the CUI version. I have to say that batch-size sometimes help, sometimes not. Also this benchmark on CUI version is not reliable and comparable, because the speed is much slower than in GUI version (maybe every time, the Net initializing and Net clearing cost too much time) and the time consuming is fluctuating a little bit in CUI version.

3 pictures of 3072*3072 => 6144x6144
                CUI version
crop-size = 256
batch-size= 1   Duration : 00:00:52,36
batch-size= 2   Duration : 00:00:51,16**
batch-size= 4   Duration : 00:00:51,22
batch-size= 8   Duration : 00:00:51,16**

crop-size = 512
batch-size= 1   Duration : 00:00:50,89
batch-size= 2   Duration : 00:00:50,86
batch-size= 3   Duration : 00:00:50,70**
batch-size= 4   Duration : 00:00:51,00
batch-size= 6   Duration : 00:00:51,25

crop-size = 1024
batch-size= 1   Duration : 00:00:50,75**

10 pictures of 720x480 => 1440x960
                CUI version                 GUI version
crop-size = 240
batch-size= 1   Duration : 00:00:15,72      00:00:06.156

crop-size = 320 (1600 x  960 = 1536000)
batch-size= 1   Duration : 00:00:16,47      00:00:06.687

crop-size = 360 (1440 x 1080 = 1555200)
batch-size= 1   Duration : 00:00:16,38      00:00:06.703

crop-size = 480
batch-size= 1   Duration : 00:00:15,27**    00:00:05.875**
batch-size= 2   Duration : 00:00:15,72
batch-size= 3   Duration : 00:00:16,32

10 pictures of 1920x1080, donoising at 1920x1080
                CUI version                 GUI version
crop-size = 384 (1920 x 1152)
batch-size= 1   Duration : 00:00:19,79**    00:00:09.750**
batch-size= 2   Duration : 00:00:20,28

crop-size = 120
batch-size= 1   Duration : 00:00:20,41      00:00:10.610
batch-size= 2   Duration : 00:00:19,91
batch-size= 3   Duration : 00:00:19,19
batch-size= 4   Duration : 00:00:19,14**
batch-size= 6   Duration : 00:00:19,36
batch-size= 8   Duration : 00:00:19,52
batch-size=12   Duration : 00:00:19,69

In the last example of (1920x1080), if crop-size cannot divide width or height, we really waste lots of calculation. A single large image processing may not care this, but for thousands frame in video processing, this mount of time is a lot. crop-size=120 is too small and can not fully utilize the GPU with batch-size=1. And even in crop-size=120 with some waste of padding(?), the speed is faster than crop-size=384, with the batch-size=4. crop-size=384 wastes 1/9 of calculation.

So, can you create APIs of width crop and height crop for the project of @HolyWu , not necessary for GUI option? So we can assign the size separately and not waste too much..

from waifu2x-caffe.

lltcggie commented on August 28, 2024

入力にフォルダにすればCUIとGUIとも同じ条件で計れると思います。
ちなみにGUIにbatch size指定をつけるのを渋っている理由は、比較的知識のないユーザーが使うGUIにbatch size指定をつけると、ソフトが強制終了するユーザーが増えてめんどくさいことになるのが容易に予想できるからです。
batch sizeを指定することで明らかに速くなるとわかれば、GUIで指定できるようにしても良いとは思っているのですが…

from waifu2x-caffe.

chungexcy commented on August 28, 2024

Yeah, I totally agree with you that adding batch-size into the GUI is not a good choice for average users. And the results didn't show us a clear evidence of the benefits of batch-size, which seems that your current design of batch-size=1 is actually the best choice in practice and can avoid a lot troubles for users.
Now, I'm just wondering about the crop size for video processing. And yes, adding too much choices for users is not always a good decision...

from waifu2x-caffe.

Specify the device in multiple GPUs system about waifu2x-caffe HOT 10 CLOSED

Comments (10)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent