Thanks for your contribution to build a TensorFlow version of Darknet. I'm trying

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Thanks for your comment. First, I both use side=11</code

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Low fps on tiny memory devices,about thtrieu/darkflow

Comments (14)

corenel commented on May 16, 2024 2

I use test instead of demo to precess images, and get this output:

# ubuntu @ tegra-ubuntu in ~/Github/darkflow on git:master x [20:34:45]
$ ./flow --model cfg/v1/tiny-yolo.train.cfg --load bin/tiny-yolo_201611160227.weights --gpu 1.0
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Parsing cfg/v1/tiny-yolo.train.cfg
Loading bin/tiny-yolo_201611160227.weights ...
Successfully identified 113568356 bytes
Finished in 0.0157389640808s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 448, 448, 3)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 448, 448, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 224, 224, 16)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 224, 224, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 112, 112, 32)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 112, 112, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 56, 56, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 56, 56, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 28, 28, 128)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 28, 28, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 14, 14, 256)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 14, 14, 512)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 7, 7, 512)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 1024)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 256)
 Load  |  Yep!  | flat                             | (?, 12544)
 Load  |  Yep!  | full 12544 x 1573  linear        | (?, 1573)
-------+--------+----------------------------------+---------------
GPU mode with 1.0 usage
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] ARM has no NUMA node, hardcoding to return zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.23GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.90G (4188778496 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.51G (3769900544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.16G (3392910336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.84G (3053619200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.56G (2748257280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.30G (2473431552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.07G (2226088448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Finished in 14.0993700027s

Forwarding 12 inputs ...
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 7.80G (8377556992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Total time = 27.4352328777s / 12 inps = 0.437393772215 ips
Forwarding 12 inputs ...
Total time = 1.80970215797s / 12 inps = 6.63092539682 ips
Forwarding 12 inputs ...
Total time = 0.644158124924s / 12 inps = 18.6289662983 ips
Forwarding 12 inputs ...
Total time = 0.64394903183s / 12 inps = 18.6350152059 ips
Forwarding 12 inputs ...
Total time = 0.652648925781s / 12 inps = 18.3866080613 ips
Forwarding 12 inputs ...
Total time = 0.644076824188s / 12 inps = 18.6313178015 ips
Forwarding 12 inputs ...
Total time = 0.830475091934s / 12 inps = 14.4495603981 ips
Forwarding 12 inputs ...
Total time = 0.595324039459s / 12 inps = 20.1570895926 ips
Forwarding 12 inputs ...
Total time = 0.637261152267s / 12 inps = 18.8305845371 ips

It seems that at first the speed is low, but it gets to 18 ips later.
Maybe it is my OpenCV that slow down the whole speed?

from darkflow.

thtrieu commented on May 16, 2024 1

@corenel My few thoughts:

You are forwarding a batch of 12 inputs at at time. That is a lot of parallelism the CPU can take care of, much less a GPU, as oppose to the case of camera demo, where each time there is only one example to be processed.
The ips measurement does not include post-processing of the output tensor (transform the 7x7x13 tensor into bounding box coordinates and associated class). In the case of YOLO this post-processing can be quite expensive. Namely for a single image forwarding of yolo-tiny, the ratio post_process/forward_time can be as big as 0.15. This ratio can be even bigger (up to 0.25) with newer version of YOLO due to the fact that new versions predict 845 boxes per image, compare to 98 boxes of YOLO v1. In short, 18-20 ips can easily be 13-15 fps during demo.

I can try to optimize for this post processing, but this has been in my mind since the beginning and I haven't improve anything substantially since then.

Anyway, it would be great if you can post here the result of your demo on CPU :)

from darkflow.

thtrieu commented on May 16, 2024

Thanks for testing out the code.

First thing I noticed is that your yolo version is not something I am familiar with, because it has a size of 113568356 bytes and its last volume has size 1573 = 11 x 11 x 13 instead of the usual 7 x 7 x 13, I cannot find both cfg and weights for this 1573 config in original darknet, so maybe what I am saying below is not applicable to your particular configuration.

Second, while you specify the model to be yolo-tiny.cfg, darkflow indicates that it is parsing yolo-tiny.train.cfg, I don't see how this is possible, did you modify the code?

Some relevant remarks: did you try running on CPU? When I test yolo tiny 180MB on my CPU of 2.80GHz × 4 and 7.7 GB memory, it can reach 4FPS easily. So 2.72 FPS with 113MB yolo on GPU is unacceptable, if you test this on a CPU and get better performance than 2.72 FPS, then there is something wrong with tensorflow's GPU usage. In fact, tensorflow's inefficient use of memory is still quite relevant few days ago (tensorflow/tensorflow#492 ).

I suggest not doing demo but use a --savepb option and see how much is the size of .pb graph file, that is the kind of number that I can try to reduce, but no less than 113568356 bytes ofcourse.

I don't have access to GPU at the moment, so I cannot reproduce your problem and see if there is any problems with your particular config. But, there is nothing much to do with tensorflow usage of GPU at the moment ...

from darkflow.

corenel commented on May 16, 2024

Thanks for your comment.
First, I both use side=11 or side=7 to train, and find that side=11 can bring better performance on darknet framework for object detection. So I use this config. Maybe I can try original side=7 model later.
Second, I'm sorry that I accidentally deleted the .train, so I'm actually using yolo-tiny.train.cfg model.
I'll try to observe the CPU and GPU usage while running the code.

from darkflow.

corenel commented on May 16, 2024

Thanks for your comment.
13-15 fps in total is enough for me to deploy it on my robot.

And that's my result by running test on CPU, only get ~3ips:

# ubuntu @ tegra-ubuntu in ~/Github/darkflow on git:master x [12:06:41]
$ ./flow --model cfg/v1/tiny-yolo.train.cfg --load bin/tiny-yolo_201611160227.weights --gpu 0.0
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Parsing cfg/v1/tiny-yolo.train.cfg
Loading bin/tiny-yolo_201611160227.weights ...
Successfully identified 113568356 bytes
Finished in 0.0162470340729s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 448, 448, 3)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 448, 448, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 224, 224, 16)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 224, 224, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 112, 112, 32)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 112, 112, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 56, 56, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 56, 56, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 28, 28, 128)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 28, 28, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 14, 14, 256)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 14, 14, 512)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 7, 7, 512)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 1024)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 256)
 Load  |  Yep!  | flat                             | (?, 12544)
 Load  |  Yep!  | full 12544 x 1573  linear        | (?, 1573)
-------+--------+----------------------------------+---------------
Running entirely on CPU
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] ARM has no NUMA node, hardcoding to return zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 1.88GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
Finished in 6.36866903305s

Forwarding 12 inputs ...
Total time = 4.92158579826s / 12 inps = 2.43823850521 ips
Forwarding 12 inputs ...
Total time = 4.38407516479s / 12 inps = 2.73717934774 ips
Forwarding 12 inputs ...
Total time = 4.34986495972s / 12 inps = 2.75870633023 ips
Forwarding 12 inputs ...
Total time = 4.24099993706s / 12 inps = 2.8295213813 ips
Forwarding 12 inputs ...
Total time = 4.34422206879s / 12 inps = 2.76228972875 ips
Forwarding 12 inputs ...
Total time = 4.34757304192s / 12 inps = 2.76016064234 ips
Forwarding 12 inputs ...
Total time = 4.33888602257s / 12 inps = 2.76568684625 ips
Forwarding 12 inputs ...
Total time = 4.29483699799s / 12 inps = 2.79405248805 ips
Forwarding 12 inputs ...
Total time = 4.28924703598s / 12 inps = 2.79769383748 ips

Still try to figure out why it only get ~4fps when running demo.

from darkflow.

corenel commented on May 16, 2024

I pull the newest repo and retry:
Using --gpu 1.0:

# ubuntu @ tegra-ubuntu in ~/Github/darkflow on git:master x [12:29:22]
$ ./flow --model cfg/v1/tiny-yolo.train.cfg --load bin/tiny-yolo_201611160227.weights --gpu 1.0
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Parsing cfg/v1/tiny-yolo.train.cfg
Loading bin/tiny-yolo_201611160227.weights ...
Successfully identified 113568356 bytes
Finished in 0.0150399208069s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 448, 448, 3)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 448, 448, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 224, 224, 16)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 224, 224, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 112, 112, 32)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 112, 112, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 56, 56, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 56, 56, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 28, 28, 128)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 28, 28, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 14, 14, 256)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 14, 14, 512)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 7, 7, 512)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 1024)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 256)
 Load  |  Yep!  | flat                             | (?, 12544)
 Load  |  Yep!  | full 12544 x 1573  linear        | (?, 1573)
-------+--------+----------------------------------+---------------
GPU mode with 1.0 usage
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] ARM has no NUMA node, hardcoding to return zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.32GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.90G (4188778496 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.51G (3769900544 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 3.16G (3392910336 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.84G (3053619200 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.56G (2748257280 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.30G (2473431552 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 2.07G (2226088448 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
Finished in 9.15090894699s

Forwarding 12 inputs ...
E tensorflow/stream_executor/cuda/cuda_driver.cc:965] failed to allocate 7.80G (8377556992 bytes) from device: CUDA_ERROR_OUT_OF_MEMORY
W tensorflow/core/common_runtime/bfc_allocator.cc:213] Ran out of memory trying to allocate 2.22GiB. The caller indicates that this is not a failure, but may mean that there could be performance gains if more memory is available.
Total time = 24.7553770542s / 12 inps = 0.484743172108 ips
Post processing 12 inputs ...
Total time = 2.11361193657s / 12 inps = 5.67748496892 ips
Forwarding 12 inputs ...
Total time = 1.26511907578s / 12 inps = 9.48527314921 ips
Post processing 12 inputs ...
Total time = 0.619744062424s / 12 inps = 19.3628317358 ips
Forwarding 12 inputs ...
Total time = 0.658466100693s / 12 inps = 18.224172797 ips
Post processing 12 inputs ...
Total time = 0.681007862091s / 12 inps = 17.6209419421 ips
Forwarding 12 inputs ...
Total time = 0.65850186348s / 12 inps = 18.2231830546 ips
Post processing 12 inputs ...
Total time = 0.596123933792s / 12 inps = 20.1300422945 ips
Forwarding 12 inputs ...
Total time = 0.599046945572s / 12 inps = 20.031819023 ips
Post processing 12 inputs ...
Total time = 0.822525978088s / 12 inps = 14.5892048636 ips
Forwarding 12 inputs ...
Total time = 0.631103038788s / 12 inps = 19.0143277127 ips
Post processing 12 inputs ...
Total time = 0.674789905548s / 12 inps = 17.7833128524 ips
Forwarding 12 inputs ...
Total time = 0.607489824295s / 12 inps = 19.7534172921 ips
Post processing 12 inputs ...
Total time = 0.779205083847s / 12 inps = 15.4003101992 ips
Forwarding 12 inputs ...
Total time = 0.592466831207s / 12 inps = 20.2542984146 ips
Post processing 12 inputs ...
Total time = 0.610358953476s / 12 inps = 19.6605619229 ips
Forwarding 12 inputs ...
Total time = 0.628900051117s / 12 inps = 19.0809334149 ips
Post processing 12 inputs ...
Total time = 0.595747947693s / 12 inps = 20.1427466875 ips

Using --gpu 0.0, i.e. pure CPU:

# ubuntu @ tegra-ubuntu in ~/Github/darkflow on git:master x [12:34:56]
$ ./flow --model cfg/v1/tiny-yolo.train.cfg --load bin/tiny-yolo_201611160227.weights --gpu 0.0
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcublas.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcudnn.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcufft.so locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcuda.so.1 locally
I tensorflow/stream_executor/dso_loader.cc:111] successfully opened CUDA library libcurand.so locally
Parsing cfg/v1/tiny-yolo.train.cfg
Loading bin/tiny-yolo_201611160227.weights ...
Successfully identified 113568356 bytes
Finished in 0.0160660743713s

Building net ...
Source | Train? | Layer description                | Output size
-------+--------+----------------------------------+---------------
       |        | input                            | (?, 448, 448, 3)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 448, 448, 16)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 224, 224, 16)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 224, 224, 32)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 112, 112, 32)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 112, 112, 64)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 56, 56, 64)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 56, 56, 128)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 28, 28, 128)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 28, 28, 256)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 14, 14, 256)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 14, 14, 512)
 Load  |  Yep!  | maxp 2x2p0_2                     | (?, 7, 7, 512)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 1024)
 Load  |  Yep!  | conv 3x3p1_1  +bnorm  leaky      | (?, 7, 7, 256)
 Load  |  Yep!  | flat                             | (?, 12544)
 Load  |  Yep!  | full 12544 x 1573  linear        | (?, 1573)
-------+--------+----------------------------------+---------------
Running entirely on CPU
I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:892] ARM has no NUMA node, hardcoding to return zero
I tensorflow/core/common_runtime/gpu/gpu_device.cc:951] Found device 0 with properties:
name: NVIDIA Tegra X1
major: 5 minor: 3 memoryClockRate (GHz) 0.072
pciBusID 0000:00:00.0
Total memory: 3.90GiB
Free memory: 2.36GiB
I tensorflow/core/common_runtime/gpu/gpu_device.cc:972] DMA: 0
I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] 0:   Y
I tensorflow/core/common_runtime/gpu/gpu_device.cc:1041] Creating TensorFlow device (/gpu:0) -> (device: 0, name: NVIDIA Tegra X1, pci bus id: 0000:00:00.0)
Finished in 6.37051415443s

Forwarding 12 inputs ...
Total time = 4.89332389832s / 12 inps = 2.45232080471 ips
Post processing 12 inputs ...
Total time = 1.27489995956s / 12 inps = 9.41250323994 ips
Forwarding 12 inputs ...
Total time = 4.34556508064s / 12 inps = 2.76143603359 ips
Post processing 12 inputs ...
Total time = 0.599526882172s / 12 inps = 20.0157830397 ips
Forwarding 12 inputs ...
Total time = 4.32573008537s / 12 inps = 2.77409818994 ips
Post processing 12 inputs ...
Total time = 0.667248010635s / 12 inps = 17.9843173883 ips
Forwarding 12 inputs ...
Total time = 4.26882100105s / 12 inps = 2.81108062321 ips
Post processing 12 inputs ...
Total time = 0.592815160751s / 12 inps = 20.2423972842 ips
Forwarding 12 inputs ...
Total time = 4.25824594498s / 12 inps = 2.81806174539 ips
Post processing 12 inputs ...
Total time = 0.819761037827s / 12 inps = 14.6384122278 ips
Forwarding 12 inputs ...
Total time = 4.33560299873s / 12 inps = 2.76778109147 ips
Post processing 12 inputs ...
Total time = 0.680369853973s / 12 inps = 17.6374657547 ips
Forwarding 12 inputs ...
Total time = 4.31595182419s / 12 inps = 2.78038321298 ips
Post processing 12 inputs ...
Total time = 0.776940822601s / 12 inps = 15.4451917713 ips
Forwarding 12 inputs ...
Total time = 4.26998996735s / 12 inps = 2.81031105266 ips
Post processing 12 inputs ...
Total time = 0.611348152161s / 12 inps = 19.6287499318 ips
Forwarding 12 inputs ...
Total time = 4.32538700104s / 12 inps = 2.77431822797 ips
Post processing 12 inputs ...
Total time = 0.601390123367s / 12 inps = 19.9537696642 ips

It seems that post-precessing is always ~19ips, and forwarding is ~2ips for CPU and ~20ips for GPU.

from darkflow.

kinhunt commented on May 16, 2024

My results with latest code:

8 inputs with one Tesla K80 GPU and tiny-yolo v2:
Forwarding 5.501616737 ips
Post processing 4.61140247141 ips

8 inputs with CPU and yolo v2:
Forwarding 1.09248647978 ips
Post processing 4.61676005141 ips

Are these results reasonable?

from darkflow.

nottug commented on May 16, 2024

@corenel Did you ever find a good solution to the FPS issues on the JTX1? I'm having a similar issue, I max out at about 5FPS

from darkflow.

corenel commented on May 16, 2024

@traw1234 No.
Finally I choose to use the original darknet and rewrite it in C++.

from darkflow.

Dhruv-Mohan commented on May 16, 2024

@corenel Hello, what's the fps that you get on the original Darknet?

from darkflow.

corenel commented on May 16, 2024

@traw1234 about 12fps in webcam demo. And even higher fps without gui.

from darkflow.

pabloapast commented on May 16, 2024

Hi there! Any improvement about this topic? I'm also having low performance of darkflow on Jetson TX2 :(

from darkflow.

Malouke commented on May 16, 2024

Hello how you can control number of batch i mean which right number of batch/cpu should gives like flags

for example cori7 3.3GHz/s and 8 physical core -------> how many batch (8 one for each or more than one ????)
thx

from darkflow.

abhishek-s-jha commented on May 16, 2024

@Malouke Batch size is heavily dependent on RAM available.

from darkflow.

Low fps on tiny memory devices about darkflow HOT 14 CLOSED

Comments (14)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent