Hello, thank you for your excellent work. I am trying to use your yolov8-pose code in

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

spend more than 20ms on prepocess on jetson nano, batch=1 about tensorrt-alpha HOT 7 OPEN

JinRanYAO commented on May 24, 2024

spend more than 20ms on prepocess on jetson nano, batch=1

from tensorrt-alpha.

Comments (7)

FeiYull commented on May 24, 2024

@JinRanYAO Is the data you're testing a picture or a video?

from tensorrt-alpha.

FeiYull commented on May 24, 2024

@JinRanYAO Try to use the following instructions to achieve fp16 quantization, and improve performance by about 100%

./trtexec --onnx=yolov8n-pose.onnx --saveEngine=yolov8n-pose-fp16.trt --buildOnly --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:4x3x640x640 --fp16

【FP32】:
[04/07/2024-09:15:16] [I] preprocess time = 0.841472; infer time = 5.80734; postprocess time = 0.186192
[04/07/2024-09:15:16] [I] preprocess time = 0.837504; infer time = 5.76032; postprocess time = 0.13976
[04/07/2024-09:15:16] [I] preprocess time = 0.845184; infer time = 5.75726; postprocess time = 0.209248
[04/07/2024-09:15:16] [I] preprocess time = 0.839952; infer time = 5.76222; postprocess time = 0.170016
[04/07/2024-09:15:16] [I] preprocess time = 0.844816; infer time = 5.76472; postprocess time = 0.146288
[04/07/2024-09:15:16] [I] preprocess time = 0.838784; infer time = 5.76434; postprocess time = 0.203216
[04/07/2024-09:15:16] [I] preprocess time = 0.808864; infer time = 5.5223; postprocess time = 0.150368
[04/07/2024-09:15:16] [I] preprocess time = 0.811856; infer time = 5.52139; postprocess time = 0.184
[04/07/2024-09:15:16] [I] preprocess time = 0.80856; infer time = 5.52371; postprocess time = 0.20792
[04/07/2024-09:15:16] [I] preprocess time = 0.809776; infer time = 5.51814; postprocess time = 0.168032
[04/07/2024-09:15:16] [I] preprocess time = 0.810064; infer time = 5.5215; postprocess time = 0.208496
[04/07/2024-09:15:16] [I] preprocess time = 0.811216; infer time = 5.51797; postprocess time = 0.201968
[04/07/2024-09:15:16] [I] preprocess time = 0.809136; infer time = 5.51658; postprocess time = 0.179296

【FP16】:
[04/07/2024-09:15:26] [I] preprocess time = 0.84056; infer time = 2.59362; postprocess time = 0.177744
[04/07/2024-09:15:26] [I] preprocess time = 0.84752; infer time = 2.43448; postprocess time = 0.132512
[04/07/2024-09:15:26] [I] preprocess time = 0.840256; infer time = 2.42754; postprocess time = 0.206288
[04/07/2024-09:15:26] [I] preprocess time = 0.841216; infer time = 2.43272; postprocess time = 0.160144
[04/07/2024-09:15:26] [I] preprocess time = 0.840736; infer time = 2.42774; postprocess time = 0.137648
[04/07/2024-09:15:26] [I] preprocess time = 0.841296; infer time = 2.4313; postprocess time = 0.194464
[04/07/2024-09:15:26] [I] preprocess time = 0.840992; infer time = 2.43011; postprocess time = 0.149072
[04/07/2024-09:15:26] [I] preprocess time = 0.83664; infer time = 2.43083; postprocess time = 0.184176
[04/07/2024-09:15:26] [I] preprocess time = 0.841136; infer time = 2.4283; postprocess time = 0.20736
[04/07/2024-09:15:26] [I] preprocess time = 0.844864; infer time = 2.4312; postprocess time = 0.165424
[04/07/2024-09:15:26] [I] preprocess time = 0.842; infer time = 2.42846; postprocess time = 0.207552
[04/07/2024-09:15:26] [I] preprocess time = 0.8444; infer time = 2.43054; postprocess time = 0.203488
[04/07/2024-09:15:26] [I] preprocess time = 0.84024; infer time = 2.43106; postprocess time = 0.179952

from tensorrt-alpha.

JinRanYAO commented on May 24, 2024

@FeiYull Thank you for your quick reply!

My project is built on ROS, so I don't use utils::InputStream. I use yolov8.init() in the beginning. When I receive an image, I run the following code each frame. I think it is like using utils::InputStream::IMAGE? Is this code reasonable or somewhere can be improved?

        imgs_batch.emplace_back(frame.clone());
        yolov8.copy(imgs_batch);
	utils::DeviceTimer d_t1; yolov8.preprocess(imgs_batch);  float t1 = d_t1.getUsedTime();
	utils::DeviceTimer d_t2; yolov8.infer();				  float t2 = d_t2.getUsedTime();
	utils::DeviceTimer d_t3; yolov8.postprocess(imgs_batch); float t3 = d_t3.getUsedTime();
	float avg_times[3] = { t1, t2, t3 };
	sample::gLogInfo << "preprocess time = " << avg_times[0] << "; "
		"infer time = " << avg_times[1] << "; "
		"postprocess time = " << avg_times[2] << std::endl;
	yolov8.reset();
	imgs_batch.clear();

Thanks, I try to use fp16, and the infer time decreased from 40ms to 30ms, with also 20ms preprocess time. Can I use int8 to get faster?
Additionally, my raw image shape is 1920x1080. Is too much time spent on resize?

from tensorrt-alpha.

FeiYull commented on May 24, 2024

@JinRanYAO
It is recommended to enter the function YOLOv8Pose::preprocess to test the internal time overhead.

void YOLOv8Pose::preprocess(const std::vectorcv::Mat& imgsBatch)

from tensorrt-alpha.

JinRanYAO commented on May 24, 2024

@FeiYull It seems that resize, bgr2rgb, norm and hwc2chw cost almost the same time, about 5ms for each process. Could I use the similar fuctions in opencv when I receive image, instead of using these processes here?

from tensorrt-alpha.

FeiYull commented on May 24, 2024

@JinRanYAO U can merge the following operations to one:

resizeDevice
bgr2rgbDevice
normDevice

Inside the resizeDevice's cuda kernel function you call, modify the following:

[modify bofore]

TensorRT-Alpha/utils/kernel_function.cu

Line 142 in bca9575

pdst[0] = c0;

[modify after]
`
//pdst[0] = c0;
//pdst[1] = c1;
//pdst[2] = c2;

// bgr2rgb
pdst[0] = c2;
pdst[1] = c1;
pdst[2] = c0;

// normlization
// float scale = 255.f
// float means[3] = { 0.f, 0.f, 0.f };
// float stds[3] = { 1.f, 1.f, 1.f };
pdst[0] = (pdst[0] / scale - means[0]) / stds[0];
pdst[1] = (pdst[1] / scale - means[0]) / stds[0];
pdst[2] = (pdst[2] / scale - means[0]) / stds[0];
`

from tensorrt-alpha.

JinRanYAO commented on May 24, 2024

@FeiYull Thanks for your advice, the preprocess time decreases to 8ms after merging resize, bgr2rgb, norm to one. Then I resize the image to trtfile size when it is received, and use the same src_size and dst_size in yolov8-pose. Finally I simplify the preporcess code by deleting affinematrix and interpolation to save more time. Here is my code now.

__global__
void resize_rgb_padding_device_kernel(unsigned char* src, int src_width, int src_height, int src_area, int src_volume,
        float* dst, int dst_width, int dst_height, int dst_area, int dst_volume,
        int batch_size, float padding_value, float inv_scale)
{
    int dx = blockDim.x * blockIdx.x + threadIdx.x;
    int dy = blockDim.y * blockIdx.y + threadIdx.y;
    
    if (dx < dst_area && dy < batch_size)
    {
        int dst_y = dx / dst_width;
        int dst_x = dx % dst_width;

        unsigned char* v = src + dy * src_volume + dst_y * src_width * 3 + dst_x * 3;

        float* pdst = dst + dy * dst_volume + dst_y * dst_width * 3 + dst_x * 3;
        pdst[0] = (v[2] + 0.5f) * inv_scale;
        pdst[1] = (v[1] + 0.5f) * inv_scale;
        pdst[2] = (v[0] + 0.5f) * inv_scale;
    }
}

After simplifying, the preprocess time decreases to about 6ms, with right inference result. Is this code all right or anything can be improved?

from tensorrt-alpha.

spend more than 20ms on prepocess on jetson nano, batch=1 about tensorrt-alpha HOT 7 OPEN

Comments (7)

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent