Comments (7)
@JinRanYAO Is the data you're testing a picture or a video?
from tensorrt-alpha.
@JinRanYAO Try to use the following instructions to achieve fp16 quantization, and improve performance by about 100%
./trtexec --onnx=yolov8n-pose.onnx --saveEngine=yolov8n-pose-fp16.trt --buildOnly --minShapes=images:1x3x640x640 --optShapes=images:2x3x640x640 --maxShapes=images:4x3x640x640 --fp16
【FP32】:
[04/07/2024-09:15:16] [I] preprocess time = 0.841472; infer time = 5.80734; postprocess time = 0.186192
[04/07/2024-09:15:16] [I] preprocess time = 0.837504; infer time = 5.76032; postprocess time = 0.13976
[04/07/2024-09:15:16] [I] preprocess time = 0.845184; infer time = 5.75726; postprocess time = 0.209248
[04/07/2024-09:15:16] [I] preprocess time = 0.839952; infer time = 5.76222; postprocess time = 0.170016
[04/07/2024-09:15:16] [I] preprocess time = 0.844816; infer time = 5.76472; postprocess time = 0.146288
[04/07/2024-09:15:16] [I] preprocess time = 0.838784; infer time = 5.76434; postprocess time = 0.203216
[04/07/2024-09:15:16] [I] preprocess time = 0.808864; infer time = 5.5223; postprocess time = 0.150368
[04/07/2024-09:15:16] [I] preprocess time = 0.811856; infer time = 5.52139; postprocess time = 0.184
[04/07/2024-09:15:16] [I] preprocess time = 0.80856; infer time = 5.52371; postprocess time = 0.20792
[04/07/2024-09:15:16] [I] preprocess time = 0.809776; infer time = 5.51814; postprocess time = 0.168032
[04/07/2024-09:15:16] [I] preprocess time = 0.810064; infer time = 5.5215; postprocess time = 0.208496
[04/07/2024-09:15:16] [I] preprocess time = 0.811216; infer time = 5.51797; postprocess time = 0.201968
[04/07/2024-09:15:16] [I] preprocess time = 0.809136; infer time = 5.51658; postprocess time = 0.179296
【FP16】:
[04/07/2024-09:15:26] [I] preprocess time = 0.84056; infer time = 2.59362; postprocess time = 0.177744
[04/07/2024-09:15:26] [I] preprocess time = 0.84752; infer time = 2.43448; postprocess time = 0.132512
[04/07/2024-09:15:26] [I] preprocess time = 0.840256; infer time = 2.42754; postprocess time = 0.206288
[04/07/2024-09:15:26] [I] preprocess time = 0.841216; infer time = 2.43272; postprocess time = 0.160144
[04/07/2024-09:15:26] [I] preprocess time = 0.840736; infer time = 2.42774; postprocess time = 0.137648
[04/07/2024-09:15:26] [I] preprocess time = 0.841296; infer time = 2.4313; postprocess time = 0.194464
[04/07/2024-09:15:26] [I] preprocess time = 0.840992; infer time = 2.43011; postprocess time = 0.149072
[04/07/2024-09:15:26] [I] preprocess time = 0.83664; infer time = 2.43083; postprocess time = 0.184176
[04/07/2024-09:15:26] [I] preprocess time = 0.841136; infer time = 2.4283; postprocess time = 0.20736
[04/07/2024-09:15:26] [I] preprocess time = 0.844864; infer time = 2.4312; postprocess time = 0.165424
[04/07/2024-09:15:26] [I] preprocess time = 0.842; infer time = 2.42846; postprocess time = 0.207552
[04/07/2024-09:15:26] [I] preprocess time = 0.8444; infer time = 2.43054; postprocess time = 0.203488
[04/07/2024-09:15:26] [I] preprocess time = 0.84024; infer time = 2.43106; postprocess time = 0.179952
from tensorrt-alpha.
@FeiYull Thank you for your quick reply!
- My project is built on ROS, so I don't use utils::InputStream. I use yolov8.init() in the beginning. When I receive an image, I run the following code each frame. I think it is like using utils::InputStream::IMAGE? Is this code reasonable or somewhere can be improved?
imgs_batch.emplace_back(frame.clone());
yolov8.copy(imgs_batch);
utils::DeviceTimer d_t1; yolov8.preprocess(imgs_batch); float t1 = d_t1.getUsedTime();
utils::DeviceTimer d_t2; yolov8.infer(); float t2 = d_t2.getUsedTime();
utils::DeviceTimer d_t3; yolov8.postprocess(imgs_batch); float t3 = d_t3.getUsedTime();
float avg_times[3] = { t1, t2, t3 };
sample::gLogInfo << "preprocess time = " << avg_times[0] << "; "
"infer time = " << avg_times[1] << "; "
"postprocess time = " << avg_times[2] << std::endl;
yolov8.reset();
imgs_batch.clear();
- Thanks, I try to use fp16, and the infer time decreased from 40ms to 30ms, with also 20ms preprocess time. Can I use int8 to get faster?
- Additionally, my raw image shape is 1920x1080. Is too much time spent on resize?
from tensorrt-alpha.
@JinRanYAO
It is recommended to enter the function YOLOv8Pose::preprocess to test the internal time overhead.
void YOLOv8Pose::preprocess(const std::vectorcv::Mat& imgsBatch)
from tensorrt-alpha.
@FeiYull It seems that resize, bgr2rgb, norm and hwc2chw cost almost the same time, about 5ms for each process. Could I use the similar fuctions in opencv when I receive image, instead of using these processes here?
from tensorrt-alpha.
@JinRanYAO U can merge the following operations to one:
- resizeDevice
- bgr2rgbDevice
- normDevice
Inside the resizeDevice's cuda kernel function you call, modify the following:
[modify bofore]
TensorRT-Alpha/utils/kernel_function.cu
Line 142 in bca9575
[modify after]
`
//pdst[0] = c0;
//pdst[1] = c1;
//pdst[2] = c2;
// bgr2rgb
pdst[0] = c2;
pdst[1] = c1;
pdst[2] = c0;
// normlization
// float scale = 255.f
// float means[3] = { 0.f, 0.f, 0.f };
// float stds[3] = { 1.f, 1.f, 1.f };
pdst[0] = (pdst[0] / scale - means[0]) / stds[0];
pdst[1] = (pdst[1] / scale - means[0]) / stds[0];
pdst[2] = (pdst[2] / scale - means[0]) / stds[0];
`
from tensorrt-alpha.
@FeiYull Thanks for your advice, the preprocess time decreases to 8ms after merging resize, bgr2rgb, norm to one. Then I resize the image to trtfile size when it is received, and use the same src_size and dst_size in yolov8-pose. Finally I simplify the preporcess code by deleting affinematrix and interpolation to save more time. Here is my code now.
__global__
void resize_rgb_padding_device_kernel(unsigned char* src, int src_width, int src_height, int src_area, int src_volume,
float* dst, int dst_width, int dst_height, int dst_area, int dst_volume,
int batch_size, float padding_value, float inv_scale)
{
int dx = blockDim.x * blockIdx.x + threadIdx.x;
int dy = blockDim.y * blockIdx.y + threadIdx.y;
if (dx < dst_area && dy < batch_size)
{
int dst_y = dx / dst_width;
int dst_x = dx % dst_width;
unsigned char* v = src + dy * src_volume + dst_y * src_width * 3 + dst_x * 3;
float* pdst = dst + dy * dst_volume + dst_y * dst_width * 3 + dst_x * 3;
pdst[0] = (v[2] + 0.5f) * inv_scale;
pdst[1] = (v[1] + 0.5f) * inv_scale;
pdst[2] = (v[0] + 0.5f) * inv_scale;
}
}
After simplifying, the preprocess time decreases to about 6ms, with right inference result. Is this code all right or anything can be improved?
from tensorrt-alpha.
Related Issues (20)
- yolov8 vs项目生成报错 HOT 2
- 编译失败
- 如何计算转化后mAP值 HOT 1
- yolov8-pose推理时报错 HOT 1
- support yolo-world
- 抓狂,过了一夜报错:UDA initialization failure with error: 222.,之前还是跑通的
- 自己训练的模型检测框错位,标签乱 HOT 1
- Cuda12.1+tensorrt10.0 运行C++yolov8工程报错,提示getBindingDimensions 不是 nvinfer1::IExecutionContext成员 代码C2039 HOT 3
- make -j10在jetson nano上报错
- 希望后期可以支持目标跟踪 HOT 1
- onnx转换trt报错 HOT 1
- YOLOV8-SEG 如何快速获取mask掩码 HOT 6
- Windows Cmake编译错误 HOT 6
- Yolov8部署jetson后map精度问题
- v8-pose在Jetson AGX Xavier上的推理时间问题 HOT 1
- 期待加上目标跟踪功能
- YOLOv8-pose 出现编译错误 ,up帮忙看看给您磕头了 HOT 1
- yolov8 seg output有output0和output1,代码中使用output0有问题 HOT 1
- 红外图像或者灰度图像的识别 HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from tensorrt-alpha.