Comments (4)
arm mali:针对buffer的特殊调度策略
知乎上有一个问题关于为何TNN的性能在mali上好,回答者和某用户进行了深入的沟通:
- 问:能否分享一下arm mali在 buffer/image 性能差异的原因,和异构调度策略的大概思路?
- 答:驱动设计层面的问题,clBuffer没有这个问题, 华为有一套私有的API可以解决cIImage(性能差异)的问题。后来经过探索,发现opencl通过cpu周期性调用clFlush强制刷新command queue可绕过(arm mali在image/buffer上性能差异的)此问题。
- 参考:https://www.zhihu.com/question/400955777
下面,将针对TNN arm mali的调度策略进行分析,其回答中提到了clFlush
,根据opencl对clFlush的解释,有如下说明:
clFlush
会触发对已入命令队列中的命令的执行状态,且该命令队列是与当前设备绑定的;clFlush
只保证所有已经入队的命令发射,但不保证clFlush
执行完后,命令队列就能完成计算;- 其返回的状态码有如下类型:
CL_SUCCESS
表示执行成功;CL_INVALID_COMMAND_QUEUE
表示无效的命令队列;CL_OUT_OF_HOST_MEMORY
表示主机端分配OpenCL资源失败。
任何带有阻塞(blocking)的命令,都会隐式地对命令队列执行clFlush
操作,下面4个例子:
- 带有
blocking_read=CL_TRUE
的clEnqueueReadImage
; - 带有
blocking_write=CL_TRUE
的clEnqueueReadBuffer
、clEnqueueWriteImage
和clEnqueueWriteBuffer
; - 带有
blocking_map=CL_TRUE
的clEnqueueMapImage
、clEnqueueMapBuffer
; clWaitForEvents
。
以上内容有些绕,总结一下:只有阻塞地Enqueue
操作和WaitForEvent
会隐式地去clFlush
命令队列(即将命令推到发射状态),而非阻塞的Enqueue
则不保证。
如若有两个命令队列,且他们有执行上的依赖关系,即命令队列B依赖命令队列A的执行,想要让命令队列A的Event对象作为B的条件,那么就需要隐式或显式地调用clFlush或阻塞命令,以确保入队的A任务的发射(start)状态。
arm官方文档在flush上的说明
Avoid application processor and GPU interactions in the middle of processing
Enqueue all the kernels first, and call clFinish() at the end if possible.
Call clFlush() after one or more clEnqueueNDRange() calls, and call clFinish() before checking the final result.
Avoid blocking calls in the submission thread
Avoid clFinish() or clWaitForEvent() or any other blocking calls in the submission thread.
If possible, wait for an asynchronous callback if you want to check the result while computations are in progress.
Try double buffering, if you are using blocking operations in your submission thread.
Batching kernels submission
From version r17p0 onwards, the OpenCL driver batches kernels that are flushed together for submission to the hardware. Batching kernels can significantly reduce the runtime overheads and cache maintenance costs. For example, this reduction is useful when the application is accessing multiple sub-buffers created from a buffer imported using clImportMemoryARM in separate kernels.
The application should flush kernels in groups as large as possible. When the GPU is idle though, reaching optimal performance requires the application to flush an initial batch of kernels early so that the GPU execution overlaps the queuing of further kernels.
Execution optimizations
• If you use callbacks to prompt the processor to continue processing data resulting from the execution of a kernel, ensure that the callbacks are set before you flush the queue. If you do not do this, the callbacks might occur at the end of a larger batch of work, later than they might have based on actual completion of work.
Wondering when I should use clFlush or clFinish.
参考:https://community.khronos.org/t/wondering-when-i-should-use-clflush-or-clfinish/3157
tnn/device/opencl/acc/opencl_layer_acc.cc
https://github.com/Tencent/TNN/blob/4b9ffbecc22f5ea4ba6bc4fdacff85475a59d08d/source/tnn/device/opencl/acc/opencl_layer_acc.cc#L160
Status OpenCLLayerAcc::Forward(const std::vector<Blob *> &inputs, const std::vector<Blob *> &outputs) {
int unit_idx = 0;
for (auto execute_unit : execute_units_) {
ret = RunKernel(execute_unit.ocl_kernel, execute_unit.global_work_size, execute_unit.local_work_size,
ocl_context_->CommandQueue(), op_name_);
unit_idx++;
}
if (NeedFlush()) {
ocl_context_->CommandQueue()->flush();
}
return TNN_OK;
}
bool OpenCLLayerAcc::NeedFlush() {
// flush by magic number
if (0 == ocl_context_->AddAndGetFlushCount() % 10) {
return true;
}
return false;
}
OpenCLContext
// https://github.com/Tencent/TNN/blob/a315d2acfb327014721b308359a6d534470289ba/source/tnn/device/opencl/opencl_context.cc
// opencl kernel flush strategy, some devices(special for huawei device) whave serious latency.
unsigned int OpenCLContext::AddAndGetFlushCount() {
flush_count_++;
return flush_count_;
}
// https://github.com/Tencent/TNN/blob/a315d2acfb327014721b308359a6d534470289ba/source/tnn/device/opencl/opencl_context.h#L88
class OpenCLContext : public Context {
public:
OpenCLContext();
~OpenCLContext();
// @brief get tnn command queue
// @param command_queue device command queue for forward
Status GetCommandQueue(void **command_queue) override;
// @brief share tnn command queue to another context
Status ShareCommandQueue(Context* context) override;
/**
* @brief get CommandQueue
*/
cl::CommandQueue *CommandQueue();
cl::CommandQueue *TuneCommandQueue();
// load library
virtual Status LoadLibrary(std::vector<std::string> path) override;
/**
* @brief befor instace forword
* @param instance instace
*/
virtual Status OnInstanceForwardBegin() override;
/**
* @brief after instace forword
* @param instance instace
*/
virtual Status OnInstanceForwardEnd() override;
// @brief before instance Reshape
virtual Status OnInstanceReshapeBegin() override;
// @brief after instace Reshape
virtual Status OnInstanceReshapeEnd() override;
// @brief wait for jobs in the current context to complete
virtual Status Synchronize() override;
// @brief add flush_count_ and return val
unsigned int AddAndGetFlushCount();
std::map<std::string, std::vector<uint32_t>>& GetLocalSizeTuneMap();
Status StoreLocalSizeTuneMap();
public:
/**
* @brief initialize opencl env
*/
Status Init();
private:
std::shared_ptr<cl::CommandQueue> command_queue_ = nullptr;
std::shared_ptr<cl::CommandQueue> tune_command_queue_ = nullptr;
std::shared_ptr<cl::CommandQueue> GetCommandQueue();
OpenCLRuntime *opencl_runtime_ = nullptr;
unsigned int flush_count_ = 0;
cl_command_queue_properties properties_ = 0;
bool ReadStatusCheck(std::ifstream& is);
std::map<std::string, std::vector<uint32_t>> local_size_tune_map_;
uint32_t tune_map_size_;
static std::mutex s_mutex_;
};
from opencl-101.
magic number for workgroup
https://github.com/Tencent/TNN/blob/aedc6c849e711a6386a8d2cd4ebb0bc94c7b9285/source/tnn/device/opencl/opencl_runtime.cc#L341
//magic number
static std::map<int, int> AdrenoSubGroup{
{640, 128}, {630, 128}, {616, 128}, {612, 64}, {610, 64}, {540, 32}, {530, 32},
{512, 32}, {510, 32}, {509, 32}, {506, 32}, {505, 32}, {405, 32}, {330, 16},
};
//opencl 2.0 can get SubGroupSize.
uint32_t OpenCLRuntime::GetSubGroupSize(const cl::Kernel &kernel, const cl::NDRange &range) {
uint32_t sub_group_size = 0;
if (ADRENO == gpu_info_.type) {
#if CL_HPP_TARGET_OPENCL_VERSION >= 200 && CL_TARGET_OPENCL_VERSION >= 210 && defined(CL_HPP_USE_CL_SUB_GROUPS_KHR)
cl_int cl_ret;
sub_group_size = kernel.getSubGroupInfo<CL_KERNEL_MAX_SUB_GROUP_SIZE_FOR_NDRANGE>(*device_, range, &cl_ret);
if (cl_ret != CL_SUCCESS) {
CHECK_CL_SUCCESS(cl_ret)
sub_group_size = 0;
}
#else
if (AdrenoSubGroup.find(gpu_info_.model_num) != AdrenoSubGroup.end()) {
sub_group_size = AdrenoSubGroup[gpu_info_.model_num];
}
#endif
}
return sub_group_size;
}
from opencl-101.
cl&&gl交互
默认编译不开启,需要设置CMake
// https://github.com/Tencent/TNN/blob/aedc6c849e711a6386a8d2cd4ebb0bc94c7b9285/source/tnn/device/opencl/opencl_runtime.cc#L341
#ifdef SHARING_MEM_WITH_OPENGL
#include <EGL/egl.h>
#endif
//Init will get platforms info, get devices info, create opencl context.
Status OpenCLRuntime::Init() {
// ....
#if defined(SHARING_MEM_WITH_OPENGL) && (CL_HPP_TARGET_OPENCL_VERSION >= 120)
// create context from glcontext
LOGI("Create special opencl context to share with OpenGL\n");
LOGI("eglGetCurrentContext(): 0x%x\n", eglGetCurrentContext());
cl_context_properties context_prop[] = {CL_GL_CONTEXT_KHR, (cl_context_properties)eglGetCurrentContext(),
CL_EGL_DISPLAY_KHR, (cl_context_properties)eglGetCurrentDisplay(), 0};
context_ = std::shared_ptr<cl::Context>(new cl::Context(*device_, context_prop, nullptr, nullptr, &err));
if (err != CL_SUCCESS) {
LOGE(
"Create special opencl context falied, Create common opencl "
"context then.\n");
context_ = std::shared_ptr<cl::Context>(new cl::Context(*device_, nullptr, nullptr, nullptr, &err));
}
#else
LOGI("Create common opencl context\n");
context_ = std::shared_ptr<cl::Context>(new cl::Context(*device_, nullptr, nullptr, nullptr, &err));
#endif
from opencl-101.
tune
// https://github.com/Tencent/TNN/blob/4b9ffbecc22f5ea4ba6bc4fdacff85475a59d08d/source/tnn/device/opencl/acc/opencl_layer_acc.cc#L160
Status OpenCLLayerAcc::Forward(const std::vector<Blob *> &inputs, const std::vector<Blob *> &outputs) {
#if defined(LOCAL_SIZE_FINE_TUNE) && TNN_PROFILE
auto execute_unit_org = execute_units_[0];
auto max_wgs = execute_unit_org.workgroupsize_max;
std::vector<std::vector<uint32_t>> local_size_list_3d = {
{16, 4, 1}, {8, 8, 1}, {4, 16, 1}, {2, 32, 1}, {1, 64, 1}, {2, 64, 1}, {4, 64, 1},
{8, 64, 1}, {16, 64, 1}, {8, 64, 2}, {4, 64, 4}, {2, 64, 8}, {2, 64, 4}, {},
};
std::vector<std::vector<uint32_t>> local_size_list_2d = {
{2, max_wgs / 2}, {4, max_wgs / 4}, {8, max_wgs / 8},
{16, max_wgs / 16}, {max_wgs / 2, 2}, {max_wgs / 4, 4},
{max_wgs / 8, 8}, {max_wgs / 16, 16}, {},
};
std::vector<uint32_t> local_size_default;
if (execute_unit_org.global_work_size.size() == 2) {
local_size_default = LocalWS2DDefault(execute_unit_org);
} else if (execute_unit_org.global_work_size.size() == 3) {
local_size_default = LocalWS3DDefault(execute_unit_org);
}
OpenCLExecuteUnit exec_unit_default = execute_unit_org;
exec_unit_default.local_work_size = local_size_default;
execute_units_.push_back(exec_unit_default);
if (execute_unit_org.global_work_size.size() == 2) {
for (auto local_size : local_size_list_2d) {
OpenCLExecuteUnit exec_unit_temp = execute_unit_org;
exec_unit_temp.local_work_size = local_size;
execute_units_.push_back(exec_unit_temp);
}
} else if (execute_unit_org.global_work_size.size() == 3) {
for (auto local_size : local_size_list_3d) {
OpenCLExecuteUnit exec_unit_temp = execute_unit_org;
exec_unit_temp.local_work_size = local_size;
execute_units_.push_back(exec_unit_temp);
}
}
#endif
from opencl-101.
Related Issues (20)
- 【论文解读】CLBlast: A Tuned OpenCL BLAS Library HOT 8
- cutlass: Efficient GEMM in CUDA
- The Midgard Shader Core HOT 1
- OpenCL与CUDA的区别
- CUDA kernel Optimization & libraries
- 【论文解读】CLTune: A Generic Auto-Tuner for OpenCL Kernels HOT 9
- 【竞品调研】MindSpore Lite的OpenCL AutoTune 策略 HOT 3
- The Utgard Shader Core
- 【渲染】Arm Mali: Tile-Based Rendering HOT 1
- FP32 operations/clock and Thread count
- 移动端OPENCL后端模型业务支持 HOT 3
- Paddle-Lite OpenCL后端整体架构 HOT 2
- 【论文解读】A Note on Auto-tuning GEMM for GPUs HOT 2
- Matrix Multiply(A:Buffer, B:Image, C:Buffer) on Adreno GPUs HOT 14
- MNN Conv2d_int8 HOT 3
- Adreno Architecture HOT 1
- GPU渲染流程 HOT 1
- PowerVR GPU Docs
- how to optimize opencl gemm HOT 1
Recommend Projects
-
React
A declarative, efficient, and flexible JavaScript library for building user interfaces.
-
Vue.js
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
-
Typescript
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
-
TensorFlow
An Open Source Machine Learning Framework for Everyone
-
Django
The Web framework for perfectionists with deadlines.
-
Laravel
A PHP framework for web artisans
-
D3
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
-
Recommend Topics
-
javascript
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
-
web
Some thing interesting about web. New door for the world.
-
server
A server is a program made to process requests and deliver data to clients.
-
Machine learning
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
-
Visualization
Some thing interesting about visualization, use data art
-
Game
Some thing interesting about game, make everyone happy.
Recommend Org
-
Facebook
We are working to build community through open source technology. NB: members must have two-factor auth.
-
Microsoft
Open source projects and samples from Microsoft.
-
Google
Google ❤️ Open Source for everyone.
-
Alibaba
Alibaba Open Source for everyone
-
D3
Data-Driven Documents codes.
-
Tencent
China tencent open source team.
from opencl-101.