Tengine 使用 CUDA 进行部署
编译
参考 源码编译(CUDA) 章节生成部署所需要的以下库文件:
待补充
运行
模型格式
CUDA 当前仅支持加载 Float32 tmfile。
推理精度设置
CUDA 支持 Float32 一种精度模型进行网络模型推理,需要在执行 prerun_graph_multithread(graph_t graph, struct options opt)
之前通过 struct options opt
显式设置推理精度。
Enable GPU FP32 mode
/* set runtime options */
struct options opt;
opt.num_thread = num_thread;
opt.cluster = TENGINE_CLUSTER_ALL;
opt.precision = TENGINE_MODE_FP32;
opt.affinity = 0;
后端硬件绑定
在加载模型前,需要显式指定 CUDA 硬件后端 context,并在调用 graph_t create_graph(context_t context, const char* model_format, const char* fname, ...)
时传入该参数。
/* create NVIDIA CUDA backend */
context_t cuda_context = create_context("cuda", 1);
add_context_device(cuda_context, "CUDA");
/* create graph, load tengine model xxx.tmfile */
create_graph(cuda_context, "tengine", model_file);
参考 Demo
源码请参考 tm_classification_tensorrt.c
执行结果
nvidia@xaiver:~/tengine-lite-tq/build-linux-cuda$ ./tm_classification_cuda -m mobilenet_v1.tmfile -i cat.jpg -g 224,224 -s 0.017,0.017,0.017 -w 104.007,116.669,122.679 -r 10
Tengine plugin allocator CUDA is registered.
tengine-lite library version: 1.2-dev
model file : /home/nvidia/tengine-test/models/mobilenet_v1.tmfile
image file : /home/nvidia/tengine-test/images/cat.jpg
img_h, img_w, scale[3], mean[3] : 224 224 , 0.017 0.017 0.017, 104.0 116.7 122.7
Repeat 10 times, thread 1, avg time 4.58 ms, max_time 5.72 ms, min_time 4.24 ms
--------------------------------------
8.574145, 282
7.880118, 277
7.812578, 278
7.286452, 263
6.357486, 281
--------------------------------------