架构详解


设计背景

先行者 Tengine

最早的 Tengine 是使用 C++ 分层设计的,编译采用 make,也就是嵌入系统常常采用的 makefile,编写之初支持的硬件是 CPU。可扩展性是 Tengine 和现在的 Lite 版本的设计核心,不仅仅是 Operator 能够很方便的通过注册宏扩展,各个主要模块也是扩展的。

推出 Tengine 发展一段时间后,市场反馈需要 Tengine 支持 MCUVLIW 架构的 DSP;在一些项目中,编译工具链对 C++ 较新标准支持较差,甚至是没有 C++ 编译器,这就对 Tengine 的架构产生了新的要求。

在这种背景下,项目团队提出重新设计代号 LiteTengineC 语言的新主要分支,经过一段时间的开发后,后续项目陆续切换到 Tengine Lite 上。从能力上,Lite 和原分支是一样的。经过一段时间后,全部 Tengine 项目维护周期终结后,Tengine 分支将会进入存档冻结状态,不再维护。

继任者 Tengine Lite

“薪火相传 砥砺前行”,Tengine Lite 的早期版本主题设计和 Tengine 高度相似,较早的 Tengine Lite 版本注册机制依赖 GNU C 扩展,而 GNU C 扩展并不是标准 C 的内容。当社区呼唤需要扩展支持到非 GCC 系列的场景,如 Microsoft Visual Studio 上时,遇到了较多的困难。另一方面,纯 C 的设计使得 Tengine Lite 的入手难度较高,社区和项目反馈需要提高易用性,“Tengine 从入门到‘放弃’”的时间要“显著”缩短。

经过一段时间的磨合后,团队决定重新设计主要模块,重点是解决这几方面的问题。

架构设计

重新设计的设计目标之一是,采用纯 C 设计 TengineLite 分支(以下简称 Tengine,不再强调 Lite 分支)最小 Runtime,复杂的功能和逻辑可以使用 C++ 开发。这样保持纯 C Runtime 在局限场景下优势的同时,使用 C++ 开发复杂模块,使注意力集中在开发算法本身上。

架构概览

../_images/architecture-1.png

重要模块介绍

网络描述 graph 和相关结构体

source/graph 目录下是 Convolution Neural Network 的描述结构体和函数。

/*!
 * @struct ir_graph_t
 * @brief  Abstract graph intermediate representation
 */
typedef struct graph
{
    struct tensor** tensor_list;            //!< the tensor list of a graph
    struct node**   node_list;              //!< the node list of a graph
    int16_t* input_nodes;                   //!< input nodes index array of a graph
    int16_t* output_nodes;                  //!< output nodes index array of a graph

    uint16_t tensor_num;                    //!< the count of all graph tensor
    uint16_t node_num;                      //!< the count of all graph node
    uint16_t input_num;                     //!< input nodes index count of a graph
    uint16_t output_num;                    //!< input nodes index count of a graph

    int8_t   graph_layout;                  //!< the data layout of a graph
    int8_t   model_layout;                  //!< model layout of graph source model
    int8_t   model_format;                  //!< model format of graph source model

    uint8_t  status;                        //!< the status of graph

    struct   serializer* serializer;        //!< serializer of graph
    void*    serializer_privacy;            //!< privacy data of serializer

    struct   device* device;                //!< assigned nn_device for this graph
    void*    device_privacy;                //!< privacy data of device

    struct   attribute*  attribute;         //<! attribute of graph

    struct vector* subgraph_list;           //!< subgraph list of this graph
} ir_graph_t;

网络的主体 DAGgraph 和其中的 node 节点共同描述;全部需要的 nodenode->node_list 中存储;node 主要描述了该节点的依赖 tensoroperator 情况。所有需要的 tensorgraph->tensor_list 中存储。graph->node_numgraph->tensor_num 分别描述了 graphnodetensor 数量。数据(tensor) + 操作(operator)构成的节点(node)最后构成一个完整的图(graph),根据调度器(scheduler)模块的分配,形成一个完整的运行图运行在 CPU 设备(device)上,这也是典型的 CPUCNN 模型的应用场景。

/*!
 * @struct ir_node_t
 * @brief  Abstract node intermediate representation
 */
typedef struct node
{
    uint16_t  index;            //!< the index of a node
    uint8_t   dynamic_shape;    //!< flag of dynamic shape
    uint8_t   input_num;        //!< count of input tensor
    uint8_t   output_num;       //!< count of output tensor
    uint8_t   node_type;        //!< type of node: { input, output, intermediate }
    int8_t    subgraph_idx;     //!< id of the owner subgraph

    uint16_t* input_tensors;    //!< id array of input tensor
    uint16_t* output_tensors;   //!< id array of output tensor

    char* name;                 //!< name of a node

    struct op op;               //!< operator of a node
    struct graph* graph;        //!< pointer of the related graph
} ir_node_t;

node 是网络的节点,不同功能的 node 协助完成数据准备和计算的工作。node->input_tensorsnode->output_tensors 分别描述了一个 node 的输入输出 tensor 的索引 index,结合 node->input_numnode->output_num 就可以完成节点的遍历。实际的 node 是存储在 graph->node_list 中的。

/*!
 * @struct ir_tensor_t
 * @brief  Abstract tensor intermediate representation
 */
typedef struct tensor
{
    uint16_t index;                          //!< the index of a tensor
    int16_t  producer;                       //!< node id, '-1' means no producer
    int16_t  consumer[TE_MAX_CONSUMER_NUM];  //!< consumer nodes array

    uint8_t  reshaped;                       //!< the tensor's shape has changed
    uint8_t  consumer_num;                   //!< count of consumer nodes
    uint8_t  tensor_type;                    //!< tensor_type: { const, input, var, dep }
    uint8_t  data_type;                      //!< data_type: { int8, uint8, fp32, fp16, int32 }
    uint8_t  dim_num;                        //!< count of dimensions
    uint8_t  elem_size;                      //!< size of single element
    uint8_t  subgraph_num;                   //!< count of all subgraph those will waiting this tensor ready
    uint8_t  free_host_mem;                  //!< should free host memory?
    uint8_t  internal_allocated;             //!< how memory is allocated?
    uint8_t  layout;                         //!< tensor layout: { TENGINE_LAYOUT_NCHW, TENGINE_LAYOUT_NHWC }

    uint16_t quant_param_num;                //!< quantization dimension
    uint32_t elem_num;                       //!< count of total elements
    int dims[TE_MAX_SHAPE_DIM_NUM];          //!< shape dimensions

    /*!
     * @union anonymity data pointer
     * @brief give useful pointer pointer
     */
    union
    {
        void*    data;
        int8_t*    i8;
        uint8_t*   u8;
        float*    f32;
        uint16_t*   f16;
        int32_t*  i32;
    };

    char* name;                             //!< tensor name

    /*!
     * @union anonymity quantization scale union
     * @brief scale or its array
     */
    union
    {
        float* scale_list;
        float  scale;
    };

    /*!
     * @union anonymity quantization zero point union
     * @brief zero point or its array
     */
    union
    {
        int  zero_point;
        int* zp_list;
    };

    struct dev_mem* dev_mem;
    uint8_t* subgraph_list;                 //!< subgraph index list of those subgraph will waiting this tensor ready
} ir_tensor_t;

tensornode 完成功能的数据基础。tensor->dims 描述了 tensorshapetensor->quant_param_num, tensor->scaletensor->scale_listtensor->zero_pointtensor->zp_list 三者共同描述了 tensor 的量化情况,具体的数值类型由 tensor->data_type 描述。 当用户通过 API 对函数进行修改后,tensor->free_host_memtensor->internal_allocated 会进行变化,用来区分是由内部释放还是用户手动释放。

/*!
 * @struct ir_op_t
 * @brief  Abstract operator intermediate representation
 */
typedef struct op
{
    uint16_t type;                          //!< the type of a operator
    uint8_t  version;                       //!< the version of a operator
    uint8_t  same_shape;                    //!< the flag of weather the operator will keep shape
    uint16_t param_size;                    //!< size of parameter memory buffer
    void* param_mem;                        //!< parameter memory buffer
    int (*infer_shape)(struct node*);       //!< infer(or broadcast) the shape from input to output(s)
} ir_op_t;

operatornode 完成功能的行为基础。op->type 描述了类型,这是这个 op 的行为基础。

/*!
 * @struct ir_subgraph_t
 * @brief  Abstract subgraph intermediate representation
 */
typedef struct subgraph
{
    uint8_t   index;                //!< the index of a subgraph
    uint8_t   input_ready_count;    //!< the count of all in ready input tensors
    uint8_t   input_wait_count;     //!< the count of all out of ready input tensors
    uint8_t   input_num;            //!< the count of input tensors
    uint8_t   output_num;           //!< the count of output tensors
    uint8_t   status;               //!< the execution status of subgraph

    uint16_t  node_num;             //!< the count of nodes in subgraph
    uint16_t* node_list;            //!< all nodes index list of subgraph

    uint16_t* input_tensor_list;    //!< input tensors index list of subgraph
    uint16_t* output_tensor_list;   //!< output tensors index list of subgraph

    struct graph*  graph;           //!< the pointer of the related graph

    struct device* device;          //!< the device which will the subgraph running on
    void*  device_graph;            //!< the related device graph
} ir_subgraph_t;

subgraph 是设备有关的,一个 subgraph 只工作于一个单一 device。当网络的全部 node 完全运行于单一 device 时,subgraph 就包含全部 graph 中的 node。当运行于多 device 时,根据实际 deviceoperator 支持情况,划分不同的 node 到多个不同的 subgraph 中。最小的 subgraph 只包含一个 node 节点;最大的 subgraph 有全部的 node

设备描述 device 和相关结构体

source/device 目录下是 Convolution Neural Network 的描述结构体和函数,子目录中是各种 device 的实现。其中 source/device/cpuTengine 支持的全部 CPU 的相关代码。

/*!
 * @struct nn_device_t
 * @brief  Abstract neural network runnable device description struct
 */
typedef struct device
{
    const char* name;
    struct interface* interface;      //!< device scheduler operation interface
    struct allocator* allocator;      //!< device allocation operation interface
    struct optimizer* optimizer;      //!< device optimizer operation interface
    struct scheduler* scheduler;      //!< device scheduler
    void*  privacy;                   //!< device privacy data
} ir_device_t;

device 是由嵌套的结构体构成的,结合描述了三类主要接口。关于详细细节和如何添加 device 还可以参考 扩展硬件后端 文档。

每当实现一个 device 的时候,都需要调用 int register_device(struct device* device); 函数,将 device 注册到 Tengine 中。运行时,可以通过 int add_context_device(context_t context, const char* dev_name);int set_context_device(context_t context, const char* dev_name, const void* dev_option, size_t dev_opt_size); 设置需要使用的 device,对于带有 optiondevice,更推荐用后一个接口一并设置 option。比如,在 TensorRT device 中,需要初始化时指定多个 GPU 以及推理精度时特别有用。CPU device 被假设为总是可用。

模型解析 serializer 和相关结构体

/*!
 * @struct serializer_t
 * @brief  Abstract serializer
 */
typedef struct serializer
{
    const char* (*get_name)(struct serializer*);

    //!< load model from file
    int (*load_model)(struct serializer*, struct graph*, const char* fname, va_list ap);

    //!< load model from memory
    int (*load_mem)(struct serializer*, struct graph*, const void* addr, int size, va_list ap);

    //!< unload model, free serializer and device related resource
    int (*unload_graph)(struct serializer*, struct graph*, void* s_priv, void* dev_priv);

    //!< interface exposed for register operator extension
    int (*register_op_loader)(struct serializer*, int op_type, int op_ver, void* op_load_func, void* op_type_map_func, void* op_ver_map_func);

    //!< interface exposed for register operator extension
    int (*unregister_op_loader)(struct serializer*, int op_type, int op_ver, void* op_load_func);

    //!< interface exposed for initialize serializer
    int (*init)(struct serializer*);

    //!< interface exposed for release serializer
    int (*release)(struct serializer*);
} serializer_t;

支持新的模型格式解析,只需要填充并注册一个 serializer 即可完成。Tengine 正计划通过此模块将主流的模型支持增加进来。

模块注册

Tengine 能够工作有赖于注册的各个模块,这些注册接口和位置可以参考 source/module 目录下的相关代码。

static vector_t* internal_serializer_registry = NULL;   //!< registry of model serializer
static vector_t* internal_device_registry     = NULL;   //!< registry of runnable neural network device
static vector_t* internal_op_method_registry  = NULL;   //!< registry of operators
static vector_t* internal_op_name_registry    = NULL;   //!< registry of operators name

source/module/module.c 中的 static struct vector 中可以知道, serializerdeviceoperaor 都被注册为 vector 中,在 module.h 中提供了注册和查找的函数。

/*!
 * @brief Register a serializer.
 *
 * @param [in]  serializer: The pointer to a struct of serializer.
 *
 * @return statue value, 0 success, other value failure.
 */
int register_serializer(struct serializer* serializer);


/*!
 * @brief Find the serializer via its name.
 *
 * @param [in]  name: The name of serializer.
 *
 * @return  The pointer of the serializer.
 */
struct serializer* find_serializer_via_name(const char* name);


/*!
 * @brief Find the serializer via its registered index.
 *
 * @param [in]  index: The index of serializer.
 *
 * @return  The pointer of the serializer.
 */
struct serializer* find_serializer_via_index(int index);


/*!
 * @brief Get count of all registered serializer.
 *
 * @return  The count of registered serializer.
 */
int get_serializer_count();


/*!
 * @brief Unregister a serializer.
 *
 * @param [in]  serializer: The pointer to a struct of serializer.
 *
 * @return statue value, 0 success, other value failure.
 */
int unregister_serializer(struct serializer* serializer);


/*!
 * @brief Release all serializer.
 *
 * @return statue value, 0 success, other value failure.
 */
int release_serializer_registry();


/*!
 * @brief Register a device.
 *
 * @param [in]  device: The pointer to a struct of device.
 *
 * @return statue value, 0 success, other value failure.
 */
int register_device(struct device* device);


/*!
 * @brief Find the device via its name.
 *
 * @param [in]  name: The name of device.
 *
 * @return  The pointer of the device.
 */
struct device* find_device_via_name(const char* name);


/*!
 * @brief Find the default device.
 *
 * @return  The pointer of the device.
 */
struct device* find_default_device();


/*!
 * @brief Find the device via its registered index.
 *
 * @param [in]  name: The index of device.
 *
 * @return  The pointer of the device.
 */
struct device* find_device_via_index(int index);


/*!
 * @brief Get count of all registered device.
 *
 * @return  The count of registered device.
 */
int get_device_count();


/*!
 * @brief Register a device.
 *
 * @param [in]  device: The pointer to a struct of device.
 *
 * @return statue value, 0 success, other value failure.
 */
int unregister_device(struct device* device);


/*!
 * @brief Release all device.
 *
 * @return statue value, 0 success, other value failure.
 */
int release_device_registry();


/*!
 * @brief Register an operator method.
 *
 * @param [in]  type: The type of an operator.
 * @param [in]  type: The name of an operator.
 * @param [in]  type: The method of an operator.
 *
 * @return statue value, 0 success, other value failure.
 */
int register_op(int type, const char* name, struct method* method);


/*!
 * @brief Find an operator method.
 *
 * @param [in]  type: The type of an operator method.
 * @param [in]  version: The version of an operator method.
 *
 * @return  The pointer of the method.
 */
struct method* find_op_method(int type, int version);


/*!
 * @brief Find an operator method via its registered index.
 *
 * @param [in]  index: The index of operator method.
 *
 * @return  The pointer of the operator method.
 */
struct method* find_op_method_via_index(int index);


/*!
 * @brief Find an operator name.
 *
 * @param [in]  type: The type of an operator method.
 *
 * @return  The char array of the method.
 */
const char* find_op_name(int type);


/*!
 * @brief Get count of all registered operator method.
 *
 * @return  The count of registered operator method.
 */
int get_op_method_count();


/*!
 * @brief Register an operator.
 *
 * @param [in]  type: The type of an operator method.
 * @param [in]  version: The version of an operator method.
 *
 * @return statue value, 0 success, other value failure.
 */
int unregister_op(int type, int version);


/*!
 * @brief Release all operator.
 *
 * @return statue value, 0 success, other value failure.
 */
int release_op_registry();

C 语言是没有 C++ 构造函数概念的,那么如果不做处理,就需要一套运行时注册和反注册的机制。在 Tengine v1.4 以后,注册的另一部分通过 CMake 完成。 通过一定规则扫描和收集到的文件,可以通过 cmake/registry.cmake 文件中添加的 GENERATE_REGISTER_HEADER_FILE(_REG_LEAD_STRING _DEL_LEAD_STRING _BACK_STRING _CONFIG_FILE _TARGET_FILE) 函数进行文件生成。

# generate needed registry
FUNCTION (GENERATE_REGISTER_HEADER_FILE _REG_LEAD_STRING _DEL_LEAD_STRING _BACK_STRING _CONFIG_FILE _TARGET_FILE)
    SET (_BGN_NOTICE_STR "// code generation start\n")
    SET (_END_NOTICE_STR "// code generation finish\n")

    SET (_GEN_REG_DEF_STR "${_BGN_NOTICE_STR}")
    SET (_GEN_REG_CAL_STR "${_BGN_NOTICE_STR}")
    SET (_GEN_DEL_DEF_STR "${_BGN_NOTICE_STR}")
    SET (_GEN_DEL_CAL_STR "${_BGN_NOTICE_STR}")

    FOREACH (_VAR ${ARGN})
        STRING(REGEX REPLACE ".+/(.+)\\..*" "\\1" _NAME ${_VAR})

        SET (_REG_FUNC "${_REG_LEAD_STRING}${_NAME}${_BACK_STRING}()")
        SET (_DEL_FUNC "${_DEL_LEAD_STRING}${_NAME}${_BACK_STRING}()")

        SET (_GEN_REG_DEF_STR "${_GEN_REG_DEF_STR}extern int ${_REG_FUNC};\n")

        SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}    ret = ${_REG_FUNC};\n")
        SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}    if(0 != ret)\n")
        SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}    {\n")
        SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}        TLOG_ERR(\"Tengine FATAL: Call %s failed(%d).\\n\", \"${_REG_FUNC}\", ret);\n")
        SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}    }\n")

        SET (_GEN_DEL_DEF_STR "${_GEN_DEL_DEF_STR}extern int ${_DEL_FUNC};\n")

        SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}    ret = ${_DEL_FUNC};\n")
        SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}    if(0 != ret)\n")
        SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}    {\n")
        SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}        TLOG_ERR(\"Tengine FATAL: Call %s failed(%d).\\n\", \"${_REG_FUNC}\", ret);\n")
        SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}    }\n")
    ENDFOREACH()

    SET (_GEN_REG_DEF_STR "${_GEN_REG_DEF_STR}${_END_NOTICE_STR}")
    SET (_GEN_REG_CAL_STR "${_GEN_REG_CAL_STR}    ${_END_NOTICE_STR}")
    SET (_GEN_DEL_DEF_STR "${_GEN_DEL_DEF_STR}${_END_NOTICE_STR}")
    SET (_GEN_DEL_CAL_STR "${_GEN_DEL_CAL_STR}    ${_END_NOTICE_STR}")

    CONFIGURE_FILE(${_CONFIG_FILE} ${_TARGET_FILE})
ENDFUNCTION()

这个函数的核心就是根据输入的文件列表,产生两个字符串 _GEN_REG_CAL_STR_GEN_DEL_CAL_STR,这两个字符串将会在生成过程 CONFIGURE_FILE(${_CONFIG_FILE} ${_TARGET_FILE}) 中,替换掉 header_name.h.in 中的 @_GEN_REG_CAL_STR@@_GEN_DEL_CAL_STR@ 字符串部分,分别完成函数的声明、注册函数的调用、反注册函数的调用,生成的文件一般会保存在 ${CMAKE_BINARY_DIR} 中的对应位置,这由函数的 _TARGET_FILE 指定。 以 serializerCMakeLists.txt 为例,生成函数调用如下:

# generate all serializer
GENERATE_REGISTER_HEADER_FILE("register_" "unregister_" "" "${_SRL_SRC_ROOT}/register.h.in" "${_SRL_BIN_ROOT}/register.h" "${_SRL_TM2_SRL_SOURCE}")

这样就完成了配置过程,生成的头文件进一步的在后续的编译过程中发挥作用。当用户使用静态分析功能的 IDE 审阅代码时,由于相关头文件没有生成,所以可能会发生无法跳转的情况。经过编译配置的 Microsoft Visual Studio Code 等在打开文件夹后,会启动 CMake 进行配置和生成,这时的头文件就会生成,也能进行跳转了。其他 Microsoft Visual StudioJetBrains CLionIDE 也可以完成配置和生成过程,推荐使用。