ONNX推理加速技术文档

228次阅读

零、前言

趁着端午假期，整理下之前记录的笔记。还是那句话，好记性不如烂笔头，写点文章既是输出也是输入~

一、模型文件转换

1.1 pth文件转onnx

pytorch框架中集成了onnx模块，属于官方支持，onnx也覆盖了pytorch框架中的大部分算子。因此将pth模型文件转换为onnx文件非常简单。以下是一个代码示例。需要注意的是，在转换之前，需要对pth模型的输入size进行冻结。比如：

batch_size = 1
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)

输入一旦冻结后，就只会有固定的batch_size，在使用转换后的onnx文件进行模型推理时，推理时输入的batch_size必须和冻结时保持一致。对于这个示例，你只能batch_size=1进行推理。如果你需要在推理时采用不同的batch_size，比如10，你只能在保存onnx模型之前修改冻结的输入节点，代码如下：

batch_size = 10
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)

这样，你就拥有了一个bacth_size=10的onnx模型。导出onnx文件，只需要使用torch.onnx.export()函数，代码如下：

    model_name = model_path.split("/")[-1].split(".")[0]
    model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"

    dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
    # dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
    torch.onnx.export(net, dummy_input, model_path,
                      verbose=False, input_names=['input'],
                      output_names=['scores', 'boxes'])

完整的转换代码：

# -*- coding: utf-8 -*-
"""
This code is used to convert the pytorch model into an onnx format model.
"""
import argparse
import sys
import torch.onnx
from models.ulfd.lib.ssd.config.fd_config import define_img_size

input_img_size = 320  # define input size ,default optional(128/160/320/480/640/1280)
define_img_size(input_img_size)
from models.ulfd.lib.ssd.mb_tiny_RFB_fd import create_Mb_Tiny_RFB_fd
from models.ulfd.lib.ssd.mb_tiny_fd import create_mb_tiny_fd


def get_args():
    parser = argparse.ArgumentParser(description='convert model to onnx')
    parser.add_argument("--net", dest='net_type', default="RFB",
                        type=str, help='net type.')
    parser.add_argument('--batch', dest='batch_size', default=1,
                        type=int, help='batch size for input.')
    args_ = parser.parse_args()

    return args_


if __name__ == '__main__':

    # net_type = "slim"  # inference faster,lower precision
    args = get_args()

    net_type = args.net_type  # inference lower,higher precision
    batch_size = args.batch_size

    device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")

    label_path = "models/ulfd/voc-model-labels.txt"
    class_names = [name.strip() for name in open(label_path).readlines()]
    num_classes = len(class_names)

    if net_type == 'slim':
        model_path = "baseline/ulfd/version-slim-320.pth"
        # model_path = "models/pretrained/version-slim-640.pth"
        net = create_mb_tiny_fd(len(class_names), is_test=True, device=device)
    elif net_type == 'RFB':
        model_path = "baseline/ulfd/version-RFB-320.pth"
        # model_path = "models/pretrained/version-RFB-640.pth"
        net = create_Mb_Tiny_RFB_fd(len(class_names), is_test=True, device=device)

    else:
        print("unsupport network type.")
        sys.exit(1)
    net.load(model_path)
    net.eval()
    net.to(device)

    model_name = model_path.split("/")[-1].split(".")[0]
    model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"

    dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
    # dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
    torch.onnx.export(net, dummy_input, model_path,
                      verbose=False, input_names=['input'],
                      output_names=['scores', 'boxes'])
    print('onnx model saved ', model_path)

    """
    PYTHONPATH=. python3 inference/ulfd/pth_to_onnx.py --net RFB --batch 16

    PYTHONPATH=. python inference/ulfd/pth_to_onnx.py --net RFB --batch 3
    """

1.2 pb文件转onnx

pb文件转onnx可以使用tf2onnx库，但必须说明的是，TensorFlow并没有官方支持onnx，tf2onnx是一个第三方库。格式转化onnx格式文件将tensorflow的pb文件转化为onnx格式的文件安装tf2onnx。参考：tensorrt-cubelab-docs tf2onnx安装

pip install tf2onnx

格式转化指令:

python -m tf2onnx.convert 
--input ./checkpoints/new_model.pb 
--inputs intent_network/inputs:0,intent_network/seq_len:0 
--outputs logits:0 
--output ./pb_models/model.onnx 
--fold_const 

# SAVE_MODEL
保存为save_model
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphC++onverter(input_saved_model_dir=input_saved_model_dir)
converter.convert()
converter.save(output_saved_model_dir)

python -m tf2onnx.convert --saved_model saved_model_dir --output model.onnx

# .pb 文件
python -m tf2onnx.convert --input frozen_graph.pb  --inputs X:0,X1:0 --outputs output:0 --output model.onnx --fold_const

# .ckpt 文件
python -m tf2onnx.convert --checkpoint checkpoint.meta  --inputs X:0 --outputs output:0 --output model.onnx --fold_const

1.3 onnx转pb文件

有时候，我们需要对模型进行跨框架的转换，比如你用pytorch训练了一个模型，但需要集成到TensorFlow中以便和其他的模型保持一致，方便部署。此时，你就可以通过将pth转换成onnx，然后再将onnx转换成pb文件，如果转换成功，那么你就可以在TensorFlow使用pb文件进行推理了。之所以强调如果，是因为TensorFlow并没有官方支持onnx，有可能会因为一些算子不兼容的问题导致转换后的pb文件在TF推理时出问题。将onnx转换pb文件可以使用onnx-tf库，安装

pip install onnx-tf

完整的转换代码：

# -*- coding: utf-8 -*-
"""
@File  : onnx_to_pb.py
@Author: qiuyanjun
@Date  : 2020-01-10 19:22
@Desc  : 
"""
import cv2
import numpy as np
import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf


model = onnx.load('models/onnx/version-RFB-320.onnx')
tf_rep = prepare(model)

img = cv2.imread('imgs/1.jpg')
image = cv2.resize(img, (320, 240))
# 测试是否能推理
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
image = np.transpose(image, [2, 0, 1])
image = np.expand_dims(image, axis=0)
image = image.astype(np.float32)
output = tf_rep.run(image)

print("output mat: \n", output)
print("output type ", type(output))

# 建立Session并获取输入输出节点信息
with tf.Session() as persisted_sess:
    print("load graph")
    persisted_sess.graph.as_default()
    tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
    inp = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.inputs[0]].name
    )
    print('input_name: ', tf_rep.tensor_dict[tf_rep.inputs[0]].name)
    print('input_names: ', tf_rep.inputs)
    out = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[0]].name
    )
    print('output_name_0: ', tf_rep.tensor_dict[tf_rep.outputs[0]].name)
    print('output_name_1: ', tf_rep.tensor_dict[tf_rep.outputs[1]].name)
    print('output_names: ', tf_rep.outputs)
    res = persisted_sess.run(out, {inp: image})
    print(res)
    print("result is ", res)

# 保存成pb文件
tf_rep.export_graph('version-RFB-320.pb')
print('onnx to pb done.')

"""cmd
PYTHONPATH=. python3 onnx_to_pb.py
"""

二、直接使用onnx进行推理

onnx文件可以直接进行推理，这时的代码就已经与框架无关了，可以与训练阶段解耦。但是，为了推理的顺利进行，你依然需要为onnx选择一个后端，以TensorFlow为例。

import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf

...
# 包装一个TF后端
predictor = onnx.load(onnx_path)
onnx.checker.check_model(predictor)
onnx.helper.printable_graph(predictor.graph)
tf_rep = prepare(predictor, device="CUDA:0")  # default CPU

# 使用TF进行预测
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7)  # defalut 0.5
tfconfig = tf.ConfigProto(allow_soft_placement=True, gpu_options=gpu_options)

... 
with tf.Session(config=tfconfig) as persisted_sess:
    persisted_sess.graph.as_default()
    tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
    tf_input = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.inputs[0]].name
    )
    tf_scores = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[0]].name
    )
    tf_boxes = persisted_sess.graph.get_tensor_by_name(
        tf_rep.tensor_dict[tf_rep.outputs[1]].name
    )

    for file_path in listdir:
        ...
        confidences, boxes = persisted_sess.run([tf_scores, tf_boxes], {tf_input: image})
        ...

三、使用onnxruntime加速推理

事实上，我们可以更高效地使用onnx。onnxruntime是一个对onnx模型提供推理加速的库，支持CPU和GPU加速，GPU加速版本为onnxruntime-gpu，默认版本为CPU加速。安装:

pip install onnxruntime  # CPU
pip install onnxruntime-gpu # GPU

使用onnxruntime对onnx模型加速非常简单，只需要几行代码。这里给出一个示例：

import onnxruntime as ort
class NLFDOnnxCpuInferBase:
    """only support in CPU and accelerate with onnxruntime."""

    __metaclass__ = ABCMeta
   ...
   def __init__(self,
                 onnx_path=ONNX_PATH):
        """pytorch和onnx可以很好地结合
        :param onnx_path: .onnx文件路径
        """
        self._onnx_path = onnx_path
        # 使用onnx模型初始化ort的session
        self._ort_session = ort.InferenceSession(self._onnx_path)
        self._input_img = self._ort_session.get_inputs()[0].name
   ...

   # 使用run推理
   def _detect_img_utils(self, img: np.ndarray):
        """batch is ok."""
        feed_dict = {self._input_img: img}
        scores_before_nms, rois_before_nms = \
            self._ort_session.run(None,input_feed=feed_dict)
        return rois_before_nms, scores_before_nms

onnxruntime会自动帮你检查onnx中的无关节点并删除，也利用了一些加速库优化推理图，从而加速推理。一些log:

 python3 inference/ulfd/onnx_cpu_infer.py
2020-01-16 12:03:49.259044 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259478 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259492 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259501 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.1.num_batches_tracked'. It is not used by any node and should be removed from the model.

四、实验结果

在人脸检测ULFD模型上，未使用onnxruntime加速，对于320×240分辨率的图片，在CPU上需要跑要50~60ms;使用onnxruntime加速后，在CPU需要8~11ms.

五、请优雅地使用Numpy

在图像处理中经常会出现归一化处理，即使在推理的时候也需要。而在推理时需要考虑性能问题，最近发现numpy的张量计算的不同方式，会对性能有很大影响。如果你的均值化处理中的每个通道减去的均值都是一样的比如127.

# 普通的做法是：(请不要使用这种做法)
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
# 实际上会由于numpy的广播运算消耗更多的时间
# 你应该采用：(保证数据类型的一致以及减去一个常量的效率更高)
image = (image - 127.) / 128.

实验代码相同均值

# coding: utf-8
import cv2
import time
import numpy as np


if __name__ == '__main__':
    test_w, test_h = 500, 500

    test_path = 'logs/test0.jpg'
    test_img = cv2.imread(test_path)
    resize_img = cv2.resize(test_img, (test_w, test_h))


    test_count = 1000
    print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))

    t1 = time.time()
    image_mean = np.array([127, 127, 127])
    for _ in range(test_count):
        image = (resize_img - image_mean) / 128
    t2 = time.time()
    print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
        (t2-t1), (t2-t1)*1000/test_count
    ))
    t3 = time.time()
    for _ in range(test_count):
        image = (resize_img - 127.) / 128.
    t4 = time.time()
    print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
        (t4 - t3), (t4 - t3) * 1000 / test_count
    ))

实验结果：

但是当你确实要对不同的通道用到不同的均值时呢？也请你这样做，以下是另一个测试结果。

实验代码不同均值

# coding: utf-8
import cv2
import time
import numpy as np


if __name__ == '__main__':
    test_w, test_h = 100, 100

    test_path = 'logs/test0.jpg'
    test_img = cv2.imread(test_path)
    resize_img = cv2.resize(test_img, (test_w, test_h))


    test_count = 100
    print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))
    print('-'*100)
    t1 = time.time()
    image_mean = np.array([127, 120, 107])
    for _ in range(test_count):
        image = (resize_img - image_mean) / 128
    t2 = time.time()
    print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
        (t2 - t1), (t2 - t1) * 1000 / test_count
    ))
    t3 = time.time()
    image = np.zeros_like(resize_img)
    for _ in range(test_count):
        image[:, :, 0] = (resize_img[:, :, 0] - 127.) / 128.
        image[:, :, 1] = (resize_img[:, :, 1] - 120.) / 128.
        image[:, :, 2] = (resize_img[:, :, 2] - 107.) / 128.
    t4 = time.time()
    print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
        (t4 - t3), (t4 - t3) * 1000 / test_count
    ))

实验结果

简单来说就是，只要你愿意动手修改几行代码，就能带来5ms~15ms的性能提升。这比采用TensorRT/ONNX等各种加速工具要简单太多了。