零、前言
趁着端午假期,整理下之前记录的笔记。还是那句话,好记性不如烂笔头,写点文章既是输出也是输入~
一、模型文件转换
1.1 pth文件转onnx
pytorch框架中集成了onnx模块,属于官方支持,onnx也覆盖了pytorch框架中的大部分算子。因此将pth模型文件转换为onnx文件非常简单。以下是一个代码示例。需要注意的是,在转换之前,需要对pth模型的输入size进行冻结。比如:
batch_size = 1
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
输入一旦冻结后,就只会有固定的batch_size,在使用转换后的onnx文件进行模型推理时,推理时输入的batch_size必须和冻结时保持一致。对于这个示例,你只能batch_size=1进行推理。如果你需要在推理时采用不同的batch_size,比如10,你只能在保存onnx模型之前修改冻结的输入节点,代码如下:
batch_size = 10
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
这样,你就拥有了一个bacth_size=10的onnx模型。导出onnx文件,只需要使用torch.onnx.export()函数,代码如下:
model_name = model_path.split("/")[-1].split(".")[0]
model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
# dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
torch.onnx.export(net, dummy_input, model_path,
verbose=False, input_names=['input'],
output_names=['scores', 'boxes'])
完整的转换代码:
# -*- coding: utf-8 -*-
"""
This code is used to convert the pytorch model into an onnx format model.
"""
import argparse
import sys
import torch.onnx
from models.ulfd.lib.ssd.config.fd_config import define_img_size
input_img_size = 320 # define input size ,default optional(128/160/320/480/640/1280)
define_img_size(input_img_size)
from models.ulfd.lib.ssd.mb_tiny_RFB_fd import create_Mb_Tiny_RFB_fd
from models.ulfd.lib.ssd.mb_tiny_fd import create_mb_tiny_fd
def get_args():
parser = argparse.ArgumentParser(description='convert model to onnx')
parser.add_argument("--net", dest='net_type', default="RFB",
type=str, help='net type.')
parser.add_argument('--batch', dest='batch_size', default=1,
type=int, help='batch size for input.')
args_ = parser.parse_args()
return args_
if __name__ == '__main__':
# net_type = "slim" # inference faster,lower precision
args = get_args()
net_type = args.net_type # inference lower,higher precision
batch_size = args.batch_size
device = torch.device("cuda:0" if torch.cuda.is_available() else "cpu")
label_path = "models/ulfd/voc-model-labels.txt"
class_names = [name.strip() for name in open(label_path).readlines()]
num_classes = len(class_names)
if net_type == 'slim':
model_path = "baseline/ulfd/version-slim-320.pth"
# model_path = "models/pretrained/version-slim-640.pth"
net = create_mb_tiny_fd(len(class_names), is_test=True, device=device)
elif net_type == 'RFB':
model_path = "baseline/ulfd/version-RFB-320.pth"
# model_path = "models/pretrained/version-RFB-640.pth"
net = create_Mb_Tiny_RFB_fd(len(class_names), is_test=True, device=device)
else:
print("unsupport network type.")
sys.exit(1)
net.load(model_path)
net.eval()
net.to(device)
model_name = model_path.split("/")[-1].split(".")[0]
model_path = f"inference/ulfd/onnx/{model_name}-batch-{batch_size}.onnx"
dummy_input = torch.randn(batch_size, 3, 240, 320).to(device)
# dummy_input = torch.randn(1, 3, 480, 640).to("cuda") #if input size is 640*480
torch.onnx.export(net, dummy_input, model_path,
verbose=False, input_names=['input'],
output_names=['scores', 'boxes'])
print('onnx model saved ', model_path)
"""
PYTHONPATH=. python3 inference/ulfd/pth_to_onnx.py --net RFB --batch 16
PYTHONPATH=. python inference/ulfd/pth_to_onnx.py --net RFB --batch 3
"""
1.2 pb文件转onnx
pb文件转onnx可以使用tf2onnx库,但必须说明的是,TensorFlow并没有官方支持onnx,tf2onnx是一个第三方库。格式转化onnx格式文件将tensorflow的pb文件转化为onnx格式的文件 安装tf2onnx。 参考:tensorrt-cubelab-docs tf2onnx安装
pip install tf2onnx
格式转化指令:
python -m tf2onnx.convert
--input ./checkpoints/new_model.pb
--inputs intent_network/inputs:0,intent_network/seq_len:0
--outputs logits:0
--output ./pb_models/model.onnx
--fold_const
# SAVE_MODEL
保存为save_model
from tensorflow.python.compiler.tensorrt import trt_convert as trt
converter = trt.TrtGraphC++onverter(input_saved_model_dir=input_saved_model_dir)
converter.convert()
converter.save(output_saved_model_dir)
python -m tf2onnx.convert --saved_model saved_model_dir --output model.onnx
# .pb 文件
python -m tf2onnx.convert --input frozen_graph.pb --inputs X:0,X1:0 --outputs output:0 --output model.onnx --fold_const
# .ckpt 文件
python -m tf2onnx.convert --checkpoint checkpoint.meta --inputs X:0 --outputs output:0 --output model.onnx --fold_const
1.3 onnx转pb文件
有时候,我们需要对模型进行跨框架的转换,比如你用pytorch训练了一个模型,但需要集成到TensorFlow中以便和其他的模型保持一致,方便部署。此时,你就可以通过将pth转换成onnx,然后再将onnx转换成pb文件,如果转换成功,那么你就可以在TensorFlow使用pb文件进行推理了。之所以强调如果,是因为TensorFlow并没有官方支持onnx,有可能会因为一些算子不兼容的问题导致转换后的pb文件在TF推理时出问题。 将onnx转换pb文件可以使用onnx-tf库,安装
pip install onnx-tf
完整的转换代码:
# -*- coding: utf-8 -*-
"""
@File : onnx_to_pb.py
@Author: qiuyanjun
@Date : 2020-01-10 19:22
@Desc :
"""
import cv2
import numpy as np
import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf
model = onnx.load('models/onnx/version-RFB-320.onnx')
tf_rep = prepare(model)
img = cv2.imread('imgs/1.jpg')
image = cv2.resize(img, (320, 240))
# 测试是否能推理
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
image = np.transpose(image, [2, 0, 1])
image = np.expand_dims(image, axis=0)
image = image.astype(np.float32)
output = tf_rep.run(image)
print("output mat: \n", output)
print("output type ", type(output))
# 建立Session并获取输入输出节点信息
with tf.Session() as persisted_sess:
print("load graph")
persisted_sess.graph.as_default()
tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
inp = persisted_sess.graph.get_tensor_by_name(
tf_rep.tensor_dict[tf_rep.inputs[0]].name
)
print('input_name: ', tf_rep.tensor_dict[tf_rep.inputs[0]].name)
print('input_names: ', tf_rep.inputs)
out = persisted_sess.graph.get_tensor_by_name(
tf_rep.tensor_dict[tf_rep.outputs[0]].name
)
print('output_name_0: ', tf_rep.tensor_dict[tf_rep.outputs[0]].name)
print('output_name_1: ', tf_rep.tensor_dict[tf_rep.outputs[1]].name)
print('output_names: ', tf_rep.outputs)
res = persisted_sess.run(out, {inp: image})
print(res)
print("result is ", res)
# 保存成pb文件
tf_rep.export_graph('version-RFB-320.pb')
print('onnx to pb done.')
"""cmd
PYTHONPATH=. python3 onnx_to_pb.py
"""
二、直接使用onnx进行推理
onnx文件可以直接进行推理,这时的代码就已经与框架无关了,可以与训练阶段解耦。但是,为了推理的顺利进行,你依然需要为onnx选择一个后端,以TensorFlow为例。
import onnx
import tensorflow as tf
from onnx_tf.backend import prepare
import onnx_tf
...
# 包装一个TF后端
predictor = onnx.load(onnx_path)
onnx.checker.check_model(predictor)
onnx.helper.printable_graph(predictor.graph)
tf_rep = prepare(predictor, device="CUDA:0") # default CPU
# 使用TF进行预测
gpu_options = tf.GPUOptions(per_process_gpu_memory_fraction=0.7) # defalut 0.5
tfconfig = tf.ConfigProto(allow_soft_placement=True, gpu_options=gpu_options)
...
with tf.Session(config=tfconfig) as persisted_sess:
persisted_sess.graph.as_default()
tf.import_graph_def(tf_rep.graph.as_graph_def(), name='')
tf_input = persisted_sess.graph.get_tensor_by_name(
tf_rep.tensor_dict[tf_rep.inputs[0]].name
)
tf_scores = persisted_sess.graph.get_tensor_by_name(
tf_rep.tensor_dict[tf_rep.outputs[0]].name
)
tf_boxes = persisted_sess.graph.get_tensor_by_name(
tf_rep.tensor_dict[tf_rep.outputs[1]].name
)
for file_path in listdir:
...
confidences, boxes = persisted_sess.run([tf_scores, tf_boxes], {tf_input: image})
...
三、使用onnxruntime加速推理
事实上,我们可以更高效地使用onnx。onnxruntime是一个对onnx模型提供推理加速的库,支持CPU和GPU加速,GPU加速版本为onnxruntime-gpu,默认版本为CPU加速。安装:
pip install onnxruntime # CPU
pip install onnxruntime-gpu # GPU
使用onnxruntime对onnx模型加速非常简单,只需要几行代码。这里给出一个示例:
import onnxruntime as ort
class NLFDOnnxCpuInferBase:
"""only support in CPU and accelerate with onnxruntime."""
__metaclass__ = ABCMeta
...
def __init__(self,
onnx_path=ONNX_PATH):
"""pytorch和onnx可以很好地结合
:param onnx_path: .onnx文件路径
"""
self._onnx_path = onnx_path
# 使用onnx模型初始化ort的session
self._ort_session = ort.InferenceSession(self._onnx_path)
self._input_img = self._ort_session.get_inputs()[0].name
...
# 使用run推理
def _detect_img_utils(self, img: np.ndarray):
"""batch is ok."""
feed_dict = {self._input_img: img}
scores_before_nms, rois_before_nms = \
self._ort_session.run(None,input_feed=feed_dict)
return rois_before_nms, scores_before_nms
onnxruntime会自动帮你检查onnx中的无关节点并删除,也利用了一些加速库优化推理图,从而加速推理。一些log:
python3 inference/ulfd/onnx_cpu_infer.py
2020-01-16 12:03:49.259044 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259478 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.9.1.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259492 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.4.num_batches_tracked'. It is not used by any node and should be removed from the model.
2020-01-16 12:03:49.259501 [W:onnxruntime:, graph.cc:2412 CleanUnusedInitializers] Removing initializer 'base_net.8.1.num_batches_tracked'. It is not used by any node and should be removed from the model.
四、实验结果
在人脸检测ULFD模型上,未使用onnxruntime加速,对于320×240分辨率的图片,在CPU上需要跑要50~60ms;使用onnxruntime加速后,在CPU需要8~11ms.
五、请优雅地使用Numpy
在图像处理中经常会出现归一化处理,即使在推理的时候也需要。而在推理时需要考虑性能问题,最近发现numpy的张量计算的不同方式,会对性能有很大影响。如果你的均值化处理中的每个通道减去的均值都是一样的比如127.
# 普通的做法是:(请不要使用这种做法)
image_mean = np.array([127, 127, 127])
image = (image - image_mean) / 128
# 实际上会由于numpy的广播运算消耗更多的时间
# 你应该采用:(保证数据类型的一致以及减去一个常量的效率更高)
image = (image - 127.) / 128.
- 实验代码 相同均值
# coding: utf-8
import cv2
import time
import numpy as np
if __name__ == '__main__':
test_w, test_h = 500, 500
test_path = 'logs/test0.jpg'
test_img = cv2.imread(test_path)
resize_img = cv2.resize(test_img, (test_w, test_h))
test_count = 1000
print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))
t1 = time.time()
image_mean = np.array([127, 127, 127])
for _ in range(test_count):
image = (resize_img - image_mean) / 128
t2 = time.time()
print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
(t2-t1), (t2-t1)*1000/test_count
))
t3 = time.time()
for _ in range(test_count):
image = (resize_img - 127.) / 128.
t4 = time.time()
print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
(t4 - t3), (t4 - t3) * 1000 / test_count
))
实验结果:
但是当你确实要对不同的通道用到不同的均值时呢? 也请你这样做,以下是另一个测试结果。
- 实验代码 不同均值
# coding: utf-8
import cv2
import time
import numpy as np
if __name__ == '__main__':
test_w, test_h = 100, 100
test_path = 'logs/test0.jpg'
test_img = cv2.imread(test_path)
resize_img = cv2.resize(test_img, (test_w, test_h))
test_count = 100
print('width: {0}, height: {1}, test_count: {2}'.format(test_w, test_h, test_count))
print('-'*100)
t1 = time.time()
image_mean = np.array([127, 120, 107])
for _ in range(test_count):
image = (resize_img - image_mean) / 128
t2 = time.time()
print('total_time_ugly: {0}s, mean_time_ugly: {1}ms'.format(
(t2 - t1), (t2 - t1) * 1000 / test_count
))
t3 = time.time()
image = np.zeros_like(resize_img)
for _ in range(test_count):
image[:, :, 0] = (resize_img[:, :, 0] - 127.) / 128.
image[:, :, 1] = (resize_img[:, :, 1] - 120.) / 128.
image[:, :, 2] = (resize_img[:, :, 2] - 107.) / 128.
t4 = time.time()
print('total_time_elegant: {0}s, mean_time_elegant: {1}ms'.format(
(t4 - t3), (t4 - t3) * 1000 / test_count
))
实验结果
简单来说就是,只要你愿意动手修改几行代码,就能带来5ms~15ms的性能提升。这比采用TensorRT/ONNX等各种加速工具要简单太多了。
六、其他参考
TensorFlow+TensorRT C++
- [1] 很有价值的参考
- [2] 参考1
- [3] 源码编译tensorflow及激活tensorrt
- [4] C++调用TensorFlow
- [5] C++调用pb模型
- [6] 单机C++部署-C接口使用
- [7] TF C++编译文档
- [8] github大神对TF-C API的封装