opencv dnn模块示例(15) opencv4.2版本dnn支持cuda加速（vs2015异常解决）

原创已于 2024-06-11 17:54:06 修改 · 9.3k 阅读

26 ·

CC 4.0 BY-SA版权

文章标签：

#opencv #dnn #cuda #目标检测 #yolo

于 2019-12-31 14:00:43 首次发布

OpenCV 同时被 2 个专栏收录

47 篇文章

订阅专栏

opencv实例源码演示

38 篇文章

订阅专栏

本文介绍OpenCV 4.2.0版本的发布，重点讲解了DNN模块集成CUDA后端支持的编译配置及解决办法，展示了CUDA加速带来的性能提升，特别是在Yolo3目标检测中的应用。

部署运行你感兴趣的模型镜像

opencv在4.2.0版本正式发布，DNN深度神经网络模块集成Google Summer of Code的项目CUDA后端支持。（详细changelog）

1、编译

常规编译过程，这里使用软硬件环境如下：

nvidia gtx1080ti
cuda 10.1
cudnn 7.6.3
vs2015

这里建议使用高版本的vs，vs2015编译源码会有cuda设备相关代码的错误，作者已经修复，在下版本可正常使用。后面会给出在vs2015使用的解决办法。

1.1 cmake配置

确保勾选 WITH_CUDA，WITH_CUDNN，以及OPENCV_DNN_CUDA等选项。
在这里插入图片描述
若cuda、cndnn等环境安装正常，点击config后会自动填入相关cuda、dnn的环境包括头文件、库的路径。
如下：

之后点击generate，等待下载相关依赖第三方的包完成。

若下载缓慢，可以查看build目录下的CMakeDownloadLog.txt文件，找到路径手动下载，按照文件提示放入指定目录，一般在source/.cache中。

（若提示cuda_arch_bin的版本要求，根据自己显卡算力填写相应的值，我使用gtx 1080ti，设置“6.1 7.0 7.5”）

linux下可以使用：
cmake -BUILD_TYPE=Release -DCMAKE_INSTALL_PREFIX=/home/opencv/opencv-4.2.0/build/install -DOPENCV_EXTRA_MODULES_PATH=/home/opencv/opencv-4.2.0/opencv_contrib-4.2.0/modules -DOPENCV_DNN_CUDA=True -DWITH_CUDA=True -DCUDA_ARCH_BIN=“6.1 7.0 7.5” -DBUILD_TESTS=False

generate成功后，打开项目，编译即可。

1.2 vs2015编译opencv库

使用vs2015编译，唯独opencv_dnn模块的库会生成失败，其他正常。
查看报错问题，如下：

4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(16): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getGridDim<0>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(17): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getGridDim<1>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(18): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getGridDim<2>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(21): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockDim<0>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(22): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockDim<1>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(23): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockDim<2>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(26): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockIdx<0>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(27): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockIdx<1>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(28): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getBlockIdx<2>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(31): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getThreadIdx<0>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(32): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getThreadIdx<1>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda\grid_stride_range.hpp(33): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷unsigned int cv::dnn::cuda4dnn::csl::device::detail::getThreadIdx<2>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda4dnn\csl\cudnn/cudnn.hpp(41): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷cudnnDataType_t cv::dnn::cuda4dnn::csl::cudnn::detail::get_data_type<__half>(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?
4>d:\opencv\opencv4.2.0\sources\modules\dnn\src\cuda4dnn\csl\cudnn/cudnn.hpp(42): error C2912: 锟斤拷式专锟矫伙拷锟斤拷锟斤拷cudnnDataType_t cv::dnn::cuda4dnn::csl::cudnn::detail::get_data_type(void)锟斤拷锟斤拷锟角猴拷锟斤拷模锟斤拷锟阶拷没锟?

对应错误代码位置为
在这里插入图片描述
解决方案：
（1）修改sources\modules\dnn\src\cuda\grid_stride_range.hpp文件
替换15-33行为：

using dim3_member_type = decltype(dim3::x);

template <int>  __device__ dim3_member_type getGridDim();
template <> inline __device__ dim3_member_type getGridDim<0>() { return gridDim.x; }
template <> inline __device__ dim3_member_type getGridDim<1>() { return gridDim.y; }
template <> inline __device__ dim3_member_type getGridDim<2>() { return gridDim.z; }

template <int> __device__ dim3_member_type getBlockDim();
template <> inline __device__ dim3_member_type getBlockDim<0>() { return blockDim.x; }
template <> inline __device__ dim3_member_type getBlockDim<1>() { return blockDim.y; }
template <> inline __device__ dim3_member_type getBlockDim<2>() { return blockDim.z; }

using uint3_member_type = decltype(uint3::x);

template <int> __device__ uint3_member_type getBlockIdx();
template <> inline __device__ uint3_member_type getBlockIdx<0>() { return blockIdx.x; }
template <> inline __device__ uint3_member_type getBlockIdx<1>() { return blockIdx.y; }
template <> inline __device__ uint3_member_type getBlockIdx<2>() { return blockIdx.z; }

template <int> __device__ uint3_member_type getThreadIdx();
template <> inline __device__ uint3_member_type getThreadIdx<0>() { return threadIdx.x; }
template <> inline __device__ uint3_member_type getThreadIdx<1>() { return threadIdx.y; }
template <> inline __device__ uint3_member_type getThreadIdx<2>() { return threadIdx.z; }

(2) 修改sources\modules\dnn\src\cuda4dnn\csl\cudnn\cudnn.hpp文件
替换40-42行为：

using cudnn_data_enum_type = decltype(CUDNN_DATA_FLOAT);
template <class> cudnn_data_enum_type get_data_type();
template <> inline cudnn_data_enum_type get_data_type<half>() { return CUDNN_DATA_HALF; }
template <> inline cudnn_data_enum_type get_data_type<float>() { return CUDNN_DATA_FLOAT; }

之后保存，重新编译即可成功。

2、benchmark

详细的性能提升说明见 https://github.com/opencv/opencv/pull/14827，这里给出几张相关截图。
（1）dnn支持cuda加速的层
在这里插入图片描述
（2）常见模型执行效率
这里测试使用的NVIDIA GTX 1080ti显卡

（3）opencv cuda 和Tensorflow duda执行效率

（4）yolo3效率对比
这里使用的是 NVIDIA RTX 2080ti显卡。

3、Yolo3的对比测试

这里的对比测试和opencv dnn模块示例(3) 目标检测 object_detection (2) YOLO object 进行对比。使用相同场景，相同模型，相同的代码。

这里使用opencv4.2时，需要修改上述文章中34.35两行代码

//int backendId = cv::dnn::DNN_BACKEND_OPENCV;
//int targetId = cv::dnn::DNN_TARGET_CPU;

int backendId = cv::dnn::DNN_BACKEND_CUDA;
int targetId = cv::dnn::DNN_TARGET_CUDA;

使用opencv作为后端，执行效率在360ms左右。
这里使用cuda作为后端，执行效率在16ms左右，提升效率近22倍。
在这里插入图片描述
资源占用如下

您可能感兴趣的与本文相关的镜像

PyTorch 2.9

PyTorch

Cuda

PyTorch 是一个开源的 Python 机器学习库，基于 Torch 库，底层由 C++ 实现，应用于人工智能领域，如计算机视觉和自然语言处理

15 条评论

weixin_44811466 2024.07.31
你好，我查到我的显卡算力是8.6，那我应该怎么设置cuda_arch_bin的值呀？

Tisfy 2021.04.28
我曾经流连于宋元的词曲，但刚发现此文甚妙！

dfgfdhdfd 2020.09.24
您好！请问用opencv dnn去调用tensorflow .pb格式的模型（已配置好cuda加速，即使用GPU进行前向推理）时，处理第一张图片时，要卡1700多ms，而后续的推理时间均只需几ms。请问这正常吗？楼主有这方面的经验吗？（我猜测是GPU的启动需要时间，不知道正不正确）
- aworkholic回复dfgfdhdfd 2020.09.29
  第一运行时，整个网络的输入输出会初始化、分配一次变量，后面再forward只是修改。可以实际debug一下。

KeepHoper 2020.07.16
请问为什么推理时间在20MS左右，但是net.forward(outs, getOutputsNames(net));的耗时在300MS，这个问题可以解决吗？
- KeepHoper回复aworkholic 2020.07.18
  [reply]wanggao_1990[/reply]我使用的代码就是类似您这篇blog的程序：https://blog.youkuaiyun.com/wanggao_1990/article/details/86713634 推理时间就是显示图片的左上角的inference time，在backend和target设置上都是cuda
- aworkholic回复KeepHoper 2020.07.17
  [reply]weixin_45654998[/reply]确认你说的推理推理时间是说的什么，其他框架，还是是opencv中代码？假设你说的是opencv dnn推理，再确认你的推理时间20ms是执行哪些代码的时间？最后，2个推理时间使用方式包括参数、是否使用gpu,debug/release等是不是一致的。

回来了。。。 2020.04.10
我现在照着你这样进行了编译，编译通过，cuda可以调用，但是DNN模块掉不起CUDA，请问是什么原因？
- sophialucky回复aworkholic 2020.09.29
  release也要2s多，也是很慢
- aworkholic回复sophialucky 2020.09.29
  使用release模式。
- sophialucky回复aworkholic 2020.09.29
  [ INFO:0] global E:\work\opencv440\sources\modules\dnn\src\dnn.cpp (2401) cv::dnn::dnn4_v20200609::Net::Impl::initCUDABackend CUDA backend will fallback to the CPU implementation for the layer "_input" of type __NetInputLayer__
- aworkholic回复回来了。。。 2020.04.10
  [reply]weixin_42960918[/reply]看报错信息。