ICP算法加速优化--多线程和GPU

原创已于 2024-11-20 23:05:17 修改 · 4.7k 阅读

41 ·

CC 4.0 BY-SA版权

文章标签：

#点云 #ICP算法 #OpenMP #CUDA

于 2022-11-25 19:12:39 首次发布

PCL 同时被 3 个专栏收录

26 篇文章

订阅专栏

CUDA

24 篇文章

订阅专栏

6D pose estimation

17 篇文章

订阅专栏

本文介绍了作者对ICP算法的C++实现进行了重构，移除第三方库并优化了多线程和GPU加速。实验结果显示，多核CPU和GPU在处理大规模点云时有显著优势，但在少量点云时GPU性能受限于数据传输。

LZ之前的文章ICP算法实现（C++）用C++实现了基础的ICP算法，由于该算法是一种迭代的优化算法，里面含有大量循环操作以及矩阵运算，可以通过使用多线程或者GPU硬件来进行加速，具体分别可以通过OpenMP和CUDA编程实现。这里给出的代码是根据github地址：https://github.com/alex-van-vliet/icp的代码改写的。原作者的代码质量还是不错的，有许多值得借签和学习的地方。但是考虑到使用的第三方库太多不便于配置和使用，LZ把这份代码重构了一下。原作者在代码里造了很多轮子，比如自己实现了Point3D、matrix以及vp-tree（也是一种搜索树，比PCL库中的kd-tree出现时间略晚）的数据结构，但是SVD分解还是调用了Eigen库（可能他也觉得底层实现太麻烦了吧~），LZ把这里面的矩阵结构统一用Eigen库实现了。另外去掉了一些不方便编译的第三方库，并简化了CmakeLists的内容以及程序的结构。如果只编译libcpu库只需依赖Eigen，编译libgpu库的话需要CUDA。

该工程的运行效果如下所示：
lib_icp.exe line1.pcd line2.pcd --cpu 100 1e-6 32 （CPU单核，0ms）
lib_icp.exe line1.pcd line2.pcd --cpu 100 1e-6 32 （CPU多核，2ms）
lib_icp.exe line1.pcd line2.pcd --gpu 100 1e-6 1024 （GPU，81ms）
lib_icp.exe bunny1.pcd bunny2.pcd --cpu 100 1e-6 32 （CPU单核，1407ms）
lib_icp.exe bunny1.pcd bunny2.pcd --cpu 100 1e-6 32 （CPU多核，246ms）
lib_icp.exe bunny1.pcd bunny2.pcd --gpu 100 1e-6 1024 （GPU，156ms）
lib_icp.exe horse1.pcd horse2.pcd --cpu 100 1e-6 32 （CPU单核，12585ms）
lib_icp.exe horse1.pcd horse2.pcd --cpu 100 1e-6 32 （CPU多核，1603ms）
lib_icp.exe horse1.pcd horse2.pcd --gpu 100 1e-6 1024 （GPU，363ms）
Release模式下编译，测试平台为Windows10系统，内存32G，CPU是i7-12700（20线程），GPU是NVIDIA GeForce RTX 3070 Laptop GPU（显存8G，每个block最大线程数为1024），计算的运行时间为三次求平均值。修改后的代码在linux系统也能编译，本人在WSL（Ubuntu20.04）的docker中测试过和Windows本地差别不大，但是调用CUDA版本代码计算结果有误。line1.pcd和line2.pcd点数为10，bunny1.pcd和bunny2.pcd点数为35947，horse1.pcd和horse2.pcd点数为193940；另外LZ调用PCL库计算这三组点云ICP配准耗时分别为0ms，758ms，30550ms。可以看出多核和GPU的加速作用随着点云点数增加优势还是非常明显的，但是点数非常少的话，GPU由于需要和CPU进行数据传输，此时运算速度会不太理想。

自己写了测试main函数，可以直接读取.pcd格式的点云文件（读取二进制储存的pcd文件更为高效，原作者代码是读.txt文件），并添加了时间计算部分和可视化部分，需要另外配置PCL库。

#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_cloud.h>
#include <pcl/point_types.h>
#include <pcl/common/transforms.h>

#ifdef _WIN32
	#include <ctime>
	#include <pcl/visualization/pcl_visualizer.h>
#endif

#ifdef __linux__
	#include <sys/time.h>
#endif

#include "libcpu/icp.h"
#include "libgpu/icp.h"


int main(int argc, char* argv[])
{
	std::cout << "\033[31mUsage:	.exe <source .pcd> <target .pcd> --cpu|gpu <iterations> <error> <capacity>\n"
		<< "for example:	./lib_icp.exe bunny1.pcd bunny2.pcd  --gpu 100 1e-6 1024\033[0m" << std::endl;

	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_src(new pcl::PointCloud<pcl::PointXYZ>);
	pcl::io::loadPCDFile(argv[1], *cloud_src);

	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_tgt(new pcl::PointCloud<pcl::PointXYZ>);
	pcl::io::loadPCDFile(argv[2], *cloud_tgt);

	bool use_gpu = (std::string(argv[3]) == "--gpu");
	if (use_gpu)	std::cout << "use gpu" << std::endl;
	else	std::cout << "use cpu" << std::endl;

	size_t iterations = std::atoi(argv[4]);
	std::cout << "max iterations:" << iterations << std::endl;

	float error = std::atof(argv[5]);
	std::cout << "error:" << error << std::endl;

	int capacity = std::atoi(argv[6]);
	std::cout << "capacity:" << capacity << std::endl;

	std::vector<Eigen::Vector3f> p(cloud_src->size());
	std::vector<Eigen::Vector3f> q(cloud_tgt->size());
	for (size_t i = 0; i < cloud_src->size(); i++)
	{
		p[i] = cloud_src->points[i].getVector3fMap();
	}
	for (size_t i = 0; i < cloud_tgt->size(); i++)
	{
		q[i] = cloud_tgt->points[i].getVector3fMap();
	}

#ifdef _WIN32
	clock_t start = clock();
#endif

#ifdef __linux__
	struct timeval start, end;
	gettimeofday(&start, 0);
#endif

	auto [transformation, new_p] = (use_gpu ? libgpu::icp(q, p, iterations, error, capacity): libcpu::icp(q, p, iterations, error, capacity));


#ifdef _WIN32
	clock_t end = clock();
	std::cout << "time cost:" << end - start << "ms" << std::endl;
#endif

#ifdef __linux__
	gettimeofday(&end, 0);
	std::cout << "time cost:" << (1000000.0 * (end.tv_sec - start.tv_sec) + end.tv_usec - start.tv_usec) / 1000.0 << "ms" << std::endl;
#endif
	
	std::cout << "transformation matrix: \n" << transformation << std::endl;

	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_icp(new pcl::PointCloud<pcl::PointXYZ>);
	pcl::transformPointCloud(*cloud_src, *cloud_icp, transformation);
	pcl::io::savePCDFile("cloud_icp.pcd", *cloud_icp);

#ifdef _WIN32
	pcl::visualization::PCLVisualizer viewer("registration Viewer");
	pcl::visualization::PointCloudColorHandlerCustom<pcl::PointXYZ> src_h(cloud_src, 0, 255, 0); 	//原始点云绿色
	pcl::visualization::PointCloudColorHandlerCustom<pcl::PointXYZ> tgt_h(cloud_tgt, 255, 0, 0); 	//目标点云红色
	pcl::visualization::PointCloudColorHandlerCustom<pcl::PointXYZ> final_h(cloud_icp, 0, 0, 255); 	//匹配好的点云蓝色

	viewer.setBackgroundColor(0, 0, 0);
	viewer.addPointCloud(cloud_src, src_h, "source cloud");
	viewer.addPointCloud(cloud_tgt, tgt_h, "target cloud");
	viewer.addPointCloud(cloud_icp, final_h, "result cloud");
	while (!viewer.wasStopped())
	{
		viewer.spinOnce(100);
	}
#endif

	return 0;
}

工程结构如下：
在这里插入图片描述

运行结果截图：
在这里插入图片描述
附：调用PCL库ICP配准API的测试代码：

#include <iostream>
#include <pcl/io/pcd_io.h>
#include <pcl/point_cloud.h>
#include <pcl/point_types.h>
#include <pcl/registration/icp.h>


int main(int argc, char* argv[])
{
	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_src(new pcl::PointCloud<pcl::PointXYZ>);
	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_tgt(new pcl::PointCloud<pcl::PointXYZ>);

	pcl::io::loadPCDFile(argv[1], *cloud_src);
	pcl::io::loadPCDFile(argv[2], *cloud_tgt);

	clock_t start = clock();

	pcl::IterativeClosestPoint<pcl::PointXYZ, pcl::PointXYZ> icp;
	pcl::PointCloud<pcl::PointXYZ>::Ptr cloud_icp_registration(new pcl::PointCloud<pcl::PointXYZ>);
	pcl::search::KdTree<pcl::PointXYZ>::Ptr tree1(new pcl::search::KdTree<pcl::PointXYZ>);
	pcl::search::KdTree<pcl::PointXYZ>::Ptr tree2(new pcl::search::KdTree<pcl::PointXYZ>);
	tree1->setInputCloud(cloud_src);
	tree2->setInputCloud(cloud_tgt);
	icp.setSearchMethodSource(tree1);
	icp.setSearchMethodTarget(tree2);
	icp.setInputSource(cloud_src);
	icp.setInputTarget(cloud_tgt);
	icp.setEuclideanFitnessEpsilon(1e-6);
	icp.setMaximumIterations(100);
	Eigen::Matrix4f transformation = Eigen::MatrixXf::Identity(4, 4);
	icp.align(*cloud_icp_registration, transformation);

	clock_t end = clock();
	std::cout << end - start << "ms" << std::endl;

	Eigen::Matrix4f transformation_icp = icp.getFinalTransformation();
	std::cout << transformation_icp << std::endl;
	std::cout << "ICP has converged:" << icp.hasConverged() << " score: " << icp.getFitnessScore() << std::endl;

	pcl::transformPointCloud(*cloud_src, *cloud_icp_registration, transformation_icp);
	pcl::io::savePCDFileBinary("icp.pcd", *cloud_icp_registration);

	return 0;
}

代码和数据集下载地址如下：
工程下载链接

18 条评论

xl__-__lx 2024.08.30
请问一下博主，有ICP用OMP加速的代码么？
- xl__-__lx回复给算法爸爸上香 2024.08.30
  是23年5月1号那篇点到面的ICP算法文章么？博主，请问一下，测试的加速效果咋样，可以告知下么？谢谢
- 给算法爸爸上香回复xl__-__lx 2024.08.30
  有啊，你翻下我以前写的

weixin_45661482 2024.08.06
博主您好，请问vptree的阈值作用是什么

lukixn_xixi 2024.04.12
博主你好，请问下eigen库是3点几的呢，CPU版本输出的点云也是压扁的状态。

—NULL—— 2024.01.11
请问libgpu库怎么下载？
- 给算法爸爸上香回复—NULL—— 2024.01.11
  里面有链接

zzzxc10 2023.12.13
博主你好，运行你给的工程代码最后结果的点云被压扁了，并且变换矩阵的 0 0 0 1到了第二行，请问这是什么原因呢？
- zzzxc10回复给算法爸爸上香 2023.12.14
  博主用的是哪个版本的CUDA，我用11.3和12.3试的都没成功
- 给算法爸爸上香回复zzzxc10 2023.12.14
  gpu版本的吗？应该是cuda或者eigen的版本有问题，之前有人好像遇到过

qq_34258179 2023.12.06
真不错，我很希望有这个源代码

bigfishlove 2023.09.04
感觉这个精度比PCL的差很多，这是为啥
- 给算法爸爸上香回复bigfishlove 2023.09.04
  你要是用的我的数据，我之前测试过误差和PCL基本上差不多的哈。要是你自己的数据，可能需要调节一下参数，

囧到没边际 2023.08.22
有尝试过nvidia官方的cuda-pcl库吗？github链接：https://github.com/NVIDIA-AI-IOT/cuPCL
- 囧到没边际回复weixin_39329492 2023.11.09
  确实，都是linux上的，因为平时都是linux开发所以没注意windows是没有的
- weixin_39329492回复囧到没边际 2023.11.08
  您好 cuda-pcl我怎么只看到.so的库呢？没有windows下的库啊！
- 囧到没边际回复给算法爸爸上香 2023.08.26
  我在台式机上用x86_64_lib分支里的成功运行了，博主也可以试试
- 给算法爸爸上香回复囧到没边际 2023.08.23
  我看了下这个官方做的性能应该不会差。但没有成功运行，不知道他这个是不是要在特定平台上，如果你有相应的环境条件可以试试。