如何调试该死的Rcpp代码？如何复用Rcpp的so共享库文件？| 向量/矩阵运算

原创已于 2024-10-23 21:21:18 修改 · 853 阅读

7 ·

CC 4.0 BY-SA版权

文章标签：

#矩阵 #c++ #r语言

于 2024-10-20 20:52:02 首次发布

R 同时被 3 个专栏收录

83 篇文章

订阅专栏

C与C++

16 篇文章

订阅专栏

数学与统计

11 篇文章

订阅专栏

目的：本文是为了在R / Rcpp环境下调试c++函数。包括如何编写，及在R中编译、运行。总不能c++函数编译成R包才能调试吧？

本文不包括：后续把C++代码加入到R包中。可参考R包中使用Rcpp。

复用Rcpp的so文件见: 2(5)

1. 相关背景

(1) 教训

不要用Rstudio，用shell R。否则会有莫名奇妙的错误。
参数里不要有注释，防止莫名其妙的报错。比如mat参数前的这个注释浪费了我好几个小时:


// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
void SNN_SmallestNonzero_Dist2(
    Eigen::SparseMatrix<double> snn,
    //Eigen::MatrixXd mat,
    int n,
    std::vector<double> nearest_dist
) {
}

解决方法：该参数整行删掉，或者仅删除注释符号。

(2) 编译环境变量

矩阵运算属于高性能运算，默认使用服务器了。我的环境是

CentOS7.9
g++ (GCC) 12.3.0
R version 4.3.3 (2024-02-29)
Rcpp v1.0.12
RcppEigen v0.3.3.9.4

$ cat ~/.R/Makevars
#CXX17 = g++ -std=c++17 -fPIC
CXX11=/home/wangjl/soft/gcc-12.3.0/bin/g++ -std=c++11 -fPIC
CXX14=/home/wangjl/soft/gcc-12.3.0/bin/g++ -std=c++14 -fPIC
CXX17=/home/wangjl/soft/gcc-12.3.0/bin/g++ -std=c++17 -fPIC
gcc=/home/wangjl/soft/gcc-12.3.0/bin/gcc

2. 先跑起来Rcpp代码

这个例子很容易找到：

(1) 写cpp文件

$ cat test_mat1.cpp
#include <Rcpp.h>

using namespace Rcpp;

// [[Rcpp::export]]
NumericVector timesTwo(NumericVector x) {
  return x * 2;
}

文件解释：

第1行：头文件
第2行：命名空间，下文的 NumericVector 类型全名是 Rcpp::NumericVector
第3行：// [[Rcpp::export]] 表示该函数暴漏到R中，可以在R中直接调用。
接着是函数体，就是标准的C++代码了：
返回类型函数名(参数类型虚参数){
函数体
return 返回值;
}

(2) 编译c++

shell 进入c++文件所在文件夹：

$ cd /path/to/cpp_file/

$ mkdir tmp #新建文件夹，作为编译目录

直接在当前文件夹内的shell中打开 shell R
$ R

编译和加载函数
> library(Rcpp)
> sourceCpp("test_mat1.cpp", cacheDir="./tmp/", verbose=T, rebuild=T)

参数解释：

第一个参数是 c++文件名
cacheDir 是指定编译文件夹
verbose=T 输出详细日志
rebuild=T 前置重新编译。这个很重要，默认是F，修改代码不会重新编译，看不到修改的效果。

最后几行如果没有报错，则表示正常编译。

(3) 测试

> timesTwo(c(-10,3, 20))
[1] -20   6  40
确实乘以2了

修改cpp文件，把函数体改为 *3，再次编译，运行：
> timesTwo(c(-10,3, 20))
[1] -30   9  60
确实乘以3了。

(4) 查看编译产物

$ ls -lth tmp/sourceCpp-x86_64-pc-linux-gnu-1.0.12/sourcecpp_108c6c552e9/
total 1.5M
-rwxrwxrwx+ 1 wangjl jinlab 550K Oct 20 19:33 sourceCpp_3.so  #这个_3数字随着编译次数递增
-rw-rw-rw-+ 1 wangjl jinlab 945K Oct 20 19:33 test_mat1.o
-rw-r--r--+ 1 wangjl jinlab  643 Oct 20 19:33 test_mat1.cpp
-rw-rw-rw-+ 1 wangjl jinlab  302 Oct 20 19:33 test_mat1.cpp.R

(5) 复用动态链接库 so 文件

另找一个shell，运行R环境：

$ R 
> library(Rcpp)  #这个很重要，否则报错
> dyn.load("tmp/sourceCpp-x86_64-pc-linux-gnu-1.0.12/sourcecpp_108c6c552e9/sourceCpp_28.so")
> timesTwo(c(-10,3, 20))
Error in timesTwo(c(-10, 3, 20)) : could not find function "timesTwo"
貌似不能调用了？！

> .C("timesTwo", c(-10, 3, 20))
Error in .C("timesTwo", c(-10, 3, 20)) : 
  C symbol name "timesTwo" not in load table
> .Call("timesTwo", c(-10, 3, 20))
Error in .Call("timesTwo", c(-10, 3, 20)) : 
  C symbol name "timesTwo" not in load table


使用命令行工具检查 .so 文件是否包含正确的符号。
$ nm -D tmp/sourceCpp-x86_64-pc-linux-gnu-1.0.12/sourcecpp_108c6c552e9/sourceCpp_28.so  | grep -in two
62:0000000000005b10 T sourceCpp_1_timesTwo
71:0000000000005860 T _Z8timesTwoN4Rcpp6VectorILi14ENS_15PreserveStorageEEE

原来是函数名不对，重新调用：
> library(Rcpp)
> .Call("sourceCpp_1_timesTwo", c(-10, 3, 20))
[1] -30   9  60

包装为R函数：
> timesTwo=function(arr){
	.Call("sourceCpp_1_timesTwo", arr)
}

> timesTwo(c(-10, 30)) 
[1] -30  90

复用动态链接库.so文件的优势：

方便包装成R包，给其他用户使用，掩盖掉复杂细节。
或发给别人自己的动态链接库so文件，方便保密代码的实现。

3. 矩阵运算：outerSize() 是行还是列？

矩阵包括：常规矩阵和稀疏矩阵。
R中和c++中有对应类型，调用时要保持类型配对。

We may need to supply:
• header location via -I,
• library location via -L,
• library via -llibraryname

(1) 写函数：如何输入正确的参数？

$ cat test_mat2.cpp
#include <Rcpp.h>
#include <RcppEigen.h>

using namespace Rcpp;

// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
void SNN_SmallestNonzero_Dist2(
    Eigen::SparseMatrix<double> snn,
    Eigen::MatrixXd mat,
    int n,
    std::vector<double> nearest_dist
) {
  
  for (int i=0; i < snn.outerSize(); ++i){
    Rcpp::Rcout << i <<  std::endl;
  }
}

(2) 编译

> Rcpp::sourceCpp("test_mat2.cpp", cacheDir="./tmp/", verbose=T, rebuild=T)

该函数已经加载到R环境中：
> SNN_SmallestNonzero_Dist2
function (snn, mat, n, nearest_dist) 
invisible(.Call(<pointer: 0x7f9ca25137f0>, snn, mat, n, nearest_dist))

(3) 测试

这个调用不容易，参数的几个类型都不能出错。

> library(Matrix);
> sparse_matrix <- Matrix(c(0, 0, 3, 0,   4, 0, 0, 0), nrow = 2, ncol = 4, byrow=T, sparse = TRUE);sparse_matrix
[1,] . . 3 .
[2,] 4 . . .
> class(sparse_matrix)
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

转变格式为 "dgCMatrix"
> sparse_matrix_dgC <- as(sparse_matrix, "dgCMatrix");print(class(sparse_matrix_dgC))
[1] "dgCMatrix"
attr(,"package")
[1] "Matrix"

一般矩阵形式
> mat = as.matrix(sparse_matrix); mat
     [,1] [,2] [,3] [,4]
[1,]    0    0    3    0
[2,]    4    0    0    0

调用1：第二个参数不是一般矩阵，是稀疏矩阵（"dgCMatrix"），就报错！
> SNN_SmallestNonzero_Dist2(sparse_matrix_dgC, sparse_matrix, 8, c(1,2,3))
Error: Not a matrix.

调用: 正确
> SNN_SmallestNonzero_Dist2(sparse_matrix_dgC, mat, 8, c(1,2,3))
0
1
2
3

结论：可见，outerSize() 是列数。

教科书的数学上，矩阵是列向量组成的。向量默认指的是列向量。

4. 对矩阵的非零元素进行遍历

(1) 写函数：如何对稀疏矩阵遍历

$ cat test_mat3.cpp
#include <Rcpp.h>
#include <RcppEigen.h>

using namespace Rcpp;

// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
void SNN_SmallestNonzero_Dist3(
    Eigen::SparseMatrix<double> snn
) {
  
  for (int j=0; j < snn.outerSize(); ++j){ //遍历列
    for (Eigen::SparseMatrix<double>::InnerIterator it(snn, j); it; ++it) { //对非0行遍历
      int row = it.row();
      double value = it.value();
      Rcpp::Rcout << "Row: " << row << ", Column: " << j << ", Value: " << value << std::endl;
    }
  }
}

(2) 编译

> Rcpp::sourceCpp("test_mat3.cpp", cacheDir="./tmp/", verbose=T, rebuild=T)

> SNN_SmallestNonzero_Dist3
function (snn) 
invisible(.Call(<pointer: 0x7f9ca18c9f70>, snn))

(3) 调用

> mat = Matrix(c(0, 0, 3, 0,   4, 0, 0, 0), nrow = 2, ncol = 4, byrow=T); mat
2 x 4 sparse Matrix of class "dgCMatrix"
            
[1,] . . 3 .
[2,] 4 . . .

> SNN_SmallestNonzero_Dist3(mat)
Row: 1, Column: 0, Value: 4
Row: 0, Column: 2, Value: 3
看着别扭，是因为外循环是列（输出的第二列）！行是内循环。

注意：在C++中下标从0开始计数。R中则从1开始计数。

5. 求向量的欧氏距离(L2范数)：.norm() 方法

(1) cpp函数

$ cat test_mat4.cpp
#include <Rcpp.h>
#include <RcppEigen.h>

using namespace Rcpp;

// [[Rcpp::depends(RcppEigen)]]
// [[Rcpp::export]]
void SNN_SmallestNonzero_Dist2(
    Eigen::SparseMatrix<double> snn,
    Eigen::MatrixXd mat,
    int n,
    std::vector<double> nearest_dist
) {
  
  for (int i=0; i < snn.outerSize(); ++i){
    Rcpp::Rcout << i <<  std::endl;
  }
  
  //第一行
  size_t cell = 0;
  int i=1;
  double res = (mat.row(cell) - mat.row(i)).norm();
  Rcpp::Rcout << "mat.row(0): " << mat.row(cell) <<  std::endl;
  Rcpp::Rcout << "mat.row(1): " << mat.row(i) <<  std::endl;
  Rcpp::Rcout << "(mat.row(0) - mat.row(1)).norm() = " << res <<  std::endl;
}

(2) 编译

> Rcpp::sourceCpp("test_mat4.cpp", cacheDir="./tmp/", verbose=T, rebuild=T)

(3) 运行

> sparse_matrix <- Matrix(c(0, 0, 3, 0,   4, 0, 0, 0), nrow = 2, ncol = 4, byrow=T, sparse = TRUE);sparse_matrix
> mat = as.matrix(sparse_matrix); mat
     [,1] [,2] [,3] [,4]
[1,]    0    0    3    0
[2,]    4    0    0    0

> SNN_SmallestNonzero_Dist2(sparse_matrix, mat, 3, c(1,2))
0
1
2
3
mat.row(0): 0 0 3 0
mat.row(1): 4 0 0 0
(mat.row(0) - mat.row(1)).norm() = 5

结论：