总所周知,data.table::fread比base::read.csv要快很多,但是,还是有一点要注意的地方:
为了测试 ranger,偶然发现了一个要注意的地方
library(ranger)
#library(bit64)
library(data.table)
traindata1 <- read.csv('input/train.csv', header = T)
traindata2 <- fread('input/train.csv', header = T, data.table = F, verbose = T)
traindata3 <- fread('input/train.csv', header = T, data.table = F, verbose = T, integer64 = 'numeric')
这里的train.csv是kaggle竞赛的数据,https://www.kaggle.com/c/santander-customer-satisfaction
traindata1$ID <- NULL
traindata2$ID <- NULL
traindata3$ID <- NULL
rg <- ranger(TARGET ~ ., data = traindata2, write.forest = TRUE)
Error in na.fail.default(list(TARGET = c(0L, 0L, 0L, 0L, 0L, 0L, 0L, 0L, :
missing values in object
for(i in 1:371) if(sum(prod(complete.cases(traindata1[i])))!=1) print(i)
for(i in 1:371) if(sum(prod(complete.cases(traindata2[i])))!=1) print(i)
for(i in 1:371) if(sum(prod(complete.cases(traindata3[i])))!=1) print(i)
##dplyr::all_equal(traindata1[203], traindata3[203])
最后经过上述检查,重新阅读了help文档,发现
traindata3 <- fread('input/train.csv', header = T, data.table = F, verbose = T, integer64 = 'numeric')
才对。
原因在integer64参数,
"integer64" (default) reads columns detected as containing integers larger than 2^31 as type bit64::integer64
.
Alternatively, "double"|"numeric"
reads as base::read.csv
does;
i.e., possibly with loss of precision and if so silently. Or, "character".
说的很清楚,以上,是为记。