一、数据预处理
1、数据清洗(data cleaning)
(1)缺失值处理(missingdata processing)
无缺失值。
(2)去噪声(noisy dataprocessing)
(未有时间研究)
(3)去异常值(outlierprocessing)
?
(4)共线性变量处理(pairwisecorrelations processing)
VIF (未有时间研究)
2、数据集成(data integration)
单一数据来源,数据结构也一致。无需再集成。
二、导入数据
分析:
数据来源 |
https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data) |
自变量-连续型 |
V2,V5,V8,V11,V13,V16,V18 |
自变量-分类型 |
V1,V3,V4,V6,V7,V9,V10,V12,V14,V15,V17,V19,V20 |
因变量y |
V21 |
变量释义 |
https://archive.ics.uci.edu/ml/datasets/Statlog+(German+Credit+Data) |
R程序:
rawdata = read.table("D:/personal/knowledge/dataMining/dataset/german/german.data",header=F) colnames(rawdata)[21] <- "y" # rename response variable str(rawdata) |
三、数据分区
分析:
训练数据 |
从总样本中抽样600条 |
验证数据 |
剩余的400条 |
R程序:
trainIdx <- sample(nrow(rawdata), round(0.6*nrow(rawdata))) traindata <- rawdata[trainIdx,] validdata <- rawdata[-trainIdx,] nrow(traindata) # result: 600 |
四、交互式分组(discretization)
1、连续型数据离散化
(1)利用最优准则(基于ConditionalInference Trees)进行分组
R程序:
# 需转换y从1-2变量变为0-1变量才到调用smbinning replace2to0 <- function(x) { n <- nrow(x); for (i in 1:n) { if (x[i,21] %in% c("2")) { x[i,21] <- 0; } } return(x); } updtraindata = replace2to0(traindata)
# binning cutoff calculation library(smbinning) V2bin=smbinning(df=updtraindata, y="y", x="V2", p=0.05) V2bin$ivtable V2bin$bands # need install package "smbinning" |
结果:
<= 11, <= 26, <= 72 |
R程序:
# binning bin <- function(x, cutoffmin, cutoffmax) { n <- length(x); for (i in 1:n) { if (cutoffmin < x[i] && x[i] <= cutoffmax) { x[i] <- 1; } else { x[i] <- 0; } } return(x); } V2bin1 <- bin(updtraindata$V2,0,11) V2bin2 <- bin(updtraindata$V2,11,26) V2bin3 <- bin(updtraindata$V2,26,72) |
这只是V2,其它像V5,V13也一样处理~~,如下:
R程序:
V5bin=smbinning(df=updtraindata, y="y", x="V5", p=0.05) V5bin$ivtable V5bin$bands V5bin1 <- bin(updtraindata$V5,250,6110) V5bin2 <- bin(updtraindata$V5,6110, 15945)
V13bin=smbinning(df=updtraindata, y="y", x="V13", p=0.05) V13bin # 结果竟然是"No Bins" |
V13结果竟然是"No Bins",不知是不是均匀分布不能分箱了,网上也查不到,那就不分吧。
其它,V8,V11,V16,V18实为分类型变量。如:
R程序:
summary(updtraindata$V8) |
结果:
Min. 1st Qu. Median Mean 3rd Qu. Max. 1.000 2.000 3.000 3.042 4.000 4.000 |
变量合并,R程序:
# 插入新V2, V5 updtraindata <- cbind(updtraindata,V2bin1) updtraindata <- cbind(updtraindata,V2bin2) updtraindata <- cbind(updtraindata,V2bin3) updtraindata <- cbind(updtraindata,V5bin1) updtraindata <- cbind(updtraindata,V5bin2) # 转换格式 updtraindata$V2bin1 <- as.factor(updtraindata$V2bin1) updtraindata$V2bin2 <- as.factor(updtraindata$V2bin2) updtraindata$V2bin3 <- as.factor(updtraindata$V2bin3) updtraindata$V5bin1 <- as.factor(updtraindata$V5bin1) updtraindata$V5bin2 <- as.factor(updtraindata$V5bin2) # 删除原V2, V5 updtraindata$V2 <- NULL updtraindata$V5 <- NULL str(updtraindata) |
结果:
# updtraindata结构 'data.frame': 600 obs. of 24 variables: $ V1 : Factor w/ 4 levels "A11","A12","A13",..: 1 4 2 2 3 1 4 4 4 2 ... $ V3 : Factor w/ 5 levels "A30","A31","A32",..: 2 5 4 3 5 3 5 5 3 5 ... $ V4 : Factor w/ 10 levels "A40","A41","A410",..: 1 5 2 5 1 5 4 1 6 2 ... $ V6 : Factor w/ 5 levels "A61","A62","A63",..: 5 1 5 1 5 1 1 1 1 2 ... $ V7 : Factor w/ 5 levels "A71","A72","A73",..: 3 5 4 3 4 4 1 5 2 1 ... $ V8 : int 4 4 3 2 4 1 2 4 1 3 ... $ V9 : Factor w/ 4 levels "A91","A92","A93",..: 3 3 3 2 3 4 3 3 2 3 ... |