目录
导入数据
library(tidyverse)
library(lubridate)
df = read.csv("data/DSynth_Output_100c_3m_v3.csv")
数据准备
The dataset contains 12043 transactions for 100 customers who have one bank account each. <dataframe描述> The transactional period is from 01/08/2018 - 31/10/2018 (92 days duration). The data entries are unique and have consistent formats for analysis. <数据唯一性> For each record/row, information is complete for majority of columns. <数据完整性> Some columns contain missing data (blank or NA cells), which is likely due to the nature of transaction. (i.e. merchants are not involved for InterBank transfers or Salary payments) <对空值的解释> It is also noticed that there is only 91 unique dates in the dataset, suggesting the transaction records for one day are missing (turned out to be 2018-08-16). <对于缺失值的描述> The range of each feature should also be examined which shows that there is one customer that resides outside Australia.
# examine the summary of the dataset
summary(df)
str(df) # structure,紧凑的显示对象内部结构, head()显示不了太长的value
改变数据格式,检查唯一性
# change the format of date column
df$date<- as.Date(df$date,format = "%d/%m/%Y")
# check unique
unique(df$date)
# confirm the one -to -one link of account_id and customer_id
df %>% select(account,customer_id) %>%
unique() %>%
nrow()
分析目标:时间,地点
时间完整性检查
# Create a sequence from min of the date to max of the date named DateRange
DateRange <- seq(min(df$date), max(df$date), by = 1)
# check the values in DataRange but not in date to fine missing values
> DateRange[!DateRange %in% df$date] # %in% means match
[1] "2018-08-16"
# 2018-08-16 transactions are missing
从交易时间中分出星期和小时
# derive weekday and hour data of each transaction
df$extraction = as.character(df$extraction)
df$hour = hour(as.POSIXct(substr(df$extraction,12,19),format="%H:%M:%S"))
df$weekday = weekdays(df$date)
从location中分出经度和纬度
# split customer & merchant lat_long into individual columns for analysis
dfloc = df[,c("long_lat","merchant_long_lat")]
dfloc<- dfloc %>% separate("long_lat", c("c_long", "c_lat"),sep=' ')
dfloc<- dfloc %>% separate("merchant_long_lat", c("m_long", "m_lat"),sep=' ')
# Convert format to numeric
dfloc<- data.frame(sapply(dfloc, as.numeric))
# conbine the two dataframes
# cbind是根据列进行合并,合并的前提是所有数据行数相等。
df <- cbind(df,dfloc)
# sapply(x,FUN)中,第一个参数x是需要处理的数据,FUN是处理数据x的Funtion;
检查居住在海外的
# check the range of customer location
# filtering out transactions for those who don't reside in Australia
# Reference: http://www.ga.gov.au/scientific-topics/national-location-information/dimensions/con
tinental-extremities
df_temp <- df %>%
filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))
此处有报错: dplyr包filter函数 Error in filter(., ) : 找不到对象'c_long'
原因:引用的多个包中都含有某个函数
解决:我们只需要在使用filter函数时在前面加上包的名字和两个冒号,来告诉R我们具体使用的是哪个包的函数即可,如下
df_temp <- df %>%
dplyr::filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))
筛选出的结果如何计数?(dataframe, vectors(list) and factors)
For dataframe: nrow()
For vectors and factors: length()
检查缺失值(na&blank&..)
apply(df, 2, function(x) sum(is.na(x)| x == '')) # 2 means check by columns