ANZ数据分析笔记

最新推荐文章于 2021-12-08 17:39:07 发布

gaocuo111

最新推荐文章于 2021-12-08 17:39:07 发布

阅读量351

点赞数

分类专栏： R

本文链接：https://blog.youkuaiyun.com/gaocuo111/article/details/114379146

版权

R 专栏收录该内容

4 篇文章

订阅专栏

导入数据

library(tidyverse)
library(lubridate)

df = read.csv("data/DSynth_Output_100c_3m_v3.csv")

数据准备

The dataset contains 12043 transactions for 100 customers who have one bank account each. <dataframe描述> The transactional period is from 01/08/2018 - 31/10/2018 (92 days duration). The data entries are unique and have consistent formats for analysis. <数据唯一性> For each record/row, information is complete for majority of columns. <数据完整性> Some columns contain missing data (blank or NA cells), which is likely due to the nature of transaction. (i.e. merchants are not involved for InterBank transfers or Salary payments) <对空值的解释> It is also noticed that there is only 91 unique dates in the dataset, suggesting the transaction records for one day are missing (turned out to be 2018-08-16). <对于缺失值的描述> The range of each feature should also be examined which shows that there is one customer that resides outside Australia.

# examine the summary of the dataset
summary(df)
str(df) # structure,紧凑的显示对象内部结构, head()显示不了太长的value

改变数据格式，检查唯一性

# change the format of date column
df$date<- as.Date(df$date,format = "%d/%m/%Y")

# check unique
unique(df$date)

# confirm the one -to -one link of account_id and customer_id
df %>% select(account,customer_id) %>%
unique() %>%
nrow()

分析目标：时间，地点

时间完整性检查

# Create a sequence from min of the date to max of the date named DateRange
DateRange <- seq(min(df$date), max(df$date), by = 1)

# check the values in DataRange but not in date to fine missing values
> DateRange[!DateRange %in% df$date] # %in% means match
[1] "2018-08-16"
# 2018-08-16 transactions are missing

从交易时间中分出星期和小时

# derive weekday and hour data of each transaction
df$extraction = as.character(df$extraction)
df$hour = hour(as.POSIXct(substr(df$extraction,12,19),format="%H:%M:%S"))
df$weekday = weekdays(df$date)

从location中分出经度和纬度

# split customer & merchant lat_long into individual columns for analysis
dfloc = df[,c("long_lat","merchant_long_lat")]
dfloc<- dfloc %>% separate("long_lat", c("c_long", "c_lat"),sep=' ')
dfloc<- dfloc %>% separate("merchant_long_lat", c("m_long", "m_lat"),sep=' ')

# Convert format to numeric
dfloc<- data.frame(sapply(dfloc, as.numeric))

# conbine the two dataframes
# cbind是根据列进行合并，合并的前提是所有数据行数相等。
df <- cbind(df,dfloc)

# sapply(x,FUN)中，第一个参数x是需要处理的数据，FUN是处理数据x的Funtion；

检查居住在海外的

# check the range of customer location
# filtering out transactions for those who don't reside in Australia
# Reference: http://www.ga.gov.au/scientific-topics/national-location-information/dimensions/con
tinental-extremities
df_temp <- df %>%
filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))

此处有报错： dplyr包filter函数 Error in filter(., ) : 找不到对象'c_long'
原因：引用的多个包中都含有某个函数
解决：我们只需要在使用filter函数时在前面加上包的名字和两个冒号，来告诉R我们具体使用的是哪个包的函数即可，如下

df_temp <- df %>%
  dplyr::filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))

筛选出的结果如何计数？(dataframe, vectors(list) and factors)

For dataframe: nrow()

For vectors and factors: length()

检查缺失值（na&blank&..)

apply(df, 2, function(x) sum(is.na(x)| x == '')) # 2 means check by columns