ANZ数据分析笔记

目录

导入数据

数据准备

改变数据格式,检查唯一性

分析目标:时间,地点

时间完整性检查

从交易时间中分出星期和小时

从location中分出经-度和纬度


 

导入数据

library(tidyverse)
library(lubridate)

df = read.csv("data/DSynth_Output_100c_3m_v3.csv")

数据准备

The dataset contains 12043 transactions for 100 customers who have one bank account each. <dataframe描述> The transactional period is from 01/08/2018 - 31/10/2018 (92 days duration). The data entries are unique and have consistent formats for analysis. <数据唯一性> For each record/row, information is complete for majority of columns. <数据完整性> Some columns contain missing data (blank or NA cells), which is likely due to the nature of transaction. (i.e. merchants are not involved for InterBank transfers or Salary payments) <对空值的解释> It is also noticed that there is only 91 unique dates in the dataset, suggesting the transaction records for one day are missing (turned out to be 2018-08-16). <对于缺失值的描述> The range of each feature should also be examined which shows that there is one customer that resides outside Australia.


# examine the summary of the dataset
summary(df)
str(df)    # structure,紧凑的显示对象内部结构, head()显示不了太长的value

改变数据格式,检查唯一性

# change the format of date column
df$date<- as.Date(df$date,format = "%d/%m/%Y")

# check unique
unique(df$date)

# confirm the one -to -one link of account_id and customer_id
df %>% select(account,customer_id) %>%
  unique() %>%
  nrow()
 

分析目标:时间,地点

时间完整性检查

# Create a sequence from min of the date to max of the date named DateRange
DateRange <- seq(min(df$date), max(df$date), by = 1)

# check the values in DataRange but not in date to fine missing values
> DateRange[!DateRange %in% df$date]    # %in% means match
[1] "2018-08-16"
# 2018-08-16 transactions are missing

从交易时间中分出星期和小时

# derive weekday and hour data of each transaction
df$extraction = as.character(df$extraction)
df$hour = hour(as.POSIXct(substr(df$extraction,12,19),format="%H:%M:%S"))
df$weekday = weekdays(df$date)

从location中分出经度和纬度

# split customer & merchant lat_long into individual columns for analysis
dfloc = df[,c("long_lat","merchant_long_lat")]
dfloc<- dfloc %>% separate("long_lat", c("c_long", "c_lat"),sep=' ')
dfloc<- dfloc %>% separate("merchant_long_lat", c("m_long", "m_lat"),sep=' ')


# Convert format to numeric
dfloc<- data.frame(sapply(dfloc, as.numeric))

# conbine the two dataframes
# cbind是根据列进行合并,合并的前提是所有数据行数相等。
df <- cbind(df,dfloc)

# sapply(x,FUN)中,第一个参数x是需要处理的数据,FUN是处理数据x的Funtion;

检查居住在海外的

# check the range of customer location
# filtering out transactions for those who don't reside in Australia
# Reference: http://www.ga.gov.au/scientific-topics/national-location-information/dimensions/con
tinental-extremities
df_temp <- df %>%
  filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))

此处有报错: dplyr包filter函数 Error in filter(., ) : 找不到对象'c_long'
原因:引用的多个包中都含有某个函数
解决:我们只需要在使用filter函数时在前面加上包的名字和两个冒号,来告诉R我们具体使用的是哪个包的函数即可,如下
 

df_temp <- df %>%
  dplyr::filter (!(c_long >113 & c_long <154 & c_lat > (-44) & c_lat < (-10)))
length(unique(df_temp$customer_id))

筛选出的结果如何计数?(dataframe, vectors(list) and factors)

For dataframe: nrow()

For vectors and factors: length()

检查缺失值(na&blank&..)

apply(df, 2, function(x) sum(is.na(x)| x == '')) # 2 means check by columns

评论 1
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值