> getwd()
[1] "C:/Users/Administrator/Documents"
> setwd('C:/Users/Administrator/Downloads')
> yo <- read.csv('yogurt.csv')
> str('yo')
chr "yo"
> str(yo)
'data.frame': 2380 obs. of 9 variables:
$ obs : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : int 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 2100081 ...
$ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
$ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
$ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
$ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
$ plain : int 0 0 0 0 0 0 0 0 0 0 ...
$ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
$ price : num 59 59 65 65 49 ...
这里要将ID转换成factor格式
> yo$id <- factor(yo$id)
> str(yo)
'data.frame': 2380 obs. of 9 variables:
$ obs : int 1 2 3 4 5 6 7 8 9 10 ...
$ id : Factor w/ 332 levels "2100081","2100370",..: 1 1 1 1 1 1 1 1 1 1 ...
$ time : int 9678 9697 9825 9999 10015 10029 10036 10042 10083 10091 ...
$ strawberry : int 0 0 0 0 1 1 0 0 0 0 ...
$ blueberry : int 0 0 0 0 0 0 0 0 0 0 ...
$ pina.colada: int 0 0 0 0 1 2 0 0 0 0 ...
$ plain : int 0 0 0 0 0 0 0 0 0 0 ...
$ mixed.berry: int 1 1 1 1 1 1 1 1 1 1 ...
$ price : num 59 59 65 65 49 ...
library(ggplot2)
ggplot(aes(x=price),data = yo)+
geom_histogram()
qplot(data=yo,x=price,fill=I('#F79420'))
qplot(data=yo,x=price,fill=I('#F79420'),binwidth=10)
调整间距以上图形发生了很大的变化,而这个图形掩盖了很多价格空白的情况,对接价格敏感度来说,这是一个不好的图形模型。
> summary(yo)
obs id time strawberry blueberry pina.colada plain
Min. : 1.0 2132290: 74 Min. : 9662 Min. : 0.0000 Min. : 0.0000 Min. : 0.0000 Min. :0.0000
1st Qu.: 696.5 2130583: 59 1st Qu.: 9843 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.: 0.0000 1st Qu.:0.0000
Median :1369.5 2124073: 50 Median :10045 Median : 0.0000 Median : 0.0000 Median : 0.0000 Median :0.0000
Mean :1367.8 2149500: 50 Mean :10050 Mean : 0.6492 Mean : 0.3571 Mean : 0.3584 Mean :0.2176
3rd Qu.:2044.2 2101790: 47 3rd Qu.:10255 3rd Qu.: 1.0000 3rd Qu.: 0.0000 3rd Qu.: 0.0000 3rd Qu.:0.0000
Max. :2743.0 2129528: 39 Max. :10459 Max. :11.0000 Max. :12.0000 Max. :10.0000 Max. :6.0000
(Other):2061
mixed.berry price
Min. :0.0000 Min. :20.00
1st Qu.:0.0000 1st Qu.:50.00
Median :0.0000 Median :65.04
Mean :0.3887 Mean :59.25
3rd Qu.:0.0000 3rd Qu.:68.96
Max. :8.0000 Max. :68.96
> unique(yo$price)
[1] 58.96 65.04 48.96 68.96 39.04 24.96 50.00 45.04 33.04 44.00 33.36 55.04 62.00 20.00 49.60 49.52 33.28 63.04 33.20
[20] 33.52
> length(unique(yo$price))
[1] 20
> table(yo$price)
20 24.96 33.04 33.2 33.28 33.36 33.52 39.04 44 45.04 48.96 49.52 49.6 50 55.04 58.96 62 63.04 65.04 68.96
2 11 54 1 1 22 1 234 21 11 81 1 1 205 6 303 15 2 799 609
添加新变量:
yo$all.purchases <- yo$strawberry+yo$blueberry+yo$pina.colada+yo$plain+yo$mixed.berry
yo <- transform(yo,all.purchases=strawberry+blueberry+pina.colada+plain+mixed.berry)
ggplot(aes(x=time,y=price),data=yo,binwidth=1)+
geom_point()
> set.seed(4230)
> sample.ids <- sample(levels(yo$id),16)
> sample.ids
[1] "2107953" "2123463" "2167320" "2127605" "2124750" "2133066" "2134676" "2141341" "2107706" "2151829" "2119693"
[12] "2122705" "2115006" "2143271" "2101980" "2101758"
> ggplot(aes(x=time,y=price),data=subset(yo,id %in% sample.ids))+
+ facet_wrap(~id)+
+ geom_line()+
+ geom_point(aes(size=all.purchases),pch=1)
注意:x %in% y 返回一个长度与 x 相同的逻辑(布尔)向量,该向量指出 x 中的每一个条目是否都出现在 y 中。也就是说,对于 x 中的每一个条目,该向量都会检查这一条目是否也出现在 y 中。
这样,我们就能将数据子集化,从而获得样本中住户的所有购买时机了。然后,我们通过样本 ID 创建价格与时间的散点图和分面。
在绘制散点时,使用 pch 或 shape 参数来指定符号。向下滚动至 QuickR 图形参数的“绘制散点”部分。
QuickR 图形参数