课程讲解中的Lesson6.rmd
Lesson 6
========================================================
### 1.Welcome
Notes:
***
### 3.Scatterplot Review
Notes:
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.
# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is
# reasonable or a rip-off.
# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).
# Limit the x-axis and y-axis to omit the top 1% of values.
# ENTER YOUR CODE BELOW THIS LINE
# ================================================================
```{r Scatterplot Review}
library(ggplot2)
ggplot(aes(x=carat,y=price),data=diamondsinfo) +
geom_point()
```
***
### 4.Price and Carat Relationship
Response:
根据上图可以得到:正相关的关系,随着克拉重量则增加,相应的价格也会上涨。
***
### 5.Frances Gerety
Notes:
到目前为止,我们仅仅考虑了价格和克拉的双变关系,这个数据集来自于2008年收的5万多颗砖石。
#### A diamonds is forever
***
### 6.The Rise of Diamonds
Notes:
***
### 7.ggpairs Function
Notes:
该函数的params参数用于改变在图形矩阵中绘制的点的形状,以便更容易查看这些点。,以成对变量的形式绘制每两个变量之间的关系。
```{r ggpairs Function}
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')
# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
# library(scales)
library(memisc)
# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
lower = list(continuous = wrap('points',shape=I('.'))),
upper = list(combo = wrap('box',outlier.shape=I('.'))))
```
What are some things you notice in the ggpairs output?
Response:
***
### 8.The Demand of Diamonds
Notes:
# Create two histograms of the price variable
# and place them side by side on one output image.
# We’ve put some code below to get you started.
# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.
# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.
# You can self-assess your work with the plots
# in the solution video.
# ALTER THE CODE BELOW THIS LINE
# ==============================================
library(gridExtra)
plot1 <- qplot() +
ggtitle('Price')
plot2 <- qplot() +
ggtitle('Price (log10)')
grid.arrange()
```{r The Demand of Diamonds}
library(gridExtra)
price1 <- ggplot(aes(x=price),data=diamondsinfo) +
geom_histogram(binwidth=30,color='black',fill='#099DD9') +
ggtitle('Price')
price2 <- ggplot(aes(x=price),data=transform(diamondsinfo,price=log10(price))) +
geom_histogram(binwidth=.001,color='orange',fill='#099DD9') +
ggtitle('price(log10)')
grid.arrange(price1, price2, ncol=1)
```
***
### 9.Connecting Demand and Price Distributions
Notes:
原始的价格曲线被严重扭曲,经过对数变换之后的接近于钟形的正态分布。对数变换之后的图像会呈现双峰形状,更符合穷人和富人买砖石的现状。
***
### 10.Scatterplot Transformation
Notes:
在对数坐标轴中,价格在克拉大小和价格的高端离散度较小
```{r Scatterplot Transformation}
ggplot(aes(x=carat,y=price),data=transform(diamondsinfo,price=log10(price))) +
geom_point()
ggtitle('Price(log10) by carat')
```
### Create a new function to transform the carat variable
对价格进行立方根的转换
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
inverse = function(x) x^3)
cuberoot = function(x){
x^(1/3)
}
ggplot(aes(cuberoot(carat), log10(price)), data = diamondsinfo) +
geom_point() +
ggtitle('Price (log10) by Cube-Root of Carat')
```
#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamondsinfo) +
geom_point() +
scale_x_continuous(trans = cuberoot(diamondsinfo$price), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### 11.Overplotting Revisited 过度绘制
可以通过计算每个数值出现的次数,来判断重复值出现的严重与否,head()仅仅显示前面的6个结果,而在pandas中head()函数仅仅展示前5个结果。
## 必须考虑重复值的问题,否则会掩盖我们正常的稀疏度和稠密度的衡量。
解决办法:通过添加抖动和透明度来让重复点变小一些。
```{r Sort and Head Tables}
head(sort(table(diamondsinfo$carat),decreasing = T))
head(sort(table(diamondsinfo$price),decreasing = T))
```
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.
# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.
# ALTER THE CODE BELOW THIS LINE
# =======================================================================
ggplot(aes(carat, price), data = diamonds) +
geom_point() +
scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
breaks = c(0.2, 0.5, 1, 2, 3)) +
scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
breaks = c(350, 1000, 5000, 10000, 15000)) +
ggtitle('Price (log10) by Cube-Root of Carat')
解决办法:通过添加抖动和透明度来让重复点变小一些。
```{r Overplotting Revisited}
cuberoot = function(x){
x^(1/3)
}
ggplot(aes(cuberoot(carat), log10(price)), data = diamondsinfo) +
geom_point(alpha=0.5, size=0.75, position= 'jitter') + # alpha是透明度参数;jitter是抖动参数
ggtitle('Price (log10) by Cube-Root of Carat')
```
***
### 12.Other Qualitative Factors 定型因子
Notes:
在经过了转换之后,克拉重量和价值之间呈现几乎线性的关系,纯净度也是影响价格的一个重要因素。
***
### 13.Price vs. Carat and Clarity
Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)
ggplot(aes(x = cuberoot(carat), y = log10(price), colour=clarity), data = diamondsinfo) + # 仅仅colour=clarity即可实现变色
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'Clarity', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```
***
### 14.Clarity and Price(克拉重量、纯净度与价格的关系)
Response:
从上面的可视化图可以看出:克拉重量、纯净度与价格之间的关系:克拉重量越大,其相应的价格也会更高;对于同一克拉重量的砖石,纯净度越高,相应的砖石价格也会越高。
***
### Price vs. Carat and Cut(克拉重量、切工与价格之间的关系)
Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = cuberoot(carat), y = log10(price), colour=cut), data = diamondsinfo) + # 仅仅colour=clarity即可实现变色
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'cut', reverse = T,
override.aes = list(alpha = 1, size = 2))) +
ggtitle('Price (log10) by Cube-Root of Carat and cut')
```
***
### 16.Cut and Price
Response:
从上图可以看出:砖石的价格与切割没有很大的关系,因为它们几乎都是理想状态下的切割。
***
### 17. vs. Carat and Color(克拉重量、颜色与价格之间的关系)
Alter the code below.
```{r Price vs. Carat and Color}
ggplot(aes(x = cuberoot(carat), y = log10(price), colour=color), data = diamondsinfo) + # 仅仅colour=clarity即可实现变色
geom_point(alpha = 0.5, size = 1, position = 'jitter') +
scale_color_brewer(type = 'div',
guide = guide_legend(title = 'color', reverse = FALSE, # 与前面相比较,去掉了图形的反参数
override.aes = list(alpha = 1, size = 2))) +
ggtitle('Price (log10) by Cube-Root of Carat and color')
```
***
### 18.Color and Price
Response:
根据上图可以得到:颜色与价格之间存在着一定的关系。
***
总结:上面整体而言,考虑了如下三对的三变量之间的关系:
克拉重量、纯度与价格之间的关系
克拉重量、切工与价格之间的关系
克拉重量、颜色与价格之间的关系
### 19.Linear Models in R
Notes:
克拉重量的立方根与价格的对数之间存在线性关系。
Response:
***
### 20.Building the Linear Model 为价格建立线性模型
Notes:
```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```
Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.
***
### 21.Model Problems
Video Notes:
suppose our model is:
Ln(price) = 0.415 + 9.144 * carat*(1/3) - 1.093*carat + (A*cut + B*color + C*clarity) + epslin
ANd the problem is :()2008-2014
1) inflation
2)2008 global recession
3)diamond market in china weating up
4) uneven price increase across different carat weight.
Research:
(Take some time to come up with 2-4 problems for the model)
(You should 10-20 min on this)
Response:
***
### 22.A Bigger, Better Data Set
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.
# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).
# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!
# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.
# This exercise WILL BE automatically graded.
# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)
# Your task is to write the code to create the models.
# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
#===========================================================================================
diamondsBigSample <- read.csv('diamondsBigSample.csv')
# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
#===========================================================================================
# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
#===========================================================================================
suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
models <- mtable(m1, m2, m3, m4, m5)
```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')
diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
```
The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data
## Building a Model Using the Big Diamonds Data Set
Notes:
this code need much times
```{r Building a Model Using the Big Diamonds Data Set}
diamondsbig$logprice = log(diamondsbig$price) # 砖石价格相对数
m1 <- lm(I(logprice) ~ I(carat^(1/3)),
data = diamondsbig[diamondbig$price < 10000 &
diamondbig$cert == 'GIA',]) # 筛选价格低于10000美元且认证机构是GIA
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```
***
## 23.Predictions
Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601
```{r}
#Be sure you鈥檝e loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
interval="prediction", level = .95)
```
Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.
***
```{r additional try}
dat = data.frame(m4$model,m4$residuals)
with(dat,sd(m4.residuals))
with(subset(dat,carat > .9 & carat<1.1),
sd(m4.residuals))
dat$resid <- as.numeric(data$m4.residuals)
ggplot(aes(y=resid,x=round(carat,2)),data=dat) +
geom_line(stat = 'summary',fun.y=sd)
```
## 24.Final Thoughts
Notes:
***
Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!
`