数据科学之第三章中lesson6的砖石价格预测 (R语言)

课程讲解中的Lesson6.rmd

Lesson 6
========================================================

### 1.Welcome
Notes:

***

### 3.Scatterplot Review

Notes:
# Let's start by examining two variables in the data set.
# The scatterplot is a powerful tool to help you understand
# the relationship between two continuous variables.

# We can quickly see if the relationship is linear or not.
# In this case, we can use a variety of diamond
# characteristics to help us figure out whether
# the price advertised for any given diamond is 
# reasonable or a rip-off.

# Let's consider the price of a diamond and it's carat weight.
# Create a scatterplot of price (y) vs carat weight (x).

# Limit the x-axis and y-axis to omit the top 1% of values.

# ENTER YOUR CODE BELOW THIS LINE
# ================================================================

```{r Scatterplot Review}
library(ggplot2)
ggplot(aes(x=carat,y=price),data=diamondsinfo) + 
  geom_point()    
```

***

### 4.Price and Carat Relationship
Response:
 根据上图可以得到:正相关的关系,随着克拉重量则增加,相应的价格也会上涨。
***

### 5.Frances Gerety
Notes:
到目前为止,我们仅仅考虑了价格和克拉的双变关系,这个数据集来自于2008年收的5万多颗砖石。

#### A diamonds is forever


***

### 6.The Rise of Diamonds
Notes:

***

### 7.ggpairs Function
Notes:
该函数的params参数用于改变在图形矩阵中绘制的点的形状,以便更容易查看这些点。,以成对变量的形式绘制每两个变量之间的关系。

```{r ggpairs Function}
# install these if necessary
install.packages('GGally')
install.packages('scales')
install.packages('memisc')
install.packages('lattice')
install.packages('MASS')
install.packages('car')
install.packages('reshape')
install.packages('plyr')

# load the ggplot graphics package and the others
library(ggplot2)
library(GGally)
# library(scales)
library(memisc)

# sample 10,000 diamonds from the data set
set.seed(20022012)
diamond_samp <- diamonds[sample(1:length(diamonds$price), 10000), ]
ggpairs(diamond_samp,
        lower = list(continuous = wrap('points',shape=I('.'))),
        upper = list(combo = wrap('box',outlier.shape=I('.'))))

```

What are some things you notice in the ggpairs output?
Response:

***

### 8.The Demand of Diamonds
Notes:
# Create two histograms of the price variable
# and place them side by side on one output image.

# We’ve put some code below to get you started.

# The first plot should be a histogram of price
# and the second plot should transform
# the price variable using log10.

# Set appropriate bin widths for each plot.
# ggtitle() will add a title to each histogram.

# You can self-assess your work with the plots
# in the solution video.

# ALTER THE CODE BELOW THIS LINE
# ==============================================

library(gridExtra)

plot1 <- qplot() + 
  ggtitle('Price')

plot2 <- qplot() +
  ggtitle('Price (log10)')

grid.arrange()

```{r The Demand of Diamonds}
library(gridExtra)

price1 <- ggplot(aes(x=price),data=diamondsinfo) + 
  geom_histogram(binwidth=30,color='black',fill='#099DD9') + 
  ggtitle('Price')

price2 <- ggplot(aes(x=price),data=transform(diamondsinfo,price=log10(price))) + 
  geom_histogram(binwidth=.001,color='orange',fill='#099DD9') +
  ggtitle('price(log10)')
grid.arrange(price1, price2, ncol=1)
  
```

***

### 9.Connecting Demand and Price Distributions
Notes:
原始的价格曲线被严重扭曲,经过对数变换之后的接近于钟形的正态分布。对数变换之后的图像会呈现双峰形状,更符合穷人和富人买砖石的现状。
***

### 10.Scatterplot Transformation
Notes:
在对数坐标轴中,价格在克拉大小和价格的高端离散度较小
```{r Scatterplot Transformation}
ggplot(aes(x=carat,y=price),data=transform(diamondsinfo,price=log10(price))) + 
  geom_point()
  ggtitle('Price(log10) by carat')
```


### Create a new function to transform the carat variable
对价格进行立方根的转换
```{r cuberoot transformation}
cuberoot_trans = function() trans_new('cuberoot', transform = function(x) x^(1/3),
                                      inverse = function(x) x^3)
cuberoot = function(x){
	x^(1/3)
}

ggplot(aes(cuberoot(carat), log10(price)), data = diamondsinfo) + 
  geom_point() + 
  ggtitle('Price (log10) by Cube-Root of Carat')
```

#### Use the cuberoot_trans function
```{r Use cuberoot_trans}
ggplot(aes(carat, price), data = diamondsinfo) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot(diamondsinfo$price), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
```

***

### 11.Overplotting Revisited 过度绘制
可以通过计算每个数值出现的次数,来判断重复值出现的严重与否,head()仅仅显示前面的6个结果,而在pandas中head()函数仅仅展示前5个结果。
## 必须考虑重复值的问题,否则会掩盖我们正常的稀疏度和稠密度的衡量。
解决办法:通过添加抖动和透明度来让重复点变小一些。
```{r Sort and Head Tables}
head(sort(table(diamondsinfo$carat),decreasing = T))
head(sort(table(diamondsinfo$price),decreasing = T))

```
# Add a layer to adjust the features of the
# scatterplot. Set the transparency to one half,
# the size to three-fourths, and jitter the points.

# If you need hints, see the Instructor Notes.
# There are three hints so scroll down slowly if
# you don’t want all the hints at once.

# ALTER THE CODE BELOW THIS LINE
# =======================================================================

ggplot(aes(carat, price), data = diamonds) + 
  geom_point() + 
  scale_x_continuous(trans = cuberoot_trans(), limits = c(0.2, 3),
                     breaks = c(0.2, 0.5, 1, 2, 3)) + 
  scale_y_continuous(trans = log10_trans(), limits = c(350, 15000),
                     breaks = c(350, 1000, 5000, 10000, 15000)) +
  ggtitle('Price (log10) by Cube-Root of Carat')
  
 解决办法:通过添加抖动和透明度来让重复点变小一些。 

```{r Overplotting Revisited}
cuberoot = function(x){
	x^(1/3)
}

ggplot(aes(cuberoot(carat), log10(price)), data = diamondsinfo) + 
  geom_point(alpha=0.5, size=0.75, position= 'jitter') +     # alpha是透明度参数;jitter是抖动参数
  ggtitle('Price (log10) by Cube-Root of Carat')


```

***

### 12.Other Qualitative Factors 定型因子
Notes:
在经过了转换之后,克拉重量和价值之间呈现几乎线性的关系,纯净度也是影响价格的一个重要因素。
***

### 13.Price vs. Carat and Clarity

Alter the code below.
```{r Price vs. Carat and Clarity}
# install and load the RColorBrewer package
install.packages('RColorBrewer')
library(RColorBrewer)

ggplot(aes(x = cuberoot(carat), y = log10(price), colour=clarity), data = diamondsinfo) +  # 仅仅colour=clarity即可实现变色
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'Clarity', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  ggtitle('Price (log10) by Cube-Root of Carat and Clarity')
```

***

### 14.Clarity and Price(克拉重量、纯净度与价格的关系)
Response:
从上面的可视化图可以看出:克拉重量、纯净度与价格之间的关系:克拉重量越大,其相应的价格也会更高;对于同一克拉重量的砖石,纯净度越高,相应的砖石价格也会越高。
***

### Price vs. Carat and Cut(克拉重量、切工与价格之间的关系)

Alter the code below.
```{r Price vs. Carat and Cut}
ggplot(aes(x = cuberoot(carat), y = log10(price), colour=cut), data = diamondsinfo) +  # 仅仅colour=clarity即可实现变色
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'cut', reverse = T,
    override.aes = list(alpha = 1, size = 2))) +  
  ggtitle('Price (log10) by Cube-Root of Carat and cut')
```

***

### 16.Cut and Price
Response:
从上图可以看出:砖石的价格与切割没有很大的关系,因为它们几乎都是理想状态下的切割。
***

### 17. vs. Carat and Color(克拉重量、颜色与价格之间的关系)

Alter the code below.
```{r Price vs. Carat and Color}

ggplot(aes(x = cuberoot(carat), y = log10(price), colour=color), data = diamondsinfo) +  # 仅仅colour=clarity即可实现变色
  geom_point(alpha = 0.5, size = 1, position = 'jitter') +
  scale_color_brewer(type = 'div',
    guide = guide_legend(title = 'color', reverse = FALSE,   # 与前面相比较,去掉了图形的反参数
    override.aes = list(alpha = 1, size = 2))) +  
  ggtitle('Price (log10) by Cube-Root of Carat and color')
```

***

### 18.Color and Price
Response:
根据上图可以得到:颜色与价格之间存在着一定的关系。

***
总结:上面整体而言,考虑了如下三对的三变量之间的关系:
克拉重量、纯度与价格之间的关系
克拉重量、切工与价格之间的关系
克拉重量、颜色与价格之间的关系

### 19.Linear Models in R
Notes:
克拉重量的立方根与价格的对数之间存在线性关系。
Response:

***

### 20.Building the Linear Model 为价格建立线性模型
Notes:

```{r Building the Linear Model}
m1 <- lm(I(log(price)) ~ I(carat^(1/3)), data = diamonds)
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```

Notice how adding cut to our model does not help explain much of the variance
in the price of diamonds. This fits with out exploration earlier.

***

### 21.Model Problems
Video Notes:
suppose our model is:
Ln(price) = 0.415 + 9.144 * carat*(1/3) - 1.093*carat + (A*cut + B*color + C*clarity) + epslin

ANd the problem is :()2008-2014
1) inflation
2)2008 global recession
3)diamond market in china weating up 
4) uneven price increase across different carat weight.


Research:
(Take some time to come up with 2-4 problems for the model)
(You should 10-20 min on this)

Response:

***

### 22.A Bigger, Better Data Set
Notes:
# Your task is to build five linear models like Solomon
# did for the diamonds data set only this
# time you'll use a sample of diamonds from the
# diamondsbig data set.

# Be sure to make use of the same variables
# (logprice, carat, etc.) and model
# names (m1, m2, m3, m4, m5).

# To get the diamondsbig data into RStudio
# on your machine, copy, paste, and run the
# code in the Instructor Notes. There's
# 598,024 diamonds in this data set!

# Since the data set is so large,
# you are going to use a sample of the
# data set to compute the models. You can use
# the entire data set on your machine which
# will produce slightly different coefficients
# and statistics for the models.

# This exercise WILL BE automatically graded.

# You can leave off the code to load in the data.
# We've sampled the data for you.
# You also don't need code to create the table output of the models.
# We'll do that for you and check your model summaries (R^2 values, AIC, etc.)

# Your task is to write the code to create the models.

# DO NOT ALTER THE CODE BELOW THIS LINE (Reads in a sample of the diamondsbig data set)
#===========================================================================================
diamondsBigSample <- read.csv('diamondsBigSample.csv')


# ENTER YOUR CODE BELOW THIS LINE. (Create the five models)
#===========================================================================================




# DO NOT ALTER THE CODE BELOW THIS LINE (Tables your models and pulls out the statistics)
#===========================================================================================
suppressMessages(library(lattice))
suppressMessages(library(MASS))
suppressMessages(library(memisc))
models <- mtable(m1, m2, m3, m4, m5)



```{r A Bigger, Better Data Set}
install.package('bitops')
install.packages('RCurl')
library('bitops')
library('RCurl')

diamondsurl = getBinaryURL("https://raw.github.com/solomonm/diamonds-data/master/BigDiamonds.Rda")
load(rawConnection(diamondsurl))
```

The code used to obtain the data is available here:
https://github.com/solomonm/diamonds-data

## Building a Model Using the Big Diamonds Data Set
Notes:
this code need much times
```{r Building a Model Using the Big Diamonds Data Set}
diamondsbig$logprice = log(diamondsbig$price)  # 砖石价格相对数

m1 <- lm(I(logprice) ~ I(carat^(1/3)), 
         data = diamondsbig[diamondbig$price < 10000 &
                              diamondbig$cert == 'GIA',])   # 筛选价格低于10000美元且认证机构是GIA
m2 <- update(m1, ~ . + carat)
m3 <- update(m2, ~ . + cut)
m4 <- update(m3, ~ . + color)
m5 <- update(m4, ~ . + clarity)
mtable(m1, m2, m3, m4, m5)
```


***

## 23.Predictions

Example Diamond from BlueNile:
Round 1.00 Very Good I VS1 $5,601

```{r}
#Be sure you鈥檝e loaded the library memisc and have m5 saved as an object in your workspace.
thisDiamond = data.frame(carat = 1.00, cut = "V.Good",
                         color = "I", clarity="VS1")
modelEstimate = predict(m5, newdata = thisDiamond,
                        interval="prediction", level = .95)
```

Evaluate how well the model predicts the BlueNile diamond's price. Think about the fitted point estimate as well as the 95% CI.

***
```{r additional try}
dat = data.frame(m4$model,m4$residuals)

with(dat,sd(m4.residuals))

with(subset(dat,carat > .9 & carat<1.1),
     sd(m4.residuals))

dat$resid <- as.numeric(data$m4.residuals)
ggplot(aes(y=resid,x=round(carat,2)),data=dat) + 
  geom_line(stat = 'summary',fun.y=sd)

```


## 24.Final Thoughts
Notes:

***

Click **KnitHTML** to see all of your hard work and to have an html
page of this lesson, your answers, and your notes!


`

 

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值