This is a final review of Introduction of Data Analysis with R.
T
h
i
s
i
s
t
h
e
w
a
y
!
!
!
^{^{This~is~the~way!!!}}
This is the way!!!
Basic notation
command | explanation |
---|---|
Inf | infinity, e.g. 1/0=Inf |
NaN | not a number, e.g. 0/0=NaN |
NULL | empty |
basic command | explanation |
rm(list=ls()) , rm(a) | remove all objects, or a single object in workspace |
class() #creat type , mode(a) #storing type > typeof( ) #for single value | show the variable type, e.g. c<-factor( data.frame(1,2,3)) class=“factor”; typeof=“integer”; mode=“numeric” |
length(a) | show the length of the character |
%% , %/% , %*% , %in% , ! , t() | modulus, integer divide, matrix multiply, is a in b?, not, transpose |
options(digits=x) | control number of displayed digits |
source('xx.r', echo=T) | load History file, echo=T show command and output/ echo=F only show output |
Sys.setlocale("LC_ALL","English") | set system language |
setwd('xx') , dir() | set path, show files in current path |
print(xxx) , cat(numeric, character...,'hello world \n') | show variable value |
set.seed() |
Variable
Numeric
Important type | generation command | ralated functions |
---|---|---|
integer/ complex/ double | b<-1 ,a<-c(1,2,3) | mode(a) , length(a) , a<2 (generate a sequence of logic vector), as.numeric() (where char converted into NA), ceiling , floor , trunc (ignore decimal), round(x, digits=#) (round(3.5)=4, round(2.5)=2) |
sequence | x1:x_n , c(x1,..,xn) , seq(x1,xn,step) | x<-c(1,2,3,4) (x[4] is 4, x[-4] is 1,2,3), cumsum , cumprod (cumulative product), range , rep (a repeating sequence of number) |
random sample | sample(c(..),size=, replace=T/F, prob=c(...)) | used with set.seed() |
matrix (every entry should be of same type) | matrix(sequence, nrow=, ncol=, byrow=T/F) | dim() , nrow() , ncol() , m[row,col] (indexing), rbind() (row bind), cbind() , x%*%x : inter product, x%o%x : outer product, diag(seq) : create diagonal matrix or retrieve diagonal entry in a matrix, solve(A,b) : find solution, solve(A) : find inverse |
List
just list
every entry of a list can be anything: w<-list(a,b,c)
related functions | explanation |
---|---|
names(w) or names(w)<-c('x1,'x2,'x3') | show name of w’s index, or give w entry name |
w[[1]] or w$x1 | indexing |
unlist() | |
max(x) | mx<-value, for (xxx) mx<-max(mx, value) : nice comparison method |
dataframe (one mother class of list)
entry can be of different type: integer/numeric/factor: d<-data.frame()
, d<-read.csv('xx.csv')
related functions | explanation |
---|---|
names(d) or names(d)<-c('xxx') | |
head(d, n=4L) , tail(d) | display first(last) 4 lines(10 lines as default) |
d[row,col] or d$xx or d[d$xx=='xxx',] | indexing or select data |
dim() , nrow() , ncol() | basic info of dataframe |
mean(x, na.rm=T) , colMeans(x, na.rm=T) , rowMeans() | total mean or column mean, romove NA is false as default |
var(x, na.rm=T) , sd(x, na.rm=T) | sample variance, sample sd |
median(x, na.rm=T) | |
apply(x, MARGIN=,FUN=) | margin=1 by row, 2 by col; function can be user-defined |
d[order(d$xx),] | sort by row xx, order return a sequence |
d[complete.cases(d),] or na.omit(d) | select non-NA complete cases, not work on factor row/col |
merge(target, origin, by='Column name') | combine data with unique key in by='' |
by(just data part in dataset, classification standard in dataset, function) | apply function on data using classification, and generate a summary table |
time series data | |
date<-as.Date(d$Date,"$d/$m/$Y") | change Date in dataframe d to date object in special fromat |
wkday<-weekdays(date) or wkday<- ordered(weekdays(date), c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday")) | find the corresponding weekday, the later code is ordered, instead of order |
unique(wkday) | display the unique levels in wkday |
ted<-as.ts(d$ClosePrice) | change ClosePrice in d to time series data |
lag(ted) | shift the time series 1 unit to the left and return the next value of the time series. |
by(ted, wkday, mean) | just a summary |
write file | |
write.csv(d, file='xxx.scv', row.names=F | row.names=F suppress write number index in the first column |
Others
variable type | generation command | related functions |
---|---|---|
logical | TRUE FALSE | & | (vec and/or), && || (control and/or, len=1) |
charcater | z <- c('a','b',"c") | combine above types: c(num,logic)->num,c(num,char)->char |
factor |
Graph
related functions | explanation |
---|---|
par(mfrow=c(row, col)) | define row × \times ×col multiple graph, fill-in by row; need to be reset next time |
par(mar=c(x1,x2,x3,x4)) | define marginal width of graph, bottom=x1, left=x2, top=x3, right=x4 |
plot(x,y,main='title', type='',xlab='',ylab'', ylim=c(y1,y2)) , plot(x, y, pch=21, bg=c('red', 'blue')[binary]) | type: “p” for points, “l” for lines, “b” for both, “c” for the lines part alone of “b”, “o” for both ‘overplotted’, “h” for ‘histogram’ like (or ‘high-density’) vertical lines, “s” for stair steps, “S” for other steps, see ‘Details’ below, “n” for no plotting.//pch represent dot type, bg means use red for binary=1 and blue for binary=2 |
plot(d$"classification standard column"~d$"data column") , plot(d) | in plot(d) , d is a dataframe, R will plot it by applying plot(d$"classification standard column"~d$"data column" multiple times |
abline(a=, b=, h=, v=) | a is intercept, b is slope, h is horizontal line, v is vertical line |
lines(c(x0, x1), c(y0, y1)) , lines(x_seq, y_seq), lty=) | draw line from
(
x
0
,
y
0
)
(x_0,y_0)
(x0,y0) to
(
x
1
,
y
1
)
(x_1,y_1)
(x1,y1), or just a line through multiple points. lty : line style |
curve(y, x_start, x_end) | draw curve
⇔
\Leftrightarrow
⇔ plot(x,y, type='l') |
barplot(a table variable, beside=T, horiz=F, legend.text=c('xx'), main='xx', args.legend=list(horiz=T, bty='n', cex=0.6), ylim=c(0.2)) | log-transformation is frequently used. If beside=F , stack bar will be used. horiz=T will transpose the barplot |
boxplot() | |
hist(x, freq=F, main='...') | freq=F produce histogram with density instead of frequency |
qqnorm(x, main='..') , qqline(x) | QQ-plot |
Function
Create
name <- function(x,y,..., OR no input) {
...a bunch of cumputation...
output
}
name <- function(...){
arg <- list(...)
...}
or define own operator:
"%+-%" <- function(miu,criticalRegion){c(miu-criticalRegion,miu+criticalRegion)}
input
: if a variable has no input value, it will be automatically assigned NA
, so we can use if
to test. if input
is ...
, then numbers of input is undefined.
output
can be a single variable (not like in python, we have to write return output
) or c(x1=var1, x2=var2,...)
or outp<-list(var); names(outp)<-c('x')
or list(x1=var1, x2=var2,...)
as multiple output in one list.
Modify
fix(name)
can modify function
⇔
\Leftrightarrow
⇔ edit(file='name')
& source(‘name’) to make effect
Speed test
proc.time()
: show current time
Flow control
for loop
for (i in 1:len) {
a[i] <- xxx
}
while loop
i <- 1
while (i <= Len){
a[i] <- xxx
i <- I+1
}
repeat
repeat {
if (i>Len) break
a[i] <- xxx
i<- I+1
}
if
if (condition){
} else if (condition){
} else if (condition){
}
or
ifelse(condition, expression 1 given T, expression 2 given F)
ifelse(condition, ifelse(condition, ...), ifelse(condition, ...))
switch
good for classification into several interval
swith(expression, 'expr I'=one type of computation, 'expr II'=another, ...)
next
terminates current loop and move to next
return(expression)
terminates current loop and return expression( or func name in recursion)
stop(message)
terminates current loop and give a warning message
readline()
get user’s input as a char.
Probability
simulation: set.seed()
⇒
\Rightarrow
⇒ sample()
distribution: d, p, q, r
+distribution_name(parameters)
, show density, cdf, quantiles, random number of certain distribution. E.g., dnorm(0,0,1)
, rnorm(#numbers,0,1)
code | Distribution | code | Distribution | code | Distribution |
---|---|---|---|---|---|
beta(x, α \alpha α, β \beta β) | beta | binom(x, size, prob) | binomial | cauchy(x, location, scale) | Cauchy |
chisq(x, df) | chi-squared | exp(x, rate) | exponential | f(x, df1, df2) | F |
gamma(x, shape, sclae) | gamma | geom(x, prob) | geometric | hyper(x, m, n, k) | hypergeometric |
lnorm(x, meanlog, sdlog) | log-normal | logis(x, location, scale) | logistic/ multinomial | nbinom(x, size, prob) | negative binomial |
norm(x, mean, sd) | normal | pois(x, λ \lambda λ) | Poisson | t(x, df) | Student’s t |
unif(x, min, max) | uniform | weibull | Weibull | wilcox(x, m, n) | Wilcoxon |
1-pXXX(...)
: find
P
(
X
>
x
)
P(X>x)
P(X>x)
Statistics
Hypothesis testing
one sample t test: t.test(x, mu=#, alt='less'/'greater'/two.sided')
. two.sided
is default.
two sample t test: t.test(x, y, alt='', var.eq=F/T, paired=F/T)
. var.eq=T
is two sample t test, F
is Welch’s t test(default). paired=T
is paired t test, F
is default.
Construct testing table
one-way table
cut()
: slice x sequence into several part and specify each part with labels
result <- cut(x_seq, breaks=c(n1,n2,...), lables=c('...', '...',...))
table(result)
table(d$Name)
: summary of a factor/character column
two-way table
table(d$Name, result)
: Cartesian product of two one-way tables
prob.table(table(d$Name, result), margin=)
: create frequency table. margin=1
means
P
(
r
o
w
)
=
1
P(row)=1
P(row)=1 (referred to row dimension), 2
means
P
(
c
o
l
u
m
n
)
=
1
P(column)=1
P(column)=1. Probability table can also be created using apply()
, rowMeans()
, colMeans()
.
Testing table
chi-square goodness of fit test: chisq.test(x, p=)
: p=sequence
is the assumed distribution.
contingenty table on independence: chisq.test(x,y)
: y is considered when x is a factor
Regression
Simple linear regression
reg1 <- lm(y~x); summary(reg1)
plot(x,y);abline(reg1) #use reg1$coef
lm
is linear regression function. In output, y-reg1$fit
=reg1$resid
. And multiple R-squared measures is cor(x,y,use='complete')
.
Check 4 assumptions of linear regression model and ways to check data: linearity(residual vs fitted value, residual vs index), normality(QQ-plot), independence(residual[i] vs residual[i-1]), constant variance(residual vs fitted value, residual vs index).
# produce 4 residual plots
# input residual vector and fitted values
#
residplot<-function(resid,fit) {
par(mfrow=c(2,2)) # define 2x2 multiframe grahpic
n<-length(resid) # get no of points
plot(fit,resid); abline(h=0) # plot e(i) vs fit(i), add x-axis to plot
plot(1:n,resid); abline(h=0) # plot e(i) vs i, add x-axis to plot
plot(resid[1:(n-1)],resid[2:n]); abline(h=0) # plot e(i) vs e(i-1), add x-axis to plot
qqnorm(resid); qqline(resid) # QQ-normal plot of e(i)
par(mfrow=c(1,1)) # reset multiframe graphic
}
If residual converges when n becomes larger, the randomness is false, then it’s not a linear form.
Then
H
0
:
α
=
0
H_0: \alpha=0
H0:α=0,
H
0
:
β
=
0
H_0:\beta=0
H0:β=0 testing on
y
i
=
α
+
β
×
x
i
+
ϵ
i
y_i=\alpha+\beta \times x_i+\epsilon_i
yi=α+β×xi+ϵi, where
ϵ
∼
N
(
0
,
σ
2
)
\epsilon \sim N(0, \sigma^2)
ϵ∼N(0,σ2):
We have test statistics: ,
T
=
a
^
σ
^
×
(
1
n
+
x
ˉ
2
S
x
x
)
T=\frac{\hat{a}}{\hat{\sigma} \times \sqrt(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})}
T=σ^×(n1+Sxxxˉ2)a^,
T
=
b
^
σ
^
S
x
x
T=\frac{\hat{b}}{\frac{\hat{\sigma}}{\sqrt{S_{xx}}}}
T=Sxxσ^b^, where
S
x
x
=
∑
i
=
1
n
(
x
i
−
x
ˉ
)
S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})
Sxx=∑i=1n(xi−xˉ)
Multiple linear regression
For
[
y
1
.
.
.
y
n
]
n
×
1
\begin{bmatrix} y_1\\.\\.\\.\\y_n \end{bmatrix}_{n\times 1}
⎣⎢⎢⎢⎢⎡y1...yn⎦⎥⎥⎥⎥⎤n×1
=
=
=
[
1
x
11
.
.
.
x
1
p
.
.
.
.
.
.
.
.
.
1
x
n
1
.
.
.
x
n
p
]
n
×
(
p
+
1
)
\begin{bmatrix} 1 & x_{11} & ... & x_{1p}\\. & . & & . \\. & . & & . \\. & . & & . \\1 & x_{n1} & ... & x_{np} \end{bmatrix}_{n\times (p+1)}
⎣⎢⎢⎢⎢⎡1...1x11...xn1......x1p...xnp⎦⎥⎥⎥⎥⎤n×(p+1)
×
\times
×
[
β
0
β
1
.
.
β
p
]
(
p
+
1
)
×
1
\begin{bmatrix} \beta_0\\\beta_1\\.\\.\\\beta_p \end{bmatrix}_{(p+1)\times 1}
⎣⎢⎢⎢⎢⎡β0β1..βp⎦⎥⎥⎥⎥⎤(p+1)×1
+
+
+
[
ϵ
1
.
.
.
ϵ
n
]
n
×
1
\begin{bmatrix} \epsilon_1\\ . \\. \\. \\ \epsilon_{n} \end{bmatrix}_{n\times 1}
⎣⎢⎢⎢⎢⎡ϵ1...ϵn⎦⎥⎥⎥⎥⎤n×1, where
ϵ
∼
N
(
0
,
σ
2
)
\epsilon\sim N(0, \sigma^2)
ϵ∼N(0,σ2):
We want to find Least Square Estimates of
β
\beta
β so that Sum of Squared Error
=
∑
i
=
1
n
ϵ
2
=
∑
i
=
1
n
(
y
−
X
×
β
)
2
=\sum_{i=1}^{n}{\epsilon^2}=\sum_{i=1}^{n}(y-X \times\beta)^2
=∑i=1nϵ2=∑i=1n(y−X×β)2 is minimized.
By minimizing SSE=
ϵ
′
ϵ
\epsilon'\epsilon
ϵ′ϵ, we can find
β
^
=
(
X
′
X
)
−
1
X
′
Y
\hat{\beta}=(X'X)^{-1}X'Y
β^=(X′X)−1X′Y with fitted value
y
^
=
X
b
\hat{y}=Xb
y^=Xb and
σ
^
2
=
(
y
−
X
b
)
′
(
y
−
X
b
)
n
−
p
−
1
\hat{\sigma}^2=\frac{(y-Xb)'(y-Xb)}{n-p-1}
σ^2=n−p−1(y−Xb)′(y−Xb)
Code: reg<- lm(y~x1+x2+x3+..., data=d)
, x1,x2,… are column names in d.
Choose model: step(reg)
, then insignificant
β
i
\beta_i
βi will be dropped.
Logistic regression
Y
Y
Y is binary variable, then
π
i
=
P
(
Y
i
=
1
∣
x
i
)
\pi_i=P(Y_i=1|x_i)
πi=P(Yi=1∣xi) is probability of success. Then logistic model is:
l
n
[
π
i
1
−
π
i
]
=
β
0
+
β
1
x
i
1
+
.
.
.
+
β
p
x
i
p
ln[\frac{\pi_i}{1-\pi_i}]=\beta_0+\beta_1 x_{i1}+...+\beta_{p}x_{ip}
ln[1−πiπi]=β0+β1xi1+...+βpxip. Assume LHS is linear on
x
i
x_i
xi, then
π
i
=
e
x
i
′
β
1
+
e
x
i
′
β
=
1
1
+
e
−
x
i
′
β
∈
[
0
,
1
]
\pi_i=\frac{e^{x_{i}'\beta}}{1+e^{x_{i}'\beta}}=\frac{1}{1+e^{-x_{i}'\beta}}\in[0,1]
πi=1+exi′βexi′β=1+e−xi′β1∈[0,1](logit transformation).
The likelihood function is
L
(
β
)
=
∏
i
=
1
n
[
π
i
y
i
(
1
−
π
i
)
(
1
−
y
i
)
]
L(\beta)=\prod_{i=1}^{n}[\pi_{i}^{y_i}(1-\pi_{i})^{(1-y_i)}]
L(β)=∏i=1n[πiyi(1−πi)(1−yi)], then do log-transformation and find MLE.
Code: reg2 <- glm(y~x1+x2+...+xn, data=d, binomial)
, binomial
is needed for logistic regression, where y is binomial variable. Also use step()
to drop insignificant regressor.
prediction:
b <- reg2$coef
X <- as.matrix(cbind(1,x1,x2,...))
c <- X%*%b
ps <- exp(c)/(1+exp(c)) # or directly use ps<-reg2$fit
pred <- ifelse(ps>0.5, "A", "B")
table(pred, oringinal list of "A"&"B") #show classification table
Nonlinear curve fitting
nls(y~f(x;a1,a2,a3,...), start=c(a1=#, a2=#, ...)
: nonlinear least square with parameters
a
1
,
a
2
,
a
3
,
.
.
.
a_1, a_2, a_3,...
a1,a2,a3,... and start value of all parameters are required to be used in searching the model.
nlsout <- summary(nls)
: to get coefficients, use nlsout$coefficients
, or use y-nlsout$res
to get fitted value.
If unclear form of y, then try log-transformation: log(y) or log(x). lm(log(y)~x
may fit a good nonlinear form.