[Cheatsheet] Introduction to Data Analysis with R

This is a final review of Introduction of Data Analysis with R. T h i s   i s   t h e   w a y ! ! ! ^{^{This~is~the~way!!!}} This is the way!!!

Basic notation

commandexplanation
Infinfinity, e.g. 1/0=Inf
NaNnot a number, e.g. 0/0=NaN
NULLempty
basic commandexplanation
rm(list=ls()), rm(a)remove all objects, or a single object in workspace
class() #creat type, mode(a) #storing type > typeof( ) #for single valueshow the variable type, e.g. c<-factor( data.frame(1,2,3))
class=“factor”; typeof=“integer”; mode=“numeric”
length(a)show the length of the character
%%, %/%, %*%, %in%, !, t()modulus, integer divide, matrix multiply, is a in b?, not, transpose
options(digits=x)control number of displayed digits
source('xx.r', echo=T)load History file, echo=T show command and output/ echo=F only show output
Sys.setlocale("LC_ALL","English")set system language
setwd('xx'), dir()set path, show files in current path
print(xxx), cat(numeric, character...,'hello world \n')show variable value
set.seed()

Variable

Numeric

Important typegeneration commandralated functions
integer/ complex/ doubleb<-1,
a<-c(1,2,3)
mode(a), length(a), a<2 (generate a sequence of logic vector), as.numeric() (where char converted into NA), ceiling, floor, trunc (ignore decimal), round(x, digits=#) (round(3.5)=4, round(2.5)=2)
sequencex1:x_n,
c(x1,..,xn),
seq(x1,xn,step)
x<-c(1,2,3,4) (x[4] is 4, x[-4] is 1,2,3), cumsum, cumprod (cumulative product), range, rep (a repeating sequence of number)
random samplesample(c(..),size=, replace=T/F, prob=c(...))used with set.seed()
matrix (every entry should be of same type)matrix(sequence, nrow=, ncol=, byrow=T/F)dim(), nrow(), ncol(), m[row,col](indexing), rbind()(row bind), cbind(), x%*%x: inter product, x%o%x: outer product, diag(seq): create diagonal matrix or retrieve diagonal entry in a matrix, solve(A,b): find solution, solve(A): find inverse

List

just list

every entry of a list can be anything: w<-list(a,b,c)

related functionsexplanation
names(w) or names(w)<-c('x1,'x2,'x3')show name of w’s index, or give w entry name
w[[1]] or w$x1indexing
unlist()
max(x)mx<-value, for (xxx) mx<-max(mx, value): nice comparison method

dataframe (one mother class of list)

entry can be of different type: integer/numeric/factor: d<-data.frame(), d<-read.csv('xx.csv')

related functionsexplanation
names(d) or names(d)<-c('xxx')
head(d, n=4L), tail(d)display first(last) 4 lines(10 lines as default)
d[row,col] or d$xx or d[d$xx=='xxx',]indexing or select data
dim(), nrow(), ncol()basic info of dataframe
mean(x, na.rm=T), colMeans(x, na.rm=T), rowMeans()total mean or column mean, romove NA is false as default
var(x, na.rm=T), sd(x, na.rm=T)sample variance, sample sd
median(x, na.rm=T)
apply(x, MARGIN=,FUN=)margin=1 by row, 2 by col; function can be user-defined
d[order(d$xx),]sort by row xx, order return a sequence
d[complete.cases(d),] or na.omit(d)select non-NA complete cases, not work on factor row/col
merge(target, origin, by='Column name')combine data with unique key in by=''
by(just data part in dataset, classification standard in dataset, function)apply function on data using classification, and generate a summary table
time series data
date<-as.Date(d$Date,"$d/$m/$Y")change Date in dataframe d to date object in special fromat
wkday<-weekdays(date) or wkday<- ordered(weekdays(date), c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))find the corresponding weekday, the later code is ordered, instead of order
unique(wkday)display the unique levels in wkday
ted<-as.ts(d$ClosePrice)change ClosePrice in d to time series data
lag(ted)shift the time series 1 unit to the left and return the next value of the time series.
by(ted, wkday, mean)just a summary
write file
write.csv(d, file='xxx.scv', row.names=Frow.names=F suppress write number index in the first column

Others

variable typegeneration commandrelated functions
logicalTRUE FALSE& | (vec and/or), && || (control and/or, len=1)
charcaterz <- c('a','b',"c")combine above types: c(num,logic)->num,c(num,char)->char
factor

Graph

related functionsexplanation
par(mfrow=c(row, col))define row × \times ×col multiple graph, fill-in by row; need to be reset next time
par(mar=c(x1,x2,x3,x4))define marginal width of graph, bottom=x1, left=x2, top=x3, right=x4
plot(x,y,main='title', type='',xlab='',ylab'', ylim=c(y1,y2)), plot(x, y, pch=21, bg=c('red', 'blue')[binary])type: “p” for points, “l” for lines, “b” for both, “c” for the lines part alone of “b”, “o” for both ‘overplotted’, “h” for ‘histogram’ like (or ‘high-density’) vertical lines, “s” for stair steps, “S” for other steps, see ‘Details’ below, “n” for no plotting.//pchrepresent dot type, bgmeans use red for binary=1 and blue for binary=2
plot(d$"classification standard column"~d$"data column"), plot(d)in plot(d), d is a dataframe, R will plot it by applying plot(d$"classification standard column"~d$"data column" multiple times
abline(a=, b=, h=, v=)a is intercept, b is slope, h is horizontal line, v is vertical line
lines(c(x0, x1), c(y0, y1)), lines(x_seq, y_seq), lty=)draw line from ( x 0 , y 0 ) (x_0,y_0) (x0,y0) to ( x 1 , y 1 ) (x_1,y_1) (x1,y1), or just a line through multiple points. lty: line style
curve(y, x_start, x_end)draw curve ⇔ \Leftrightarrow plot(x,y, type='l')
barplot(a table variable, beside=T, horiz=F, legend.text=c('xx'), main='xx', args.legend=list(horiz=T, bty='n', cex=0.6), ylim=c(0.2))log-transformation is frequently used. If beside=F, stack bar will be used. horiz=T will transpose the barplot
boxplot()
hist(x, freq=F, main='...')freq=F produce histogram with density instead of frequency
qqnorm(x, main='..'), qqline(x)QQ-plot

Function

Create

name <- function(x,y,..., OR no input) {
	...a bunch of cumputation...
	output
}
name <- function(...){
	arg <- list(...)
	...}

or define own operator:

"%+-%" <- function(miu,criticalRegion){c(miu-criticalRegion,miu+criticalRegion)}

input: if a variable has no input value, it will be automatically assigned NA, so we can use if to test. if input is ..., then numbers of input is undefined.
output can be a single variable (not like in python, we have to write return output) or c(x1=var1, x2=var2,...) or outp<-list(var); names(outp)<-c('x') or list(x1=var1, x2=var2,...) as multiple output in one list.

Modify

fix(name) can modify function ⇔ \Leftrightarrow edit(file='name') & source(‘name’) to make effect

Speed test

proc.time(): show current time

Flow control

for loop

for (i in 1:len) {
	a[i] <- xxx
}

while loop

i <- 1
while (i <= Len){
	a[i] <- xxx
	i <- I+1
}

repeat

repeat {
	if (i>Len) break
	a[i] <- xxx
	i<- I+1
}

if

if (condition){
} else if (condition){
} else if (condition){
}

or

ifelse(condition, expression 1 given T, expression 2 given F)
ifelse(condition, ifelse(condition, ...), ifelse(condition, ...))

switch

good for classification into several interval

swith(expression, 'expr I'=one type of computation, 'expr II'=another, ...)

next

terminates current loop and move to next

return(expression)

terminates current loop and return expression( or func name in recursion)

stop(message)

terminates current loop and give a warning message

readline()

get user’s input as a char.

Probability

simulation: set.seed() ⇒ \Rightarrow sample()
distribution: d, p, q, r+distribution_name(parameters), show density, cdf, quantiles, random number of certain distribution. E.g., dnorm(0,0,1), rnorm(#numbers,0,1)

codeDistributioncodeDistributioncodeDistribution
beta(x, α \alpha α, β \beta β)betabinom(x, size, prob)binomialcauchy(x, location, scale)Cauchy
chisq(x, df)chi-squaredexp(x, rate)exponentialf(x, df1, df2)F
gamma(x, shape, sclae)gammageom(x, prob)geometrichyper(x, m, n, k)hypergeometric
lnorm(x, meanlog, sdlog)log-normallogis(x, location, scale)logistic/ multinomialnbinom(x, size, prob)negative binomial
norm(x, mean, sd)normalpois(x, λ \lambda λ)Poissont(x, df)Student’s t
unif(x, min, max)uniformweibullWeibullwilcox(x, m, n)Wilcoxon

1-pXXX(...): find P ( X > x ) P(X>x) P(X>x)

Statistics

Hypothesis testing

one sample t test: t.test(x, mu=#, alt='less'/'greater'/two.sided'). two.sided is default.
two sample t test: t.test(x, y, alt='', var.eq=F/T, paired=F/T). var.eq=T is two sample t test, F is Welch’s t test(default). paired=T is paired t test, F is default.

Construct testing table

one-way table
cut(): slice x sequence into several part and specify each part with labels

result <- cut(x_seq, breaks=c(n1,n2,...), lables=c('...', '...',...))
table(result)

table(d$Name): summary of a factor/character column
two-way table
table(d$Name, result): Cartesian product of two one-way tables
prob.table(table(d$Name, result), margin=): create frequency table. margin=1 means P ( r o w ) = 1 P(row)=1 P(row)=1 (referred to row dimension), 2 means P ( c o l u m n ) = 1 P(column)=1 P(column)=1. Probability table can also be created using apply(), rowMeans(), colMeans().

Testing table

chi-square goodness of fit test: chisq.test(x, p=): p=sequence is the assumed distribution.
contingenty table on independence: chisq.test(x,y): y is considered when x is a factor

Regression

Simple linear regression

reg1 <- lm(y~x); summary(reg1)
plot(x,y);abline(reg1) #use reg1$coef

lm is linear regression function. In output, y-reg1$fit=reg1$resid. And multiple R-squared measures is cor(x,y,use='complete').
Check 4 assumptions of linear regression model and ways to check data: linearity(residual vs fitted value, residual vs index), normality(QQ-plot), independence(residual[i] vs residual[i-1]), constant variance(residual vs fitted value, residual vs index).

# produce 4 residual plots
# input residual vector and fitted values
#
residplot<-function(resid,fit) {
  par(mfrow=c(2,2))			# define 2x2 multiframe grahpic	
  n<-length(resid)			# get no of points
  plot(fit,resid); abline(h=0)		# plot e(i) vs fit(i), add x-axis to plot
  plot(1:n,resid); abline(h=0)		# plot e(i) vs i, add x-axis to plot
  plot(resid[1:(n-1)],resid[2:n]); abline(h=0)	# plot e(i) vs e(i-1), add x-axis to plot
  qqnorm(resid); qqline(resid)		# QQ-normal plot of e(i)
  par(mfrow=c(1,1))			# reset multiframe graphic
}

If residual converges when n becomes larger, the randomness is false, then it’s not a linear form.
Then H 0 : α = 0 H_0: \alpha=0 H0:α=0, H 0 : β = 0 H_0:\beta=0 H0:β=0 testing on y i = α + β × x i + ϵ i y_i=\alpha+\beta \times x_i+\epsilon_i yi=α+β×xi+ϵi, where ϵ ∼ N ( 0 , σ 2 ) \epsilon \sim N(0, \sigma^2) ϵN(0,σ2):
We have test statistics: , T = a ^ σ ^ × ( 1 n + x ˉ 2 S x x ) T=\frac{\hat{a}}{\hat{\sigma} \times \sqrt(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})} T=σ^×( n1+Sxxxˉ2)a^, T = b ^ σ ^ S x x T=\frac{\hat{b}}{\frac{\hat{\sigma}}{\sqrt{S_{xx}}}} T=Sxx σ^b^, where S x x = ∑ i = 1 n ( x i − x ˉ ) S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x}) Sxx=i=1n(xixˉ)

Multiple linear regression

For [ y 1 . . . y n ] n × 1 \begin{bmatrix} y_1\\.\\.\\.\\y_n \end{bmatrix}_{n\times 1} y1...ynn×1 = = = [ 1 x 11 . . . x 1 p . . . . . . . . . 1 x n 1 . . . x n p ] n × ( p + 1 ) \begin{bmatrix} 1 & x_{11} & ... & x_{1p}\\. & . & & . \\. & . & & . \\. & . & & . \\1 & x_{n1} & ... & x_{np} \end{bmatrix}_{n\times (p+1)} 1...1x11...xn1......x1p...xnpn×(p+1) × \times × [ β 0 β 1 . . β p ] ( p + 1 ) × 1 \begin{bmatrix} \beta_0\\\beta_1\\.\\.\\\beta_p \end{bmatrix}_{(p+1)\times 1} β0β1..βp(p+1)×1 + + + [ ϵ 1 . . . ϵ n ] n × 1 \begin{bmatrix} \epsilon_1\\ . \\. \\. \\ \epsilon_{n} \end{bmatrix}_{n\times 1} ϵ1...ϵnn×1, where ϵ ∼ N ( 0 , σ 2 ) \epsilon\sim N(0, \sigma^2) ϵN(0,σ2):
We want to find Least Square Estimates of β \beta β so that Sum of Squared Error = ∑ i = 1 n ϵ 2 = ∑ i = 1 n ( y − X × β ) 2 =\sum_{i=1}^{n}{\epsilon^2}=\sum_{i=1}^{n}(y-X \times\beta)^2 =i=1nϵ2=i=1n(yX×β)2 is minimized.
By minimizing SSE= ϵ ′ ϵ \epsilon'\epsilon ϵϵ, we can find β ^ = ( X ′ X ) − 1 X ′ Y \hat{\beta}=(X'X)^{-1}X'Y β^=(XX)1XY with fitted value y ^ = X b \hat{y}=Xb y^=Xb and σ ^ 2 = ( y − X b ) ′ ( y − X b ) n − p − 1 \hat{\sigma}^2=\frac{(y-Xb)'(y-Xb)}{n-p-1} σ^2=np1(yXb)(yXb)

Code: reg<- lm(y~x1+x2+x3+..., data=d), x1,x2,… are column names in d.
Choose model: step(reg), then insignificant β i \beta_i βi will be dropped.

Logistic regression

Y Y Y is binary variable, then π i = P ( Y i = 1 ∣ x i ) \pi_i=P(Y_i=1|x_i) πi=P(Yi=1xi) is probability of success. Then logistic model is: l n [ π i 1 − π i ] = β 0 + β 1 x i 1 + . . . + β p x i p ln[\frac{\pi_i}{1-\pi_i}]=\beta_0+\beta_1 x_{i1}+...+\beta_{p}x_{ip} ln[1πiπi]=β0+β1xi1+...+βpxip. Assume LHS is linear on x i x_i xi, then π i = e x i ′ β 1 + e x i ′ β = 1 1 + e − x i ′ β ∈ [ 0 , 1 ] \pi_i=\frac{e^{x_{i}'\beta}}{1+e^{x_{i}'\beta}}=\frac{1}{1+e^{-x_{i}'\beta}}\in[0,1] πi=1+exiβexiβ=1+exiβ1[0,1](logit transformation).
The likelihood function is L ( β ) = ∏ i = 1 n [ π i y i ( 1 − π i ) ( 1 − y i ) ] L(\beta)=\prod_{i=1}^{n}[\pi_{i}^{y_i}(1-\pi_{i})^{(1-y_i)}] L(β)=i=1n[πiyi(1πi)(1yi)], then do log-transformation and find MLE.
Code: reg2 <- glm(y~x1+x2+...+xn, data=d, binomial), binomial is needed for logistic regression, where y is binomial variable. Also use step() to drop insignificant regressor.
prediction:

b <- reg2$coef
X <- as.matrix(cbind(1,x1,x2,...))
c <- X%*%b
ps <- exp(c)/(1+exp(c)) # or directly use ps<-reg2$fit
pred <- ifelse(ps>0.5, "A", "B")
table(pred, oringinal list of "A"&"B") #show classification table

Nonlinear curve fitting

nls(y~f(x;a1,a2,a3,...), start=c(a1=#, a2=#, ...): nonlinear least square with parameters a 1 , a 2 , a 3 , . . . a_1, a_2, a_3,... a1,a2,a3,... and start value of all parameters are required to be used in searching the model.
nlsout <- summary(nls): to get coefficients, use nlsout$coefficients, or use y-nlsout$res to get fitted value.
If unclear form of y, then try log-transformation: log(y) or log(x). lm(log(y)~x may fit a good nonlinear form.

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值