[Cheatsheet] Introduction to Data Analysis with R

最新推荐文章于 2023-07-02 04:04:19 发布

原创最新推荐文章于 2023-07-02 04:04:19 发布 · 1.4k 阅读

0 ·

CC 4.0 BY-SA版权

R 同时被 3 个专栏收录

2 篇文章

订阅专栏

Cheatsheet

2 篇文章

订阅专栏

Data Analysis

1 篇文章

订阅专栏

本文全面介绍了使用R语言进行数据处理、统计分析、图形绘制及建模的技巧，涵盖基本概念、变量类型、数据结构操作、流程控制、概率统计、假设检验、回归分析等核心内容，适合初学者和进阶用户学习。

部署运行你感兴趣的模型镜像

This is a final review of Introduction of Data Analysis with R. $^{^{This~is~the~way!!!}}$

Basic notation

command	explanation
Inf	infinity, e.g. 1/0=Inf
NaN	not a number, e.g. 0/0=NaN
NULL	empty
basic command	explanation
`rm(list=ls())`, `rm(a)`	remove all objects, or a single object in workspace
`class() #creat type`, `mode(a) #storing type` > `typeof( ) #for single value`	show the variable type, e.g. `c<-factor( data.frame(1,2,3))` class=“factor”; typeof=“integer”; mode=“numeric”
`length(a)`	show the length of the character
`%%`, `%/%`, `%*%`, `%in%`, `!`, `t()`	modulus, integer divide, matrix multiply, is a in b?, not, transpose
`options(digits=x)`	control number of displayed digits
`source('xx.r', echo=T)`	load History file, `echo=T` show command and output/ `echo=F` only show output
`Sys.setlocale("LC_ALL","English")`	set system language
`setwd('xx')`, `dir()`	set path, show files in current path
`print(xxx)`, `cat(numeric, character...,'hello world \n')`	show variable value
`set.seed()`

Variable

Numeric

Important type	generation command	ralated functions
integer/ complex/ double	`b<-1`, `a<-c(1,2,3)`	`mode(a)`, `length(a)`, `a<2` (generate a sequence of logic vector), `as.numeric()` (where char converted into NA), `ceiling`, `floor`, `trunc` (ignore decimal), `round(x, digits=#)` (round(3.5)=4, round(2.5)=2)
sequence	`x1:x_n`, `c(x1,..,xn)`, `seq(x1,xn,step)`	`x<-c(1,2,3,4)` (x[4] is 4, x[-4] is 1,2,3), `cumsum`, `cumprod` (cumulative product), `range`, `rep` (a repeating sequence of number)
random sample	`sample(c(..),size=, replace=T/F, prob=c(...))`	used with `set.seed()`
matrix (every entry should be of same type)	`matrix(sequence, nrow=, ncol=, byrow=T/F)`	`dim()`, `nrow()`, `ncol()`, `m[row,col]`(indexing), `rbind()`(row bind), `cbind()`, `x%*%x`: inter product, `x%o%x`: outer product, `diag(seq)`: create diagonal matrix or retrieve diagonal entry in a matrix, `solve(A,b)`: find solution, `solve(A)`: find inverse

List

just list

every entry of a list can be anything: w<-list(a,b,c)

related functions	explanation
`names(w)` or `names(w)<-c('x1,'x2,'x3')`	show name of w’s index, or give w entry name
`w[[1]]` or `w$x1`	indexing
`unlist()`
`max(x)`	`mx<-value, for (xxx) mx<-max(mx, value)`: nice comparison method

dataframe (one mother class of list)

entry can be of different type: integer/numeric/factor: d<-data.frame(), d<-read.csv('xx.csv')

related functions	explanation
`names(d)` or `names(d)<-c('xxx')`
`head(d, n=4L)`, `tail(d)`	display first(last) 4 lines(10 lines as default)
`d[row,col]` or `d$xx` or `d[d$xx=='xxx',]`	indexing or select data
`dim()`, `nrow()`, `ncol()`	basic info of dataframe
`mean(x, na.rm=T)`, `colMeans(x, na.rm=T)`, `rowMeans()`	total mean or column mean, romove NA is false as default
`var(x, na.rm=T)`, `sd(x, na.rm=T)`	sample variance, sample sd
`median(x, na.rm=T)`
`apply(x, MARGIN=,FUN=)`	margin=1 by row, 2 by col; function can be user-defined
`d[order(d$xx),]`	sort by row xx, order return a sequence
`d[complete.cases(d),]` or `na.omit(d)`	select non-NA complete cases, not work on factor row/col
`merge(target, origin, by='Column name')`	combine data with unique key in `by=''`
`by(just data part in dataset, classification standard in dataset, function)`	apply function on data using classification, and generate a summary table
time series data
`date<-as.Date(d$Date,"$d/$m/$Y")`	change Date in dataframe d to date object in special fromat
`wkday<-weekdays(date)` or `wkday<- ordered(weekdays(date), c( "Monday", "Tuesday", "Wednesday", "Thursday", "Friday"))`	find the corresponding weekday, the later code is ordered, instead of order
`unique(wkday)`	display the unique levels in wkday
`ted<-as.ts(d$ClosePrice)`	change ClosePrice in d to time series data
`lag(ted)`	shift the time series 1 unit to the left and return the next value of the time series.
`by(ted, wkday, mean)`	just a summary
write file
`write.csv(d, file='xxx.scv', row.names=F`	`row.names=F` suppress write number index in the first column

Others

variable type	generation command	related functions
logical	`TRUE FALSE`	`&` `\|` (vec and/or), `&&` `\|\|` (control and/or, len=1)
charcater	`z <- c('a','b',"c")`	combine above types: c(num,logic)->num,c(num,char)->char
factor

Graph

related functions	explanation
`par(mfrow=c(row, col))`	define row $\times$ col multiple graph, fill-in by row; need to be reset next time
`par(mar=c(x1,x2,x3,x4))`	define marginal width of graph, bottom=x1, left=x2, top=x3, right=x4
`plot(x,y,main='title', type='',xlab='',ylab'', ylim=c(y1,y2))`, `plot(x, y, pch=21, bg=c('red', 'blue')[binary])`	type: “p” for points, “l” for lines, “b” for both, “c” for the lines part alone of “b”, “o” for both ‘overplotted’, “h” for ‘histogram’ like (or ‘high-density’) vertical lines, “s” for stair steps, “S” for other steps, see ‘Details’ below, “n” for no plotting.//`pch`represent dot type, `bg`means use red for binary=1 and blue for binary=2
`plot(d$"classification standard column"~d$"data column")`, `plot(d)`	in `plot(d)`, `d` is a dataframe, R will plot it by applying `plot(d$"classification standard column"~d$"data column"` multiple times
`abline(a=, b=, h=, v=)`	a is intercept, b is slope, h is horizontal line, v is vertical line
`lines(c(x0, x1), c(y0, y1))`, `lines(x_seq, y_seq), lty=)`	draw line from $x_0,y_0)$ to $x_1,y_1)$ , or just a line through multiple points. `lty`: line style
`curve(y, x_start, x_end)`	draw curve $\Leftrightarrow$ `plot(x,y, type='l')`
`barplot(a table variable, beside=T, horiz=F, legend.text=c('xx'), main='xx', args.legend=list(horiz=T, bty='n', cex=0.6), ylim=c(0.2))`	log-transformation is frequently used. If `beside=F`, stack bar will be used. `horiz=T` will transpose the barplot
boxplot()
`hist(x, freq=F, main='...')`	`freq=F` produce histogram with density instead of frequency
`qqnorm(x, main='..')`, `qqline(x)`	QQ-plot

Function

Create

name <- function(x,y,..., OR no input) {
	...a bunch of cumputation...
	output
}

name <- function(...){
	arg <- list(...)
	...}

or define own operator:

"%+-%" <- function(miu,criticalRegion){c(miu-criticalRegion,miu+criticalRegion)}

input: if a variable has no input value, it will be automatically assigned NA, so we can use if to test. if input is ..., then numbers of input is undefined.
output can be a single variable (not like in python, we have to write return output) or c(x1=var1, x2=var2,...) or outp<-list(var); names(outp)<-c('x') or list(x1=var1, x2=var2,...) as multiple output in one list.

Modify

fix(name) can modify function $\Leftrightarrow$ edit(file='name') & source(‘name’) to make effect

Speed test

proc.time(): show current time

Flow control

for loop

for (i in 1:len) {
	a[i] <- xxx
}

while loop

i <- 1
while (i <= Len){
	a[i] <- xxx
	i <- I+1
}

repeat

repeat {
	if (i>Len) break
	a[i] <- xxx
	i<- I+1
}

if

if (condition){
} else if (condition){
} else if (condition){
}

ifelse(condition, expression 1 given T, expression 2 given F)
ifelse(condition, ifelse(condition, ...), ifelse(condition, ...))

switch

good for classification into several interval

swith(expression, 'expr I'=one type of computation, 'expr II'=another, ...)

terminates current loop and move to next

return(expression)

terminates current loop and return expression( or func name in recursion)

stop(message)

terminates current loop and give a warning message

readline()

get user’s input as a char.

Probability

simulation: set.seed() $\Rightarrow$ sample()
distribution: d, p, q, r+distribution_name(parameters), show density, cdf, quantiles, random number of certain distribution. E.g., dnorm(0,0,1), rnorm(#numbers,0,1)

code	Distribution	code	Distribution	code	Distribution
beta(x, $\alpha$ , $\beta$ )	beta	binom(x, size, prob)	binomial	cauchy(x, location, scale)	Cauchy
chisq(x, df)	chi-squared	exp(x, rate)	exponential	f(x, df1, df2)	F
gamma(x, shape, sclae)	gamma	geom(x, prob)	geometric	hyper(x, m, n, k)	hypergeometric
lnorm(x, meanlog, sdlog)	log-normal	logis(x, location, scale)	logistic/ multinomial	nbinom(x, size, prob)	negative binomial
norm(x, mean, sd)	normal	pois(x, $\lambda$ )	Poisson	t(x, df)	Student’s t
unif(x, min, max)	uniform	weibull	Weibull	wilcox(x, m, n)	Wilcoxon

1-pXXX(...): find $P (X > x)$

Statistics

Hypothesis testing

one sample t test: t.test(x, mu=#, alt='less'/'greater'/two.sided'). two.sided is default.
two sample t test: t.test(x, y, alt='', var.eq=F/T, paired=F/T). var.eq=T is two sample t test, F is Welch’s t test(default). paired=T is paired t test, F is default.

Construct testing table

one-way table
cut(): slice x sequence into several part and specify each part with labels

result <- cut(x_seq, breaks=c(n1,n2,...), lables=c('...', '...',...))
table(result)

table(d$Name): summary of a factor/character column
two-way table
table(d$Name, result): Cartesian product of two one-way tables
prob.table(table(d$Name, result), margin=): create frequency table. margin=1 means $P (r o w) = 1$ (referred to row dimension), 2 means $P (c o l u m n) = 1$ . Probability table can also be created using apply(), rowMeans(), colMeans().

Testing table

chi-square goodness of fit test: chisq.test(x, p=): p=sequence is the assumed distribution.
contingenty table on independence: chisq.test(x,y): y is considered when x is a factor

Regression

Simple linear regression

reg1 <- lm(y~x); summary(reg1)
plot(x,y);abline(reg1) #use reg1$coef

lm is linear regression function. In output, y-reg1$fit=reg1$resid. And multiple R-squared measures is cor(x,y,use='complete').
Check 4 assumptions of linear regression model and ways to check data: linearity(residual vs fitted value, residual vs index), normality(QQ-plot), independence(residual[i] vs residual[i-1]), constant variance(residual vs fitted value, residual vs index).

# produce 4 residual plots
# input residual vector and fitted values
#
residplot<-function(resid,fit) {
  par(mfrow=c(2,2))			# define 2x2 multiframe grahpic	
  n<-length(resid)			# get no of points
  plot(fit,resid); abline(h=0)		# plot e(i) vs fit(i), add x-axis to plot
  plot(1:n,resid); abline(h=0)		# plot e(i) vs i, add x-axis to plot
  plot(resid[1:(n-1)],resid[2:n]); abline(h=0)	# plot e(i) vs e(i-1), add x-axis to plot
  qqnorm(resid); qqline(resid)		# QQ-normal plot of e(i)
  par(mfrow=c(1,1))			# reset multiframe graphic
}

If residual converges when n becomes larger, the randomness is false, then it’s not a linear form.
Then $H_0: \alpha=0$ , $H_0:\beta=0$ testing on $y_i=\alpha+\beta \times x_i+\epsilon_i$ , where $\epsilon \sim N(0, \sigma^2)$ :
We have test statistics: , $T=\frac{\hat{a}}{\hat{\sigma} \times \sqrt(\frac{1}{n}+\frac{\bar{x}^2}{S_{xx}})}$ , $T=\frac{\hat{b}}{\frac{\hat{\sigma}}{\sqrt{S_{xx}}}}$ , where $S_{xx}=\sum_{i=1}^{n}(x_i-\bar{x})$

Multiple linear regression

For $\begin{bmatrix} y_1\\.\\.\\.\\y_n \end{bmatrix}_{n\times 1}$ $=$ $\begin{bmatrix} 1 & x_{11} & ... & x_{1p}\\. & . & & . \\. & . & & . \\. & . & & . \\1 & x_{n1} & ... & x_{np} \end{bmatrix}_{n\times (p+1)}$ $\times$ $\begin{bmatrix} \beta_0\\\beta_1\\.\\.\\\beta_p \end{bmatrix}_{(p+1)\times 1}$ $+$ $\begin{bmatrix} \epsilon_1\\ . \\. \\. \\ \epsilon_{n} \end{bmatrix}_{n\times 1}$ , where $\epsilon\sim N(0, \sigma^2)$ :
We want to find Least Square Estimates of $\beta$ so that Sum of Squared Error $=\sum_{i=1}^{n}{\epsilon^2}=\sum_{i=1}^{n}(y-X \times\beta)^2$ is minimized.
By minimizing SSE= $\epsilon'\epsilon$ , we can find $\hat{\beta}=(X'X)^{-1}X'Y$ with fitted value $\hat{y}=Xb$ and $\hat{\sigma}^2=\frac{(y-Xb)'(y-Xb)}{n-p-1}$

Code: reg<- lm(y~x1+x2+x3+..., data=d), x1,x2,… are column names in d.
Choose model: step(reg), then insignificant $\beta_i$ will be dropped.

Logistic regression

$Y$ is binary variable, then $\pi_i=P(Y_i=1|x_i)$ is probability of success. Then logistic model is: $ln[\frac{\pi_i}{1-\pi_i}]=\beta_0+\beta_1 x_{i1}+...+\beta_{p}x_{ip}$ . Assume LHS is linear on $x_i$ , then $\pi_i=\frac{e^{x_{i}'\beta}}{1+e^{x_{i}'\beta}}=\frac{1}{1+e^{-x_{i}'\beta}}\in[0,1]$ (logit transformation).
The likelihood function is $L(\beta)=\prod_{i=1}^{n}[\pi_{i}^{y_i}(1-\pi_{i})^{(1-y_i)}]$ , then do log-transformation and find MLE.
Code: reg2 <- glm(y~x1+x2+...+xn, data=d, binomial), binomial is needed for logistic regression, where y is binomial variable. Also use step() to drop insignificant regressor.
prediction:

b <- reg2$coef
X <- as.matrix(cbind(1,x1,x2,...))
c <- X%*%b
ps <- exp(c)/(1+exp(c)) # or directly use ps<-reg2$fit
pred <- ifelse(ps>0.5, "A", "B")
table(pred, oringinal list of "A"&"B") #show classification table

Nonlinear curve fitting

nls(y~f(x;a1,a2,a3,...), start=c(a1=#, a2=#, ...): nonlinear least square with parameters $a_1, a_2, a_3,...$ and start value of all parameters are required to be used in searching the model.
nlsout <- summary(nls): to get coefficients, use nlsout$coefficients, or use y-nlsout$res to get fitted value.
If unclear form of y, then try log-transformation: log(y) or log(x). lm(log(y)~x may fit a good nonlinear form.