# load the dplyr
```{r}
rm(list=ls())
if(!require(dplyr)) install.packages(dplyr)
library(dplyr)
vignette(package = "dplyr")
```
Several functions that supported in the dplyr packages,such as
**group_by**
**filter**
**distinct**
**arrange**
**left_join**
etc.
from htlm
* [github](https://github.com/rstudio/cheatsheets/blob/main/data-transformation)
# select
**you can use select function to define the variables that if you wanna choose in the datasets**
**you also can use the minus(-) to exclude variables**
next we use diamonds that saved in the ggplot2 to display how select works
```{r}
library(ggplot2)
data(diamonds,packages="ggplot2")
#show the first six lines
head(diamonds)
#choose the variable from carat to price(use colon(:))
diamonds_se1 <- diamonds %>%
select(carat:price)
#we don't choose clarity
diamonds_se2 <- diamonds %>%
select(carat:price,-clarity)
#other helpers select variables by matching patterns in their names:
#starts_with:starts with a prefix
#ends_with:ends with a suffix
#contains:contains a literal string
#matches:matches a regular expression
#num_range:matches a numerical range like x01 x02 x03
diamonds_se3 <- diamonds %>%
select(starts_with("d")|ends_with("y")|contains("i")|matches("c.+t")|num_range("x",1:3,suffix="t"))
```
# filter
**you can use filter to withdraw the observations met requirement**
```{r}
diamonds_fl1 <- diamonds %>%
filter(cut %in% c("Very Good","Good"),price>3100)
vars <- c("depth","price")
cond <- c(60,3000)
diamonds_fl2 <- diamonds %>%
filter(y > mean(y,na.rm=TRUE),
.data[[vars[[1]]]] > cond[[1]],
.data[[vars[[2]]]] > cond[[2]]
)
```
# mutate
**you can use mutate to calculate a new variable**
```{r}
diamonds_mu1 <- diamonds %>%
mutate(price1 = price*0.9,cprice=carat*price+price,
type=if_else(price>mean(price,na.rm=T),">mean","<=mean"))
```
# summarise function
**combine with group_by**
*center:mean,median
*spread:sd,iqr,mad
*range:min,max,quntile
*position:first,last,nth
*count:n,n_distinct
*logical:any,all
```{r}
diamonds_groupby <-diamonds %>%
group_by(color) %>%
summarise(nrow=n(),mean_price=round(mean(price,na.rm=T),2),std_price=round(sd(price,na.rm=T),2))
```
# arrange
**order the dataset**
```{r}
#desc means descending order
#arrange default ascending order
diamonds_arrange1 <- diamonds %>%
group_by(color) %>%
summarise(nrow=n(),mean_price=mean(price),std_price=sd(price)) %>%
arrange(nrow,desc(mean_price))
```
# join
Mutating Joind:
*inner_join:keep the obs both in x and y
*equal:merge(x,y)*
*equal:select from inner join*
*left_join:keep all obs in x
*equal:merge(x,y,all.x=T)*
*equal:select from left join*
*right join:keep all obs in y
*equal:merge(x,y,all.y=T)*
*equal:select from right join*
*full_join:keep all obs in x and y
*equal:merge(x,y,all.x=T,all.y=T)*
*equal:select from full join*
Filtering Joins:
*semi_join:keep the obs in x that matched in y*
*anti_join:discard the obs in x that matched in y*
```{r}
# use the saved dataset in dplyr to display four join function
inner <- band_members %>%
inner_join(band_instruments,by="name")
left <- band_members %>%
left_join(band_instruments,by="name")
right <- band_members %>%
right_join(band_instruments,by="name")
full <- band_members %>%
full_join(band_instruments,by="name")
semi <- band_members %>%
semi_join(band_instruments,by="name")
anti <- band_members %>%
anti_join(band_instruments,by="name")
```