tidyverse学习笔记——Data tidying篇

Data tidying

Rules that make a dataset clean

  1. Each variable is a column; each column is a variable.
  2. Each observation is a row; each row is an observation.
  3. Each value is a cell; each cell is a single value.

pivot

pivot helps to tidy the datasets.

Lengthening data

pivot_longer

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

Data in column names
artistdate.enteredwk1wk2wk3wk4wk5
2 Pac2000-02-268782727787
2Ge+her2000-09-02918792NANA
3 Doors Down2000-04-088170686766
3 Doors Down2000-10-217676726967
504 Boyz2000-04-155734251717
98^02000-08-195139342626
billboard |> 
    pivot_longer(
        cols = starts_with("wk"), 
        names_to = "week", 
        values_to = "rank",
        values_drop_na = TRUE
    )
Variables in column names
countryyearsp_m_014sp_m_1524sp_f_2534sp_m_3544sp_m_4554
Afghanistan1980NANANANANA
Afghanistan1981NANANANANA
Afghanistan1982NANANANANA
Afghanistan1983NANANANANA
Afghanistan1984NANANANANA
Afghanistan1985NANANANANA

sp” refers to diagnosis, “m” and “f” denote gender. “014” indicates age between 0 and 14.

who2 |> 
    pivot_longer(
        cols = !(country:year),
        names_to = c("diagnosis", "gender", "age"), 
        names_sep = "_",
        values_to = "count"
    )

Widening data

cms_patient_experience |> 
    pivot_wider(
        names_from = measure_cd,
        values_from = prf_rate
    )

Missing values

Identifying NA values
During import
read_csv(path, na = )
  • na specifies which strings in the data should be interpreted as missing values (NA). By default, na = c("")(i.e., empty strings are treated as NA).
na_if

na_if(x, y) modifies the vector x by replacing any elements that are equal to y with NA.

na_if(x, y)
  1. x: The vector to modify.

  2. y: The value or vector to compare against. y must either be a single value or have the same length as x.

  • If y is a single value, it is recycled to match the length of x before comparison.

  • If x and y have the same length, corresponding elements are compared.

Implicit missing value

An explicit missing value is the presence of an absence. An implicit missing value is the absence of a presence.

stocks <- tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)
  • The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.

  • The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

pivot

Use pivoting to make implicit missings explicit:

stocks |>
  pivot_wider(
    names_from = qtr, 
    values_from = price
  )
#> # A tibble: 2 × 5
#>    year   `1`   `2`   `3`   `4`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  2020  1.88  0.59  0.35 NA   
#> 2  2021 NA     0.92  0.17  2.66
complete

complete() turns implicit missing values into explicit missing values. It generates all possible combinations of specified columns including those absent in the original tibble.

complete(data, ..., fill = list(), explicit = TRUE)
  1. data: A data frame.

  2. ...:

<data-masking>Specification of columns to expand or complete. Columns can be atomic vectors or lists.

  • To find all unique combinations of x, y and z, including those not present in the data, supply each variable as a separate argument: expand(df, x, y, z) or complete(df, x, y, z).

  • To find only the combinations that occur in the data, use nesting: ``expand(df, nesting(school_id, student_id), date)` would produce a row for each present school-student combination for all possible dates.

  • When used with factors, complete() use the full set of levels, not just those that appear in the data. To retain only the factor levels present in the data, use forcats::fct_drop().

  1. fill:

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

  1. explicit

Should both implicit (newly created) and explicit (pre-existing) missing values be filled by fill? By default, this is TRUE, but if set to FALSE this will limit the fill to only implicit missing values.

df <- tibble(
  group = c(1:2, 1, 2),
  item_id = c(1:2, 2, 3),
  item_name = c("a", "a", "b", "b"),
  value1 = c(1, NA, 3, 4),
  value2 = 4:7
)
df
#> # A tibble: 4 × 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <dbl>  <int>
#> 1     1       1 a              1      4
#> 2     2       2 a             NA      5
#> 3     1       2 b              3      6
#> 4     2       3 b              4      7

# Limit the fill to only the newly created (i.e. previously implicit)
# missing values with `explicit = FALSE`
df %>%
  complete(
    group,
    nesting(item_id, item_name),
    fill = list(value1 = 0, value2 = 99),
    explicit = FALSE
  )
#> # A tibble: 8 × 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <dbl>  <int>
#> 1     1       1 a              1      4
#> 2     1       2 a              0     99
#> 3     1       2 b              3      6
#> 4     1       3 b              0     99
#> 5     2       1 a              0     99
#> 6     2       2 a             NA      5
#> 7     2       2 b              0     99
#> 8     2       3 b              4      7
Fill missing values
LOCF—Last observation carried forward

fill() takes a set of columns and fills missing values in the selected columns using the next or previous entry.

fill(data, ..., .direction = c("down", "up", "downup", "updown"))
  1. data: A data frame.

  2. ...: <tidy-select> Columns to fill.

  3. .direction: Direction in which to fill missing values.

  • "down": Fills missing values using the nearest non-missing value from above.
  • "up": Fills missing values using the nearest non-missing value from below.
  • "downup": First applies down fill, then fills any remaining missing values using up fill.
  • "updown": First applies up fill, then fills any remaining missing values using down fill.
Fixed values
coalesce

coalesce() combines multiple vectors by returning the first non-missing value at each position. If all values at a position are missing (NA), the result will also be NA.

coalesce(...)
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
#> [1] 1 2 3 4 5
replace_na

replace_na() replaces NA values in a vector or data frame with a specified replacement value.

replace_na(data, replace)
  1. data: A data frame or vector.

  2. replace

  • If data is a data frame, replace takes a named list of values, with one value for each column that has missing values to be replaced.

  • If data is a vector, replace takes a single value. This single value replaces all of the missing values in the vector.

# Replace NAs in a data frame
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#> # A tibble: 3 × 2
#>       x y      
#>   <dbl> <chr>  
#> 1     1 a      
#> 2     2 unknown
#> 3     0 b      

# Replace NAs in a vector
df %>% mutate(x = replace_na(x, 0))
#> # A tibble: 3 × 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     2 NA   
#> 3     0 b  

The replacement value must match the type of the vector or column. For example:

# Replace NULLs in a list: NULLs are the list-col equivalent of NAs
df_list <- tibble(z = list(1:5, NULL, 10:20))
df_list %>% replace_na(list(z = list(5)))
#> # A tibble: 3 × 1
#>   z         
#>   <list>    
#> 1 <int [5]> 
#> 2 <dbl [1]> 
#> 3 <int [11]>
# if:
df_list %>% replace_na(list(z = 5))
# Error in `vec_assign()`:
# ! Can't convert `replace$z` <double> to match type of `data$z` <list>.
# Run `rlang::last_trace()` to see where the error occurred.

References

R4DS

评论
添加红包

请填写红包祝福语或标题

红包个数最小为10个

红包金额最低5元

当前余额3.43前往充值 >
需支付:10.00
成就一亿技术人!
领取后你会自动成为博主和红包主的粉丝 规则
hope_wisdom
发出的红包
实付
使用余额支付
点击重新获取
扫码支付
钱包余额 0

抵扣说明:

1.余额是钱包充值的虚拟货币,按照1:1的比例进行支付金额的抵扣。
2.余额无法直接购买下载,可以购买VIP、付费专栏及课程。

余额充值