tidyverse学习笔记——Data tidying篇_lengthening data-优快云博客

Data tidying

Rules that make a dataset clean

Each variable is a column; each column is a variable.
Each observation is a row; each row is an observation.
Each value is a cell; each cell is a single value.

pivot

pivot helps to tidy the datasets.

Lengthening data

pivot_longer

pivot_longer() “lengthens” data, increasing the number of rows and decreasing the number of columns.

Data in column names

artist	date.entered	wk1	wk2	wk3	wk4	wk5
2 Pac	2000-02-26	87	82	72	77	87
2Ge+her	2000-09-02	91	87	92	NA	NA
3 Doors Down	2000-04-08	81	70	68	67	66
3 Doors Down	2000-10-21	76	76	72	69	67
504 Boyz	2000-04-15	57	34	25	17	17
98^0	2000-08-19	51	39	34	26	26

billboard |> 
    pivot_longer(
        cols = starts_with("wk"), 
        names_to = "week", 
        values_to = "rank",
        values_drop_na = TRUE
    )

Variables in column names

country	year	sp_m_014	sp_m_1524	sp_f_2534	sp_m_3544	sp_m_4554
Afghanistan	1980	NA	NA	NA	NA	NA
Afghanistan	1981	NA	NA	NA	NA	NA
Afghanistan	1982	NA	NA	NA	NA	NA
Afghanistan	1983	NA	NA	NA	NA	NA
Afghanistan	1984	NA	NA	NA	NA	NA
Afghanistan	1985	NA	NA	NA	NA	NA

“sp” refers to diagnosis, “m” and “f” denote gender. “014” indicates age between 0 and 14.

who2 |> 
    pivot_longer(
        cols = !(country:year),
        names_to = c("diagnosis", "gender", "age"), 
        names_sep = "_",
        values_to = "count"
    )

Widening data

cms_patient_experience |> 
    pivot_wider(
        names_from = measure_cd,
        values_from = prf_rate
    )

Missing values

Identifying NA values

During import

read_csv(path, na = )

na specifies which strings in the data should be interpreted as missing values (NA). By default, na = c("")(i.e., empty strings are treated as NA).

na_if

na_if(x, y) modifies the vector x by replacing any elements that are equal to y with NA.

na_if(x, y)

x: The vector to modify.
y: The value or vector to compare against. y must either be a single value or have the same length as x.

If y is a single value, it is recycled to match the length of x before comparison.
If x and y have the same length, corresponding elements are compared.

Implicit missing value

An explicit missing value is the presence of an absence. An implicit missing value is the absence of a presence.

stocks <- tibble(
  year  = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
  qtr   = c(   1,    2,    3,    4,    2,    3,    4),
  price = c(1.88, 0.59, 0.35,   NA, 0.92, 0.17, 2.66)
)

The price in the fourth quarter of 2020 is explicitly missing, because its value is NA.
The price for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.

pivot

Use pivoting to make implicit missings explicit:

stocks |>
  pivot_wider(
    names_from = qtr, 
    values_from = price
  )
#> # A tibble: 2 × 5
#>    year   `1`   `2`   `3`   `4`
#>   <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1  2020  1.88  0.59  0.35 NA   
#> 2  2021 NA     0.92  0.17  2.66

complete

complete() turns implicit missing values into explicit missing values. It generates all possible combinations of specified columns including those absent in the original tibble.

complete(data, ..., fill = list(), explicit = TRUE)

data: A data frame.
...:

<data-masking>Specification of columns to expand or complete. Columns can be atomic vectors or lists.

To find all unique combinations of x, y and z, including those not present in the data, supply each variable as a separate argument: expand(df, x, y, z) or complete(df, x, y, z).
To find only the combinations that occur in the data, use nesting: ``expand(df, nesting(school_id, student_id), date)` would produce a row for each present school-student combination for all possible dates.
When used with factors, complete() use the full set of levels, not just those that appear in the data. To retain only the factor levels present in the data, use forcats::fct_drop().

fill:

A named list that for each variable supplies a single value to use instead of NA for missing combinations.

explicit

Should both implicit (newly created) and explicit (pre-existing) missing values be filled by fill? By default, this is TRUE, but if set to FALSE this will limit the fill to only implicit missing values.

df <- tibble(
  group = c(1:2, 1, 2),
  item_id = c(1:2, 2, 3),
  item_name = c("a", "a", "b", "b"),
  value1 = c(1, NA, 3, 4),
  value2 = 4:7
)
df
#> # A tibble: 4 × 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <dbl>  <int>
#> 1     1       1 a              1      4
#> 2     2       2 a             NA      5
#> 3     1       2 b              3      6
#> 4     2       3 b              4      7

# Limit the fill to only the newly created (i.e. previously implicit)
# missing values with `explicit = FALSE`
df %>%
  complete(
    group,
    nesting(item_id, item_name),
    fill = list(value1 = 0, value2 = 99),
    explicit = FALSE
  )
#> # A tibble: 8 × 5
#>   group item_id item_name value1 value2
#>   <dbl>   <dbl> <chr>      <dbl>  <int>
#> 1     1       1 a              1      4
#> 2     1       2 a              0     99
#> 3     1       2 b              3      6
#> 4     1       3 b              0     99
#> 5     2       1 a              0     99
#> 6     2       2 a             NA      5
#> 7     2       2 b              0     99
#> 8     2       3 b              4      7

Fill missing values

LOCF—Last observation carried forward

fill() takes a set of columns and fills missing values in the selected columns using the next or previous entry.

fill(data, ..., .direction = c("down", "up", "downup", "updown"))

data: A data frame.
...: <tidy-select> Columns to fill.
.direction: Direction in which to fill missing values.

"down": Fills missing values using the nearest non-missing value from above.
"up": Fills missing values using the nearest non-missing value from below.
"downup": First applies down fill, then fills any remaining missing values using up fill.
"updown": First applies up fill, then fills any remaining missing values using down fill.

Fixed values

coalesce

coalesce() combines multiple vectors by returning the first non-missing value at each position. If all values at a position are missing (NA), the result will also be NA.

coalesce(...)

y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
#> [1] 1 2 3 4 5

replace_na

replace_na() replaces NA values in a vector or data frame with a specified replacement value.

replace_na(data, replace)

data: A data frame or vector.
replace

If data is a data frame, replace takes a named list of values, with one value for each column that has missing values to be replaced.
If data is a vector, replace takes a single value. This single value replaces all of the missing values in the vector.

# Replace NAs in a data frame
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#> # A tibble: 3 × 2
#>       x y      
#>   <dbl> <chr>  
#> 1     1 a      
#> 2     2 unknown
#> 3     0 b      

# Replace NAs in a vector
df %>% mutate(x = replace_na(x, 0))
#> # A tibble: 3 × 2
#>       x y    
#>   <dbl> <chr>
#> 1     1 a    
#> 2     2 NA   
#> 3     0 b

The replacement value must match the type of the vector or column. For example:

# Replace NULLs in a list: NULLs are the list-col equivalent of NAs
df_list <- tibble(z = list(1:5, NULL, 10:20))
df_list %>% replace_na(list(z = list(5)))
#> # A tibble: 3 × 1
#>   z         
#>   <list>    
#> 1 <int [5]> 
#> 2 <dbl [1]> 
#> 3 <int [11]>
# if:
df_list %>% replace_na(list(z = 5))
# Error in `vec_assign()`:
# ! Can't convert `replace$z` <double> to match type of `data$z` <list>.
# Run `rlang::last_trace()` to see where the error occurred.

References

R4DS