Data tidying
Rules that make a dataset clean
- Each variable is a column; each column is a variable.
- Each observation is a row; each row is an observation.
- Each value is a cell; each cell is a single value.
pivot
pivot
helps to tidy the datasets.
Lengthening data
pivot_longer
pivot_longer()
“lengthens” data, increasing the number of rows and decreasing the number of columns.
Data in column names
artist | date.entered | wk1 | wk2 | wk3 | wk4 | wk5 |
---|---|---|---|---|---|---|
2 Pac | 2000-02-26 | 87 | 82 | 72 | 77 | 87 |
2Ge+her | 2000-09-02 | 91 | 87 | 92 | NA | NA |
3 Doors Down | 2000-04-08 | 81 | 70 | 68 | 67 | 66 |
3 Doors Down | 2000-10-21 | 76 | 76 | 72 | 69 | 67 |
504 Boyz | 2000-04-15 | 57 | 34 | 25 | 17 | 17 |
98^0 | 2000-08-19 | 51 | 39 | 34 | 26 | 26 |
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
Variables in column names
country | year | sp_m_014 | sp_m_1524 | sp_f_2534 | sp_m_3544 | sp_m_4554 |
---|---|---|---|---|---|---|
Afghanistan | 1980 | NA | NA | NA | NA | NA |
Afghanistan | 1981 | NA | NA | NA | NA | NA |
Afghanistan | 1982 | NA | NA | NA | NA | NA |
Afghanistan | 1983 | NA | NA | NA | NA | NA |
Afghanistan | 1984 | NA | NA | NA | NA | NA |
Afghanistan | 1985 | NA | NA | NA | NA | NA |
“sp” refers to diagnosis, “m” and “f” denote gender. “014” indicates age between 0 and 14.
who2 |>
pivot_longer(
cols = !(country:year),
names_to = c("diagnosis", "gender", "age"),
names_sep = "_",
values_to = "count"
)
Widening data
cms_patient_experience |>
pivot_wider(
names_from = measure_cd,
values_from = prf_rate
)
Missing values
Identifying NA values
During import
read_csv(path, na = )
na
specifies which strings in the data should be interpreted as missing values (NA
). By default,na = c("")
(i.e., empty strings are treated asNA
).
na_if
na_if(x, y)
modifies the vector x
by replacing any elements that are equal to y
with NA
.
na_if(x, y)
-
x
: The vector to modify. -
y
: The value or vector to compare against.y
must either be a single value or have the same length asx
.
-
If
y
is a single value, it is recycled to match the length ofx
before comparison. -
If
x
andy
have the same length, corresponding elements are compared.
Implicit missing value
An explicit missing value is the presence of an absence. An implicit missing value is the absence of a presence.
stocks <- tibble(
year = c(2020, 2020, 2020, 2020, 2021, 2021, 2021),
qtr = c( 1, 2, 3, 4, 2, 3, 4),
price = c(1.88, 0.59, 0.35, NA, 0.92, 0.17, 2.66)
)
-
The
price
in the fourth quarter of 2020 is explicitly missing, because its value isNA
. -
The
price
for the first quarter of 2021 is implicitly missing, because it simply does not appear in the dataset.
pivot
Use pivoting to make implicit missings explicit:
stocks |>
pivot_wider(
names_from = qtr,
values_from = price
)
#> # A tibble: 2 × 5
#> year `1` `2` `3` `4`
#> <dbl> <dbl> <dbl> <dbl> <dbl>
#> 1 2020 1.88 0.59 0.35 NA
#> 2 2021 NA 0.92 0.17 2.66
complete
complete()
turns implicit missing values into explicit missing values. It generates all possible combinations of specified columns including those absent in the original tibble.
complete(data, ..., fill = list(), explicit = TRUE)
-
data
: A data frame. -
...
:
<data-masking>
Specification of columns to expand or complete. Columns can be atomic vectors or lists.
-
To find all unique combinations of
x
,y
andz
, including those not present in the data, supply each variable as a separate argument:expand(df, x, y, z)
orcomplete(df, x, y, z)
. -
To find only the combinations that occur in the data, use
nesting
: ``expand(df, nesting(school_id, student_id), date)` would produce a row for each present school-student combination for all possible dates. -
When used with factors,
complete()
use the full set of levels, not just those that appear in the data. To retain only the factor levels present in the data, useforcats::fct_drop()
.
fill
:
A named list that for each variable supplies a single value to use instead of NA
for missing combinations.
explicit
Should both implicit (newly created) and explicit (pre-existing) missing values be filled by fill
? By default, this is TRUE
, but if set to FALSE
this will limit the fill to only implicit missing values.
df <- tibble(
group = c(1:2, 1, 2),
item_id = c(1:2, 2, 3),
item_name = c("a", "a", "b", "b"),
value1 = c(1, NA, 3, 4),
value2 = 4:7
)
df
#> # A tibble: 4 × 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <chr> <dbl> <int>
#> 1 1 1 a 1 4
#> 2 2 2 a NA 5
#> 3 1 2 b 3 6
#> 4 2 3 b 4 7
# Limit the fill to only the newly created (i.e. previously implicit)
# missing values with `explicit = FALSE`
df %>%
complete(
group,
nesting(item_id, item_name),
fill = list(value1 = 0, value2 = 99),
explicit = FALSE
)
#> # A tibble: 8 × 5
#> group item_id item_name value1 value2
#> <dbl> <dbl> <chr> <dbl> <int>
#> 1 1 1 a 1 4
#> 2 1 2 a 0 99
#> 3 1 2 b 3 6
#> 4 1 3 b 0 99
#> 5 2 1 a 0 99
#> 6 2 2 a NA 5
#> 7 2 2 b 0 99
#> 8 2 3 b 4 7
Fill missing values
LOCF—Last observation carried forward
fill()
takes a set of columns and fills missing values in the selected columns using the next or previous entry.
fill(data, ..., .direction = c("down", "up", "downup", "updown"))
-
data
: A data frame. -
...
:<tidy-select>
Columns to fill. -
.direction
: Direction in which to fill missing values.
"down"
: Fills missing values using the nearest non-missing value from above."up"
: Fills missing values using the nearest non-missing value from below."downup"
: First applies down fill, then fills any remaining missing values using up fill."updown"
: First applies up fill, then fills any remaining missing values using down fill.
Fixed values
coalesce
coalesce()
combines multiple vectors by returning the first non-missing value at each position. If all values at a position are missing (NA
), the result will also be NA
.
coalesce(...)
y <- c(1, 2, NA, NA, 5)
z <- c(NA, NA, 3, 4, 5)
coalesce(y, z)
#> [1] 1 2 3 4 5
replace_na
replace_na()
replaces NA
values in a vector or data frame with a specified replacement value.
replace_na(data, replace)
-
data
: A data frame or vector. -
replace
-
If
data
is a data frame,replace
takes a named list of values, with one value for each column that has missing values to be replaced. -
If
data
is a vector,replace
takes a single value. This single value replaces all of the missing values in the vector.
# Replace NAs in a data frame
df <- tibble(x = c(1, 2, NA), y = c("a", NA, "b"))
df %>% replace_na(list(x = 0, y = "unknown"))
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 unknown
#> 3 0 b
# Replace NAs in a vector
df %>% mutate(x = replace_na(x, 0))
#> # A tibble: 3 × 2
#> x y
#> <dbl> <chr>
#> 1 1 a
#> 2 2 NA
#> 3 0 b
The replacement value must match the type of the vector or column. For example:
# Replace NULLs in a list: NULLs are the list-col equivalent of NAs
df_list <- tibble(z = list(1:5, NULL, 10:20))
df_list %>% replace_na(list(z = list(5)))
#> # A tibble: 3 × 1
#> z
#> <list>
#> 1 <int [5]>
#> 2 <dbl [1]>
#> 3 <int [11]>
# if:
df_list %>% replace_na(list(z = 5))
# Error in `vec_assign()`:
# ! Can't convert `replace$z` <double> to match type of `data$z` <list>.
# Run `rlang::last_trace()` to see where the error occurred.