This vignette can be referred to by citing the following:
Patil et al., (2022). datawizard: An R Package for Easy Data Preparation and Statistical Transformations. Journal of Open Source Software, 7(78), 4684, https://doi.org/10.21105/joss.04684
Introduction
datawizard package aims to make basic data wrangling
easier than with base R. The data wrangling workflow it supports is
similar to the one supported by the tidyverse package combination of
dplyr and tidyr. However, one of its main
features is that it has a very few dependencies: {stats}
and {utils}
(included in base R) and
insight, which is the core package of the
easystats ecosystem. This package grew organically to
simultaneously satisfy the “0 non-base hard dependency” principle of
easystats and the data wrangling needs of the constituent
packages in this ecosystem. It is also important to note that
datawizard was designed to avoid namespace collisions
with tidyverse packages.
In this article, we will see how to go through basic data wrangling steps with datawizard. We will also compare it to the tidyverse syntax for achieving the same. This way, if you decide to make the switch, you can easily find the translations here. This vignette is largely inspired from dplyr’s Getting started vignette.
Note: In this vignette, we use the native pipe-operator,
|>
, which was introduced in R 4.1. Users of R version
3.6 or 4.0 should replace the native pipe by magrittr’s one
(%>%
) so that examples work.
Workhorses
Before we look at their tidyverse equivalents, we can first have a look at datawizard’s key functions for data wrangling:
Note that there are a few functions in datawizard that
have no strict equivalent in dplyr or
tidyr (e.g data_rotate()
), and so we won’t
discuss them in the next section.
Equivalence with {dplyr}
/ {tidyr}
Before we look at them individually, let’s first have a look at the summary table of this equivalence.
Filtering
data_filter()
is a wrapper around subset()
.
However, if you want to have several filtering conditions, you can
either use &
(as in subset()
) or
,
(as in dplyr::filter()
).
# ---------- datawizard -----------
starwars |>
data_filter(
skin_color == "light",
eye_color == "brown"
)
# or
starwars |>
data_filter(
skin_color == "light" &
eye_color == "brown"
)
# ---------- tidyverse -----------
starwars |>
filter(
skin_color == "light",
eye_color == "brown"
)
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Padmé Am… 185 45 brown light brown 46 fema… femin…
## 4 Cordé 157 NA brown light brown NA NA NA
## 5 Dormé 165 NA brown light brown NA fema… femin…
## 6 Raymus A… 188 79 brown light brown NA male mascu…
## 7 Poe Dame… NA NA brown light brown NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
## # A tibble: 7 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Leia Org… 150 49 brown light brown 19 fema… femin…
## 2 Biggs Da… 183 84 black light brown 24 male mascu…
## 3 Padmé Am… 185 45 brown light brown 46 fema… femin…
## 4 Cordé 157 NA brown light brown NA NA NA
## 5 Dormé 165 NA brown light brown NA fema… femin…
## 6 Raymus A… 188 79 brown light brown NA male mascu…
## 7 Poe Dame… NA NA brown light brown NA male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Selecting
data_select()
is the equivalent of
dplyr::select()
. The main difference between these two
functions is that data_select()
uses two arguments
(select
and exclude
) and requires quoted
column names if we want to select several variables, while
dplyr::select()
accepts any unquoted column names.
# ---------- datawizard -----------
starwars |>
data_select(select = c("hair_color", "skin_color", "eye_color"))
# ---------- tidyverse -----------
starwars |>
select(hair_color, skin_color, eye_color)
## # A tibble: 6 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 NA gold yellow
## 3 NA white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
# ---------- datawizard -----------
starwars |>
data_select(select = -ends_with("color"))
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
# ---------- datawizard -----------
starwars |>
data_select(select = -(hair_color:eye_color))
# ---------- tidyverse -----------
starwars |>
select(!(hair_color:eye_color))
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
# ---------- datawizard -----------
starwars |>
data_select(exclude = regex("color$"))
## # A tibble: 6 × 11
## name height mass birth_year sex gender homeworld species films vehicles
## <chr> <int> <dbl> <dbl> <chr> <chr> <chr> <chr> <lis> <list>
## 1 Luke Sk… 172 77 19 male mascu… Tatooine Human <chr> <chr>
## 2 C-3PO 167 75 112 none mascu… Tatooine Droid <chr> <chr>
## 3 R2-D2 96 32 33 none mascu… Naboo Droid <chr> <chr>
## 4 Darth V… 202 136 41.9 male mascu… Tatooine Human <chr> <chr>
## 5 Leia Or… 150 49 19 fema… femin… Alderaan Human <chr> <chr>
## 6 Owen La… 178 120 52 male mascu… Tatooine Human <chr> <chr>
## # ℹ 1 more variable: starships <list>
# ---------- datawizard -----------
starwars |>
data_select(select = is.numeric)
## # A tibble: 6 × 3
## height mass birth_year
## <int> <dbl> <dbl>
## 1 172 77 19
## 2 167 75 112
## 3 96 32 33
## 4 202 136 41.9
## 5 150 49 19
## 6 178 120 52
You can find a list of all the select helpers with
?data_select
.
Modifying
data_modify()
is a wrapper around
base::transform()
but has several additional benefits:
- it allows us to use newly created variables in the following expressions;
- it works with grouped data;
- it preserves variable attributes such as labels;
- it accepts expressions as character vectors so that it is easy to program with it
This last point is also the main difference between
data_modify()
and dplyr::mutate()
.
# ---------- datawizard -----------
efc |>
data_modify(
c12hour_c = center(c12hour),
c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
c12hour_z2 = standardize(c12hour)
)
# ---------- tidyverse -----------
efc |>
mutate(
c12hour_c = center(c12hour),
c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE),
c12hour_z2 = standardize(c12hour)
)
## c12hour e16sex e42dep c172code neg_c_7 c12hour_c c12hour_z c12hour_z2
## 1 16 2 3 2 12 -67.6 -0.9420928 -0.9420928
## 2 148 2 3 2 20 64.4 0.8974967 0.8974967
## 3 70 2 3 1 11 -13.6 -0.1895335 -0.1895335
## 4 NA 2 <NA> 2 10 NA NA NA
## 5 168 2 4 2 12 84.4 1.1762224 1.1762224
## 6 16 2 4 2 19 -67.6 -0.9420928 -0.9420928
data_modify()
accepts expressions as strings:
new_exp <- c(
"c12hour_c = center(c12hour)",
"c12hour_z = c12hour_c / sd(c12hour, na.rm = TRUE)"
)
data_modify(efc, new_exp)
## c12hour e16sex e42dep c172code neg_c_7 c12hour_c c12hour_z
## 1 16 2 3 2 12 -67.6 -0.9420928
## 2 148 2 3 2 20 64.4 0.8974967
## 3 70 2 3 1 11 -13.6 -0.1895335
## 4 NA 2 <NA> 2 10 NA NA
## 5 168 2 4 2 12 84.4 1.1762224
## 6 16 2 4 2 19 -67.6 -0.9420928
This makes it easy to use it in custom functions:
miles_to_km <- function(data, var) {
data_modify(
data,
paste0("km = ", var, "* 1.609344")
)
}
distance <- data.frame(miles = c(1, 8, 233, 88, 9))
distance
## miles
## 1 1
## 2 8
## 3 233
## 4 88
## 5 9
miles_to_km(distance, "miles")
## miles km
## 1 1 1.609344
## 2 8 12.874752
## 3 233 374.977152
## 4 88 141.622272
## 5 9 14.484096
Sorting
data_arrange()
is the equivalent of
dplyr::arrange()
. It takes two arguments: a data frame, and
a vector of column names used to sort the rows. Note that contrary to
most other functions in datawizard, it is not possible to
use select helpers such as starts_with()
in
data_arrange()
.
# ---------- datawizard -----------
starwars |>
data_arrange(c("hair_color", "height"))
# ---------- tidyverse -----------
starwars |>
arrange(hair_color, height)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 2 Leia Org… 150 49 brown light brown 19 fema… femin…
## 3 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 4 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 5 R2-D2 96 32 NA white, bl… red 33 none mascu…
## 6 C-3PO 167 75 NA gold yellow 112 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
You can also sort variables in descending order by putting a
"-"
in front of their name, like below:
# ---------- datawizard -----------
starwars |>
data_arrange(c("-hair_color", "-height"))
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Darth Va… 202 136 none white yellow 41.9 male mascu…
## 2 Owen Lars 178 120 brown, gr… light blue 52 male mascu…
## 3 Leia Org… 150 49 brown light brown 19 fema… femin…
## 4 Luke Sky… 172 77 blond fair blue 19 male mascu…
## 5 C-3PO 167 75 NA gold yellow 112 none mascu…
## 6 R2-D2 96 32 NA white, bl… red 33 none mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
Extracting
Although we mostly work on data frames, it is sometimes useful to
extract a single column as a vector. This can be done with
data_extract()
, which reproduces the behavior of
dplyr::pull()
:
# ---------- datawizard -----------
starwars |>
data_extract(gender)
# ---------- tidyverse -----------
starwars |>
pull(gender)
## [1] "masculine" "masculine" "masculine" "masculine" "feminine" "masculine"
We can also specify several variables in select
. In this
case, data_extract()
is equivalent to
data_select()
:
starwars |>
data_extract(select = contains("color"))
## # A tibble: 6 × 3
## hair_color skin_color eye_color
## <chr> <chr> <chr>
## 1 blond fair blue
## 2 NA gold yellow
## 3 NA white, blue red
## 4 none white yellow
## 5 brown light brown
## 6 brown, grey light blue
Renaming
data_rename()
is the equivalent of
dplyr::rename()
but the syntax between the two is
different. While dplyr::rename()
takes new-old pairs of
column names, data_rename()
requires a vector of column
names to rename, and then a vector of new names for these columns that
must be of the same length.
# ---------- datawizard -----------
starwars |>
data_rename(
pattern = c("sex", "hair_color"),
replacement = c("Sex", "Hair Color")
)
# ---------- tidyverse -----------
starwars |>
rename(
Sex = sex,
"Hair Color" = hair_color
)
## # A tibble: 6 × 14
## name height mass `Hair Color` skin_color eye_color birth_year Sex gender
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
## 1 Luke S… 172 77 blond fair blue 19 male mascu…
## 2 C-3PO 167 75 NA gold yellow 112 none mascu…
## 3 R2-D2 96 32 NA white, bl… red 33 none mascu…
## 4 Darth … 202 136 none white yellow 41.9 male mascu…
## 5 Leia O… 150 49 brown light brown 19 fema… femin…
## 6 Owen L… 178 120 brown, grey light blue 52 male mascu…
## # ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
The way data_rename()
is designed makes it easy to apply
the same modifications to a vector of column names. For example, we can
remove underscores and use TitleCase with the following code:
to_rename <- names(starwars)
starwars |>
data_rename(
pattern = to_rename,
replacement = tools::toTitleCase(gsub("_", " ", to_rename, fixed = TRUE))
)
## # A tibble: 6 × 14
## Name Height Mass `Hair Color` `Skin Color` `Eye Color` `Birth Year` Sex
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr>
## 1 Luke Sk… 172 77 blond fair blue 19 male
## 2 C-3PO 167 75 NA gold yellow 112 none
## 3 R2-D2 96 32 NA white, blue red 33 none
## 4 Darth V… 202 136 none white yellow 41.9 male
## 5 Leia Or… 150 49 brown light brown 19 fema…
## 6 Owen La… 178 120 brown, grey light blue 52 male
## # ℹ 6 more variables: Gender <chr>, Homeworld <chr>, Species <chr>,
## # Films <list>, Vehicles <list>, Starships <list>
It is also possible to add a prefix or a suffix to all or a subset of
variables with data_addprefix()
and
data_addsuffix()
. The argument select
accepts
all select helpers that we saw above with
data_select()
:
starwars |>
data_addprefix(
pattern = "OLD.",
select = contains("color")
) |>
data_addsuffix(
pattern = ".NEW",
select = -contains("color")
)
## # A tibble: 6 × 14
## name.NEW height.NEW mass.NEW OLD.hair_color OLD.skin_color OLD.eye_color
## <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skywalker 172 77 blond fair blue
## 2 C-3PO 167 75 NA gold yellow
## 3 R2-D2 96 32 NA white, blue red
## 4 Darth Vader 202 136 none white yellow
## 5 Leia Organa 150 49 brown light brown
## 6 Owen Lars 178 120 brown, grey light blue
## # ℹ 8 more variables: birth_year.NEW <dbl>, sex.NEW <chr>, gender.NEW <chr>,
## # homeworld.NEW <chr>, species.NEW <chr>, films.NEW <list>,
## # vehicles.NEW <list>, starships.NEW <list>
Relocating
Sometimes, we want to relocate one or a small subset of columns in
the dataset. Rather than typing many names in
data_select()
, we can use data_relocate()
,
which is the equivalent of dplyr::relocate()
. Just like
data_select()
, we can specify a list of variables we want
to relocate with select
and exclude
. Then, the
arguments before
and after
1 specify where the
selected columns should be relocated:
# ---------- datawizard -----------
starwars |>
data_relocate(sex:homeworld, before = "height")
# ---------- tidyverse -----------
starwars |>
relocate(sex:homeworld, .before = height)
## # A tibble: 6 × 14
## name sex gender homeworld height mass hair_color skin_color eye_color
## <chr> <chr> <chr> <chr> <int> <dbl> <chr> <chr> <chr>
## 1 Luke Skyw… male mascu… Tatooine 172 77 blond fair blue
## 2 C-3PO none mascu… Tatooine 167 75 NA gold yellow
## 3 R2-D2 none mascu… Naboo 96 32 NA white, bl… red
## 4 Darth Vad… male mascu… Tatooine 202 136 none white yellow
## 5 Leia Orga… fema… femin… Alderaan 150 49 brown light brown
## 6 Owen Lars male mascu… Tatooine 178 120 brown, gr… light blue
## # ℹ 5 more variables: birth_year <dbl>, species <chr>, films <list>,
## # vehicles <list>, starships <list>
In addition to column names, before
and
after
accept column indices. Finally, one can use
before = -1
to relocate the selected columns just before
the last column, or after = -1
to relocate them after the
last column.
# ---------- datawizard -----------
starwars |>
data_relocate(sex:homeworld, after = -1)
## # A tibble: 6 × 14
## name height mass hair_color skin_color eye_color birth_year species films
## <chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <lis>
## 1 Luke Sk… 172 77 blond fair blue 19 Human <chr>
## 2 C-3PO 167 75 NA gold yellow 112 Droid <chr>
## 3 R2-D2 96 32 NA white, bl… red 33 Droid <chr>
## 4 Darth V… 202 136 none white yellow 41.9 Human <chr>
## 5 Leia Or… 150 49 brown light brown 19 Human <chr>
## 6 Owen La… 178 120 brown, gr… light blue 52 Human <chr>
## # ℹ 5 more variables: vehicles <list>, starships <list>, sex <chr>,
## # gender <chr>, homeworld <chr>
Reshaping
Longer
Reshaping data from wide to long or from long to wide format can be
done with data_to_long()
and data_to_wide()
.
These functions were designed to match
tidyr::pivot_longer()
and tidyr::pivot_wider()
arguments, so that the only thing to do is to change the function name.
However, not all of tidyr::pivot_longer()
and
tidyr::pivot_wider()
features are available yet.
We will use the relig_income
dataset, as in the {tidyr}
vignette.
relig_income
## # A tibble: 18 × 11
## religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
## <chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 Agnostic 27 34 60 81 76 137 122
## 2 Atheist 12 27 37 52 35 70 73
## 3 Buddhist 27 21 30 34 33 58 62
## 4 Catholic 418 617 732 670 638 1116 949
## 5 Don’t k… 15 14 15 11 10 35 21
## 6 Evangel… 575 869 1064 982 881 1486 949
## 7 Hindu 1 9 7 9 11 34 47
## 8 Histori… 228 244 236 238 197 223 131
## 9 Jehovah… 20 27 24 24 21 30 15
## 10 Jewish 19 19 25 25 30 95 69
## 11 Mainlin… 289 495 619 655 651 1107 939
## 12 Mormon 29 40 48 51 56 112 85
## 13 Muslim 6 7 9 10 9 23 16
## 14 Orthodox 13 17 23 32 32 47 38
## 15 Other C… 9 7 11 13 13 14 18
## 16 Other F… 20 33 40 46 49 63 46
## 17 Other W… 5 2 3 4 2 7 3
## 18 Unaffil… 217 299 374 365 341 528 407
## # ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
## # `Don't know/refused` <dbl>
We would like to reshape this dataset to have 3 columns: religion,
count, and income. The column “religion” doesn’t need to change, so we
exclude it with -religion
. Then, each remaining column
corresponds to an income category. Therefore, we want to move all these
column names to a single column called “income”. Finally, the values
corresponding to each of these columns will be reshaped to be in a
single new column, called “count”.
# ---------- datawizard -----------
relig_income |>
data_to_long(
-religion,
names_to = "income",
values_to = "count"
)
# ---------- tidyverse -----------
relig_income |>
pivot_longer(
!religion,
names_to = "income",
values_to = "count"
)
## # A tibble: 180 × 3
## religion income count
## <chr> <chr> <dbl>
## 1 Agnostic <$10k 27
## 2 Agnostic $10-20k 34
## 3 Agnostic $20-30k 60
## 4 Agnostic $30-40k 81
## 5 Agnostic $40-50k 76
## 6 Agnostic $50-75k 137
## 7 Agnostic $75-100k 122
## 8 Agnostic $100-150k 109
## 9 Agnostic >150k 84
## 10 Agnostic Don't know/refused 96
## # ℹ 170 more rows
To explore a bit more the arguments of data_to_long()
,
we will use another dataset: the billboard
dataset.
billboard
## # A tibble: 317 × 79
## artist track date.entered wk1 wk2 wk3 wk4 wk5 wk6 wk7 wk8
## <chr> <chr> <date> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 2 Pac Baby… 2000-02-26 87 82 72 77 87 94 99 NA
## 2 2Ge+her The … 2000-09-02 91 87 92 NA NA NA NA NA
## 3 3 Doors D… Kryp… 2000-04-08 81 70 68 67 66 57 54 53
## 4 3 Doors D… Loser 2000-10-21 76 76 72 69 67 65 55 59
## 5 504 Boyz Wobb… 2000-04-15 57 34 25 17 17 31 36 49
## 6 98^0 Give… 2000-08-19 51 39 34 26 26 19 2 2
## 7 A*Teens Danc… 2000-07-08 97 97 96 95 100 NA NA NA
## 8 Aaliyah I Do… 2000-01-29 84 62 51 41 38 35 35 38
## 9 Aaliyah Try … 2000-03-18 59 53 38 28 21 18 16 14
## 10 Adams, Yo… Open… 2000-08-26 76 76 74 69 68 67 61 58
## # ℹ 307 more rows
## # ℹ 68 more variables: wk9 <dbl>, wk10 <dbl>, wk11 <dbl>, wk12 <dbl>,
## # wk13 <dbl>, wk14 <dbl>, wk15 <dbl>, wk16 <dbl>, wk17 <dbl>, wk18 <dbl>,
## # wk19 <dbl>, wk20 <dbl>, wk21 <dbl>, wk22 <dbl>, wk23 <dbl>, wk24 <dbl>,
## # wk25 <dbl>, wk26 <dbl>, wk27 <dbl>, wk28 <dbl>, wk29 <dbl>, wk30 <dbl>,
## # wk31 <dbl>, wk32 <dbl>, wk33 <dbl>, wk34 <dbl>, wk35 <dbl>, wk36 <dbl>,
## # wk37 <dbl>, wk38 <dbl>, wk39 <dbl>, wk40 <dbl>, wk41 <dbl>, wk42 <dbl>, …
# ---------- datawizard -----------
billboard |>
data_to_long(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
# ---------- tidyverse -----------
billboard |>
pivot_longer(
cols = starts_with("wk"),
names_to = "week",
values_to = "rank",
values_drop_na = TRUE
)
## # A tibble: 5,307 × 5
## artist track date.entered week rank
## <chr> <chr> <date> <chr> <dbl>
## 1 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk1 87
## 2 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk2 82
## 3 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk3 72
## 4 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk4 77
## 5 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk5 87
## 6 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk6 94
## 7 2 Pac Baby Don't Cry (Keep... 2000-02-26 wk7 99
## 8 2Ge+her The Hardest Part Of ... 2000-09-02 wk1 91
## 9 2Ge+her The Hardest Part Of ... 2000-09-02 wk2 87
## 10 2Ge+her The Hardest Part Of ... 2000-09-02 wk3 92
## # ℹ 5,297 more rows
Wider
Once again, we use an example in the tidyr vignette to
show how close data_to_wide()
and
pivot_wider()
are:
fish_encounters
## # A tibble: 114 × 3
## fish station seen
## <fct> <fct> <int>
## 1 4842 Release 1
## 2 4842 I80_1 1
## 3 4842 Lisbon 1
## 4 4842 Rstr 1
## 5 4842 Base_TD 1
## 6 4842 BCE 1
## 7 4842 BCW 1
## 8 4842 BCE2 1
## 9 4842 BCW2 1
## 10 4842 MAE 1
## # ℹ 104 more rows
# ---------- datawizard -----------
fish_encounters |>
data_to_wide(
names_from = "station",
values_from = "seen",
values_fill = 0
)
# ---------- tidyverse -----------
fish_encounters |>
pivot_wider(
names_from = station,
values_from = seen,
values_fill = 0
)
## # A tibble: 19 × 12
## fish Release I80_1 Lisbon Rstr Base_TD BCE BCW BCE2 BCW2 MAE MAW
## <fct> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
## 1 4842 1 1 1 1 1 1 1 1 1 1 1
## 2 4843 1 1 1 1 1 1 1 1 1 1 1
## 3 4844 1 1 1 1 1 1 1 1 1 1 1
## 4 4845 1 1 1 1 1 0 0 0 0 0 0
## 5 4847 1 1 1 0 0 0 0 0 0 0 0
## 6 4848 1 1 1 1 0 0 0 0 0 0 0
## 7 4849 1 1 0 0 0 0 0 0 0 0 0
## 8 4850 1 1 0 1 1 1 1 0 0 0 0
## 9 4851 1 1 0 0 0 0 0 0 0 0 0
## 10 4854 1 1 0 0 0 0 0 0 0 0 0
## 11 4855 1 1 1 1 1 0 0 0 0 0 0
## 12 4857 1 1 1 1 1 1 1 1 1 0 0
## 13 4858 1 1 1 1 1 1 1 1 1 1 1
## 14 4859 1 1 1 1 1 0 0 0 0 0 0
## 15 4861 1 1 1 1 1 1 1 1 1 1 1
## 16 4862 1 1 1 1 1 1 1 1 1 0 0
## 17 4863 1 1 0 0 0 0 0 0 0 0 0
## 18 4864 1 1 0 0 0 0 0 0 0 0 0
## 19 4865 1 1 1 0 0 0 0 0 0 0 0
Joining
In datawizard, joining datasets is done with
data_join()
(or its alias data_merge()
).
Contrary to dplyr, this unique function takes care of all
types of join, which are then specified inside the function with the
argument join
(by default, join = "left"
).
Below, we show how to perform the four most common joins: full, left,
right and inner. We will use the datasets band_members
and
band_instruments
provided by dplyr:
band_members
## # A tibble: 3 × 2
## name band
## <chr> <chr>
## 1 Mick Stones
## 2 John Beatles
## 3 Paul Beatles
band_instruments
## # A tibble: 3 × 2
## name plays
## <chr> <chr>
## 1 John guitar
## 2 Paul bass
## 3 Keith guitar
Full join
# ---------- datawizard -----------
band_members |>
data_join(band_instruments, join = "full")
# ---------- tidyverse -----------
band_members |>
full_join(band_instruments)
## # A tibble: 4 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 Mick Stones NA
## 2 John Beatles guitar
## 3 Paul Beatles bass
## 4 Keith NA guitar
Left and right joins
# ---------- datawizard -----------
band_members |>
data_join(band_instruments, join = "left")
# ---------- tidyverse -----------
band_members |>
left_join(band_instruments)
## # A tibble: 3 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 Mick Stones NA
## 2 John Beatles guitar
## 3 Paul Beatles bass
# ---------- datawizard -----------
band_members |>
data_join(band_instruments, join = "right")
# ---------- tidyverse -----------
band_members |>
right_join(band_instruments)
## # A tibble: 3 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
## 3 Keith NA guitar
Inner join
# ---------- datawizard -----------
band_members |>
data_join(band_instruments, join = "inner")
# ---------- tidyverse -----------
band_members |>
inner_join(band_instruments)
## # A tibble: 2 × 3
## name band plays
## * <chr> <chr> <chr>
## 1 John Beatles guitar
## 2 Paul Beatles bass
Uniting
Uniting variables is useful e.g to create unique indices by combining
several variables or to gather years, months, and days into a single
date. data_unite()
offers an interface very close to
tidyr::unite()
:
test <- data.frame(
year = 2002:2004,
month = c("02", "03", "09"),
day = c("11", "22", "28"),
stringsAsFactors = FALSE
)
test
## year month day
## 1 2002 02 11
## 2 2003 03 22
## 3 2004 09 28
# ---------- datawizard -----------
test |>
data_unite(
new_column = "date",
select = c("year", "month", "day"),
separator = "-"
)
# ---------- tidyverse -----------
test |>
unite(
col = "date",
year, month, day,
sep = "-"
)
## date
## 1 2002-02-11
## 2 2003-03-22
## 3 2004-09-28
# ---------- datawizard -----------
test |>
data_unite(
new_column = "date",
select = c("year", "month", "day"),
separator = "-",
append = TRUE
)
# ---------- tidyverse -----------
test |>
unite(
col = "date",
year, month, day,
sep = "-",
remove = FALSE
)
## year month day date
## 1 2002 02 11 2002-02-11
## 2 2003 03 22 2003-03-22
## 3 2004 09 28 2004-09-28
Separating
Separating variables is the counterpart to uniting variables and is
useful to split values into multiple columns, e.g. when splitting a date
into values for years, months and days. data_separate()
offers an interface very close to tidyr::separate()
:
test <- data.frame(
date_arrival = c("2002-02-11", "2003-03-22", "2004-09-28"),
date_departure = c("2002-03-15", "2003-03-28", "2004-09-30"),
stringsAsFactors = FALSE
)
test
## date_arrival date_departure
## 1 2002-02-11 2002-03-15
## 2 2003-03-22 2003-03-28
## 3 2004-09-28 2004-09-30
# ---------- datawizard -----------
test |>
data_separate(
select = "date_arrival",
new_columns = c("Year", "Month", "Day")
)
## date_departure Year Month Day
## 1 2002-03-15 2002 02 11
## 2 2003-03-28 2003 03 22
## 3 2004-09-30 2004 09 28
Unlike tidyr::separate()
, you can separate multiple
columns in one step with data_separate()
.
test |>
data_separate(
new_columns = list(
date_arrival = c("Arr_Year", "Arr_Month", "Arr_Day"),
date_departure = c("Dep_Year", "Dep_Month", "Dep_Day")
)
)
## Arr_Year Arr_Month Arr_Day Dep_Year Dep_Month Dep_Day
## 1 2002 02 11 2002 03 15
## 2 2003 03 22 2003 03 28
## 3 2004 09 28 2004 09 30
Other useful functions
datawizard contains other functions that are not
necessarily included in dplyr or tidyr or
do not directly modify the data. Some of them are inspired from the
package janitor
.
Work with rownames
We can convert a column in rownames and move rownames to a new column
with rownames_as_column()
and
column_as_rownames()
:
mtcars <- head(mtcars)
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars2 <- mtcars |>
rownames_as_column(var = "model")
mtcars2
## model mpg cyl disp hp drat wt qsec vs am gear carb
## 1 Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## 2 Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## 3 Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## 4 Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## 5 Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## 6 Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
mtcars2 |>
column_as_rownames(var = "model")
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
Work with row ids
rowid_as_column()
is close but not identical to
tibble::rowid_to_column()
. The main difference is when we
use it with grouped data. While tibble::rowid_to_column()
uses one distinct rowid for every row in the dataset,
rowid_as_column()
creates one id for every row in each
group. Therefore, two rows in different groups can have the same
row id.
This means that rowid_as_column()
is closer to using
n()
in mutate()
, like the following:
test <- data.frame(
group = c("A", "A", "B", "B"),
value = c(3, 5, 8, 1),
stringsAsFactors = FALSE
)
test
## group value
## 1 A 3
## 2 A 5
## 3 B 8
## 4 B 1
test |>
data_group(group) |>
tibble::rowid_to_column()
## rowid group value
## 1 1 A 3
## 2 2 A 5
## 3 3 B 8
## 4 4 B 1
test |>
data_group(group) |>
rowid_as_column()
## # A tibble: 4 × 3
## # Groups: group [2]
## rowid group value
## <int> <chr> <dbl>
## 1 1 A 3
## 2 2 A 5
## 3 1 B 8
## 4 2 B 1
test |>
data_group(group) |>
mutate(id = seq_len(n()))
## # A tibble: 4 × 3
## # Groups: group [2]
## group value id
## <chr> <dbl> <int>
## 1 A 3 1
## 2 A 5 2
## 3 B 8 1
## 4 B 1 2
Work with column names
When dealing with messy data, it is sometimes useful to use a row as
column names, and vice versa. This can be done with
row_to_colnames()
and colnames_to_row()
.
x <- data.frame(
X_1 = c(NA, "Title", 1:3),
X_2 = c(NA, "Title2", 4:6)
)
x
## X_1 X_2
## 1 <NA> <NA>
## 2 Title Title2
## 3 1 4
## 4 2 5
## 5 3 6
x2 <- x |>
row_to_colnames(row = 2)
x2
## Title Title2
## 1 <NA> <NA>
## 3 1 4
## 4 2 5
## 5 3 6
x2 |>
colnames_to_row()
## x1 x2
## 1 Title Title2
## 11 <NA> <NA>
## 3 1 4
## 4 2 5
## 5 3 6
Take a quick look at the data
# ---------- datawizard -----------
data_peek(iris)
# ---------- tidyverse -----------
glimpse(iris)
## Data frame with 150 rows and 5 variables
##
## Variable | Type | Values
## -----------------------------------------------------------------------
## Sepal.Length | numeric | 5.1, 4.9, 4.7, 4.6, 5, 5.4, 4.6, 5, 4.4, ...
## Sepal.Width | numeric | 3.5, 3, 3.2, 3.1, 3.6, 3.9, 3.4, 3.4, 2.9, ...
## Petal.Length | numeric | 1.4, 1.4, 1.3, 1.5, 1.4, 1.7, 1.4, 1.5, ...
## Petal.Width | numeric | 0.2, 0.2, 0.2, 0.2, 0.2, 0.4, 0.3, 0.2, ...
## Species | factor | setosa, setosa, setosa, setosa, setosa, ...