This function "lengthens" data, increasing the number of rows and decreasing
the number of columns. This is a dependency-free base-R equivalent of
tidyr::pivot_longer()
.
Usage
data_to_long(
data,
select = "all",
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
values_to = "value",
values_drop_na = FALSE,
rows_to = NULL,
ignore_case = FALSE,
regex = FALSE,
...,
cols
)
reshape_longer(
data,
select = "all",
names_to = "name",
names_prefix = NULL,
names_sep = NULL,
names_pattern = NULL,
values_to = "value",
values_drop_na = FALSE,
rows_to = NULL,
ignore_case = FALSE,
regex = FALSE,
...,
cols
)
Arguments
- data
A data frame to convert to long format, so that it has more rows and fewer columns after the operation.
- select
Variables that will be included when performing the required tasks. Can be either
a variable specified as a literal variable name (e.g.,
column_name
),a string with the variable name (e.g.,
"column_name"
), a character vector of variable names (e.g.,c("col1", "col2", "col3")
), or a character vector of variable names including ranges specified via:
(e.g.,c("col1:col3", "col5")
),for some functions, like
data_select()
ordata_rename()
,select
can be a named character vector. In this case, the names are used to rename the columns in the output data frame. See 'Details' in the related functions to see where this option applies.a formula with variable names (e.g.,
~column_1 + column_2
),a vector of positive integers, giving the positions counting from the left (e.g.
1
orc(1, 3, 5)
),a vector of negative integers, giving the positions counting from the right (e.g.,
-1
or-1:-3
),one of the following select-helpers:
starts_with()
,ends_with()
,contains()
, a range using:
, orregex()
.starts_with()
,ends_with()
, andcontains()
accept several patterns, e.gstarts_with("Sep", "Petal")
.regex()
can be used to define regular expression patterns.a function testing for logical conditions, e.g.
is.numeric()
(oris.numeric
), or any user-defined function that selects the variables for which the function returnsTRUE
(like:foo <- function(x) mean(x) > 3
),ranges specified via literal variable names, select-helpers (except
regex()
) and (user-defined) functions can be negated, i.e. return non-matching elements, when prefixed with a-
, e.g.-ends_with()
,-is.numeric
or-(Sepal.Width:Petal.Length)
. Note: Negation means that matches are excluded, and thus, theexclude
argument can be used alternatively. For instance,select=-ends_with("Length")
(with-
) is equivalent toexclude=ends_with("Length")
(no-
). In case negation should not work as expected, use theexclude
argument instead.
If
NULL
, selects all columns. Patterns that found no matches are silently ignored, e.g.extract_column_names(iris, select = c("Species", "Test"))
will just return"Species"
.- names_to
The name of the new column (variable) that will contain the names from columns in
select
as values, to identify the source of the values.names_to
can be a character vector with more than one column name, in which casenames_sep
ornames_pattern
must be provided in order to identify which parts of the column names go into newly created columns. See also 'Examples'.- names_prefix
A regular expression used to remove matching text from the start of each variable name.
- names_sep, names_pattern
If
names_to
contains multiple values, this argument controls how the column name is broken up.names_pattern
takes a regular expression containing matching groups, i.e. "()".- values_to
The name of the new column that will contain the values of the columns in
select
.- values_drop_na
If
TRUE
, will drop rows that contain onlyNA
in thevalues_to
column. This effectively converts explicit missing values to implicit missing values, and should generally be used only when missing values in data were created by its structure.- rows_to
The name of the column that will contain the row names or row numbers from the original data. If
NULL
, will be removed.- ignore_case
Logical, if
TRUE
and when one of the select-helpers or a regular expression is used inselect
, ignores lower/upper case in the search pattern when matching against variable names.- regex
Logical, if
TRUE
, the search pattern fromselect
will be treated as regular expression. Whenregex = TRUE
, select must be a character string (or a variable containing a character string) and is not allowed to be one of the supported select-helpers or a character vector of length > 1.regex = TRUE
is comparable to using one of the two select-helpers,select = contains()
orselect = regex()
, however, since the select-helpers may not work when called from inside other functions (see 'Details'), this argument may be used as workaround.- ...
Currently not used.
- cols
Identical to
select
. This argument is here to ensure compatibility withtidyr::pivot_longer()
. If bothselect
andcols
are provided,cols
is used.
Value
If a tibble was provided as input, reshape_longer()
also returns a
tibble. Otherwise, it returns a data frame.
Details
Reshaping data into long format usually means that the input data frame is
in wide format, where multiple measurements taken on the same subject are
stored in multiple columns (variables). The long format stores the same
information in a single column, with each measurement per subject stored in
a separate row. The values of all variables that are not in select
will
be repeated.
The necessary information for data_to_long()
is:
The columns that contain the repeated measurements (
select
).The name of the newly created column that will contain the names of the columns in
select
(names_to
), to identify the source of the values.names_to
can also be a character vector with more than one column name, in which casenames_sep
ornames_pattern
must be provided to specify which parts of the column names go into the newly created columns.The name of the newly created column that contains the values of the columns in
select
(values_to
).
In other words: repeated measurements that are spread across several columns
will be gathered into a single column (values_to
), with the original column
names, that identify the source of the gathered values, stored in one or more
new columns (names_to
).
See also
Add a prefix or suffix to column names:
data_addprefix()
,data_addsuffix()
Functions to reorder or remove columns:
data_reorder()
,data_relocate()
,data_remove()
Functions to reshape, pivot or rotate data frames:
data_to_long()
,data_to_wide()
,data_rotate()
Functions to recode data:
rescale()
,reverse()
,categorize()
,recode_values()
,slide()
Functions to standardize, normalize, rank-transform:
center()
,standardize()
,normalize()
,ranktransform()
,winsorize()
Split and merge data frames:
data_partition()
,data_merge()
Functions to find or select columns:
data_select()
,extract_column_names()
Functions to filter rows:
data_match()
,data_filter()
Examples
wide_data <- setNames(
data.frame(replicate(2, rnorm(8))),
c("Time1", "Time2")
)
wide_data$ID <- 1:8
wide_data
#> Time1 Time2 ID
#> 1 -1.6895557 0.50381245 1
#> 2 1.2394959 2.52833655 2
#> 3 -0.1089660 0.54909674 3
#> 4 -0.1172420 0.23821292 4
#> 5 0.1830826 -1.04889314 5
#> 6 1.2805549 1.29476325 6
#> 7 -1.7272706 0.82553984 7
#> 8 1.6901844 -0.05568601 8
# Default behaviour (equivalent to tidyr::pivot_longer(wide_data, cols = 1:3))
# probably doesn't make much sense to mix "time" and "id"
data_to_long(wide_data)
#> name value
#> 1 Time1 -1.68955566
#> 2 Time2 0.50381245
#> 3 ID 1.00000000
#> 4 Time1 1.23949589
#> 5 Time2 2.52833655
#> 6 ID 2.00000000
#> 7 Time1 -0.10896597
#> 8 Time2 0.54909674
#> 9 ID 3.00000000
#> 10 Time1 -0.11724196
#> 11 Time2 0.23821292
#> 12 ID 4.00000000
#> 13 Time1 0.18308261
#> 14 Time2 -1.04889314
#> 15 ID 5.00000000
#> 16 Time1 1.28055488
#> 17 Time2 1.29476325
#> 18 ID 6.00000000
#> 19 Time1 -1.72727063
#> 20 Time2 0.82553984
#> 21 ID 7.00000000
#> 22 Time1 1.69018435
#> 23 Time2 -0.05568601
#> 24 ID 8.00000000
# Customizing the names
data_to_long(
wide_data,
select = c("Time1", "Time2"),
names_to = "Timepoint",
values_to = "Score"
)
#> ID Timepoint Score
#> 1 1 Time1 -1.68955566
#> 2 1 Time2 0.50381245
#> 3 2 Time1 1.23949589
#> 4 2 Time2 2.52833655
#> 5 3 Time1 -0.10896597
#> 6 3 Time2 0.54909674
#> 7 4 Time1 -0.11724196
#> 8 4 Time2 0.23821292
#> 9 5 Time1 0.18308261
#> 10 5 Time2 -1.04889314
#> 11 6 Time1 1.28055488
#> 12 6 Time2 1.29476325
#> 13 7 Time1 -1.72727063
#> 14 7 Time2 0.82553984
#> 15 8 Time1 1.69018435
#> 16 8 Time2 -0.05568601
# Reshape multiple columns into long format.
mydat <- data.frame(
age = c(20, 30, 40),
sex = c("Female", "Male", "Male"),
score_t1 = c(30, 35, 32),
score_t2 = c(33, 34, 37),
score_t3 = c(36, 35, 38),
speed_t1 = c(2, 3, 1),
speed_t2 = c(3, 4, 5),
speed_t3 = c(1, 8, 6)
)
# The column names are split into two columns: "type" and "time". The
# pattern for splitting column names is provided in `names_pattern`. Values
# of all "score_*" and "speed_*" columns are gathered into a single column
# named "count".
data_to_long(
mydat,
select = 3:8,
names_to = c("type", "time"),
names_pattern = "(score|speed)_t(\\d+)",
values_to = "count"
)
#> age sex type time count
#> 1 20 Female score 1 30
#> 2 20 Female score 2 33
#> 3 20 Female score 3 36
#> 4 20 Female speed 1 2
#> 5 20 Female speed 2 3
#> 6 20 Female speed 3 1
#> 7 30 Male score 1 35
#> 8 30 Male score 2 34
#> 9 30 Male score 3 35
#> 10 30 Male speed 1 3
#> 11 30 Male speed 2 4
#> 12 30 Male speed 3 8
#> 13 40 Male score 1 32
#> 14 40 Male score 2 37
#> 15 40 Male score 3 38
#> 16 40 Male speed 1 1
#> 17 40 Male speed 2 5
#> 18 40 Male speed 3 6
# Full example
# ------------------
data <- psych::bfi # Wide format with one row per participant's personality test
# Pivot long format
very_long_data <- data_to_long(data,
select = regex("\\d"), # Select all columns that contain a digit
names_to = "Item",
values_to = "Score",
rows_to = "Participant"
)
head(very_long_data)
#> gender education age Participant Item Score
#> 1 1 NA 16 61617 A1 2
#> 2 1 NA 16 61617 A2 4
#> 3 1 NA 16 61617 A3 3
#> 4 1 NA 16 61617 A4 4
#> 5 1 NA 16 61617 A5 4
#> 6 1 NA 16 61617 C1 2
even_longer_data <- data_to_long(
tidyr::who,
select = new_sp_m014:newrel_f65,
names_to = c("diagnosis", "gender", "age"),
names_pattern = "new_?(.*)_(.)(.*)",
values_to = "count"
)
head(even_longer_data)
#> # A tibble: 6 × 8
#> country iso2 iso3 year diagnosis gender age count
#> <chr> <chr> <chr> <dbl> <chr> <chr> <chr> <dbl>
#> 1 Afghanistan AF AFG 1980 sp m 014 NA
#> 2 Afghanistan AF AFG 1980 sp m 1524 NA
#> 3 Afghanistan AF AFG 1980 sp m 2534 NA
#> 4 Afghanistan AF AFG 1980 sp m 3544 NA
#> 5 Afghanistan AF AFG 1980 sp m 4554 NA
#> 6 Afghanistan AF AFG 1980 sp m 5564 NA