Creates data partitions (for instance, a training and a test set) based on a
dataframe that can also be stratified (i.e., evenly spread a given factor)
using the group
argument.
Usage
data_partition(
data,
proportion = 0.7,
group = NULL,
seed = NULL,
row_id = ".row_id",
verbose = TRUE,
training_proportion = proportion,
...
)
Arguments
- data
A data frame, or an object that can be coerced to a data frame.
- proportion
Scalar (between 0 and 1) or numeric vector, indicating the proportion(s) of the training set(s). The sum of
proportion
must not be greater than 1. The remaining part will be used for the test set.- group
A character vector indicating the name(s) of the column(s) used for stratified partitioning.
- seed
A random number generator seed. Enter an integer (e.g. 123) so that the random sampling will be the same each time you run the function.
- row_id
Character string, indicating the name of the column that contains the row-id's.
- verbose
Toggle messages and warnings.
- training_proportion
Deprecated, please use
proportion
.- ...
Other arguments passed to or from other functions.
Value
A list of data frames. The list includes one training set per given
proportion and the remaining data as test set. List elements of training
sets are named after the given proportions (e.g., $p_0.7
), the test set
is named $test
.
See also
Functions to rename stuff:
data_rename()
,data_rename_rows()
,data_addprefix()
,data_addsuffix()
Functions to reorder or remove columns:
data_reorder()
,data_relocate()
,data_remove()
Functions to reshape, pivot or rotate dataframes:
data_to_long()
,data_to_wide()
,data_rotate()
Functions to recode data:
data_rescale()
,data_reverse()
,data_cut()
,data_recode()
,data_shift()
Functions to standardize, normalize, rank-transform:
center()
,standardize()
,normalize()
,ranktransform()
,winsorize()
Split and merge dataframes:
data_partition()
,data_merge()
Functions to find or select columns:
data_select()
,data_find()
Functions to filter rows:
data_match()
,data_filter()
Examples
data(iris)
out <- data_partition(iris, proportion = 0.9)
out$test
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1 4.9 3.1 1.5 0.1 setosa 10
#> 2 5.8 4.0 1.2 0.2 setosa 15
#> 3 5.1 3.5 1.4 0.3 setosa 18
#> 4 5.2 4.1 1.5 0.1 setosa 33
#> 5 5.1 3.8 1.9 0.4 setosa 45
#> 6 5.3 3.7 1.5 0.2 setosa 49
#> 7 5.7 2.8 4.5 1.3 versicolor 56
#> 8 5.0 2.0 3.5 1.0 versicolor 61
#> 9 7.1 3.0 5.9 2.1 virginica 103
#> 10 5.7 2.5 5.0 2.0 virginica 114
#> 11 6.5 3.0 5.5 1.8 virginica 117
#> 12 6.3 2.7 4.9 1.8 virginica 124
#> 13 6.7 3.3 5.7 2.1 virginica 125
#> 14 6.9 3.1 5.4 2.1 virginica 140
#> 15 6.7 3.0 5.2 2.3 virginica 146
nrow(out$p_0.9)
#> [1] 135
# Stratify by group (equal proportions of each species)
out <- data_partition(iris, proportion = 0.9, group = "Species")
out$test
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1 4.7 3.2 1.3 0.2 setosa 3
#> 2 4.6 3.1 1.5 0.2 setosa 4
#> 3 5.4 3.4 1.5 0.4 setosa 32
#> 4 4.5 2.3 1.3 0.3 setosa 42
#> 5 5.3 3.7 1.5 0.2 setosa 49
#> 6 6.9 3.1 4.9 1.5 versicolor 53
#> 7 5.0 2.0 3.5 1.0 versicolor 61
#> 8 6.1 2.8 4.0 1.3 versicolor 72
#> 9 6.3 2.3 4.4 1.3 versicolor 88
#> 10 5.6 3.0 4.1 1.3 versicolor 89
#> 11 7.3 2.9 6.3 1.8 virginica 108
#> 12 7.2 3.6 6.1 2.5 virginica 110
#> 13 5.7 2.5 5.0 2.0 virginica 114
#> 14 7.7 2.6 6.9 2.3 virginica 119
#> 15 6.0 2.2 5.0 1.5 virginica 120
# Create multiple partitions
out <- data_partition(iris, proportion = c(0.3, 0.3))
lapply(out, head)
#> $p_0.3
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1 5.4 3.9 1.7 0.4 setosa 6
#> 2 5.4 3.7 1.5 0.2 setosa 11
#> 3 4.8 3.4 1.6 0.2 setosa 12
#> 4 5.7 4.4 1.5 0.4 setosa 16
#> 5 5.1 3.7 1.5 0.4 setosa 22
#> 6 5.0 3.0 1.6 0.2 setosa 26
#>
#> $p_0.3
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1 4.7 3.2 1.3 0.2 setosa 3
#> 2 5.0 3.6 1.4 0.2 setosa 5
#> 3 4.6 3.4 1.4 0.3 setosa 7
#> 4 4.4 2.9 1.4 0.2 setosa 9
#> 5 4.9 3.1 1.5 0.1 setosa 10
#> 6 4.8 3.0 1.4 0.1 setosa 13
#>
#> $test
#> Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1 5.1 3.5 1.4 0.2 setosa 1
#> 2 4.9 3.0 1.4 0.2 setosa 2
#> 3 4.6 3.1 1.5 0.2 setosa 4
#> 4 5.0 3.4 1.5 0.2 setosa 8
#> 5 5.8 4.0 1.2 0.2 setosa 15
#> 6 5.4 3.9 1.3 0.4 setosa 17
#>
# Create multiple partitions, stratified by group - 30% equally sampled
# from species in first training set, 50% in second training set and
# remaining 20% equally sampled from each species in test set.
out <- data_partition(iris, proportion = c(0.3, 0.5), group = "Species")
lapply(out, function(i) table(i$Species))
#> $p_0.3
#>
#> setosa versicolor virginica
#> 15 15 15
#>
#> $p_0.5
#>
#> setosa versicolor virginica
#> 25 25 25
#>
#> $test
#>
#> setosa versicolor virginica
#> 10 10 10
#>