Skip to contents

Creates data partitions (for instance, a training and a test set) based on a dataframe that can also be stratified (i.e., evenly spread a given factor) using the group argument.

Usage

data_partition(
  data,
  proportion = 0.7,
  group = NULL,
  seed = NULL,
  row_id = ".row_id",
  verbose = TRUE,
  training_proportion = proportion,
  ...
)

Arguments

data

A data frame, or an object that can be coerced to a data frame.

proportion

Scalar (between 0 and 1) or numeric vector, indicating the proportion(s) of the training set(s). The sum of proportion must not be greater than 1. The remaining part will be used for the test set.

group

A character vector indicating the name(s) of the column(s) used for stratified partitioning.

seed

A random number generator seed. Enter an integer (e.g. 123) so that the random sampling will be the same each time you run the function.

row_id

Character string, indicating the name of the column that contains the row-id's.

verbose

Toggle messages and warnings.

training_proportion

Deprecated, please use proportion.

...

Other arguments passed to or from other functions.

Value

A list of data frames. The list includes one training set per given proportion and the remaining data as test set. List elements of training sets are named after the given proportions (e.g., $p_0.7), the test set is named $test.

See also

Examples

data(iris)
out <- data_partition(iris, proportion = 0.9)
out$test
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species .row_id
#> 1           4.9         3.1          1.5         0.1     setosa      10
#> 2           5.8         4.0          1.2         0.2     setosa      15
#> 3           5.1         3.5          1.4         0.3     setosa      18
#> 4           5.2         4.1          1.5         0.1     setosa      33
#> 5           5.1         3.8          1.9         0.4     setosa      45
#> 6           5.3         3.7          1.5         0.2     setosa      49
#> 7           5.7         2.8          4.5         1.3 versicolor      56
#> 8           5.0         2.0          3.5         1.0 versicolor      61
#> 9           7.1         3.0          5.9         2.1  virginica     103
#> 10          5.7         2.5          5.0         2.0  virginica     114
#> 11          6.5         3.0          5.5         1.8  virginica     117
#> 12          6.3         2.7          4.9         1.8  virginica     124
#> 13          6.7         3.3          5.7         2.1  virginica     125
#> 14          6.9         3.1          5.4         2.1  virginica     140
#> 15          6.7         3.0          5.2         2.3  virginica     146
nrow(out$p_0.9)
#> [1] 135

# Stratify by group (equal proportions of each species)
out <- data_partition(iris, proportion = 0.9, group = "Species")
out$test
#>    Sepal.Length Sepal.Width Petal.Length Petal.Width    Species .row_id
#> 1           4.7         3.2          1.3         0.2     setosa       3
#> 2           4.6         3.1          1.5         0.2     setosa       4
#> 3           5.4         3.4          1.5         0.4     setosa      32
#> 4           4.5         2.3          1.3         0.3     setosa      42
#> 5           5.3         3.7          1.5         0.2     setosa      49
#> 6           6.9         3.1          4.9         1.5 versicolor      53
#> 7           5.0         2.0          3.5         1.0 versicolor      61
#> 8           6.1         2.8          4.0         1.3 versicolor      72
#> 9           6.3         2.3          4.4         1.3 versicolor      88
#> 10          5.6         3.0          4.1         1.3 versicolor      89
#> 11          7.3         2.9          6.3         1.8  virginica     108
#> 12          7.2         3.6          6.1         2.5  virginica     110
#> 13          5.7         2.5          5.0         2.0  virginica     114
#> 14          7.7         2.6          6.9         2.3  virginica     119
#> 15          6.0         2.2          5.0         1.5  virginica     120

# Create multiple partitions
out <- data_partition(iris, proportion = c(0.3, 0.3))
lapply(out, head)
#> $p_0.3
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1          5.4         3.9          1.7         0.4  setosa       6
#> 2          5.4         3.7          1.5         0.2  setosa      11
#> 3          4.8         3.4          1.6         0.2  setosa      12
#> 4          5.7         4.4          1.5         0.4  setosa      16
#> 5          5.1         3.7          1.5         0.4  setosa      22
#> 6          5.0         3.0          1.6         0.2  setosa      26
#> 
#> $p_0.3
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1          4.7         3.2          1.3         0.2  setosa       3
#> 2          5.0         3.6          1.4         0.2  setosa       5
#> 3          4.6         3.4          1.4         0.3  setosa       7
#> 4          4.4         2.9          1.4         0.2  setosa       9
#> 5          4.9         3.1          1.5         0.1  setosa      10
#> 6          4.8         3.0          1.4         0.1  setosa      13
#> 
#> $test
#>   Sepal.Length Sepal.Width Petal.Length Petal.Width Species .row_id
#> 1          5.1         3.5          1.4         0.2  setosa       1
#> 2          4.9         3.0          1.4         0.2  setosa       2
#> 3          4.6         3.1          1.5         0.2  setosa       4
#> 4          5.0         3.4          1.5         0.2  setosa       8
#> 5          5.8         4.0          1.2         0.2  setosa      15
#> 6          5.4         3.9          1.3         0.4  setosa      17
#> 

# Create multiple partitions, stratified by group - 30% equally sampled
# from species in first training set, 50% in second training set and
# remaining 20% equally sampled from each species in test set.
out <- data_partition(iris, proportion = c(0.3, 0.5), group = "Species")
lapply(out, function(i) table(i$Species))
#> $p_0.3
#> 
#>     setosa versicolor  virginica 
#>         15         15         15 
#> 
#> $p_0.5
#> 
#>     setosa versicolor  virginica 
#>         25         25         25 
#> 
#> $test
#> 
#>     setosa versicolor  virginica 
#>         10         10         10 
#>