This function describes a distribution by a set of indices (e.g., measures of centrality, dispersion, range, skewness, kurtosis).
Usage
describe_distribution(x, ...)
# S3 method for class 'numeric'
describe_distribution(
x,
centrality = "mean",
dispersion = TRUE,
iqr = TRUE,
range = TRUE,
quartiles = FALSE,
ci = NULL,
iterations = 100,
threshold = 0.1,
verbose = TRUE,
...
)
# S3 method for class 'factor'
describe_distribution(x, dispersion = TRUE, range = TRUE, verbose = TRUE, ...)
# S3 method for class 'data.frame'
describe_distribution(
x,
select = NULL,
exclude = NULL,
centrality = "mean",
dispersion = TRUE,
iqr = TRUE,
range = TRUE,
quartiles = FALSE,
include_factors = FALSE,
ci = NULL,
iterations = 100,
threshold = 0.1,
ignore_case = FALSE,
regex = FALSE,
verbose = TRUE,
...
)
Arguments
- x
A numeric vector, a character vector, a data frame, or a list. See
Details
.- ...
Additional arguments to be passed to or from methods.
- centrality
The point-estimates (centrality indices) to compute. Character (vector) or list with one or more of these options:
"median"
,"mean"
,"MAP"
(seemap_estimate()
),"trimmed"
(which is justmean(x, trim = threshold)
),"mode"
or"all"
.- dispersion
Logical, if
TRUE
, computes indices of dispersion related to the estimate(s) (SD
andMAD
formean
andmedian
, respectively). Dispersion is not available for"MAP"
or"mode"
centrality indices.- iqr
Logical, if
TRUE
, the interquartile range is calculated (based onstats::IQR()
, usingtype = 6
).- range
Return the range (min and max).
- quartiles
Return the first and third quartiles (25th and 75pth percentiles).
- ci
Confidence Interval (CI) level. Default is
NULL
, i.e. no confidence intervals are computed. If notNULL
, confidence intervals are based on bootstrap replicates (seeiterations
). Ifcentrality = "all"
, the bootstrapped confidence interval refers to the first centrality index (which is typically the median).- iterations
The number of bootstrap replicates for computing confidence intervals. Only applies when
ci
is notNULL
.- threshold
For
centrality = "trimmed"
(i.e. trimmed mean), indicates the fraction (0 to 0.5) of observations to be trimmed from each end of the vector before the mean is computed.- verbose
Toggle warnings and messages.
- select
Variables that will be included when performing the required tasks. Can be either
a variable specified as a literal variable name (e.g.,
column_name
),a string with the variable name (e.g.,
"column_name"
), a character vector of variable names (e.g.,c("col1", "col2", "col3")
), or a character vector of variable names including ranges specified via:
(e.g.,c("col1:col3", "col5")
),a formula with variable names (e.g.,
~column_1 + column_2
),a vector of positive integers, giving the positions counting from the left (e.g.
1
orc(1, 3, 5)
),a vector of negative integers, giving the positions counting from the right (e.g.,
-1
or-1:-3
),one of the following select-helpers:
starts_with()
,ends_with()
,contains()
, a range using:
orregex("")
.starts_with()
,ends_with()
, andcontains()
accept several patterns, e.gstarts_with("Sep", "Petal")
.or a function testing for logical conditions, e.g.
is.numeric()
(oris.numeric
), or any user-defined function that selects the variables for which the function returnsTRUE
(like:foo <- function(x) mean(x) > 3
),ranges specified via literal variable names, select-helpers (except
regex()
) and (user-defined) functions can be negated, i.e. return non-matching elements, when prefixed with a-
, e.g.-ends_with("")
,-is.numeric
or-(Sepal.Width:Petal.Length)
. Note: Negation means that matches are excluded, and thus, theexclude
argument can be used alternatively. For instance,select=-ends_with("Length")
(with-
) is equivalent toexclude=ends_with("Length")
(no-
). In case negation should not work as expected, use theexclude
argument instead.
If
NULL
, selects all columns. Patterns that found no matches are silently ignored, e.g.extract_column_names(iris, select = c("Species", "Test"))
will just return"Species"
.- exclude
See
select
, however, column names matched by the pattern fromexclude
will be excluded instead of selected. IfNULL
(the default), excludes no columns.- include_factors
Logical, if
TRUE
, factors are included in the output, however, only columns for range (first and last factor levels) as well as n and missing will contain information.- ignore_case
Logical, if
TRUE
and when one of the select-helpers or a regular expression is used inselect
, ignores lower/upper case in the search pattern when matching against variable names.- regex
Logical, if
TRUE
, the search pattern fromselect
will be treated as regular expression. Whenregex = TRUE
, select must be a character string (or a variable containing a character string) and is not allowed to be one of the supported select-helpers or a character vector of length > 1.regex = TRUE
is comparable to using one of the two select-helpers,select = contains("")
orselect = regex("")
, however, since the select-helpers may not work when called from inside other functions (see 'Details'), this argument may be used as workaround.
Details
If x
is a data frame, only numeric variables are kept and will be
displayed in the summary.
If x
is a list, the behavior is different whether x
is a stored list. If
x
is stored (for example, describe_distribution(mylist)
where mylist
was created before), artificial variable names are used in the summary
(Var_1
, Var_2
, etc.). If x
is an unstored list (for example,
describe_distribution(list(mtcars$mpg))
), then "mtcars$mpg"
is used as
variable name.
Note
There is also a
plot()
-method
implemented in the
see-package.
Examples
describe_distribution(rnorm(100))
#> Mean | SD | IQR | Range | Skewness | Kurtosis | n | n_Missing
#> ---------------------------------------------------------------------------
#> -0.03 | 1.09 | 1.69 | [-3.31, 2.71] | -0.19 | 0.11 | 100 | 0
data(iris)
describe_distribution(iris)
#> Variable | Mean | SD | IQR | Range | Skewness | Kurtosis | n | n_Missing
#> ----------------------------------------------------------------------------------------
#> Sepal.Length | 5.84 | 0.83 | 1.30 | [4.30, 7.90] | 0.31 | -0.55 | 150 | 0
#> Sepal.Width | 3.06 | 0.44 | 0.52 | [2.00, 4.40] | 0.32 | 0.23 | 150 | 0
#> Petal.Length | 3.76 | 1.77 | 3.52 | [1.00, 6.90] | -0.27 | -1.40 | 150 | 0
#> Petal.Width | 1.20 | 0.76 | 1.50 | [0.10, 2.50] | -0.10 | -1.34 | 150 | 0
describe_distribution(iris, include_factors = TRUE, quartiles = TRUE)
#> Variable | Mean | SD | IQR | Range | Quartiles | Skewness
#> -------------------------------------------------------------------------------
#> Sepal.Length | 5.84 | 0.83 | 1.30 | [4.3, 7.9] | 5.10, 6.40 | 0.31
#> Sepal.Width | 3.06 | 0.44 | 0.52 | [2, 4.4] | 2.80, 3.30 | 0.32
#> Petal.Length | 3.76 | 1.77 | 3.52 | [1, 6.9] | 1.60, 5.10 | -0.27
#> Petal.Width | 1.20 | 0.76 | 1.50 | [0.1, 2.5] | 0.30, 1.80 | -0.10
#> Species | | | | [setosa, virginica] | | 0.00
#>
#> Variable | Kurtosis | n | n_Missing
#> -----------------------------------------
#> Sepal.Length | -0.55 | 150 | 0
#> Sepal.Width | 0.23 | 150 | 0
#> Petal.Length | -1.40 | 150 | 0
#> Petal.Width | -1.34 | 150 | 0
#> Species | -1.51 | 150 | 0
describe_distribution(list(mtcars$mpg, mtcars$cyl))
#> Variable | Mean | SD | IQR | Range | Skewness | Kurtosis | n | n_Missing
#> ----------------------------------------------------------------------------------------
#> mtcars$mpg | 20.09 | 6.03 | 7.53 | [10.40, 33.90] | 0.67 | -0.02 | 32 | 0
#> mtcars$cyl | 6.19 | 1.79 | 4.00 | [4.00, 8.00] | -0.19 | -1.76 | 32 | 0