
Check variables for within- and/or between-group variation
Source:R/check_group_variation.R
check_group_variation.Rd
Checks if variables vary within and/or between levels of grouping variables.
This function can be used to infer the hierarchical Design of a given
dataset, or detect any predictors that might cause heterogeneity bias (Bell
and Jones, 2015). Use summary()
on the output if you are mainly interested
if and which predictors are possibly affected by heterogeneity bias.
Usage
check_group_variation(x, ...)
# Default S3 method
check_group_variation(x, ...)
# S3 method for class 'data.frame'
check_group_variation(
x,
select = NULL,
by = NULL,
include_by = FALSE,
numeric_as_factor = FALSE,
tolerance_numeric = 1e-04,
tolerance_factor = "crossed",
...
)
# S3 method for class 'check_group_variation'
summary(object, flatten = FALSE, ...)
Arguments
- x
A data frame or a mixed model. See details and examples.
- ...
Arguments passed to other methods
- select
Character vector (or formula) with names of variables to select that should be checked. If
NULL
, selects all variables (except those inby
).- by
Character vector (or formula) with the name of the variable that indicates the group- or cluster-ID. For cross-classified or nested designs,
by
can also identify two or more variables as group- or cluster-IDs.- include_by
When there is more than one grouping variable, should they be check against each other?
- numeric_as_factor
Should numeric variables be tested as factors?
- tolerance_numeric
The minimal percent of variation (observed icc) that is tolerated to indicate no within- or no between-effect.
- tolerance_factor
How should a non-numeric variable be identified as varying only "within" a grouping variable? Options are:
"crossed"
- if all groups have all unique values of X."balanced"
- if all groups have all unique values of X, with equal frequency.
- object
result from
check_group_variation()
- flatten
Logical, if
TRUE
, the values are returned as character vector, not as list. Duplicated values are removed.
Details
This function attempt to identify the variability of a set of variables
(select
) with respect to one or more grouping variables (by
). If x
is a
(mixed effect) model, the variability of the fixed effects predictors are
checked with respect to the random grouping variables.
Generally, a variable is considered to vary between groups if is correlated
with those groups, and to vary within groups if it not a constant within at
least one group.
Numeric variables
Numeric variables are partitioned via datawizard::demean()
to their
within- and between-group components. Then, the variance for each of these
two component is calculated. Variables with within-group variance larger than
tolerance_numeric
are labeled as within, variables with a between-group
variance larger than tolerance_numeric
are labeled as between, and
variables with both variances larger than tolerance_numeric
are labeled as
both.
Setting numeric_as_factor = TRUE
causes numeric variables to be tested
using the following criteria.
Non-numeric variables
These variables can have one of the following three labels:
between - the variable is correlated with the groups, and is fixed within each group (each group has exactly one unique, constant value)
within - the variable is crossed with the grouping variable, such that all possible values appear within each group. The
tolerance_factor
argument controls if full balance is also required.both - the variable is correlated with the groups, but also varies within each group but is not fully crossed (or, when
tolerance_factor = "balanced"
the variable is fully crossed, but not perfectly balanced).
Additionally, the design of non-numeric variables is also checked to see if
they are nested within the groups or is they are crossed. This is
indicated by the Design
column.
Heterogeneity bias
Variables that vary both within and between groups can cause a heterogeneity
bias (Bell and Jones, 2015). It is recommended to center (person-mean
centering) those variables to avoid this bias. See datawizard::demean()
for further details. Use summary()
to get a short text result that
indicates if and which predictors are possibly affected by heterogeneity
bias.
References
Bell A, Jones K. 2015. Explaining Fixed Effects: Random Effects Modeling of Time-Series Cross-Sectional and Panel Data. Political Science Research and Methods, 3(1), 133–153.
See also
For further details, read the vignette
https://easystats.github.io/parameters/articles/demean.html and also
see documentation for datawizard::demean()
.
Examples
data(npk)
check_group_variation(npk, by = "block")
#> Check block variation
#>
#> Variable | Variation | Design
#> ------------------------------
#> N | within | crossed
#> P | within | crossed
#> K | within | crossed
#> yield | both |
data(iris)
check_group_variation(iris, by = "Species")
#> Check Species variation
#>
#> Variable | Variation | Design
#> ---------------------------------
#> Sepal.Length | both |
#> Sepal.Width | both |
#> Petal.Length | both |
#> Petal.Width | both |
data(ChickWeight)
check_group_variation(ChickWeight, by = "Chick")
#> Check Chick variation
#>
#> Variable | Variation | Design
#> -----------------------------
#> weight | both |
#> Time | both |
#> Diet | between |
# A subset of mlmRev::egsingle
egsingle <- data.frame(
schoolid = factor(rep(c("2020", "2820"), times = c(18, 6))),
lowinc = rep(c(TRUE, FALSE), times = c(18, 6)),
childid = factor(rep(
c("288643371", "292020281", "292020361", "295341521"),
each = 6
)),
female = rep(c(TRUE, FALSE), each = 12),
year = rep(1:6, times = 4),
math = c(
-3.068, -1.13, -0.921, 0.463, 0.021, 2.035,
-2.732, -2.097, -0.988, 0.227, 0.403, 1.623,
-2.732, -1.898, -0.921, 0.587, 1.578, 2.3,
-2.288, -2.162, -1.631, -1.555, -0.725, 0.097
)
)
result <- check_group_variation(
egsingle,
by = c("schoolid", "childid"),
include_by = TRUE
)
result
#> Check schoolid variation
#>
#> Variable | Variation | Design
#> -----------------------------
#> childid | both | nested
#> lowinc | between | nested
#> female | both |
#> year | within |
#> math | both |
#>
#> Check childid variation
#>
#> Variable | Variation | Design
#> -----------------------------
#> schoolid | between |
#> lowinc | between |
#> female | between |
#> year | within |
#> math | both |
summary(result)
#> Possible heterogeneity bias due to following predictors:
#> - childid:
#> math
#> - schoolid: childid, female, math
data(sleepstudy, package = "lme4")
check_group_variation(sleepstudy, select = "Days", by = "Subject")
#> Check Subject variation
#>
#> Variable | Variation | Design
#> -----------------------------
#> Days | within |
# Or
mod <- lme4::lmer(Reaction ~ Days + (Days | Subject), data = sleepstudy)
result <- check_group_variation(mod)
result
#> Check Subject variation
#>
#> Variable | Variation | Design
#> -----------------------------
#> Days | within |
summary(result)
#> No predictor found that could cause heterogeneity bias.