Skip to contents

Estimate average value of response variable at each factor level or representative value, respectively at values defined in a "data grid" or "reference grid". For plotting, check the examples in visualisation_recipe(). See also other related functions such as estimate_contrasts() and estimate_slopes().

Usage

estimate_means(
  model,
  by = "auto",
  predict = NULL,
  ci = 0.95,
  marginalize = "average",
  backend = getOption("modelbased_backend", "marginaleffects"),
  transform = NULL,
  verbose = TRUE,
  ...
)

Arguments

model

A statistical model.

by

The (focal) predictor variable(s) at which to evaluate the desired effect / mean / contrasts. Other predictors of the model that are not included here will be collapsed and "averaged" over (the effect will be estimated across them). The by argument is used to create a "reference grid" or "data grid" with representative values for the focal predictors. by can be a character (vector) naming the focal predictors (and optionally, representative values or levels), or a list of named elements. See details in insight::get_datagrid() to learn more about how to create data grids for predictors of interest.

predict

Is passed to the type argument in emmeans::emmeans() (when backend = "emmeans") or in marginaleffects::avg_predictions() (when backend = "marginaleffects"). For emmeans, see also this vignette. Valid options for `predict“ are:

  • backend = "emmeans": predict can be "response", "link", "mu", "unlink", or "log". If predict = NULL (default), the most appropriate transformation is selected (which usually is "response").

  • backend = "marginaleffects": predict can be "response", "link" or any valid type option supported by model's class predict() method (e.g., for zero-inflation models from package glmmTMB, you can choose predict = "zprob" or predict = "conditional" etc., see glmmTMB::predict.glmmTMB). By default, when predict = NULL, the most appropriate transformation is selected, which usually returns predictions or contrasts on the response-scale.

"link" will leave the values on scale of the linear predictors. "response" (or NULL) will transform them on scale of the response variable. Thus for a logistic model, "link" will give estimations expressed in log-odds (probabilities on logit scale) and "response" in terms of probabilities. To predict distributional parameters (called "dpar" in other packages), for instance when using complex formulae in brms models, the predict argument can take the value of the parameter you want to estimate, for instance "sigma", "kappa", etc.

ci

Confidence Interval (CI) level. Default to 0.95 (95%).

marginalize

Character string, indicating the type of marginalization. This dictates how the predictions are "averaged" over the non-focal predictors, i.e. those variables that are not specified in by or contrast.

  • "average" (default): Takes the mean value for non-focal numeric predictors and marginalizes over the factor levels of non-focal terms, which computes a kind of "weighted average" for the values at which these terms are hold constant. These predictions are a good representation of the sample, because all possible values and levels of the non-focal predictors are considered. It answers the question, "What is the predicted value for an 'average' observation in my data?". It refers to randomly picking a subject of your sample and the result you get on average. This approach is the one taken by default in the emmeans package.

  • "population": Non-focal predictors are marginalized over the observations in the sample, where the sample is replicated multiple times to produce "counterfactuals" and then takes the average of these predicted values (aggregated/grouped by the focal terms). It can be considered as extrapolation to a hypothetical target population. Counterfactual predictions are useful, insofar as the results can also be transferred to other contexts (Dickerman and Hernan, 2020). It answers the question, "What is the predicted response value for the 'average' observation in the broader target population?". It does not only refer to the actual data in your observed sample, but also "what would be if" we had more data, or if we had data from a different sample.

In other words, the distinction between marginalization types resides in whether the prediction are made for:

  • A specific "individual" from the sample (i.e., a specific combination of predictor values): this is what is obtained when using estimate_relation() and the other prediction functions.

  • An average individual from the sample: obtained with estimate_means(..., marginalize = "average")

  • The broader, hypothetical target population: obtained with estimate_means(..., marginalize = "population")

backend

Whether to use "emmeans" or "marginaleffects" as a backend. Results are usually very similar. The major difference will be found for mixed models, where backend = "marginaleffects" will also average across random effects levels, producing "marginal predictions" (instead of "conditional predictions", see Heiss 2022).

You can set a default backend via options(), e.g. use options(modelbased_backend = "emmeans") to use the emmeans package or options(modelbased_backend = "marginaleffects") to set marginaleffects as default backend.

transform

Deprecated, please use predict instead.

verbose

Use FALSE to silence messages and warnings.

...

Other arguments passed, for instance, to insight::get_datagrid(), to functions from the emmeans or marginaleffects package, or to process Bayesian models via bayestestR::describe_posterior(). Examples:

  • insight::get_datagrid(): Argument such as length or range can be used to control the (number of) representative values.

  • marginaleffects: Internally used functions are avg_predictions() for means and contrasts, and avg_slope() for slopes. Therefore, arguments for instance like vcov, transform, equivalence or slope can be passed to those functions.

  • emmeans: Internally used functions are emmeans() and emtrends(). Additional arguments can be passed to these functions.

  • Bayesian models: For Bayesian models, parameters are cleaned using describe_posterior(), thus, arguments like, for example, centrality, rope_range, or test are passed to that function.

Value

A data frame of estimated marginal means.

Details

The estimate_slopes(), estimate_means() and estimate_contrasts() functions are forming a group, as they are all based on marginal estimations (estimations based on a model). All three are built on the emmeans or marginaleffects package (depending on the backend argument), so reading its documentation (for instance emmeans::emmeans(), emmeans::emtrends() or this website) is recommended to understand the idea behind these types of procedures.

  • Model-based predictions is the basis for all that follows. Indeed, the first thing to understand is how models can be used to make predictions (see estimate_link()). This corresponds to the predicted response (or "outcome variable") given specific predictor values of the predictors (i.e., given a specific data configuration). This is why the concept of reference grid() is so important for direct predictions.

  • Marginal "means", obtained via estimate_means(), are an extension of such predictions, allowing to "average" (collapse) some of the predictors, to obtain the average response value at a specific predictors configuration. This is typically used when some of the predictors of interest are factors. Indeed, the parameters of the model will usually give you the intercept value and then the "effect" of each factor level (how different it is from the intercept). Marginal means can be used to directly give you the mean value of the response variable at all the levels of a factor. Moreover, it can also be used to control, or average over predictors, which is useful in the case of multiple predictors with or without interactions.

  • Marginal contrasts, obtained via estimate_contrasts(), are themselves at extension of marginal means, in that they allow to investigate the difference (i.e., the contrast) between the marginal means. This is, again, often used to get all pairwise differences between all levels of a factor. It works also for continuous predictors, for instance one could also be interested in whether the difference at two extremes of a continuous predictor is significant.

  • Finally, marginal effects, obtained via estimate_slopes(), are different in that their focus is not values on the response variable, but the model's parameters. The idea is to assess the effect of a predictor at a specific configuration of the other predictors. This is relevant in the case of interactions or non-linear relationships, when the effect of a predictor variable changes depending on the other predictors. Moreover, these effects can also be "averaged" over other predictors, to get for instance the "general trend" of a predictor over different factor levels.

Example: Let's imagine the following model lm(y ~ condition * x) where condition is a factor with 3 levels A, B and C and x a continuous variable (like age for example). One idea is to see how this model performs, and compare the actual response y to the one predicted by the model (using estimate_expectation()). Another idea is evaluate the average mean at each of the condition's levels (using estimate_means()), which can be useful to visualize them. Another possibility is to evaluate the difference between these levels (using estimate_contrasts()). Finally, one could also estimate the effect of x averaged over all conditions, or instead within each condition (using [estimate_slopes]).

References

Dickerman, Barbra A., and Miguel A. Hernán. 2020. Counterfactual Prediction Is Not Only for Causal Inference. European Journal of Epidemiology 35 (7): 615–17. doi:10.1007/s10654-020-00659-8

Heiss, A. (2022). Marginal and conditional effects for GLMMs with marginaleffects. Andrew Heiss. doi:10.59350/xwnfm-x1827

Examples

library(modelbased)

# Frequentist models
# -------------------
model <- lm(Petal.Length ~ Sepal.Width * Species, data = iris)

estimate_means(model)
#> We selected `by=c("Species")`.
#> Estimated Marginal Means
#> 
#> Species    | Mean |   SE |       95% CI | t(144)
#> ------------------------------------------------
#> setosa     | 1.43 | 0.08 | [1.28, 1.58] |  18.70
#> versicolor | 4.50 | 0.07 | [4.35, 4.65] |  60.64
#> virginica  | 5.61 | 0.06 | [5.50, 5.72] |  99.61
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species
#> Predictors averaged: Sepal.Width (3.1)
#> 

# the `length` argument is passed to `insight::get_datagrid()` and modulates
# the number of representative values to return for numeric predictors
estimate_means(model, by = c("Species", "Sepal.Width"), length = 2)
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI | t(144)
#> --------------------------------------------------------------
#> setosa     |        2.00 | 1.35 | 0.21 | [0.92, 1.77] |   6.28
#> versicolor |        2.00 | 3.61 | 0.15 | [3.33, 3.90] |  24.81
#> virginica  |        2.00 | 4.88 | 0.17 | [4.54, 5.23] |  27.92
#> setosa     |        4.40 | 1.54 | 0.15 | [1.24, 1.84] |  10.19
#> versicolor |        4.40 | 5.63 | 0.29 | [5.05, 6.20] |  19.34
#> virginica  |        4.40 | 6.53 | 0.25 | [6.04, 7.02] |  26.19
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species, Sepal.Width
#> 

# an alternative way to setup your data grid is specify the values directly
estimate_means(model, by = c("Species", "Sepal.Width = c(2, 4)"))
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI | t(144)
#> --------------------------------------------------------------
#> setosa     |           2 | 1.35 | 0.21 | [0.92, 1.77] |   6.28
#> versicolor |           2 | 3.61 | 0.15 | [3.33, 3.90] |  24.81
#> virginica  |           2 | 4.88 | 0.17 | [4.54, 5.23] |  27.92
#> setosa     |           4 | 1.51 | 0.10 | [1.31, 1.70] |  15.19
#> versicolor |           4 | 5.29 | 0.22 | [4.85, 5.73] |  23.78
#> virginica  |           4 | 6.26 | 0.18 | [5.89, 6.62] |  34.11
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species, Sepal.Width = c(2, 4)
#> 

# or use one of the many predefined "tokens" that help you creating a useful
# data grid - to learn more about creating data grids, see help in
# `?insight::get_datagrid`.
estimate_means(model, by = c("Species", "Sepal.Width = [fivenum]"))
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI | t(144)
#> --------------------------------------------------------------
#> setosa     |        2.00 | 1.35 | 0.21 | [0.92, 1.77] |   6.28
#> versicolor |        2.00 | 3.61 | 0.15 | [3.33, 3.90] |  24.81
#> virginica  |        2.00 | 4.88 | 0.17 | [4.54, 5.23] |  27.92
#> setosa     |        2.80 | 1.41 | 0.11 | [1.20, 1.62] |  13.28
#> versicolor |        2.80 | 4.29 | 0.05 | [4.18, 4.39] |  78.28
#> virginica  |        2.80 | 5.43 | 0.06 | [5.31, 5.56] |  87.55
#> setosa     |        3.00 | 1.43 | 0.08 | [1.26, 1.59] |  17.27
#> versicolor |        3.00 | 4.45 | 0.07 | [4.32, 4.59] |  65.68
#> virginica  |        3.00 | 5.57 | 0.05 | [5.46, 5.68] | 101.89
#> setosa     |        3.30 | 1.45 | 0.06 | [1.34, 1.57] |  25.21
#> versicolor |        3.30 | 4.70 | 0.11 | [4.49, 4.92] |  43.66
#> virginica  |        3.30 | 5.78 | 0.08 | [5.62, 5.93] |  74.17
#> setosa     |        4.40 | 1.54 | 0.15 | [1.24, 1.84] |  10.19
#> versicolor |        4.40 | 5.63 | 0.29 | [5.05, 6.20] |  19.34
#> virginica  |        4.40 | 6.53 | 0.25 | [6.04, 7.02] |  26.19
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species, Sepal.Width = [fivenum]
#> 

# same for factors: filter by specific levels
estimate_means(model, by = "Species=c('versicolor', 'setosa')")
#> Estimated Marginal Means
#> 
#> Species    | Mean |   SE |       95% CI | t(144)
#> ------------------------------------------------
#> versicolor | 4.50 | 0.07 | [4.35, 4.65] |  60.64
#> setosa     | 1.43 | 0.08 | [1.28, 1.58] |  18.70
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species=c('versicolor', 'setosa')
#> Predictors averaged: Sepal.Width (3.1)
#> 
estimate_means(model, by = c("Species", "Sepal.Width=0"))
#> Estimated Marginal Means
#> 
#> Species    | Sepal.Width | Mean |   SE |       95% CI | t(144)
#> --------------------------------------------------------------
#> setosa     |           0 | 1.18 | 0.50 | [0.19, 2.17] |   2.36
#> versicolor |           0 | 1.93 | 0.49 | [0.97, 2.90] |   3.96
#> virginica  |           0 | 3.51 | 0.51 | [2.50, 4.52] |   6.88
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species, Sepal.Width=0
#> 

# estimate marginal average of response at values for numeric predictor
estimate_means(model, by = "Sepal.Width", length = 5)
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI | t(144)
#> -------------------------------------------------
#> 2.00        | 3.28 | 0.10 | [3.07, 3.49] |  31.48
#> 2.60        | 3.60 | 0.06 | [3.49, 3.71] |  64.21
#> 3.20        | 3.92 | 0.04 | [3.84, 4.01] |  89.81
#> 3.80        | 4.25 | 0.08 | [4.08, 4.41] |  50.21
#> 4.40        | 4.57 | 0.14 | [4.30, 4.84] |  33.25
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Sepal.Width
#> Predictors averaged: Species
#> 
estimate_means(model, by = "Sepal.Width=c(2, 4)")
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI | t(144)
#> -------------------------------------------------
#> 2           | 3.28 | 0.10 | [3.07, 3.49] |  31.48
#> 4           | 4.35 | 0.10 | [4.15, 4.55] |  42.81
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Sepal.Width=c(2, 4)
#> Predictors averaged: Species
#> 

# or provide the definition of the data grid as list
estimate_means(
  model,
  by = list(Sepal.Width = c(2, 4), Species = c("versicolor", "setosa"))
)
#> Estimated Marginal Means
#> 
#> Sepal.Width | Species    | Mean |   SE |       95% CI | t(144)
#> --------------------------------------------------------------
#> 2           | versicolor | 3.61 | 0.15 | [3.33, 3.90] |  24.81
#> 4           | versicolor | 5.29 | 0.22 | [4.85, 5.73] |  23.78
#> 2           | setosa     | 1.35 | 0.21 | [0.92, 1.77] |   6.28
#> 4           | setosa     | 1.51 | 0.10 | [1.31, 1.70] |  15.19
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Sepal.Width = c(2, 4), Species = c('versicolor', 'setosa')
#> 

# Methods that can be applied to it:
means <- estimate_means(model, by = c("Species", "Sepal.Width=0"))

plot(means) # which runs visualisation_recipe()

standardize(means)
#> Estimated Marginal Means (standardized)
#> 
#> Species    | Sepal.Width |  Mean |   SE |         95% CI | t(144)
#> -----------------------------------------------------------------
#> setosa     |       -7.01 | -1.46 | 0.28 | [-2.02, -0.90] |   2.36
#> versicolor |       -7.01 | -1.03 | 0.28 | [-1.58, -0.49] |   3.96
#> virginica  |       -7.01 | -0.14 | 0.29 | [-0.71,  0.43] |   6.88
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species, Sepal.Width=0
#> 

# \donttest{
data <- iris
data$Petal.Length_factor <- ifelse(data$Petal.Length < 4.2, "A", "B")

model <- lme4::lmer(
  Petal.Length ~ Sepal.Width + Species + (1 | Petal.Length_factor),
  data = data
)
estimate_means(model)
#> We selected `by=c("Species")`.
#> Estimated Marginal Means
#> 
#> Species    | Mean |   SE |       95% CI | t(144)
#> ------------------------------------------------
#> setosa     | 1.67 | 0.34 | [1.00, 2.35] |   4.88
#> versicolor | 4.27 | 0.34 | [3.61, 4.94] |  12.69
#> virginica  | 5.25 | 0.34 | [4.58, 5.92] |  15.45
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Species
#> Predictors averaged: Sepal.Width (3.1), Petal.Length_factor
#> 
estimate_means(model, by = "Sepal.Width", length = 3)
#> Estimated Marginal Means
#> 
#> Sepal.Width | Mean |   SE |       95% CI | t(144)
#> -------------------------------------------------
#> 2.00        | 3.40 | 0.35 | [2.72, 4.09] |   9.84
#> 3.20        | 3.78 | 0.33 | [3.12, 4.43] |  11.35
#> 4.40        | 4.15 | 0.35 | [3.45, 4.85] |  11.70
#> 
#> Variable predicted: Petal.Length
#> Predictors modulated: Sepal.Width
#> Predictors averaged: Species, Petal.Length_factor
#> 
# }