Skip to contents

The get_predicted() function is a robust, flexible and user-friendly alternative to base R predict() function. Additional features and advantages include availability of uncertainty intervals (CI), bootstrapping, a more intuitive API and the support of more models than base R's predict() function. However, although the interface are simplified, it is still very important to read the documentation of the arguments. This is because making "predictions" (a lose term for a variety of things) is a non-trivial process, with lots of caveats and complications. Read the 'Details' section for more information.

get_predicted_ci() returns the confidence (or prediction) interval (CI) associated with predictions made by a model. This function can be called separately on a vector of predicted values. get_predicted() usually returns confidence intervals (included as attribute, and accessible via the as.data.frame() method) by default. It is preferred to rely on the get_predicted() function for standard errors and confidence intervals - use get_predicted_ci() only if standard errors and confidence intervals are not available otherwise.

Usage

get_predicted(x, ...)

# Default S3 method
get_predicted(
  x,
  data = NULL,
  predict = "expectation",
  ci = NULL,
  ci_type = "confidence",
  ci_method = NULL,
  dispersion_method = "sd",
  vcov = NULL,
  vcov_args = NULL,
  verbose = TRUE,
  ...
)

# S3 method for class 'lm'
get_predicted(
  x,
  data = NULL,
  predict = "expectation",
  ci = NULL,
  iterations = NULL,
  verbose = TRUE,
  ...
)

# S3 method for class 'stanreg'
get_predicted(
  x,
  data = NULL,
  predict = "expectation",
  iterations = NULL,
  ci = NULL,
  ci_method = NULL,
  include_random = "default",
  include_smooth = TRUE,
  verbose = TRUE,
  ...
)

# S3 method for class 'gam'
get_predicted(
  x,
  data = NULL,
  predict = "expectation",
  ci = NULL,
  include_random = TRUE,
  include_smooth = TRUE,
  iterations = NULL,
  verbose = TRUE,
  ...
)

# S3 method for class 'lmerMod'
get_predicted(
  x,
  data = NULL,
  predict = "expectation",
  ci = NULL,
  ci_method = NULL,
  include_random = "default",
  iterations = NULL,
  verbose = TRUE,
  ...
)

# S3 method for class 'principal'
get_predicted(x, data = NULL, ...)

Arguments

x

A statistical model (can also be a data.frame, in which case the second argument has to be a model).

...

Other argument to be passed, for instance to get_predicted_ci().

data

An optional data frame in which to look for variables with which to predict. If omitted, the data used to fit the model is used. Visualization matrices can be generated using get_datagrid().

predict

string or NULL

  • "link" returns predictions on the model's link-scale (for logistic models, that means the log-odds scale) with a confidence interval (CI).

  • "expectation" (default) also returns confidence intervals, but this time the output is on the response scale (for logistic models, that means probabilities).

  • "prediction" also gives an output on the response scale, but this time associated with a prediction interval (PI), which is larger than a confidence interval (though it mostly make sense for linear models).

  • "classification" only differs from "prediction" for binomial models where it additionally transforms the predictions into the original response's type (for instance, to a factor).

  • Other strings are passed directly to the type argument of the predict() method supplied by the modelling package.

  • Specifically for models of class brmsfit (package brms), the predict argument can be any valid option for the dpar argument, to predict distributional parameters (such as "sigma", "beta", "kappa", "phi" and so on, see ?brms::brmsfamily).

  • When predict = NULL, alternative arguments such as type will be captured by the ... ellipsis and passed directly to the predict() method supplied by the modelling package. Note that this might result in conflicts with multiple matching type arguments - thus, the recommendation is to use the predict argument for those values.

  • Notes: You can see the 4 options for predictions as on a gradient from "close to the model" to "close to the response data": "link", "expectation", "prediction", "classification". The predict argument modulates two things: the scale of the output and the type of certainty interval. Read more about in the Details section below.

ci

The interval level. Default is NULL, to be fast even for larger models. Set the interval level to an explicit value, e.g. 0.95, for 95% CI).

ci_type

Can be "prediction" or "confidence". Prediction intervals show the range that likely contains the value of a new observation (in what range it would fall), whereas confidence intervals reflect the uncertainty around the estimated parameters (and gives the range of the link; for instance of the regression line in a linear regressions). Prediction intervals account for both the uncertainty in the model's parameters, plus the random variation of the individual values. Thus, prediction intervals are always wider than confidence intervals. Moreover, prediction intervals will not necessarily become narrower as the sample size increases (as they do not reflect only the quality of the fit). This applies mostly for "simple" linear models (like lm), as for other models (e.g., glm), prediction intervals are somewhat useless (for instance, for a binomial model for which the dependent variable is a vector of 1s and 0s, the prediction interval is... [0, 1]).

ci_method

The method for computing p values and confidence intervals. Possible values depend on model type.

  • NULL uses the default method, which varies based on the model type.

  • Most frequentist models: "wald" (default), "residual" or "normal".

  • Bayesian models: "quantile" (default), "hdi", "eti", and "spi".

  • Mixed effects lme4 models: "wald" (default), "residual", "normal", "satterthwaite", and "kenward-roger".

See get_df() for details.

dispersion_method

Bootstrap dispersion and Bayesian posterior summary: "sd" or "mad".

vcov

Variance-covariance matrix used to compute uncertainty estimates (e.g., for robust standard errors). This argument accepts a covariance matrix, a function which returns a covariance matrix, or a string which identifies the function to be used to compute the covariance matrix.

  • A covariance matrix

  • A function which returns a covariance matrix (e.g., stats::vcov())

  • A string which indicates the kind of uncertainty estimates to return.

    • Heteroskedasticity-consistent: "HC", "HC0", "HC1", "HC2", "HC3", "HC4", "HC4m", "HC5". See ?sandwich::vcovHC

    • Cluster-robust: "CR", "CR0", "CR1", "CR1p", "CR1S", "CR2", "CR3". See ?clubSandwich::vcovCR

    • Bootstrap: "BS", "xy", "residual", "wild", "mammen", "fractional", "jackknife", "norm", "webb". See ?sandwich::vcovBS

    • Other sandwich package functions: "HAC", "PC", "CL", "OPG", "PL".

    • Kenward-Roger approximation: kenward-roger. See ?pbkrtest::vcovAdj.

vcov_args

List of arguments to be passed to the function identified by the vcov argument. This function is typically supplied by the sandwich or clubSandwich packages. Please refer to their documentation (e.g., ?sandwich::vcovHAC) to see the list of available arguments. If no estimation type (argument type) is given, the default type for "HC" equals the default from the sandwich package; for type "CR", the default is set to "CR3".

verbose

Toggle warnings.

iterations

For Bayesian models, this corresponds to the number of posterior draws. If NULL, will return all the draws (one for each iteration of the model). For frequentist models, if not NULL, will generate bootstrapped draws, from which bootstrapped CIs will be computed. Iterations can be accessed by running as.data.frame(..., keep_iterations = TRUE) on the output.

include_random

If "default", include all random effects in the prediction, unless random effect variables are not in the data. If TRUE, include all random effects in the prediction (in this case, it will be checked if actually all random effect variables are in data). If FALSE, don't take them into account. Can also be a formula to specify which random effects to condition on when predicting (passed to the re.form argument). If include_random = TRUE and data is provided, make sure to include the random effect variables in data as well.

include_smooth

For General Additive Models (GAMs). If FALSE, will fix the value of the smooth to its average, so that the predictions are not depending on it. (default), mean(), or bayestestR::map_estimate().

Value

The fitted values (i.e. predictions for the response). For Bayesian or bootstrapped models (when iterations != NULL), iterations (as columns and observations are rows) can be accessed via as.data.frame().

Details

In insight::get_predicted(), the predict argument jointly modulates two separate concepts, the scale and the uncertainty interval.

Confidence Interval (CI) vs. Prediction Interval (PI))

  • Linear models - lm(): For linear models, prediction intervals (predict="prediction") show the range that likely contains the value of a new observation (in what range it is likely to fall), whereas confidence intervals (predict="expectation" or predict="link") reflect the uncertainty around the estimated parameters (and gives the range of uncertainty of the regression line). In general, Prediction Intervals (PIs) account for both the uncertainty in the model's parameters, plus the random variation of the individual values. Thus, prediction intervals are always wider than confidence intervals. Moreover, prediction intervals will not necessarily become narrower as the sample size increases (as they do not reflect only the quality of the fit, but also the variability within the data).

  • Generalized Linear models - glm(): For binomial models, prediction intervals are somewhat useless (for instance, for a binomial (Bernoulli) model for which the dependent variable is a vector of 1s and 0s, the prediction interval is... [0, 1]).

When users set the predict argument to "expectation", the predictions are returned on the response scale, which is arguably the most convenient way to understand and visualize relationships of interest. When users set the predict argument to "link", predictions are returned on the link scale, and no transformation is applied. For instance, for a logistic regression model, the response scale corresponds to the predicted probabilities, whereas the link-scale makes predictions of log-odds (probabilities on the logit scale). Note that when users select predict="classification" in binomial models, the get_predicted() function will first calculate predictions as if the user had selected predict="expectation". Then, it will round the responses in order to return the most likely outcome.

Heteroscedasticity consistent standard errors

The arguments vcov and vcov_args can be used to calculate robust standard errors for confidence intervals of predictions. These arguments, when provided in get_predicted(), are passed down to get_predicted_ci(), thus, see the related documentation there for more details.

Bayesian and Bootstrapped models and iterations

For predictions based on multiple iterations, for instance in the case of Bayesian models and bootstrapped predictions, the function used to compute the centrality (point-estimate predictions) can be modified via the centrality_function argument. For instance, get_predicted(model, centrality_function = stats::median). The default is mean. Individual draws can be accessed by running iter <- as.data.frame(get_predicted(model)), and their iterations can be reshaped into a long format by bayestestR::reshape_iterations(iter).

See also

Examples

data(mtcars)
x <- lm(mpg ~ cyl + hp, data = mtcars)

predictions <- get_predicted(x, ci = 0.95)
predictions
#> Predicted values:
#> 
#>  [1] 21.21678 21.21678 26.07124 21.21678 15.44448 21.31239 14.10597 26.66401
#>  [9] 26.03299 20.96820 20.96820 15.34888 15.34888 15.34888 14.87083 14.67962
#> [17] 14.39279 26.58752 26.85523 26.60665 25.99475 15.92253 15.92253 14.10597
#> [25] 15.44448 26.58752 26.10948 25.68880 13.74265 19.97387 12.38501 25.76529
#> 
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
#> 

# Options and methods ---------------------
get_predicted(x, predict = "prediction")
#> Predicted values:
#> 
#>  [1] 21.21678 21.21678 26.07124 21.21678 15.44448 21.31239 14.10597 26.66401
#>  [9] 26.03299 20.96820 20.96820 15.34888 15.34888 15.34888 14.87083 14.67962
#> [17] 14.39279 26.58752 26.85523 26.60665 25.99475 15.92253 15.92253 14.10597
#> [25] 15.44448 26.58752 26.10948 25.68880 13.74265 19.97387 12.38501 25.76529
#> 
#> NOTE: Confidence intervals, if available, are stored as attributes and can be accessed using `as.data.frame()` on this output.
#> 

# Get CI
as.data.frame(predictions)
#>    Predicted        SE    CI_low  CI_high
#> 1   21.21678 0.7281647 19.727518 22.70605
#> 2   21.21678 0.7281647 19.727518 22.70605
#> 3   26.07124 0.9279509 24.173366 27.96911
#> 4   21.21678 0.7281647 19.727518 22.70605
#> 5   15.44448 0.9200310 13.562810 17.32616
#> 6   21.31239 0.7777664 19.721680 22.90310
#> 7   14.10597 1.0080670 12.044237 16.16769
#> 8   26.66401 0.9225132 24.777260 28.55076
#> 9   26.03299 0.9362657 24.118117 27.94787
#> 10  20.96820 0.6234320 19.693139 22.24326
#> 11  20.96820 0.6234320 19.693139 22.24326
#> 12  15.34888 0.8862558 13.536280 17.16147
#> 13  15.34888 0.8862558 13.536280 17.16147
#> 14  15.34888 0.8862558 13.536280 17.16147
#> 15  14.87083 0.8057154 13.222961 16.51871
#> 16  14.67962 0.8206255 13.001249 16.35798
#> 17  14.39279 0.8911693 12.570146 16.21544
#> 18  26.58752 0.9099596 24.726448 28.44860
#> 19  26.85523 0.9695585 24.872258 28.83820
#> 20  26.60665 0.9127445 24.739874 28.47342
#> 21  25.99475 0.9454598 24.061069 27.92843
#> 22  15.92253 1.1490264 13.572504 18.27255
#> 23  15.92253 1.1490264 13.572504 18.27255
#> 24  14.10597 1.0080670 12.044237 16.16769
#> 25  15.44448 0.9200310 13.562810 17.32616
#> 26  26.58752 0.9099596 24.726448 28.44860
#> 27  26.10948 0.9205392 24.226768 27.99220
#> 28  25.68880 1.0474287 23.546572 27.83104
#> 29  13.74265 1.2011595 11.286007 16.19930
#> 30  19.97387 0.7635547 18.412227 21.53552
#> 31  12.38501 2.1153615  8.058613 16.71141
#> 32  25.76529 1.0175965 23.684073 27.84651

# Bootstrapped
as.data.frame(get_predicted(x, iterations = 4))
#>    Predicted   iter_1   iter_2    iter_3   iter_4
#> 1   21.40432 21.39969 21.82240 20.970542 21.42465
#> 2   21.40432 21.39969 21.82240 20.970542 21.42465
#> 3   26.68824 27.17221 28.54269 23.659855 27.37819
#> 4   21.40432 21.39969 21.82240 20.970542 21.42465
#> 5   15.13788 15.17414 14.88754 16.075708 14.41413
#> 6   21.50666 21.44688 21.84475 21.200284 21.53475
#> 7   13.70504 14.51349 14.57464 12.859324 12.87272
#> 8   27.32278 27.46478 28.68126 25.084254 28.06082
#> 9   26.64730 27.15333 28.53375 23.567958 27.33415
#> 10  21.13822 21.27699 21.76429 20.373213 21.13838
#> 11  21.13822 21.27699 21.76429 20.373213 21.13838
#> 12  15.03554 15.12695 14.86519 15.845966 14.30403
#> 13  15.03554 15.12695 14.86519 15.845966 14.30403
#> 14  15.03554 15.12695 14.86519 15.845966 14.30403
#> 15  14.52381 14.89101 14.75344 14.697258 13.75353
#> 16  14.31912 14.79663 14.70874 14.237774 13.53333
#> 17  14.01208 14.65506 14.64169 13.548549 13.20302
#> 18  27.24090 27.42703 28.66338 24.900460 27.97274
#> 19  27.52747 27.55916 28.72596 25.543737 28.28102
#> 20  27.26137 27.43647 28.66785 24.946409 27.99476
#> 21  26.60636 27.13446 28.52481 23.476061 27.29011
#> 22  15.64961 15.41009 14.99929 17.224417 14.96464
#> 23  15.64961 15.41009 14.99929 17.224417 14.96464
#> 24  13.70504 14.51349 14.57464 12.859324 12.87272
#> 25  15.13788 15.17414 14.88754 16.075708 14.41413
#> 26  27.24090 27.42703 28.66338 24.900460 27.97274
#> 27  26.72917 27.19108 28.55163 23.751752 27.42223
#> 28  26.27885 26.98345 28.45329 22.740888 26.93779
#> 29  13.31613 14.33417 14.48970 11.986305 12.45434
#> 30  20.07382 20.78622 21.53184 17.983899 19.99334
#> 31  11.86282 13.66407 14.17233  8.723972 10.89091
#> 32  26.36073 27.02120 28.47117 22.924681 27.02587
# Same as as.data.frame(..., keep_iterations = FALSE)
summary(get_predicted(x, iterations = 4))
#>    Predicted
#> 1   20.79277
#> 2   20.79277
#> 3   25.80472
#> 4   20.79277
#> 5   15.02221
#> 6   20.87179
#> 7   13.91592
#> 8   26.29465
#> 9   25.77311
#> 10  20.58731
#> 11  20.58731
#> 12  14.94319
#> 13  14.94319
#> 14  14.94319
#> 15  14.54809
#> 16  14.39005
#> 17  14.15298
#> 18  26.23143
#> 19  26.45269
#> 20  26.24724
#> 21  25.74151
#> 22  15.41732
#> 23  15.41732
#> 24  13.91592
#> 25  15.02221
#> 26  26.23143
#> 27  25.83633
#> 28  25.48864
#> 29  13.61564
#> 30  19.76550
#> 31  12.49355
#> 32  25.55186

# Different prediction types ------------------------
data(iris)
data <- droplevels(iris[1:100, ])

# Fit a logistic model
x <- glm(Species ~ Sepal.Length, data = data, family = "binomial")

# Expectation (default): response scale + CI
pred <- get_predicted(x, predict = "expectation", ci = 0.95)
head(as.data.frame(pred))
#>    Predicted         SE      CI_low    CI_high
#> 1 0.16579367 0.05943589 0.078854431 0.31573138
#> 2 0.06637193 0.03625646 0.022083989 0.18286787
#> 3 0.02479825 0.01843411 0.005675609 0.10175666
#> 4 0.01498061 0.01261461 0.002839122 0.07513285
#> 5 0.10623680 0.04779474 0.042437982 0.24173444
#> 6 0.48159935 0.07901420 0.333158095 0.63336131

# Prediction: response scale + PI
pred <- get_predicted(x, predict = "prediction", ci = 0.95)
head(as.data.frame(pred))
#>    Predicted       CI_low      CI_high
#> 1 0.16579367 2.220446e-16 1.000000e+00
#> 2 0.06637193 2.220446e-16 1.000000e+00
#> 3 0.02479825 2.220446e-16 2.220446e-16
#> 4 0.01498061 2.220446e-16 2.220446e-16
#> 5 0.10623680 2.220446e-16 1.000000e+00
#> 6 0.48159935 2.220446e-16 1.000000e+00

# Link: link scale + CI
pred <- get_predicted(x, predict = "link", ci = 0.95)
head(as.data.frame(pred))
#>     Predicted        SE     CI_low    CI_high
#> 1 -1.61573668 0.4297415 -2.4580146 -0.7734588
#> 2 -2.64380391 0.5850960 -3.7905709 -1.4970369
#> 3 -3.67187114 0.7622663 -5.1658856 -2.1778567
#> 4 -4.18590475 0.8548690 -5.8614172 -2.5103923
#> 5 -2.12977030 0.5033646 -3.1163467 -1.1431939
#> 6 -0.07363584 0.3164854 -0.6939359  0.5466642

# Classification: classification "type" + PI
pred <- get_predicted(x, predict = "classification", ci = 0.95)
head(as.data.frame(pred))
#>   Predicted CI_low    CI_high
#> 1    setosa setosa versicolor
#> 2    setosa setosa versicolor
#> 3    setosa setosa     setosa
#> 4    setosa setosa     setosa
#> 5    setosa setosa versicolor
#> 6    setosa setosa versicolor