Skip to contents

Most functions to fit multilevel and mixed effects models only allow to specify frequency weights, but not design (i.e. sampling or probability) weights, which should be used when analyzing complex samples and survey data. rescale_weights() implements an algorithm proposed by Asparouhov (2006) and Carle (2009) to rescale design weights in survey data to account for the grouping structure of multilevel models, which then can be used for multilevel modelling.

Usage

rescale_weights(data, by, probability_weights, nest = FALSE)

Arguments

data

A data frame.

by

Variable names (as character vector, or as formula), indicating the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group variables; in such cases, each created weighting variable will be suffixed by the name of the group variable.

probability_weights

Variable indicating the probability (design or sampling) weights of the survey data (level-1-weight).

nest

Logical, if TRUE and by indicates at least two group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables in by.

Value

data, including the new weighting variables: pweights_a and pweights_b, which represent the rescaled design weights to use in multilevel models (use these variables for the weights argument).

Details

Rescaling is based on two methods: For pweights_a, the sample weights probability_weights are adjusted by a factor that represents the proportion of group size divided by the sum of sampling weights within each group. The adjustment factor for pweights_b is the sum of sample weights within each group divided by the sum of squared sample weights within each group (see Carle (2009), Appendix B). In other words, pweights_a "scales the weights so that the new weights sum to the cluster sample size" while pweights_b "scales the weights so that the new weights sum to the effective cluster size".

Regarding the choice between scaling methods A and B, Carle suggests that "analysts who wish to discuss point estimates should report results based on weighting method A. For analysts more interested in residual between-group variance, method B may generally provide the least biased estimates". In general, it is recommended to fit a non-weighted model and weighted models with both scaling methods and when comparing the models, see whether the "inferential decisions converge", to gain confidence in the results.

Though the bias of scaled weights decreases with increasing group size, method A is preferred when insufficient or low group size is a concern.

The group ID and probably PSU may be used as random effects (e.g. nested design, or group and PSU as varying intercepts), depending on the survey design that should be mimicked.

References

  • Carle A.C. (2009). Fitting multilevel models in complex survey data with design weights: Recommendations. BMC Medical Research Methodology 9(49): 1-13

  • Asparouhov T. (2006). General Multi-Level Modeling with Sampling Weights. Communications in Statistics - Theory and Methods 35: 439-460

Examples

if (require("lme4")) {
  data(nhanes_sample)
  head(rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR"))

  # also works with multiple group-variables
  head(rescale_weights(nhanes_sample, c("SDMVSTRA", "SDMVPSU"), "WTINT2YR"))

  # or nested structures.
  x <- rescale_weights(
    data = nhanes_sample,
    by = c("SDMVSTRA", "SDMVPSU"),
    probability_weights = "WTINT2YR",
    nest = TRUE
  )
  head(x)

  nhanes_sample <- rescale_weights(nhanes_sample, "SDMVSTRA", "WTINT2YR")

  glmer(
    total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)) + (1 | SDMVPSU),
    family = poisson(),
    data = nhanes_sample,
    weights = pweights_a
  )
}
#> Loading required package: lme4
#> Loading required package: Matrix
#> Generalized linear mixed model fit by maximum likelihood (Laplace
#>   Approximation) [glmerMod]
#>  Family: poisson  ( log )
#> Formula: total ~ factor(RIAGENDR) * (log(age) + factor(RIDRETH1)) + (1 |  
#>     SDMVPSU)
#>    Data: nhanes_sample
#> Weights: pweights_a
#>       AIC       BIC    logLik  deviance  df.resid 
#>  78844.27  78920.47 -39409.14  78818.27      2582 
#> Random effects:
#>  Groups  Name        Std.Dev.
#>  SDMVPSU (Intercept) 0.1018  
#> Number of obs: 2595, groups:  SDMVPSU, 2
#> Fixed Effects:
#>                         (Intercept)                    factor(RIAGENDR)2  
#>                            2.491806                            -1.021303  
#>                            log(age)                    factor(RIDRETH1)2  
#>                            0.838727                            -0.088629  
#>                   factor(RIDRETH1)3                    factor(RIDRETH1)4  
#>                           -0.013332                             0.722511  
#>                   factor(RIDRETH1)5           factor(RIAGENDR)2:log(age)  
#>                           -0.106522                            -1.012698  
#> factor(RIAGENDR)2:factor(RIDRETH1)2  factor(RIAGENDR)2:factor(RIDRETH1)3  
#>                           -0.009089                             0.732982  
#> factor(RIAGENDR)2:factor(RIDRETH1)4  factor(RIAGENDR)2:factor(RIDRETH1)5  
#>                            0.275962                             0.542074