Most functions to fit multilevel and mixed effects models only
allow the user to specify frequency weights, but not design (i.e., sampling
or probability) weights, which should be used when analyzing complex samples
(e.g., probability samples). rescale_weights()
implements two algorithms,
one proposed by Asparouhov (2006) and Carle (2009), to rescale
design weights in survey data to account for the grouping structure of
multilevel models, and one based on the design effect proposed by
Kish (1965), to rescale weights by the design effect to account for
additional sampling error introduced by weighting.
Usage
rescale_weights(
data,
probability_weights = NULL,
by = NULL,
nest = FALSE,
method = "carle"
)
Arguments
- data
A data frame.
- probability_weights
Variable indicating the probability (design or sampling) weights of the survey data (level-1-weight), provided as character string or formula.
- by
Variable names (as character vector, or as formula), indicating the grouping structure (strata) of the survey data (level-2-cluster variable). It is also possible to create weights for multiple group variables; in such cases, each created weighting variable will be suffixed by the name of the group variable. This argument is required for
method = "carle"
, but optional formethod = "kish"
.- nest
Logical, if
TRUE
andby
indicates at least two group variables, then groups are "nested", i.e. groups are now a combination from each group level of the variables inby
. This argument is not used whenmethod = "kish"
.- method
String, indicating which rescale-method is used for rescaling weights. Can be either
"carle"
(default) or"kish"
. See 'Details'. Ifmethod = "carle"
, theby
argument is required.
Value
data
, including the new weighting variable(s). For method = "carle"
, new
columns rescaled_weights_a
and rescaled_weights_b
are returned, and for
method = "kish"
, the returned data contains a column rescaled_weights
.
These represent the rescaled design weights to use in multilevel models (use
these variables for the weights
argument).
Details
method = "carle"
Rescaling is based on two methods: For
rescaled_weights_a
, the sample weightsprobability_weights
are adjusted by a factor that represents the proportion of group size divided by the sum of sampling weights within each group. The adjustment factor forrescaled_weights_b
is the sum of sample weights within each group divided by the sum of squared sample weights within each group (see Carle (2009), Appendix B). In other words,rescaled_weights_a
"scales the weights so that the new weights sum to the cluster sample size" whilerescaled_weights_b
"scales the weights so that the new weights sum to the effective cluster size".Regarding the choice between scaling methods A and B, Carle suggests that "analysts who wish to discuss point estimates should report results based on weighting method A. For analysts more interested in residual between-group variance, method B may generally provide the least biased estimates". In general, it is recommended to fit a non-weighted model and weighted models with both scaling methods and when comparing the models, see whether the "inferential decisions converge", to gain confidence in the results.
Though the bias of scaled weights decreases with increasing group size, method A is preferred when insufficient or low group size is a concern.
The group ID and probably PSU may be used as random effects (e.g. nested design, or group and PSU as varying intercepts), depending on the survey design that should be mimicked.
method = "kish"
Rescaling is based on scaling the sample weights so the mean value is 1, which means the sum of all weights equals the sample size. Next, the design effect (Kish 1965) is calculated, which is the mean of the squared weights divided by the squared mean of the weights. The scaled sample weights are then divided by the design effect. This method is most appropriate when weights are based on additional variables beyond the grouping variables in the model (e.g., other demographic characteristics), but may also be useful in other contexts.
Some tests on real-world survey-data suggest that, in comparison to the Carle-method, the Kish-method comes closer to estimates from a regular survey-design using the survey package. Note that these tests are not representative and it is recommended to check your results against a standard survey-design.
References
Asparouhov T. (2006). General Multi-Level Modeling with Sampling Weights. Communications in Statistics - Theory and Methods 35: 439-460
Carle A.C. (2009). Fitting multilevel models in complex survey data with design weights: Recommendations. BMC Medical Research Methodology 9(49): 1-13
Kish, L. (1965) Survey Sampling. London: Wiley.
Examples
data(nhanes_sample)
head(rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA"))
#> total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a
#> 1 1 2.20 1 3 2 31 97593.68 1.5733612
#> 2 7 2.08 2 3 1 29 39599.36 0.6231745
#> 3 3 1.48 2 1 2 42 26619.83 0.8976966
#> 4 4 1.32 2 4 2 33 34998.53 0.7083628
#> 5 1 2.00 2 1 1 41 14746.45 0.4217782
#> 6 6 2.20 2 4 1 38 28232.10 0.6877550
#> rescaled_weights_b
#> 1 1.2005159
#> 2 0.5246593
#> 3 0.5439111
#> 4 0.5498944
#> 5 0.3119698
#> 6 0.5155503
# also works with multiple group-variables
head(rescale_weights(nhanes_sample, "WTINT2YR", c("SDMVSTRA", "SDMVPSU")))
#> total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR pweight_a_SDMVSTRA
#> 1 1 2.20 1 3 2 31 97593.68 1.5733612
#> 2 7 2.08 2 3 1 29 39599.36 0.6231745
#> 3 3 1.48 2 1 2 42 26619.83 0.8976966
#> 4 4 1.32 2 4 2 33 34998.53 0.7083628
#> 5 1 2.00 2 1 1 41 14746.45 0.4217782
#> 6 6 2.20 2 4 1 38 28232.10 0.6877550
#> pweight_b_SDMVSTRA pweight_a_SDMVPSU pweight_b_SDMVPSU
#> 1 1.2005159 1.8458164 1.3699952
#> 2 0.5246593 0.8217570 0.5780808
#> 3 0.5439111 0.5034683 0.3736824
#> 4 0.5498944 0.6619369 0.4913004
#> 5 0.3119698 0.3060151 0.2152722
#> 6 0.5155503 0.5858662 0.4121388
# or nested structures.
x <- rescale_weights(
data = nhanes_sample,
probability_weights = "WTINT2YR",
by = c("SDMVSTRA", "SDMVPSU"),
nest = TRUE
)
head(x)
#> total age RIAGENDR RIDRETH1 SDMVPSU SDMVSTRA WTINT2YR rescaled_weights_a
#> 1 1 2.20 1 3 2 31 97593.68 1.6532834
#> 2 7 2.08 2 3 1 29 39599.36 0.5492655
#> 3 3 1.48 2 1 2 42 26619.83 0.8341084
#> 4 4 1.32 2 4 2 33 34998.53 0.7937824
#> 5 1 2.00 2 1 1 41 14746.45 0.4285939
#> 6 6 2.20 2 4 1 38 28232.10 0.6215579
#> rescaled_weights_b
#> 1 1.3583967
#> 2 0.4875798
#> 3 0.5891700
#> 4 0.5941006
#> 5 0.3068214
#> 6 0.4759935
# \donttest{
# compare different methods, using multilevel-Poisson regression
d <- rescale_weights(nhanes_sample, "WTINT2YR", "SDMVSTRA")
result1 <- lme4::glmer(
total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
family = poisson(),
data = d,
weights = rescaled_weights_a
)
result2 <- lme4::glmer(
total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
family = poisson(),
data = d,
weights = rescaled_weights_b
)
d <- rescale_weights(
nhanes_sample,
"WTINT2YR",
method = "kish"
)
result3 <- lme4::glmer(
total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
family = poisson(),
data = d,
weights = rescaled_weights
)
d <- rescale_weights(
nhanes_sample,
"WTINT2YR",
"SDMVSTRA",
method = "kish"
)
result4 <- lme4::glmer(
total ~ factor(RIAGENDR) + log(age) + factor(RIDRETH1) + (1 | SDMVPSU),
family = poisson(),
data = d,
weights = rescaled_weights
)
parameters::compare_parameters(
list(result1, result2, result3, result4),
exponentiate = TRUE,
column_names = c("Carle (A)", "Carle (B)", "Kish", "Kish (grouped)")
)
#> Number of weighted observations differs from number of unweighted
#> observations.
#> Parameter | Carle (A) | Carle (B)
#> ----------------------------------------------------------
#> (Intercept) | 12.20 (10.52, 14.14) | 11.95 (10.27, 13.92)
#> RIAGENDR [2] | 0.41 ( 0.40, 0.42) | 0.42 ( 0.40, 0.43)
#> age [log] | 1.69 ( 1.63, 1.75) | 1.66 ( 1.60, 1.73)
#> RIDRETH1 [2] | 0.90 ( 0.84, 0.97) | 0.90 ( 0.83, 0.98)
#> RIDRETH1 [3] | 1.19 ( 1.14, 1.24) | 1.21 ( 1.16, 1.27)
#> RIDRETH1 [4] | 2.16 ( 2.07, 2.26) | 2.16 ( 2.06, 2.28)
#> RIDRETH1 [5] | 1.01 ( 0.95, 1.07) | 1.05 ( 0.97, 1.12)
#> ----------------------------------------------------------
#> Observations | 2617 | 1965
#>
#> Parameter | Kish | Kish (grouped)
#> ---------------------------------------------------------
#> (Intercept) | 11.72 (9.89, 13.87) | 11.95 (10.27, 13.92)
#> RIAGENDR [2] | 0.42 (0.41, 0.43) | 0.42 ( 0.40, 0.43)
#> age [log] | 1.66 (1.59, 1.73) | 1.66 ( 1.60, 1.73)
#> RIDRETH1 [2] | 0.98 (0.90, 1.07) | 0.90 ( 0.83, 0.98)
#> RIDRETH1 [3] | 1.23 (1.17, 1.29) | 1.21 ( 1.16, 1.27)
#> RIDRETH1 [4] | 2.11 (1.99, 2.23) | 2.16 ( 2.06, 2.28)
#> RIDRETH1 [5] | 1.09 (1.01, 1.18) | 1.05 ( 0.97, 1.12)
#> ---------------------------------------------------------
#> Observations | 1903 | 1965
# }