This example demonstrates estimating a univariate distribution with standard errors from balanced repeated replication (BRR).
Workflow:
For each replicate: 4. MCMC Model 5. Simulate Usual Intakes 6. Summary Statistics
The dataset used for this analysis is derived from the 2005-2010
NHANES data. A subset of six strata (SDMVSTRA) will be used
to reduce computation time and allow this example to run in real
time.
The NHANES dataset strata are masked pseudo-strata that hide the confidential information used to make the true design strata. These pseudo-strata are designed to produce the same variance estimates as the true strata.
Individuals subjects are identified by SEQN. Each
subject has up to two dietary recalls which are identified by
DAY.
The nutrient being analyzed is sodium (TSODI).
The covariates being examined are smoking status
(SMK_REC), age (RIDAGEYR), sex
(RIAGENDR). Two nuisance covariates are being accounted
for, whether the recall was on a weekend (Weekend) and
whether the recall is on day 2 (Day2). They are considered
nuisance covariates because they are not necessarily of interest to
study but must be accounted for to get an accurate distribution.
The WTDRD1 variable is the survey weighting of each
observation.
The functions in this package expect data in a tall format where each row represents one observation of a subject. Not all subjects need to have the same number of observations as long as there are sufficient subjects that have two or more observations. A good rule is to have at least 50 subjects with two or more observations where each variable of interest is non-zero.
Looking at the first 10 observations shows the structure of the data.
#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]
#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)
#First 10 observations
head(input.dataset[,c("SEQN", "DAY", "TSODI", "SMK_REC", "RIDAGEYR", "RIAGENDR", "Day2", "Weekend", "WTDRD1")], 10)
#> SEQN DAY TSODI SMK_REC RIDAGEYR RIAGENDR Day2 Weekend WTDRD1
#> 7 31155 1 5720 3 38 0 FALSE 1 22335.658
#> 8 31155 2 4552 3 38 0 TRUE 0 22335.658
#> 16 31186 1 3453 1 46 1 FALSE 1 2969.506
#> 17 31186 2 4350 1 46 1 TRUE 0 2969.506
#> 18 31187 1 3260 3 22 1 FALSE 1 13055.354
#> 19 31187 2 4026 3 22 1 TRUE 0 13055.354
#> 34 31214 1 282 2 46 1 FALSE 1 21388.889
#> 35 31214 2 3801 2 46 1 TRUE 0 21388.889
#> 42 31228 1 2654 2 34 0 FALSE 1 5018.875
#> 43 31228 2 441 2 34 0 TRUE 1 5018.875The data should now be prepared for analysis.
All subjects used in the analysis must be complete cases (i.e, each subject must have all of the covariates, nutrients, and foods that are being analyzed). For this example, subjects who are missing one or more variables being analyzed will be removed from the dataset.
#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$TSODI)
input.dataset <- input.dataset[!missing.covariates & !missing.variables,]
#rename sodium variable for readability
input.dataset$Sodium <- input.dataset$TSODITo calculate standard errors, sets of BRR replicate weights must be generated based on the base survey weights. BRR is similar to bootstrap, but it is performed using a structured set of weights generated using a Hadamard matrix. For standard BRR, half of the sample has double weight and the other half has zero weight for each set of weights.
The number of weight sets is determined by dimension of the Hadamard matrix, which must be 1, 2, or a multiple of 4 and greater than the number of strata. Since this example uses a subset of the NHANES data with 6 Strata, a Hadamard matrix of dimension 8 is used, and 8 BRR replicates are required. If the full dataset were used, there are 46 strata so 48 BRR replicates would be needed. Since ordinary bootstrap usually requires 200 or more replicates, BRR is often far more efficient for datasets where it can be applied.
Post-stratification adjustment will also be performed so that the sum of the weights in each post-stratification cell is the same as for the base weights.
To use BRR, each strata in the dataset must have exactly two primary sampling units (PSUs). Most NHANES data, including the dataset used in this example, meets this criteria.
Some NHANES cycles contain unusual strata that do not have exactly
two PSUs. These strata can be corrected the
fix_nhanes_strata() utility. The function call is
demonstrated below even though all strata in the dataset used in this
example have 2 PSUs.
BRR replicate weight sets can be generated using the
brr_weights() function.
The brr_weights() function takes in the strata
(SDMVSTRA), PSU (SDMVPSU), base weight
(WTDRD1), and post-stratification cell
(PSCELL) variables.
The output of the function will contain the base weight variable
(RepWt_0) and a RepWt_ variable for each BRR
weight set. The output weights are integerized by rounding them to the
nearest integer and distributing the remainders to minimize rounding
error.
In this tutorial, the Fay method of BRR will be used. This variation of BRR adjusts the weights using a specified value called the Fay factor (f), which must be between 0 and 1. The weights in half of the sample are multiplied by 1 + f and the weights in the other half are multiplied by 1 - f for every replicate weight set. Using a Fay factor below 1 ensures that every observation will be used in every weight set. A Fay factor of 1 is the same as standard BRR.
For this example, a Fay factor of 0.7 will be used. This multiplies the weights in half of the sample by 1.7 (1 + 0.7) and multiplies the weights in the other half by 0.3 (1 - 0.7) for every replicate weight set. It is important to record the Fay factor used to generate the weights since it will become important when calculating the BRR variance and standard error.
fay.factor <- 0.7
input.dataset <- brr_weights(input.data=input.dataset,
id="SEQN",
strata="SDMVSTRA",
psu="SDMVPSU",
cell="PSCELL",
weight="WTDRD1",
fay.factor=fay.factor)The next step is to Winsorize extreme values of the sodium variable.
The boxcox_survey() function has a helpful utility to
suggest cutoffs for Winsorization.
By default, extreme values are defined as being beyond the range
three times the interquartile range (IQR) below the 25th or above the
75th percentile on the transformed scale. This can be changed using the
iqr.multiple parameter which sets how many times the IQR
away from 25th and 75th percentiles a value can be before being
considered an outlier. Lower IQR multiples lead to stricter
Winsorization of the data since more values are identified as outliers.
For this example analysis, a cutoff of twice the IQR is used to
illustrate the structure of the Winsorization report. For an actual
analysis, the ideal Winsorization cutoff should be the least strict
value that still removes problematic outliers. This will have to be
found through experimentation, though the default value of three times
the IQR will usually suffice.
Note that the suggested Winsorized values are found using values in
the first recall but are applied to the entire dataset. Additionally,
only positive values can be used since the Box-Cox transformation is
invalid for zero and negative values. The boxcox_survey()
function will only use positive values to find the best lambda. Since
sodium is a daily consumed nutrient, values are expected to be above
zero for all subjects.
For Winsorization, the id and repeat.obs
parameters are needed to specify the subject and observation of each row
of the dataset. This allows the algorithm to identify which rows are in
need of Winsorization and produce a report.
#run Box-Cox survey on sodium variable to get suggested Winsorization cutoffs using the weighting for the point estimate
wins.sodium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Sodium",
weight="RepWt_0",
do.winsorization=TRUE,
iqr.multiple=2,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Sodium
#> SEQN DAY Sodium Sodium.winsorized
#> 31214 1 282 387.7924
#> 31403 2 55 387.7924
#> 31435 1 338 387.7924
#> 35957 1 335 387.7924
#> 37945 2 151 387.7924
#> 40879 1 18053 13598.5605
#> 42997 1 20165 13598.5605
#> 43528 1 308 387.7924
#> 44183 1 17687 13598.5605
#> 46795 1 21248 13598.5605
#> 47865 1 315 387.7924
#> 50316 1 15337 13598.5605
#> 51510 2 115 387.7924
#> 57179 2 163 387.7924
#> 57963 2 60 387.7924
#> 59838 1 207 387.7924The Winsorization report has identified extreme values of
Sodium as being below 387.79 mg or above 13598.56 mg. Using
these values, it is now possible to Winsorize the Sodium
variable.
input.dataset$Sodium <- ifelse(input.dataset$Sodium < 387.7924, 387.7924, input.dataset$Sodium)
input.dataset$Sodium <- ifelse(input.dataset$Sodium > 13598.5605, 13598.5605, input.dataset$Sodium)The NCI method assumes that food and nutrient amounts have a normal
distribution. The Sodium variable can be normalized using a
Box-Cox transformation. A suitable transformation parameter can be found
using the same boxcox_survey() function used to find the
Winsorization cutoffs.
Unlike with Winsorization, the function will be run using the
covariates that will be used in the model. The covariates are specified
as a vector in the covariates argument. Categorical
covariates are specified in cat.covariates and their
corresponding reference values in cat.reference. Smoking
status (SMK_REC) will be categorical using never-smokers
(3) as the reference.
Setting do.plots to TRUE generates Q-Q
plots of the original scale and Box-Cox transformed variable against the
standard normal distribution. The solid line represents the hypothetical
quantiles of the variable if it were normally distributed. The points
will be close to the line if the actual distribution of the variable is
close to normal.
#run Box-Cox survey with covariates to find best lambda parameter
boxcox.lambda.data <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Sodium",
weight="RepWt_0",
covariates=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
do.plots=TRUE)Next, the minimum intake for sodium is calculated using the
calculate_minimum_amount() function. The minimum amount is
calculated as half the smallest non-zero intake in the dataset. This
value becomes the minimum allowed usual intake when simulating usual
intakes later in the workflow. As with the Box-Cox lambda, the minimum
amount is found using the first recall.
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
daily.variables="Sodium")Now, the nci_multivar_preprocessor() function can be
used to generate the input dataset for the MCMC model. The preprocessor
function will apply the Box-Cox transformation from
boxcox_survey() to the sodium variable. The transformed
variable is then standardized to a mean of zero and a variance of 2 as
required by the MCMC algorithm.
Continuous covariates (such as age) should be standardized to a mean
of zero and a variance of 1 for best results in the MCMC algorithm. The
covariates to be standardized should be given in
continuous.covariates as a vector. The standardized
covariates created by nci_multivar_preprocessor() have the
prefix std..
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
daily.variables="Sodium",
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)Before fitting the MCMC model, it is worthwhile to explore the parts of the MCMC input.
The mcmc.input element is the input dataset with the
transformed and standardized variables and covariates added.
The backtransformation element contains the data used to
transform the sodium variable. This data is necessary to interpret the
MCMC results and will be utilized later to calculate the true usual
intakes on the original scale.
It is also important to be familiar with the
backtransformation data columns. tran_lambda:
The Box-Cox lambda used to transform the variable.
minamount: The minimum allowed usual intake, defined as
half of the smallest non-zero intake in the observed data.
tran_center: The mean of the Box-Cox transformed variable
before standardization. tran_scale: The standard deviation
of the Box-Cox transformed variable before standardization divided by
sqrt(2). biomarker: Logical flag of whether
the variable is a biomarker assumed to be unbiased on the transformed
scale. If FALSE, a bias correction factor will be added and
a 9-point approximation will be used for backtransformation. If
TRUE, an exact backtransformation will be used with no
correction.
pre.mcmc.data$backtransformation
#> variable tran_lambda minamount tran_center tran_scale biomarker
#> 1 Sodium 0.31 193.8962 35.57676 4.594267 FALSEMCMC models are fit using the base weight and each set of replicate weights.
Covariates to be used for every variable in the model are specified
in default.covariates as a vector. Note that the
standardized version of the age covariate generated by the preprocessor
(std.RIDAGEYR) is used in the MCMC model. Like
boxcox_survey(), categorical covariates are specified in
cat.covariates and their reference values in
cat.reference.
The number of iterations to run the MCMC must also be selected. The iteration parameters are:
num.mcmc.iterations: The total number of iterations to
run the MCMC including burn-in num.burn: The number of
burn-in iterations that will be discarded when calculating posterior
means num.thin: The spacing between iterations that will be
used to calculate posterior means to ensure independence between
iterations
The num.post argument specifies how many conditional
random effect (U) matrix draws should be made after the MCMC is
finished. These U matrices are conditional on the recalls and MCMC
parameters for the subjects in the MCMC.
Conditional U matrices can be used for generating usual intakes if the same population is used to fit the MCMC model and generate usual intakes. Using conditional U matrices does not substantially change the results for distributions, but it is required for regression calibration. It can be helpful to output conditional U matrices even for distributions to make it easier to add regression calibration later on without re-running the MCMC.
Drawing more conditional U matrices will give a more stable distribution at the cost of using more memory to store the U matrices. A good starting point is between 200-500 conditional U matrix draws.
The random seed for the MCMC is given in mcmc.seed.
Re-running the MCMC model using the same seed will produce the same
results each time. The seed should be changed with each iteration to
avoid biasing the results.
For BRR to work properly, nearly all BRR replicates should be used in variance calculation. It is good practice to verify that all of the BRR replicates have converged using techniques such as trace plots and the Gelman-Rubin test. Replicates that did not converge should not be included in the variance calculation.
num.brr <- 8
mcmc.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
mcmc.brr[[brr.rep + 1]] <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight=paste0("RepWt_", brr.rep),
daily.variables="Sodium",
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=1000,
num.thin=2,
num.post=500,
mcmc.seed=(9999 + brr.rep))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"The MCMC model estimates the following parameters:
Beta: Coefficients of the covariates and intercept in the model Sigma-u: Between-person variance-covariance matrix Sigma-e: Within-person variance-covariance matrix
Each iteration of the MCMC algorithm represents a sample from the distribution of the parameters. The posterior means are the means of the parameter samples at the thinned iterations after burn-in. They are used as the point estimates of the MCMC parameters.
The posterior means of each parameter can be viewed in the model output.
mcmc.brr[[1]]$beta.mean
#> $amt.Sodium
#> intercept SMK_REC_CAT1 SMK_REC_CAT2 std.RIDAGEYR RIAGENDR Day2
#> 0.60576545 -0.07033579 0.18798043 -0.20353054 -0.92828305 -0.13471077
#> Weekend
#> 0.02509548
mcmc.brr[[1]]$sigma.u.mean
#> amt.Sodium
#> amt.Sodium 0.4928532
mcmc.brr[[1]]$sigma.e.mean
#> amt.Sodium
#> amt.Sodium 1.187985The next step is simulating usual intakes using the MCMC model parameters.
A population-based dataset must be created that contains all of the subjects and additional variables that will be used for generating usual intakes and downstream analysis. For this example, the population base will be the first instance of each subject in the MCMC input dataset.
#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)Once the initial population base is created, nuisance variables must
be accounted for. To factor out the effect of the recall being conducted
on day 2, the Day2 variable will be set to zero for all
subjects.
To account for weekend vs weekday consumption, the simulated usual
intakes for weekends and weekdays will be weighted and averaged for each
subject. To accomplish this, a repeat of each subject will be created in
the population base. The first instance of each subject corresponds to
weekday consumption (Weekend = 0) and the second instance
corresponds to weekend consumption (Weekend = 1). Since
weekends are defined as three days in this dataset (Friday, Saturday,
and Sunday), weekend consumption will be given a weight of 3 while
weekday consumption will be given a weight of 4.
#create a repeat of each subject
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
#set weekend indicators and weights for the two instances of each subject
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)The population base and MCMC parameters are used to create a
distribution of simulated usual intakes using
nci_multivar_distrib().
The subjects in the MCMC and population base are the same, so the
conditional U matrices from the MCMC can be used for generating usual
intakes. This done by setting use.mcmc.u.matrices to
TRUE. Since there are 500 conditional U matrices, 500 usual
intakes will be generated for each subject in the population base.
The id and weight parameters specify the
subject identifier and weight from the population base to include in the
output. For population bases with multiple nuisance variable levels
(such as Weekend), nuisance.weight specifies
the weight for each level.
The random seed for usual intake simulation is given by
distrib.seed. Re-simulating the usual intakes with the same
seeds and the same MCMC models will produce the same simulated values.
As with the MCMC, the seed should be changed for each call to simulate
usual intakes in the analysis to avoid biasing the results.
Summary statistics can then be calculated from the simulated usual
intakes using the nci_multivar_summary() function. This is
done in the same loop as the usual intake simulation so that the usual
intake distribution dataset can be overwritten each time to save
memory.
The proportions are specified as named lists for the lower and upper intake thresholds. The name(s) of the list represent the variable(s) to calculate proportions for while the value(s) represent the lower or upper intake limits.
The lower.thresholds parameter is used for calculating
the proportions of intakes above a lower limit. Likewise,
upper.thresholds is used for calculating the proportions of
intakes below an upper limit.
This process is repeated for the base weight and each set of replicate weights.
summary.brr <- vector(mode="list", length=num.brr + 1)
for(brr.rep in 0:num.brr) {
print(paste0("Starting Iteration ", brr.rep))
#simulate usual intakes using conditional U matrices from MCMC
distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.brr[[brr.rep + 1]],
distrib.population=distrib.pop,
id="SEQN",
weight=paste0("RepWt_", brr.rep),
nuisance.weight="Weekend.Weight",
use.mcmc.u.matrices=TRUE,
distrib.seed=(99999 + brr.rep))
#compute means, quantiles, and proportions for simulated sodium intakes
summary.brr[[brr.rep + 1]] <- nci_multivar_summary(input.data=distrib.data,
variables="usual.intake.Sodium",
weight=paste0("RepWt_", brr.rep),
do.means=TRUE,
do.quantiles=TRUE,
quantiles=c(0.05, 0.25, 0.5, 0.75, 0.95),
do.proportions=TRUE,
lower.thresholds=list(usual.intake.Sodium=2200),
upper.thresholds=list(usual.intake.Sodium=3600))
}
#> [1] "Starting Iteration 0"
#> [1] "Starting Iteration 1"
#> [1] "Starting Iteration 2"
#> [1] "Starting Iteration 3"
#> [1] "Starting Iteration 4"
#> [1] "Starting Iteration 5"
#> [1] "Starting Iteration 6"
#> [1] "Starting Iteration 7"
#> [1] "Starting Iteration 8"With a point estimate and BRR replicates for every summary statistic, standard errors and confidence intervals can now be calculated. With the data set up as one column per replicate, this can be done easily and efficiently using vectorized code.
If using Fay’s method, the variance of the BRR replicates is divided by the square of the Fay factor. It is important to verify the Fay factor used to generate the BRR weights so that the correct variance can be calculated. If ordinary bootstrapping is used instead of BRR, the same standard error formula can be used with the Fay factor set to 1.
The number of degrees of freedom for confidence intervals is the total number of PSUs minus the number of strata. Since BRR uses exactly two PSUs per strata, the number of degrees of freedom is simply the number of strata.
#calculate degrees of freedom
df <- length(unique(input.dataset$SDMVSTRA))
#extract point estimate and BRR replicates
point <- summary.brr[[1]]$value
reps <- vapply(summary.brr[-1], function(brr.i) brr.i$value, point)
#calculate BRR standard error
std.error <- sqrt(rowSums((reps - point)^2)/(num.brr*fay.factor^2))
#95% confidence intervals
confidence.lower <- point + qt(0.025, df)*std.error
confidence.upper <- point + qt(0.975, df)*std.error
#create summary report
summary.report <- data.frame(summary.brr[[1]],
std.error=std.error,
confidence.lower=confidence.lower,
confidence.upper=confidence.upper)
summary.report
#> population variable statistic value std.error
#> 1 All usual.intake.Sodium Mean 3591.8093561 95.54554719
#> 2 All usual.intake.Sodium 5% 1993.0782251 48.04958460
#> 3 All usual.intake.Sodium 25% 2770.1334863 64.72568805
#> 4 All usual.intake.Sodium 50% 3455.9045075 89.20604931
#> 5 All usual.intake.Sodium 75% 4280.5607409 120.16823128
#> 6 All usual.intake.Sodium 95% 5636.5106893 198.97064533
#> 7 All usual.intake.Sodium > 2200 Proportion 0.9127940 0.01016531
#> 8 All usual.intake.Sodium < 3600 Proportion 0.5505691 0.03256666
#> confidence.lower confidence.upper
#> 1 3358.0178244 3825.6008879
#> 2 1875.5051271 2110.6513231
#> 3 2611.7554332 2928.5115395
#> 4 3237.6251682 3674.1838467
#> 5 3986.5196717 4574.6018102
#> 6 5149.6470592 6123.3743194
#> 7 0.8879204 0.9376676
#> 8 0.4708814 0.6302569