This example demonstrates the use of trace plots and statistical testing to assess MCMC models.
Workflow:
When performing an analysis with the NCI method, it is important to assess the measurement error model to ensure that it properly represents the distribution of the true parameters. The NCI method uses a Markov Chain Monte Carlo (MCMC) method to fit the measurement error model, so it worthwhile to become familiar with common methods used to assess MCMC models. These methods can be used to diagnose issues in the model and can help guide refinements to it.
A good model as well as models with some common problems will be shown to build familiarity with diagnosing errors.
The dataset used for this analysis is derived from the 2005-2010
NHANES data. A subset of six strata (SDMVSTRA) will be used
to reduce computation time and allow this example to run in real
time.
Intakes of total grain (G_TOTAL) and refined grain
(G_REFINED) will be modeled to demonstrate trace plots for
different scenarios.
The covariates of interest are smoking status (SMK_REC),
age (RIDAGEYR), and sex (RIAGENDR). The
nuisance covariates will be whether the recall was on a weekend
(Weekend) or whether the recall was on day 2
(Day2).
The base weight variable (WTDRD1) will be used for all
models.
For more information on the specifics of the data processing, please see the food and nutrient analysis vignette.
Subjects with missing values are removed, and variables are defined for the analysis.
#subset data
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78),]
#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)
#remove subjects that are missing any covariates or variables
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$G_TOTAL) | is.na(input.dataset$G_REFINED)
input.dataset <- input.dataset[!missing.covariates & !missing.variables,]
#rename variables for readability
input.dataset$Total.Grain <- input.dataset$G_TOTAL
input.dataset$Refined.Grain <- input.dataset$G_REFINEDThe variables will now be transformed and standardized for use in the MCMC models.
#Winsorize extreme values
wins.total.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Total.Grain",
weight="WTDRD1",
do.winsorization=TRUE,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Total.Grain
#> SEQN DAY Total.Grain Total.Grain.winsorized
#> 40879 1 55.92 51.05915
wins.refined.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Refined.Grain",
weight="WTDRD1",
do.winsorization=TRUE,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Refined.Grain
#> [1] SEQN DAY Refined.Grain
#> [4] Refined.Grain.winsorized
#> <0 rows> (or 0-length row.names)
input.dataset$Total.Grain <- ifelse(input.dataset$Total.Grain > 51.05915, 51.05915, input.dataset$Total.Grain)
#Find best Box-Cox lambdas for each variable
boxcox.total.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Total.Grain",
weight="WTDRD1",
covariates=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3)
boxcox.refined.grain <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Refined.Grain",
weight="WTDRD1",
covariates=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3)
boxcox.lambda.data <- rbind(boxcox.total.grain, boxcox.refined.grain)
#calculate minimum amounts
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
daily.variables=c("Total.Grain", "Refined.Grain"))
#run pre-processor to get MCMC input data
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
daily.variables=c("Total.Grain", "Refined.Grain"),
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)The first example will be of a model that has mixed well.
The trace_plots() utility is used to generate trace
plots for an MCMC model.
An ideal trace plot looks like random movement around an imaginary horizontal line. The y-value of the line represents the mean of the distribution and the amplitude of the movement represents the variance.
Each trace plot has a vertical red line at the number of burn-in iterations. The area before the line represents the burn-in period where the parameter values are expected to be unstable.
The values of the mean and variance do not matter for assessing the plot. As long as the mean and variance appear stable over time, the model is said to mix well.
Beta parameters tend to stabilize more easily than the variance components Sigma-u and Sigma-e. The convergence of all parameters should be investigated when assessing a model.
good.model <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables="Total.Grain",
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=1000,
num.thin=2,
mcmc.seed=9999)
trace_plots(multivar.mcmc.model=good.model)#> NULL
The following example demonstrates a model with too few burn-in iterations.
Some of the parameter values early on in the chain will not be representative of the parameter distributions. The inclusion of these values causes the estimates of the posterior means to become less stable.
If a plot does not appear to have settled around a value after the burn-in period, it may indicate that more burn-in iterations are needed. Multivariate models and episodic variables generally require more burn-in than univariate models and continous variables. The variance components Sigma-u and Sigma-e often have longer burn-in periods than the beta parameters.
low.burn.in <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables="Total.Grain",
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=50,
num.thin=2,
mcmc.seed=9999)
trace_plots(multivar.mcmc.model=low.burn.in)#> NULL
Another potential issue is an insufficient number of subjects with more than one recall.
The effect of the low sample size can be seen in all of the trace plots, but it is especially apparent in the variance components (Sigma-e and Sigma-u). While there isn’t a clear trend up or down, the center of the trace appears choppy which shows that the mean is not able to settle on a value even well after the burn-in period.
Note that the number of subjects with one recall does not have any bearing on model convergence. The limiting factor is the number of subjects that have two or more observations. As a general rule, at least 50 subjects should have two or more observations were consumption is non-zero for proper mixing. More complicated relationships in the data (such as high correlation between different variance components) will require more subjects to fit.
#subset input dataset so that only the first 10 subjects have a second recall
first.recall <- input.dataset[input.dataset$Day2 == 0,]
second.recall <- input.dataset[input.dataset$Day2 == 1,][1:10,]
input.subset <- rbind(first.recall, second.recall)
pre.mcmc.subset <- nci_multivar_preprocessor(input.data=input.subset,
daily.variables="Total.Grain",
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)
small.sample <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.subset,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables="Total.Grain",
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=1000,
num.thin=2,
mcmc.seed=9999)
trace_plots(multivar.mcmc.model=small.sample)#> NULL
Variables that are highly correlated or subsets of each other will affect the model fit significantly. A common cause of this is not properly disaggregating variables during pre-processing.
The trace plots have a long burn-in and have trouble settling around a value. This can sometimes resemble the plots from having a low sample size. If the sample size is sufficient, then this pattern could indicate an issue with disaggregation.
correlated.model <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables=c("Total.Grain", "Refined.Grain"),
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=1000,
num.thin=2,
mcmc.seed=9999)
trace_plots(multivar.mcmc.model=correlated.model)#> NULL
The Gelman-Rubin test is an alternative method assessing convergence that gives numerical values instead of graphs. The test works by creating multiple MCMC chains with different random seeds and calculating the within-chain and between-chain variation (Gelman and Rubin, 1992). If the model parameters converge, there should be little to no difference between different chains which will cause the between-chain variance to fall to zero.
The Gelman-Rubin statistic is the square root of the ratio between the total variance and the within-chain variance of a parameter. Convergence is indicated by the Gelman-Rubin statistics for all of the parameters being close to 1. There is no exact rule for how close to 1 the statistics must be, though an upper bound of 1.1 for convergence has been suggested by Gelman, et al. (2004). In general, iteration parameters should be chosen that get the Gelman-Rubin statistics as close to 1 as possible while balancing computation time.
The gelman_rubin() function computes Gelman-Rubin
statistics for each MCMC parameter. The Gelman-Rubin test can be used as
evidence that the good model shown earlier in the vignette has
converged.
gr.statistics <- gelman_rubin(num.chains=5,
pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables="Total.Grain",
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=3000,
num.burn=1000,
num.thin=2,
initial.mcmc.seed=9999)
gr.statistics
#> beta.amt.Total.Grain.intercept beta.amt.Total.Grain.SMK_REC_CAT1
#> 1.000844 1.001135
#> beta.amt.Total.Grain.SMK_REC_CAT2 beta.amt.Total.Grain.std.RIDAGEYR
#> 1.001358 1.001057
#> beta.amt.Total.Grain.RIAGENDR beta.amt.Total.Grain.Day2
#> 1.000885 1.001208
#> beta.amt.Total.Grain.Weekend sigma.u.amt.Total.Grain.amt.Total.Grain
#> 1.000697 1.000695
#> sigma.e.amt.Total.Grain.amt.Total.Grain
#> 1.000683The advantage to the Gelman-Rubin test is that it gives a numerical result as opposed to a subjective assessment of a graph. On the other hand, trace plots are much faster to create and can be used on a model that has already been fit. Additionally, the specific patterns in trace plots can give clues about why the model has not converged. The best practice is to use both methods for a more thorough assessment.