This example demonstrates an NCI method workflow using the low-memory
features of the ncimultivar package.
Workflow:
The NCI method can use a lot of computer memory, especially for MCMC
model parameters and simulated usual intake datasets. Fortunately, the
ncimultivar package functions have built-in ways to reduce
the amount of data held in memory. These features often come at the cost
of computation speed, but they can be useful for large models or when
running the software on systems with fewer resources.
The effect of the ratio of sodium (TSODI) to potassium
(TPOTA) on systolic blood pressure (BPSY_AVG)
will be measured from 2005-2010 NHANES data. To calculate the
sodium/potassium ratio, a bivariate NCI method model of sodium and
potassium will be fit. A subset of six strata (SDMVSTRA)
will be used to reduce computation time and allow this example to run in
real time.
The covariates being examined are smoking status
(SMK_REC), age (RIDAGEYR), and sex
(RIAGENDR). Two nuisance covariates will be adjusted for,
whether the recall was on a weekend (Weekend) and and
whether the recall is on day 2 (Day2).
The WTDRD1 variable is the survey weighting of each
subject.
A simple way to conserve memory is subsetting the data to only the variables needed for the analysis. Subjects with missing values are removed, and variables are defined for the analysis.
#subset data
analysis.variables <- c("SEQN", "DAY",
"TSODI", "TPOTA",
"SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend",
"BPSY_AVG",
"WTDRD1")
input.dataset <- nhcvd[nhcvd$SDMVSTRA %in% c(48, 54, 60, 66, 72, 78), analysis.variables]
#Define indicator for Day 2
input.dataset$Day2 <- (input.dataset$DAY == 2)
#remove subjects that are missing any covariates, variables, or outcomes
missing.covariates <- is.na(input.dataset$SMK_REC) | is.na(input.dataset$RIDAGEYR) | is.na(input.dataset$RIAGENDR) |
is.na(input.dataset$Weekend) | is.na(input.dataset$Day2)
missing.variables <- is.na(input.dataset$TSODI) | is.na(input.dataset$TPOTA)
missing.outcomes <- is.na(input.dataset$BPSY_AVG)
input.dataset <- input.dataset[!missing.covariates & !missing.variables & !missing.outcomes,]
#rename sodium and potassium variables for readability
input.dataset$Sodium <- input.dataset$TSODI
input.dataset$Potassium <- input.dataset$TPOTAThe sodium and potassium variables can now be transformed and standardized for use in the MCMC algorithm.
#Winsorize extreme values of sodium and potassium intake
wins.sodium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Sodium",
weight="WTDRD1",
do.winsorization=TRUE,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Sodium
#> SEQN DAY Sodium Sodium.winsorized
#> 31403 2 55 113.1129
#> 57963 2 60 113.1129
wins.potassium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Potassium",
weight="WTDRD1",
do.winsorization=TRUE,
id="SEQN",
repeat.obs="DAY")
#> Outliers and Suggested Winsorized values for Potassium
#> SEQN DAY Potassium Potassium.winsorized
#> 57963 2 0 42.45263
input.dataset$Sodium <- ifelse(input.dataset$Sodium < 113.1129, 113.1129, input.dataset$Sodium)
input.dataset$Potassium <- ifelse(input.dataset$Potassium < 42.45263, 42.45263, input.dataset$Potassium)
#run Box-Cox survey and create Box-Cox lambda data using the first recall
boxcox.sodium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Sodium",
covariates=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
weight="WTDRD1")
boxcox.potassium <- boxcox_survey(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
variable="Potassium",
covariates=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
weight="WTDRD1")
boxcox.lambda.data <- rbind(boxcox.sodium, boxcox.potassium)
#Calculate minimum amount of sodium and potassium intake in the first recall
minimum.amount.data <- calculate_minimum_amount(input.data=input.dataset,
row.subset=(input.dataset$Day2 == 0),
daily.variables=c("Sodium", "Potassium"))
#Run MCMC pre-preprocessor
pre.mcmc.data <- nci_multivar_preprocessor(input.data=input.dataset,
daily.variables=c("Sodium", "Potassium"),
continuous.covariates="RIDAGEYR",
boxcox.lambda.data=boxcox.lambda.data,
minimum.amount.data=minimum.amount.data)The MCMC measurement error model can now be fit using
nci_multivar_mcmc(), drawing 500 conditional U
matrices.
The MCMC parameters from each iteration and U matrices are saved in memory by default. Covariance matrix size increases with the square of the number of variables, and U matrix size increases with the number of subjects. This means that memory requirements can become very high for multivariate models and large sample sizes.
The low.memory.mode option in
nci_multivar_mcmc() can be used to reduce the memory
requirements of the algorithm. Setting low.memory.mode to
TRUE writes parameters and U matrices to CSV files at each
iteration instead of holding them in memory. The memory usage of the
algorithm is significantly reduced at the cost of slower run time due to
the extra I/O overhead.
mcmc.output <- nci_multivar_mcmc(pre.mcmc.data=pre.mcmc.data,
id="SEQN",
repeat.obs="DAY",
weight="WTDRD1",
daily.variables=c("Sodium", "Potassium"),
default.covariates=c("SMK_REC", "std.RIDAGEYR", "RIAGENDR", "Day2", "Weekend"),
cat.covariates="SMK_REC",
cat.reference=3,
num.mcmc.iterations=4000,
num.burn=2000,
num.thin=2,
num.post=500,
mcmc.seed=9999,
low.memory.mode=TRUE)The population base for simulating usual intakes is generated from the input data.
#get first instance of each subject
mcmc.input.data <- pre.mcmc.data$mcmc.input
distrib.pop <- mcmc.input.data[!duplicated(mcmc.input.data$SEQN),]
num.subjects <- nrow(distrib.pop)
#Set Day 2 to zero to factor out the effect of Day 2 recalls
distrib.pop$Day2 <- 0
#create repeats of each subject for weekday and weekend consumption
distrib.pop <- distrib.pop[rep(seq_len(num.subjects), each=2),]
distrib.pop$Weekend <- rep(c(0,1), num.subjects)
distrib.pop$Weekend.Weight <- rep(c(4,3), num.subjects)The nci_multivar_distrib() function can now be used to
generate 500 usual intakes for each subject using the conditional U
matrices from the MCMC. The outcome variable and covariates for the
regression are included in the output dataset using
additional.output.
Just like the MCMC model, the size of the simulated usual intake dataset increases with the number of subjects and variables.
The nci_multivar_distrib() function also contains a
low.memory.mode option to reduce memory use. When
low.memory.mode is TRUE, each simulated usual
intake replicate is written to a CSV file instead of keeping it in
memory. As with the MCMC algorithm, memory usage is reduced
significantly in exchange for a slower run time.
#simulate usual intakes for each subject using MCMC U matrices
distrib.data <- nci_multivar_distrib(multivar.mcmc.model=mcmc.output,
distrib.population=distrib.pop,
id="SEQN",
weight="WTDRD1",
nuisance.weight="Weekend.Weight",
use.mcmc.u.matrices=TRUE,
additional.output=c("SMK_REC", "RIDAGEYR", "RIAGENDR", "BPSY_AVG"),
distrib.seed=99999,
low.memory.mode=TRUE)The usual intakes of sodium and potassium are used to calculate the sodium/potassium ratio.
Since the simulated usual intakes are stored in a CSV file, they must
be loaded into memory to perform the calculations. The
apply_csv() function is used to manipulate data in CSV
files without loading the entire dataset.
Using apply_csv() is very similar to the
apply() family of functions in base R. A function that
operates on a data.frame is passed into
apply_csv() along with the input CSV file. This function is
applied to each block of data that is loaded in from the file.
In general, the logic of the function will be the same as operating on the data in memory. This means that code from in-memory workflows can easily be adapted for low-memory workflows.
#Function to calculate sodium/potassium ratio
calculate_ratio <- function(x) {
x$Sodium.Potassium.Ratio <- x$usual.intake.Sodium/x$usual.intake.Potassium
return(x)
}
#Apply the function to dataset of simulated usual intakes
distrib.data <- apply_csv(distrib.data, calculate_ratio)The distribution of the sodium/potassium ratio is summarized by using
nci_multivar_summary(). The summary statistics are
calculated without loading the full dataset into memory.
Calculating quantiles for a variable requires loading the full column
into memory. As a result, memory requirements are higher if quantile
calculation is enabled. If quantiles are not needed,
do.quantiles can be set to FALSE.
#compute means and quantiles
summary.stats <- nci_multivar_summary(input.data=distrib.data,
variables="Sodium.Potassium.Ratio",
weight="WTDRD1",
do.means=TRUE,
do.quantiles=TRUE,
quantiles=c(0.05, 0.25, 0.5, 0.75, 0.95))
summary.stats
#> population variable statistic value
#> 1 All Sodium.Potassium.Ratio Mean 1.3525331
#> 2 All Sodium.Potassium.Ratio 5% 0.8134275
#> 3 All Sodium.Potassium.Ratio 25% 1.0881832
#> 4 All Sodium.Potassium.Ratio 50% 1.3115268
#> 5 All Sodium.Potassium.Ratio 75% 1.5717261
#> 6 All Sodium.Potassium.Ratio 95% 2.0291058For regression calibration, the sodium/potassium ratio is averaged for each subject to calculate the proxy intakes.
The group_means_csv() function is used to calculate the
average usual intake by subject for data in a CSV file without loading
the entire dataset. The response variable, weights, and other covariates
for the regression are passed into the output using
additional.output.
#Average sodium/potassium ratio for each subject
regression.data <- group_means_csv(distrib.data,
variables="Sodium.Potassium.Ratio",
id="SEQN",
additional.output=c("BPSY_AVG", "WTDRD1", "SMK_REC", "RIDAGEYR", "RIAGENDR"))The linear model is fit on the dataset of proxy intakes.
#convert smoking status to factor variable for regression function
regression.data$SMK_REC.f <- factor(regression.data$SMK_REC,
levels=c(3, 1, 2),
labels=c("Never", "Current", "Former"))
#fit linear model
bp.model <- lm(BPSY_AVG ~ Sodium.Potassium.Ratio + SMK_REC.f + RIDAGEYR + RIAGENDR,
data=regression.data,
weights=regression.data$WTDRD1)
#summary dataset of model parameters
bp.parameters <- summary_coefficients(bp.model)
bp.parameters
#> population variable statistic value
#> 1 All BPSY_AVG Coefficient for (Intercept) 102.34801833
#> 2 All BPSY_AVG Coefficient for Sodium.Potassium.Ratio 2.29999642
#> 3 All BPSY_AVG Coefficient for SMK_REC.fCurrent -0.31028522
#> 4 All BPSY_AVG Coefficient for SMK_REC.fFormer -0.05772115
#> 5 All BPSY_AVG Coefficient for RIDAGEYR 0.35855471
#> 6 All BPSY_AVG Coefficient for RIAGENDR -4.71033919