/*
Title: Efficient Coding for Comprehensive Analysis
Author: NCI/Information Management Services
Date: 3/25/2025
*/

/*
This example demonstrates using the NCI method to calculate summary statistics of distributions for different subgroups along with standard errors.
It is designed to resemble an NCI method workflow that might be used in a publication.

NOTE: THIS EXAMPLE MAY HAVE LONG COMPUTATION TIMES; IT IS RECOMMENDED TO RUN THIS PRIOR TO THE BREAKOUT SESSION.
*/

libname indata "./ncimultivar/data";

%include "./ncimultivar/macros/ncimultivar.sas";

/*
A study may require many pieces of information about a distribution that will be put into tables and figures, along with standard errors.
It is important to understand the best practices for performing these complex analyses to minimize computation time and reduce errors.

This example will calculate summary statistics and fit models on the distribution of sodium (TSODI) from 2005-2010 NHANES data.
A subset of six strata (SDMVSTRA) will be used to reduce computation time and allow this example to run in real time.

The mean, quantiles, and proportions of the usual intake distribution will be calculated.
Regression calibration will also be used to fit models of systolic blood pressure (BPSY_AVG), hypertension (HTN_BIN), and cardiovascular disease (CVDSTAT).
A linear model will be fit for systolic blood pressure, and logistic models will be fit for hypertension and cardiovascular disease.

The covariates being examined are age (RIDAGEYR) and sex (RIAGENDR). 
Two nuisance covariates will be factored in as well: whether the recall was on a weekend (Weekend) and and whether the recall is on day 1 or day 2 (DAY).

The `WTDRD1` variable is the base weighting for each observation.

The analysis will be stratified by smoking status (SMK_REC).
Results will be calculated for current smokers, former smokers, and never-smokers and for the overall population.
The differences in summary statistics between current and former smokers versus never-smokers will also be calculated. 

Standard errors for the summary statistics and differences will be estimated using balanced repeated replication (BRR).

Subjects with missing values are removed.
*/

**subset data;
data input_dataset;
	set indata.nhcvd;
	if SDMVSTRA in (48 54 60 66 72 78);
	
	**Define indicator for Day 2;
	Day2 = (DAY = 2);
run;

data input_dataset;
	set input_dataset;
	
	**remove subjects that are missing any covariates, variables, or outcomes;
	if not missing(SMK_REC)  and 
		 not missing(RIDAGEYR) and 
		 not missing(RIAGENDR) and
		 not missing(Weekend)  and
		 not missing(Day2)		 and
		 not missing(TSODI)    and
		 not missing(BPSY_AVG) and
		 not missing(HTN_BIN)  and
		 not missing(CVDSTAT);
	
	**rename sodium variable for readability;
	Sodium = TSODI;
run;

/*
BRR weights are generated for the dataset. 
A Fay factor of 0.7 is used.
*/

%let fay_factor = 0.7;

%brr_weights(input_data=input_dataset,
						 id=SEQN,
						 strata=SDMVSTRA,
						 psu=SDMVPSU,
						 cell=PSCELL,
						 weight=WTDRD1,
						 fay_factor=&fay_factor.,
						 outname=input_dataset);
						 
/*
Since this analysis will be stratified by smoking status, the pre-processing will need to be done separately for each subgroup. 
A macro is used here to avoid repeating code for each subgroup.
This tutorial assumes some familiarity with writing SAS macros.
*/

%macro process_strata();

	%do smk = 1 %to 3;
	
		data strata;
			set input_dataset;
			where SMK_REC = &smk.;
		run;
		
		**get suggested Winsorization values;
		%boxcox_survey(input_data=strata,
									 row_subset=%quote(Day2 = 0),
									 variable=Sodium,
									 weight=RepWt_0,
									 do_winsorization=Y,
									 id=SEQN,
									 repeat_obs=DAY);
									 
		**Winsorize using suggested values;
		proc sort data=strata; by SEQN DAY Sodium; run;
		proc sort data=win_Sodium; by SEQN DAY Sodium; run;
		data strata;
			merge strata (in=in_strata)
						win_Sodium;
			by SEQN DAY Sodium;
			if in_strata = 1;
			
			if not missing(Sodium_win) then Sodium = Sodium_win;
		run;
		
		**Find best Box-Cox lambda in the presence of covariates - smoking status is omitted because it is the stratification variable;
		%boxcox_survey(input_data=strata,
									 row_subset=%quote(Day2 = 0),
									 variable=Sodium,
									 covariates=RIDAGEYR RIAGENDR Weekend,
									 weight=RepWt_0);
									 
		**Calculate minimum consumption amount;
		%calculate_minimum_amount(input_data=strata,
															row_subset=%quote(Day2 = 0),
															daily_variables=Sodium);
															
		**Run pre-processor to standardize variables and covariates;
		%nci_multivar_preprocessor(input_data=strata,
															 daily_variables=Sodium,
															 continuous_covariates=RIDAGEYR,
															 boxcox_lambda_data=bc_Sodium,
															 minimum_amount_data=minimum_amount_data,
															 outname=smk&smk.);
	%end;
%mend process_strata;

%process_strata();

/*
Models must now be fit for the base weight and each set of replicate weights. 
Since regression calibration will be performed, 500 conditional U matrix draws should be made using the num_post parameter.

Since this is a stratified analysis, a separate MCMC model must be fit for each stratification level within each replicate.
As such, it is highly recommend to use loops to automate the replicates.
If setting the random seed using mcmc_seed, it is important to change the seed each time the MCMC is run.
*/

%let num_brr = 8; **See Complex Survey vignette for further detail on why 8 BRR are chosen;

%macro mcmc_brr(num_brr=);
	
	%do brr_rep = 0 %to &num_brr.;
	
		**Run MCMC for each strata level;
		%do smk = 1 %to 3;
		
			%nci_multivar_mcmc(pre_mcmc_data=smk&smk.,
												 id=SEQN,
												 repeat_obs=DAY,
												 weight=RepWt_&brr_rep.,
												 daily_variables=Sodium,
												 default_covariates=std_RIDAGEYR RIAGENDR Day2 Weekend,
												 num_mcmc_iterations=3000,
												 num_burn=1000,
												 num_thin=2,
												 num_post=500,
												 mcmc_seed=%eval(9999 + 3*&brr_rep. + &smk.),
												 outname=mcmc_brr&brr_rep._smk&smk.);
		%end;
	%end;
%mend mcmc_brr;

%mcmc_brr(num_brr=&num_brr.);

/*
The next step is to simulate usual intakes for each strata in every BRR replicate.
The usual intakes should be simulated using the conditional U matrix draws from the MCMC for regression calibration.
As with the MCMC, it is important make sure that distrib_seed is different each time the %nci_multivar_distrib() macro is called.

Summary statistics can then be calculated for the overall population as well as by strata.
A publication often has multiple tables and figures as well as supplemental material that each need different statistics. 

In this example, the distribution of usual intakes will be summarized using means, quantiles, and proportions.
In addition, regression calibration will be done for three outcome variables.
Differences in summary statistics between current or former smokers versus never-smokers using the %summary_difference() utility.

Since running replicates is very time-consuming, it is best practice to compute all statistics that may be needed and combine them into one summary dataset for BRR variance calculation. 
The summary utilities in the package make this convenient by using the same columns and format for all summary datasets. 
The full summary dataset can then be subset to statistics that are eventually needed for publication.

The summary statistics will be calculated in the same loop as simulating the usual intakes to save memory.
*/

**Create distrib populations for each strata level;
**this only needs to be done once;
%macro make_distrib_populations();

	%do smk = 1 %to 3;
	
		proc sort data=smk&smk._mcmc_in; by SEQN; run;
		data distrib_pop_smk&smk.;
			set smk&smk._mcmc_in;
			by SEQN;
			
			**get first instance of each subject;
			if first.SEQN then do;
	
				**Set Day 2 to zero to factor out the effect of Day 2 recalls;
				Day2 = 0;
	
				**create repeats of each subject for weekday and weekend consumption;
				Weekend = 0;
				Weekend_Weight = 4;
				output;
	
				Weekend = 1;
				Weekend_Weight = 3;
				output;
			end;
		run;
	%end;
%mend make_distrib_populations;

%make_distrib_populations();

**datasets for lower and upper thresholds for proportions;
data lower;

	variable = "usl_Sodium";
	threshold = 2200;
run;

data upper;

	variable = "usl_Sodium";
	threshold = 3600;
run;

**Simulate usual intakes and calculate summary statistics for each BRR replicate;
%macro summary_brr(num_brr=);

	%do brr_rep = 0 %to &num_brr.;
	
		**simulate usual intakes for all strata using conditional U matrices from MCMC;
		%do smk = 1 %to 3;
		
			%nci_multivar_distrib(multivar_mcmc_model=mcmc_brr&brr_rep._smk&smk.,
														distrib_population=distrib_pop_smk&smk.,
														id=SEQN,
														weight=RepWt_&brr_rep.,
														nuisance_weight=Weekend_Weight,
														additional_output=RIDAGEYR RIAGENDR BPSY_AVG HTN_BIN CVDSTAT,
														use_mcmc_u_matrices=Y,
														distrib_seed=%eval(99999 + 3*&brr_rep. + &smk.),
														outname=distrib_smk&smk.);
		%end;
		
		data distrib_all;
			set %do smk = 1 %to 3;
						distrib_smk&smk.
					%end;
					;
		run;
		
		**compute summary statistics for overall population and all strata;
		**use names for strata to differentiate them in output;
		%let strata_names = Overall Current_Smoker Former_Smoker Never_Smoker;
		
		%do smk = 0 %to 3;
		
			data distrib_strata;
				%if &smk. = 0 %then %do;
					set distrib_all;
				%end;
				%else %do;
					set distrib_smk&smk.;
				%end;
			run;
			
			%let strata_name = %sysfunc(scan(&strata_names., %eval(&smk. + 1), %str( )));
			
			**Calculate distribution summary statistics;
			%nci_multivar_summary(input_data=distrib_strata,
														variables=usl_Sodium,
														population_name=&strata_name.,
														weight=RepWt_&brr_rep.,
														do_means=Y,
														do_quantiles=Y,
														quantiles=20 40 60 80,
														do_proportions=Y,
														lower_thresholds=lower,
														upper_thresholds=upper,
														outname=summary_dist);
														
			**Average usual intakes per subject for regression calibration;
			proc sort data=distrib_strata; by SEQN RepWt_&brr_rep. RIDAGEYR RIAGENDR BPSY_AVG HTN_BIN CVDSTAT; run;
			
			proc univariate data=distrib_strata noprint;
				by SEQN RepWt_&brr_rep. RIDAGEYR RIAGENDR BPSY_AVG HTN_BIN CVDSTAT;
				
				var usl_Sodium;
				
				output out=regression_data mean=usl_Sodium;
			run;
														
			**scale down sodium usual intake by 1000 to show the effect per 1,000 mg of sodium;
			data regression_data;
				set regression_data;
				
				usl_Sodium = usl_Sodium/1000;
			run;
			
			ods select none;
			
			**Linear model of systolic blood pressure;
			proc surveyreg data=regression_data;
			
				model BPSY_AVG = usl_Sodium RIDAGEYR RIAGENDR;
				
				weight RepWt_&brr_rep.;
				
				ods output ParameterEstimates=bp_model;
			run;
			
			%summary_coef_surveyreg(parameter_estimates=bp_model,
															response=BPSY_AVG,
															population_name=&strata_name.,
															outname=bp_parameters);
			
			**Logistic model of hypertension;
			proc surveylogistic data=regression_data;
			
				model HTN_BIN(event='1') = usl_Sodium RIDAGEYR RIAGENDR;
				
				weight RepWt_&brr_rep.;
				
				ods output ParameterEstimates=htn_model;
			run;
			
			%summary_coef_surveylogistic(parameter_estimates=htn_model,
																	 response=HTN_BIN,
																	 population_name=&strata_name.,
																	 outname=htn_parameters);
			
			**Logistic model of cardiovascular disease;
			proc surveylogistic data=regression_data;
			
				model CVDSTAT(event='1') = usl_Sodium RIDAGEYR RIAGENDR;
				
				weight RepWt_&brr_rep.;
				
				ods output ParameterEstimates=cvd_model;
			run;
			
			%summary_coef_surveylogistic(parameter_estimates=cvd_model,
																	 response=CVDSTAT,
																	 population_name=&strata_name.,
																	 outname=cvd_parameters);
																	 
			ods select all;
																	 
			**Concatenate summary output into one dataset;
			data summary_smk&smk.;
				set summary_dist
						bp_parameters
						htn_parameters
						cvd_parameters;
			run;
		%end;
		
		**compute differences of current smokers (1) and former smokers (2) vs. never-smokers (3);
		%summary_difference(population1=summary_smk1,
												population2=summary_smk3,
												outname=summary_diff1);
												
		%summary_difference(population1=summary_smk2,
												population2=summary_smk3,
												outname=summary_diff2);
												
		**concatenate summary and difference datasets into a single large dataset;
		data summary_brr&brr_rep.;
			set summary_smk0
					summary_smk1
					summary_smk2
					summary_smk3
					summary_diff1
					summary_diff2;
		run;
	%end;
	
	**extract point estimate and BRR replicates;
	data summary_brr_data;
		set summary_brr0;
		%do brr_rep = 1 %to &num_brr.;
			set summary_brr&brr_rep. (keep = value rename=(value = brr&brr_rep.));
		%end;
	run;
%mend summary_brr;

%summary_brr(num_brr=&num_brr.);

/*
With a point estimate and BRR replicates for every summary statistic, standard errors and confidence intervals can now be calculated. 
With the data set up as one column per replicate, this can be done easily and efficiently using vectorized code.
The BRR replicate weights in this dataset used a Fay factor of 0.7, so this must be accounted for in calculating the variance. 

When calculating confidence intervals, it is important to use the correct number of degrees of freedom. 
This is equal to the total number of PSUs across all strata minus the number of strata. 
Since BRR uses exactly two PSUs per strata, the degrees of freedom is simply the number of strata.
*/

**calculate degrees of freedom by counting number of strata;
proc sort data=input_dataset; by SDMVSTRA; run;

data _NULL_;
	set input_dataset end=last;
	by SDMVSTRA;
	
	retain num_strata 0;
	
	if first.SDMVSTRA then num_strata = num_strata + 1;
	
	if last = 1 then call symputx("df", num_strata);
run;

**create summary report;
data summary_report (keep = population variable statistic value std_error confidence_lower confidence_upper);
	set summary_brr_data;
	
	array reps{&num_brr.} brr1-brr&num_brr.;
	
	**calculate BRR standard error;
	sum_sq_diff = 0;
	do i = 1 to &num_brr.;
	
		sum_sq_diff = sum_sq_diff + (reps{i} - value)**2;
	end;
	
	std_error = sqrt(sum_sq_diff/(&num_brr.*&fay_factor.**2));
	
	**95% confidence intervals;
	confidence_lower = value + tinv(0.025, &df.)*std_error;
	confidence_upper = value + tinv(0.975, &df.)*std_error;
run;

proc print data=summary_report; 

	title "Sodium Intake by Smoking Status";
run;