library(Hmisc)
library(survival)
library(rms)
library(scales)
library(ggplot2)
library(latex2exp)
library(knitr)
library(kableExtra)
library(modelsummary)
getHdata(prostate)
Cox model case study
Analyzing prostate cancer dataset using a Cox model. The primary goal is to discover patterns in survival, and to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. Prediction is a secondary goal.
Data
We consider the 506-patient prostate cancer dataset from Byar and Green. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration.
We will follow guidelines given in the book Regression Modeling Strategies by Frank Harrell and his general approach to this dataset in Chapter 21. The analysis is expanded compared to what was done in the book.
There are 354 deaths among the 502 patients and we will use a multivariate survival model to explain the time until death (of any cause).
Loading data
We start by showing summary of all the variables in the dataset
Click here to see output
Preprocess data for function datasummary_skim
.
<- prostate
prostate_summary
# Convert 'labelled' columns to numeric
<- lapply(prostate_summary, function(x) {
prostate_summary[] if (inherits(x, "labelled")) {
as.numeric(x)
else {
}
x
}
})
$bm <- as.factor(prostate$bm)
prostate_summary$stage <- as.factor(prostate$stage)
prostate_summary$hx <- as.factor(prostate$hx)
prostate_summary
<- prostate_summary[
prostate_summary
,!names(prostate) %in% c("patno", "sdate")
]
<- sapply(prostate_summary, is.factor) factors
Summary of factor variables in data
datasummary_skim(prostate_summary[factors])
N | % | ||
---|---|---|---|
stage | 3 | 289 | 57.6 |
4 | 213 | 42.4 | |
rx | placebo | 127 | 25.3 |
0.2 mg estrogen | 124 | 24.7 | |
1.0 mg estrogen | 126 | 25.1 | |
5.0 mg estrogen | 125 | 24.9 | |
status | alive | 148 | 29.5 |
dead - prostatic ca | 130 | 25.9 | |
dead - heart or vascular | 96 | 19.1 | |
dead - cerebrovascular | 31 | 6.2 | |
dead - pulmonary embolus | 14 | 2.8 | |
dead - other ca | 25 | 5.0 | |
dead - respiratory disease | 16 | 3.2 | |
dead - other specific non-ca | 28 | 5.6 | |
dead - unspecified non-ca | 7 | 1.4 | |
dead - unknown cause | 7 | 1.4 | |
pf | normal activity | 450 | 89.6 |
in bed < 50% daytime | 37 | 7.4 | |
in bed > 50% daytime | 13 | 2.6 | |
confined to bed | 2 | 0.4 | |
hx | 0 | 289 | 57.6 |
1 | 213 | 42.4 | |
ekg | normal | 168 | 33.5 |
benign | 23 | 4.6 | |
rhythmic disturb & electrolyte ch | 51 | 10.2 | |
heart block or conduction def | 26 | 5.2 | |
heart strain | 150 | 29.9 | |
old MI | 75 | 14.9 | |
recent MI | 1 | 0.2 | |
bm | 0 | 420 | 83.7 |
1 | 82 | 16.3 |
Summary of continuous variables in data
datasummary_skim(prostate_summary[!factors])
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
dtime | 76 | 0 | 36.1 | 23.3 | 0.0 | 34.0 | 76.0 | ![]() |
age | 42 | 0 | 71.5 | 7.1 | 48.0 | 73.0 | 89.0 | ![]() |
wt | 68 | 0 | 99.0 | 13.4 | 69.0 | 98.0 | 152.0 | ![]() |
sbp | 18 | 0 | 14.4 | 2.4 | 8.0 | 14.0 | 30.0 | ![]() |
dbp | 12 | 0 | 8.1 | 1.5 | 4.0 | 8.0 | 18.0 | ![]() |
hg | 91 | 0 | 13.4 | 2.0 | 5.9 | 13.7 | 21.2 | ![]() |
sz | 56 | 1 | 14.6 | 12.3 | 0.0 | 11.0 | 69.0 | ![]() |
sg | 12 | 2 | 10.3 | 2.0 | 5.0 | 10.0 | 15.0 | ![]() |
ap | 128 | 0 | 12.2 | 62.2 | 0.1 | 0.7 | 999.9 | ![]() |
We see that the variable ekg only has one observation in the last category and pf has two. So, we combine recent MI with old MI and call it MI, and we combine confined to bed with \(>50\%\) daytime.
levels(prostate$ekg)[levels(prostate$ekg) %in% c("old MI", "recent MI")] <- "MI"
levels(prostate$pf) <- c(
levels(prostate$pf)[1:3],
levels(prostate$pf)[3]
)
Imputing missing values
We have 27 patients with missing values as seen below, so we do imputation on these.
The proportion of missing values is \(0.05\), so we just do single imputation, since simpler to make plots and validation. However if the proportion were higher one should do multiple imputation.
sum(is.na(prostate))
[1] 27
colSums(is.na(prostate))
patno stage rx dtime status age wt pf hx sbp dbp
0 0 0 0 0 1 2 0 0 0 0
ekg hg sz sg ap bm sdate
8 0 5 11 0 0 0
<- transcan(
w ~ sz + sg + ap + sbp + dbp + age +
+ hg + ekg + pf + bm + hx,
wt imputed = TRUE,
data = prostate, pl = FALSE, pr = FALSE
)
attach(prostate)
<- impute(w, sz, data = prostate)
sz <- impute(w, sg, data = prostate)
sg <- impute(w, age, data = prostate)
age <- impute(w, wt, data = prostate)
wt <- impute(w, ekg, data = prostate) kg
Specifying the model
A Cox proportional hazards model is in terms of the hazard function given by: \[\lambda(t\vert X)=\lambda(t)\exp(X\beta).\] We fit a Cox proportional hazards model with the predictors and their degrees of freedom specified in the table below. All continuous predictors will be modeled with a restricted cubic spline with 4 knots.
Predictor | Name | d.f. | Levels |
---|---|---|---|
Dose of estrogen | rx | 3 | placebo, 0.2, 1.0, 5.0 mg estrogen |
Age in years | age | 3 | |
Weight index | wt | 3 | |
Performance rating | pf | 2 | normal, in bed < 50% of time, in bed > 50% |
History of cardiovascular disease | hx | 1 | present/absent |
Systolic blood pressure/10 | sbp | 3 | |
Diastolic blood pressure/10 | dbp | 3 | |
Electrocardiogram code | ekg | 5 | normal, benign, rhythm disturb., block, strain, myocardial infarction |
Serum hemoglobin (g/100ml) | hg | 3 | |
Tumor size (cm^2) | sz | 3 | |
Stage/histologic grade combination | sg | 3 | |
Serum prostatic acid phosphatase | ap | 3 | |
Bone metastasis | bm | 1 | present/absent |
We see that we have 36 degrees of freedom in our model, so about 1/10 of the amount of deaths there are. So there is some hope our model will validate.
<- datadist(prostate)
dd options(datadist = "dd")
units(prostate$dtime) <- "Month"
<- Surv(prostate$dtime, prostate$status != "alive")
s
<- cph(s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx +
f rcs(sbp, 4) + rcs(dbp, 4) + ekg + rcs(hg, 4) +
rcs(sg, 4) + rcs(sz, 4) + rcs(log(ap), 4) + bm, data = prostate)
print(f, html = TRUE, coefs = FALSE)
Frequencies of Missing Values Due to Each Variable
s rx age wt pf hx sbp dbp ekg hg sg sz ap bm
0 0 1 2 0 0 0 0 8 0 11 5 0 0
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx + rcs(sbp,
4) + rcs(dbp, 4) + ekg + rcs(hg, 4) + rcs(sg, 4) + rcs(sz,
4) + rcs(log(ap), 4) + bm, data = prostate)
Model Tests Discrimination
Indexes
Obs 475 LR chi2 136.31 R2 0.250
Events 338 d.f. 36 R2(36,475)0.190
Center -1.9598 Pr(> chi2) 0.0000 R2(36,338)0.257
Score chi2 145.22 Dxy 0.344
Pr(> chi2) 0.0000
Test for removing all predictors is highly significant, so modelling is warranted. AIC on the \(\chi^2\)-scale is given by \(136.04-2\cdot 36=64.04\). A rough shrinkage estimate is thus given by \((136.04-36)/136.04=0.74\). So we estimate that 0.26 of the model fitting will be noise, especially with regard to calibration accuracy. One approach would just be to multiply coefficients of final model if used for prediction by the shrinkage estimate. Instead we try to do some data reduction using domain knowledge.
Data reduction using domain knowledge
We do the following changes
Combine systolic blood pressure (sbp) and diastolic blood pressure (dbp) into Mean Arterial Pressure.
Combine history of cardiovascular disease (hx) and electrocardiogram code ekg and assume linear. We code it as 2 if ekg are not normal or benign and hx, 1 if either, and 0 if none.
Weight index not that important, so only use 3 knots.
Take log of Serum prostatic acid phosphatase (ap) for numeric stability (heavy right-tail) and allow 1 more knot for compensation
The tumor variables tumor size (sz) and stage/histologic grade combination (sg) should be important variables, but since we have both we keep them but give them only 3 knots each.
Assume performance rating (pf) is linear.
$map <- (2 * prostate$dbp + prostate$sbp) / 3
prostatelabel(prostate$map) <- "Mean Arterial Pressure/10"
$heart_d <- prostate$hx + prostate$ekg %nin% c("normal", "benign")
prostatelabel(prostate$heart_d) <- "Heart Disease Code"
$pf_linear <- as.numeric(prostate$pf) prostate
So all in all we save 3 degrees of freedom from defining map, 5 for defining Heart Disease Code, and 3 for dropping knots, 1 for assuming performance rating is linear and gain 1 from ap. For a total reduction of 11 degrees of freedom.
<- datadist(prostate)
dd options(datadist = "dd")
<- cph(
f ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
s + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) + rcs(log(ap), 5) + bm,
x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 * 12, data = prostate
)
print(f, html = TRUE, coefs = 3)
Frequencies of Missing Values Due to Each Variable
s rx age wt pf_linear heart_d map hg
0 0 1 2 0 0 0 0
sg sz ap bm
11 5 0 0
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear +
heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz,
3) + rcs(log(ap), 5) + bm, data = prostate, x = TRUE, y = TRUE,
surv = TRUE, time.inc = 5 * 12)
Model Tests Discrimination
Indexes
Obs 483 LR chi2 118.48 R2 0.218
Events 344 d.f. 24 R2(24,483)0.178
Center -2.2006 Pr(> chi2) 0.0000 R2(24,344)0.240
Score chi2 125.95 Dxy 0.323
Pr(> chi2) 0.0000
Coef S.E. Wald Z Pr(>|Z|)
rx=0.2 mg estrogen -0.0131 0.1518 -0.09 0.9314
rx=1.0 mg estrogen -0.3867 0.1675 -2.31 0.0210
rx=5.0 mg estrogen -0.0916 0.1595 -0.57 0.5656
. . .
LR 119.86 with 24 degrees of freedom. AIC is \(119.86-2 \cdot 24=71.86\). Rough shrinkage estimate of this is now better at \((119.86-24)/119.86=0.80\).
anova(f, html = TRUE)
Wald Statistics Response: s
Factor Chi-Square d.f. P
rx 6.61 3 0.0856
age 12.30 3 0.0064
Nonlinear 7.69 2 0.0214
wt 8.48 2 0.0144
Nonlinear 3.39 1 0.0655
pf_linear 3.27 1 0.0706
heart_d 23.90 1 <.0001
map 0.28 2 0.8683
Nonlinear 0.21 1 0.6479
hg 8.89 3 0.0308
Nonlinear 3.95 2 0.1386
sg 2.39 2 0.3031
Nonlinear 0.00 1 0.9974
sz 12.42 2 0.0020
Nonlinear 0.04 1 0.8443
ap 6.95 4 0.1388
Nonlinear 6.77 3 0.0795
bm 0.17 1 0.6772
TOTAL NONLINEAR 20.66 11 0.0371
TOTAL 119.17 24 <.0001
We see that the p-value for the treatment, doses of estrogen (rx), is below \(0.05\), so it seems like it has some effect.
Checking Proportional Hazards
A Cox model should have proportional hazards, since if we look at two individuals with covariate values \(\boldsymbol{X}\) and \(\boldsymbol{X}^*\) then the ratio of their hazard rates is \[\frac{h(t \vert \boldsymbol{X})}{h(t \vert \boldsymbol{X}^*)}=\exp\Big[\sum_{k=1}^p \beta_k(X_k-X_{k^*})\Big],\] which is a constant. We can test this assumption using scaled Schoenfeld residuals separately for each predictor and test the PH assumption using the “correlation with time” test. Smoothed trends in the residuals are also plotted.
<- predict(f, type = "terms", data = prostate)
z <- cph(s ~ z, x = TRUE, y = TRUE, data = prostate)
f_short <- cox.zph(f_short, transform = "identity", terms = FALSE)
phtest phtest
chisq df p
rx 3.63e+00 1 0.057
age 1.63e+00 1 0.201
wt 1.28e-02 1 0.910
pf_linear 1.15e-02 1 0.915
heart_d 3.58e-01 1 0.550
map 1.61e+00 1 0.205
hg 1.18e-01 1 0.731
sg 1.28e-02 1 0.910
sz 2.38e-05 1 0.996
ap 1.41e-01 1 0.707
bm 2.25e-01 1 0.635
GLOBAL 8.89e+00 11 0.632
We see that the p-value of the test for proportional hazards is below \(0.05\) for dose of estrogen, indicating that it may not have constant relative hazard. We plot it and don’t find anything too worry-some, and the global test of proportional hazard by adjusting for the 11 tests gives a p-value of \(0.78\). So we accept our model and continue.
plot(phtest, var = "rx")
Describing Effects
We plot how each predictor is related to the log hazard of death with \(95\%\)-confidence bands.
ggplot(Predict(f),
sepdiscrete = "vertical", nlevels = 4,
vnames = "names"
)
Presenting the final model for inference
To present point and interval estimates of predictor effects we draw a hazard ratio chart. Since the ap relationship is so non-monotonic we use a 20:1 hazard ratio for this variable. The others use the default.
plot(summary(f, ap = c(1, 20)), log = TRUE, main = "")
This plot is interpreted as the hazard ratio for each variable if we went from for example age 70 to age 76. \(95\%, 97.5\%\) and \(99\%\) confidence bands are added.
Finally we will also show a nomogram.
<- Survival(f)
surv <- function(x) surv(3 * 12, lp = x)
surv3 <- function(x) surv(5 * 12, lp = x)
surv5 <- Quantile(f)
quan <- function(x) quan(lp = x) / 12
med <- c(.05, .1, .2, .3, .4, .5, .6, .7, .8, .9, .95)
ss <- nomogram(f,
nom ap = c(.1, .5, 1, 2, 3, 4, 5, 10, 20, 30, 40),
fun = list(surv3, surv5, med),
funlabel = c(
"3-year Survival", "5-year Survival",
"Median Survival Time (years)"
),fun.at = list(ss, ss, c(.5, 1:6))
)plot(nom, cex.axis = 0.8, cex.var = 0.7, xfrac = .65, lmgp = .35)
This is interpreted by taking a value for each predictor and seeing how many points each value correspond to. Then the total amount of points can be used to find the probability of for example 5-year survival.
Model for prediction
For a prediction model one should validate the model. We start by doing that
Validating the model
We will use bootstrapping for validation. We first validate this model for Somers’ \(D_{xy}\) rank correlation between predicted log hazard and observed survival time, and for slope shrinkage. The bootstrap is used (with 1000 resamples) to penalize for possible overfitting.
<- validate(f, B = 1000)
v options(prType = "html")
v
Index | Original Sample |
Training Sample |
Test Sample |
Optimism | Corrected Index |
Successful Resamples |
---|---|---|---|---|---|---|
Dxy | 0.3234 | 0.3514 | 0.2985 | 0.053 | 0.2705 | 1000 |
R2 | 0.2176 | 0.2561 | 0.1814 | 0.0747 | 0.1429 | 1000 |
Slope | 1 | 1 | 0.781 | 0.219 | 0.781 | 1000 |
D | 0.0303 | 0.0368 | 0.0247 | 0.0121 | 0.0182 | 1000 |
U | -5e-04 | -5e-04 | 0.0027 | -0.0032 | 0.0027 | 1000 |
Q | 0.0308 | 0.0373 | 0.022 | 0.0153 | 0.0155 | 1000 |
g | 0.7298 | 0.8155 | 0.6369 | 0.1786 | 0.5512 | 1000 |
Here “training” refers to accuracy when evaluated on the bootstrap sample used to fit the model, and “test” refers to the accuracy when this model is applied without modification to the original sample. The apparent \(D_{xy}\) is 0.32, but a better estimate of how well the model will discriminate prognoses in the future is \(D_{xy} = 0.27\). The bootstrap estimate of slope shrinkage is \(0.79\), close to the simple heuristic estimate.
Validate the model for predicting the probability of surviving five years
We will use the non-shrinked estimates to show that this model will have some overfitting consistent with regression to the mean.
<- calibrate(f, B = 1000, u = 5 * 12) cal
Using Cox survival estimates at 60 Months
Convergence problems.... stopping addition
Convergence problems.... stopping addition
plot(cal, subtitles = FALSE)
The line nearer the ideal line corresponds to apparent predictive accuracy. The blue curve corresponds to bootstrap-corrected estimates. We clearly see some regression to the mean.
Final model for prediction
For a better model for prediction one should multiply the coefficients in the model by the shrinkage estimate.
v
Index | Original Sample |
Training Sample |
Test Sample |
Optimism | Corrected Index |
Successful Resamples |
---|---|---|---|---|---|---|
Dxy | 0.3234 | 0.3514 | 0.2985 | 0.053 | 0.2705 | 1000 |
R2 | 0.2176 | 0.2561 | 0.1814 | 0.0747 | 0.1429 | 1000 |
Slope | 1 | 1 | 0.781 | 0.219 | 0.781 | 1000 |
D | 0.0303 | 0.0368 | 0.0247 | 0.0121 | 0.0182 | 1000 |
U | -5e-04 | -5e-04 | 0.0027 | -0.0032 | 0.0027 | 1000 |
Q | 0.0308 | 0.0373 | 0.022 | 0.0153 | 0.0155 | 1000 |
g | 0.7298 | 0.8155 | 0.6369 | 0.1786 | 0.5512 | 1000 |
<- v[3, 5] # Extract bootstrap estimate of shrinkage
shrinkage_est <- f
f_prediction $coefficients <- f$coefficients * shrinkage_est
f_prediction
print(f_prediction, html = TRUE, coef = 3)
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) + rcs(log(ap), 5) + bm, data = prostate, x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 * 12)
Model Tests | Discrimination Indexes |
|
---|---|---|
Obs 483 | LR χ2 118.48 | R2 0.218 |
Events 344 | d.f. 24 | R224,483 0.178 |
Center -2.2006 | Pr(>χ2) 0.0000 | R224,344 0.240 |
Score χ2 125.95 | Dxy 0.323 | |
Pr(>χ2) 0.0000 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
rx=0.2 mg estrogen | -0.0102 | 0.1518 | -0.07 | 0.9464 |
rx=1.0 mg estrogen | -0.3021 | 0.1675 | -1.80 | 0.0713 |
rx=5.0 mg estrogen | -0.0716 | 0.1595 | -0.45 | 0.6536 |
… |
Approximating the full model
If the client doesn’t want to collect all these variables for prediction we can do model approximation. The goal is to explain f_prediction
with a trade off between parameters and \(R^2\). This can be done by backwards selection.
<- predict(f_prediction) # compute linear predictor from full model
z
# Force sigma to be 1 since perfect fit
<- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
a + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3)
+ rcs(log(ap), 5) + bm, sigma = 1, data = prostate)
<- fastbw(a, aics = 10000)
backwards print(backwards, html = TRUE, digits = 3)
Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
map 0.42 2 0.8124 0.42 2 0.8124 -3.58 0.998
bm 0.22 1 0.6428 0.63 3 0.8894 -5.37 0.997
sg 3.40 2 0.1826 4.03 5 0.5449 -5.97 0.980
pf_linear 4.77 1 0.0289 8.81 6 0.1848 -3.19 0.956
rx 12.93 3 0.0048 21.73 9 0.0098 3.73 0.892
age 14.73 3 0.0021 36.46 12 0.0003 12.46 0.819
wt 13.66 2 0.0011 50.12 14 0.0000 22.12 0.751
ap 20.68 4 0.0004 70.80 18 0.0000 34.80 0.648
hg 31.13 3 0.0000 101.93 21 0.0000 59.93 0.493
sz 44.95 2 0.0000 146.88 23 0.0000 100.88 0.269
heart_d 54.02 1 0.0000 200.90 24 0.0000 152.90 0.000
Approximate Estimates after Deleting Factors
Coef S.E. Wald Z P
[1,] 1.8e-11 0.0455 3.95e-10 1
Factors in Final Model
None
Now the client can choose the trade-off between predictive ability and number of predictors. We visualize this in the following plot
<- as.data.frame(backwards$result)
result <- nrow(result)
total_predictors <- result$R2
r_squared <- c((total_predictors - 1):0)
predictors
ggplot(data.frame(predictors, r_squared), aes(x = predictors, y = r_squared)) +
geom_line(color = "blue") +
geom_point(color = "red") +
labs(x = "Predictors", y = TeX("$R^2$")) +
ggtitle("R-squared vs. Predictors") +
scale_y_continuous(breaks = pretty_breaks()) +
theme_minimal()
A reasonable approach trade off would be to remove map, bm, sg and pf_linear from our model to get an approximate model which still has high \(R^2\) of \(0.96\) for explaining the full model.
<- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d
f_approx + rcs(hg, 4) + rcs(sz, 3) + rcs(log(ap), 5)
x = TRUE, data = prostate)
,
print(f_approx, html = TRUE, coefs = 3)
Linear Regression Model
ols(formula = z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d + rcs(hg, 4) + rcs(sz, 3) + rcs(log(ap), 5), data = prostate, x = TRUE)
Model Likelihood Ratio Test |
Discrimination Indexes |
|
---|---|---|
Obs 483 | LR χ2 1510.49 | R2 0.956 |
σ 0.1378 | d.f. 18 | R2adj 0.954 |
d.f. 464 | Pr(>χ2) 0.0000 | g 0.715 |
Residuals
Min 1Q Median 3Q Max -0.44623 -0.08724 -0.02054 0.05327 0.63271
β | S.E. | t | Pr(>|t|) | |
---|---|---|---|---|
Intercept | 3.9360 | 0.2263 | 17.39 | <0.0001 |
rx=0.2 mg estrogen | -0.0030 | 0.0178 | -0.17 | 0.8652 |
rx=1.0 mg estrogen | -0.4058 | 0.0179 | -22.69 | <0.0001 |
… |
Conclusion
We devolved a Cox proportional hazards model to explain time until death (of any cause) for this dataset. We described the effects the predictors in the model had by various plots.
We did validation and calibration if one wished to use the model for prediction, to avoid issues such as regression to the mean. We then developed a simplified model which dropped 4 predictors, but could still predict the full model with high accuracy (\(R^2=0.964)\).
Throughout we used the guidelines of Regression Modelling Strategies to follow sounds statistical strategies when building and validating a model.