library(Hmisc)
library(survival)
library(rms)
library(scales)
library(ggplot2)
library(latex2exp)
library(knitr)
library(kableExtra)
library(modelsummary)
getHdata(prostate)
Cox model case study
Analyzing prostate cancer dataset using a Cox model. The primary goal is to discover patterns in survival, and to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. Prediction is a secondary goal.
Data
We consider the 506-patient prostate cancer dataset from Byar and Green. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration.
We will follow guidelines given in the book Regression Modeling Strategies by Frank Harrell and his general approach to this dataset in Chapter 21. The analysis is expanded compared to what was done in the book.
There are 354 deaths among the 502 patients and we will use a multivariate survival model to explain the time until death (of any cause).
Loading data
We start by showing summary of all the variables in the dataset
Click here to see output
Preprocess data for function datasummary_skim
.
<- prostate
prostate_summary
# Convert 'labelled' columns to numeric
<- lapply(prostate_summary, function(x) {
prostate_summary[] if (inherits(x, "labelled")) {
as.numeric(x)
else {
}
x
}
})
$bm <- as.factor(prostate$bm)
prostate_summary$stage <- as.factor(prostate$stage)
prostate_summary$hx <- as.factor(prostate$hx)
prostate_summary
<- prostate_summary[
prostate_summary
,!names(prostate) %in% c("patno", "sdate")
]
<- sapply(prostate_summary, is.factor) factors
Summary of factor variables in data
datasummary_skim(prostate_summary[factors])
N | % | ||
---|---|---|---|
stage | 3 | 289 | 57.6 |
4 | 213 | 42.4 | |
rx | placebo | 127 | 25.3 |
0.2 mg estrogen | 124 | 24.7 | |
1.0 mg estrogen | 126 | 25.1 | |
5.0 mg estrogen | 125 | 24.9 | |
status | alive | 148 | 29.5 |
dead - prostatic ca | 130 | 25.9 | |
dead - heart or vascular | 96 | 19.1 | |
dead - cerebrovascular | 31 | 6.2 | |
dead - pulmonary embolus | 14 | 2.8 | |
dead - other ca | 25 | 5.0 | |
dead - respiratory disease | 16 | 3.2 | |
dead - other specific non-ca | 28 | 5.6 | |
dead - unspecified non-ca | 7 | 1.4 | |
dead - unknown cause | 7 | 1.4 | |
pf | normal activity | 450 | 89.6 |
in bed < 50% daytime | 37 | 7.4 | |
in bed > 50% daytime | 13 | 2.6 | |
confined to bed | 2 | 0.4 | |
hx | 0 | 289 | 57.6 |
1 | 213 | 42.4 | |
ekg | normal | 168 | 33.5 |
benign | 23 | 4.6 | |
rhythmic disturb & electrolyte ch | 51 | 10.2 | |
heart block or conduction def | 26 | 5.2 | |
heart strain | 150 | 29.9 | |
old MI | 75 | 14.9 | |
recent MI | 1 | 0.2 | |
bm | 0 | 420 | 83.7 |
1 | 82 | 16.3 |
Summary of continuous variables in data
datasummary_skim(prostate_summary[!factors])
Unique | Missing Pct. | Mean | SD | Min | Median | Max | Histogram | |
---|---|---|---|---|---|---|---|---|
dtime | 76 | 0 | 36.1 | 23.3 | 0.0 | 34.0 | 76.0 | |
age | 42 | 0 | 71.5 | 7.1 | 48.0 | 73.0 | 89.0 | |
wt | 68 | 0 | 99.0 | 13.4 | 69.0 | 98.0 | 152.0 | |
sbp | 18 | 0 | 14.4 | 2.4 | 8.0 | 14.0 | 30.0 | |
dbp | 12 | 0 | 8.1 | 1.5 | 4.0 | 8.0 | 18.0 | |
hg | 91 | 0 | 13.4 | 2.0 | 5.9 | 13.7 | 21.2 | |
sz | 56 | 1 | 14.6 | 12.3 | 0.0 | 11.0 | 69.0 | |
sg | 12 | 2 | 10.3 | 2.0 | 5.0 | 10.0 | 15.0 | |
ap | 128 | 0 | 12.2 | 62.2 | 0.1 | 0.7 | 999.9 |
We see that the variable ekg only has one observation in the last category and pf has two. So, we combine recent MI with old MI and call it MI, and we combine confined to bed with \(>50\%\) daytime.
levels(prostate$ekg)[levels(prostate$ekg) %in% c("old MI", "recent MI")] <- "MI"
levels(prostate$pf) <- c(
levels(prostate$pf)[1:3],
levels(prostate$pf)[3]
)
Imputing missing values
We have 27 patients with missing values as seen below, so we do imputation on these.
The proportion of missing values is \(0.05\), so we just do single imputation, since simpler to make plots and validation. However if the proportion were higher one should do multiple imputation.
sum(is.na(prostate))
[1] 27
colSums(is.na(prostate))
patno stage rx dtime status age wt pf hx sbp dbp
0 0 0 0 0 1 2 0 0 0 0
ekg hg sz sg ap bm sdate
8 0 5 11 0 0 0
<- transcan(
w ~ sz + sg + ap + sbp + dbp + age +
+ hg + ekg + pf + bm + hx,
wt imputed = TRUE,
data = prostate, pl = FALSE, pr = FALSE
)
attach(prostate)
<- impute(w, sz, data = prostate)
sz <- impute(w, sg, data = prostate)
sg <- impute(w, age, data = prostate)
age <- impute(w, wt, data = prostate)
wt <- impute(w, ekg, data = prostate) kg
Specifying the model
A Cox proportional hazards model is in terms of the hazard function given by: \[\lambda(t\vert X)=\lambda(t)\exp(X\beta).\] We fit a Cox proportional hazards model with the predictors and their degrees of freedom specified in the table below. All continuous predictors will be modeled with a restricted cubic spline with 4 knots.
Predictor | Name | d.f. | Levels |
---|---|---|---|
Dose of estrogen | rx | 3 | placebo, 0.2, 1.0, 5.0 mg estrogen |
Age in years | age | 3 | |
Weight index | wt | 3 | |
Performance rating | pf | 2 | normal, in bed < 50% of time, in bed > 50% |
History of cardiovascular disease | hx | 1 | present/absent |
Systolic blood pressure/10 | sbp | 3 | |
Diastolic blood pressure/10 | dbp | 3 | |
Electrocardiogram code | ekg | 5 | normal, benign, rhythm disturb., block, strain, myocardial infarction |
Serum hemoglobin (g/100ml) | hg | 3 | |
Tumor size (cm^2) | sz | 3 | |
Stage/histologic grade combination | sg | 3 | |
Serum prostatic acid phosphatase | ap | 3 | |
Bone metastasis | bm | 1 | present/absent |
We see that we have 36 degrees of freedom in our model, so about 1/10 of the amount of deaths there are. So there is some hope our model will validate.
<- datadist(prostate)
dd options(datadist = "dd")
units(dtime) <- "Month"
<- Surv(dtime, status != "alive")
s
<- cph(s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx +
f rcs(sbp, 4) + rcs(dbp, 4) + ekg + rcs(hg, 4) +
rcs(sg, 4) + rcs(sz, 4) + rcs(log(ap), 4) + bm)
print(f, html = TRUE, coefs = FALSE)
Frequencies of Missing Values Due to Each Variable
s rx age wt pf hx sbp dbp ekg hg sg sz ap bm
0 0 0 0 0 0 0 0 8 0 0 0 0 0
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx + rcs(sbp,
4) + rcs(dbp, 4) + ekg + rcs(hg, 4) + rcs(sg, 4) + rcs(sz,
4) + rcs(log(ap), 4) + bm)
Model Tests Discrimination
Indexes
Obs 494 LR chi2 136.04 R2 0.241
Events 348 d.f. 36 R2(36,494)0.183
Center -2.0435 Pr(> chi2) 0.0000 R2(36,348)0.250
Score chi2 144.95 Dxy 0.338
Pr(> chi2) 0.0000
Test for removing all predictors is highly significant, so modelling is warranted. AIC on the \(\chi^2\)-scale is given by \(136.04-2\cdot 36=64.04\). A rough shrinkage estimate is thus given by \((136.04-36)/136.04=0.74\). So we estimate that 0.26 of the model fitting will be noise, especially with regard to calibration accuracy. One approach would just be to multiply coefficients of final model if used for prediction by the shrinkage estimate. Instead we try to do some data reduction using domain knowledge.
Data reduction using domain knowledge
We do the following changes
Combine systolic blood pressure (sbp) and diastolic blood pressure (dbp) into Mean Arterial Pressure.
Combine history of cardiovascular disease (hx) and electrocardiogram code ekg and assume linear. We code it as 2 if ekg are not normal or benign and hx, 1 if either, and 0 if none.
Weight index not that important, so only use 3 knots.
Take log of Serum prostatic acid phosphatase (ap) for numeric stability (heavy right-tail) and allow 1 more knot for compensation
The tumor variables tumor size (sz) and stage/histologic grade combination (sg) should be important variables, but since we have both we keep them but give them only 3 knots each.
Assume performance rating (pf) is linear.
<- (2 * dbp + sbp) / 3
map label(map) <- "Mean Arterial Pressure/10"
<- hx + ekg %nin% c("normal", "benign")
heart_d label(heart_d) <- "Heart Disease Code"
<- as.numeric(pf) pf_linear
So all in all we save 3 degrees of freedom from defining map, 5 for defining Heart Disease Code, and 3 for dropping knots, 1 for assuming performance rating is linear and gain 1 from ap. For a total reduction of 11 degrees of freedom.
<- datadist(dd, heart_d, map, pf_linear)
dd
<- cph(
f ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
s + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) + rcs(log(ap), 5) + bm,
x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 * 12
)
print(f, html = TRUE, coefs = 3)
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear +
heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz,
3) + rcs(log(ap), 5) + bm, x = TRUE, y = TRUE, surv = TRUE,
time.inc = 5 * 12)
Model Tests Discrimination
Indexes
Obs 502 LR chi2 119.86 R2 0.212
Events 354 d.f. 24 R2(24,502)0.174
Center -2.3792 Pr(> chi2) 0.0000 R2(24,354)0.237
Score chi2 126.89 Dxy 0.323
Pr(> chi2) 0.0000
Coef S.E. Wald Z Pr(>|Z|)
rx=0.2 mg estrogen 0.0005 0.1492 0.00 0.9975
rx=1.0 mg estrogen -0.4136 0.1659 -2.49 0.0127
rx=5.0 mg estrogen -0.1044 0.1572 -0.66 0.5066
. . .
LR 119.86 with 24 degrees of freedom. AIC is \(119.86-2 \cdot 24=71.86\). Rough shrinkage estimate of this is now better at \((119.86-24)/119.86=0.80\).
anova(f, html = TRUE)
Wald Statistics Response: s
Factor Chi-Square d.f. P
rx 7.92 3 0.0476
age 13.65 3 0.0034
Nonlinear 8.94 2 0.0115
wt 8.23 2 0.0163
Nonlinear 2.57 1 0.1092
pf_linear 3.50 1 0.0615
heart_d 24.91 1 <.0001
map 0.05 2 0.9753
Nonlinear 0.05 1 0.8232
hg 12.49 3 0.0059
Nonlinear 8.10 2 0.0174
sg 1.70 2 0.4269
Nonlinear 0.05 1 0.8318
sz 12.43 2 0.0020
Nonlinear 0.07 1 0.7888
ap 6.46 4 0.1672
Nonlinear 6.16 3 0.1042
bm 0.03 1 0.8523
TOTAL NONLINEAR 23.58 11 0.0146
TOTAL 120.27 24 <.0001
We see that the p-value for the treatment, doses of estrogen (rx), is below \(0.05\), so it seems like it has some effect.
Checking Proportional Hazards
A Cox model should have proportional hazards, since if we look at two individuals with covariate values \(\boldsymbol{X}\) and \(\boldsymbol{X}^*\) then the ratio of their hazard rates is \[\frac{h(t \vert \boldsymbol{X})}{h(t \vert \boldsymbol{X}^*)}=\exp\Big[\sum_{k=1}^p \beta_k(X_k-X_{k^*})\Big],\] which is a constant. We can test this assumption using scaled Schoenfeld residuals separately for each predictor and test the PH assumption using the “correlation with time” test. Smoothed trends in the residuals are also plotted.
<- predict(f, type = "terms")
z <- cph(s ~ z, x = TRUE, y = TRUE)
f_short <- cox.zph(f_short, transform = "identity", terms = FALSE)
phtest phtest
chisq df p
rx 3.906254 1 0.048
age 1.218347 1 0.270
wt 0.016868 1 0.897
pf_linear 0.190007 1 0.663
heart_d 0.360714 1 0.548
map 0.046079 1 0.830
hg 0.018155 1 0.893
sg 0.904761 1 0.342
sz 0.218858 1 0.640
ap 0.009228 1 0.923
bm 0.000329 1 0.986
GLOBAL 7.813453 11 0.730
We see that the p-value of the test for proportional hazards is below \(0.05\) for dose of estrogen, indicating that it may not have constant relative hazard. We plot it and don’t find anything too worry-some, and the global test of proportional hazard by adjusting for the 11 tests gives a p-value of \(0.78\). So we accept our model and continue.
plot(phtest, var = "rx")
Describing Effects
We plot how each predictor is related to the log hazard of death with \(95\%\)-confidence bands.
ggplot(Predict(f),
sepdiscrete = "vertical", nlevels = 4,
vnames = "names"
)
Presenting the final model for inference
To present point and interval estimates of predictor effects we draw a hazard ratio chart. Since the ap relationship is so non-monotonic we use a 20:1 hazard ratio for this variable. The others use the default.
plot(summary(f, ap = c(1, 20)), log = TRUE, main = "")
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
This plot is interpreted as the hazard ratio for each variable if we went from for example age 70 to age 76. \(95\%, 97.5\%\) and \(99\%\) confidence bands are added.
Finally we will also show a nomogram.
<- Survival(f)
surv <- function(x) surv(3 * 12, lp = x)
surv3 <- function(x) surv(5 * 12, lp = x)
surv5 <- Quantile(f)
quan <- function(x) quan(lp = x) / 12
med <- c(.05, .1, .2, .3, .4, .5, .6, .7, .8, .9, .95)
ss <- nomogram(f,
nom ap = c(.1, .5, 1, 2, 3, 4, 5, 10, 20, 30, 40),
fun = list(surv3, surv5, med),
funlabel = c(
"3-year Survival", "5-year Survival",
"Median Survival Time (years)"
),fun.at = list(ss, ss, c(.5, 1:6))
)
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
plot(nom, cex.axis = 0.8, cex.var = 0.7, xfrac = .65, lmgp = .35)
This is interpreted by taking a value for each predictor and seeing how many points each value correspond to. Then the total amount of points can be used to find the probability of for example 5-year survival.
Model for prediction
For a prediction model one should validate the model. We start by doing that
Validating the model
We will use bootstrapping for validation. We first validate this model for Somers’ \(D_{xy}\) rank correlation between predicted log hazard and observed survival time, and for slope shrinkage. The bootstrap is used (with 1000 resamples) to penalize for possible overfitting.
<- validate(f, B = 1000)
v options(prType = "html")
v
Index | Original Sample |
Training Sample |
Test Sample |
Optimism | Corrected Index |
Successful Resamples |
---|---|---|---|---|---|---|
Dxy | 0.3228 | 0.3514 | 0.298 | 0.0533 | 0.2695 | 1000 |
R2 | 0.2125 | 0.2513 | 0.1779 | 0.0735 | 0.139 | 1000 |
Slope | 1 | 1 | 0.7844 | 0.2156 | 0.7844 | 1000 |
D | 0.0296 | 0.036 | 0.0242 | 0.0118 | 0.0178 | 1000 |
U | -5e-04 | -5e-04 | 0.0025 | -0.003 | 0.0025 | 1000 |
Q | 0.0301 | 0.0365 | 0.0217 | 0.0148 | 0.0152 | 1000 |
g | 0.7245 | 0.8108 | 0.6345 | 0.1763 | 0.5482 | 1000 |
Here “training” refers to accuracy when evaluated on the bootstrap sample used to fit the model, and “test” refers to the accuracy when this model is applied without modification to the original sample. The apparent \(D_{xy}\) is 0.32, but a better estimate of how well the model will discriminate prognoses in the future is \(D_{xy} = 0.27\). The bootstrap estimate of slope shrinkage is \(0.79\), close to the simple heuristic estimate.
Validate the model for predicting the probability of surviving five years
We will use the non-shrinked estimates to show that this model will have some overfitting consistent with regression to the mean.
<- calibrate(f, B = 1000, u = 5 * 12) cal
Using Cox survival estimates at 60 Months
Convergence problems.... stopping addition
Convergence problems.... stopping addition
Convergence problems.... stopping addition
Convergence problems.... stopping addition
plot(cal, subtitles = FALSE)
The line nearer the ideal line corresponds to apparent predictive accuracy. The blue curve corresponds to bootstrap-corrected estimates. We clearly see some regression to the mean.
Final model for prediction
For a better model for prediction one should multiply the coefficients in the model by the shrinkage estimate.
v
Index | Original Sample |
Training Sample |
Test Sample |
Optimism | Corrected Index |
Successful Resamples |
---|---|---|---|---|---|---|
Dxy | 0.3228 | 0.3514 | 0.298 | 0.0533 | 0.2695 | 1000 |
R2 | 0.2125 | 0.2513 | 0.1779 | 0.0735 | 0.139 | 1000 |
Slope | 1 | 1 | 0.7844 | 0.2156 | 0.7844 | 1000 |
D | 0.0296 | 0.036 | 0.0242 | 0.0118 | 0.0178 | 1000 |
U | -5e-04 | -5e-04 | 0.0025 | -0.003 | 0.0025 | 1000 |
Q | 0.0301 | 0.0365 | 0.0217 | 0.0148 | 0.0152 | 1000 |
g | 0.7245 | 0.8108 | 0.6345 | 0.1763 | 0.5482 | 1000 |
<- v[3, 5] # Extract bootstrap estimate of shrinkage
shrinkage_est <- f
f_prediction $coefficients <- f$coefficients * shrinkage_est
f_prediction
print(f_prediction, html = TRUE, coef = 3)
Cox Proportional Hazards Model
cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) + rcs(log(ap), 5) + bm, x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 * 12)
Model Tests | Discrimination Indexes |
|
---|---|---|
Obs 502 | LR χ2 119.86 | R2 0.212 |
Events 354 | d.f. 24 | R224,502 0.174 |
Center -2.3792 | Pr(>χ2) 0.0000 | R224,354 0.237 |
Score χ2 126.89 | Dxy 0.323 | |
Pr(>χ2) 0.0000 |
β | S.E. | Wald Z | Pr(>|Z|) | |
---|---|---|---|---|
rx=0.2 mg estrogen | 0.0004 | 0.1492 | 0.00 | 0.9980 |
rx=1.0 mg estrogen | -0.3244 | 0.1659 | -1.96 | 0.0506 |
rx=5.0 mg estrogen | -0.0819 | 0.1572 | -0.52 | 0.6024 |
… |
Approximating the full model
If the client doesn’t want to collect all these variables for prediction we can do model approximation. The goal is to explain f_prediction
with a trade off between parameters and \(R^2\). This can be done by backwards selection.
<- predict(f_prediction) # compute linear predictor from full model z
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
Consider formula(paste(x, collapse = " ")) instead.
# Force sigma to be 1 since perfect fit
<- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
a + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3)
+ rcs(log(ap), 5) + bm, sigma = 1)
<- fastbw(a, aics = 10000)
backwards print(backwards, html = TRUE, digits = 3)
Deleted Chi-Sq d.f. P Residual d.f. P AIC R2
map 0.07 2 0.9641 0.07 2 0.9641 -3.93 1.000
bm 0.04 1 0.8321 0.12 3 0.9896 -5.88 0.999
sg 2.50 2 0.2861 2.62 5 0.7582 -7.38 0.987
pf_linear 4.88 1 0.0271 7.51 6 0.2767 -4.49 0.964
rx 15.66 3 0.0013 23.16 9 0.0058 5.16 0.888
wt 12.41 2 0.0020 35.57 11 0.0002 13.57 0.828
age 16.50 3 0.0009 52.08 14 0.0000 24.08 0.748
ap 19.61 4 0.0006 71.69 18 0.0000 35.69 0.653
hg 36.53 3 0.0000 108.22 21 0.0000 66.22 0.475
sz 42.83 2 0.0000 151.04 23 0.0000 105.04 0.268
heart_d 55.27 1 0.0000 206.31 24 0.0000 158.31 0.000
Approximate Estimates after Deleting Factors
Coef S.E. Wald Z P
[1,] 7.58e-11 0.0446 1.7e-09 1
Factors in Final Model
None
Now the client can choose the trade-off between predictive ability and number of predictors. We visualize this in the following plot
<- as.data.frame(backwards$result)
result <- nrow(result)
total_predictors <- result$R2
r_squared <- c((total_predictors - 1):0)
predictors
ggplot(data.frame(predictors, r_squared), aes(x = predictors, y = r_squared)) +
geom_line(color = "blue") +
geom_point(color = "red") +
labs(x = "Predictors", y = TeX("$R^2$")) +
ggtitle("R-squared vs. Predictors") +
scale_y_continuous(breaks = pretty_breaks()) +
theme_minimal()
A reasonable approach trade off would be to remove map, bm, sg and pf_linear from our model to get an approximate model which still has high \(R^2\) of \(0.96\) for explaining the full model.
<- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d
f_approx + rcs(hg, 4) + rcs(sz, 3) + rcs(log(ap), 5), x = TRUE)
print(f_approx, html = TRUE, coefs = 3)
Linear Regression Model
ols(formula = z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d + rcs(hg, 4) + rcs(sz, 3) + rcs(log(ap), 5), x = TRUE)
Model Likelihood Ratio Test |
Discrimination Indexes |
|
---|---|---|
Obs 502 | LR χ2 1663.53 | R2 0.964 |
σ 0.1247 | d.f. 18 | R2adj 0.962 |
d.f. 483 | Pr(>χ2) 0.0000 | g 0.713 |
Residuals
Min 1Q Median 3Q Max -0.30962 -0.07092 -0.02333 0.04420 0.63137
β | S.E. | t | Pr(>|t|) | |
---|---|---|---|---|
Intercept | 3.7212 | 0.1976 | 18.83 | <0.0001 |
rx=0.2 mg estrogen | 0.0070 | 0.0158 | 0.44 | 0.6600 |
rx=1.0 mg estrogen | -0.4338 | 0.0159 | -27.32 | <0.0001 |
… |
Conclusion
We devolved a Cox proportional hazards model to explain time until death (of any cause) for this dataset. We described the effects the predictors in the model had by various plots.
We did validation and calibration if one wished to use the model for prediction, to avoid issues such as regression to the mean. We then developed a simplified model which dropped 4 predictors, but could still predict the full model with high accuracy (\(R^2=0.964)\).
Throughout we used the guidelines of Regression Modelling Strategies to follow sounds statistical strategies when building and validating a model.