Cox model case study

Survival analysis
R
Published

July, 2024

Analyzing prostate cancer dataset using a Cox model. The primary goal is to discover patterns in survival, and to do an analysis of covariance to assess the effect of treatment while adjusting for patient heterogeneity. Prediction is a secondary goal.

Data

We consider the 506-patient prostate cancer dataset from Byar and Green. These data were from a randomized trial comparing four treatments for stage 3 and 4 prostate cancer, with almost equal numbers of patients on placebo and each of three doses of estrogen. Four patients had missing values on all of the following variables: wt, pf, hx, sbp, dbp, ekg, hg, bm; two of these patients were also missing sz. These patients are excluded from consideration.

We will follow guidelines given in the book Regression Modeling Strategies by Frank Harrell and his general approach to this dataset in Chapter 21. The analysis is expanded compared to what was done in the book.

There are 354 deaths among the 502 patients and we will use a multivariate survival model to explain the time until death (of any cause).

Loading data

library(Hmisc)
library(survival)
library(rms)
library(scales)
library(ggplot2)
library(latex2exp)
library(knitr)
library(kableExtra)
library(modelsummary)
getHdata(prostate)

We start by showing summary of all the variables in the dataset

Click here to see output

Preprocess data for function datasummary_skim.

prostate_summary <- prostate

# Convert 'labelled' columns to numeric
prostate_summary[] <- lapply(prostate_summary, function(x) {
  if (inherits(x, "labelled")) {
    as.numeric(x)
  } else {
    x
  }
})

prostate_summary$bm <- as.factor(prostate$bm)
prostate_summary$stage <- as.factor(prostate$stage)
prostate_summary$hx <- as.factor(prostate$hx)

prostate_summary <- prostate_summary[
  ,
  !names(prostate) %in% c("patno", "sdate")
]

factors <- sapply(prostate_summary, is.factor)

Summary of factor variables in data

datasummary_skim(prostate_summary[factors])
N %
stage 3 289 57.6
4 213 42.4
rx placebo 127 25.3
0.2 mg estrogen 124 24.7
1.0 mg estrogen 126 25.1
5.0 mg estrogen 125 24.9
status alive 148 29.5
dead - prostatic ca 130 25.9
dead - heart or vascular 96 19.1
dead - cerebrovascular 31 6.2
dead - pulmonary embolus 14 2.8
dead - other ca 25 5.0
dead - respiratory disease 16 3.2
dead - other specific non-ca 28 5.6
dead - unspecified non-ca 7 1.4
dead - unknown cause 7 1.4
pf normal activity 450 89.6
in bed < 50% daytime 37 7.4
in bed > 50% daytime 13 2.6
confined to bed 2 0.4
hx 0 289 57.6
1 213 42.4
ekg normal 168 33.5
benign 23 4.6
rhythmic disturb & electrolyte ch 51 10.2
heart block or conduction def 26 5.2
heart strain 150 29.9
old MI 75 14.9
recent MI 1 0.2
bm 0 420 83.7
1 82 16.3

Summary of continuous variables in data

datasummary_skim(prostate_summary[!factors])
Unique Missing Pct. Mean SD Min Median Max Histogram
dtime 76 0 36.1 23.3 0.0 34.0 76.0
age 42 0 71.5 7.1 48.0 73.0 89.0
wt 68 0 99.0 13.4 69.0 98.0 152.0
sbp 18 0 14.4 2.4 8.0 14.0 30.0
dbp 12 0 8.1 1.5 4.0 8.0 18.0
hg 91 0 13.4 2.0 5.9 13.7 21.2
sz 56 1 14.6 12.3 0.0 11.0 69.0
sg 12 2 10.3 2.0 5.0 10.0 15.0
ap 128 0 12.2 62.2 0.1 0.7 999.9

We see that the variable ekg only has one observation in the last category and pf has two. So, we combine recent MI with old MI and call it MI, and we combine confined to bed with \(>50\%\) daytime.

levels(prostate$ekg)[levels(prostate$ekg) %in% c("old MI", "recent MI")] <- "MI"

levels(prostate$pf) <- c(
  levels(prostate$pf)[1:3],
  levels(prostate$pf)[3]
)

Imputing missing values

We have 27 patients with missing values as seen below, so we do imputation on these.

The proportion of missing values is \(0.05\), so we just do single imputation, since simpler to make plots and validation. However if the proportion were higher one should do multiple imputation.

sum(is.na(prostate))
[1] 27
colSums(is.na(prostate))
 patno  stage     rx  dtime status    age     wt     pf     hx    sbp    dbp 
     0      0      0      0      0      1      2      0      0      0      0 
   ekg     hg     sz     sg     ap     bm  sdate 
     8      0      5     11      0      0      0 
w <- transcan(
  ~ sz + sg + ap + sbp + dbp + age +
    wt + hg + ekg + pf + bm + hx,
  imputed = TRUE,
  data = prostate, pl = FALSE, pr = FALSE
)

attach(prostate)
sz <- impute(w, sz, data = prostate)
sg <- impute(w, sg, data = prostate)
age <- impute(w, age, data = prostate)
wt <- impute(w, wt, data = prostate)
kg <- impute(w, ekg, data = prostate)

Specifying the model

A Cox proportional hazards model is in terms of the hazard function given by: \[\lambda(t\vert X)=\lambda(t)\exp(X\beta).\] We fit a Cox proportional hazards model with the predictors and their degrees of freedom specified in the table below. All continuous predictors will be modeled with a restricted cubic spline with 4 knots.

Predictor Name d.f. Levels
Dose of estrogen rx 3 placebo, 0.2, 1.0, 5.0 mg estrogen
Age in years age 3
Weight index wt 3
Performance rating pf 2 normal, in bed < 50% of time, in bed > 50%
History of cardiovascular disease hx 1 present/absent
Systolic blood pressure/10 sbp 3
Diastolic blood pressure/10 dbp 3
Electrocardiogram code ekg 5 normal, benign, rhythm disturb., block, strain, myocardial infarction
Serum hemoglobin (g/100ml) hg 3
Tumor size (cm^2) sz 3
Stage/histologic grade combination sg 3
Serum prostatic acid phosphatase ap 3
Bone metastasis bm 1 present/absent

We see that we have 36 degrees of freedom in our model, so about 1/10 of the amount of deaths there are. So there is some hope our model will validate.

dd <- datadist(prostate)
options(datadist = "dd")
units(dtime) <- "Month"
s <- Surv(dtime, status != "alive")

f <- cph(s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx +
           rcs(sbp, 4) + rcs(dbp, 4) + ekg + rcs(hg, 4) +
           rcs(sg, 4) + rcs(sz, 4) + rcs(log(ap), 4) + bm)
print(f, html = TRUE, coefs = FALSE)
Frequencies of Missing Values Due to Each Variable
  s  rx age  wt  pf  hx sbp dbp ekg  hg  sg  sz  ap  bm 
  0   0   0   0   0   0   0   0   8   0   0   0   0   0 

Cox Proportional Hazards Model

cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 4) + pf + hx + rcs(sbp, 
    4) + rcs(dbp, 4) + ekg + rcs(hg, 4) + rcs(sg, 4) + rcs(sz, 
    4) + rcs(log(ap), 4) + bm)

                        Model Tests     Discrimination    
                                               Indexes    
Obs        494    LR chi2    136.04     R2       0.241    
Events     348    d.f.           36    R2(36,494)0.183    
Center -2.0435    Pr(> chi2) 0.0000    R2(36,348)0.250    
                  Score chi2 144.95     Dxy      0.338    
                  Pr(> chi2) 0.0000                       

Test for removing all predictors is highly significant, so modelling is warranted. AIC on the \(\chi^2\)-scale is given by \(136.04-2\cdot 36=64.04\). A rough shrinkage estimate is thus given by \((136.04-36)/136.04=0.74\). So we estimate that 0.26 of the model fitting will be noise, especially with regard to calibration accuracy. One approach would just be to multiply coefficients of final model if used for prediction by the shrinkage estimate. Instead we try to do some data reduction using domain knowledge.

Data reduction using domain knowledge

We do the following changes

  • Combine systolic blood pressure (sbp) and diastolic blood pressure (dbp) into Mean Arterial Pressure.

  • Combine history of cardiovascular disease (hx) and electrocardiogram code ekg and assume linear. We code it as 2 if ekg are not normal or benign and hx, 1 if either, and 0 if none.

  • Weight index not that important, so only use 3 knots.

  • Take log of Serum prostatic acid phosphatase (ap) for numeric stability (heavy right-tail) and allow 1 more knot for compensation

  • The tumor variables tumor size (sz) and stage/histologic grade combination (sg) should be important variables, but since we have both we keep them but give them only 3 knots each.

  • Assume performance rating (pf) is linear.

map <- (2 * dbp + sbp) / 3
label(map) <- "Mean Arterial Pressure/10"

heart_d <- hx + ekg %nin% c("normal", "benign")
label(heart_d) <- "Heart Disease Code"

pf_linear <- as.numeric(pf)

So all in all we save 3 degrees of freedom from defining map, 5 for defining Heart Disease Code, and 3 for dropping knots, 1 for assuming performance rating is linear and gain 1 from ap. For a total reduction of 11 degrees of freedom.

dd <- datadist(dd, heart_d, map, pf_linear)

f <- cph(
  s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
  + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3) + rcs(log(ap), 5) + bm,
  x = TRUE, y = TRUE, surv = TRUE, time.inc = 5 * 12
)

print(f, html = TRUE, coefs = 3)
Cox Proportional Hazards Model

cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + 
    heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 
    3) + rcs(log(ap), 5) + bm, x = TRUE, y = TRUE, surv = TRUE, 
    time.inc = 5 * 12)

                        Model Tests     Discrimination    
                                               Indexes    
Obs        502    LR chi2    119.86     R2       0.212    
Events     354    d.f.           24    R2(24,502)0.174    
Center -2.3792    Pr(> chi2) 0.0000    R2(24,354)0.237    
                  Score chi2 126.89     Dxy      0.323    
                  Pr(> chi2) 0.0000                       

                   Coef     S.E.    Wald Z Pr(>|Z|)
rx=0.2 mg estrogen   0.0005  0.1492  0.00  0.9975  
rx=1.0 mg estrogen  -0.4136  0.1659 -2.49  0.0127  
rx=5.0 mg estrogen  -0.1044  0.1572 -0.66  0.5066  
. . .                                              

LR 119.86 with 24 degrees of freedom. AIC is \(119.86-2 \cdot 24=71.86\). Rough shrinkage estimate of this is now better at \((119.86-24)/119.86=0.80\).

anova(f, html = TRUE)
                Wald Statistics          Response: s 

 Factor          Chi-Square d.f. P     
 rx                7.92      3   0.0476
 age              13.65      3   0.0034
  Nonlinear        8.94      2   0.0115
 wt                8.23      2   0.0163
  Nonlinear        2.57      1   0.1092
 pf_linear         3.50      1   0.0615
 heart_d          24.91      1   <.0001
 map               0.05      2   0.9753
  Nonlinear        0.05      1   0.8232
 hg               12.49      3   0.0059
  Nonlinear        8.10      2   0.0174
 sg                1.70      2   0.4269
  Nonlinear        0.05      1   0.8318
 sz               12.43      2   0.0020
  Nonlinear        0.07      1   0.7888
 ap                6.46      4   0.1672
  Nonlinear        6.16      3   0.1042
 bm                0.03      1   0.8523
 TOTAL NONLINEAR  23.58     11   0.0146
 TOTAL           120.27     24   <.0001

We see that the p-value for the treatment, doses of estrogen (rx), is below \(0.05\), so it seems like it has some effect.

Checking Proportional Hazards

A Cox model should have proportional hazards, since if we look at two individuals with covariate values \(\boldsymbol{X}\) and \(\boldsymbol{X}^*\) then the ratio of their hazard rates is \[\frac{h(t \vert \boldsymbol{X})}{h(t \vert \boldsymbol{X}^*)}=\exp\Big[\sum_{k=1}^p \beta_k(X_k-X_{k^*})\Big],\] which is a constant. We can test this assumption using scaled Schoenfeld residuals separately for each predictor and test the PH assumption using the “correlation with time” test. Smoothed trends in the residuals are also plotted.

z <- predict(f, type = "terms")
f_short <- cph(s ~ z, x = TRUE, y = TRUE)
phtest <- cox.zph(f_short, transform = "identity", terms = FALSE)
phtest
             chisq df     p
rx        3.906254  1 0.048
age       1.218347  1 0.270
wt        0.016868  1 0.897
pf_linear 0.190007  1 0.663
heart_d   0.360714  1 0.548
map       0.046079  1 0.830
hg        0.018155  1 0.893
sg        0.904761  1 0.342
sz        0.218858  1 0.640
ap        0.009228  1 0.923
bm        0.000329  1 0.986
GLOBAL    7.813453 11 0.730

We see that the p-value of the test for proportional hazards is below \(0.05\) for dose of estrogen, indicating that it may not have constant relative hazard. We plot it and don’t find anything too worry-some, and the global test of proportional hazard by adjusting for the 11 tests gives a p-value of \(0.78\). So we accept our model and continue.

plot(phtest, var = "rx")

Describing Effects

We plot how each predictor is related to the log hazard of death with \(95\%\)-confidence bands.

ggplot(Predict(f),
  sepdiscrete = "vertical", nlevels = 4,
  vnames = "names"
)

Presenting the final model for inference

To present point and interval estimates of predictor effects we draw a hazard ratio chart. Since the ap relationship is so non-monotonic we use a 20:1 hazard ratio for this variable. The others use the default.

plot(summary(f, ap = c(1, 20)), log = TRUE, main = "")
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.

This plot is interpreted as the hazard ratio for each variable if we went from for example age 70 to age 76. \(95\%, 97.5\%\) and \(99\%\) confidence bands are added.

Finally we will also show a nomogram.

surv <- Survival(f)
surv3 <- function(x) surv(3 * 12, lp = x)
surv5 <- function(x) surv(5 * 12, lp = x)
quan <- Quantile(f)
med <- function(x) quan(lp = x) / 12
ss <- c(.05, .1, .2, .3, .4, .5, .6, .7, .8, .9, .95)
nom <- nomogram(f,
  ap = c(.1, .5, 1, 2, 3, 4, 5, 10, 20, 30, 40),
  fun = list(surv3, surv5, med),
  funlabel = c(
    "3-year Survival", "5-year Survival",
    "Median Survival Time (years)"
  ),
  fun.at = list(ss, ss, c(.5, 1:6))
)
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
plot(nom, cex.axis = 0.8, cex.var = 0.7, xfrac = .65, lmgp = .35)

This is interpreted by taking a value for each predictor and seeing how many points each value correspond to. Then the total amount of points can be used to find the probability of for example 5-year survival.

Model for prediction

For a prediction model one should validate the model. We start by doing that

Validating the model

We will use bootstrapping for validation. We first validate this model for Somers’ \(D_{xy}\) rank correlation between predicted log hazard and observed survival time, and for slope shrinkage. The bootstrap is used (with 1000 resamples) to penalize for possible overfitting.

v <- validate(f, B = 1000)
options(prType = "html")
v
Index Original
Sample
Training
Sample
Test
Sample
Optimism Corrected
Index
Successful
Resamples
Dxy 0.3228 0.3514 0.298 0.0533 0.2695 1000
R2 0.2125 0.2513 0.1779 0.0735 0.139 1000
Slope 1 1 0.7844 0.2156 0.7844 1000
D 0.0296 0.036 0.0242 0.0118 0.0178 1000
U -5e-04 -5e-04 0.0025 -0.003 0.0025 1000
Q 0.0301 0.0365 0.0217 0.0148 0.0152 1000
g 0.7245 0.8108 0.6345 0.1763 0.5482 1000

Here “training” refers to accuracy when evaluated on the bootstrap sample used to fit the model, and “test” refers to the accuracy when this model is applied without modification to the original sample. The apparent \(D_{xy}\) is 0.32, but a better estimate of how well the model will discriminate prognoses in the future is \(D_{xy} = 0.27\). The bootstrap estimate of slope shrinkage is \(0.79\), close to the simple heuristic estimate.

Validate the model for predicting the probability of surviving five years

We will use the non-shrinked estimates to show that this model will have some overfitting consistent with regression to the mean.

cal <- calibrate(f, B = 1000, u = 5 * 12)
Using Cox survival estimates at 60 Months
Convergence problems.... stopping addition
Convergence problems.... stopping addition
Convergence problems.... stopping addition
Convergence problems.... stopping addition
plot(cal, subtitles = FALSE)

The line nearer the ideal line corresponds to apparent predictive accuracy. The blue curve corresponds to bootstrap-corrected estimates. We clearly see some regression to the mean.

Final model for prediction

For a better model for prediction one should multiply the coefficients in the model by the shrinkage estimate.

v
Index Original
Sample
Training
Sample
Test
Sample
Optimism Corrected
Index
Successful
Resamples
Dxy 0.3228 0.3514 0.298 0.0533 0.2695 1000
R2 0.2125 0.2513 0.1779 0.0735 0.139 1000
Slope 1 1 0.7844 0.2156 0.7844 1000
D 0.0296 0.036 0.0242 0.0118 0.0178 1000
U -5e-04 -5e-04 0.0025 -0.003 0.0025 1000
Q 0.0301 0.0365 0.0217 0.0148 0.0152 1000
g 0.7245 0.8108 0.6345 0.1763 0.5482 1000
shrinkage_est <- v[3, 5] # Extract bootstrap estimate of shrinkage
f_prediction <- f
f_prediction$coefficients <- f$coefficients * shrinkage_est

print(f_prediction, html = TRUE, coef = 3)

Cox Proportional Hazards Model

cph(formula = s ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + 
    heart_d + rcs(map, 3) + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 
    3) + rcs(log(ap), 5) + bm, x = TRUE, y = TRUE, surv = TRUE, 
    time.inc = 5 * 12)
Model Tests Discrimination
Indexes
Obs 502 LR χ2 119.86 R2 0.212
Events 354 d.f. 24 R224,502 0.174
Center -2.3792 Pr(>χ2) 0.0000 R224,354 0.237
Score χ2 126.89 Dxy 0.323
Pr(>χ2) 0.0000
β S.E. Wald Z Pr(>|Z|)
rx=0.2 mg estrogen   0.0004   0.1492 0.00 0.9980
rx=1.0 mg estrogen   -0.3244   0.1659 -1.96 0.0506
rx=5.0 mg estrogen   -0.0819   0.1572 -0.52 0.6024

Approximating the full model

If the client doesn’t want to collect all these variables for prediction we can do model approximation. The goal is to explain f_predictionwith a trade off between parameters and \(R^2\). This can be done by backwards selection.

z <- predict(f_prediction) # compute linear predictor from full model
Warning in formula.character(object, env = baseenv()): Using formula(x) is deprecated when x is a character vector of length > 1.
  Consider formula(paste(x, collapse = " ")) instead.
# Force sigma to be 1 since perfect fit
a <- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + pf_linear + heart_d + rcs(map, 3)
         + rcs(hg, 4) + rcs(sg, 3) + rcs(sz, 3)
         + rcs(log(ap), 5) + bm, sigma = 1)

backwards <- fastbw(a, aics = 10000)
print(backwards, html = TRUE, digits = 3)

 Deleted   Chi-Sq d.f. P      Residual d.f. P      AIC    R2   
 map        0.07  2    0.9641   0.07    2   0.9641  -3.93 1.000
 bm         0.04  1    0.8321   0.12    3   0.9896  -5.88 0.999
 sg         2.50  2    0.2861   2.62    5   0.7582  -7.38 0.987
 pf_linear  4.88  1    0.0271   7.51    6   0.2767  -4.49 0.964
 rx        15.66  3    0.0013  23.16    9   0.0058   5.16 0.888
 wt        12.41  2    0.0020  35.57   11   0.0002  13.57 0.828
 age       16.50  3    0.0009  52.08   14   0.0000  24.08 0.748
 ap        19.61  4    0.0006  71.69   18   0.0000  35.69 0.653
 hg        36.53  3    0.0000 108.22   21   0.0000  66.22 0.475
 sz        42.83  2    0.0000 151.04   23   0.0000 105.04 0.268
 heart_d   55.27  1    0.0000 206.31   24   0.0000 158.31 0.000

Approximate Estimates after Deleting Factors

         Coef   S.E.  Wald Z P
[1,] 7.58e-11 0.0446 1.7e-09 1

Factors in Final Model

None

Now the client can choose the trade-off between predictive ability and number of predictors. We visualize this in the following plot

result <- as.data.frame(backwards$result)
total_predictors <- nrow(result)
r_squared <- result$R2
predictors <- c((total_predictors - 1):0)

ggplot(data.frame(predictors, r_squared), aes(x = predictors, y = r_squared)) +
  geom_line(color = "blue") +
  geom_point(color = "red") +
  labs(x = "Predictors", y = TeX("$R^2$")) +
  ggtitle("R-squared vs. Predictors") +
  scale_y_continuous(breaks = pretty_breaks()) +
  theme_minimal()

A reasonable approach trade off would be to remove map, bm, sg and pf_linear from our model to get an approximate model which still has high \(R^2\) of \(0.96\) for explaining the full model.

f_approx <- ols(z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d
                + rcs(hg, 4) + rcs(sz, 3) + rcs(log(ap), 5), x = TRUE)

print(f_approx, html = TRUE, coefs = 3)

Linear Regression Model

ols(formula = z ~ rx + rcs(age, 4) + rcs(wt, 3) + heart_d + rcs(hg, 
    4) + rcs(sz, 3) + rcs(log(ap), 5), x = TRUE)
Model Likelihood
Ratio Test
Discrimination
Indexes
Obs 502 LR χ2 1663.53 R2 0.964
σ 0.1247 d.f. 18 R2adj 0.962
d.f. 483 Pr(>χ2) 0.0000 g 0.713

Residuals

     Min       1Q   Median       3Q      Max 
-0.30962 -0.07092 -0.02333  0.04420  0.63137 
β S.E. t Pr(>|t|)
Intercept   3.7212  0.1976 18.83 <0.0001
rx=0.2 mg estrogen   0.0070  0.0158 0.44 0.6600
rx=1.0 mg estrogen   -0.4338  0.0159 -27.32 <0.0001

Conclusion

We devolved a Cox proportional hazards model to explain time until death (of any cause) for this dataset. We described the effects the predictors in the model had by various plots.

We did validation and calibration if one wished to use the model for prediction, to avoid issues such as regression to the mean. We then developed a simplified model which dropped 4 predictors, but could still predict the full model with high accuracy (\(R^2=0.964)\).

Throughout we used the guidelines of Regression Modelling Strategies to follow sounds statistical strategies when building and validating a model.