050. Cox Regression

Cox Regression

Proportional Hazards Model

Proposed by Cox (1972, JRSS-B), primarily to model the relationship between hazard function and covariates. most cited paper in statistics ( 41; 000 as of April 2016), one of the most cited in science.

Several extensions to more complex data structures, e.g., clustered failure time data, or recurrent event data, etc.

※ Data Structure

Observed data: ${X_{i} = T_{i} \land C_{i}, Δ_{i} = I (T_{i} < C_{i}), Z_{i} (\cdot)} \sim ii d$

추가로 $N_{i} = I (X_{i} \leq t, Δ_{i} = 1)$ , $Z_{i} (t)$ = covariate vector (possibly time-dependent).

Cox PH Model

λ_{i} (t) = λ (t ∣ Z_{i}) = λ_{0} (t) exp (β^{'} Z_{i}) (Cox Model)

semiparametric model:

$exp (β^{'} Z_{i})$ , parametric assumption on covariate effects
multiplicative model
$β$ : $p \times 1$ vector, $p < \infty$
$λ_{0} (t)$ , nonparametric; is $\infty$ dimensional
shape of hazard function is unspecified

Due to nonparametric component, standard maximum likelihood theory does not apply

Let $Z_{ij}$ be the $j$ -th element of $Z_{i}$

$β_{j}$ = difference in log hazards
$exp (β_{j})$ = ratio of hazards; assumed constant for all $t$
$λ_{0} (t)$ : baseline hazard; common to all subjects, $λ_{0} (t) = λ_{i} (t Z_{i} = 0)$

The hazard ratio, $exp (β_{j})$ , is sometimes referred to as a relative risk

risk = probability, not a rate
hazard is a rate, not a probability
in ratio of hazards, time dimension cancels out

Direction of effect:

β_{j} > 0 : β_{j} < 0 : ↑ λ_{i} ↓ λ_{i} ↓ S_{i} (t) ↑ S_{i} (t)

Magnitude of effect is easy to interpret w.r.t. $λ_{i} (t)$

Cumulative hazard function:

λ_{i} (t) Λ_{i} (t) = λ_{0} (t) exp (β Z_{i}) = \int_{0}^{t} λ_{0} (s) exp (β Z_{i}) d s = Λ_{0} (t) exp (β Z_{i})

Survival function:

S_{i} (t) = exp {- Λ_{i} (t)} = exp {- Λ_{0} (t) exp (β^{'} Z_{i})} = S_{0} (t)^{e x p {β^{'} Z_{i}}}

By fitting a Cox model, one can readily interpret the multiplicative effect on the hazard:

ex) randomized trial: treatment ( $Z_{i} = 1$ ) versus placebo ( $Z_{i} = 0$ ); $\hat{β} = 0.405$ ( $exp (\hat{β}) = 1.5$ )
$λ_{i} (t)$ for treated patients is 50% more of that of the controls.
irrespective of $λ_{0} (t)$

Nevertheless, $Λ_{0} (t)$ is required in order to determine $Z_{i}$ ‘s effect on $S_{i} (t)$ , e.g.,

S (t Z_{i} = 0) = 0.95 S (t Z_{i} = 0) = 0.70 v s . v s . S (t Z_{i} = 1) = 0.93 S (t Z_{i} = 1) = 0.59

Cox Model: Independent Censoring

Independent censoring assumption is less stringent than in nonparametric estimation.

Assumption is often written as $T_{i} ⊥ C_{i} Z_{i}$ :

= δ \to 0 lim \frac{1}{δ} P (t \leq T_{i} < t + δ T_{i} \geq t, C_{i} \geq t, δ \to 0 lim \frac{1}{δ} P (t \leq T_{i} < t + δ T_{i} \geq t, Z_{i}) Z_{i})

※ Note: $C_{i}$ is allowed to depend on $Z_{i}$

Semiparametric PH Model: General

General expression for multiplicative proportional hazards model:

λ_{i} (t) = λ_{0} (t) g (β^{'} Z_{i})

$g (x)$ is link function, specified. $\forall x : g (x) \geq 0$ , $\exists g^{''} (x)$ , and in special case, $g (x) = exp (x)$ .

Other choices for link function (e.g., Self & Prentice, 1983): $g (x) = 1 + x = (1 + x)^{- 1} = lo g (1 + x)$

※ Notes:

not all choices of $g (x)$ lead to clear interpretation of $β_{j}$
certain choices of $g (x)$ lead to numerical issues; e.g., likelihood is flat; local maxima, etc.
$g (x) \neq = e x p (x)$ has received little attention in the literature

Multiplicative Model

Cox model is a multiplicative model, i.e., covariates assumed to affect survival probability by multiplying the baseline hazard.

Additive models also been proposed

Proportional Hazards Regression and Multiplicative Intensity Model

Recall Counting process: martingale representation

N (t) Y (t) M (t) F_{t} = I (X \leq t, Δ = 1) = I (X \geq t) = N (t) - \int_{0}^{t} Y (u) λ_{0} (u) e^{β^{'} Z} d u = σ {N (u), Y (u +), Z : 0 \leq u \leq t} (1)

intensity $l (u) = Y (u) λ_{0} (u) e^{β^{'} Z}$ , therefore integrated form is cumulative intensity $A (t)$ .

Multiplicative Intensity Model:

l (t) = Y (t) λ_{0} (t) e^{β^{'} Z (t)}

Counting process: $N (t)$ = Number of events of a specified type that have occurred by time $t$
- $N (t)$ may take more than one jump
- multiple infections, repeated breakdowns, hospital admissions
- $EN (t) < \infty$
At-risk process: $Y (t)$ , left-continuous process, $1$ if failure can be observed at time $t$ , otherwise $0$ .
- $Y (t)$ can be used to represent situation in which a subject enter and exit risk sets several times
- $Y (t)$ may be $1$ even after an observed failure
Covariate process: $Z (t)$ = (bounded) predictable process
- time-dependent treatment, risk factors
- model checking and relaxing PH assumption
Baseline hazard function: $λ_{0} (\cdot)$ = an arbitrary deterministic function
Filtration: $F_{t} = σ {N (u), Y (u +), Z (u) : 0 \leq u \leq t}$
Martingale: $M (t) = N (t) - \int_{0}^{t} l (u) d u$
Intensity function: $E \Big \{ dN(t) \Big \| \mathcal F_{t-} \Big} = l(t) dt$
Data: $n$ independent observations on ${N (\cdot), Y (\cdot), Z (\cdot)}$

Likelihood; conditional, marginal and partial likelihoods

$X =$ vector of observations; $f_{X} (x, θ) =$ density of $X$
$θ =$ vector parameter; $θ = (β^{'}, ϕ^{'})^{'}$
$β =$ parameter of interest; $ϕ =$ nuisance parameter
likelihood: $f_{X} (x, θ) = f_{W ∣ V} (w v, θ) f_{V} (v, θ)$
- $X = (V^{'}, W^{'})^{'}$
- infinite-dimensional $ϕ$
- $f_{W ∣ V} (w v, θ)$ does not involve $ϕ$ $\Rightarrow$ use $f_{W ∣ V} (w v, β)$ (conditional likelihood)
- $f_{V} (v, θ)$ does not involve $ϕ$ $\Rightarrow$ use $f_{V} (v, β)$ (marginal likelihood)

X = (V_{1}, W_{1}, \dots, V_{K}, W_{K})

f_{X} (x, θ) = f_{V_{1}, W_{1}, \dots, V_{K}, W_{K}} (v_{1}, w_{1}, \dots, v_{K}, w_{K}; θ) = f_{V_{1}} (v_{1}; θ) f_{W_{1} ∣ V_{1}} (w_{1} ∣ v_{1}; θ) f_{V_{2} ∣ V_{1}, W_{1}} (v_{2} ∣ v_{1}, w_{1}; θ) \times \dots = {i = 1 \prod K f_{W_{i} ∣ Q_{i}} (w_{i} q_{i}; θ)} {i = 1 \prod K f_{V_{i} ∣ P_{i}} (v_{i} p_{i}; θ)}

P_{1} = ϕ, Q_{1} = V 1, P_{i} = (V_{1}, W_{1}, \dots, V_{i - 1}, W_{i - 1}) Q_{i} = (V_{1}, W_{1}, \dots, W_{i - 1}, V_{i})

$\prod_{i = 1}^{K} f_{W_{i} ∣ Q_{i}} (w_{i} q_{i}; θ)$ is free of $ϕ$ $\Rightarrow$ use $\prod_{i = 1}^{K} f_{W_{i} ∣ Q_{i}} (w_{i} q_{i}; β)$ (partial likelihood)

Partial & Marginal Likelihoods

Focus on Proportional Hazards Model: i.e., $(X_{i}, δ_{i}, Z_{i}), i = 1, \dots, n$ ( $n$ independent triplets)

λ (t Z) = λ_{0} (t) e^{β^{'} Z} S (t Z) = {S_{0} (t)}^{e^{β^{'} Z}} (1)

위에서 $λ_{0} (t)$ 는 unspecified.

Partial Likelihood: assume no ties, absolutely continuous failure distribution

Suppose there are L observed failures at $τ_{1} < \dots < τ_{L}$ (set $τ_{0} \equiv 0$ & $τ_{L + 1} \equiv \infty$ )

16.png

Let (i) be the label for individual failing at $τ_{i}$ (set $(L + 1) \equiv n + 1$ ). Note $t_{(i)} = τ_{i}$

Covariates for $L$ failures: $(Z_{(1)}, \dots, Z_{(L)})$ . (Hereafter, condition on ${Z_{i} : i = 1, \dots, n}$ )

Censorship times in $[τ_{i}; τ_{i + 1})$ : $(τ_{i 1}, \dots, τ_{i, m_{i}})$ with covariates $(Z_{(i, 1)}, \dots, Z_{(i, m_{i})})$ , i.e., $(i, j)$ is label for item censored at $τ_{ij}$

17.png

The data can be divided into sets

(V_{1}, W_{1}, \dots, V_{L + 1}, W_{L + 1})

where, for $i = 1, \dots, L, L + 1$ ,

V_{i} an d W_{i} = {τ_{i}, τ_{i - 1, j}; (i - 1, j) : j = 1, \dots, m_{i - 1}} = {(i)}

18.png

19.png

GOAL: Build a likelihood on a subset of the full data set

carrying most of the information about $β$
carrying no information on nuisance parameters ${λ_{0} (t) : t \geq 0}$

PROPOSAL: Generate likelihood of ${W_{1}, \dots, W_{L}}$

JUSTIFICATION, WHY?:

Timing of events ${τ_{1}, \dots, τ_{L}}$ can be explained by $λ_{0} (\cdot)$ .
Censoring times and labels can be ignored if we assume non-informative censorship (independent censoring).

So this is a partial likelihood in the sense that it is only part of the likelihood of the observed data.

If $Q_{i} \equiv (V_{1}, W_{1}, \dots, V_{i - 1}, W_{i - 1}, V_{i})$ and $F_{τ_{i}} \equiv (Q_{i}, Z)$ , the partial likelihood is $\prod_{i = 1}^{L} P (W_{i} = (i) F_{τ_{i}})$ , i.e., given the risk set at $τ_{i}$ , and given event occurs at $τ_{i}$ .

Denote $R_{i} \equiv {j : X_{j} \geq τ_{i}}$ as risk set at $τ_{i}$ . Then, by the assumption of independent censoring,

P (W_{i} = (i) F_{τ_{i}}) = \frac{P { t _{(i)} \in [ τ _{i} , τ _{i} + d τ ) F _{τ_{i}} } \cdot j \in R _{i} - ( i ) \prod P { t _{j} \neq \in [ τ _{i} , τ _{i} + d τ ) F _{τ_{i}} }}{l \in R _{i} \sum [ P { t _{l} \in [ τ _{i} , τ _{i} + d τ ) F _{τ_{i}} } \cdot j \in R _{i} - l \prod P { t _{j} \neq \in [ τ _{i} , τ _{i} + d τ ) F _{τ_{i}} } ]} = \frac{d Λ ( τ _{i} Z _{(i)} ) j \in R _{i} - ( i ) \prod { 1 - d Λ ( τ _{i} Z _{j} ) }}{l \in R _{i} \sum [ d Λ ( τ _{i} Z _{l} ) j \in R _{i} - l \prod { 1 - d Λ ( τ _{i} Z _{j} ) } ]} \div \frac{d τ _{i}}{d τ _{i}} = \frac{λ ( τ _{i} Z _{(i)} )}{\frac{P { T \in [ t , t + d t ) T \geq t , Z }}{d t} = l \in R _{i} \sum [ λ ( τ _{i} Z _{l} ) ]} = (1) \frac{exp ( β ^{'} Z _{(i)} )}{l \in R _{i} \sum exp ( β ^{'} Z _{l} )} (a) (2)

at (a), $P {t_{j} \neq \in [τ_{i}, τ_{i} + d τ) F_{τ_{i}}} = 1 - P {t_{j} \in [τ_{i}, τ_{i} + d τ) F_{τ_{i}}}$
at (2), $d Λ (τ_{i} Z_{j}) = 0$

Thus, the Partial Likelihood is

i = 1 \prod L \frac{exp ( β ^{'} Z _{(i)} )}{l \in R _{i} \sum exp ( β ^{'} Z _{l} )} = L (β) (3)

Note: unspecified $λ_{0} (\cdot)$ + noninformative censoring $\Rightarrow$ $i = 1 \prod L f_{V_{i} P_{i}} (v_{i} p_{i}; θ)$ contains little or no information about $β$ .

Counting process notation:

L (β) = i = 1 \prod n t \geq 0 \prod ⎩ ⎨ ⎧ \frac{exp ( β ^{'} Z _{i} )}{j = 1 \sum n Y _{j} ( t ) exp ( β ^{'} Z _{j} )} ⎭ ⎬ ⎫^{d N_{i} (t)}, d N_{i} (t) = {10 N_{i} (t) - N_{i} (t -) = 1 o . w .

Maximum partial likelihood estimator (MPLE): $L (\hat{β}) = max_{β} L (β)$ (using Newton-Raphson (NR) algorithm)
- Specifically, the log partial likelihood is then
$l (β) = i = 1 \sum n \int_{0}^{\infty} [Y_{i} (t) Z_{i} β - lo g (j = 1 \sum n Y_{j} (t) exp (β^{'} Z_{j}))] d N_{i} (t)$
- The score vector, $U (β)$ , can be obtained by differentiating $l (β)$ w.r.t. $β$ :
$U (β) = i = 1 \sum n \int_{0}^{\infty} {Z_{i} - \overset{ˉ}{Z} (β, t)} = i = 1 \sum n \int_{0}^{\infty} {Z_{i} - \frac{\sum _{i = 1}^{n} Y _{i} ( t ) Z _{i} exp ( β ^{'} Z _{i} )}{\sum _{i = 1}^{n} Y _{i} ( t ) exp ( β ^{'} Z _{i} )}} d N_{i} (t) d N_{i} (t)$
- where $\overset{ˉ}{Z} (β, t)$ is a weighted mean of $Z$ over those observations still at risk at time $t$ .
- The information matrix, $I (β)$ , is the negative second derivative where
$I (β) V (β, t) = i = 1 \sum n \int_{0}^{\infty} V (β, t) d N_{i} (s) = \frac{i = 1 \sum n Y _{i} ( t ) exp ( β ^{'} Z _{i} ) { Z _{i} - Z ^ ( β , t ) } ^{'} { Z _{i} - Z ^ ( β , t ) }}{i = 1 \sum n Y _{i} ( t ) exp ( β ^{'} Z _{i} )}$
- and $V (β, t)$ is the weighted variance of $Z$ at time $t$ .

Then, the MPLE, $\hat{β}$ , is found by solving the partial likelihood equation: $U (\hat{β}) = 0$ .

Under some regularity conditions, $\hat{β}$ is consistent and asymptotically normally distributed with mean $β$ and variance $E {I (β)}^{- 1}$ (will be shown later.)

The NR algorithm to solve the partial likelihood equation: Compute iteratively until convergence (requires an initial value $\hat{β}^{(0)}$ ).

\hat{β}^{(n + 1)} = \hat{β}^{(n)} + I^{- 1} (\hat{β}^{(n)}) \cdot U (\hat{β}^{(n)})

※ Note:

(incredibly) Robust algorithm!
$\hat{β}^{(0)} = 0$ usually works.

Cox Proportional Hazards Model

Cox model:

λ_{i} (t) = λ (t Z_{i}) lo g λ (t Z_{i}) S (t Z_{i}) = λ_{0} (t) exp (β^{'} Z_{i}) = λ_{0} (t) exp (β_{1} Z_{i 1} + \dots + β_{k} Z_{ik}) ⇕ = lo g [λ_{0} (t)] + β_{1} Z_{i 1} + \dots + β_{k} Z_{ik} = [S_{0} (t)]^{e x p (β_{1} Z_{i 1} + \dots + β_{k} Z_{ik})}

※ Note:

λ_{0} (t) exp (β_{1} Z_{1} + \dots + β_{k} Z_{k}) = λ (t Z_{1} = \dots = Z_{k} = 0) = RR = \frac{λ ( t Z _{1} , \dots , Z _{k} )}{λ ( t Z _{1} = \dots = Z _{k} = 0 )} (1)

(1) is relative risk of hazard of death comparing covariates values $Z_{1}, \dots, Z_{k}$ to $Z_{1} = \dots = Z_{k} = 0$

Interpreting Cox Model Coeffcients: $β_{k}$ is the log RR (hazard ratio) for a unit change in $Z_{k}$ , given all other covariates remain constant, i.e.,

\frac{λ [ t Z _{1} , \dots , ( Z _{k^{'}} + 1 ) , \dots , Z _{k} ]}{λ [ t Z _{1} , \dots , Z _{k^{'}} , \dots , Z _{k} ]} = exp (β_{1} \cdot 0 + \dots + β_{k^{'}} \cdot (Z_{k^{'}} + 1 - Z_{k^{'}}) + \dots + β_{k} \cdot 0) = exp (β_{k^{'}})

The RR comparing 2 sets of values for the covariates $(Z_{1}, \dots, Z_{k})$ vs. $(Z_{1}^{'}, \dots, Z_{k}^{'})$ :

RR = \frac{λ ( t Z _{1} , \dots , Z _{k} )}{λ ( t Z _{1}^{'} , \dots , Z _{k}^{'} )} = exp {β_{1} (Z_{1} - Z_{1}^{'}) + \dots + β_{k} (Z_{k} - Z_{k}^{'})}

20.png

Comparison of Nested Models

Nested Models:

λ (t) = λ_{0} (t) exp (β_{1} Z_{1} + \dots β_{p} Z_{p} + β_{p + 1} Z_{p + 1} + \dots + β_{k} Z_{k}) = λ_{0} (t) exp (β_{1} Z_{1} + \dots β_{p} Z_{p}) (Full Model) (Reduced Model)

To test:

Nested Models:

H_{0} : H_{A} : RM RM \Leftrightarrow \Leftrightarrow H_{0} : β_{p + 1} = \dots = β_{k} = 0 H_{A} : \neq = somewhere

Use the partial likelihood ratio statistic, $X_{C o x}^{2} = - 2 [lo g P L (RM) - lo g P L (FM)]$ .

Under $H_{0}$ : Reduced model, and when $n$ is large:

\begin{align} X^2_{Cox} \sim \chi^2_{k-p} && k-p \text{ is the ## of parameters set to 0 by }H_0 \end{align}

20.png, 21.png

Stratification

Two Ways to Stratify. Suppose a confounder $C$ has 3 levels on which we would like to stratify when comparin g $λ (t E)$ and $λ (t \overset{ˉ}{E})$ . How? $X_{E} = {10 E \overset{ˉ}{E} (exposed) (not exposed)$

22.png

Which Way to Stratify?

Under dummy variable stratification model, the proportional stratum-to-stratum hazards assumption may not be correct. If not, the con-founder $C$ may be inadequately controlled.
Proportionality assumption can be checked using time-dependent covariates.
True stratification is a more thorough adjustment, as long as observations within each level are homogeneous. If $C$ can be measured continuously and the strata were formed by grouping values of it, better control for $C$ might be achieved with continuous (could be time-dependent) covariate adjustment.
If $C$ is controlled using the true stratification there is no way to estimate one summary relative risk comparing two levels of $C$ . However, we can estimate $λ_{0 i} (t)$ for each stratum then we can estimate a RR function.
True stratification generally requires more data to obtain the same precision in coefficient estimates.

23.png

24.png

Test statistics

The standard asymptotic likelihood inference tests, Wald, score, and likelihood ratio (LR), still can be applied for the Cox partial likelihood.

25.png

Their finite sample properties may differ; in general, the LRT is the most reliable, the Wald test is the least.

26.png

When $p = 1$ and the single covariate is categorical, the score test is identical to the log-rank test.

27.png

Handling ties

Real data sets often contain tied event times.

When do we have ties?

Continuous event times are grouped into intervals.
Event time scale is discrete.

Four commonly used ways of handling ties: 1) Breslow approximation, 2) Efron approximation, 3) Exact partial likelihood, and 4) Averaged likelihood.

When the underlying time is continuous but ties are generated due to a grouping, the contribution to the partial likelihood for the $i$ -th event at time $t_{i}$ is $\frac{e x p ( β ^{'} Z _{i} )}{j \in R _{i} \sum Y _{j} ( t _{i} ) e x p ( β ^{'} Z _{j} )}$

Two commonly used methods are

Breslow approximation
Efron approximation

Example: Assume 5 subjects are at risk of dying at time $t$ and two die at the same time $t$ (because of grouping of time) If the time data had been more precise, then the first two terms in the likelihood would be either

28.png

29.png

30.png

Quartz 4

Explorer