What are joint models for longitudinal and survival data?
In this post we will introduce in layman's terms the framework of joint models for longitudinal and time-to-event data. These models are applied in settings where the sample units are followed-up in time, for example, we may be interest in patients suffering from a specific disease who are followed-up in time to monitor their progress. In this context typically different types of outcomes are collected for the sample units at hand. In general, these can be categorized as longitudinal outcomes, i.e., the same outcome repeatedly measured in time for the same sample unit, and event-time outcomes, i.e., the time until a specific event of interest occurs for the sample units. Often our research questions require analyzing each of these outcomes separately. For instance, does a new treatment have a beneficial effect in the survival of the patients or is there any evidence for a difference in the average longitudinal evolutions between males and females. However, there are also research questions of more complex nature for which a joint modeling approach of these outcomes is required. Joint models for longitudinal and survival data constitute an attractive paradigm for the analysis of such data, and they are mainly applicable in two settings: First, when focus is on a survival outcome
and we wish to account for the effect of endogenous time-varying covariates measured with error, and second, when focus is on the
longitudinal outcome and we wish to correct for non-random dropout.
What are endogenous time-varying covariates (in the setting described above)?
A
time-varying covariate is termed endogenous if its value at any time point t is can be affected by an event occurring
at an earlier time point s < t, and is exogenous if its value at t is not affected by an earlier event.
Examples
of exogenous covariates are the nurse that took the blood sample or performed
the echo, the period of the year (e.g., winter versus summer) and environmental
factors (e.g., pollution levels). On the other hand all covariates measured on
the patient (e.g., biomarkers) are endogenous. To make this distinction
more clear let’s consider two time-varying covariates in the context of a study on asthma, namely, a biomarker for asthma that has been measured
during follow up for the patients in the study, and the air pollution levels in the neighborhood where each patients live. Suppose
that a particular patient has an asthma attack after s = 5 months from the start of the study. It is directly evident that at a future time point, say t = 5.2 months, the level of the biomarker will be affected from the fact that this patient had an asthma attack, whereas air pollution levels at the same time point t = 5.2 months will not be affected by
this attack.
What is non-random dropout (in the setting described above)?
A common and important problem in the analysis of longitudinal outcomes is missing data. Namely, if though measurements are planned at specific time points, often and for a variety of reasons the subjects under study do not adhere to these scheduled visit times or they even completely dropout from the study. When the reasons (or more accurately the probability) for dropping out depends on unobserved longitudinal responses, the process that generates these missing data cannot be ignored, even if the main focus is on the longitudinal outcome. In this case the dropout process is termed non-random.
How joint models work?
The intuitive idea behind joint models for longitudinal and survival outcomes is given in the following figure:
In the top panel the hazard process is depicted, which describes how the instantaneous risk (hazard function) of an event changes in time; for example, the hazard of patient having a relapse. In the bottom panel the asterisk denote the observed longitudinal responses and the green line the underlying longitudinal process; for example, the values of a biomarker for the disease under study. Joint models postulate a relative risk (proportional hazards) model for the event time outcome, which is directly associated with the longitudinal process denoted by the green line. This green line is recovered from the observed data (asterisks) using a mixed effects model. This model contains fixed effects, describing the average longitudinal evolution in time, and random effects that describe how each patient deviates from this average evolution. In their basic form joint models assume that the hazard function at any particular time point t, denoted by the vertical dashed line, is associated with the value of the longitudinal process (green line) at the same time point. The blue line represents the assumption behind the time-dependent Cox model, which posits that the value of the longitudinal outcome remain constant in between the observation times. Estimation of the model is based on the joint distribution of the two outcomes and can be done either under maximum likelihood or under a Bayesian approach. The framework of joint models can be used to account for both endogenous time-varying covariates and non-random dropout.
Fitting joint models using package JMbayes
There are several packages available on CRAN for fitting this type of joint models, namely JM, JMbayes, joineR and lcmm among others. In this post we will show how joint models can be fitted using using package JMbayes that fits joint models under the Bayesian paradigm. For this illustration we will be using the Primary Biliary Cirrhosis (PBC) data set (available in the package and also in the survival package) collected by the Mayo Clinic from 1974 to 1984. For our analysis we will consider 312 patients who have been randomized to D-penicillamine and placebo. During follow-up several biomarkers associated with PBC have been collected for these patients. Here we focus on serum bilirubin levels, which is considered one of the most important ones associated with disease progression. In package JMbayes the PBC data are available in the data frames pbc2 and pbc2.id containing the longitudinal and survival information, respectively (i.e., the former is in the long format while the latter contains a single row per patient).
We start by loading the JMbayes and lattice packages and defining the indicator status2 for the composite event, namely transplantation or death:
library("JMbayes")
library("lattice")
pbc2$status2 <- as.numeric(pbc2$status != "alive")
pbc2.id$status2 <- as.numeric(pbc2.id$status != "alive")
The design followed by package
JMbayes requires to first separately fit a linear mixed model for the longitudinal outcome and a Cox model for the survival one. The aim of the linear mixed model is to describe/recover the subject-specific longitudinal trajectories. Close inspection of the shapes of the log serum bilirubin profiles indicates that for some individuals these seem to be nonlinear.
set.seed(1)
## we take a sample of patients with more than six
## measurements
long_ids <- names(which(table(pbc2$id) > 6))
ids <- sample(long_ids, 16)
xyplot(log(serBilir) ~ year | id, data = pbc2,
subset = id %in% ids, type = c("p", "smooth"),
lwd = 2, layout = c(4, 4))
Hence, to allow for flexibility in the specification of these profiles we include natural cubic splines in both the fixed- and random-effects parts of the mixed model. This model can be fitted using the following calls to functions
lme() and
ns() (the latter from package
splines):
lmeFit <- lme(log(serBilir) ~ ns(year, 2), data = pbc2,
random = ~ ns(year, 2) | id)
Analogously, in the Cox model we control for treatment and age, and also allow for their interaction:
coxFit <- coxph(Surv(years, status2) ~ drug * age,
data = pbc2.id, x = TRUE)
In the call to coxph() argument x is set to TRUE such that the design matrix is also included in the resulting model object. Using as main arguments the lmeFit and coxFit objects, the corresponding joint model is fitted using the code:
jointFit <- jointModelBayes(lmeFit, coxFit, timeVar = "year")
summary(jointFit)
Call:
jointModelBayes(lmeObject = lmeFit.pbc1, survObject = coxFit.pbc1,
timeVar = "year")
Data Descriptives:
Longitudinal Process Event Process
Number of Observations: 1945 Number of Events: 169 (54.2%)
Number of subjects: 312
Joint Model Summary:
Longitudinal Process: Linear mixed-effects model
Event Process: Relative risk model with penalized-spline-approximated
baseline risk function
Parameterization: Time-dependent value
LPML DIC pD
-3169.172 6116.463 939.5384
Variance Components:
StdDev Corr
(Intercept) 1.0097 (Intr) n(,2)1
ns(year, 2)1 2.3555 0.3640
ns(year, 2)2 2.2430 0.3609 0.5701
Residual 0.3023
Coefficients:
Longitudinal Process
Value Std.Err Std.Dev 2.5% 97.5% P
(Intercept) 0.4918 0.0212 0.0699 0.3676 0.6316 <0.001
ns(year, 2)1 2.4117 0.1032 0.2337 1.9423 2.7918 <0.001
ns(year, 2)2 2.3611 0.0895 0.2906 1.7765 2.8609 <0.001
Event Process
Value Std.Err Std.Dev 2.5% 97.5% P
drugD-penicil -0.9648 0.1272 0.7543 -2.4530 0.4693 0.199
age 0.0370 0.0018 0.0107 0.0135 0.0571 <0.001
drugD-penicil:age 0.0179 0.0025 0.0145 -0.0096 0.0470 0.211
Assoct 1.4181 0.0062 0.0953 1.2367 1.6041 <0.001
Bs.gammas1 -6.6205 0.1009 0.5992 -7.7762 -5.3132 <0.001
Bs.gammas2 -6.5832 0.1031 0.5962 -7.7639 -5.2468 <0.001
...
MCMC summary:
iterations: 20000
adapt: 3000
burn-in: 3000
thinning: 10
time: 2.6 min
Argument timeVar is a character string that specifies the name of the time variable in the mixed model (the scale of time (e.g., days, months, years) must be the same in both the mixed and Cox models). The default call to jointModelBayes() includes in the linear predictor of the relative risk model the subject-specific linear predictor of the mixed model, which in this case represents the average patient-specific log serum bilirubin level. The output of the summary() method is rather self-explanatory and contains model summary statistics, namely LPML (the log pseudo marginal likelihood value), DIC (deviance information criterion), and pD (the effective number of parameters component of DIC), posterior means for all parameters, and standard errors (effective sample size estimated using time series methodology), standard deviations, 95% credibility intervals and tail probabilities for all regression coefficients in the two submodels. The association parameter, denoted in the output as Assoct, is the parameter that measures how strongly associated is the longitudinal outcome at any particular time point t with the hazard of an event at the same time point. The results suggest that serum bilirubin is strongly related with the risk for the composite event, with a doubling of serum bilirubin levels, resulting in a 2.7-fold (95% CI: 2.3; 3.1) increase of the risk.