Wednesday, December 17, 2014

Joint Models with Categorical Longitudinal Outcomes

Extending the Standard Joint Model

In a previous post we have briefly introduced the framework of joint models for longitudinal and time-to-event data, and we have shown how models of this type can be fitted in R using package JMbayes. In that post we have presented what can be called the 'standard' joint model, in which we have one continuous longitudinal outcome and one survival endpoint. Starting from this standard joint model different extensions have been proposed in the literature to accommodate different types of longitudinal and survival data. The aim of this post is to present one of these extensions, namely joint models with categorical longitudinal responses, and how such models can be fitted using JMbayes.

Generalized Linear Mixed Effects Models

Before discussing joint models with categorical longitudinal outcomes we need first to introduce the framework of Generalized Linear Mixed Models (GLMMs). GLMMs are an extension of linear mixed models to allow response variables from different distributions, such as binary responses. Alternatively, you could think of GLMMs as an extension of generalized linear models (e.g., logistic regression) to include both fixed and random effects (hence mixed models). Contrary to linear mixed models, fitting GLMMs under maximum likelihood is computationally a much more challenging task. This is due to the fact that in the definition of the observed data log-likelihood there is an integral over the (normally distributed) random effects, which does not have, in general, a closed-form solution (for linear mixed models this integral does have a closed-form solution which facilitates computations). Under the Bayesian approach there is essentially no difference between linear mixed models and GLMMs because in both the random effects are considered parameters for which we wish to derive their posterior distribution. In R there are several packages that can fit GLMMs under either maximum likelihood or the Bayesian approach, such as lme4 and MCMCglmm, among others. However, in order to fit joint models with categorical longitudinal response using package JMbayes the user needs first to fit the GLMMs using function glmmPQL() from package MASS. Even though the PQL algorithm has been shown not to work that optimally, especially for dichotomous longitudinal outcomes, the purpose of fitting the GLMMs with glmmPQL() is to extract from the resulting model object  the response vector and the design matrices for the fixed and random effects, and initial values for the parameters. The following code illustrates the use of glmmPQL() for fitting a mixed effects logistic regression using the Primary Biliary Cirrhosis (PBC) data set. Since in the PBC data there was no binary biomarker recorded during follow-up, we artificially create one by dichotomizing serum bilirubin at the threshold value of 1.8 mg/dL.



The syntax of glmmPQL() is similar to the one of function lme() from package nlme (the PQL algorithm actually requires fitting linear mixed effects model for a transformed response variable).

A Joint Model with a Binary Longitudinal Biomarker

To fit a joint model with a binary longitudinal outcome we need to appropriately define argument densLong of the jointModelBayes() function. This argument accepts a function that calculates the probability density function (and its natural logarithm) of the longitudinal outcome. The arguments of this function are y the vector of longitudinal responses, eta.y the subject-specific linear predictor (i.e., including both the fixed and random effects), scale a potential scale parameter (e.g., the measurement error standard deviation in the case of linear mixed models), log a logical argument denoting whether the pdf or the logarithm of the pdf is returned by the function, and data a data.frame that contains variables potentially used in the definition of densLong. For our example, the function we need to define is:



Following we fit a Cox model that represents the survival submodel of the joint model and we put all the above as arguments to jointModelBayes(), i.e.,




In the output the association parameter Assoct denotes the log hazard ratio, at any time point t, for a unit increase in the log odds of having serum bilirubin above the threshold value of 1.8 mg/dL, at the same time point.

Thursday, September 4, 2014

Joint Models for Longitudinal and Survival Data

What are joint models for longitudinal and survival data?

In this post we will introduce in layman's terms the framework of joint models for longitudinal and time-to-event data. These models are applied in settings where the sample units are followed-up in time, for example, we may be interest in patients suffering from a specific disease who are followed-up in time to monitor their progress. In this context typically different types of outcomes are collected for the sample units at hand. In general, these can be categorized as longitudinal outcomes, i.e., the same outcome repeatedly measured in time for the same sample unit, and event-time outcomes, i.e., the time until a specific event of interest occurs for the sample units. Often our research questions require analyzing each of these outcomes separately. For instance, does a new treatment have a beneficial effect in the survival of the patients or is there any evidence for a difference in the average longitudinal evolutions between males and females. However, there are also research questions of more complex nature for which a joint modeling approach of these outcomes is required. Joint models for longitudinal and survival data

What are endogenous time-varying covariates (in the setting described above)?

A time-varying covariate is termed endogenous if its value at any time point t is can be affected by an event occurring at an earlier time point s < t, and is exogenous if its value at t is not affected by an earlier event.

Examples of exogenous covariates are the nurse that took the blood sample or performed the echo, the period of the year (e.g., winter versus summer) and environmental factors (e.g., pollution levels). On the other hand all covariates measured on the patient (e.g., biomarkers) are endogenous. To make this distinction more clear let’s consider two time-varying covariates in the context of  a study on asthma, namely, a biomarker for asthma that has been measured during follow up for the patients in the study, and the air pollution levels in the neighborhood where each patients live. Suppose that a particular patient has an asthma attack after s = 5 months from the start of the study. It is directly evident that at a future time point, say t = 5.2 months, the level of the biomarker will be affected from the fact that this patient had an asthma attack, whereas air pollution levels at the same time point t = 5.2 months will not be affected by this attack.

What is non-random dropout (in the setting described above)?

A common and important problem in the analysis of longitudinal outcomes is missing data. Namely, if though measurements are planned at specific time points, often and for a variety of reasons the subjects under study do not adhere to these scheduled visit times or they even completely dropout from the study. When the reasons (or more accurately the probability) for dropping out depends on unobserved longitudinal responses, the process that generates these missing data cannot be ignored, even if the main focus is on the longitudinal outcome. In this case the dropout process is termed non-random.

How joint models work?

The intuitive idea behind joint models for longitudinal and survival outcomes is given in the following figure:

In the top panel the hazard process is depicted, which describes how the instantaneous risk (hazard function) of an event changes in time; for example, the hazard of patient having a relapse. In the bottom panel the asterisk denote the observed longitudinal responses and the green line the underlying longitudinal process; for example, the values of a biomarker for the disease under study. Joint models postulate a relative risk (proportional hazards) model for the event time outcome, which is directly associated with the longitudinal process denoted by the green line. This green line is recovered from the observed data (asterisks) using a mixed effects model. This model contains fixed effects, describing the average longitudinal evolution in time, and random effects that describe how each patient deviates from this average evolution. In their basic form joint models assume that the hazard function at any particular time point t, denoted by the vertical dashed line, is associated with the value of the longitudinal process (green line) at the same time point. The blue line represents the assumption behind the time-dependent Cox model, which posits that the value of the longitudinal outcome remain constant in between the observation times. Estimation of the model is based on the joint distribution of the two outcomes and can be done either under maximum likelihood or under a Bayesian approach. The framework of joint models can be used to account for both endogenous time-varying covariates and non-random dropout.

Fitting joint models using package JMbayes

There are several packages available on CRAN for fitting this type of joint models, namely JM, JMbayes, joineR and lcmm among others. In this post we will show how joint models can be fitted using using package JMbayes that fits joint models under the Bayesian paradigm. For this illustration we will be using the Primary Biliary Cirrhosis (PBC) data set (available in the package and also in the survival package) collected by the Mayo Clinic from 1974 to 1984. For our analysis we will consider 312 patients who have been randomized to D-penicillamine and placebo. During follow-up several biomarkers associated with PBC have been collected for these patients. Here we focus on serum bilirubin levels, which is considered one of the most important ones associated with disease progression. In package JMbayes the PBC data are available in the data frames pbc2 and pbc2.id containing the longitudinal and survival information, respectively (i.e., the former is in the long format while the latter contains a single row per patient).

We start by loading the JMbayes and lattice packages and defining the indicator status2 for the composite event, namely transplantation or death:

library("JMbayes")
library("lattice")
pbc2$status2 <- as.numeric(pbc2$status != "alive")
pbc2.id$status2 <- as.numeric(pbc2.id$status != "alive")



The design followed by package JMbayes requires to first separately fit a linear mixed model for the longitudinal outcome and a Cox model for the survival one. The aim of the linear mixed model is to describe/recover the subject-specific longitudinal trajectories. Close inspection of the shapes of the log serum bilirubin profiles indicates that for some individuals these seem to be nonlinear.

set.seed(1)
## we take a sample of patients with more than six 
## measurements
long_ids <- names(which(table(pbc2$id) > 6))
ids <- sample(long_ids, 16)
xyplot(log(serBilir) ~ year | id, data = pbc2, 

       subset = id %in% ids, type = c("p", "smooth"), 
       lwd = 2, layout = c(4, 4))



Hence, to allow for flexibility in the specification of these profiles we include natural cubic splines in both the fixed- and random-effects parts of the mixed model. This model can be fitted using the following calls to functions lme() and ns() (the latter from package splines): 

lmeFit <- lme(log(serBilir) ~ ns(year, 2), data = pbc2,
              random = ~ ns(year, 2) | id) 

Analogously, in the Cox model we control for treatment and age, and also allow for their interaction:

coxFit <- coxph(Surv(years, status2) ~ drug * age, 
                data = pbc2.id, x = TRUE)

In the call to coxph() argument x is set to TRUE such that the design matrix is also included in the resulting model object. Using as main arguments the lmeFit and coxFit objects, the corresponding joint model is fitted using the code:

jointFit <- jointModelBayes(lmeFit, coxFit, timeVar = "year")

summary(jointFit)



Call:
jointModelBayes(lmeObject = lmeFit.pbc1, survObject = coxFit.pbc1, 
    timeVar = "year")

Data Descriptives:
Longitudinal Process             Event Process
Number of Observations: 1945     Number of Events: 169 (54.2%)
Number of subjects: 312

Joint Model Summary:
Longitudinal Process: Linear mixed-effects model
Event Process: Relative risk model with penalized-spline-approximated 
  baseline risk function
Parameterization: Time-dependent value 

      LPML      DIC       pD
 -3169.172 6116.463 939.5384

Variance Components:
              StdDev    Corr        
(Intercept)   1.0097  (Intr)  n(,2)1
ns(year, 2)1  2.3555  0.3640        
ns(year, 2)2  2.2430  0.3609  0.5701
Residual      0.3023                

Coefficients:
Longitudinal Process
              Value Std.Err Std.Dev   2.5%  97.5%      P
(Intercept)  0.4918  0.0212  0.0699 0.3676 0.6316 <0.001
ns(year, 2)1 2.4117  0.1032  0.2337 1.9423 2.7918 <0.001
ns(year, 2)2 2.3611  0.0895  0.2906 1.7765 2.8609 <0.001

Event Process
                     Value Std.Err  Std.Dev    2.5%    97.5%      P
drugD-penicil      -0.9648  0.1272   0.7543 -2.4530   0.4693  0.199
age                 0.0370  0.0018   0.0107  0.0135   0.0571 <0.001
drugD-penicil:age   0.0179  0.0025   0.0145 -0.0096   0.0470  0.211
Assoct              1.4181  0.0062   0.0953  1.2367   1.6041 <0.001
Bs.gammas1         -6.6205  0.1009   0.5992 -7.7762  -5.3132 <0.001
Bs.gammas2         -6.5832  0.1031   0.5962 -7.7639  -5.2468 <0.001
... 

MCMC summary:
iterations: 20000 
adapt: 3000 
burn-in: 3000 
thinning: 10 
time: 2.6 min
 
Argument timeVar is a character string that specifies the name of the time variable in the mixed model (the scale of time (e.g., days, months, years) must be the same in both the mixed and Cox models). The default call to jointModelBayes() includes in the linear predictor of the relative risk model the subject-specific linear predictor of the mixed model, which in this case represents the average patient-specific log serum bilirubin level. The output of the summary() method is rather self-explanatory and contains model summary statistics, namely LPML (the log pseudo marginal likelihood value), DIC (deviance information criterion), and pD (the effective number of parameters component of DIC), posterior means for all parameters, and standard errors (effective sample size estimated using time series methodology), standard deviations, 95% credibility intervals and tail probabilities for all regression coefficients in the two submodels. The association parameter, denoted in the output as Assoct, is the parameter that measures how strongly associated is the longitudinal outcome at any particular time point t with the hazard of an event at the same time point. The results suggest that serum bilirubin is strongly related with the risk for the composite event, with a doubling of serum bilirubin levels, resulting in a 2.7-fold (95% CI: 2.3; 3.1) increase of the risk.