Monday, October 24, 2016

Multivariate Joint Models for Multiple Longitudinal Outcomes and a Time-to-Event

A new, development version, of package JMbayes has been rolled-out on the dedicated GitHub repo. The major addition in this version is a set of new functions that can fit multivariate joint models for multiple longitudinal outcomes and a time-to-event. As with other GitHub packages, it can be easily installed using the install_github() function from package devtools. The new function that fits multivariate joint models is called mvJointModelBayes() and has a very similar syntax as the jointModelBayes() function described in a previous post. However, contrary to jointModelBayes() that is entirely written in R, the main bulk of computations of mvJointModelBayes() are based on C++ code building upon the excellent Rcpp and RcppArmadillo packages.

To make the connection between the two functions from a practical viewpoint more clear, let's fit first a univariate joint model using mvJointModelBayes(). As for jointModelBayes(), we first need to separately fit a mixed-effects model for the longitudinal outcome, and a Cox model for the survival outcome. However, contrary to jointModelBayes(), which required the mixed model to be fitted with function lme() from the nlme package, for mvJointModelBayes() the mixed model needs to be fitted with the new function mvlgmer(). This follows the syntax of lmer() from package lme4. For example, the code below fits a linear mixed model for the longitudinal biomarker serum bilirubin from the Mayo Clinic PBC data set:

The main arguments of this function are: formulas a list of lme4-like formulas (a formula per outcome), data a data.frame that contains all the variables specified in formulas (NAs allowed), and families a list of family objects specifying the type of each outcome (currently the gaussian, binomial and poisson families are allowed). Following, we fit a Cox model using function coxph() from the survival package; argument model of coxph() needs to be set to TRUE:

Then the univariate joint model is fitted by the following simple call to mvJointModelBayes():

To fit a multivariate joint model for multiple longitudinal outcomes, the only thing that needs to be adapted from the procedure described above, is the call to mvglmer(). For example, to fit a bivariate joint model for serum bilirubin and presence of spiders, we use the syntax:

In addition, similarly to jointModelBayes(), mvJointModelBayes() allows to specify different functional forms for the longitudinal outcomes that are included in the Cox model.  As an example, we extend model JMFit2 by including the value and slope term for bilirubin, and the area term for spiders (in the log-odds scale). To include these terms into the model we specify the Formulas argument. This is specified in a similar manner as the derivForms argument of jointModelBayes(). In particular, it should be a list of lists. Each component of the outer list should have as name the name of the corresponding outcome variable. Then in the inner list we need to specify four components, namely, the fixed & random R formulas specifying the fixed and random effects part of the term to be included, and indFixed & indRandom integer indices specifying which of the original fixed and random effects are involved in the calculation of the new term. In the inner list you can also optionally specify a name for the term you want to include. A couple of notes:
  1. For terms not specified in the Formulas list, the default value functional form is used.
  2. If for a particular outcome you want to include both the value functional form and an extra term, then you need to specify that in the Formulas using two entries. To include the value functional form you only need to set the corresponding to "value", and for the second term to specify the inner list. See example below on how to include the value and slope for serum bilirubin (for example, if the list below the entry "log(serBilir)" = "value" was not given, then only the slope term would have been included in the survival submodel).

We further extend the previous model by including the interactions terms between the terms specified in Formulas and the randomized treatment indicator drug. The names specified in the list that defined the interaction factors should match with the names of the output from JMFit3. Because of the many association parameters we have, we place a shrinkage prior on the association coefficients 'alpha'. In particular, if we have K association parameters, we assume that alpha_k ~ N(0, tau * phi_k), k = 1, ..., K. The precision parameters tau and phi_k are given Gamma priors. Parameter tau is global shrinkage parameter, and phi_k a specific shrinkage parameter per alpha coefficient:

Future updates of the package will include additional features, including among others
  • a method for the generic functions survfitJM() and predict() to compute dynamic predictions for multivariate longitudinal data setting for both the longitudinal and survival outcomes.
  • allow for competing risks.
  • make the C++ code (even) faster.

Friday, March 18, 2016

An Integrated Shiny App for a Course on Repeated Measurements Analysis (completed)

Repeated Measurements Analysis

Repeated measurements analysis, and in particular longitudinal data analysis, is one of the two most frequently used types of analysis in my field (Biostatistics) - the other being survival analysis. Starting from this year I will be teaching in my university a new course on regression model for repeated measurements data that is primarily focused on applied statisticians, epidemiologists and clinicians. In general, this type of audience often finds this topic quite difficult, mainly due to the fact that one has to carefully consider the two levels of such data, namely, how to model longitudinal evolutions, and how to model correlations. On top of that, many of the researchers following this course have been primarily exposed to SPSS, making the transition to R that I will be using in the course somewhat more demanding.

Shiny app

Based on the considerations mentioned above, when I was developing the course I was thinking of ways to facilitate both the understanding of the key concepts of repeated measurements analysis, and how to effectively explain the use of R to analyze such data. The answer to both questions was to utilize the great capabilities of shiny. I have created an app the replays all analyses done in the course - a snapshot shown below.

The students can select a chapter and a section, see the code used in that section in the 'Code' tab, and examine the output in the 'Output' tab. The slides of the course are also integrated in the app, and can been seen in the 'Slides' tab. The 'Help' tab explains the basic usage of the main functions used in the selected chapter. To further enhance understanding of some key concepts, such as how random effects capture correlations and how longitudinal evolutions are affected by the levels of baseline covariates, the app allows to interactively change values for some parameters that control these features. The app also includes four practicals aimed at Chapter 2 that introduces marginal models for continuous data, Chapter 3 that explains linear mixed effects models, Chapter 4 the presents the framework of generalized estimating equations, and Chapter 5 the presents generalized linear mixed effects models, respectively. Chapter 6 focuses on explaining the issues with incomplete data in longitudinal studies. For each practical the students may reveal the answer to specific questions they have trouble solving, or download a whole R markdown report with a detailed explanation of the solutions.

The app is based on some popular packages for repeated measurements analysis (nlme, lme4, MCMCglmm, geepack), and some additional utilities packages (lattice, MASS, corrplot).

The app is available in my dedicated GitHub repository for this course, and can be invoked using the command (assuming that you have the aforementioned packages installed):

shiny::runGitHub("Repeated_Measurements", "drizopoulos")

Friday, March 4, 2016

Dynamic Predictions using Joint Models

What are Dynamic Predictions

In this post we will explain the concept of dynamic predictions and illustrate how these can be computed using the framework of joint models for longitudinal and survival data, and the R package JMbayes. The type of dynamic predictions we will discuss here are calculated in follow-up studies in which some sample units (e.g., patients) who are followed-up in time provide a set of longitudinal measurements. These longitudinal measurements are expected to be associated to events that the sample units may experience during follow-up (e.g., death, onset of disease, getting a child, dropout from the study, etc.). In this context, we would like to utilize the longitudinal information we have available up to  particular time point t to predict the risk of an event after t. For example, for a particular patient we would like to use his available blood values up to year 5 to predict the chance that he will develop a disease before year 7 (i.e., within two years from his last available measurement). The dynamic nature of these predictions stems from the fact that each time we obtain a new a longitudinal measurement we can update the prediction we have previously calculated.

Joint models for longitudinal and survival data have been shown to be a valuable tool for obtaining such predictions. They allow to investigate which features of the longitudinal profiles are most predictive, while appropriately accounting for the complex correlations in the longitudinal measurements.

Fit a Joint Model

For this illustration we will be using the Primary Biliary Cirrhosis (PBC) data set collected by the Mayo Clinic from 1974 to 1984. For our analysis we will consider 312 patients who have been randomized to D-penicillamine and placebo. During follow-up several biomarkers associated with PBC have been collected for these patients. Here we focus on serum bilirubin levels, which is considered one of the most important ones associated with disease progression. In package JMbayes the PBC data are available in the data frames pbc2 and containing the longitudinal and survival information, respectively (i.e., the former is in the long format while the latter contains a single row per patient).

We start by fitting a joint model to the PBC data set. For the log-transformed serum bilirubin we use a linear mixed effects models with natural cubic splines in the fixed and random effects for time, and also correct in the fixed part for age and sex. For the time-to-death we use a Cox model with baseline covariates age, sex and their interaction and the underlying level of serum bilirubin as estimated from the mixed model. This joint model is fitted using the following piece of code:

Calculate Dynamic Predictions

In package JMbayes these subject-specific predictions are calculated using function survfitJM(), respectively. As an illustration, we show how this function can be utilized to derive predictions for Patient 2 from the PBC data set using our fitted joint model jointFit. We first extract the data of this patient in a separate data frame and then we call survfitJM()

The last available measurement of this patient was in year 8.83, and survfitJM() will by default produce estimates of event-free probabilities starting from this last time point to the end of the follow-up. The calculation of these probabilities is based on a Monte Carlo procedure, and in the output we obtain as estimates the mean and median over the Monte Carlo samples along with the 95% pointwise credible intervals. Hence, the probability that this patient will survive up to year 11.2 is 60%. A plot of these probabilities can be obtained using the plot() method for objects returned by survfitJM()

Shiny app for Dynamic Predictions

To facilitate the use of dynamic predictions in practice, a web interface has been written using package shiny. This is available in the demo folder of the package and can be invoked with a call to the runDynPred() function. With this interface users may load an R workspace with the fitted joint model(s), load the data of the new subject, and subsequently obtain dynamic estimates of survival probabilities and future longitudinal measurements (i.e., an estimate after each longitudinal measurement). Several additional options are provided to calculate predictions based on different joint models (if the R workspace contains more than one model), to obtain estimates at specific horizon times, and to extract the data set with the estimated conditional survival probabilities. A detailed description of the options of this app is provided in the 'Help' tab within the app.

Friday, October 16, 2015

An Integrated Shiny App for a Course on Repeated Measurements Analysis

I will be teaching a course on statistical regression models for repeated measurements data, and I thought of creating a shiny app to let students run the code used in the course and examine the output in real time.

The app is still in development mode (Chapter 5 is still missing), but you may give it a try, and let me know of any feedback/comments you may have.

More information on installation requirements and how to invoke is available on my GitHub page.

Sunday, August 9, 2015

Two Presentations about Joint Models

Packages JM and JMbayes @ JSM2015

This year JSM features an interesting invited session about fitting joint models in different software packages -- if you're interested drop by...

Here are my slides in which I give a short intro in packages JM and JMbayes:

Personalized Screening Intervals for Longitudinal Biomarkers using Joint Models

This presentation shows a new methodological framework for optimizing screening intervals for longitudinal biomarkers using the framework of joint models for longitudinal and time-to-event data; more information can be found on arXiv.

Here are the slides:

Wednesday, December 17, 2014

Joint Models with Categorical Longitudinal Outcomes

Extending the Standard Joint Model

In a previous post we have briefly introduced the framework of joint models for longitudinal and time-to-event data, and we have shown how models of this type can be fitted in R using package JMbayes. In that post we have presented what can be called the 'standard' joint model, in which we have one continuous longitudinal outcome and one survival endpoint. Starting from this standard joint model different extensions have been proposed in the literature to accommodate different types of longitudinal and survival data. The aim of this post is to present one of these extensions, namely joint models with categorical longitudinal responses, and how such models can be fitted using JMbayes.

Generalized Linear Mixed Effects Models

Before discussing joint models with categorical longitudinal outcomes we need first to introduce the framework of Generalized Linear Mixed Models (GLMMs). GLMMs are an extension of linear mixed models to allow response variables from different distributions, such as binary responses. Alternatively, you could think of GLMMs as an extension of generalized linear models (e.g., logistic regression) to include both fixed and random effects (hence mixed models). Contrary to linear mixed models, fitting GLMMs under maximum likelihood is computationally a much more challenging task. This is due to the fact that in the definition of the observed data log-likelihood there is an integral over the (normally distributed) random effects, which does not have, in general, a closed-form solution (for linear mixed models this integral does have a closed-form solution which facilitates computations). Under the Bayesian approach there is essentially no difference between linear mixed models and GLMMs because in both the random effects are considered parameters for which we wish to derive their posterior distribution. In R there are several packages that can fit GLMMs under either maximum likelihood or the Bayesian approach, such as lme4 and MCMCglmm, among others. However, in order to fit joint models with categorical longitudinal response using package JMbayes the user needs first to fit the GLMMs using function glmmPQL() from package MASS. Even though the PQL algorithm has been shown not to work that optimally, especially for dichotomous longitudinal outcomes, the purpose of fitting the GLMMs with glmmPQL() is to extract from the resulting model object  the response vector and the design matrices for the fixed and random effects, and initial values for the parameters. The following code illustrates the use of glmmPQL() for fitting a mixed effects logistic regression using the Primary Biliary Cirrhosis (PBC) data set. Since in the PBC data there was no binary biomarker recorded during follow-up, we artificially create one by dichotomizing serum bilirubin at the threshold value of 1.8 mg/dL.

The syntax of glmmPQL() is similar to the one of function lme() from package nlme (the PQL algorithm actually requires fitting linear mixed effects model for a transformed response variable).

A Joint Model with a Binary Longitudinal Biomarker

To fit a joint model with a binary longitudinal outcome we need to appropriately define argument densLong of the jointModelBayes() function. This argument accepts a function that calculates the probability density function (and its natural logarithm) of the longitudinal outcome. The arguments of this function are y the vector of longitudinal responses, eta.y the subject-specific linear predictor (i.e., including both the fixed and random effects), scale a potential scale parameter (e.g., the measurement error standard deviation in the case of linear mixed models), log a logical argument denoting whether the pdf or the logarithm of the pdf is returned by the function, and data a data.frame that contains variables potentially used in the definition of densLong. For our example, the function we need to define is:

Following we fit a Cox model that represents the survival submodel of the joint model and we put all the above as arguments to jointModelBayes(), i.e.,

In the output the association parameter Assoct denotes the log hazard ratio, at any time point t, for a unit increase in the log odds of having serum bilirubin above the threshold value of 1.8 mg/dL, at the same time point.

Thursday, September 4, 2014

Joint Models for Longitudinal and Survival Data

What are joint models for longitudinal and survival data?

In this post we will introduce in layman's terms the framework of joint models for longitudinal and time-to-event data. These models are applied in settings where the sample units are followed-up in time, for example, we may be interest in patients suffering from a specific disease who are followed-up in time to monitor their progress. In this context typically different types of outcomes are collected for the sample units at hand. In general, these can be categorized as longitudinal outcomes, i.e., the same outcome repeatedly measured in time for the same sample unit, and event-time outcomes, i.e., the time until a specific event of interest occurs for the sample units. Often our research questions require analyzing each of these outcomes separately. For instance, does a new treatment have a beneficial effect in the survival of the patients or is there any evidence for a difference in the average longitudinal evolutions between males and females. However, there are also research questions of more complex nature for which a joint modeling approach of these outcomes is required. Joint models for longitudinal and survival data

What are endogenous time-varying covariates (in the setting described above)?

A time-varying covariate is termed endogenous if its value at any time point t is can be affected by an event occurring at an earlier time point s < t, and is exogenous if its value at t is not affected by an earlier event.

Examples of exogenous covariates are the nurse that took the blood sample or performed the echo, the period of the year (e.g., winter versus summer) and environmental factors (e.g., pollution levels). On the other hand all covariates measured on the patient (e.g., biomarkers) are endogenous. To make this distinction more clear let’s consider two time-varying covariates in the context of  a study on asthma, namely, a biomarker for asthma that has been measured during follow up for the patients in the study, and the air pollution levels in the neighborhood where each patients live. Suppose that a particular patient has an asthma attack after s = 5 months from the start of the study. It is directly evident that at a future time point, say t = 5.2 months, the level of the biomarker will be affected from the fact that this patient had an asthma attack, whereas air pollution levels at the same time point t = 5.2 months will not be affected by this attack.

What is non-random dropout (in the setting described above)?

A common and important problem in the analysis of longitudinal outcomes is missing data. Namely, if though measurements are planned at specific time points, often and for a variety of reasons the subjects under study do not adhere to these scheduled visit times or they even completely dropout from the study. When the reasons (or more accurately the probability) for dropping out depends on unobserved longitudinal responses, the process that generates these missing data cannot be ignored, even if the main focus is on the longitudinal outcome. In this case the dropout process is termed non-random.

How joint models work?

The intuitive idea behind joint models for longitudinal and survival outcomes is given in the following figure:

In the top panel the hazard process is depicted, which describes how the instantaneous risk (hazard function) of an event changes in time; for example, the hazard of patient having a relapse. In the bottom panel the asterisk denote the observed longitudinal responses and the green line the underlying longitudinal process; for example, the values of a biomarker for the disease under study. Joint models postulate a relative risk (proportional hazards) model for the event time outcome, which is directly associated with the longitudinal process denoted by the green line. This green line is recovered from the observed data (asterisks) using a mixed effects model. This model contains fixed effects, describing the average longitudinal evolution in time, and random effects that describe how each patient deviates from this average evolution. In their basic form joint models assume that the hazard function at any particular time point t, denoted by the vertical dashed line, is associated with the value of the longitudinal process (green line) at the same time point. The blue line represents the assumption behind the time-dependent Cox model, which posits that the value of the longitudinal outcome remain constant in between the observation times. Estimation of the model is based on the joint distribution of the two outcomes and can be done either under maximum likelihood or under a Bayesian approach. The framework of joint models can be used to account for both endogenous time-varying covariates and non-random dropout.

Fitting joint models using package JMbayes

There are several packages available on CRAN for fitting this type of joint models, namely JM, JMbayes, joineR and lcmm among others. In this post we will show how joint models can be fitted using using package JMbayes that fits joint models under the Bayesian paradigm. For this illustration we will be using the Primary Biliary Cirrhosis (PBC) data set (available in the package and also in the survival package) collected by the Mayo Clinic from 1974 to 1984. For our analysis we will consider 312 patients who have been randomized to D-penicillamine and placebo. During follow-up several biomarkers associated with PBC have been collected for these patients. Here we focus on serum bilirubin levels, which is considered one of the most important ones associated with disease progression. In package JMbayes the PBC data are available in the data frames pbc2 and containing the longitudinal and survival information, respectively (i.e., the former is in the long format while the latter contains a single row per patient).

We start by loading the JMbayes and lattice packages and defining the indicator status2 for the composite event, namely transplantation or death:

pbc2$status2 <- as.numeric(pbc2$status != "alive")$status2 <- as.numeric($status != "alive")

The design followed by package JMbayes requires to first separately fit a linear mixed model for the longitudinal outcome and a Cox model for the survival one. The aim of the linear mixed model is to describe/recover the subject-specific longitudinal trajectories. Close inspection of the shapes of the log serum bilirubin profiles indicates that for some individuals these seem to be nonlinear.

## we take a sample of patients with more than six 
## measurements
long_ids <- names(which(table(pbc2$id) > 6))
ids <- sample(long_ids, 16)
xyplot(log(serBilir) ~ year | id, data = pbc2, 

       subset = id %in% ids, type = c("p", "smooth"), 
       lwd = 2, layout = c(4, 4))

Hence, to allow for flexibility in the specification of these profiles we include natural cubic splines in both the fixed- and random-effects parts of the mixed model. This model can be fitted using the following calls to functions lme() and ns() (the latter from package splines): 

lmeFit <- lme(log(serBilir) ~ ns(year, 2), data = pbc2,
              random = ~ ns(year, 2) | id) 

Analogously, in the Cox model we control for treatment and age, and also allow for their interaction:

coxFit <- coxph(Surv(years, status2) ~ drug * age, 
                data =, x = TRUE)

In the call to coxph() argument x is set to TRUE such that the design matrix is also included in the resulting model object. Using as main arguments the lmeFit and coxFit objects, the corresponding joint model is fitted using the code:

jointFit <- jointModelBayes(lmeFit, coxFit, timeVar = "year")


jointModelBayes(lmeObject = lmeFit.pbc1, survObject = coxFit.pbc1, 
    timeVar = "year")

Data Descriptives:
Longitudinal Process             Event Process
Number of Observations: 1945     Number of Events: 169 (54.2%)
Number of subjects: 312

Joint Model Summary:
Longitudinal Process: Linear mixed-effects model
Event Process: Relative risk model with penalized-spline-approximated 
  baseline risk function
Parameterization: Time-dependent value 

      LPML      DIC       pD
 -3169.172 6116.463 939.5384

Variance Components:
              StdDev    Corr        
(Intercept)   1.0097  (Intr)  n(,2)1
ns(year, 2)1  2.3555  0.3640        
ns(year, 2)2  2.2430  0.3609  0.5701
Residual      0.3023                

Longitudinal Process
              Value Std.Err Std.Dev   2.5%  97.5%      P
(Intercept)  0.4918  0.0212  0.0699 0.3676 0.6316 <0.001
ns(year, 2)1 2.4117  0.1032  0.2337 1.9423 2.7918 <0.001
ns(year, 2)2 2.3611  0.0895  0.2906 1.7765 2.8609 <0.001

Event Process
                     Value Std.Err  Std.Dev    2.5%    97.5%      P
drugD-penicil      -0.9648  0.1272   0.7543 -2.4530   0.4693  0.199
age                 0.0370  0.0018   0.0107  0.0135   0.0571 <0.001
drugD-penicil:age   0.0179  0.0025   0.0145 -0.0096   0.0470  0.211
Assoct              1.4181  0.0062   0.0953  1.2367   1.6041 <0.001
Bs.gammas1         -6.6205  0.1009   0.5992 -7.7762  -5.3132 <0.001
Bs.gammas2         -6.5832  0.1031   0.5962 -7.7639  -5.2468 <0.001

MCMC summary:
iterations: 20000 
adapt: 3000 
burn-in: 3000 
thinning: 10 
time: 2.6 min
Argument timeVar is a character string that specifies the name of the time variable in the mixed model (the scale of time (e.g., days, months, years) must be the same in both the mixed and Cox models). The default call to jointModelBayes() includes in the linear predictor of the relative risk model the subject-specific linear predictor of the mixed model, which in this case represents the average patient-specific log serum bilirubin level. The output of the summary() method is rather self-explanatory and contains model summary statistics, namely LPML (the log pseudo marginal likelihood value), DIC (deviance information criterion), and pD (the effective number of parameters component of DIC), posterior means for all parameters, and standard errors (effective sample size estimated using time series methodology), standard deviations, 95% credibility intervals and tail probabilities for all regression coefficients in the two submodels. The association parameter, denoted in the output as Assoct, is the parameter that measures how strongly associated is the longitudinal outcome at any particular time point t with the hazard of an event at the same time point. The results suggest that serum bilirubin is strongly related with the risk for the composite event, with a doubling of serum bilirubin levels, resulting in a 2.7-fold (95% CI: 2.3; 3.1) increase of the risk.