7.5 Survival analysis dataset structure

The R functions we will use for survival analysis require a dataset with a specific structure. There must be a numeric event time variable and a binary event indicator variable, coded as numeric, with values of 1 for events, 0 for censored event times, and no other non-missing values. There can also be additional variables representing predictors. In the basic set-up, the predictors do not vary over time and so there is one row per individual. Later, we will discuss time-varying predictors (Section 7.14) which require a dataset with multiple rows per individual.

Example 7.1 (continued): The first five rows of the Natality teaching dataset, shown below, include the event time (gestage37), the indicator of preterm birth (preterm01), and a few time-invariant demographic variables and other risk factors – mother’s age (MAGER), mother’s race/Hispanic origin (MRACEHISP), previous preterm birth (RF_PPTERM), and previous Cesarean birth (RF_CESAR). Four of the five births have a gestational age censored at 37 weeks (preterm01 = 0), and one was preterm at gestational age 31 weeks (preterm01 = 1).

load("Data/natality2018_rmph.Rdata")
natality %>%
  select(gestage37, preterm01, MAGER, MRACEHISP, RF_PPTERM, RF_CESAR) %>% 
  head(5)
## # A tibble: 5 × 6
##   gestage37 preterm01 MAGER MRACEHISP RF_PPTERM RF_CESAR
##       <dbl>     <dbl> <dbl> <fct>     <fct>     <fct>   
## 1        37         0    35 Hispanic  No        Yes     
## 2        31         1    28 NH White  Yes       No      
## 3        37         0    22 NH Black  No        No      
## 4        37         0    35 NH White  No        No      
## 5        37         0    30 NH White  No        No

Verify the event time variable (gestage37) is numeric using is.numeric() and summarize the event times using summary().

is.numeric(natality$gestage37)
## [1] TRUE
summary(natality$gestage37)
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##    17.0    37.0    37.0    36.4    37.0    37.0

Similarly, verify the event indicator (preterm01) is numeric and use table() to verify its only non-missing values are 0 and 1.

is.numeric(natality$preterm01)
## [1] TRUE
table(natality$preterm01, useNA = "ifany")
## 
##    0    1 
## 1748  252

In Example 7.1, the event indicator variable was already in the correct format. What if it is not?

Example 7.2: The Digitalis Investigation Group (DIG) teaching dataset (dig_rmph.rData, see Appendix A.6) contains data from a clinical trial investigating the safety and efficacy of Digoxin for treating congestive heart failure. One of the endpoints measured was toxicity (DIG). Examine if this event indicator variable is numeric with values 0 and 1 and, if it is not, then convert it to that form.

load("Data/dig_rmph.Rdata")
is.numeric(dig$DIG)
## [1] FALSE
table(dig$DIG, useNA = "ifany")
## 
##    No Event First Event 
##        6702          98

The variable is not numeric, but it does have just two values. Create a numeric event indicator variable that is 1 when the original variable is “First Event”, and use table() to check the derivation. The syntax dig$DIG == "First Event" creates a logical vector of TRUE and FALSE values, and as.numeric() converts that logical vector to numeric, converting TRUE to 1 and FALSE to 0.

dig$DIG_event <- as.numeric(dig$DIG == "First Event")
table(dig$DIG, dig$DIG_event, useNA = "ifany")
##              
##                  0    1
##   No Event    6702    0
##   First Event    0   98

The datasets used in this text all include an event time variable. However, in your future work you may encounter datasets for which you have to compute the event time. For example, you may be given the dates the individuals started being observed and dates that events occurred (or were censored). Computing the event times, the times between those dates, is facilitated in R by using date-formatted variables and functions specifically designed to count time units between date-formatted variables. See, for example, the chapter “Dates and Times” in R for Data Science (H. Wickham, Çetinkaya-Rundel, and Grolemund 2023).

References

Wickham, H., M. Çetinkaya-Rundel, and G. Grolemund. 2023. R for Data Science: Import, Tidy, Transform, Visualize, and Model Data. 2nd ed. Sebastopol, CA: O’Reilly Media.