Life histories are sequences of transitions between states of existence [@aalen2008survival,@willekens2014]. In stochastic simulation of life histories, ages at transition are obtained by sampling time-to-event or waiting time distributions.The time-to-event is the duration between a reference event,e.g. birth, and the event of interest. In survival analysis, it is known as the survival time. A number of parametric models of survival times have been proposed in the literature. They include the exponential distribution, the Gompertz distribution and the Weibull distribution. The piecewise-exponential distribution is an extension of the exponential distribution. The exponential waiting time distribution has a single parameter, the constant rate of transition, hazard rate or incidence rate. In the piecewise-exponential model the transition rates are piecewise-constant. They are constant during time or age intervals and vary between intervals. Piecewise-constant transition rates are common in demography because transition rates are often published by age or age group.
The piecewise-exponential waiting time distribution is the subject of this vignette. By sampling the distribution, individual waiting times can be generated. In the application, two states of existence are considered: alive and dead. The event of interest is death. The transition rate is the death rate. The multistate model, discussed in the vignette “Simulation of life histories”, extends the two-state model to multiple states of existence. The basic approach to sampling waiting time distributions remains valid in the multistate model.
The generic approach to generating waiting times consists of two steps [@rubinstein2017simulation]. In the first step, a random number is drawn from the standard uniform distribution U(0,1). The random draw is a value between zero and one, interpreted as a probability and denoted by u. The second step is to determine the duration at which the probability of a transition is u. The duration depends on the waiting time distribution.
The vignette consists of two sections. The first is a brief theoretical background. The key concept is the inverse cumulative distribution function or quantile function. Concepts are illustrated using the exponential distribution and the piecewise-exponential distribution, and a few numerical illustrations should clarify the sampling of probability distributions. The second section is an application. Individual ages at death (lifespans) are generated from period age-specific death rates. The death rates are retrieved from the Human Mortality Database (HMD) (https://www.mortality.org). The retrieval of data from the HMD is described in the tutorial, which is the third vignette that comes with the package \(VirtualPop\). The vignette requires the packages survival, eha, survminer,knitr, kableExtra, ggplot2.
Let X be a continuous random variable denoting the waiting time to a transition (or time-to-transition) and let x be a realization of X. The cumulative distribution function (cdf) of X is $${F_X(x)=Pr{X \le x}=1-exp\left[-\int_0^x\mu(\tau) d\tau \right]=1-exp \left[-\Lambda(x) \right]}\tag{1} $$
with \(\mu(\tau)\) the instantaneous hazard rate at duration \(\tau\) and \(\Lambda(x)=\int_0^x\mu(\tau) d\tau\) the cumulative or integrated hazard at x. The cumulative hazard depends on \(\mu(\tau) \ \ \ \ \tau<x\) and the exposure time \(d\tau\), which measures the duration of exposure between \(\tau\) and \(\tau+d\tau\). The complement of the cumulative distribution function is the survival function: \(S_X(x)=exp \left[-\Lambda(x) \right]\).
\(F_X(x)\) is the probability that the value of X is smaller than or equal to a given value x. It is a non-decreasing function of x. The inverse cumulative distribution function \(F_X^{-1}\) gives the value of x that makes \(F_X(x)\) equal to a given probability, u say. That value is denoted by \(x_u\). The inverse cumulative distribution function is also known as the quantile function. The quantile function is the workhorse of stochastic simulation. If the function \(F_X (x)=u\) has no analytical solution, \(x_u\) must be determined numerically, which involves iteration.
Since \(F_X (x)\) is a nondecreasing function of x, the inverse cumulative distribution or quantile function may be defined as
$$ x_u=F_X^{-1} (u)=min{x:F_X (x) \ge u} \tag{2} $$ It is the lowest value of x for which the cumulative distribution function is equal to or exceeds u.
First, we consider a simple example. The general case is covered next. Assume the waiting time to an event is exponentially distributed. The exponential distribution has a single parameter \(\mu\), the hazard rate or transition rate. The rate is constant. The survival function is \(S_X(x)=exp\left [ - \mu x\right]\) and the cumulative distribution function is \(F_X(x)=1-S_X(x)\). The value of x at which \(F_X(x)=u\) is
$$x_u=F_X^{-1}(u)=-\frac{ln(1-u)}{\mu} \tag{3}$$ A random draw from the exponential distribution with parameter \(\mu\) is equivalent to a random draw from the standard uniform distribution, followed by the computation of x_u using equation (3). The method is the most fundamental approach for sampling from a desired distribution. @rubinstein2017simulation [p. 55] refer to the method as the inverse-transform method. See also @taimre2019monte.
Now consider the general case. Recall that \(x=x_u\) if the following condition is satisfied: \(\Lambda(x)=-ln(1-u)\) or
$$g(x)=\Lambda(x)+ln(1-u)=\int_{0}^{x}{\mu(\tau) d\tau}+ln(1-u)=0 \tag{4}$$
The value of x that satisfies the condition is the root of equation \(g(x)=0\). \(x_u\) is the root of \(g(x)\).
The function \(runif(n)\) of base R draws n random numbers from the standard uniform distribution. The function \(rexp(n,rate)\) of base R draws n random waiting times from the exponential distribution with parameter \(rate\) ($rate=\mu$). The following code draws 10 random waiting times from the exponential distribution: \(rexp(n=10,rate=0.1)\).
The generic function uniroot() is often used to obtain the root of the equation \(g(x)=0\).
A piecewise constant hazard rate is a sequence of constant hazard rates. The hazard rate is constant in an interval and varies between intervals. The cumulative hazard at the end of an interval is the hazard rate during that interval times the length of the interval. Let \(\mu(t_0,t_1 )=\mu_1\) be the hazard rate during the first interval which starts at \(t_0\) and ends at \(t_1\). \(\mu(t_1,t_2 )=\mu_2\) is the hazard rate during the second interval from \(t_1\) to \(t_2\). In general, the hazard rate during the interval \((t_i, t_{i+1})\) is \(\mu_{i+1}\). A slightly different notation is also used. The hazard rate during the interval from \(t_i\) to \(t_{i+1}\) is then denoted by \(\mu(t_i, t_{i+1})\).
The time at which the hazard rate changes is a break point. The cumulative hazard at some time x during the third interval is
$$\Lambda(x)=\mu_1*(t_1-t_0 )+\mu_2*(t_2-t_1 )+\mu_3*( x-t_2)$$ with \(\mu_i\) the hazard rate during interval \(t_{i-1}\) and \(t_i\) and \(t_i - t_{i-1}\) the length of interval i. Since the end-point x is somewhere in the third interval, the exposure during that interval is part of the interval.
The probability of an event before x is \(F_X (x)=1-exp[-\Lambda(x)]\). The duration x at which the event has occurred with probability u is \(x_u\). At that time point \(F_X(x)=1-exp[-\Lambda(x)]=u\). That condition is satisfied when the cumulative hazard \(\Lambda(x)=-ln(1-u)\). If u is a random draw from a standard uniform distribution, then x_u is the value of x at which \(\Lambda(x)=-ln(1-u)\).
Consider a period of total length 60 time units. The minimum duration is 0 and the maximum duration is 60. Assume the period is divided into four intervals: 0-10, 10-20, 20-30, and 30-60. The starting point is 0, the ending point 60 and the break points are 10, 20 and 30. Let the hazard rates be 0.1, 0.2, 0.4, 0.15. The function Rate_pw() retrieves the event rates at selected time points:
Rate_pw <- function (t,breakpoints,rates)
{ int <- findInterval(t,breakpoints,all.inside=TRUE)
z <- rates[int]
sojournInt <- t-breakpoints[int]
h <- data.frame(t=t,rate=z,interval=int,startInt=breakpoints[int],
sojourn=sojournInt)
return(h)
}
findInterval() is a function of base R. Consider a single point in time, e.g. \(t=18.3\). The hazard rate at time \(t=18.3\) is
breakpoints <- c(0, 10, 20, 30, 60)
rates <- c(0.01,0.02,0.04,0.15)
Rate_pw(t=18.3,breakpoints,rates)
t rate interval startInt sojourn
1 18.3 0.02 2 10 8.3
The function Rate_pw() returns a data frame with five columns: the time point(s) t, the rate at t, the interval in which t is located, the starting time of the interval, and the sojourn time in the (current) interval. The piecewise-constant rates are shown in Table 1 and Figure 1.
If t is a vector, i.e. t =c(10,18.3,23.6,54.7), then
res <- Rate_pw(t=c(10,18.3,23.6,54.7),breakpoints,rates)
out <- knitr::kable(res,
caption = "Simulated waiting times: piecewise-exponential distribution",
format = 'latex', booktabs = T,linesep = "")
kableExtra::kable_styling(out,latex_options="HOLD_position")
\begin{table}[H]
\caption{\label{tab:unnamed-chunk-3}Simulated waiting times: piecewise-exponential distribution} \centering \begin{tabular}[t]{rrrrr} \toprule t & rate & interval & startInt & sojourn\ \midrule 10.0 & 0.02 & 2 & 10 & 0.0\ 18.3 & 0.02 & 2 & 10 & 8.3\ 23.6 & 0.04 & 3 & 20 & 3.6\ 54.7 & 0.15 & 4 & 30 & 24.7\ \bottomrule \end{tabular} \end{table}
max <- max(breakpoints)
xx <- seq(0,max, by=0.1)
## xx<- 18:25 # 25 is maximum (24 is last time interval)
h <- Rate_pw(xx, breakpoints,rates=rates)
require (ggplot2)
#> Loading required package: ggplot2
p <- ggplot (data=h,aes(x=t,y=rate))
p <- p + geom_line(aes(group=interval))
p <- p + xlab("Time")+ylab("Hazard hazard")
p <- p + scale_x_continuous(breaks=seq(0,60,by=10))
p <- p + scale_y_continuous (breaks=seq(0,0.16,by=0.04))
p
If the hazard rate is piecewise constant, then the cumulative hazard increases linearly in duration intervals. The function H_pw() of the VirtualPop package computes the cumulative hazard at time t. The following function call produces cumulative hazards at time 10, 18.3, 23.6 and 54.7:
z <- VirtualPop::H_pw(t=c(10,18.3,23.6,54.7),breakpoints,rates)
print(z)
[1] 0.100 0.266 0.444 4.405
Suppose we want to determine the duration at which the probability that the event has occurred is 35 percent ($u=0.35$). The cumulative hazard at that point in time is \(\Lambda(x)=-ln(1-0.35)=0.4308\). The cumulative hazard is 0.4308 if x is equal to \(\int_0^{x_u}\mu(\tau) d\tau=23.27\) (Figure 2). The cumulative hazard at duration (quantile) 23.27 is H_pw(t=23.27,breakpoints,rates)=0.4308. Hence \(x_u=23.27\). The cumulative distribution function at duration 23.27 is \(1-exp(-0.4308)=0.35\). Figure 2 shows the cumulative hazard function between durations 0 and 40.
t <- 0:40
H <- VirtualPop::H_pw(t=t,breakpoints,rates)
data <- data.frame(x=0:40,y=H)
require (ggplot2)
p <- ggplot (data=data,aes(x=x,y=y))
p <- p + geom_line()
yy <- -log(1-0.35)
z <- which.min(abs(H-yy))
p <- p + geom_segment(aes(x = 0, y = yy, xend = t[z], yend = yy),
arrow = arrow(length = unit(0.5, "cm")),
linetype="dashed")
p <- p + geom_segment(aes(x = t[z], y = yy, xend = t[z], yend = 0),
arrow = arrow(length = unit(0.5, "cm")),
linetype="dashed")
p <- p + xlab("Time")+ylab("Cumulative hazard")
p <- p + annotate(geom="text", x=0, y=yy+0.05,
label=as.character (round(yy,4)))
p <- p + annotate(geom="text", x=t[z]+1.4, y=0, label="23.27")
p
Since the survival function can be derived unambiguously from the cumulative hazard function, one could (in principle) use the survival function to derive \(x_u\). The survival function is \(S_X (x)=exp[-\Lambda(x)]\). The survival probability that is consistent with u=0.35 is \(S_X (x_u )=1-0.35=0.65\) (Figure 3).
t <- 0:40
H <- VirtualPop::H_pw(t=t,breakpoints,rates)
data <- data.frame(x=0:40,y=exp(-H))
require (ggplot2)
p <- ggplot (data=data,aes(x=x,y=y))
p <- p + geom_line()
yy <- 0.65
z <- which.min(abs(exp(-H)-yy))
p <- p + geom_segment(aes(x = 0, y = yy, xend = t[z], yend = yy),
arrow = arrow(length = unit(0.5, "cm")),linetype="dashed")
p <- p + geom_segment(aes(x = t[z], y = yy, xend = t[z], yend = 0),
arrow = arrow(length = unit(0.5, "cm")),linetype="dashed")
p <- p + xlab("Time")+ylab("Cumulative hazard")
p <- p + annotate(geom="text", x=0, y=yy+0.05, label=as.character (round(yy,4)))
p <- p + annotate(geom="text", x=t[z]+1.4, y=0, label="23.27")
p
In mathematics and computing, a root-finding algorithm is an algorithm for finding zeroes, also called “roots”, of continuous functions. Let’s get the root of g(x) (equation4). The R code that finds the root of the function includes the following components:
The following function defines equation \(g(x)=0\). Note that the survival function is used here, hence u is the survival probability and NOT the cumulative distribution.
pw_root <- function(t, breakpoints,rates, uu){
aa <- VirtualPop::H_pw(t, breakpoints,rates) + log(uu)
# Cum hazard rate should be equal to - log(u),
# with u a value of survival function
return(aa)
}
pw_root (t= c(10,18.3,23.6,54.7),breakpoints,rates,uu=0.43)
[1] -0.7439701 -0.5779701 -0.3999701 3.5610299
The function pw_root() is based on \(pw.root\) written by Zinn for root finding in the package MicSim [@Zinn2014]. MicSim is an R package for the simulation of individual life histories in the context of population projection. It is a simplified R version of the MicCore [@zinn2009mic] package, developed as part of the MicMac project [@gampe2007population].
Define the endpoints of the interval and call uniroot():
interval=c(breakpoints[1],breakpoints[length(breakpoints)])
uniroot(f=pw_root,interval=interval,breakpoints,rates,uu=0.65)
$root
[1] 23.26957
$f.root
[1] -0.00000000000000005551115
$iter
[1] 7
$init.it
[1] NA
$estim.prec
[1] 0.00006103516
Now we have the tool to create a sample of piecewise-exponentially distributed random waiting times. It consists of two steps: (a) draw a random number, u say, from a standard uniform distribution and (b) solve equation (4). The function r.pw_exp() of VirtualPop draws a sample of n random waiting times from a piecewise-exponential waiting time distribution with endpoints 0 and 60, breakpoints 10, 20 and 30, and hazard rates 0.01, 0.02, 0.04 and 0.15:
pw_sample <- VirtualPop::r.pw_exp (n=1000, breakpoints, rates=rates)
The object \(pw\_sample\) is a vector of waiting times.
Note that \(uniroot()\) only works when the value of \(g(x)\) at the lowest possible value of x has a different sign then the value of \(g(x)\) at the highest possible value of x. If the function values at the endpoints are not of opposite signs (or zero), \(uniroot()\) gives the error message “Error in uniroot(f = pw_root, interval = interval),: \(f()\) values at end points not of opposite sign”, where \(f()\) is \(g(x)\). The error is caused by a very low survival probability u, which generates a time to transition that is beyond the interval, i.e. beyond the highest breakpoint. If such an error message occurs, the highest breakpoint should be increased. An alternative is to add \(extendInt="yes"\) as an argument of the uniroot() function. In the implementation of uniroot() in computation of waiting times, the computation does not stop when an error occurs, but the waiting time is given the value NA. To prevent to programme from crashing, the function r.pw_exp() uses the tryCatch() wrapper. The wrapper generates a value of 5000 if the endpoints are not of opposite signs. It allows the user to take appropriate action.
An alternative approach to sampling a piecewise-constant exponential distribution is to use the function rpexp() of the \(msm\) package [@jackson2011multi]. The package includes a set of functions (pexp()) to compute the density, the cumulative distribution and the quantile function of the piecewise-exponential distribution. It also includes a function to generate random numbers. The function qpexp() computes the duration at which the probability that the event has occurred is u and the survival probability is 1-u. For instance, the duration at which the survival probability is 65 percent is 23.27 units of time. It is computed using the code:
msm::qpexp((1-0.65),rates,breakpoints[-length(breakpoints)])
#> [1] 23.26957
To generate 1000 random waiting times, use the code:
nsample <- 10000
pw.sample.msm <- msm::rpexp (n=nsample,
rate=rates,
t=breakpoints[-length(breakpoints)])
m <- mean(pw.sample.msm)
s <- sd(pw.sample.msm)
The mean and standard deviation of the waiting time are also computed. The mean is 27.35 and the standard deviation is 11.96.
Figure 4 shows the frequency distribution of the simulated waiting times. The figure displays a sequence of exponential survival functions. The survival functions are exponential within duration intervals.
require (msm)
# x <- msm::rpexp(n=nsample,rates,breakpoints[-length(breakpoints)])
my.lower <- breakpoints[-length(breakpoints)]
my.upper <- breakpoints[-1]
hist (pw.sample.msm,breaks=100,freq=FALSE,xlim=c(0,max(my.upper)),las=1,
ylim=c(0,0.07),
xlab="Duration",cex.axis=0.8,cex.main=0.8,main="")
curve (dpexp(x,rates,breakpoints[-length(breakpoints)]),
add=TRUE,lwd=2,col="red")
box()
Consider the 2021 period death rates of the population of the United States, by single years of age and sex. The data are included in Human Mortality Database. They are also included in the VirtualPop package as data object \(rates\). Consider a virtual population of 2000 individuals, 1000 males and 1000 females. To generate lifespans that are consistent with the empirical age-specific death rates, the highest age possible must be defined. The maximum age is set to be 120. The death rate for persons aged 110-120 applies to all survivors at ages above 110. The following code simulates lifespans and displays the lifespans of the first six individuals in the virtual population.
# Get the data object "rates" from the VirtualPop package
rates <- NULL
data(rates,package="VirtualPop")
# paramters
countrycode <- attr(rates,"country")
refyear <- attr(rates,"year")
z <- rownames(rates$ASDR)
ages <- as.numeric(z)
breakpoints <- c(ages,120)
# rates
ratesm <- rates$ASDR[,1]
ratesf <- rates$ASDR[,2]
# Create dataframe with individual ages at death
nsample <- 2000 # sample size
d <- data.frame(sex=sample(x=c(1,2),size=nsample,replace=TRUE,prob=c(0.5,0.5)))
d$sex <-factor(d$sex,levels=c(1,2),labels=c("Male","Female"))
nmales <- length(d$sex[d$sex=="Male"])
nfemales <- length(d$sex[d$sex=="Female"])
d$x_D <- NA
d$x_D[d$sex=="Male"] <- VirtualPop::r.pw_exp (n=nmales, breakpoints,
rates=ratesm)
d$x_D[d$sex=="Female"] <- VirtualPop::r.pw_exp (n=nfemales, breakpoints,
rates=ratesf)
The object d may also be obtained using the function Lifespan() of VirtualPop.
nsample <- 20000
dd <- data.frame(ID=1:nsample)
dd$bdated <- 2000
dd$sex <- sample(x=c(1,2),size=nsample,replace=TRUE,prob=c(0.5,0.5))
dd$sex <- factor(dd$sex,levels=c(1,2),labels=c("Male","Female"))
dd <- VirtualPop::Lifespan(data=dd, ASDR=rates$ASDR, mort = NULL)
The distribution of the simulated ages at death is shown in Figure 5.
dd$x_D[dd$x_D>110] <- 110
binwidth <- (max(dd$x_D,na.rm=TRUE)-min(dd$x_D,na.rm=TRUE))/60
require (ggplot2)
p <- ggplot() +
geom_histogram(data=dd,aes(x=x_D,color=sex,fill=sex,y=after_stat(density)),
alpha=0.5,position="dodge",binwidth=binwidth)
xmin <- 0
xmax <- 110
p <- p + xlab("Age")+ylab("Density")
p <- p + scale_x_continuous(breaks=seq(xmin,xmax,by=10))
p <- p + scale_y_continuous (breaks=seq(0,0.04,by=0.005))
p <- p + theme(legend.position = c(0.1, 0.85))
p
The simulated waiting times may be used for further analysis. To illustrate the process, let’s estimate the survival function shown in Figure 3. The Kaplan-Meier estimator of the survival function and its 95 percent confidence interval are shown in Figure 6. The figure also shows the survival function obtained directly from the cumulative hazard: \(S_{X}(x) = exp\left\lbrack - \Lambda(x) \right\rbrack\) and the exponential survival function with parameter the mean event rate \(S_{X}(x) = exp\left\lbrack - \mu^{*}\ x \right\rbrack\) with \(\mu^{*}\) the mean event rate (computed from the simulated waiting times). The code is
nsample <- 1000
breakpoints <- c(0, 10, 20, 30, 60)
rates <- c(0.01,0.02,0.04,0.15)
pw_sample <- VirtualPop::r.pw_exp (n=nsample, breakpoints, rates=rates)
KM <- survival::survfit(survival::Surv(pw_sample)~1)
The survival function is KM$surv and the standard error is KM$std.err. For detailed results, type summary(KM) and str(KM).
The code to produce Figure 6 is
max <- max(breakpoints)
x <- seq(0,max, by=0.1)
# Cumulative hazard
H <- VirtualPop::H_pw(x, breakpoints,rates)
dd <- data.frame(x=x,y=exp(-H))
# Plot KM estimator
p <- survminer::ggsurvplot(KM,data=data.frame(pw_sample),conf.int = TRUE,
ggtheme = theme_bw())
p <- p$plot + geom_step(data=dd,mapping=aes(x=x,y=y),linetype="dashed",
color="blue")
p <- p + xlab("Duration") + ylab("Survival function")
p <- p + geom_text(x=45, y=0.8, label="KM estimate of survival function",
color="red")
p <- p + geom_text(x=45, y=0.7, label="Survival function",color="blue")
p <- p + theme(legend.position="none")
p
The simulated waiting time data may be used to estimate piecewise-constant hazard rates and reproduce the input rates. The package eha for event history analysis [@brostrom2021event] has a function that does the job. It is the function piecewise^[The function pch() of the eha package yields several distributions of the piecewise-constant hazards (pch) distribution: the density, distribution function, quantile function, cumulative hazard function, and random number generation.]. Let’s use the waiting times generated by the rpexp function of the msm package. The code to compute the hazard rates is shown below and the estimated rates are shown in Table 2, together with the rates used in the simulation.
breakp2 <- breakpoints[2:(length(breakpoints)-1)]
x <- seq(0,max, by=0.2)
## waiting time = pw.sample or
pw <- msm::rpexp(n=nsample,rate=rates,t=breakpoints[-length(breakpoints)])
pw_rates <- eha::piecewise (enter=rep(0,nsample),exit=pw,
event=rep(1,nsample),cutpoints=breakp2)
aa <-data.frame(From=breakpoints[-length(breakpoints)],
To=breakpoints[-1],
Events=pw_rates$events,
Exposure=round(pw_rates$exposure,0),
Rates_estimated=round(pw_rates$intensity,4),
Rates_input=rates)
out2 <- knitr::kable(aa,
caption = paste0("Computation of hazard rates in piecewise-exponential",
"model with simulated data"),
format = 'latex', booktabs = T,linesep = "")
kableExtra::kable_styling(out2,latex_options="HOLD_position")
\begin{table}[H]
\caption{\label{tab:2_test17}Computation of hazard rates in piecewise-exponentialmodel with simulated data} \centering \begin{tabular}[t]{rrrrrr} \toprule From & To & Events & Exposure & Rates_estimated & Rates_input\ \midrule 0 & 10 & 102 & 9530 & 0.0107 & 0.01\ 10 & 20 & 138 & 8231 & 0.0168 & 0.02\ 20 & 30 & 257 & 6199 & 0.0415 & 0.04\ 30 & 60 & 503 & 3406 & 0.1477 & 0.15\ \bottomrule \end{tabular} \end{table}
The object \(pw\_rates\) has three components: (a) the estimated number of events by duration category, (b) the estimated exposure time by duration category, and © the hazard rate by duration category. The input data (rates) are also added. The measures are shown in Table \ref{tab:2_test17}. The last column shows the piecewise-constant hazard rate function used as input to generate waiting times.