ADPO: automatic-differentiation-assisted parametric optimization

Appednix A Finite difference

FD requires a properly chosen step size h to get the approximate gradients. Despite the existence of high accuracy FD formulas [50] whose error can be smaller than \(\mathcal (h^2)\), forward difference and central difference are commonly used.

Taking the lth element in the gradients \(\varvec_i\) in Eq. 1 as an example (neglecting the negative sign for convenience), i.e., the derivative \(\frac_i)}}\). From Taylor expansion [20] we can obtain

$$\begin \frac_i)}} = \frac + h, ...) - l_i(\varvec_i)} + \mathcal (h). \end$$

(A1)

For forward difference, \(\frac_i)}}\) is approximated by the first term in Eq. A1,

$$\begin \frac_i)}} \approx \frac + h, ...) - l_i(\varvec_i)}. \end$$

(A2)

Usually \(l_i(\varvec_i)\) is already calculated, so one additional function evaluation \(l_i(..., \eta _ + h, ...)\) is needed to obtain the approximated derivative \(\frac + h, ...) - l_i(\varvec_i)}\) in forward difference. To obtain the n derivatives in \(\varvec_i\), n additional function evaluations are required. So FD can be computationally expensive.

The difference between the approximated derivative and the true derivative \(\frac_i)}}\) is \(\mathcal (h)\), which is the truncation error coming from the truncated Taylor expansion in Eq. A1. For algorithms such as the gradient-based BFGS, which heavily relies on the gradients to guide the search for the optimum parameters at each iteration, it is non-trivial to dynamically pick a proper step size h within each iteration. An improper choice of h can make the gradients obtained from FD suffer from either round-off errors or truncation errors [15], which can be described as follows.

Truncation error:

if the step size h is too big such that the truncation error \(\mathcal (h)\) cannot be neglected, then the approximated derivative cannot represent the true derivative. That is to say, if the step size h is too big, the derivative computed using FD will be inaccurate due to truncation error.

Round-off error:

it is introduced due to limitation of the machine precision, so the computer is not able to represent real numbers with arbitrary accuracy. If the step size h is too small, although the truncation error \(\mathcal (h)\) can be eliminated, due to round-off error, the subtraction of two very similar values, i.e., \(l_i(..., \eta _ + h, ...) - l_i(\varvec_i)\), as well as the division of the very small h, i.e., \(\frac + h, ...) - l_i(\varvec_i)}\), are not able to be accurately calculated by the computer. That is to say, if the step size h is too small, the derivative computed using FD will be inaccurate due to round-off error.

For central difference, the derivative \(\frac_i)}}\) is approximated by

$$\begin \frac_i)}} \approx \frac + h, ...) - l_i(..., \eta _ - h, ...)}, \end$$

(A3)

which comes from the Taylor expansion

$$\begin \frac_i)}}&= \frac + h, ...) - l_i(..., \eta _ - h, ...)} \nonumber \\&+ \mathcal (h^2). \end$$

(A4)

Compared with forward difference, the truncation error in central difference is \(\mathcal (h^2)\) instead of \(\mathcal (h)\), which means central difference can be more accurate. However, compare with forward difference, in central difference, an additional evaluation of \(l_i(..., \eta _ - h, ...)\) is required. So central difference is twice as slow as forward difference. Further consider that truncation errors and round-off still exists, therefore in Phoenix® NLME™, forward difference is the first choice of FD. Central difference is used only when forward difference failed.

In addition to FD, there are other step-size-based methods, such as complex step method [16]. Although it may bypass the effects of the round-off and truncation errors from FD and can achieve accurate gradients, the use of complex arithmetic incurs extra overhead.

Appendix B hardware and software

The ADPO and FOCE-FD benchmarks are performed on an Amazon Workspace (AWS) machine (operating system is Windows Server 2019 Datacenter) which has been configured with 4 CPU cores from an Intel Xeon Platinum 8488C CPU running at 2.4 Ghz and 32 GB memory. For all the FOCE-FD and the ADPO runs, we use MPI (Message Passing Interface) with 4 CPU cores, and with default ODE solver settings in Phoenix® NLME™.

Phoenix® NLME™ 8.6 is used. We test the performance of ADPO and FOCE-FD using all the available ODE solvers provided in Phoenix® NLME™ 8.6, e.g.,

‘Matrix Exponent’ solver.

‘Auto-Detect’ solver.

‘Non-Stiff DVERK’ solver.

‘Non-Stiff DOPRI5’ solver.

‘Stiff’ solver.

Since they are used in this paper, to help the readers have a better understanding of the results, we briefly summarize the background information and the main features of them in Appendix C. More details can be found in [51].

We also use random pause sampling method [13] to identify what the source code is doing when paused (see Section “Confirming the bottleneck of FOCE-FD”). This was achieved by using the debug tool in Microsoft Visual Studio Enterprise 2019 to show the call stack map, and using a laptop (operating system is Windows 10 Pro) with Intel Core i7 1260p CPU runs at 3.4 Ghz and 40 GB memory.

The same laptop is also used to run the modified Phoenix® NLME™ 8.6 source code in which additional performance related statistics are added (see Section “Methods for performance evaluations”), i.e., the total number of outer loop BFGS iterations, total number of outer loop backtrackings, total number of inner loop BFGS iterations, and total number of inner loop backtrackings.

Appendix C Brief summary of the ODE solvers in Phoenix® NLME™

Since we are testing the performance of ADPO and FOCE-FD using all the available ODE solvers provided in Phoenix® NLME™ , to help the readers have a better understanding of them, we list the names of each ODE solver and briefly summarize the background information and the main features of them. More details can be found in [51].

‘Matrix Exponent’. It is based on the DGPADM (Double-precision General Padé Approximation for Matrix) program [52] which solves a system of linear ODEsFootnote 17 by doing matrix exponentials using ‘scaling and squaring’ method combined with Padé approximation. It essentially solves the ODEs in the closed form. Therefore, whenever the system of ODEs are linear, it should be the first choice because it is faster and more accurate than other numerical ODE solvers in Phoenix® NLME™ which contain time-step errors, and it does not suffer from stiffness of the ODEs.

‘Auto-Detect’. It is based on the LSODA (Livermore Solver for Ordinary Differential Equations with Automatic method switching) solver [39], and it is a robust adaptive stepwise solver which automatically selects methods between non-stiff and stiff cases. It uses Adams methods (predictor-corrector) in the non-stiff case, and Backward Differentiation Formula (BDF) methods (the Gear methods) in the stiff case. Its speed may be slower than the ‘non-stiff DVERK’ solver.

‘Non-Stiff DVERK’. It is based on the DVERK (Double Precision Variable Step Runge-Kutta) solver [53]. It is an explicit Runge-Kutta method based on Verner’s 5th and 6th order pair of formula, and it uses an adaptive step size mechanism to control the error and improve efficiency. Due to its fast and robust performance, it is one of the widely used ODE solver for solving non-stiff problems, and it is the default solver used by Phoenix® NLME™ when the matrix exponent solver is not applicable. However, for stiff problems, DVERK is not efficient, and integration errors are expected.

‘Non-Stiff DOPRI5’. It is based on the widely used DOPRI5 (Dormand-Prince 5) solver [54] for non-stiff problems. It is based on the Dormand-Prince 5(4) method, and it is an explicit Runge-Kutta method of order 5 with an embedded 4th-order method for adaptive step-size control to improve efficiency. For certain stiff problems, DOPRI5 may be more reliable and efficient than ‘Non-Stiff DVERK’ solver.

‘Stiff’. It is based on the LSODE (Livermore Solver for Ordinary Differential Equations) solver [39] which is a general-purpose solver that can handle both stiff and non-stiff ODE systems. Unlike LSODA, it does not automatically switch between Adams and BDF methods. In Phoenix® NLME™ , it is configured to use BDF methods to solve stiff problems thus the name ‘Stiff’.

Appendix D Model definitions and datasetsD.1 Model 1: one-compartment model with first-order clearance

We report the results of a one-compartment model with first-order clearance, where absorption is governed by a transit compartment model consisting of three compartments. In this model, the fixed effects parameter vector \(\varvec\) is,

$$\begin \varvec = (_},_},_}}}})^T, \end$$

(D7)

where the superscript T on the RHS means ‘transpose’, so that \(\varvec\) is a column vector. The covariance matrix of random effects \(\varvec\) is a diagonal matrix,

$$\begin \varvec = \text (\omega _,\omega _,\omega _}}}}), \end$$

(D8)

which is the covariance matrix of the random effect vector \(\varvec\) where

$$\begin \varvec = (\eta _,\eta _,\eta _}}})^T. \end$$

(D9)

This model is governed by a system of linear ODEs with constant coefficients, so all the ODE solvers provided in Phoenix® NLME™ can be used for this model. For subject i, the ODEs of the model are listed as below:

$$\begin \frac}}_1}&= -K_} \times }}_1, \end$$

(D10)

$$\begin \frac}}_2}&= K_} \times (}}_1 - }}_2), \end$$

(D11)

$$\begin \frac}}_3}&= K_} \times (}}_2 - }}_3), \end$$

(D12)

$$\begin \frac&= -K_} \times }}_3 - Cl \times C, \end$$

(D13)

where

$$\begin K_}&= \frac}}} , \end$$

(D14)

$$\begin V&= _} \exp (\eta _) , \end$$

(D15)

$$\begin Cl&= _} \exp (\eta _) , \end$$

(D16)

$$\begin T_}}&= _}}}} \exp (\eta _}}}) , \end$$

(D17)

$$\begin C&= \frac. \end$$

(D18)

At time \(t=0\), \(}}_2\), \(}}_3\), and \(_1\) are all zero, while the value of \(}}_1\) is given by the input bolus dose it received.

The concentration for subject i at time \(t_j\) is given by,

$$\begin C_ = C(t_j) (1 + \epsilon ), \end$$

(D19)

where \(C(t_j)\) is the concentration C in Eq. D18 at time \(t_j\), and the noise \(\epsilon\) is assumed to be a Gaussian random variable whose mean is 0 and standard deviation \(\sigma\).

For the simulated dataset, we use 80 subjects. For each subject, we set the observation time (unit is hour) as 2, 4, 6, 8, ..., 24. At time 0, a bolus dose of 100 (unit is mg) is added to the transit compartment \(}}_1\). The values of \(\varvec\), \(\varvec\), and \(\varvec\) that are used to generate the simulated dataset, are listed in the ‘True Value’ column in Table 2.

D.2 Model 2: one-compartment quasi-equilibrium target-mediated drug disposition (QE-TMDD) model

We report the results of a one-compartment quasi-equilibrium target-mediated drug disposition (QE-TMDD) model, assuming a constant total receptor concentration. In this model, the fixed effects parameter vector \(\varvec\) is,

$$\begin \varvec = (_},_}}},_}}} ,_}}},_}}} )^T, \end$$

(D20)

The covariance matrix of random effects \(\varvec\) is a diagonal matrix,

$$\begin \varvec = \text (\omega _,\omega _}},\omega _}}), \end$$

(D21)

which is the covariance matrix of the random effect vector \(\varvec\) where

$$\begin \varvec = (\eta _,\eta _}},\eta _}})^T. \end$$

(D22)

This model is described by non-linear ODEs, so the matrix exponential ODE solver cannot be used. For subject i, the ODE of the model is:

$$\begin \frac}}&= -K_} \times A_} - (K_}-K_}) \times C \times V, \end$$

(D23)

where

$$\begin V&= _} \exp (\eta _) \end$$

(D24)

$$\begin K_}&= _}}} \exp (\eta _}}) \end$$

(D25)

$$\begin R_}&= _}}} \end$$

(D26)

$$\begin K_}&= _}}} \exp (\eta _}}) \end$$

(D27)

$$\begin K_}&= _}}} \end$$

(D28)

$$\begin C_}&= \frac}} \end$$

(D29)

$$\begin C&= \frac}-R_}-K_}) } \nonumber \\&+ \frac}-R_}-K_})^2 + 4\times K_} \times C_} } } \end$$

(D30)

At time \(t=0\), the value of \(A_}\) is given by the input bolus dose it received.

The concentration for subject i at time \(t_j\) is given by \(C_ = C(t_j) (1 + \epsilon )\), where \(C(t_j)\) is the concentration C in Eq. D30 at time \(t_j\), and the noise \(\epsilon\) is assumed to be a Gaussian random variable whose mean is 0 and standard deviation \(\sigma\).

For the simulated dataset, we use 120 subjects. Each subject has multiple observation times (unit is hour) between 0 and 80. At time 0, a bolus dose (unit is mg) is added to the transit compartment \(}}_1\). The observation time and the amount of the dose can be found in the dataset in the corresponding Phoenix project file. The values of \(\varvec\), \(\varvec\), and \(\varvec\) that are used to generate the simulated dataset, are listed in the ‘True Value’ column in Table 3.

D.3 Model 3: two-compartment model with first-order clearance

We report the results of a two-compartment model with first-order clearance. In this model, the fixed effects parameter vector \(\varvec\) is,

$$\begin \varvec = (_},_},_}}},_} ,_} ,_}}} )^T, \end$$

(D31)

The covariance matrix of random effects \(\varvec\) is a diagonal matrix,

$$\begin \varvec = \text (\omega _,\omega _,\omega _}},\omega _}}), \end$$

(D32)

which is the covariance matrix of the random effect vector \(\varvec\) where

$$\begin \varvec = (\eta _,\eta _,\eta _}},\eta _}})^T. \end$$

(D33)

This model is governed by a system of linear ODEs with constant coefficients, so all the ODE solvers provided in Phoenix® NLME™ can be used. For subject i, the ODEs of the model are:

$$\begin \frac}}&= -A_} \times K_} \end$$

(D34)

$$\begin \frac&= - Cl \times C + A_} \times K_} - Cl_2 \times (C-C_2) \end$$

(D35)

$$\begin \frac&= Cl_2 \times (C-C_2) \end$$

(D36)

where

$$\begin V&= _} \exp (\eta _) \end$$

(D37)

$$\begin Cl&= _} \exp (\eta _) \end$$

(D38)

$$\begin K_}&= _}}} \exp (\eta _}}) \end$$

(D39)

$$\begin V_2&= _} \end$$

(D40)

$$\begin Cl_2&= _} \end$$

(D41)

$$\begin F_}&= _}}} \end$$

(D42)

$$\begin C_2&= \frac \end$$

(D43)

$$\begin C&= \frac \end$$

(D44)

The unit of time is hour. At time \(t=0\), both \(A_1\) and \(A_2\) are zero, while the value of \(A_}\) is given by the input oral bolus dose multiply by the bioavailability \(\text (F_})\), where \(\text ()\) is the inverse logit function, i.e., \(\text (F_}) = \frac})}})}.\) At time \(t=72\), \(A_1\) will receive an intravenous (IV) bolus. Absorption is assumed to follow first-order kinetics.

The concentration for subject i at time \(t_j\) is given by \(C_ = C(t_j) (1 + \epsilon )\), where \(C(t_j)\) is the concentration C in Eq. D44 at time \(t_j\), and the noise \(\epsilon\) is assumed to be a Gaussian random variable whose mean is 0 and standard deviation \(\sigma\).

For the simulated dataset, we use 100 subjects. Each subject has multiple observation times (unit is hour) between 0 and 100. All subjects receive an oral bolus at time 0, followed by an intravenous (IV) bolus at time 72. The data set can be found in the corresponding Phoenix project file. The values of \(\varvec\), \(\varvec\), and \(\varvec\) that are used to generate the simulated dataset, are listed in the ‘True Value’ column in Table 4.

D.4 Model 4: two-compartment voriconazole model with michaelis menten kinetics

We report the results of a two-compartment voriconazole model with Michaelis-Menten kinetics. This model has been used in realistic cases and in various references [33, 41]. In this model, the fixed effects parameter vector \(\varvec\) is,

$$\begin \varvec \!=\! (\! _}}},\! _}}},\!_}}} ,\!_}}},\!_}}}, \!_}}},\!_}}} \!)^T, \end$$

(D45)

The covariance matrix of random effects \(\varvec\) is a diagonal matrix,

$$\begin \varvec = \text (\omega _}},\omega _}},\omega _}} ,\omega _}},\omega _}},\omega _}},\omega _}}), \end$$

(D46)

which is the covariance matrix of the random effect vector \(\varvec\),

$$\begin \varvec = (\eta _}},\eta _}},\eta _}} ,\eta _}},\eta _}},\eta _}},\eta _}})^T. \end$$

(D47)

This model is governed by a system of non-linear ODEs, so the matrix exponential ODE solver cannot be used. For subject i, the ODEs of the model are:

$$\begin \frac}}&= -K_} \times A_}, \end$$

(D48)

$$\begin \frac&= K_} \times A_} + r_}(t) - \frac} \times A_1}} \times V+A_1} \nonumber \\&-K_}\times A_1+K_}\times A_2, \end$$

(D49)

$$\begin \frac&= K_}\times A_1 - K_}\times A_2, \end$$

(D50)

where

$$\begin K_}&= _}}} \exp (\eta _}}), \end$$

(D51)

$$\begin V_}&= _}}} \exp (\eta _}}), \end$$

(D52)

$$\begin K_}&= _}}} \exp (\eta _}}), \end$$

(D53)

$$\begin V_}&= _}}} \exp (\eta _}}), \end$$

(D54)

$$\begin F_}&= _}}} \exp (\eta _}}), \end$$

(D55)

$$\begin K_}&= _}}} \exp (\eta _}}), \end$$

(D56)

$$\begin K_}&= _}}} \exp (\eta _}}), \end$$

(D57)

$$\begin V_}&= V_} \times wt^, \end$$

(D58)

$$\begin V&= V_} \times wt, \end$$

(D59)

$$\begin C&= \frac \end$$

(D60)

The unit of time is hour. The rate \(r_}(t)\) is ratio between the IV dose and the duration time, it is zero unless there is an IV dose with a non-zero duration time. At time \(t=0\), both \(A_}\) and \(A_3\) are zero, while \(A_1\) is receiving an IV dose with a duration time, so the \(r_}(t)\) for \(A_1\) is non-zero during the IV dose duration time. At a later time, \(A_}\) will receive a bolus dose with zero duration, so its value needs to be added by the bolus dose multiply by the bioavailability \(F_}\). The covariate weight is represented by wt.

The concentration for subject i at time \(t_j\) is given by,

$$\begin C_ = C(t_j) + \epsilon _, \end$$

(D61)

where \(C(t_j)\) is the concentration C in Eq. D60 at time \(t_j\), the noise \(\epsilon _\) is assumed to be a Gaussian random variable whose standard deviation \(\sigma _\) has the following additive and multiplicative form [55],

$$\begin \sigma _ = c_} \times [ c_0 + c_1 \times C(t_j) ], \end$$

(D62)

where \(c_}\) is the residual error parameter \(\varvec\) for this model, \(c_0\) and \(c_1\) are fixed at 0.02 and 0.1 respectively.

For the simulated dataset, we use 50 subjects. Each subject has 24 observations times (unit is hour), i.e., t=2,4,6,8,...48. All subjects receive an IV bolus (unit is mg) of 180 at time 0 with duration time as 2 hours, followed by a bolus of 180 at time 24 with duration time as 0. The covariate wt (unit is kg) is Gaussian distributed with the mean of 16.5 and standard deviation of 1.65. The data set can be found in the corresponding Phoenix project file. The values of \(\varvec\), \(\varvec\), and \(\varvec\) that are used to generate the simulated dataset, are listed in the ‘True Value’ column in Table 5.

Appendix E The differences and connections among FOCE ELS, laplacian and AGQ algorithms

In this paper, we use FOCE ELS to illustrate how ADPO works in the inner loop, because FOCE ELS is the most frequently used in Phoenix® NLME™ . Despite that, in Phoenix® NLME™ 8.6, the Laplacian method and AGQ algorithm also support ADPO. Therefore, to help the readers understand how ADPO works in Laplacian method and AGQ, in Appendixes E.1 and E.2, we briefly overview the general ideas of the two, as well as describing the differences and connections among FOCE ELS, Laplacian method and AGQ. More details can be found in [7, 24, 25].

E.1 The main difference between laplacian method and FOCE ELS

Laplacian method [24] is very similar with FOCE ELS. It also benefits from ADPO because the inner loop of both are the same. The only difference is in the outer loop’s treatment of the Hessian matrix which is defined as

$$\begin \varvec_i(\varvec_i^*) \equiv \Delta l_i(\varvec_i^*) \end$$

(E63)

where \(l_i(\varvec_i)\) and \(\varvec_i^*\) are the same as those defined in Section “BFGS for the inner loop”, i.e., the former is subject i’s joint log-likelihood, and the latter is the corresponding optimal \(\varvec^*_i\) such that \(-l_i(\varvec_i^*)\) is minimized.

In the outer loop, when calculating the approximated overall data log-likelihood which is the Eq.(18) in [9], Laplacian method calculate the \(\varvec_i(\varvec_i^*)\) using Eq. E63 via FD, which means \(\Delta l_i(\varvec_i^*)\) are calculated using FD for 2nd-order derivatives. However, in FOCE ELS, \(\Delta l_i(\varvec_i^*)\) is approximated, so that the elements of the Hessian matrix \(\varvec_i(\varvec_i^*)\) are calculated by Eq. (14) in [9]. In this way, the FD calculations for 2nd-order derivatives are prevented. Therefore, Laplacian method is slower than FOCE ELS, due to additional function calls required during the FD calculations for 2nd-order derivatives.

However, the approximated Hessian matrix in FOCE ELS makes it only suitable for Gaussian data (e.g., the predicted concentration and the observed concentration differ by a Gaussian random variable). While Laplacian method is applicable for both Gaussian and non-Gaussian data (e.g., count, categorical, and time-to-event data).

E.2 The connection between AGQ and laplacian method

Adaptive Gaussian quadrature algorithm (AGQ) [25] is a generalized Laplacian method. In other words, Laplacian method is a special case of AGQ. AGQ does not rely on any assumption about the distribution of residual errors, therefore applicable to both Gaussian and non-Gaussian data.

For the given \(\varvec\), \(\varvec\), and \(\varvec\), in the outer loop the overall data likelihood is calculated using Eq. (9) in [9], which requires evaluating the integral \(\int e^_i)} d \varvec_i\) for each subject i. The difference between the Laplacian method and AGQ is the way they evaluate this integral.

Since \(\varvec_i\) contain n elements, \(\int e^_i)} d \varvec_i\) is a \(n-\)dimensional integral. Generally speaking, AGQ uses Gaussian-Hermite quadratures [26] along each of the n dimensions to numerically approximate the integral, such that,

$$\begin&\int e^_i)} d \varvec_i \nonumber \\ \approx&\sum _^}} ... \sum _^}} \sum _^}} f_(\varvec_i^*, \varvec,\varvec,\varvec), \end$$

(E64)

where \(N_}\) is the number of sample points used (which is the integer in the ‘N AGQ’ box in Fig. 1c). The function \(f_(\varvec_i^*, \varvec,\varvec,\varvec)\) is determined by the corresponding Gaussian-Hermite quadrature.

Given the population parameters \(\varvec\), \(\varvec\), and \(\varvec\), the value of the function \(f_(\varvec_i^*, \varvec,\varvec,\varvec)\) depends on the corresponding \(\varvec_i^*\) that minimize \(-l_i(\varvec_i)\), so the inner loop BFGS optimization is needed to find \(\varvec_i^*\). Therefore, AGQ can also benefit from the ADPO implemented in the inner loop.

The integral \(\int e^_i)} d \varvec_i\) on the left-hand-side (LHS) of Eq. E64 is called a marginal probability density function (PDF), while the RHS is the approximated marginal PDF. As the value of \(N_}\) increases, the RHS becomes closer and closer to the LHS. So, in principle, in AGQ the approximated marginal PDF can be made arbitrarily close to the true one, such that the overall data likelihood can also be accurately evaluated.

However, as shown in the RHS of Eq. E64, the number of functions \(f_(\varvec_i^*, \varvec,\varvec,\varvec)\) to be evaluated, grows exponentially with n —the dimension of the integral. The complexity of the AGQ algorithm is \(\mathcal (}}^n)\), which means AGQ may suffer from the ‘curse of dimensionality’ [56]Footnote 18. Therefore, AGQ is the most suitable for models whose number of random effects parameters n is small.

On the other hand, AGQ is designed in a way that, when \(N_}=1\), the RHS of Eq. E64 reduced to the same one used in Laplacian, i.e., the integral \(\int e^_i)} d \varvec_i\) is approximated by

$$\begin \int e^_i)} d \varvec_i \approx e^_i^*)} \left| \frac_i(\varvec_i^*)} \right| ^}, \end$$

(E65)

where the Hessian matrix \(\varvec_i(\varvec_i^*)\) is the same one defined in Eq. E63. In other words, when \(N_}=1\), AGQ is the Laplacian method. Therefore, the latter is a special case of the former.

Comments (0)

No login
gif