From 5716fbc4ed33779d765020c4408587ed5d50ad06 Mon Sep 17 00:00:00 2001 From: Steve Bronder Date: Wed, 11 Feb 2026 15:01:39 -0500 Subject: [PATCH 1/5] update options for laplace --- src/functions-reference/embedded_laplace.qmd | 58 ++++++++++++-------- src/reference-manual/laplace.qmd | 4 +- 2 files changed, 37 insertions(+), 25 deletions(-) diff --git a/src/functions-reference/embedded_laplace.qmd b/src/functions-reference/embedded_laplace.qmd index 63346dbfb..6adb73751 100644 --- a/src/functions-reference/embedded_laplace.qmd +++ b/src/functions-reference/embedded_laplace.qmd @@ -4,23 +4,20 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation -The embedded Laplace approximation can be used to approximate certain -marginal and conditional distributions that arise in latent Gaussian models. -A latent Gaussian model observes the following hierarchical structure: +The embedded Laplace approximation can be used to approximate certain marginal and conditional distributions that arise in latent Gaussian models. +Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation at their conditional posterior mode, producing an approximation to the marginal likelihood. +Embedded Laplace allows a sampler to search a lower-dimensional marginal posterior over non-latent parameters instead of jointly sampling all latent effects. +The embedded Laplace approximation in Stan is best suited for latent Gaussian models when full joint sampling is expensive and the latent conditional posterior is reasonably close to Gaussian. + +For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure: \begin{eqnarray} \phi &\sim& p(\phi), \\ \theta &\sim& \text{MultiNormal}(0, K(\phi)), \\ y &\sim& p(y \mid \theta, \phi). \end{eqnarray} -In this formulation, $y$ represents the -observed data, and $p(y \mid \theta, \phi)$ is the likelihood function that -specifies how observations are generated conditional on the latent Gaussian -variables $\theta$ and hyperparameters $\phi$. -$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables -$\theta$ and is parameterized by $\phi$. -The prior $p(\theta \mid \phi)$ is restricted to be a multivariate normal -centered at 0. That said, we can always pick a likelihood that offsets $\theta$, -which is equivalently to specifying a prior mean. +In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that +specifies how observations are generated conditional on the $\theta$ and $\phi$. +$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables $\theta$ and is parameterized by $\phi$. To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either use a standard method, such as Markov chain Monte Carlo, or we can follow @@ -83,12 +80,14 @@ The signature of the function is: Which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. {{< since 2.37 >}} -This function takes in the following arguments. +The embedded Laplace functions accept two functors whose user defined arguments are passed in as tuples to `laplace_marginal`. -1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta` -2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function -3. `covariance_function` - Prior covariance function -4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function +1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta` with other arguments are user defined. + - `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)` +2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function. +3. `covariance_function` - Prior covariance function. + - `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)` +4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function. Below we go over each argument in more detail. @@ -198,12 +197,13 @@ It also possible to specify control parameters, which can help improve the optimization that underlies the Laplace approximation, using `laplace_marginal_tol` with the following signature: -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage} - - -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): real}|hyperpage} - -`real` **`laplace_marginal_tol`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline +```stan +real laplace_marginal_tol(function likelihood_function, tuple(...), + function covariance_function, tuple(...), + tuple(vector theta_init, real tol, int max_steps, + int hessian_block_size, int solver, + int max_steps_linesearch, int allow_fallback)) +``` Returns an approximation to the log marginal likelihood $p(y \mid \phi)$ and allows the user to tune the control parameters of the approximation. @@ -244,6 +244,18 @@ the step is repeatedly halved until the objective function decreases or the maximum number of steps in the linesearch is reached. By default, `max_steps_linesearch=0`, meaning no linesearch is performed. +* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition fails, the optimizer will try `solver=2` instead. + +The embedded Laplace approximation's options have a helper callable `generate_laplace_options(int theta_size)` that will generate the tuple for you. This can be useful for quickly setting up the control parameters in the `transformed data` block to reuse within the model. + +```stan +tuple(vector[theta_size], real, int, int, int, int, int) laplace_ops = generate_laplace_options(theta_size); +// Modify solver type +laplace_ops.5 = 2; +// Turn off fallthrough +laplace_ops.7 = 0; +``` + {{< since 2.37 >}} ## Sample from the approximate conditional $\hat{p}(\theta \mid y, \phi)$ diff --git a/src/reference-manual/laplace.qmd b/src/reference-manual/laplace.qmd index 092b8c479..06ac96432 100644 --- a/src/reference-manual/laplace.qmd +++ b/src/reference-manual/laplace.qmd @@ -14,14 +14,14 @@ to the constrained space before outputting them. Given the estimate of the mode $\widehat{\theta}$, the Hessian $H(\widehat{\theta})$ is computed using -central finite differences of the model functor. +central finite differences of the model functor. Next the algorithm computes the Cholesky factor of the negative inverse Hessian: $R^{-1} = \textrm{chol}(-H(\widehat{\theta})) \backslash \mathbf{1}$. Each draw is generated on the unconstrained scale by sampling -$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$ +$\theta^{\textrm{std}(m)} \sim \textrm{normal}(0, \textrm{I})$ and defining draw $m$ to be From 24dfe788494935edd01ef5327c69d809c5180b48 Mon Sep 17 00:00:00 2001 From: Charles Margossian Date: Wed, 18 Feb 2026 09:51:24 -0800 Subject: [PATCH 2/5] update documentation for embedded Laplace approximation. --- src/functions-reference/embedded_laplace.qmd | 100 ++++++++++++------- src/functions-reference/functions_index.qmd | 5 - src/stan-users-guide/gaussian-processes.qmd | 71 ++++++++----- 3 files changed, 108 insertions(+), 68 deletions(-) diff --git a/src/functions-reference/embedded_laplace.qmd b/src/functions-reference/embedded_laplace.qmd index 6adb73751..dddb0a683 100644 --- a/src/functions-reference/embedded_laplace.qmd +++ b/src/functions-reference/embedded_laplace.qmd @@ -5,9 +5,10 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation The embedded Laplace approximation can be used to approximate certain marginal and conditional distributions that arise in latent Gaussian models. -Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation at their conditional posterior mode, producing an approximation to the marginal likelihood. -Embedded Laplace allows a sampler to search a lower-dimensional marginal posterior over non-latent parameters instead of jointly sampling all latent effects. -The embedded Laplace approximation in Stan is best suited for latent Gaussian models when full joint sampling is expensive and the latent conditional posterior is reasonably close to Gaussian. +Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation. +In doing so, it marginalizes out the latent Gaussian variables. +Inference can then be performed on the remaining, often low-dimensional, parameters. +The embedded Laplace approximation in Stan is best suited for latent Gaussian models when jointly sampling over all model parameters is expensive and the conditional posterior of the Gaussian latent variables is reasonably close to Gaussian. For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure: \begin{eqnarray} @@ -18,6 +19,7 @@ For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $ In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that specifies how observations are generated conditional on the $\theta$ and $\phi$. $K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables $\theta$ and is parameterized by $\phi$. +The prior on $\theta$ is centered at 0, however an offset can always be added when specifying the likelihood function $p(y \mid \theta, \phi)$. To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either use a standard method, such as Markov chain Monte Carlo, or we can follow @@ -31,7 +33,7 @@ are typically available in closed form and so they must be approximated. The marginal posterior can be written as $p(\phi \mid y) \propto p(y \mid \phi) p(\phi)$, where $p(y \mid \phi) = \int p(y \mid \phi, \theta) p(\theta) \text{d}\theta$ is called the marginal likelihood. The Laplace method approximates -$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at +$p(y \mid \phi, \theta) p(\theta)$ with a normal distribution centered at the mode, $$ \theta^* = \underset{\theta}{\text{argmax}} \ \log p(\theta \mid y, \phi), $$ @@ -50,7 +52,7 @@ using one of Stan's algorithms. The marginal posterior is lower dimensional and likely to have a simpler geometry leading to more efficient inference. On the other hand each marginal likelihood computation is more costly, and the combined change in efficiency -depends on the case. +depends on the application. To obtain posterior draws for $\theta$, we sample from the normal approximation to $p(\theta \mid y, \phi)$ in `generated quantities`. @@ -58,8 +60,9 @@ The process of iteratively sampling from $p(\phi \mid y)$ (say, with MCMC) and then $p(\theta \mid y, \phi)$ produces samples from the joint posterior $p(\theta, \phi \mid y)$. -The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is -log-concave. Stan's embedded Laplace approximation is restricted to the case +The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli. +(The normal distribution is also log concave, however when the likelihood is normal, marginalization can be performed exactly and does not required an approximation.) +Stan's embedded Laplace approximation is restricted to the case where the prior $p(\theta \mid \phi)$ is multivariate normal. Furthermore, the likelihood $p(y \mid \phi, \theta)$ must be computed using only operations which support higher-order derivatives @@ -77,15 +80,17 @@ The signature of the function is: `real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments)` -Which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. +which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. {{< since 2.37 >}} The embedded Laplace functions accept two functors whose user defined arguments are passed in as tuples to `laplace_marginal`. -1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables `theta` with other arguments are user defined. +1. `likelihood_function` - user-specified log likelihood whose first argument is the vector of latent Gaussian variables $\theta$. +The subsequent arguments are user defined. - `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)` -2. `likelihood_arguments` - A tuple of the log likelihood arguments whose internal members will be passed to the covariance function. -3. `covariance_function` - Prior covariance function. +2. `likelihood_arguments` - A tuple of arguments whose internal members are be passed to the log likelihood function. +This tuple does NOT include the latent variable $\theta$. +3. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$. - `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)` 4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function. @@ -94,12 +99,12 @@ Below we go over each argument in more detail. ## Specifying the log likelihood function {#laplace-likelihood_spec} The first step to use the embedded Laplace approximation is to write down a -function in the `functions` block which returns the log joint likelihood +function in the `functions` block which returns the log likelihood $\log p(y \mid \theta, \phi)$. There are a few constraints on this function: -1. The function return type must be `real` +1. The function return type must be `real`. 2. The first argument must be the latent Gaussian variable $\theta$ and must have type `vector`. @@ -123,7 +128,7 @@ as data or parameter. The tuple after `likelihood_function` contains the arguments that get passed to `likelihood_function` *excluding $\theta$*. For instance, if a user defined -likelihood uses a real and a matrix the likelihood function's signature would +likelihood uses a real and a matrix, the likelihood function's signature would first have a vector and then a real and matrix argument. ```stan @@ -208,43 +213,52 @@ real laplace_marginal_tol(function likelihood_function, tuple(...), Returns an approximation to the log marginal likelihood $p(y \mid \phi)$ and allows the user to tune the control parameters of the approximation. -* `theta_init`: the initial guess for the Newton solver when finding the mode +* `theta_init`: the initial guess for a Newton solver when finding the mode of $p(\theta \mid y, \phi)$. By default, it is a zero-vector. * `tol`: the tolerance $\epsilon$ of the optimizer. Specifically, the optimizer stops when $||\nabla \log p(\theta \mid y, \phi)|| \le \epsilon$. By default, -the value is $\epsilon = 10^{-6}$. +the value is $\epsilon \approx 1.49 \times 10^{-8}$, which is the square-root of machine precision. * `max_num_steps`: the maximum number of steps taken by the optimizer before it gives up (in which case the Metropolis proposal gets rejected). The default -is 100 steps. +is 500 steps. -* `hessian_block_size`: the size of the blocks, assuming the Hessian -$\partial \log p(y \mid \theta, \phi) \ \partial \theta$ is block-diagonal. +* `hessian_block_size`: the size of the blocks, assuming the Hessian of the log likelihood, +$\partial \log p(y \mid \theta, \phi) / \partial \theta$ is block-diagonal. The structure of the Hessian is determined by the dependence structure of $y$ on $\theta$. By default, the Hessian is treated as diagonal (`hessian_block_size=1`). If the Hessian is not block diagonal, then set `hessian_block_size=n`, where `n` is the size of $\theta$. -* `solver`: choice of Newton solver. The optimizer used to compute the +* `solver`: choice of Newton solver. The optimizer underlying the Laplace approximation does one of three matrix decompositions to compute a -Newton step. The problem determines which decomposition is numerical stable. -By default (`solver=1`), the solver makes a Cholesky decomposition of the -negative Hessian, $- \partial \log p(y \mid \theta, \phi) / \partial \theta$. -If `solver=2`, the solver makes a Cholesky decomposition of the covariance -matrix $K(\phi)$. -If the Cholesky decomposition cannot be computed for neither the negative -Hessian nor the covariance matrix, use `solver=3` which uses a more expensive -but less specialized approach. +Newton step. The problem determines which decomposition is numerically stable. +By default (`solver=1`), the solver attempts a Cholesky decomposition of the +negative Hessian of the log likelihood, +$- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$. +This operation is legal if the negative Hessian is positive-definite, +which will always be true when the likelihood is log concave. +If `solver=2`, the solver makes a Cholesky decomposition of the covariance matrix $K(\phi)$. +Since a covariance matrix is always positive-definite, compute its +Cholesky decomposition is always a legal operation, at least in theory. +In practice, we may not be able to compute the Cholesky decomposition of the +negative Hessian or the covariance matrix, either because it does not exist or +because of numerical issues. +In that case, we can use `solver=3` which uses a more expensive but less +specialized approach to compute a Newton step. * `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch -method tries to insure that the Newton step leads to a decrease in the -objective function. If the Newton step does not improve the objective function, -the step is repeatedly halved until the objective function decreases or the -maximum number of steps in the linesearch is reached. By default, -`max_steps_linesearch=0`, meaning no linesearch is performed. +adjusts to step size to ensure that a Newton step leads to an increase in +the objective function (i.e., $f(\theta) = p(\theta \mid \phi, y)$). +If a standard Newton step does not improve the objective function, +the step is adjusted iteratively until the objective function increases +or the maximum number of steps in the linesearch is reached. +By default, `max_steps_linesearch=1000`. +Setting `max_steps_linesearch=0` results in no linesearch. -* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition fails, the optimizer will try `solver=2` instead. +* `allow_fallback`: If user set solver fails, this flag determines whether to fallback to the next solver. For example, if the user specifies `solver=1` but the Cholesky decomposition of the negative Hessian $- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$ fails, the optimizer will try `solver=2` instead. +By default, `allow_fallback = 1` (TRUE). The embedded Laplace approximation's options have a helper callable `generate_laplace_options(int theta_size)` that will generate the tuple for you. This can be useful for quickly setting up the control parameters in the `transformed data` block to reuse within the model. @@ -256,6 +270,17 @@ laplace_ops.5 = 2; laplace_ops.7 = 0; ``` +The arguments stored in the `laplace_ops` tuple are, +``` +laplace_ops = {theta_init, + tol, + max_num_steps, + hessian_block_size, + solver, + max_steps_linesearch, + allow_fallback} +``` + {{< since 2.37 >}} ## Sample from the approximate conditional $\hat{p}(\theta \mid y, \phi)$ @@ -270,13 +295,14 @@ the signature for `laplace_marginal`: `vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...))`
\newline -Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$. +Draws samples from the Laplace approximation to the conditional posterior $p(\theta \mid y, \phi)$. {{< since 2.37 >}} Once again, it is possible to specify control parameters: -\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch): vector}|hyperpage} +\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch), +int allow\_fallback: vector}|hyperpage} -`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch)`
\newline +`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch, int allow_fallback)`
\newline Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$ and allows the user to tune the control parameters of the approximation. {{< since 2.37 >}} diff --git a/src/functions-reference/functions_index.qmd b/src/functions-reference/functions_index.qmd index 870df9d52..9538e7157 100644 --- a/src/functions-reference/functions_index.qmd +++ b/src/functions-reference/functions_index.qmd @@ -1781,11 +1781,6 @@ pagetitle: Alphabetical Index -
[`(array[] int y | array[] int y_index, vector m, function covariance_function, tuple(...)) : real`](embedded_laplace.qmd#index-entry-c092314a5f45deef27ce0e8930a7d28c87ca601d) (embedded_laplace.html)
-**laplace_marginal_tol**: - - -
[`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : real`](embedded_laplace.qmd#index-entry-0f4bd0330deef2db7884dc5a4c933f181e1f2a8c) (embedded_laplace.html)
- - **laplace_marginal_tol_bernoulli_logit**: -
[distribution statement](embedded_laplace.qmd#index-entry-92d43f7c3643c7d85966bbfc0b2a71546facb37c) (embedded_laplace.html)
diff --git a/src/stan-users-guide/gaussian-processes.qmd b/src/stan-users-guide/gaussian-processes.qmd index 58f05db3b..96ff38d37 100644 --- a/src/stan-users-guide/gaussian-processes.qmd +++ b/src/stan-users-guide/gaussian-processes.qmd @@ -476,7 +476,19 @@ parameters { vector[N] eta; } model { - // ... + vector[N] f; + { + matrix[N, N] L_K; + matrix[N, N] K = gp_exp_quad_cov(x, alpha, rho); + + // diagonal elements + for (n in 1:N) { + K[n, n] = K[n, n] + delta; + } + + L_K = cholesky_decompose(K); + f = L_K * eta; + } rho ~ inv_gamma(5, 5); alpha ~ std_normal(); a ~ std_normal(); @@ -510,7 +522,7 @@ the marginal likelihood as follows: $$ \hat p_\mathcal{L}(y \mid \rho, \alpha, a) = \frac{p(f^* \mid \alpha, \rho) p(y \mid f^*, a)}{ - \hat p_\mathcal{L}(f \mid \rho, \alpha, a, y)}, + \hat p_\mathcal{L}(f^* \mid \rho, \alpha, a, y)}, $$ where $f^*$ is the mode of $p(f \mid \rho, \alpha, a, y)$, obtained via numerical optimization. @@ -556,44 +568,51 @@ generated quantities { Users can set the control parameters of the embedded Laplace approximation, via `laplace_marginal_tol` and `laplace_latent_tol_rng`. When using these -functions, the user must set *all* the control parameters. +functions, the user must set *all* the control options and store them in a tuple. +These control parameters mostly concern the numerical optimizer used to find +the mode $f^*$ of $p(f \mid \rho, \alpha, a)$. ```stan transformed data { -// ... - - vector[N] f_init = rep_vector(0, N); // starting point for optimizer. - real tol = 1e-6; // optimizer's tolerance for Laplace approx. - int max_num_steps = 1e3; // maximum number of steps for optimizer. - int hessian_block_size = 1; // when hessian of log likelihood is block - // diagonal, size of block (here 1). - int solver = 1; // which Newton optimizer to use; default is 1, - // use 2 and 3 only for special cases. - max_steps_linesearch = 0; // if >= 1, optimizer does a lineseach with - // specified number of steps. + tuple(vector[N], real, int, int, int, int, int) laplace_ops; + laplace_ops.1 = rep_vector(0, N); // starting point for Laplace optimizer + laplace_ops.2 = 1.49e-8; // tolerance for optimizer + laplace_ops.2 = 500; // maximum number of steps for optimizer. + laplace_ops.4 = 1; // hessian block size, assuming the prior + // covariance $K$ is block-diagonal. + laplace_ops.5 = 1; // solver type being used. + laplace_ops.6 = 500; // max number of steps for linesearch. + laplace_ops.7 = 1; // allow_fallback (1: TRUE, 0: FALSE) +``` +If users want to depart from the defaults for only some of the control +parameters, a tuple with the default values (as above) can be created with the +helper callable `generate_laplace_options()`, and the specific control +parameter can then be modified, +```stan +transformed data { + tuple(vector[N], real, int, int, int, int, int) laplace_ops = + generate_laplace_options(N); + + laplace_ops.2 = 1e-6; // make tolerance of the optimizer less strict. } - -// ... - +``` +The tuple `laplace_ops` is then passed is then passed to `laplace_marginal_tol` +and `laplace_rng_tol`. +```stan model { // ... - target += laplace_marginal(ll_function, (a, y), - cov_function, (rho, alpha, x, N, delta), - f_init, tol, max_num_steps, hessian_block_size, - solver, max_steps_linesearch); + target += laplace_marginal_tol(ll_function, (a, y), + cov_function, (rho, alpha, x, N, delta), + laplace_ops); } generated quantities { vector[N] f = laplace_latent_rng(ll_function, (a, y), cov_function, (rho, alpha, x, N, delta), - f_init, tol, max_num_steps, - hessian_block_size, solver, - max_steps_linesearch); + laplace_ops); } ``` -For details about the control parameters, see @Margossian:2023. - Stan also provides support for a limited menu of built-in functions, including the Poisson distribution with a log link and and prior mean $m$. When using such From afab737d31dcfe2b87d1177252f8361edf968053 Mon Sep 17 00:00:00 2001 From: Charles Margossian Date: Wed, 18 Feb 2026 09:56:52 -0800 Subject: [PATCH 3/5] adjust max line steps in gp section. --- src/stan-users-guide/gaussian-processes.qmd | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/src/stan-users-guide/gaussian-processes.qmd b/src/stan-users-guide/gaussian-processes.qmd index 96ff38d37..75b8f130d 100644 --- a/src/stan-users-guide/gaussian-processes.qmd +++ b/src/stan-users-guide/gaussian-processes.qmd @@ -580,7 +580,7 @@ transformed data { laplace_ops.4 = 1; // hessian block size, assuming the prior // covariance $K$ is block-diagonal. laplace_ops.5 = 1; // solver type being used. - laplace_ops.6 = 500; // max number of steps for linesearch. + laplace_ops.6 = 1000; // max number of steps for linesearch. laplace_ops.7 = 1; // allow_fallback (1: TRUE, 0: FALSE) ``` If users want to depart from the defaults for only some of the control From 3ff96fa84954b38a283035f430647386f94dc1f3 Mon Sep 17 00:00:00 2001 From: Charles Margossian Date: Thu, 12 Mar 2026 17:13:12 -0700 Subject: [PATCH 4/5] update documentation on hessian block size for embedded Laplace approx. --- src/functions-reference/embedded_laplace.qmd | 88 +++++++++----------- src/functions-reference/functions_index.qmd | 7 +- src/stan-users-guide/gaussian-processes.qmd | 55 ++++++++---- 3 files changed, 80 insertions(+), 70 deletions(-) diff --git a/src/functions-reference/embedded_laplace.qmd b/src/functions-reference/embedded_laplace.qmd index dddb0a683..247d19b44 100644 --- a/src/functions-reference/embedded_laplace.qmd +++ b/src/functions-reference/embedded_laplace.qmd @@ -4,11 +4,16 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation -The embedded Laplace approximation can be used to approximate certain marginal and conditional distributions that arise in latent Gaussian models. -Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation. +The embedded Laplace approximation can be used to approximate certain marginal +and conditional distributions that arise in latent Gaussian models. +Embedded Laplace replaces explicit sampling of high-dimensional Gaussian latent +variables with a local Gaussian approximation. In doing so, it marginalizes out the latent Gaussian variables. -Inference can then be performed on the remaining, often low-dimensional, parameters. -The embedded Laplace approximation in Stan is best suited for latent Gaussian models when jointly sampling over all model parameters is expensive and the conditional posterior of the Gaussian latent variables is reasonably close to Gaussian. +Inference can then be performed on the remaining, often low-dimensional, +parameters. The embedded Laplace approximation in Stan is best suited for +latent Gaussian models when jointly sampling over all model parameters is +expensive and the conditional posterior of the Gaussian latent variables is +reasonably close to Gaussian. For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $\phi$, a latent Gaussian model observes the following hierarchical structure: \begin{eqnarray} @@ -17,9 +22,11 @@ For observed data $y$, latent Gaussian variables $\theta$, and hyperparameters $ y &\sim& p(y \mid \theta, \phi). \end{eqnarray} In this formulation, $p(y \mid \theta, \phi)$ is the likelihood function that -specifies how observations are generated conditional on the $\theta$ and $\phi$. -$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables $\theta$ and is parameterized by $\phi$. -The prior on $\theta$ is centered at 0, however an offset can always be added when specifying the likelihood function $p(y \mid \theta, \phi)$. +specifies how observations are generated conditional on $\theta$ and $\phi$. +$K(\phi)$ denotes the prior covariance matrix for the latent Gaussian variables +$\theta$ and is parameterized by $\phi$. The prior on $\theta$ is centered at 0, +however an offset can always be added when specifying the likelihood function +$p(y \mid \theta, \phi)$. To sample from the joint posterior $p(\phi, \theta \mid y)$, we can either use a standard method, such as Markov chain Monte Carlo, or we can follow @@ -60,8 +67,11 @@ The process of iteratively sampling from $p(\phi \mid y)$ (say, with MCMC) and then $p(\theta \mid y, \phi)$ produces samples from the joint posterior $p(\theta, \phi \mid y)$. -The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli. -(The normal distribution is also log concave, however when the likelihood is normal, marginalization can be performed exactly and does not required an approximation.) +The Laplace approximation is especially useful if $p(y \mid \phi, \theta)$ is +log-concave, e.g., Poisson, binomial, negative-binomial, and Bernoulli. +(The normal distribution is also log concave, however when the likelihood is +normal, marginalization can be performed exactly and does not required an +approximation.) Stan's embedded Laplace approximation is restricted to the case where the prior $p(\theta \mid \phi)$ is multivariate normal. Furthermore, the likelihood $p(y \mid \phi, \theta)$ must be computed using @@ -74,11 +84,9 @@ In the `model` block, we increment `target` with `laplace_marginal`, a function that approximates the log marginal likelihood $\log p(y \mid \phi)$. The signature of the function is: -\index{{\tt \bfseries laplace\_marginal\_tol }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, function covariance\_function, tuple(...), vector theta\_init covariance\_arguments): real}|hyperpage} +\index{{\tt \bfseries laplace\_marginal }!{\tt (function likelihood\_function, tuple(...) likelihood\_arguments, int hessian_block_size, function covariance\_function, tuple(...)): real}|hyperpage} - - -`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments)` +`real` **`laplace_marginal`**`(function likelihood_function, tuple(...) likelihood_arguments, int hessian_block_size, function covariance_function, tuple(...) covariance_arguments)` which returns an approximation to the log marginal likelihood $p(y \mid \phi)$. {{< since 2.37 >}} @@ -90,9 +98,10 @@ The subsequent arguments are user defined. - `real likelihood_function(vector theta, likelihood_arguments_1, likelihood_arguments_2, ...)` 2. `likelihood_arguments` - A tuple of arguments whose internal members are be passed to the log likelihood function. This tuple does NOT include the latent variable $\theta$. -3. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$. +3. `hessian_block_size` - the block size of the Hessian of the log likelihood, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$. +4. `covariance_function` - A function that returns the covariance matrix of the multivariate normal prior on $\theta$. - `matrix covariance_function(covariance_argument_1, covariance_argument_2, ...)` -4. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function. +5. `covariance_arguments` A tuple of the arguments whose internal members will be passed to the the covariance function. Below we go over each argument in more detail. @@ -153,6 +162,13 @@ for example, real likelihood_function(vector theta, data vector x, ...) ``` +In addition to the likelihood function, users must specify the block size +of the Hessian, $\partial^2 \log p(y \mid \theta, \phi) / \partial \theta^2$. +The Hessian is often block diagonal and this structure can be taken advantage of for fast computation. +For example, if $y_i$ only depends on $\theta_i$, then the Hessian is diagonal and `hessian_block_size=1`. +On the other hand, if the Hessian is not block diagonal, we can always set +`hessian_block_size=n` where $n$ is the size of $\theta$. + ## Specifying the covariance function The argument `covariance_function` returns the prior covariance matrix @@ -163,20 +179,6 @@ It's return type must be a matrix of size $n \times n$ where $n$ is the size of matrix covariance_function(...) ``` - - The `...` represents a set of optional variadic arguments. There is no type restrictions for the variadic arguments `...` and each argument can be passed as data or parameter. The variables @@ -204,9 +206,9 @@ with the following signature: ```stan real laplace_marginal_tol(function likelihood_function, tuple(...), + hessian_block_size, function covariance_function, tuple(...), - tuple(vector theta_init, real tol, int max_steps, - int hessian_block_size, int solver, + tuple(vector theta_init, real tol, int max_steps, int solver, int max_steps_linesearch, int allow_fallback)) ``` @@ -224,20 +226,13 @@ the value is $\epsilon \approx 1.49 \times 10^{-8}$, which is the square-root of it gives up (in which case the Metropolis proposal gets rejected). The default is 500 steps. -* `hessian_block_size`: the size of the blocks, assuming the Hessian of the log likelihood, -$\partial \log p(y \mid \theta, \phi) / \partial \theta$ is block-diagonal. -The structure of the Hessian is determined by the dependence structure of $y$ -on $\theta$. By default, the Hessian is treated as diagonal -(`hessian_block_size=1`). If the Hessian is not block diagonal, then set -`hessian_block_size=n`, where `n` is the size of $\theta$. - * `solver`: choice of Newton solver. The optimizer underlying the Laplace approximation does one of three matrix decompositions to compute a Newton step. The problem determines which decomposition is numerically stable. By default (`solver=1`), the solver attempts a Cholesky decomposition of the negative Hessian of the log likelihood, $- \partial^2 \log p(y \mid \theta, \phi) / \partial^2 \theta$. -This operation is legal if the negative Hessian is positive-definite, +This operation is legal if the negative Hessian is positive-definite, which will always be true when the likelihood is log concave. If `solver=2`, the solver makes a Cholesky decomposition of the covariance matrix $K(\phi)$. Since a covariance matrix is always positive-definite, compute its @@ -245,7 +240,7 @@ Cholesky decomposition is always a legal operation, at least in theory. In practice, we may not be able to compute the Cholesky decomposition of the negative Hessian or the covariance matrix, either because it does not exist or because of numerical issues. -In that case, we can use `solver=3` which uses a more expensive but less +In that case, we can use `solver=3` which uses a more expensive but less specialized approach to compute a Newton step. * `max_steps_linesearch`: maximum number of steps in linesearch. The linesearch @@ -253,7 +248,7 @@ adjusts to step size to ensure that a Newton step leads to an increase in the objective function (i.e., $f(\theta) = p(\theta \mid \phi, y)$). If a standard Newton step does not improve the objective function, the step is adjusted iteratively until the objective function increases -or the maximum number of steps in the linesearch is reached. +or the maximum number of steps in the linesearch is reached. By default, `max_steps_linesearch=1000`. Setting `max_steps_linesearch=0` results in no linesearch. @@ -290,19 +285,18 @@ approximation of $p(\theta \mid \phi, y)$ using `laplace_latent_rng`. The signature for `laplace_latent_rng` follows closely the signature for `laplace_marginal`: - -\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init): vector}|hyperpage} + +\index{{\tt \bfseries laplace\_latent\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...)): vector}|hyperpage} -`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...))`
\newline +`vector` **`laplace_latent_rng`**`(function likelihood_function, tuple(...), hessian_block_size, function covariance_function, tuple(...))`
\newline Draws samples from the Laplace approximation to the conditional posterior $p(\theta \mid y, \phi)$. {{< since 2.37 >}} Once again, it is possible to specify control parameters: -\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), function covariance\_function, tuple(...), vector theta\_init, real tol, int max\_steps, int hessian\_block\_size, int solver, int max\_steps\_linesearch), -int allow\_fallback: vector}|hyperpage} +\index{{\tt \bfseries laplace\_latent\_tol\_rng }!{\tt (function likelihood\_function, tuple(...), int hessian_block_size, function covariance\_function, tuple(...), tuple(...) laplace_ops): vector}|hyperpage} -`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch, int allow_fallback)`
\newline +`vector` **`laplace_latent_tol_rng`**`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...), tuple(...) laplace_ops)`
\newline Draws approximate samples from the conditional posterior $p(\theta \mid y, \phi)$ and allows the user to tune the control parameters of the approximation. {{< since 2.37 >}} diff --git a/src/functions-reference/functions_index.qmd b/src/functions-reference/functions_index.qmd index 9538e7157..ad18c3d42 100644 --- a/src/functions-reference/functions_index.qmd +++ b/src/functions-reference/functions_index.qmd @@ -1713,7 +1713,7 @@ pagetitle: Alphabetical Index **laplace_latent_rng**: - -
[`(function likelihood_function, tuple(...), function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-6d0685309664591fc32d3e2a2304af7aa5459e1c) (embedded_laplace.html)
+ -
[`(function likelihood_function, tuple(...), int hessian_block_size, function covariance_function, tuple(...)) : vector`](embedded_laplace.qmd#index-entry-cacb1d5344e246ec89c460e00a6f1065f4a8a1c1) (embedded_laplace.html)
**laplace_latent_tol_bernoulli_logit_rng**: @@ -1731,11 +1731,6 @@ pagetitle: Alphabetical Index -
[`(array[] int y, array[] int y_index, vector m, function covariance_function, tuple(...), vector theta_init, real tol, int max_steps, int hessian_block_size, int solver, int max_steps_linesearch) : vector`](embedded_laplace.qmd#index-entry-97f692748ebfabf0574b7a8e2edaceb940ee4b6b) (embedded_laplace.html)
-**laplace_marginal**: - - -
[`(function likelihood_function, tuple(...) likelihood_arguments, function covariance_function, tuple(...) covariance_arguments) : real`](embedded_laplace.qmd#index-entry-6da15f0ed076016d814cdc278127896f99d29633) (embedded_laplace.html)
- - **laplace_marginal_bernoulli_logit**: -
[distribution statement](embedded_laplace.qmd#index-entry-1d93ab799518d0aac88e63c01f9655f36c7cbeb6) (embedded_laplace.html)
diff --git a/src/stan-users-guide/gaussian-processes.qmd b/src/stan-users-guide/gaussian-processes.qmd index 75b8f130d..ffbde0880 100644 --- a/src/stan-users-guide/gaussian-processes.qmd +++ b/src/stan-users-guide/gaussian-processes.qmd @@ -545,7 +545,24 @@ functions { } ``` -We then increment `target` in the model block with the approximation to +The embedded Laplace relies on calculations of the log likelihood's Hessian, +$\partial^2 \log p(y \mid f, a, \rho, \alpha) / \partial f^2$, and these +calculations can be much faster when the Hessian is sparse. In particular, +it is expected that the Hessian is block diagonal. In the `transformed data` +block we can specify the block size of the Hessian. +```stan +transformed data { + int hessian_block_size = 1; +} +``` +For example, if $y_i$ depends only on $f_i$, then the Hessian of the log +likelihood is diagonal and the block size is 1. On the other hand, if the +Hessian is not sparse, then we set the hessian block size +to $N$, where $N$ is the dimension of $f$. Currently, Stan does not check +the block size of the Hessian and so the user is responsible for correctly +specifying the block size. + +Finally, we increment `target` in the model block with the approximation to $\log p(y \mid \rho, \alpha, a)$. ```stan model { @@ -553,7 +570,7 @@ model { alpha ~ std_normal(); sigma ~ std_normal(); - target += laplace_marginal(ll_function, (a, y), + target += laplace_marginal(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ``` @@ -561,7 +578,7 @@ Notice that we do not need to construct $f$ explicitly, since it is marginalized out. Instead, we recover the GP function in `generated quantities`: ```stan generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, y), + vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ``` @@ -573,25 +590,23 @@ These control parameters mostly concern the numerical optimizer used to find the mode $f^*$ of $p(f \mid \rho, \alpha, a)$. ```stan transformed data { - tuple(vector[N], real, int, int, int, int, int) laplace_ops; + tuple(vector[N], real, int, int, int, int) laplace_ops; laplace_ops.1 = rep_vector(0, N); // starting point for Laplace optimizer laplace_ops.2 = 1.49e-8; // tolerance for optimizer - laplace_ops.2 = 500; // maximum number of steps for optimizer. - laplace_ops.4 = 1; // hessian block size, assuming the prior - // covariance $K$ is block-diagonal. - laplace_ops.5 = 1; // solver type being used. - laplace_ops.6 = 1000; // max number of steps for linesearch. - laplace_ops.7 = 1; // allow_fallback (1: TRUE, 0: FALSE) + laplace_ops.3 = 500; // maximum number of steps for optimizer. + laplace_ops.4 = 1; // solver type being used. + laplace_ops.5 = 1000; // max number of steps for linesearch. + laplace_ops.6 = 1; // allow_fallback (1: TRUE, 0: FALSE) ``` If users want to depart from the defaults for only some of the control -parameters, a tuple with the default values (as above) can be created with the -helper callable `generate_laplace_options()`, and the specific control +parameters, a tuple with the default values (as above) can be created with the +helper callable `generate_laplace_options()`, and the specific control parameter can then be modified, ```stan transformed data { tuple(vector[N], real, int, int, int, int, int) laplace_ops = generate_laplace_options(N); - + laplace_ops.2 = 1e-6; // make tolerance of the optimizer less strict. } ``` @@ -601,13 +616,13 @@ and `laplace_rng_tol`. model { // ... - target += laplace_marginal_tol(ll_function, (a, y), + target += laplace_marginal_tol(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta), laplace_ops); } generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, y), + vector[N] f = laplace_latent_rng(ll_function, (a, y), hessian_block_size, cov_function, (rho, alpha, x, N, delta), laplace_ops); } @@ -715,13 +730,19 @@ functions { // ... +transformed data { + int hessian_block_size = 1; +} + +// ... + model { - target += laplace_marginal(ll_function, (a, z), + target += laplace_marginal(ll_function, (a, z), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } generated quantities { - vector[N] f = laplace_latent_rng(ll_function, (a, z), + vector[N] f = laplace_latent_rng(ll_function, (a, z), hessian_block_size, cov_function, (rho, alpha, x, N, delta)); } ``` From b5e6f7868429d895d932e990aeca91c3eddc4ff4 Mon Sep 17 00:00:00 2001 From: Steve Bronder Date: Thu, 9 Apr 2026 17:53:55 -0400 Subject: [PATCH 5/5] fixup wording for embedded laplace refrenence doc --- src/reference-manual/laplace_embedded.qmd | 220 ++++++++++++---------- 1 file changed, 119 insertions(+), 101 deletions(-) diff --git a/src/reference-manual/laplace_embedded.qmd b/src/reference-manual/laplace_embedded.qmd index 3912ccc8a..5a9fdd542 100644 --- a/src/reference-manual/laplace_embedded.qmd +++ b/src/reference-manual/laplace_embedded.qmd @@ -4,173 +4,191 @@ pagetitle: Embedded Laplace Approximation # Embedded Laplace Approximation -Stan provides functions to perform an embedded Laplace -approximation for latent Gaussian models, following the procedure described -by @RasmussenWilliams:2006 and @Rue:2009. This approach is often refered to -as the integrated or nested Laplace approximation, although the exact details -of the method can vary. The details of Stan's implementation can be found in -references [@Margossian:2020; @Margossian:2023]. - -A standard approach to fit a latent Gaussian model would be to perform inference -jointly over the latent Gaussian variables and the hyperparameters. -Instead, the embedded Laplace approximation can be used to do *approximate* -marginalization of the latent Gaussian variables; we can then -use any inference over the remaining hyperparameters, for example Hamiltonian -Monte Carlo sampling. - -Formally, consider a latent Gaussian model, +The embedded Laplace approximation replaces explicit sampling of high-dimensional Gaussian latent variables with a local Gaussian approximation, marginalizing them out so that inference proceeds over the remaining hyperparameters alone. +This approach is often referred to as the integrated or nested Laplace approximation, although the exact details of the method can vary. +The details of Stan's implementation can be found in references [@Margossian:2020; @Margossian:2023]. + +A standard approach to fit a latent Gaussian model would be to perform inference jointly over the latent Gaussian variables and the hyperparameters. +Instead, the embedded Laplace approximation can be used to do *approximate* marginalization of the latent Gaussian variables; we can then use any inference over the remaining hyperparameters. +By marginalizing out the latent variables, the sampler explores a lower-dimensional, better-behaved marginal posterior. +Individual iterations are more expensive (each requires an inner optimization), but the sampler typically needs far fewer iterations to achieve the same effective sample size. + +For complete function signatures and the built-in likelihood wrappers (Poisson, Negative Binomial, Bernoulli), see the [Embedded Laplace functions reference](../functions-reference/embedded_laplace.qmd). +For worked examples with full data blocks, see the [Gaussian Processes chapter](../stan-users-guide/gaussian-processes.qmd#poisson-gp-using-an-embedded-laplace-approximation). + + +## Latent Gaussian models + +The embedded Laplace approximation is used for latent Gaussian models. A latent Gaussian model is defined by three main components: + +- $\phi$: hyperparameters (e.g., GP kernel length-scale and magnitude, or variance components in a hierarchical model), +- $\theta$: latent Gaussian variables (the high-dimensional quantity to be marginalized out), +- $y$: observed data. + +These components are related through a hierarchical structure. +The hyperparameters $\phi$ are given a prior $p(\phi)$. +The latent variables $\theta$ have a multivariate normal prior with covariance matrix $K(\phi)$. +The observations $y$ are generated according to a likelihood $p(y \mid \theta, \phi)$. +The prior on $\theta$ is centered at zero; an offset can be incorporated into the likelihood function if a non-zero mean is needed. + \begin{eqnarray*} \phi & \sim & p(\phi) \\ \theta & \sim & \text{Multi-Normal}(0, K(\phi)) \\ y & \sim & p(y \mid \theta, \phi). \end{eqnarray*} -The motivation for marginalization is to bypass the challenging geometry of the joint -posterior $p(\phi, \theta \mid y)$. This geometry (e.g. funnels) often frustrates -inference algorithms, including Hamiltonian Monte Carlo sampling and approximate -methods such as variational inference. On the other hand, the marginal posterior -$p(\phi \mid y)$ is often well-behaved and in many cases low-dimensional. -Furthermore, the conditional posterior $p(\theta \mid \phi, y)$ can be well -approximated by a normal distribution, if the likelihood $p(y \mid \theta, \phi)$ -is log concave. +The generative model above defines a joint distribution over all three quantities, $p(\phi, \theta, y) = p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$. +After observing data $y$, Bayes' theorem gives the joint posterior $p(\phi, \theta \mid y) \propto p(\phi) \, p(\theta \mid \phi) \, p(y \mid \theta, \phi)$. + +Sampling directly from the joint posterior $p(\phi, \theta \mid y)$ of this model is often difficult. +Challenging geometries (e.g., funnels) frustrate inference algorithms, including Hamiltonian Monte Carlo and variational inference. +However, the marginal posterior $p(\phi \mid y)$ is often well-behaved and low-dimensional, making it much easier to sample. +The Laplace approximation allows the conditional posterior $p(\theta \mid \phi, y)$ to be approximated by a normal distribution when the likelihood $p(y \mid \theta, \phi)$ is log concave. ## Approximation of the conditional posterior and marginal likelihood -The Laplace approximation is the normal distribution that matches the mode -and curvature of the conditional posterior $p(\theta \mid y, \phi)$. -The mode, +The two-step inference strategy for using embedded laplace in a latent Gaussian model requires approximations to both the conditional posterior $p(\theta \mid y, \phi)$ and the marginal likelihood $p(y \mid \phi)$. +The Laplace approximation is the normal distribution that matches the mode and curvature of the conditional posterior $p(\theta \mid y, \phi)$. +The mode, defined as the value of $\theta$ that maximizes the conditional posterior, is estimated by a Newton solver, $$ \theta^* = \underset{\theta}{\text{argmax}} \ p(\theta \mid y, \phi), $$ -is estimated by a Newton solver. Since the approximation is normal, -the curvature is matched by setting the covariance to the negative Hessian -of the log conditional posterior, evaluated at the mode, + +Since the approximation is normal, the curvature is matched by setting the covariance to the negative Hessian of the log conditional posterior, evaluated at the mode, + $$ \Sigma^* = - \left . \frac{\partial^2}{\partial \theta^2} \log p (\theta \mid \phi, y) \right |_{\theta =\theta^*}. $$ -The resulting Laplace approximation is then, + +The resulting Laplace approximation is a multivariate normal centered at the mode with covariance given by the inverse curvature, + $$ \hat p_\mathcal{L} (\theta \mid y, \phi) = \text{Multi-Normal}(\theta^*, \Sigma^*) \approx p(\theta \mid y, \phi). $$ -This approximation implies another approximation for the marginal likelihood, + +This approximation also yields an approximation to the marginal likelihood, obtained by evaluating the prior, likelihood, and approximate posterior at the mode $\theta^*$, + $$ \hat p_\mathcal{L}(y \mid \phi) := \frac{p(\theta^* \mid \phi) \ p(y \mid \theta^*, \phi) }{ \hat p_\mathcal{L} (\theta^* \mid \phi, y) } \approx p(y \mid \phi). $$ -Hence, a strategy to approximate the posterior of the latent Gaussian model -is to first estimate the marginal posterior -$\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$ + +Hence, a strategy to approximate the posterior of the latent Gaussian model is to first estimate the marginal posterior $\hat p_\mathcal{L}(\phi \mid y) \propto p(\phi) p_\mathcal{L} (y \mid \phi)$ using any algorithm supported by Stan. -Approximate posterior draws for the latent Gaussian variables are then -obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and -then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$. +Approximate posterior draws for the latent Gaussian variables are then obtained by first drawing $\phi \sim \hat p_\mathcal{L}(\phi \mid y)$ and then $\theta \sim \hat p_\mathcal{L}(\theta \mid \phi, y)$. ## Trade-offs of the approximation -The embedded Laplace approximation presents several trade-offs with standard -inference over the joint posterior $p(\theta, \phi \mid y)$. The main -advantage of the embedded Laplace approximation is that it side-steps the -intricate geometry of hierarchical models. The marginal posterior -$p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling -without extensive tuning or reparameterization, and the mixing time is faster, -meaning we can run shorter chains to achieve a desired precision. In some cases, -approximate methods, e.g. variational inference, which -work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal -posterior $p(\phi \mid y)$. - -On the other hand, the embedded Laplace approximation presents certain -disadvantages. First, we need to perform a Laplace approximation each time -the log marginal likelihood is evaluated, meaning each iteration -can be expensive. Secondly, the approximation can introduce non-negligible -error, especially with non-conventional likelihoods (note the prior -is always multivariate normal). How these trade-offs are resolved depends on -the application; see @Margossian:2020 for some examples. +The embedded Laplace approximation presents several trade-offs with standard inference over the joint posterior $p(\theta, \phi \mid y)$. +The main advantage of the embedded Laplace approximation is that it side-steps the intricate geometry of hierarchical models. +The marginal posterior $p(\phi \mid y)$ can then be handled by Hamiltonian Monte Carlo sampling without extensive tuning or reparameterization, and the mixing time is faster, meaning we can run shorter chains to achieve a desired precision. +One additional benefit is that approximate methods, e.g. variational inference, which work poorly on the joint $p(\theta, \phi \mid y)$ work well on the marginal posterior $p(\phi \mid y)$. + +On the other hand, the embedded Laplace approximation presents certain disadvantages. +First, we need to perform a Laplace approximation each time the log marginal likelihood is evaluated, meaning each iteration can be expensive. +Secondly, the approximation can introduce non-negligible error, especially with non-conventional likelihoods (note the prior is always multivariate normal). +How these trade-offs are resolved depends on the application; see @Margossian:2020 for some examples. +### When the approximation is appropriate + +The quality of the Laplace approximation depends on how close the true conditional posterior $p(\theta \mid y, \phi)$ is to Gaussian. + +**Works well.** Log-concave likelihoods i.e. A Poisson with log link or negative binomial with log link. +These produce unimodal conditional posteriors when combined with a Gaussian prior. +The approximation error is typically negligible for these likelihoods, especially with moderate-to-large counts [@Kuss:2005; @Vanhatalo:2010; @Cseke:2011; @Vehtari:2016]. + +**Works adequately.** Bernoulli with logit link. +The curvature information from binary observations is weaker, so the Gaussian approximation is less accurate than for count data. +The embedded Laplace is still useful when $\theta$ is high-dimensional and joint sampling is infeasible; see @Vehtari:2016 and @Margossian:2020 for discussion. + +**Not appropriate.** The normal distribution is log-concave, but when the likelihood is normal, marginalization can be performed exactly and no approximation is needed. +For likelihoods that are not log-concave in $\theta$, the conditional posterior may be multimodal and the Laplace approximation will miss modes. +When $\theta$ is low-dimensional (a few dozen or fewer), the overhead of the inner optimization may not pay for itself and standard joint HMC sampling is often adequate. ## Details of the approximation +When the embedded Laplace approximation does not converge or produces unexpected results, the solver configuration may need adjustment. +This section describes the internals of the Newton solver and the options available for tuning it. + ### Tuning the Newton solver -A critical component of the embedded Laplace approximation is the Newton solver -used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$. The objective -function being maximized is +A critical component of the embedded Laplace approximation is the Newton solver used to estimate the mode $\theta^*$ of $p(\theta \mid \phi, y)$. +The objective function being maximized is the log joint density of the prior and likelihood. + $$ \Psi(\theta) = \log p(\theta \mid \phi) + \log p(y \mid \theta, \phi), $$ -and convergence is declared if the change in the objective is sufficiently -small between two iterations + +Convergence is declared when the change in the objective between successive iterations falls below a *tolerance* $\Delta$. + $$ | \Psi (\theta^{(i + 1)}) - \Psi (\theta^{(i)}) | \le \Delta, $$ -for some *tolerance* $\Delta$. The solver also stops after reaching a -pre-specified *maximum number of steps*: in that case, Stan throws an exception -and rejects the current proposal. This is not a problem, as -long as these exceptions are rare and confined to early phases of the warmup. -The Newton iteration can be augmented with a linesearch step to insure that -at each iteration the objective function $\Psi$ decreases. Specifically, -suppose that +The solver also stops after reaching a pre-specified *maximum number of steps*. +In that case, Stan throws a warning, but still returns the last iteration's parameters. +If you see this warning you should check the diagnostics to understand why the solver failed to converge. + +To help with cases where the Newton step does not lead to a decrease in the objective function, the Newton iteration is augmented with a wolfe line-search to ensure that at each iteration the objective function $\Psi$ decreases. +Specifically, suppose the objective increases after a Newton step, indicating the step overshot a region of improvement. + $$ \Psi (\theta^{(i + 1)}) < \Psi (\theta^{(i)}). $$ -This can indicate that the Newton step is too large and that we skipped a region -where the objective function decreases. In that case, we can reduce the step -length by a factor of 2, using + +This can indicate that the Newton step $\alpha$ at iteration $i$ is too large and that we skipped a region where the objective function decreases. +In that case, we can fallback to a Wolfe line search to find a step size which satisfies the Wolfe conditions. The wolfe line search attempts to find a search direction $p_i$ and step size $\alpha_k$ such that an accepted step both increases our objective while ensuring the slope of the accepted step is flatter than our previous position. Together these help push the algorithm towards a minimum. + +$$ +f(x_i + \alpha_k p_i) \le f(x_i) + c_1 \alpha_k \nabla f(x_i)^T p_i +-p^T_i \Delta f(x_i + \alpha_k p_i) \le -c_2 p^T_i \Delta f(x_i) +$$ + $$ \theta^{(i + 1)} \leftarrow \frac{\theta^{(i + 1)} + \theta^{(i)}}{2}. $$ -We repeat this halving of steps until -$\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number -of linesearch steps is reached. By default, this maximum is set to 0, which -means the Newton solver performs no linesearch. For certain problems, adding -a linsearch can make the optimization more stable. +We repeat this halving of steps until $\Psi (\theta^{(i + 1)}) \ge \Psi (\theta^{(i)})$, or until a maximum number of linesearch steps is reached. +For certain problems, adding a linesearch can make the optimization and solver more stable. + +### Solver Strategies -The embedded Laplace approximation uses a custom Newton solver, specialized -to find the mode of $p(\theta \mid \phi, y)$. -A keystep for efficient optimization is to insure all matrix inversions are -numerically stable. This can be done using the Woodburry-Sherman-Morrison -formula and requires one of three matrix decompositions: +The embedded Laplace approximation uses a custom Newton solver, specialized to find the mode of $p(\theta \mid \phi, y)$. +A key step for efficient optimization is to ensure all matrix inversions are numerically stable. +This can be done using the Woodbury-Sherman-Morrison formula and requires one of three matrix decompositions: -1. Cholesky decomposition of the Hessian of the negative log likelihood -$W = - \partial^2_\theta \log p(y \mid \theta, \phi)$ +1. Cholesky decomposition of the Hessian of the negative log likelihood $W = - \partial^2_\theta \log p(y \mid \theta, \phi)$. 2. Cholesky decomposition of the prior covariance matrix $K(\phi)$. 3. LU-decomposition of $I + KW$, where $I$ is the identity matrix. -The first solver (1) should be used if the negative log likelihood is -positive-definite. Otherwise the user should rely on (2). In rarer cases where -it is not numerically safe to invert the covariance matrix $K$, users can -use the third solver as a last-resort option. +The first solver (1) should be used if the negative log likelihood is positive-definite. +Otherwise the user should rely on (2). +In rarer cases where it is not numerically safe to invert the covariance matrix $K$, users can use the third solver as a last-resort option. ### Sparse Hessian of the log likelihood -A key step to speed up computation is to take advantage of the sparsity of -the Hessian of the log likelihood, +A key step to speed up computation is to take advantage of the sparsity of $H$, the Hessian of the log likelihood with respect to the latent variables, $$ H = \frac{\partial^2}{\partial \theta^2} \log p(y \mid \theta, \phi). $$ -For example, if the observations $(y_1, \cdots, y_N)$ are conditionally -independent and each depends on only depend on one component of $\theta$, -such that +For example, if the observations $(y_1, \cdots, y_N)$ are conditionally independent and each depends on only one component of $\theta$, the log likelihood decomposes into a sum of per-observation terms, $$ \log p(y \mid \theta, \phi) = \sum_{i = 1}^N \log p(y_i \mid \theta_i, \phi), $$ -then the Hessian is diagonal. This leads to faster calculations of the Hessian -and subsequently sparse matrix operations. This case is common in Gaussian -process models, and certain hierarchical models. +and the Hessian is diagonal. +This leads to faster calculations of the Hessian and subsequently sparse matrix operations. +This case is common in Gaussian process models, and certain hierarchical models. -Stan's suite of functions for the embedded Laplace approximation are not -equipped to handle arbitrary sparsity structures; instead, they work on -block-diagonal Hessians, and the user can specify the size $B$ of these blocks. -The user is responsible for working out what $B$ is. If the Hessian is dense, -then we simply set $B = N$. +Stan's suite of functions for the embedded Laplace approximation exploits block-diagonal structure in the Hessian, where the user specifies the block size B. +The user can specify the size $B$ of these blocks. +The user is responsible for working out what $B$ is. If the Hessian is dense, then we simply set $B = N$. +The diagonal case above corresponds to B = 1. +Arbitrary sparsity patterns beyond block-diagonal structure are not currently supported. -NOTE: currently, there is no support for sparse prior covariance matrix. -We expect this to be supported in future versions of Stan.