Model Fitting¶

Single Value¶

Consider an experiment designed to measures a value \(a\). Suppose that the experiment is performed \(N\) times, finding results

\[\begin{gather*} y_n = a + e_n \end{gather*}\]

where \(e_n\) is a random variable (see Random Variables). What do we learn about the parameter \(a\) from these experiments \(\vect{y} = (y_1, y_2, \dots, y_{N})\)? Key to this is the likelihood function \(\mathcal{L}(a|\vect{y}) = p(\vect{y}|a)\) which is the probability or likelihood of obtaining the data \(\vect{y}\) if the actual parameter value were \(a\).

In its full glory, if the errors are distributed according to the probability distribution function (PDF) \(p_e(\vect{e})\), then

\[\begin{gather*} \mathcal{L}(a|\vect{y}) = p(\vect{y}|a) = p_e(\overbrace{\vect{y}-a}^{\vect{e}}). \end{gather*}\]

Note that this is not a probability distribution for the parameter \(a\). To obtain such a distribution, called the posterior distribution, we must use Bayes’ theorem, and assume some prior distribution \(p(a)\) for \(a\) representing our prior knowledge about \(a\):

\[\begin{gather*} p(a|\vect{y}) = \frac{\mathcal{L}(a|\vect{y})p(a)}{p(\vect{y})} \end{gather*}\]

The denominator \(p(\vect{y})\) is here is the model evidence and is generally regarded simply as a normalization factor to ensure that \(\int p(a|\vect{y}) \d{a} = 1\).

This is what we can learn about the true value of \(a\) from the measurements. The resulting posterior distribution combines our prior knowledge – what we know about \(a\) before the experiment – with the experimental results, and normalizes these. All of what follows in model fitting comes from this result, using various approximations, or simplifications.

Curve Fitting¶

Expanding on this, consider a model \(y = f(x, \vect{a})\) depending on \(M\) parameters \(\vect{a}\), and a collection of measurements \(\vect{y} = f(\vect{x}, \vect{a}) + \vect{e}\) where again \(e_{n}\) are random variables with overall PDF \(p_e(\vect{e})\). Bayes’ theorem says the same thing:

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{\mathcal{L}(\vect{a}|\vect{y})p(\vect{a})}{p(\vect{y})} \end{gather*}\]

but the likelihood function is slightly more complicated

\[\begin{gather*} \mathcal{L}(\vect{a}|\vect{y}) = p(\vect{y}|\vect{a}) = p_e\bigr(\vect{y} - f(\vect{x}, \vect{a})\bigr). \end{gather*}\]

The Main Point¶

The entire point of model fitting is to determine and characterize the posterior distribution \(p(\vect{a}|\vect{y})\). In principle, using Bayes’ theorem is straightforward – simply multiply the distributions. The computational challenges arise as follows:

To explore the posterior, finding, for example, the maximum, and characterizing the shape.
To marginalize over nuisance parameters by integrating the posterior over these parameters.

In general, these tasks are not feasible analytically, but can be easily performed using simple (but often slow) Monte Carlo techniques. One case where analytic solutions are available is where the posterior is gaussian (a multivariate normal distribution):

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{ \exp\Bigl(-\frac{1}{2}(\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}})\Bigr) }{\sqrt{\det{(2\pi)\mat{C}}}}. \end{gather*}\]

Here \(\bar{\vect{a}}\) are the “best-fit” parameters, and \(\mat{C}\) is the covariance matrix. This will be the case if the errors \(p_e(\vect{e})\) and priors \(p(\vect{a})\) are gaussian, and the model \(f(\vect{x}, \vect{a}) \approx f(\vect{x}, \bar{\vect{a}}) + \mat{J}\cdot(\vect{a}-\bar{\vect{a}})\) is sufficiently linear close to the best-fit value.

In most cases, the posterior is not gaussian. However, the maximum of any generic function can still be expressed as a quadratic form. Inspired by the multivariate normal distribution, we consider instead the negative logarithm of the posterior, whose minimum is an exact quadratic form:

\[\begin{gather*} -\ln p(\vect{a}|\vect{y}) + \text{const} \approx \frac{1}{2} \overbrace{(\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}})}^{\Delta\chi^2}. \end{gather*}\]

This form is valid for virtually any form of posterior, as long as the final result is sufficiently well constrained.

Note

The quadratic portion denoted \(\Delta\chi^2\) by the brace here corresponds to the change in \(\chi^2\) when performing the usual least-squares fitting. We will discuss this more below.

Confidence regions are properly expressed in terms of a percentage \(p\), but often expressed in terms of \(n\sigma\) for a gaussian distribution where fraction \(p_{n}\) of the probability lies between \(\mu \pm n\sigma\) with mean \(\mu\) and standard deviation \(\sigma\). The relationship can be computed using the CDF of the normal gaussian distribution:

\[\begin{align*} c(x) &= \int_0^{x}\!\!\!\d{x}\; \frac{e^{-x^2/2}}{\sqrt{2\pi}},\\ p_{n} &= \int_{-n}^{n}\!\!\!\d{x}\;\frac{e^{-x^2/2}}{\sqrt{2\pi}},\\ &= c(n) - c(-n). \end{align*}\]

\(n\sigma\)	\(p_n\)
\(1\sigma\)	\(68.27%\)
\(2\sigma\)	\(95.45%\)
\(3\sigma\)	\(99.73%\)
\(4\sigma\)	\(99.99%\)

Confidence Regions¶

If your results are such that this form is valid, then to you can report \(\bar{\vect{a}}\) and \(\mat{C}\), but you need one more thing: confidence regions – regions in parameter space that contain a fraction \(p\) of the parameters. These can be expressed in terms of contours (ellipsoids) of this function:

\[\begin{gather*} \DeclareMathOperator{\CR}{CR} \vect{a} \in \CR(p) = \Bigl\{ \vect{a} \Big| (\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}}) \leq \Delta \chi^2(p) \Bigr\}, \end{gather*}\]

but we must also express how the contour levels of \(\Delta \chi^2(p)\) depend on the confidence level \(p\):

\[\begin{gather*} \int_{\vect{a} \in \CR(p)} p(\vect{a}|\vect{y}) = p. \end{gather*}\]

In the Gaussian case, this is the chi-squared distribution \(P_{\chi^2, \nu}(\chi^2)\) with \(\nu\) degrees of freedom corresponding to \(\nu\) parameters \(\vect{a}\) (see Chi-Squared Distribution for details):

\[\begin{gather*} p = \int_{0}^{\Delta\chi^2(p)}\d{\chi^2}\;P_{\chi^2, \nu}(\chi^2). \end{gather*}\]

In general, however, the tails of your posterior distribution will not be gaussian, and so you will also need to report appropriate contours \(\Delta\chi^2(p)\) for your desired set of confidence levels \(p\).

If your posterior is not well approximated by a quadratic over the regions of interest, then the confidence regions will not be ellipsoids, and you must work harder to characterize the final distribution. Markov-chain Monte Carlo MCMC is the tool for this job.

Marginalization¶

To marginalize over nuisance parameters, one integrates the posterior \(p(\vect{a}|\vect{y})\) over these, leaving the final posterior for the relevant parameters. This integration extends over the full parameter range, and so is sensitive to the tails of the distribution, which may have a different form for different parameter sets. For this reason, your may need to quote values for the contours \(\Delta\chi^2(p)\) for each set of relevant parameters.

I.e.: even if you provide the confidence level for a set of \(N>2\) relevant parameters, to obtain the pairwise confidence regions, you must still integrate the posterior over the full range of parameters, so the \(\nu=2\) contour \(\Delta\chi^2(p)\) may have no simple relationship to the full \(\nu = N\) contour.

Again, in the gaussian case, analytic results exist. Once you determine the posterior distribution for your complete set of parameters, marginalization is straightforward:

If needed, first transform your variables

\[\begin{gather*} \vect{a} \rightarrow \mat{S}\vect{a}, \qquad \mat{C} \rightarrow \mat{S}\mat{C}\mat{S}^{T}, \end{gather*}\]

so that the nuisance parameters you wish to marginalize over are orthogonal to the others. We will assume this has been done.
Simply project \(\vect{a}\) and \(\mat{C}\) (not \(\mat{C}^{-1}\)) onto the subspace of remaining parameters. I.e. just keep the appropriate rows and columns of \(\mat{C}\) and the remaining variables.

The contours are again given by the appropriate chi-squared distribution \(P_{\chi^2, \nu}(\chi^2)\) where \(\nu\) is the number of remaining variables.

import scipy.stats, scipy.integrate
sp = scipy

# Here we numerically check this by comparing the pdf 

N = 3
rng = np.random.default_rng(seed=1)
a = rng.random(size=N) - 0.5
C = rng.random(size=(N, N)) - 0.5
C = C @ C.T

rv1 = sp.stats.multivariate_normal(mean=a, cov=C)
rv2 = sp.stats.multivariate_normal(mean=a[1:], cov=C[1:, :][:, 1:])

# Tests that the PDFs are equivalent at some random points
a2 = rng.random(size=N-1) - 0.5

res, err = sp.integrate.quad(lambda x: rv1.pdf([x] + a2.tolist()),
                             -np.inf, np.inf)
assert err < 1e-8
assert np.allclose(res, rv2.pdf(a2), atol=2*err)

# Print table for margin.
c = sp.stats.norm().cdf
print(r"| $n\sigma$ | $p_n$     |")
print(r"|-----------|-----------|")
for n in range(1, 4):
    print(fr"| ${n}\sigma$ | ${100*(c(n) - c(-n)):.2f}%$ |")

| $n\sigma$ | $p_n$     |
|-----------|-----------|
| $1\sigma$ | $68.27%$ |
| $2\sigma$ | $95.45%$ |
| $3\sigma$ | $99.73%$ |

Summary¶

To fit a model \(y=f(x, \vect{a})\) depending on \(M\) parameters \(\vect{a}\) to a collection of measurements \(\vect{y} = f(\vect{x}, \vect{a}) + \vect{e}\) with errors \(\vect{e}\) distributed as \(p_e(\vect{e})\) we must characterize the posterior distribution:

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{\mathcal{L}(\vect{a}|\vect{y})p(\vect{a})}{p(\vect{y})}, \qquad \mathcal{L}(\vect{a}|\vect{y}) = p(\vect{y}|\vect{a}) = p_e\bigr(\vect{y} - f(\vect{x}, \vect{a})\bigr). \end{gather*}\]

If the variance in the parameters is small enough, the posterior will be well-approximated by a quadratic near its mode (maximum). We use the model

\[\begin{gather*} -2\ln p(\vect{a}|\vect{y}) + \text{const} \approx (\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}}) \leq \Delta\chi^2(p) \end{gather*}\]

to characterize the confidence regions corresponding to fraction \(p\) of the total distribution. The best fit parameters \(\vect{a}\) which maximize the posterior, the covariance matrix \(\mat{C}\) and the appropriate contour levels \(\Delta\chi^2(p)\) for your desired confidence levels \(p\) should be reported for all sets of relevant parameters.

If (an only if) your posterior is gaussian, then you may use the standard chi-squared distribution to obtain these contours \(\DeclareMathOperator{\CDF}{CDF}\Delta\chi^2(p) = \CDF^{-1}_{\chi^2,\nu}(p)\) from the inverse cumulative distribution function for the Chi-Squared Distribution of \(\nu\) parameters.

For more details see Model Fitting Details including a discussion of how this works when your errors are gaussian, how the Bayesian approach reproduces the standard least-squares approach with an appropriate choice of likelihood function and prior.

Physics 581: Physics Inspired Computational Techniques

Model Fitting

Contents