Model Fitting

Model Fitting#

These notes closely follow Chapter 15.6 “Confidence Limits on Estimated Model Parameters” of [Press et al., 2007] which I highly recommend you read in parallel. In addition, we have a series of explicit examples following the discussion here in sec:ModelFittingEg.

TL;DR#

Here is a brief overview. We consider an experiment measuring a quantity \(a\) (which could be a vector of quantities) and a dataset \(y_n = a+e_n\) where \(e_n\) are some sort of errors. What can we say about \(a\) from the data \(\vect{y} = (y_1, y_2, \dots, y_N)\) in various cases? In particular:

If we have a good model for the distribution of the errors \(e_n\): i.e. we know the probability density function (PDF) \(p_e(\vect{e}; a)\). Here we will use Bayes’ theorem, discuss maximum likelihood estimation, and least squares fitting. We characterize what we learn about \(a\) in terms of PDFs and the confidence regions/intervals by using simple Monte Carlo techniques to calibrate various estimators \(A(\vect{y})\) for the parameter \(a\).
What if we do not know \(p_e(e)\) but believe they are identical and independently distributed (idd):

\[\begin{gather*} p_e(\vect{e}) = \prod_{n=1}^{N}p(e_n). \end{gather*}\]

Here we use bootstrapping.

The general idea will be to find some estimator \(A(\vect{y})\) for the quantity \(a\), then to caibrate this estimator so we can provide a reliable estimate for \(a\) including confidence regions. While in some cases this can be done analytically, this is generally complicated, so here we focus on establishing these results using simple Monte Carlo techniques.

Single Value#

We start with the simplest example: Consider an experiment designed to measures a single value \(a\). Suppose that the experiment is performed \(N\) times, finding results

\[\begin{gather*} y_n = a + e_n \end{gather*}\]

where \(e_n\) are random variable (see Random Variables). What can we learn about the parameter \(a\) from these experiments \(\vect{y} = (y_1, y_2, \dots, y_{N})\)?

To proceed, we need an estimator \(A(\vect{y})\): some function of the data \(\vect{y}\) that provides an “estimate” for the true parameter \(a\). A common example might be the mean \(A(\vect{y}) = \sum_{n}y_n/N\).

If we know how the errors are distributed, i.e according to the probability distribution function (PDF) \(p_e(\vect{e};a)\), then we can compute or simulate the distribution of the estimator \(p_A\bigl(A;a\bigr)\). This distribution, and the dependence on \(a\) forms a calibration of the estimator. Once this calibration is known, then from a given data set, we can invert the calibration and find an estimate with appropriate confidence regions for the true parameter \(a\).

Key to this is the likelihood function \(\mathcal{L}(a|\vect{y}) = p(\vect{y}|a)\) which is the probability or likelihood of obtaining the data \(\vect{y}\) if the actual parameter value were \(a\).

In its full glory, if the errors are distributed according to the probability distribution function (PDF) \(p_e(\vect{e};a)\), then

\[\begin{gather*} \mathcal{L}(a|\vect{y}) = p(\vect{y}|a) = p_e(\overbrace{\vect{y}-a}^{\vect{e}};a). \end{gather*}\]

Note that this is not a probability distribution for the parameter \(a\). To obtain such a distribution, called the posterior distribution, we must use Bayes’ theorem, and assume some prior distribution \(p(a)\) for \(a\) representing our prior knowledge about \(a\):

\[\begin{gather*} p(a|\vect{y}) = \frac{\mathcal{L}(a|\vect{y})p(a)}{p(\vect{y})} \end{gather*}\]

The denominator \(p(\vect{y})\) is here is the model evidence and is generally regarded simply as a normalization factor to ensure that \(\int p(a|\vect{y}) \d{a} = 1\).

This is what we can learn about the true value of \(a\) from the measurements. The resulting posterior distribution combines our prior knowledge – what we know about \(a\) before the experiment – with the experimental results, and normalizes these. All of what follows in model fitting comes from this result, using various approximations, or simplifications.

Curve Fitting#

Expanding on this, consider a model \(y = f(x, \vect{a})\) depending on \(M\) parameters \(\vect{a}\), and a collection of measurements \(\vect{y} = f(\vect{x}, \vect{a}) + \vect{e}\) where again \(e_{n}\) are random variables with overall PDF \(p_e(\vect{e};\vect{a})\). Bayes’ theorem says the same thing:

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{\mathcal{L}(\vect{a}|\vect{y})p(\vect{a})}{p(\vect{y})} \end{gather*}\]

but the likelihood function is slightly more complicated

\[\begin{gather*} \mathcal{L}(\vect{a}|\vect{y}) = p(\vect{y}|\vect{a}) = p_e\bigr(\vect{y} - f(\vect{x}, \vect{a});\vect{a}\bigr). \end{gather*}\]

The Main Point#

The entire point of model fitting is to determine and characterize the posterior distribution \(p(\vect{a}|\vect{y})\). In principle, using Bayes’ theorem is straightforward – simply multiply the distributions. The computational challenges arise as follows:

To explore the posterior, finding, for example, the maximum, and characterizing the shape.
To marginalize over nuisance parameters by integrating the posterior over these parameters.

In general, these tasks are not feasible analytically, but can be easily performed using simple (but often slow) Monte Carlo techniques. One case where analytic solutions are available is where the posterior is gaussian (a multivariate normal distribution):

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{ \exp\Bigl(-\frac{1}{2}(\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}})\Bigr) }{\sqrt{\det{(2\pi)\mat{C}}}}. \end{gather*}\]

Here \(\bar{\vect{a}}\) are the “best-fit” parameters, and \(\mat{C}\) is the covariance matrix. This will be the case if the errors \(p_e(\vect{e})\) and priors \(p(\vect{a})\) are gaussian, and the model \(f(\vect{x}, \vect{a}) \approx f(\vect{x}, \bar{\vect{a}}) + \mat{J}\cdot(\vect{a}-\bar{\vect{a}})\) is sufficiently linear close to the best-fit value.

In most cases, the posterior is not gaussian. However, the maximum of any generic function can still be expressed as a quadratic form. Inspired by the multivariate normal distribution, we consider instead the negative logarithm of the posterior, whose minimum is an exact quadratic form:

\[\begin{gather*} -\ln p(\vect{a}|\vect{y}) + \text{const} \approx \frac{1}{2} \overbrace{(\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}})}^{\Delta\chi^2}. \end{gather*}\]

This form is valid for virtually any form of posterior, as long as the final result is sufficiently well constrained.

Note

The quadratic portion denoted \(\Delta\chi^2\) by the brace here corresponds to the change in \(\chi^2\) when performing the usual least-squares fitting. We will discuss this more below.

Confidence regions are properly expressed in terms of a percentage \(p\), but often expressed in terms of \(n\sigma\) for a gaussian distribution where fraction \(p_{n}\) of the probability lies between \(\mu \pm n\sigma\) with mean \(\mu\) and standard deviation \(\sigma\). The relationship can be computed using the CDF of the normal gaussian distribution:

\[\begin{align*} c(x) &= \int_0^{x}\!\!\!\d{x}\; \frac{e^{-x^2/2}}{\sqrt{2\pi}},\\ p_{n} &= \int_{-n}^{n}\!\!\!\d{x}\;\frac{e^{-x^2/2}}{\sqrt{2\pi}},\\ &= c(n) - c(-n). \end{align*}\]

\(n\sigma\)	\(p_n\)
\(1\sigma\)	\(68.27%\)
\(2\sigma\)	\(95.45%\)
\(3\sigma\)	\(99.73%\)
\(4\sigma\)	\(99.99%\)

Confidence Regions#

If your results are such that this form is valid, then to you can report \(\bar{\vect{a}}\) and \(\mat{C}\), but you need one more thing: confidence regions – regions in parameter space that contain a fraction \(p\) of the parameters. These can be expressed in terms of contours (ellipsoids) of this function:

\[\begin{gather*} \DeclareMathOperator{\CR}{CR} \vect{a} \in \CR(p) = \Bigl\{ \vect{a} \Big| (\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}}) \leq \Delta \chi^2(p) \Bigr\}, \end{gather*}\]

but we must also express how the contour levels of \(\Delta \chi^2(p)\) depend on the confidence level \(p\):

\[\begin{gather*} \int_{\vect{a} \in \CR(p)} p(\vect{a}|\vect{y}) = p. \end{gather*}\]

In the Gaussian case, this is the chi-squared distribution \(P_{\chi^2, \nu}(\chi^2)\) with \(\nu\) degrees of freedom corresponding to \(\nu\) parameters \(\vect{a}\) (see Chi-Squared Distribution for details):

\[\begin{gather*} p = \int_{0}^{\Delta\chi^2(p)}\d{\chi^2}\;P_{\chi^2, \nu}(\chi^2). \end{gather*}\]

In general, however, the tails of your posterior distribution will not be gaussian, and so you will also need to report appropriate contours \(\Delta\chi^2(p)\) for your desired set of confidence levels \(p\).

If your posterior is not well approximated by a quadratic over the regions of interest, then the confidence regions will not be ellipsoids, and you must work harder to characterize the final distribution. Markov-chain Monte Carlo MCMC is the tool for this job.

Marginalization#

To marginalize over nuisance parameters, one integrates the posterior \(p(\vect{a}|\vect{y})\) over these, leaving the final posterior for the relevant parameters. This integration extends over the full parameter range, and so is sensitive to the tails of the distribution, which may have a different form for different parameter sets. For this reason, your may need to quote values for the contours \(\Delta\chi^2(p)\) for each set of relevant parameters.

I.e.: even if you provide the confidence level for a set of \(N>2\) relevant parameters, to obtain the pairwise confidence regions, you must still integrate the posterior over the full range of parameters, so the \(\nu=2\) contour \(\Delta\chi^2(p)\) may have no simple relationship to the full \(\nu = N\) contour.

Again, in the gaussian case, analytic results exist. Once you determine the posterior distribution for your complete set of parameters, marginalization is straightforward:

If needed, first transform your variables

\[\begin{gather*} \vect{a} \rightarrow \mat{S}\vect{a}, \qquad \mat{C} \rightarrow \mat{S}\mat{C}\mat{S}^{T}, \end{gather*}\]

so that the nuisance parameters you wish to marginalize over are orthogonal to the others. We will assume this has been done.
Simply project \(\vect{a}\) and \(\mat{C}\) (not \(\mat{C}^{-1}\)) onto the subspace of remaining parameters. I.e. just keep the appropriate rows and columns of \(\mat{C}\) and the remaining variables.

The contours are again given by the appropriate chi-squared distribution \(P_{\chi^2, \nu}(\chi^2)\) where \(\nu\) is the number of remaining variables.

Summary#

To fit a model \(y=f(x, \vect{a})\) depending on \(M\) parameters \(\vect{a}\) to a collection of measurements \(\vect{y} = f(\vect{x}, \vect{a}) + \vect{e}\) with errors \(\vect{e}\) distributed as \(p_e(\vect{e})\) we must characterize the posterior distribution:

\[\begin{gather*} p(\vect{a}|\vect{y}) = \frac{\mathcal{L}(\vect{a}|\vect{y})p(\vect{a})}{p(\vect{y})}, \qquad \mathcal{L}(\vect{a}|\vect{y}) = p(\vect{y}|\vect{a}) = p_e\bigr(\vect{y} - f(\vect{x}, \vect{a})\bigr). \end{gather*}\]

If the variance in the parameters is small enough, the posterior will be well-approximated by a quadratic near its mode (maximum). We use the model

\[\begin{gather*} -2\ln p(\vect{a}|\vect{y}) + \text{const} \approx (\vect{a} - \bar{\vect{a}})^T\cdot \mat{C}^{-1}\cdot (\vect{a} - \bar{\vect{a}}) \leq \Delta\chi^2(p) \end{gather*}\]

to characterize the confidence regions corresponding to fraction \(p\) of the total distribution. The best fit parameters \(\vect{a}\) which maximize the posterior, the covariance matrix \(\mat{C}\) and the appropriate contour levels \(\Delta\chi^2(p)\) for your desired confidence levels \(p\) should be reported for all sets of relevant parameters.

If (an only if) your posterior is gaussian, then you may use the standard chi-squared distribution to obtain these contours \(\DeclareMathOperator{\CDF}{CDF}\Delta\chi^2(p) = \CDF^{-1}_{\chi^2,\nu}(p)\) from the inverse cumulative distribution function for the Chi-Squared Distribution of \(\nu\) parameters.

For more details see Model Fitting Details including a discussion of how this works when your errors are gaussian, how the Bayesian approach reproduces the standard least-squares approach with an appropriate choice of likelihood function and prior.