Goodness of Fit

Leave a comment

In this post, we discussed the Goodness of Fit for a simple regression problem.

In nuclear physics, the data we collected are plotted in histogram, and we have to fit the histogram to find the energies and resolutions. This is another class of problem for Goodness of Fit.

In general, depends on how we treat the data,

  • treat it as a regression problem with error for each data follows Poisson distribution.
  • treat it as a testing hypothesis that what is the chance for the experimental distribution be observed, given that the model distribution is truth?

For the regression problem, there are 2 mains metrics, one is the R-squared, and the another is the chi-squared.

For the testing hypothesis, the experimental data is tested against the null hypothesis, which is the model distribution. A common method is the Pearson’s chi-squared test.

Note that the chi-squared and chi-squared test are two different things, although they are very similar in the algorithm.

Another note is that, please keep in mind that there are many things not discussed in below, there are many details and assumptions behind.

The R-squared is very simple, the meaning is How good the model captures the data. In a standard ANOVA (Analysis of variance, can be imagine as the partition of the SSTotal ), we have

\displaystyle SSTotal = \sum_{i} (Y_i - \bar{Y} )^2 = \left( \sum_i Y_i^2 \right) + \bar{Y}^2

\displaystyle SSR = \sum_{i} (Y_i - \hat{Y_i})^2

The SSRTotal is the sum of square of “residuals” against the mean value. i.e., If the data has no feature, no peak, the SSTotal will be minimum. SSTotal represents the “size” of the feature in the data. Larger the SSTotal, the data is more different from the mean values, and more feature-rich.

The SSR is the sum of square of residuals against the fitted function. If the model captures the data very well, SSR will be minimum. Since the data could have some natural fluctuation, the MSR = SSR/ndf represents the unbiased estimator of the “sample variance”, where ndf is the number of degree of freedom.

n is the size of the data, p is the number of parameter of the model, or the size of the model. The SSTotal is the “size” of feature, SSR is the “size” of the things that did not captured by the model. The basic of ANOVA is the study of the partition of the data.

The difference between SSTotal and SSR is the “size” of the model captured data.

\displaystyle R^2 = \frac{SSTotal - SSR}{SSTotal}  < 1

The idea of R-squared is that, the numerator is the size that the model captured from the data, and the denominator is the size of the feature in the data. If the data is featureless, thus the data is mainly from the natural fluctuation, therefore, there is nothing a model can be captured. so, R-squared will be small.

Chi-squared includes the sample variance,

\displaystyle \chi^2 = \sum_i \frac{(Y_i - \hat{Y_i})^2}{\sigma_i^2}

We can see that the \chi^2 is (roughly) the SSR divided by sample variance \sigma_i^2 . As we discussed before, the SSR is the size of the natural fluctuation — sample variance. If we divided the SSR with the sample variance, the \frac{(Y_i - \hat{Y_i})^2}{\sigma_i^2} \rightarrow 1 when the fitting is good. It is because, if the fitting is not good, the SSR will be larger than the sample variance (as there are something the model did not capture), or, if the fitting is too good, the SSR will be much smaller than the sample variance.

For histogram data, the sample variance of each bin should follow the Poisson statistics that \sigma_i^2 = Y_i .

The reduced chi-squared is

\displaystyle \bar{\chi^2} = \frac{\chi^2}{ndf}

For n data points, they are living in the “n-dimensional” space. Suppose there is a model to generate these data points, and the model has p parameters. Thus, the model “occupied” “p-dimensional” space. The rest “(n-p)-dimensional” space is generated by the natural fluctuation. Thus, the “size” of the \chi^2 should be (n-p) or the degree of freedom. For a good fit

\displaystyle \bar{\chi^2} = 1

Pearson’s chi-squared test, is a testing, asking, does the data distribution can be captured by a model distribution, more technically, what is the probability to observed the experimental distribution, given that the model distribution is truth. One limitation for using this test is that, the number of data for each bin cannot be too small.

The test calculate the following quantity,

\displaystyle X^2 = \sum_i \frac{(Y_i - \hat{Y_i})^2}{\hat{Y_i}}

The X^2 follows the \chi^2-distribution with degree of freedom ndf.

As all testing hypothesis, we need to know the probability of the tail. For \chi^2-distribution, only 1-tail is available, and this is usually called the p-value. Here shows an example of the \chi^2-distribution.

The meaning of the p-value is that,

given the null hypothesis, or the model distribution, is truth, the probability of the experimental data happen.

When the p-value is less than 5%, the null hypothesis should be rejected. In other word, the model distribution cannot represents the experimental distribution. Or, if the model distribution is truth, less than 5% chance to observe the experimental data.

Multi-dimension Linear Regression

Leave a comment

In the field of science, collecting data and fitting it with model is essential. The most common type of fitting is 1-dimensional fitting, as there is only one independent variable. By fitting, we usually mean the least-squared method.

Suppose we want to find the n parameters in a linear function

f(x_1, x_2,\cdots, x_n) = \sum_{i=1} a_i x_i

with m observed experimental data

Y_j = f(x_{1j}, x_{2j}, \cdot, x_{nj} + \epsilon_j= \sum_{i=1} a_i x_{ij}+ \epsilon_j

Thus, we have a matrix equation

Y=X \cdot A + \epsilon

where Y is a m-dimensional data column vector, A is a n-dimensional parameter column vector, and X is a n-m non-square matrix.

In order to get the n parameter, the number of data m >= n. when m=n, it is not really a fitting because of degree-of-freedom is DF = m-n = 0, so that the fitting error is infinity.

The least square method in matrix algebra is like calculation. Take both side with transpose of X

X^T \cdot Y = (X^T \cdot X) \cdot A + X^T \cdot \epsilon

(X^T\cdot X)^{-1} \cdot X^T \cdot Y = A + (X^T \cdot X)^{-1} \cdot X^T \cdot \epsilon

Since the expectation of the \epsilon is zero. Thus the expected parameter is

A = (X^T \cdot X)^{-1} \cdot X^T \cdot Y

The unbiased variance is

\sigma^2 = (Y - X\cdot A)^T \cdot (Y - X\cdot A) / DF

where DF is the degree of freedom, which is the number of value that are free to vary. Many people will confuse by the “-1” issue. In fact, if you only want to calculate the sum of square of residual SSR, the degree of freedom is always m - n.

The covariance of the estimated parameters is

Var(A) = \sigma^2 (X^T\cdot X)^{-1}

This is only a fast-food notices on the linear regression. This has a geometrical meaning  that the matrix X is the sub-space of parameters with basis formed by the column vectors of X. Y is a bit out-side the sub-space. The linear regression is a method to find the shortest distance from Y to the sub-space X .

The from of the variance can be understood using Taylor series. This can be understood using variance in matrix notation Var(A) = E( A - E(A) )^T \cdot E(A  - E(A)) .




Goodness of Fit

Leave a comment

We assumed each data point is taking from a distribution with mean \mu and variance \sigma^2

Y\sim D(\mu, \sigma^2)

in which, the mean can be a function of X.

For example, we have a data Y_i , it has relation with an independent variable X_i. We would like to know the relationship between Y_i and X_i, so we fit a function y = f(x).

After the fitting (least square method), we will have so residual for each of the data

e_i = y_i - Y_i

This residual  should be follow the distribution

e \sim D(0, \sigma_e^2)

The goodness of fit, is a measure, to see the distribution of the residual, agree with the experimental error of each point, i.e. \sigma

Thus, we would like to divide the residual with \sigma and define the chi-squared

\chi^2 = (\sum (e_i^2)/\sigma_{e_i}^2 ) .

we can see, the distribution of

e/\sigma_e \sim D(0, 1)

and the sum of this distribution would be the chi-squared distribution. It has a mean of the degree of freedom DF. Note that the mean and the peak of the chi-squared distribution is not the same that the peak at  DF-1.

In the case we don’t know the error, then, the sample variance of the residual is out best estimator of the true variance. The unbiased sample variance is

\sigma_s^2 = Var(e)/DF ,

where DF is degree of freedom. In the cause of f(x) = a x + b, the DF = n-1 , because there is 1 degree o freedom used in x. And because the 1  with the b is fixed, it provides no degree of freedom.

on angular momentum adding & rotation operator


the angular momentum has 2 kinds – orbital angular momentum L , which is caused by a charged particle executing orbital motion, since there are 3 dimension space. and spin S , which is an internal degree of freedom to let particle “orbiting” at there.

thus, a general quantum state for a particle should not just for the spatial part and the time part. but also the spin, since a complete state should contains all degree of freedom.

\left| \Psi \right> = \left| x,t \right> \bigotimes \left| s \right>

when we “add” the orbital angular momentum and the spin together, actually, we are doing:

J = L \bigotimes 1 + 1 \bigotimes S

where the 1 with L is the identity of the spin-space and the 1 with S is the identity of the 3-D space.

the above was discussed on J.J. Sakurai’s book.

the mathematics of L and S are completely the same at rotation operator.

R_J (\theta) = Exp( - \frac {i}{\hbar} \theta J)

where J can be either L or S.

the L can only have effect on spatial state while S can only have effect on the spin-state. i.e:

R_L(\theta) \left| s \right> = \left| s\right>

R_S(\theta) \left| x \right> = \left| x\right>

the L_z can only have integral value but S_z can be both half-integral and integral. the half-integral value of Sz makes the spin-state have to rotate 2 cycles in order to be the same again.

thus, if the different of L and S is just man-made. The degree of freedom in the spin-space is actually by some real geometry on higher dimension. and actually, the orbital angular momentum can change the spin state:

L \left| s \right> = \left | s' \right > = c \left| s \right>

but the effect is so small and

R_L (\theta) \left| s\right > = Exp( - \frac {i}{\hbar} \theta c )\left| s \right>

but the c is very small, but if we can rotate the state for a very large angle, the effect of it can be seen by compare to the rotation by spin.

\left < R_L(\omega t) + R_S(\omega t) \right> = 2 ( 1+ cos ( \omega ( c -1 ) t)

the experiment can be done as follow. we apply a rotating magnetic field at the same frequency as the Larmor frequency. at a very low temperature, the spin was isolated and T_1 and T_2 is equal to \infty . the different in the c will come up at very long time measurement and it exhibit a interference pattern.

if c is a complex number, it will cause a decay, and it will be reflected in the interference pattern.

if we find out this c, then we can reveal the other spacial dimension!


the problem is. How can we act the orbital angular momentum on the spin with out the effect of spin angular momentum? since L and S always coupled.

one possibility is make the S zero. in the system of electron and positron. the total spin is zero.

another possibility is act the S on the spatial part. and this will change the energy level.


an more fundamental problem is, why L and S commute? the possible of writing this

\left| \Psi \right> = \left| x,t \right> \bigotimes \left| s \right>

is due to the operators are commute to each other. by why?

if we break down the L in to position operator x and momentum operator p, the question becomes, why x and S commute or p and S commute?

[x,S]=0 ?

[p,S]=0 ?

[p_x, S_y] \ne 0 ?

i will prove it later.


another problem is, how to evaluate the Poisson bracket? since L and S is not same dimension. may be we can write the eigenket in vector form:

\begin {pmatrix} \left|x, t \right> \\ \left|s\right> \end {pmatrix}

i am not sure.



For any vector operator, it must satisfy following equation, due to rotation symmetry.

[V_i, J_j] = i \hbar V_k   run in cyclic


where J is rotation operator. but i am not sure is it restricted to real space rotation. any way, spin is a vector operator, thus

$latex [S_x, L_y] = i \hbar S_z = – [S_y, L_x] $

so, L, S is not commute.