In this post, we discussed the Goodness of Fit for a simple regression problem.
In nuclear physics, the data we collected are plotted in histogram, and we have to fit the histogram to find the energies and resolutions. This is another class of problem for Goodness of Fit.
In general, depends on how we treat the data,
- treat it as a regression problem with error for each data follows Poisson distribution.
- treat it as a testing hypothesis that what is the chance for the experimental distribution be observed, given that the model distribution is truth?
For the regression problem, there are 2 mains metrics, one is the R-squared, and the another is the chi-squared.
For the testing hypothesis, the experimental data is tested against the null hypothesis, which is the model distribution. A common method is the Pearson’s chi-squared test.
Note that the chi-squared and chi-squared test are two different things, although they are very similar in the algorithm.
Another note is that, please keep in mind that there are many things not discussed in below, there are many details and assumptions behind.
The R-squared is very simple, the meaning is How good the model captures the data. In a standard ANOVA (Analysis of variance, can be imagine as the partition of the SSTotal ), we have
The SSRTotal is the sum of square of “residuals” against the mean value. i.e., If the data has no feature, no peak, the SSTotal will be minimum. SSTotal represents the “size” of the feature in the data. Larger the SSTotal, the data is more different from the mean values, and more feature-rich.
The SSR is the sum of square of residuals against the fitted function. If the model captures the data very well, SSR will be minimum. Since the data could have some natural fluctuation, the represents the unbiased estimator of the “sample variance”, where is the number of degree of freedom.
The difference between SSTotal and SSR is the “size” of the model captured data.
The idea of R-squared is that, the numerator is the size that the model captured from the data, and the denominator is the size of the feature in the data. If the data is featureless, thus the data is mainly from the natural fluctuation, therefore, there is nothing a model can be captured. so, R-squared will be small.
Chi-squared includes the sample variance,
We can see that the is (roughly) the SSR divided by sample variance . As we discussed before, the SSR is the size of the natural fluctuation — sample variance. If we divided the SSR with the sample variance, the when the fitting is good. It is because, if the fitting is not good, the SSR will be larger than the sample variance (as there are something the model did not capture), or, if the fitting is too good, the SSR will be much smaller than the sample variance.
For histogram data, the sample variance of each bin should follow the Poisson statistics that .
The reduced chi-squared is
For data points, they are living in the “n-dimensional” space. Suppose there is a model to generate these data points, and the model has parameters. Thus, the model “occupied” “p-dimensional” space. The rest “(n-p)-dimensional” space is generated by the natural fluctuation. Thus, the “size” of the should be (n-p) or the degree of freedom. For a good fit
Pearson’s chi-squared test, is a testing, asking, does the data distribution can be captured by a model distribution, more technically, what is the probability to observed the experimental distribution, given that the model distribution is truth. One limitation for using this test is that, the number of data for each bin cannot be too small.
The test calculate the following quantity,
The follows the -distribution with degree of freedom .
As all testing hypothesis, we need to know the probability of the tail. For -distribution, only 1-tail is available, and this is usually called the p-value. Here shows an example of the -distribution.
The meaning of the p-value is that,
given the null hypothesis, or the model distribution, is truth, the probability of the experimental data happen.
When the p-value is less than 5%, the null hypothesis should be rejected. In other word, the model distribution cannot represents the experimental distribution. Or, if the model distribution is truth, less than 5% chance to observe the experimental data.