Goodness of Fit

Leave a comment

We assumed each data point is taking from a distribution with mean \mu and variance \sigma^2

Y\sim D(\mu, \sigma^2)

in which, the mean can be a function of X.

For example, we have a data Y_i , it has relation with an independent variable X_i. We would like to know the relationship between Y_i and X_i, so we fit a function y = f(x).

After the fitting (least square method), we will have so residual for each of the data

e_i = y_i - Y_i

This residual  should be follow the distribution

e \sim D(0, \sigma_e^2)

The goodness of fit, is a measure, to see the distribution of the residual, agree with the experimental error of each point, i.e. \sigma

Thus, we would like to divide the residual with \sigma and define the chi-squared

\chi^2 = (\sum (e_i^2)/\sigma_{e_i}^2 ) .

we can see, the distribution of

e/\sigma_e \sim D(0, 1)

and the sum of this distribution would be the chi-squared distribution. It has a mean of the degree of freedom DF. Note that the mean and the peak of the chi-squared distribution is not the same that the peak at  DF-1.

In the case we don’t know the error, then, the sample variance of the residual is out best estimator of the true variance. The unbiased sample variance is

\sigma_s^2 = Var(e)/DF ,

where DF is degree of freedom. In the cause of f(x) = a x + b, the DF = n-1 , because there is 1 degree o freedom used in x. And because the 1  with the b is fixed, it provides no degree of freedom.

Weighted mean and error

Leave a comment

We have n values of x_i and error \sigma_i,

With a weighting w_i, the uncorrelated weighted mean and error is

X= \sum x_i w_i / \sum w_i

S^2 = \sum w_i^2 \sigma_i^2 / (\sum w_i)^2

when combining data, the weighting is

w_i = 1/\sigma_i^2

and the weighted error becomes

S^2 = \sum{\frac{1}{\sigma_i^2}} / (\sum{\frac{1}{\sigma_i^2}})^2 = 1 / \sum \frac{1}{\sigma_i^2}


we measured a quantity n times, we can assume the intrinsic error of the data is fixed. Thus,

w_i = 1/n

X = \sum x_i / n

S^2 = \sum \sigma_0^2/n^2 = \sigma_0^2 /n

Therefore, when we take more and more data, the error is proportional to 1/\sqrt{n}.

In normal distribution, the sample of size n, the estimator of the sample mean and sample variance are

X =\sum x_i/n

S^2 = \sum (x_i-X)^2 / (n-1)

Don’t mix up the sample variance and intrinsic error, although they are very similar.

To explain the formula of the weighted variance, we have to go to the foundation of the algebra of distribution.

For a random variable follow a distribution with mean \mu and variance \sigma^2,

X \sim D(\mu, \sigma^2)

Another random variable built on it,

Z=aX+b \sim D(a \mu + b, a^2\sigma^2)

The derivation is very simple, in this page.

The adding of two independent random variables is

Z=aX + bY \sim D(a \mu_X + b \mu_Y, a^2\sigma_X^2 + b^2 \sigma_Y^2)

But there is a catch, when the \mu_X = \mu_Y and \sigma_X = \sigma_Y, The rule does not apply. But lets look back, if the mean and variance are the same, the two distribution does not really independent.


Maximum Likelihood

Leave a comment

In data analysis, especially the number of data is small, in order to found out the parameter of the distribution, which fit the data the best, maximum likelihood method is a mathematical tool to do so.

The ideal can be found in Wikipedia. For illustration, I generate 100 data points from a Gaussian distribution with mean = 1, and sigma = 2.

In Mathematica,

Data = RandomVariate[NormalDistribution[1, 2], 100]
MaxLikeliHood = Flatten[Table[{
 Log[Product[PDF[NormalDistribution[mean, sigma], Data[[i]]] // N, {i, 1,100}]],
 {mean, -3, 3, 1}, {sigma, 0.5, 3.5, 0.5}], 1]

This calculate the a table of mean form -3 to 3, step 1, sigma from 0.5 to 3.5 step 0.5. To find the maximum of the LogProduct in the table:

maxL=MaxLikeliHood[[1 ;; -1, 3]];
maxN = Position[maxL[[1 ;; -1, 3]], %]

The result is


which is the correct mean and sigma.