In the field of science, collecting data and fitting it with model is essential. The most common type of fitting is 1-dimensional fitting, as there is only one independent variable. By fitting, we usually mean the least-squared method.

Suppose we want to find the n parameters in a linear function

f(x_1, x_2,\cdots, x_n) = \sum_{i=1} a_i x_i

with m observed experimental data

Y_j = f(x_{1j}, x_{2j}, \cdot, x_{nj} + \epsilon_j= \sum_{i=1} a_i x_{ij}+ \epsilon_j

Thus, we have a matrix equation

Y=X \cdot A + \epsilon

where Y is a m-dimensional data column vector, A is a n-dimensional parameter column vector, and X is a n-m non-square matrix.

In order to get the n parameter, the number of data m >= n. when m=n, it is not really a fitting because of degree-of-freedom is DF = m-n = 0, so that the fitting error is infinity.

The least square method in matrix algebra is like calculation. Take both side with transpose of X

X^T \cdot Y = (X^T \cdot X) \cdot A + X^T \cdot \epsilon

(X^T\cdot X)^{-1} \cdot X^T \cdot Y = A + (X^T \cdot X)^{-1} \cdot X^T \cdot \epsilon

Since the expectation of the \epsilon is zero. Thus the expected parameter is

A = (X^T \cdot X)^{-1} \cdot X^T \cdot Y

The unbiased variance is

\sigma^2 = (Y - X\cdot A)^T \cdot (Y - X\cdot A) / DF

where DF is the degree of freedom, which is the number of value that are free to vary. Many people will confuse by the “-1” issue. In fact, if you only want to calculate the sum of square of residual SSR, the degree of freedom is always m - n.

The covariance of the estimated parameters is

Var(A) = \sigma^2 (X^T\cdot X)^{-1}

This is only a fast-food notices on the linear regression. This has a geometrical meaning  that the matrix X is the sub-space of parameters with basis formed by the column vectors of X. Y is a bit out-side the sub-space. The linear regression is a method to find the shortest distance from Y to the sub-space X .

The from of the variance can be understood using Taylor series. This can be understood using variance in matrix notation Var(A) = E( A - E(A) )^T \cdot E(A  - E(A)) .