## Levenberg-Marquardt Algorithm

In pervious post, we shows the Gauss-Newton method for fitting non-linear function. The disadvantage of that method is that the inverse matrix could be ill-defined. This makes the method unstable.

Back to the basic, we want to minimize the sum of square of residual (SSR). The SSR is, $SSR(\beta) = (Y - f(\beta))^T\cdot (Y-f(\beta))$

The derivative on $\beta$, $\frac{d}{d\beta} SSR(\beta) = -2 (Y-f(\beta))^T \cdot \nabla f(\beta)$

Many literatures denote $\nabla f = J$, which is the Jacobian. The second derivative of $f$ is Hessian matrix $H = \nabla^2 f \sim J^T\cdot J$.

The Gradient Descent method is that , $h = \beta - \beta_0 = \alpha J^T \cdot (Y - f(\beta_0))$

where $\alpha$ is a step size. The gradient descent changes the SSR using the steepest path. The step size $\alpha$ has to be adjusted. The simplest way to adjust is testing the $\delta = SSR(\beta_0 + h) - SSR(\beta_0)$. If $\delta < 0$, the $\alpha$ increases, else decreases. This method is slow but stable. It is slow because of finding the $\alpha$. It is stable because the method is always computable.

Thus, we have 2 methods, Gradient Descent is stable and slow, Gauss-Newton method is unstable but fast. Levenberg-Marquardt Algorithm combined this 2 methods so that it is stable and fast by solving, $(J^T \cdot J + \lambda I) h = J^T \cdot (Y - f)$

where $\lambda$ is an adjustable parameter. When $\lambda >> 1$ , the $J^T\cdot J$ is neglectable and the method becomes Gradient Descent with small $\alpha$. When the $\lambda << 1$, the method becomes Gauss-Newton method.

Usually, the $\lambda_0$ is small. The Gauss-Newton method is very good near the minimum of SSR, while Gradient Descent is better far away.

When the $\delta < 0$, $\lambda_{i+1} = \lambda_i / 10$, else $\lambda_{i+1} = \lambda_i * 10$. I don’t know the exact reason for this setting. In fact if you set oppositely, the method is still work in most cases.

The method add $\lambda I$ on the $J^T\cdot J$ , the inverse is always well-define. Therefore, this method is stable.

## Non-linear Regression

The fit equation is $Y = f(A) + \epsilon$

We assume near $Y$ , the curvy subspace of $f(A)$ can be approximated by a plane.  This, using Taylor series, $Y = f(A_0) + F(A_0) \cdot (A - A_0) + \cdots$,

where $F(A_0)$ is divergence of $f(A)$ at $A_0$.

Using same technique in linear regression, $A - A_0 = (F(A_0)^T \cdot F(A_0))^{-1} \cdot F(A_0) \cdot ( Y-f(A_0))$

With an initial guess, the interaction should approach to the best estimated parmeter $\hat{A}$.

The covariance is $Var(A) = \sigma^2 (F(A)^T \cdot F(A))^{-1}$

The above method is also called Gauss-Newton method.

## Multi-dimension Linear Regression

In the field of science, collecting data and fitting it with model is essential. The most common type of fitting is 1-dimensional fitting, as there is only one independent variable. By fitting, we usually mean the least-squared method.

Suppose we want to find the n parameters in a linear function $f(x_1, x_2,\cdots, x_n) = \sum_{i=1} a_i x_i$

with m observed experimental data $Y_j = f(x_{1j}, x_{2j}, \cdot, x_{nj} + \epsilon_j= \sum_{i=1} a_i x_{ij}+ \epsilon_j$

Thus, we have a matrix equation $Y=X \cdot A + \epsilon$

where $Y$ is a m-dimensional data column vector, $A$ is a n-dimensional parameter column vector, and $X$ is a n-m non-square matrix.

In order to get the $n$ parameter, the number of data $m >= n$. when $m=n$, it is not really a fitting because of degree-of-freedom is $DF = m-n = 0$, so that the fitting error is infinity.

The least square method in matrix algebra is like calculation. Take both side with transpose of $X$ $X^T \cdot Y = (X^T \cdot X) \cdot A + X^T \cdot \epsilon$ $(X^T\cdot X)^{-1} \cdot X^T \cdot Y = A + (X^T \cdot X)^{-1} \cdot X^T \cdot \epsilon$

Since the expectation of the $\epsilon$ is zero. Thus the expected parameter is $A = (X^T \cdot X)^{-1} \cdot X^T \cdot Y$

The unbiased variance is $\sigma^2 = (Y - X\cdot A)^T \cdot (Y - X\cdot A) / DF$

where $DF$ is the degree of freedom, which is the number of value that are free to vary. Many people will confuse by the “-1” issue. In fact, if you only want to calculate the sum of square of residual SSR, the degree of freedom is always $m - n$.

The covariance of the estimated parameters is $Var(A) = \sigma^2 (X^T\cdot X)^{-1}$

This is only a fast-food notices on the linear regression. This has a geometrical meaning  that the matrix $X$ is the sub-space of parameters with basis formed by the column vectors of $X$. $Y$ is a bit out-side the sub-space. The linear regression is a method to find the shortest distance from $Y$ to the sub-space $X$.

The from of the variance can be understood using Taylor series. This can be understood using variance in matrix notation $Var(A) = E( A - E(A) )^T \cdot E(A - E(A))$.