In pervious post, we shows the Gauss-Newton method for fitting non-linear function. The disadvantage of that method is that the inverse matrix could be ill-defined. This makes the method unstable.

Back to the basic, we want to minimize the sum of square of residual (SSR). The SSR is,

$SSR(\beta) = (Y - f(\beta))^T\cdot (Y-f(\beta))$

The derivative on $\beta$,

$\frac{d}{d\beta} SSR(\beta) = -2 (Y-f(\beta))^T \cdot \nabla f(\beta)$

Many literatures denote $\nabla f = J$, which is the Jacobian. The second derivative of $f$ is Hessian matrix $H = \nabla^2 f \sim J^T\cdot J$.

The Gradient Descent method is that ,

$h = \beta - \beta_0 = \alpha J^T \cdot (Y - f(\beta_0))$

where $\alpha$ is a step size. The gradient descent changes the SSR using the steepest path. The step size $\alpha$ has to be adjusted. The simplest way to adjust is testing the $\delta = SSR(\beta_0 + h) - SSR(\beta_0)$. If $\delta < 0$, the $\alpha$ increases, else decreases. This method is slow but stable. It is slow because of finding the $\alpha$. It is stable because the method is always computable.

Thus, we have 2 methods, Gradient Descent is stable and slow, Gauss-Newton method is unstable but fast. Levenberg-Marquardt Algorithm combined this 2 methods so that it is stable and fast by solving,

$(J^T \cdot J + \lambda I) h = J^T \cdot (Y - f)$

where $\lambda$ is an adjustable parameter. When $\lambda >> 1$ , the $J^T\cdot J$ is neglectable and the method becomes Gradient Descent with small $\alpha$. When the $\lambda << 1$, the method becomes Gauss-Newton method.

Usually, the $\lambda_0$ is small. The Gauss-Newton method is very good near the minimum of SSR, while Gradient Descent is better far away.

When the $\delta < 0$, $\lambda_{i+1} = \lambda_i / 10$, else $\lambda_{i+1} = \lambda_i * 10$. I don’t know the exact reason for this setting. In fact if you set oppositely, the method is still work in most cases.

The method add $\lambda I$ on the $J^T\cdot J$ , the inverse is always well-define. Therefore, this method is stable.