A reminder of Exponential function

Suddenly found that exponential function is very wonderful. It is the eigen-function of differential operator. because of that, it appears in almost everywhere as differential operator constantly appears.

The definition of the exponential $e$ is,

$\displaystyle e = \lim_{n\to\infty} \left(1+\frac{1}{n} \right)^n$

this is the 1% growth continually in a unit of time. It is easy to see that

$\displaystyle e^x = \lim_{n\to\infty} \left(1+\frac{x}{n} \right)^n$

$e^x e^y = e^{x+y}$

expanding,

$\displaystyle \left(1+\frac{x}{n} \right)^n = \sum_{r=0}^{n} C_r^n \left(\frac{x}{n} \right)^r = \sum_{r=0}^{n} \left(\prod_{k=0}^{r-1} \frac{n-k}{n} \right)\frac{x^r}{r!}$,

Take $n\to\infty$,

$\displaystyle e^x = \sum_{r=0}^{\infty} \frac{x^r}{r!}$

It is easy to see that

$\displaystyle \frac{d}{dx} e^x = e^x$

Levenberg-Marquardt Algorithm

In pervious post, we shows the Gauss-Newton method for fitting non-linear function. The disadvantage of that method is that the inverse matrix could be ill-defined. This makes the method unstable.

Back to the basic, we want to minimize the sum of square of residual (SSR). The SSR is,

$SSR(\beta) = (Y - f(\beta))^T\cdot (Y-f(\beta))$

The derivative on $\beta$,

$\frac{d}{d\beta} SSR(\beta) = -2 (Y-f(\beta))^T \cdot \nabla f(\beta)$

Many literatures denote $\nabla f = J$, which is the Jacobian. The second derivative of $f$ is Hessian matrix $H = \nabla^2 f \sim J^T\cdot J$.

The Gradient Descent method is that ,

$h = \beta - \beta_0 = \alpha J^T \cdot (Y - f(\beta_0))$

where $\alpha$ is a step size. The gradient descent changes the SSR using the steepest path. The step size $\alpha$ has to be adjusted. The simplest way to adjust is testing the $\delta = SSR(\beta_0 + h) - SSR(\beta_0)$. If $\delta < 0$, the $\alpha$ increases, else decreases. This method is slow but stable. It is slow because of finding the $\alpha$. It is stable because the method is always computable.

Thus, we have 2 methods, Gradient Descent is stable and slow, Gauss-Newton method is unstable but fast. Levenberg-Marquardt Algorithm combined this 2 methods so that it is stable and fast by solving,

$(J^T \cdot J + \lambda I) h = J^T \cdot (Y - f)$

where $\lambda$ is an adjustable parameter. When $\lambda >> 1$ , the $J^T\cdot J$ is neglectable and the method becomes Gradient Descent with small $\alpha$. When the $\lambda << 1$, the method becomes Gauss-Newton method.

Usually, the $\lambda_0$ is small. The Gauss-Newton method is very good near the minimum of SSR, while Gradient Descent is better far away.

When the $\delta < 0$, $\lambda_{i+1} = \lambda_i / 10$, else $\lambda_{i+1} = \lambda_i * 10$. I don’t know the exact reason for this setting. In fact if you set oppositely, the method is still work in most cases.

The method add $\lambda I$ on the $J^T\cdot J$ , the inverse is always well-define. Therefore, this method is stable.

Non-linear Regression

The fit equation is

$Y = f(A) + \epsilon$

We assume near $Y$ , the curvy subspace of $f(A)$ can be approximated by a plane.  This, using Taylor series,

$Y = f(A_0) + F(A_0) \cdot (A - A_0) + \cdots$,

where $F(A_0)$ is divergence of $f(A)$ at $A_0$.

Using same technique in linear regression,

$A - A_0 = (F(A_0)^T \cdot F(A_0))^{-1} \cdot F(A_0) \cdot ( Y-f(A_0))$

With an initial guess, the interaction should approach to the best estimated parmeter $\hat{A}$.

The covariance is

$Var(A) = \sigma^2 (F(A)^T \cdot F(A))^{-1}$

The above method is also called Gauss-Newton method.

Multi-dimension Linear Regression

In the field of science, collecting data and fitting it with model is essential. The most common type of fitting is 1-dimensional fitting, as there is only one independent variable. By fitting, we usually mean the least-squared method.

Suppose we want to find the n parameters in a linear function

$f(x_1, x_2,\cdots, x_n) = \sum_{i=1} a_i x_i$

with m observed experimental data

$Y_j = f(x_{1j}, x_{2j}, \cdot, x_{nj} + \epsilon_j= \sum_{i=1} a_i x_{ij}+ \epsilon_j$

Thus, we have a matrix equation

$Y=X \cdot A + \epsilon$

where $Y$ is a m-dimensional data column vector, $A$ is a n-dimensional parameter column vector, and $X$ is a n-m non-square matrix.

In order to get the $n$ parameter, the number of data $m >= n$. when $m=n$, it is not really a fitting because of degree-of-freedom is $DF = m-n = 0$, so that the fitting error is infinity.

The least square method in matrix algebra is like calculation. Take both side with transpose of $X$

$X^T \cdot Y = (X^T \cdot X) \cdot A + X^T \cdot \epsilon$

$(X^T\cdot X)^{-1} \cdot X^T \cdot Y = A + (X^T \cdot X)^{-1} \cdot X^T \cdot \epsilon$

Since the expectation of the $\epsilon$ is zero. Thus the expected parameter is

$A = (X^T \cdot X)^{-1} \cdot X^T \cdot Y$

The unbiased variance is

$\sigma^2 = (Y - X\cdot A)^T \cdot (Y - X\cdot A) / DF$

where $DF$ is the degree of freedom, which is the number of value that are free to vary. Many people will confuse by the “-1” issue. In fact, if you only want to calculate the sum of square of residual SSR, the degree of freedom is always $m - n$.

The covariance of the estimated parameters is

$Var(A) = \sigma^2 (X^T\cdot X)^{-1}$

This is only a fast-food notices on the linear regression. This has a geometrical meaning  that the matrix $X$ is the sub-space of parameters with basis formed by the column vectors of $X$. $Y$ is a bit out-side the sub-space. The linear regression is a method to find the shortest distance from $Y$ to the sub-space $X$.

The from of the variance can be understood using Taylor series. This can be understood using variance in matrix notation $Var(A) = E( A - E(A) )^T \cdot E(A - E(A))$.

The time-independent Schrödinger equation is

$(-\frac{\hbar^2}{2m}\nabla^2 + V ) \Psi = E \Psi$

Using the Laplacian in spherical coordinate. and Set $\Psi = R Y$

$\nabla^2 R Y - \frac{2m}{\hbar^2}(V-E) R Y = 0$

$\nabla^2 = \frac{1}{r^2}\frac{d}{dr}(r^2 \frac{d}{dr}) - \frac{1}{r^2} L^2$

The angular part,

$L^2 Y = l(l+1) Y$

$\frac{d}{dr}(r^2\frac{dR}{dr}) - l(l+1)R - \frac{2mr^2}{\hbar^2}(V-E) R = 0$

To simplify the first term,

$R = \frac{u}{r}$

$\frac{d}{dr}(r^2 \frac{dR}{dr})= r \frac{d^2u}{dr^2}$

A more easy form of the radial function is,

$\frac{d^2u}{dr^2} + \frac{l(l+1)}{r^2} u - \frac{2m}{\hbar^2} (V-E) u = 0$

The effective potential $U$

$U = V + \frac{\hbar^2}{m} \frac{l(l+1)}{r^2}$

$\frac{d^2u}{dr^2} + \frac{2m}{\hbar^2} (E - U) u = 0$

We can use Rungu-Kutta method to numerically solve the equation.

The initial condition of $u$ has to be 0. (home work)

I used excel to calculate a scattered state of L = 0 of energy 30 MeV. The potential is a Wood-Saxon of depth 50 MeV, radius 3.5 fm, diffusiveness 0.8 fm.

Another example if bound state of L = 0. I have to search for the energy, so that the wavefunction is flat at large distance. The outermost eigen energy is -7.27 MeV. From the radial function, we know it is a 2s orbit.

Special Rungu-Kutta method for any order ODE

The central piece of Rungu-Kutta method is the approximation of the increasement of the function. In 1st order ODE,

$\dot{y} = f(x,y), \dot{y}(x_0) = y_0$

In a special case of Rungu-Kutta of order 4 (RK4), there are 2 array $b_i$ and $c_i$, so that

$dx = h, dy = h \sum\limits_{i=1}^{4} b_i dy_i$

$dy_i = f(x+ c_i dx, y+c_i dy_{i-1})$

where $c_i$ is ranging from 0 to 1, $\sum\limits_{i=1}^{4} b_i =1$.

In RK4,

$c = (0, \frac{1}{2}, \frac{1}{2}, 1)$

$b = \frac{1}{6}(1,2,2,1)$

The $c_1 = 0$ is a must, otherwise, we have to define $dy_0$. There should be some methods to obtain an optimum values for $c$ and $b$, but I don’t know.

For 2nd order ODE

$\frac{d^2y}{dx^2} + a \frac{dy}{dx} = f(x,y), y(x_0) = y_0, \frac{dy}{dx}|_{x_0} = z_0$

change to

$\frac{dy}{dx} = z, \frac{dz}{dx}=F(x,y,z)$

These equation are the similar 1st order ODEs.

$dx = h, dy = \sum\limits_{i=1}^{4} b_i dy_i, dz = \sum\limits_{i=1}^{4} b_i dz_i$

$dy_i = h( z+ c_i dz_{i-1})$, where is the z for $y_n$ step.

$dz_i = h F(x+ c_i dx, y+c_i dy_{i-1}, z+c_i dz_{i-1})$

The $dy_i$ is using $dz_i$

$x_{n+1} = x_n + dx$

$y_{n+1} = y_n + dy$

$z_{n+1} = z_n + dz$

This can be generalized to any order ODE by decoupling the ODE into

$\frac{dy}{dx} = x,..., \frac{du}{dx}=F(x,y,...,u)$

the equation for $du_i$ is

$du_i = h F(x+ c_i dx, y+c_i dy_{i-1}, z+c_i dz_{i-1}, ...., u+ c_i du_{i-1})$

And for all the intermediate variable $y, z, w,.... = O^{(k)}$

$dO_i^{(k+1)} = h( O^{(k)} + c_i dO_{i-1}^{(k)})$

Testing Hypothesis

Testing hypothesis may be the most used and most misunderstood statistics tool. When we do even a simple fitting, and want to evaluate the fitting result, we have to use the hypothesis testing. One common quantity used is the reduced chi-squared.

A hypothesis testing means given an observation and hypothesis, Is the hypothesis NOT true? right, hypothesis test never tell us the trueness  of the hypothesis, but the wrongness of it. The core of the test is “Can we reject the null hypothesis?

There are one-tailed and two-tailed testing, as a result, the p-value has different meanings.

https://en.wikipedia.org/wiki/One-_and_two-tailed_tests

https://en.wikipedia.org/wiki/P-value

The p-value is the probability that the model agree with the observation. when the p-value too small, smaller than the confident level, the null hypothesis Rejected. But if the p-value is very large, in a 1-tailed test, we cannot say the null hypothesis is true, but we can say the null hypothesis CANNOT be rejected.

In 2-tailed test, there are two p-values, corresponding to each tail.

https://en.wikipedia.org/wiki/Confidence_interval

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors