Wavelet Analysis or MRA

Leave a comment

Although the Fourier transform is a very powerful tool for data analysis, it has some limit due to lack of time information. From physics point of view, any time-data should live in time-frequency space. Since the Fourier transform has very narrow frequency resolution, according to  uncertainty principle, the time resolution will be very large, therefore, no time information can be given by Fourier transform.

Usually, such limitation would not be a problem. However, when analysis musics, long term performance of a device, or seismic survey, time information is very crucial.

To over come this difficulty, there a short-time Fourier transform (STFT) was developed. The idea is the applied a time-window (a piecewise uniform function, or Gaussian) on the data first, then FT. By applying the time-window on difference time of the data (or shifting the window), we can get the time information. However, since the frequency range of the time-window  always covers the low frequency, this means the high frequency  signal is hard to extract.

To improve the STFT, the time-window can be scaled (usually by 2). When the time window is shrink by factor of 2, the frequency range is expanded by factor of 2. If we can subtract the frequency ranges for the time-window and the shrink-time-window, the high frequency range is isolated.

To be more clear, let say the time-window function be

\phi_{[0,1)}(t) = 1 , 0 \leq t < 1

its FT is

\hat{\phi}(\omega) = sinc(\pi \omega)

Lets also define a dilation operator

Df(t) = \sqrt{2} f(2t)

the factor \sqrt{2} is for normalization.

The FT of D\phi(t) has smaller frequency range, like the following graph.


We can subtract the orange and blue curve to get the green curve. Then FT back the green curve to get the high-frequency time-window.

We can see that, we can repeat the dilation, or anti-dilation infinite time. Because of this, we can drop the FT basis Exp(-2\pi i t \omega), only use the low-pass time-window to see the low-frequency behaviour of the data, and use the high-pass time-window to see the high-frequency behaviour of the data. Now, we stepped into the Multi-resolution analysis (MRA).

In MRA, the low-pass time-window is called scaling function \phi(t) , and the high-pass time-window is called wavelet \psi(t).

Since the scaling function is craetd by dilation, it has the property

\phi(t) = \sum_{k} g_{0}(k) \phi(2t-k)

where k is integer. This means the vector space span by {\phi(t-k)}_{k}=V_0 is a subspace of the dilated space DV_0 =V_1. The dilation can be go one forever, so that the whole frequency domain will be covered by V_{\infty}.

Also, the space span by the wavelet, {\psi(t-k)}=W_0, is also a subspace of V_1. Thus, we can imagine the structure of MRA is:


Therefore, any function f(t) can also be expressed into the wavelet spaces. i.e.

f(t) = \sum_{j,k} w_{j,k} 2^{j/2}\psi(2^j t - k)

where j, k are integers.

I know this introduction is very rough, but it gives a relatively smooth transition from FT to WT (wavelet transform), when compare to the available material on the web.





Levenberg-Marquardt Algorithm

Leave a comment

In pervious post, we shows the Gauss-Newton method for fitting non-linear function. The disadvantage of that method is that the inverse matrix could be ill-defined. This makes the method unstable.

Back to the basic, we want to minimize the sum of square of residual (SSR). The SSR is,

SSR(\beta) = (Y - f(\beta))^T\cdot (Y-f(\beta))

The derivative on \beta,

\frac{d}{d\beta} SSR(\beta) = -2 (Y-f(\beta))^T \cdot \nabla f(\beta)

Many literatures denote \nabla f = J, which is the Jacobian. The second derivative of f is Hessian matrix H = \nabla^2 f \sim J^T\cdot J.

The Gradient Descent method is that ,

h = \beta - \beta_0 = \alpha J^T \cdot (Y - f(\beta_0))

where \alpha is a step size. The gradient descent changes the SSR using the steepest path. The step size \alpha has to be adjusted. The simplest way to adjust is testing the \delta = SSR(\beta_0 + h) - SSR(\beta_0). If \delta < 0 , the \alpha increases, else decreases. This method is slow but stable. It is slow because of finding the \alpha . It is stable because the method is always computable.

Thus, we have 2 methods, Gradient Descent is stable and slow, Gauss-Newton method is unstable but fast. Levenberg-Marquardt Algorithm combined this 2 methods so that it is stable and fast by solving,

(J^T \cdot J + \lambda I) h = J^T \cdot (Y - f)

where \lambda is an adjustable parameter. When \lambda >> 1 , the J^T\cdot J is neglectable and the method becomes Gradient Descent with small \alpha. When the \lambda << 1, the method becomes Gauss-Newton method.

Usually, the \lambda_0 is small. The Gauss-Newton method is very good near the minimum of SSR, while Gradient Descent is better far away.

When the \delta < 0, \lambda_{i+1} = \lambda_i / 10 , else \lambda_{i+1} = \lambda_i * 10. I don’t know the exact reason for this setting. In fact if you set oppositely, the method is still work in most cases.

The method add \lambda I on the J^T\cdot J , the inverse is always well-define. Therefore, this method is stable.

Non-linear Regression

Leave a comment

The fit equation is

Y = f(A) + \epsilon

We assume near Y , the curvy subspace of f(A) can be approximated by a plane.  This, using Taylor series,

Y = f(A_0) + F(A_0) \cdot (A - A_0)  + \cdots,

where F(A_0) is divergence of f(A) at A_0.

Using same technique in linear regression,

A - A_0 = (F(A_0)^T \cdot F(A_0))^{-1} \cdot F(A_0) \cdot ( Y-f(A_0))

With an initial guess, the interaction should approach to the best estimated parmeter \hat{A}.

The covariance is

Var(A) = \sigma^2 (F(A)^T \cdot F(A))^{-1}

The above method is also called Gauss-Newton method.

Multi-dimension Linear Regression

Leave a comment

In the field of science, collecting data and fitting it with model is essential. The most common type of fitting is 1-dimensional fitting, as there is only one independent variable. By fitting, we usually mean the least-squared method.

Suppose we want to find the n parameters in a linear function

f(x_1, x_2,\cdots, x_n) = \sum_{i=1} a_i x_i

with m observed experimental data

Y_j = f(x_{1j}, x_{2j}, \cdot, x_{nj} + \epsilon_j= \sum_{i=1} a_i x_{ij}+ \epsilon_j

Thus, we have a matrix equation

Y=X \cdot A + \epsilon

where Y is a m-dimensional data column vector, A is a n-dimensional parameter column vector, and X is a n-m non-square matrix.

In order to get the n parameter, the number of data m >= n. when m=n, it is not really a fitting because of degree-of-freedom is DF = m-n = 0, so that the fitting error is infinity.

The least square method in matrix algebra is like calculation. Take both side with transpose of X

X^T \cdot Y = (X^T \cdot X) \cdot A + X^T \cdot \epsilon

(X^T\cdot X)^{-1} \cdot X^T \cdot Y = A + (X^T \cdot X)^{-1} \cdot X^T \cdot \epsilon

Since the expectation of the \epsilon is zero. Thus the expected parameter is

A = (X^T \cdot X)^{-1} \cdot X^T \cdot Y

The unbiased variance is

\sigma^2 = (Y - X\cdot A)^T \cdot (Y - X\cdot A) / DF

where DF is the degree of freedom, which is the number of value that are free to vary. Many people will confuse by the “-1” issue. In fact, if you only want to calculate the sum of square of residual SSR, the degree of freedom is always m - n.

The covariance of the estimated parameters is

Var(A) = \sigma^2 (X^T\cdot X)^{-1}

This is only a fast-food notices on the linear regression. This has a geometrical meaning  that the matrix X is the sub-space of parameters with basis formed by the column vectors of X. Y is a bit out-side the sub-space. The linear regression is a method to find the shortest distance from Y to the sub-space X .

The from of the variance can be understood using Taylor series. This can be understood using variance in matrix notation Var(A) = E( A - E(A) )^T \cdot E(A  - E(A)) .




Solving radial SE numerically

Leave a comment

The time-independent Schrödinger equation is

(-\frac{\hbar^2}{2m}\nabla^2 + V ) \Psi = E \Psi

Using the Laplacian in spherical coordinate. and Set \Psi = R Y

\nabla^2 R Y - \frac{2m}{\hbar^2}(V-E) R Y = 0

\nabla^2 = \frac{1}{r^2}\frac{d}{dr}(r^2 \frac{d}{dr}) - \frac{1}{r^2} L^2

The angular part,

L^2 Y = l(l+1) Y

The radial part,

\frac{d}{dr}(r^2\frac{dR}{dr}) - l(l+1)R - \frac{2mr^2}{\hbar^2}(V-E) R = 0

To simplify the first term,

R = \frac{u}{r}

\frac{d}{dr}(r^2 \frac{dR}{dr})= r \frac{d^2u}{dr^2}

A more easy form of the radial function is,

\frac{d^2u}{dr^2} + \frac{l(l+1)}{r^2} u - \frac{2m}{\hbar^2} (V-E) u = 0

The effective potential U

U = V + \frac{\hbar^2}{m} \frac{l(l+1)}{r^2}

\frac{d^2u}{dr^2} + \frac{2m}{\hbar^2} (E - U) u = 0

We can use Rungu-Kutta method to numerically solve the equation.


The initial condition of u has to be 0. (home work)

I used excel to calculate a scattered state of L = 0 of energy 30 MeV. The potential is a Wood-Saxon of depth 50 MeV, radius 3.5 fm, diffusiveness 0.8 fm.


Another example if bound state of L = 0. I have to search for the energy, so that the wavefunction is flat at large distance. The outermost eigen energy is -7.27 MeV. From the radial function, we know it is a 2s orbit.


Testing Hypothesis

Leave a comment

Testing hypothesis may be the most used and most misunderstood statistics tool. When we do even a simple fitting, and want to evaluate the fitting result, we have to use the hypothesis testing. One common quantity used is the reduced chi-squared.

A hypothesis testing means given an observation and hypothesis, Is the hypothesis NOT true? right, hypothesis test never tell us the trueness  of the hypothesis, but the wrongness of it. The core of the test is “Can we reject the null hypothesis?

There are one-tailed and two-tailed testing, as a result, the p-value has different meanings.



The p-value is the probability that the model agree with the observation. when the p-value too small, smaller than the confident level, the null hypothesis Rejected. But if the p-value is very large, in a 1-tailed test, we cannot say the null hypothesis is true, but we can say the null hypothesis CANNOT be rejected.

In 2-tailed test, there are two p-values, corresponding to each tail.




Variance and Sigma in finding resolution

Leave a comment

I have been mis-understood for a while.

First of all, the variance of a distribution is not equal its sigma, except for Normal distribution.

In an observation, there should be an intrinsic variance, for example, some hole size, or physical windows. And there is a resolution from the detection. As a result, we observed an overall effect of the intrinsic variance and the detector resolution. In an data analysis, one of the goal is the find out the resolution.

Lets denote the random variable of the intrinsic variance is

X \sim D(\mu, \sigma^2)

and the resolution of the detection is an other random variable of Normal distribution,

Y \sim N(0, s^2)

Then, we observed,

Z = X + Y \sim D(\mu, \sigma^2 + s^2),

according to the algebra of distribution.

If the \sigma >> s and the the intrinsic distribution is NOT a gaussian, say, it is a uniform distribution, then, the observed distribution is NOT a gaussian. One why to get the resolution is to do de-convolution. Since we are not interesting on the intrinsic distribution but the resolution, thus we can simply use the variance of the intrinsic distribution and the variance of the observed distribution to extract the resolution.

When \sigma <= s, the observed distribution is mostly a gaussian-like. and we can approximate the observed variance as the squared-sigma of a gaussian fit.

for example, in deducing the time resolution using time-difference method with a help of a tracking, that a narrow width of the position was gated.

The narrow width of the position is equivalent to a uniform distribution of time-difference. Thus, the time resolution is deduced using the observed variance and the variance of the uniform distribution. For the width of the position is \Delta X, the width of the time difference is

\Delta t = \Delta X / c/\beta,


Var(resol.) = Var(obser.) - Var(\Delta t)

The variance of an uniform distribution is 1/12 of the square of the width.

Var(\Delta t) = (\Delta t)^2/ 12 \neq (\Delta t)^2

The effect of the factor 1/12 is very serious when the resolution is similar with the width of the time difference. But can be neglected when Var(resol.) >> Var( \Delta t).

Because of missing this 1/12 factor, the resolution will be smaller than actual resolution.

Here is an example, I generated 10,000 data, the intrinsic distribution is a uniform distribution from 0 to 100, the resolution is a gaussian of sigma of 20. The result distribution is a convolution of them.

Screen Shot 2016-03-22 at 0.52.07.png

When find resolution using projection from a 2-D distribution, we should be careful about the projected 1-D distribution and its variance. For example, a uniform 2-D disk projected to 1-D distribution, the 1-D distribution is not uniform but a half circle,

pdf =f(x) =\sqrt{(r^2-x^2)}

the variance is (0.5 r)^2.

the formula to calculate variance is

Var(X) = \int x^2 f(x) dx / \int f(x) dx ,

where f(x) is the pdf.

ToDo, estimate the error of the deduced resolution, with number of data. For small number of data, the error should be large. but how large?


Older Entries