Distribution of ordered difference from a uniform distribution

Leave a comment

Recently, I found an interesting phenomenon.

Suppose we have a list of n random numbers, drawing from a uniform distribution ranging from (0, k n), k \in \mathbb{N}.

Now, we sort the list so that they are in ascending order, and then calculate the differences between each consecutive pair. For example, suppose we have a list like:

\displaystyle 1, 3, 9, 13, 18, 19, ....

The differences are

\displaystyle 2, 6, 4, 5, 1, ...

And then, we plot a histogram to see the frequency of each number. Here is 3000 output from a uniform integer distribution ranging from (0, 30000) .

The distribution (except the first few bins) follows an exponential distribution of

\displaystyle D(x|n, k) = \frac{n}{k} e^{- \frac{1}{k} x}

In the above plot, the blue curve is

\displaystyle f(x) = 300 e^{-x/10}

as the number of data is n = 3000, and the range of the uniformdistribution is (0, 10 \times n ) .


The problem is, why??

even normal distribution give exponential decay distribution of ordered difference!!


[20220211] update.

I understand why.

Suppose n-random draws from L discrete output from (1, L) . For a give period x , the mean number of count should be x n / L . Thus, using Poisson distribution, the probability of having m-events inside this period is

\displaystyle P(x, m) = \left(\frac{xn}{L}\right)^m \frac{1}{m!} \exp \left(-x\frac{n}{L}\right)

Now, this is the tricky part, if we want to find difference of x between the consecutive events, that mean, m = 0 . i.e. no event inside this period.

Why? This is because our question is if we have an event at a given time, How long do we have to wait to have the next event? What is the probability for a waiting period of x to get the next event?

Therefore, there must be NO event within x-period, so m=0. And an event at the next immediate time. Thus,

\displaystyle f(x) = P(x-1, 0) P(1,1) = \frac{n}{L} \exp\left(-x\frac{n}{L}\right)

Probability of a peak that is not from noise

Leave a comment

In this post, we talked about the probability Limit of a null experiment. The simplest case is that, tossing a coin n-times, given r-head, what is the probability of have a fair coin? The straight answer is using the likelihood function of the coin has probability x of given head.

\displaystyle L(x| n-toss, r-head) = C^n_r x^r (1-x)^{n-r}

Note that the area of the Likelihood function is not equal to 1.

\displaystyle \int_{0}^{1}L(x|n,r) dx = \frac{1}{(n+1}

So, the probability that the coin gives probability between p_1 to p_2 of head is

\displaystyle P(p_1, p_2) = (n+1) \int_{p_1}^{p_2} L(x|n,r) dx

The meaning of 90% confident interval of the coin gives p-probability of head is that,

\displaystyle P(p_1, p) = P(p, p_2) = 0.45

using the above condition, the p_1 is the lower limit and p_2 is the higher limit.

For example, 10-toss, 8 heads.

The coin gives head 80% with lower limit of 63% and higher limit of 100%, for 90% confident interval.

And the likelihood of have a fair coin is

\displaystyle \int_{0.45}^{0.55} L(x| 10, 8) dx  = 0.05


In the above example, there are 2 outputs. Now, we have a k numbers to draw. How to formulate to check is the drawing mechanism is fair?

To simplify the study, let say k = 3 . The likelihood of n-draw, have x_1, x_2, x_3 = n - x_1 - x_2 counts and probability p_1, p_2, p_3 = 1-p_1-p_2 for each output is

\displaystyle L( p_1, p_2 | n, x_1, x_2 ) = \frac{n!}{x_1!x_2!(n-x_1-x_2)!} p_1^{x_1} p_2^{x_2} (1-p_1-p_2)^{n-x_1-x_2}

But that is kind of complicated for more and more output.


Now, following similar idea, to determine whether a peak in a spectrum is real or not. We first determine the background level under the peak. the background should follow Poisson distribution. And we calculate the probability of have that count at the peak position.

Take the spectrum like this:

We estimate the background from bin 1 to 25, give mean value of 3.92. The likelihood function for background is

And we can see that, the probability for this background to produce the peak at 30 is very very small.


Another way to do is using the expectation sum of output. Then, we can testing hypothesis. In this post, we demonstrated the hypothesis testing.

We can have a null hypothesis that there is no peak is the histogram, so, it is a uniform distribution. Now, we denote each bin from 1 to k. And add those output. then do a hypothesis testing that can we reject the null hypothesis of no structure.

For example, we have a spectrum like below. This one-tail probability is 3%. So, we can reject the null hypothesis.

The method only has a flaw is that, if the data size is way smaller than the number of outcome, this method will always reject the null hypothesis. And also, event the null hypothesis is rejected, we can only know there is a structure, not it does not tell where the structure is. The other things is, because it is base on a sum of data, so it is not sensitive to local structure. It only give a overview of the data.

I think, we can have a null hypothesis that there is a peak at 30. and formulate the null hypothesis probability.

Adding two probability distributions

Leave a comment

Suppose a random variable with the probability distribution, and the parameter set of the distribution is alpha, such that

\displaystyle X \sim D_X( x | \alpha) \\ Y \sim D_Y( y | \beta)

What is the sum of these two distribution?

in the discrete case, it is

\displaystyle Z \sim D_Z( z = x+y | \alpha, \beta ) = \sum_{r} D_X(r | \alpha) D_Y( z- r | \beta )

for continuous distribution is

$latex \displaystyle Z = X + Y \sim D_Z( z = x+y | \alpha, \beta ) = \int_{r} D_X(r | \alpha) D_Y( z - r | \beta ) dr

Why convolution? Imagine we have a 2-D distribution on the x and y axis, such that D_X(x) D_Y(y) , we want to project the distribution at the diagonal axis z = x+y, so, we rotate the axis, so that x \rightarrow (z + w)/2, y \rightarrow z - x , where w = x - y . And integrate the area, we get the distribution for x+y. There is no intrinsic different between the continues and discrete cases.


Now, We give the general formula for add n-times of a discrete uniform distribution, ranging from {0, k}, and the relation between the Gaussian distribution. The uniform distribution is

\displaystyle X \sim U(x, k) = \frac{1}{k}, ~ 0 < x \leq k

Adding the distribution is

\displaystyle 2X = X+X \sim U_1(z, k) = \sum_{x} U(x, k) U(z -x, k) \\~~~~~~~~~~~~~ U_1(z, k) = \begin{cases} z -1 & 2 \leq z \leq k+1 \\ 2k+1-z & k+1 < z \leq 2k \end{cases}

For n-times, the distribution approximates to normal distribution (how to derive this result?) because of central limit theorm

\displaystyle nX \sim U_n(z, k) \approx N(\mu, \sigma^2) = \frac{1}{\sqrt{2\pi} \sigma} e^{ - \frac{(x-\mu)^2}{2\sigma^2}}, ~ \mu = n \frac{k+1}{2}, \sigma^2 = n \frac{k^2-1}{12}

The mean is easy to calculate. The mean of the sum of the squares from 1 to k = (2k^2+3k+1)/6 . The square of mean is (k^2+2k+1)/4. So, the variance is (k^2-1)/12 . see this for algebra of distribution.


This distribution can help us to answer a questions. Is a coin fair? Say, a coin is tolled 10 times and got 8 heads. We denote head is 1, tail is 2. So, we have a score 8 + 4 = 12 . The one-tail probability is 2.9%. So, for 95% confident level, we can reject that the coin is fair.

Bayes theorem

Leave a comment

There are many good materials on Bayes theorem. I am here to give a little bit detail.

The Bayesian theorem states that:

\displaystyle P(H|E) = \frac{P(E|H) P(H)}{P(E)} = \frac{P(E|H) P(H)}{P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})}

H for hypothesis being true. E for evidence.

The above equation can be visualized using a Venn diagram:

The red box is the events H, outside is not H or \hat{H}. The green boxes is the events E, outside the green boxes are not E. It is obvious that

\displaystyle P(E|H) = \frac{P(E \cap H)}{P(H)}

\displaystyle P(H|E) = \frac{P(E \cap H)}{P(E)}

\displaystyle P(E) = P(E \cap H) + P(E \cap \hat{H}) = P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})


One common application is medical test. Given a test for a disease is 90% for true positive, and 10% for false positive. And the chance for having the disease among the population is 0.1%. What is the probability that a positive test really means infection?

From the data, we have P(H) = 0.001,  P(E|H) = 0.9, P(E|\hat{H}) = 0.1, than

\displaystyle P(H|E) = \frac{0.9 \times 0.001}{ 0.9 \times 0.001 + 0.1 \times 0.999} = 0.009

which is only 0.9% that he is infected. Let’s do another test, and it is still positive, than

\displaystyle P(H|E)= \frac{0.9 \times 0.009} {0.9 \times 0.009 + 0.1 \times 0.991} = 0.076

which is still 7.6% that he is infected.

Number of test positiveP(H|E)
10.89%
27.5%
342.2%
486.8%
598.3%
699.8%

This table shows that, for a very rare disease, 90% true positive testing method is not sufficient.

In fact, for P(E|H) + P(E| \hat{H} ) = 1 ,

\displaystyle P(H|E) = \frac{P(E|H) P(H) }{ P(E|H) P(H) + (1- P(E|H)) (1-P(H))}

when P(E|H) = 0.5, P(H|E) = P(H)

The above plot is the curve for P(H|E) as a function of P(H) for various P(E|H) . The black arrows started with prior P(H) =0.2, and they show that each positive test with P(E|H)=0.9 will iterate toward higher and higher P(H|E) .

We can see that, if P(E|H) < 0.5 , the test is basically useless as P(E|\hat{H}) > 0.5 that more false positive than true positive.


A more tricky thing is, what if, the 1st test is positive, the 2nd test is negative, and 3rd test is positive? In this case, we have to evaluate

\displaystyle P(H|\hat{E}) = \frac{P(\hat{E}|H) P(H)}{P(\hat{E}|H) P(H) + P(\hat{E}|\hat{H}) P(\hat{H})}

Using the Venn diagram,

\displaystyle P(\hat{E}|H) = 1 - P(E|H),  P(\hat{E}|\hat{H}) = 1 - P(E|\hat{H})

\displaystyle P(H|\hat{E}) = \frac{(1-P(E|H) P(H)}{ (1-P(E|H))P(H) + (1-P(E|\hat{H})) P(\hat{H})}

Using P(E|H) = 1 - P(E|\hat{H} ) ,

\displaystyle P(H|\hat{E}) [P(E|H)] = P(H|E) [1-P(E|H)]

So that the curve is the diagonal “mirror”.

The above plot start with prior P(H) =0.2 . First test gives positive, the P(H|E) =0.69, but the 2nd test is negative, that gives P(H|\hat{E}) = 0.2 . Back to the prior.


In above discussion, we assumed that P(E|H) + P(E|\hat{H}) = 1, which is sum of the probabilities of the true positive and false positive are unity. But that is not necessary true.

\displaystyle P(H|E) = \frac{P(E|H) P(H)}{ P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})} =

If we simplify with simple variables, we have

\displaystyle f(h, x) = \frac{x h}{ x h + (1-x) (1-h)}

\displaystyle g(h, z, y) = \frac{z h}{ z h + y (1-h) }

where f(h,x) represents the P(H|E) for P(E|H) + P(E|\hat{H}) = 1 , and g(h, z, y) represents the other way. And we can check that

\displaystyle f\left(h, \frac{z}{z+y}\right) = g(h, z, y),  z \neq 0, y \neq 0

So, the curve for g(h, z, y) is same as f(h, x) with a transformation.


Up to today, there are zero covid reported for consecutive 40 days in Hong Kong. What is the probability that the virus is still exist in Hong Kong?

Our prior for the covid is P(H) = 0.2 , 20% of people has covid at ay 0-th. Assuming the probability that people with covid will show symptopes is 80%, and 20% will not show any symptoms. i.e P(S|H) = 0.8, P(\hat{S}|H) = 0.2 . An also, it it is very likely that people without covid shows covid symptopes, i.e. P(S|\hat{H}) = 0.8. And we also suppose that the covid testing had 70% true positive and 10% false positive.

The probability for reporting a covid case is the sum of tested positive with symptopes, plus false positive with fake symptopes.

P(E) = P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})

P(E|H) = P(S|H) P(E|H) = 0.56, so, there is 56% chance that a true covid will be reported.

P(E|\hat{H}) = P(S|\hat{H}) P(E|\hat{H})  = 0.08 , there is 8% chance that a false covid will be reported.

so, P(E) = 0.56 \times 0.2 + 0.08 \times 0.8  = 0.176 , there is 17.6% chance that a covid will be reported in the population.

P(H|\hat{E}) = (1-0.56) \times 0.2 / (1-0.176)  = 0.11 , thus, there is 11% chance that there is a covid but no case reported on the 1st day.

Here is the plot for the P(H|\hat{E}) vs number of day of zero report case.

Hong Kong has 7.5 million people, at 20th days of zero case reported, there could be 1 people infected and hidden in the population. But at 40th days of zero case reporte, the chance to have covid is 3.8 \times 10^{-14} .

In the early days of the pandemic policy, the HK gov required 21 days with no case reported as a condition for relaxing the measure, which is reasonable. As in our rough estimation, there still could be 1 real case among the population.

In our rough estimation, we ignored the spreading of the covid. To include the covid spreading, we can multiple the R-factor to the P(H|E) everyday. Suppose the R-factor is 0.7, the actual covid case in the population for the 1st day of zero-case reported will be (1+0.7) \times 0.11 and so on. It turns out, P(H|E) needs more time to as small as 10^{-6}. But

Given Hong Kong is one of the lowest infection rate among the world, the effective R-factor should be small or close to 0. R=0.7 is like without any protection.

Goodness of Fit

Leave a comment

In this post, we discussed the Goodness of Fit for a simple regression problem.

In nuclear physics, the data we collected are plotted in histogram, and we have to fit the histogram to find the energies and resolutions. This is another class of problem for Goodness of Fit.

In general, depends on how we treat the data,

  • treat it as a regression problem with error for each data follows Poisson distribution.
  • treat it as a testing hypothesis that what is the chance for the experimental distribution be observed, given that the model distribution is truth?

For the regression problem, there are 2 mains metrics, one is the R-squared, and the another is the chi-squared.

For the testing hypothesis, the experimental data is tested against the null hypothesis, which is the model distribution. A common method is the Pearson’s chi-squared test.

Note that the chi-squared and chi-squared test are two different things, although they are very similar in the algorithm.

Another note is that, please keep in mind that there are many things not discussed in below, there are many details and assumptions behind.


The R-squared is very simple, the meaning is How good the model captures the data. In a standard ANOVA (Analysis of variance, can be imagine as the partition of the SSTotal ), we have

\displaystyle SSTotal = \sum_{i} (Y_i - \bar{Y} )^2 = \left( \sum_i Y_i^2 \right) + \bar{Y}^2

\displaystyle SSR = \sum_{i} (Y_i - \hat{Y_i})^2

The SSRTotal is the sum of square of “residuals” against the mean value. i.e., If the data has no feature, no peak, the SSTotal will be minimum. SSTotal represents the “size” of the feature in the data. Larger the SSTotal, the data is more different from the mean values, and more feature-rich.

The SSR is the sum of square of residuals against the fitted function. If the model captures the data very well, SSR will be minimum. Since the data could have some natural fluctuation, the MSR = SSR/ndf represents the unbiased estimator of the “sample variance”, where ndf is the number of degree of freedom.

n is the size of the data, p is the number of parameter of the model, or the size of the model. The SSTotal is the “size” of feature, SSR is the “size” of the things that did not captured by the model. The basic of ANOVA is the study of the partition of the data.

The difference between SSTotal and SSR is the “size” of the model captured data.

\displaystyle R^2 = \frac{SSTotal - SSR}{SSTotal}  < 1

The idea of R-squared is that, the numerator is the size that the model captured from the data, and the denominator is the size of the feature in the data. If the data is featureless, thus the data is mainly from the natural fluctuation, therefore, there is nothing a model can be captured. so, R-squared will be small.


Chi-squared includes the sample variance,

\displaystyle \chi^2 = \sum_i \frac{(Y_i - \hat{Y_i})^2}{\sigma_i^2}

We can see that the \chi^2 is (roughly) the SSR divided by sample variance \sigma_i^2 . As we discussed before, the SSR is the size of the natural fluctuation — sample variance. If we divided the SSR with the sample variance, the \frac{(Y_i - \hat{Y_i})^2}{\sigma_i^2} \rightarrow 1 when the fitting is good. It is because, if the fitting is not good, the SSR will be larger than the sample variance (as there are something the model did not capture), or, if the fitting is too good, the SSR will be much smaller than the sample variance.

For histogram data, the sample variance of each bin should follow the Poisson statistics that \sigma_i^2 = Y_i .

The reduced chi-squared is

\displaystyle \bar{\chi^2} = \frac{\chi^2}{ndf}

For n data points, they are living in the “n-dimensional” space. Suppose there is a model to generate these data points, and the model has p parameters. Thus, the model “occupied” “p-dimensional” space. The rest “(n-p)-dimensional” space is generated by the natural fluctuation. Thus, the “size” of the \chi^2 should be (n-p) or the degree of freedom. For a good fit

\displaystyle \bar{\chi^2} = 1


Pearson’s chi-squared test, is a testing, asking, does the data distribution can be captured by a model distribution, more technically, what is the probability to observed the experimental distribution, given that the model distribution is truth. One limitation for using this test is that, the number of data for each bin cannot be too small.

The test calculate the following quantity,

\displaystyle X^2 = \sum_i \frac{(Y_i - \hat{Y_i})^2}{\hat{Y_i}}

The X^2 follows the \chi^2-distribution with degree of freedom ndf.

As all testing hypothesis, we need to know the probability of the tail. For \chi^2-distribution, only 1-tail is available, and this is usually called the p-value. Here shows an example of the \chi^2-distribution.

The meaning of the p-value is that,

given the null hypothesis, or the model distribution, is truth, the probability of the experimental data happen.

When the p-value is less than 5%, the null hypothesis should be rejected. In other word, the model distribution cannot represents the experimental distribution. Or, if the model distribution is truth, less than 5% chance to observe the experimental data.

Conjugate Prior

Leave a comment

Suppose we have a prior probability P(\theta), and we have a observation from the prior probability that the likelihood is L(\theta|x) = P(x| \theta) , thus, according to the Bayesian probability theory, the updated, posterior probability is

\displaystyle P(\theta |x) = \frac{P(x|\theta) P(\theta) }{ \int P(x|\theta') P(\theta') d\theta' = P(x)}

Here, P(\theta|x), P(\theta) are called conjugate distribution. P(\theta) is called conjugate prior to the likelihood P(x|theta) .

suppose we have no knowledge on the prior probability, it is a fair assumption that P(\theta) = 1 or uniform, then, the posterior probability is proportional to the likelihood function, i.e.

 P(\theta | x ) \propto L(\theta | x) = P(x|\theta) .

Now, suppose we know the prior probability, after updated with new information, if the prior probability is “stable”, the new (or posterior) probability should have the similar functional form as the odd (prior) probability!

When the likelihood function is Binomial distribution, it is found that the beta distribution is the “eigen” distribution that is unchanged after update.

\displaystyle P(\theta | a, b) = \frac{\theta^{(a-1)} (1-\theta)^{(b-1)}}{Beta(a,b) }

where Beta(a,b) is the beta function, which served as the normalization factor.

After s success trails and r failure trial, the posterior is

\displaystyle P(\theta | s, r) = \frac{\theta^{(s+a-1)} (1-\theta)^{(r+b-1)}}{Beta(s+a,r+b) }

When a = b = 1 \implies Beta(1,1) = 1, the posterior probability is reduced to binomial distribution and equal to the Likelihood function.

It is interesting to write the Bayesian equation in a “complete” form

\displaystyle P(\theta| s + a , r + b) \propto P(s + a , r+ b | \theta) P(\theta | a, b)  

Unfortunately, the beta distribution is undefined for a = 0 || b = 0, therefore, when no “prior” trials was taken, there is no “eigen” probability.

Remark: this topic is strongly related to the Laplace rule of succession.


Update: the reason for same functional form of prior and posterior is that the inference of mean, variant is more “consistence”.

 

Probability Limit of a null experiment

Leave a comment

Suppose we want to measure the probability of tail of a coin. And found that after 100 toss, no tail is appear, with is the upper limit of the probability of tail?


Suppose we have a coin, we toss it for 1 time, it show head, obviously, we cannot say the coin is biased or not, because the probability base on the result has not much information. So, we toss it 2 times, it shows head again, and if we toss it 10 time, it shows 10 heads. So, we are quite confident that the coin is biased to head, but how biased it is?

Or, we ask, what is the likelihood of the probability of head?


The probability of having r head in n toss, given that the probability of head is p is

P( r-\textrm{head} | p ) = C^n_r p^r (1-p)^{n-r}

Now, we want to find out the “inverse”

\displaystyle  P( p = x | r-\textrm{head} )

This will generate a likelihood function.

\displaystyle L(x | r-\textrm{head}) = P(  \textrm{head} | x ) = C^n_r x^r (1-x)^{n-r}

Capture.PNG


The peak position of the likelihood function for binomial distribution is

\displaystyle x_{\max}  = \frac{r}{n}

Since the distribution has a width, we can estimate the width by a given confident level.

From the graph, more the toss, the smaller of the width, and narrower of the uncertainly of the probability.

As the binomial distribution approach normal distribution when n \to \infty .

\displaystyle C^n_r p^r(1-p)^n \to \frac{1}{\sqrt{2\pi \sigma}} \exp\left( - \frac{(r- \mu)^2}{2\sigma^2} \right)

where \mu = n p and \sigma^2 = n p (1-p) .


When r = 0, no head was shown in n toss, the likelihood function is

L(x | n, 0) =  (1 - x )^n

Capture.PNG

The Full-width Half-maximum is

\displaystyle x_{\textrm{FWHM}} = 1- \left(\frac{1}{2}\right)^{\frac{1}{n}}

We can assign this is a “rough” the upper limit for the probability of head.

Testing Hypothesis

Leave a comment

Testing hypothesis may be the most used and most misunderstood statistics tool. When we do even a simple fitting, and want to evaluate the fitting result, we have to use the hypothesis testing. One common quantity used is the reduced chi-squared.

A hypothesis testing means given an observation and hypothesis, Is the hypothesis NOT true? right, hypothesis test never tell us the trueness  of the hypothesis, but the wrongness of it. The core of the test is “Can we reject the null hypothesis?

There are one-tailed and two-tailed testing, as a result, the p-value has different meanings.

https://en.wikipedia.org/wiki/One-_and_two-tailed_tests

https://en.wikipedia.org/wiki/P-value

The p-value is the probability that the model agree with the observation. when the p-value too small, smaller than the confident level, the null hypothesis Rejected. But if the p-value is very large, in a 1-tailed test, we cannot say the null hypothesis is true, but we can say the null hypothesis CANNOT be rejected.

In 2-tailed test, there are two p-values, corresponding to each tail.

https://en.wikipedia.org/wiki/Confidence_interval

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors

 

Variance and Sigma in finding resolution

Leave a comment

I have been mis-understood for a while.

First of all, the variance of a distribution is not equal its sigma, except for Normal distribution.

In an observation, there should be an intrinsic variance, for example, some hole size, or physical windows. And there is a resolution from the detection. As a result, we observed an overall effect of the intrinsic variance and the detector resolution. In an data analysis, one of the goal is the find out the resolution.

Lets denote the random variable of the intrinsic variance is

X \sim D(\mu, \sigma^2)

and the resolution of the detection is an other random variable of Normal distribution,

Y \sim N(0, s^2)

Then, we observed,

Z = X + Y \sim D(\mu, \sigma^2 + s^2),

according to the algebra of distribution.


If the \sigma >> s and the the intrinsic distribution is NOT a gaussian, say, it is a uniform distribution, then, the observed distribution is NOT a gaussian. One why to get the resolution is to do de-convolution. Since we are not interesting on the intrinsic distribution but the resolution, thus we can simply use the variance of the intrinsic distribution and the variance of the observed distribution to extract the resolution.

When \sigma <= s, the observed distribution is mostly a gaussian-like. and we can approximate the observed variance as the squared-sigma of a gaussian fit.


for example, in deducing the time resolution using time-difference method with a help of a tracking, that a narrow width of the position was gated.

The narrow width of the position is equivalent to a uniform distribution of time-difference. Thus, the time resolution is deduced using the observed variance and the variance of the uniform distribution. For the width of the position is \Delta X, the width of the time difference is

\Delta t = \Delta X / c/\beta,

Thus,

Var(resol.) = Var(obser.) - Var(\Delta t)

The variance of an uniform distribution is 1/12 of the square of the width.

Var(\Delta t) = (\Delta t)^2/ 12 \neq (\Delta t)^2

The effect of the factor 1/12 is very serious when the resolution is similar with the width of the time difference. But can be neglected when Var(resol.) >> Var( \Delta t).

Because of missing this 1/12 factor, the resolution will be smaller than actual resolution.

Here is an example, I generated 10,000 data, the intrinsic distribution is a uniform distribution from 0 to 100, the resolution is a gaussian of sigma of 20. The result distribution is a convolution of them.

Screen Shot 2016-03-22 at 0.52.07.png


When find resolution using projection from a 2-D distribution, we should be careful about the projected 1-D distribution and its variance. For example, a uniform 2-D disk projected to 1-D distribution, the 1-D distribution is not uniform but a half circle,

pdf =f(x) =\sqrt{(r^2-x^2)}

the variance is (0.5 r)^2.


the formula to calculate variance is

Var(X) = \int x^2 f(x) dx / \int f(x) dx ,

where f(x) is the pdf.


ToDo, estimate the error of the deduced resolution, with number of data. For small number of data, the error should be large. but how large?

 

Goodness of Fit

Leave a comment

We assumed each data point is taking from a distribution with mean \mu and variance \sigma^2

Y\sim D(\mu, \sigma^2)

in which, the mean can be a function of X.

For example, we have a data Y_i , it has relation with an independent variable X_i. We would like to know the relationship between Y_i and X_i, so we fit a function y = f(x).

After the fitting (least square method), we will have so residual for each of the data

e_i = y_i - Y_i

This residual  should be follow the distribution

e \sim D(0, \sigma_e^2)

The goodness of fit, is a measure, to see the distribution of the residual, agree with the experimental error of each point, i.e. \sigma

Thus, we would like to divide the residual with \sigma and define the chi-squared

\chi^2 = (\sum (e_i^2)/\sigma_{e_i}^2 ) .

we can see, the distribution of

e/\sigma_e \sim D(0, 1)

and the sum of this distribution would be the chi-squared distribution. It has a mean of the degree of freedom DF. Note that the mean and the peak of the chi-squared distribution is not the same that the peak at  DF-1.


In the case we don’t know the error, then, the sample variance of the residual is out best estimator of the true variance. The unbiased sample variance is

\sigma_s^2 = Var(e)/DF ,

where DF is degree of freedom. In the cause of f(x) = a x + b, the DF = n-1 , because there is 1 degree o freedom used in x. And because the 1  with the b is fixed, it provides no degree of freedom.

Older Entries