Bayes theorem

Leave a comment

There are many good materials on Bayes theorem. I am here to give a little bit detail.

The Bayesian theorem states that:

\displaystyle P(H|E) = \frac{P(E|H) P(H)}{P(E)} = \frac{P(E|H) P(H)}{P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})}

H for hypothesis being true. E for evidence.

The above equation can be visualized using a Venn diagram:

The red box is the events H, outside is not H or \hat{H}. The green boxes is the events E, outside the green boxes are not E. It is obvious that

\displaystyle P(E|H) = \frac{P(E \cap H)}{P(H)}

\displaystyle P(H|E) = \frac{P(E \cap H)}{P(E)}

\displaystyle P(E) = P(E \cap H) + P(E \cap \hat{H}) = P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})


One common application is medical test. Given a test for a disease is 90% for true positive, and 10% for false positive. And the chance for having the disease among the population is 0.1%. What is the probability that a positive test really means infection?

From the data, we have P(H) = 0.001,  P(E|H) = 0.9, P(E|\hat{H}) = 0.1, than

\displaystyle P(H|E) = \frac{0.9 \times 0.001}{ 0.9 \times 0.001 + 0.1 \times 0.999} = 0.009

which is only 0.9% that he is infected. Let’s do another test, and it is still positive, than

\displaystyle P(H|E)= \frac{0.9 \times 0.009} {0.9 \times 0.009 + 0.1 \times 0.991} = 0.076

which is still 7.6% that he is infected.

Number of test positiveP(H|E)
10.89%
27.5%
342.2%
486.8%
598.3%
699.8%

This table shows that, for a very rare disease, 90% true positive testing method is not sufficient.

In fact, for P(E|H) + P(E| \hat{H} ) = 1 ,

\displaystyle P(H|E) = \frac{P(E|H) P(H) }{ P(E|H) P(H) + (1- P(E|H)) (1-P(H))}

when P(E|H) = 0.5, P(H|E) = P(H)

The above plot is the curve for P(H|E) as a function of P(H) for various P(E|H) . The black arrows started with prior P(H) =0.2, and they show that each positive test with P(E|H)=0.9 will iterate toward higher and higher P(H|E) .

We can see that, if P(E|H) < 0.5 , the test is basically useless as P(E|\hat{H}) > 0.5 that more false positive than true positive.


A more tricky thing is, what if, the 1st test is positive, the 2nd test is negative, and 3rd test is positive? In this case, we have to evaluate

\displaystyle P(H|\hat{E}) = \frac{P(\hat{E}|H) P(H)}{P(\hat{E}|H) P(H) + P(\hat{E}|\hat{H}) P(\hat{H})}

Using the Venn diagram,

\displaystyle P(\hat{E}|H) = 1 - P(E|H),  P(\hat{E}|\hat{H}) = 1 - P(E|\hat{H})

\displaystyle P(H|\hat{E}) = \frac{(1-P(E|H) P(H)}{ (1-P(E|H))P(H) + (1-P(E|\hat{H})) P(\hat{H})}

Using P(E|H) = 1 - P(E|\hat{H} ) ,

\displaystyle P(H|\hat{E}) [P(E|H)] = P(H|E) [1-P(E|H)]

So that the curve is the diagonal “mirror”.

The above plot start with prior P(H) =0.2 . First test gives positive, the P(H|E) =0.69, but the 2nd test is negative, that gives P(H|\hat{E}) = 0.2 . Back to the prior.


In above discussion, we assumed that P(E|H) + P(E|\hat{H}) = 1, which is sum of the probabilities of the true positive and false positive are unity. But that is not necessary true.

\displaystyle P(H|E) = \frac{P(E|H) P(H)}{ P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})} =

If we simplify with simple variables, we have

\displaystyle f(h, x) = \frac{x h}{ x h + (1-x) (1-h)}

\displaystyle g(h, z, y) = \frac{z h}{ z h + y (1-h) }

where f(h,x) represents the P(H|E) for P(E|H) + P(E|\hat{H}) = 1 , and g(h, z, y) represents the other way. And we can check that

\displaystyle f\left(h, \frac{z}{z+y}\right) = g(h, z, y),  z \neq 0, y \neq 0

So, the curve for g(h, z, y) is same as f(h, x) with a transformation.


Up to today, there are zero covid reported for consecutive 40 days in Hong Kong. What is the probability that the virus is still exist in Hong Kong?

Our prior for the covid is P(H) = 0.2 , 20% of people has covid at ay 0-th. Assuming the probability that people with covid will show symptopes is 80%, and 20% will not show any symptoms. i.e P(S|H) = 0.8, P(\hat{S}|H) = 0.2 . An also, it it is very likely that people without covid shows covid symptopes, i.e. P(S|\hat{H}) = 0.8. And we also suppose that the covid testing had 70% true positive and 10% false positive.

The probability for reporting a covid case is the sum of tested positive with symptopes, plus false positive with fake symptopes.

P(E) = P(E|H) P(H) + P(E|\hat{H}) P(\hat{H})

P(E|H) = P(S|H) P(E|H) = 0.56, so, there is 56% chance that a true covid will be reported.

P(E|\hat{H}) = P(S|\hat{H}) P(E|\hat{H})  = 0.08 , there is 8% chance that a false covid will be reported.

so, P(E) = 0.56 \times 0.2 + 0.08 \times 0.8  = 0.176 , there is 17.6% chance that a covid will be reported in the population.

P(H|\hat{E}) = (1-0.56) \times 0.2 / (1-0.176)  = 0.11 , thus, there is 11% chance that there is a covid but no case reported on the 1st day.

Here is the plot for the P(H|\hat{E}) vs number of day of zero report case.

Hong Kong has 7.5 million people, at 20th days of zero case reported, there could be 1 people infected and hidden in the population. But at 40th days of zero case reporte, the chance to have covid is 3.8 \times 10^{-14} .

In the early days of the pandemic policy, the HK gov required 21 days with no case reported as a condition for relaxing the measure, which is reasonable. As in our rough estimation, there still could be 1 real case among the population.

In our rough estimation, we ignored the spreading of the covid. To include the covid spreading, we can multiple the R-factor to the P(H|E) everyday. Suppose the R-factor is 0.7, the actual covid case in the population for the 1st day of zero-case reported will be (1+0.7) \times 0.11 and so on. It turns out, P(H|E) needs more time to as small as 10^{-6}. But

Given Hong Kong is one of the lowest infection rate among the world, the effective R-factor should be small or close to 0. R=0.7 is like without any protection.

Conjugate Prior

Leave a comment

Suppose we have a prior probability P(\theta), and we have a observation from the prior probability that the likelihood is L(\theta|x) = P(x| \theta) , thus, according to the Bayesian probability theory, the updated, posterior probability is

\displaystyle P(\theta |x) = \frac{P(x|\theta) P(\theta) }{ \int P(x|\theta') P(\theta') d\theta' = P(x)}

Here, P(\theta|x), P(\theta) are called conjugate distribution. P(\theta) is called conjugate prior to the likelihood P(x|theta) .

suppose we have no knowledge on the prior probability, it is a fair assumption that P(\theta) = 1 or uniform, then, the posterior probability is proportional to the likelihood function, i.e.

 P(\theta | x ) \propto L(\theta | x) = P(x|\theta) .

Now, suppose we know the prior probability, after updated with new information, if the prior probability is “stable”, the new (or posterior) probability should have the similar functional form as the odd (prior) probability!

When the likelihood function is Binomial distribution, it is found that the beta distribution is the “eigen” distribution that is unchanged after update.

\displaystyle P(\theta | a, b) = \frac{\theta^{(a-1)} (1-\theta)^{(b-1)}}{Beta(a,b) }

where Beta(a,b) is the beta function, which served as the normalization factor.

After s success trails and r failure trial, the posterior is

\displaystyle P(\theta | s, r) = \frac{\theta^{(s+a-1)} (1-\theta)^{(r+b-1)}}{Beta(s+a,r+b) }

When a = b = 1 \implies Beta(1,1) = 1, the posterior probability is reduced to binomial distribution and equal to the Likelihood function.

It is interesting to write the Bayesian equation in a “complete” form

\displaystyle P(\theta| s + a , r + b) \propto P(s + a , r+ b | \theta) P(\theta | a, b)  

Unfortunately, the beta distribution is undefined for a = 0 || b = 0, therefore, when no “prior” trials was taken, there is no “eigen” probability.

Remark: this topic is strongly related to the Laplace rule of succession.


Update: the reason for same functional form of prior and posterior is that the inference of mean, variant is more “consistence”.

 

Probability Limit of a null experiment

Leave a comment

Suppose we want to measure the probability of tail of a coin. And found that after 100 toss, no tail is appear, with is the upper limit of the probability of tail?


Suppose we have a coin, we toss it for 1 time, it show head, obviously, we cannot say the coin is biased or not, because the probability base on the result has not much information. So, we toss it 2 times, it shows head again, and if we toss it 10 time, it shows 10 heads. So, we are quite confident that the coin is biased to head, but how biased it is?

Or, we ask, what is the likelihood of the probability of head?


The probability of having r head in n toss, given that the probability of head is p is

P( r-\textrm{head} | p ) = C^n_r p^r (1-p)^{n-r}

Now, we want to find out the “inverse”

\displaystyle  P( p = x | r-\textrm{head} )

This will generate a likelihood function.

\displaystyle L(x | r-\textrm{head}) = P(  \textrm{head} | x ) = C^n_r x^r (1-x)^{n-r}

Capture.PNG


The peak position of the likelihood function for binomial distribution is

\displaystyle x_{\max}  = \frac{r}{n}

Since the distribution has a width, we can estimate the width by a given confident level.

From the graph, more the toss, the smaller of the width, and narrower of the uncertainly of the probability.

As the binomial distribution approach normal distribution when n \to \infty .

\displaystyle C^n_r p^r(1-p)^n \to \frac{1}{\sqrt{2\pi \sigma}} \exp\left( - \frac{(r- \mu)^2}{2\sigma^2} \right)

where \mu = n p and \sigma^2 = n p (1-p) .


When r = 0, no head was shown in n toss, the likelihood function is

L(x | n, 0) =  (1 - x )^n

Capture.PNG

The Full-width Half-maximum is

\displaystyle x_{\textrm{FWHM}} = 1- \left(\frac{1}{2}\right)^{\frac{1}{n}}

We can assign this is a “rough” the upper limit for the probability of head.

Testing Hypothesis

Leave a comment

Testing hypothesis may be the most used and most misunderstood statistics tool. When we do even a simple fitting, and want to evaluate the fitting result, we have to use the hypothesis testing. One common quantity used is the reduced chi-squared.

A hypothesis testing means given an observation and hypothesis, Is the hypothesis NOT true? right, hypothesis test never tell us the trueness  of the hypothesis, but the wrongness of it. The core of the test is “Can we reject the null hypothesis?

There are one-tailed and two-tailed testing, as a result, the p-value has different meanings.

https://en.wikipedia.org/wiki/One-_and_two-tailed_tests

https://en.wikipedia.org/wiki/P-value

The p-value is the probability that the model agree with the observation. when the p-value too small, smaller than the confident level, the null hypothesis Rejected. But if the p-value is very large, in a 1-tailed test, we cannot say the null hypothesis is true, but we can say the null hypothesis CANNOT be rejected.

In 2-tailed test, there are two p-values, corresponding to each tail.

https://en.wikipedia.org/wiki/Confidence_interval

https://en.wikipedia.org/wiki/Type_I_and_type_II_errors