Hi MHF,

I'm a physics undergraduate student at Oxford doing his MPhys project. I've got a problem that to me seems quite difficult but may be very easy for an experienced statistician. I'll lay out the problem below. Is there anyone in the department who could give me a clue as to if I'm on the right track and what statistics I need to learn?


Thank you very much for any assistance.


Regards,


Will








The problem - the physics and notation:


The galaxies we are considering can be considered axisymmetric oblate spheroids.
Spheroid - Wikipedia, the free encyclopedia


These have two dimensions, the height and the width, the ratio of which is the true axial ratio. When you look at a galaxy in the sky it is randomly oriented. If the galaxy is edge on to you, then the ellipse you observe will have the same axial ratio as the galaxy, the true axial ratio, r. If the galaxy is "face on" then the axial ratio will be at a maximum, 1. The orientation of the galaxy means that the observed axial ratio, q, for any galaxy is between r and 1.


Suppose we have a group of 244 galaxies all randomly oriented in the sky. The number of galaxies with axial ratio between q and q + dq (for small dq) is given by f(q) dq, so f(q) is the frequency distribution of observed galaxies. Similarly the number of galaxies with true axial ratio in interval [r, r + dr] is n(r) dr.


Here comes the stats.


Simply from the assumption that the galaxies are randomly oriented, we can derive that


(using latex notation incase the images don't work - copy and paste the text equations here Online LaTeX Equation Editor - create, integrate and download)


f(q) = \int dr \: n(r) P(q|r)





= \int_{0}^{q} dr \: \frac{ n(r).\:q}{\sqrt{1 - r^{2}} \sqrt{q^{2} - r^{2}}}





We want to reverse this equation, that is, we have f(q) and wish to find n(r)


To do this we first discretise this equation , sampling both n and f into N bins of width 1/N. Equation above becomes:


f(q_{i}) = \sum_{j = 0}^{i-1} P(q_{i},r_{j})\:n(r_{j})





Where here, f_i labels bin i of the observed ellipticities q, r_ j is bin j of the actual ellipticities r. And where


P(q_{i},r_{j}) = \frac{q}{\sqrt{1-r^{2}}\sqrt{q^{2} - r^{2}}}





...is analogous to P(q|r) and q and r are taken at the midpoint of each bin. Note in the sum equation we stop at bin i - 1 instead of bin i, as P(q_i | r_j ) -> infinity when i = j. This is later refered to as the approximation that P( j | j ) = 0


So to keep track of what we're doing, we're now solving the sum equation for n(r_ j), given known f and P. This sum equation is a matrix equation (with P[i,j] = 0 for j>i). K has zeroes on the leading diagonal so is singular and cannot be inverted. However, K is lower triangular, so we can solve it by forward substitution. I'll quickly explicitly show what that means.


Our matrix equation reads:


\begin{pmatrix}
f_{0}\\
f_{1}\\
f_{2}\\
f_{3}\\
f_{4}\\
\vdots
\end{pmatrix} = \begin{pmatrix}
0 & 0 & 0 & 0 & 0 & ...\\
P_{10} & 0 & & & & \\
P_{20} & P_{21} & 0 & & & \\
P_{30} & P_{31} & P_{32} & 0 & & \\
P_{40} & P_{41} & P_{42} & P_{43} & 0 & \\
\vdots & & & & & \ddots
\end{pmatrix}\begin{pmatrix}
n_{0}\\
n_{1}\\
n_{2}\\
n_{3}\\
n_{4}\\
\vdots
\end{pmatrix}






So the equations this represents are:
f_{0} = 0

f_{1} = P_{10}\:n_{0}

f_{2} = P_{20}\:n_{0} + P_{21}\:n_{1}

...


Which can be reversed to find the elements of n:
n_{0}= \frac{f_{1}}{ P_{10}}


Then
n_{1} = \frac{f_{2} - P_{20}\:n_{0} }{P_{21}}

Then
n_{2} = \frac{f_{3} - (P_{30}\:n_{0} + P_{31}\:n_{1}) }{P_{31}}

...


So far so logical.


Here is where the problem is. To illustrate we will take an example observed distribution f and for simplicity set all nonzero P to 0.2. Here we have gone through the process and filled out n.


\begin{pmatrix}
0\\
0\\
1\\
0\\
0\\
\vdots
\end{pmatrix} = \begin{pmatrix}
0 & 0 & 0 & 0 & 0 & ...\\
0.2 & 0 & & & & \\
0.2& 0.2 & 0 & & & \\
0.2 & 0.2 & 0.2 & 0 & & \\
0.2 & 0.2 & 0.2 & 0.2 & 0 & \\
\vdots & & & & & \ddots
\end{pmatrix}\begin{pmatrix}
0\\
5\\
-5\\
0\\
0\\
\vdots
\end{pmatrix}






Clearly something has gone wrong because we have calculated that we will get -5 galaxies in n_2!


Let's look at the process step by step and see why this has happened.


We know a galaxy can only be observed as having a higher axial ratio than it's true axial ratio. In the discrete case this means that galaxies in an actual axial ratio bin, e.g bin 2, which may contain e.g. axial ratios in range [0.2 : 0.3], can only be measured as having an observed axial ratio that will contribute to the population of bins 3, 4, 5... of the observed axial ratio distribution. In other words, if we have one galaxy with actual axial ratio 0.24, so all of the n (actual axial ratio distribution) bins have population 0 except for bin 2 with population 1. Then this 1 galaxy can only be observed (go into f) in bin 3 or bin 4 or bin 5... etc.


Starting from from the top then, there are no n bins less than the 0th bin, so f_0 cannot be contributed to by any bins and is necessarily 0.


Then there is only one n bin, the 0th n bin, less than the 1st f bin. Thus the 1st f bin can only be contributed to by galaxies in the 0th n bin. If there is a 0.2 chance of a galaxy in bin 0 going into f bin 1, then we expect 5 galaxies in n bin 0 for each galaxy measured in f bin 1. However, in our example f_1 = 0, so n_0 = 0.


Next, there are two bins, the 0th and the 1st n bins less than the 2nd f bin. Only those n bins may contribute to the 2nd f bin. We already 'expect' (i.e. know) that there are 0 galaxies in n_0, so we know no galaxies are contributing to f_2 from that. Thus all the galaxies in f_2 are from n_1 bin. f_2 = 1, and P(2|1) = 0.2, so we expect 5 galaxies in n bin 1.


And then.... there are 3 bins, the 0th, 1st, 2nd less than the 3rd f bin. Only these may contribute to the 3rd f bin. We know n_0 = 0, n_1 = 5. If n_1 = 5, we expect from that n bin alone, there to be 5*0.2 = 1 galaxy in f_3. Any surplus galaxies in f_3 must have come from n_2. But in fact there are ZERO galaxies in f_3, which means that there are -5 galaxies in the n_2 bin.


This is wrong, so where is the error? I've identified some problems myself with the model but this is where I need a statistician to come in and help.


The issue is that the algorithm does not take into account that P is only a probability function and not an analytic rule. The calculation does not account for the fact that when you observe zero galaxies in f_1, there are not necessarily zero galaxies in n0 only that there is a probability of there being zero galaxies in f0.

And similarly, the fact that there is 1 in f_2 after none in f_0 and f_1 does not necessarily mean that there are 5 galaxies in n_1.

The fact that there are 5 galaxies in n1 does not necessarily mean we can expect to see one in f2, only that there is a finite probability of seeing one in n3.





My final thoughts:


I think this has something to do with 'priors', where you have the result of the distribution, f, and the distribution itself, and you need to work backwards.


It's possible that a completely different way of calculating n is necessary.


I find that for small N (number of bins), such that every bin is populated enough to be representative of the distribution (i.e. without the discretisation being "visible") we CAN expect this to work as each bin more accurately will represent the shape of the distribution. but for small N (so wide bins) our approximation that P( j | j) = 0 made when discretising, to avoid P blowing up, becomes invalid.


Then perhaps P( j | j) = 0 is a silly approximation. We are taking what would be the largest contribution, where P -> infinity, and approximating it to 0! Note that P is a probability density and so for large enough N, we hope that that the single bin (and thus total probability of that bin) when i = j is small enough to be negligible. Instead though, we could use an approximation:


Probability of observing in f bin j given we have one galaxy in n bin j is
= P( j | j ) = P'(r_{j} < q < (r_j + W) | r_{j})





where P' implies using the continuous equation for P, W is the bin width and r_ j is the midpoint of the bin. The RHS here is just a number. This is motivated by the fact that the average value of the actual axial ratio in box n is r_j, and then roughly speaking the probability that you land in the same box is the probability that q is within that box but greater than r_j .


Ideally we'd like a solution that is valid when N->infinity such that we effectively recover n in the continuous limit.








Thank you in advance for any help, and for reference in case this message is passed on, my email address is william.taylor@trinity.ox.ac.uk