2 Sample T Test for Nonnormal data
I work in the advertising industry and I am in the process of creating a t test calculator. The calculator will be used to test the statistical differences between two different advertisements, two campaigns, two web pages, etc. I've made a click through rate significance calculator (using a Bernoulli distribution) and a calculator for the average order value (normal distribution, so straight forward 2 sample t test). I'm trying to make a revenue per visit calculator now, but I am stuck on what to do!
The vast majority of visitors to a website will not purchase anything, hence they will have a revenue value equal to zero. Since most visitors will have revenue equal to zero, the distribution will be heavily skewed at zero. The sample sizes should be quite large (n>1000). I'm at a loss for how to formulate a hypothesis test for this metric, any advice would be much appreciated! Thanks!
Re: 2 Sample T Test for Nonnormal data
t tests are for testing if it is true that there is a specific difference (usually 0) between the two means. I guess you are assuming that there is a difference so you aren't trying to disprove that the difference in revenue is 0. If you want to show that once advertising scheme is better than another then that is a different test. Confidence intervals would be a better approach.
The analysis is a bit long winded and some people abandon their mathhelpforum accounts after a couple days so if you can confirm that you are still looking for an answer then I'll post the analysis.
Re: 2 Sample T Test for Nonnormal data
Hi thanks for your reply. I think you misunderstood my question. I am in fact trying to prove that two advertisements (i.e. two samples) are different from one another. The problem is that most users who visit a site do not purchase, hence they have a revenue value that equals to zero. So, when looking at the samples, the data will not be normal (they will be heavily skewed at zero) thus standard 2 sample t tests are not valid. I need a workaround to this, probably a non parametric t test. Would love to hear your thoughts. Thanks
Re: 2 Sample T Test for Nonnormal data
Right this took quite a while to figure out so I hope you can make use of it. It was quite interesting for me too, its certainly the longest mathhelpforum answer I've ever given.
You seem quite well versed in statistics so I'm assuming you understand the concept of a confidence interval.
You can split up your distribution into a binomial distribution to model them buying something or not, and a normal distribution to model the amount people spend if they are buying something. The amount a website visitor spends would be the product of the distributions.
Since your sample size is large you can use a normal distribution instead of a t distribution.
Lets say you have 2 samples sets to compare for two advertisements, set A and set B.
Find the confidence interval for the mean of the amount that buyers spend for A and B like you usually would.
For the binomial distribution I denote the true chance of someone buying something and your sample chance of someone buying something p. The variance of a binomial distribution is
With your sample size you shouldn't worry much about the sample standard deviation being inaccurate.
You can use the normal approximation to the binomial distribution. It has the same shape as the binomial but the bumps are smoothed out and it extends to infinity.
The normal approximation is like the histogram on the left but there is a part of the graph below 0 which should not be there, obviously the probability cannot be below zero. But when you remove this area it doesn't just vanish, if it did then all the probabilities wouldn't sum to 1. The removed area gets spread out across the whole histogram and that is why the graph on the right is taller. Say an area k is below 0, when you remove k the rest of the graph's height increases by a factor of
The normal approximation to the binomial distribution can be used for your sample size. It is commonly used when finding confidence intervals, the problem is that with your chance of success being so low, when you compute the lower limit of the confidence interval it could be negative and obviously you can't have a negative chance.
The lower limit of the confidence interval is given by .
With this is mind there are two cases for your interval.
The first case is simple, when the lower limit is non negative
Find Z as you usually would for a confidence interval for your significance and check if it satisfies
The part of the graph missing below 0 will barely effect your confidence interval if so you can construct your confidence interval as usual. If then you have a more complicated problem on your hands-case 2.
The second case accounts for the missing part of the graph. This is a truncated normal distribution Truncated normal distribution - Wikipedia, the free encyclopedia. In theory the missing part could be as large 50% of the graph depending on how close your sample mean is to 0.
Note that in this the cumulative density function is used quite a bit. The symbol for it isn't available on these forums so I will just use to indicate the left tail cumulative density function- this is the total probability from minus infinity up to t standard deviations from the mean. You can think of it as the area under the graph between minus infinity and some vertical line at a point t. I believe most tables use left CDF. I will also use to indicate the right tail CDF, this is similar except its from plus infinity to a point t.
When finding Z scores for a confidence interval you are getting them from a standard normal distribution with .
Here the confidence interval for a significance lies in the range to . L is the number of standard deviations it is away from the mean 0. The two unshaded tails both have an area of so they have a total area of . The confidence interval covers all but of the area which is what is meant when you say you have a significance of .
But just as the earlier graph was adjusted for removing all the area below 0 the standard normal distribution must also be adjusted.
So how much is k? From the graph two above you can see that the distance 0 is away from the mean is which corresponds to a distance of standard deviations away from the mean.
In the standard normal distribution the part chopped off the graph is also standard deviations away from the mean. Because the point is below the mean it is at a point from the mean. The area k is equal to .
In the transformation since every part of the graph went up to compensate for removing k it is clear that where Q is a point on the graph before the transformation and Q' is that same point (same number of standard deviations from the mean) after the transformation. This applies to both the graph of your distribution and the graph for the standard normal distribution.
Since the lower end of your confidence interval is equal to zero your entire confidence interval on the graph will look like this
With the shaded region covering the area of the confidence interval. Note that the distance from to 0 is less than the distance from even if my image doesn't look like it.
Relating this to the standard normal distribution which we have transformed
u is the upper limit and v is the lower limit. We already know v, we need to find u. From the image you can see that it is a one tail confidence interval, therefore . But recall that . For the non truncated standard normal distribution We are required to make this change because this is what is listed in tables.
Find u for by looking in tables of the normal distribution for the right tail figures. Or for the left tail figures.
Now that you have u and v you are ready to construct the confidence interval. Where your sample size is n, the lower limit is 0 and the upper limit is
Finally the tricky bit is over!
Now you have a confidence interval for the probability someone will spend money and a confidence interval for how much a buyer does spend. You have these for both data sets A and B.
Lets say your revenue for A is between and the probability someone buys something is in the range to which are the 2 ends of the confidence interval which you calculated. If you had the simple case then this simplifies to
If you had the simple case then your average revenue per visitor is
You can compare this final figure to the final figure you get for B to see if they are they overlap. If they do not overlap then you can say that one has a higher mean than the other, if they do overlap then you cannot say that one advertising scheme is better than the other and you will have to reduce your significance and try again.
If you had the complicated case then I do not know how to multiply the two uncertainties. You will have to get help elsewhere.