1. ## Calculating Statistical Significance?

Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

I am just trying to figure out whether 5.77*10^-14 is statistically significant?

Thanks for any kind of help

2. Originally Posted by syrivaci
Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

I am just trying to figure out whether 5.77*10^-14 is statistically significant?

Thanks for any kind of help
I'm having a bit of trouble understanding exactly what question is being
asked here. Is there a test going on? If so what is being tested?

Also I'm a but puzzled by "The chance of picking a recipe sheet on it's own
with that specific ingredient is 0.408", if 10 sheets of 100 contain this
ingriedient why is the probability not 0.1?

I expect I've missed something obvious here, but I don't understand the
question

RonL

3. ## Thanks

Originally Posted by CaptainBlack
I'm having a bit of trouble understanding exactly what question is being
asked here. Is there a test going on? If so what is being tested?

Also I'm a but puzzled by "The chance of picking a recipe sheet on it's own
with that specific ingredient is 0.408", if 10 sheets of 100 contain this
ingriedient why is the probability not 0.1?

I expect I've missed something obvious here, but I don't understand the
question

RonL
Hi, Captain Black

Sorry for the confusion.

From my understanding:
The number of ways to choose 10 objects out 100 is C(100,10)=17310309456440

1/17310309456440=5.77X10-14

Plus:
(90C9)(10C1)/(100C10)= 0.40799532232929

I just want to work out if 1/17310309456440=5.77X10-14 is statistically significant. I am pretty rusty on p-values and which test to use so please go easy on me.

Thanks again.

4. Originally Posted by syrivaci
Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

I am just trying to figure out whether 5.77*10^-14 is statistically significant?

Thanks for any kind of help
With a hypergeometric density $\displaystyle f$ with parameters 100, 10 and 10:

$\displaystyle f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.$
$\displaystyle f(10) = \frac{{10 \choose 10}{90 \choose 0}}{{100 \choose 10}} = 5.78 \times 10^{-14}.$

From your later post I gather that you're not sure how to phrase the test. So I'll suggest it is this.

The null hypothesis is that you are drawing recipes from a population of 100 recipes of which 10 have a certain ingredient. The alternative hypothesis is that more than 10 recipes have the ingredient.

You are drawing a sample of 10 (not 30) recipes without replacement from the population. So $\displaystyle x$, the number of recipes in the sample with the ingredient, has a hypergeometric distribution with parameters 90, 10 and 10 under the null hypothesis. A best test would reject the null hypothesis if the p-value for $\displaystyle x$ under the null is less than .05. Since the p-value for $\displaystyle x = 10$ is extremely statistically significant at $\displaystyle 5.78 \times 10^{-14} \ll .05$, the null hypothesis is overwhelmingly rejected.

Put another way, under the null hypothesis it is extremely unlikely that you would draw 10 recipes with the ingredient. So reject the null hypothesis.

5. Originally Posted by JakeD
With a hypergeometric density $\displaystyle f$ with parameters 100, 10 and 10:

$\displaystyle f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.$
Ah, so the 0.408 is the probability of finding exactly one of a sample of 10
sheets use the ingredient. That's better

RonL

My comprehension seems to be suffereing in addition to my accuracy
at the moment

6. ## One more thing...

Originally Posted by JakeD
With a hypergeometric density $\displaystyle f$ with parameters 100, 10 and 10:

$\displaystyle f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.$
$\displaystyle f(10) = \frac{{10 \choose 10}{90 \choose 0}}{{100 \choose 10}} = 5.78 \times 10^{-14}.$

From your later post I gather that you're not sure how to phrase the test. So I'll suggest it is this.

The null hypothesis is that you are drawing recipes from a population of 100 recipes of which 10 have a certain ingredient. The alternative hypothesis is that more than 10 recipes have the ingredient.

You are drawing a sample of 10 (not 30) recipes without replacement from the population. So $\displaystyle x$, the number of recipes in the sample with the ingredient, has a hypergeometric distribution with parameters 90, 10 and 10 under the null hypothesis. A best test would reject the null hypothesis if the p-value for $\displaystyle x$ under the null is less than .05. Since the p-value for $\displaystyle x = 10$ is extremely statistically significant at $\displaystyle 5.78 \times 10^{-14} \ll .05$, the null hypothesis is overwhelmingly rejected.

Put another way, under the null hypothesis it is extremely unlikely that you would draw 10 recipes with the ingredient. So reject the null hypothesis.
Hi JakeD

Thank you very much that really has helped alot.
I am just wondering does that mean that the p-value never changes.

For instance, if I had another set of 5 recipes from the 100 that were linked by another ingredient how would I calculate whether this was statistically significant.

Thanks again for everybody's help

7. Originally Posted by syrivaci
Hi JakeD

Thank you very much that really has helped alot.
I am just wondering does that mean that the p-value never changes.

For instance, if I had another set of 5 recipes from the 100 that were linked by another ingredient how would I calculate whether this was statistically significant.

Thanks again for everybody's help

The p-value changes. The cutoff for the p-value does not change and is generally set at .05. This cutoff is called the significance level. A result is statistically significant when the p-value of that result is less than the cutoff value (significance level).

The p-value is the probability under the null hypothesis of getting an observation as extreme as the one observed. For example, you drew 10 recipes and got 10 with the one ingredient. The probability of getting 10 or more recipes with the one ingredient is the p-value. (Of course, the "or more" is not possible here since 10 is the max when drawing 10 recipes.)

What was your sample size and null hypothesis when you drew the 5 recipes? In your first post, you indicated you picked 10 out of the 100 at random so your sample size was 10. And your null hypothesis appeared to be there were 10 with the ingredient.

What was your sample size here? Was it still 10, perhaps the same 10? And was your null hypothesis that there were 10 with the other ingredient too? If so, the p-value is the probability of getting 5 or more recipes with the other ingredient out of the sample of 10 using the same hypergeometric density f. So that would be f(5)+f(6)+...+f(10).

Please verify something for me. You are drawing a random sample from the 100 recipes. You have a null hypothesis before you look at the results of the random sample. Everything I said is based on this.

8. ## Thanks

Originally Posted by JakeD

The p-value changes. The cutoff for the p-value does not change and is generally set at .05. This cutoff is called the significance level. A result is statistically significant when the p-value of that result is less than the cutoff value (significance level).

The p-value is the probability under the null hypothesis of getting an observation as extreme as the one observed. For example, you drew 10 recipes and got 10 with the one ingredient. The probability of getting 10 or more recipes with the one ingredient is the p-value. (Of course, the "or more" is not possible here since 10 is the max when drawing 10 recipes.)

What was your sample size and null hypothesis when you drew the 5 recipes? In your first post, you indicated you picked 10 out of the 100 at random so your sample size was 10. And your null hypothesis appeared to be there were 10 with the ingredient.

What was your sample size here? Was it still 10, perhaps the same 10? And was your null hypothesis that there were 10 with the other ingredient too? If so, the p-value is the probability of getting 5 or more recipes with the other ingredient out of the sample of 10 using the same hypergeometric density f. So that would be f(5)+f(6)+...+f(10).

Please verify something for me. You are drawing a random sample from the 100 recipes. You have a null hypothesis before you look at the results of the random sample. Everything I said is based on this.
Hi JakeD
Thanks for the info.

My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

I hope this helps. Thanks again.

9. Originally Posted by syrivaci
Hi JakeD
Thanks for the info.

My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

I hope this helps. Thanks again.
Then I think what you are calculating is inappropriate for what you want
to do (still not entirely clear though). It seems that there is no test.

From what you write it seems to me that an appropriate measure of
the importance of an ingredient to a collection of recipes is something
like the fraction of recipes in the collection that the ingredient appears in.

RonL

10. Originally Posted by syrivaci
Hi JakeD
Thanks for the info.

My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

I hope this helps. Thanks again.
This does help. Now I share Captain Black's concerns about whether your method is an appropriate way to analyze the recipe data. Your method is not standard, but there may be standard methods such the chi-square goodness of fit test that could be used. I'm thinking a multinomial logit analysis is a possibility.

So I suggest turning your questions in the direction of what is a good method for analyzing this data. That requires describing your project in more detail. Here are some things a consultant would need to know to be able to recommend a methodology.

• What is the purpose of the project? Are you writing an article for publication? Is this a term project for a class?
• What are the time constraints? Is it due next week?
• What is your statistical background? Are you willing and do you have time to research methods that may be unfamiliar to you?
• What do the data look like? Give examples.
• What sorts of conclusions do you want to be able to draw from the data?

I think it would be good for a qualified statistician (which I'm not) to look at your requirements and recommend a strategy. You can start here. If you wrote a short description of your project along the above lines, we can review it. Then you could go back to the help sites where you've been posting these questions and try asking for recommendations.

Of course, you may have time for none of this. It goes back to those questions I listed. What is your purpose?

11. ## Details

Originally Posted by JakeD
This does help. Now I share Captain Black's concerns about whether your method is an appropriate way to analyze the recipe data. Your method is not standard, but there may be standard methods such the chi-square goodness of fit test that could be used. I'm thinking a multinomial logit analysis is a possibility.

So I suggest turning your questions in the direction of what is a good method for analyzing this data. That requires describing your project in more detail. Here are some things a consultant would need to know to be able to recommend a methodology.

• What is the purpose of the project? Are you writing an article for publication? Is this a term project for a class?
• What are the time constraints? Is it due next week?
• What is your statistical background? Are you willing and do you have time to research methods that may be unfamiliar to you?
• What do the data look like? Give examples.
• What sorts of conclusions do you want to be able to draw from the data?

I think it would be good for a qualified statistician (which I'm not) to look at your requirements and recommend a strategy. You can start here. If you wrote a short description of your project along the above lines, we can review it. Then you could go back to the help sites where you've been posting these questions and try asking for recommendations.

Of course, you may have time for none of this. It goes back to those questions I listed. What is your purpose?
Purpose
I doubt it will be published (I wish!) but this is a class project.
I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

e.g.
Data Set Country ITALY
First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
Second Recipe: Ingredient5 Ingredient1 Ingredient6
.
.
.
Hundreth Recipe:Ingredient87 Ingredient212

e.t.c

12. Originally Posted by syrivaci
Purpose
I doubt it will be published (I wish!) but this is a class project.
I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

e.g.
Data Set Country ITALY
First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
Second Recipe: Ingredient5 Ingredient1 Ingredient6
.
.
.
Hundreth Recipe:Ingredient87 Ingredient212

e.t.c
Nice project!

The simplest way to compare the use of an ingredient between countries is to calculate the difference in the proportions $\displaystyle p_i$ of the recipes that use the ingredient in country $\displaystyle i$. (Captain Black suggested this before.)

To measure the statistical significance of a difference use the statistic

$\displaystyle Z = \frac{p_1 - p_2}{\sqrt{p(1-p)(1/n_1+1/n_2)}}$

where $\displaystyle n_i$ is the total number of recipes in country $\displaystyle i$ and

$\displaystyle p = \frac{n_1 p_1 + n_2 p_2}{n_1 + n_2}$

is the combined proportion. Treat $\displaystyle Z$ as a standard normal variable. So reject the hypothesis that the two proportions are equal when $\displaystyle |Z| > 1.96$ for a 5% significance level.

Now I don't see how the hypergeometric distribution would apply to this project.

Although I'd like to go into an extensive discussion of all the ways this data could be analyzed, I'll try to restrain myself.

13. Originally Posted by syrivaci
Purpose
I doubt it will be published (I wish!) but this is a class project.
I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

e.g.
Data Set Country ITALY
First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
Second Recipe: Ingredient5 Ingredient1 Ingredient6
.
.
.
Hundreth Recipe:Ingredient87 Ingredient212

e.t.c
Sketch of what might be an appropriate test:

You have a data set with N recipes, using a total of n ingredients.

You have an empirical (observed) distribution of the number of ingriedients
in a recipe for this data set.

Null hypothesis: The data set is a random sample of N recipes from a population
with an appropriate distribution of number of ingredients per recipe, and
with each of the n ingredient appearing with equal frequency*.

Test: Is the frequency of appearance of the most common ingredient/s in
the data set significantly higher than would be expected under the null
hypothesis?

Such a test might well require the extravegant use of the bootstrap to establish
critical values for the test statistic.

That is about all I can say at present. To work out the details would start to
incur serious consultancy fees

RonL

* I intend this to mean that the number of ingredients k is sampled from the
ingredient number distribution then the k ingredients are determined by
sampling without replacement from the list of possible ingredients.