Results 1 to 13 of 13

Math Help - Calculating Statistical Significance?

  1. #1
    Newbie
    Joined
    Jun 2006
    Posts
    9

    Question Calculating Statistical Significance?

    Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

    If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

    I am just trying to figure out whether 5.77*10^-14 is statistically significant?

    Thanks for any kind of help
    Follow Math Help Forum on Facebook and Google+

  2. #2
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by syrivaci
    Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

    If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

    I am just trying to figure out whether 5.77*10^-14 is statistically significant?

    Thanks for any kind of help
    I'm having a bit of trouble understanding exactly what question is being
    asked here. Is there a test going on? If so what is being tested?

    Also I'm a but puzzled by "The chance of picking a recipe sheet on it's own
    with that specific ingredient is 0.408", if 10 sheets of 100 contain this
    ingriedient why is the probability not 0.1?

    I expect I've missed something obvious here, but I don't understand the
    question

    RonL
    Follow Math Help Forum on Facebook and Google+

  3. #3
    Newbie
    Joined
    Jun 2006
    Posts
    9

    Thanks

    Quote Originally Posted by CaptainBlack
    I'm having a bit of trouble understanding exactly what question is being
    asked here. Is there a test going on? If so what is being tested?

    Also I'm a but puzzled by "The chance of picking a recipe sheet on it's own
    with that specific ingredient is 0.408", if 10 sheets of 100 contain this
    ingriedient why is the probability not 0.1?

    I expect I've missed something obvious here, but I don't understand the
    question

    RonL
    Hi, Captain Black

    Sorry for the confusion.

    From my understanding:
    The number of ways to choose 10 objects out 100 is C(100,10)=17310309456440

    1/17310309456440=5.77X10-14

    Plus:
    (90C9)(10C1)/(100C10)= 0.40799532232929

    I just want to work out if 1/17310309456440=5.77X10-14 is statistically significant. I am pretty rusty on p-values and which test to use so please go easy on me.

    Thanks again.
    Follow Math Help Forum on Facebook and Google+

  4. #4
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by syrivaci
    Does anyone know the best Signifinace Test for a dataset following a hypergeometric distribution with sample size over 30?

    If you have 100 recipe sheets. And, 10 of these sheets are linked because they all contain the same ingredient. The chance of picking a recipe sheet on it's own with that specific ingredient is 0.408. But, the probability of all 10 sheets picked having the same ingredient is 5.77*10^-14.

    I am just trying to figure out whether 5.77*10^-14 is statistically significant?

    Thanks for any kind of help
    With a hypergeometric density f with parameters 100, 10 and 10:

    f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.
    f(10) = \frac{{10 \choose 10}{90 \choose 0}}{{100 \choose 10}} = 5.78 \times 10^{-14}.

    From your later post I gather that you're not sure how to phrase the test. So I'll suggest it is this.

    The null hypothesis is that you are drawing recipes from a population of 100 recipes of which 10 have a certain ingredient. The alternative hypothesis is that more than 10 recipes have the ingredient.

    You are drawing a sample of 10 (not 30) recipes without replacement from the population. So x, the number of recipes in the sample with the ingredient, has a hypergeometric distribution with parameters 90, 10 and 10 under the null hypothesis. A best test would reject the null hypothesis if the p-value for x under the null is less than .05. Since the p-value for x = 10 is extremely statistically significant at 5.78 \times 10^{-14} \ll .05, the null hypothesis is overwhelmingly rejected.

    Put another way, under the null hypothesis it is extremely unlikely that you would draw 10 recipes with the ingredient. So reject the null hypothesis.
    Last edited by JakeD; June 28th 2006 at 06:04 AM.
    Follow Math Help Forum on Facebook and Google+

  5. #5
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by JakeD
    With a hypergeometric density f with parameters 100, 10 and 10:

    f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.
    Ah, so the 0.408 is the probability of finding exactly one of a sample of 10
    sheets use the ingredient. That's better

    RonL

    My comprehension seems to be suffereing in addition to my accuracy
    at the moment
    Last edited by CaptainBlack; June 28th 2006 at 09:32 AM.
    Follow Math Help Forum on Facebook and Google+

  6. #6
    Newbie
    Joined
    Jun 2006
    Posts
    9

    Question One more thing...

    Quote Originally Posted by JakeD
    With a hypergeometric density f with parameters 100, 10 and 10:

    f(1) = \frac{{10 \choose 1}{90 \choose 9}}{{100 \choose 10}} = .40799.
    f(10) = \frac{{10 \choose 10}{90 \choose 0}}{{100 \choose 10}} = 5.78 \times 10^{-14}.

    From your later post I gather that you're not sure how to phrase the test. So I'll suggest it is this.

    The null hypothesis is that you are drawing recipes from a population of 100 recipes of which 10 have a certain ingredient. The alternative hypothesis is that more than 10 recipes have the ingredient.

    You are drawing a sample of 10 (not 30) recipes without replacement from the population. So x, the number of recipes in the sample with the ingredient, has a hypergeometric distribution with parameters 90, 10 and 10 under the null hypothesis. A best test would reject the null hypothesis if the p-value for x under the null is less than .05. Since the p-value for x = 10 is extremely statistically significant at 5.78 \times 10^{-14} \ll .05, the null hypothesis is overwhelmingly rejected.

    Put another way, under the null hypothesis it is extremely unlikely that you would draw 10 recipes with the ingredient. So reject the null hypothesis.
    Hi JakeD

    Thank you very much that really has helped alot.
    I am just wondering does that mean that the p-value never changes.

    For instance, if I had another set of 5 recipes from the 100 that were linked by another ingredient how would I calculate whether this was statistically significant.

    Thanks again for everybody's help
    Follow Math Help Forum on Facebook and Google+

  7. #7
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by syrivaci
    Hi JakeD

    Thank you very much that really has helped alot.
    I am just wondering does that mean that the p-value never changes.

    For instance, if I had another set of 5 recipes from the 100 that were linked by another ingredient how would I calculate whether this was statistically significant.

    Thanks again for everybody's help
    Hi, syrivaci. Glad to help.

    The p-value changes. The cutoff for the p-value does not change and is generally set at .05. This cutoff is called the significance level. A result is statistically significant when the p-value of that result is less than the cutoff value (significance level).

    The p-value is the probability under the null hypothesis of getting an observation as extreme as the one observed. For example, you drew 10 recipes and got 10 with the one ingredient. The probability of getting 10 or more recipes with the one ingredient is the p-value. (Of course, the "or more" is not possible here since 10 is the max when drawing 10 recipes.)

    What was your sample size and null hypothesis when you drew the 5 recipes? In your first post, you indicated you picked 10 out of the 100 at random so your sample size was 10. And your null hypothesis appeared to be there were 10 with the ingredient.

    What was your sample size here? Was it still 10, perhaps the same 10? And was your null hypothesis that there were 10 with the other ingredient too? If so, the p-value is the probability of getting 5 or more recipes with the other ingredient out of the sample of 10 using the same hypergeometric density f. So that would be f(5)+f(6)+...+f(10).

    Please verify something for me. You are drawing a random sample from the 100 recipes. You have a null hypothesis before you look at the results of the random sample. Everything I said is based on this.
    Last edited by JakeD; June 28th 2006 at 11:07 AM.
    Follow Math Help Forum on Facebook and Google+

  8. #8
    Newbie
    Joined
    Jun 2006
    Posts
    9

    Question Thanks

    Quote Originally Posted by JakeD
    Hi, syrivaci. Glad to help.

    The p-value changes. The cutoff for the p-value does not change and is generally set at .05. This cutoff is called the significance level. A result is statistically significant when the p-value of that result is less than the cutoff value (significance level).

    The p-value is the probability under the null hypothesis of getting an observation as extreme as the one observed. For example, you drew 10 recipes and got 10 with the one ingredient. The probability of getting 10 or more recipes with the one ingredient is the p-value. (Of course, the "or more" is not possible here since 10 is the max when drawing 10 recipes.)

    What was your sample size and null hypothesis when you drew the 5 recipes? In your first post, you indicated you picked 10 out of the 100 at random so your sample size was 10. And your null hypothesis appeared to be there were 10 with the ingredient.

    What was your sample size here? Was it still 10, perhaps the same 10? And was your null hypothesis that there were 10 with the other ingredient too? If so, the p-value is the probability of getting 5 or more recipes with the other ingredient out of the sample of 10 using the same hypergeometric density f. So that would be f(5)+f(6)+...+f(10).

    Please verify something for me. You are drawing a random sample from the 100 recipes. You have a null hypothesis before you look at the results of the random sample. Everything I said is based on this.
    Hi JakeD
    Thanks for the info.

    My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

    I hope this helps. Thanks again.
    Follow Math Help Forum on Facebook and Google+

  9. #9
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by syrivaci
    Hi JakeD
    Thanks for the info.

    My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

    I hope this helps. Thanks again.
    Then I think what you are calculating is inappropriate for what you want
    to do (still not entirely clear though). It seems that there is no test.

    From what you write it seems to me that an appropriate measure of
    the importance of an ingredient to a collection of recipes is something
    like the fraction of recipes in the collection that the ingredient appears in.

    RonL
    Follow Math Help Forum on Facebook and Google+

  10. #10
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by syrivaci
    Hi JakeD
    Thanks for the info.

    My overall plan is to calculate the significance of each ingredient dependent on how many times they occur in each of the recipes (some ingredients will occur in 10 dishes whereas others will occur in 5 dishes). I want to show whether there is a reason why some could be more significant than others (e.g. comparing a set of Italian Dishes and a set of British dishes e.t.c).

    I hope this helps. Thanks again.
    This does help. Now I share Captain Black's concerns about whether your method is an appropriate way to analyze the recipe data. Your method is not standard, but there may be standard methods such the chi-square goodness of fit test that could be used. I'm thinking a multinomial logit analysis is a possibility.

    So I suggest turning your questions in the direction of what is a good method for analyzing this data. That requires describing your project in more detail. Here are some things a consultant would need to know to be able to recommend a methodology.

    • What is the purpose of the project? Are you writing an article for publication? Is this a term project for a class?
    • What are the time constraints? Is it due next week?
    • What is your statistical background? Are you willing and do you have time to research methods that may be unfamiliar to you?
    • What do the data look like? Give examples.
    • What sorts of conclusions do you want to be able to draw from the data?


    I think it would be good for a qualified statistician (which I'm not) to look at your requirements and recommend a strategy. You can start here. If you wrote a short description of your project along the above lines, we can review it. Then you could go back to the help sites where you've been posting these questions and try asking for recommendations.

    Of course, you may have time for none of this. It goes back to those questions I listed. What is your purpose?
    Follow Math Help Forum on Facebook and Google+

  11. #11
    Newbie
    Joined
    Jun 2006
    Posts
    9

    Question Details

    Quote Originally Posted by JakeD
    This does help. Now I share Captain Black's concerns about whether your method is an appropriate way to analyze the recipe data. Your method is not standard, but there may be standard methods such the chi-square goodness of fit test that could be used. I'm thinking a multinomial logit analysis is a possibility.

    So I suggest turning your questions in the direction of what is a good method for analyzing this data. That requires describing your project in more detail. Here are some things a consultant would need to know to be able to recommend a methodology.

    • What is the purpose of the project? Are you writing an article for publication? Is this a term project for a class?
    • What are the time constraints? Is it due next week?
    • What is your statistical background? Are you willing and do you have time to research methods that may be unfamiliar to you?
    • What do the data look like? Give examples.
    • What sorts of conclusions do you want to be able to draw from the data?


    I think it would be good for a qualified statistician (which I'm not) to look at your requirements and recommend a strategy. You can start here. If you wrote a short description of your project along the above lines, we can review it. Then you could go back to the help sites where you've been posting these questions and try asking for recommendations.

    Of course, you may have time for none of this. It goes back to those questions I listed. What is your purpose?
    Purpose
    I doubt it will be published (I wish!) but this is a class project.
    I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

    e.g.
    Data Set Country ITALY
    First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
    Second Recipe: Ingredient5 Ingredient1 Ingredient6
    .
    .
    .
    Hundreth Recipe:Ingredient87 Ingredient212

    e.t.c
    Follow Math Help Forum on Facebook and Google+

  12. #12
    Senior Member
    Joined
    Apr 2006
    Posts
    399
    Awards
    1
    Quote Originally Posted by syrivaci
    Purpose
    I doubt it will be published (I wish!) but this is a class project.
    I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

    e.g.
    Data Set Country ITALY
    First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
    Second Recipe: Ingredient5 Ingredient1 Ingredient6
    .
    .
    .
    Hundreth Recipe:Ingredient87 Ingredient212

    e.t.c
    Nice project!

    The simplest way to compare the use of an ingredient between countries is to calculate the difference in the proportions p_i of the recipes that use the ingredient in country i. (Captain Black suggested this before.)

    To measure the statistical significance of a difference use the statistic

    Z = \frac{p_1 - p_2}{\sqrt{p(1-p)(1/n_1+1/n_2)}}

    where n_i is the total number of recipes in country i and

    p = \frac{n_1 p_1 + n_2 p_2}{n_1 + n_2}

    is the combined proportion. Treat Z as a standard normal variable. So reject the hypothesis that the two proportions are equal when |Z| > 1.96 for a 5% significance level.

    Now I don't see how the hypergeometric distribution would apply to this project.

    Although I'd like to go into an extensive discussion of all the ways this data could be analyzed, I'll try to restrain myself.
    Follow Math Help Forum on Facebook and Google+

  13. #13
    Grand Panjandrum
    Joined
    Nov 2005
    From
    someplace
    Posts
    14,972
    Thanks
    4
    Quote Originally Posted by syrivaci
    Purpose
    I doubt it will be published (I wish!) but this is a class project.
    I am looking into whether there is a reason why particular ingredients are found more often in recipes from particular data sets (each data set representing a particular country). I am pretty sure there is a link but I want to show that some ingredients are more statistically significant than others in a particular data set.

    e.g.
    Data Set Country ITALY
    First Recipe: Ingredient1 Ingredient2 Ingredient3 Ingredient4
    Second Recipe: Ingredient5 Ingredient1 Ingredient6
    .
    .
    .
    Hundreth Recipe:Ingredient87 Ingredient212

    e.t.c
    Sketch of what might be an appropriate test:

    You have a data set with N recipes, using a total of n ingredients.

    You have an empirical (observed) distribution of the number of ingriedients
    in a recipe for this data set.

    Null hypothesis: The data set is a random sample of N recipes from a population
    with an appropriate distribution of number of ingredients per recipe, and
    with each of the n ingredient appearing with equal frequency*.

    Test: Is the frequency of appearance of the most common ingredient/s in
    the data set significantly higher than would be expected under the null
    hypothesis?

    Such a test might well require the extravegant use of the bootstrap to establish
    critical values for the test statistic.

    That is about all I can say at present. To work out the details would start to
    incur serious consultancy fees


    RonL

    * I intend this to mean that the number of ingredients k is sampled from the
    ingredient number distribution then the k ingredients are determined by
    sampling without replacement from the list of possible ingredients.
    Last edited by CaptainBlack; June 30th 2006 at 08:02 AM.
    Follow Math Help Forum on Facebook and Google+

Similar Math Help Forum Discussions

  1. I need help with a homework question - Statistical Significance
    Posted in the Advanced Statistics Forum
    Replies: 1
    Last Post: July 16th 2010, 10:05 PM
  2. Statistical Significance
    Posted in the Statistics Forum
    Replies: 1
    Last Post: November 14th 2009, 03:50 PM
  3. Statistical Significance Problem
    Posted in the Statistics Forum
    Replies: 0
    Last Post: October 6th 2009, 01:12 PM
  4. Statistical Significance
    Posted in the Advanced Statistics Forum
    Replies: 1
    Last Post: August 4th 2008, 02:07 AM
  5. Stats-Statistical and practical significance
    Posted in the Advanced Statistics Forum
    Replies: 2
    Last Post: November 13th 2007, 02:58 PM

Search Tags


/mathhelpforum @mathhelpforum