# Thread: Chi Square test, or some other way?

1. ## Chi Square test, or some other way?

hi, here's the problem my girlfriend is having with some genetics she's working on.

As you can see on the spreadsheet attached, A sample tests positive or negative for whatever it is she's testing. She also finds the concentration of some other thing in that given sample.

She wants to know if there is some kind of relationship between Negativeness(or positiveness) and the lowness(or highness) of the concentration.

She's talking about P values and Chi square tests and stuff, but we are not exactly sure how to (or whether we should) apply them to this problem. What would you suggest?

Thanks very much for having a look

2. Originally Posted by bruxism
hi, here's the problem my girlfriend is having with some genetics she's working on.

As you can see on the spreadsheet attached, A sample tests positive or negative for whatever it is she's testing. She also finds the concentration of some other thing in that given sample.

She wants to know if there is some kind of relationship between Negativeness(or positiveness) and the lowness(or highness) of the concentration.

She's talking about P values and Chi square tests and stuff, but we are not exactly sure how to (or whether we should) apply them to this problem. What would you suggest?

Thanks very much for having a look
Hi. A Chi square test could and should be done. However, the test requires that there be at least 5 observations in each cell. That may obscure the relationship.

Here's how I grouped the data for the Chi square test to keep at least 5 observations in each cell:
Code:
RNA         Positive  Negative
#  Pct     #  Pct
<3           7  .58     5  .42
>3 <9        8  .27    22  .73
>9           6  .43     8  .57
Total       21 .375    35 .625
Low and high RNA concentrations produce more positive tests. Whether this is statistically significant should be tested with a Chi square test.

To do this test you calculate an expected frequency for each cell using the total percentages. Example: for the cell "Positive <3" the expected is $.375(7+5) = 4.5.$ Then the Chi square statistic is the sum over the cells of

$\frac{(actual\ frequency - expected\ frequency)^2}{expected\ frequency}.$

For cell "Positive <3" this is $(7 - 4.5)^2 / 4.5 = 1.39.$ The sum over all the cells gives the Chi square statisitic of 3.9. It has degrees of freedom 6 - 1 - 3 = (3 - 1)(2 - 1) = 2 because 3 parameters (2 for the rows and 1 for the columns) are being estimated. The P-value of 3.9 with 2 DF is .14, not significant. (EDIT: I corrected the DF from 4 to 2 and the P-value from .42 to .14 per CaptainBlack's post below.)

But I think the relationship is stronger at the high end than the grouped data show. You can see this if you sort the data by RNA concentration. (I haven't shown this here; it is better in color in the spreadsheet.)

To show this relationship, I suggest a logit or probit analysis. With these, you test whether there is a linear or U-shaped relationship between the probability of a positive test and RNA concentration. For either of these analyses, you don't have the grouping restrictions of the Chi square test. But you need software such as SAS or SPSS for this.

3. Originally Posted by JakeD
For cell "Positive <3" this is $(7 - 4.5)^2 / 4.5 = 1.39.$ The sum over all the cells gives the Chi square statisitic of 3.9. It has degrees of freedom 6 - 2 = 4. The P-value of 3.9 with 4 DF is .42, not at all significant.
Now your calculation of the number of degrees of freedom for this left me
uneasy, but I don't do cross tabular analysis every day so I look this up when
I need it. It appears that DF=(rows-1)*(columns -1) = 2*1 = 2.

A Chi-Square of 3.9 is still not significant with this number of degrees of
freedom.

RonL

4. thanks for the replies. That's made things a little more clear for us.

a few questions though

1. In regards to performing a chi square test on something like this, it seems the groups are being made somewhat at random. Choosing <3, >3 <9, >9 seems ok, but you could have chosen anything couldn't you? how do you decide? it seems you could skew the stats in this way to make it seem like something is occurring....

2. you said "The P-value of 3.9 with 4 DF is .42, not at all significant". How do you calculate whether it is significant or not.

3. The probit and logit analysis sounds like a good idea. Is there a way to perform these without buying additional software...or do you just have to bite the bullet and spend?

5. Originally Posted by CaptainBlack
Now your calculation of the number of degrees of freedom for this left me
uneasy, but I don't do cross tabular analysis every day so I look this up when
I need it. It appears that DF=(rows-1)*(columns -1) = 2*1 = 2.

A Chi-Square of 3.9 is still not significant with this number of degrees of
freedom.

RonL
I don't do these every day either and I should have looked it up too. I corrected the post. Thank you.

6. Originally Posted by bruxism
2. you said "The P-value of 3.9 with 4 DF is .42, not at all significant". How do you calculate whether it is significant or not.
Either a cumulative chi-squared distribution calculator, or a set of
tables of critical values for the chi-squared distribution.

Below is the help text and example calculation from the system that
I use most frequently, when I'm not using the book of tables next to
my desk.

Code:
>
>help chidis
chidis is a builtin function.

normaldis(x) : returns the probability that a normally
distributed (mean 0, st.dev. 1) is less than x.
invnormaldis(p) : is the inverse.
chidis(x,n) : chi-distribution with n degrees of freedom.
tdis(x,n) : Student's t-distribution with n degrees of freedom.
invtdis(p,n) : the inverse.
fdis(x,n,m) : f-distribution with n and m degrees of freedom.
>
>chidis(3.9,2)
0.857726
>chidis(3.9,4)
0.580291
>

7. Originally Posted by bruxism
thanks for the replies. That's made things a little more clear for us.

a few questions though

1. In regards to performing a chi square test on something like this, it seems the groups are being made somewhat at random. Choosing <3, >3 <9, >9 seems ok, but you could have chosen anything couldn't you? how do you decide? it seems you could skew the stats in this way to make it seem like something is occurring....
I chose integer cutoffs at the high and low end to keep at least 5 observations in each cell. Looking back at the data, I see I didn't look at it too closely. I could have chosen a cutoff of 12 instead of 9 to get exactly 5 observations in both high categories. Doing that would have increased the significance, so it is true you may be able to cook the results somewhat by carefully selecting the categories. However, the Chi square test requires putting the data into categories; you have to do it somehow. Just don't be too cute when selecting the categories.

2. you said "The P-value of 3.9 with 4 DF is .42, not at all significant". How do you calculate whether it is significant or not.
Statistical convention says the P-value should be .05 or less to be statistically significant. I calculated the P-value in the spreadsheet using the function CHIDIST(3.9;2). (Note the DF is actually 2 per CaptainBlack and the P-value is .14.)

3. The probit and logit analysis sounds like a good idea. Is there a way to perform these without buying additional software...or do you just have to bite the bullet and spend?
I googled logit analysis free software. The first hit was software called EasyReg. I've never used it, but the price is right!