How many data are enough to determine a probability distribution?
The probability distribution is more correct if my dataset encompass 200 data than with 20 points.
So the answer may be the more the better. But is there any conventional thought?
Thanks
Well, yeah, I can use K-S test. But K-S test can be different when sample size is small or large. For example:
Example 1:
>>x=1:1:30; ' Generating 30 points for x.
>> x=x';
>> y=raylpdf(x,1); ' assuming a Rayleigth distribution for y.
>> alam=gamfit(y); ' but I fit y with gamma distribution
>> [h,p]=kstest(y,[y gamcdf(y,alam(1),alam(2))],0.05) ' K-S test for y
h =
0
p =
0.9343 ' pretty good. y is a gamma distribution
------------------------
Example 2:
>> x=1:.01:30; ' Generating 2901 points for x.
>> x=x';
>> y=raylpdf(x,1); ' assuming a Rayleigth distribution for y
>> alam=gamfit(y);
>> [h,p]=kstest(y,[y gamcdf(y,alam(1),alam(2))],0.05)
h =
1
p =
9.0404e-013 ' y is not a gamma distribution
-----------------------
Of course, it is just a ideal exmaple. If we go to real data observed from experiment, the thing will be even more complex. It make me think whether conclusion can be wrong just because we did not choose the right probability distribution due to limitation of sample size?
So my question is if there is a conventional idea regarding it?
Like most people think two datasets is statistically significant if P<0.05. Why is the criterion not 0.01 or 0.1? Just because most people follow the rule. Welcome your idea!
Conventions depend on who you're holding yourself accountable to. When you say that "most people think" p< .05 is significant, your "most people" are part of a particular audience. Sometimes the criterion will be p < .01, or p< 0.1--I've even read published results where a researcher reported that "[such-and-such] was marginally significant (p<.15)". I've only seen that one a few times, but its not terribly uncommon for one to report a result significant at the 0.1 level as "marginally significant" where the convention in the field is to use .05 as the threshold.
In practice, and ideally, someone running statistics on observed data ought to declare, a priori, what they will use as their threshold for significance, taking into account as much of the relevant circumstances that surround the data as possible--sample size, number/nature of questions/tests being thrust at the data set, etc. It seems that the .05 level is by far the convention because if the theories from which you derive your questions are reasonable, even with a modest sample size it is pretty safe to assume whatever results you obtain (barring major design flaws, data loss, massive violations to the assumptions, etc) are, in fact, valid.
I'm probably just re-hashing things you've heard or been advised on or thought to yourself already, but there it is.