just to make it clear, this is about some very basic stuff. I am trying to learn statistics (using R), and I won't understand concepts until I've understood what they mean "in real life", so to speak.
Ok, this is the scenario (taken from the book I'm using to learn this, H. Baayen's 'Analyzing linguistic data, a practical introduction to statistics using R', 2008, p.45ff.):
We have a gathered corpus of English, and it has 18.6 million words. The most frequent word in this corpus is the word 'the', with 1.1 million instances. Now, the relative frequency of 'the' is 1.1/18.6=0.06. So this is p, right, meaning that if I randomly selected one word from this corpus, the probability of getting 'the' is 0.06. So far so good.
Now, Baayen says 'let the NUMBER OF TRIALS(n) denote the size of the corpus. Each token in the corpus is regarded as a trial which can result either in a success or in a failure' (p.46). I don't really understand why that is. If we assume independence, then every time we pick a word, the word goes back in, so the number of trials won't equal the size of the corpus. If we didn't put the word back in, then sure, the numner of trials would automatically match the number of words.
At any rate, now it's time to look at the probability (of something) given a binomial distribution. So this probability for 'the' is dbinom(Random Variable,n,p), which is dbinom(1100000,18600000,0.06). Now, this is simply a function that R has - I don't care how it calculates it. The output is 0.0004.
I don't understand what 0.0004 is the probability _of_, in plain English. Could someone _explain_ with some analogy (or using the word corpus case) what 0.0004 is the probability of?