# Probability in binomial distributions

Printable View

• May 31st 2009, 06:09 PM
tjodrik
Probability in binomial distributions
Hi,

just to make it clear, this is about some very basic stuff. I am trying to learn statistics (using R), and I won't understand concepts until I've understood what they mean "in real life", so to speak.

Ok, this is the scenario (taken from the book I'm using to learn this, H. Baayen's 'Analyzing linguistic data, a practical introduction to statistics using R', 2008, p.45ff.):

We have a gathered corpus of English, and it has 18.6 million words. The most frequent word in this corpus is the word 'the', with 1.1 million instances. Now, the relative frequency of 'the' is 1.1/18.6=0.06. So this is p, right, meaning that if I randomly selected one word from this corpus, the probability of getting 'the' is 0.06. So far so good.

Now, Baayen says 'let the NUMBER OF TRIALS(n) denote the size of the corpus. Each token in the corpus is regarded as a trial which can result either in a success or in a failure' (p.46). I don't really understand why that is. If we assume independence, then every time we pick a word, the word goes back in, so the number of trials won't equal the size of the corpus. If we didn't put the word back in, then sure, the numner of trials would automatically match the number of words.

At any rate, now it's time to look at the probability (of something) given a binomial distribution. So this probability for 'the' is dbinom(Random Variable,n,p), which is dbinom(1100000,18600000,0.06). Now, this is simply a function that R has - I don't care how it calculates it. The output is 0.0004.

I don't understand what 0.0004 is the probability _of_, in plain English. Could someone _explain_ with some analogy (or using the word corpus case) what 0.0004 is the probability of?

Thanks!
• May 31st 2009, 08:22 PM
pickslides
Quote:

Originally Posted by tjodrik
dbinom(1100000,18600000,0.06). Now, this is simply a function that R has - I don't care how it calculates it. The output is 0.0004.

I do not use R but I think the problem is in the function dbinom(). I have a feeling that the first parameter of 1,100,000 is in correct. I would think the first variable would be the amount of times you expect the word 'the' to appear.

In other words given the word 'the' has a probabilty of 0.06 then the chance it will occur 11,100,000 times given 18,600,000 words appear is 0.04

this first parameter can very for finding the probabilty of X amount of 'the' depending on how many you desire.

Quote:

Originally Posted by tjodrik
So this probability for 'the' is dbinom(Random Variable,n,p), which is dbinom(1100000,18600000,0.06). Now, this is simply a function that R has - I don't care how it calculates it. The output is 0.0004.

Here is the main problem, if you looked at the binomial theorem this would all make much more sense.