# Thread: sample selection based on unequal probability

1. ## sample selection based on unequal probability

Anyone know of a routine that can be used in Excel to randomly select n elements based on the value of a parameter for the elements? So a higher parameter value would create a higher probability that element would be selected than another element that had a lower value.

2. Originally Posted by Translate
Anyone know of a routine that can be used in Excel to randomly select n elements based on the value of a parameter for the elements? So a higher parameter value would create a higher probability that element would be selected than another element that had a lower value.
Here is a way that I think will work.

Let's say the item identifiers are in cells A1:Am and their associated weights are in cells B1:Bm.

1. Put the following formula in cell C1: =B1*rand()

2. Copy cell C1 and paste into cells C2:Cm.

3. Copy cells C1:Cm and Paste Special-Values into C1:Cm, over the same cells. This fixes the random numbers in these cells so they now will not change.

4. Sort the data on column C and select the top n items.

3. ## Works

Not sure of the theoretical underpinnings, but empirically works great. Thanks.

4. I have continued to mull over this problem, and I have decided the solution I gave was wrong. What is more, I don't have a correct solution at this time.

Although the OP did not state the requirement explicitly, I assume the goal of the algorithm is to select a subset of size n, with the probability of selection of each subset being proportional to the sum of the weights of the items in the subset. The algorithm I gave does not do that.

In particular, suppose there are only 2 elements and we want to select a subset of size 1. If the weights are a and b for items 1 and 2, respectively, then we want item 1 (i.e., the subset consisting of item 1) to be selected with probability a/(a+b) and item 2 to be selected with probability b/(a+b). Instead, it can be shown that the method I proposed will select item 1 with probability a/(2b) if a < b and probability (2a-b)/(2a) if a > b. This isn't what we want, unless a=b, which is the unweighted case.

I will continue to think about this problem, but so far I haven't come up with a workable algorithm.

5. Uh oh. But your initial procedure was so elegant! I would appreciate any other suggestions you or others might devise.

As before, let's suppose the item identifiers are in cells A1:Am, and their associated weights are in cells B1:Bm.

1. Put the following formula into cell C1: = -LN(RAND())/B1

2. Copy cell C1 and paste into cells C2:Cm.

3. Copy cells C1:Cm and Paste Special-Values over the original cells C1:Cm. This fixes the values of those cells so they will not change during the next step.

4. Sort the data in cells A1:Cm in ascending order based on column C.
Your sample is in the first n rows (which have the lowest values in column C).

This process is based on two properties of the Exponential distribution.
First, if X has an Exponential( $\lambda$) distribution and Y has an Exponential( $\mu$) distribution, then
$\Pr(X < Y) = \frac{\lambda}{\lambda + \mu}$.
Second, you can simulate a draw from an Exponential( $\lambda$) distribution on a computer by computing
$X = -\frac{1}{\lambda} \ln U$
where U is a (pseudo-) random number drawn from a Uniform(0,1) distribution.