Filtering the top 5% in real-time
I am a computing student, and I wanted some advice on how to solve a problem using maths.
I have a program which receives numbers from an external source. Currently the program stores a histogram of these numbers, for example the value of the number, as well as the frequency of its occurrence. To clarify I could receive the following numbers:
1 10 2 3 9 1 9 1 4 5,
and then I would make the table:
After collecting 800 numbers from one experiment I find that ~80% have a frequency of 1, ~95% less than 5, and the remaining 5% are between 5 and 35. The CDF of these frequencies fits a Weibull distribution quite well.
Now the problem I'm having is that after collecting 800 numbers I have a table with ~500 entries, whereas I'm only really interested in the top 5\% (40 entries). I want to decrease the size of the table I need to store the values. Is there a way to filter out some numbers as they arrive and not insert small values into my table. Or some way to prune the table as it grows to remove the smaller values without throwing away a potentially larger value. I don't mind if I lose the frequencies of the values, just as long as I know which are more popular.
After writing this down it seems clearer to maybe state the problem as: Values are randomly generated from a Weibull distribution, how many values do I need before I can identify the top 5%?
Would sampling techniques be of any use? I need this to occur quickly in real-time, i.e. I still want to be able to identify the top 5\% after say 100 numbers.
thanks for any advice or suggestions. I'm happy to be sent away to read up on specific topics.