Dude, you sound a lot like my brother.
But more grammar-centric.
I need help with a formula for a music related website I am building. On
the site there will be a 'popularity chart' that is used to determine how
popular a band/artist is on the website. There are four pieces of criteria
that I want to use to determine this. Number of Fans, Unique daily page
views (artists have their own Myspace style web page), Unique daily song
plays and artist review score. The first three are self-explanatory, but to
elaborate on artist review score, for every artist that creates a profile
on the website, up to 1000 randomly chosen fans will be prompted to
'review' any given artists music on a simple 1-5 scale. 1 meaning they
don't like the artists music and 5 meaning they love it. Now the problem I
am having is determining how to equally weigh (25% per criterion) these
four categories into my formula since 3 of the determinants are open ended
(i.e. a band could have 1,000,000 page views, song plays or friends) but
only 1000 artist review votes (which will come to an average artist review
score). I would appreciate any help that someone can offer in regards to
creating an optimal formula to most fairly determine an artists popularity
using these four factors. Please keep in mind, I would like all four categories factor
in as 25% weight in determining an artists chart ranking.
Well, the simplest thing to do is to average your four scores. In order to weight them all equally, first "scale" so each is 0- 1. For example, divide your "1 to 5 scale" by 5 so that 1= .2, 2= .4, 3= .6, 4= .8, 5= 1. For "number of fans", divide by some large number- larger than you think any artist will have for fans to get between 0 and 1. After you have done that, add the numbers and divide by four.
Ok that makes sense. However, doesn't the larger the number I divide "fans", "page views" and "song plays" by affect the outcome? For example, If I divide these 3 categories by 1 million (to ensure that it is a number never reached) and if the artist has 83,000 fans, the value is .083. However, if I divide the artist review by 5 (and they have an average review of 4.1) that gives them a .82. So if the first three categories are all realistically in the neighborhood of .01 to .1 and the artist review is in the neighborhood .4 to .9 on average, wouldn't this make it so that artist review is really pulling more than 25% weight because such a greater number? Because if someone has more plays, more fans and more page views than another artist but did dramatically poorer in artist review, in that case they would be ranked lower then the artist that they beat in the other 3 categories right? Any suggestions on how to make it more even or how to rectify this situation? Thanks! Brandon
It shouldn't be a problem bebebe1 if you follow HallsofIvy's advice carefully.
You rescale FIRST, then average SECOND. So first of all each category is rescaled to a 0 to 1 scale separately. So if there are 3 million plays in total, and one artist gets 89000 plays their 'play' score is 0.029. There are different ways to perform this rescaling, you can do it linearly or nonlinearly. Probably linear rescaling is easier but not necessarily "optimal" as you desired.
And then as you figured out, the typical rescaled scores are going to be incongruous if you rescale in certain ways. The higher review scores might dominate because they range from 0.2 to 1, whereas the 'play' and 'views' scores range from 0 to 1.
To make each category contribute on average 25% to the overall score you can simply use a weighted average. How to do this?
After normalizing the category scores, next add up the scores from all artists for a category. Call these totals n1, n2, n3, n4. Suppose you have three artists, A, B, C. Their respective sets of normalized scores are (a1,a2,a3,a4), (b1,b2,b3,b4), (c1,c2,c3,c4) for the four respective categories. Then,
n1 = a1+b1+c1.
Add them together to get a grand sum N=n1+n2+n3+n4. Then the weight for category 1 is w1=n1/N, for category 2 the weight is w2=n2/N, etc. Now go back to each artist and calculate their oveall rating as the weighted average, so for artist 'A' we get,
A's score = (a1*w1+a2*w2+a3*w3+a4*w4)/4
where a1 is the normalized score for artist 'A' in category 1, etc.
Note that the typical ai*wi/N=0.25, which I believe is what you wanted.
Another solution might be to use a geometric average. If the overall score is the product of all normalized scores (geometric average), that might be fairer because all artists will get the same treatment: the higher review numbers will weight the total score product disproportionately, but then that'd be the same for all artists. Otherwise use the weighted average as previously described.
Whatever normalization and averaging scheme you adopt you can later check how it fairs. Use statistical tests to gauge the sensitivity of the final overall rating on each of the four categories.
The wider problem is that you have no meta-criteria for fairness. But you can figure that out later and adjust the weights you use for each of the four criteria. If you want artist review scores to be more important then your earlier proposed scaling and non-weighted average might be ok, but who knows? You need some other criteria to account for fairness. Then do the suggested sensitivity tests. Basically just calculate correlations between each category score and the overall scores. If one category is more highly correlated with overall rating then you have a bias. But you need a large sample size to make sure the difference in correlations for each category are significant.