I have a question on some ranking problems. Suppose I have a list of search results from some documents with their scores.
Doc #1 Paragraph 1 of 5: 1.2
Doc #1 Paragraph 2 of 5: 1.5
Doc #1 Paragraph 3 of 5: 1.6
Doc #1 Paragraph 4 of 5: 1.7
Doc #4 Paragraph 2 of 2: 1.8
Doc #2 Paragraph 1 of 2: 3.5
Doc #2 Paragraph 2 of 2: 3.6
Doc #3 Paragraph 1 of 1: 4.0
Doc #4 Paragraph 1 of 2: 4.1
How can I rank these documents based on their scores, when they can appear more than once in the list? The scores are independent for each result. You don't have to deal with my specific example, because I work with some various set of results.
The easiest answer is not to perform scoring based on paragraphs, but combine all text of one document as one paragraph so that we can have only one document, one paragraph and one score. I want to hear your suggestions before I use that conventional approach.
1. Getting the average may not be so accurate because some documents got only one paragraph, but others have a lot of paragraphs which means this may pull the document's score down when they got many paragraphs.
2. Summation of their scores seem not to be sound as well, because a document with many paragraphs but little scores can compete with a document with only one paragraph but way bigger individual score.
3. Getting the maximum score may be a little good, but it ignores the effort of other paragraphs, like Doc #2 getting 3.5 and 3.6 as compared to one 4.0 by Doc #3.
4. It seems that whatever the solution is, the combined/final score of a document should be at least the maximum score of all the sub-scores.
Is there a solution that balances these issues without knowing how sub-scores are computed?