I apologize in advance if this is the wrong forum for this.

Anyway, I am having some trouble trying to figure out how to calculate inter-rater reliability for a data set.

My data consists of 2211 RATEES, with 10 RATERS (the design is fully-crossed, so all raters rated all items once). The ratees are actually word pairs (e.g. "alligator-ostrich", "alligator-pineapple", "bottle-ostrich," etc.) which were given a similarity score by the raters on a continuous scale.

The way the rating process worked, in case it is relevant, is that the word pair appeared on a computer screen above a slide bar. The rater was instructed to set the position of the slider in the bar (with the left-hand side being "unrelated" and the right-hand side being "related") to assess word pair similarity. Therefore, the raters were not explicitly assigning ranks, numbers, or categories to each rater - they just set the value of a continuous scale. The program outputted a number (between 0 and 100) to a datafile, but at no point did the raters actually see these values.

So, in any case, the data looks a bit like this:

RATER 1 RATER 2 RATER 3 ...

RATEE 1 5 0 43

RATEE 2 33 18 86

RATEE 3 12 4 52

...

What I want to do is calculate the inter-rater reliability or concordance of the data set. I know there are a number of different methods for doing so, but I can't seem to find one that fits the criteria of my data set, so I was hoping somebody hear could help me out. Maybe I am misunderstanding some of these tests, or there is a major one I don't know about. Anyway, these are the different tests I have looked at.

Cohen's kappa: inappropriate because it requires qualitative/categorical data

Fleiss' kappa: same as above

Joint probability of agreement: again, only works for nominal data

Concordance correlation: only works for paired data

Kendall's W: requires rank ordering, which may or may not be appropriate?

Intra-class correlation: may be appropriate, but searching online just confuses me on this. Some sources say it is only for paired data, others do not. And it seems to require ANOVA?

Anyway, I'm sorry if this is a basic question, but Wikipedia and other online sources are just confusing me more on this, because the information seems to be contradictory from one place to the next. Is ICC the most appropriate method? Or Kendall's or something else that rank orders the data? Should I treat it as nominal and do a kappa?