Questions about calculating inter-rater reliability

I apologize in advance if this is the wrong forum for this.

Anyway, I am having some trouble trying to figure out how to calculate inter-rater reliability for a data set.

My data consists of 2211 RATEES, with 10 RATERS (the design is fully-crossed, so all raters rated all items once). The ratees are actually word pairs (e.g. "alligator-ostrich", "alligator-pineapple", "bottle-ostrich," etc.) which were given a similarity score by the raters on a continuous scale.

The way the rating process worked, in case it is relevant, is that the word pair appeared on a computer screen above a slide bar. The rater was instructed to set the position of the slider in the bar (with the left-hand side being "unrelated" and the right-hand side being "related") to assess word pair similarity. Therefore, the raters were not explicitly assigning ranks, numbers, or categories to each rater - they just set the value of a continuous scale. The program outputted a number (between 0 and 100) to a datafile, but at no point did the raters actually see these values.

So, in any case, the data looks a bit like this:

RATER 1 RATER 2 RATER 3 ...

RATEE 1 5 0 43

RATEE 2 33 18 86

RATEE 3 12 4 52

...

What I want to do is calculate the inter-rater reliability or concordance of the data set. I know there are a number of different methods for doing so, but I can't seem to find one that fits the criteria of my data set, so I was hoping somebody hear could help me out. Maybe I am misunderstanding some of these tests, or there is a major one I don't know about. Anyway, these are the different tests I have looked at.

Cohen's kappa: inappropriate because it requires qualitative/categorical data

Fleiss' kappa: same as above

Joint probability of agreement: again, only works for nominal data

Concordance correlation: only works for paired data

Kendall's W: requires rank ordering, which may or may not be appropriate?

Intra-class correlation: may be appropriate, but searching online just confuses me on this. Some sources say it is only for paired data, others do not. And it seems to require ANOVA?

Anyway, I'm sorry if this is a basic question, but Wikipedia and other online sources are just confusing me more on this, because the information seems to be contradictory from one place to the next. Is ICC the most appropriate method? Or Kendall's or something else that rank orders the data? Should I treat it as nominal and do a kappa?

Re: Questions about calculating inter-rater reliability

Hey RASimmons.

This model looks like a contigency/catiegorical data problem. You might want to model this as such and then do a statistical test for association for the rating for a particular word combination cell against the variable corresponding to a rater.

If you have evidence of an association, a chi-square analysis should pick it up. If you don't, then it is equivalent to rejecting the hypothesis that there is an assocation.

Re: Questions about calculating inter-rater reliability

Thanks for the reply, chiro!

How do I model this as a contingency/categorical data problem? I don't quite understand how to apply a chi-square analysis to this data set ...

Re: Questions about calculating inter-rater reliability

Do you have access to SAS? SAS has a function called PROC FREQ that will do the test for you, but if you want to do it manually, you simply compare an observed distribution with a uniform distribution.

The idea is that if there is an association then P(A|B) will not be constant across different values of B. If P(A|B) = P(A), this means A and B are not associated (independent).

So what you are doing is your observed distribution is P(A = a|B) for some fixed a and varied B, and your expected distribution is a uniform distribution.

If the result is extreme enough, you can reject that the two variables independent.

Re: Questions about calculating inter-rater reliability

I do have access to SAS. I just wanted to know the logic/method behind what I was doing. I don't like being one of those people that just plugs things blindly into a program without actually understanding what it is doing. I will look at that function.

Thanks for your help!