I am evaluating statistical significance of improvements in accurracy in a speech recognition system as a response to changes within the acoustic models. I am comparing the mean recognition accuracy before and after the changes.
As an input I've got pairs of "speaker - accuracy" or "word - accuracy" for both groups. As I am keeping speakers and vocabulary the same in both evaluations, I decided to use the t-test for dependent samples to find out if there is a significant change in mean accuracy or not.
There is still one thing, I wonder about: The standard t-test for dependent populations does not seem to be able to take into account the number of test samples that I have either per vocabulary item or per speaker. If I evaluate on a "by speaker"-basis, then the t-test makes no difference if I've got 20 or 50 utterances per speaker. If I evaluate on a "by vocabulary item"-basis, there is no dependence on the number of evaluation speakers.
As I evaluate 20 speakers but 100 utterances per speaker the significance is usually much higher if I run the t-test based on vocabulary items instead of speakers.
Is there another test or a modification of the t-test which takes this information into account? For each utterance I've got the information if it was recognized correctly or incorrectly. So I cannot run the t-test directly on the utterances because I need an interval scale for that and not only a binary decision. Is there a way to test on speakers and vocabulary items separately and obtain a better information about significance from those two results?
I'd be glad if anyone could help me as my knowledge of statistics is rather limited...
thank you in advance,