Hey mysteriouspoet3000.

When we consider randomness, we consider two things.

The first is that every outcome has the same probability as every other and the second is that every trial in no way is affected by any other trial. Your binomial model captures all of this information.

But the thing that also needs to be considered is the nature of the test that is not random.

If the test had a specific structure and didn't just allocate the answers at random (and how different answers are related to one another), then this needs to be incorporated into the model of how a specific score relates to being based on understanding something (i.e. a probability not a certainty) vs whether it's based on a hypothesis of whether a score was obtained based on "luck".

This is not an easy thing to do, but it will give you a more specific way to create a model to determine under the assumptions whether a score (high or low) was obtained via "luck" or via true understanding.

You can decide whether you want to go this far or not. If you don't, you will be neglecting how the content, structure, and relative connection of answers affects the estimator, test statistic, and final region (and rejection region) for making both hypotheses (i.e. the one that the student got an answer based on understanding and one based on luck) or whether you want to ignore the structure of the test and just use pure randomness as an estimator.

If you want to use a detailed analysis, you need to basically relate conditional probabilities to all the distributions so that each answer will basically change the way the rest of the distributions for other questions are defined. Then you use this information to construct an estimator and use this to determine whether a score (either extremely high or extremely low) is suspected (under a hypothesis) to demonstrate knowledge (or lack of knowledge) as opposed to luck.

Your assumption of normality is a good one if n is big enough (which it is) and np(1-p) is big enough (which it is).

This can be used to get evidence for being the options you have mentioned above, but you will need to use the kind of analysis I mentioned above to get more evidence of whether it is lucky or based on understanding and that will require some very complex models in the general case depending on the test is structured (in fact you would structure a test before-hand with the right properties and then use a computer to generate the right estimators and the rejection regions/non-rejection regions for the two hypotheses).

The simple thing is whether you want to narrow down your evidence from the first test.

Ultimately in any hypothesis test you will always be faced with getting it wrong, but the nature and design of your estimator and the region will help in coming up with some kind of way to rectify your question (those three choices) mathematically and at least when it's clearly defined, you can build up from there and also communicate this to other statisticians which will know what this means.

I'm sorry if this seems complicated, but it kind of is.