What options have you thought about?
something very simple might be "number of correct elements - number of wrong elements". Higher numbers indicate better performance.
If you want something more sophisticated I suggest you think about what kinds of errors are most important. For example if the function gives an answer that is 100 units too big, is that more/less/equally significant than another function which gives 2 wrong answers which are 50 units too big?