CS478 Paired Permutation Test Overview

We would like to compare the performance of models M1 and M2. One way to do this is to form a (null) hypothesis such as "there is no difference in the accuracies of M1 and M2", or more precisely, "on a random set of data, (M1, M2) are as likely to have generalization accuracies (a,b) as accuracies (b,a), for any a and b. To test such a hypothesis, we can select a representative statistic, such as "the mean difference in accuracies" and try to estimate it at some level of confidence. This will tell us something about whether or not we should reject our hypothesis. For example, if we have a high confidence that the mean difference is not 0, then we can reject our hypothesis and say that the accuracy of M1 is significantly higher (or lower) that that of M2. One way to do this is to use a paired permutation test as follows:

  1. Obtain a set of k pairs of accuracy estimates {(a1, b1), (a2, b2), ..., ak} for (M1, M2). Make sure that estimates ai and bi are obtained from the same data and that estimates ai and aj are obtained from different data for i&nej (this is typically done using N-fold [stratified] cross-validation).
  2. Calculate the average difference in accuracies, μdiff = Σi(ai-bi)/k
  3. Let n = 0
  4. For each possible permutation of swapping ai and bi (perhaps most easily computed by permuting the signs of the k differences):
    1. Calculate the average difference in accuracies, μnew = Σi(ai-bi)/k
    2. If |μnew| ≥ |μdiff|
          n=n+1
  5. Report p=n/2k (smaller p means we are more likely to reject the null hypothesis, or, put another way, that we are more confident that our observed value, μdiff, is statistically significantly different than 0).

Note: in the general case, k can be too large to admit an exhaustive test of all 2k permutations. In these situations, the most common approach is the obvious one of randomly sampling the permutations.

Thanks to Jason Eisner of the Johns Hopkins Computer Science department for many helpful suggestions.