## CS478 Paired Permutation Test Overview

We would like to compare the performance of models
*M _{1}* and

*M*. One way to do this is to form a (null) hypothesis such as "there is no difference in the accuracies of

_{2}*M*and

_{1}*M*", or more precisely, "on a random set of data, (

_{2}*M*,

_{1}*M*) are as likely to have generalization accuracies (a,b) as accuracies (b,a), for any a and b. To test such a hypothesis, we can select a representative statistic, such as "the mean difference in accuracies" and try to estimate it at some level of confidence. This will tell us something about whether or not we should reject our hypothesis. For example, if we have a high confidence that the mean difference is

_{2}*not*0, then we can reject our hypothesis and say that the accuracy of

*M*is significantly higher (or lower) that that of

_{1}*M*. One way to do this is to use a paired permutation test as follows:

_{2}- Obtain a set of
*k*pairs of accuracy estimates {(*a*,_{1}*b*), (_{1}*a*,_{2}*b*), ...,_{2}*a*} for (_{k}*M*,_{1}*M*). Make sure that estimates_{2}*a*and_{i}*b*are obtained from the same data and that estimates_{i}*a*and_{i}*a*are obtained from different data for_{j}*i*&ne*j*(this is typically done using*N*-fold [stratified] cross-validation). - Calculate the average difference in accuracies,
*μ*= Σ_{diff}_{i}(*a*-_{i}*b*)/_{i}*k* - Let
*n*= 0 - For each possible permutation of swapping
*a*and_{i}*b*(perhaps most easily computed by permuting the signs of the_{i}*k*differences):- Calculate the average difference in accuracies,
*μ*= Σ_{new}_{i}(*a*-_{i}*b*)/_{i}*k* - If |
*μ*| ≥ |_{new}*μ*|_{diff}

*n*=*n*+1

- Calculate the average difference in accuracies,
- Report
*p*=*n*/2^{k}(smaller*p*means we are more likely to reject the null hypothesis, or, put another way, that we are more confident that our observed value,*μ*, is statistically significantly different than 0)._{diff}

Note: in the general case, *k* can be too large to
admit an exhaustive test of all 2^{k}
permutations. In these situations, the most common approach
is the obvious one of randomly sampling the
permutations.

*Thanks to Jason Eisner of
the Johns Hopkins Computer Science department for many
helpful suggestions.*