Friday, June 14, 2013

"ironic" comments on cross-validation

There is an interesting set of responses to "Ten ironic rules for non-statistical reviewers" (Friston, 2012) in press at NeuroImage. Many interesting issues are raised, but I want to touch on the discussion of cross-validation in the reply Sample size and the fallacies of classical inference (Friston, 2013).

from Friston, 2013:
"Is cross validation useful?
[...] I think that the role of cross validation in neuroimaging deserves further discussion. As articulated clearly by Lindquist et al., the goals of inference and cross validation are very distinct. Cross validation is generally used to validate a model (or its parameters) that predicts or classifies using new observations. Technically, this is usually cast in terms of a posterior predictive probability density. Cross validation is used to validate this posterior predictive density for diagnostic or other purposes (such as out-of-sample estimation or variable selection). However, these purposes do not include model comparison or hypothesis testing. In other words, cross validation is not designed for testing competing hypotheses or comparing different models. This would be no problem, were it not for the fact that cross validation is used to test hypotheses in brain mapping. For example, do the voxels in my hippocampal volume of interest encode the novelty of a particular stimulus? To answer this question one has to convert the cross validation scheme into a hypothesis testing scheme - generally by testing the point null hypothesis that the classification accuracy is at chance levels. It is this particular application that is suboptimal. The proof is straightforward: if a test of classification accuracy gives a different p-value from the standard log likelihood ratio test then it is – by the Neyman–Pearson Lemma – suboptimal. In short, a significant classification accuracy based upon cross validation is not an appropriate proxy for hypothesis testing. It is in this (restricted) sense that the Neyman–Pearson Lemma comes into play.

I have mixed feelings about cross validation, particularly in the setting of multivariate pattern classification procedures. On the one hand, these procedures speak to the important issue of multivariate characterisation of functional brain architectures. On the other hand, their application to hypothesis testing and model selection could be viewed as a non-rigorous and slightly lamentable development."
I fully agree that "cross validation is not designed for testing competing hypotheses or comparing different models", but do not see how that is a problem for (typical) MVPA hypothesis testing.

This image is my illustration of a common way cross-validation is used in MVPA to generate a classification accuracy (see this post for an explanation). As Friston says in the extract, we often want to test whether the classification accuracy is at chance. My preferred approach in these sorts of cases is usually dataset-wise permutation testing, in which the task labels are changed, then the classification (and averaging the accuracies obtained on the cross-validation folds) is repeated, resulting in a null distribution for mean accuracy to which the true mean accuracy can be compared (and a significance level obtained, if desired).

In this scenario the cross-validation does not play a direct role in the hypothesis testing: it is one step of the procedure that led from the data to the statistic we're hypothesizing about. The classification accuracy is not significant "based upon cross validation", but rather significant based on the permutation testing; cross-validation is one stage of the procedure that led to the classification accuracy. Friston, K. (2013). Sample size and the fallacies of classical inference NeuroImage DOI: 10.1016/j.neuroimage.2013.02.057

Tuesday, June 11, 2013

indexing woes: spm and afni

For "fun" I decided to add spm and afni to my set of voxel indexes started with MRIcroN, R, and pyMVPA.

SPM8 follows the 1-based style of MRIcroN and R:

while afni follows the 0-based style of pyMVPA:

So, it's probably not fair to think of either the one or zero-based styles as "correct", but this needs to be kept in mind when referring to voxel by i,j,k coordinates programmatically.

I don't use fsl or brain voyager; send me how these (or other) programs show the voxel coordinates and I'll add them to the collection.

Update 9 August 2013: fsl and caret are both 0-based

MVPA Permutation Schemes: Permutation Testing in the Land of Cross-Validation

Next week I will be attending PRNI 2013, presenting "MVPA Permutation Schemes: Permutation Testing in the Land of Cross-Validation". The papers will be published, but are not online yet. I'll add a link once one's online, but let me know if you'd like a copy now and I'll send you a pdf. The little demo accompanying the paper is already online, here (it's a zip, containing a pdf explanation plus knitr R code and files). I'd like to expand the paper to include more examples and firmer advice, but the little paper is a (hopefully) a clear overview of the issues.

The paper describes a few ways in which permutation testing is commonly done in MVPA, focusing on how the cross-validation folds (aka data partitions) are accounted for. If you're not familiar with why cross-validation is relevant for MVPA, I think this is a pretty readable statistically-oriented introduction, you could try a paper I wrote a few years ago (Etzel, J.A., Gazzola, V., Keysers, C., 2009. An introduction to anatomical ROI-based fMRI classification analysis. Brain Research 1282, 114-125.), or just google.

two permutation schemes

One goal of the paper is to point out that "we did a permutation test" is not a sufficient description for MVPA, since there are many reasonable ways to set up permutation tests. We use the terms "dataset-wise" and "fold-wise" to describe two common schemes. Since these terms aren't standard, we illustrate the two schemes with a running example.

introduction: running example

The running example is based on what I think is pretty much the simplest situation possible for task-based classification analysis for a single subject:

"This person completed three runs of fMRI scanning, each of which contained three blocks each of two different tasks. These task blocks were presented with sufficient rest intervals to allow the task-related BOLD signal to return to baseline, making it reasonable to assume that the task labels can be permuted [2, 3]. We further assume that the image preprocessing (motion correction, etc.) was adequate to remove most linear trends and uninteresting signals. Temporal compression [4] was performed, so that each task block is represented in the final dataset as a single labeled vector of voxel values (Fig. 1). There are n entries in each vector, corresponding to the voxels falling within an anatomically-defined region of interest (ROI). We assume that n is small enough (e.g. 100) that further feature selection is not necessary.

We wish to use a classification algorithm (e.g. linear support vector machines) to distinguish the two tasks, using all n voxels listed in the dataset. For simplicity, we will partition the data on the runs (three-fold CV): leave out one run, train on the two remaining runs, and repeat, leaving out each run in turn. The three test set accuracies are then averaged to obtain the overall classification accuracy (Fig. 2), which, if greater than chance, we interpret as indicating that the voxels’ BOLD varied with task."

I carry this way of illustrating cross-validation and classification through the later figures. The white-on-black color indicates that these examples have the true task labels: the numbers in the circles (which are 'seen' by the classifier) match those of the third (task) column in the dataset.

permutation schemes

Now, the permutation testing. We need to put new task labels on, but where? There are 20 ways of reordering the task labels; shown in the figures as colored circles on a light-grey background.

Under the dataset-wise scheme we put the new task labels on before the cross-validation, carrying the new labels through the cross-validation folds. Figure 4 shows how this works when both the training and testing sets are relabeled, while Figure 5 shows how it works when only the training sets are relabeled.

Note that the dataset's structure is maintained under the dataset-wise permutation scheme when both training and testing sets are relabeled (Figure 4 has the same pattern of arrows as Figure 2). Some of the arrows are shared between Figure 5 and Figure 2, but the property (in the real data) that each labeling is used in a test set is lost.

Under the fold-wise permutation scheme we put the new task labels on during the cross-validation. Figure 6 shows this for relabeling the training data only, as suggested in the pyMVPA documentation. Figure 6 has a similar structure to Figure 5, but the coloring is different: under the dataset-wise scheme a run is given the same set of permuted labels in all training sets in a permutation, while under the fold-wise scheme each run gets a different set of permuted labels (i.e. in Figure 5 run 1 is purple in both the first and second cross-validation folds, while in Figure 6 run 1 is purple in the first and red in the second).

does the permutation scheme matter?

Yes, at least sometimes. We are often dealing with such small, messy datasets in MVPA that shifting the rank by just a few values can really matter. The simulations in the little demo show a few (tweakable) cases.

Here are the first two repetitions of the demo (this is in a more readable format in the pdf in the demo). In the first, the true-labeled accuracy was 0.64 (vertical line), the p-value for both.dset (permuting training and testing labels dataset-wise, ala figure 4) was 0.013, train.dset (dataset-wise, permuting training only, ala figure 5) was 0.002; both.fold (permuting both, fold-wise) was 0.001, and train.fold (fold-wise permuting training only, ala figure 6) was 0.005. On repetition 2, the p-values were: both.dset=0.061, train.dset=0.036, both.fold=0.032, and train.fold=0.028. That gives us p-values above and below the 'magical' value of 0.05, depending on how we do the permutation test.

final thoughts

Which permutation scheme should we use for MVPA? Well, I don't know a universally-applicable answer. As we know, how simulated datasets are created can really, really matter, and I certainly don't claim this little demo is representative of true fMRI data. That said, the pattern in null distributions above - larger variance on dataset-wise than fold-wise schemes (and so a more stringent test) - is common in my experience, and unsurprising. It seems clear that more of the dataset structure is kept under dataset-wise schemes, which is consistent with the goal of matching the permutation test as closely as possible to the true data analysis.

My feeling is that dataset-wise permutation schemes, particularly permuting both the training and testing sets (Fig. 4) is the most rigorous test for a dataset like the little running example here. Dataset-wise permuting of either just the training or just the testing set labels may be preferable in some cases, such as 'cross' classification, when the training and testing datasets are not runs but rather independent datasets (e.g. different experimental conditions or acquisition days).

I don't think that fold-wise schemes should be recommended for general use (datasets like the running example), since some of the dataset structure (similarity of training data across cross-validation folds) present in the true data is lost.

UPDATE (20 June 2013): Here is a link to a copy of my poster, plus expanded simulation code.

UPDATE (22 July 2013): Here is a link to a copy of the paper, as well.

UPDATE (6 November 2013): The paper is now up at IEEE and has the DOI 10.1109/PRNI.2013.44.

Wednesday, June 5, 2013

indexing woes: a story of pyMVPA, R, and mricron

In the hope that someone else will avoid the headaches we've experienced, here is a brief note about indexing in MRIcroN, R (specifically, oro.nifti), and pyMVPA. The python code - and finding the source of the discrepancy - is thanks to Ben Acland.

First, MRIcroN and R. In the image above I show the fake brain NIfTI file I've used before opened in MRIcroN. In both, the value of the voxel at coordinates 18, 16, 21 is 13.

Now, pyMVPA. Here's the code (thanks, Ben!), with a screenshot from running it within NeuroDebian:
from mvpa2.suite import *
ds = fmri_dataset('/home/brain/host/fakeBrain.nii.gz')
def ijkIdx(dataset, i, j, k):
  return np.where((dataset.fa['voxel_indices'][:,0] == i)
    & (dataset.fa['voxel_indices'][:,1] == j)
    & (dataset.fa['voxel_indices'][:,2] == k))[0][0]
ds.samples[0,ijkIdx(ds, 18,16,21)]
ds.samples[0,ijkIdx(ds, 17,15,20)]

Notice the lines marked in green: the voxel at 18, 16, 21 does not have the value 13, but 25: the voxel at 17, 15, 20 has the value 13 (orange). Python is zero-based and R is one-based, which may be the source of this discrepancy.

Ben wrote the little function ijkIdx to return a voxel value by i,j,k coordinates since pyMVPA stores the values internally as vectors; see the pyMVPA tutorial for more details.

Update 11 June 2013: spm is also 1-based while afni is 0-based
Update 9 August 2013: fsl and caret are both 0-based.