I just finished reading several papers by Malin Björnsdotter describing her randomly averaged searchlight method and am very excited to try it.

In brief, instead of doing an exhaustive searchlight analysis (a searchlight centered on every voxel), searchlights are positioned randomly, such that each voxel is in just one searchlight. The parcellation is repeated, so that each voxel is included in several different searchlights (it might be in the center of one, touching the center of another, in the edge of a third, etc.). You then create the final information map by assigning each voxel the average of all the searchlights it's a part of - not just the single searchlight it was the center of as in normal searchlight analysis.

Malin introduces the technique as a way to speed searchlight analysis: far fewer searchlights need to be calculated to achieve good results. That's certainly a benefit, as searchlight analyses can take a ridiculous amount of time, even distributed across a cluster.

But I am particularly keen on the idea of assigning each voxel the average accuracy of all the searchlights it's a part of, rather than the single centered searchlight accuracy. In this and other papers Malin shows that the Monte Carlo version of searchlight analysis produces more accurate maps than standard searchlighting, speculating that the averaging reduces some of the spurious results. I find this quite plausible: a single searchlight may be quite accurate at random; the chance high accuracy will be "diluted" by averaging with the overlapping searchlights. Averaging won't eliminate all distortions in the results of course, but might help mitigate some of the extremes.

Malin points out that averaged information maps can be constructed after doing an exhaustive searchlight mapping, not just a Monte Carlo one. I will certainly be calculating some of these with my datasets in the near future.

Björnsdotter M, Rylander K, & Wessberg J (2011). A Monte Carlo method for locally multivariate brain mapping. NeuroImage, 56 (2), 508-16 PMID: 20674749

## Tuesday, September 25, 2012

## Monday, September 24, 2012

### searchlight shapes: searchmight

The searchmight toolbox was developed by Francisco Pereira and the Botvinick Lab and introduced in Information mapping with pattern classifiers. The toolbox is specialized for searchlight analysis, streamlining and speeding the process of trying different classification algorithms on the same datasets.

Francisco said that, by default, the toolbox uses cubical searchlights (other shapes can be manually defined). In the image of different one-voxel radius searchlights, this is the one that I labeled "edge, face, or corner": 26 surrounding voxels for a one-voxel radius searchlight. Larger radii are also cubes.

Pereira, Francisco, & Botvinick, Matthew (2011). Information mapping with pattern classifiers: A comparative study. NeuroImage. DOI: 10.1016/j.neuroimage.2010.05.026

Francisco said that, by default, the toolbox uses cubical searchlights (other shapes can be manually defined). In the image of different one-voxel radius searchlights, this is the one that I labeled "edge, face, or corner": 26 surrounding voxels for a one-voxel radius searchlight. Larger radii are also cubes.

Pereira, Francisco, & Botvinick, Matthew (2011). Information mapping with pattern classifiers: A comparative study. NeuroImage. DOI: 10.1016/j.neuroimage.2010.05.026

## Wednesday, September 19, 2012

### pyMVPA searchlight shape: solved

Michael Hanke posted further description of the pyMVPA searchlight: it is the same as the Princeton MVPA toolbox searchlight I described earlier.

As Michael points out, it is possible to define other searchlight shapes in pyMVPA; this is the default for a sphere, not the sole option.

As Michael points out, it is possible to define other searchlight shapes in pyMVPA; this is the default for a sphere, not the sole option.

### searchlight shapes: a collection

The topic of searchlight shapes is more complex than I'd originally expected. I'd like to compile a collection of what people are using. I added a "searchlight shapes" label to the relevant posts: clicking that label should bring up all of the entries.

I have contacted some people directly in the hopes of obtaining sufficient details to add their searchlight shapes to the collection. If you have details of a searchlight not included, please send them and I'll add it (or arrange for you to post them). My initial goal is to describe the main (i.e. default) spherical searchlights produced in the most widely used code/programs, but I also plan to include other types of searchlights, like the random ones created by Malin Björnsdotter.

I have contacted some people directly in the hopes of obtaining sufficient details to add their searchlight shapes to the collection. If you have details of a searchlight not included, please send them and I'll add it (or arrange for you to post them). My initial goal is to describe the main (i.e. default) spherical searchlights produced in the most widely used code/programs, but I also plan to include other types of searchlights, like the random ones created by Malin Björnsdotter.

## Tuesday, September 18, 2012

### permutation testing: one person only

Here are the results and specifics of the permutation testing for one (real) person from the dataset I introduced in the previous post.

Combining the first six runs (the first training set) this person has 19 usable examples of class a and 31 of class b. In the last six runs (the first testing set) there are 22 usable examples of class a and 18 of class b. The data is thus quite out of balance (many more b in the training set and many more a in the testing). Balancing the dataset requires selecting 19 of the class b examples from the training set, and 18 of class a from the testing set. I did this selection randomly 10 times, which I will call "splits".

The permutation test's aim is to evaluate the classification significance: is the accuracy by which we classified the stimuli as a or b meaningful? I do this by permuting the class labels in the dataset, so in any particular permutation some class a examples are still labeled a, while others are labeled b. But in this case, even though it's just one person, we have ten datasets: the splits.

I handle this by precalculating the label permutations, then running the permutation test on each of the splits separately and averaging the results. To make it more concrete, this person has (after balancing) 19 examples of each class in the training set. I sort the rows so that they are grouped by class: the real labeling is 19 a followed by 19 b. I store 1000 random label rearrangements, each containing 19 a and 19 b. For example, here's the new labeling for permutation #1: "a" "b" "b" "a" "b" "a" "b" "a" "a" "a" "b" "b" "b" "a" "a" "b" "a" "b" "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "a" "b" "a" "a" "b" "b" "b" "b" "a" "b". I replace the real class labels with this set, then run the usual classification code - for each of the 10 splits.

When everything is done I end up with 1000 permuted-labels * 10 splits accuracies. I then make a set of 1000 accuracies by averaging the accuracy obtained with each label permutation on each split. In other words, I take the average of the accuracies I got by using the abbababaaa... labeling on each of my 10 "split" datasets. I then compute the significance by taking the rank of the real accuracy over the number of permutations. In this case, that's 2/1001 = 0.002.

Here's how it actually looks. These are histograms of the null distributions obtained on each of the split datasets separately, plus the averaged one. The vertical blue lines give the accuracy of the true labeling on each split, while the vertical light-grey line is at 0.5 (chance). The first graph is overplotting density plot versions of the 10 splits' histograms.

This method obviously requires a lot of computations (I often use a cluster). But I like that it enables comparing apples to apples: I'm comparing the accuracy I got by averaging 10 numbers to a distribution of accuracies obtained by averaging 10 numbers. Since the same set of splits and relabelings is used in all cases, the averaging is meaningful.

In this case, as often, the averaged distribution has less variance than the individual splits. This is not surprising, of course. It's also clear that the true-labeling accuracies vary more than the permutation distributions for each split, which is probably also not surprising (since these are permutation distributions), but not as clean as might be hoped. I do like that the true accuracy is at the right of all of the distributions; I consider this amount of variability pretty normal. I'd be worried if the distributions varied a great deal, or if the true-label accuracies are more variable (e.g. at chance on some splits, > 0.7 on others).

Combining the first six runs (the first training set) this person has 19 usable examples of class a and 31 of class b. In the last six runs (the first testing set) there are 22 usable examples of class a and 18 of class b. The data is thus quite out of balance (many more b in the training set and many more a in the testing). Balancing the dataset requires selecting 19 of the class b examples from the training set, and 18 of class a from the testing set. I did this selection randomly 10 times, which I will call "splits".

The permutation test's aim is to evaluate the classification significance: is the accuracy by which we classified the stimuli as a or b meaningful? I do this by permuting the class labels in the dataset, so in any particular permutation some class a examples are still labeled a, while others are labeled b. But in this case, even though it's just one person, we have ten datasets: the splits.

I handle this by precalculating the label permutations, then running the permutation test on each of the splits separately and averaging the results. To make it more concrete, this person has (after balancing) 19 examples of each class in the training set. I sort the rows so that they are grouped by class: the real labeling is 19 a followed by 19 b. I store 1000 random label rearrangements, each containing 19 a and 19 b. For example, here's the new labeling for permutation #1: "a" "b" "b" "a" "b" "a" "b" "a" "a" "a" "b" "b" "b" "a" "a" "b" "a" "b" "a" "a" "b" "b" "a" "a" "b" "b" "a" "a" "a" "b" "a" "a" "b" "b" "b" "b" "a" "b". I replace the real class labels with this set, then run the usual classification code - for each of the 10 splits.

When everything is done I end up with 1000 permuted-labels * 10 splits accuracies. I then make a set of 1000 accuracies by averaging the accuracy obtained with each label permutation on each split. In other words, I take the average of the accuracies I got by using the abbababaaa... labeling on each of my 10 "split" datasets. I then compute the significance by taking the rank of the real accuracy over the number of permutations. In this case, that's 2/1001 = 0.002.

Here's how it actually looks. These are histograms of the null distributions obtained on each of the split datasets separately, plus the averaged one. The vertical blue lines give the accuracy of the true labeling on each split, while the vertical light-grey line is at 0.5 (chance). The first graph is overplotting density plot versions of the 10 splits' histograms.

This method obviously requires a lot of computations (I often use a cluster). But I like that it enables comparing apples to apples: I'm comparing the accuracy I got by averaging 10 numbers to a distribution of accuracies obtained by averaging 10 numbers. Since the same set of splits and relabelings is used in all cases, the averaging is meaningful.

In this case, as often, the averaged distribution has less variance than the individual splits. This is not surprising, of course. It's also clear that the true-labeling accuracies vary more than the permutation distributions for each split, which is probably also not surprising (since these are permutation distributions), but not as clean as might be hoped. I do like that the true accuracy is at the right of all of the distributions; I consider this amount of variability pretty normal. I'd be worried if the distributions varied a great deal, or if the true-label accuracies are more variable (e.g. at chance on some splits, > 0.7 on others).

### permutation testing: example dataset

I currently have a dataset that while mostly straightforward to analyze, has some complications that arise frequently. In this post I'll sketch some critical aspects to hopefully clarify the permutation testing issues.

There are various ways to set up the cross-validation. Partitioning on the runs is not an option, since some runs don't have all the stimulus types. My goal was to find a partitioning scheme that is as simple as possible, but separates the training and testing sets in time while getting the number of examples of each class in the training and testing sets as close as possible.

I calculated the number of training and testing examples for several schemes, such as leave-four-runs-out and leave-three-runs-out, settling on leave-six-runs-out. Specifically, I use the first six runs as the training set and the last six runs (in presentation order) as the testing set, then the reverse. Using this scheme, the smallest number of examples in a partition for any person is 6; not great, but I guessed workable. The largest minimum for any person is 25; most people have in the teens.

Even with leave-six-runs-out partitioning the number of examples of the two classes in the training and testing set is usually unequal, sometimes strikingly so. My usual work-around is to subset the larger class to equalize the number of examples in each case (e.g. if there are 12 of class a and 14 of class b in the training set, take out two of the class b). There are of course many, many ways to do the subsetting. My usual practice is to randomly calculate 10 of these, then average the resulting accuracies. (e.g. calculate the accuracy when taking out the 1st and 3rd class b examples in the training set, then after taking out the 4th and 13th examples, etc.).

- The analysis is within-subjects (each person classified alone) with anatomical ROIs. The ROIs are reasonably small and independently defined; we will not do additional feature selection.
- The subjects completed 12 scanning runs, with the stimuli reasonably-separated (time-wise), and randomly ordered.
- There are various types of stimuli, but we are only interested in certain two-way classifications.
- We only want to include examples where the subject gave the correct response.
- We will do linear support vector machines, c=1.

There are various ways to set up the cross-validation. Partitioning on the runs is not an option, since some runs don't have all the stimulus types. My goal was to find a partitioning scheme that is as simple as possible, but separates the training and testing sets in time while getting the number of examples of each class in the training and testing sets as close as possible.

I calculated the number of training and testing examples for several schemes, such as leave-four-runs-out and leave-three-runs-out, settling on leave-six-runs-out. Specifically, I use the first six runs as the training set and the last six runs (in presentation order) as the testing set, then the reverse. Using this scheme, the smallest number of examples in a partition for any person is 6; not great, but I guessed workable. The largest minimum for any person is 25; most people have in the teens.

Even with leave-six-runs-out partitioning the number of examples of the two classes in the training and testing set is usually unequal, sometimes strikingly so. My usual work-around is to subset the larger class to equalize the number of examples in each case (e.g. if there are 12 of class a and 14 of class b in the training set, take out two of the class b). There are of course many, many ways to do the subsetting. My usual practice is to randomly calculate 10 of these, then average the resulting accuracies. (e.g. calculate the accuracy when taking out the 1st and 3rd class b examples in the training set, then after taking out the 4th and 13th examples, etc.).

## Monday, September 17, 2012

### permutation testing: opening thoughts

This is the first of what will probably be a continuing series of musing, meandering posts about permutation testing for group fMRI MVPA.

I'm a big fan of permutation testing, especially for MVPA: our questions, analyses, sample sizes, distributions, etc. are so far from what was envisioned for parametric stats that using a "standard" parametric test "out of the box" gives me pause (or, in practice, lots of assumption-checking). I like that with permutation tests it's possible to design a test that gets at exactly the hypothesis we're asking, while keeping the relevant variance structures (e.g. how many runs, type of classifier, number of examples in each cross-validation fold) intact.

But, the exact way two different people design a permutation test for any particular question will probably be different. And this gets worse in group analyses.

I'll be writing a series of posts illustrating some of the common ways permutation tests are done, both single subject and group analyses. My intent is that these posts will highlight and explain how to carry out what I think are the most useful and common methods, but also share my perception of where some of the open questions lie. As always, please feel free to comment, especially pointing out where your practice/observation/experience/conclusions differ from mine.

I'm a big fan of permutation testing, especially for MVPA: our questions, analyses, sample sizes, distributions, etc. are so far from what was envisioned for parametric stats that using a "standard" parametric test "out of the box" gives me pause (or, in practice, lots of assumption-checking). I like that with permutation tests it's possible to design a test that gets at exactly the hypothesis we're asking, while keeping the relevant variance structures (e.g. how many runs, type of classifier, number of examples in each cross-validation fold) intact.

But, the exact way two different people design a permutation test for any particular question will probably be different. And this gets worse in group analyses.

I'll be writing a series of posts illustrating some of the common ways permutation tests are done, both single subject and group analyses. My intent is that these posts will highlight and explain how to carry out what I think are the most useful and common methods, but also share my perception of where some of the open questions lie. As always, please feel free to comment, especially pointing out where your practice/observation/experience/conclusions differ from mine.

## Wednesday, September 12, 2012

### more searchlights: the Princeton mvpa toolbox

MS Al-Rawi kindly sent me coordinates of the searchlights created by the Princeton MVPA toolbox, which I made into this image. As before, the center voxel is black, the one-voxel-radius searchlight red, the two-voxel-radius searchlight purple, and the three-voxel-radius searchlight blue.

The number of voxels in each is:

1-voxel radius: 6 surrounding voxels (7 total)

2-voxel radius: 32 surrounding voxels (33 total)

3-voxel radius: 122 surrounding (123 total)

4-voxel radius: 256 surrounding (257 total)

From a quick look at the code generating the coordinates, it appears to "draw" a sphere around the center voxel, then retrieve the voxels falling within the sphere. (anyone disagree with this characterization?)

This "draw a sphere first" strategy explains why I couldn't come up with this number of voxels for a three-voxel-radius searchlight if the shape follows additive rules (e.g. "add voxels that share a face with the next-size-smaller radius"). It's another example of how much it can vary when different people actually implement a description: to me, it was more natural to think of searchlights of different radii as growing iteratively, a rather different solution than the one used in the Princeton MVPA toolbox.

The number of voxels in each is:

1-voxel radius: 6 surrounding voxels (7 total)

2-voxel radius: 32 surrounding voxels (33 total)

3-voxel radius: 122 surrounding (123 total)

4-voxel radius: 256 surrounding (257 total)

From a quick look at the code generating the coordinates, it appears to "draw" a sphere around the center voxel, then retrieve the voxels falling within the sphere. (anyone disagree with this characterization?)

This "draw a sphere first" strategy explains why I couldn't come up with this number of voxels for a three-voxel-radius searchlight if the shape follows additive rules (e.g. "add voxels that share a face with the next-size-smaller radius"). It's another example of how much it can vary when different people actually implement a description: to me, it was more natural to think of searchlights of different radii as growing iteratively, a rather different solution than the one used in the Princeton MVPA toolbox.

## Tuesday, September 11, 2012

### my searchlight, and some (final?) thoughts

Since I've been thinking about searchlight shapes, here's a diagram of the two-voxel radius searchlight I'm currently using in a few projects. As in the previous post, the center voxel is black, the one-voxel radius surrounds red, and the two-voxel radius surrounds purple.

This two-voxel radius searchlight has 13 + 21 + 24 + 21 + 13 = 92 voxels in the surround, which makes it larger than the three-voxel radius faces-must-touch searchlight (at 86 surrounding voxels).

So how should we grow searchlights, edge-face-corner, edge-face, faces-only? Beats me. I can imaging comparing the results of a few different shapes in a particular dataset, but I doubt there's a single shape that is always best. But the searchlight shape should be added to our list of how to describe a searchlight analysis.

PS - I use R code to generate lists of adjacent voxels for running searchlight analyses; contact me if you'd like a copy/details.

This two-voxel radius searchlight has 13 + 21 + 24 + 21 + 13 = 92 voxels in the surround, which makes it larger than the three-voxel radius faces-must-touch searchlight (at 86 surrounding voxels).

So how should we grow searchlights, edge-face-corner, edge-face, faces-only? Beats me. I can imaging comparing the results of a few different shapes in a particular dataset, but I doubt there's a single shape that is always best. But the searchlight shape should be added to our list of how to describe a searchlight analysis.

PS - I use R code to generate lists of adjacent voxels for running searchlight analyses; contact me if you'd like a copy/details.

### pyMVPA searchlight shapes

Michael Hanke listed the number of voxels for different searchlight radii in pyMVPA. He also wrote that pyMVPA considers neighboring voxels to be those that share two edges (i.e., a face).

Here's a diagram of following the faces-touching rule. The center voxel is black, the one-voxel radius surrounds red, the two-voxel radius surrounds purple, and the three-voxel radius surrounds blue.

This gives 1 + 9 + 21 + 24 + 21 + 9 + 1 = 86 surrounding voxels for the three-voxel radius.

But Michael wrote that a three-voxel radius searchlight in pyMVPA has 123 voxels (counting the center). If so, at left is how I could come by that count: 13 + 28 + 40 + 28 + 13 = 122.

This shape is not quite symmetrical: there are four voxels to the left, right, front, and back of the center voxel but only two above and below.

Anyone know if this is what pyMVPA does?

**UPDATE**: Neither of these sketches is correct, see pyMVPA searchlight: solved.

## Monday, September 10, 2012

### searchlight shapes

Nikolaus Kriegeskorte (the father of searchlight analysis) uses the illustration on the left.

But there are other ways of thinking of a searchlight. Here are three ways of defining a one-voxel radius searchlight.

Personally, I implemented the middle version (edges and faces, not corners, 18 voxels) as a one-voxel radius searchlight in my code. Larger radii are calculated iteratively (so for a two-voxel radius, combine the one-voxel radius searchlights defined by treating all 18 surrounding voxels present in the one-voxel radius searchlight as centers). I don't think one version of defining surrounds is necessarily better than others, but we do need to specify the voxels we are including.

update 11 September: I changed the caption of the searchlight shape images for clarity.

Subscribe to:
Posts (Atom)