Monday, July 30, 2012

Nipype impressions

I was asked to take a look at Nipype (great logo, Arno!). Here are some of my impressions; I haven't tried coding or installing it. If you have, I'd be curious to hear how it's gone and it you think any of this is off the mark.

Nipype aims to connect different fMRI software packages (spm, fsl, etc) so that you can program a reproducible set of steps. Your preprocessing could mix programs (e.g. smooth in one program, realign in another), and it should be easier to switch a step from one program to another (e.g. smooth in fsl instead of spm) with relatively few changes to the program.

I like that Nipype makes it possible to reproduce and share analysis steps; the current procedures are far too messy and error-prone. Nipype looks to require a fair bit of programming skill (python), however, as you're coding at a step above the underlying routines/programs: to do it properly you'd need a decent understanding of both nipype and the programs it's calling (from fsl or whatever).

Nipype strikes me as most useful to a skilled programmer wanting to run several versions of an analysis (e.g. with different motion-correction algorithms) or the same analysis many, many times (lots of subjects). It makes it more straightforward to make an established set of analysis steps (such as preprocessing) available to other people. It's hard to tell without trying it out, but it looks like a nipype program could be set up to be run by someone that doesn't know the underlying routines (e.g. an RA running the same preprocessing steps on a new group of subjects).

I have concerns about missing errors, specifying options, and stability. It's often necessary to check initial output before moving to the next step in the pipeline; these errors would need to be caught somehow (go back and check intermediate files?). Most routines have many options (e.g. FWHM, interpolation algorithm). Does Nipype set the default values? Making every single option visible through Nipype seems daunting, which is one of my concerns about stability: Nipype will need to be updated any time one of the underlying programs is updated. And installation would likely be non-trivial (all the necessary dependencies and paths), unless it's run in NeuroDebian.

A few final thoughts: First, I've had terrible troubles before with left/right image flipping and shifting centerpoints when changing from one software package to another (and that's not even thinking about changing from one atlas to another). Trying to make packages that have different standard spaces play nicely together strikes me as very non-trivial (I don't mean code-wise, I mean keeping track of alignment). Second, would there be a benefit to using Nipype if the pipeline only includes one program (e.g. why Nipype instead of spm8 batching)? It seems in this case it might just be adding extra complication: the need to learn the program options, plus the Nipype code.

So will I recommend we try Nipype right now? No. But I'll keep an eye on it and am certainly willing to revise my opinion. Gorgolewski K, Burns CD, Madison C, Clark D, Halchenko YO, Waskom ML, & Ghosh SS (2011). Nipype: a flexible, lightweight and extensible neuroimaging data processing framework in python. Frontiers in neuroinformatics, 5 PMID: 21897815

Thursday, July 12, 2012

but replications are possible

Considering the posts on all the reasons that fMRI findings can be dodgy and replications unlikely, I'll share some recent experiences I've had in which proper replications succeeded.

I can't go into any detail (yet), but this image is a few slices showing the results with the first dataset in red and the second in blue. We definitely tweaked the analysis of the first dataset a bit; I did as much of the tweaking as possible with a control classification rather than the real one, but I can't claim that all the choices were made on a priori principle. But I would claim that the choices were made without any data peeking for the second dataset: the blue blobs are the very first set of across-subjects results; just the analysis we used for the first dataset. And I'm extra-pleased by these results because they are properly independent: different scanner, different people, even different stimuli (but the same task and experimental questions, of course).

Within the last week I've actually had another case of replicating results in properly independent datasets; a completely different set of analyses than the blobs above. It's an anatomical ROI-based analysis, so I don't have pretty pictures to show, but the tables are very lovely.

So do not give up hope: replications are possible. But we must be very vigilant in our methods: we are all easy to fool.

Monday, July 9, 2012

many, many options (continued)

Thanks for the comments on the previous post; here's a few thoughts and responses.


Jonas Kaplan mentioned optimizing the methodology on a pilot subject or two, which are then not included in the main analysis. That can be a great solution, especially if the pilot subjects are collected ahead of time and need to be collected regardless.

Something that's worked for me is figuring out the analysis using a positive control classification, then applying it to the real classification. For example, classifying the button presses (did the person push the first-finger or second-finger response button?). Which button was pressed is of course not of interest, but if I can't classify button pressing using motor voxels, I know something is not right, and I can use button-pressing classification accuracy as a sort of yardstick (if button-presses are classified around 80% accuracy and the task I actually care about at 75%, that's probably a strong signal).

Neither strategy (testing out methodology on different subjects or classifications) is perfect (the pilots could be much different than the other subjects, the control classification could have a radically different signal than the classification of interest, etc.), but they're better than nothing, or trying out many different analyses with the real data.

how big an impact?

An anonymous person commented that they've found that often a variety of methodological choices produces similar results, mentioning that sometimes a change will increase accuracy but also variability, resulting in similar conclusions/significance.

My experience is similar: sometimes changing a factor has a surprisingly small impact on the results. Classifier type comes to mind immediately; I've tried both linear svms and random forests (very different classifiers) on the same datasets with very similar results. Changing the number of cross-validation folds doesn't always make a big difference, though changing schemes entirely can (e.g. partitioning on the runs, then ignoring the runs in the partitioning schemes).

I've found that some choices have made very large differences in particular datasets; enough to change the interpretation completely.
  • temporal compression. I've seen massive differences in results when compressing to one example per run instead of one example per block; much more accurate with more averaging. I interpret this as the increase in signal/noise (from averaging more volumes together) outweighing the decrease in the number of datapoints (fewer examples with more compression).
  • scaling and detrending. Some datasets, particularly when classifying across subjects, have very different results depending on which scaling (i.e. normalizing, across-volumes within a single voxel, or across voxels within a single image) technique was used.
  • balancing. This one's a bit different; I mean choosing a subset of examples if necessary to ensure that equal numbers of each class are sent to the classifier. For example, suppose we've compressed to one example per block, and are partitioning on the runs, but are only including blocks the person answered correctly. We might then have different numbers of examples in the classes in each cross-validation fold. I usually handle this by subsetting the bigger class at random - some of the examples are left out. Since there are many ways to choose which examples are left out, I usually do ten different random subsets and average the results. Sometimes which examples are included can have a very large difference in accuracy. In these cases I often look at the processing and analysis stream to see if there's a way to make the results more stable (such as by increasing the number of examples in the training set).

Monday, July 2, 2012

many, many options

Let me join the chorus of people singing the praises of the recent article "False-positive psychology". If you haven't read the paper yet, go read it, especially if you don't particularly like statistics. One of the problems they highlight is "researcher degrees of freedom." In the case of MVPA I'd summarize this as having so many ways to do an analysis that if you know what sort of result you'd like you can "play around" a bit until you find one that yields what you'd like. This isn't cheating in the sense of making up data, but more insidious: how do you determine what analysis you should perform, and when to accept the results you found?

Neuroskeptic has a great post on this topic, listing some of the researcher degrees of freedom in analyzing a hypothetical fMRI experiment:
"Let's assume a very simple fMRI experiment. The task is a facial emotion visual response. Volunteers are shown 30 second blocks of Neutral, Fearful and Happy faces during a standard functional EPI scanning. We also collect a standard structural MRI as required to analyze that data."
What are some of the options for analyzing this with MVPA? This is not an exhaustive list by any stretch, just the first few that came to mind.
 temporal compression
  • Average the volumes to one per block. Which volumes to include in the average (i.e. to account for the hemodynamic lag)?
  • Create parameter estimate images (PEIs) (i.e. fit a linear model and do MVPA on the beta weights), one per block. The linear model could be canonical or individualized.
  • Average the volumes to one per run. Calculate the averages from the block files or all at once from the raw images.
  • Create one PEI for each run.
  • Analyze individual volumes (first volume in each block, second volume in each block, etc).
  • the "default": linear svm, c=1.
  • a linear svm, but fit the c.
  • a nonlinear svm (which type?).
  • a different classifier (random forest, naive bayes, ....).
  • correlation-based
  • linear discriminants (multiple options)
  • on the runs
  • on a combination of runs (first two runs out, next two out, etc)
  • ignoring the runs (ten-fold, leave-three-examples-out, etc)
  • on the subjects (leave-one-subject-out)
  • on the runs, but including multiple subjects
  • whole-brain
  • ROI (anatomical, functional, hybrid)
  • searchlight (which radius? which shape? how to combine across subjects?)
  • resize the voxels?
  • scale (normalize) the data? (across voxels within an example, across examples?). Center, normalize the variance, take out linear trends, take out nonlinear trends?
Simmons, Nelson, and Simonsohn argue that we (as authors and reviewers) need to be clear about why we chose particular combinations, and how sensitive the results were to those choices. For example, there is not always a clear choice as to the best cross-validation scheme to use in a particular analysis; several may be equally valid (leave-three-out, leave-four-out, leave-five-out, ...). If you only try one scheme and report it, that's fine. But if you try multiple, then only report the one that "worked", you've really increased the odds of finding a false positive. We need to be honest about how stable the results are to these types of (sometimes arbitrary) analytical choices. Simmons JP, Nelson LD, & Simonsohn U (2011). False-positive psychology: undisclosed flexibility in data collection and analysis allows presenting anything as significant. Psychological science, 22 (11), 1359-66 PMID: 22006061