Thursday, April 18, 2013

below-chance accuracy: case studies

Reading my previous post, I realized it is missing some concrete examples/case studies: why did I give some of that advice? So here are two from my experience; I'd love to hear others - put them in the comments or send them to me directly (I'll post them if you'd like, with or without your name).

Also, I am not particularly concerned if a few subjects have somewhat below-chance accuracy. There are no hard-and-fast rules, but I'd say that if you have at least a dozen subjects, if one or two classify around 0.45 while the others are above 0.6 I wouldn't be too concerned (assuming chance = 0.5 and within-subjects analysis). If some are classifying at 0.2 and others at 0.9, I'd be much more worried, or if something like a third of the subjects are below-chance.

first: check the dataset

This is my first piece of advice because of a project I was involved in several years ago. Early analyses of the dataset produced the strangest pattern of results I'd (still) ever seen. Most of the accuracies were not below-chance but implausibly above-chance: accuracies I'd expect to see on some sort of primary motor task, not the subtle cognitive effect we were actually testing. Also, accuracy went with the number of voxels in the ROI: larger ROIs classified better.

It took a lot of checking to find the problem. We weren't even really sure that there was a problem; the results just "felt" too good, and the relationship between ROI size and accuracy was worrisome. We finally tracked down that the problem was that some of the examples were mislabeled. Another researcher had generated PEIs (beta-weight images), which I then used as the MVPA input. I labeled which PEI was which according to my notes ... but my notes were wrong. Once we realized this and fixed the labels everything made much, much more sense. But this took weeks to sort out.

While not a case of straight below-chance accuracy, this experience made me convinced that just about anything can happen if something goes wrong in the data preparation (preprocessing, labeling, temporal compression, etc.). And it is very, very easy to have something go wrong in the data preparation since it is so complex. This is partly why I'm keen to perform a quality-checking classification: first classifying something that is not of interest (and so not impacting the experimenter degrees-of-freedom too much) but that should have a very strong signal (like a movement). Hopefully, this procedure will catch some dataset-level errors - if it fails, something is not right.

aiming for stability

I'm currently working with a dataset with a very complex design. I started by trying to classify button-presses (which finger) in motor areas, within-subjects. Classification was possible, but highly unstable: huge variability on cross-validation folds and between different subjects (some people classified perfectly, others very below-chance). Accordingly, the group-level results (e.g. t-test for accuracy > chance) were lousy.

I then tried adjusting some of the analysis choices to see if the button-push classification could be made more stable. In this case, I changed from giving the classifier individual examples (e.g. first button-push, second button-push, etc - number of examples matching the number of button-pushes) to classifying averaged examples (e.g. averaging all button "a" pushes in the first run together). This gave me far fewer examples to send to the classifier (just one per button per run), but accuracy and stability improved drastically. We later followed the same averaging strategy for the real analysis.

In this case, I suspect the individual examples were just too noisy for adequate classification; averaging the examples improved the signal more than reducing the number of examples hurt the classifier (a linear SVM for this particular project).

No comments:

Post a Comment