Open data, better data?

The New England Journal of Medicine has done it – post an [editorial], invent a new term (data parasites) and nicely evoke the wrath of Twitter (or in particular,, the Open Science Movement). Nice move, NEJM!

As a quick recap, the editorial lists some pros and cons of data sharing, a practice we see shockingly little of in psychological science. The most offending bit of the editorial was that part about ‘data parasites’, a new class of researcher living exclusively from the shared data other researchers so painstakingly collected. And, God forbid, these data parasites might even use all this shared data to disprove the hypotheses of the original authors!

No, really, we don’t want that, do we now?

Anwyay, it’s quite obvious that Longo and Drazen did not quite think things through. It is rather ridiculous to claim full authorship for papers in which your data is (re)used. That is what we have citations for. Similarly, if I use a Stroop task, I cite Dr Stroop rather than asking him to be a co-author. Which would kind of stretch my paranormal research quite a bit, given the good Emeritus Professor of Bible Studies has been dead for over forty years.

Summarizing, nice move, NEMJ, nice move.

Nevertheless, in the context of Open Science and Open Data, there are some interesting observations to be made. I will not regurgitate all the arguments in favour of open science here. It’s pretty much a no-brainer that data sharing is -in principle- a good thing. However, as an experimentalist generating a lot of human subject data (by measuring human subjects that is, not making sh!t up), some of the arguments by Longo and Drazen do resonate with me.

Lest I evoke the wrath of Twitter, let me state that I am 100% in favour of data sharing. I am less convinced of making data available publicly, because of ethics and privacy reasons, though, but I’ve already voiced my concerns about that earlier (and had a great talk with Rink Hoekstra, one of the co-authors of the PRO Initiative afterwards), but basically it’s a technical matter we disagree on.

In this post, I would like to put the entire Open Data ideal somewhat into perspective. If you go through Twitter feeds, it seems that science without open data is bad science, and I know that there are some people who think about it in such a way. That’s fine. However, my take on this is slightly different. I am an experimental scientist, which means I try to understand phenomena in the world by making people behave in particular ways in a laboratory setting. These laboratory tasks are necessarily abstractions from reality. Often these abstractions work, and tell us something about real world settings (e.g., learning and memory experiments). Sometimes, it does not matter, because we’re testing the limits of cognition (e.g., visual psychophysics). And sometimes, we miserably fail in capturing anything meaningful (any named example here would make someone mad). This latter case is what we call a bad experiment.

Last year we have seen a brilliant example in the literature: [Sadness impairs color perception]. In this paper, it was claimed that ‘feeling blue’ leads to a specific impairment in colour perception on the yellow-blue axis. It is an interesting example, because the data underlying this paper were open. Given the claim, it did not take long before the first skeptics re-analyzed the data and found some rather serious errors in the data analysis. However, when I looked at the paper I did not have to look at the data in order to conclude something was seriously wrong: the methods did not make sense at all! The emotion manipulation was cr@p, and even worse, the measurement of color perception was just, well, wrong! To measure colour perception, we typically rely on carefully calibrated psychophysical methods. The authors of above mentioned paper used a, well, suboptimal method. It suffices to say that the paper was quickly withdraws, to the credit of the authors, but it should not have passed peer review.
Mind you, this is a paper with open data. But open data is not necessarily good data. In this case, the data is rather awful and actually completely meaningless. Sadly, bad experiments are surprisingly common. One of the problems I see with the present focus on data, and in tandem, with statistical power, is that the quality of the experiment generating the data gets overlooked. I have argued before that we cannot and should not reduce experiments to data. Even if you have N =10,000 and a beautifully coded data sheet which is open to all, your data is still worthless if your experiment sucks. In experimental science, data is only as good as the experiment underlying it.

So how do you know whether an experiment is any good? Well, sadly, this is something that requires expertise. In particular where rather trikcy concepts (things like ‘consciousness’, but also ‘colour perception’, see above), or methods (fMRI, EEG) are used, you need to know what you (or the authors of a dataset) are doing if you want to use a dataset properly and evaluate its merits. Actually, this is the reason I am not too fond on using other people’s data. I rather replicate the studies I like and gather my own data. Sure, that is more work, but it gives me a better understanding of what is going on in a particular manipulation, and thus a better position to do science. Now, I do realize that this is not feasible for a lot of fields, but I think that for a lot of experimental work in my field, it is.

So, what I am looking for in a paper is not data per se, but a great new manipulation, or analysis method that gives new theoretical insight. The data is great, but the idea behing empirical science is that when lab A makes an observation, a decently competent experimenter in lab B should be able to make the same observation when using lab A’s methods. Rather than a paper with open data, I’d have a paper which shares its stimulus material and analysis scripts. Open data is to me somewhat like someone else’s toothbrush. Surely it can get the job done, and it’s very welcome if you don’t have one of your own, but I prefer my own.

The bottom line of this story is this: sure, data sharing is great. But let’s not pretend it is our Holy Grail. At least in my field it is not, and there are more important things to focus on. The focus on Open Science is great, as long as it does not steal any glory from a great experiment. It’s my impression that we’re becoming a bit too obsessed with data at the moment, at the expense of experimental methods.