The Open Data Pitfall II – Now With Data

Yesterday I wrote something on why I think providing unrestricted access data from psychological experiments, as advocated by some, is not a good idea. Today I was in the opportunity to actually collect some data surrounding this issue, from the people who are neglected in this discussion: the participants.

I have used Mentimeter to ask the 60 first year students who showed up for my Biopsychology lecture whether they would participate in an experiment of which the data would be made publicly available.

At the beginning of the lecture, I gave a short introduction on open data. I referred to the LaCour case, and to Wicherts et al.’s work on the lack of willingness to share data, and emphasized the necessity of sharing data. I also mentioned that there is a debate going on on how data should be shared. I mentioned that some researchers are in favour of storing data in institutional repositories, whereas other researchers are in favour of posting data on publicly accessible repositories. I then explicitly told I would give my thoughts on the matter after I asked them two short questions via Mentimeter.

I read out two vignettes to the students:

1. “Imagine you signed up via Sona for one *name of researcher* studies on sexual arousal. Data of the study will be shared with other researchers. The dataset will be anonymized – it may contain some information such as your gender and age, but no personally identifiable information. Would you consent to participate in this study?”

2. “Imagine you signed up for the same study. However, now *name of researcher* will make the data publicly available on the internet. This means other researchers will have easier access to it, but also that anyone, such as your fellow students, companies, or the government, can see the data. Of course, the dataset will be anonymized – it may contain your gender, or age, but no personally identifiable information. Would you consent to participate in this study?”

After each vignette, they submited their response to

As I said, respondents were 60 first year psychology students, of the international bachelor in psychology of the University of Groningen, most of them German. It is my experience that this population generally guards its privacy a lot more than their Dutch counterparts – please keep this in mind.

The results? For scenario 1, 13.3% indicated they would *not* partcipate. This percentage indicates the data may be a bit skewed – for most studies I run (EEG work on visual perception and social interaction) I have a non-consent rate of about 5 to max. 10%. For my TMS work this can go op to 33%. However, given the nature of the research I used as an example (I named a researcher they know, and her research involves the role of disgust in sexual arousal – stuff like touching the inside of a toilet bowl after watching a porn clip), 13.3% might not be totally unreasonable.

For scenario 2, the percentage of non-consenters was obviously higher. But not just a little bit – it went up to a whopping 52.4%. More than half of the students present indicated they would not want to participate in this study if the data were to be made publicly available, even though I clearly indicated all data would be anonymized.

The Mentimeter result can be found here. Please note that there are 61 votes for vignette 2; one student was late and voted only for vignette 2. Feel free to remove one ‘no’ vote from the poll – it’s now 51.6% non-consenters.

What does this tell us? Well, there are some obvious caveats. First of all – this was a very ad-hoc experiment in a rather select and possibly biased group of students (ie., students who took the trouble of going to a lecture from 17:00 to 19:00 in a lecture hall 15 mins from the city centre, knowing I would lecture about consiousness, my favourite topic). Second, the experimenter (me) was biased, and even though I explicitly mentioned I would only give my view after the experiment, we all know how experimenter bias affects outcome of experiments. Maybe I did not defend the ‘open’ option furiosly enough. Maybe I made a weird face during vignette 2. Finally, the vignettes I used were about experiments in which potentially sensitive data (sexual arousal) is collected.

Nevertheless, I was surprised by the result. I expected an increase in non-consent, but not to such an extent that more than half would decline. Either I am very good at unconsciously influencing people, or this sample actually has a problem with having their data made publicly accessible. Anyway, it confirmed my hunch that in the debate on open data we should involve the people it is really about: our participants.

I do not wish to use this data as a plea against open data. But I do think researchers should talk to participants. Have a student on your IRB if you use first year participant pools, or otherwise someone from your paid participant pool. Set up a questionnaire to find out what participants find acceptable with regard to data sharing. In the end, if you post a dataset online without restrictions, it’s *their* data and *their* privacy that are at stake.

As a side note, going through some paperwork about consent forms, it actually turned that data storage and sharing in my default consent form is phrased as such:

“My data will be stored anonymously, and will only be used for scientific purposes, including publication in scientific journals.”

This formulation, which is presribed by my IRB, allows for data sharing between researchers, but forbids unrestricted (open) publication. I was actually quite happy to rediscover this – it means I can adhere to the Agenda for Open Research (or better, can not adhere to it with good reason)… publication of data would be a breach of consent in this case. If I were to put my data publicly online, I cannot keep my promise that data would only be used for scientific purposes.

But why not add something to the informed consent?

“The researcher will take care my data is stored at an institutional repository and guarantees she or he will share my data upon request with other researchers.”

Everybody happy.

The pitfalls of open data

TL;DR summary: some data can be made publicly available without any problems. A lot of data, however, cannot. Therefore, unrestricted sharing should not be the default. In stead, all data could be hosted on institutional repositories to which researchers can get access upon request to the institution.

Data is an essential part of research, and it is a no-brainer that scientists should share their data. The default approach is and has been ‘share on request’: if you’re interested in a dataset, you simply e-mail the author of a paper, and ask for the data. However, it turns out that this does not work that well. Wicherts, Borsboom, Kats, & Molenaar (2006)  have shown, for example, that authors are not really enthusiastic about sharing data, something not unique to psychology.

This is bad. Sadly, not just for the sake of scientific progress – recently, social science has seen another data-fabrication scandal where a graduate student faked his data for a study published in Science (you would think they had learned their lesson at Science after Stapel, but sadly, no). Making data available with your publication at least makes sure that a) you conducted the study, and b) allows others to (re)use your data, saving work in the end.

It is therefore not surprising that there is now an open research movement, calling for full transparency in research, including making all research data public by default. I totally support open research, and I have considered signing the ‘Agenda’ several times. After a discussion of the ISCON Facebook page I have now decided not to.

As a matter of fact, the discussion has convinced me that making all research data publicly available without restriction by default is in fact a bad idea.

Before the flame-war starts, let me point out that I am not against sharing data between researchers, or not even against compulsory data sharing (i.e., if an author refuses to share data without good reason, her/his boss will send the data). However, I disagree with unrestricted data publishing, i.e. putting all data online where anyone (i.e., the general public) can access it. I am strongly in favour of a system where data is deposited at an institutional repository and anyone interested in the data may ask for access, if necessary even without consent of the author.

Let me illustrate my concerns with the following thought experiment. You participate in an experiment on sexual arousal, and have to fill out a questionnaire about how aroused you are after watching a clip with the most depraved sex acts. Your data is stored anonymously, and will be uploaded to Github directly after the experiment (see Jeff Rouder’s paper on an implementation of such a system). Would you give consent?

For this example, I may. I can always fake my response on the questionnaire should I feel something tingling in my nether regions to avoid embarrassment.

For the next experiment, this study is repeated, but we’re now measuring physiological arousal (i.e., the response of your private parts to said depraved sex acts). Again, the data will be uploaded directly to Github after the experiment.

Now, I would be a bit uncomfortable. Suppose I got sexually aroused (or not – it actually does not matter, the behaviour of my private Willy Johnson is not anyone’s business besides my own and my wife’s, and for this one occasion, the researcher’s). This is now freely available for anyone to see. And by the timestamp on the file, I may be identified by the one or two students who saw me entering the sex research room for the 12:00 session on June 2nd. Unlikely, but not impossible. Oh sure, remove the timestamp then! Yes, but how is a researcher then going to show (s)he collected all the data after preregistering his/her study and not before (or did not fabricate the data on request after someone asked for it)?

Ok, we take it a step further. We now measure the response of your nether regions, but now we ask you to have your fingerprint scanned and stored with the data as well.

Making this data publicly available would be huge no to me. Fingerprints are unique identifiers, are you mad?

But now replace ‘fingerprint’ with raw EEG data. We do not often consider this, but EEG data is as uniquely identifiable as fingerprints. I can recognize some of my regular test subjects and students from their raw EEG data – shape and topography of alpha, for example, are individual traits and may be used to identify individuals if you really, really want to.

One step further: individual, raw fMRI data, associated with your physiological ‘performance’ on this sex rating task. Rendering a 3D face from the associated anatomical image is trivial – it’s one of the first things you do (for fun!) when you start learning MRI analysis. How identifiable do you want to have your participant? And note that raw individual fMRI data cannot be interpreted without the anatomical scan – you need the latter to map activations on brain structures.

So, don’t publish the raw data then! Sure, that fixes some problems, but creates others. What if I want to re-analyze a study’s data, because I do not agree with the author’s preprocessing pipeline, and rather try my own? For this I would still ask the author for the full data set, then. Mind you – most researcher degrees of freedom for EEG and fMRI are in the preprocessing of data (e.g., what filters do you use, what kind of corrections do you apply, what rereferencing do you apply, etc.), and aggregate datasets, such as published on Neurosynth do not allow you to reproduce a preprocessing pipeline.

But the main problem is that many data or patterns in data can be used as unique identifiers. Even questionnaire, reaction time, or psychophysics data. Data mining techniques can be used to find patterns in datasets, such as Facebook likes, that can be used for personal identification. What’s to stop people from running publicly available research data through such algorithms? Unlikely? Sure. Very much so, even. Impossible? Nope.

Of course, my thought experiment deals with a rather extreme example – I guess that very little people are willing to have their boy/girl-boner data in a public database for everyone to see. So let’s take another example. Visual masking. What can go wrong with that? Well – performance on a visual masking task may be affected by illnesses such as schizophrenia, or being related to an individual with schizophrenia. Is that something you want to be publicly accessible? And so there are many other examples. Data reveals an awful lot about participants and it is not clear at all how much data is needed to identify people. It may be less than we think.

I fully realize that the scenarios I put forward here are extreme, hypothetical, and I am sure some people will think I am fearmongering, making a fuss, and maybe even an enemy of open science. Ok, so be it. I think that we as scientists do not only have a responsibility to each other, but even more so to our participants. People participating in our studies are the lifeblood of what we do and earn our utmost respect and care. They participate in our studies and provide us with often very intimate data, but also trust us we handle that data conscientiously, and they contribute their data for science. We need to protect their privacy. Just putting all data online for everyone to see does not fit with that idea. There is always a potential for violations of privacy, but making all data public also opens up the data for, let’s say, the government, insurance companies, marketeers, and so on, for corporate analyses, marketing purposes, and other goals than the progress of science. Do we want that?

Maybe I should give another example – what about video material? Suppose you carried out an experiment in which you taped participants’ emotional responses to shocking material. Even if you would blur out faces to prevent identification, and my IRB is ok with publishing these clips, I would still not submit such material to a public depository for every Tom, Dick, and Harry to browse clips of crying participants.

I am not saying these are realistic scenarios, but is worth giving some thought – at least, more than people are doing now.

There are and will be many datasets that can be made publicly available without any concern at all. I’ve got a feeling that the authors of the Agenda for Open Research primarily work with such datasets, but do not sufficiently realize that there is a lot of sensitive data being collected as well. The ideal of all data made public by default does not fit well with my ideas of being a responsible experimenter. And there is a clear ‘grey zone’ here. Not everyone will share my concerns. Some will even say I am making a fuss out of nothing. But I would like to be able to carry out my job with a clear conscience. Towards my colleagues, but most of all towards my participants. And that means I will not make every dataset I collect publicly available, even if this entails the signatories of Agenda for Open Research will not review my paper because they do not agree with my reasons not to make a given dataset publicly available. Too bad.

So, you want access to the data I did not made publicly available, but I am on extended leave? There is a fix! And actually, this fix should appeal to the Open Research movement too.

For every IRB-approved experiment, require authors to deposit their data at an institutional repository. All data and materials that is. Raw data, stimuli, analysis code and scripts. The whole shebang, all documented of course. Authors are free to give anyone access to this data they want to. Scientists interested in the data can request access via a link provided with a paper. In principle, the author will provide access, but if no reply is given within a reasonable term (let’s say two weeks), or the data is not shared, but without proper reason, the request is forwarded to the Director of Research (or another independent authority) who then decides.

In Groningen, we have such a system in place. It ensures that for every published study, the data is accounted for, and access to the raw data can be granted if an individual requires so. The author of a study controls who has access to the data, but can be overruled by the Director of Research. It works for me, and I do not see what the added benefits of unrestricted access are over this system. Working in this way makes me feel a lot better. I can only hope that the signatories of the Agenda for Open Research consider this practice to be open enough.