This week, Richard Morey et al.’s PRO (Peer Reviewer Openness) Initiative launched, the revamped version of their Agenda for Open Research. The PRO Initiative is a laudable step by a group of devoted Open Science proponents to make our science more transparent. And for good reason – science should be open, and accessible to everyone. The PRO Initiative aims to do this by asking reviewer to withhold in-depth review of academic papers if data and materials are not made open. The arguments for improving the way we handle data are compelling. The present practice, in which data is ‘available on request’ simply does not work, as has been shown several times. Moreover, data sharing encourages collaboration and emphasizes that science is a collaborative enterprise. We’re in this together, figuring out how the world works, and hopefully making it a better place. Sharing data is helping towards that ideal.
If you’re following Richard (if you do not, you should – even if you’re not into Open Science, his posts on Bayesian statistics cannot be missed!) or other members of the PRO Initiative (whom you should follow, too, again, even if you’re not into Open Science, because they’re all pretty good bloggers with sensible things to say), you will have seen many calls to sign the Initiative. As a signatory, you pledge that, starting January 2017, you will request that authors make their data publicly available when you review a paper, and withhold further review if they do not do so without good reason. Of course I have been thinking about this, starting a couple of months ago when Richard published the first version of the ‘Agenda for Open Science’ – what is not to like about Open Science?
But something did not sit well with me. I decided to wait for the updated version of the Agenda, which now is the PRO Initiative, and there was still this something that made me feel uneasy, a bit worried even. As a matter of fact, the past weeks I have been writing on a manuscript detailing these concerns, but maybe it’s better to throw some of these ideas out here and see what you think of this. I am still not decided.
I have a big problem with the PRO Initiative’s definition of ‘open data’. The PRO Initiative asks researchers to make data publicly available. Sharing, or depositing data on a server to which only researchers have access is not enough – data, preferably raw data, has to be publicly accessible in order to count as ‘open’. In principle, there is nothing wrong with wide open data – on the contrary. CERN streams its data live to a publicly accessible server, some major archaeological discoveries have been made in the openly accessible data of Google Maps, and undoubtedly, if you want to discover extraterrestrial life, you’re free to roam NASA’s open database of images from other worlds. So, why do I feel uneasy about open data, then? The main reason: because we (cognitive neuroscientists/psychologists) observe people. Our raw data is a detailed description of human behaviour and neurophysiology. I have a problem throwing such data out in the open.
What I dearly miss in most discussions on open data is the perspective of the research participant. All arguments are centered around scientists and the process of science. We seem to forget that our (psychologists’) data is about actual people. In the discussion on open data, participants are stakeholders, too. It’s their data (not ours) we are planning to throw on the internet. As a scientist studying human behaviour, I feel my very first responsibility is to the participants in my experiments. I am obliged to guard them as well as I can from any harm coming from their participation. Moreover, I think that they should have strong voice in stating how their data can and should be used – stronger than that of the scientist. If a participant requests to be taken out of a dataset, so be it.
So, what may be harmful about publishing properly anonymized raw data? Well, I am trained in thinking in doomsday scenarios, so let’s come up with a potential disaster:
I participate in an fMRI/EEG experiment of a colleague in which by brain responses to a pornographic clip with very inappropriate material (insert your favourite fetish here) are measured, together with the physiological response of my Private Willy Johnson. The participant after me happens to be one of my first year students. This student unfortunately has an unhealthy obsession for his lecturer, and makes a note that I participated in this weird experiment, on Dec 3, around noon. One year later, the research paper with raw data is published. Being a good experimenter, my colleague notifies all research participants of this joyful occasion. Our student now downloads the data, and although I am known as Participant-007, our student checks the time stamps, and presto, he can now work on his blog post “How My Professor Got A Stiffy From Copulating Hippopotamuses And He Really Enjoyed It! (with data)”. Moreover, the student now also has my fMRI and EEG data. A recent study has shown that individuals can be reliably identified on basis of the neural connectivity data, so this means my stalker can now also identify my data in the study on The Effects of Mindfulness of Believing in Bullshit – an EEG Connectivity Study, and see that I score massive bonus points on the bullshit scale for knowing who Deepak Chopra is (and actually having talked to him)
Ok, sure, this is a strictly hypothetical scenario – but it does show how vulnerable wide open data is for breaches of privacy. Open data means basically giving up privacy to anyone who knows you participated in a particular experiment at a given time, and such knowledge can fairly easily be obtained by someone who wants to. So, delete the time stamps! Well, my colleague from the example would love to, but she pre-registered her study and she needs the time stamps to show she did collect the data after she submitted her preregistration…
Fine, you say. Let’s not post such sensitive data then. The PRO Initiative leaves enough space for this – if a researcher has a good reason to not make data publicly available, she/he can say so. However, I still have some more issues.
If the PRO Initiative gains momentum, petabytes of behavioural and neurophysiological data will become publicly accessible. Given that the vast majority of our studies are carried out in undergraduate psychology students, it is relatively easy to identify particular strata (e.g. students from the class 2015-2016 – I can just look at the timestamped data). For example, most of our freshmen are on Facebook, where they started a group page, and in the kindness of their hearts, they allowed me to be a member as well. This means I have access to all their profiles, and as such, I can compile a pretty interesting profile of the average psychology student. But with research data out in the open, I can also mine actual research data measuring validated psychological constructs. A lot will not be particularly interesting, but data about cognitive abilities, implicit prejudice, or attitudes towards political ideas may all be quite worthwhile from the perspective of, let’s say, a marketing company, or another party with an interest in nudging behaviour.
This is not a direct threat for any individual research participant (contrary to a breach of anonynimity), but if I, as a research participant, come into the for the benefit of science, I would be somewhat displeased to figure out that a shady marketing company uses my data for profiling. To aggravate matters – it is very well conceivable that matching up open research data with data from social networks or other sources can lead to identification – for example, individual ‘likes’ on Facebook predict personality; a measure I may be able to cross-reference with all the open data I just downloaded from the Groningen psychology servers. The more data, the happier my data-crunching algorithms will be.
You may think this is all very hypothetical, and very unlikely. Maybe you say I am scaremongering. Then again, that is what you need to do when thinking about research ethics, I’d say – I served in a couple of Ethics Committees over the past years. What is the absolute worst that can happen, and how likely is that scenario? My personal evaluation is that a breach of anonymity is conceivable, and in some cases even likely (e.g., when someone knows you participated in a particular experiment, which at least is not uncommon among first year participants here in Groningen).
So, yes, I am worried about the PRO Initiative. I am not at all convinced making data wide open is such a good idea. “But the data belongs to the public, the tax payer paid your salary!”, I hear you say. Well, sure, the tax payer also pays for the construction of the road they’re building next to our office building, but that doesn’t mean I (yes, I also pay taxes) can go to the construction site and help myself to a nice supply of concrete. I am entitled to use the road, however, once it’s finished. Metaphorically, the concrete is the data underlying the research paper. I think access to the research paper is what tax payers pay for – so making research papers open access should be a no-brainer to anyone.
I believe that posting raw data on the internet according to the guidelines of the PRO Initiative, results in an increased risk to the well-being my participants I am not willing to expose them to, no matter how small it is. I am happy the PRO Initiative leaves enough room to voice such concerns on an individual basis, but if the Initiative gains momentum, many research participants will be exposed to the risk of their data being used in manners they did not anticipate or consented with.
This is all the more pressing since there are excellent alternatives to ‘wide open data’ – hosting data on an institutional or national repository, for example, where access to data is regulated by an Ethics Committee or dedicated data officer, who can grant access to data on a case-by-case basis, or to registered users, independent of the researcher. As a matter of fact, this is already a requirement by many funding agencies and institutions, including mine (mind you, depositing data is required, making data publicly available typically is not!) Publicly posting raw data is – in my opinion – exposing participants to unnecessary risk for adverse effects of their research participation. Asking other scientists to do the same, and putting pressure on them to do so, makes me feel uneasy. Maybe this feeling is unjustified, I don’t know. But in a rather long nutshell, this is why I have not signed the PRO Initiative.
So, my Open Science pledge is that I will not make raw data public in any way unless my research participants request me to do so, nor will I ask others to expose their participants to unnecessary risk. I will pre-register my studies, upload my stimulus materials and analysis scripts to a publicly accessible place, but I will post my raw data to a reliable depository hosted by either my institution or a third party, where anyone who needs my data for research purposes can have access to it without intervention from my part. Moreover, I promise my participants that I will share their data with anyone who needs it for her or his research to advance science, but also to not share their data if they request me so. And finally, I will make sure that all my research output is openly accessible to everyone.
That’s what I have to offer, Team PRO. Hope we can still be friends?