Why a meta-analysis of 90 studies does not tell that much about psi, or why academic papers should not be reduced to their data

Social psychologist-turned-statistics-and-publication-ethics crusader Daniel Lakens has recently published his review of a meta-analysis of 90 studies by Bem and colleagues that allegedly shows that there is strong evidence for precognition. Lakens rips apart the meta-analysis in his review, in particular because of the poor control for publication bias. According to Lakens, who recently converted to PET-PEESE as the best way to control for publication bias, there is a huge publication bias in the literature on psi, and if one, contrary to the meta-analysis’ authors, properly controls for that, the actual effect size is not different from zero. Moreover, Lakens suggests in his post that doing experiments without a theoretical framework is like running around like a headless chicken – every now and then you bump into something, but it’s not as if you were actually aiming.

I cannot comment on Daniel’s statistical points. I have not spent the last two years freshing up my stats, as Daniel so thoroughly has done, so I have to assume that he knows to some extent what he’s talking about. However, it may be worthwhile noting that the notion of what is an effect, and how to determine its existence has become somewhat fluid over the past five years. An important part of the debate we’re presently having in psychology is no longer about interpretations of what we have observed, but increasingly about the question whether we have observed anything at all. Daniel’s critical review of Bem et al.’s meta-analysis is an excellent example.

However, I do think Daniel’s post shows something interesting about the role of theory and methods in meta-analyses as well, that in my opinion stretches beyond the present topic. After reading Daniel’s post, and going through some of the original studies included in the meta-analysis it struck me that there might be going something wrong here. And with ‘here’ I mean reducing experiments to datasets and effect sizes. We all know that in order to truly appreciate an experiment and its outcomes, it does not suffice to look at the results section, or to have access to the data. You also need to carefully study the methods section to verify that an author has actually carried out the experiment in such a way that is measured what the author claims has been measured. And this is where many studies go wrong. I will illustrate this with a (in)famous example; Maier et al.’s 2014 paper ‘Feeling the Future Again’.

To give you some more background: Daniel claims that psi lacks a theoretical framework. This statement is incorrect. In fact, there are quite some theoretical models that potentially explain psi effects. Most of these make use or abuse concepts from (quantum) physics, and a as a result many psychologists either do not understand the models, or do not bother to try understand them, and simply draw the ‘quantum waffling’ card. Often this is the appropriate response, but it’s nothing more than a heuristic.

Maier et al. (2014) did not start running experiments like headless chickens hoping to find a weird effect. In fact, they quite carefully crafted a hypothesis about what can be expected from precognitive effects. Precognition is problematic from a physics point of view, not because it’s impossible (it isn’t), but because it creates the possibility for grandfather paradoxes. In typical precognition/presentiment experiments, an observer shows an anomalous response to an event that will take place in the near future, let’s say a chandelier falling down from the ceiling. However, if the observer is aware of his precognitive response, (s)he can act in order to prevent the future event (fixing new screws to the chandelier). However, now said event will not occur anymore, so how can it affect the past? Similarly, you cannot win roulette using your precognitive powers – any attempt to use a signal from the future to intentionally alter your behaviour leads to time paradoxes.

In order to avoid this loophole, Maier et al. suggest that precognition may only work unconsciously; that is, if there are precognitive effects they may only work in a probabilistic way, and only affect unconsciously initiated behaviour. Very superficially, this line of reasoning resembles Deutsch’s closed timelike curves proposal for time-travel of quantum information, but that’s besides the point here. The critical issue is that Maier et al. set up a series of experiments in which they manipulated consciousness of stimuli and actions that were believed to induce or be influenced by precognitive signals.

And that is where things go wrong in their paper.

Maier et al. used stimuli from the IAPS to evoke emotional responses. Basically, the methodology is this: participants had to press two buttons, left and right. Immediately after, two images would appear in on the screen, one of which would have negative emotional content. The images were masked in order to avoid them from entering conscious awareness. The idea is that participants would respond slower pressing the button at the same side as where the negative image would appear (ie., they would precognitively avoid the negative image). However, since this would be a strictly unconscious effect, it would avoid time paradoxes (although one could argue about that one).

What Maier et al. failed to do, though, is properly checking whether their masking manipulation worked. Properly masking stimuli is deceivingly difficult, and reading through their method sections, I am actually very skeptical whether they could have been successful at all. The presentation times of the masked stimuli were 1 video-frame, which would be necessary to properly mask the stimuli, but the presentation software used (E-Prime) is quite notorious for its timing errors, especially under Windows 7 or higher, with video cards low on VRAM. The authors, however, do not provide any details on what operating system they used, or what graphics board they used. To add insult to injury, they did not ask participants on a trial-by-trial basis whether the masked image was seen or not (and even that may not be the best way to check for awareness). Therefore, I have little faith the authors actually succeeded in successfully masking their emotional images in the lab. Their important, super-high-powered N=1221 study, which is often cited, has been carried out on-line. It’s very dubious whether masking was successful in this case at all.

If we follow the reasoning of Maier et al., conscious awareness of stimuli is important in getting precognitive effects (or not). Suppose that E-Prime’s timing messed up in 1 out of 4 trials, and the stimuli were visible – what does that mean for the results? Should these trials have been excluded? Can’t it be the case that such trials diluted the effect, so we end up with an underestimation? And, can’t it be that the inclusion of non-masked trials in the online experiment has affected the outcomes? Measuring ‘unconscious’ behaviour, as in blindsight-like behaviour, in normal individuals is extremely difficult and sensitive to general context – could this have played a role?

In sum, if you do not carefully check your critical manipulations you’re left with a high-powered study that may or may not tell us something about precognition. However, it matters when you include it in your meta-analysis – a study with such a high N will appear very informative because of its (potential) power, but if the methodology is not sound, it is not informative at all.

On a separate note, Maier et al.’s study is not the only one where consciousness is sloppily manipulated – the average ‘social priming’ or ‘unconscious thinking’ study is far worse – make sure you read Tom Stafford’s excellent commentary on this matter!

So, how it this relevant to Bem’s meta-analysis? Quite simply put: what studies you put in matters. You cannot reduce an experiment to its data if you are not absolutely sure the experiment has been carried out properly. And in particular with sensitive techniques like visual masking, or manipulations of consciousness, having some expertise matters. To some extent, Neuroskeptic’s Harry Potter Theory makes perfect sense – there are effects and manipulations which require specific expertise and technical knowledge to replicate (ironically, Neuroskeptic came up with HPT to state the opposite). In order to evaluate an experiment you not only need to have access to the data, but also to the methods used. Given that this information seems to be lacking, it is unclear what this meta-analysis actually tells us.

Now, one problem that you will run in to a whole series ‘No True Scotsman’ arguments (“we should leave Maiers’s paper out of our discussions of psi, because they did not really measure psi”), but some extent that is inevitable. The data of an experiment with clear methodological problems is worthless, even if it is preregistered, open, and high-powered. Open data is not necessarily good data, more data does not necessarily mean better data, and a replication of a bad experiment will not result in better data. The present focus in the ‘replication debate’ draws attention away from this – Tom Postmes referred to this ‘data fetishism’ in a recent post, and he is right.

So how do we solve this problem? The answer is not just “more power”. The answer is “better methods”. And a better a priori theoretical justification of our experiments and analysis. What do we measure, how do we measure it, and why? Preferably, such a justification has to be peer-reviewed, and ideally a paper should be accepted on basis of such a proposal rather than on basis of the results. Hmm, this sounds somewhat familiar