An interdisciplinary way to get more reliable knowledge in empirical, social and behavioral sciences

The empirical, social and behavioral sciences like psychology have made significant advances in mathematical modeling, however all these fields have a common problem that we can call “replication / confidence crisis.” Dr. Frank Zenker, the principal investigator of “Models, Theories Research Program” and his team stress the importance of the fact that reliable scientific knowledge requires empirical research results that are replicated independently and propose to develop a new research program to solve the replication crisis in empirical, social and behavioral sciences.

Models, Theories and Research Program (MTR) is a project conducted with the coordination of Department of Philosophy and The Master of Arts Program in Cognitive Sciences at Boğaziçi University and funded by TUBİTAK 2232- International Fellowship for Outstanding Researchers Program. The principal investigator of the project, Dr. Frank Zenker who is a qualified researcher from Lund University in philosophy of science and cognitive science, and his team including both graduate and undergraduate students started to work in November, 2019 with the aim of developing a useful ESBS (empirical, social and behavioral sciences) research program that brings together the insights of philosophy of science and philosophy of statistics.

Dr. Frank Zenker and researchers of MTR, Oğuz Erdin, Müge Kuyumcuoğlu, Kaan Arıkan, Burçe Gümüşlü, Aran Arslan, Seçil Aracı and Buse Kurtar answered our questions about their project.

What is the “replication/confidence crisis” in empirical social and behavioral sciences (ESBS) and why are we seeing this crisis in ESBS and not in physical, mathematical and natural sciences?

As the 2019 National Academies of Sciences guideline on “Reproducibility and Replicability in Science” makes clear, reliable scientific knowledge requires empirical research results that are replicated independently. The replication crisis in the ESBS arises because one must estimate that the majority of research results here fail to replicate. This rightly leads to questioning the confidence one should have in published and current ESBS-research results. As a non-empirical science, mathematics simply cannot experience a similar crisis. Empirical sciences such as physics or chemistry, by contrast, are generally better than the ESBS at using available experimental and statistical methods. They also accept greater risks by emphasizing the value of successful theoretical prediction. By contrast, the dominant ESBS-standards for collecting and evaluating data look comparatively “aged.” The null-hypothesis significance testing approach (NHST), for instance, is of extremely limited use, because NHST cannot show that data support a substantial theoretical hypothesis. NHST has nevertheless remained the most widely used statistical approach in the ESBS, its results being regularly over-interpreted. Bayesian methods, which are currently “en vogue,” face other problems that mainly relate to small sample sizes and unreasonable laxness in selecting the prior probability of hypotheses. Before data collection starts, after all, the Bayesian approach allows researchers to favor a hypothesis over rival hypotheses on purely subjective grounds.                                                                                            

“ESBS could improve the quality of their research results by using larger samples together with better statistical methods”

Of course, unlike when natural scientists work with particles, waves, or DNA-strands, achieving well-controlled experimental conditions is more difficult for ESBS-researchers, because human responses not only display large heterogeneity but are also subject to feedback effects. We claim that the ESBS could nevertheless vastly improve the quality of their research results by using larger samples together with better statistical methods by improving meta-analytical methods, and by engaging in theory construction.

Why is this crisis more apparent in the last 10 years as you addressed in the press release?

Various aspects of the crisis were known roughly since the 1970s. A veritable crisis only arose as awareness of the problem’s magnitude within the ESBS increased. Highly influential in this respect was John Ioannidis 2005 paper "Why Most Published Research Findings Are False,” especially because it focuses on experimental research in medical science.

As the largest and most influential ESBS, psychology had its “coming out” with papers by Pasher and Wagenmakers, in 2012, and the Open Science Collaboration, in 2015. Around that time, a politically-driven “push-back” against science not only coined terms such as “fake news”, but also eroded public confidence in science. Among others, this led to science-internal conceptual research on how the crisis may be overcome.

What are the causes of this crisis in terms of the empirical structures of ESBS? You claim that this crisis is a result of having arranged research efforts in ways that misapply and over-interpret statistical methods. Then can we say that one of the reasons is related to the human part who apply these methods?

Multiple science-internal and science-external causes are working together.

As for internal causes, most ESBS research is data-driven, researchers often misapply statistical methods, and they regularly over-interpret or overstate the scientific significance of such empirical results. A related problem arises from a habit of assuming "exactly zero effect of site, experimenter, stimuli, task, instructions, and every other factor except subject,” resulting in a mismatch between a general verbal statement of a theoretical hypothesis and its statistical expressions, and leading to a "generalizability crisis." As for external causes, a strong preference for merely novel results has led many ESBS researchers to recognize the full value of replication research only recently. In fact, funding, publication, and promotion incentives are still biased towards novelty. In technical terms, statistically significant empirical results broadly remain sufficient for publication, without due emphasis on replication. In sum, the ESBS engage in rather questionable practices that prove slow to change.

“Our project's results can improve how ESBS researchers theorize their research areas”

You also claim there is a lack of general theory in ESBS to provide more accurate predictions. Can this kind of a theory be developed bearing in mind that the disciplines in ESBS are widely differentiated in terms of research methods and what can be the main characteristics of this general theory?

Part of the challenge in the MTR project is to contribute worked-out answers to this question. The project engages with the few examples of rudimentary theories that the ESBS have developed. We bring (niche-) research from the philosophy of science to bear on these theories, specifically the semantic view of empirical theories. We are confident that our project's results can improve how ESBS researchers theorize their research areas. As the MTR project has started only very recently, however, it’s simply too early to tell.

Are the current standards in academia about making publications or having funding among the obstacles to reach more accurate results in ESBS and what can some of the alternatives be in order to change the current system?

Pairing publication pressure with a merely novelty-seeking research culture certainly does not help, for it leads to making one-off “discoveries” without engaging in replication or theory construction. This has downstream effects, for instance where statistical training of ESBS researchers is tailored accordingly, or where the poor quality of such training leads to an honest but entrenched misunderstanding of fundamental notions. A heavy dose of training in philosophy of science could improve graduate education in the ESBS.

“The main obstacle in the ESBS is a lack of serious theory-construction efforts”

Rather than applying the “follow the crowd”-heuristics, moreover, PhD-students and post-doctoral researchers should question critically what senior researchers find normal today. Even as training, incentives, and statistical techniques would become more sensitive to the value of replicating previous experimental results, however, the main obstacle in the ESBS appears to be a lack of serious theory-construction efforts. Until the mass of data the ESBS are producing bear on theory-construction, there is no reason to hope for theoretically progressive ESBS research anytime soon.

How can “scientific success” be defined with this lack of general theory or should it be defined at all?

Definitions do tend to change, of course. Yet they always add rigor and clarity.

For the ESBS, scientific success requires the ability to derive point-specific predictions from empirical theories that subsume well-replicated experimental effects, thus contributing to explaining, and intervening on, focal phenomena. The practical value of theoretical knowledge indeed rests entirely on successful intervention. Point-predicting theories must therefore integrate seemingly disparate data-sets, or differentiate between them. In any case, they must both retrodict old data and predict new data. Without such theories, scientific success is hard to define and hard to achieve.

How can philosophy of science help in developing more trustworthy insights to arrange an ESBS research program?

Scientific knowledge must be replicable, generalizable, but nevertheless open to revision. The philosophy of science has long recognized that induction and prediction are distinct notions. ESBS researchers, however, tend to conflate them, taking it for granted that inductive knowledge learnt from past data substitutes for a theory based prediction of new data. Particularly Bayesian methods embrace this idea. But predictions derive from theoretical knowledge, while induction may practically succeed without theoretical accountability. Induction is the process of arriving at a parameter value, a theory then predicts this value in a specific empirical condition, and tests it to confirm the prediction against new data. For this reason confirmation can only be a confirmation of a theory-derived point-specific prediction. Theories that fail to offer point-specific predictions can therefore never be well-confirmed by data.

“Our approach stresses the role of collaboration for good science”

Our own approach, the research program strategy (RPS) combines the best available statistical methods from Frequentism and Bayesianism. RPS relies on insights from Lakatos’ and Laudan’s philosophy of scientific research programs, as well as the semantic approach to reconstructing theoretical structures. The philosophy of science teaches, after all, that theory construction and evaluation are informed by continuous efforts at reconstructing historically older theories, especially those that offer successful predictions. Moreover, RPS stresses the role of collaboration for good science. Collaboration is the obvious way to increase the small sample sizes that are typical in the ESBS, and the replication crisis cannot be overcome without gathering much larger samples. Data-sets that an individual lab collects must therefore be integrated into larger sets. But current meta-analytical methods focus on statistically significant empirical results, while neglecting various factors that reduce the experimental conditions’ sensitivity. Induction analysis as proposed in RPS estimates such factors, and seeks to correct them. Induction analysis thus is an improved version of meta-analysis.

“Replicable results are more important than novel results”

What are the main points of the research program strategy you will develop? You suggest developing induction analysis as an alternative method to meta-analysis, what are the differences between the two?

It should be clear that replicable results are more important than novel results, and that one’s ability to predict the former makes them even more important. RPS accepts this fully. RPS itself is a fairly sophisticated combination of two different standard approaches to statistical inference: the frequentist null-hypothesis significance testing (NHST) and the Bayesian hypothesis testing (BHT) approach. RPS joins both approaches into an all-things-considered superior approach that combines an optimal way of learning the numerical value of an empirical parameter from data, on one hand, with an optimal way of confirming theoretical hypotheses by data, on the other. Given the replication crisis, if a standard meta-analysis today combines NHST- and BHT-results into a global result, then three problematic conditions apply. First, published object-level ESBS-studies mostly report non-replicable effects. Second, published ESBS-studies predominantly report statistically significant effects; the number of un published non –significant effects, however, is unknown. (This alone has implications for the public perception of science.) Third, empirically observed effects are heterogeneous and typically arise under different experimental conditions. Given these three conditions, a meta-analytical pooling of object-level studies into a global estimate is prone to misestimate the true effect, because underpowered studies and an inadequate inclusion of unpublished studies with smaller effect sizes must lead to overestimating the global effect size. To clarify the extent of misestimation is part of what our work on induction analysis aims to achieve.

“Until researchers publish replicable results, one should be cautious with investing confidence”

Do you think the confidence crisis in the ESBS can also cause a loss of confidence in society towards these sciences and their results?

One should not forget that the ESBS did also uncover experimental effects that are stable enough to have some confidence in them. In fact, such results are applied in various praxes today, influencing decision-making for instance in marketing, human resource management, or therapy. Generally, however, a statistically significant result being published in a quality-controlled scientific journal is an insufficient reason to trust the result. Unless the result is well-replicated, one should assign low confidence to it. This message bears repeating. Even the Nobel prize winner Daniel Kahneman has admitted, in 2017, that his 2011 bestseller “Thinking fast and slow” did rely on far too many non-replicated results. Until ESBS researchers mostly publish replicable results, one should be cautious with investing confidence.


To learn more about the project and the team, please click here.