I came across this month’s preprint at a Nowhere Lab meeting (h/t Dwayne Lieck). The paper is quite heavy on Philosophy of Science, but I think it does a good job of showing the value of (combining) different types of replications.
A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework
we investigate how the current undesirable state is related to the problem of empirical underdetermination and its disproportionately detrimental effects in the social and behavioral sciences. We then discuss how close and conceptual replications can be employed to mitigate different aspects of underdetermination, and why they might even aggravate the problem when conducted in isolation. … The Systematic Replications Framework we propose involves conducting logically connected series of close and conceptual replications and will provide a way to increase the informativity of (non)corroborative results and thereby effectively reduce the ambiguity of falsification.
The introduction is catchy:
At least some of the problems that social and behavioral sciences tackle have far-reaching and serious implications in the real world. Among them one could list very diverse questions, such as “Is exposure to media violence related to aggressive behavior and how?” … Apart from all being socially very pertinent, substantial numbers of studies investigated each of these questions. However, the similarities do not end here. Curiously enough, even after so much resource has been invested in the empirical investigation of these almost-too-relevant problems, nothing much is accomplished in terms of arriving at clear, definitive answers … Resolving theoretical disputes is an important means to scientific progress because when a given scientific field lacks consensus regarding established evidence and how exactly it supports or contradicts competing theoretical claims, the scientific community cannot appraise whether there is scientific progress or merely a misleading semblance of it. That is to say, it cannot be in a position to judge whether a theory constitutes scientific progress in the sense that it accounts for phenomena better than alternative or previous theories and can lead to the discovery of new facts, or is degenerating in the sense that it focuses on explaining away counterevidence by finding faults in replications (Lakatos, 1978). Observing this state, Lakatos maintained decades ago that most theorizing in social sciences risks making merely pseudo-scientific progress (1978, p. 88-9, n. 3-4). What further solidifies this problem is that most “hypothesis-tests” do not test any theory and those that do so subject the theory to radically few number of tests (see e.g., McPhetres et. al., 2020). This situation has actually been going on for a considerably long time, which renders an old observation of Meehl still relevant; namely, that theoretical claims often do not die normal deaths at the hands of empirical evidence but are discontinued due to a sheer loss of interest (1978).
As researchers whose work doesn’t directly replicate point out, a failed replication doesn’t necessarily mean a theory is falsified:
this straightforward falsificationist strategy is complicated by the fact that theories by themselves do not logically imply any testable predictions. As the Duhem-Quine Thesis (DQT from now on) famously propounds, scientific theories or hypotheses have empirical consequences only in conjunction with other hypotheses or background assumptions. These auxiliary hypotheses range from ceteris paribus clauses (i.e., all other things being equal) to various assumptions regarding the research design and the instruments being used, the accuracy of the measurements, the validity of the operationalizations of the theoretical terms linked in the main hypothesis, the implications of previous theories and so on. Consequently, it is impossible to test a theoretical hypothesis in isolation. In other words, the antecedent clause in the first premise of the modus tollens is not a theory ( T ) but actually a bundle consisting of the theory and various auxiliary hypotheses ( T , AH 1, …, AH n). For this reason, falsification is necessarily ambiguous. That is, it cannot be ascertained from a single test if the hypothesis under test or one or more of the auxiliary hypotheses should bear the burden of falsification (see Duhem, 1954, p. 187; also Strevens, 2001, p. 516).1 Likewise, Lakatos maintained that absolute falsification is impossible, because in the face of a failed prediction, the target of the modus tollens can always be shifted towards the auxiliary hypotheses and away from the theory (1978, p. 18-19; see also Popper, 2002b, p. 20).
Popper considered auxiliary hypotheses to be unimportant background assumptions that researchers had to demarcate from the theory being tested by designing a good methodology. But this is hard to do in the social sciences (my experience suggests this is probably true in many areas of biology as well):
In the social and behavioral sciences, relegating AH s to unproblematic background assumptions is particularly difficult, and consequently the implications of the DQT are particularly relevant and crucial (Meehl, 1978; 1990). For several reasons we need to presume that AH s nearly always enter the test along with the main theoretical hypothesis (Meehl, 1990). Firstly, in the social and behavioral sciences the theories are so loosely organized that they do not say much about how the measurements should be (Folger, 1989; Meehl, 1978). Secondly, AH s are seldom independently testable (Meehl, 1978) and, consequently, usually no particular operationalization qualitatively stands out. Besides, in these disciplines, theoretical terms are often necessarily vague (Qizilbash, 2003), and researchers have a lesser degree of control on the environment of inquiry, so hypothesized relationships can be expected to be spatiotemporally less reliable (Leonelli, 2018). Moreover, in the absence of a strong theory of measurement that is informed by the dominant paradigm of the given scientific discipline (Muthukrishna & Henrich, 2019), the selection of AH s is usually guided by the assumptions of the very theory that is put into test. Consequently, each contending approach develops its own measurement devices regarding the same phenomenon, heeding to their own theoretical postulations. Attesting to the threat this situation poses for the validity of scientific inferences, it has recently been shown that the differences in research teams’ preferences of basic design elements drastically influence the effects observed for the same theoretical hypotheses (Landy et al., 2020).
The proposed Systematic Replications Framework (also depicted in Fig. 2):
SRF consists of a systematically organized series of replications that function collectively as a single research line. The basic idea is to bring close and conceptual replications together in order to weight the effects of the AH pre and AH out sets on the findings . SRF starts with a close replication, which is followed by a series of conceptual replications in which the operationalization of one theoretical variable at a time is varied while keeping that of the other constant and then repeats the procedure for the other leg.
Its benefits for hypothesis testing are:
SRF reduces ambiguities implied by the DQT in original studies as well as in close and conceptual replications. Primarily, it allows for non-corroborative evidence to have differential implications for the components of the TH & AH s bundle. Thereby these components can receive blame not collectively but in terms of a weighted distribution. In cases where it is not possible to achieve this, it allows demarcating on which pairings from possible AH pre and AH out sets the truth-value of the TH is conditional. In all cases, the confounding effects deriving from the AH s can be relatively isolated. Lastly, SRF can enable that we approximate to an ideal test of a theoretical hypothesis within the methodological falsificationist paradigm by embedding alternative operationalizations and associated measurement approaches into a severe testing framework (see Mayo, 1997; 2018).
Besides replications, the SRF could also be useful for doing systematic literature reviews:
Another potential practical implication of SRF lies in using the same strategy of logically connecting different AH bundles in conducting and interpreting systematic literature reviews (particularly when the previous findings are mixed). Such a strategy can help researchers distinguish the effects that seem to be driven by certain AH s from the ones in which the TH is more robust to such influences. To put it differently, in a contested literature there are already numerous conceptual replications that have been conducted, and at least some of these replications rely on the same AH s in their operationalizations. Therefore, to the extent that they have overlaps in their AH s, their results can be organized in such a way that resembles a pattern of results that can be obtained with a novel research project planned according to SRF. The term “systematic” in systematic literature review already indicates that the scientific question to be investigated (i.e., the subject-matter, the problem or hypothesis), the data collection strategy (e.g., databases to be searched, inclusion criteria) as well as the method that will be used in analyzing the data (e.g., statistical tests or qualitative analyses) are standardized. However, for various reasons (e.g., to limit the inquiry to those studies that use a particular method), not every systematic literature review is conducive to figuring out whether the TH is conditional on particular AH sets. An SRF-inspired strategy of tabulating the results in a systematic literature review will also help researchers in appraising the conceptual networks of theoretical claims, theoretically relevant auxiliary assumptions and measurements. Thus, it can eventually help in appraising the verisimilitude of the TH by revealing how it is conditional on certain AH s, and can lead to the reformulation or refinement of the TH as well as guide and constrain subsequent modifications to it.
The decade-long discussion on a replicability and confidence crisis in several disciplines of social, behavioral and life sciences (e.g., Camerer et al., 2018; OSC, 2015; Ioannidis, 2005) has identified the prioritization of the exploratory over the critical mission as one of the key causes, and led to proposals for slowing science down (Stengers, 2018), applying more caution in giving policy advice (Ijzerman et al., 2020), and inaugurating a credibility revolution (Vazire, 2020). All potential contributions of SRF will be part of a strategy to prioritize science’s critical mission on the way towards more credible research in social, behavioral, and life sciences. This would imply that the scientific community focuses less on producing huge numbers of novel hypotheses with little corroboration and more on having a lesser number of severely tested theoretical claims. Successful implementation of SRF also requires openness and transparency regarding both positive and negative results of original and replication studies (Nosek et al., 2015) and demands increased research collaboration (Landy et al., 2020). Ideally, this would also take the form of adversarial collaboration.
@surya re. the adversarial collaborations. It’s discussed in more detail in its own section.
I’d be interested to hear from some people involved in current psychology replication projects about their thoughts on using conceptual replications to test auxiliary hypotheses vs. just using close/direct replications.
This paper also reminded me of the old concept of strong inference, which also focuses on testing a variety of alternative hypotheses in a given study.