Open Science Reading Recommendations

Many members of the forum community are interested in Open Science, but it’s not always easy to keep up with the latest OS literature if you’re not a metascience researcher yourself (well, if you’re not active on Twitter…). So I wanted to encourage people to post new OS/metascience paper/preprints here, particularly if they think the article will be of general interest and stimulate discussion in the forum.

A previous example of preprint that generated some discussion is Metascience as a scientific social movement

Like the one above, articles are welcome to cover advanced topics and issues around Open Science. I’d refer anybody looking for an introduction to OS to Easing Into Open Science: A Guide for Graduate Students and Their Advisors and the material in the Intro Papers folder ReproducibiliTea Zotero library.

If a given article generates a bit of discussion then I or another moderator can split it into another thread. I’ll also pick an interesting article to feature in the Latest IGDORE Newsletter each month (and give a shout out to whoever posted it originally).

Here is a paper on journal impact factors to start things off (h/t @pcmasuzzo)

Journal impact factors, publication charges and assessment of quality and accuracy of scientific research are critical for researchers, managers, funders, policy makers, and society. Editors and publishers compete for impact factor rankings, to demonstrate how important their journals are, and researchers strive to publish in perceived top journals, despite high publication and access charges. This raises questions of how top journals are identified, whether assessments of impacts are accurate and whether high publication charges borne by the research community are justified, bearing in mind that they also collectively provide free peer-review to the publishers. Although traditional journals accelerated peer review and publication during the COVID-19 pandemic, preprint servers made a greater impact with over 30,000 open access articles becoming available and accelerating a trend already seen in other fields of research. We review and comment on the advantages and disadvantages of a range of assessment methods and the way in which they are used by researchers, managers, employers and publishers. We argue that new approaches to assessment are required to provide a realistic and comprehensive measure of the value of research and journals and we support open access publishing at a modest, affordable price to benefit research producers and consumers.

Some old arguments against the impact factor:

In 1997 Per Seglen (1997) summarized in four points why JIFs should not be used for the evaluation of research:

  1. “Use of journal impact factors conceals the difference in article citation rates (articles in the most cited half of articles in a journal are cited 10 times as often as the least cited half).
  2. Journals’ impact factors are determined by technicalities unrelated to the scientific quality of their articles.
  3. Journal impact factors depend on the research field: high impact factors are likely in journals covering large areas of basic research with a rapidly expanding but short lived literature that uses many references per article.
  4. Article citation rates determine the journal impact factor, not vice versa.”

Some existing alternatives:

A number of alternative metrics to JIF have been developed (Table 1). All of these are based on citation counts for individual papers but vary in how the numbers are used to assess impact. As discussed later, the accuracy of data based on citation counts is highly questionable.

  • CiteScore calculates a citations/published items score conceptually similar to JIF but using Scopus data to count four years of citations and four years of published items.
  • The Source Normalized Impact Factor also uses Scopus data to take a citation/published items score and normalizes it against the average number of citations/citing document.
  • The Eigenfactor (EF) and Scimago Journal Rank work in a manner analogous to Google’s PageRank algorithm, employing iterative calculations with data from Journal Citation Reports and Scopus respectively to derive scores based on the weighted valuations of citing documents.
  • Finally, h-indexes attempt to balance the number of papers published by an author or journal against the distribution of citation counts for those papers. This metric is frequently used and is discussed in more detail in a following section.

Some new suggestions for evaluative criteria:

If so, and recognizing that any evaluation based on a single criterion alone can be criticized, what are the criteria we should consider in order to devise a more effective system for recognition and assessment of accomplishments which also supports an equitable publishing process that is not hidden behind expensive paywalls and OA fees? The following are all metrics that could be collectively looked at to aid in assessment although as we have discussed, if used alone, all have their limitations:

  1. Contribution of an author to the paper including preprints, i.e. first author, last author, conducted experiments, analyzed data, contributed to the writing, other?
  2. Number of years active in research field and productivity
  3. Number of publications in journals where others in the same field also publish
  4. Views and downloads
  5. Number of citations as first, last, or corresponding author.

The authors’ get bonus points for including a reference to a Bob Dylan song as well!

I think that final points could be good factors to start making an assessment metric. One thing I note is that these are all quantitive factors that are easy to collect, but I wonder if there is scope to include other qualitative assessments (although these generally require more effort to create)? For instance, I think that peer usage and validation of research findings and/or output would be a very positive indicator. A replication study is basically the essence of peer validation, but in some cases, information indicating usage could be easier to get (e.g. looking at forks/contributions to a code repository that go on to be used for other papers).

Any thoughts? Or other suggestions for assessment metrics?

1 Like

I came across this month’s preprint at a [Nowhere Lab] (https://twitter.com/thenowherelab) meeting (h/t Dwayne Lieck). The paper is quite heavy on Philosophy of Science, but I think it does a good job of showing the value of (combining) different types of replications.

A Falsificationist Treatment of Auxiliary Hypotheses in Social and Behavioral Sciences: Systematic Replications Framework

In short:

we investigate how the current undesirable state is related to the problem of empirical underdetermination and its disproportionately detrimental effects in the social and behavioral sciences. We then discuss how close and conceptual replications can be employed to mitigate different aspects of underdetermination, and why they might even aggravate the problem when conducted in isolation. … The Systematic Replications Framework we propose involves conducting logically connected series of close and conceptual replications and will provide a way to increase the informativity of (non)corroborative results and thereby effectively reduce the ambiguity of falsification.

The introduction is catchy:

At least some of the problems that social and behavioral sciences tackle have far-reaching and serious implications in the real world. Among them one could list very diverse questions, such as “Is exposure to media violence related to aggressive behavior and how?” … Apart from all being socially very pertinent, substantial numbers of studies investigated each of these questions. However, the similarities do not end here. Curiously enough, even after so much resource has been invested in the empirical investigation of these almost-too-relevant problems, nothing much is accomplished in terms of arriving at clear, definitive answers … Resolving theoretical disputes is an important means to scientific progress because when a given scientific field lacks consensus regarding established evidence and how exactly it supports or contradicts competing theoretical claims, the scientific community cannot appraise whether there is scientific progress or merely a misleading semblance of it. That is to say, it cannot be in a position to judge whether a theory constitutes scientific progress in the sense that it accounts for phenomena better than alternative or previous theories and can lead to the discovery of new facts, or is degenerating in the sense that it focuses on explaining away counterevidence by finding faults in replications (Lakatos, 1978). Observing this state, Lakatos maintained decades ago that most theorizing in social sciences risks making merely pseudo-scientific progress (1978, p. 88-9, n. 3-4). What further solidifies this problem is that most “hypothesis-tests” do not test any theory and those that do so subject the theory to radically few number of tests (see e.g., McPhetres et. al., 2020). This situation has actually been going on for a considerably long time, which renders an old observation of Meehl still relevant; namely, that theoretical claims often do not die normal deaths at the hands of empirical evidence but are discontinued due to a sheer loss of interest (1978).

As researchers whose work doesn’t directly replicate point out, a failed replication doesn’t necessarily mean a theory is falsified:

this straightforward falsificationist strategy is complicated by the fact that theories by themselves do not logically imply any testable predictions. As the Duhem-Quine Thesis (DQT from now on) famously propounds, scientific theories or hypotheses have empirical consequences only in conjunction with other hypotheses or background assumptions. These auxiliary hypotheses range from ceteris paribus clauses (i.e., all other things being equal) to various assumptions regarding the research design and the instruments being used, the accuracy of the measurements, the validity of the operationalizations of the theoretical terms linked in the main hypothesis, the implications of previous theories and so on. Consequently, it is impossible to test a theoretical hypothesis in isolation. In other words, the antecedent clause in the first premise of the modus tollens is not a theory ( T ) but actually a bundle consisting of the theory and various auxiliary hypotheses ( T , AH 1, …, AH n). For this reason, falsification is necessarily ambiguous. That is, it cannot be ascertained from a single test if the hypothesis under test or one or more of the auxiliary hypotheses should bear the burden of falsification (see Duhem, 1954, p. 187; also Strevens, 2001, p. 516).1 Likewise, Lakatos maintained that absolute falsification is impossible, because in the face of a failed prediction, the target of the modus tollens can always be shifted towards the auxiliary hypotheses and away from the theory (1978, p. 18-19; see also Popper, 2002b, p. 20).

Popper considered auxiliary hypotheses to be unimportant background assumptions that researchers had to demarcate from the theory being tested by designing a good methodology. But this is hard to do in the social sciences (my experience suggests this is probably true in many areas of biology as well):

In the social and behavioral sciences, relegating AH s to unproblematic background assumptions is particularly difficult, and consequently the implications of the DQT are particularly relevant and crucial (Meehl, 1978; 1990). For several reasons we need to presume that AH s nearly always enter the test along with the main theoretical hypothesis (Meehl, 1990). Firstly, in the social and behavioral sciences the theories are so loosely organized that they do not say much about how the measurements should be (Folger, 1989; Meehl, 1978). Secondly, AH s are seldom independently testable (Meehl, 1978) and, consequently, usually no particular operationalization qualitatively stands out. Besides, in these disciplines, theoretical terms are often necessarily vague (Qizilbash, 2003), and researchers have a lesser degree of control on the environment of inquiry, so hypothesized relationships can be expected to be spatiotemporally less reliable (Leonelli, 2018). Moreover, in the absence of a strong theory of measurement that is informed by the dominant paradigm of the given scientific discipline (Muthukrishna & Henrich, 2019), the selection of AH s is usually guided by the assumptions of the very theory that is put into test. Consequently, each contending approach develops its own measurement devices regarding the same phenomenon, heeding to their own theoretical postulations. Attesting to the threat this situation poses for the validity of scientific inferences, it has recently been shown that the differences in research teams’ preferences of basic design elements drastically influence the effects observed for the same theoretical hypotheses (Landy et al., 2020).

The proposed Systematic Replications Framework (also depicted in Fig. 2):

SRF consists of a systematically organized series of replications that function collectively as a single research line. The basic idea is to bring close and conceptual replications together in order to weight the effects of the AH pre and AH out sets on the findings . SRF starts with a close replication, which is followed by a series of conceptual replications in which the operationalization of one theoretical variable at a time is varied while keeping that of the other constant and then repeats the procedure for the other leg.

Its benefits for hypothesis testing are:

SRF reduces ambiguities implied by the DQT in original studies as well as in close and conceptual replications. Primarily, it allows for non-corroborative evidence to have differential implications for the components of the TH & AH s bundle. Thereby these components can receive blame not collectively but in terms of a weighted distribution. In cases where it is not possible to achieve this, it allows demarcating on which pairings from possible AH pre and AH out sets the truth-value of the TH is conditional. In all cases, the confounding effects deriving from the AH s can be relatively isolated. Lastly, SRF can enable that we approximate to an ideal test of a theoretical hypothesis within the methodological falsificationist paradigm by embedding alternative operationalizations and associated measurement approaches into a severe testing framework (see Mayo, 1997; 2018).

Besides replications, the SRF could also be useful for doing systematic literature reviews:

Another potential practical implication of SRF lies in using the same strategy of logically connecting different AH bundles in conducting and interpreting systematic literature reviews (particularly when the previous findings are mixed). Such a strategy can help researchers distinguish the effects that seem to be driven by certain AH s from the ones in which the TH is more robust to such influences. To put it differently, in a contested literature there are already numerous conceptual replications that have been conducted, and at least some of these replications rely on the same AH s in their operationalizations. Therefore, to the extent that they have overlaps in their AH s, their results can be organized in such a way that resembles a pattern of results that can be obtained with a novel research project planned according to SRF. The term “systematic” in systematic literature review already indicates that the scientific question to be investigated (i.e., the subject-matter, the problem or hypothesis), the data collection strategy (e.g., databases to be searched, inclusion criteria) as well as the method that will be used in analyzing the data (e.g., statistical tests or qualitative analyses) are standardized. However, for various reasons (e.g., to limit the inquiry to those studies that use a particular method), not every systematic literature review is conducive to figuring out whether the TH is conditional on particular AH sets. An SRF-inspired strategy of tabulating the results in a systematic literature review will also help researchers in appraising the conceptual networks of theoretical claims, theoretically relevant auxiliary assumptions and measurements. Thus, it can eventually help in appraising the verisimilitude of the TH by revealing how it is conditional on certain AH s, and can lead to the reformulation or refinement of the TH as well as guide and constrain subsequent modifications to it.

In closing:

The decade-long discussion on a replicability and confidence crisis in several disciplines of social, behavioral and life sciences (e.g., Camerer et al., 2018; OSC, 2015; Ioannidis, 2005) has identified the prioritization of the exploratory over the critical mission as one of the key causes, and led to proposals for slowing science down (Stengers, 2018), applying more caution in giving policy advice (Ijzerman et al., 2020), and inaugurating a credibility revolution (Vazire, 2020). All potential contributions of SRF will be part of a strategy to prioritize science’s critical mission on the way towards more credible research in social, behavioral, and life sciences. This would imply that the scientific community focuses less on producing huge numbers of novel hypotheses with little corroboration and more on having a lesser number of severely tested theoretical claims. Successful implementation of SRF also requires openness and transparency regarding both positive and negative results of original and replication studies (Nosek et al., 2015) and demands increased research collaboration (Landy et al., 2020). Ideally, this would also take the form of adversarial collaboration.

@surya re. the adversarial collaborations. It’s discussed in more detail in its own section.

I’d be interested to hear from some people involved in current psychology replication projects about their thoughts on using conceptual replications to test auxiliary hypotheses vs. just using close/direct replications.

This paper also reminded me of the old concept of strong inference, which also focuses on testing a variety of alternative hypotheses in a given study.