Preprint Similarity Search

Gavin · February 12, 2022, 2:47pm

It’s only for preprints on bioRxiv or medRxiv but seems to work well. Check where you’re own preprints fall on the map

From the paper: Examining linguistic shifts between preprints and publications

Preprints allow researchers to make their findings available to the scientific community before they have undergone peer review. Studies on preprints within bioRxiv have been largely focused on article metadata and how often these preprints are downloaded, cited, published, and discussed online. A missing element that has yet to be examined is the language contained within the bioRxiv preprint repository. We sought to compare and contrast linguistic features within bioRxiv preprints to published biomedical text as a whole as this is an excellent opportunity to examine how peer review changes these documents. The most prevalent features that changed appear to be associated with typesetting and mentions of supporting information sections or additional files. In addition to text comparison, we created document embeddings derived from a preprint-trained word2vec model. We found that these embeddings are able to parse out different scientific approaches and concepts, link unannotated preprint–peer-reviewed article pairs, and identify journals that publish linguistically similar papers to a given preprint. We also used these embeddings to examine factors associated with the time elapsed between the posting of a first preprint and the appearance of a peer-reviewed publication. We found that preprints with more versions posted and more textual changes took longer to publish. Lastly, we constructed a web application (Preprint Similarity Search) that allows users to identify which journals and articles that are most linguistically similar to a bioRxiv or medRxiv preprint as well as observe where the preprint would be positioned within a published article landscape.

paoladm · February 16, 2022, 3:48am

Thank you for sharing Gav, it is most interesting to see ML used in this useful way

eoteromuras · March 1, 2022, 10:27pm

Thank you for sharing. It has some resemblance to other open-source projects like arxiv-sanity-preserver (only for arXiv preprints, developed by Andrej Karpathy).

Also related to this functionality there is also a curious tool: the iris.ai explorer, that is unfortunately proprietary software though accesible online via free trial.

Gavin · March 4, 2022, 5:04pm

Welcome to the forum Enrique

Connected Papers is another discovery tool, although it looks at citation networks rather than doing text analysis. I’ve tried both iris.ai and ConnPapers and wasn’t that impressed by either… Although I think it may have been my use case, which was searching a small biomedical field that mostly gave direct/obvious connections I knew about and long-range connections that weren’t relevant. A bigger field (or one that I didn’t know well) might give more useful intermediate connections.

eoteromuras · March 11, 2022, 7:13pm

Thanks @Gavin!

I didn’t know connectedpapers.com. It seems interesting also. I’ll try it. Though, as you say, it may not be practical enough to discover relevant unknown connections

Topic		Replies	Views
Useful preprint templates Social sciences preprint	0	736	November 27, 2019
Preprint servers face closure because of money troubles Open and replicable science open-access , preprint	30	975	December 2, 2020
Nature: Software searches out reproducibility issues in scientific papers Open and replicable science reproducibility	1	434	January 23, 2020
Best practices for publishing primarily on eprint websites Open and replicable science	7	913	August 20, 2021
Preprint: Ten myths around open scholarly publishing Open and replicable science open-access , open-science , preprint , igdore , new-paper	1	436	March 18, 2019

Preprint Similarity Search

Related topics