How Software in the Life Sciences Actually Works (And Doesn’t Work)

“Institutions have not been forced to pay professional programmers competitive salaries; grant agencies have not been compelled to set aside appropriate funds for a software infrastructure; and the line items for professional software engineering have not made it into budget models. Thus, genomics has become accustomed to, even addicted to, abundant free software. In a sense, in our idealistic, anti-establishment zeal, we free software warriors have locked computational genomics into an unsustainable financial model.”

I thought this was a useful description of the problems with developing reliable software for biology research. I’d be interested in perspectives on this from other RSEs, maybe @grant or @heidi?

2 Likes

As someone who has created/contributed to a lot of FOSS software over the years in biology, I’d say that’s a pretty spot-on description.

1 Like

I agree, everything written in the article I’ve experienced. I especially agree that we should have more publicly funded tool development in science. This seems to be a problem in general, and I sometimes think about how these challenges were approached in say astronomy/astrophysics. Maybe in astrophysics people realized there was no way to get away with processing data in R, so everything (or most software) was engineered for efficiency upfront, usually always in C or C++. There were no real commercial interests to rely on, and somehow they seem to have made it work.

Right now I find myself thinking a lot about how to approach massive data in biology. I think the challenge in biology is that not only are datasets becoming much larger (quickly), but that they are also becoming more diverse, and consequently more complex. Having an efficient representation for sparse matrices (transcriptomic data) is one thing, representing multi-modal and integrating it (graph?) seems to be another (or matroid?). So I see the main challenges being size and complexity, and I don’t really see the current R or Python ecosystem ready to address it (though the Apache Arrow ecosystem does look like it could be promising).

A couple projects/ideas that interest me:

  • Chapel programming language. Government funded effort to address the programming difficulties of writing distributed/clustered software. I could see larger Chapel adoption making impact. Anything that makes high performance programming easier is going to lead to outsized gains (Julia is great, I’m just not sure it can compete with Chapel in communication efficiency in a distributed setting…maybe it can, I just don’t know).
  • Language/Database integration. Work done initially in finance (kdb+ provided by KX systems, which interesting was used at NASA’s frontier development lab) to enable extremely high throughput data processing by having the programming language and database exist in the same process. Hobbes is somewhat of a successor to kdb+, but based on a rich type system that might facilitate more complicated modeling (i.e. category theoretic).
2 Likes