Scientific Research Software

John D. McGregor J. Yates Monteith John E. Ingram

Strategic Software Engineering Research Group
Clemson University
Clemson, SC 29634
{johnmc, jymonte, jei}@clemson.edu

The science research enterprise – including organizations such as universities, companies, and federal agencies – supports the development of a large amount of software. In some cases, a large community of scientific users comes to depend on the continued availability of one of these software systems or one of its constituent parts. Examples of these systems include Hadoop, R, Eclipse, and many more. In the case of open-source software, much of the software and software systems developed for scientific users depends on numerous software packages, some with a long lineage of “parent” software projects. The future of the system being developed depends on these components being maintained but there are just too many open source software systems for universities, companies, or federal agencies to support all of them. The science research enterprise must strategically choose which software systems to develop, support, and maintain and which to petition the original producers to maintain.

When a software tool becomes popular outside the research group that developed it, the continued use of the software system is a point of risk for the advancement of scientific goals. Scientific outcomes are dependent on the continued support of not just the target software package, but also on the continued maintenance of the ecosystem of software packages upon which a product depends. When decisions must be made about continued funding for these research projects, these decisions should be partially based on the quality and availability of the supporting software infrastructure and the proposed software’s future impact on its scientific community.

Our operating premise is that software, which is supported by a healthy ecosystem [8], will be nurtured and sustained. This is easier for “Big Science” projects [3] that involve professional staff than it is for projects with one or two senior investigators and a few graduate students. GitHub and similar development support infrastructure facilitate some mechanical tasks but small groups may not have a computing specialist and may have a hard time identifying and understanding how to use a robust infrastructure. There is a substantial difference between a warehouse such as GitHub, which stores discrete pieces of software, and a development community, which stores software that contributes to the specific products developed by the community.

The National Science Foundation (NSF) report: A VISION AND STRATEGY FOR SOFTWARE FOR SCIENCE, ENGINEERING, AND EDUCATION [9] recommends that NSF “Support the creation and maintenance of an innovative, integrated, reliable, sustainable and accessible ecosystem of software and services that advances scientific inquiry and application at unprecedented complexity and scale.” Taking a software ecosystem approach addresses both organizational issues and technical issues [2, 6]. An analysis of the ecosystem surrounding a project could assist in evaluating requests for funding software development. Such an analysis should include an evaluation of the strength of the community support for the software and the software’s fit with the larger context as defined by the ecosystem’s architecture. The community’s contributions to the software through add-ons, testing, and other continuing activities is an important factor [5]. The analysis might also include an evaluation of the product itself through the quality of the code, architecture, and supporting elements such as automated test cases [5].

This strategic view can be difficult to motivate in basic scientific research projects where the return on investment is even more indirect than for an open source product. The impacts of a research project and its intellectual merit should be considered in the context of value chain analysis to point out the balance between cost and value. Evaluating the potential of start-up companies and patents securing research results would also strengthen the business case.

This is not simply an economic issue. Scientific research must be reproducible. Changes to libraries somewhere in the supply chain may affect results and be virtually impossible to trace. Having access to the entire supply chain is essential to reproducibility. A scientific software ecosystem should support reproducibility, as does a commercial product development environment, by providing meta-data that identifies the exact tool chain and software component chains used to produce a specific set of results.

There are numerous other issues regarding the sustainability of scientific research software. Many of these issues have been surfaced at the Workshop on Sustainable Software for Science: Practice and Experiences (WSSPE) workshop series (http://wssspe.researchcomputing.org.uk/wssspe2/cfp/). For example, Allen and Schmidt pointed out issues with establishing a repository of code for a discipline including the need for meta-data curation and giving the repository sufficient within the discipline. They state that “the greatest inhibitors relate to human nature, including the unwillingness of scientists to share their codes openly, the effect of the lack of an adequate reward system for software authorship, and the competitive environment in astronomy [1]”. Habermann et al [4] look at sustainability from the point of view of data “In order to be sustainable in the long-term, data must be preserved in well-documented, self-describing formats accessible on multiple platforms using many programming languages.”

Clemson University, a longtime member of the Eclipse Foundation, joined the Eclipse Science Working Group with the goal of participating in the formation of a model ecosystem that sustains scientific research software for a domain. As part of a National Science Foundation funded project, we have already produced several studies and modified our ecosystem modeling technique to facilitate understanding the available software within an ecosystem [6,7]. We look forward to participating in growing and maturing the community and to raising awareness of the issues and potential solutions to developing long-lived scientific research software.

This work was partially funded by the National Science Foundation grant #ACI-1343033.

  1. Alice Allen and Judy Schmidt. Looking before leaping: Creating a software registry. http://arxiv.org/abs/1407.5378, 2014.
  2. G. Chastek and J. D. McGregor, “It takes an ecosystem,” SSTC, 2012.
  3. The CRASH Report - 2011/12 (CAST Report on Application Software Health), http://www.castsoftware.com/resources/resource/whitepapers/cast-report-on-application-software-health?gad=otd.
  4. Habermann, Ted; Collette, Andrew; Vincena, Steve; Billings, Jay Jay; Gerring, Matt; Hinsen, Konrad; Benger, Werner; Maia, Filipe RNC; Byna, Suren; de Buyl, Pierre (2014): The Hierarchical Data Format (HDF): A Foundation for Sustainable Data and Software. http://dx.doi.org/10.6084/m9.figshare.1112485.
  5. John D. McGregor: A method for analyzing software product line ecosystems: First International Workshop on Software Ecosystems, 73-80, 2008.
  6. John Yates Monteith, John D. McGregor, and John E. Ingram. 2014. Proposed metrics on ecosystem health. In Proceedings of the 2014 ACM international workshop on Software-defined ecosystems (BigSystem '14). ACM, New York, NY, USA, 33-36. DOI=10.1145/2609441.2609643 http://doi.acm.org/10.1145/2609441.2609643.
  7. J. Yates Monteith, John D. McGregor, and John E. Ingram. 2014. Scientific Research Software Ecosystems. In Proceedings of the 2014 European Conference on Software Architecture Workshops (ECSAW '14). ACM, New York, NY, USA, , Article 9 , 6 pages. DOI=10.1145/2642803.2642812 http://doi.acm.org/10.1145/2642803.2642812.
  8. David G. Messerschmitt and Clemens Szyperski (2003). Software Ecosystem: Understanding an Indispensable Technology and Industry. Cambridge, MA, USA: MIT Press.
  9. National Science Foundation, A VISION AND STRATEGY FOR SOFTWARE FORSCIENCE, ENGINEERING, AND EDUCATION CYBERINFRASTRUCTURE FRAMEWORKFOR THE 21ST CENTURY, www.nsf.gov/pubs/2012/nsf12113/nsf12113.pdf.

About the Authors

John D. McGregor
Clemson University