The T Experiments: Errors in Scientific Software

The T Experiments: Errors in Scientific Software. Hatton, L. 4(2):27–38.

This paper covers two very large experiments carried out concurrently between 1990 and 1994, together known as the T-experiments. Experiment T1 had the objective of measuring the consistency of several million lines of scientific software written in C and Fortran 77 by static deep-flow analysis across many different industries and application areas, and experiment T2 had the objective of measuring the level of dynamic disagreement between independent implementations of the same algorithms acting on the same input data with the same parameters in just one of these industrial application areas. Experiment T1 showed that C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that Fortran components are typically 2.5 times bigger than C components, and that roughly 30\,% of the Fortran population and 10\,% of the C population would be deemed untestable by any standards. Experiment T2 was even more disturbing. Whereas scientists like to think that their results are accurate to the precision of the arithmetic used, in this study, the degree of agreement gradually degenerated from 6 significant figures to 1 significant figure during the computation. The reasons for this disagreement are laid squarely at the door of software failure, as other possible causes are considered and rejected. Taken with other evidence, these two experiments suggest that the results of scientific calculations involving significant amounts of software should be treated with the same measure of disbelief as an unconfirmed physical experiment.

@article{hattonExperimentsErrorsScientific1997,
  title = {The {{T}} Experiments: Errors in Scientific Software},
  author = {Hatton, L.},
  date = {1997-04},
  journaltitle = {Computational Science \& Engineering, IEEE},
  volume = {4},
  pages = {27--38},
  issn = {1070-9924},
  doi = {10.1109/99.609829},
  url = {https://doi.org/10.1109/99.609829},
  abstract = {This paper covers two very large experiments carried out concurrently between 1990 and 1994, together known as the T-experiments. Experiment T1 had the objective of measuring the consistency of several million lines of scientific software written in C and Fortran 77 by static deep-flow analysis across many different industries and application areas, and experiment T2 had the objective of measuring the level of dynamic disagreement between independent implementations of the same algorithms acting on the same input data with the same parameters in just one of these industrial application areas. Experiment T1 showed that C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that Fortran components are typically 2.5 times bigger than C components, and that roughly 30\,\% of the Fortran population and 10\,\% of the C population would be deemed untestable by any standards. Experiment T2 was even more disturbing. Whereas scientists like to think that their results are accurate to the precision of the arithmetic used, in this study, the degree of agreement gradually degenerated from 6 significant figures to 1 significant figure during the computation. The reasons for this disagreement are laid squarely at the door of software failure, as other possible causes are considered and rejected. Taken with other evidence, these two experiments suggest that the results of scientific calculations involving significant amounts of software should be treated with the same measure of disbelief as an unconfirmed physical experiment.},
  keywords = {*imported-from-citeulike-INRMM,~INRMM-MiD:c-4211316,computational-science,software-engineering,software-errors,software-uncertainty},
  number = {2}
}

Downloads: 0

{"_id":"oQkP7ecyba2MsPe3R","bibbaseid":"hatton-thetexperimentserrorsinscientificsoftware","authorIDs":[],"author_short":["Hatton, L."],"bibdata":{"bibtype":"article","type":"article","title":"The T Experiments: Errors in Scientific Software","author":[{"propositions":[],"lastnames":["Hatton"],"firstnames":["L."],"suffixes":[]}],"date":"1997-04","journaltitle":"Computational Science & Engineering, IEEE","volume":"4","pages":"27–38","issn":"1070-9924","doi":"10.1109/99.609829","url":"https://doi.org/10.1109/99.609829","abstract":"This paper covers two very large experiments carried out concurrently between 1990 and 1994, together known as the T-experiments. Experiment T1 had the objective of measuring the consistency of several million lines of scientific software written in C and Fortran 77 by static deep-flow analysis across many different industries and application areas, and experiment T2 had the objective of measuring the level of dynamic disagreement between independent implementations of the same algorithms acting on the same input data with the same parameters in just one of these industrial application areas. Experiment T1 showed that C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that Fortran components are typically 2.5 times bigger than C components, and that roughly 30\\,% of the Fortran population and 10\\,% of the C population would be deemed untestable by any standards. Experiment T2 was even more disturbing. Whereas scientists like to think that their results are accurate to the precision of the arithmetic used, in this study, the degree of agreement gradually degenerated from 6 significant figures to 1 significant figure during the computation. The reasons for this disagreement are laid squarely at the door of software failure, as other possible causes are considered and rejected. Taken with other evidence, these two experiments suggest that the results of scientific calculations involving significant amounts of software should be treated with the same measure of disbelief as an unconfirmed physical experiment.","keywords":"*imported-from-citeulike-INRMM,~INRMM-MiD:c-4211316,computational-science,software-engineering,software-errors,software-uncertainty","number":"2","bibtex":"@article{hattonExperimentsErrorsScientific1997,\n title = {The {{T}} Experiments: Errors in Scientific Software},\n author = {Hatton, L.},\n date = {1997-04},\n journaltitle = {Computational Science \\& Engineering, IEEE},\n volume = {4},\n pages = {27--38},\n issn = {1070-9924},\n doi = {10.1109/99.609829},\n url = {https://doi.org/10.1109/99.609829},\n abstract = {This paper covers two very large experiments carried out concurrently between 1990 and 1994, together known as the T-experiments. Experiment T1 had the objective of measuring the consistency of several million lines of scientific software written in C and Fortran 77 by static deep-flow analysis across many different industries and application areas, and experiment T2 had the objective of measuring the level of dynamic disagreement between independent implementations of the same algorithms acting on the same input data with the same parameters in just one of these industrial application areas. Experiment T1 showed that C and Fortran are riddled with statically detectable inconsistencies independent of the application area. For example, interface inconsistencies occur at the rate of one in every 7 interfaces on average in Fortran, and one in every 37 interfaces in C. They also show that Fortran components are typically 2.5 times bigger than C components, and that roughly 30\\,\\% of the Fortran population and 10\\,\\% of the C population would be deemed untestable by any standards. Experiment T2 was even more disturbing. Whereas scientists like to think that their results are accurate to the precision of the arithmetic used, in this study, the degree of agreement gradually degenerated from 6 significant figures to 1 significant figure during the computation. The reasons for this disagreement are laid squarely at the door of software failure, as other possible causes are considered and rejected. Taken with other evidence, these two experiments suggest that the results of scientific calculations involving significant amounts of software should be treated with the same measure of disbelief as an unconfirmed physical experiment.},\n keywords = {*imported-from-citeulike-INRMM,~INRMM-MiD:c-4211316,computational-science,software-engineering,software-errors,software-uncertainty},\n number = {2}\n}\n\n","author_short":["Hatton, L."],"key":"hattonExperimentsErrorsScientific1997","id":"hattonExperimentsErrorsScientific1997","bibbaseid":"hatton-thetexperimentserrorsinscientificsoftware","role":"author","urls":{"Paper":"https://doi.org/10.1109/99.609829"},"keyword":["*imported-from-citeulike-INRMM","~INRMM-MiD:c-4211316","computational-science","software-engineering","software-errors","software-uncertainty"],"downloads":0},"bibtype":"article","biburl":"https://tmpfiles.org/dl/58794/INRMM.bib","creationDate":"2020-07-02T22:41:09.146Z","downloads":0,"keywords":["*imported-from-citeulike-inrmm","~inrmm-mid:c-4211316","computational-science","software-engineering","software-errors","software-uncertainty"],"search_terms":["experiments","errors","scientific","software","hatton"],"title":"The T Experiments: Errors in Scientific Software","year":null,"dataSources":["DXuKbcZTirdigFKPF"]}