Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study. Cal�a, J. & Missier, P. Big Data Research, 13:76 - 94, 2018. Big Medical/Healthcare Data Analytics
Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study [link]Paper  doi  abstract   bibtex   1 download  
The value of knowledge assets generated by analytics processes using Data Science techniques tends to decay over time, as a consequence of changes in the elements the process depends on: external data sources, libraries, and system dependencies. For large-scale problems, refreshing those outcomes through greedy re-computation is both expensive and inefficient, as some changes have limited impact. In this paper we address the problem of refreshing past process outcomes selectively, that is, by trying to identify the subset of outcomes that will have been affected by a change, and by only re-executing fragments of the original process. We propose a technical approach to address the selective re-computation problem by combining multiple techniques, and present an extensive experimental study in Genomics, namely variant calling and their clinical interpretation, to show its effectiveness. In this case study, we are able to decrease the number of required re-computations on a cohort of individuals from 495 (blind) down to 71, and that we can reduce runtime by at least 60% relative to the naïve blind approach, and in some cases by 90%. Starting from this experience, we then propose a blueprint for a generic re-computation meta-process that makes use of process history metadata to make informed decisions about selective re-computations in reaction to a variety of changes in the data.
@article{CALA201876,
title = "Selective and Recurring Re-computation of Big Data Analytics Tasks: Insights from a Genomics Case Study",
journal = "Big Data Research",
volume = "13",
pages = "76 - 94",
year = "2018",
note = "Big Medical/Healthcare Data Analytics",
issn = "2214-5796",
doi = "https://doi.org/10.1016/j.bdr.2018.06.001",
url = "http://www.sciencedirect.com/science/article/pii/S2214579617303520",
author = "Jacek Cal�a and Paolo Missier",
keywords = "Re-computation, Knowledge decay, Big data analysis, Genomics",
abstract = "The value of knowledge assets generated by analytics processes using Data Science techniques tends to decay over time, as a consequence of changes in the elements the process depends on: external data sources, libraries, and system dependencies. For large-scale problems, refreshing those outcomes through greedy re-computation is both expensive and inefficient, as some changes have limited impact. In this paper we address the problem of refreshing past process outcomes selectively, that is, by trying to identify the subset of outcomes that will have been affected by a change, and by only re-executing fragments of the original process. We propose a technical approach to address the selective re-computation problem by combining multiple techniques, and present an extensive experimental study in Genomics, namely variant calling and their clinical interpretation, to show its effectiveness. In this case study, we are able to decrease the number of required re-computations on a cohort of individuals from 495 (blind) down to 71, and that we can reduce runtime by at least 60\% relative to the naïve blind approach, and in some cases by 90\%. Starting from this experience, we then propose a blueprint for a generic re-computation meta-process that makes use of process history metadata to make informed decisions about selective re-computations in reaction to a variety of changes in the data."
}

Downloads: 1