Clustering Provenance Facilitating Provenance Exploration Through Data Abstraction.
Karsai, L.; Fekete, A.; Kay, J.; and Missier, P.
In
Proceedings of the Workshop on Human-In-the-Loop Data Analytics, of
HILDA '16, pages 6:1–6:5, New York, NY, USA, 2016. ACM
Paper
doi
link
bibtex
@inproceedings{karsai_clustering_2016,
address = {New York, NY, USA},
series = {{HILDA} '16},
title = {Clustering {Provenance} {Facilitating} {Provenance} {Exploration} {Through} {Data} {Abstraction}},
isbn = {978-1-4503-4207-0},
url = {http://doi.acm.org/10.1145/2939502.2939508},
doi = {10.1145/2939502.2939508},
booktitle = {Proceedings of the {Workshop} on {Human}-{In}-the-{Loop} {Data} {Analytics}},
publisher = {ACM},
author = {Karsai, Linus and Fekete, Alan and Kay, Judy and Missier, Paolo},
year = {2016},
keywords = {provenance, large-scale graphs, visualisation},
pages = {6:1--6:5},
}
Alan Turing Intitute Symposium on Reproducibioity for Data-Intensive Research – Final Report.
Burgess, L. C; Crotty, D.; de Roure, D.; Gibbons, J.; Goble, C.; Missier, P.; Mortier, R.; Nichols, T. E; and O�Beirne, R.
. 2016.
Paper
link
bibtex
@article{burgess_alan_2016,
title = {Alan {Turing} {Intitute} {Symposium} on {Reproducibioity} for {Data}-{Intensive} {Research} – {Final} {Report}},
url = {https://dx.doi.org/10.6084/m9.figshare.3487382},
author = {Burgess, Lucie C and Crotty, David and de Roure, David and Gibbons, Jeremy and Goble, Carole and Missier, Paolo and Mortier, Richard and Nichols, Thomas E and O�Beirne, Richard},
year = {2016},
}
The data, they are a-changin'.
Missier, P.; Cala, J.; and Wijaya, E.
In Cohen-Boulakia, S., editor(s),
Proc. TAPP'16 (Theory and Practice of Provenance), Washington D.C., USA, 2016. USENIX Association
Paper
link
bibtex
abstract
@inproceedings{missier_data_2016,
address = {Washington D.C., USA},
title = {The data, they are a-changin'},
url = {https://arxiv.org/abs/1604.06412},
abstract = {The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive fac- tors: low cost data generation, inexpensively scalable stor- age and processing infrastructure (cloud), software frame- works and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observa- tion that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. This suggests that the value of such derivative knowledge may decay over time, unless it is preserved by reacting to those changes. Our broad research goal is to develop mod- els, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions. We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of hu- man variants in support of genetic diagnosis.},
booktitle = {Proc. {TAPP}'16 ({Theory} and {Practice} of {Provenance})},
publisher = {USENIX Association},
author = {Missier, Paolo and Cala, Jacek and Wijaya, Eldarina},
editor = {Cohen-Boulakia, Sarah},
year = {2016},
keywords = {\#provenance, \#re-computation, \#big data processing, \#data change},
}
The cost of deriving actionable knowledge from large datasets has been decreasing thanks to a convergence of positive fac- tors: low cost data generation, inexpensively scalable stor- age and processing infrastructure (cloud), software frame- works and tools for massively distributed data processing, and parallelisable data analytics algorithms. One observa- tion that is often overlooked, however, is that each of these elements is not immutable, rather they all evolve over time. This suggests that the value of such derivative knowledge may decay over time, unless it is preserved by reacting to those changes. Our broad research goal is to develop mod- els, methods, and tools for selectively reacting to changes by balancing costs and benefits, i.e. through complete or partial re-computation of some of the underlying processes. In this paper we present an initial model for reasoning about change and re-computations, and show how analysis of detailed provenance of derived knowledge informs re-computation decisions. We illustrate the main ideas through a real-world case study in genomics, namely on the interpretation of hu- man variants in support of genetic diagnosis.
Analyzing Provenance across Heterogeneous Provenance Graphs.
Oliveira, W.; Missier, P.; Ocana, K.; de Oliveira, D.; and Braganholo, V.
In
Procs. IPAW 2016, Washington D.C., USA, 2016. Springer
link
bibtex
abstract
@inproceedings{oliveira_analyzing_2016,
address = {Washington D.C., USA},
title = {Analyzing {Provenance} across {Heterogeneous} {Provenance} {Graphs}},
abstract = {Provenance generated by different workflow systems is generally ex- pressed using different formats. This is not an issue when scientists analyze provenance graphs in isolation, or when they use the same workflow system. However, when analyzing heterogeneous provenance graphs from multiple systems poses a challenge. To address this problem we adopt ProvONE as an integration model, and show how different provenance databases can be con- verted to a global ProvONE schema. Scientists can then query this integrated database, exploring and linking provenance across several different workflows that may represent different implementations of the same experiment. To illus- trate the feasibility of our approach, we developed conceptual mappings be- tween the provenance databases of two workflow systems (e-Science Central and SciCumulus). We provide cartridges that implement these mappings and generate an integrated provenance database expressed as Prolog facts. To demonstrate its usage, we have developed Prolog rules that enable scientists to query the integrated database.},
booktitle = {Procs. {IPAW} 2016},
publisher = {Springer},
author = {Oliveira, Wellington and Missier, Paolo and Ocana, Kary and de Oliveira, Daniel and Braganholo, Vanessa},
year = {2016},
keywords = {\#provenance},
}
Provenance generated by different workflow systems is generally ex- pressed using different formats. This is not an issue when scientists analyze provenance graphs in isolation, or when they use the same workflow system. However, when analyzing heterogeneous provenance graphs from multiple systems poses a challenge. To address this problem we adopt ProvONE as an integration model, and show how different provenance databases can be con- verted to a global ProvONE schema. Scientists can then query this integrated database, exploring and linking provenance across several different workflows that may represent different implementations of the same experiment. To illus- trate the feasibility of our approach, we developed conceptual mappings be- tween the provenance databases of two workflow systems (e-Science Central and SciCumulus). We provide cartridges that implement these mappings and generate an integrated provenance database expressed as Prolog facts. To demonstrate its usage, we have developed Prolog rules that enable scientists to query the integrated database.
Tracking Dengue Epidemics using Twitter Content Classification and Topic Modelling.
Missier, P.; Romanovsky, A; Miu, T; Pal, A; Daniilakis, M; Garcia, A; Cedrim, D; and Sousa, L
In
Procs. SoWeMine workshop, co-located with ICWE 2016, Lugano, Switzerland, 2016.
Paper
link
bibtex
abstract
@inproceedings{missier_tracking_2016,
address = {Lugano, Switzerland},
title = {Tracking {Dengue} {Epidemics} using {Twitter} {Content} {Classification} and {Topic} {Modelling}},
url = {http://arxiv.org/abs/1605.00968},
abstract = {Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twit- ter, is increasingly being used for health vigilance applications such as flu detec- tion. However, previous work has not addressed the complexity of drastic sea- sonal changes on Twitter a across multiple epidemic outbreaks. In order to address this gap, this paper contrasts two complementary approaches to detecting Twitter content that is relevant for Dengue outbreak detection, namely supervised classification and unsupervised clustering using topic modelling. Each approach has benefits and shortcomings. Our classifier achieves a prediction accuracy of about 80\% based on a small training set of about 1,000 instances, but the need for manual annotation makes it hard to track seasonal changes in the nature of the epidemics, such as the emergence of new types of virus in certain geographical locations. In contrast, LDA-based topic modelling scales well, generating cohe- sive and well-separated clusters from larger samples. While clusters can be easily re-generated following changes in epidemics, however, this approach makes it hard to clearly segregate relevant tweets into well-defined clusters.},
booktitle = {Procs. {SoWeMine} workshop, co-located with {ICWE} 2016},
author = {Missier, Paolo and Romanovsky, A and Miu, T and Pal, A and Daniilakis, M and Garcia, A and Cedrim, D and Sousa, L},
year = {2016},
keywords = {\#social media analytics, \#twitter analytics},
}
Detecting and preventing outbreaks of mosquito-borne diseases such as Dengue and Zika in Brasil and other tropical regions has long been a priority for governments in affected areas. Streaming social media content, such as Twit- ter, is increasingly being used for health vigilance applications such as flu detec- tion. However, previous work has not addressed the complexity of drastic sea- sonal changes on Twitter a across multiple epidemic outbreaks. In order to address this gap, this paper contrasts two complementary approaches to detecting Twitter content that is relevant for Dengue outbreak detection, namely supervised classification and unsupervised clustering using topic modelling. Each approach has benefits and shortcomings. Our classifier achieves a prediction accuracy of about 80% based on a small training set of about 1,000 instances, but the need for manual annotation makes it hard to track seasonal changes in the nature of the epidemics, such as the emergence of new types of virus in certain geographical locations. In contrast, LDA-based topic modelling scales well, generating cohe- sive and well-separated clusters from larger samples. While clusters can be easily re-generated following changes in epidemics, however, this approach makes it hard to clearly segregate relevant tweets into well-defined clusters.
Workload-aware streaming graph partitioning.
Firth, H.; and Missier, P.
In
Procs. GraphQ Workshop, co-located with EDBT'16, Bordeaux, 2016.
link
bibtex
@inproceedings{firth_workload-aware_2016,
address = {Bordeaux},
title = {Workload-aware streaming graph partitioning},
booktitle = {Procs. {GraphQ} {Workshop}, co-located with {EDBT}'16},
author = {Firth, Hugo and Missier, Paolo},
year = {2016},
}
Data trajectories: tracking reuse of published data for transitive credit attribution.
Missier, P.
International Journal of Digital Curation, 11(1): 1–16. 2016.
Paper
doi
link
bibtex
abstract
@article{missier_data_2016-1,
title = {Data trajectories: tracking reuse of published data for transitive credit attribution},
volume = {11},
url = {http://bibbase.org/network/publication/missier-datatrajectoriestrackingreuseofpublisheddatafortransitivecreditattribution-2016},
doi = {doi:10.2218/ijdc.v11i1.425},
abstract = {The ability to measure the use and impact of published data sets is key to the success of the open data / open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which however is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and we show how this enables the design of accurate models for ascribing credit to data originators. A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors. We also show this model of transitive credit in action by means of a Data Reuse Simulator. Ultimately, our hope is that, in the longer term, credit models based on direct measures of data reuse will provide further incentives to data publication. We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances, in the wild.},
number = {1},
journal = {International Journal of Digital Curation},
author = {Missier, Paolo},
year = {2016},
keywords = {provenance, data reuse, data trajectories},
pages = {1--16},
}
The ability to measure the use and impact of published data sets is key to the success of the open data / open science paradigm. A direct measure of impact would require tracking data (re)use in the wild, which however is difficult to achieve. This is therefore commonly replaced by simpler metrics based on data download and citation counts. In this paper we describe a scenario where it is possible to track the trajectory of a dataset after its publication, and we show how this enables the design of accurate models for ascribing credit to data originators. A Data Trajectory (DT) is a graph that encodes knowledge of how, by whom, and in which context data has been re-used, possibly after several generations. We provide a theoretical model of DTs that is grounded in the W3C PROV data model for provenance, and we show how DTs can be used to automatically propagate a fraction of the credit associated with transitively derived datasets, back to original data contributors. We also show this model of transitive credit in action by means of a Data Reuse Simulator. Ultimately, our hope is that, in the longer term, credit models based on direct measures of data reuse will provide further incentives to data publication. We conclude by outlining a research agenda to address the hard questions of creating, collecting, and using DTs systematically across a large number of data reuse instances, in the wild.
Scalable and Efficient Whole-exome Data Processing Using Workflows on the Cloud.
Cala, J.; Marei, E.; Yu, Y.; Takeda, K.; and Missier, P.
Future Generation Computer Systems, In press(Special Issue: Big Data in the Cloud - Best paper award at the FGCS forum 2016). 2016.
link
bibtex
abstract
@article{cala_scalable_2016,
title = {Scalable and {Efficient} {Whole}-exome {Data} {Processing} {Using} {Workflows} on the {Cloud}},
volume = {In press},
abstract = {Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3x speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.},
number = {Special Issue: Big Data in the Cloud - Best paper award at the FGCS forum 2016},
journal = {Future Generation Computer Systems},
author = {Cala, Jacek and Marei, Eyad and Yu, Yaobo and Takeda, Kenji and Missier, Paolo},
year = {2016},
keywords = {workflow, Performance analysis, Cloud computing, HPC, Whole-exome sequencing, Workflow-based application, cloud, genomics, ?},
}
Dataflow-style workflows offer a simple, high-level programming model for flexible prototyping of scientific applications as an attractive alternative to low-level scripting. At the same time, workflow management systems (WFMS) may support data parallelism over big datasets by providing scalable, distributed deployment and execution of the workflow over a cloud infrastructure. In theory, the combination of these properties makes workflows a natural choice for implementing Big Data processing pipelines, common for instance in bioinformatics. In practice, however, correct workflow design for parallel Big Data problems can be complex and very time-consuming. In this paper we present our experience in porting a genomics data processing pipeline from an existing scripted implementation deployed on a closed HPC cluster, to a workflow-based design deployed on the Microsoft Azure public cloud. We draw two contrasting and general conclusions from this project. On the positive side, we show that our solution based on the e-Science Central WFMS and deployed in the cloud clearly outperforms the original HPC-based implementation achieving up to 2.3x speed-up. However, in order to deliver such performance we describe the importance of optimising the workflow deployment model to best suit the characteristics of the cloud computing infrastructure. The main reason for the performance gains was the availability of fast, node-local SSD disks delivered by D-series Azure VMs combined with the implicit use of local disk resources by e-Science Central workflow engines. These conclusions suggest that, on parallel Big Data problems, it is important to couple understanding of the cloud computing architecture and its software stack with simplicity of design, and that further efforts in automating parallelisation of complex pipelines are required.