Plumb: Efficient Stream Processing of Multi-User Pipelines

Plumb: Efficient Stream Processing of Multi-User Pipelines. Qadeer, A. & Heidemann, J. Software—Practice and Experience, 51(2):385–408, 2020.

Paper doi abstract bibtex

Operational services run 24x7 and require analytics pipelines to evaluate performance. In mature services such as DNS, these pipelines often grow to many stages developed by multiple, loosely-coupled teams. Such pipelines pose two problems: first, computation and data storage may be \emphduplicated across components developed by different groups, wasting resources. Second, processing can be \emphskewed, with \emphstructural skew occurring when different pipeline stages need different amounts of resources, and \emphcomputational skew occurring when a block of input data requires increased resources. Duplication and structural skew both decrease efficiency, increasing cost, latency, or both. Computational skew can cause pipeline failure or deadlock when resource consumption balloons; we have seen cases where pessimal traffic increases CPU requirements 6-fold. Detecting duplication is challenging when components from multiple teams evolve independently and require fault isolation. Skew management is hard due to dynamic workloads coupled with the conflicting goals of both minimizing latency and maximizing utilization. We propose \emphPlumb, a framework to abstract stream processing as large-block streaming (LBS) for a multi-stage, multi-user workflow. Plumb users express analytics as a DAG of processing modules, allowing Plumb to integrate and optimize workflows from multiple users. Many real-world applications map to the LBS abstraction. Plumb detects and eliminates duplicate computation and storage, and it detects and addresses both structural and computational skew by tracking computation across the pipeline. We exercise Plumb using the analytics pipeline for \BRoot DNS. We compare Plumb to a hand-tuned system, cutting latency to one-third the original, and requiring $39%$ fewer container hours, while supporting more flexible, multi-user analytics and providing greater robustness to DDoS-driven demands.

@Article{Qadeer20a,
        author =        "Abdul Qadeer and John Heidemann",
        title =         "Plumb: Efficient Stream Processing of Multi-User Pipelines",
        journal =       "Software---Practice and Experience",
        year =          2020,
	sortdate = 		"2020-09-24", 
	project = "ant, lacanic, gawseed",
	jsubject = "network_big_data",
        volume =     "51",
        number =     "2",
        pages =      "385--408",
        jlocation =   "johnh: pafile",
        keywords =   "big data, hadoop, plumb, DNS, streaming data,
                  data processing, workflow",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.pdf",
	doi =	"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.pdf",
	blogurl = "https://ant.isi.edu/blog/?p=1524",
        doi =        "10.1002/spe.2909",
	abstract = "Operational services run 24x7 and require analytics pipelines to
evaluate performance.  In mature services such as DNS, these pipelines
often grow to many stages developed by multiple, loosely-coupled
teams.  Such pipelines pose two problems:  first, computation and data
storage may be \emph{duplicated across components} developed by
different groups, wasting resources.  Second, processing can be
\emph{skewed}, with \emph{structural skew} occurring when different
pipeline stages need different amounts of resources, and
\emph{computational skew} occurring when a block of input data
requires increased resources.  Duplication and structural skew both
decrease efficiency, increasing cost, latency, or both.  Computational
skew can cause pipeline failure or deadlock when resource consumption
balloons; we have seen cases where pessimal traffic increases CPU
requirements 6-fold.  Detecting duplication is challenging when
components from multiple teams evolve independently and require fault
isolation.  Skew management is hard due to dynamic workloads coupled
with the conflicting goals of both minimizing latency and maximizing
utilization.  We propose \emph{Plumb}, a framework to abstract stream
processing as large-block streaming (LBS) for a multi-stage,
multi-user workflow.  Plumb users express analytics as a DAG of
processing modules, allowing Plumb to integrate and optimize workflows
from multiple users.  Many real-world applications map to the LBS
abstraction.  Plumb detects and eliminates duplicate computation and
storage, and it detects and addresses both structural and
computational skew by tracking computation across the pipeline.  We
exercise Plumb using the analytics pipeline for \BRoot DNS.  We
compare Plumb to a hand-tuned system, cutting latency to one-third the
original, and requiring $39\%$ fewer container hours, while supporting
more flexible, multi-user analytics and providing greater robustness
to DDoS-driven demands.",
}

Downloads: 0

{"_id":"XhphLuCY5H46ZDoJu","bibbaseid":"qadeer-heidemann-plumbefficientstreamprocessingofmultiuserpipelines-2020","author_short":["Qadeer, A.","Heidemann, J."],"bibdata":{"bibtype":"article","type":"article","author":[{"firstnames":["Abdul"],"propositions":[],"lastnames":["Qadeer"],"suffixes":[]},{"firstnames":["John"],"propositions":[],"lastnames":["Heidemann"],"suffixes":[]}],"title":"Plumb: Efficient Stream Processing of Multi-User Pipelines","journal":"Software—Practice and Experience","year":"2020","sortdate":"2020-09-24","project":"ant, lacanic, gawseed","jsubject":"network_big_data","volume":"51","number":"2","pages":"385–408","jlocation":"johnh: pafile","keywords":"big data, hadoop, plumb, DNS, streaming data, data processing, workflow","url":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.html","pdfurl":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.pdf","doi":"10.1002/spe.2909","blogurl":"https://ant.isi.edu/blog/?p=1524","abstract":"Operational services run 24x7 and require analytics pipelines to evaluate performance. In mature services such as DNS, these pipelines often grow to many stages developed by multiple, loosely-coupled teams. Such pipelines pose two problems: first, computation and data storage may be \\emphduplicated across components developed by different groups, wasting resources. Second, processing can be \\emphskewed, with \\emphstructural skew occurring when different pipeline stages need different amounts of resources, and \\emphcomputational skew occurring when a block of input data requires increased resources. Duplication and structural skew both decrease efficiency, increasing cost, latency, or both. Computational skew can cause pipeline failure or deadlock when resource consumption balloons; we have seen cases where pessimal traffic increases CPU requirements 6-fold. Detecting duplication is challenging when components from multiple teams evolve independently and require fault isolation. Skew management is hard due to dynamic workloads coupled with the conflicting goals of both minimizing latency and maximizing utilization. We propose \\emphPlumb, a framework to abstract stream processing as large-block streaming (LBS) for a multi-stage, multi-user workflow. Plumb users express analytics as a DAG of processing modules, allowing Plumb to integrate and optimize workflows from multiple users. Many real-world applications map to the LBS abstraction. Plumb detects and eliminates duplicate computation and storage, and it detects and addresses both structural and computational skew by tracking computation across the pipeline. We exercise Plumb using the analytics pipeline for \\BRoot DNS. We compare Plumb to a hand-tuned system, cutting latency to one-third the original, and requiring $39%$ fewer container hours, while supporting more flexible, multi-user analytics and providing greater robustness to DDoS-driven demands.","bibtex":"@Article{Qadeer20a,\n author = \"Abdul Qadeer and John Heidemann\",\n title = \"Plumb: Efficient Stream Processing of Multi-User Pipelines\",\n journal = \"Software---Practice and Experience\",\n year = 2020,\n\tsortdate = \t\t\"2020-09-24\", \n\tproject = \"ant, lacanic, gawseed\",\n\tjsubject = \"network_big_data\",\n volume = \"51\",\n number = \"2\",\n pages = \"385--408\",\n jlocation = \"johnh: pafile\",\n keywords = \"big data, hadoop, plumb, DNS, streaming data,\n data processing, workflow\",\n\turl =\t\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.html\",\n\tpdfurl =\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.pdf\",\n\tdoi =\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.pdf\",\n\tblogurl = \"https://ant.isi.edu/blog/?p=1524\",\n doi = \"10.1002/spe.2909\",\n\tabstract = \"Operational services run 24x7 and require analytics pipelines to\nevaluate performance. In mature services such as DNS, these pipelines\noften grow to many stages developed by multiple, loosely-coupled\nteams. Such pipelines pose two problems: first, computation and data\nstorage may be \\emph{duplicated across components} developed by\ndifferent groups, wasting resources. Second, processing can be\n\\emph{skewed}, with \\emph{structural skew} occurring when different\npipeline stages need different amounts of resources, and\n\\emph{computational skew} occurring when a block of input data\nrequires increased resources. Duplication and structural skew both\ndecrease efficiency, increasing cost, latency, or both. Computational\nskew can cause pipeline failure or deadlock when resource consumption\nballoons; we have seen cases where pessimal traffic increases CPU\nrequirements 6-fold. Detecting duplication is challenging when\ncomponents from multiple teams evolve independently and require fault\nisolation. Skew management is hard due to dynamic workloads coupled\nwith the conflicting goals of both minimizing latency and maximizing\nutilization. We propose \\emph{Plumb}, a framework to abstract stream\nprocessing as large-block streaming (LBS) for a multi-stage,\nmulti-user workflow. Plumb users express analytics as a DAG of\nprocessing modules, allowing Plumb to integrate and optimize workflows\nfrom multiple users. Many real-world applications map to the LBS\nabstraction. Plumb detects and eliminates duplicate computation and\nstorage, and it detects and addresses both structural and\ncomputational skew by tracking computation across the pipeline. We\nexercise Plumb using the analytics pipeline for \\BRoot DNS. We\ncompare Plumb to a hand-tuned system, cutting latency to one-third the\noriginal, and requiring $39\\%$ fewer container hours, while supporting\nmore flexible, multi-user analytics and providing greater robustness\nto DDoS-driven demands.\",\n}\n\n","author_short":["Qadeer, A.","Heidemann, J."],"bibbaseid":"qadeer-heidemann-plumbefficientstreamprocessingofmultiuserpipelines-2020","role":"author","urls":{"Paper":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer20a.html"},"keyword":["big data","hadoop","plumb","DNS","streaming data","data processing","workflow"],"metadata":{"authorlinks":{}}},"bibtype":"article","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","fLZcDgNSoSuatv6aX","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["big data","hadoop","plumb","dns","streaming data","data processing","workflow"],"search_terms":["plumb","efficient","stream","processing","multi","user","pipelines","qadeer","heidemann"],"title":"Plumb: Efficient Stream Processing of Multi-User Pipelines","year":2020}