Efficient Processing of Streaming Data using Multiple Abstractions

Efficient Processing of Streaming Data using Multiple Abstractions. Qadeer, A. & Heidemann, J. In Proceedings of the IEEE International Conference on Cloud Computing, pages 157–167, Virtual, September, 2021. IEEE. Special paper award

Paper doi abstract bibtex

Large websites and distributed systems employ sophisticated analytics to evaluate successes to celebrate and problems to be addressed. As analytics grow, different teams often require different frameworks, with dozens of packages supporting with streaming and batch processing, SQL and no-SQL. Bringing multiple frameworks to bear on a large, changing dataset often create challenges where data transitions—these impedance mismatches can create brittle glue logic and performance problems that consume developer time. We propose Plumb, a meta-framework that can bridge three different abstractions to meet the needs of a large class of applications in a common workflow. \emphLarge-block streaming (Block-Streamin) is suitable for single-pass applications that care about the temporal and spatial locality. \emphWindowed-Streaming allows applications to process a group of data and many reductions. \emphStateful-Streaming enables applications to keep a long-term state and always-on behavior. We show that it is possible to bridge abstractions, with a common, high-level workflow specification, while the system transitions data batch processing and block- and record-level streaming as required. The challenge in bridging abstractions is to minimize latency while allowing applications to select between sequential and parallel operation, while handling out-of-order data delivery, component failures, and providing clear semantics in the face of missing data. We demonstrate these abstractions evaluating a 10-stage workflow of DNS analytics that has been in production use with Plumb for 2 years, comparing to a brittle hand-built system that has run for more than 3 years.

@InProceedings{Qadeer21b,
        author =        "Abdul Qadeer and John Heidemann",
        title =         "Efficient Processing of Streaming Data using Multiple Abstractions",
        booktitle =     "Proceedings of the " # " IEEE International Conference on Cloud Computing",
        year =          2021,
	sortdate = 		"2021-09-05", 
	project = "ant, lacanic, gawseed",
	jsubject = "network_big_data",
        pages =      "157--167",
	note = "Special paper award",
        month =      sep,
        address =    "Virtual",
        publisher =  "IEEE",
        jlocation =   "johnh: pafile",
        keywords =   "big data, hadoop, plumb, DNS, streaming data,
                  data processing, workflow",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.pdf",
	doi =	"https://doi.org/10.1109/CLOUD53861.2021.00029",
	blogurl = "https://ant.isi.edu/blog/?p=1760",
	abstract = "Large websites and distributed systems employ sophisticated analytics
to evaluate successes to celebrate and problems to be addressed.  As
analytics grow, different teams often require different frameworks,
with dozens of packages supporting with streaming and batch
processing, SQL and no-SQL.  Bringing multiple frameworks to bear on a
large, changing dataset often create challenges where data
transitions---these impedance mismatches can create brittle glue logic
and performance problems that consume developer time.  We propose
Plumb, a meta-framework that can bridge three different abstractions
to meet the needs of a large class of applications in a common
workflow.  \emph{Large-block streaming} (Block-Streamin) is
suitable for single-pass applications that care about the temporal and
spatial locality.  \emph{Windowed-Streaming} allows applications
to process a group of data and many reductions. \emph{Stateful-Streaming} 
enables applications to keep a long-term
state and always-on behavior.  We show that it is possible to bridge
abstractions, with a common, high-level workflow specification, while
the system transitions data batch processing and block- and
record-level streaming as required.  The challenge in bridging
abstractions is to minimize latency while allowing applications to
select between sequential and parallel operation, while handling
out-of-order data delivery, component failures, and providing clear
semantics in the face of missing data.  We demonstrate these
abstractions evaluating a 10-stage workflow of DNS analytics that has
been in production use with Plumb for 2 years, comparing to a brittle
hand-built system that has run for more than 3 years.",
}

Downloads: 0

{"_id":"sezCiSt8YruTeij4F","bibbaseid":"qadeer-heidemann-efficientprocessingofstreamingdatausingmultipleabstractions-2021","author_short":["Qadeer, A.","Heidemann, J."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["Abdul"],"propositions":[],"lastnames":["Qadeer"],"suffixes":[]},{"firstnames":["John"],"propositions":[],"lastnames":["Heidemann"],"suffixes":[]}],"title":"Efficient Processing of Streaming Data using Multiple Abstractions","booktitle":"Proceedings of the IEEE International Conference on Cloud Computing","year":"2021","sortdate":"2021-09-05","project":"ant, lacanic, gawseed","jsubject":"network_big_data","pages":"157–167","note":"Special paper award","month":"September","address":"Virtual","publisher":"IEEE","jlocation":"johnh: pafile","keywords":"big data, hadoop, plumb, DNS, streaming data, data processing, workflow","url":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html","pdfurl":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.pdf","doi":"https://doi.org/10.1109/CLOUD53861.2021.00029","blogurl":"https://ant.isi.edu/blog/?p=1760","abstract":"Large websites and distributed systems employ sophisticated analytics to evaluate successes to celebrate and problems to be addressed. As analytics grow, different teams often require different frameworks, with dozens of packages supporting with streaming and batch processing, SQL and no-SQL. Bringing multiple frameworks to bear on a large, changing dataset often create challenges where data transitions—these impedance mismatches can create brittle glue logic and performance problems that consume developer time. We propose Plumb, a meta-framework that can bridge three different abstractions to meet the needs of a large class of applications in a common workflow. \\emphLarge-block streaming (Block-Streamin) is suitable for single-pass applications that care about the temporal and spatial locality. \\emphWindowed-Streaming allows applications to process a group of data and many reductions. \\emphStateful-Streaming enables applications to keep a long-term state and always-on behavior. We show that it is possible to bridge abstractions, with a common, high-level workflow specification, while the system transitions data batch processing and block- and record-level streaming as required. The challenge in bridging abstractions is to minimize latency while allowing applications to select between sequential and parallel operation, while handling out-of-order data delivery, component failures, and providing clear semantics in the face of missing data. We demonstrate these abstractions evaluating a 10-stage workflow of DNS analytics that has been in production use with Plumb for 2 years, comparing to a brittle hand-built system that has run for more than 3 years.","bibtex":"@InProceedings{Qadeer21b,\n author = \"Abdul Qadeer and John Heidemann\",\n title = \"Efficient Processing of Streaming Data using Multiple Abstractions\",\n booktitle = \"Proceedings of the \" # \" IEEE International Conference on Cloud Computing\",\n year = 2021,\n\tsortdate = \t\t\"2021-09-05\", \n\tproject = \"ant, lacanic, gawseed\",\n\tjsubject = \"network_big_data\",\n pages = \"157--167\",\n\tnote = \"Special paper award\",\n month = sep,\n address = \"Virtual\",\n publisher = \"IEEE\",\n jlocation = \"johnh: pafile\",\n keywords = \"big data, hadoop, plumb, DNS, streaming data,\n data processing, workflow\",\n\turl =\t\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html\",\n\tpdfurl =\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.pdf\",\n\tdoi =\t\"https://doi.org/10.1109/CLOUD53861.2021.00029\",\n\tblogurl = \"https://ant.isi.edu/blog/?p=1760\",\n\tabstract = \"Large websites and distributed systems employ sophisticated analytics\nto evaluate successes to celebrate and problems to be addressed. As\nanalytics grow, different teams often require different frameworks,\nwith dozens of packages supporting with streaming and batch\nprocessing, SQL and no-SQL. Bringing multiple frameworks to bear on a\nlarge, changing dataset often create challenges where data\ntransitions---these impedance mismatches can create brittle glue logic\nand performance problems that consume developer time. We propose\nPlumb, a meta-framework that can bridge three different abstractions\nto meet the needs of a large class of applications in a common\nworkflow. \\emph{Large-block streaming} (Block-Streamin) is\nsuitable for single-pass applications that care about the temporal and\nspatial locality. \\emph{Windowed-Streaming} allows applications\nto process a group of data and many reductions. \\emph{Stateful-Streaming} \nenables applications to keep a long-term\nstate and always-on behavior. We show that it is possible to bridge\nabstractions, with a common, high-level workflow specification, while\nthe system transitions data batch processing and block- and\nrecord-level streaming as required. The challenge in bridging\nabstractions is to minimize latency while allowing applications to\nselect between sequential and parallel operation, while handling\nout-of-order data delivery, component failures, and providing clear\nsemantics in the face of missing data. We demonstrate these\nabstractions evaluating a 10-stage workflow of DNS analytics that has\nbeen in production use with Plumb for 2 years, comparing to a brittle\nhand-built system that has run for more than 3 years.\",\n}\n\n","author_short":["Qadeer, A.","Heidemann, J."],"bibbaseid":"qadeer-heidemann-efficientprocessingofstreamingdatausingmultipleabstractions-2021","role":"author","urls":{"Paper":"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html"},"keyword":["big data","hadoop","plumb","DNS","streaming data","data processing","workflow"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","fLZcDgNSoSuatv6aX","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["big data","hadoop","plumb","dns","streaming data","data processing","workflow"],"search_terms":["efficient","processing","streaming","data","using","multiple","abstractions","qadeer","heidemann"],"title":"Efficient Processing of Streaming Data using Multiple Abstractions","year":2021}