Efficient Processing of Streaming Data using Multiple Abstractions. Qadeer, A. & Heidemann, J. In Proceedings of the IEEE International Conference on Cloud Computing, pages 157–167, Virtual, September, 2021. IEEE. Special paper award
Efficient Processing of Streaming Data using Multiple Abstractions [link]Paper  doi  abstract   bibtex   
Large websites and distributed systems employ sophisticated analytics to evaluate successes to celebrate and problems to be addressed. As analytics grow, different teams often require different frameworks, with dozens of packages supporting with streaming and batch processing, SQL and no-SQL. Bringing multiple frameworks to bear on a large, changing dataset often create challenges where data transitions—these impedance mismatches can create brittle glue logic and performance problems that consume developer time. We propose Plumb, a meta-framework that can bridge three different abstractions to meet the needs of a large class of applications in a common workflow. \emphLarge-block streaming (Block-Streamin) is suitable for single-pass applications that care about the temporal and spatial locality. \emphWindowed-Streaming allows applications to process a group of data and many reductions. \emphStateful-Streaming enables applications to keep a long-term state and always-on behavior. We show that it is possible to bridge abstractions, with a common, high-level workflow specification, while the system transitions data batch processing and block- and record-level streaming as required. The challenge in bridging abstractions is to minimize latency while allowing applications to select between sequential and parallel operation, while handling out-of-order data delivery, component failures, and providing clear semantics in the face of missing data. We demonstrate these abstractions evaluating a 10-stage workflow of DNS analytics that has been in production use with Plumb for 2 years, comparing to a brittle hand-built system that has run for more than 3 years.
@InProceedings{Qadeer21b,
        author =        "Abdul Qadeer and John Heidemann",
        title =         "Efficient Processing of Streaming Data using Multiple Abstractions",
        booktitle =     "Proceedings of the " # " IEEE International Conference on Cloud Computing",
        year =          2021,
	sortdate = 		"2021-09-05", 
	project = "ant, lacanic, gawseed",
	jsubject = "network_big_data",
        pages =      "157--167",
	note = "Special paper award",
        month =      sep,
        address =    "Virtual",
        publisher =  "IEEE",
        jlocation =   "johnh: pafile",
        keywords =   "big data, hadoop, plumb, DNS, streaming data,
                  data processing, workflow",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Qadeer21b.pdf",
	doi =	"https://doi.org/10.1109/CLOUD53861.2021.00029",
	blogurl = "https://ant.isi.edu/blog/?p=1760",
	abstract = "Large websites and distributed systems employ sophisticated analytics
to evaluate successes to celebrate and problems to be addressed.  As
analytics grow, different teams often require different frameworks,
with dozens of packages supporting with streaming and batch
processing, SQL and no-SQL.  Bringing multiple frameworks to bear on a
large, changing dataset often create challenges where data
transitions---these impedance mismatches can create brittle glue logic
and performance problems that consume developer time.  We propose
Plumb, a meta-framework that can bridge three different abstractions
to meet the needs of a large class of applications in a common
workflow.  \emph{Large-block streaming} (Block-Streamin) is
suitable for single-pass applications that care about the temporal and
spatial locality.  \emph{Windowed-Streaming} allows applications
to process a group of data and many reductions. \emph{Stateful-Streaming} 
enables applications to keep a long-term
state and always-on behavior.  We show that it is possible to bridge
abstractions, with a common, high-level workflow specification, while
the system transitions data batch processing and block- and
record-level streaming as required.  The challenge in bridging
abstractions is to minimize latency while allowing applications to
select between sequential and parallel operation, while handling
out-of-order data delivery, component failures, and providing clear
semantics in the face of missing data.  We demonstrate these
abstractions evaluating a 10-stage workflow of DNS analytics that has
been in production use with Plumb for 2 years, comparing to a brittle
hand-built system that has run for more than 3 years.",
}

Downloads: 0