Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended). Heidemann, J., Pradkin, Y., & Nisar, A. Technical Report ISI-TR-724, USC/Information Sciences Institute, February, 2018.
Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended) [link]Paper  abstract   bibtex   
Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can \emphdiscover common points-of-failure at \emphany of these layers by observing correlated failures. We use \emphend-to-end observations from data-plane-level connectivity of edge hosts in the Internet. We identify \emphcorrelations in connectivity: networks that usually fail and recover at the same time suggest common point-of-failure. We define two new algorithms to meet these goals. First, we define a computationally-efficient algorithm to create a \emphlinear ordering of blocks to make correlated failures apparent to a human analyst. Second, we develop an \emphevent-based clustering algorithm that directly networks with correlated failures, suggesting common points-of-failure. Our algorithms scale to real-world datasets of millions of networks and observations: linear ordering is O(n log n) time and event-based clustering parallelizes with Map/Reduce. We demonstrate them on three months of outages for 4 million /24 network prefixes, showing high recall (0.83 to 0.98) and precision (0.72 to 1.0) for blocks that respond. We also show that our algorithms generalize to identify correlations in anycast catchments and routing.
@TechReport{Heidemann18b,
	author = 	"John Heidemann and Yuri Pradkin and Aqib Nisar",
	title = 	"Back Out: End-to-end Inference of Common
                  Points-of-Failure in the Internet (extended)",
	institution = 	"USC/Information Sciences Institute",
	year = 		2018,
	sortdate = 		"2018-02-02", 
	project = "ant, lacanic, retrofuturebridge, duoi",
	jsubject = "routing",
	number =	"ISI-TR-724",
	month =		feb,
	jlocation =	"johnh: pafile",
	keywords =	"network outage detection, clustering, visualization",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.pdf",
	myorganization =	"USC/Information Sciences Institute",
	copyrightholder = "authors",
	abstract = "
Internet reliability has many potential weaknesses:  fiber
rights-of-way at the physical layer, exchange-point congestion from
DDOS at the network layer, settlement disputes between organizations
at the financial layer, and government intervention the political
layer.  This paper shows that we
can \emph{discover common points-of-failure}
at \emph{any} of these layers by observing
correlated failures.  We use \emph{end-to-end} observations from
data-plane-level connectivity of edge hosts in the Internet.  We
identify \emph{correlations in connectivity:}  networks that usually
fail and recover at the same time suggest common point-of-failure.  We
define two new algorithms to meet these goals.  First, we define a
computationally-efficient algorithm to create a \emph{linear ordering}
of blocks to make correlated failures apparent to a human analyst.
Second, we develop an \emph{event-based clustering} algorithm that
directly networks with correlated failures, suggesting common
points-of-failure.  Our algorithms scale to real-world datasets of
millions of networks and observations:  linear ordering
is O(n log n) time and event-based clustering parallelizes with Map/Reduce.  We
demonstrate them on three months of outages for 4 million
/24 network prefixes, showing high recall (0.83 to 0.98) and precision
(0.72 to 1.0) for blocks that respond.  We also show that our
algorithms generalize to identify correlations in anycast catchments
and routing.
",
}

Downloads: 0