Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)

Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended). Heidemann, J., Pradkin, Y., & Nisar, A. Technical Report ISI-TR-724, USC/Information Sciences Institute, February, 2018.

Paper abstract bibtex

Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can \emphdiscover common points-of-failure at \emphany of these layers by observing correlated failures. We use \emphend-to-end observations from data-plane-level connectivity of edge hosts in the Internet. We identify \emphcorrelations in connectivity: networks that usually fail and recover at the same time suggest common point-of-failure. We define two new algorithms to meet these goals. First, we define a computationally-efficient algorithm to create a \emphlinear ordering of blocks to make correlated failures apparent to a human analyst. Second, we develop an \emphevent-based clustering algorithm that directly networks with correlated failures, suggesting common points-of-failure. Our algorithms scale to real-world datasets of millions of networks and observations: linear ordering is O(n log n) time and event-based clustering parallelizes with Map/Reduce. We demonstrate them on three months of outages for 4 million /24 network prefixes, showing high recall (0.83 to 0.98) and precision (0.72 to 1.0) for blocks that respond. We also show that our algorithms generalize to identify correlations in anycast catchments and routing.

@TechReport{Heidemann18b,
	author = 	"John Heidemann and Yuri Pradkin and Aqib Nisar",
	title = 	"Back Out: End-to-end Inference of Common
                  Points-of-Failure in the Internet (extended)",
	institution = 	"USC/Information Sciences Institute",
	year = 		2018,
	sortdate = 		"2018-02-02", 
	project = "ant, lacanic, retrofuturebridge, duoi",
	jsubject = "routing",
	number =	"ISI-TR-724",
	month =		feb,
	jlocation =	"johnh: pafile",
	keywords =	"network outage detection, clustering, visualization",
	url =		"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.html",
	pdfurl =	"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.pdf",
	myorganization =	"USC/Information Sciences Institute",
	copyrightholder = "authors",
	abstract = "
Internet reliability has many potential weaknesses:  fiber
rights-of-way at the physical layer, exchange-point congestion from
DDOS at the network layer, settlement disputes between organizations
at the financial layer, and government intervention the political
layer.  This paper shows that we
can \emph{discover common points-of-failure}
at \emph{any} of these layers by observing
correlated failures.  We use \emph{end-to-end} observations from
data-plane-level connectivity of edge hosts in the Internet.  We
identify \emph{correlations in connectivity:}  networks that usually
fail and recover at the same time suggest common point-of-failure.  We
define two new algorithms to meet these goals.  First, we define a
computationally-efficient algorithm to create a \emph{linear ordering}
of blocks to make correlated failures apparent to a human analyst.
Second, we develop an \emph{event-based clustering} algorithm that
directly networks with correlated failures, suggesting common
points-of-failure.  Our algorithms scale to real-world datasets of
millions of networks and observations:  linear ordering
is O(n log n) time and event-based clustering parallelizes with Map/Reduce.  We
demonstrate them on three months of outages for 4 million
/24 network prefixes, showing high recall (0.83 to 0.98) and precision
(0.72 to 1.0) for blocks that respond.  We also show that our
algorithms generalize to identify correlations in anycast catchments
and routing.
",
}

Downloads: 0

{"_id":"tP8W8kNbm5ZqxsbKL","bibbaseid":"heidemann-pradkin-nisar-backoutendtoendinferenceofcommonpointsoffailureintheinternetextended-2018","author_short":["Heidemann, J.","Pradkin, Y.","Nisar, A."],"bibdata":{"bibtype":"techreport","type":"techreport","author":[{"firstnames":["John"],"propositions":[],"lastnames":["Heidemann"],"suffixes":[]},{"firstnames":["Yuri"],"propositions":[],"lastnames":["Pradkin"],"suffixes":[]},{"firstnames":["Aqib"],"propositions":[],"lastnames":["Nisar"],"suffixes":[]}],"title":"Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)","institution":"USC/Information Sciences Institute","year":"2018","sortdate":"2018-02-02","project":"ant, lacanic, retrofuturebridge, duoi","jsubject":"routing","number":"ISI-TR-724","month":"February","jlocation":"johnh: pafile","keywords":"network outage detection, clustering, visualization","url":"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.html","pdfurl":"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.pdf","myorganization":"USC/Information Sciences Institute","copyrightholder":"authors","abstract":"Internet reliability has many potential weaknesses: fiber rights-of-way at the physical layer, exchange-point congestion from DDOS at the network layer, settlement disputes between organizations at the financial layer, and government intervention the political layer. This paper shows that we can \\emphdiscover common points-of-failure at \\emphany of these layers by observing correlated failures. We use \\emphend-to-end observations from data-plane-level connectivity of edge hosts in the Internet. We identify \\emphcorrelations in connectivity: networks that usually fail and recover at the same time suggest common point-of-failure. We define two new algorithms to meet these goals. First, we define a computationally-efficient algorithm to create a \\emphlinear ordering of blocks to make correlated failures apparent to a human analyst. Second, we develop an \\emphevent-based clustering algorithm that directly networks with correlated failures, suggesting common points-of-failure. Our algorithms scale to real-world datasets of millions of networks and observations: linear ordering is O(n log n) time and event-based clustering parallelizes with Map/Reduce. We demonstrate them on three months of outages for 4 million /24 network prefixes, showing high recall (0.83 to 0.98) and precision (0.72 to 1.0) for blocks that respond. We also show that our algorithms generalize to identify correlations in anycast catchments and routing. ","bibtex":"@TechReport{Heidemann18b,\n\tauthor = \t\"John Heidemann and Yuri Pradkin and Aqib Nisar\",\n\ttitle = \t\"Back Out: End-to-end Inference of Common\n Points-of-Failure in the Internet (extended)\",\n\tinstitution = \t\"USC/Information Sciences Institute\",\n\tyear = \t\t2018,\n\tsortdate = \t\t\"2018-02-02\", \n\tproject = \"ant, lacanic, retrofuturebridge, duoi\",\n\tjsubject = \"routing\",\n\tnumber =\t\"ISI-TR-724\",\n\tmonth =\t\tfeb,\n\tjlocation =\t\"johnh: pafile\",\n\tkeywords =\t\"network outage detection, clustering, visualization\",\n\turl =\t\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.html\",\n\tpdfurl =\t\"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.pdf\",\n\tmyorganization =\t\"USC/Information Sciences Institute\",\n\tcopyrightholder = \"authors\",\n\tabstract = \"\nInternet reliability has many potential weaknesses: fiber\nrights-of-way at the physical layer, exchange-point congestion from\nDDOS at the network layer, settlement disputes between organizations\nat the financial layer, and government intervention the political\nlayer. This paper shows that we\ncan \\emph{discover common points-of-failure}\nat \\emph{any} of these layers by observing\ncorrelated failures. We use \\emph{end-to-end} observations from\ndata-plane-level connectivity of edge hosts in the Internet. We\nidentify \\emph{correlations in connectivity:} networks that usually\nfail and recover at the same time suggest common point-of-failure. We\ndefine two new algorithms to meet these goals. First, we define a\ncomputationally-efficient algorithm to create a \\emph{linear ordering}\nof blocks to make correlated failures apparent to a human analyst.\nSecond, we develop an \\emph{event-based clustering} algorithm that\ndirectly networks with correlated failures, suggesting common\npoints-of-failure. Our algorithms scale to real-world datasets of\nmillions of networks and observations: linear ordering\nis O(n log n) time and event-based clustering parallelizes with Map/Reduce. We\ndemonstrate them on three months of outages for 4 million\n/24 network prefixes, showing high recall (0.83 to 0.98) and precision\n(0.72 to 1.0) for blocks that respond. We also show that our\nalgorithms generalize to identify correlations in anycast catchments\nand routing.\n\",\n}\n\n","author_short":["Heidemann, J.","Pradkin, Y.","Nisar, A."],"bibbaseid":"heidemann-pradkin-nisar-backoutendtoendinferenceofcommonpointsoffailureintheinternetextended-2018","role":"author","urls":{"Paper":"https://ant.isi.edu/%7ejohnh/PAPERS/Heidemann18b.html"},"keyword":["network outage detection","clustering","visualization"],"metadata":{"authorlinks":{}}},"bibtype":"techreport","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","fLZcDgNSoSuatv6aX","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["network outage detection","clustering","visualization"],"search_terms":["back","out","end","end","inference","common","points","failure","internet","extended","heidemann","pradkin","nisar"],"title":"Back Out: End-to-end Inference of Common Points-of-Failure in the Internet (extended)","year":2018}