Detecting Internet Outages with Precise Active Probing (extended). Quan, L., Heidemann, J., & Pradkin, Y. Technical Report ISI-TR-2012-678, USC/Information Sciences Institute, February, 2012. (This TR superceeds ISI-TR-2011-672.)
Detecting Internet Outages with Precise Active Probing (extended) [link]Paper  abstract   bibtex   
Parts of the Internet are down \emphevery day, from the intentional shutdown of the Egyptian Internet in Jan. 2011 and the results of natural disasters such as the Mar. 2011 Japanese earthquake, to the thousands of small, daily outages caused by localized accidents or human error. In this paper we present a new system to detect network outages by active probing. We show that a single PC can track outages across the entire analyzable IPv4 Internet, probing a sample of 20 addresses in all 2.5M responsive /24 address blocks. We develop new algorithms to identify and visualize outages and to cluster those outages into network-level events. We carefully validate our approach to active probing, showing consistent results over two years of observations taken from three different sites. Using public BGP archives and news sources we confirm 83% of large events. We also examine a random sample of 50 observed events, confirming prior work showing that small outages often do not appear in control-plane messages, since only 38% of small events include partial control-plane information. Emulating controlled outages, we show that our approach detects 100% of full-block outages that last at least twice our probing interval. We show that our system is significantly more accurate than prior approaches that use a single representative for each routed block, cutting the number of outage mis-classifications from%7e44% to under%7e8%. Finally, we report on Internet stability as a whole, and the size and duration of typical outages. We find that about 0.3% of the Internet is likely to be unreachable at any time, suggesting the Internet provides only 2.5 ``nines'' of availability. By providing a baseline estimate of Internet outages, we lay the groundwork to evaluate ISP reliability.
@TechReport{Quan12a_120200,
	author = 	"Lin Quan and John Heidemann and Yuri Pradkin",
	title = 	"Detecting Internet Outages with Precise
                  Active Probing (extended)",
	institution = 	"USC/Information Sciences Institute",
	year = 		2012,
	sortdate = 		"2012-02-01",
	project = "ant, lacrend, lander, madcat",
	jsubject = "omitted",
	number = 	"ISI-TR-2012-678",
	month = 	feb,
	location = 	"johnh: pafile",
	note = "(This TR superceeds ISI-TR-2011-672.)",
	keywords = 	"routing outage detection, active probing,
                  ntework outages, revision of [Quan11a]",
	url =		"http://www.isi.edu/%7ejohnh/PAPERS/Quan12a.html",
	pdfurl =	"http://www.isi.edu/%7ejohnh/PAPERS/Quan12a.pdf",
	otherurl =	"ftp://ftp.isi.edu/isi-pubs/tr-678.pdf",
	myorganization =	"USC/Information Sciences Institute",
	copyrightholder = "authors",
	abstract = "Parts of the Internet are down \emph{every day}, from the intentional
shutdown of the Egyptian Internet in Jan. 2011 and the results of
natural disasters such as the Mar. 2011 Japanese earthquake, to the
thousands of small, daily outages caused by localized accidents or
human error.  In this paper we present a new system to detect network
outages by active probing.  We show that a single PC can track outages
across the entire analyzable IPv4 Internet, probing a sample of 20
addresses in all 2.5M responsive /24 address blocks.  We develop new
algorithms to identify and visualize outages and to cluster those
outages into network-level events.  We carefully validate our approach
to active probing, showing consistent results over two years of
observations taken from three different sites.  Using public BGP
archives and news sources we confirm 83\% of large events.  We also
examine a random sample of 50 observed events, confirming prior work
showing that small outages often do not appear in control-plane
messages, since only 38\% of small events include partial
control-plane information.  Emulating controlled outages, we show that
our approach detects 100\% of full-block outages that last at least
twice our probing interval.  We show that our system is significantly
more accurate than prior approaches that use a single representative
for each routed block, cutting the number of outage
mis-classifications from%7e44\% to under%7e8\%.  Finally, we report on
Internet stability as a whole, and the size and duration of typical
outages.  We find that about 0.3\% of the Internet is likely to be
unreachable at any time, suggesting the Internet provides only 2.5
``nines'' of availability.  By providing a baseline estimate of
Internet outages, we lay the groundwork to evaluate ISP reliability.
",
}

Downloads: 0