FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems. Fisk, M. E. & Hash, C. L. In Proceedings of the 4thInternational Workshop on Cloud Data and Platforms, pages to appear, Amsterdam, the Netherlands, April, 2014. ACM.
doi  abstract   bibtex   
In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have developed and utilized over the last 5 years. This system features several significant design decisions and performance aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is to have a system for scheduling and orchestrating parallel and distributed data processing, but that does not interpose itself between data and the serial programs that process data. FileMap manages the organization of input and output files and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and maximize the ability of its users to benefit from specialized compute platforms, file systems, and software. We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their own data and participate in analysis of data by moving map/reduce processing upstream to where the data is collected. Our on SMP systems, clusters built for Hadoop, and this distributed cloud, show that FileMap outperforms more prevalent computing systems and models by factors between 2x (compared to Hadoop) and 14x (cloud vs. centralized). FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to data-intensive compute clusters to distributed sensing and analysis clouds. This paper introduces our fundamental design decisions in Section 1 and specifics of the implementation in Section 2. Section 3 describes the use of embedded databases within our system. Section 4 presents a performance benchmark measured across a variety of computing environments.
@InProceedings{Fisk14a,
	author = 	"Michael E. Fisk and Curtis L. Hash",
	title = 	"FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems",
	booktitle = 	"Proceedings of the " # "4th" # " International Workshop on Cloud Data and Platforms",
	year = 		2014,
	sortdate = 		"2014-04-01",
	pages = 	"to appear",
	month = 	apr,
	address = 	"Amsterdam, the Netherlands",
	publisher = 	"ACM",
	doi = "http://dx.doi.org/10.1145/2592784.2592790",
	jlocation = 	"johnh: pafile",
	keywords = 	"map/reduce, file map, lanl, retro-future",
	myorganization = "LANL",
	codeurl = 	"http://mfisk.github.io/filemap",
	project = 	"ant, retrofuture",
	abstract = "
In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have
developed and utilized over the last 5 years. This system features several significant design decisions and performance
aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is
to have a system for scheduling and orchestrating parallel
and distributed data processing, but that does not interpose
itself between data and the serial programs that process data.
FileMap manages the organization of input and output files
and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is
stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and
maximize the ability of its users to benefit from specialized
compute platforms, file systems, and software.
We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust
or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their
own data and participate in analysis of data by moving
map/reduce processing upstream to where the data is collected.
Our on SMP systems, clusters built for Hadoop, and
this distributed cloud, show that FileMap outperforms more
prevalent computing systems and models by factors between
2x (compared to Hadoop) and 14x (cloud vs. centralized).

FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to
data-intensive compute clusters to distributed sensing and
analysis clouds.
This paper introduces our fundamental design decisions
in Section 1 and specifics of the implementation in Section
2. Section 3 describes the use of embedded databases within
our system. Section 4 presents a performance benchmark
measured across a variety of computing environments.
"
}

Downloads: 0