FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems

FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems. Fisk, M. E. & Hash, C. L. In Proceedings of the 4thInternational Workshop on Cloud Data and Platforms, pages to appear, Amsterdam, the Netherlands, April, 2014. ACM.
doi abstract bibtex

In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have developed and utilized over the last 5 years. This system features several significant design decisions and performance aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is to have a system for scheduling and orchestrating parallel and distributed data processing, but that does not interpose itself between data and the serial programs that process data. FileMap manages the organization of input and output files and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and maximize the ability of its users to benefit from specialized compute platforms, file systems, and software. We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their own data and participate in analysis of data by moving map/reduce processing upstream to where the data is collected. Our on SMP systems, clusters built for Hadoop, and this distributed cloud, show that FileMap outperforms more prevalent computing systems and models by factors between 2x (compared to Hadoop) and 14x (cloud vs. centralized). FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to data-intensive compute clusters to distributed sensing and analysis clouds. This paper introduces our fundamental design decisions in Section 1 and specifics of the implementation in Section 2. Section 3 describes the use of embedded databases within our system. Section 4 presents a performance benchmark measured across a variety of computing environments.

@InProceedings{Fisk14a,
author = "Michael E. Fisk and Curtis L. Hash",
title = "FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems",
booktitle = "Proceedings of the " # "4th" # " International Workshop on Cloud Data and Platforms",
year = 2014,
sortdate = "2014-04-01",
pages = "to appear",
month = apr,
address = "Amsterdam, the Netherlands",
publisher = "ACM",
doi = "http://dx.doi.org/10.1145/2592784.2592790",
jlocation = "johnh: pafile",
keywords = "map/reduce, file map, lanl, retro-future",
myorganization = "LANL",
codeurl = "http://mfisk.github.io/filemap",
project = "ant, retrofuture",
abstract = "
In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have
developed and utilized over the last 5 years. This system features several significant design decisions and performance
aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is
to have a system for scheduling and orchestrating parallel
and distributed data processing, but that does not interpose
itself between data and the serial programs that process data.
FileMap manages the organization of input and output files
and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is
stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and
maximize the ability of its users to benefit from specialized
compute platforms, file systems, and software.
We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust
or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their
own data and participate in analysis of data by moving
map/reduce processing upstream to where the data is collected.
Our on SMP systems, clusters built for Hadoop, and
this distributed cloud, show that FileMap outperforms more
prevalent computing systems and models by factors between
2x (compared to Hadoop) and 14x (cloud vs. centralized).

FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to
data-intensive compute clusters to distributed sensing and
analysis clouds.
This paper introduces our fundamental design decisions
in Section 1 and specifics of the implementation in Section
2. Section 3 describes the use of embedded databases within
our system. Section 4 presents a performance benchmark
measured across a variety of computing environments.
"
}

Downloads: 0

{"_id":"RWY4w9o2YwwaTECNM","bibbaseid":"fisk-hash-filemapmapreduceprogramexecutiononlooselycoupleddistributedsystems-2014","author_short":["Fisk, M. E.","Hash, C. L."],"bibdata":{"bibtype":"inproceedings","type":"inproceedings","author":[{"firstnames":["Michael","E."],"propositions":[],"lastnames":["Fisk"],"suffixes":[]},{"firstnames":["Curtis","L."],"propositions":[],"lastnames":["Hash"],"suffixes":[]}],"title":"FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems","booktitle":"Proceedings of the 4thInternational Workshop on Cloud Data and Platforms","year":"2014","sortdate":"2014-04-01","pages":"to appear","month":"April","address":"Amsterdam, the Netherlands","publisher":"ACM","doi":"http://dx.doi.org/10.1145/2592784.2592790","jlocation":"johnh: pafile","keywords":"map/reduce, file map, lanl, retro-future","myorganization":"LANL","codeurl":"http://mfisk.github.io/filemap","project":"ant, retrofuture","abstract":"In this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have developed and utilized over the last 5 years. This system features several significant design decisions and performance aspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is to have a system for scheduling and orchestrating parallel and distributed data processing, but that does not interpose itself between data and the serial programs that process data. FileMap manages the organization of input and output files and the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is stored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and maximize the ability of its users to benefit from specialized compute platforms, file systems, and software. We measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust or synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their own data and participate in analysis of data by moving map/reduce processing upstream to where the data is collected. Our on SMP systems, clusters built for Hadoop, and this distributed cloud, show that FileMap outperforms more prevalent computing systems and models by factors between 2x (compared to Hadoop) and 14x (cloud vs. centralized). FileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to data-intensive compute clusters to distributed sensing and analysis clouds. This paper introduces our fundamental design decisions in Section 1 and specifics of the implementation in Section 2. Section 3 describes the use of embedded databases within our system. Section 4 presents a performance benchmark measured across a variety of computing environments. ","bibtex":"@InProceedings{Fisk14a,\n\tauthor = \t\"Michael E. Fisk and Curtis L. Hash\",\n\ttitle = \t\"FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems\",\n\tbooktitle = \t\"Proceedings of the \" # \"4th\" # \" International Workshop on Cloud Data and Platforms\",\n\tyear = \t\t2014,\n\tsortdate = \t\t\"2014-04-01\",\n\tpages = \t\"to appear\",\n\tmonth = \tapr,\n\taddress = \t\"Amsterdam, the Netherlands\",\n\tpublisher = \t\"ACM\",\n\tdoi = \"http://dx.doi.org/10.1145/2592784.2592790\",\n\tjlocation = \t\"johnh: pafile\",\n\tkeywords = \t\"map/reduce, file map, lanl, retro-future\",\n\tmyorganization = \"LANL\",\n\tcodeurl = \t\"http://mfisk.github.io/filemap\",\n\tproject = \t\"ant, retrofuture\",\n\tabstract = \"\nIn this paper we present FileMap, an open-source,1 alternative map-reduce-based computing system that we have\ndeveloped and utilized over the last 5 years. This system features several significant design decisions and performance\naspects that are not found in prevalent map-reduce systems such as Hadoop [16]. The prevailing design goal is\nto have a system for scheduling and orchestrating parallel\nand distributed data processing, but that does not interpose\nitself between data and the serial programs that process data.\nFileMap manages the organization of input and output files\nand the scheduling of program execution, but does not process files itself and is agnostic to the format in which data is\nstored and/or indexed. We layer on top of existing, ubiquitous file systems, security models, and network access software in order to minimize the complexity of FileMap and\nmaximize the ability of its users to benefit from specialized\ncompute platforms, file systems, and software.\nWe measure the performance of FileMap in several instantiations including a heterogeneous “cloud” conglomeration of computers and storage distributed across multiple owning organizations with no cross-organization trust\nor synchronization. This “cloud” model intentionally supports distributed sensor systems in which nodes collect their\nown data and participate in analysis of data by moving\nmap/reduce processing upstream to where the data is collected.\nOur on SMP systems, clusters built for Hadoop, and\nthis distributed cloud, show that FileMap outperforms more\nprevalent computing systems and models by factors between\n2x (compared to Hadoop) and 14x (cloud vs. centralized).\n\nFileMap provides a single programming model that allows processing to seamlessly scale from a single laptop to\ndata-intensive compute clusters to distributed sensing and\nanalysis clouds.\nThis paper introduces our fundamental design decisions\nin Section 1 and specifics of the implementation in Section\n2. Section 3 describes the use of embedded databases within\nour system. Section 4 presents a performance benchmark\nmeasured across a variety of computing environments.\n\"\n}\n\n","author_short":["Fisk, M. E.","Hash, C. L."],"bibbaseid":"fisk-hash-filemapmapreduceprogramexecutiononlooselycoupleddistributedsystems-2014","role":"author","urls":{},"keyword":["map/reduce","file map","lanl","retro-future"],"metadata":{"authorlinks":{}}},"bibtype":"inproceedings","biburl":"https://bibbase.org/f/dHevizJoWEhWowz8q/johnh-2023-2.bib","dataSources":["YLyu3mj3xsBeoqiHK","fLZcDgNSoSuatv6aX","fxEParwu2ZfurScPY","7nuQvtHTqKrLmgu99"],"keywords":["map/reduce","file map","lanl","retro-future"],"search_terms":["filemap","map","reduce","program","execution","loosely","coupled","distributed","systems","fisk","hash"],"title":"FileMap: Map-Reduce Program Execution on Loosely-Coupled Distributed Systems","year":2014}