HOG: Distributed hadoop MapReduce on the grid

Chen He, Derek Weitzel, David Swanson, Ying Lu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

22 Citations (Scopus)

Abstract

MapReduce is a powerful data processing platform for commercial and academic applications. In this paper, we build a novel Hadoop MapReduce framework executed on the Open Science Grid which spans multiple institutions across the United States - Hadoop On the Grid (HOG). It is different from previous MapReduce platforms that run on dedicated environments like clusters or clouds. HOG provides a free, elastic, and dynamic MapReduce environment on the opportunistic resources of the grid. In HOG, we improve Hadoop's fault tolerance for wide area data analysis by mapping data centers across the U.S. to virtual racks and creating multi-institution failure domains. Our modifications to the Hadoop framework are transparent to existing Hadoop MapReduce applications. In the evaluation, we successfully extend HOG to 1100 nodes on the grid. Additionally, we evaluate HOG with a simulated Facebook Hadoop MapReduce workload. We conclude that HOG's rapid scalability can provide comparable performance to a dedicated Hadoop cluster.

Original languageEnglish (US)
Title of host publicationProceedings - 2012 SC Companion
Subtitle of host publicationHigh Performance Computing, Networking Storage and Analysis, SCC 2012
Pages1276-1283
Number of pages8
DOIs
StatePublished - Dec 1 2012
Event2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 - Salt Lake City, UT, United States
Duration: Nov 10 2012Nov 16 2012

Publication series

NameProceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

Conference

Conference2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012
CountryUnited States
CitySalt Lake City, UT
Period11/10/1211/16/12

Fingerprint

Fault tolerance
Scalability

Keywords

  • Grid computing
  • MapReduce
  • Middleware

ASJC Scopus subject areas

  • Computational Theory and Mathematics
  • Computer Science Applications
  • Software

Cite this

He, C., Weitzel, D., Swanson, D., & Lu, Y. (2012). HOG: Distributed hadoop MapReduce on the grid. In Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012 (pp. 1276-1283). [6495936] (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012). https://doi.org/10.1109/SC.Companion.2012.154

HOG : Distributed hadoop MapReduce on the grid. / He, Chen; Weitzel, Derek; Swanson, David; Lu, Ying.

Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. p. 1276-1283 6495936 (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

He, C, Weitzel, D, Swanson, D & Lu, Y 2012, HOG: Distributed hadoop MapReduce on the grid. in Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012., 6495936, Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, pp. 1276-1283, 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012, Salt Lake City, UT, United States, 11/10/12. https://doi.org/10.1109/SC.Companion.2012.154
He C, Weitzel D, Swanson D, Lu Y. HOG: Distributed hadoop MapReduce on the grid. In Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. p. 1276-1283. 6495936. (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012). https://doi.org/10.1109/SC.Companion.2012.154
He, Chen ; Weitzel, Derek ; Swanson, David ; Lu, Ying. / HOG : Distributed hadoop MapReduce on the grid. Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012. 2012. pp. 1276-1283 (Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012).
@inproceedings{fce9df0233ab4a5d9bc45506590d0d01,
title = "HOG: Distributed hadoop MapReduce on the grid",
abstract = "MapReduce is a powerful data processing platform for commercial and academic applications. In this paper, we build a novel Hadoop MapReduce framework executed on the Open Science Grid which spans multiple institutions across the United States - Hadoop On the Grid (HOG). It is different from previous MapReduce platforms that run on dedicated environments like clusters or clouds. HOG provides a free, elastic, and dynamic MapReduce environment on the opportunistic resources of the grid. In HOG, we improve Hadoop's fault tolerance for wide area data analysis by mapping data centers across the U.S. to virtual racks and creating multi-institution failure domains. Our modifications to the Hadoop framework are transparent to existing Hadoop MapReduce applications. In the evaluation, we successfully extend HOG to 1100 nodes on the grid. Additionally, we evaluate HOG with a simulated Facebook Hadoop MapReduce workload. We conclude that HOG's rapid scalability can provide comparable performance to a dedicated Hadoop cluster.",
keywords = "Grid computing, MapReduce, Middleware",
author = "Chen He and Derek Weitzel and David Swanson and Ying Lu",
year = "2012",
month = "12",
day = "1",
doi = "10.1109/SC.Companion.2012.154",
language = "English (US)",
isbn = "9780769549569",
series = "Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012",
pages = "1276--1283",
booktitle = "Proceedings - 2012 SC Companion",

}

TY - GEN

T1 - HOG

T2 - Distributed hadoop MapReduce on the grid

AU - He, Chen

AU - Weitzel, Derek

AU - Swanson, David

AU - Lu, Ying

PY - 2012/12/1

Y1 - 2012/12/1

N2 - MapReduce is a powerful data processing platform for commercial and academic applications. In this paper, we build a novel Hadoop MapReduce framework executed on the Open Science Grid which spans multiple institutions across the United States - Hadoop On the Grid (HOG). It is different from previous MapReduce platforms that run on dedicated environments like clusters or clouds. HOG provides a free, elastic, and dynamic MapReduce environment on the opportunistic resources of the grid. In HOG, we improve Hadoop's fault tolerance for wide area data analysis by mapping data centers across the U.S. to virtual racks and creating multi-institution failure domains. Our modifications to the Hadoop framework are transparent to existing Hadoop MapReduce applications. In the evaluation, we successfully extend HOG to 1100 nodes on the grid. Additionally, we evaluate HOG with a simulated Facebook Hadoop MapReduce workload. We conclude that HOG's rapid scalability can provide comparable performance to a dedicated Hadoop cluster.

AB - MapReduce is a powerful data processing platform for commercial and academic applications. In this paper, we build a novel Hadoop MapReduce framework executed on the Open Science Grid which spans multiple institutions across the United States - Hadoop On the Grid (HOG). It is different from previous MapReduce platforms that run on dedicated environments like clusters or clouds. HOG provides a free, elastic, and dynamic MapReduce environment on the opportunistic resources of the grid. In HOG, we improve Hadoop's fault tolerance for wide area data analysis by mapping data centers across the U.S. to virtual racks and creating multi-institution failure domains. Our modifications to the Hadoop framework are transparent to existing Hadoop MapReduce applications. In the evaluation, we successfully extend HOG to 1100 nodes on the grid. Additionally, we evaluate HOG with a simulated Facebook Hadoop MapReduce workload. We conclude that HOG's rapid scalability can provide comparable performance to a dedicated Hadoop cluster.

KW - Grid computing

KW - MapReduce

KW - Middleware

UR - http://www.scopus.com/inward/record.url?scp=84876542016&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=84876542016&partnerID=8YFLogxK

U2 - 10.1109/SC.Companion.2012.154

DO - 10.1109/SC.Companion.2012.154

M3 - Conference contribution

AN - SCOPUS:84876542016

SN - 9780769549569

T3 - Proceedings - 2012 SC Companion: High Performance Computing, Networking Storage and Analysis, SCC 2012

SP - 1276

EP - 1283

BT - Proceedings - 2012 SC Companion

ER -