Evaluating the impact of data placement to spark and SciDB with an Earth Science use case

Khoa Doan, Amidu O. Oloso, Kwo Sen Kuo, Thomas L. Clune, Hongfeng Yu, Brian Nelson, Jian Zhang

Research output: Chapter in Book/Report/Conference proceedingConference contribution

4 Citations (Scopus)

Abstract

We investigate the impact of data placement on two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance.

Original languageEnglish (US)
Title of host publicationProceedings - 2016 IEEE International Conference on Big Data, Big Data 2016
EditorsRonay Ak, George Karypis, Yinglong Xia, Xiaohua Tony Hu, Philip S. Yu, James Joshi, Lyle Ungar, Ling Liu, Aki-Hiro Sato, Toyotaro Suzumura, Sudarsan Rachuri, Rama Govindaraju, Weijia Xu
PublisherInstitute of Electrical and Electronics Engineers Inc.
Pages341-346
Number of pages6
ISBN (Electronic)9781467390040
DOIs
StatePublished - Jan 1 2016
Event4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States
Duration: Dec 5 2016Dec 8 2016

Other

Other4th IEEE International Conference on Big Data, Big Data 2016
CountryUnited States
CityWashington
Period12/5/1612/8/16

Fingerprint

Earth sciences
Electric sparks

Keywords

  • data layout
  • multimensional arrays
  • SciDB
  • SciDB
  • Spark

ASJC Scopus subject areas

  • Computer Networks and Communications
  • Information Systems
  • Hardware and Architecture

Cite this

Doan, K., Oloso, A. O., Kuo, K. S., Clune, T. L., Yu, H., Nelson, B., & Zhang, J. (2016). Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. In R. Ak, G. Karypis, Y. Xia, X. T. Hu, P. S. Yu, J. Joshi, L. Ungar, L. Liu, A-H. Sato, T. Suzumura, S. Rachuri, R. Govindaraju, ... W. Xu (Eds.), Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 (pp. 341-346). [7840621] Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/BigData.2016.7840621

Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. / Doan, Khoa; Oloso, Amidu O.; Kuo, Kwo Sen; Clune, Thomas L.; Yu, Hongfeng; Nelson, Brian; Zhang, Jian.

Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. ed. / Ronay Ak; George Karypis; Yinglong Xia; Xiaohua Tony Hu; Philip S. Yu; James Joshi; Lyle Ungar; Ling Liu; Aki-Hiro Sato; Toyotaro Suzumura; Sudarsan Rachuri; Rama Govindaraju; Weijia Xu. Institute of Electrical and Electronics Engineers Inc., 2016. p. 341-346 7840621.

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Doan, K, Oloso, AO, Kuo, KS, Clune, TL, Yu, H, Nelson, B & Zhang, J 2016, Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. in R Ak, G Karypis, Y Xia, XT Hu, PS Yu, J Joshi, L Ungar, L Liu, A-H Sato, T Suzumura, S Rachuri, R Govindaraju & W Xu (eds), Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016., 7840621, Institute of Electrical and Electronics Engineers Inc., pp. 341-346, 4th IEEE International Conference on Big Data, Big Data 2016, Washington, United States, 12/5/16. https://doi.org/10.1109/BigData.2016.7840621
Doan K, Oloso AO, Kuo KS, Clune TL, Yu H, Nelson B et al. Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. In Ak R, Karypis G, Xia Y, Hu XT, Yu PS, Joshi J, Ungar L, Liu L, Sato A-H, Suzumura T, Rachuri S, Govindaraju R, Xu W, editors, Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. Institute of Electrical and Electronics Engineers Inc. 2016. p. 341-346. 7840621 https://doi.org/10.1109/BigData.2016.7840621
Doan, Khoa ; Oloso, Amidu O. ; Kuo, Kwo Sen ; Clune, Thomas L. ; Yu, Hongfeng ; Nelson, Brian ; Zhang, Jian. / Evaluating the impact of data placement to spark and SciDB with an Earth Science use case. Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016. editor / Ronay Ak ; George Karypis ; Yinglong Xia ; Xiaohua Tony Hu ; Philip S. Yu ; James Joshi ; Lyle Ungar ; Ling Liu ; Aki-Hiro Sato ; Toyotaro Suzumura ; Sudarsan Rachuri ; Rama Govindaraju ; Weijia Xu. Institute of Electrical and Electronics Engineers Inc., 2016. pp. 341-346
@inproceedings{72f942a8fb4b471191af2764ec0cb7f2,
title = "Evaluating the impact of data placement to spark and SciDB with an Earth Science use case",
abstract = "We investigate the impact of data placement on two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance.",
keywords = "data layout, multimensional arrays, SciDB, SciDB, Spark",
author = "Khoa Doan and Oloso, {Amidu O.} and Kuo, {Kwo Sen} and Clune, {Thomas L.} and Hongfeng Yu and Brian Nelson and Jian Zhang",
year = "2016",
month = "1",
day = "1",
doi = "10.1109/BigData.2016.7840621",
language = "English (US)",
pages = "341--346",
editor = "Ronay Ak and George Karypis and Yinglong Xia and Hu, {Xiaohua Tony} and Yu, {Philip S.} and James Joshi and Lyle Ungar and Ling Liu and Aki-Hiro Sato and Toyotaro Suzumura and Sudarsan Rachuri and Rama Govindaraju and Weijia Xu",
booktitle = "Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016",
publisher = "Institute of Electrical and Electronics Engineers Inc.",

}

TY - GEN

T1 - Evaluating the impact of data placement to spark and SciDB with an Earth Science use case

AU - Doan, Khoa

AU - Oloso, Amidu O.

AU - Kuo, Kwo Sen

AU - Clune, Thomas L.

AU - Yu, Hongfeng

AU - Nelson, Brian

AU - Zhang, Jian

PY - 2016/1/1

Y1 - 2016/1/1

N2 - We investigate the impact of data placement on two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance.

AB - We investigate the impact of data placement on two Big Data technologies, Spark and SciDB, with a use case from Earth Science where data arrays are multidimensional. Simultaneously, this investigation provides an opportunity to evaluate the performance of the technologies involved. Two datastores, HDFS and Cassandra, are used with Spark for our comparison. It is found that Spark with Cassandra performs better than with HDFS, but SciDB performs better yet than Spark with either datastore. The investigation also underscores the value of having data aligned for the most common analysis scenarios in advance on a shared nothing architecture. Otherwise, repartitioning needs to be carried out on the fly, degrading overall performance.

KW - data layout

KW - multimensional arrays

KW - SciDB

KW - SciDB

KW - Spark

UR - http://www.scopus.com/inward/record.url?scp=85015240035&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85015240035&partnerID=8YFLogxK

U2 - 10.1109/BigData.2016.7840621

DO - 10.1109/BigData.2016.7840621

M3 - Conference contribution

SP - 341

EP - 346

BT - Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016

A2 - Ak, Ronay

A2 - Karypis, George

A2 - Xia, Yinglong

A2 - Hu, Xiaohua Tony

A2 - Yu, Philip S.

A2 - Joshi, James

A2 - Ungar, Lyle

A2 - Liu, Ling

A2 - Sato, Aki-Hiro

A2 - Suzumura, Toyotaro

A2 - Rachuri, Sudarsan

A2 - Govindaraju, Rama

A2 - Xu, Weijia

PB - Institute of Electrical and Electronics Engineers Inc.

ER -