Improving short job latency performance in hybrid job schedulers with dice

Wei Zhou, K. Preston White, Hongfeng Yu

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Abstract

It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the "head-of-line" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.

Original languageEnglish (US)
Title of host publicationProceedings of the 48th International Conference on Parallel Processing, ICPP 2019
PublisherAssociation for Computing Machinery
ISBN (Electronic)9781450362955
DOIs
StatePublished - Aug 5 2019
Event48th International Conference on Parallel Processing, ICPP 2019 - Kyoto, Japan
Duration: Aug 5 2019Aug 8 2019

Publication series

NameACM International Conference Proceeding Series

Conference

Conference48th International Conference on Parallel Processing, ICPP 2019
CountryJapan
CityKyoto
Period8/5/198/8/19

Fingerprint

Experiments
Costs
Industry

Keywords

  • Big data
  • Job scheduling
  • Resource management

ASJC Scopus subject areas

  • Human-Computer Interaction
  • Computer Networks and Communications
  • Computer Vision and Pattern Recognition
  • Software

Cite this

Zhou, W., White, K. P., & Yu, H. (2019). Improving short job latency performance in hybrid job schedulers with dice. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019 [a56] (ACM International Conference Proceeding Series). Association for Computing Machinery. https://doi.org/10.1145/3337821.3337851

Improving short job latency performance in hybrid job schedulers with dice. / Zhou, Wei; White, K. Preston; Yu, Hongfeng.

Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019. Association for Computing Machinery, 2019. a56 (ACM International Conference Proceeding Series).

Research output: Chapter in Book/Report/Conference proceedingConference contribution

Zhou, W, White, KP & Yu, H 2019, Improving short job latency performance in hybrid job schedulers with dice. in Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019., a56, ACM International Conference Proceeding Series, Association for Computing Machinery, 48th International Conference on Parallel Processing, ICPP 2019, Kyoto, Japan, 8/5/19. https://doi.org/10.1145/3337821.3337851
Zhou W, White KP, Yu H. Improving short job latency performance in hybrid job schedulers with dice. In Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019. Association for Computing Machinery. 2019. a56. (ACM International Conference Proceeding Series). https://doi.org/10.1145/3337821.3337851
Zhou, Wei ; White, K. Preston ; Yu, Hongfeng. / Improving short job latency performance in hybrid job schedulers with dice. Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019. Association for Computing Machinery, 2019. (ACM International Conference Proceeding Series).
@inproceedings{61a5b21e09ee494abc21e7bebea72a60,
title = "Improving short job latency performance in hybrid job schedulers with dice",
abstract = "It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the {"}head-of-line{"} blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9{\%}, 54.5{\%}, and 43.5{\%} improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2{\%}, 74.1{\%}, and 85.3{\%} improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.",
keywords = "Big data, Job scheduling, Resource management",
author = "Wei Zhou and White, {K. Preston} and Hongfeng Yu",
year = "2019",
month = "8",
day = "5",
doi = "10.1145/3337821.3337851",
language = "English (US)",
series = "ACM International Conference Proceeding Series",
publisher = "Association for Computing Machinery",
booktitle = "Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019",

}

TY - GEN

T1 - Improving short job latency performance in hybrid job schedulers with dice

AU - Zhou, Wei

AU - White, K. Preston

AU - Yu, Hongfeng

PY - 2019/8/5

Y1 - 2019/8/5

N2 - It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the "head-of-line" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.

AB - It is common to find a mixture of both long batch jobs and latency-sensitive short jobs in enterprise data centers. Recently hybrid job schedulers emerge as attractive alternatives of conventional centralized job schedulers. In this paper, we conduct trace-driven experiments to study the job-completion-delay performance of two representative hybrid job schedulers (Hawk and Eagle), and find that short jobs still encounter long latency issues due to fluctuating bursty nature of workloads. To this end, we propose Dice, a general performance optimization framework for hybrid job schedulers, to alleviate the high job-completion-delay problem of short jobs. Dice is composed of two simple yet effective techniques: Elastic Sizing and Opportunistic Preemption. Both Elastic Sizing and Opportunistic Preemption keep track of the task waiting times of short jobs. When the mean task waiting time of short jobs is high, Elastic Sizing dynamically and adaptively increases the short partition size to prioritize short jobs over long jobs. On the other hand, Opportunistic Preemption preempts resources from long tasks running in the general partition on demand, so as to mitigate the "head-of-line" blocking problem of short jobs. We enhance the two schedulers with Dice and evaluate Dice performance improvement in our prototype implementation. Experiment results show that Dice achieves 50.9%, 54.5%, and 43.5% improvement on 50th-percentile, 75th-percentile, and 90th-percentile job completion delays of short jobs in Hawk respectively, as well as 33.2%, 74.1%, and 85.3% improvement on those in Eagle respectively under the Google trace, at low performance costs to long jobs.

KW - Big data

KW - Job scheduling

KW - Resource management

UR - http://www.scopus.com/inward/record.url?scp=85071111655&partnerID=8YFLogxK

UR - http://www.scopus.com/inward/citedby.url?scp=85071111655&partnerID=8YFLogxK

U2 - 10.1145/3337821.3337851

DO - 10.1145/3337821.3337851

M3 - Conference contribution

AN - SCOPUS:85071111655

T3 - ACM International Conference Proceeding Series

BT - Proceedings of the 48th International Conference on Parallel Processing, ICPP 2019

PB - Association for Computing Machinery

ER -