Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters

Wang, Sheng; Chen, Shiping; Shi, Yumei

doi:10.1007/s11227-023-05807-x

Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters

Published: 11 December 2023

Volume 80, pages 9554–9578, (2024)
Cite this article

The Journal of Supercomputing Aims and scope Submit manuscript

Sheng Wang^1,2,
Shiping Chen^1,3 &
Yumei Shi²

314 Accesses
Explore all metrics

Abstract

Optimizing energy consumption in heterogeneous GPU clusters is of paramount importance to enhance overall system efficiency and reduce operational costs. However, the diversity of GPU types in heterogeneous GPU clusters poses challenges for energy optimization. In this paper, we propose an utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. We utilize a feature correlation-based method to select feature vectors and improve the prediction accuracy of our GPU utilization prediction model. By constructing a model for energy consumption and combining it with the predicted results, our approach can avoid wasting too much energy on idle GPUs and reduce the energy consumption of the heterogeneous GPU cluster. To validate our approach, experiments are conducted on real Alibaba data. The results show that our model achieved a good performance in predicting the utilization for four types of GPUs, with RMSE, MAE and \(R^{2}\) average values of 7.08, 3.81 and 0.93, respectively. Moreover, we calculate the energy consumption of the GPU cluster before and after the adjustment, and estimate the energy savings, which amount to an average of 35.55%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

A Novel Statistical Power Model for Integrated GPU with Optimization

Event driven power consumption optimization control model of GPU clusters

Article 05 January 2019

Energy prediction of CUDA application instances using dynamic regression models

Article 04 January 2017

Data availability

The datasets is available at: https://github.com/alibaba/clusterdata/.

References

Weng Q, Xiao W, Yu Y, Wang W, Wang C (2022) MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters. In: 19th USENIX Symposium on Networked Systems Design and Implementation, pp 945-960
Yan K, Zhang N, Jiang Z, Sheng Y, Gao Y (2022) A GPU-based heterogeneous computing method to speed up wireless channel simulation. In: 2022 International Conference on Microwave and Millimeter Wave Technology (ICMMT), pp 1-3. IEEE
Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous GPUs. In: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, pp 253-264
Dewangan BK, Agarwal A, Venkatadri M, Pasricha A (2019) Self-characteristics based energy-efficient resource scheduling for cloud. Procedia Comput Sci 152:204–211
Article Google Scholar
Tarafdar A, Sarkar S, Das RK, Khatua S (2023) Power modeling for energy-efficient resource management in a cloud data center. J Grid Comput 21(1):10
Article Google Scholar
Jayaprakash S, Nagarajan MD, Prado RPD, Subramanian S, Divakarachari PB (2021) A systematic review of energy management strategies for resource allocation in the cloud: clustering, optimization and machine learning. Energies 14(17):5322
Article Google Scholar
Google cluster trace (2019) Accessed on October https://github.com/google/cluster-data
Bitbrains cluster log (2015) Accessed on October http://gwa.ewi.tudelft.nl
AzurePublicDataset (2019) Accessed on October https://github.com/Azure/AzurePublicDataset
Alibaba cluster trace program (2020) Accessed on October https://github.com/alibaba/clusterdata, (2021)
DGX-1 BMC (2023) Accessed on October https://docs.nvidia.com/dgx/
Ali G, Side M, Bhalachandra S, Wright NJ, Chen Y (2023) Performance-aware energy-efficient GPU frequency selection using DNN-based models. In: Proceedings of the 52nd International Conference on Parallel Processing, pp 433-442
Ge R, Vogt R, Majumder J, Alam A, Burtscher M, Zong Z (2013) Effects of dynamic voltage and frequency scaling on a k20 gpu. In: 2013 42nd International Conference on Parallel Processing, pp 826-833
Guerreiro J, Ilic A, Roma N, Tomas P (2019) Modeling and decoupling the GPU power consumption for cross-domain DVFS. IEEE T Parall Distr 30(11):2494–2506
Article Google Scholar
Wang Q (2020) Performance and power modeling of GPU systems with dynamic voltage and frequency scaling (Doctoral dissertation, Hong Kong Baptist University)
Kumar N, Vidyarthi DP (2017) An energy aware cost effective scheduling framework for heterogeneous cluster system. Future Gener Comp Sy 71:73–88
Article Google Scholar
Ari AAA, Damakoa I, Titouna C, Labraoui N, Gueroui A (2017) Efficient and scalable ACO-based task scheduling for green cloud computing environment. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud), pp 66-71. IEEE
Zong Z (2020) An improvement of task scheduling algorithms for green cloud computing. In: 15th International Conference on Computer Science & Education (ICCSE), pp 654-657
Mekala MS, Viswanathan P (2019) Energy-efficient virtual machine selection based on resource ranking and utilization factor approach in cloud computing for IoT. Comput Electr Eng 73:227–244
Article Google Scholar
Liu R, Ye Y, Hu N, Chen H, Wang X (2019) Classified prediction model of Rockburst using rough sets-normal cloud. Neural Comput Appl 31(12):8185–8193
Article Google Scholar
Messias VR, Estrella JC, Ehlers R, Santana MJ, Santana RC, Reiff-Marganiec S (2016) Combining time series prediction models using genetic algorithm to autoscaling web applications hosted in the cloud infrastructure. Neural Comput Appl 27:2383–2406
Article Google Scholar
Kaur G, Bala A, Chana I (2019) An intelligent regressive ensemble approach for predicting resource usage in cloud computing. J Parallel Distr Com 123:1–12
Article Google Scholar
Ouhame S, Hadi Y, Ullah A (2021) An efficient forecasting approach for resource utilization in cloud data center using CNN-LSTM model. Neural Comput Appl 33:10043–10055
Article Google Scholar
Kholidy HA (2020) An intelligent swarm based prediction approach for predicting cloud computing user resource needs. Comput Commun 151:133–144
Article Google Scholar
Hsieh SY, Liu CS, Buyya R, Zomaya AY (2020) Utilization-prediction-aware virtual machine consolidation approach for energy-efficient cloud data centers. J Parallel Distr Com 139:99–109
Article Google Scholar
Cheng M, Li J, Nazarian S (2018) DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, pp 129–134
Guan H, Yao J, Qi Z, Wang R (2015) Energy-efficient SLA guarantees for virtualized GPU in cloud gaming. IEEE Trans Parallel Distrib Syst 26(9):2434–2443
Article Google Scholar
Xu M, Song C, Wu H, Gill SS, Ye K, Xu C (2022) esDNN: deep neural network based multivariate workload prediction in cloud computing environments. Acm T Internet Techn 22(3):1–24
Google Scholar
Alomari ES, Nuiaa RR, Alyasseri ZAA, Mohammed HJ, Sani NS, Esa MI, Musawi BA (2023) Malware detection using deep learning and correlation-based feature selection. Symmetry 15(1):123
Article Google Scholar
Paramasivam S, Velusamy RL (2023) Cor-ENTC: correlation with ensembled approach for network traffic classification using SDN technology for future networks. J Supercomput 79(8):8513–8537
Article Google Scholar
Jain N, Jana PK (2023) LRF: a logically randomized forest algorithm for classification and regression problems. Expert Syst Appl 213:119225
Article Google Scholar
Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2020) Towards GPU utilization prediction for cloud deep learning. In: Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing, pp 6-6
Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2022) Horus: interference-aware and prediction-based scheduling in deep learning systems. IEEE T Parall Distr 33(1):88–100
Article Google Scholar
Zhong B, Su X, Wen M, Zuo S, Hong L, Lin J (2022) ParaFold: paralleling AlphaFold for large-scale predictions. In: International Conference on High Performance Computing in Asia-Pacific Region Workshops, pp 1-9
Zhong W, Zhuang Y, Sun J, Gu J (2018) A load prediction model for cloud computing using PSO-based weighted wavelet support vector machine. Appl Intell 48:4072–4083
Article Google Scholar
Hu Q, Sun P, Yan S, Wen Y, Zhang T (2021) Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1-15
Sosonkina M, Sundriyal V, Galvez Vallejo JL (2022) Runtime power allocation based on multi-GPU utilization in GAMESS. J Comput Netw Commun 10(9):66–80
Google Scholar
Gupta S, Dileep AD, Gonsalves TA (2018) A joint feature selection framework for multivariate resource usage prediction in cloud servers using stability and prediction performance. J Supercomput 74:6033–6068
Article Google Scholar
Erradi A, Iqbal W, Mahmood A, Bouguettaya A (2019) Web application resource requirements estimation based on the workload latent features. IEEE T Serv Comput 14(6):1638–1649
Article Google Scholar
Suksriupatham N, Hoonlor A (2020) Workload prediction with regression for over and under provisioning problems in multi-agent dynamic resource provisioning framework. In: 2020 17th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp 128-133. IEEE
Zeng Q, Du Y, Huang K, Leung KK (2021) Energy-efficient resource management for federated edge learning with CPU-GPU heterogeneous computing. IEEE T Wirel Commun 20(12):7947–7962
Article Google Scholar
Sun Q, Liu Y, Yang H, Luan Z, Qian D (2019) Smqos: improving utilization and energy efficiency with QOS awareness on GPUS. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp 1-5. IEEE
Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. In: 2020 29th International Conference On Computer Communications and Networks (ICCCN), pp 1-9. IEEE
Karim ME, Maswood MMS, Das S, Alharbi AG (2021) BHyPreC: a novel Bi-LSTM based hybrid recurrent neural network model to predict the CPU workload of cloud virtual machine. IEEE Access 9:131476–131495
Article Google Scholar

Download references

Funding

National Natural Science Foundation of China project (Nos. 61472256, 61170277), Technology Development Fund Project of Shanghai of University for Science and Technology (Nos. 16KJFZ035, 2017KJFZ033);Anhui Provincial Natural Science Foundation General Project (Nos. KJ2020B08, KJ2020B04).

Author information

Authors and Affiliations

Business School, University of Shanghai for Science and Technology, 334 Jungong Road, Shanghai, 200000, Shanghai, China
Sheng Wang & Shiping Chen
School of Mathematics and Finance, Chuzhou University, 1 HuifengRoad, Chuzhou, 239000, Anhui, China
Sheng Wang & Yumei Shi
School of Optical-Electrical and Computer Engineering, University of Shanghai for Science and Technology, 334 Jungong Road, Shanghai, 200000, Shanghai, China
Shiping Chen

Authors

Sheng Wang
View author publications
You can also search for this author in PubMed Google Scholar
Shiping Chen
View author publications
You can also search for this author in PubMed Google Scholar
Yumei Shi
View author publications
You can also search for this author in PubMed Google Scholar

Contributions

SW and SC contributed to the conception of the study; SW and YS performed the experiment; SW, SC and YS contributed significantly to analysis and manuscript preparation; SW, SC and YS performed the data analyses and wrote the manuscript; SW and SC helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Shiping Chen.

Ethics declarations

Conflict of interest

The authors does not have any conflict of interest to declare.

Ethical approval

This declaration is “not applicable.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Cite this article

Wang, S., Chen, S. & Shi, Y. Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. J Supercomput 80, 9554–9578 (2024). https://doi.org/10.1007/s11227-023-05807-x

Download citation

Accepted: 10 November 2023
Published: 11 December 2023
Issue Date: May 2024
DOI: https://doi.org/10.1007/s11227-023-05807-x

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

¥17,985 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price includes VAT (Japan)

Instant access to the full article PDF.

Institutional subscriptions

Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Statistical Power Model for Integrated GPU with Optimization

Event driven power consumption optimization control model of GPU clusters

Energy prediction of CUDA application instances using dynamic regression models

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

A Novel Statistical Power Model for Integrated GPU with Optimization

Event driven power consumption optimization control model of GPU clusters

Energy prediction of CUDA application instances using dynamic regression models

Data availability

References

Funding

Author information

Authors and Affiliations

Contributions

Corresponding author

Ethics declarations

Conflict of interest

Ethical approval

Additional information

Publisher's Note

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation