Abstract
Optimizing energy consumption in heterogeneous GPU clusters is of paramount importance to enhance overall system efficiency and reduce operational costs. However, the diversity of GPU types in heterogeneous GPU clusters poses challenges for energy optimization. In this paper, we propose an utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. We utilize a feature correlation-based method to select feature vectors and improve the prediction accuracy of our GPU utilization prediction model. By constructing a model for energy consumption and combining it with the predicted results, our approach can avoid wasting too much energy on idle GPUs and reduce the energy consumption of the heterogeneous GPU cluster. To validate our approach, experiments are conducted on real Alibaba data. The results show that our model achieved a good performance in predicting the utilization for four types of GPUs, with RMSE, MAE and \(R^{2}\) average values of 7.08, 3.81 and 0.93, respectively. Moreover, we calculate the energy consumption of the GPU cluster before and after the adjustment, and estimate the energy savings, which amount to an average of 35.55%.
Similar content being viewed by others
Data availability
The datasets is available at: https://github.com/alibaba/clusterdata/.
References
Weng Q, Xiao W, Yu Y, Wang W, Wang C (2022) MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters. In: 19th USENIX Symposium on Networked Systems Design and Implementation, pp 945-960
Yan K, Zhang N, Jiang Z, Sheng Y, Gao Y (2022) A GPU-based heterogeneous computing method to speed up wireless channel simulation. In: 2022 International Conference on Microwave and Millimeter Wave Technology (ICMMT), pp 1-3. IEEE
Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous GPUs. In: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, pp 253-264
Dewangan BK, Agarwal A, Venkatadri M, Pasricha A (2019) Self-characteristics based energy-efficient resource scheduling for cloud. Procedia Comput Sci 152:204–211
Tarafdar A, Sarkar S, Das RK, Khatua S (2023) Power modeling for energy-efficient resource management in a cloud data center. J Grid Comput 21(1):10
Jayaprakash S, Nagarajan MD, Prado RPD, Subramanian S, Divakarachari PB (2021) A systematic review of energy management strategies for resource allocation in the cloud: clustering, optimization and machine learning. Energies 14(17):5322
Google cluster trace (2019) Accessed on October https://github.com/google/cluster-data
Bitbrains cluster log (2015) Accessed on October http://gwa.ewi.tudelft.nl
AzurePublicDataset (2019) Accessed on October https://github.com/Azure/AzurePublicDataset
Alibaba cluster trace program (2020) Accessed on October https://github.com/alibaba/clusterdata, (2021)
DGX-1 BMC (2023) Accessed on October https://docs.nvidia.com/dgx/
Ali G, Side M, Bhalachandra S, Wright NJ, Chen Y (2023) Performance-aware energy-efficient GPU frequency selection using DNN-based models. In: Proceedings of the 52nd International Conference on Parallel Processing, pp 433-442
Ge R, Vogt R, Majumder J, Alam A, Burtscher M, Zong Z (2013) Effects of dynamic voltage and frequency scaling on a k20 gpu. In: 2013 42nd International Conference on Parallel Processing, pp 826-833
Guerreiro J, Ilic A, Roma N, Tomas P (2019) Modeling and decoupling the GPU power consumption for cross-domain DVFS. IEEE T Parall Distr 30(11):2494–2506
Wang Q (2020) Performance and power modeling of GPU systems with dynamic voltage and frequency scaling (Doctoral dissertation, Hong Kong Baptist University)
Kumar N, Vidyarthi DP (2017) An energy aware cost effective scheduling framework for heterogeneous cluster system. Future Gener Comp Sy 71:73–88
Ari AAA, Damakoa I, Titouna C, Labraoui N, Gueroui A (2017) Efficient and scalable ACO-based task scheduling for green cloud computing environment. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud), pp 66-71. IEEE
Zong Z (2020) An improvement of task scheduling algorithms for green cloud computing. In: 15th International Conference on Computer Science & Education (ICCSE), pp 654-657
Mekala MS, Viswanathan P (2019) Energy-efficient virtual machine selection based on resource ranking and utilization factor approach in cloud computing for IoT. Comput Electr Eng 73:227–244
Liu R, Ye Y, Hu N, Chen H, Wang X (2019) Classified prediction model of Rockburst using rough sets-normal cloud. Neural Comput Appl 31(12):8185–8193
Messias VR, Estrella JC, Ehlers R, Santana MJ, Santana RC, Reiff-Marganiec S (2016) Combining time series prediction models using genetic algorithm to autoscaling web applications hosted in the cloud infrastructure. Neural Comput Appl 27:2383–2406
Kaur G, Bala A, Chana I (2019) An intelligent regressive ensemble approach for predicting resource usage in cloud computing. J Parallel Distr Com 123:1–12
Ouhame S, Hadi Y, Ullah A (2021) An efficient forecasting approach for resource utilization in cloud data center using CNN-LSTM model. Neural Comput Appl 33:10043–10055
Kholidy HA (2020) An intelligent swarm based prediction approach for predicting cloud computing user resource needs. Comput Commun 151:133–144
Hsieh SY, Liu CS, Buyya R, Zomaya AY (2020) Utilization-prediction-aware virtual machine consolidation approach for energy-efficient cloud data centers. J Parallel Distr Com 139:99–109
Cheng M, Li J, Nazarian S (2018) DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, pp 129–134
Guan H, Yao J, Qi Z, Wang R (2015) Energy-efficient SLA guarantees for virtualized GPU in cloud gaming. IEEE Trans Parallel Distrib Syst 26(9):2434–2443
Xu M, Song C, Wu H, Gill SS, Ye K, Xu C (2022) esDNN: deep neural network based multivariate workload prediction in cloud computing environments. Acm T Internet Techn 22(3):1–24
Alomari ES, Nuiaa RR, Alyasseri ZAA, Mohammed HJ, Sani NS, Esa MI, Musawi BA (2023) Malware detection using deep learning and correlation-based feature selection. Symmetry 15(1):123
Paramasivam S, Velusamy RL (2023) Cor-ENTC: correlation with ensembled approach for network traffic classification using SDN technology for future networks. J Supercomput 79(8):8513–8537
Jain N, Jana PK (2023) LRF: a logically randomized forest algorithm for classification and regression problems. Expert Syst Appl 213:119225
Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2020) Towards GPU utilization prediction for cloud deep learning. In: Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing, pp 6-6
Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2022) Horus: interference-aware and prediction-based scheduling in deep learning systems. IEEE T Parall Distr 33(1):88–100
Zhong B, Su X, Wen M, Zuo S, Hong L, Lin J (2022) ParaFold: paralleling AlphaFold for large-scale predictions. In: International Conference on High Performance Computing in Asia-Pacific Region Workshops, pp 1-9
Zhong W, Zhuang Y, Sun J, Gu J (2018) A load prediction model for cloud computing using PSO-based weighted wavelet support vector machine. Appl Intell 48:4072–4083
Hu Q, Sun P, Yan S, Wen Y, Zhang T (2021) Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1-15
Sosonkina M, Sundriyal V, Galvez Vallejo JL (2022) Runtime power allocation based on multi-GPU utilization in GAMESS. J Comput Netw Commun 10(9):66–80
Gupta S, Dileep AD, Gonsalves TA (2018) A joint feature selection framework for multivariate resource usage prediction in cloud servers using stability and prediction performance. J Supercomput 74:6033–6068
Erradi A, Iqbal W, Mahmood A, Bouguettaya A (2019) Web application resource requirements estimation based on the workload latent features. IEEE T Serv Comput 14(6):1638–1649
Suksriupatham N, Hoonlor A (2020) Workload prediction with regression for over and under provisioning problems in multi-agent dynamic resource provisioning framework. In: 2020 17th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp 128-133. IEEE
Zeng Q, Du Y, Huang K, Leung KK (2021) Energy-efficient resource management for federated edge learning with CPU-GPU heterogeneous computing. IEEE T Wirel Commun 20(12):7947–7962
Sun Q, Liu Y, Yang H, Luan Z, Qian D (2019) Smqos: improving utilization and energy efficiency with QOS awareness on GPUS. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp 1-5. IEEE
Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. In: 2020 29th International Conference On Computer Communications and Networks (ICCCN), pp 1-9. IEEE
Karim ME, Maswood MMS, Das S, Alharbi AG (2021) BHyPreC: a novel Bi-LSTM based hybrid recurrent neural network model to predict the CPU workload of cloud virtual machine. IEEE Access 9:131476–131495
Funding
National Natural Science Foundation of China project (Nos. 61472256, 61170277), Technology Development Fund Project of Shanghai of University for Science and Technology (Nos. 16KJFZ035, 2017KJFZ033);Anhui Provincial Natural Science Foundation General Project (Nos. KJ2020B08, KJ2020B04).
Author information
Authors and Affiliations
Contributions
SW and SC contributed to the conception of the study; SW and YS performed the experiment; SW, SC and YS contributed significantly to analysis and manuscript preparation; SW, SC and YS performed the data analyses and wrote the manuscript; SW and SC helped perform the analysis with constructive discussions.
Corresponding author
Ethics declarations
Conflict of interest
The authors does not have any conflict of interest to declare.
Ethical approval
This declaration is “not applicable.”
Additional information
Publisher's Note
Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.
Rights and permissions
Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.
About this article
Cite this article
Wang, S., Chen, S. & Shi, Y. Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. J Supercomput 80, 9554–9578 (2024). https://doi.org/10.1007/s11227-023-05807-x
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s11227-023-05807-x