Skip to main content

Advertisement

Log in

Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters

  • Published:
The Journal of Supercomputing Aims and scope Submit manuscript

Abstract

Optimizing energy consumption in heterogeneous GPU clusters is of paramount importance to enhance overall system efficiency and reduce operational costs. However, the diversity of GPU types in heterogeneous GPU clusters poses challenges for energy optimization. In this paper, we propose an utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. We utilize a feature correlation-based method to select feature vectors and improve the prediction accuracy of our GPU utilization prediction model. By constructing a model for energy consumption and combining it with the predicted results, our approach can avoid wasting too much energy on idle GPUs and reduce the energy consumption of the heterogeneous GPU cluster. To validate our approach, experiments are conducted on real Alibaba data. The results show that our model achieved a good performance in predicting the utilization for four types of GPUs, with RMSE, MAE and \(R^{2}\) average values of 7.08, 3.81 and 0.93, respectively. Moreover, we calculate the energy consumption of the GPU cluster before and after the adjustment, and estimate the energy savings, which amount to an average of 35.55%.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Algorithm 1
Fig. 4
Fig. 5
Algorithm 2
Fig. 6
Fig. 7
Fig. 8
Fig. 9
Fig. 10
Fig. 11

Similar content being viewed by others

Data availability

The datasets is available at: https://github.com/alibaba/clusterdata/.

References

  1. Weng Q, Xiao W, Yu Y, Wang W, Wang C (2022) MLaaS in the wild: workload analysis and scheduling in large-scale heterogeneous GPU clusters. In: 19th USENIX Symposium on Networked Systems Design and Implementation, pp 945-960

  2. Yan K, Zhang N, Jiang Z, Sheng Y, Gao Y (2022) A GPU-based heterogeneous computing method to speed up wireless channel simulation. In: 2022 International Conference on Microwave and Millimeter Wave Technology (ICMMT), pp 1-3. IEEE

  3. Chen F, Li P, Wu C, Guo S (2022) Hare: exploiting inter-job and intra-job parallelism of distributed machine learning on heterogeneous GPUs. In: Proceedings of the 31st International Symposium on High-Performance Parallel and Distributed Computing, pp 253-264

  4. Dewangan BK, Agarwal A, Venkatadri M, Pasricha A (2019) Self-characteristics based energy-efficient resource scheduling for cloud. Procedia Comput Sci 152:204–211

    Article  Google Scholar 

  5. Tarafdar A, Sarkar S, Das RK, Khatua S (2023) Power modeling for energy-efficient resource management in a cloud data center. J Grid Comput 21(1):10

    Article  Google Scholar 

  6. Jayaprakash S, Nagarajan MD, Prado RPD, Subramanian S, Divakarachari PB (2021) A systematic review of energy management strategies for resource allocation in the cloud: clustering, optimization and machine learning. Energies 14(17):5322

    Article  Google Scholar 

  7. Google cluster trace (2019) Accessed on October https://github.com/google/cluster-data

  8. Bitbrains cluster log (2015) Accessed on October http://gwa.ewi.tudelft.nl

  9. AzurePublicDataset (2019) Accessed on October https://github.com/Azure/AzurePublicDataset

  10. Alibaba cluster trace program (2020) Accessed on October https://github.com/alibaba/clusterdata, (2021)

  11. DGX-1 BMC (2023) Accessed on October https://docs.nvidia.com/dgx/

  12. Ali G, Side M, Bhalachandra S, Wright NJ, Chen Y (2023) Performance-aware energy-efficient GPU frequency selection using DNN-based models. In: Proceedings of the 52nd International Conference on Parallel Processing, pp 433-442

  13. Ge R, Vogt R, Majumder J, Alam A, Burtscher M, Zong Z (2013) Effects of dynamic voltage and frequency scaling on a k20 gpu. In: 2013 42nd International Conference on Parallel Processing, pp 826-833

  14. Guerreiro J, Ilic A, Roma N, Tomas P (2019) Modeling and decoupling the GPU power consumption for cross-domain DVFS. IEEE T Parall Distr 30(11):2494–2506

    Article  Google Scholar 

  15. Wang Q (2020) Performance and power modeling of GPU systems with dynamic voltage and frequency scaling (Doctoral dissertation, Hong Kong Baptist University)

  16. Kumar N, Vidyarthi DP (2017) An energy aware cost effective scheduling framework for heterogeneous cluster system. Future Gener Comp Sy 71:73–88

    Article  Google Scholar 

  17. Ari AAA, Damakoa I, Titouna C, Labraoui N, Gueroui A (2017) Efficient and scalable ACO-based task scheduling for green cloud computing environment. In: 2017 IEEE International Conference on Smart Cloud (SmartCloud), pp 66-71. IEEE

  18. Zong Z (2020) An improvement of task scheduling algorithms for green cloud computing. In: 15th International Conference on Computer Science & Education (ICCSE), pp 654-657

  19. Mekala MS, Viswanathan P (2019) Energy-efficient virtual machine selection based on resource ranking and utilization factor approach in cloud computing for IoT. Comput Electr Eng 73:227–244

    Article  Google Scholar 

  20. Liu R, Ye Y, Hu N, Chen H, Wang X (2019) Classified prediction model of Rockburst using rough sets-normal cloud. Neural Comput Appl 31(12):8185–8193

    Article  Google Scholar 

  21. Messias VR, Estrella JC, Ehlers R, Santana MJ, Santana RC, Reiff-Marganiec S (2016) Combining time series prediction models using genetic algorithm to autoscaling web applications hosted in the cloud infrastructure. Neural Comput Appl 27:2383–2406

    Article  Google Scholar 

  22. Kaur G, Bala A, Chana I (2019) An intelligent regressive ensemble approach for predicting resource usage in cloud computing. J Parallel Distr Com 123:1–12

    Article  Google Scholar 

  23. Ouhame S, Hadi Y, Ullah A (2021) An efficient forecasting approach for resource utilization in cloud data center using CNN-LSTM model. Neural Comput Appl 33:10043–10055

    Article  Google Scholar 

  24. Kholidy HA (2020) An intelligent swarm based prediction approach for predicting cloud computing user resource needs. Comput Commun 151:133–144

    Article  Google Scholar 

  25. Hsieh SY, Liu CS, Buyya R, Zomaya AY (2020) Utilization-prediction-aware virtual machine consolidation approach for energy-efficient cloud data centers. J Parallel Distr Com 139:99–109

    Article  Google Scholar 

  26. Cheng M, Li J, Nazarian S (2018) DRL-cloud: Deep reinforcement learning-based resource provisioning and task scheduling for cloud service providers. In: 2018 23rd Asia and South Pacific Design Automation Conference (ASP-DAC). IEEE, pp 129–134

  27. Guan H, Yao J, Qi Z, Wang R (2015) Energy-efficient SLA guarantees for virtualized GPU in cloud gaming. IEEE Trans Parallel Distrib Syst 26(9):2434–2443

    Article  Google Scholar 

  28. Xu M, Song C, Wu H, Gill SS, Ye K, Xu C (2022) esDNN: deep neural network based multivariate workload prediction in cloud computing environments. Acm T Internet Techn 22(3):1–24

    Google Scholar 

  29. Alomari ES, Nuiaa RR, Alyasseri ZAA, Mohammed HJ, Sani NS, Esa MI, Musawi BA (2023) Malware detection using deep learning and correlation-based feature selection. Symmetry 15(1):123

    Article  Google Scholar 

  30. Paramasivam S, Velusamy RL (2023) Cor-ENTC: correlation with ensembled approach for network traffic classification using SDN technology for future networks. J Supercomput 79(8):8513–8537

    Article  Google Scholar 

  31. Jain N, Jana PK (2023) LRF: a logically randomized forest algorithm for classification and regression problems. Expert Syst Appl 213:119225

    Article  Google Scholar 

  32. Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2020) Towards GPU utilization prediction for cloud deep learning. In: Proceedings of the 12th USENIX Conference on Hot Topics in Cloud Computing, pp 6-6

  33. Yeung G, Borowiec D, Friday A, Harper R, Garraghan P (2022) Horus: interference-aware and prediction-based scheduling in deep learning systems. IEEE T Parall Distr 33(1):88–100

    Article  Google Scholar 

  34. Zhong B, Su X, Wen M, Zuo S, Hong L, Lin J (2022) ParaFold: paralleling AlphaFold for large-scale predictions. In: International Conference on High Performance Computing in Asia-Pacific Region Workshops, pp 1-9

  35. Zhong W, Zhuang Y, Sun J, Gu J (2018) A load prediction model for cloud computing using PSO-based weighted wavelet support vector machine. Appl Intell 48:4072–4083

    Article  Google Scholar 

  36. Hu Q, Sun P, Yan S, Wen Y, Zhang T (2021) Characterization and prediction of deep learning workloads in large-scale gpu datacenters. In: Proceedings of the International Conference for High Performance Computing, Networking, Storage and Analysis, pp 1-15

  37. Sosonkina M, Sundriyal V, Galvez Vallejo JL (2022) Runtime power allocation based on multi-GPU utilization in GAMESS. J Comput Netw Commun 10(9):66–80

    Google Scholar 

  38. Gupta S, Dileep AD, Gonsalves TA (2018) A joint feature selection framework for multivariate resource usage prediction in cloud servers using stability and prediction performance. J Supercomput 74:6033–6068

    Article  Google Scholar 

  39. Erradi A, Iqbal W, Mahmood A, Bouguettaya A (2019) Web application resource requirements estimation based on the workload latent features. IEEE T Serv Comput 14(6):1638–1649

    Article  Google Scholar 

  40. Suksriupatham N, Hoonlor A (2020) Workload prediction with regression for over and under provisioning problems in multi-agent dynamic resource provisioning framework. In: 2020 17th International Joint Conference on Computer Science and Software Engineering (JCSSE), pp 128-133. IEEE

  41. Zeng Q, Du Y, Huang K, Leung KK (2021) Energy-efficient resource management for federated edge learning with CPU-GPU heterogeneous computing. IEEE T Wirel Commun 20(12):7947–7962

    Article  Google Scholar 

  42. Sun Q, Liu Y, Yang H, Luan Z, Qian D (2019) Smqos: improving utilization and energy efficiency with QOS awareness on GPUS. In: 2019 IEEE International Conference on Cluster Computing (CLUSTER), pp 1-5. IEEE

  43. Gao J, Wang H, Shen H (2020) Machine learning based workload prediction in cloud computing. In: 2020 29th International Conference On Computer Communications and Networks (ICCCN), pp 1-9. IEEE

  44. Karim ME, Maswood MMS, Das S, Alharbi AG (2021) BHyPreC: a novel Bi-LSTM based hybrid recurrent neural network model to predict the CPU workload of cloud virtual machine. IEEE Access 9:131476–131495

    Article  Google Scholar 

Download references

Funding

National Natural Science Foundation of China project (Nos. 61472256, 61170277), Technology Development Fund Project of Shanghai of University for Science and Technology (Nos. 16KJFZ035, 2017KJFZ033);Anhui Provincial Natural Science Foundation General Project (Nos. KJ2020B08, KJ2020B04).

Author information

Authors and Affiliations

Authors

Contributions

SW and SC contributed to the conception of the study; SW and YS performed the experiment; SW, SC and YS contributed significantly to analysis and manuscript preparation; SW, SC and YS performed the data analyses and wrote the manuscript; SW and SC helped perform the analysis with constructive discussions.

Corresponding author

Correspondence to Shiping Chen.

Ethics declarations

Conflict of interest

The authors does not have any conflict of interest to declare.

Ethical approval

This declaration is “not applicable.”

Additional information

Publisher's Note

Springer Nature remains neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Rights and permissions

Springer Nature or its licensor (e.g. a society or other partner) holds exclusive rights to this article under a publishing agreement with the author(s) or other rightsholder(s); author self-archiving of the accepted manuscript version of this article is solely governed by the terms of such publishing agreement and applicable law.

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, S., Chen, S. & Shi, Y. Utilization-prediction-aware energy optimization approach for heterogeneous GPU clusters. J Supercomput 80, 9554–9578 (2024). https://doi.org/10.1007/s11227-023-05807-x

Download citation

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s11227-023-05807-x

Keywords

Navigation