Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

Wang, Yaohua; Li, Chen; Liu, Chang; Liu, Sheng; Lei, Yuanwu; Zhang, Jian; Zhang, Yang; Guo, Yang

doi:10.1007/s42514-020-00057-2

Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

Survey Paper
Published: 31 March 2021

Volume 3, pages 114–125, (2021)
Cite this article

CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Yaohua Wang ORCID: orcid.org/0000-0002-9556-5535¹,
Chen Li¹,
Chang Liu¹,
Sheng Liu¹,
Yuanwu Lei¹,
Jian Zhang¹,
Yang Zhang¹ &
…
Yang Guo¹

637 Accesses
17 Citations
1 Altmetric
Explore all metrics

Abstract

Digital Signal Processors (DSPs) have been widely used in embedded domains, delivering high performance with ultra-low power consumption. Such promises make it attractive for more domains that DSP was not an option before. To show how DSP lives up to these promises, we review two milestone DSPs: FT-Matrix and FT-Matrix2, which are designed by National University of Defense Technology with the purpose of advancing DSPs into the era of higher performance computing, AI, and even beyond. We demonstrate that the key challenges lie in the orchestration of huge computation resources and efficient data supply sub-system design. We show the key mechanisms in both FT-Matrix and FT-Matrix2 targeting these challenges, and also come up with possible future directions for enabling DSPs for a wider scope of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

FPGA-Based DSP

TLP-LDPC: Three-Level Parallel FPGA Architecture for Fast Prototyping of LDPC Decoder Using High-Level Synthesis

Article 30 November 2022

The DSPCAD Framework for Modeling and Synthesis of Signal Processing Systems

Discover the latest articles, news and stories from top researchers in related subjects.

Artificial Intelligence

References

Anoushe-Jamshidi, D., Mehrzad, S., Mahlke, S.: D2ma: accelerating coarse-grained data transfer for gpus. In: PACT (2014)
Bauer, M., Cook, H., Khailany, B.: Cudadma: optimizing gpu memory bandwidth via warp specialization. In: Intertantional conference on super computing (SC) (2011)
Berkel, K., Heinle, F.: Vector processing as an enabler for software-defined radio in handheld devices. EURASIP J. Appl. Signal Process. 16, 2613–2625 (2005)
Google Scholar
Chen, T., Raghavan, R., Dale, J.: Cell broadband engine architecture and its first implementation a performance view. IBM J. Res. Dev. 51, 559–572 (2007)
Article Google Scholar
Chen, S., Wang, Y., Liu, S., Wan, J., Chen, H., Liu, H., Zhang, K., Liu, X., Ning, X.: Ft-matrix: a coordination-aware architecture for signal processing. IEEE Micro 34, 64–73 (2014)
Article Google Scholar
Dybaahl, H., Stenstrom, P.: An adaptive shared/private nuca cache partition scheme for chip multiprocessors. In: HPCA2007 (2007)
Efland, G., Parikh, S., Sanghavi, H., Farooqui, A.: High performance dsp for vision, imaging and neural networks. In: Hot Chips 28 Symposium, pp. 1–30 (2016)
Geforce gtx 280 specifications, NVIDIA Corporation (2008)
Green500.: In: http://www.top500.org/green500 (2016)
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A. G., Chrysos, G., PradeepDubey, G.: Design and implementation of the linpack benchmarkfor single and multi-node systems based on intel xeon phitmcoprocessor. In: 2013 IEEE 27th international symposium on parallel & distributed processing (IPDPS) (2013)
Igual, F. D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R. A.: Unleashing the high-performance and low-power of multi-core dsps for general-purpose hpc. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC12) (2012)
Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating linpack with mpi-opencl on clusters of multi-gpu nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015)
Article Google Scholar
Khailany, B., et al.: Imagine: media processing with streams. IEEE Micro 21(2), 35–46 (2001)
Article Google Scholar
Kistler, M., Perrone, M., Fabrizio, P.: Cell multiprocessor communication network: Built for speed. IEEE MICRO (2020)
Krashinsky, B., Batten, C., et al.: The vector-thread architecture. IEEE Micro 24(6), 84–90 (2004)
Article Google Scholar
Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(08), 988–1003 (2004)
Article Google Scholar
Lee, Y., et al.: Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International symposium on computer architecture (2011)
Micrium.: In: http://www.micrium.com (2016)
Nvidia ampere architecture whitepaper, Nvidia (2020)
Nvidia’s next generation cuda compute architecture: Fermi, NVIDIA Corporation (2009)
Philips, E.: Cuda accelerated linpack on clusters. In: ACM/IEEE international Conference of supercomputing (2010)
Raghavan, P., Munaga, S.: A customized cross-bar for data shuffling in domain specific SIMD processors. Proc. Archit. Comput. Syst. 2007, 57–68 (2007)
Google Scholar
Ranjith, S., Yannis, S.: Adaptive caches: effective shaping of cache behavior to workloads. In: 39th Annual IEEE/ACM international symposium on microarchitecture (MICRO06) (2006)
Reddy, V. G.: Neon technology introduction, ARM Corp. (2008)
Rivoire, S., et al.: Vector lane threading. In: International conference on parallel processing, pp. 55–64 (2006)
Santhanam, S., et al.: A low-cost, 300MHz, RISC CPU with attached media processor. IEEE J. Solid-State Circ. 33(11), 1829–1839 (1998)
Article Google Scholar
Seiler, L.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH, pp. 1–15. NY, USA, New York (2008)
Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Versatility of extended subwords and the matrix register file. ACM Trans. Archit. Code Optim. 5, 1 (2008)
Article Google Scholar
Soderquis, P., Leeser, M.: Division and square root: choosing the right implementation. IEEE Micro 17(4), 56–66 (1997)
Article Google Scholar
T. I. (TI): Tms320c6678 multicore fixed and floating-point digital signal processor (2012)
Walther, J.S.: A unified algorithm for elementary functions. Proc. AFIPS Conf. 38, 379–385 (1971)
Google Scholar
Wen, et al.: Multiple-morphs adaptive stream architecture. J. Comput. Sci. Technol. 20, 635–646 (2005)
Article Google Scholar
Woh, M., et al.: Anysp: anytime anywhere anyway signal processing. IEEE Micro 30(1), 81–91 (2010)
Article Google Scholar
Yang, X. et al.: A 64-bit stream processor architecture for scientific applications. In: International symposium on computer architecture (2007)
Yang, C., Chen, S., Zhang, J., Lv, Z., Wang, Z.: A novel DSP architecture for scientific computing and deep learning. IEEE Access 7, 36413–36425 (2019)
Article Google Scholar

Download references

Acknowledgements

We thank the reviewers for their valuable feedback on the paper. This work is supported by National Key Research and Development of China (2018YFB0204301), The Science and Technology Planning Project of Hunan Province (2019RS2027), and the National University of Defense Technology research project (No. 18/19-QNCXJ-WYH).

Author information

Authors and Affiliations

National University of Defense Technology, 137 Yanwachi Street, Changsha, Hunan, 410073, China
Yaohua Wang, Chen Li, Chang Liu, Sheng Liu, Yuanwu Lei, Jian Zhang, Yang Zhang & Yang Guo

Authors

Yaohua Wang
View author publications
You can also search for this author in PubMed Google Scholar
Chen Li
View author publications
You can also search for this author in PubMed Google Scholar
Chang Liu
View author publications
You can also search for this author in PubMed Google Scholar
Sheng Liu
View author publications
You can also search for this author in PubMed Google Scholar
Yuanwu Lei
View author publications
You can also search for this author in PubMed Google Scholar
Jian Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Zhang
View author publications
You can also search for this author in PubMed Google Scholar
Yang Guo
View author publications
You can also search for this author in PubMed Google Scholar

Corresponding author

Correspondence to Yang Guo.

Rights and permissions

Reprints and permissions

About this article

Cite this article

Wang, Y., Li, C., Liu, C. et al. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Trans. HPC 3, 114–125 (2021). https://doi.org/10.1007/s42514-020-00057-2

Download citation

Received: 15 June 2020
Accepted: 22 October 2020
Published: 31 March 2021
Issue Date: March 2021
DOI: https://doi.org/10.1007/s42514-020-00057-2

Keywords

Access this article

Log in via an institution

Subscribe and save

Springer+ Basic

$34.99 /Month

Get 10 units per month
Download Article/Chapter or eBook
1 Unit = 1 Article or 1 Chapter
Cancel anytime

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Institutional subscriptions

Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FPGA-Based DSP

TLP-LDPC: Three-Level Parallel FPGA Architecture for Fast Prototyping of LDPC Decoder Using High-Level Synthesis

The DSPCAD Framework for Modeling and Synthesis of Signal Processing Systems

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Keywords

Subscribe and save

Buy Now

Navigation

Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

Abstract

Access this article

Subscribe and save

Buy Now

Similar content being viewed by others

FPGA-Based DSP

TLP-LDPC: Three-Level Parallel FPGA Architecture for Fast Prototyping of LDPC Decoder Using High-Level Synthesis

The DSPCAD Framework for Modeling and Synthesis of Signal Processing Systems

Explore related subjects

References

Acknowledgements

Author information

Authors and Affiliations

Corresponding author

Rights and permissions

About this article

Cite this article

Share this article

Keywords

Subscribe and save

Buy Now

Search

Navigation