Skip to main content
Log in

Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions

  • Survey Paper
  • Published:
CCF Transactions on High Performance Computing Aims and scope Submit manuscript

Abstract

Digital Signal Processors (DSPs) have been widely used in embedded domains, delivering high performance with ultra-low power consumption. Such promises make it attractive for more domains that DSP was not an option before. To show how DSP lives up to these promises, we review two milestone DSPs: FT-Matrix and FT-Matrix2, which are designed by National University of Defense Technology with the purpose of advancing DSPs into the era of higher performance computing, AI, and even beyond. We demonstrate that the key challenges lie in the orchestration of huge computation resources and efficient data supply sub-system design. We show the key mechanisms in both FT-Matrix and FT-Matrix2 targeting these challenges, and also come up with possible future directions for enabling DSPs for a wider scope of applications.

This is a preview of subscription content, log in via an institution to check access.

Access this article

Subscribe and save

Springer+ Basic
$34.99 /Month
  • Get 10 units per month
  • Download Article/Chapter or eBook
  • 1 Unit = 1 Article or 1 Chapter
  • Cancel anytime
Subscribe now

Buy Now

Price excludes VAT (USA)
Tax calculation will be finalised during checkout.

Instant access to the full article PDF.

Fig. 1
Fig. 2
Fig. 3
Fig. 4
Fig. 5
Fig. 6
Fig. 7
Fig. 8
Fig. 9

Similar content being viewed by others

Explore related subjects

Discover the latest articles, news and stories from top researchers in related subjects.

References

  • Anoushe-Jamshidi, D., Mehrzad, S., Mahlke, S.: D2ma: accelerating coarse-grained data transfer for gpus. In: PACT (2014)

  • Bauer, M., Cook, H., Khailany, B.: Cudadma: optimizing gpu memory bandwidth via warp specialization. In: Intertantional conference on super computing (SC) (2011)

  • Berkel, K., Heinle, F.: Vector processing as an enabler for software-defined radio in handheld devices. EURASIP J. Appl. Signal Process. 16, 2613–2625 (2005)

    Google Scholar 

  • Chen, T., Raghavan, R., Dale, J.: Cell broadband engine architecture and its first implementation a performance view. IBM J. Res. Dev. 51, 559–572 (2007)

    Article  Google Scholar 

  • Chen, S., Wang, Y., Liu, S., Wan, J., Chen, H., Liu, H., Zhang, K., Liu, X., Ning, X.: Ft-matrix: a coordination-aware architecture for signal processing. IEEE Micro 34, 64–73 (2014)

    Article  Google Scholar 

  • Dybaahl, H., Stenstrom, P.: An adaptive shared/private nuca cache partition scheme for chip multiprocessors. In: HPCA2007 (2007)

  • Efland, G., Parikh, S., Sanghavi, H., Farooqui, A.: High performance dsp for vision, imaging and neural networks. In: Hot Chips 28 Symposium, pp. 1–30 (2016)

  • Geforce gtx 280 specifications, NVIDIA Corporation (2008)

  • Green500.: In: http://www.top500.org/green500 (2016)

  • Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A. G., Chrysos, G., PradeepDubey, G.: Design and implementation of the linpack benchmarkfor single and multi-node systems based on intel xeon phitmcoprocessor. In: 2013 IEEE 27th international symposium on parallel & distributed processing (IPDPS) (2013)

  • Igual, F. D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R. A.: Unleashing the high-performance and low-power of multi-core dsps for general-purpose hpc. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC12) (2012)

  • Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating linpack with mpi-opencl on clusters of multi-gpu nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015)

    Article  Google Scholar 

  • Khailany, B., et al.: Imagine: media processing with streams. IEEE Micro 21(2), 35–46 (2001)

    Article  Google Scholar 

  • Kistler, M., Perrone, M., Fabrizio, P.: Cell multiprocessor communication network: Built for speed. IEEE MICRO (2020)

  • Krashinsky, B., Batten, C., et al.: The vector-thread architecture. IEEE Micro 24(6), 84–90 (2004)

    Article  Google Scholar 

  • Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(08), 988–1003 (2004)

    Article  Google Scholar 

  • Lee, Y., et al.: Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International symposium on computer architecture (2011)

  • Micrium.: In: http://www.micrium.com (2016)

  • Nvidia ampere architecture whitepaper, Nvidia (2020)

  • Nvidia’s next generation cuda compute architecture: Fermi, NVIDIA Corporation (2009)

  • Philips, E.: Cuda accelerated linpack on clusters. In: ACM/IEEE international Conference of supercomputing (2010)

  • Raghavan, P., Munaga, S.: A customized cross-bar for data shuffling in domain specific SIMD processors. Proc. Archit. Comput. Syst. 2007, 57–68 (2007)

    Google Scholar 

  • Ranjith, S., Yannis, S.: Adaptive caches: effective shaping of cache behavior to workloads. In: 39th Annual IEEE/ACM international symposium on microarchitecture (MICRO06) (2006)

  • Reddy, V. G.: Neon technology introduction, ARM Corp. (2008)

  • Rivoire, S., et al.: Vector lane threading. In: International conference on parallel processing, pp. 55–64 (2006)

  • Santhanam, S., et al.: A low-cost, 300MHz, RISC CPU with attached media processor. IEEE J. Solid-State Circ. 33(11), 1829–1839 (1998)

    Article  Google Scholar 

  • Seiler, L.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH, pp. 1–15. NY, USA, New York (2008)

  • Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Versatility of extended subwords and the matrix register file. ACM Trans. Archit. Code Optim. 5, 1 (2008)

    Article  Google Scholar 

  • Soderquis, P., Leeser, M.: Division and square root: choosing the right implementation. IEEE Micro 17(4), 56–66 (1997)

    Article  Google Scholar 

  • T. I. (TI): Tms320c6678 multicore fixed and floating-point digital signal processor (2012)

  • Walther, J.S.: A unified algorithm for elementary functions. Proc. AFIPS Conf. 38, 379–385 (1971)

    Google Scholar 

  • Wen, et al.: Multiple-morphs adaptive stream architecture. J. Comput. Sci. Technol. 20, 635–646 (2005)

    Article  Google Scholar 

  • Woh, M., et al.: Anysp: anytime anywhere anyway signal processing. IEEE Micro 30(1), 81–91 (2010)

    Article  Google Scholar 

  • Yang, X. et al.: A 64-bit stream processor architecture for scientific applications. In: International symposium on computer architecture (2007)

  • Yang, C., Chen, S., Zhang, J., Lv, Z., Wang, Z.: A novel DSP architecture for scientific computing and deep learning. IEEE Access 7, 36413–36425 (2019)

    Article  Google Scholar 

Download references

Acknowledgements

We thank the reviewers for their valuable feedback on the paper. This work is supported by National Key Research and Development of China (2018YFB0204301), The Science and Technology Planning Project of Hunan Province (2019RS2027), and the National University of Defense Technology research project (No. 18/19-QNCXJ-WYH).

Author information

Authors and Affiliations

Authors

Corresponding author

Correspondence to Yang Guo.

Rights and permissions

Reprints and permissions

About this article

Check for updates. Verify currency and authenticity via CrossMark

Cite this article

Wang, Y., Li, C., Liu, C. et al. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Trans. HPC 3, 114–125 (2021). https://doi.org/10.1007/s42514-020-00057-2

Download citation

  • Received:

  • Accepted:

  • Published:

  • Issue Date:

  • DOI: https://doi.org/10.1007/s42514-020-00057-2

Keywords

Navigation