Abstract
Digital Signal Processors (DSPs) have been widely used in embedded domains, delivering high performance with ultra-low power consumption. Such promises make it attractive for more domains that DSP was not an option before. To show how DSP lives up to these promises, we review two milestone DSPs: FT-Matrix and FT-Matrix2, which are designed by National University of Defense Technology with the purpose of advancing DSPs into the era of higher performance computing, AI, and even beyond. We demonstrate that the key challenges lie in the orchestration of huge computation resources and efficient data supply sub-system design. We show the key mechanisms in both FT-Matrix and FT-Matrix2 targeting these challenges, and also come up with possible future directions for enabling DSPs for a wider scope of applications.
Similar content being viewed by others
Explore related subjects
Discover the latest articles, news and stories from top researchers in related subjects.References
Anoushe-Jamshidi, D., Mehrzad, S., Mahlke, S.: D2ma: accelerating coarse-grained data transfer for gpus. In: PACT (2014)
Bauer, M., Cook, H., Khailany, B.: Cudadma: optimizing gpu memory bandwidth via warp specialization. In: Intertantional conference on super computing (SC) (2011)
Berkel, K., Heinle, F.: Vector processing as an enabler for software-defined radio in handheld devices. EURASIP J. Appl. Signal Process. 16, 2613–2625 (2005)
Chen, T., Raghavan, R., Dale, J.: Cell broadband engine architecture and its first implementation a performance view. IBM J. Res. Dev. 51, 559–572 (2007)
Chen, S., Wang, Y., Liu, S., Wan, J., Chen, H., Liu, H., Zhang, K., Liu, X., Ning, X.: Ft-matrix: a coordination-aware architecture for signal processing. IEEE Micro 34, 64–73 (2014)
Dybaahl, H., Stenstrom, P.: An adaptive shared/private nuca cache partition scheme for chip multiprocessors. In: HPCA2007 (2007)
Efland, G., Parikh, S., Sanghavi, H., Farooqui, A.: High performance dsp for vision, imaging and neural networks. In: Hot Chips 28 Symposium, pp. 1–30 (2016)
Geforce gtx 280 specifications, NVIDIA Corporation (2008)
Green500.: In: http://www.top500.org/green500 (2016)
Heinecke, A., Vaidyanathan, K., Smelyanskiy, M., Kobotov, A., Dubtsov, R., Henry, G., Shet, A. G., Chrysos, G., PradeepDubey, G.: Design and implementation of the linpack benchmarkfor single and multi-node systems based on intel xeon phitmcoprocessor. In: 2013 IEEE 27th international symposium on parallel & distributed processing (IPDPS) (2013)
Igual, F. D., Ali, M., Friedmann, A., Stotzer, E., Wentz, T., van de Geijn, R. A.: Unleashing the high-performance and low-power of multi-core dsps for general-purpose hpc. In: Proceedings of the international conference on high performance computing, networking, storage and analysis (SC12) (2012)
Jo, G., Nah, J., Lee, J., Kim, J., Lee, J.: Accelerating linpack with mpi-opencl on clusters of multi-gpu nodes. IEEE Trans. Parallel Distrib. Syst. 26(7), 1814–1825 (2015)
Khailany, B., et al.: Imagine: media processing with streams. IEEE Micro 21(2), 35–46 (2001)
Kistler, M., Perrone, M., Fabrizio, P.: Cell multiprocessor communication network: Built for speed. IEEE MICRO (2020)
Krashinsky, B., Batten, C., et al.: The vector-thread architecture. IEEE Micro 24(6), 84–90 (2004)
Lang, T., Bruguera, J.D.: Floating-point multiply-add-fused with reduced latency. IEEE Trans. Comput. 53(08), 988–1003 (2004)
Lee, Y., et al.: Exploring the tradeoffs between programmability and efficiency in data-parallel accelerators. In: International symposium on computer architecture (2011)
Micrium.: In: http://www.micrium.com (2016)
Nvidia ampere architecture whitepaper, Nvidia (2020)
Nvidia’s next generation cuda compute architecture: Fermi, NVIDIA Corporation (2009)
Philips, E.: Cuda accelerated linpack on clusters. In: ACM/IEEE international Conference of supercomputing (2010)
Raghavan, P., Munaga, S.: A customized cross-bar for data shuffling in domain specific SIMD processors. Proc. Archit. Comput. Syst. 2007, 57–68 (2007)
Ranjith, S., Yannis, S.: Adaptive caches: effective shaping of cache behavior to workloads. In: 39th Annual IEEE/ACM international symposium on microarchitecture (MICRO06) (2006)
Reddy, V. G.: Neon technology introduction, ARM Corp. (2008)
Rivoire, S., et al.: Vector lane threading. In: International conference on parallel processing, pp. 55–64 (2006)
Santhanam, S., et al.: A low-cost, 300MHz, RISC CPU with attached media processor. IEEE J. Solid-State Circ. 33(11), 1829–1839 (1998)
Seiler, L.: Larrabee: a many-core x86 architecture for visual computing. In: ACM SIGGRAPH, pp. 1–15. NY, USA, New York (2008)
Shahbahrami, A., Juurlink, B., Vassiliadis, S.: Versatility of extended subwords and the matrix register file. ACM Trans. Archit. Code Optim. 5, 1 (2008)
Soderquis, P., Leeser, M.: Division and square root: choosing the right implementation. IEEE Micro 17(4), 56–66 (1997)
T. I. (TI): Tms320c6678 multicore fixed and floating-point digital signal processor (2012)
Walther, J.S.: A unified algorithm for elementary functions. Proc. AFIPS Conf. 38, 379–385 (1971)
Wen, et al.: Multiple-morphs adaptive stream architecture. J. Comput. Sci. Technol. 20, 635–646 (2005)
Woh, M., et al.: Anysp: anytime anywhere anyway signal processing. IEEE Micro 30(1), 81–91 (2010)
Yang, X. et al.: A 64-bit stream processor architecture for scientific applications. In: International symposium on computer architecture (2007)
Yang, C., Chen, S., Zhang, J., Lv, Z., Wang, Z.: A novel DSP architecture for scientific computing and deep learning. IEEE Access 7, 36413–36425 (2019)
Acknowledgements
We thank the reviewers for their valuable feedback on the paper. This work is supported by National Key Research and Development of China (2018YFB0204301), The Science and Technology Planning Project of Hunan Province (2019RS2027), and the National University of Defense Technology research project (No. 18/19-QNCXJ-WYH).
Author information
Authors and Affiliations
Corresponding author
Rights and permissions
About this article
Cite this article
Wang, Y., Li, C., Liu, C. et al. Advancing DSP into HPC, AI, and beyond: challenges, mechanisms, and future directions. CCF Trans. HPC 3, 114–125 (2021). https://doi.org/10.1007/s42514-020-00057-2
Received:
Accepted:
Published:
Issue Date:
DOI: https://doi.org/10.1007/s42514-020-00057-2