Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Wei, Xiuying; Moalla, Skander; Pascanu, Razvan; Gulcehre, Caglar

Computer Science > Computation and Language

arXiv:2406.16450 (cs)

[Submitted on 24 Jun 2024 (v1), last revised 5 Nov 2024 (this version, v2)]

Title:Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Authors:Xiuying Wei, Skander Moalla, Razvan Pascanu, Caglar Gulcehre

View PDF HTML (experimental)

Abstract:State-of-the-art results in large language models (LLMs) often rely on scale, which becomes computationally expensive. This has sparked a research agenda to reduce these models' parameter counts and computational costs without significantly impacting their performance. Our study focuses on transformer-based LLMs, specifically targeting the computationally intensive feedforward networks (FFNs), which are less studied than attention blocks. We consider three structured linear parameterizations of the FFN using efficient low-rank and block-diagonal matrices. In contrast to many previous works that examined these approximations, our study i) explores these structures from a training-from-scratch perspective, ii) scales up to 1.3B parameters, and iii) is conducted within recent Transformer-based LLMs rather than convolutional architectures. We demonstrate that these structures can lead to actual computational gains in various scenarios, including online decoding when using a pre-merge technique. Additionally, we propose a novel training regime, called \textit{self-guided training}, aimed at improving the poor training dynamics that these approximations exhibit when used from initialization. Interestingly, the scaling performance of structured matrices is explored, revealing steeper curves in scaling training FLOPs, along with a favorable scaling trend in the overtraining regime. Specifically, we show that wide and structured networks can utilize training FLOPs more efficiently, with fewer parameters and lower loss than dense models at their optimal trade-off. Our code is available at \url{this https URL}.

Comments:	Accepted by NeurIPS2024
Subjects:	Computation and Language (cs.CL)
Cite as:	arXiv:2406.16450 [cs.CL]
	(or arXiv:2406.16450v2 [cs.CL] for this version)
	https://doi.org/10.48550/arXiv.2406.16450

Submission history

From: Xiuying Wei [view email]
[v1] Mon, 24 Jun 2024 08:43:21 UTC (6,553 KB)
[v2] Tue, 5 Nov 2024 22:34:29 UTC (7,941 KB)

Computer Science > Computation and Language

Title:Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Computation and Language

Title:Building on Efficient Foundations: Effectively Training LLMs with Structured Feedforward Layers

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators