On The Embedding Collapse When Scaling Up Recommendation Models

O N THE E MBEDDING C OLLAPSE W HEN S CALING UP
R ECOMMENDATION M ODELS
Xingzhuo Guo1†, Junwei Pan2 , Ximei Wang2 , Baixu Chen1 , Jie Jiang2 , Mingsheng Long1
1
School of Software, BNRist, Tsinghua University, China
2
Tencent Inc, China
[email protected], [email protected]
A BSTRACT
arXiv:2310.04400v1 [cs.LG] 6 Oct 2023
Recent advances in deep foundation models have led to a promising trend of de-
veloping large recommendation models to leverage vast amounts of available data.
However, we experiment to scale up existing recommendation models and observe
that the enlarged models do not improve satisfactorily. In this context, we inves-
tigate the embedding layers of enlarged models and identify a phenomenon of
embedding collapse, which ultimately hinders scalability, wherein the embedding
matrix tends to reside in a low-dimensional subspace. Through empirical and the-
oretical analysis, we demonstrate that the feature interaction module specific to
recommendation models has a two-sided effect. On the one hand, the interaction
restricts embedding learning when interacting with collapsed embeddings, exac-
erbating the collapse issue. On the other hand, feature interaction is crucial in
mitigating the fitting of spurious features, thereby improving scalability. Based on
this analysis, we propose a simple yet effective multi-embedding design incorpo-
rating embedding-set-specific interaction modules to capture diverse patterns and
reduce collapse. Extensive experiments demonstrate that this proposed design
provides consistent scalability for various recommendation models.
1 I NTRODUCTION
Recommender systems are significant machine learning scenarios that predict users’ actions on items
based on multi-field categorical data (Zhang et al., 2016), which play an indispensable role in our
daily lives to help people discover information about their interests and have been adopted in a wide
range of online applications, such as E-commerce, social media, news feeds, and music streaming.
Recently, researchers have developed deep-learning-based recommendation models to dig feature
representations flexibly. These models have been successfully deployed across a multitude of appli-
cation scenarios, thereby demonstrating their widespread adoption and effectiveness.
In recommender systems, there is a tremendous amount of Internet data, while mainstream models
typically tuned with embedding size 10 (Zhu et al., 2022) do not adequately capture the magnitude
of the available data. Motivated by the advancement of large foundation models (Kirillov et al.,
2023; OpenAI, 2023; Radford et al., 2021; Rombach et al., 2022), which benefit from increasing
parameters, it would be a promising trend to scale up the recommendation model size. However, we
experiment to increase the embedding size of mainstream recommendation models (Qu et al., 2016;
Lian et al., 2018; Wang et al., 2021), as shown in Figure 1a, and find an unsatisfactory improvement
or even performance drop. This suggests a deficiency in the scalability of existing architecture
designs, constraining the maximum potential for recommender systems.
We take a spectral analysis on the learned embedding matrices based on singular value decompo-
sition and exhibit the normalized singular values in Figure 1b. Surprisingly, most singular values
are significantly small, i.e., the learned embedding matrices are nearly low-rank, which we refer to
as the embedding collapse phenomenon. With the enlarged model size, the model does not learn to
capture a larger dimension of information, implying a learning process with ineffective parameter
utilization, which restricts the scalability.
†
Work partially done while an intern at Tencent Inc.
1
0.0004
Relative Test AUC

0.0000
Fields
Fields
Fields
0.0004
DNN
IPNN
0.0008 xDeepFM
DCNv2
base 2x 3x 4x 10x
Scale-up Factor Base 4x 10x
(a) Performance when scaling up (b) Singular values of DCNv2 under different model size, with the dashed
recommendation models lines corresponding to the base size.
Figure 1: Unsatisfactory scalability of existing recommendation models. (a): Increasing the embed-
ding size does not improve remarkably or even hurts the model performance. (b): Most embedding
matrices do not learn large singular values and tend to be low-rank.
In this work, we study the mechanism behind the embedding collapse phenomenon through empiri-
cal and theoretical analysis. We shed light on the two-sided effect of the feature interaction module,
the characteristic of recommendation models to model higher-order correlations, on model scalabil-
ity. On the one hand, interaction with collapsed embeddings will constrain the embedding learning
and, thus, in turn, aggravate the collapse issue. On the other hand, the feature interaction also plays
a vital role in reducing overfitting when scaling up models.
Based on our analysis, we conclude the principle to mitigate collapse without suppressing feature in-
teraction about how to design scalable models. We propose multi-embedding as a simple yet efficient
design for model scaling. Multi-embedding scales the number of independent embedding sets and
incorporates embedding-set-specific interaction modules to jointly capture diverse patterns. Our ex-
perimental results demonstrate that multi-embedding provides scalability for extensive mainstream
models, pointing to a methodology of breaking through the size limit of recommender systems.
Our contributions can be summarized as:
• To the best of our knowledge, we are the first to point out the non-scalability issue for
recommendation models and discover the embedding collapse phenomenon, which is an
urgent problem to address for model scalability.
• We shed light on the two-sided effect of the feature interaction process on scalability based
on the collapse phenomenon using empirical and theoretical analysis. Specifically, feature
interaction leads to collapse while providing essential overfitting reduction.
• Following our concluded principle to mitigate collapse without suppressing feature interac-
tion, we propose multi-embedding as a simple unified design, which consistently improves
scalability for extensive state-of-the-art recommendation models.
2 P RELIMINARIES
Recommendation models aim to predict an action based on features from various fields. Throughout
this paper, we consider the fundamental scenario of recommender systems, in which categorial
features and binary outputs are involved. Formally, suppose there are N fields, with the i-th field
denoted as Xi = {1, 2, ..., Di } where Di denotes the field cardinality. The value of Di may vary in
a wide range, adding difficulty to recommender systems. Let
X = X1 × X2 × ... × XN
and Y = {0, 1}, then recommendation models aim to learn a mapping from X to Y. In addition to
considering individual features from diverse fields, there have been numerous studies (Koren et al.,
2009; Rendle, 2010; Juan et al., 2016; Guo et al., 2017; Lian et al., 2018; Pan et al., 2018; Sun et al.,
2021; Wang et al., 2021) within the area of recommender systems to model combined features using
feature interaction modules. In this work, we investigate the following widely adopted architecture
for mainstream models. A model comprises: (1) embedding layers Ei ∈ RDi ×K for each field,
with embedding size K; (2) an interaction module I responsible for integrating all embeddings
2
into a combined feature scalar or vector; and (3) a subsequent postprocessing module F used for
prediction purposes, such as MLP and MoE. The forward pass of such a model is formalized as
ei = Ei⊤ 1xi , ∀i ∈ {1, 2, ..., N },
h = I(e1 , e2 , ..., en ),
ŷ = F (h),
where 1xi indicates the one-hot encoding of xi ∈ Xi , in other words, ei refers to (transposed) xi -th
row of the embedding table Ei .
3 E MBEDDING C OLLAPSE
Singular value decompostion has been widely used to measure the collapse phenomenon (Jing et al.,
2021). In Figure 1b, we have shown that the learned embedding matrices of recommendation models
are approximately low-rank with some extremely small singular values. To determine the degree
of collapse for such matrices with low-rank tendencies, we propose information abundance as a
generalized quantification.
Definition 1 (Information Abundance) Consider a matrix E ∈ RD×K and its singular value de-
K
σk uk vk⊤ , then the information abundance of E is defined as
P
composition E = U ΣV =
k=1
∥σ∥1
IA(E) = ,
∥σ∥∞
i.e., the sum of all singular values normalized by the maximum singular value.
Intuitively, a matrix with high information abundance demonstrates
a balanced distribution in vector space since it has similar singular 100 Learned
values. In contrast, a matrix with low information abundance sug- Information Abundance Random
gests that the components corresponding to smaller singular values 75 D=K
can be compressed without significantly impacting the result. Com-

50
pared with matrix rank, information abundance can be regarded as a
simple extension by noticing that rank(E) = ∥σ∥0 , yet it is appli- 25
cable for non-strictly low-rank matrices, especially for fields with
Di ≫ K which is possibly of rank K. We calculate the information 0
Fields
abundance of embedding matrices for the enlarged DCNv2 (Wang
et al., 2021) and compare it with that of randomly initialized ma- Figure 2: Visualization of in-
trices, shown in Figure 2. It is observed that the information abun- formation abundance on the
dance of learned embedding matrices is extremely low, indicating Criteo dataset. The fields are
the embedding collapse phenomenon. sorted by their cardinalities.
4 F EATURE I NTERACTION R EVISITED

In this section, we delve deeper into the embedding collapse phenomenon for recommendation mod-
els. Our investigation revolves around two questions: (1) How is embedding collapse caused? (2)
How to properly mitigate embedding collapse for scalability? Through empirical and theoretical
studies, we shed light on the two-sided effect of the commonly employed feature interaction module
on model scalability.
4.1 I NTERACTION -C OLLAPSE L AW
To determine how feature interaction leads to embedding collapse, it is inadequate to directly ana-
lyze the raw embedding matrices since the learned embedding matrix results from interactions with
all other fields, making it difficult to isolate the impact of field-pair-level interaction on embed-
ding learning. To overcome this obstacle, we propose a solution involving experiments on a line of
models equipped with sub-embedding modules generated from the raw embeddings and field indi-
vidually designed for feature interaction. By examining the collapse of all these sub-embeddings,
we can effectively discern the interaction effects between the two fields in a disentangled manner
and establish a high-level law.
3
Information Abundance
Field i Field i
N N
IA(Ei→j ). IA(Ei→j ).
P P
(b) (e)
j=1 j=1
i j i
j
(a) IA(Ei→j ). Field j (d) IA(Ei→j ). Field j

N N
IA(Ei→j ). IA(Ei→j ).
P P
(c) (f)
i=1 i=1
Figure 3: Visualization of information abundance of sub-embedding matrices for FFM (left) and
DCNv2 (right), with field indices sorted by information abundance of corresponding raw embedding
matrices. Higher or warmer indicates larger. It is observed that IA(Ei→j ) are co-influenced by both
IA(Ei ) and IA(Ej ).
Evidence I: Experiments on FFM. Field-aware factorization machines (FFM) (Juan et al., 2016)
split an embedding matrix of field i into multiple sub-embeddings with
h i
→(i−1) →(i+1)
Ei = Ei→1 , Ei→2 , ..., Ei , Ei , ..., Ei→N ,
where sub-embedding Ei→j ∈ RDi ×K/(N −1) is only used when interacting field i with field j for
j ̸= i. To determine the collapse of sub-embedding matrices, we calculate IA(Ei→j ) for all i, j
and show them in Figure 3a. For convenience, we pre-sort the field indices by the ascending order
of information abundance, i.e., i is ordered according to IA(Ei ), similar to j. We can observe that
IA(Ei→j ) is approximately increasing along i, which is trivial since Ei→j is simply a split of Ei .
Interestingly, another correlation can be observed that the information abundance of sub-embeddings
is co-influenced by the fields it interacts with, reflected by the increasing trend along j, especially
with larger i. This is amazing in the sense that even using independent embeddings to represent
the same field features, these embeddings get different information abundance after learning. For
instance, we calculate the summation of IA(Ei→j ) over j or i to study the effect of the other single
variable, shown in Figure 3b and Figure 3c. Both of them show an increasing trend, confirming the
co-influence of i and j.
Evidence II: Experiments on DCNv2. An improved deep & cross network (DCNv2) (Wang
et al., 2021) incorporates a crossing network which is parameterized with transformation matrices
Wi→j (Sun et al., 2021) over each field pair to project an embedding vector from field i before
interaction with field j. By collecting all projected embedding vectors, DCNv2 can be regarded
to implicitly generate field-aware sub-embeddings Ei→1 , Ei→2 , ..., Ei→N to interact with all fields
from embedding matrix Ei , with
Ei→j = Ei Wi→j⊤
.
DCNv2 consists of multiple stacked cross layers, and for simplification, we only discuss the first
layer throughout this paper. Similar to Evidence I, we calculate IA(Ei→j ) together with the axis-
wise summations and show them in the right part of Figure 3. Consistent with previous observation
as FFM, the information abundance of sub-embedding matrices approximately increases along j
with the same i, even though they are projected from the same embedding matrix Ei .
Summary: How is collapse caused in real-world scenarios? In the cases of FFM and DCNv2,
the sub-embeddings obtained from the raw embedding, either separated or projected, provide in-
sights into the interactions between different fields. Evidence I&II highlight that the information
4
abundance in a sub-embedding matrix is greatly impacted by the field it interacts with, or specif-
ically, interacting with a field with a low-information-abundance embedding matrix will result in
a more collapsed sub-embedding. By further considering the fact that sub-embeddings reflect the
effect when fields interact, we conclude an inherent mechanism of feature interaction in recommen-
dation models, leading us to propose the interaction-collapse law from a higher perspective:
Finding 1 (Interaction-Collapse Law). In feature interaction of recommendation models, fields

with low-information-abundance embeddings constrain the information abundance of other
fields, resulting in collapsed embedding matrices.
While the interaction-collapse law is derived from sub-embedding-based models, it is important to

note that these sub-embeddings serve as analytical tools that reveal the underlying mechanism of
feature interaction. Therefore, this law is applicable to general recommendation models instead of
only the models with sub-embeddings. The interaction-collapse law generally suggests that feature
interaction is the primary catalyst for collapse, thereby imposing constraints on the ideal scalability.
We now present how collapse is caused by feature interaction in recommendation models from
a theoretical view. For simplicity, we consider an FM-style (Rendle, 2010) feature interaction.
Formally, the interaction process is defined by
X i−1
N X N X
X i−1
h= e⊤
i ej = 1⊤ ⊤
xi Ei Ej 1xj ,
i=1 j=1 i=1 j=1
where h is the combined feature as mentioned before. Without loss of generality, we discuss one
specific row e1 of E1 and keep other embedding matrices fixed. Consider a minibatch with batch
size B. Denote σi,k as the k-th singular value of Ei , similar for ui,k , vi,k . We have
B B N
∂L 1 X ∂ℓ(b) ∂h(b) 1 X ∂ℓ(b) X ⊤
= · = · E 1 (b)
∂e1 B
b=1
∂h(b) ∂e1 B
b=1
∂h(b) i=2 k xi
B N K
1 X ∂ℓ(b) X X
= (b)
· σi,k vi,k u⊤
i,k 1x(b)
B ∂h i=2 k=1
i
b=1
N X K B
!
X 1 X ∂ℓ(b) ⊤
= u 1 (b) σi,k vi,k
i=2 k=1
B
b=1
∂h(b) i,k xi
X K
N X N
X
= αi,k σi,k vi,k = θi
i=2 k=1 i=2
The equation means that the gradient can be decomposed into field-
specific terms. We analyze the component θi for a certain field Small
i, which is further decomposed into spectral for the corresponding 4.3 Large
embedding matrix Ei .
4.0
From the form of θi , it is observed that {αi,k } are σi -agnostic
scalars determined by the training data and objective function. 3.7
Thus, the variety of σi significantly influences the composition 3.4
of θi . For those larger σi,k , the gradient component θi will be
weighted more heavily along the corresponding spectral vi,k . When Iterations
Ei is low-information-abundance, the components of θi weigh im-
balancely, resulting in the degeneration of e1 . Since different e1 Figure 4: IA(E1 ) w.r.t. train-
affects only αi,k instead of σi,k and vi,k , all rows of E1 degener- ing iterations for toy experi-
ates in similar manners and finally form a collapsed matrix. ments. “Small” and “Large”
refers to the cardinality of X3 .
To further illustrate, we conduct a toy experiment over synthetic
data. Suppose there are N = 3 fields, and we set D3 to different values with D3 < K and D3 ≫ K
to simulate low-information-abundance and high-information-abundance cases, which matches the
diverse range of the field cardinality in real-world scenarios. We train E1 while keeping E2 , E3
5
fixed. Details of experiment setups are discussed in Appendix A. We show the information abun-
dance of E1 along the training process for the two cases in Figure 4. It is observed that interacting
with a low-information-abundance matrix will result in a collapsed embedding matrix.
4.2 I S IT SUFFICIENT TO AVOID COLLAPSE FOR SCALABILITY ?
Following our discussion above, we have shown that the feature interaction process of recommen-
dation models leads to collapse and thus limits the model scalability. We now discuss its negative
proposition, i.e., whether suppressing the feature interaction to mitigate collapse leads to model
scalability. To answer this question, we design the following two experiments to compare standard
models and models with feature interaction suppressed.
Evidence III: Regularization on DCNv2 to mitigate collapse. Evidence II shows that a projec-
tion Wi→j is learned to adjust information abundance for feature interaction. To further investigate
the effect of the adjustment of Wi→j on scalability, we introduce the following regularization with
learnable parameter λij
N X N
X
⊤ 2
ℓreg = Wi→j Wi→j − λij I F
i=1 j=1
to regularize the projection matrix to be a multiplication of an unitary matrix. This way, Wi→j will
preserve all normalized singular values and maintain the information abundance after projection.
We experiment with various embedding sizes and compare the changes in performance, the infor-
mation abundances, and the optimization dynamics for standard and regularized models. Results
are shown in Figure 5. As anticipated, regularization in DCNv2 helps learn embeddings with higher
information abundance. Nevertheless, from the performance perspective, the model presents unex-
pected results whereby the scalability does not improve or worsen as the collapse is alleviated. We
further find that such a model overfits during the learning process, with the training loss consistently
decreasing and the validation AUC dropping.
25 Standard 0.8136
Regularized 0.47 0.812
Validation AUC
20
Training Loss
0.8134
Test AUC
0.46 0.810
15 Standard
Regularized
0.8132 0.45 0.808
10
0.44 Training Loss 0.806
5 0.8130 Validation AUC
base 2x 3x 4x 10x 1 2 3 4 5 6
Fields Scale-up Factor Epochs
(a) IA w/ 10x model size. (b) Test AUC w.r.t. model size. (c) Training vs. validation.
Figure 5: Experimental results of Evidence III. Restricting DCNv2 leads to higher information
abundance, yet the model suffers from over-fitting, thus resulting in non-scalability.
Evidence IV: Scaling up DCNv2

and DNN. We now discuss DNN, 25 DCNv2
which consists of a plain interaction DNN 0.0000

Relative Test AUC
20
module by concatenating all feature
vectors from different fields and pro- 15
0.0004
cessing with an MLP, formulized by 10
DCNv2
h = G([e1 , e2 , ..., eN ]). 5 0.0008 DNN
base 2x 3x 4x 10x
Fields Scale-up Factor
Since DNN does not conduct explicit
(a) IA w/ 10x model size. (b) Test AUC w.r.t. model size.
2-order feature interaction (Rendle
et al., 2020), following our previous
Figure 6: Experimental results of Evidence IV. Despite
interaction-collapse law, it would suf-
higher information abundance, the performance of DNN
fer less from collapse. We compare
drops w.r.t. model size.
the learned embeddings of DCNv2
6
and DNN and their performance with
the growth of embedding size. Considering that different architectures or objectives may differ in
modeling, we mainly discuss the performance trend as a fair comparison. Results are shown in
Figure 6. DNN learns less-collapsed embedding matrices, reflected by higher information abun-
dance than DCNv2. Yet, perversely, the AUC of DNN drops when increasing the embedding size,
while DCNv2 sustains the performance. Such observations show that DNN falls into the issue of
overfitting and lacks scalability, even though it suffers less from collapse.
Summary: Does suppressing collapse definitely improve scalability? Regularized DCNv2 and
DNN are both models with feature interaction suppressed, and they learn less-collapsed embedding
matrices than DCNv2, as expected. Yet observations in evidence III&IV demonstrate that regular-
ized DCNv2 and DNN are both non-scalable with the growth of model size and suffer from serious
overfitting. We conclude the following finding:
Finding 2. A less-collapsed model with feature interaction suppressed is insufficient for scala-
bility due to overfitting concern.
Such a finding is plausible, considering that feature interaction brings domain knowledge of higher-
order correlations in recommender systems and helps form generalizable representations. When
feature interaction is suppressed, models tend to fit noise as the embedding size increases, resulting
in reduced generalization.
5 M ULTI - EMBEDDING D ESIGN
In this section, we present a simple design of multi-embedding, which serves as an effective scaling
design applicable to a wide range of model architecture designs. We introduce the overall architec-
ture, present experimental results, and analyze how multi-embedding works.
5.1 M ULTI -E MBEDDING FOR BETTER SCALABILITY
The two-sided effect of feature interaction for scalability implies a principle for model design. That
is, a scalable model should be capable of less-collapsed embeddings within the existing feature
interaction framework instead of removing interaction. Based on this principle, we propose multi-
embedding or ME as a simple yet efficient design to improve scalability. Specifically, we scale up the
number of independent and complete embedding sets instead of the embedding size, and incorporate
embedding-set-specific feature interaction layers. Similar to previous works such as group convo-
lution (Krizhevsky et al., 2012) and multi-head attention (Vaswani et al., 2017), such design allows
the model to learn different interaction patterns jointly, while a single-embedding model would be
limited to the only interaction pattern that causes severe collapse. This way, the model is capable
of learning diverse embedding vectors to mitigate collapse while keeping the original interaction
modules. Formally, a model with M sets of embeddings is defined as
⊤
(m) (m)
ei = Ei 1xi , ∀i ∈ {1, 2, ..., N },

(m) (m) (m)
h(m) = I (m) e1 , e2 , ..., eN ,
M
1 X (m)
h= h , ŷ = F (h),
M m=1
where m stands for the index of embedding set. One requirement of multi-embedding is that there
should be non-linearities such as ReLU in interaction I; otherwise, the model is equivalent to
single-embedding and hence does not capture different patterns (see Appendix B). As a solution,
we add a non-linear projection after interaction for the model with linear interaction layers and re-
duce one MLP layer for F to achieve a fair comparison. An overall architecture comparison of
single-embedding and mult-embedding models with N = 2 and M = 2 is shown in Figure 7.
7
(!)
𝒆! DNN
(!)
𝑬! 0.0015
𝑬!
𝒆! IPNN
Relative Test AUC

𝑥! 𝑥! NFwFM
(")
(")
𝒆! 𝒉(!) 0.0010 xDeepFM
𝑬! 𝐼 (!)
DCNv2
FinalMLP
𝐼 𝒉 + 𝒉 0.0005
(!)
𝑬" 𝐼 (")
(!)
𝑬" 𝒆" 𝒉(") 0.0000
𝑥" 𝑥"
𝒆"
(") base 2x 3x 4x 10x
𝑬" (")
𝒆" Scale-up Factor
Figure 7: Architectures of single-embedding (left) and multi- Figure 8: Scalability of multi-

embedding (right) models with N = 2 and M = 2. embedding on Criteo dataset.
5.2 E XPERIMENTS
Setup. We conduct our experiments on two datasets for recommender systems: Criteo (Jean-
Baptiste Tien, 2014) and Avazu (Steve Wang, 2014), which are large and hard benchmark datasets
widely used in recommender systems. We experiment on baseline models including DNN,
IPNN (Qu et al., 2016), NFwFM (Pan et al., 2018), xDeepFM (Lian et al., 2018), DCNv2 (Wang
et al., 2021), FinalMLP (Mao et al., 2023) and their corresponding multi-embedding variants with
2x, 3x, 4x and 10x model size1 . Here NFwFM is a variant of NFM He & Chua (2017) by replacing
FM with FwFM. All experiments are performed with 8/1/1 training/validation/test splits, and we
apply early stopping based on validation AUC. More details are shown in Appendix C.2.
Results. We repeat each experiment 3 times and report the average test AUC with different scaling
factors of the model size. Results are shown in Table 1. For the experiments with single-embedding,
we observe that all the models demonstrate poor scalability. Only DCNv2 and NFwFM show slight
improvements with increasing embedding sizes, with gains of 0.00036 on Criteo and 0.00090 on
Avazu, respectively. For DNN, xDeepFM, and FinalMLP, which rely highly on non-explicit interac-
tion, the performance even drops (0.00136 on Criteo and 0.00118 on Avazu) when scaled up to 10x,
as discussed in Section 4.2. In contrast to single-embedding, our multi-embedding shows consistent
and remarkable improvement with the growth of the embedding size, and the highest performance
is always achieved with the largest 10x size. For DCNv2 and NFwFM, multi-embedding gains
0.00099 on Critio and 0.00202 on Avazu by scaling up to 10x, which is never obtained by single-
embedding. Over all models and datasets, compared with baselines, the largest models averagely
achieve 0.00106 improvement on the test AUC2 . Multi-embedding provides a methodology to break
through the non-scalability limit of existing models. We visualize the scalability of multi-embedding
on Criteo dataset in Figure 8. The standard deviation and detailed scalability comparison are shown
in Appendix C.3.
5.3 A NALYSIS
Information abundance. Multi-embedding models achieve remarkable scalability compared with

single-embedding. We verify that such scalability originates from the mitigation of collapse. We
compare the information abundance of single-embedding and multi-embedding DCNv2 with the
10x embedding size. As shown in Figure 9a, multi-embedding offers higher information abundance
and indicates less collapsed embedding matrices.
Variations of embeddings. Multi-embedding utilizes embedding-set-specific interactions to en-

rich embedding learning. We analyze the information abundance for each embedding set as shown
in Figure 9b. It is observed that the embedding matrices of different sets vary in information abun-
dance.
1
The embedding of NFwFM with 10x size on Avazu costs nearly 37.6GB memory, which exceeds our GPU
memory limit. Therefore, we do not conduct 10x NFwFM on Avazu. On the other hand, the existing experiment
with 4x is already sufficient for NFwFM on Avazu.
2
A slightly higher AUC at 0.001-level is regarded significant (Cheng et al., 2016; Guo et al., 2017; Song
et al., 2019; Tian et al., 2023)
8
Table 1: Test AUC for different models. Higher indicates better. Underlined and bolded values refer
to the best performance with single-embedding (SE) and multi-embedding (ME), respectively.
Criteo Avazu
Model
base 2x 3x 4x 10x base 2x 3x 4x 10x
SE 0.81222 0.81207 0.81213 0.81142 0.78759 0.78752 0.78728 0.78648
DNN 0.81228 0.78744
ME 0.81261 0.81288 0.81289 0.81287 0.78805 0.78826 0.78862 0.78884
SE 0.81273 0.81272 0.81271 0.81262 0.78741 0.78738 0.78750 0.78745
IPNN 0.81272 0.78732
ME 0.81268 0.81270 0.81273 0.81311 0.78806 0.78868 0.78902 0.78949
SE 0.81087 0.81090 0.81112 0.81113 0.78757 0.78783 0.78794 –
NFwFM 0.81059 0.78684
ME 0.81128 0.81153 0.81171 0.81210 0.78868 0.78901 0.78932 –
SE 0.81180 0.81167 0.81137 0.81116 0.78750 0.78714 0.78735 0.78693
xDeepFM 0.81217 0.78743
ME 0.81236 0.81239 0.81255 0.81299 0.78848 0.78886 0.78894 0.78927
SE 0.81341 0.81345 0.81346 0.81357 0.78835 0.78854 0.78852 0.78856
DCNv2 0.81339 0.78786
ME 0.81348 0.81361 0.81382 0.81385 0.78862 0.78882 0.78907 0.78942
SE 0.81262 0.81248 0.81240 0.81175 0.78797 0.78795 0.78742 0.78662
FinalMLP 0.81259 0.78751
ME 0.81290 0.81302 0.81303 0.81303 0.78821 0.78831 0.78836 0.78830
Different interaction patterns. To justify that the scalability of multi-embedding originates from
(m)
different interaction patterns, we visualize ∥Wi→j ∥F as the interaction pattern (Wang et al., 2021)
for a multi-embedding DCNv2 model in Figure 9c. It is shown that the interaction layers learn
various patterns. To further illustrate, we conduct an ablation study by restricting the divergence of
(m)
∥Wi→j ∥F across all embedding sets. From results in Figure 9d, it is observed that the divergence-
restricted multi-embedding model does not show similar scalability as standard multi-embedding
models, indicating multi-embedding works from the diversity of interaction layers.
7 Embedding I
Embedding II
25 Single-embedding 10x Standard

Embedding I
Multi-embedding 10x Embedding II 0.8138 Restricted

20 6 Embedding III
Embedding IV
Test AUC
5 0.8136
15
4
10 0.8134
Embedding IV
Embedding III
3
5 2 0.8132
base 2x 3x 4x 10x
Fields Fields Scale-up Factor
(m) (m)
(a) IA(Ei ). (b) IA(Ei ). (c) ∥Wi→j ∥F . (d) Restricting diversity.
Figure 9: Analysis of multi-embedding. (a): Multi-embedding learns higher information abundance.

(b): Each embedding set learns diverse embeddings, relected by varying information abundance.
(c): Embedding-set-specific feature interaction layers capture different interaction patterns. (d):
(m)
Restricting diversity of ∥Wi→j ∥F across all embedding sets leads to non-scalability.
6 R ELATED W ORKS
Modules in recommender systems. Plenty of existing works investigate the module design for
recommender systems. A line of studies focuses on feature interaction process (Koren et al., 2009;
Rendle, 2010; Juan et al., 2016; Qu et al., 2016; He & Chua, 2017; Guo et al., 2017; Pan et al., 2018;
Lian et al., 2018; Song et al., 2019; Cheng et al., 2020; Sun et al., 2021; Wang et al., 2021; Mao
et al., 2023; Tian et al., 2023), which is specific for recommender systems. These works are built up
to fuse domain-specific knowledge of recommender systems. In contrast to proposing new modules,
our work starts from a view of machine learning and analyzes the existing models for scalability.
Collapse phenomenon. Neural collapse or representation collapse describes the degeneration of

representation vectors with restricted variation. This phenomenon is widely studied in supervised
learning (Papyan et al., 2020; Zhu et al., 2021; Tirer & Bruna, 2022), unsupervised contrastive
learning (Hua et al., 2021; Jing et al., 2021; Gupta et al., 2022), transfer learning (Aghajanyan
et al., 2020; Kumar et al., 2022) and generative models (Mao et al., 2017; Miyato et al., 2018).
Chi et al. (2022) discuss the representation collapse in sparse MoEs. Inspired by these works, we
realize the embedding collapse of recommendation models when regarding embedding vectors as
9
representations by their definition, yet we are facing the setting of field-level interaction, which has
not previously been well studied.
Intrinsic dimensions and compression theories. To describe the complexity of data, existing
works include intrinsic-dimension-based quantification (Levina & Bickel, 2004; Ansuini et al., 2019;
Pope et al., 2020) and pruning-based analysis (Wen et al., 2017; Alvarez & Salzmann, 2017; Sun
et al., 2021). Our SVD-based concept of information abundance is related to these works.
7 C ONCLUSION
In this paper, we highlight the non-scalability issue of existing recommendation models and identify
the embedding collapse phenomenon that hinders scalability. From empirical and theoretical analy-
sis around embedding collapse, we conclude the two-sided effect of feature interaction on scalabil-
ity, i.e., feature interaction causes collapse while reducing overfitting. We propose a unified design
of multi-embedding to mitigate collapse without suppressing feature interaction. Experiments on
benchmark datasets demonstrate that multi-embedding consistently improves model scalability.
R EFERENCES
Armen Aghajanyan, Akshat Shrivastava, Anchit Gupta, Naman Goyal, Luke Zettlemoyer, and Sonal
Gupta. Better fine-tuning by reducing representational collapse. In ICLR, 2020.
Jose M Alvarez and Mathieu Salzmann. Compression-aware training of deep networks. In NeurIPS,
2017.
Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of
data representations in deep neural networks. In NeurIPS, 2019.
Heng-Tze Cheng, Levent Koc, Jeremiah Harmsen, Tal Shaked, Tushar Chandra, Hrishi Aradhye,
Glen Anderson, Greg Corrado, Wei Chai, Mustafa Ispir, et al. Wide & deep learning for recom-
mender systems. In DLRS, 2016.
Weiyu Cheng, Yanyan Shen, and Linpeng Huang. Adaptive factorization network: Learning
adaptive-order feature interactions. In AAAI, 2020.
Zewen Chi, Li Dong, Shaohan Huang, Damai Dai, Shuming Ma, Barun Patra, Saksham Singhal,
Payal Bajaj, Xia Song, Xian-Ling Mao, et al. On the representation collapse of sparse mixture of
experts. In NeurIPS, 2022.
Huifeng Guo, Ruiming Tang, Yunming Ye, Zhenguo Li, and Xiuqiang He. Deepfm: a factorization-
machine based neural network for ctr prediction. In arXiv preprint arXiv:1703.04247, 2017.
Kartik Gupta, Thalaiyasingam Ajanthan, Anton van den Hengel, and Stephen Gould. Understand-
ing and improving the role of projection head in self-supervised learning. In arXiv preprint
arXiv:2212.11491, 2022.
Xiangnan He and Tat-Seng Chua. Neural factorization machines for sparse predictive analytics. In
SIGIR, 2017.
Tianyu Hua, Wenxiao Wang, Zihui Xue, Sucheng Ren, Yue Wang, and Hang Zhao. On feature
decorrelation in self-supervised learning. In ICCV, 2021.
Olivier Chapelle Jean-Baptiste Tien, joycenv. Display advertising challenge, 2014. URL https:
//kaggle.com/competitions/criteo-display-ad-challenge.
Li Jing, Pascal Vincent, Yann LeCun, and Yuandong Tian. Understanding dimensional collapse in
contrastive self-supervised learning. In ICLR, 2021.
Yuchin Juan, Yong Zhuang, Wei-Sheng Chin, and Chih-Jen Lin. Field-aware Factorization Machines
for CTR Prediction. In RecSys, 2016.
10
Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete
Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In arXiv
preprint arXiv:2304.02643, 2023.
Yehuda Koren, Robert Bell, and Chris Volinsky. Matrix factorization techniques for recommender
systems. In Computer, 2009.
Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classification with deep convo-
lutional neural networks. In NeurIPS, 2012.
Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine-tuning can
distort pretrained features and underperform out-of-distribution. In ICLR, 2022.
Elizaveta Levina and Peter Bickel. Maximum likelihood estimation of intrinsic dimension. In
NeurIPS, 2004.
Jianxun Lian, Xiaohuan Zhou, Fuzheng Zhang, Zhongxia Chen, Xing Xie, and Guangzhong Sun.
xdeepfm: Combining explicit and implicit feature interactions for recommender systems. In
SIGKDD, 2018.
Kelong Mao, Jieming Zhu, Liangcai Su, Guohao Cai, Yuru Li, and Zhenhua Dong. Finalmlp: An
enhanced two-stream mlp model for ctr prediction. In arXiv preprint arXiv:2304.00902, 2023.
Xudong Mao, Qing Li, Haoran Xie, Raymond YK Lau, Zhen Wang, and Stephen Paul Smolley.
Least squares generative adversarial networks. In ICCV, 2017.
Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization
for generative adversarial networks. In ICLR, 2018.
OpenAI. Gpt-4 technical report. In arXiv preprint arXiv:2303.08774, 2023.
Junwei Pan, Jian Xu, Alfonso Lobos Ruiz, Wenliang Zhao, Shengjun Pan, Yu Sun, and Quan Lu.
Field-weighted factorization machines for click-through rate prediction in display advertising. In
WWW, 2018.
Vardan Papyan, XY Han, and David L Donoho. Prevalence of neural collapse during the terminal
phase of deep learning training. In PNAS, 2020.
Phil Pope, Chen Zhu, Ahmed Abdelkader, Micah Goldblum, and Tom Goldstein. The intrinsic
dimension of images and its impact on learning. In ICLR, 2020.
Yanru Qu, Han Cai, Kan Ren, Weinan Zhang, Yong Yu, Ying Wen, and Jun Wang. Product-based
neural networks for user response prediction. In ICDM, 2016.
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal,
Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual
models from natural language supervision. In ICML, 2021.
Steffen Rendle. Factorization machines. In ICDM, 2010.
Steffen Rendle, Walid Krichene, Li Zhang, and John Anderson. Neural collaborative filtering vs.
matrix factorization revisited. In RecSys, 2020.
Robin Rombach, Andreas Blattmann, Dominik Lorenz, Patrick Esser, and Björn Ommer. High-
resolution image synthesis with latent diffusion models. In CVPR, 2022.
Weiping Song, Chence Shi, Zhiping Xiao, Zhijian Duan, Yewen Xu, Ming Zhang, and Jian Tang.
Autoint: Automatic feature interaction learning via self-attentive neural networks. In CIKM,
2019.
Will Cukierski Steve Wang. Click-through rate prediction, 2014. URL https://kaggle.com/
competitions/avazu-ctr-prediction.
Yang Sun, Junwei Pan, Alex Zhang, and Aaron Flores. Fm2: Field-matrixed factorization machines
for recommender systems. In WWW, 2021.
11
Zhen Tian, Ting Bai, Wayne Xin Zhao, Ji-Rong Wen, and Zhao Cao. Eulernet: Adaptive feature
interaction learning via euler’s formula for ctr prediction. In arXiv preprint arXiv:2304.10711,
2023.
Tom Tirer and Joan Bruna. Extended unconstrained features model for exploring deep neural col-
lapse. In ICML, 2022.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,
Łukasz Kaiser, and Illia Polosukhin. Attention is all you need. In NeurIPS, 2017.
Ruoxi Wang, Rakesh Shivanna, Derek Cheng, Sagar Jain, Dong Lin, Lichan Hong, and Ed Chi.
DCN V2: Improved Deep & Cross Network and Practical Lessons for Web-scale Learning to
Rank Systems. In WWW, 2021.
Wei Wen, Cong Xu, Chunpeng Wu, Yandan Wang, Yiran Chen, and Hai Li. Coordinating filters for
faster deep neural networks. In ICCV, 2017.
Weinan Zhang, Tianming Du, and Jun Wang. Deep learning over multi-field categorical data: –a
case study on user response prediction. In ECIR, 2016.
Jieming Zhu, Quanyu Dai, Liangcai Su, Rong Ma, Jinyang Liu, Guohao Cai, Xi Xiao, and Rui
Zhang. Bars: Towards open benchmarking for recommender systems. In SIGIR, 2022.
Zhihui Zhu, Tianyu Ding, Jinxin Zhou, Xiao Li, Chong You, Jeremias Sulam, and Qing Qu. A
geometric analysis of neural collapse with unconstrained features. In NeurIPS, 2021.
12
A D ETAILS OF T OY E XPERMIENT
In this section, we present the detailed settings of the toy experiment. We consider a scenario with
N = 3 fields and D1 = D2 = 100. For each (x1 , x2 ) ∈ X1 × X2 , we randomly assign x3 ∼ U[X3 ],
y ∼ U{0, 1} and let (x, y) to be one piece of data, thus for different values of D3 , there are always
1002 pieces of data, and they follow the same distribution when reduced on X1 × X2 . We set
D3 = 3 and D3 = 100 to simulate the case with low-information-abundance and high-information-
abundance, respectively. We randomly initialize all embedding matrices with normal distribution
N (0, 1), fix E2 , E3 and only optimize E1 during training. We use full-batch SGD with the learning
rate of 1. We train the model for 5,000 iterations in total.
B N ON -L INEARITY FOR M ULTI -E MBEDDING

We have mentioned that the embedding-set-specific feature interaction of multi-embedding should
contain non-linearity, otherwise the model will degrade to a single-embedding model. For simplicity,
we consider a stronger version of multi-embedding, where the combined features from different
embedding sets are concatenated instead of averaged. To further illustrate, consider linear feature
N
interaction modules I (m) : RK → Rh , then we can define a linear feature interaction module
N
Iall : RM K → RM h . For convenience, we denote [f (i)]ni=1 as [f (1), f (2), ..., f (n)], and ei =
m M
[ei ]m=1 . The form of Iall can be formulated by
h iM
(m) (m)
Iall (e1 , e2 , ..., eN ) = I (m) (e1 , ..., eN ) .
m=1
This shows a multi-embedding model is equivalent to a model by concatenating all embedding sets.
We will further show that the deduced model with Iall is homogeneous to a single-embedding model
with size M K, i.e., multi-embedding is similar to single-embedding with linear feature interaction
modules. Denote the feature interaction module of single-embedding as I. Despite Iall could have
different forms from I, we further give three examples to show the homogeneity of Iall and I.
DNN. Ignoring the followed MLP, DNN incorporate a non-parametric interaction module by con-
catenating all fields together. Formally, we have
h iN
(m)
I(e1 , ..., eN ) = [ei ]M m=1 ,
i=1
h iM
(m)
Iall (e1 , ..., eN ) = [ei ]Ni=1 .
m=1
In other words, I and Iall only differ in a permutation, thus multi-embedding and single-embedding
are equivalent.
Projected DNN. If we add a linear projection after DNN, then we can split the projection for fields
and embedding sets, and derive
N X
M
(m)
X
I(e1 , ..., eN ) = Wi,m ei ,
i=1 m=1
" N
#M
(m)
X
Iall (e1 , ..., eN ) = Wi,m ei .
i=1 m=1
In other words, I and Iall only differ in a summation. Actually if we average the combined features
for Iall rather than concatenate to restore our proposed version of multi-embedding, then multi-
embedding and single-embedding are equivalent by the scalar 1/M .
DCNv2. DCNv2 incorporates the following feature interaction by

 N
XN
I(e1 , ..., eN ) = ei ⊙ Wj→i ej  ,
j=1
i=1
13
thus by splitting Wi→j we have
 N
N X
M
′ ′
(m) (m,m ) (m ) M 
X
I(e1 , ..., eN ) = [ei ⊙ Wj→i ej ]m=1
j=1 m′ =1
i=1
 M
N
(m) (m) (m) N 
X
Iall (e1 , ..., eN ) = [ei ⊙ Wj→i ej ]i=1 .
j=1
m=1
′
By simply letting W (m,m) = W (m) and W (m,m ) = O for m = ̸ m′ , we convert a multi-
embedding model into a single-embedding model under permutation. Therefore, multi-embedding
is a special case of single-embedding for DCNv2.
Summary. In summary, a linear feature interaction module will cause homogenity between single-
embedding and multi-embedding. Hence it is necessary to use or introduce non-linearity in feature
interaction module.
C D ETAILS OF E XPERIMENT
C.1 DATASET D ESCRIPTION
The statistics of Criteo and Avazu are shown in Table 2. It is shown that the data amount is ample
and Di can vary in a large range.
Table 2: Statistics of benchmark datasets for experiments.

P
Dataset #Instances #Fields i Di max{Di } min{Di }
Criteo 45.8M 39 1.08M 0.19M 4
Avazu 40.4M 22 2.02M 1.61M 5
C.2 E XPERIMENT S ETTINGS
Specific multi-embedding design. For DCNv2, DNN, IPNN and NFwFM, we add one non-linear
projection after the stacked cross layers, the concatenation layer, the inner product layer and the
field-weighted dot product layer, respectively. For xDeepFM, we directly average the output of the
compressed interaction network, and process the ensembled DNN the same as the pure DNN model.
For FinalMLP, we average the two-stream outputs respectively.
Hyperparameters. For all experiments, we split the dataset into 8 : 1 : 1 for train-
ing/validation/test with random seed 0. We use the Adam optimizer with batch size 2048, learning
rate 0.001 and weight decay 1e-6. For base size, we use embedding size 50 for NFwFM considering
the pooling, and 10 for all other experiments. We find the hidden size and depth of MLP does not
matters the result, and for simplicity, we set hidden size to 400 and set depth to 3 (2 hidden layers and
1 output layer) for all models. We use 4 cross layers for DCNv2 and hidden size 16 for xDeepFM.
All experiments use early stopping on validation AUC with patience 3. We repeat each experiment
for 3 times with different random initialization. All experiments can be done with a single NVIDIA
GeForce RTX 3090.
C.3 E XPERIMENTAL R ESULTS
Here we present detailed experimental results with estimated standard deviation. Specifically, we
show results on Criteo dataset in Table 3 and Figure 10 and Avazu dataset in Table 4 and Figure 11.
14
Table 3: Results on Criteo dataset. Higher indicates better.
Criteo
Model
base 2x 3x 4x 10x
SE 0.81222±0.00002 0.81207±0.00007 0.81213±0.00011 0.81142±0.00006
DNN 0.81228±0.00004
ME 0.81261±0.00004 0.81288±0.00015 0.81289±0.00007 0.81287±0.00005
SE 0.81273±0.00013 0.81272±0.00004 0.81271±0.00007 0.81262±0.00016
IPNN 0.81272±0.00003
ME 0.81268±0.00009 0.81270±0.00002 0.81273±0.00015 0.81311±0.00008
SE 0.81087±0.00008 0.81090±0.00012 0.81112±0.00011 0.81113±0.00022
NFwFM 0.81059±0.00012
ME 0.81128±0.00017 0.81153±0.00002 0.81171±0.00012 0.81210±0.00010
SE 0.81180±0.00002 0.81167±0.00008 0.81137±0.00005 0.81116±0.00009
xDeepFM 0.81217±0.00003
ME 0.81236±0.00006 0.81239±0.00022 0.81255±0.00011 0.81299±0.00009
SE 0.81341±0.00007 0.81345±0.00009 0.81346±0.00011 0.81357±0.00004
DCNv2 0.81339±0.00002
ME 0.81348±0.00005 0.81361±0.00014 0.81382±0.00015 0.81385±0.00005
SE 0.81262±0.00007 0.81248±0.00008 0.81240±0.00002 0.81175±0.00020
FinalMLP 0.81259±0.00009
ME 0.81290±0.00017 0.81302±0.00005 0.81303±0.00004 0.81303±0.00012
DNN 0.8132
IPNN NFwFM
0.8130 Single-embedding Single-embedding Single-embedding
0.8120
Multi-embedding Multi-embedding Multi-embedding
0.8130
0.8125
Test AUC
Test AUC
Test AUC
0.8115
0.8128
0.8120
0.8110
0.8126
0.8115
0.8105
base2x 3x 4x 10x base2x 3x 4x 10x base2x 3x 4x 10x
Scale-up Factor Scale-up Factor Scale-up Factor
xDeepFM DCNv2 FinalMLP
0.8130 Single-embedding Single-embedding Single-embedding
0.8130
Multi-embedding 0.8138 Multi-embedding Multi-embedding
0.8125
Test AUC
Test AUC
Test AUC
0.8125
0.8120 0.8136
0.8120
0.8115
0.8134
0.8110 0.8115
Figure 10: Visualization of scalability on Criteo dataset.
Table 4: Results on Avazu dataset. Higher indicates better.
Avazu
Model
base 2x 3x 4x 10x
SE 0.78759±0.00011 0.78752±0.00031 0.78728±0.00036 0.78648±0.00013
DNN 0.78744±0.00008
ME 0.78805±0.00017 0.78826±0.00013 0.78862±0.00026 0.78884±0.00005
SE 0.78741±0.00022 0.78738±0.00010 0.78750±0.00007 0.78745±0.00018
IPNN 0.78732±0.00020
ME 0.78806±0.00012 0.78868±0.00023 0.78902±0.00009 0.78949±0.00028
SE 0.78757±0.00020 0.78783±0.00009 0.78794±0.00022 –
NFwFM 0.78684±0.00017
ME 0.78868±0.00038 0.78901±0.00029 0.78932±0.00035 –
SE 0.78750±0.00025 0.78714±0.00030 0.78735±0.00004 0.78693±0.00050
xDeepFM 0.78743±0.00009
ME 0.78848±0.00006 0.78886±0.00026 0.78894±0.00004 0.78927±0.00019
SE 0.78835±0.00023 0.78854±0.00010 0.78852±0.00003 0.78856±0.00016
DCNv2 0.78786±0.00022
ME 0.78862±0.00011 0.78882±0.00012 0.78907±0.00011 0.78942±0.00024
SE 0.78797±0.00019 0.78795±0.00017 0.78742±0.00015 0.78662±0.00025
FinalMLP 0.78751±0.00026
ME 0.78821±0.00013 0.78831±0.00029 0.78836±0.00018 0.78830±0.00022
15
0.7890
DNN IPNN NFwFM
Single-embedding 0.7895 Single-embedding Single-embedding
0.7885 Multi-embedding Multi-embedding Multi-embedding
0.789
0.7890
Test AUC
Test AUC
Test AUC
0.7880
0.7885
0.7875 0.788
0.7880
0.7870
0.7875 0.787
0.7865
0.7870
base2x 3x 4x 10x base2x 3x 4x 10x base 2x 3x 4x
xDeepFM DCNv2 FinalMLP
Single-embedding 0.7895 Single-embedding 0.7885 Single-embedding
0.789 Multi-embedding Multi-embedding Multi-embedding
0.7880
0.7890
Test AUC
Test AUC
Test AUC
0.788 0.7875
0.7885
0.7870
0.787 0.7880
0.7865
Figure 11: Visualization of scalability on Avazu dataset.
16

On The Embedding Collapse When Scaling Up Recommendation Models

Uploaded by

Copyright:

Available Formats

On The Embedding Collapse When Scaling Up Recommendation Models

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

On The Embedding Collapse When Scaling Up Recommendation Models

Uploaded by

Copyright:

Available Formats

O N THE E MBEDDING C OLLAPSE W HEN S CALING UP

Relative Test AUC

can be compressed without significantly impacting the result. Com-

4 F EATURE I NTERACTION R EVISITED

4.1 I NTERACTION -C OLLAPSE L AW

(a) IA(Ei→j ). Field j (d) IA(Ei→j ). Field j

Finding 1 (Interaction-Collapse Law). In feature interaction of recommendation models, fields

While the interaction-collapse law is derived from sub-embedding-based models, it is important to

4.2 I S IT SUFFICIENT TO AVOID COLLAPSE FOR SCALABILITY ?

Regularized 0.47 0.812

Evidence IV: Scaling up DCNv2

which consists of a plain interaction DNN 0.0000

5 M ULTI - EMBEDDING D ESIGN

5.1 M ULTI -E MBEDDING FOR BETTER SCALABILITY

Relative Test AUC

Figure 7: Architectures of single-embedding (left) and multi- Figure 8: Scalability of multi-

Information abundance. Multi-embedding models achieve remarkable scalability compared with

Variations of embeddings. Multi-embedding utilizes embedding-set-specific interactions to en-

25 Single-embedding 10x Standard

Multi-embedding 10x Embedding II 0.8138 Restricted

Figure 9: Analysis of multi-embedding. (a): Multi-embedding learns higher information abundance.

Collapse phenomenon. Neural collapse or representation collapse describes the degeneration of

B N ON -L INEARITY FOR M ULTI -E MBEDDING

DCNv2. DCNv2 incorporates the following feature interaction by

Table 2: Statistics of benchmark datasets for experiments.

C.2 E XPERIMENT S ETTINGS

C.3 E XPERIMENTAL R ESULTS

Table 4: Results on Avazu dataset. Higher indicates better.

You might also like