2023 CXL DesignTradeoffs IEEE Micro
2023 CXL DesignTradeoffs IEEE Micro
2023 CXL DesignTradeoffs IEEE Micro
Abstract—DRAM is a key driver of performance and cost in public cloud servers. At the same
time, a significant amount of DRAM is underutilized due to fragmented use across servers.
Emerging interconnects such as CXL offer a path towards improving utilization through memory
pooling. However, the design space of CXL-based memory systems is large, with key questions
around the size, reach, and topology of the memory pool. At the same time, using pools requires
navigating complex design constraints around performance, virtualization, and management.
This paper discusses why cloud providers should deploy CXL memory pools, key design
constraints, and observations in designing towards practical deployment. We identify
configuration examples with significant positive return of investment.
2 IEEE Micro
40 of CPU cores are scheduled for VMs, 6% of
Stranded Memory [%]
Outliers
30
memory is stranded. This grows to over 10%
95th Percentile when ∼85% of CPU cores are allocated to VMs.
20 th
5 Percentile This makes sense since stranding is an artifact
10 of highly utilized nodes, which correlates with
0
highly utilized clusters. Outliers are shown by the
60 70 80 90 error bars, representing 5th and 95th percentiles.
CPU Cores Scheduled
[a] Stranding vs.
CPU utilization
in this Cluster [%] At 95th , stranding reaches 25% during high uti-
lization periods. Individual outliers reach more
Servers Servers
Rack 6 Rack 7
4
3
3.2. VM Memory Utilization in Azure
2
1
0
Dataset. We perform measurements on the same
0 15 30 45 60 75 100 general-purpose production clusters. For un-
[b] Stranding over time Time [Days] touched memory, we rely on guest-reported mem-
Figure 1: Memory stranding. (a) Stranding increases ory usage counters cross-referenced with hyper-
significantly as more CPU cores are scheduled; (b) Strand-
visor page table access bit scans. We sample
ing changes dynamically over time.
memory bandwidth counters using Intel RDT [12]
for a subset of clusters with compatible hardware.
Finally, we use hypervisor counters to measure
of a bidirectional ×8-CXL port at a typical non-uniform memory access (NUMA) spanning in
2:1 read:write-ratio roughly matches an 80-bit dual-socket servers, where a VM has cores on one
DDR5-4800 channel. socket and some memory from another socket.
March-April 2023
3
Slowdown: performance under all remote Local: 115ns, remote: 255ns (222%)
memory relative to all local memory Local: 78ns, remote: 142ns (182%)
100
Spark GAPBS TPC-H SPEC CPU 2017 2x
B
ry
Slowdown (%)
SH
s
ltD
80
ta
C
Not run on red
di
SE
ie
LA
Re
Vo
pr
configuration,
R
P
o
S
PA
Pr
60 insufficient DRAM
per NUMA node
40
20
0
P1 → P13 YCSB A→F ML/Web, etc. bc, bfs, cc, pr, sssp, tc Queries 1 → 22 501.perlbench_r → 657.xz_s facesim, vips, fft, etc.
Figure 2: Performance slowdowns when memory latency increases by 182-222% (§3.3). Workloads have different
sensitivity to increased memory latency as they would see with CXL. X-axis shows 158 representative workloads; Y
represents the normalized performance slowdown, i.e., performance under higher (remote) latency relative to all local
memory. “Proprietary” denotes production workloads at Azure.
3.3. Workload Sensitivity to Memory Latency workload classes like graph processing (GAPBS)
We summarize previous experiments on la- are sensitive to both latency and bandwidth, and
tency sensitivity [2]. both effects are worsened on the 222% system.
Dataset. We evaluate 158 workloads across pro-
4. The Memory Pool Design Space
prietary workloads, in-memory stores, data pro-
Designing a memory pool involves multiple
cessing, and benchmark suites. They run on dual-
hardware components and design choices that
socket Intel Skylake 8157M, with a 182% latency
expand with every new CXL release. To limit
increase for socket-remote memory, or AMD
complexity, we focus on two design aspects:
EPYC 7452, with 222% latency increase. We
1) whether to provide connectivity via CXL
normalize performance as slowdown relative to
switches or through CXL multi-headed devices
NUMA-local performance.
(MHDs) [5, §2.5] and 2) how large the con-
Latency sensitivity. Figure 2 surveys workload structed pool should be to maximize return-on-
slowdowns. Under a 182% increase in memory investment (ROI). We discuss a particular set of
latency, we find that 26% of the 158 workloads choices suitable for general-purpose cloud com-
experience less than 1% slowdown under CXL. puting. Other use cases may see different sets of
At the same time, some workloads are severely choices and tradeoffs.
affected with 21% of the workloads facing >25%
slowdowns. Overall, every workload class has at 4.1. Components
least one workload with less than 5% slowdown CXL memory controller (MC) devices act as
and one workload with more than 25% slowdown a bridge between the CXL protocol and memory
(except SPLASH2x). Our proprietary workloads devices such as DDR5 DRAMs. Today’s MCs
are less impacted than the overall workload set typically bridge between 1-2 CXL ×8 ports and
with almost half seeing <1% slowdown. These 1-2 80b channels of DDR5 (e.g., [13]).
production workloads are NUMA-aware and of- CXL switches behave similar to other network
ten include data placement optimizations. switches in that they forward requests and data,
Under a 222% increase in memory latency, we without serving as an endpoint. Physically, CXL
find that 23% of the 158 workloads experience switches will likely share many characteristics
less than 1% slowdown under CXL. More than (e.g., port count) with PCIe switches, due to using
37% of workloads face >25% slowdowns — a the same physical interface. For the purposes of
significantly higher fraction than on the 182% this analysis, we assume that switches with 128-
emulated latency increase. We find that the pro- lanes (16-ports) of CXL are used to build a fabric
cessing pipeline for some workloads, like VoltDB, layer.
seems to have just enough slack to accomodate A CXL MHD essentially combines a switch
the smaller 182% latency increase with significant and a memory controller in a single device.
pipeline stalls for 222% latency increase. Other Specifically, the MHD offers multiple CXL ports
4 IEEE Micro
Pool designs with multi-headed device (MHD)
MHD
2-8 Core/LLC CXL CXL ACL MC & Load to use
..
CPU CXL DDR5
.
Socket CPU MHD /Fabric Port Port NOC DRAM 155ns, 182%
CPU
Pool
40ns 25ns 5ns 25ns 15ns 45ns
MHD
16 Core/LLC CXL Re- CXL ACL MC & Load to use
..
CPU DDR5
.
Socket R CXL MHD 180ns, 212%
CPU
R /Fabric Port timer Port NOC DRAM
CPU
Pool R
40ns 25ns 5+20+5ns 25ns 15ns 45ns
..
32-64
..
. DDR5 Core/LLC CXL Re- CXL ARB CXL Re- CXL ACL MC & Load to use
.
CPU R
Socket CPU
R
R CXL Switch R MHD /Fabric Port timer Port NOC Port timer Port NOC DRAM >270ns, 318%
CPU R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 15ns 45ns
Pool designs with only switches and single-headed memory controller (MC)
CXL Switch MC
..
2-8 Core/LLC CXL CXL ARB CXL CXL MC & Load to use
..
CPU DDR5
.
Socket CPU CXL Switch MC /Fabric Port Port NOC Port Port DRAM >190ns, 224%
Pool CPU
40ns 25ns 5ns 25ns 20ns 25ns 5ns 25ns 45ns
CXL Switch MC
..
16 Core/LLC CXL Re- CXL ARB CXL Re- CXL MC & Load to use
..
CPU R DDR5
.
Socket CPU
R
R CXL Switch R
MC /Fabric Port timer Port NOC Port timer Port DRAM >250ns, 294%
CPU R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 45ns
...
32-64
...
CPU CXL DDR5 Core/LLC CXL Re- CXL ARB CXL Re- CXL MC & Load to use
...
R R
Socket CPU R
R Switch R Switch R MC /Fabric Port timer Port NOC Port timer Port DRAM >345ns, 405%
CPU R R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 45ns
Figure 3: Pool size and latency tradeoffs. Small pools of 8-16 sockets add only 75-90ns relative to NUMA-local
DRAM. Latency increases for larger pools that require retimers and a switch.
March-April 2023
5
Pool DRAM minimum amount of cluster memory corresponds
6 IEEE Micro
DRAM, and other standard infrastructure (e.g. Infra 2x MC Cost savings
network interface cards (NICs), power delivery, 1x
overhead 0.5x MHD from Figure 4
management controllers, boards, etc.). Within this
March-April 2023
7
6. Discussion and Conclusion REFERENCES
CXL-based memory pooling promises to re- 1. Shigeru Shiratake. Scaling and Performance Challenges of
Future DRAM. In IMW ’20.
duce DRAM needs for general-purpose cloud 2. Huaicheng Li, Daniel S. Berger, Stanko Novakovic, Lisa
platforms. This paper outlines the design space Hsu, Dan Ernst, Pantea Zardoshti, Monish Shah, Samir
Rajadnya, Scott Lee, Ishwar Agarwal, Mark Hill, Marcus
for memory pooling and offers a framework to Fontoura, and Ricardo Bianchini. Pond: CXL-Based Mem-
evaluate different proposals. ory Pooling Systems for Cloud Platforms. In ASPLOS ’23,
As cloud datacenters are quickly evolving, pages 574–587.
3. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf
some key parameters will differ significantly even Chowdhury, and Kang G. Shin. Efficient Memory Disag-
among cloud providers and over time. The frac- gregation with Infiniswap. In NSDI ’17.
4. Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera,
tion of VM memory that can be allocated on and Adam Belay. AIFM: High-Performance, Application-
CXL pools depends largely on the type of latency Integrated Far Memory. In OSDI ’20.
mitigation employed. For example, the recent 5. CXL Specification. Available at https://www.computeexpre
sslink.org/download-the-specification, accessed December
Pond [2] system can allocate an average of 35- 2020, 2020.
44% of DRAM on CXL pools while satisfying 6. Huaicheng Li, Mingzhe Hao, Stanko Novakovic, Vaibhav
stringent cloud performance goals. Future tech- Gogte, Sriram Govindan, Dan R. K. Ports, Irene Zhang,
Ricardo Bianchini, Haryadi S. Gunawi, and Anirudh Badam.
niques for performance management may lead LeapIO: Efficient and Portable Virtual NVMe Storage on
to significantly higher CXL pool usage. Another ARM SoCs. In ASPLOS ’20.
7. Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro,
difference comes from server and infrastructure Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit,
cost breakdowns, which lead to entirely different and Dan Tsafrir. Page Fault Support for Network Controllers.
In ASPLOS ’17.
cost curves (Figure 5).
8. Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan,
Regardless of the variability in system and Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi,
cost parameters, we believe that Observations 1-4 Yang Chen, Mark Russinovich, and Thomas Moscibroda.
Protean: VM Allocation Service at Scale. In OSDI ’20.
broadly apply to general-purpose clouds. We 9. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russi-
highlight that small pools, spanning up to 16 novich, Marcus Fontoura, and Ricardo Bianchini. Resource
sockets, can lead to significant DRAM savings. Central: Understanding and Predicting Workloads for Im-
proved Resource Management in Large Cloud Platforms. In
This requires keeping infrastructure cost over- SOSP ’17.
heads low, which reinforces the need for stan- 10. Christian Pinto, Dimitris Syrivelis, Michele Gazzetti, Panos
Koutsovasilis, Andrea Reale, Kostas Katrinis, and Peter
dardization of pooling infrastructure. Latency and Hofstee. ThymesisFlow: A Software-Defined, HW/SW Co-
cost increase quickly for larger pool sizes, while Designed Interconnect Stack for Rack-Scale Memory Dis-
the efficiency benefits fall off, which may make aggregation. In MICRO-53.
11. Debendra Das Sharma. Compute Express Link: An Open
large pools counterproductive in many scenarios. Industry-standard Interconnect Enabling Heterogenous Data-
Our savings model focuses on pooling itself, centric Computing. In HotI29.
e.g., averaging peak DRAM demand across the 12. Intel Resource Director Technology (Intel RDT). Available
at https://www.intel.com/content/www/us/en/architecture-a
pool, and for Azure specific workloads. CXL also nd-technology/resource-director-technology.html, accessed
enables other savings including using cheaper September 2022, 2015.
13. AsteraLabs. Leo Memory Connectivity Platform for CXL
media behind a CXL controller, such as reusing 1.1 and 2.0. Available at https://www.asteralabs.com/wp-c
DDR4 from decommissioned servers. We advise ontent/uploads/2022/08/Astera Labs Leo Aurora Product
FINAL.pdf, accessed August 2022, 2022.
practitioners to create a savings model for their
14. Lisa Su. AMD Unveils Workload-Tailored Innovations and
specific use cases, which might differ from ours. Products at The Accelerated Data Center Premiere. https:
CXL re-opens memory controller architecture //www.amd.com/en/press-releases/2021-11-08-amd-unveils
-workload-tailored-innovations-and-products-the-accelerat
as a research frontier. With memory controllers ed, November 2021.
decoupled from CPU sockets, new controller fea- 15. CXL Use-cases Driving the Need For Low Latency Perfor-
tures can be more quickly explored and deployed. mance Retimers. https://www.microchip.com/en-us/about/bl
og/learning-center/cxl--use-cases-driving-the-need-for-low
Cloud providers need improved reliability, avail- -latency-performance-reti, 2021.
ability, and serviceability (RAS) capabilities in- 16. Muhammad Tirmazi, Adam Barker, Nan Deng, Md E.
Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter,
cluding memory error correction, management, and John Wilkes. Borg: the Next Generation. In EuroSys
and isolation. Tighter integration between mem- ’20.
ory chips, modules, and controllers can enable 17. Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, and
Badrish Chandramouli. Redy: Remote Dynamic Memory
improvements along the Pareto frontier of RAS, Cache. In VLDB ’22.
memory bandwidth, and latency. 18. Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya
March-April 2023
1
Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Ph.D. degree in computer science from UC Berkeley
Rethinking Software Runtimes for Disaggregated Memory.
and served 32 years at University of Wisconsin Com-
In ASPLOS ’21.
puter Science.
19. Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and
Yiying Zhang. Clio: A Hardware-Software Co-Designed
Disaggregated Memory System. In ASPLOS ’22. Ricardo Bianchini is a Distinguished Engineer at
20. Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Microsoft Azure. Ricardo received his Ph.D. degree in
Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, computer science from University of Rochester.
and Scott Shenker. Network Requirements for Resource
Disaggregation. In OSDI ’16.
2 IEEE Micro