2023 CXL DesignTradeoffs IEEE Micro

Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Design Tradeoffs in

CXL-Based Memory Pools for


Public Cloud Platforms
Daniel S. Berger∗§ , Daniel Ernst∗ , Huaicheng Li† , Pantea Zardoshti∗ , Monish Shah∗ ,
Samir Rajadnya∗ , Scott Lee∗ , Lisa Hsu∗ , Ishwar Agarwal‡ , Mark D. Hill∗◦ , Ricardo Bianchini∗
∗ § † ‡ ◦
Microsoft Azure University of Washington Virginia Tech Intel University of Wisconsin-Madison

Abstract—DRAM is a key driver of performance and cost in public cloud servers. At the same
time, a significant amount of DRAM is underutilized due to fragmented use across servers.
Emerging interconnects such as CXL offer a path towards improving utilization through memory
pooling. However, the design space of CXL-based memory systems is large, with key questions
around the size, reach, and topology of the memory pool. At the same time, using pools requires
navigating complex design constraints around performance, virtualization, and management.
This paper discusses why cloud providers should deploy CXL memory pools, key design
constraints, and observations in designing towards practical deployment. We identify
configuration examples with significant positive return of investment.

1. Introduction savings. Stranding happens when all server cores


are rented (i.e., allocated to customer VMs) but
Motivation. Many public cloud customers deploy unallocated memory capacity remains and cannot
their workloads via virtual machines (VMs). VMs be rented. We find that up to 30% of DRAM
enable performance comparable to on-premises becomes stranded as more cores become allocated
datacenters without the need to manage data- to VMs.
centers. Cloud providers face the challenge of
achieving excellent performance at a competitive Limitations of the state-of-the-art. Reducing
hardware cost. DRAM usage in the public cloud is challenging
A key driver of both performance and cost due to its stringent performance requirements.
is main memory. The gold standard for memory Pooling memory via memory disaggregation is
performance is to preallocate a VM with cores a promising approach because stranded memory
and memory on the same socket. This leads to can be returned to the disaggregated pool and
memory latency below 100ns and facilitates vir- used by other servers. Unfortunately, existing
tualization acceleration. At the same time, DRAM pooling systems have microsecond access laten-
has become a major portion of hardware cost due cies and require page faults or changes to the VM
to its poor scaling properties with only nascent guest [3, 4].
alternatives [1]. For example, DRAM can be over The emerging CXL interconnect. The emerging
50% of server cost [2]. Compute Express Link (CXL) interconnect [5]
Through analysis of Azure VM traces, we enables cacheable load/store (ld/st) accesses
identify memory stranding as a dominant source to pooled memory on many current processors.
of memory waste and a potential source of cost Pool-memory accesses via loads/stores is a game

IEEE Micro Published by the IEEE Computer Society © 2023 IEEE


1
changer for cloud computing as it allows memory A simplified view of Azure’s VM scheduling
to remain statically preallocated while physically is that VMs are first assigned to a compute cluster
being located in a shared pool. However, CXL and then placed on a specific server within this
access latency depends on the overall system de- cluster. A cluster roughly corresponds to a row
sign, especially the pool size (the number of CPU of racks with homogenous server configurations.
sockets able to use a given pool) and topology. We use the unit of a cluster to characterize our
Larger pools require traversing switching levels, workloads.
which adds significant latency. Additionally, each
Memory stranding. It is often difficult to pro-
CXL component adds to the system cost, which
vision servers that closely match the resource
must be balanced against stranding savings.
demands of the incoming VM mix. A common
This work. This work is motivated by the mem- reason is that the DRAM-to-core ratio of a server
ory stranding problem identified in Pond [2] and that will last years must be determined at platform
we paraphrase the stranding analysis in Section 3. design time and is statically fixed over its lifetime.
While Pond focuses on system software policies Additionally, fixed-size DIMMs over limited free-
and mechanisms for allocating/managing pooled dom in determining the DRAM-to-core ratio.
memory, this work focuses on design tradeoffs in When the DRAM-to-core ratio of VM arrivals
the pool’s hardware configuration. First, we char- and a cluster’s server resources do not match,
acterize pool components, possible topologies, tight packing becomes especially difficult. We
and associate memory access latencies. We derive define a resource as stranded when it is techni-
a set of design recommendations from this anal- cally available to be rented to a customer, but is
ysis. Second, we compare savings from memory practically unavailable as some other resource has
pooling to the cost of its components for different been exhausted. The typical scenario for memory
pool sizes and CXL device types. We find that stranding is that all cores have been rented, but
CXL-based memory pooling can yield significant there is still memory available in the server.
positive returns on investment. Contrary to the
focus of existing literature, smaller pools may be Reducing stranding via pooling. This work
attractive. Third, we discuss future directions for proposes to break the fixed hardware configura-
the industry as well as academic research. tion of servers by disaggregating memory into
a pool that is accessible by multiple hosts [10].
2. Background By dynamically reassigning memory to different
Cloud resource allocation. Public cloud work- hosts at different times, we can shift memory
loads run inside virtual machines (VMs). To offer resources to where they are needed. Thus, we can
performance close to dedicated (non-virtualized) provision servers close to the average DRAM-to-
resources, VM resources are statically allocated core ratios and tackle deviations via the memory
by reserving each resource (CPU, DRAM, net- pool.
work bandwidth, etc.) for a VM’s lifetime. Ad- Pooling via CXL. The CXL.mem protocol for
ditionally, providers optimize I/O performance ld/st memory semantics maps device memory
with virtualization accelerators that bypass the to the system address space. Last-level cache
hypervisor [6]. For example, accelerated network- (LLC) misses to CXL memory addresses trans-
ing is enabled by default on AWS and Azure. late into requests on a CXL port whose re-
Virtualization acceleration requires statically pre- sponses bring the missing cachelines. Similarly,
allocating (or “pinning”) a VM’s entire address LLC write-backs translate into CXL data writes.
space [7]. CXL memory is virtualized using hypervisor page
Cloud resource scheduling. Scheduling VMs tables and the memory-management unit and is
with heterogeneous multi-dimensional resource thus compatible with virtualization acceleration.
demands onto servers leads to a challenging bin- CXL.mem uses PCIe’s electrical interface
packing problem [8, 9]. Scheduling is further with custom link and transaction layers for low
complicated by constraints such as spreading latency. Intel measures CXL port latencies at 25ns
VMs across multiple failure domains. round-trip [11]. With PCIe 5.0, the bandwidth

2 IEEE Micro
40 of CPU cores are scheduled for VMs, 6% of
Stranded Memory [%]
Outliers
30
memory is stranded. This grows to over 10%
95th Percentile when ∼85% of CPU cores are allocated to VMs.
20 th
5 Percentile This makes sense since stranding is an artifact
10 of highly utilized nodes, which correlates with
0
highly utilized clusters. Outliers are shown by the
60 70 80 90 error bars, representing 5th and 95th percentiles.
CPU Cores Scheduled
[a] Stranding vs.
CPU utilization
in this Cluster [%] At 95th , stranding reaches 25% during high uti-
lization periods. Individual outliers reach more
Servers Servers
Rack 6 Rack 7

7 than 30% stranding.


Racks

Figure 1b shows stranding over time across


6 eight adjacent racks. Every row shows a server
within each rack. A workload change (around day
0 15 30 36) suddenly increased stranding significantly.
Whitespace: no stranding Black pixel: stranding > 1%
(server strands
Furthermore, stranding can affect many racks
Servers in racks 5/6: long periods
< 1% of memory)
of stranding > 1% of memory concurrently (e.g., racks 2, 4–7) and it is generally
7 hard to predict which clusters/racks will have
6 stranded memory.
5
Racks

4
3
3.2. VM Memory Utilization in Azure
2
1
0
Dataset. We perform measurements on the same
0 15 30 45 60 75 100 general-purpose production clusters. For un-
[b] Stranding over time Time [Days] touched memory, we rely on guest-reported mem-
Figure 1: Memory stranding. (a) Stranding increases ory usage counters cross-referenced with hyper-
significantly as more CPU cores are scheduled; (b) Strand-
visor page table access bit scans. We sample
ing changes dynamically over time.
memory bandwidth counters using Intel RDT [12]
for a subset of clusters with compatible hardware.
Finally, we use hypervisor counters to measure
of a bidirectional ×8-CXL port at a typical non-uniform memory access (NUMA) spanning in
2:1 read:write-ratio roughly matches an 80-bit dual-socket servers, where a VM has cores on one
DDR5-4800 channel. socket and some memory from another socket.

3. Cloud Workload Characterization Memory bandwidth. Memory bandwidth us-


age of general-purpose workloads is generally
3.1. Stranding at Azure low with average bandwidth utilization below
We summarize previous analysis on strand- 10 GB/s. VMs on a small number of hosts do,
ing [2]. however, use 100% of memory bandwidth.
Dataset. We measure stranding in 100 general- NUMA spanning. Most VMs are small and can
purpose clusters over a 75-day period. A general- fit on a single socket. Azure’s hypervisor aims to
purpose cluster hosts a mix of first-party and schedule VMs on dual-socket servers such that
third-party VM workloads that do not require spe- they fit entirely (cores and memory) on a single
cial hardware (such as GPUs). We select clusters NUMA node. We find that spanning occurs for
with similar deployment years, spanning major only 2-3% of VMs.
regions on the planet. Each cluster trace contains
Overall, untouched memory and low memory
millions of per-VM arrival/departure events.
bandwidth requirements make VM workloads a
Memory stranding. Figure 1a shows the hourly good fit for memory pooling. However, with
average amount of stranded DRAM across our 97-98% of VMs using NUMA-local memory,
cluster sample, bucketed by the percentage of performance parity for pooled memory will be
scheduled CPU cores. In clusters where 75% challenging.

March-April 2023
3
Slowdown: performance under all remote Local: 115ns, remote: 255ns (222%)
memory relative to all local memory Local: 78ns, remote: 142ns (182%)
100
Spark GAPBS TPC-H SPEC CPU 2017 2x

B
ry
Slowdown (%)
SH

s
ltD
80

ta

C
Not run on red

di

SE
ie
LA

Re

Vo
pr
configuration,

R
P

o
S

PA
Pr
60 insufficient DRAM
per NUMA node
40

20

0
P1 → P13 YCSB A→F ML/Web, etc. bc, bfs, cc, pr, sssp, tc Queries 1 → 22 501.perlbench_r → 657.xz_s facesim, vips, fft, etc.

Figure 2: Performance slowdowns when memory latency increases by 182-222% (§3.3). Workloads have different
sensitivity to increased memory latency as they would see with CXL. X-axis shows 158 representative workloads; Y
represents the normalized performance slowdown, i.e., performance under higher (remote) latency relative to all local
memory. “Proprietary” denotes production workloads at Azure.

3.3. Workload Sensitivity to Memory Latency workload classes like graph processing (GAPBS)
We summarize previous experiments on la- are sensitive to both latency and bandwidth, and
tency sensitivity [2]. both effects are worsened on the 222% system.
Dataset. We evaluate 158 workloads across pro-
4. The Memory Pool Design Space
prietary workloads, in-memory stores, data pro-
Designing a memory pool involves multiple
cessing, and benchmark suites. They run on dual-
hardware components and design choices that
socket Intel Skylake 8157M, with a 182% latency
expand with every new CXL release. To limit
increase for socket-remote memory, or AMD
complexity, we focus on two design aspects:
EPYC 7452, with 222% latency increase. We
1) whether to provide connectivity via CXL
normalize performance as slowdown relative to
switches or through CXL multi-headed devices
NUMA-local performance.
(MHDs) [5, §2.5] and 2) how large the con-
Latency sensitivity. Figure 2 surveys workload structed pool should be to maximize return-on-
slowdowns. Under a 182% increase in memory investment (ROI). We discuss a particular set of
latency, we find that 26% of the 158 workloads choices suitable for general-purpose cloud com-
experience less than 1% slowdown under CXL. puting. Other use cases may see different sets of
At the same time, some workloads are severely choices and tradeoffs.
affected with 21% of the workloads facing >25%
slowdowns. Overall, every workload class has at 4.1. Components
least one workload with less than 5% slowdown CXL memory controller (MC) devices act as
and one workload with more than 25% slowdown a bridge between the CXL protocol and memory
(except SPLASH2x). Our proprietary workloads devices such as DDR5 DRAMs. Today’s MCs
are less impacted than the overall workload set typically bridge between 1-2 CXL ×8 ports and
with almost half seeing <1% slowdown. These 1-2 80b channels of DDR5 (e.g., [13]).
production workloads are NUMA-aware and of- CXL switches behave similar to other network
ten include data placement optimizations. switches in that they forward requests and data,
Under a 222% increase in memory latency, we without serving as an endpoint. Physically, CXL
find that 23% of the 158 workloads experience switches will likely share many characteristics
less than 1% slowdown under CXL. More than (e.g., port count) with PCIe switches, due to using
37% of workloads face >25% slowdowns — a the same physical interface. For the purposes of
significantly higher fraction than on the 182% this analysis, we assume that switches with 128-
emulated latency increase. We find that the pro- lanes (16-ports) of CXL are used to build a fabric
cessing pipeline for some workloads, like VoltDB, layer.
seems to have just enough slack to accomodate A CXL MHD essentially combines a switch
the smaller 182% latency increase with significant and a memory controller in a single device.
pipeline stalls for 222% latency increase. Other Specifically, the MHD offers multiple CXL ports

4 IEEE Micro
Pool designs with multi-headed device (MHD)
MHD
2-8 Core/LLC CXL CXL ACL MC & Load to use

..
CPU CXL DDR5

.
Socket CPU MHD /Fabric Port Port NOC DRAM 155ns, 182%
CPU
Pool
40ns 25ns 5ns 25ns 15ns 45ns

MHD
16 Core/LLC CXL Re- CXL ACL MC & Load to use
..
CPU DDR5
.
Socket R CXL MHD 180ns, 212%
CPU
R /Fabric Port timer Port NOC DRAM
CPU
Pool R
40ns 25ns 5+20+5ns 25ns 15ns 45ns

CXL Switch MHD

..
32-64
..

. DDR5 Core/LLC CXL Re- CXL ARB CXL Re- CXL ACL MC & Load to use
.

CPU R
Socket CPU
R
R CXL Switch R MHD /Fabric Port timer Port NOC Port timer Port NOC DRAM >270ns, 318%
CPU R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 15ns 45ns

Pool designs with only switches and single-headed memory controller (MC)
CXL Switch MC
..

2-8 Core/LLC CXL CXL ARB CXL CXL MC & Load to use
..

CPU DDR5
.

Socket CPU CXL Switch MC /Fabric Port Port NOC Port Port DRAM >190ns, 224%
Pool CPU
40ns 25ns 5ns 25ns 20ns 25ns 5ns 25ns 45ns

CXL Switch MC
..

16 Core/LLC CXL Re- CXL ARB CXL Re- CXL MC & Load to use
..

CPU R DDR5
.

Socket CPU
R
R CXL Switch R
MC /Fabric Port timer Port NOC Port timer Port DRAM >250ns, 294%
CPU R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 45ns

CXL CXL Switch MC


...

...

32-64
...

CPU CXL DDR5 Core/LLC CXL Re- CXL ARB CXL Re- CXL MC & Load to use
...
R R
Socket CPU R
R Switch R Switch R MC /Fabric Port timer Port NOC Port timer Port DRAM >345ns, 405%
CPU R R
Pool R
40ns 25ns 5+20+5ns 25ns 20ns 25ns 5+20+5ns 25ns 45ns

Figure 3: Pool size and latency tradeoffs. Small pools of 8-16 sockets add only 75-90ns relative to NUMA-local
DRAM. Latency increases for larger pools that require retimers and a switch.

and appears to each connected host as a single Observation 1:


logical memory device [5]. The most significant A significant percentage (more than 25%) of dat-
tradeoffs for MHD designs are the number of acenter memory needs to remain local to compute
incoming CXL ports and DDR channels. A useful servers.
design comparison is a modern server CPU IO-
die (IOD), such as the one in AMD Genoa [14]. To understand pool latencies, we first charac-
The Genoa IOD offers 128 PCIe5 lanes as well terize the impact on latency of achievable topolo-
as 12 DDR5 channels. With the ×8-CXL require- gies given viable components.
ment, this would be analogous to a 16-headed
device with at least 8 channels of DDR5. In our Observation 2:
analysis we consider both this 16-headed device When using at least a ×8-CXL port for each host,
as well as a smaller 8-headed device. pool sizes beyond 16-32 hosts will require at least
one level of switches if MHDs are used or two
4.2. Pool size vs latency levels of switches if using only individual MCs.
At a high level, the first design decision is
whether cloud compute servers can pool all of Access latencies derive from multiple param-
their memory. With 21-37% of workloads facing eters. Port latency plays a dominant role with
significant slowdowns on pool-only configura- initial measurements indicating 25ns [11]. Re-
tions (§3), we do not recommend fully disag- timers are devices used to maintain CXL/PCIe
gregating compute and memory. Servers need signal integrity over distances above roughly half
to retain significant amounts of local DRAM to a meter, depending on the implementation of the
maintain performance expectations, which will signal path. They add about 10ns of latency in
likely go beyond the scope of on-die memory. each direction (e.g., [15]). Each switch will add
Further, achieving maximum memory bandwidth at least 70ns of latency due to ports, arbitration,
requires CPUs to populate all available local DDR and network-on-chip (NOC).
channels, creating a practical minimum for local Figure 3 shows a range of CXL path types
memory capacity. based on pool sizes and the use of MHDs versus

March-April 2023
5
Pool DRAM minimum amount of cluster memory corresponds

Required Overall DRAM [%]


100 ...
[Fixed fraction of VM memory] to the “required overall DRAM” reported below.
10%
Figure 4 presents cluster DRAM requirements
95 when VMs are assigned either 10%, 30%, or
25% 50% of pool DRAM. As the pool size increases,
the figure shows that required overall DRAM
90
50%
decreases. However, this effect diminishes for
larger pools. For example, with a fixed 50% pool
85 ... DRAM, a pool with 32 sockets saves 12% of
2 8 16 32 64 128
DRAM while a pool with 64 sockets saves 13%
Pool Size [CPU sockets]
of DRAM. Note that allocating 50% of VM mem-
Figure 4: Impact of pool size. Small pools of 32 sockets
ory to pool DRAM require latency mitigation
are sufficient to significantly reduce overall memory
techniques (§6).
needs.
Besides low latency, feasible configurations
also must be ROI positive, as discussed next.
switches with single-headed devices. We find
4.4. Pool size vs system cost
that small 8-and 16-socket pools using MHDs
System cost depends on many factors. We
increase latencies to 182-212% (155-180ns) rela-
consider a simplified model that focuses on
tive to NUMA-local DRAM. Latency when using
key hardware components: DRAM, memory con-
only switches and single-headed memory con-
trollers, cables, and the memory blade enclo-
trollers would further increase by 23-38%.
sure/printed circuit board (PCB). Our model ig-
Rack-scale pooling with 64 sockets would
nores factors of time, scale, and market com-
increase latencies by 318-405% (270-345ns) and
petition. Specifically, our model calculates cost
pooling across multiple racks would require yet
relative to a non-pooled server’s bill of materials
another level of switching and potentially mul-
(BOM) based on the following set of parameters.
tiple retimers, increasing latencies by more than
465% (395ns). Comparing these latencies to the MC cost of a typical 2x8 CXL memory con-
slowdowns observable at 182-222% (§3), we troller (e.g., 0.4%)
observe that large-scale pooling will likely be MHD8 cost of an 8-headed memory controller
prohibitive from a performance perspective. (e.g., 0.8%)
MHD16 cost of a 16-headed memory controller
Observation 3: (e.g., 2.0%)
The size of CXL-based memory pools will likely Switch cost of a 16-port CXL switch (e.g., 1.6%)
be a subset of a rack to minimize the performance Ret cost of a CXL retimer (e.g., 0.02%)
impact of access latencies. Infra cost of the supporting memory enclosure,
PCBs, and cables expressed as a multiplier
Modern CPUs can connect to multiple MHDs applied to MHD or switch cost (e.g., 0.5-
or switches, which allows scaling to meet band- 2×)
width and capacity goals for different clusters. The exemplary values for the parameters are
roughly based on estimates of silicon area as
4.3. Pool size vs DRAM savings well as connectivity and infrastructure necessary
We analyze VM-to-server traces from Azure to support the memory pools. Note that there is
(§3) to estimate the amount of DRAM that could significant room for these parameters to change
be saved via pools of different sizes. The reduc- between companies, server configurations, use
tion in DRAM comes from averaging host’s peak cases, and over time.
memory needs across the pool. Our simulation Figure 5 presents cost overheads for pool sizes
plays back VM traces while assigning a fixed per- from 2-64 sockets and for pools encompassing
centage of pool memory. We repeatedly run clus- two different capacity points relative to total sys-
ter simulations while decreasing overall memory tem memory. The baseline for comparison is the
in 0.5% steps until the first VM is rejected. The full cost of a non-pooled server, including CPU,

6 IEEE Micro
DRAM, and other standard infrastructure (e.g. Infra 2x MC Cost savings
network interface cards (NICs), power delivery, 1x
overhead 0.5x MHD from Figure 4
management controllers, boards, etc.). Within this

Savings Relative to Single Socket [%]


baseline, DRAM memory is assumed to account

Cost Relative to Single Socket [%]

25% Pool Memory


115 115
for approximately half of the total cost, with
110 110
the CPU and other infrastructure splitting the
other half. All other modeled configurations hold 105 105
the total cost of the base system constant, but
100 100
add the costs of the extra components required

50% Pool Memory


for pooling part of the memory. Our results are 115 115

reported as a percentage of cost uplift versus the


110 110
baseline configuration. We vary the infrastructure
overhead cost to show that the overall costs are 105 105
very sensitive to the ability for a design to cost-
100 100
effectively provide connectivity to the pool. The 1 8 16 32 64
analysis also shows that overhead for switch- Pool size [sockets]
based designs versus MHD designs is significant.
As an example, an 8-socket pool implemented Figure 5: Pool system cost tradeoffs. Both cost and
with switches adds over two-and-a-half times the savings increase with pool size. Infrastructure overheads
cost of an 8-socket pool based on MHDs. also play a key factor in cost. Cost savings (black line)
This overhead is important, as the system- from Figure 4 are workload dependent and may look
level goal is reaching a beneficial pooling con- significantly different for other use cases. We advise
figuration, which is one where the cost uplift of practitioners to evaluate savings for their workloads.
moving memory into the pool is less than the ef-
ficiency benefit of having flexible memory as out-
lined in the savings analysis above. In Figure 5,
the black line plots the savings estimate from the
earlier analysis (Figure 4). Configurations below Hypervisor/OS level approaches such as [3]
this line are ROI positive, while those above rely on page faults and access monitoring to
the line are likely ROI negative unless further maintain the working set in local DRAM. In
optimizations can be made to improve savings. the context of general-purpose cloud computing,
Note in particular that most switch-based con- these OS-based approaches bring too much over-
figurations are ROI negative, while many MHD- head and jitter. They are also incompatible with
based configurations are ROI positive, especially virtualization acceleration (e.g., DDA).
for smaller pool sizes. Runtime-based disaggregation designs [4, 18]
propose customized application programming in-
Observation 4: terfaces for remote memory access. While effec-
Positive ROI requires pool designers to navigate tive, this approach requires developers to explic-
a complex tradeoff between pool size, topology, itly use these mechanisms at the application level.
and savings, which is workload dependent. Infras-
Hardware-based memory disaggregation have
tructure overheads may become a major hurdle
served as an inspiration for CXL but prior ap-
to adopting CXL-based pooling as expensively-
proaches were not available on commodity hard-
designed configurations will not achieve benefi-
ware [10, 19].
cial ROI.
Prior analysis of requirements for disaggrega-
5. Related Work tion are related to our goals. However, network-
Low memory resource utilization and strand- based disaggregation [20] lead to a different
ing has been observed at Google [16] and Mi- design space, e.g., with latency considered in
crosoft [17]. This motivated at least three lines of the range of 1us to 40us, whereas we consider
prior research on memory pooling prior to CXL. latencies lower by an order of magnitude.

March-April 2023
7
6. Discussion and Conclusion REFERENCES
CXL-based memory pooling promises to re- 1. Shigeru Shiratake. Scaling and Performance Challenges of
Future DRAM. In IMW ’20.
duce DRAM needs for general-purpose cloud 2. Huaicheng Li, Daniel S. Berger, Stanko Novakovic, Lisa
platforms. This paper outlines the design space Hsu, Dan Ernst, Pantea Zardoshti, Monish Shah, Samir
Rajadnya, Scott Lee, Ishwar Agarwal, Mark Hill, Marcus
for memory pooling and offers a framework to Fontoura, and Ricardo Bianchini. Pond: CXL-Based Mem-
evaluate different proposals. ory Pooling Systems for Cloud Platforms. In ASPLOS ’23,
As cloud datacenters are quickly evolving, pages 574–587.
3. Juncheng Gu, Youngmoon Lee, Yiwen Zhang, Mosharaf
some key parameters will differ significantly even Chowdhury, and Kang G. Shin. Efficient Memory Disag-
among cloud providers and over time. The frac- gregation with Infiniswap. In NSDI ’17.
4. Zhenyuan Ruan, Malte Schwarzkopf, Marcos K. Aguilera,
tion of VM memory that can be allocated on and Adam Belay. AIFM: High-Performance, Application-
CXL pools depends largely on the type of latency Integrated Far Memory. In OSDI ’20.
mitigation employed. For example, the recent 5. CXL Specification. Available at https://www.computeexpre
sslink.org/download-the-specification, accessed December
Pond [2] system can allocate an average of 35- 2020, 2020.
44% of DRAM on CXL pools while satisfying 6. Huaicheng Li, Mingzhe Hao, Stanko Novakovic, Vaibhav
stringent cloud performance goals. Future tech- Gogte, Sriram Govindan, Dan R. K. Ports, Irene Zhang,
Ricardo Bianchini, Haryadi S. Gunawi, and Anirudh Badam.
niques for performance management may lead LeapIO: Efficient and Portable Virtual NVMe Storage on
to significantly higher CXL pool usage. Another ARM SoCs. In ASPLOS ’20.
7. Ilya Lesokhin, Haggai Eran, Shachar Raindel, Guy Shapiro,
difference comes from server and infrastructure Sagi Grimberg, Liran Liss, Muli Ben-Yehuda, Nadav Amit,
cost breakdowns, which lead to entirely different and Dan Tsafrir. Page Fault Support for Network Controllers.
In ASPLOS ’17.
cost curves (Figure 5).
8. Ori Hadary, Luke Marshall, Ishai Menache, Abhisek Pan,
Regardless of the variability in system and Esaias E Greeff, David Dion, Star Dorminey, Shailesh Joshi,
cost parameters, we believe that Observations 1-4 Yang Chen, Mark Russinovich, and Thomas Moscibroda.
Protean: VM Allocation Service at Scale. In OSDI ’20.
broadly apply to general-purpose clouds. We 9. Eli Cortez, Anand Bonde, Alexandre Muzio, Mark Russi-
highlight that small pools, spanning up to 16 novich, Marcus Fontoura, and Ricardo Bianchini. Resource
sockets, can lead to significant DRAM savings. Central: Understanding and Predicting Workloads for Im-
proved Resource Management in Large Cloud Platforms. In
This requires keeping infrastructure cost over- SOSP ’17.
heads low, which reinforces the need for stan- 10. Christian Pinto, Dimitris Syrivelis, Michele Gazzetti, Panos
Koutsovasilis, Andrea Reale, Kostas Katrinis, and Peter
dardization of pooling infrastructure. Latency and Hofstee. ThymesisFlow: A Software-Defined, HW/SW Co-
cost increase quickly for larger pool sizes, while Designed Interconnect Stack for Rack-Scale Memory Dis-
the efficiency benefits fall off, which may make aggregation. In MICRO-53.
11. Debendra Das Sharma. Compute Express Link: An Open
large pools counterproductive in many scenarios. Industry-standard Interconnect Enabling Heterogenous Data-
Our savings model focuses on pooling itself, centric Computing. In HotI29.
e.g., averaging peak DRAM demand across the 12. Intel Resource Director Technology (Intel RDT). Available
at https://www.intel.com/content/www/us/en/architecture-a
pool, and for Azure specific workloads. CXL also nd-technology/resource-director-technology.html, accessed
enables other savings including using cheaper September 2022, 2015.
13. AsteraLabs. Leo Memory Connectivity Platform for CXL
media behind a CXL controller, such as reusing 1.1 and 2.0. Available at https://www.asteralabs.com/wp-c
DDR4 from decommissioned servers. We advise ontent/uploads/2022/08/Astera Labs Leo Aurora Product
FINAL.pdf, accessed August 2022, 2022.
practitioners to create a savings model for their
14. Lisa Su. AMD Unveils Workload-Tailored Innovations and
specific use cases, which might differ from ours. Products at The Accelerated Data Center Premiere. https:
CXL re-opens memory controller architecture //www.amd.com/en/press-releases/2021-11-08-amd-unveils
-workload-tailored-innovations-and-products-the-accelerat
as a research frontier. With memory controllers ed, November 2021.
decoupled from CPU sockets, new controller fea- 15. CXL Use-cases Driving the Need For Low Latency Perfor-
tures can be more quickly explored and deployed. mance Retimers. https://www.microchip.com/en-us/about/bl
og/learning-center/cxl--use-cases-driving-the-need-for-low
Cloud providers need improved reliability, avail- -latency-performance-reti, 2021.
ability, and serviceability (RAS) capabilities in- 16. Muhammad Tirmazi, Adam Barker, Nan Deng, Md E.
Haque, Zhijing Gene Qin, Steven Hand, Mor Harchol-Balter,
cluding memory error correction, management, and John Wilkes. Borg: the Next Generation. In EuroSys
and isolation. Tighter integration between mem- ’20.
ory chips, modules, and controllers can enable 17. Qizhen Zhang, Philip A. Bernstein, Daniel S. Berger, and
Badrish Chandramouli. Redy: Remote Dynamic Memory
improvements along the Pareto frontier of RAS, Cache. In VLDB ’22.
memory bandwidth, and latency. 18. Irina Calciu, M. Talha Imran, Ivan Puddu, Sanidhya

March-April 2023
1
Kashyap, Hasan Al Maruf, Onur Mutlu, and Aasheesh Kolli. Ph.D. degree in computer science from UC Berkeley
Rethinking Software Runtimes for Disaggregated Memory.
and served 32 years at University of Wisconsin Com-
In ASPLOS ’21.
puter Science.
19. Zhiyuan Guo, Yizhou Shan, Xuhao Luo, Yutong Huang, and
Yiying Zhang. Clio: A Hardware-Software Co-Designed
Disaggregated Memory System. In ASPLOS ’22. Ricardo Bianchini is a Distinguished Engineer at
20. Peter X. Gao, Akshay Narayan, Sagar Karandikar, Joao Microsoft Azure. Ricardo received his Ph.D. degree in
Carreira, Sangjin Han, Rachit Agarwal, Sylvia Ratnasamy, computer science from University of Rochester.
and Scott Shenker. Network Requirements for Resource
Disaggregation. In OSDI ’16.

Daniel S. Berger is a Senior Researcher in the


Azure Systems Research group (AzSR) at Microsoft
Azure. Daniel received his Ph.D. degree in computer
science from TU Kaiserslautern.

Dan Ernst is a Principal Architect in the Leading


Edge Architecture Pathfinding (LEAP) at Microsoft
Azure. Dan received his Ph.D. degree in computer
science and engineering from the University of Michi-
gan.

Huaicheng Li is an Assistant Professor in the


Computer Science department at Virginia Tech.
Huaicheng received his Ph.D. degree in computer
science from the University of Chicago.

Pantea Zardoshti is a Research Software Devel-


opment Engineer in the AzSR group at Microsoft
Azure. Pantea received her Ph.D. degree in computer
science from the Lehigh University.

Monish Shah is a Senior Principal Hardware Engi-


neer in the LEAP group at Microsoft Azure. Monish
received his M.Sc. degree in electrical engineering
from Stanford University.

Samir Rajadnya is a Principal Memory System


Engineer in the LEAP group at Microsoft Azure. Samir
received his M.Tech. degree in electrical engineering
from IIT Bombay.

Scott Lee is a Principal Software Engineer Lead


at Microsoft. Scott received his B.Sc in computer
engineering from the University of Washington.

Lisa Hsu is a Principal Architect at Microsoft Azure.


Lisa received her Ph.D. degree in computer science
from the University of Michigan.

Ishwar Agarwal is a Senior Principal Engineer at In-


tel Corporation. Ishwar received his M.Sc. in eletrical
and computer engineering from Georgia Tech.

Mark D. Hill is a Partner Architect and leads the


LEAP group at Microsoft Azure. Mark received his

2 IEEE Micro

You might also like