I A: A 96-Core Processor With Six Chiplets 3D-Stacked On An Active Interposer With Distributed Interconnects and Integrated Power Management

Download as pdf or txt
Download as pdf or txt
You are on page 1of 19

IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO.

1, JANUARY 2021 79

I NTACT: A 96-Core Processor With Six Chiplets


3D-Stacked on an Active Interposer
With Distributed Interconnects and
Integrated Power Management
Pascal Vivet , Member, IEEE, Eric Guthmuller, Yvain Thonnart, Member, IEEE,
Gael Pillonnet , Senior Member, IEEE, César Fuguet, Ivan Miro-Panades , Member, IEEE,
Guillaume Moritz, Jean Durupt, Christian Bernard, Didier Varreau, Julian Pontes, Member, IEEE,
Sébastien Thuries, David Coriat, Michel Harrand, Member, IEEE, Denis Dutoit, Didier Lattard,
Lucile Arnaud, Jean Charbonnier, Member, IEEE, Perceval Coudrain, Arnaud Garnier,
Frédéric Berger, Alain Gueugnot, Alain Greiner, Quentin L. Meunier,
Alexis Farcy, Alexandre Arriordaz, Séverine Chéramy , Member, IEEE,
and Fabien Clermidy, Member, IEEE

Abstract— In the context of high-performance computing, in 65-nm process, offering a total of 96 computing cores.
the integration of more computing capabilities with generic Full scalability of the computing system is achieved using an
cores or dedicated accelerators for artificial intelligence (AI) innovative scalable cache-coherent memory hierarchy, enabled
application is raising more and more challenges. Due to the by distributed network-on-chips, with 3-Tbit/s/mm2 high band-
increasing costs of advanced nodes and the difficulties of shrink- width 3D-plug interfaces using 20-µm pitch micro-bumps,
ing analog and circuit input output signals (IOs), alternative 0.6-ns/mm low latency asynchronous interconnects, while the
architecture solutions to single die are becoming mainstream. six chiplets are locally power-supplied with 156-mW/mm2 at
Chiplet-based systems using 3D technologies enable modular 82%-peak-efficiency dc–dc converters through the active inter-
and scalable architecture and technology partitioning. Never- poser. Thermal dissipation is studied showing the feasibility of
theless, there are still limitations due to chiplet integration such approach.
on passive interposers—silicon or organic. In this article we
present the first CMOS active interposer, integrating: 1) power Index Terms— 3D technology, active interposer, chiplet,
management without any external components; 2) distributed network-on-chip (NoC), power management, thermal dissipation.
interconnects enabling any chiplet-to-chiplet communication;
and3) system infrastructure, design-for-test, and circuit IOs. The I. I NTRODUCTION
I NTACT circuit prototype integrates six chiplets in FDSOI 28-nm
technology, which are 3D-stacked onto this active interposer I N THE context of high-performance computing (HPC) and
big-data applications, the quest for performance requires
modular, scalable, energy-efficient, low-cost many-core
Manuscript received June 11, 2020; revised September 17, 2020 and Octo-
ber 27, 2020; accepted November 2, 2020. Date of current version Decem- systems. To address the demanding needs for computing
ber 24, 2020. This paper was approved by Guest Editor Dejan Markovic.This power, system architects are continuously integrating more
work was supported in part by the French National Program Programme cores, more acceleratorsand more memory in a given power
d’Investissements d’Avenir, IRT Nanoelec under Grant ANR-10-AIRT-05,
in part by the SHARP CA109 CATRENE Project, in part by the MASTER3D envelope [1]. It appears that similar needs and constraints
CT312 CATRENE Project, and in part by the Hubeo+ CARNOT Project. are emerging for the embedded HPC domain, in transport
(Corresponding author: Pascal Vivet.) applications for instance with autonomous driving, avionics,
Pascal Vivet, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, César
Fuguet, Ivan Miro-Panades, Guillaume Moritz, Jean Durupt, Sébastien and so on.
Thuries, David Coriat, Michel Harrand, Denis Dutoit, Didier Lattard, Lucile All these application domains require highly optimized
Arnaud, Jean Charbonnier, Perceval Coudrain, Arnaud Garnier, Frédéric and energy-efficient functions: generic ones such as cores,
Berger, Alain Gueugnot, Séverine Chéramy, and Fabien Clermidy are
with CEA, University Grenoble Alpes, 38054 Grenoble, France (e-mail: GPUs, embedded FPGAs, dense and fast memories, and
[email protected]). also more specialized ones, such as machine learning
Christian Bernard and Didier Varreau, retired, were with CEA Grenoble, and neuro-accelerators to efficiently implement the greedy
38054 Grenoble, France.
Julian Pontes was with CEA Grenoble, 38054 Grenoble, France. He is now computing demand of Big Data and artificial intelligence (AI)
with ARM, Sheffield S1 4LW, U.K. applications.
Alain Greiner and Quentin L. Meunier are with LIP6 Lab, University Paris Circuit and system designers are in need of a more afford-
Sorbonne, 75252 Paris, France.
Alexis Farcy is with STMicroelectronics, 38920 Crolles, France. able, scalable, and efficient way of integrating those het-
Alexandre Arriordaz is with Mentor, A Siemens Business, 38330 Montbon- erogeneous functions, to allow more reuse, at circuit level,
not, France. while focusing on the right innovations in a sustainable man-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2020.3036341. ner. Due to the slowdown of advanced CMOS technologies
Digital Object Identifier 10.1109/JSSC.2020.3036341 (7 nm and below), with yield issues, design, and mask costs,
0018-9200 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
80 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

the innovation and differentiation through single die solution is


not viable anymore. Mixing heterogeneous technologies using
3D is a clear alternative [2], [3]. Partitioning the system into
multiple chiplets 3D-stacked onto large-scale interposers—
organic substrate [4], 2.5D passive interposer [5], or silicon
bridge [6]—leads to large modular architectures and cost
reduction in advanced technologies using a Known Good
Die (KGD) strategy and yield management.
Nevertheless, the current passive interposer solutions still
lack flexible and efficient long-distance communication,
smooth integration of chiplets with incompatible interfaces,
and easy integration of less-scalable analog functions, such Fig. 1. Chiplet partitioning concept.
as power management and system input output signals (IOs).
We present the first CMOS Active Interposer, measured on sil- - It is enabled by heterogeneous integration. For chiplets,
icon, integrating power management, distributed interconnects, the right technology is selected to implement the right
enabling an innovative scalable cache-coherent memory hier- function: advanced CMOS for computing, DRAM for
archy. Six chiplets are 3D-stacked onto the active interposer, memory cubes like high bandwidth memory (HBM)
offering a total of 96 cores. [11], non-volatile mémory (NVM) technology for data
The outline of this article is as follows. Section I intro- processing within AI accelerators [12], mature technol-
duces the chiplet paradigm in more detail, with a state of ogy for analog functions (IOs, clocking, power man-
the art on these technologies and the proposed concept of agement, and so on). Chiplet integration is then per-
active interposer. Section II presents an overview of the formed using advanced 3D technologies, which are
I NTACT demonstrator architecture and 3D technology, while getting more and more mature, with reduced pitches,
Sections III–VIII detail the various sub-elements of the circuit: using through-silicon via (TSV) and micro-bumps [5]
computing chiplet, power management, distributed intercon- or even more advanced die-to-wafer hybrid bonding
nects, and testability. Section IX addresses the thermal issues. technologies [3], [13].
Finally, Sections X and XI present the final circuit results and To benefit from all these advantages and possibilities, there
conclusion. are nevertheless clear challenges for chiplets. The ecosystem
needs to change progressively from IP-reuse to chiplet-reuse;
II. C HIPLET AND ACTIVE I NTERPOSER P RINCIPLE this requires fundamental changes in the responsibilities of the
various providers. These constraints are economical rather than
A. Chiplet Partitioning: Concept and Challenges
technical, but they are strongly driving the technical choices.
Chiplet partitioning is raising new interest in the research For system-level design, the simple LEGO cartoon (Fig. 1)
community [7], in large research programs [8], and in the needs some adequate methodologies to address system mod-
industry [9]. It is actually an idea with a long history in eling, cost modeling, to perform technology and architecture
the 3D technology field [2]. The concept of chiplet is rather partitioning while achieving an optimized system. A strong
simple: divide circuits in modular sub-systems, in order to movement is building momentum toward the standardization
build a system as an LEGO-based approach, using advanced of chiplet interfaces to enable this modularity between various
3D technologies. vendors [14].
The motivation for chiplet-based partitioning is as follows. Finally, many circuit level design issues arise: design of
- It is driven by cost. Due to increasing issues in advanced energy-efficient chiplet interfaces, testability, power manage-
CMOS technologies (7 nm and below), achieving high ment, and power distribution, final system sign-off in terms
yield on large dies in acceptable costs is not possible of timing, power, layout, and reliability, thermal dissipation.
anymore, while shrinking all the analog intellectuel To address these 3D design challenges, new CAD tools and
property blocks (IPs) (power management, Fast IO associated design flows must be developed [49].
SerDes, and so on) is becoming increasingly difficult. In this article, a partitioning using identical chiplets is
By dividing a system into various sub-modules, called proposed to scale-out a large distributed computing system
chiplets, it is possible to yield larger systems at an offering 96-cores, by using heterogeneous technologies. Many
acceptable cost, thanks to KGD sorting [10]. circuit design aspects are addressed in terms of chiplet inter-
- It is driven by modularity. By an elegant divide and faces, distributed interconnects, power management, testabil-
conquer partitioning scheme, chipletization allows to ity, and associated CAD flows.
build modular systems from various building blocks
and circuits, focusing more on functional aspects than B. State of the Art on Interposers
on technology constraints. Circuit designers can deeply In order to assemble the chiplets together, various technolo-
optimize each function: generic CPUs, optimized GPUs, gies have been developed and are currently available in the
embedded FPGAs, dedicated accelerators for machine industry (Fig. 2).
learning, dense memory solutions, IO, and services, Firstly, organic substrate is the lowest cost solution,
while the system designer is picking the best combi- while offering larger interconnect pitches (130 μ m).
nation to build a differentiated and optimized system. This technology has been adapted by AMD for their EPIC
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 81

Fig. 2. State-of-the-art on recent interposer and 3D technologies.

processor family, with the first version with up to four


chiplets [4] and a recent version with up to eight chiplets
using a central IO die to distribute the system level intercon- Fig. 3. Active interposer concept and main features.
nects [15]. Passive interposers, also called 2.5D integration,
as proposed for instance by TSMC CoWoS [5] enable more
aggressive chip-to-chip interconnects and pitches (40 μm) but
functions as presented here, or differentiated functions, as pre-
are still limited to “wire only” connections. A trade-off in
sented in Fig. 1. Chiplets, implemented in an advanced CMOS
terms of cost, pitches, and performances can be achieved by
technology, may themselves be composed of clusters of cores.
using a silicon bridge embedded within the organic substrate
Each chiplet can contain its own interconnects for intra-chip
as presented by INTEL and their EMIB bridge [6]. Finally,
communication, which are extended in 3D for chiplet-to-
regular 3D stacking (for vertical assembly) may also be used,
chiplet communication. The CMOS interposer integrates a
which is also orthogonal and complementary of interposer
scalable and distributed network-on-chip (NoC), which offers
approaches. INTEL has presented recent results with their
the main capability of allowing any chiplet-to-chiplet traffic,
Foveros technology and Lakefield processor [16].
without interfering with unrelated chiplets. As a conclusion,
All these solutions are promising and show clear benefits
a hierarchical 3D NoC is obtained, with 2D NoC within the
in terms of cost and performances. Nevertheless, various
chiplet, 2D NoC within the active interposer, which can be
challenges still arise.
further refined and differentiated according to traffic classes.
- Inter-chiplet communication is mostly limited to side- Moreover, dense 3D interconnects enable high bandwidth
by-side communication, due to wire-only interposers. density with parallel signaling. Such a communication scheme
Longer range communication should rebound in the enables fully modular and scalable cache-coherent architec-
chiplets themselves, which is not scalable to build larger ture, for offering large many-cores [20], [28], [29].
systems with numerous chiplets. The recent solution In order to provide efficient power supply to each chiplet,
from AMD with their input output die (IOD) [15] is power management and associated power converters can be
partially solving these issues, with better communication directly implemented within the active interposer, to bring
distribution and easier IO integration, but may still not power supply closer to the cores, for increased energy effi-
scale further on the long term. ciency in the overall power distribution hierarchy, and allowing
- Current interposer solutions do not integrate themselves dynamic voltage and frequency scaling (DVFS) scheme at the
with less scalable functions, such as IOs, analogs, power chiplet level. Moreover, all the less-scalable functions, such
management, close to the chiplets. The recent solution as analog IPs, clock generators, and circuit IOs with SerDes
from INTEL with digital on top of analog partitioning and PHYs for off-chip communication, as well as the regular
is solving this issue, but is still limited today to a single system-on-chip infrastructure, such as low performance IOs,
die [16]. test, debug, and so on, can also be implemented in the bottom
- Finally, it is currently complex to integrate chiplets die. Finally, additional features can be integrated into the active
from different sources, due to missing standards, even interposer, to specialize for a given application, enabling to
if strong initiative is on-going [8], [14]. Wire-only differentiate the overall system. For instance, if incompatible
interposers prevent the integration of chiplets using chiplets are assembled, the active interposer can implement
incompatible protocols, while active interposer enables protocol bridges.
to bridge them, as adapted by zGLUE Inc. [51]. Due to the additional power budget within the interposer,
In order to tackle all these issues, this article presents the the thermal challenge of 3D might increase. Nevertheless, most
concept of Active Interposer, which integrates logic functions of the power budget is within the chiplets, thermal dissipation
within the large interposer area. The concept has been already issues are then limited, as presented in Section IX.
introduced before, either as a low cost and limited active-light Regarding technology partitioning, the active interposer
solution for ESD integration [17] or with system level archi- should be implemented using a mature technology, with a low
tecture explorations showing the capability to scale larger logic density to achieve high yield. Large logic density within
systems [19], [20]. Section II-C presents the active interposer a large interposer would lead, even using a mature technology,
concept, enabled by technology improvements [18]. to an un-yieldable and costly system. A difference of at least
two technology nodes between the computing chiplets and the
C. Active Interposer Principle and Partitioning interposer should lead to an acceptable cost, while allowing
The proposed active interposer concept is detailed in Fig. 3. enough performances in the bottom die for analog and PHYs
Chiplets can be either identical or different, for homogeneous to sustain the overall system performances, as done in [16].

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
82 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

Fig. 5. I NTA CT: from concept to 3D-cross section.

design-for-test (DFT) logic for KGD sorting and final test.


3D testability challenges and associated DFT are presented in
more detail in Section VIII.
In conclusion, I NTACT offers a large-scale cache-coherent
many-core architecture, with a total of 96 cores in six
Fig. 4. I NTA CT overall circuit architecture. chiplets (four cores × four clusters × six chiplets), which are
III. I NTACT C IRCUIT A RCHITECTURE 3D-stacked onto the active interposer.

The proposed Active Interposer concept is implemented


within a large-scale circuit demonstrator, offering 96-cores, B. I NTACT Circuit Technology Partitioning
called I NTACT for “Active Interposer.” The 22-mm2 chiplets are implemented in a 28-nm FDSOI
CMOS node, while the 200-mm2 active interposer is using
A. IntAct Circuit Architecture
a 65-nm CMOS node, which is a more mature technol-
I NTACT is the first CMOS active interposer [21] integrating: ogy. As presented in Section III, this technology partitioning
1) switched capacitor voltage regulator (SCVR) for on-chip exhibits two technology node differences between the com-
power management; 2) flexible system interconnect topologies puting die and the active interposer. This enables enough
between all chiplets for scalable cache coherency support; performance in the bottom die for the interconnects, the analog
3) energy-efficient 3D-Plugs for dense inter-layer communi- parts, and the system IOs, while still allowing a yieldable
cation; and 4) a memory-IO controller and PHY for socket large-scale active interposer.
communication. Fig. 4 presents an overview of the overall Even though complex functions are integrated, the yield
many-core circuit architecture, with the chiplets, the distrib- of the active interposer is high thanks to this mature
uted interconnects, the integrated power management, and the 65-nm node and a reduced complexity (0.08 transistor/μm2 ,
system infrastructure, which are detailed hereinafter. see Section II-C), with 30% interposer area devoted to SCVR
Each chiplet is a 16-core sub-system composed of four com- variability-tolerant capacitors scheme. This technology parti-
puting clusters of four cores, integrating their own distributed tioning leads to a practical and reachable circuit and system in
coherent memory caches, and their associated system level terms of silicon cost using advanced 3D technologies (more
interconnects. The chiplet architecture and associated memory details in terms of yield analysis can be found in [22]).
hierarchy is presented in Section IV.
The chiplet interconnects are extended through the active
interposer for chiplet-to-chiplet communication using distrib- C. I NTACT Physical Design and 3D Technology Parameters
uted NoCs and various kinds of communication links, using For enabling system integration, and allowing efficient
so-called “3D Plug” communication interfaces. For off-chip chiplet-to-chiplet communication, an aggressive 3D technol-
communication, the active interposer integrates a memory- ogy has been developed and used. A summary of the respective
IO controller and 4 × 32 bits 600-Mb/s bidirectional LVDS chiplet, interposer, and 3D technologies is given in Table I.
links offering a total of 19.2-GB/s off-chip bandwidth. The As presented in Fig. 5 with the circuit 3D-cross section,
communication IPs and overall communication architecture the six chiplets are 3D-stacked in a face-to-face configuration
are presented in Sections VI and VII, respectively. using 20-μm-pitch micro-bumps (μ-bumps) onto the active
The active interposer integrates a power management IP for interposer (2× smaller pitch compared to state of the art [23]).
supplying individually each chiplet and offering on-demand These dense chip-to-chip interconnects enable a high band-
energy-efficient power management, below each chiplet and width density, up to 3 Tbit/s/mm2 as detailed in Section VI-A,
surrounded by pipeline NoC links. The SCVR is presented in using parallel signaling through thousands of 3D signal inter-
more detail in Section V. faces. For bringing power supplies and allowing off-chip
Finally, the active interposer integrates some regular communication, the active interposer integrates TSV-middle
System-on-Chip infrastructure elements such as clock and with a pitch of 40 μm and an aspect ratio of 1:10
reset generation, thermal sensors, stress sensors, low speed (10 μm diameter for a silicon height of 100 μm) and a keep-
interfaces (UART, SPI) for debug and configuration, and out zone of 10 μm. Finally, the overall system is assembled

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 83

Fig. 6. Chiplet and active interposer floorplans, details of the 3D-plug μ-bumps, final 3D integration and package.

TABLE I array (BGA) flip-chip package with ten layers. In addition,


I NTA CT M AIN C IRCUIT F EATURES AND 3D T ECHNOLOGY D ETAILS one can see the six chiplets onto the package, before the final
assembly with the cover lead and the package serigraphy.
More details on the various 3D technology element
(μ-bumps, TSVs, RDL, and so on) and 3D assembly steps,
with in-depth technology characterization, can be found
in [23].

IV. C OMPUTING C HIPLET A RCHITECTURE


A. Chiplet Overview
The focus of this architecture is its scalability, so we chose
to design homogeneous chiplets embedding both processors
and caches [24]. With the current memory mapping, the archi-
tecture can be tiled up to 8 × 7 chiplets, last 2D-mesh row
being reserved for off-chip IO accesses, achieving a maximum
number of 56 chiplets, for a hypothetical total of 896 cores.
onto a package organic substrate (ten layers), using C4 bumps The last-level cache is large enough with respect to computing
with a pitch of 200 μm. power to release the pressure on the external memory access.
In terms of complexity: 150 000 3D connections are per- Each chiplet is composed of four clusters of four 32-bit
formed using μ-bumps between the chiplets and the active scalar cores (MIPS32v1 compatible ISA) as shown in Fig. 7.
interposer, with 20 000 connections for system communica- System interconnects are extended to 3D using synchro-
tion, using the various 3D-Plugs, and 120 000 connections nous and asynchronous so-called “3D Plugs.” Chiplets form
for power supplies using the SCVRs; while 14 000 TSVs are a single fully cache-coherent architecture composed of the
implemented for power supplies and off-chip communication. following elements: separate 16-kB L1 Instruction-cache
Due to the high level of complexity of the system, 3D assem- (I-cache) and Data-cache (D-cache) per core with virtual
bly sign-off has been performed using the Mentor 3DStack memory support, a shared distributed L2-cache with 256 kB
CAD tool [50]. per cluster, and an adaptive distributed L3-cache, with four
In Fig. 6, we present the respective floorplans of the chiplets L3 tiles (4 × 1 MB) per chiplet.
and the active interposer. For the chiplet, one can see the four All clocks (Cluster, L3 tile, and interconnect) are gener-
computing clusters and associated L1/L2/L3 caches, while for ated by 11 frequency locked loop (FLL) clock generators.
the active interposer one can see the different SCVR, which To mitigate PVT variation, particularly across the dies in
are supplying power to each individual chiplet, the distributed a 3D stack, we implement a timing fault methodology for
system interconnects, and the system IOs on the circuit periph- Fmax/Vmin tracking [30]. Finally, chiplets are tested using
ery. Dense 3D connectivity is done in various locations of the IEEE1687 IJTAG, compressed full-scan, memory built-in self-
circuit using the 3D-plug interfaces and associated μ-bumps. test (BIST), and boundary scan for 3D IOs test, to allow for
Finally, the overall circuit has been packaged in a ball grid KGD assembly, as explained in Section VIII.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
84 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

Fig. 8. Memory mapping and cache allocation.

D. Cache Coherency Scalability


1) L1 Caches: Each core has a private L1 cache that
implements a Harvard’s architecture with separate four-way
16-kB cache memories for instruction (I) and data (D) with
64-B cache lines. L1 D-caches are write-through and imple-
ment a fully associative write buffer composed of 8128-B
entries flushed either on explicit synchronization or on
Fig. 7. Chiplet architecture, offering a 16-core scalable coherent fabric. expiration of an 8-cycle timer. L1 I/D-caches include a
Memory Management Unit (MMU), which consists of two,
per-core private, fully associative, 16-entry translation looka-
B. Chiplet Interconnects and Their Extension side buffers (TLBs) for instruction and data. MMUs with
coherent TLBs translate the 32-bit virtual address space (4 GB)
Four different system level interconnects (N0 to N3) in processor cores onto the 40-bit physical address space
make up the system communication infrastructure as shown (1 TB) mapped as shown in Fig. 8. The hardware guarantees
in Fig. 7, three of which are extendable in 3D: (N0) within the coherency of both L1 I/D-caches and both I/D-TLBs
cluster, an NoC crossbar allows the communication between (see Section IV).
the four cores through I& D caches and network interface; As mentioned previously, the processor implements an
(N1) between L1 and L2 caches, a 5-channel 2D-mesh inter- NUMA memory hierarchy, and the most significant bits of
connect implements the coherency protocol and is routed in the physical address designate the coordinates of the L2 cache
the interposer through passive links (two on each side); (N2) containing that data. To improve performance, the operating
between L2 and L3 tiles, a 2-channel 2D/3D-mesh intercon- system (OS) needs to map data and code close to the cores
nect; (N3) between L3-caches and off-chip DRAM memory, that use them. OS can do that through the virtual memory
a 2-channel 2D/3D-mesh interconnect. (N1) 2D-mesh is fully subsystem by the mapping of physical memory pages. To assist
extended to other chiplets for maximum throughput and lowest the OS in this task, our architecture implements two hint bits
short-reach latency as shown in Section VII. Peripherals are in the page table: the local (L) and remote (R) bits. They are
also connected to this network, which conveys IO requests. automatically set by the MMUs and signal if a given physical
(N2) and (N3) networks implement a hierarchical routing memory page has been accessed by a core in the same (local)
scheme where a single router among the four of the chiplet or different (remote) cluster than the L2 cache that hosts the
2D-mesh is used to reach the active interposer. This architec- cache lines in that page, also called the “home node.” For
ture reduces the 3D footprint for N2 and N3 networks, which instance, pages with the R bit set but the L bit unset are
are less bandwidth demanding. Using asynchronous logic for candidates for migration.
N2 3D plug allows for low latency L2 to L3 communications. 2) L2 Caches: L2 caches are 16-way 256-kB set associative
write-back caches handling 64-B cache lines. The scalable
cache coherence protocol exploits the fact that shared data are
C. System Memory Mapping
mostly either sparsely shared read-write or widely shared read-
The memory hierarchy is distributed and adaptive (Fig. 8): only [25], [27]. Thus, L2 caches maintain L1-caches, TLBs.
the 1-TB memory space is physically distributed among and IO coherence using a directory-based hybrid cache coher-
L2-caches accessed through (N1) network. Cluster coordinates ence protocol (DHCCP) based on write-through L1 caches.
in the 2D-mesh are encoded in the eight most significant bits L2-cache lines have two states: a list mode where coherence is
of the address, forming a nonuniform memory architecture maintained by issuing multicast updates/invalidates; a counter
(NUMA), as done in [48]. Due to the X-first routing scheme of mode where only broadcast invalidates are sent for this line.
(N1) network, access to IO controllers located in the external In list mode, the sharers’ set of this line is stored as a linked
FPGA is done through the North port of the (X = 3, Y = 5) list: the first sharer ID is in the L2 directory and the following
router found in upmost right (X = 1, Y = 2) chiplet. Thus in another memory bank (heap). When a core writes to this
these IOs are mapped at [0 × 3 600 000 000: +40 GB] memory line, the respective home L2 cache sends update messages to
segment. sharers, thus keeping their copy up to date.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 85

When the number of sharers reaches a predefined threshold


(4 in our implementation) or if the heap is full (4096 entries
in our implementation), the cache line is put in counter mode
where only the sharers’ count is stored and broadcast invali-
dates are issued. The (N1) 2D-mesh and (N0) crossbar NoCs
provide hardware support for broadcast and only L1 sharers
of this line answer the broadcast, thus limiting the impact of
broadcasts on scalability.
This hybrid sharing set representation is efficiently handling
both main categories of shared data [26]. Write-through asso- Fig. 9. SCVR cross section.
ciated update messages also mitigate false sharing problems.
The L2-cache coherence directory represents only 2% of die
area with 15-bits core/cache IDs, showing the scalability of
the cache coherence protocol. Section X-C shows scalability
results for up to 512 cores.
3) L3 Caches: L3-cache tiles are 16-way 1-MB set asso-
ciative write-back caches handling 128-B cache lines with
one dirty bit per 64-B block. Tiles are dynamically allo-
cated to L2-caches by software, forming a nonuniform cache
architecture (NUCA) as presented in Fig. 8. In the case of
L3 cache overlap, a shared tile behaves as a shared cache: more
space is allocated to the most demanding memory segment.
By overlapping L3 caches, the L3 cache controller located at
the output of each L2 cache offers an L3 fault-tolerant adaptive Fig. 10. SCVR unit-cell schematics and hierarchy.
repair. The controller uses a list of defective tiles to redirect
traffic initially directed at these tiles to known working tiles.
More detail on L3 micro-architecture and performance can be
found in [28] and [29].

V. I NTEGRATED VOLTAGE R EGULATOR (VR)


A. Principle and 3D-Staking
Granular power delivery is a key feature to improve
the overall energy efficiency of multi-core processors [31].
To allow DVFS per-chiplet, fast transitions, and mitigate
IR-drop effects, six integrated VRs have been included in
the interposer layer which individually supplies each chiplet Fig. 11. SCVR layout on the interposer.
by Vcore from Vin as shown in Fig. 9. The power is delivered
through the μ-bump flip-chip matrix. The Vin voltage is of interleaved phases is also chosen to maintain power
delivered from the interposer back-face through a 40-μm- efficiency at low-voltage level where the required power
pitch TSV array. The VRs are fully integrated into the active for chiplet drops off. The feedback control is based on
interposer without needing any external component. one-cycle hysteresis controller proposed in [34]. The voltage
The typical input voltage range Vin is 1.5–2.5 V to reduce controller is centralized and sequences the charge injection
the delivered current Iin from TSV, package, and motherboard. in the interleaved converters at each clock cycle. The clock
Thus, the number of package’s power IOs can be reduced generation and controller is integrated on-chip.
compared to a direct Vcore distribution from external VR. The
power distributed network loss is also reduced.
C. Physical Design on Interposer
B. Circuit Design As shown in Fig. 11, each SCVR occupies 50% of the
The SCVR has been chosen thanks to their fully integration chiplet footprint (11.4 mm2 ) and is composed of 270 regular
capability [31]–[35]. The chosen topology is a parallel–series unit cells (UCs), with a 0, 2-mm pitch, in a checkerboard
three-stage gearbox scheme to cover a large Vout range while pattern. The I/O device transistor may operate on an up
maintaining power efficiency (Fig. 10). Thus, the SCVR to 3.3-V input voltage. A MOS–MOM–MIM capacitor stack
generates seven lossless voltage-ratio from 4:1 to 4:3. From maximizes the capacitance density (8.9 nF/mm2 ) with 102-nF
1.8 Vin , the SCVR provides from 0.35 to 1.35 V, which flying capacitor per SCVR. To deal with potential process
covers the low-to-high chiplet’s power modes. The gearbox defaults on the large area of the interposer, fault-tolerant
scheme is interleaved into ten phases to reduce the Vcore protocol is also included to mitigate the effect of defective
ripple and to increase the control bandwidth. The number unit cells on overall power efficiency.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
86 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

TABLE III
3D-P LUG T YPES AND U SAGE IN I NTA CT

Fig. 12. SCVR experimental results. (a) Power efficiency versus voltage
conversion ratio and gearbox configurations. (b) Efficiency over output current.
(c) Efficiency versus input voltage at 2:1 ratio. (d) Load transient. Fig. 13. 3D-Plug physical and logical interface overview.

E. Discussion
TABLE II
C OMPARISON W ITH C OMPARABLE SCVR U SING Since the power efficiency obtained by the integrated
MOS OR MIM C APACITOR T ECHNOLOGY VR is lower than external dc-dc converters, the overall power
efficiency of the computing system could improve by allowing
fine-grain DVFS without increasing the bill-of-material (BoM)
and IOs numbers. The power density is smaller than previously
published results but the converters are fully integrated within
the active interposer, not on the same die, thus reducing the
cost impact of the proposed active power mesh. The interposer
integration opens the opportunity for dedicated post-process
high-density capacitors (e.g., deep trench capacitors) con-
nected through TSV. We also prove the up-scaling capability
of SCVR by fabricating the largest die area SCVR with a
built-in capacitor fault-mitigation scheme.

VI. 3D-P LUG C OMMUNICATION I NTERFACES


D. Experimental Results
A. 3D-Plug Features Overview
As shown in Fig. 12(a), the SCVR achieves a measured As presented in Section IV-B, the different chiplet system
156- and 351-mW/mm2 power density at 82% peak efficiency level interconnects are extended throughout the active inter-
and similar LDO’s efficiency, respectively. SCVR maintains poser, by using generic chiplet–interposer interfaces, called
more than 60% from 0.45 to 1.35 V covering the full supply 3D-Plugs.
voltage range of the chiplet [Fig. 12(c)]. The SVCR delivers up Each 3D-Plug integrates both the logical and physical
to 5-A output current while maintaining higher efficiency than interfaces. As presented in Fig. 13, it contains the micro-bump
an LDO [Fig. 12(b)]. The peak power efficiency is relatively array, the micro-buffer cells (bi-dir driver with ESD protection,
constant against Vin typical range. As shown in Fig. 12(d), level shifter, and pull-up), and boundary-scan logic for DFT.
the feedback control achieves less than 10-ns step response A bi-directional driver is used to allow testability of the
for a middle-to-zero load transient (0.8 to 0 A), while the full interface before assembly (see Section VIII). This 3D interface
load is defined at peak efficiency (1.2 A). is very similar to the 3D-NoC interface, as presented earlier
Table II compares the 3D stacked SCVR to some previously in [36]. Due to the 28-/65-nm technology partitioning, the
published SCVR in 2D context. The proposed SCVR exhibits micro-buffer cell also requires in that case a level shifter
the highest number of lossless ratio and the highest deliv- in order to bridge the voltage domains between the chiplet
ered power with a commonly available capacitor to enable (typically 1.0 V) and the active interposer (1.2 V).
widely spread use. Even if the SCVR is affected by TSV In terms of physical design, the different 3D IOs of each
grid, the power density is comparable to other wide range 3D-plug have been created and assigned in an array fashion,
SCVRs. 3D integration of the SCVR on the interposer mini- while the micro-buffer cell has been designed as a standard
mizes the system area and cost, with no impact on the chiplets. cell and pre-placed within the pitch of the micro-bumps. All

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 87

Fig. 14. (a) Synchronous 3D-Plug micro-architecture. (b) Comparison to state of the art.

the other parts of the 3D-Plug (their logical interface and DFT)
have been designed using automated place and route.
In order to build the system level interconnects of I NTACT,
different kinds of 3D-plug have been designed, as presented
in Table III.
Due to the different natures of the interconnects, in terms
of traffic and distance/connectivity, two different kinds of
3D-Plugs have been designed, and compared in detail: one
using synchronous design, as presented in Section VI-B, and
one using asynchronous design, as presented in Section VI-C.
Fig. 15. Synchronous 3D-Plug max data rate for 2.5D passive links.
B. 3D-Plug Synchronous Version
The microarchitecture of the source synchronous 3D-Plugs
used for 2.5D passive (N1 NoC) and 3D face-to-face links
(N3 NoC) is shown in Fig. 14(a). Implemented as a standard
synthesizable digital design, 3D-Plugs provide multiple virtual
channels (VCs), the number of which is configured at design
time. They use credit-based control flow and clock forwarding
-Plug control logic operates at a higher frequency than the
NoCs to reduce contention due to VC multiplexing. Delay
lines and polarity selectors are used to skew TX clock for Fig. 16. 3D Plug asynchronous version overview, composed of protocol
RX data sampling (CLK_TX_1) and TX credit sampling converters between the on-chip communication and the 3D interface.
(CLK_TX_2).
When attached to the 3D vertical active link, the 3D-Plug
achieves 3-Tb/s/mm2 bandwidth density, 1.9× higher than [5]. adapted for designing system level interconnects and NoC
2.5D passive links reach a 12% higher bandwidth cross section architectures in a globally asynchronous locally synchronous
than [5] as shown in Fig. 14(b). The aggregate synchronous (GALS) scheme. In the context of 3D architectures, asyn-
3D/2.5D links bandwidth is 527 GB/s. chronous logic and its local handshakes enable interfacing
We performed a frequency, logic voltage, and clock phase two different dies without any clocking issues. By using
sweep on synchronous 2.5D/3D links. All 2.5D passive robust quasi-delay insensitive (QDI) logic, an asynchronous
links were able to reach at least 1.25 Gb/s/pin in the 3D NoC has been earlier presented in [36] but presents some
[0.95 V-1.2 V] VDD range and the best link shown in 3D throughput limitations due to the four-phase handshake
Fig. 15 was able to reach this bandwidth at 1 V, while reaching protocol.
more than 1.6 Gb/s/pin at 1.2 V. We obtained best results with For I NTACT, an innovative 3D-Plug interface has been
a 180◦ CLK_TX_1 phase and varying CLK_TX_2 phase designed, to benefit from two-phase handshake protocol at
depending on frequency. While much shorter than passive the 3D interface, which reduces the penalty of 3D interface
links, 3D vertical links achieve slightly lower data rates delay within the interface cycle time, and thus increases the
of 1.21 Gb/s/pin upward and 1.23 Gb/s/pin downward as one 3D interface throughput.
side of these 3D-Plugs is implemented in the more mature and As introduced in [38], the principle is as follows (Fig. 16).
slower 65-nm technology of the interposer. - Use asynchronous two-phase protocol for 3D interface
communication, to reduce 3D interface delay penalty;
C. 3D-Plug Asynchronous Version - Use asynchronous four-phase protocol for on-chip com-
For its inherent robustness to any source of timing vari- munication, within the active interposer, for its inherent
ations and low latency [37], asynchronous logic is well simplicity, low latency and performance [37];
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
88 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

Fig. 17. 3D-Plug asynchronous version details, composed of (a) four-phase


to two-phase protocol converter and (b) two-phase to four-phase protocol
converter.

- Introduce a protocol converter, from two-phase protocol


to four-phase protocol and respectively, using an ad hoc
asynchronous logic encoding.
Fig. 18. I NTA CT system interconnect structure on longest path using different
A recent overview of asynchronous logic and signaling technologies for different traffic classes. (a) L1 and L2 cache interconnect with
can be found in [39]. For implementing a low cost protocol passive nearest-neighbor links. (b) L2 and L3 cache interconnect with QDI
asynchronous routing. (c) L3–EXT-MEM interconnect with global synchro-
converter, a two-phase 1T-of-N multi-rail transition based nous routing.
signaling is used [38], with N = 4 (4-rail encoded, thus four
wires for two bits). In this encoding and two-phase protocol, synchronous 3D-Plugs, no other adaptation is required, and
one single transition on Raili indicates the i value, which is routing is entirely performed within the chiplets. Therefore,
then acknowledged by a transition on the feedback path. This only passive metal wires are used on the interposer to connect
encoding is close to the 1-of-n on-chip protocol, which leads the microbumps of neighboring chiplets.
to the corresponding protocol converters, shown in Fig. 17. Physical design of these interposer passive links was opti-
Similar to the synchronous 3D-plug interface, the protocol mized to reduce delay and crosstalk between the nets. A ded-
converter also integrates all the 3D objects: micro-bumps, icated routing scheme on two levels of metal was used
micro-buffers, and boundary scan (Fig. 16). Finally, since the (M3–M5 horizontal, M2–M4 vertical), with trace widths
same number of wires is used for both protocols, a bypass of 300 nm and spacing of 1.1 μm. Additional shielding was
mode of the protocol converters is added, configuring the used for clock nets running at twice the data rate. Crossings
3D-plug interface either in two-phase mode or in four-phase with minimum-width unrelated wires on the interposer showed
mode, for circuit measurements. very little impact on crosstalk or delays in the signal, and were
therefore allowed on the other interposer metal levels.
VII. ACTIVE I NTERPOSER S YSTEM I NTERCONNECTS Point-to-point connection between two adjacent 3D-Plugs
A. Overview was measured at 1.25 GHz, with a latency of 7.2 ns. Most of
Different kinds of system interconnects have been imple- this latency is due to the clock-domain crossings in 3D-Plugs.
mented between the chiplets on the interposer, using the For large applications, nevertheless, L1–L2 cache coherence
3D plugs described in Section VI. These interconnects are traffic needs to extend farther than between adjacent chiplets.
used to transport the different levels of cache coherence in the In that case, pipelining and routing are handled by the inter-
memory hierarchy. As discussed earlier, a mix of synchronous mediate chiplets. The main advantage in this case is that this
and asynchronous implementations was used, depending on is done using the advanced technology node in the chiplets,
latency and power targets. which has a better performance and lower power consumption
The structure of the different interconnects is shown than the interposer does. However, the major drawback is the
in Fig. 18, with clock-domain crossings, conversion interfaces, accumulation of pipeline and clock-domain crossings, which
pipelining, and routers. These three interconnects will be adds extra latency for distant L1–L2 traffic.
detailed in the next paragraphs. To assess their performance, The 2D NoC frequency in the chiplet runs at 1 GHz,
on-chip traffic generators and probes were inserted in the but the one-way latency from the source 3D-Plug to the
chiplets NoCs, for throughput and latency measurements. destination 3D-Plug can be as high as 44 cycles on the longest
path from chiplet 00 to chiplet 12, with two intermediate
B. L1–L2 Cache Interconnect chiplets, five routers and eight to ten FIFO stages between
Fig. 18(a) presents the first level of cache interconnect routers. Nevertheless, this solution is very energy efficient with
between local L1 and distributed L2 caches (N1 NoC). only 0.15 pJ/bit/mm.
As this first level of cache traffic is intended to be localized
using an adequate application mapping, most of the traffic is C. L2–L3 Cache Interconnect
expected to be exchanged between neighboring chiplets. Aside Fig. 18(b) presents the second level of cache interconnect
from clock-domain crossing between the two chiplets using between distributed L2 and L3 caches (N2 NoC). The main

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 89

TABLE IV
C OMPARATIVE P ERFORMANCE OF S YSTEM I NTERCONNECTS IN I NTA CT

performance target is in this case to offer low latency long


reach communication. For this purpose, it was chosen to
implement it in fully asynchronous logic on the interposer,
using the ANoC QDI NoC [37]. This allows for only two
synchronous/asynchronous conversions on an end-to-end path,
to save on clock-domain-crossing latency. Deep-pipelining on
the ANoC allows to insert an asynchronous pipeline stage
every 500 μm to preserve throughput with almost no impact Fig. 19. Chiplet layout (zoom), with 3D-Plug interface and additional test
on the latency compared to inverter-based buffering. pads.
The asynchronous 3D-Plug in two-phase mode allows
an injection rate in the network for 72-bit data words up wide interconnects optimized for latency in the asynchronous
to 520 MHz, while the 2D NoC is able to sustain up version, with 0.6 ns/mm, or for power consumption in the
to 0.97 GHz on every link, which limits the in-network synchronous version, with 0.24 pJ/bit/mm, with performance
contention of overlapping network paths. The efficient metrics twice as good as [36] in the same 65-nm technology
asynchronous pipelining allows an end-to-end latency on the node as the active interposer.
synchronous interfaces of the 3D-Plugs of only 15.2 ns, with The achieved low level interconnect performances could be
four clock cycles and 11.2 ns of asynchronous latency across used for a more systematic system level study, such as [19],
four routers and 25 mm of pipelined links. by trading off different traffic classes, latency, and energy,
thanks to the extended active interposer traffic capabilities.

D. L3–ExtMemory Interconnect VIII. ACTIVE I NTERPOSER T ESTABILITY


Fig. 18(c) presents the last interconnect between the dis- AND 3D DFT A RCHITECTURE
tributed L3 caches and the external memory (N3 NoC). A. Testability Challenges
Considering the intrinsic contention of this last level of cache
traffic, and the longer latency for paginated access to the With such 3D active interposer, testability is raising various
external memory, the focus was put on energy efficiency, then challenges. First, it is required to ensure KGD sorting to
on low latency. This interconnect is implemented as a global achieve high system yield [10]. This implies that the 3D
synchronous NoC, with clock-domain crossings at the source test architecture must enable EWS test of the chiplet and
3D-Plug and in the memory IO interface. Two-stage FIFOs the interposer (pre-bond test, before 3D assembly), and final
are inserted every 1 mm, and tight clock-tree balancing was test (post-bond, after 3D assembly in the circuit package).
performed to increase the throughput. This results in a 72-bit Moreover, due to fine pitch μ-bumps, reduced test access is
synchronous network running up to 750 MHz, with a latency observed, μ-bumps cannot be directly probed in test mode.
of 2 ns/mm, for a good energy efficiency of 0.24 pJ/bit/mm. This implies to include additional IO pads, which are only
used for test purpose, and not in functional mode (see Fig. 19).
Finally, with 3D technologies, additional defects may be
E. System Interconnect Comparison and Conclusion encountered, such as μ-bumps misalignments, TSV pinhole,
Table IV summarizes the different figures of merit for the shorts, and so on which lead to specific care for testing the 3D
three interconnects, and provides a benchmark with respect to objects and interfaces. Another concern is also regarding the
the 3D NoC in [36]. It shows that neighboring connections automatic test pattern generation (ATPG) engineering effort,
can be efficiently made using the synchronous 3D-Plug in where easy re-targeting of test patterns from pre-bond test to
an advanced technology node, with a high throughput and post-bond test should be proposed to reduce test development
a low power consumption. For longer-range communication, efforts.
limiting the number of clock-domain crossings is key for Numerous researchers have addressed specific test solutions
performance. The NoCs in the active interposer can provide for 3D defaults, see for instance [40], [41], for testing generic

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
90 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

TABLE V
I NTA CT DFT R ESULTS

Fig. 20. 3D design-for-test architecture for I NTA CT, overview and detailed.

3D architectures using die wrappers and elevators [42], and between the chiplets and the active interposer, to facilitate the
for testing 2.5D passive interposers [43]. A standardization test architecture integration.
initiative on 3D testability has emerged with the recent IEEE
1838 standard [44]. Nevertheless, no work addressed initially C. Test CAD Flow and Test Coverage
the testability of active interposers. The proposed 3D DFT architecture has been designed and
inserted using Tessent tools from Mentor, a Siemens Busi-
ness, Montbonnot, France. By using IJTAG and IEEE1687,
B. 3D DFT Architecture high-level languages such as “Instrument Connectivity Lan-
Within the I NTACT architecture, the test of the 3D sys- guage” (ICL) and “Procedural Description Language” (PDL)
tem must address the test of all the following elements: are provided and enable to handle the complexity of such
1) the regular standard cell-based logic; 2) all memories using a system. In particular, it is possible to fully automate the
BIST engines and repair; 3) the distributed 3D interconnects test pattern generation of Memory BIST engines, from ATPG
and IOs: 3D connections of active links and passive links, at chiplet level to ATPG of the same patterns within the
which are implemented by micro-bumps; and finally 4) the full 3D system, enabling so-called test pattern retargeting.
regular package IO pads for off-chip communication through As presented in Table V, full testability is achieved for all
the TSVs. logic, 3D interconnects and regular package IOs, and memory
In order to test the active interposer and its associated BIST engines, before 3D assembly and after 3D assembly.
chiplets, the proposed 3D DFT architecture (Fig. 20) is based Using the proposed DFT architecture and test patterns,
on the two following main test access mechanisms (TAMs), the full system was tested using an automated test equipment
as proposed earlier in [45]. (ATE).
- A IJTAG IEEE1687 hierarchical and configurable chain, - The 28-nm chiplet has been tested at wafer level using
accessed by a primary JTAG TAP port, for testing all a dedicated probe card, with a binning strategy.
the interconnects and memories, based on the concept - The active interposer has not been tested at wafer
of “chiplet footprint.” level, supposing the maturity of the 65-nm technol-
- A Full Scan logic network using compression logic, for ogy and its high yield due to its low complexity
reduction of test time and of number of test IOs. (see Section III-B). Nevertheless, its standalone DFT
and dedicated IOs were initially planned and designed
By using IJTAG IEEE 1687, the JTAG chain is hierarchical
as mentioned above.
and fully configurable: the JTAG chain provides dynamic
- The full I NTACT circuit, after 3D assembly and packag-
access to any embedded test engines. The active interposer
ing, has been tested within a dedicated package socket.
JTAG chain is designed similar to a chain of TAPs on a PCB
board. It is composed of “chiplet footprints,” which provide
IX. T HERMAL C HALLENGES AND S TRATEGY
either access to the above 3D-stacked chiplet or to the next
chiplet interface, and which are chained serially. The JTAG A. Thermal Challenges
network is used to test and control the 3D active links, the 3D In 3D technology, thermal dissipation is a challenge that
passive links, the off-chip interfaces, and the embedded test needs to be properly addressed. Due to more integration in
engines, such as the memory BISTs. a smaller volume, a larger power density is usually observed
The Full Scan logic network offers efficient and parallel in 3D, while the thermal dissipation itself is getting more com-
full scan test of the whole 3D system logic. In order to reduce plex in the overall 3D stack of the circuit and package, overall
the number of 3D parallel ports, compression logic is used leading to thermal hotspots or even thermal runaway [52].
in both the chiplets and the active interposer, with a classical In the generic context of logic-on-logic stacking, thermal
tradeoff (shift time/pin count). Independent scan paths are used dissipation is worse because multiple layers of compute dies

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 91

Fig. 23. (a) Package temperature (without heat sink). Peak temperature = ∼150 ◦ C. (b) Package temperature (with heat sink and fan). Peak temperature =
53 ◦ C. (c) Chiplet thermal map. Peak temperature = ∼53 ◦ C.

Fig. 21. 3D chip-package thermal flow, from early exploration to sign-off. Fig. 22. INTACT circuit and package cross section used for thermal
modeling.

need to dissipate their heat on top of themselves. On the level tools able to cope with detailed circuit and technology
contrary, in the case of interposer-based systems, a single description but with simple packaging condition, or package
layer of chiplets is dissipating heat, while heat extraction can level tools able to cope with detailed packaging, but with
be performed from the top package face, similar to a regular reduced die and technology information. In order to achieve
flip-chip packaged circuit. Nevertheless, contrarily to passive an accurate thermal exploration covering all modeling aspects,
interposers, in the case of an active interposer, the bottom the Project Sahara solution, a thermal analysis prototype from
layer is also part of the power budget, and dissipates heat as Mentor Graphics a Siemens Business, was selected [55].
well. Since the power budget of the active interposer layer is As presented in Fig. 21, an adequate thermal methodology
rather limited, with most power budget within the chiplets, this has been setup to allow modeling of low level structures
should help the overall thermal dissipation. (TSV, micro-bumps, underfill), with a design entry at GDS
Finally, due to the heterogeneous structure of such a 3D level and with accurate static or dynamic power maps, all
stack, many materials are composing the device, with silicon this in the context of the full system (package and fan). The
substrate, back-end of line (BEOL) in copper, underfill com- methodology has been qualified on a previous 3D logic-on-
posite materials between the chiplets and interposer, micro- logic design with silicon thermal measurements 16 [36]. More
bumps (SnAg), TSVs (copper), and so on. This assembly leads details of the thermal methodology can be found in [56].
to strong anisotropic thermal effects, favoring and increasing
thermal hotspots effects. Moreover, due to the thin layer
effect of the interposer (100 μm), the horizontal thermal C. Thermal Simulation Results
dissipation is reduced in the interposer, while it remains The I NTACT circuit and package have been modeled,
mostly the vertical thermal dissipation through the chiplets. as presented in Fig. 22 with a detailed cross section of the
These various thermal effects have been widely studied in the 3D circuit. In terms of power budget, a scenario with a maxi-
literature [53], [54], and need to be taken into account in the mum static power budget of 28 W is simulated, corresponding
full system. to a worst case situation of 3 W per chiplet (×6) and 10 Watts
in the active interposer, while the nominal circuit power budget
is 17 W as presented in Section X-B.
B. Thermal Modeling Strategy As a result, Fig. 23 shows the thermal exploration, without
With all the 3D thermal challenges: increased power density Heat Sink (max temperature 150 ◦ C), with a regular Heat
and design complexity on the design side, fine grain mate- Sink and Fan (max temperature 53 ◦ C), while no hotspots
rial effects on the technology side, coupled to the regular appear within the computing chiplet. Even for this worst case
package and board thermal information, an accurate thermal scenario, due to a still limited power density of 0.14 W/mm2 ,
exploration must be performed with the adequate thermal the thermal dissipation of the active interposer can be achieved
methodology. Various thermal tools are available: either circuit using a regular heatsink and fan.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
92 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

Fig. 25. (a) Maximum core frequency. (b) Power consumption at Fmax [FBB = (0,1)]. (c) Power efficiency at Vmin.

Fig. 26. Power consumption breakdown, cores operating at 1 V, 900 MHz.


Fig. 24. Development board fitting in a standard PC case.

X. OVERALL C IRCUIT R ESULTS


As shown in Fig. 24, a complete development board has
been designed for measurement and application evaluations
including running Linux on the chip. The board features
two FPGAs with a 2 × 16 GB 64-bit DDR4 memory and
various peripherals: 8× PCIe Gen3, SATA, 1-Gb Ethernet,
10-Gb Ethernet, HDMI, USB, SD-Card, and UART. The
demonstration board also features a power infrastructure with
voltage and current sensing. Each FPGA is connected to two
of the four LVDS links of the chip. Fig. 27. Execution speedup up to 512 cores.

A. Circuit Performances
The chiplet is functional in the 0.5–1.3-V range with consumption and energy efficiency while running Coremark
Forward Body Biasing (FBB) [46] up to ±2V. Fig. 25(a) benchmark is compared to a theoretical system using a digital
shows that a core frequency of 1.15 GHz is achieved at 1.3 V LDO instead of the proposed fully integrated SCVR. Using an
with 0/+1 (VDDS/GNDS) FBB. Single core performance is LDO at the same VIN = 2.5 V would result in a 2× increase
2.47 Coremark/MHz and 1.23 DMIPS/MHz. At chip level, in power consumption, a lower VIN would be needed to limit
maximum energy efficiency is 9.6 GOPs/W on Coremark losses at the expense of more power pins and voltage-drop
benchmarks (IPC = 0.8/core) at 0.6 V taking into account volt- issues.
age regulation losses in the interposer as shown in Fig. 25(c). The power breakdown in Fig. 26 shows the low power
As expected, FBB boosts performance: in typical at 0.9 V, budget of the active interposer with only 3% of total power
a frequency increase of 24% is achieved with −1/+1 FBB, consumed by the active interposer logic. The cores+L1$
while in typical at 680 MHz, an energy efficiency increase represent over half the power consumption of the chiplets,
of 15% is achieved with asymmetric 0/+1 FBB. themselves consuming the majority of the measured circuit
power (17 W).
B. Circuit Power Budget and Energy Efficiency C. Circuit Scalability
In Fig. 25(b) and (c), we show overall chip power Lastly, Fig. 27 shows the scalability of the cache-coherent
and performance measurements with a 0/+1 FBB. Power architecture that is analyzed by running a 4 Mpixels image

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 93

TABLE VI
S TATE OF THE A RT C OMPARISON

filtering application from 1 to 512 cores. The filter is composed The overall system integrates a total of 96-cores, in six
of a 1-D convolution, followed by a transposition of the image chiplets, offering a peak computing power of 220 GOPs
and ends with another 1-D convolution. Software synchro- (peak mult-acc), which is quite comparable to advanced state
nization barriers separate these steps and the transposition, of the art processor systems. Finally, the overall distributed
in particular, involves many memory transfers. interconnects and cache coherency memory architecture are
Results for more than 96 cores were obtained by RTL scalable up to 896 cores, showing the architecture partitioning
simulation with additional chiplets. Software is executed on capability to target larger computing scale.
a single cluster up to four cores and on a single chiplet
up to 16 cores. Compared to a single core execution,
a 67× execution-time speedup is obtained with 96 cores and XI. C ONCLUSION
340× with 512 cores. The slight uptick above 128 cores The presented Active interposer leverages the 3D integra-
results from the threshold where the data set fits in caches. tion benefits by offering a baseline of functionalities such
This quasi-linear speedup, ignoring limitations of the external as voltage delivery, chiplet-to-chiplet communications, IOs,
memory bandwidth, shows the scalability of network protocols shared by most of computing assemblies. The active interposer
and their 3D implementations. allows a flexible assembly with common functionalities while
maintaining the yield management benefits. For this reduced
power density and budget, thermal dissipation is not an issue
D. Comparison to Prior Art within the active interposer, as for a regular passive interposer.
Compared to prior art (Table VI), the I NTACT circuit is 3D integration and active interposer open the way toward
the first CMOS active interposer validated on silicon, which efficient integration of large-scale chiplet-based computing
offers a chiplet-based many-core architecture for HPC. The systems. Such scheme can be applied for integration of similar
active interposer solution allows for integrated VRs without chiplets as presented in this article, but also for smooth
any external passives, using free die area available in the integration of heterogeneous computing chiplets [47].
active interposer, offering DVFS-per-chiplet and achieving
156 mW/mm2 at 82% peak power efficiency, with 10%–50% ACKNOWLEDGMENT
more efficiency with respect to LDO converters integrated in
organic schemes. The SCVR is also fault tolerant to mitigate The authors would like to thank STMicroelectronics and
the effect of defective unit cells on the overall power efficiency. Didier Campos team for I NTACT package design and assem-
Regarding interconnects, contrary to previous point-to-point bly, PRESTO Engineering and Brice Grisollet for testing the
solutions, the active interposer offers flexible and distrib- I NTACT circuit onto Automatic Test Equipement, Easii-IC
uted NoC meshes enabling any chiplet-to-chiplet communi- and Jean-Paul Goglio and his team for designing the I NTACT
cation for scalable cache-coherency traffic, with 0.6-ns/mm application demonstration board. Finally, they would like to
inter-chiplet latency using asynchronous signaling within the thank many other contributors from Mentor Graphics on the
interposer, and a 0.59-pJ/bit synchronous 3D-Plug energy CAD tool teams and from CEA-LETI on both the design team
efficiency with 3-Tb/s/mm2 bandwidth density, which is twice and technology teams for their dedication to make this concept
better than previous circuits. and circuit a successful realization.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
94 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

R EFERENCES [24] E. Guthmuller et al., “A 29 Gops/Watt 3D-ready 16-core computing


fabric with scalable cache coherent architecture using distributed l2 and
[1] G. Agosta et al., “Challenges in deeply heterogeneous high performance adaptive l3 caches,” in Proc. IEEE 44th Eur. Solid State Circuits Conf.
systems,” in Proc. 22nd Euromicro Conf. Digit. Syst. Des. (DSD), (ESSCIRC), Sep. 2018, pp. 318–321.
Aug. 2019, pp. 428–435. [25] J. Dumas, E. Guthmuller, and F. Petrot, “Dynamic coherent cluster:
[2] P. Ramm, Handbook of 3D Integration: Technology and Applications of A scalable sharing set management approach,” in Proc. IEEE 29th Int.
Conf. Appl.-Specific Syst., Archit. Processors (ASAP), Jul. 2018, pp. 1–8.
3D, vol. 1. Hoboken, NJ, USA: Wiley, 2008.
[26] J. Dumas, E. Guthmuller, C. F. Tortolero, and F. Petrot, “Trace-driven
[3] M.-F. Chen, F.-C. Chen, W.-C. Chiou, and C. H. Doug Yu, “System on
exploration of sharing set management strategies for cache coherence
Integrated Chips (SoIC(TM) for 3D Heterogeneous Integration,” in Proc.
in manycores,” in Proc. 15th IEEE Int. New Circuits Syst. Conf.
IEEE 69th Electron. Compon. Technol. Conf. (ECTC), 2019, pp. 1–5.
(NEWCAS), Jun. 2017, pp. 77–80.
[4] N. Beck, S. White, M. Paraschou, and S. Naffziger, “Zeppelin’: An
[27] Y. Fu, T. M. Nguyen, and D. Wentzlaff, “Coherence domain restriction
SoC for multichip architectures,” in IEEE Int. Solid-State Circuits Conf.
on large scale systems,” in Proc. 48th Int. Symp. Microarchitecture,
(ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 40–41.
2015, pp. 686–698.
[5] M.-S. Lin et al., “A 7nm 4GHz Arm-core-based CoWoS chiplet
[28] E. Guthmuller, I. Miro-Panades, and A. Greiner, “Adaptive stackable 3D
design for high performance computing,” in Proc. Symp. VLSI Circuits,
cache architecture for manycores,” in Proc. IEEE Comput. Soc. Annu.
Jun. 2019, pp. 28–32.
Symp., Aug. 2012, pp. 39–44.
[6] D. Greenhill et al., “A 14nm 1GHz FPGA with 2.5D transceiver
[29] E. Guthmuller, I. Miro-Panades, and A. Greiner, “Architectural explo-
integration,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech.
ration of a fine-grained 3D cache for high performance in a manycore
Papers, Feb. 2017, pp. 54–55.
context,” in Proc. IFIP/IEEE 21st Int. Conf. Very Large Scale Integr.
[7] P. Gupta and S. S. Iyer, “Goodbye, motherboard. Bare chiplets bonded (VLSI-SoC), Oct. 2013, pp. 302–307.
to silicon will make computers smaller and more powerful: Hello,
[30] I. Miro-Panades, E. Beigne, O. Billoint, and Y. Thonnart, “In-situ
silicon-interconnect fabric,” IEEE Spectr., vol. 56, no. 10, pp. 28–33, Fmax/Vmin tracking for energy efficiency and reliability optimization,”
Oct. 2019. in Proc. IEEE 23rd Int. Symp. On-Line Test. Robust Syst. Des. (IOLTS),
[8] CHIPS Program. Accessed: 2017. [Online]. Available: https://www. Jul. 2017, pp. 96–99.
darpa.mil/program/common-heterogeneous-integration-and-ip-reuse- [31] P. Meinerzhagen et al., “An energy-efficient graphics processor featuring
strategies fine-grain DVFS with integrated voltage regulators, execution-unit turbo,
[9] 3 Ways Chiplets are Remaking Processors. Accessed: Apr. 2020. and retentive sleep in 14nm tri-gate CMOS,” in IEEE Int. Solid-State
[Online]. Available: https://spectrum.ieee.org/semiconductors/ Circuits Conf. (ISSCC) Dig. Tech. Papers, Feb. 2018, pp. 38–40.
processors/3-ways-chiplets-are-remaking-processors [32] I. Bukreyev et al., “Four monolithically integrated switched-capacitor
[10] J. Quinne and B. Loferer, “Quality in 3D assembly—Is, known good DC-DC converters with dynamic capacitance sharing in 65-nm CMOS,”
die good enough?” in Proc. IEEE Int. 3D Syst. Integr. Conf. (3DIC), IEEE Trans. Circuits Syst. I, Reg. Papers, vol. 65, no. 6, pp. 2035–2047,
3DIC’2013, pp. 1–5. Nov. 2017.
[11] K. Sohn et al., “18.2 a 1.2 V 20nm 307GB/s HBM DRAM with at-speed [33] H. Meyvaert, T. Van Breussegem, and M. Steyaert, “A 1.65W fully
wafer-level I/O test scheme and adaptive refresh considering temperature integrated 90nm bulk CMOS intrinsic charge recycling capacitive DC-
distribution,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. DC converter: Design; Techniques for high power density,” in Proc.
Papers, Jan. 2016, pp. 316–317. IEEE Energy Convers. Congr. Expo., Sep. 2011, pp. 3234–3241.
[12] T. F. Wu et al., “14.3 a 43pJ/Cycle non-volatile microcontroller with 4.7s [34] T. M. Andersen et al., “A 10 w on-chip switched capacitor voltage
Shutdown/Wake-up integrating 2.3-bit/Cell resistive RAM and resilience regulator with feedforward regulation capability for granular micro-
techniques,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. processor power delivery,” IEEE Trans. Power Electron., vol. 32, no. 1,
Papers, Feb. 2019. pp. 378–393, Jan. 2017.
[13] A. Jouve et al., “Die to wafer direct hybid bonding demonstration with [35] T. Souvignet, B. Allard, and S. Trochut, “A fully integrated switched-
high alignment accuracy and electrical yields,” in Proc. Int. 3D Syst. capacitor regulator with frequency modulation control in 28-nm
Integr. Conf. (3DIC), Oct. 2019, pp. 1–5. FDSOI,” IEEE Trans. Power Electron., vol. 31, no. 7, pp. 4984–4994,
[14] Open Compute ODSA Project. Accessed: 2019. [Online]. Available: Jul. 2016.
https://www. [36] P. Vivet et al., “A 4 × 4 × 2 homogeneous scalable 3D Network-
opencompute.org/wiki/Server/ODSA on-Chip circuit with 326 MFlit/s 0.66 pJ/b robust and fault tolerant
[15] S. Naffziger, K. Lepak, M. Paraschou, and M. Subramony, “2.2 AMD asynchronous 3D links,” IEEE J. Solid-State Circuits, vol. 52, no. 1,
chiplet architecture for high-performance server and desktop products,” pp. 33–49, Jan. 2017, doi: 10.1109/JSSC.2016.2611497.
in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. Tech. Papers, [37] Y. Thonnart, P. Vivet, S. Agarwal, and R. Chauhan, “Latency improve-
Feb. 2020, pp. 44–45s ment of an industrial SoC system interconnect using an asynchronous
[16] W. Gomes et al., “8.1 lakefield and mobility compute: A 3D stacked NoC backbone,” in Proc. 25th IEEE Int. Symp. Asynchronous Circuits
10nm and 22FFL hybrid processor system in 12×12 mm2 , 1 mm Syst. (ASYNC), May 2019, pp. 46–47.
Package-on-Package,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) [38] J. Pontes, P. Vivet, and Y. Thonnart, “Two-phase protocol converters for
Dig. Tech. Papers, Feb. 2020, pp. 144–146. 3D asynchronous 1-of-n data links,” in Proc. 20th Asia South Pacific
[17] G. Hellings et al., “Active-lite interposer for 2.5 & 3D integration,” in Des. Autom. Conf., Jan. 2015, pp. 154–159.
Proc. Symp. VLSI Technol. Circuits, 2015, pp. 222–223. [39] S. M. Nowick and M. Singh, “Asynchronous design—Part 1: Overview
[18] S. Chéramy et al., “The active-interposer concept for high-performance and recent advances,” sIEEE Des. Test. Comput., vol. 32, no. 3, pp. 5–18,
chip-to-chip connections,” Chip Scale Rev., vol. 5, p. 35, Jun. 2014. Jun. 2015.
[19] J. Yin et al., “Modular routing design for chiplet-based systems,” [40] R. P. Reddy, A. Acharyya, and S. Khursheed, “A cost-aware framework
in Proc. ACM/IEEE 45th Annu. Int. Symp. Comput. Archit. (ISCA), for lifetime reliability of TSV-based 3D-IC design,” IEEE Trans. Circuits
Jun. 2018, pp. 726–738. Syst. II, Exp. Briefs, vol. 67, no. 11, pp. 2677–2681, Nov. 2020,
[20] V. Pano, R. Kuttappa, and B. Taskin, “3D NoCs with active interposer doi: 10.1109/tcsii.2020.2970724.
for multi-die systems,” in Proc. 13th IEEE/ACM Int. Symp. Networks- [41] C. Metzler et al., “Computing detection probability of delay defects in
on-Chip, Oct. 2019, pp. 1–8 signal line tsvs,” in Proc. 18TH IEEE Eur. TEST Symp. (ETS), May 2013,
[21] P. Vivet et al., “2.3 a 220GOPS 96-core processor with 6 chiplets 3D- pp. 1–6.
stacked on an active interposer offering 0.6ns/mm latency, 3Tb/s/mm2 [42] C. Papameletis, B. Keller, V. Chickermane, S. Hamdioui, and
inter-chiplet interconnects and 156 mW/mm2 82%-Peak-Efficiency DC- E. J. Marinissen, “A DfT architecture and tool flow for 3-D SICs with
DC converters,” in IEEE Int. Solid-State Circuits Conf. (ISSCC) Dig. test data compression, embedded cores, and multiple towers,” IEEE Des.
Tech. Papers, Feb. 2020, pp. 1–85. Test. Comput., vol. 32, no. 4, pp. 40–48, Aug. 2015.
[22] D. Gitlin et al., “Generalized cost model for 3D systems,” in Proc. [43] S. K. Goel et al., “Test and debug strategy for TSMC CoWoS
IEEE SOI-3D-Subthreshold Microelectron. Technol. Unified Conf. (S3S), 2122; stacking process based heterogeneous 3D IC: A silicon case
Oct. 2017, pp. 1–3. study,” in Proc. IEEE Int. Test Conf. (ITC), Sep. 2013, pp. 1–8,
[23] P. Coudrain et al., “Active interposer technology for chiplet-based doi: 10.1109/TEST.2013.6651893.
advanced 3D system architectures,” in Proc. IEEE 69th Electron. Com- [44] IEEE 1838 WG. Accessed: Mar. 2020. [Online]. Available:
pon. Technol. Conf. (ECTC), May 2019, pp. 569–578. http://grouper.ieee.org/groups/3Dtest/

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 95

[45] J. Durupt, P. Vivet, and J. Schloeffel, “IJTAG supported 3D DFT using Gael Pillonnet (Senior Member, IEEE) was born
chiplet-footprints for testing multi-chips active interposer system,” in in Lyon, France, in 1981. He received the master’s
Proc. 21th IEEE Eur. Test Symp. (ETS), May 2016, pp. 1–8. degree in electrical engineering from CPE Lyon,
[46] E. Beigne et al., “A 460 MHz at 397 mV, 2.6 GHz at 1.3 V, 32 bits Lyon, France, in 2004, and the Ph.D. and Habili-
VLIW DSP embedding f MAX tracking,” IEEE J. Solid-State Circuits, tation degrees from INSA Lyon, Lyon, in 2007 and
vol. 50, no. 1, pp. 125–136, Jan. 2015. 2016, respectively.
[47] P.-Y. Martinez et al., “ExaNoDe: Combined integration of chiplets on Following an early experience as an Analog
active interposer with bare dice in a multi-chip-module for heteroge- Designer in STMicroelectronics, Crolles, France,
neous and scalable high performance compute nodes,” in Proc. IEEE in 2008, he joined the Electrical Engineering Depart-
VLSI Conf., 2020. ment, University of Lyon, Lyon. From 2011 to 2012,
[48] A. Olofsson, T. Nordstrom, and Z. Ul-Abdin, “Kickstarting high- he held a visiting researcher position at the Univer-
performance energy-efficient manycore architectures with epiphany,” sity of California at Berkeley, Berkeley, CA, USA. Since 2013, he has been
in Proc. 48th Asilomar Conf. Signals, Syst. Comput., Nov. 2014, a Full-Time Researcher at CEA-LETI, a major French research institution.
pp. 1719–1726. His research focuses on energy transfers in electronic devices, such as
[49] H. Reiter, “Multi-Die IC Design Tutorial,” in Proc. 3D ASIP Conf., power converters, audio amplifiers, energy-recovery logics, electromechanical
2015, pp. 1–5. transducers, and harvesting electrical interfaces.
[50] Calibre 3D STACK. Accessed: 2011. [Online]. Available: https://www.
mentor.com/products/ic_nanometer_design/verification-signoff/physical-
verification/calibre-3dstack
[51] zGLUE Inc. Accessed: 2014. [Online]. Available: www.zglue.com César Fuguet received the M.S. degree in system’s
[52] C. Torregiani, B. Vandevelde, H. Oprins, E. Beyne, and I. D. Wolf, engineering from the Universidad de Los Andes
“Thermal analysis of hot spots in advanced 3D-stacked structures,” in (ULA), Mérida, Venezuela, in 2012, and the M.S.
Proc. 15th Int. Workshop Thermal Invest. ICs Syst., 2009, pp. 55–60. and Ph.D. degrees in computer science from Univer-
[53] T. R. Harris, P. Franzon, W. R. Davis, and L. Wang, “Thermal effects of sity Pierre and Marie Curie (UPMC), Paris, France,
heterogeneous interconnects on InP/GaN/Si diverse integrated circuits,” in 2012 and 2015, respectively.
in Proc. Int. 3D Syst. Integr. Conf. (3DIC), Dec. 2014, pp. 1–3s. Following an experience at Kalray, Grenoble,
[54] C. Santos, P. Vivet, J.-P. Colonna, P. Coudrain, and R. Reis, “Thermal France, he is currently a Full-Time Researcher at
performance of 3D ICs: Analysis and alternatives,” in Proc. Int. 3D Syst. CEA-List, Grenoble, France. His main research
Integr. Conf. (3DIC), Dec. 2014, pp. 1–7. interests are multicore processor architectures, cache
[55] M. Graphics and W. Paper, “A complete guide to 3D chip-package ther- coherency, and heterogeneous architectures with
mal co-design, 10 key considerations,” 2017, Tech. Rep. [Online]. Avail- accelerators for high-performance computing.
able: https://www.mentor.com/products/mechanical/resources/overview/
a-complete-guide-to-3d-chip-package-thermal-co-design-10-key-
considerations-d8b0e79e-fb79-4c5a-992d-45d0d3b5f0ac
[56] C. Santos, P. Vivet, L. Wang, M. White, and A. Arriordaz, “Thermal Ivan Miro-Panades (Member, IEEE) received
exploration and sign-off analysis for advanced 3D integration,” in Proc. the M.S. degree in telecommunication engineering
Design Track, DAC Conf., Jun. 2017. from the Technical University of Catalonia (UPC),
Barcelona, Spain, in 2002, and the M.S. and Ph.D.
Pascal Vivet (Member, IEEE) received the Ph.D.
degrees in computer science from University Pierre
degree from Grenoble INPG, Grenoble, France,
and Marie Curie (UPMC), Paris, France, in 2004 and
in 2001, designing an asynchronous microprocessor.
2008, respectively.
After four years with STMicroelectronics, Crolles,
He worked at Philips Research, Paris, and STMi-
France, he joined CEA-Leti, Grenoble, in 2003,
croelectronics, Grenoble, France, before joining
in the digital design lab. He was a Project Leader
CEA, Grenoble, in 2008, where he is currently a
on 3D circuit design from 2011 to 2018. He is
Research Engineer in digital integrated circuits. His
currently a Scientific Director of the Digital Systems
main research interests are artificial intelligence (AI), Internet-of-Things,
and Integrated Circuits Division, CEA-LIST, a CEA
low-power architectures, energy-efficient systems, and Fmax/Vmin tracking
institute. He has authored or coauthored more than
methodologies.
120 articles and holds several patents in the field of
digital design. His research interests cover wide aspects of circuit and system
level design, ranging from system integration, multicore architecture, network-
on-chip, energy-efficient design, related CAD aspects, and in strong links with
advanced technologies, such as 3D, nonvolatile-memories, and photonics. Guillaume Moritz was born in France in 1987.
He graduated from Telecom Physique Strasbourg
Eric Guthmuller graduated from École Polytech- in 2010 with a specialization in micro- and
nique and received the M.S. degree from Telecom nano-electronics and the associated master.
Paris, Paris, France, in 2009, and the Ph.D. degree in After finishing his internship at CEA-Leti,
computer science from University Pierre and Marie Grenoble, France, he joined Leti for two years. Then,
Curie (UPMC), Telecom Paris, in 2013. as a subcontractor from ATOS specialized in phys-
He joined CEA-Leti, Grenoble, France, as a Full- ical design, he holds different positions with Leti,
Time Researcher in 2019, then with CEA-List. working on various advanced projects, including two
His main research interests include processor archi- major 3D circuits. He joined Leti in 2019, where
tectures and their memory hierarchy, in particular he is currently focusing on physical implementation
cache coherency for manycore and heterogeneous of image sensors.
architectures.

Yvain Thonnart (Member, IEEE) graduated from


Ecole Polytechnique, Paris, France, and received the Jean Durupt graduated from the École Centrale de
M.S. degree from Telecom Paris, in 2005. Lyon, Lyon, France, in 1990, with a specialization
He joined Technological Research Division, CEA, in micro-electronics.
French Alternative Energies and Atomic Energy He joined CEA, Grenoble, France, in 2001, in the
Commission, with CEA-Leti, Grenoble, France, digital design lab. His main research interests are
in 2019, then with CEA-List. He is now a Senior multicore processor architectures, circuit design, and
Expert on communication and synchronization in more specifically design-for-test and circuit testabil-
systems on chip, and a Scientific Advisor for the ity, including testability of 3D architectures.
mixed-signal lab. His main research interests include
networks on chip, asynchronous logic, emerging
technologies integration, and interposers.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
96 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021

Christian Bernard received the Engineering degree Michel Harrand (Member, IEEE) started his career
from the Grenoble Polytechnical Institute, Grenoble, in Matra Espace, Paris, France, in 1980, where
France, in 1979. he designed automatic pilot systems for satel-
After four years with Thomson, Paris, France, lites. In 1985, he joined Thomson Semiconduc-
he worked at Bull, Paris, on mainframe HW design tors, Grenoble, France, where he designed numerous
of CPU cores, multiprocessing, and cache coherency integrated circuits in the microprocessor, telecom-
aspects. He joined CEA-Leti, Grenoble, in 2001, munication, and mostly image compression fields,
in the digital design lab. He contributed in the and lead a design team before being appointed as
design of large systems of the lab covering various the Director of the Embedded DRAM Department
domains: 4G mobile baseband, space mission ded- in 1996. He joined CEA, Grenoble, in 2006, to pre-
icated hardware accelerators, and many-core archi- pare the creation of Kalray, Grenoble, a startup
tectures, including the integration of cache coherency in 3D many cores. He is designing manycore processors, which he co-founded in 2008 as the CTO.
now retired. He joined back CEA in end 2012 to explore the architecture, design, and
applications of new technologies as 2.5D integrated circuits, emerging non-
volatile memories, and currently neural networks. He has served in the ISSCC
TPC from 2001 to 2006, and holds more than 40 patents.
Didier Varreau was born in Dôle, France,
in 1954. He received the Electronic Higher Technical Denis Dutoit joined CEA, Grenoble, France,
Diploma degree from Grenoble University, France, in 2009, after working for STMicroelectronics,
in 1975. Crolles, France, and STEricsson, Grenoble. In CEA,
In 1976, he joined CEA-LETI, Grenoble, France, he has been involved in system-on-a-chip archi-
to develop instrumental electronic boards for med- tecture for computing and 3D integrated circuit
ical and nuclear purpose. From 2003 to 2006, projects. After defining the CEA-Leti’s roadmap of
he worked on the FAUST project developing inte- technologies and solutions for advanced computing,
grated synchronous IPs. Since 2006, he has been he is now involved in European Projects in High
in charge of physical implementation of low-power Performance Computing as a Coordinator, Project
energy-efficient accelerators, then since 2010, he has Leader, and SoC Architect.
been working on large multiprocessor system-on-chip, including large 3D
systems. He is now retired.

Didier Lattard was born in Saint Marcellin, France,


in 1963. He received the Ph.D. degree in micro-
Julian Pontes (Member, IEEE) is graduated in com- electronics from the National Polytechnic Institute
puter engineering at UEPG, Ponta Grossa, Brazil, of Grenoble, Grenoble, France, in 1989.
in 2006. He received the M.Sc. and Ph.D. degrees In 1990, he joined CEA-Leti, Grenoble. He was
in computer science from PUC-RS, Ponte Alegre, involved in the design of image and baseband
Brazil, in collaboration with CEA-Leti, Grenoble, processing circuits as a Senior Research and Devel-
France, in 2008 and 2012, respectively. His Ph.D. opment Engineer and Project Leader. From 2003 to
research work was focused on fault tolerance in 2014, he led projects in the field of NoC-based tele-
asynchronous circuits and this work was extended as com and high-performance computing applications.
a PostDoc in CEA/Leti, with research contributions In 2014, he moved to the Technology Department of
on 3D architecture and circuit design. CEA-Tech and was involved in 3D integration projects. Since 2020, he has
He worked with System Integration at Arm Ltd., been leading a team developing mixed-signal circuits and software tools for
Sheffield, U.K. He currently works with CPU design at ARM, Sophia near memory computing, cybersecurity, IoT, and telecom applications. He has
Antipolis, France. published 60 articles in books, refereed journals, and conferences. He holds
24 patents in the fields of baseband processing, NoC architectures, and 3D
integration.

Sébastien Thuries received the master’s degree Lucile Arnaud joined CEA-LETI, Grenoble,
from the University of Montpellier, Montpellier, France, in 1984. She first covered design and
France, in 2003. characterization of magnetic and electromagnetic
He joined CEA/Leti, Grenoble, France, in 2004, as passive devices. From 2007 to 2014, she was
a Research Engineer. He is leading the High-Density assigned at STMicroelectronics, Crolles, France, for
3D Architecture and Design Group, CEA-LETI, interconnect reliability expertise of most advanced
including fine pitch 3D stacking as well as mono- CMOS technology. Since 2014, she has been
lithic 3D (M3D). He has worked on and led several involved in 3D-IC developments in LETI for tech-
digital ASIC developments for a set of application, nology expertise and projects managing. In the last
such as 4G digital baseband, complex imagers, sys- four years, she managed internal and collaborative
tem on chip, and mixed signal RF over the last projects for 3D interconnects development with Cu-
decade. He has been a pioneer in FDSOI digital design and back biasing SiO2 hybrid bonding technologies. She authored or coauthored more than 90
capability. He leads the research team on new architecture and design articles, including some invited talks and tutorials in the IEEE conferences.
paradigm raised by M3D-IC in order to optimize the full system to technology
fields.
Jean Charbonnier (Member, IEEE) is graduated
from the National School of Physics of Grenoble,
Grenoble, France, in 2001 and received the Ph.D.
David Coriat received the M.Sc. degree from degree in crystallography from the University Joseph
the University of Science of Montpellier, France, Fourier, Grenoble, in 2006.
in 2012. He joined the 3D Wafer Level Packaging Group,
He subsequently joined CEA-Leti Minatec, Greno- CEA-Leti, Grenoble, in 2008. He has been working
ble, France. He has worked on dynamic management for more than ten years in through silicon vias,
of power and variability in MP-SoC architectures 3D interconnections, and silicon interposers technol-
as well as power estimation techniques in large ogy. His research interests include high-performance
MP-SoC architectures. His research interests now lie computing, silicon photonics interposer, as well as
in low-power architectures and design. cryopackaging for quantum architecture applications. He is currently in
charge of coordinating the High-Density 3D Integration Group, 3D Packaging
Laboratory, CEA-Leti.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 97

Perceval Coudrain received the M.S. degree in Quentin L. Meunier received the Diploma degree
materials sciences from the University of Nantes, from the Ensimag School, Grenoble, France,
Nantes, France, in 2001, and the Ph.D. degree in 2007, and the Ph.D. degree in computer science
from the Institut Supérieur de ’Aéronautique et de from the Université de Grenoble, Grenoble, in 2010.
l’Espace, Toulouse, France, in 2009. Since 2011, he has been an Associate Pro-
He joined STMicroelectronics, Crolles, France, fessor at the LIP6 Laboratory, Sorbonne Univer-
in 2002, and entered the advanced research and sité, Paris, France. His research interests include
development group in 2005, where he was involved many core architectures and cache coherence, high-
in the development of backside illumination and performance computing, and side-channel attacks
monolithic 3D integration for CMOS image sen- and counter-measures.
sors. For ten years, he has been focusing on 3D
integration technologies, including TSV and C–Cu hybrid bonding, and
thermal management. He moved to CEA-Leti, Grenoble, France, in 2018,
where his research focuses on 3D integration, fan-out wafer level packaging, Alexis Farcy graduated in electronic engineering
and embedded microfluidics. from the “Institut des Sciences et Techniques de
Grenoble,” France, in 2000, and the Ph.D. degree in
electronic, optronic and systems from the University
of Savoie, Chambery, France, in 2009.
Arnaud Garnier graduated from INSA de Lyon, He was employed by STMicroelectronics, Crolles,
Lyon, France, in 2004. He received the Ph.D. degree France. From 2000 to 2007, he was among
from Université St Quentin en Yvelines, St Quentin, the Advanced Interconnects and Passive Compo-
France, in 2007, in materials science in 2007 after nents Module, focusing on interconnect performance
studying the Smart Cut technology on GaN for three analysis for advanced technology nodes, integration
years in SOITEC. of advanced inductors and 3D capacitors in BEOL,
He then joined CEA-LETI, Grenoble, France, and high-frequency characterizations of low-k and high-k dielectrics. Since
to work on wafer level packaging, with a specific 2007, he has been in the field of 3D integration on innovative technologies,
focus on wafer bonding, chip assembly, underfilling, processes and materials for 3D integration, and performance assessment for
3D process integration, and advanced packaging. photonics and image sensors.
He currently works as a Project Leader mainly on
fan-out wafer level packaging technologies. Alexandre Arriordaz received the master’s degree
in electronics from the University de Nice-Sophia-
Antipolis, France.
He is a Senior Product Engineering Manager for
Frédéric Berger born in Grenoble, France, in 1973.
caliber design solutions at Mentor – A Siemens
He received the B.T.S. degree from lycée Argouges,
Business, Montbonnot, France. He is leading prod-
Grenoble, in 1993, in photonic optical engineering.
uct management and software development teams
He started his career as a Technician in the main-
located, Grenoble, France, focusing on circuit relia-
tenance of alarm systems, then fiber optic welders
bility and verification product lines. In parallel to this
for telecommunications at Siemens/Corning. More
activity, he is also a technical interface for various
attracted by research and development, he continues
European projects dealing with research and devel-
his activity in the Photonics team to develop and
opment topics, such as 3D-IC or silicon photonics. Prior to joining Mentor,
perfect optical amplifiers. In 2003, he joined CEA,
he was a Full-Custom Design Engineer at Freescale Semiconductor (now NXP,
Grenoble, as a Technician in the Packaging and
Grenoble), working on advanced testchip/SRAM compiler developments.
Assembly Laboratory. In 2005, he participated with
SET in the development of the first FC300 equipment for 3D assemblies
based on microtubes for infrared imagers. He used this technical background Séverine Chéramy (Member, IEEE) received the
to carry out the assemblies of the six chiplets of the I NTA CT project. Engineering degree from Polytech Orleans, Orleans,
France, in 1998, having specialized in material sci-
ence.
Alain Gueugnot received the B.T.S. degree in She has spent over eight years at GEMALTO, Aix
microtechnology from Lycée Jules Richard, Paris, en Provence, France, a leading smart-card company
France, in 1989. developing technologies for secure solutions, such
He joined CEA-DAM, Grenoble, in 1992 and then as contactless smart cards and electronic passports.
CEA-LETI at the DOPT in 2003 to work in the In 2008, she joined CEA-Leti, Grenoble, France, as a
joint laboratory with SOFRADIR (Lynred) in the 3D Project Leader and then as a 3D Integration Lab-
packaging. Then, he set up means of morpholog- oratory Manager. This group develops technology
ical characterization and metallographic expertise and integration for 3DIC, in strong relationship with 3D design, model, and
of assemblies of components for infrared, lighting, simulation teams. Since January 2017, she has been responsible for 3DIC
imager and screen using profilometers, and ionic and integration strategy and related business development. She is also the Director
mechanical cross section for SEM imaging. of the 3D project of the Institute of Technological Research (ITR) Nanoelec.

Fabien Clermidy (Member, IEEE) received the


Alain Greiner is currently a Professor at Universite Ph.D. and Thesis Supervisor degrees from INPG,
Pierre et Marie Curie (UPMC), Paris, France, and an Grenoble, France, in 1999 and 2010, respectively.
Associate Professor at Ecole Polytechnique, Paris. In 2000, he joined CEA-LIST, Paris, France, where
He was the Head of the Hardware Architecture he was involved in the design of an application-
Department, LIP6 Laboratory, from 1990 to 2010. specific parallel computer. In 2003, he joined CEA-
He was the Team Leader of the public domain LETI, Grenoble, in the digital circuit laboratory,
VLSI/CAD system ALLIANCE, and the Technical where he led the design of various large many-core
Coordinator of the SoCLib virtual prototyping plat- circuits. He is currently the Head of the Digital
form, supported by the French Agence Nationale System and Circuit Division, CEA-LIST, a CEA
pour la Recherche, and jointly developed by six institute. He has published more than 80 articles in
industrial companies and ten academic laboratories. international conferences. He holds 14 patents. His research interest covers
He is the Chief Architect of the scalable, shared memory, and manycore TSAR wide scope of digital systems: many-core architecture, network-on-chip,
architecture, and is presently working on scalable operating systems for those energy-efficient design, embedded systems, and interaction with advanced
kinds of machines. technologies.

Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.

You might also like