I A: A 96-Core Processor With Six Chiplets 3D-Stacked On An Active Interposer With Distributed Interconnects and Integrated Power Management
I A: A 96-Core Processor With Six Chiplets 3D-Stacked On An Active Interposer With Distributed Interconnects and Integrated Power Management
I A: A 96-Core Processor With Six Chiplets 3D-Stacked On An Active Interposer With Distributed Interconnects and Integrated Power Management
1, JANUARY 2021 79
Abstract— In the context of high-performance computing, in 65-nm process, offering a total of 96 computing cores.
the integration of more computing capabilities with generic Full scalability of the computing system is achieved using an
cores or dedicated accelerators for artificial intelligence (AI) innovative scalable cache-coherent memory hierarchy, enabled
application is raising more and more challenges. Due to the by distributed network-on-chips, with 3-Tbit/s/mm2 high band-
increasing costs of advanced nodes and the difficulties of shrink- width 3D-plug interfaces using 20-µm pitch micro-bumps,
ing analog and circuit input output signals (IOs), alternative 0.6-ns/mm low latency asynchronous interconnects, while the
architecture solutions to single die are becoming mainstream. six chiplets are locally power-supplied with 156-mW/mm2 at
Chiplet-based systems using 3D technologies enable modular 82%-peak-efficiency dc–dc converters through the active inter-
and scalable architecture and technology partitioning. Never- poser. Thermal dissipation is studied showing the feasibility of
theless, there are still limitations due to chiplet integration such approach.
on passive interposers—silicon or organic. In this article we
present the first CMOS active interposer, integrating: 1) power Index Terms— 3D technology, active interposer, chiplet,
management without any external components; 2) distributed network-on-chip (NoC), power management, thermal dissipation.
interconnects enabling any chiplet-to-chiplet communication;
and3) system infrastructure, design-for-test, and circuit IOs. The I. I NTRODUCTION
I NTACT circuit prototype integrates six chiplets in FDSOI 28-nm
technology, which are 3D-stacked onto this active interposer I N THE context of high-performance computing (HPC) and
big-data applications, the quest for performance requires
modular, scalable, energy-efficient, low-cost many-core
Manuscript received June 11, 2020; revised September 17, 2020 and Octo-
ber 27, 2020; accepted November 2, 2020. Date of current version Decem- systems. To address the demanding needs for computing
ber 24, 2020. This paper was approved by Guest Editor Dejan Markovic.This power, system architects are continuously integrating more
work was supported in part by the French National Program Programme cores, more acceleratorsand more memory in a given power
d’Investissements d’Avenir, IRT Nanoelec under Grant ANR-10-AIRT-05,
in part by the SHARP CA109 CATRENE Project, in part by the MASTER3D envelope [1]. It appears that similar needs and constraints
CT312 CATRENE Project, and in part by the Hubeo+ CARNOT Project. are emerging for the embedded HPC domain, in transport
(Corresponding author: Pascal Vivet.) applications for instance with autonomous driving, avionics,
Pascal Vivet, Eric Guthmuller, Yvain Thonnart, Gael Pillonnet, César
Fuguet, Ivan Miro-Panades, Guillaume Moritz, Jean Durupt, Sébastien and so on.
Thuries, David Coriat, Michel Harrand, Denis Dutoit, Didier Lattard, Lucile All these application domains require highly optimized
Arnaud, Jean Charbonnier, Perceval Coudrain, Arnaud Garnier, Frédéric and energy-efficient functions: generic ones such as cores,
Berger, Alain Gueugnot, Séverine Chéramy, and Fabien Clermidy are
with CEA, University Grenoble Alpes, 38054 Grenoble, France (e-mail: GPUs, embedded FPGAs, dense and fast memories, and
[email protected]). also more specialized ones, such as machine learning
Christian Bernard and Didier Varreau, retired, were with CEA Grenoble, and neuro-accelerators to efficiently implement the greedy
38054 Grenoble, France.
Julian Pontes was with CEA Grenoble, 38054 Grenoble, France. He is now computing demand of Big Data and artificial intelligence (AI)
with ARM, Sheffield S1 4LW, U.K. applications.
Alain Greiner and Quentin L. Meunier are with LIP6 Lab, University Paris Circuit and system designers are in need of a more afford-
Sorbonne, 75252 Paris, France.
Alexis Farcy is with STMicroelectronics, 38920 Crolles, France. able, scalable, and efficient way of integrating those het-
Alexandre Arriordaz is with Mentor, A Siemens Business, 38330 Montbon- erogeneous functions, to allow more reuse, at circuit level,
not, France. while focusing on the right innovations in a sustainable man-
Color versions of one or more figures in this article are available at
https://doi.org/10.1109/JSSC.2020.3036341. ner. Due to the slowdown of advanced CMOS technologies
Digital Object Identifier 10.1109/JSSC.2020.3036341 (7 nm and below), with yield issues, design, and mask costs,
0018-9200 © 2020 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See https://www.ieee.org/publications/rights/index.html for more information.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
80 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
82 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 83
Fig. 6. Chiplet and active interposer floorplans, details of the 3D-plug μ-bumps, final 3D integration and package.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
84 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 85
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
86 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
TABLE III
3D-P LUG T YPES AND U SAGE IN I NTA CT
Fig. 12. SCVR experimental results. (a) Power efficiency versus voltage
conversion ratio and gearbox configurations. (b) Efficiency over output current.
(c) Efficiency versus input voltage at 2:1 ratio. (d) Load transient. Fig. 13. 3D-Plug physical and logical interface overview.
E. Discussion
TABLE II
C OMPARISON W ITH C OMPARABLE SCVR U SING Since the power efficiency obtained by the integrated
MOS OR MIM C APACITOR T ECHNOLOGY VR is lower than external dc-dc converters, the overall power
efficiency of the computing system could improve by allowing
fine-grain DVFS without increasing the bill-of-material (BoM)
and IOs numbers. The power density is smaller than previously
published results but the converters are fully integrated within
the active interposer, not on the same die, thus reducing the
cost impact of the proposed active power mesh. The interposer
integration opens the opportunity for dedicated post-process
high-density capacitors (e.g., deep trench capacitors) con-
nected through TSV. We also prove the up-scaling capability
of SCVR by fabricating the largest die area SCVR with a
built-in capacitor fault-mitigation scheme.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 87
Fig. 14. (a) Synchronous 3D-Plug micro-architecture. (b) Comparison to state of the art.
the other parts of the 3D-Plug (their logical interface and DFT)
have been designed using automated place and route.
In order to build the system level interconnects of I NTACT,
different kinds of 3D-plug have been designed, as presented
in Table III.
Due to the different natures of the interconnects, in terms
of traffic and distance/connectivity, two different kinds of
3D-Plugs have been designed, and compared in detail: one
using synchronous design, as presented in Section VI-B, and
one using asynchronous design, as presented in Section VI-C.
Fig. 15. Synchronous 3D-Plug max data rate for 2.5D passive links.
B. 3D-Plug Synchronous Version
The microarchitecture of the source synchronous 3D-Plugs
used for 2.5D passive (N1 NoC) and 3D face-to-face links
(N3 NoC) is shown in Fig. 14(a). Implemented as a standard
synthesizable digital design, 3D-Plugs provide multiple virtual
channels (VCs), the number of which is configured at design
time. They use credit-based control flow and clock forwarding
-Plug control logic operates at a higher frequency than the
NoCs to reduce contention due to VC multiplexing. Delay
lines and polarity selectors are used to skew TX clock for Fig. 16. 3D Plug asynchronous version overview, composed of protocol
RX data sampling (CLK_TX_1) and TX credit sampling converters between the on-chip communication and the 3D interface.
(CLK_TX_2).
When attached to the 3D vertical active link, the 3D-Plug
achieves 3-Tb/s/mm2 bandwidth density, 1.9× higher than [5]. adapted for designing system level interconnects and NoC
2.5D passive links reach a 12% higher bandwidth cross section architectures in a globally asynchronous locally synchronous
than [5] as shown in Fig. 14(b). The aggregate synchronous (GALS) scheme. In the context of 3D architectures, asyn-
3D/2.5D links bandwidth is 527 GB/s. chronous logic and its local handshakes enable interfacing
We performed a frequency, logic voltage, and clock phase two different dies without any clocking issues. By using
sweep on synchronous 2.5D/3D links. All 2.5D passive robust quasi-delay insensitive (QDI) logic, an asynchronous
links were able to reach at least 1.25 Gb/s/pin in the 3D NoC has been earlier presented in [36] but presents some
[0.95 V-1.2 V] VDD range and the best link shown in 3D throughput limitations due to the four-phase handshake
Fig. 15 was able to reach this bandwidth at 1 V, while reaching protocol.
more than 1.6 Gb/s/pin at 1.2 V. We obtained best results with For I NTACT, an innovative 3D-Plug interface has been
a 180◦ CLK_TX_1 phase and varying CLK_TX_2 phase designed, to benefit from two-phase handshake protocol at
depending on frequency. While much shorter than passive the 3D interface, which reduces the penalty of 3D interface
links, 3D vertical links achieve slightly lower data rates delay within the interface cycle time, and thus increases the
of 1.21 Gb/s/pin upward and 1.23 Gb/s/pin downward as one 3D interface throughput.
side of these 3D-Plugs is implemented in the more mature and As introduced in [38], the principle is as follows (Fig. 16).
slower 65-nm technology of the interposer. - Use asynchronous two-phase protocol for 3D interface
communication, to reduce 3D interface delay penalty;
C. 3D-Plug Asynchronous Version - Use asynchronous four-phase protocol for on-chip com-
For its inherent robustness to any source of timing vari- munication, within the active interposer, for its inherent
ations and low latency [37], asynchronous logic is well simplicity, low latency and performance [37];
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
88 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 89
TABLE IV
C OMPARATIVE P ERFORMANCE OF S YSTEM I NTERCONNECTS IN I NTA CT
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
90 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
TABLE V
I NTA CT DFT R ESULTS
Fig. 20. 3D design-for-test architecture for I NTA CT, overview and detailed.
3D architectures using die wrappers and elevators [42], and between the chiplets and the active interposer, to facilitate the
for testing 2.5D passive interposers [43]. A standardization test architecture integration.
initiative on 3D testability has emerged with the recent IEEE
1838 standard [44]. Nevertheless, no work addressed initially C. Test CAD Flow and Test Coverage
the testability of active interposers. The proposed 3D DFT architecture has been designed and
inserted using Tessent tools from Mentor, a Siemens Busi-
ness, Montbonnot, France. By using IJTAG and IEEE1687,
B. 3D DFT Architecture high-level languages such as “Instrument Connectivity Lan-
Within the I NTACT architecture, the test of the 3D sys- guage” (ICL) and “Procedural Description Language” (PDL)
tem must address the test of all the following elements: are provided and enable to handle the complexity of such
1) the regular standard cell-based logic; 2) all memories using a system. In particular, it is possible to fully automate the
BIST engines and repair; 3) the distributed 3D interconnects test pattern generation of Memory BIST engines, from ATPG
and IOs: 3D connections of active links and passive links, at chiplet level to ATPG of the same patterns within the
which are implemented by micro-bumps; and finally 4) the full 3D system, enabling so-called test pattern retargeting.
regular package IO pads for off-chip communication through As presented in Table V, full testability is achieved for all
the TSVs. logic, 3D interconnects and regular package IOs, and memory
In order to test the active interposer and its associated BIST engines, before 3D assembly and after 3D assembly.
chiplets, the proposed 3D DFT architecture (Fig. 20) is based Using the proposed DFT architecture and test patterns,
on the two following main test access mechanisms (TAMs), the full system was tested using an automated test equipment
as proposed earlier in [45]. (ATE).
- A IJTAG IEEE1687 hierarchical and configurable chain, - The 28-nm chiplet has been tested at wafer level using
accessed by a primary JTAG TAP port, for testing all a dedicated probe card, with a binning strategy.
the interconnects and memories, based on the concept - The active interposer has not been tested at wafer
of “chiplet footprint.” level, supposing the maturity of the 65-nm technol-
- A Full Scan logic network using compression logic, for ogy and its high yield due to its low complexity
reduction of test time and of number of test IOs. (see Section III-B). Nevertheless, its standalone DFT
and dedicated IOs were initially planned and designed
By using IJTAG IEEE 1687, the JTAG chain is hierarchical
as mentioned above.
and fully configurable: the JTAG chain provides dynamic
- The full I NTACT circuit, after 3D assembly and packag-
access to any embedded test engines. The active interposer
ing, has been tested within a dedicated package socket.
JTAG chain is designed similar to a chain of TAPs on a PCB
board. It is composed of “chiplet footprints,” which provide
IX. T HERMAL C HALLENGES AND S TRATEGY
either access to the above 3D-stacked chiplet or to the next
chiplet interface, and which are chained serially. The JTAG A. Thermal Challenges
network is used to test and control the 3D active links, the 3D In 3D technology, thermal dissipation is a challenge that
passive links, the off-chip interfaces, and the embedded test needs to be properly addressed. Due to more integration in
engines, such as the memory BISTs. a smaller volume, a larger power density is usually observed
The Full Scan logic network offers efficient and parallel in 3D, while the thermal dissipation itself is getting more com-
full scan test of the whole 3D system logic. In order to reduce plex in the overall 3D stack of the circuit and package, overall
the number of 3D parallel ports, compression logic is used leading to thermal hotspots or even thermal runaway [52].
in both the chiplets and the active interposer, with a classical In the generic context of logic-on-logic stacking, thermal
tradeoff (shift time/pin count). Independent scan paths are used dissipation is worse because multiple layers of compute dies
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 91
Fig. 23. (a) Package temperature (without heat sink). Peak temperature = ∼150 ◦ C. (b) Package temperature (with heat sink and fan). Peak temperature =
53 ◦ C. (c) Chiplet thermal map. Peak temperature = ∼53 ◦ C.
Fig. 21. 3D chip-package thermal flow, from early exploration to sign-off. Fig. 22. INTACT circuit and package cross section used for thermal
modeling.
need to dissipate their heat on top of themselves. On the level tools able to cope with detailed circuit and technology
contrary, in the case of interposer-based systems, a single description but with simple packaging condition, or package
layer of chiplets is dissipating heat, while heat extraction can level tools able to cope with detailed packaging, but with
be performed from the top package face, similar to a regular reduced die and technology information. In order to achieve
flip-chip packaged circuit. Nevertheless, contrarily to passive an accurate thermal exploration covering all modeling aspects,
interposers, in the case of an active interposer, the bottom the Project Sahara solution, a thermal analysis prototype from
layer is also part of the power budget, and dissipates heat as Mentor Graphics a Siemens Business, was selected [55].
well. Since the power budget of the active interposer layer is As presented in Fig. 21, an adequate thermal methodology
rather limited, with most power budget within the chiplets, this has been setup to allow modeling of low level structures
should help the overall thermal dissipation. (TSV, micro-bumps, underfill), with a design entry at GDS
Finally, due to the heterogeneous structure of such a 3D level and with accurate static or dynamic power maps, all
stack, many materials are composing the device, with silicon this in the context of the full system (package and fan). The
substrate, back-end of line (BEOL) in copper, underfill com- methodology has been qualified on a previous 3D logic-on-
posite materials between the chiplets and interposer, micro- logic design with silicon thermal measurements 16 [36]. More
bumps (SnAg), TSVs (copper), and so on. This assembly leads details of the thermal methodology can be found in [56].
to strong anisotropic thermal effects, favoring and increasing
thermal hotspots effects. Moreover, due to the thin layer
effect of the interposer (100 μm), the horizontal thermal C. Thermal Simulation Results
dissipation is reduced in the interposer, while it remains The I NTACT circuit and package have been modeled,
mostly the vertical thermal dissipation through the chiplets. as presented in Fig. 22 with a detailed cross section of the
These various thermal effects have been widely studied in the 3D circuit. In terms of power budget, a scenario with a maxi-
literature [53], [54], and need to be taken into account in the mum static power budget of 28 W is simulated, corresponding
full system. to a worst case situation of 3 W per chiplet (×6) and 10 Watts
in the active interposer, while the nominal circuit power budget
is 17 W as presented in Section X-B.
B. Thermal Modeling Strategy As a result, Fig. 23 shows the thermal exploration, without
With all the 3D thermal challenges: increased power density Heat Sink (max temperature 150 ◦ C), with a regular Heat
and design complexity on the design side, fine grain mate- Sink and Fan (max temperature 53 ◦ C), while no hotspots
rial effects on the technology side, coupled to the regular appear within the computing chiplet. Even for this worst case
package and board thermal information, an accurate thermal scenario, due to a still limited power density of 0.14 W/mm2 ,
exploration must be performed with the adequate thermal the thermal dissipation of the active interposer can be achieved
methodology. Various thermal tools are available: either circuit using a regular heatsink and fan.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
92 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Fig. 25. (a) Maximum core frequency. (b) Power consumption at Fmax [FBB = (0,1)]. (c) Power efficiency at Vmin.
A. Circuit Performances
The chiplet is functional in the 0.5–1.3-V range with consumption and energy efficiency while running Coremark
Forward Body Biasing (FBB) [46] up to ±2V. Fig. 25(a) benchmark is compared to a theoretical system using a digital
shows that a core frequency of 1.15 GHz is achieved at 1.3 V LDO instead of the proposed fully integrated SCVR. Using an
with 0/+1 (VDDS/GNDS) FBB. Single core performance is LDO at the same VIN = 2.5 V would result in a 2× increase
2.47 Coremark/MHz and 1.23 DMIPS/MHz. At chip level, in power consumption, a lower VIN would be needed to limit
maximum energy efficiency is 9.6 GOPs/W on Coremark losses at the expense of more power pins and voltage-drop
benchmarks (IPC = 0.8/core) at 0.6 V taking into account volt- issues.
age regulation losses in the interposer as shown in Fig. 25(c). The power breakdown in Fig. 26 shows the low power
As expected, FBB boosts performance: in typical at 0.9 V, budget of the active interposer with only 3% of total power
a frequency increase of 24% is achieved with −1/+1 FBB, consumed by the active interposer logic. The cores+L1$
while in typical at 680 MHz, an energy efficiency increase represent over half the power consumption of the chiplets,
of 15% is achieved with asymmetric 0/+1 FBB. themselves consuming the majority of the measured circuit
power (17 W).
B. Circuit Power Budget and Energy Efficiency C. Circuit Scalability
In Fig. 25(b) and (c), we show overall chip power Lastly, Fig. 27 shows the scalability of the cache-coherent
and performance measurements with a 0/+1 FBB. Power architecture that is analyzed by running a 4 Mpixels image
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 93
TABLE VI
S TATE OF THE A RT C OMPARISON
filtering application from 1 to 512 cores. The filter is composed The overall system integrates a total of 96-cores, in six
of a 1-D convolution, followed by a transposition of the image chiplets, offering a peak computing power of 220 GOPs
and ends with another 1-D convolution. Software synchro- (peak mult-acc), which is quite comparable to advanced state
nization barriers separate these steps and the transposition, of the art processor systems. Finally, the overall distributed
in particular, involves many memory transfers. interconnects and cache coherency memory architecture are
Results for more than 96 cores were obtained by RTL scalable up to 896 cores, showing the architecture partitioning
simulation with additional chiplets. Software is executed on capability to target larger computing scale.
a single cluster up to four cores and on a single chiplet
up to 16 cores. Compared to a single core execution,
a 67× execution-time speedup is obtained with 96 cores and XI. C ONCLUSION
340× with 512 cores. The slight uptick above 128 cores The presented Active interposer leverages the 3D integra-
results from the threshold where the data set fits in caches. tion benefits by offering a baseline of functionalities such
This quasi-linear speedup, ignoring limitations of the external as voltage delivery, chiplet-to-chiplet communications, IOs,
memory bandwidth, shows the scalability of network protocols shared by most of computing assemblies. The active interposer
and their 3D implementations. allows a flexible assembly with common functionalities while
maintaining the yield management benefits. For this reduced
power density and budget, thermal dissipation is not an issue
D. Comparison to Prior Art within the active interposer, as for a regular passive interposer.
Compared to prior art (Table VI), the I NTACT circuit is 3D integration and active interposer open the way toward
the first CMOS active interposer validated on silicon, which efficient integration of large-scale chiplet-based computing
offers a chiplet-based many-core architecture for HPC. The systems. Such scheme can be applied for integration of similar
active interposer solution allows for integrated VRs without chiplets as presented in this article, but also for smooth
any external passives, using free die area available in the integration of heterogeneous computing chiplets [47].
active interposer, offering DVFS-per-chiplet and achieving
156 mW/mm2 at 82% peak power efficiency, with 10%–50% ACKNOWLEDGMENT
more efficiency with respect to LDO converters integrated in
organic schemes. The SCVR is also fault tolerant to mitigate The authors would like to thank STMicroelectronics and
the effect of defective unit cells on the overall power efficiency. Didier Campos team for I NTACT package design and assem-
Regarding interconnects, contrary to previous point-to-point bly, PRESTO Engineering and Brice Grisollet for testing the
solutions, the active interposer offers flexible and distrib- I NTACT circuit onto Automatic Test Equipement, Easii-IC
uted NoC meshes enabling any chiplet-to-chiplet communi- and Jean-Paul Goglio and his team for designing the I NTACT
cation for scalable cache-coherency traffic, with 0.6-ns/mm application demonstration board. Finally, they would like to
inter-chiplet latency using asynchronous signaling within the thank many other contributors from Mentor Graphics on the
interposer, and a 0.59-pJ/bit synchronous 3D-Plug energy CAD tool teams and from CEA-LETI on both the design team
efficiency with 3-Tb/s/mm2 bandwidth density, which is twice and technology teams for their dedication to make this concept
better than previous circuits. and circuit a successful realization.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
94 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 95
[45] J. Durupt, P. Vivet, and J. Schloeffel, “IJTAG supported 3D DFT using Gael Pillonnet (Senior Member, IEEE) was born
chiplet-footprints for testing multi-chips active interposer system,” in in Lyon, France, in 1981. He received the master’s
Proc. 21th IEEE Eur. Test Symp. (ETS), May 2016, pp. 1–8. degree in electrical engineering from CPE Lyon,
[46] E. Beigne et al., “A 460 MHz at 397 mV, 2.6 GHz at 1.3 V, 32 bits Lyon, France, in 2004, and the Ph.D. and Habili-
VLIW DSP embedding f MAX tracking,” IEEE J. Solid-State Circuits, tation degrees from INSA Lyon, Lyon, in 2007 and
vol. 50, no. 1, pp. 125–136, Jan. 2015. 2016, respectively.
[47] P.-Y. Martinez et al., “ExaNoDe: Combined integration of chiplets on Following an early experience as an Analog
active interposer with bare dice in a multi-chip-module for heteroge- Designer in STMicroelectronics, Crolles, France,
neous and scalable high performance compute nodes,” in Proc. IEEE in 2008, he joined the Electrical Engineering Depart-
VLSI Conf., 2020. ment, University of Lyon, Lyon. From 2011 to 2012,
[48] A. Olofsson, T. Nordstrom, and Z. Ul-Abdin, “Kickstarting high- he held a visiting researcher position at the Univer-
performance energy-efficient manycore architectures with epiphany,” sity of California at Berkeley, Berkeley, CA, USA. Since 2013, he has been
in Proc. 48th Asilomar Conf. Signals, Syst. Comput., Nov. 2014, a Full-Time Researcher at CEA-LETI, a major French research institution.
pp. 1719–1726. His research focuses on energy transfers in electronic devices, such as
[49] H. Reiter, “Multi-Die IC Design Tutorial,” in Proc. 3D ASIP Conf., power converters, audio amplifiers, energy-recovery logics, electromechanical
2015, pp. 1–5. transducers, and harvesting electrical interfaces.
[50] Calibre 3D STACK. Accessed: 2011. [Online]. Available: https://www.
mentor.com/products/ic_nanometer_design/verification-signoff/physical-
verification/calibre-3dstack
[51] zGLUE Inc. Accessed: 2014. [Online]. Available: www.zglue.com César Fuguet received the M.S. degree in system’s
[52] C. Torregiani, B. Vandevelde, H. Oprins, E. Beyne, and I. D. Wolf, engineering from the Universidad de Los Andes
“Thermal analysis of hot spots in advanced 3D-stacked structures,” in (ULA), Mérida, Venezuela, in 2012, and the M.S.
Proc. 15th Int. Workshop Thermal Invest. ICs Syst., 2009, pp. 55–60. and Ph.D. degrees in computer science from Univer-
[53] T. R. Harris, P. Franzon, W. R. Davis, and L. Wang, “Thermal effects of sity Pierre and Marie Curie (UPMC), Paris, France,
heterogeneous interconnects on InP/GaN/Si diverse integrated circuits,” in 2012 and 2015, respectively.
in Proc. Int. 3D Syst. Integr. Conf. (3DIC), Dec. 2014, pp. 1–3s. Following an experience at Kalray, Grenoble,
[54] C. Santos, P. Vivet, J.-P. Colonna, P. Coudrain, and R. Reis, “Thermal France, he is currently a Full-Time Researcher at
performance of 3D ICs: Analysis and alternatives,” in Proc. Int. 3D Syst. CEA-List, Grenoble, France. His main research
Integr. Conf. (3DIC), Dec. 2014, pp. 1–7. interests are multicore processor architectures, cache
[55] M. Graphics and W. Paper, “A complete guide to 3D chip-package ther- coherency, and heterogeneous architectures with
mal co-design, 10 key considerations,” 2017, Tech. Rep. [Online]. Avail- accelerators for high-performance computing.
able: https://www.mentor.com/products/mechanical/resources/overview/
a-complete-guide-to-3d-chip-package-thermal-co-design-10-key-
considerations-d8b0e79e-fb79-4c5a-992d-45d0d3b5f0ac
[56] C. Santos, P. Vivet, L. Wang, M. White, and A. Arriordaz, “Thermal Ivan Miro-Panades (Member, IEEE) received
exploration and sign-off analysis for advanced 3D integration,” in Proc. the M.S. degree in telecommunication engineering
Design Track, DAC Conf., Jun. 2017. from the Technical University of Catalonia (UPC),
Barcelona, Spain, in 2002, and the M.S. and Ph.D.
Pascal Vivet (Member, IEEE) received the Ph.D.
degrees in computer science from University Pierre
degree from Grenoble INPG, Grenoble, France,
and Marie Curie (UPMC), Paris, France, in 2004 and
in 2001, designing an asynchronous microprocessor.
2008, respectively.
After four years with STMicroelectronics, Crolles,
He worked at Philips Research, Paris, and STMi-
France, he joined CEA-Leti, Grenoble, in 2003,
croelectronics, Grenoble, France, before joining
in the digital design lab. He was a Project Leader
CEA, Grenoble, in 2008, where he is currently a
on 3D circuit design from 2011 to 2018. He is
Research Engineer in digital integrated circuits. His
currently a Scientific Director of the Digital Systems
main research interests are artificial intelligence (AI), Internet-of-Things,
and Integrated Circuits Division, CEA-LIST, a CEA
low-power architectures, energy-efficient systems, and Fmax/Vmin tracking
institute. He has authored or coauthored more than
methodologies.
120 articles and holds several patents in the field of
digital design. His research interests cover wide aspects of circuit and system
level design, ranging from system integration, multicore architecture, network-
on-chip, energy-efficient design, related CAD aspects, and in strong links with
advanced technologies, such as 3D, nonvolatile-memories, and photonics. Guillaume Moritz was born in France in 1987.
He graduated from Telecom Physique Strasbourg
Eric Guthmuller graduated from École Polytech- in 2010 with a specialization in micro- and
nique and received the M.S. degree from Telecom nano-electronics and the associated master.
Paris, Paris, France, in 2009, and the Ph.D. degree in After finishing his internship at CEA-Leti,
computer science from University Pierre and Marie Grenoble, France, he joined Leti for two years. Then,
Curie (UPMC), Telecom Paris, in 2013. as a subcontractor from ATOS specialized in phys-
He joined CEA-Leti, Grenoble, France, as a Full- ical design, he holds different positions with Leti,
Time Researcher in 2019, then with CEA-List. working on various advanced projects, including two
His main research interests include processor archi- major 3D circuits. He joined Leti in 2019, where
tectures and their memory hierarchy, in particular he is currently focusing on physical implementation
cache coherency for manycore and heterogeneous of image sensors.
architectures.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
96 IEEE JOURNAL OF SOLID-STATE CIRCUITS, VOL. 56, NO. 1, JANUARY 2021
Christian Bernard received the Engineering degree Michel Harrand (Member, IEEE) started his career
from the Grenoble Polytechnical Institute, Grenoble, in Matra Espace, Paris, France, in 1980, where
France, in 1979. he designed automatic pilot systems for satel-
After four years with Thomson, Paris, France, lites. In 1985, he joined Thomson Semiconduc-
he worked at Bull, Paris, on mainframe HW design tors, Grenoble, France, where he designed numerous
of CPU cores, multiprocessing, and cache coherency integrated circuits in the microprocessor, telecom-
aspects. He joined CEA-Leti, Grenoble, in 2001, munication, and mostly image compression fields,
in the digital design lab. He contributed in the and lead a design team before being appointed as
design of large systems of the lab covering various the Director of the Embedded DRAM Department
domains: 4G mobile baseband, space mission ded- in 1996. He joined CEA, Grenoble, in 2006, to pre-
icated hardware accelerators, and many-core archi- pare the creation of Kalray, Grenoble, a startup
tectures, including the integration of cache coherency in 3D many cores. He is designing manycore processors, which he co-founded in 2008 as the CTO.
now retired. He joined back CEA in end 2012 to explore the architecture, design, and
applications of new technologies as 2.5D integrated circuits, emerging non-
volatile memories, and currently neural networks. He has served in the ISSCC
TPC from 2001 to 2006, and holds more than 40 patents.
Didier Varreau was born in Dôle, France,
in 1954. He received the Electronic Higher Technical Denis Dutoit joined CEA, Grenoble, France,
Diploma degree from Grenoble University, France, in 2009, after working for STMicroelectronics,
in 1975. Crolles, France, and STEricsson, Grenoble. In CEA,
In 1976, he joined CEA-LETI, Grenoble, France, he has been involved in system-on-a-chip archi-
to develop instrumental electronic boards for med- tecture for computing and 3D integrated circuit
ical and nuclear purpose. From 2003 to 2006, projects. After defining the CEA-Leti’s roadmap of
he worked on the FAUST project developing inte- technologies and solutions for advanced computing,
grated synchronous IPs. Since 2006, he has been he is now involved in European Projects in High
in charge of physical implementation of low-power Performance Computing as a Coordinator, Project
energy-efficient accelerators, then since 2010, he has Leader, and SoC Architect.
been working on large multiprocessor system-on-chip, including large 3D
systems. He is now retired.
Sébastien Thuries received the master’s degree Lucile Arnaud joined CEA-LETI, Grenoble,
from the University of Montpellier, Montpellier, France, in 1984. She first covered design and
France, in 2003. characterization of magnetic and electromagnetic
He joined CEA/Leti, Grenoble, France, in 2004, as passive devices. From 2007 to 2014, she was
a Research Engineer. He is leading the High-Density assigned at STMicroelectronics, Crolles, France, for
3D Architecture and Design Group, CEA-LETI, interconnect reliability expertise of most advanced
including fine pitch 3D stacking as well as mono- CMOS technology. Since 2014, she has been
lithic 3D (M3D). He has worked on and led several involved in 3D-IC developments in LETI for tech-
digital ASIC developments for a set of application, nology expertise and projects managing. In the last
such as 4G digital baseband, complex imagers, sys- four years, she managed internal and collaborative
tem on chip, and mixed signal RF over the last projects for 3D interconnects development with Cu-
decade. He has been a pioneer in FDSOI digital design and back biasing SiO2 hybrid bonding technologies. She authored or coauthored more than 90
capability. He leads the research team on new architecture and design articles, including some invited talks and tutorials in the IEEE conferences.
paradigm raised by M3D-IC in order to optimize the full system to technology
fields.
Jean Charbonnier (Member, IEEE) is graduated
from the National School of Physics of Grenoble,
Grenoble, France, in 2001 and received the Ph.D.
David Coriat received the M.Sc. degree from degree in crystallography from the University Joseph
the University of Science of Montpellier, France, Fourier, Grenoble, in 2006.
in 2012. He joined the 3D Wafer Level Packaging Group,
He subsequently joined CEA-Leti Minatec, Greno- CEA-Leti, Grenoble, in 2008. He has been working
ble, France. He has worked on dynamic management for more than ten years in through silicon vias,
of power and variability in MP-SoC architectures 3D interconnections, and silicon interposers technol-
as well as power estimation techniques in large ogy. His research interests include high-performance
MP-SoC architectures. His research interests now lie computing, silicon photonics interposer, as well as
in low-power architectures and design. cryopackaging for quantum architecture applications. He is currently in
charge of coordinating the High-Density 3D Integration Group, 3D Packaging
Laboratory, CEA-Leti.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.
VIVET et al.: INTA CT: A 96-CORE PROCESSOR WITH SIX CHIPLETS 3D-STACKED ON AN ACTIVE INTERPOSER 97
Perceval Coudrain received the M.S. degree in Quentin L. Meunier received the Diploma degree
materials sciences from the University of Nantes, from the Ensimag School, Grenoble, France,
Nantes, France, in 2001, and the Ph.D. degree in 2007, and the Ph.D. degree in computer science
from the Institut Supérieur de ’Aéronautique et de from the Université de Grenoble, Grenoble, in 2010.
l’Espace, Toulouse, France, in 2009. Since 2011, he has been an Associate Pro-
He joined STMicroelectronics, Crolles, France, fessor at the LIP6 Laboratory, Sorbonne Univer-
in 2002, and entered the advanced research and sité, Paris, France. His research interests include
development group in 2005, where he was involved many core architectures and cache coherence, high-
in the development of backside illumination and performance computing, and side-channel attacks
monolithic 3D integration for CMOS image sen- and counter-measures.
sors. For ten years, he has been focusing on 3D
integration technologies, including TSV and C–Cu hybrid bonding, and
thermal management. He moved to CEA-Leti, Grenoble, France, in 2018,
where his research focuses on 3D integration, fan-out wafer level packaging, Alexis Farcy graduated in electronic engineering
and embedded microfluidics. from the “Institut des Sciences et Techniques de
Grenoble,” France, in 2000, and the Ph.D. degree in
electronic, optronic and systems from the University
of Savoie, Chambery, France, in 2009.
Arnaud Garnier graduated from INSA de Lyon, He was employed by STMicroelectronics, Crolles,
Lyon, France, in 2004. He received the Ph.D. degree France. From 2000 to 2007, he was among
from Université St Quentin en Yvelines, St Quentin, the Advanced Interconnects and Passive Compo-
France, in 2007, in materials science in 2007 after nents Module, focusing on interconnect performance
studying the Smart Cut technology on GaN for three analysis for advanced technology nodes, integration
years in SOITEC. of advanced inductors and 3D capacitors in BEOL,
He then joined CEA-LETI, Grenoble, France, and high-frequency characterizations of low-k and high-k dielectrics. Since
to work on wafer level packaging, with a specific 2007, he has been in the field of 3D integration on innovative technologies,
focus on wafer bonding, chip assembly, underfilling, processes and materials for 3D integration, and performance assessment for
3D process integration, and advanced packaging. photonics and image sensors.
He currently works as a Project Leader mainly on
fan-out wafer level packaging technologies. Alexandre Arriordaz received the master’s degree
in electronics from the University de Nice-Sophia-
Antipolis, France.
He is a Senior Product Engineering Manager for
Frédéric Berger born in Grenoble, France, in 1973.
caliber design solutions at Mentor – A Siemens
He received the B.T.S. degree from lycée Argouges,
Business, Montbonnot, France. He is leading prod-
Grenoble, in 1993, in photonic optical engineering.
uct management and software development teams
He started his career as a Technician in the main-
located, Grenoble, France, focusing on circuit relia-
tenance of alarm systems, then fiber optic welders
bility and verification product lines. In parallel to this
for telecommunications at Siemens/Corning. More
activity, he is also a technical interface for various
attracted by research and development, he continues
European projects dealing with research and devel-
his activity in the Photonics team to develop and
opment topics, such as 3D-IC or silicon photonics. Prior to joining Mentor,
perfect optical amplifiers. In 2003, he joined CEA,
he was a Full-Custom Design Engineer at Freescale Semiconductor (now NXP,
Grenoble, as a Technician in the Packaging and
Grenoble), working on advanced testchip/SRAM compiler developments.
Assembly Laboratory. In 2005, he participated with
SET in the development of the first FC300 equipment for 3D assemblies
based on microtubes for infrared imagers. He used this technical background Séverine Chéramy (Member, IEEE) received the
to carry out the assemblies of the six chiplets of the I NTA CT project. Engineering degree from Polytech Orleans, Orleans,
France, in 1998, having specialized in material sci-
ence.
Alain Gueugnot received the B.T.S. degree in She has spent over eight years at GEMALTO, Aix
microtechnology from Lycée Jules Richard, Paris, en Provence, France, a leading smart-card company
France, in 1989. developing technologies for secure solutions, such
He joined CEA-DAM, Grenoble, in 1992 and then as contactless smart cards and electronic passports.
CEA-LETI at the DOPT in 2003 to work in the In 2008, she joined CEA-Leti, Grenoble, France, as a
joint laboratory with SOFRADIR (Lynred) in the 3D Project Leader and then as a 3D Integration Lab-
packaging. Then, he set up means of morpholog- oratory Manager. This group develops technology
ical characterization and metallographic expertise and integration for 3DIC, in strong relationship with 3D design, model, and
of assemblies of components for infrared, lighting, simulation teams. Since January 2017, she has been responsible for 3DIC
imager and screen using profilometers, and ionic and integration strategy and related business development. She is also the Director
mechanical cross section for SEM imaging. of the 3D project of the Institute of Technological Research (ITR) Nanoelec.
Authorized licensed use limited to: HANGZHOU DIANZI UNIVERSITY. Downloaded on September 22,2024 at 09:50:25 UTC from IEEE Xplore. Restrictions apply.