NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning
AMAZING GAMES
ANYWHERE
AAA titles delivered at 1080p
60fps
https://www.nvidia.com/en-
us/geforce/products/geforce-
now/mac-pc/
5
GPU COMPUTING
7
2017: TESLA VOLTA V100
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
8
*full GV100 chip contains 84 SMs
V100 SPECIFICATIONS
9
HOW DID WE GET HERE?
10
NVIDIA GPUS: 1999 TO NOW
https://youtu.be/I25dLTIPREA
11
SOUL OF THE GRAPHICS PROCESSING UNIT
GPU: Changes Everything
• Goal: approach the image quality of movie studio offline rendering farms, but in
real-time
• Instead of hours per frame, > 60 frames per second
• Use large arrays of floating point units to exploit wide and deep parallelism 12
CLASSIC GEFORCE GPUS
13
GEFORCE 6 AND 7 SERIES
2004-2006
14
THE LIFE OF A TRIANGLE IN A GPU
Classic Edition
process commands
convert to FP
Host / Front End / Vertex Fetch
transform vertices
to screen-space
Vertex Processing
generate per-
triangle equations
Primitive Assembly, Setup
Pixel Shader
Register Combiners
L2 Tex
24 pixel shaders
16 pixel engines
Memory Memory Memory Memory
Partition Partition Partition Partition
18
G80
GeForce 8800 released 2006
• 681M transistors
• 470mm2 in 90nm
• First to support Microsoft DirectX10 API
• Invested a little extra (epsilon) HW in SM to also support general purpose
throughput computing
• Beginning of CUDA everywhere
• 155W 21
BEGINNING OF GPU COMPUTING
Throughput Computing
• Latency Oriented
• Fewer, bigger cores with out-of-order, speculative execution
• Throughput Oriented
• Lots of simple compute cores and hardware scheduling
24
FROM FERMI TO PASCAL
25
FERMI GF100
Tesla C2070 released 2011
• 3B transistors
• 529 mm2 in 40nm
• 1150 MHz SM clock
• 3rd generation SM, each with configurable L1/shared
memory
• IEEE 754-2008 FMA
• 1030 GFLOPS fp32, 515 GFLOPS fp64
• 247W
26
KEPLER GK110
Tesla K40 released 2013
• 7.1B transistors
• 550 mm2 in 28nm
• Intense focus on power efficiency, operating at lower
frequency
• 2880 CUDA cores at 810 MHz
29
PASCAL GP100
released 2016
• 15.3B transistors
• 610 mm2 in 16ff
• 10.6 TFLOPS fp32, 5.3 TFLOPS fp64
• 21 TFLOPS fp16 for Deep Learning training and
inference acceleration
• New high-bandwidth NVLink GPU interconnect
• HBM2 stacked memory
• 300W
30
MAJOR ADVANCES IN PASCAL
P100
Teraflops (FP32/FP16)
P100 3x
Bandwidth (GB/Sec)
20 (FP16) 160 P100
Bandwidth
15 120
2x
P100
10 (FP32) 80
M40 1x K40 M40
5 K40 40
K40 M40
https://www.nvidia.com/en-us/geforce/products/10series/geforce-
gtx-1080-ti/
https://youtu.be/2c2vN736V60
32
FINAL FANTASY XV PREVIEW DEMO WITH
GEFORCE GTX 1080TI
https://www.geforce.com/whats-new/articles/final-fantasy-xv-windows-edition-4k-
trailer-nvidia-gameworks-enhancements
https://youtu.be/h0o3fctwXw0
33
2017: VOLTA
34
TESLA V100: 2017
21B transistors
815 mm2 in 16ff
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
35
*full GV100 chip contains 84 SMs
TESLA V100
Independent Thread
Volta Architecture Improved NVLink & New SM Core Tensor Core
HBM2 SM
Scheduling
L1 I$
More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration,
MPS acceleration, and more …
The Fastest and Most Productive GPU for Deep Learning and HPC
36
GPU PERFORMANCE COMPARISON
P100 V100 Ratio
DL Training 10 TFLOPS 120 TFLOPS 12x
DL Inferencing 21 TFLOPS 120 TFLOPS 6x
FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
STREAM Triad Perf 557 GB/s 855 GB/s 1.5x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x 37
TENSOR CORE
38
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
D = AB + C 39
VOLTA TENSOR OPERATION
Sum with
FP16 Full precision FP32 Convert to
storage/input product accumulator FP32 result
more products
F16
× + F32
F16
F32
Power Savings Reduce number of active lanes for lightly loaded link
41
NVLINK NODES
HPC – P9 CORAL NODE – SUMMIT
P9
V100 V100 V100 V100
P9
V100 V100 V100 V100
42
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Directed testing: shared in global
Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Pascal Volta
43
44
GPU COMPUTING AND DEEP LEARNING
45
TWO FORCES DRIVING
THE FUTURE OF COMPUTING
107 Transistors
(thousands)
106
1.1X per year
105
104
103
1.5X per year
102
Single-threaded perf
46
RISE OF NVIDIA GPU COMPUTING
104
103
1.5X per year
102
Single-threaded perf
47
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection
Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking
Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign
Language Processing
Sentiment Analysis
Recommendation
48
DEEP NEURAL NETWORK
I0
w0
I1 w1
….. ∑
I2 w2
…..
wn
In
49
ANATOMY OF A FULLY CONNECTED LAYER
Lots of dot products
𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛
50
COMBINE THE DOT PRODUCTS
What if we assemble the weights into a matrix?
𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛
Unfortunately …
52
GPU DEEP LEARNING —
A NEW COMPUTING MODEL
53
AI IMPROVING AT AMAZING RATES
SPEECH RECOGNITION
IMAGENET ACCURACY
ACCURACY
54
AI BREAKTHROUGHS
Recent Breakthroughs
“Superhuman” Image
Recognition Conversational Speech
Recognition
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
2015 — Microsoft ResNet 2016 — Baidu Deep Speech 2 2017 — Google NMT
56
NVIDIA DNN ACCELERATION
57
A COMPLETE DEEP LEARNING PLATFORM
MANAGE TRAIN DEPLOY
DIGITS TensorRT
PROTOTXT
TEST TRAIN
MANAGE /
AUGMENT DATA
EMBEDDED AUTOMOTIVE
CENTER
58
DNN TRAINING
59
NVIDIA DGX SYSTEMS
Built for Leading AI Research
https://www.nvidia.com/en-us/data-center/dgx-systems/
https://youtu.be/8xYz46h3MJ0
60
NVIDIA DGX STATION
PERSONAL DGX
61
NVIDIA DGX STATION
PERSONAL DGX
62
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH
63
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH
64
DNN TRAINING WITH DGX-1
Iterate and Innovate Faster
65
DNN INFERENCE
66
TensorRT
High-performance framework
makes it easy to develop
GPU-accelerated inference
Production deployment solution
for deep learning inference
TensorRT for Data Center TensorRT for Automotive
Optimized inference for a given
trained neural network and target
Image Object Image Pedestrian Lane Traffic Sign
GPU Classification Detection Segmentation Detection Tracking Recognition
Supports deployment of
fp32,fp16,int8* inference
NVIDIA DRIVE PX 2
* int8 support will be available from v2
67
TensorRT
Optimizations
68
NVIDIA TENSORRT
Programmable Inference Accelerator
next input
concat
next input
relu relu relu relu
batch nm batch nm
input
1x1 conv 1x1 conv
input
70
AUTONOMOUS VEHICLE TECHNOLOGY
71
AI IS THE SOLUTION TO SELF DRIVING CARS
72
PARKER
Next-Generation System-on-Chip
ARM v8
NVIDIA’s next-generation Pascal CPU
graphics architecture COMPLEX
(2x Denver 2 + 4x A57)
1.5 teraflops Coherent HMP
GigE
DISPLAY 128-bit BOOT and IMAGE
Functional safety for automotive ENGINES LPDDR4 PM PROC
Ethernet
MAC
PROC (ISP)
applications
Safety
Engine I/O
73
DRIVE PX 2 COMPUTE
COMPLEXES
2 Complete AI Systems
Pascal Discrete GPU
1,280 CUDA Cores
4 GB GDDR5 RAM
Safety Microprocessor
Infineon AURIX Safety Microprocessor
ASIL D
74
14
NVIDIA DRIVE PLATFORM
Level 2 -> Level 5
100 TOPS
DRIVE PX Xavier
Level 4/5
10 TOPS
DRIVE PX 2 Parker
Level 2/3
1 TOPS
Memory Interface
http://nvdla.org/ 76
NVIDIA DRIVE
END TO END SELF-DRIVING CAR PLATFORM
MAPPING
KALDI
LOCALIZATION
DRIVENET
PILOTNET
77
DRIVING AND IMAGING
78
CURRENT DRIVER ASSIST
BRAKE
FPGA
CPU
CV ASIC
79
80
81
82
CURRENT DRIVER ASSIST
BRAKE
FPGA
CPU
CV ASIC
83
FUTURE AUTONOMOUS DRIVING SYSTEM
STEER
ACCELERATE
DNN
84
NVIDIA BB8 AI CAR —
LEARNING BY EXAMPLE 85
BB8 SELF-DRIVING CAR DEMO
https://blogs.nvidia.com/blog/2017/01/04/bb8-ces/
https://youtu.be/fmVWLr0X1Sk
86
WORKING @ NVIDIA
OUR CULTURE
A LEARNING MACHINE
INNOVATION
“willingness to take risks”
ONE TEAM
“what’s best for the company”
INTELLECTUAL HONESTY
“admit mistakes, no ego”
EXCELLENCE
“hold ourselves to the highest standards”
88
11,000 employees — Tackling challenges that matter
A GREAT PLACE TO WORK Top 50 “Best Places to Work” — Glassdoor
#1 of the “50 Smartest Companies” — MIT Tech Review
89
JOIN THE NVIDIA TEAM: INTERNS AND NEW GRADS
We’re hiring interns and new college grads. Come join the industry leader
in virtual reality, artificial intelligence, self-driving cars, and gaming.
Learn more at: www.nvidia.com/university
90
THANK YOU