NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning

NVIDIA GPU COMPUTING: A JOURNEY
FROM PC GAMING TO DEEP LEARNING

Stuart Oberman | October 2017
GAMING PROENTERPRISE
VISUALIZATION DATA CENTER AUTO
NVIDIA ACCELERATED COMPUTING

2
GEFORCE: PC Gaming
200M GeForce gamers worldwide
Most advanced technology
Gaming ecosystem: More than just chips
Amazing experiences & imagery 3
NINTENDO SWITCH: POWERED BY NVIDIA TEGRA
4
GEFORCE NOW:
AMAZING GAMES
ANYWHERE
AAA titles delivered at 1080p
60fps
Streamed to SHIELD family of

devices
Streaming to Mac (beta)
https://www.nvidia.com/en-
us/geforce/products/geforce-
now/mac-pc/
5
GPU COMPUTING
Drug Design Seismic Imaging Automotive Design Medical Imaging

Molecular Dynamics Reverse Time Migration Computational Fluid Dynamics Computed Tomography
15x speed up 14x speed up 30-100x speed up
Astrophysics Options Pricing Product Development Weather Forecasting

n-body Monte Carlo Finite Difference Time Domain Atmospheric Physics
20x speed up 6
GPU: 2017
7
2017: TESLA VOLTA V100
21B transistors
815 mm2
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
8
*full GV100 chip contains 84 SMs
V100 SPECIFICATIONS
9
HOW DID WE GET HERE?
10
NVIDIA GPUS: 1999 TO NOW
https://youtu.be/I25dLTIPREA
11
SOUL OF THE GRAPHICS PROCESSING UNIT
GPU: Changes Everything
• Accelerate computationally-intensive applications

• NVIDIA introduced GPU in 1999
• A single chip processor to accelerate PC gaming and 3D graphics
• Goal: approach the image quality of movie studio offline rendering farms, but in
real-time
• Instead of hours per frame, > 60 frames per second
• Millions of pixels per frame can all be operated on in parallel

• 3D graphics is often termed embarrassingly parallel
• Use large arrays of floating point units to exploit wide and deep parallelism 12
CLASSIC GEFORCE GPUS
13
GEFORCE 6 AND 7 SERIES
2004-2006
• Example: GeForce 7900 GTX

• 278M transistors
• 650MHz pipeline clock
• 196mm2 in 90nm
• >300 GFLOPS peak, single-precision
14
THE LIFE OF A TRIANGLE IN A GPU
Classic Edition
process commands
convert to FP
Host / Front End / Vertex Fetch
transform vertices
to screen-space
Vertex Processing
generate per-
triangle equations
Primitive Assembly, Setup
Frame Buffer Controller

generate pixels, delete pixels
that cannot be seen
Rasterize & Zcull
Pixel Shader
determine the colors, transparencies

and depth of the pixel
Texture
Register Combiners
do final hidden surface test, blend

and write out color and new depth
Pixel Engines (ROP)
15
NUMERIC REPRESENTATIONS IN A GPU
• Fixed point formats

• u8, s8, u16, s16, s3.8, s5.10, ...
• Floating point formats

• fp16, fp24, fp32, ...
• Tradeoff of dynamic range vs. precision
• Block floating point formats

• Treat multiple operands as having a common exponent
• Allows a tradeoff in dynamic range vs storage and computation

16
INSIDE THE 7900GTX GPU
Host / FW / VTF vertex fetch engine
8 vertex shaders
Cull / Clip / Setup
Z-Cull Shader Instruction Dispatch conversion to pixels
L2 Tex
24 pixel shaders
Fragment Crossbar redistribute pixels
16 pixel engines
Memory Memory Memory Memory
Partition Partition Partition Partition
DRAM(s) DRAM(s) DRAM(s) DRAM(s)

4 independent 64-bit memory partitions 17
G80: REDEFINED THE GPU
18
G80
GeForce 8800 released 2006
• G80 first GPU with a unified shader processor architecture

• Introduced the SM: Streaming Multiprocessor
• Array of simple streaming processor cores: SPs or CUDA cores
• All shader stages use the same instruction set
• All shader stages execute on the same units
• Permits better sharing of SM hardware resources

• Recognized that building dedicated units often results in under-utilization due to
the application workload
19
20
G80 FEATURES
• 681M transistors
• 470mm2 in 90nm
• First to support Microsoft DirectX10 API
• Invested a little extra (epsilon) HW in SM to also support general purpose
throughput computing
• Beginning of CUDA everywhere
• SM functional units designed to run at 2x frequency, half the number of units

• 576 GFLOPs @ 1.5GHz , IEEE 754 fp32 FADD and FMUL
• 155W 21
BEGINNING OF GPU COMPUTING
Throughput Computing
• Latency Oriented
• Fewer, bigger cores with out-of-order, speculative execution
• Big caches optimized for latency
• Math units are small part of the die
• Throughput Oriented
• Lots of simple compute cores and hardware scheduling
• Big register files. Caches optimized for bandwidth.
• Math units are most of the die

22
CUDA
Most successful environment for throughput computing
C++ for throughput computers

On-chip memory management
Asynchronous, parallel API
Programmability makes it possible
to innovate
New layer type? No problem.

23
G80 ARCHITECTURE
24
FROM FERMI TO PASCAL
25
FERMI GF100
Tesla C2070 released 2011
• 3B transistors
• 529 mm2 in 40nm
• 1150 MHz SM clock
• 3rd generation SM, each with configurable L1/shared
memory
• IEEE 754-2008 FMA
• 1030 GFLOPS fp32, 515 GFLOPS fp64
• 247W
26
KEPLER GK110
Tesla K40 released 2013
• 7.1B transistors
• 550 mm2 in 28nm
• Intense focus on power efficiency, operating at lower
frequency
• 2880 CUDA cores at 810 MHz
• Tradeoff of area efficiency vs. power efficiency

• 4.3 TFLOPS fp32, 1.4 TFLOPS fp64
• 235W
27
28
TITAN SUPERCOMPUTER
Oak Ridge National Laboratory
29
PASCAL GP100
released 2016
• 15.3B transistors
• 610 mm2 in 16ff
• 10.6 TFLOPS fp32, 5.3 TFLOPS fp64
• 21 TFLOPS fp16 for Deep Learning training and
inference acceleration
• New high-bandwidth NVLink GPU interconnect
• HBM2 stacked memory
• 300W
30
MAJOR ADVANCES IN PASCAL
P100
Teraflops (FP32/FP16)
P100 3x
Bandwidth (GB/Sec)
20 (FP16) 160 P100
Bandwidth
15 120
2x
P100
10 (FP32) 80
M40 1x K40 M40
5 K40 40
K40 M40
3x Compute 5x GPU-GPU BW 3x GPU Mem BW
NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 31

GEFORCE GTX 1080TI
https://www.nvidia.com/en-us/geforce/products/10series/geforce-
gtx-1080-ti/
https://youtu.be/2c2vN736V60
32
FINAL FANTASY XV PREVIEW DEMO WITH
GEFORCE GTX 1080TI
https://www.geforce.com/whats-new/articles/final-fantasy-xv-windows-edition-4k-
trailer-nvidia-gameworks-enhancements
https://youtu.be/h0o3fctwXw0
33
2017: VOLTA
34
TESLA V100: 2017
21B transistors
815 mm2 in 16ff
80 SM
5120 CUDA Cores
640 Tensor Cores
16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
35
*full GV100 chip contains 84 SMs
TESLA V100
Independent Thread
Volta Architecture Improved NVLink & New SM Core Tensor Core
HBM2 SM
Scheduling
L1 I$
Sub- Sub- Sub- Sub-

Core Core Core Core
TEX L1 D$ & SMEM
Performance & 120 Programmable

Most Productive GPU Efficient Bandwidth Programmability New Algorithms TFLOPS Deep Learning
More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration,
MPS acceleration, and more …
The Fastest and Most Productive GPU for Deep Learning and HPC
36
GPU PERFORMANCE COMPARISON
P100 V100 Ratio
DL Training 10 TFLOPS 120 TFLOPS 12x
DL Inferencing 21 TFLOPS 120 TFLOPS 6x
FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
STREAM Triad Perf 557 GB/s 855 GB/s 1.5x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x 37
TENSOR CORE
CUDA TensorOp instructions & data formats
4x4 matrix processing array
D[FP32] = A[FP16] * B[FP16] + C[FP32]
Optimized for deep learning
Activation Inputs Weights Inputs Output Results
38
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices
A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3
D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3
A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3
A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3
FP16 or FP32 FP16 FP16 FP16 or FP32
D = AB + C 39
VOLTA TENSOR OPERATION
Sum with
FP16 Full precision FP32 Convert to
storage/input product accumulator FP32 result
more products
F16
× + F32
F16
F32
Also supports FP16 accumulator mode for inferencing

40
NVLINK – PERFORMANCE AND POWER
25Gbps signaling
Bandwidth 6 NVLinks for GV100
1.9 x Bandwidth improvement over GP100
Latency sensitive CPU caches GMEM
Coherence Fast access in local cache hierarchy
Probe filter in GPU
Power Savings Reduce number of active lanes for lightly loaded link
41
NVLINK NODES
HPC – P9 CORAL NODE – SUMMIT
DL – HYBRID CUBE MESH – DGX-1 w/ Volta V100 V100 V100
P9
V100 V100 V100 V100
P9
V100 V100 V100 V100
V100 V100 V100
42
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Directed testing: shared in global
Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%
Shared: vs cache
• Faster atomics
• More banks
• More predictable
Pascal Volta
43
44
GPU COMPUTING AND DEEP LEARNING
45
TWO FORCES DRIVING
THE FUTURE OF COMPUTING
107 Transistors
(thousands)
106
1.1X per year
105
104
103
1.5X per year
102
Single-threaded perf
1980 1990 2000 2010 2020

40 Years of Microprocessor Trend Data The Big Bang of Deep Learning
Original data up to the year 2010 collected and plotted by M. Horowitz,
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
46
RISE OF NVIDIA GPU COMPUTING
107 GPU-Computing perf

1.5X per year 1000X
by 2025
106
1.1X per year
105
104
103
1.5X per year
102
Single-threaded perf
1980 1990 2000 2010 2020

40 Years of Microprocessor Trend Data The Big Bang of Deep Learning
Original data up to the year 2010 collected and plotted by M. Horowitz,
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp
47
DEEP LEARNING EVERYWHERE
INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection
Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking
Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign
Language Processing
Sentiment Analysis
Recommendation
48
DEEP NEURAL NETWORK
I0
w0
I1 w1
….. ∑
I2 w2
…..
wn
In
49
ANATOMY OF A FULLY CONNECTED LAYER
Lots of dot products
Each neuron calculates a dot product, M in a layer
𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛
50
COMBINE THE DOT PRODUCTS
What if we assemble the weights into a matrix?
Each neuron calculates a dot product, M in a layer
𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛
What if we assemble the weights as [M, K] matrix?
Matrix-vector multiplication (GEMV)
Unfortunately …
M*K+2*K elements load/store
M*K FMA math operations
This is memory bandwidth limited!

51
BATCH TO GET MATRIX MULTIPLICATION
Making the problem math limited
Can we turn this into a GEMM?
“Batching”: process several inputs at once
Input is now a matrix, not a vector
Weight matrix remains the same
1 <= N <= 128 is common
52
GPU DEEP LEARNING —
A NEW COMPUTING MODEL
53
AI IMPROVING AT AMAZING RATES
SPEECH RECOGNITION
IMAGENET ACCURACY
ACCURACY
54
AI BREAKTHROUGHS
Recent Breakthroughs
“Superhuman” Image
Recognition Conversational Speech
Recognition
Atari Games Lip Reading
AlphaGo Rivals World

Champion
2015 2016 2017 55

MODEL COMPLEXITY IS EXPLODING
105 ExaFLOPS
8.7 Billion Parameters
20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters
2015 — Microsoft ResNet 2016 — Baidu Deep Speech 2 2017 — Google NMT
56
NVIDIA DNN ACCELERATION
57
A COMPLETE DEEP LEARNING PLATFORM
MANAGE TRAIN DEPLOY
DIGITS TensorRT
PROTOTXT
TEST TRAIN
MANAGE /
AUGMENT DATA
EMBEDDED AUTOMOTIVE
CENTER
58
DNN TRAINING
59
NVIDIA DGX SYSTEMS
Built for Leading AI Research
https://www.nvidia.com/en-us/data-center/dgx-systems/
https://youtu.be/8xYz46h3MJ0
60
NVIDIA DGX STATION
PERSONAL DGX
480 Tensor TFLOPS | 4x Tesla V100 16GB

NVLink Fully Connected | 3x DisplayPort
1500W | Water Cooled
61
NVIDIA DGX STATION
PERSONAL DGX
480 Tensor TFLOPS | 4x Tesla V100 16GB

NVLink Fully Connected | 3x DisplayPort
1500W | Water Cooled
$69,000
62
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH
960 Tensor TFLOPS | 8x Tesla V100 | NVLink Hybrid Cube

From 8 days on TITAN X to 8 hours
400 servers in a box
63
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH
960 Tensor TFLOPS | 8x Tesla V100 | NVLink Hybrid Cube

From 8 days on TITAN X to 8 hours
400 servers in a box
$149,000
64
DNN TRAINING WITH DGX-1
Iterate and Innovate Faster
65
DNN INFERENCE
66
TensorRT
High-performance framework
makes it easy to develop
GPU-accelerated inference
Production deployment solution
for deep learning inference
TensorRT for Data Center TensorRT for Automotive
Optimized inference for a given
trained neural network and target
Image Object Image Pedestrian Lane Traffic Sign
GPU Classification Detection Segmentation Detection Tracking Recognition
Solutions for Hyperscale, ADAS,

Embedded
Supports deployment of
fp32,fp16,int8* inference
NVIDIA DRIVE PX 2
* int8 support will be available from v2
67
TensorRT
Optimizations
Fuse network layers

Eliminate concatenation layers OPTIMIZED
Kernel specialization INFERENCE
Auto-tuning for target platform
RUNTIME
Tuned for given batch size
TRAINED
NEURAL NETWORK
68
NVIDIA TENSORRT
Programmable Inference Accelerator
next input
concat
next input
relu relu relu relu
batch nm batch nm batch nm batch nm

copy 3x3 CR 5x5 CR 1x1 CR
1x1 conv 3x3 conv 5x5 conv 1x1 conv
1x1 CR max pool

relu relu max pool
batch nm batch nm
input
1x1 conv 1x1 conv
input
Weight & Activation Precision Calibration | Layer & Tensor Fusion

Kernel Auto-Tuning | Multi-Stream Execution 69
V100 INFERENCE
Datacenter Inference Acceleration
• 3.7x faster inference on V100

vs. P100
• 18x faster inference on

TensorFlow models on V100
• 40x faster than CPU-only
70
AUTONOMOUS VEHICLE TECHNOLOGY
71
AI IS THE SOLUTION TO SELF DRIVING CARS
PERCEPTION REASONING DRIVING
HD MAP MAPPING AI COMPUTING
72
PARKER
Next-Generation System-on-Chip
ARM v8
NVIDIA’s next-generation Pascal CPU
graphics architecture COMPLEX
(2x Denver 2 + 4x A57)
1.5 teraflops Coherent HMP
NVIDIA’s next-generation ARM 64b SECURITY

ENGINES
4K60
VIDEO
4K60
VIDEO
AUDIO
ENGINE
2D ENGINE
Denver 2 CPU ENCODER DECODER
GigE
DISPLAY 128-bit BOOT and IMAGE
Functional safety for automotive ENGINES LPDDR4 PM PROC
Ethernet
MAC
PROC (ISP)
applications
Safety
Engine I/O
73
DRIVE PX 2 COMPUTE
COMPLEXES
2 Complete AI Systems
Pascal Discrete GPU
1,280 CUDA Cores
4 GB GDDR5 RAM
Parker SOC Complex

256 CUDA Cores
4 Cortex A57 Cores
2 NVIDIA Denver2 Cores
8 GB LPDDR4 RAM
64 GB Flash
Safety Microprocessor
Infineon AURIX Safety Microprocessor
ASIL D
74
14
NVIDIA DRIVE PLATFORM
Level 2 -> Level 5
100 TOPS
DRIVE PX Xavier
Level 4/5
10 TOPS
DRIVE PX 2 Parker
Level 2/3
1 TOPS
DRIVE PX 2 ONE ARCHITECTURE DRIVE PX (Xavier)

2 PARKER + 2 PASCAL GPU | 20 TOPS DL | 120 SPECINT | 80W 30 TOPS DL | 160 SPECINT | 30W
75
ANNOUNCING XAVIER DLA
NOW OPEN SOURCE
Command Interface
Tensor Execution Micro-controller
Sparse Weight Output

Unified Decompression MAC
512KB Array Postprocess
Input or
Input DMA (Activation
(Activations Buffer 2048 Int8 Output Output
or Function, DMA
and Weights) Accumulators
Activations 1024 Int16 Pooling
and Native or etc.)
Winograd 1024 FP16
Weights
Input
Transform
Memory Interface
http://nvdla.org/ 76
NVIDIA DRIVE
END TO END SELF-DRIVING CAR PLATFORM
MAPPING
KALDI
LOCALIZATION
DRIVENET
PILOTNET
Training on Driving with

NVIDIA DGX-1 NVIDIA DRIVE PX2
DGX-1 DriveWorks
77
DRIVING AND IMAGING
78
CURRENT DRIVER ASSIST
SENSE PLAN ACT

WARN
BRAKE
FPGA
CPU
CV ASIC
79
80
81
82
CURRENT DRIVER ASSIST
SENSE PLAN ACT

WARN
BRAKE
FPGA
CPU
CV ASIC
83
FUTURE AUTONOMOUS DRIVING SYSTEM
SENSE PLAN ACT

WARN
FPGA
CPU
CV ASIC
BRAKE
STEER
ACCELERATE
DNN
84
NVIDIA BB8 AI CAR —
LEARNING BY EXAMPLE 85
BB8 SELF-DRIVING CAR DEMO
https://blogs.nvidia.com/blog/2017/01/04/bb8-ces/
https://youtu.be/fmVWLr0X1Sk
86
WORKING @ NVIDIA
OUR CULTURE
A LEARNING MACHINE
INNOVATION
“willingness to take risks”
ONE TEAM
“what’s best for the company”
INTELLECTUAL HONESTY
“admit mistakes, no ego”
SPEED & AGILITY

“the world is changing fast”
EXCELLENCE
“hold ourselves to the highest standards”
88
11,000 employees — Tackling challenges that matter
A GREAT PLACE TO WORK Top 50 “Best Places to Work” — Glassdoor
#1 of the “50 Smartest Companies” — MIT Tech Review
89
JOIN THE NVIDIA TEAM: INTERNS AND NEW GRADS
We’re hiring interns and new college grads. Come join the industry leader
in virtual reality, artificial intelligence, self-driving cars, and gaming.
Learn more at: www.nvidia.com/university
90
THANK YOU

NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning

Uploaded by

Copyright:

Available Formats

NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning

Uploaded by

Copyright:

Available Formats

NVIDIA GPU COMPUTING: A JOURNEY

FROM PC GAMING TO DEEP LEARNING

NVIDIA ACCELERATED COMPUTING

Streamed to SHIELD family of

Streaming to Mac (beta)

Drug Design Seismic Imaging Automotive Design Medical Imaging

Astrophysics Options Pricing Product Development Weather Forecasting

• Accelerate computationally-intensive applications

• Millions of pixels per frame can all be operated on in parallel

• Example: GeForce 7900 GTX

Frame Buffer Controller

determine the colors, transparencies

do final hidden surface test, blend

• Fixed point formats

• Floating point formats

• Tradeoff of dynamic range vs. precision

• Block floating point formats

• Allows a tradeoff in dynamic range vs storage and computation

Fragment Crossbar redistribute pixels

DRAM(s) DRAM(s) DRAM(s) DRAM(s)

• G80 first GPU with a unified shader processor architecture

• All shader stages use the same instruction set

• All shader stages execute on the same units

• Permits better sharing of SM hardware resources

• SM functional units designed to run at 2x frequency, half the number of units

• Big caches optimized for latency

• Math units are small part of the die

• Big register files. Caches optimized for bandwidth.

• Math units are most of the die

C++ for throughput computers

New layer type? No problem.

• Tradeoff of area efficiency vs. power efficiency

3x Compute 5x GPU-GPU BW 3x GPU Mem BW

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 31

Sub- Sub- Sub- Sub-

TEX L1 D$ & SMEM

Performance & 120 Programmable

CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

FP16 or FP32 FP16 FP16 FP16 or FP32

Also supports FP16 accumulator mode for inferencing

Coherence Fast access in local cache hierarchy

Probe filter in GPU

DL – HYBRID CUBE MESH – DGX-1 w/ Volta V100 V100 V100

V100 V100 V100

1980 1990 2000 2010 2020

107 GPU-Computing perf

1980 1990 2000 2010 2020

Each neuron calculates a dot product, M in a layer

Each neuron calculates a dot product, M in a layer

What if we assemble the weights as [M, K] matrix?

Matrix-vector multiplication (GEMV)

M*K+2*K elements load/store

M*K FMA math operations

This is memory bandwidth limited!

Can we turn this into a GEMM?

MK+2K elements load/store