NVIDIA GPU Computing - A Journey From PC Gaming To Deep Learning

Download as pdf or txt
Download as pdf or txt
You are on page 1of 91

NVIDIA GPU COMPUTING: A JOURNEY

FROM PC GAMING TO DEEP LEARNING


Stuart Oberman | October 2017
GAMING PROENTERPRISE
VISUALIZATION DATA CENTER AUTO

NVIDIA ACCELERATED COMPUTING


2
GEFORCE: PC Gaming
200M GeForce gamers worldwide
Most advanced technology
Gaming ecosystem: More than just chips
Amazing experiences & imagery 3
NINTENDO SWITCH: POWERED BY NVIDIA TEGRA
4
GEFORCE NOW:

AMAZING GAMES
ANYWHERE
AAA titles delivered at 1080p
60fps

Streamed to SHIELD family of


devices

Streaming to Mac (beta)

https://www.nvidia.com/en-
us/geforce/products/geforce-
now/mac-pc/

5
GPU COMPUTING

Drug Design Seismic Imaging Automotive Design Medical Imaging


Molecular Dynamics Reverse Time Migration Computational Fluid Dynamics Computed Tomography
15x speed up 14x speed up 30-100x speed up

Astrophysics Options Pricing Product Development Weather Forecasting


n-body Monte Carlo Finite Difference Time Domain Atmospheric Physics
20x speed up 6
GPU: 2017

7
2017: TESLA VOLTA V100

21B transistors
815 mm2

80 SM
5120 CUDA Cores
640 Tensor Cores

16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
8
*full GV100 chip contains 84 SMs
V100 SPECIFICATIONS

9
HOW DID WE GET HERE?

10
NVIDIA GPUS: 1999 TO NOW

https://youtu.be/I25dLTIPREA

11
SOUL OF THE GRAPHICS PROCESSING UNIT
GPU: Changes Everything

• Accelerate computationally-intensive applications


• NVIDIA introduced GPU in 1999
• A single chip processor to accelerate PC gaming and 3D graphics

• Goal: approach the image quality of movie studio offline rendering farms, but in
real-time
• Instead of hours per frame, > 60 frames per second

• Millions of pixels per frame can all be operated on in parallel


• 3D graphics is often termed embarrassingly parallel

• Use large arrays of floating point units to exploit wide and deep parallelism 12
CLASSIC GEFORCE GPUS

13
GEFORCE 6 AND 7 SERIES
2004-2006

• Example: GeForce 7900 GTX


• 278M transistors
• 650MHz pipeline clock
• 196mm2 in 90nm
• >300 GFLOPS peak, single-precision

14
THE LIFE OF A TRIANGLE IN A GPU
Classic Edition
process commands
convert to FP
Host / Front End / Vertex Fetch

transform vertices
to screen-space
Vertex Processing

generate per-
triangle equations
Primitive Assembly, Setup

Frame Buffer Controller


generate pixels, delete pixels
that cannot be seen
Rasterize & Zcull

Pixel Shader

determine the colors, transparencies


and depth of the pixel
Texture

Register Combiners

do final hidden surface test, blend


and write out color and new depth
Pixel Engines (ROP)
15
NUMERIC REPRESENTATIONS IN A GPU

• Fixed point formats


• u8, s8, u16, s16, s3.8, s5.10, ...

• Floating point formats


• fp16, fp24, fp32, ...

• Tradeoff of dynamic range vs. precision

• Block floating point formats


• Treat multiple operands as having a common exponent

• Allows a tradeoff in dynamic range vs storage and computation


16
INSIDE THE 7900GTX GPU
Host / FW / VTF vertex fetch engine
8 vertex shaders
Cull / Clip / Setup
Z-Cull Shader Instruction Dispatch conversion to pixels

L2 Tex
24 pixel shaders

Fragment Crossbar redistribute pixels

16 pixel engines
Memory Memory Memory Memory
Partition Partition Partition Partition

DRAM(s) DRAM(s) DRAM(s) DRAM(s)


4 independent 64-bit memory partitions 17
G80: REDEFINED THE GPU

18
G80
GeForce 8800 released 2006

• G80 first GPU with a unified shader processor architecture


• Introduced the SM: Streaming Multiprocessor
• Array of simple streaming processor cores: SPs or CUDA cores

• All shader stages use the same instruction set

• All shader stages execute on the same units

• Permits better sharing of SM hardware resources


• Recognized that building dedicated units often results in under-utilization due to
the application workload
19
20
G80 FEATURES

• 681M transistors
• 470mm2 in 90nm
• First to support Microsoft DirectX10 API
• Invested a little extra (epsilon) HW in SM to also support general purpose
throughput computing
• Beginning of CUDA everywhere

• SM functional units designed to run at 2x frequency, half the number of units


• 576 GFLOPs @ 1.5GHz , IEEE 754 fp32 FADD and FMUL

• 155W 21
BEGINNING OF GPU COMPUTING
Throughput Computing

• Latency Oriented
• Fewer, bigger cores with out-of-order, speculative execution

• Big caches optimized for latency

• Math units are small part of the die

• Throughput Oriented
• Lots of simple compute cores and hardware scheduling

• Big register files. Caches optimized for bandwidth.

• Math units are most of the die


22
CUDA
Most successful environment for throughput computing

C++ for throughput computers


On-chip memory management
Asynchronous, parallel API
Programmability makes it possible
to innovate

New layer type? No problem.


23
G80 ARCHITECTURE

24
FROM FERMI TO PASCAL

25
FERMI GF100
Tesla C2070 released 2011
• 3B transistors
• 529 mm2 in 40nm
• 1150 MHz SM clock
• 3rd generation SM, each with configurable L1/shared
memory
• IEEE 754-2008 FMA
• 1030 GFLOPS fp32, 515 GFLOPS fp64
• 247W

26
KEPLER GK110
Tesla K40 released 2013

• 7.1B transistors
• 550 mm2 in 28nm
• Intense focus on power efficiency, operating at lower
frequency
• 2880 CUDA cores at 810 MHz

• Tradeoff of area efficiency vs. power efficiency


• 4.3 TFLOPS fp32, 1.4 TFLOPS fp64
• 235W
27
28
TITAN SUPERCOMPUTER
Oak Ridge National Laboratory

29
PASCAL GP100
released 2016

• 15.3B transistors
• 610 mm2 in 16ff
• 10.6 TFLOPS fp32, 5.3 TFLOPS fp64
• 21 TFLOPS fp16 for Deep Learning training and
inference acceleration
• New high-bandwidth NVLink GPU interconnect
• HBM2 stacked memory
• 300W
30
MAJOR ADVANCES IN PASCAL

P100
Teraflops (FP32/FP16)

P100 3x

Bandwidth (GB/Sec)
20 (FP16) 160 P100

Bandwidth
15 120
2x
P100
10 (FP32) 80
M40 1x K40 M40
5 K40 40
K40 M40

3x Compute 5x GPU-GPU BW 3x GPU Mem BW

NVIDIA CONFIDENTIAL. DO NOT DISTRIBUTE. 31


GEFORCE GTX 1080TI

https://www.nvidia.com/en-us/geforce/products/10series/geforce-
gtx-1080-ti/
https://youtu.be/2c2vN736V60
32
FINAL FANTASY XV PREVIEW DEMO WITH
GEFORCE GTX 1080TI

https://www.geforce.com/whats-new/articles/final-fantasy-xv-windows-edition-4k-
trailer-nvidia-gameworks-enhancements

https://youtu.be/h0o3fctwXw0

33
2017: VOLTA

34
TESLA V100: 2017

21B transistors
815 mm2 in 16ff

80 SM
5120 CUDA Cores
640 Tensor Cores

16 GB HBM2
900 GB/s HBM2
300 GB/s NVLink
35
*full GV100 chip contains 84 SMs
TESLA V100
Independent Thread
Volta Architecture Improved NVLink & New SM Core Tensor Core
HBM2 SM
Scheduling
L1 I$

Sub- Sub- Sub- Sub-


Core Core Core Core

TEX L1 D$ & SMEM

Performance & 120 Programmable


Most Productive GPU Efficient Bandwidth Programmability New Algorithms TFLOPS Deep Learning

More V100 Features: 2x L2 atomics, int8, new memory model, copy engine page migration,
MPS acceleration, and more …

The Fastest and Most Productive GPU for Deep Learning and HPC

36
GPU PERFORMANCE COMPARISON
P100 V100 Ratio
DL Training 10 TFLOPS 120 TFLOPS 12x
DL Inferencing 21 TFLOPS 120 TFLOPS 6x
FP64/FP32 5/10 TFLOPS 7.5/15 TFLOPS 1.5x
HBM2 Bandwidth 720 GB/s 900 GB/s 1.2x
STREAM Triad Perf 557 GB/s 855 GB/s 1.5x
NVLink Bandwidth 160 GB/s 300 GB/s 1.9x
L2 Cache 4 MB 6 MB 1.5x
L1 Caches 1.3 MB 10 MB 7.7x 37
TENSOR CORE

CUDA TensorOp instructions & data formats

4x4 matrix processing array

D[FP32] = A[FP16] * B[FP16] + C[FP32]

Optimized for deep learning

Activation Inputs Weights Inputs Output Results

38
TENSOR CORE
Mixed Precision Matrix Math
4x4 matrices

A0,0 A0,1 A0,2 A0,3 B0,0 B0,1 B0,2 B0,3 C0,0 C0,1 C0,2 C0,3

D=
A1,0 A1,1 A1,2 A1,3 B1,0 B1,1 B1,2 B1,3 C1,0 C1,1 C1,2 C1,3

A2,0 A2,1 A2,2 A2,3 B2,0 B2,1 B2,2 B2,3 C2,0 C2,1 C2,2 C2,3

A3,0 A3,1 A3,2 A3,3 B3,0 B3,1 B3,2 B3,3 C3,0 C3,1 C3,2 C3,3

FP16 or FP32 FP16 FP16 FP16 or FP32

D = AB + C 39
VOLTA TENSOR OPERATION

Sum with
FP16 Full precision FP32 Convert to
storage/input product accumulator FP32 result

more products

F16
× + F32
F16

F32

Also supports FP16 accumulator mode for inferencing


40
NVLINK – PERFORMANCE AND POWER
25Gbps signaling
Bandwidth 6 NVLinks for GV100
1.9 x Bandwidth improvement over GP100
Latency sensitive CPU caches GMEM

Coherence Fast access in local cache hierarchy

Probe filter in GPU

Power Savings Reduce number of active lanes for lightly loaded link

41
NVLINK NODES
HPC – P9 CORAL NODE – SUMMIT

DL – HYBRID CUBE MESH – DGX-1 w/ Volta V100 V100 V100

P9
V100 V100 V100 V100

P9
V100 V100 V100 V100

V100 V100 V100

42
NARROWING THE SHARED MEMORY GAP
with the GV100 L1 cache
Directed testing: shared in global
Cache: vs shared
Average
Shared 93%
• Easier to use Memory
Benefit
• 90%+ as good
70%

Shared: vs cache
• Faster atomics
• More banks
• More predictable
Pascal Volta
43
44
GPU COMPUTING AND DEEP LEARNING

45
TWO FORCES DRIVING
THE FUTURE OF COMPUTING

107 Transistors
(thousands)
106
1.1X per year
105

104

103
1.5X per year
102
Single-threaded perf

1980 1990 2000 2010 2020


40 Years of Microprocessor Trend Data The Big Bang of Deep Learning
Original data up to the year 2010 collected and plotted by M. Horowitz,
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

46
RISE OF NVIDIA GPU COMPUTING

107 GPU-Computing perf


1.5X per year 1000X
by 2025
106
1.1X per year
105

104

103
1.5X per year
102
Single-threaded perf

1980 1990 2000 2010 2020


40 Years of Microprocessor Trend Data The Big Bang of Deep Learning
Original data up to the year 2010 collected and plotted by M. Horowitz,
F. Labonte, O. Shacham, K. Olukotun, L. Hammond, and C. Batten New plot and data collected for 2010-2015 by K. Rupp

47
DEEP LEARNING EVERYWHERE

INTERNET & CLOUD MEDICINE & BIOLOGY MEDIA & ENTERTAINMENT SECURITY & DEFENSE AUTONOMOUS MACHINES
Image Classification Cancer Cell Detection Video Captioning Face Detection Pedestrian Detection
Speech Recognition Diabetic Grading Video Search Video Surveillance Lane Tracking
Language Translation Drug Discovery Real Time Translation Satellite Imagery Recognize Traffic Sign
Language Processing
Sentiment Analysis
Recommendation

48
DEEP NEURAL NETWORK

I0
w0

I1 w1
….. ∑
I2 w2

…..
wn
In

49
ANATOMY OF A FULLY CONNECTED LAYER
Lots of dot products

Each neuron calculates a dot product, M in a layer

𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛

50
COMBINE THE DOT PRODUCTS
What if we assemble the weights into a matrix?

Each neuron calculates a dot product, M in a layer

𝑥1 = 𝑔 𝒗𝑥1 ∗ 𝒛

What if we assemble the weights as [M, K] matrix?

Matrix-vector multiplication (GEMV)

Unfortunately …

M*K+2*K elements load/store

M*K FMA math operations

This is memory bandwidth limited!


51
BATCH TO GET MATRIX MULTIPLICATION
Making the problem math limited

Can we turn this into a GEMM?

“Batching”: process several inputs at once

Input is now a matrix, not a vector

Weight matrix remains the same

1 <= N <= 128 is common

52
GPU DEEP LEARNING —
A NEW COMPUTING MODEL

53
AI IMPROVING AT AMAZING RATES

SPEECH RECOGNITION
IMAGENET ACCURACY
ACCURACY

54
AI BREAKTHROUGHS
Recent Breakthroughs

“Superhuman” Image
Recognition Conversational Speech
Recognition

Atari Games Lip Reading

AlphaGo Rivals World


Champion

2015 2016 2017 55


MODEL COMPLEXITY IS EXPLODING
105 ExaFLOPS
8.7 Billion Parameters

20 ExaFLOPS
300 Million Parameters
7 ExaFLOPS
60 Million Parameters

2015 — Microsoft ResNet 2016 — Baidu Deep Speech 2 2017 — Google NMT

56
NVIDIA DNN ACCELERATION

57
A COMPLETE DEEP LEARNING PLATFORM
MANAGE TRAIN DEPLOY

DIGITS TensorRT

PROTOTXT

TEST TRAIN

MANAGE /
AUGMENT DATA
EMBEDDED AUTOMOTIVE
CENTER

58
DNN TRAINING

59
NVIDIA DGX SYSTEMS
Built for Leading AI Research

https://www.nvidia.com/en-us/data-center/dgx-systems/

https://youtu.be/8xYz46h3MJ0

60
NVIDIA DGX STATION
PERSONAL DGX

480 Tensor TFLOPS | 4x Tesla V100 16GB


NVLink Fully Connected | 3x DisplayPort
1500W | Water Cooled

61
NVIDIA DGX STATION
PERSONAL DGX

480 Tensor TFLOPS | 4x Tesla V100 16GB


NVLink Fully Connected | 3x DisplayPort
1500W | Water Cooled
$69,000

62
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH

960 Tensor TFLOPS | 8x Tesla V100 | NVLink Hybrid Cube


From 8 days on TITAN X to 8 hours
400 servers in a box

63
NVIDIA DGX-1 WITH TESLA V100
ESSENTIAL INSTRUMENT OF AI RESEARCH

960 Tensor TFLOPS | 8x Tesla V100 | NVLink Hybrid Cube


From 8 days on TITAN X to 8 hours
400 servers in a box
$149,000

64
DNN TRAINING WITH DGX-1
Iterate and Innovate Faster

65
DNN INFERENCE

66
TensorRT
High-performance framework
makes it easy to develop
GPU-accelerated inference
Production deployment solution
for deep learning inference
TensorRT for Data Center TensorRT for Automotive
Optimized inference for a given
trained neural network and target
Image Object Image Pedestrian Lane Traffic Sign
GPU Classification Detection Segmentation Detection Tracking Recognition

Solutions for Hyperscale, ADAS,


Embedded

Supports deployment of
fp32,fp16,int8* inference

NVIDIA DRIVE PX 2
* int8 support will be available from v2

67
TensorRT
Optimizations

Fuse network layers


Eliminate concatenation layers OPTIMIZED
Kernel specialization INFERENCE
Auto-tuning for target platform
RUNTIME
Tuned for given batch size
TRAINED
NEURAL NETWORK

68
NVIDIA TENSORRT
Programmable Inference Accelerator
next input

concat

next input
relu relu relu relu

batch nm batch nm batch nm batch nm


copy 3x3 CR 5x5 CR 1x1 CR
1x1 conv 3x3 conv 5x5 conv 1x1 conv

1x1 CR max pool


relu relu max pool

batch nm batch nm
input
1x1 conv 1x1 conv

input

Weight & Activation Precision Calibration | Layer & Tensor Fusion


Kernel Auto-Tuning | Multi-Stream Execution 69
V100 INFERENCE
Datacenter Inference Acceleration

• 3.7x faster inference on V100


vs. P100

• 18x faster inference on


TensorFlow models on V100

• 40x faster than CPU-only

70
AUTONOMOUS VEHICLE TECHNOLOGY

71
AI IS THE SOLUTION TO SELF DRIVING CARS

PERCEPTION REASONING DRIVING

HD MAP MAPPING AI COMPUTING

72
PARKER
Next-Generation System-on-Chip
ARM v8
NVIDIA’s next-generation Pascal CPU
graphics architecture COMPLEX
(2x Denver 2 + 4x A57)
1.5 teraflops Coherent HMP

NVIDIA’s next-generation ARM 64b SECURITY


ENGINES
4K60
VIDEO
4K60
VIDEO
AUDIO
ENGINE
2D ENGINE

Denver 2 CPU ENCODER DECODER

GigE
DISPLAY 128-bit BOOT and IMAGE
Functional safety for automotive ENGINES LPDDR4 PM PROC
Ethernet
MAC
PROC (ISP)

applications
Safety
Engine I/O

73
DRIVE PX 2 COMPUTE
COMPLEXES
2 Complete AI Systems
Pascal Discrete GPU
1,280 CUDA Cores
4 GB GDDR5 RAM

Parker SOC Complex


256 CUDA Cores
4 Cortex A57 Cores
2 NVIDIA Denver2 Cores
8 GB LPDDR4 RAM
64 GB Flash

Safety Microprocessor
Infineon AURIX Safety Microprocessor
ASIL D

74
14
NVIDIA DRIVE PLATFORM
Level 2 -> Level 5
100 TOPS

DRIVE PX Xavier
Level 4/5

10 TOPS
DRIVE PX 2 Parker
Level 2/3

1 TOPS

DRIVE PX 2 ONE ARCHITECTURE DRIVE PX (Xavier)


2 PARKER + 2 PASCAL GPU | 20 TOPS DL | 120 SPECINT | 80W 30 TOPS DL | 160 SPECINT | 30W
75
ANNOUNCING XAVIER DLA
NOW OPEN SOURCE
Command Interface

Tensor Execution Micro-controller

Sparse Weight Output


Unified Decompression MAC
512KB Array Postprocess
Input or
Input DMA (Activation
(Activations Buffer 2048 Int8 Output Output
or Function, DMA
and Weights) Accumulators
Activations 1024 Int16 Pooling
and Native or etc.)
Winograd 1024 FP16
Weights
Input
Transform

Memory Interface

http://nvdla.org/ 76
NVIDIA DRIVE
END TO END SELF-DRIVING CAR PLATFORM

MAPPING

KALDI
LOCALIZATION

DRIVENET

PILOTNET

Training on Driving with


NVIDIA DGX-1 NVIDIA DRIVE PX2
DGX-1 DriveWorks

77
DRIVING AND IMAGING

78
CURRENT DRIVER ASSIST

SENSE PLAN ACT


WARN

BRAKE
FPGA
CPU
CV ASIC

79
80
81
82
CURRENT DRIVER ASSIST

SENSE PLAN ACT


WARN

BRAKE
FPGA
CPU
CV ASIC

83
FUTURE AUTONOMOUS DRIVING SYSTEM

SENSE PLAN ACT


WARN
FPGA
CPU
CV ASIC
BRAKE

STEER

ACCELERATE
DNN
84
NVIDIA BB8 AI CAR —
LEARNING BY EXAMPLE 85
BB8 SELF-DRIVING CAR DEMO

https://blogs.nvidia.com/blog/2017/01/04/bb8-ces/

https://youtu.be/fmVWLr0X1Sk

86
WORKING @ NVIDIA
OUR CULTURE
A LEARNING MACHINE
INNOVATION
“willingness to take risks”

ONE TEAM
“what’s best for the company”

INTELLECTUAL HONESTY
“admit mistakes, no ego”

SPEED & AGILITY


“the world is changing fast”

EXCELLENCE
“hold ourselves to the highest standards”
88
11,000 employees — Tackling challenges that matter
A GREAT PLACE TO WORK Top 50 “Best Places to Work” — Glassdoor
#1 of the “50 Smartest Companies” — MIT Tech Review
89
JOIN THE NVIDIA TEAM: INTERNS AND NEW GRADS

We’re hiring interns and new college grads. Come join the industry leader
in virtual reality, artificial intelligence, self-driving cars, and gaming.
Learn more at: www.nvidia.com/university
90
THANK YOU

You might also like