Contents

1. Introduction
- 1.1. The Benefits of Using GPUs
- 1.2. CUDA®: A General-Purpose Parallel Computing Platform and Programming Model
- 1.3. A Scalable Programming Model
- 1.4. Document Structure
2. Programming Model
- 2.1. Kernels
- 2.2. Thread Hierarchy
  - 2.2.1. Thread Block Clusters
- 2.3. Memory Hierarchy
- 2.4. Heterogeneous Programming
- 2.5. Asynchronous SIMT Programming Model
  - 2.5.1. Asynchronous Operations
- 2.6. Compute Capability
3. Programming Interface
- 3.1. Compilation with NVCC
- 3.2. CUDA Runtime
- 3.3. Versioning and Compatibility
- 3.4. Compute Modes
- 3.5. Mode Switches
- 3.6. Tesla Compute Cluster Mode for Windows
4. Hardware Implementation
- 4.1. SIMT Architecture
- 4.2. Hardware Multithreading
5. Performance Guidelines
- 5.1. Overall Performance Optimization Strategies
- 5.2. Maximize Utilization
- 5.3. Maximize Memory Throughput
  - 5.3.1. Data Transfer between Host and Device
  - 5.3.2. Device Memory Accesses
- 5.4. Maximize Instruction Throughput
- 5.5. Minimize Memory Thrashing
6. CUDA-Enabled GPUs
7. C++ Language Extensions
- 7.1. Function Execution Space Specifiers
  - 7.1.1. __global__
  - 7.1.2. __device__
  - 7.1.3. __host__
  - 7.1.4. Undefined behavior
  - 7.1.5. __noinline__ and __forceinline__
  - 7.1.6. __inline_hint__
- 7.2. Variable Memory Space Specifiers
  - 7.2.1. __device__
  - 7.2.2. __constant__
  - 7.2.3. __shared__
  - 7.2.4. __grid_constant__
  - 7.2.5. __managed__
  - 7.2.6. __restrict__
- 7.3. Built-in Vector Types
  - 7.3.1. char, short, int, long, longlong, float, double
  - 7.3.2. dim3
- 7.4. Built-in Variables
  - 7.4.1. gridDim
  - 7.4.2. blockIdx
  - 7.4.3. blockDim
  - 7.4.4. threadIdx
  - 7.4.5. warpSize
- 7.5. Memory Fence Functions
- 7.6. Synchronization Functions
- 7.7. Mathematical Functions
- 7.8. Texture Functions
  - 7.8.1. Texture Object API
- 7.9. Surface Functions
  - 7.9.1. Surface Object API
- 7.10. Read-Only Data Cache Load Function
- 7.11. Load Functions Using Cache Hints
- 7.12. Store Functions Using Cache Hints
- 7.13. Time Function
- 7.14. Atomic Functions
  - 7.14.1. Arithmetic Functions
  - 7.14.2. Bitwise Functions
- 7.15. Address Space Predicate Functions
  - 7.15.1. __isGlobal()
  - 7.15.2. __isShared()
  - 7.15.3. __isConstant()
  - 7.15.4. __isGridConstant()
  - 7.15.5. __isLocal()
- 7.16. Address Space Conversion Functions
  - 7.16.1. __cvta_generic_to_global()
  - 7.16.2. __cvta_generic_to_shared()
  - 7.16.3. __cvta_generic_to_constant()
  - 7.16.4. __cvta_generic_to_local()
  - 7.16.5. __cvta_global_to_generic()
  - 7.16.6. __cvta_shared_to_generic()
  - 7.16.7. __cvta_constant_to_generic()
  - 7.16.8. __cvta_local_to_generic()
- 7.17. Alloca Function
  - 7.17.1. Synopsis
  - 7.17.2. Description
  - 7.17.3. Example
- 7.18. Compiler Optimization Hint Functions
  - 7.18.1. __builtin_assume_aligned()
  - 7.18.2. __builtin_assume()
  - 7.18.3. __assume()
  - 7.18.4. __builtin_expect()
  - 7.18.5. __builtin_unreachable()
  - 7.18.6. Restrictions
- 7.19. Warp Vote Functions
- 7.20. Warp Match Functions
  - 7.20.1. Synopsis
  - 7.20.2. Description
- 7.21. Warp Reduce Functions
  - 7.21.1. Synopsis
  - 7.21.2. Description
- 7.22. Warp Shuffle Functions
  - 7.22.1. Synopsis
  - 7.22.2. Description
  - 7.22.3. Examples
- 7.23. Nanosleep Function
  - 7.23.1. Synopsis
  - 7.23.2. Description
  - 7.23.3. Example
- 7.24. Warp Matrix Functions
  - 7.24.1. Description
  - 7.24.2. Alternate Floating Point
  - 7.24.3. Double Precision
  - 7.24.4. Sub-byte Operations
  - 7.24.5. Restrictions
  - 7.24.6. Element Types and Matrix Sizes
  - 7.24.7. Example
- 7.25. DPX
  - 7.25.1. Examples
- 7.26. Asynchronous Barrier
  - 7.26.1. Simple Synchronization Pattern
  - 7.26.2. Temporal Splitting and Five Stages of Synchronization
  - 7.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
  - 7.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
  - 7.26.5. Spatial Partitioning (also known as Warp Specialization)
  - 7.26.6. Early Exit (Dropping out of Participation)
  - 7.26.7. Completion Function
  - 7.26.8. Memory Barrier Primitives Interface
    - 7.26.8.1. Data Types
    - 7.26.8.2. Memory Barrier Primitives API
- 7.27. Asynchronous Data Copies
  - 7.27.1. memcpy_async API
  - 7.27.2. Copy and Compute Pattern - Staging Data Through Shared Memory
  - 7.27.3. Without memcpy_async
  - 7.27.4. With memcpy_async
  - 7.27.5. Asynchronous Data Copies using cuda::barrier
  - 7.27.6. Performance Guidance for memcpy_async
- 7.28. Asynchronous Data Copies using cuda::pipeline
  - 7.28.1. Single-Stage Asynchronous Data Copies using cuda::pipeline
  - 7.28.2. Multi-Stage Asynchronous Data Copies using cuda::pipeline
  - 7.28.3. Pipeline Interface
  - 7.28.4. Pipeline Primitives Interface
- 7.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
  - 7.29.1. Using TMA to transfer one-dimensional arrays
  - 7.29.2. Using TMA to transfer multi-dimensional arrays
    - 7.29.2.1. Multi-dimensional TMA PTX wrappers
- 7.30. Encoding a Tensor Map on Device
  - 7.30.1. Device-side Encoding and Modification of a Tensor Map
  - 7.30.2. Usage of a Modified Tensor Map
  - 7.30.3. Creating a Template Tensor Map Value Using the Driver API
- 7.31. Profiler Counter Function
- 7.32. Assertion
- 7.33. Trap function
- 7.34. Breakpoint Function
- 7.35. Formatted Output
  - 7.35.1. Format Specifiers
  - 7.35.2. Limitations
  - 7.35.3. Associated Host-Side API
  - 7.35.4. Examples
- 7.36. Dynamic Global Memory Allocation and Operations
  - 7.36.1. Heap Memory Allocation
  - 7.36.2. Interoperability with Host Memory API
  - 7.36.3. Examples
- 7.37. Execution Configuration
- 7.38. Launch Bounds
- 7.39. Maximum Number of Registers per Thread
- 7.40. #pragma unroll
- 7.41. SIMD Video Instructions
- 7.42. Diagnostic Pragmas
8. Cooperative Groups
- 8.1. Introduction
- 8.2. What’s New in Cooperative Groups
- 8.3. Programming Model Concept
  - 8.3.1. Composition Example
- 8.4. Group Types
  - 8.4.1. Implicit Groups
  - 8.4.2. Explicit Groups
    - 8.4.2.1. Thread Block Tile
      - 8.4.2.1.1. Warp-Synchronous Code Pattern
      - 8.4.2.1.2. Single Thread Group
    - 8.4.2.2. Coalesced Groups
      - 8.4.2.2.1. Discovery Pattern
- 8.5. Group Partitioning
- 8.6. Group Collectives
- 8.7. Grid Synchronization
- 8.8. Multi-Device Synchronization
9. CUDA Dynamic Parallelism
- 9.1. Introduction
  - 9.1.1. Overview
  - 9.1.2. Glossary
- 9.2. Execution Environment and Memory Model
  - 9.2.1. Execution Environment
  - 9.2.2. Memory Model
    - 9.2.2.1. Coherence and Consistency
- 9.3. Programming Interface
- 9.4. Programming Guidelines
- 9.5. CDP2 vs CDP1
  - 9.5.1. Differences Between CDP1 and CDP2
  - 9.5.2. Compatibility and Interoperability
- 9.6. Legacy CUDA Dynamic Parallelism (CDP1)
10. Virtual Memory Management
- 10.1. Introduction
- 10.2. Query for Support
- 10.3. Allocating Physical Memory
  - 10.3.1. Shareable Memory Allocations
  - 10.3.2. Memory Type
    - 10.3.2.1. Compressible Memory
- 10.4. Reserving a Virtual Address Range
- 10.5. Virtual Aliasing Support
- 10.6. Mapping Memory
- 10.7. Controlling Access Rights
- 10.8. Fabric Memory
  - 10.8.1. Query for Support
- 10.9. Multicast Support
11. Stream Ordered Memory Allocator
- 11.1. Introduction
- 11.2. Query for Support
- 11.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
- 11.4. Memory Pools and the cudaMemPool_t
- 11.5. Default/Implicit Pools
- 11.6. Explicit Pools
- 11.7. Physical Page Caching Behavior
- 11.8. Resource Usage Statistics
- 11.9. Memory Reuse Policies
- 11.10. Device Accessibility for Multi-GPU Support
- 11.11. IPC Memory Pools
- 11.12. Synchronization API Actions
- 11.13. Addendums
12. Graph Memory Nodes
- 12.1. Introduction
- 12.2. Support and Compatibility
- 12.3. API Fundamentals
- 12.4. Optimized Memory Reuse
  - 12.4.1. Address Reuse within a Graph
  - 12.4.2. Physical Memory Management and Sharing
- 12.5. Performance Considerations
  - 12.5.1. First Launch / cudaGraphUpload
- 12.6. Physical Memory Footprint
- 12.7. Peer Access
  - 12.7.1. Peer Access with Graph Node APIs
  - 12.7.2. Peer Access with Stream Capture
13. Mathematical Functions
- 13.1. Standard Functions
- 13.2. Intrinsic Functions
14. C++ Language Support
- 14.1. C++11 Language Features
- 14.2. C++14 Language Features
- 14.3. C++17 Language Features
- 14.4. C++20 Language Features
- 14.5. Restrictions
- 14.6. Polymorphic Function Wrappers
- 14.7. Extended Lambdas
- 14.8. Code Samples
15. Texture Fetching
- 15.1. Nearest-Point Sampling
- 15.2. Linear Filtering
- 15.3. Table Lookup
16. Compute Capabilities
- 16.1. Feature Availability
- 16.2. Features and Technical Specifications
- 16.3. Floating-Point Standard
- 16.4. Compute Capability 5.x
- 16.5. Compute Capability 6.x
- 16.6. Compute Capability 7.x
- 16.7. Compute Capability 8.x
- 16.8. Compute Capability 9.0
17. Driver API
- 17.1. Context
- 17.2. Module
- 17.3. Kernel Execution
- 17.4. Interoperability between Runtime and Driver APIs
- 17.5. Driver Entry Point Access
18. CUDA Environment Variables
19. Unified Memory Programming
- 19.1. Unified Memory Introduction
  - 19.1.1. System Requirements for Unified Memory
  - 19.1.2. Programming Model
- 19.2. Unified memory on devices with full CUDA Unified Memory support
  - 19.2.1. System-Allocated Memory: in-depth examples
    - 19.2.1.1. File-backed Unified Memory
    - 19.2.1.2. Inter-Process Communication (IPC) with Unified Memory
  - 19.2.2. Performance Tuning
- 19.3. Unified memory on devices without full CUDA Unified Memory support
  - 19.3.1. Unified memory on devices with only CUDA Managed Memory support
  - 19.3.2. Unified memory on Windows or devices with compute capability 5.x
20. Lazy Loading
- 20.1. What is Lazy Loading?
- 20.2. Lazy Loading version support
- 20.3. Triggering loading of kernels in lazy mode
  - 20.3.1. CUDA Driver API
  - 20.3.2. CUDA Runtime API
- 20.4. Querying whether Lazy Loading is Turned On
- 20.5. Possible Issues when Adopting Lazy Loading
21. Notices
- 21.1. Notice
- 21.2. OpenCL
- 21.3. Trademarks