Contents
- 1. Introduction
- 2. Programming Model
-
3. Programming Interface
- 3.1. Compilation with NVCC
-
3.2. CUDA Runtime
- 3.2.1. Initialization
- 3.2.2. Device Memory
-
3.2.3. Device Memory L2 Access Management
- 3.2.3.1. L2 Cache Set-Aside for Persisting Accesses
- 3.2.3.2. L2 Policy for Persisting Accesses
- 3.2.3.3. L2 Access Properties
- 3.2.3.4. L2 Persistence Example
- 3.2.3.5. Reset L2 Access to Normal
- 3.2.3.6. Manage Utilization of L2 set-aside cache
- 3.2.3.7. Query L2 cache Properties
- 3.2.3.8. Control L2 Cache Set-Aside Size for Persisting Memory Access
- 3.2.4. Shared Memory
- 3.2.5. Distributed Shared Memory
- 3.2.6. Page-Locked Host Memory
- 3.2.7. Memory Synchronization Domains
-
3.2.8. Asynchronous Concurrent Execution
- 3.2.8.1. Concurrent Execution between Host and Device
- 3.2.8.2. Concurrent Kernel Execution
- 3.2.8.3. Overlap of Data Transfer and Kernel Execution
- 3.2.8.4. Concurrent Data Transfers
- 3.2.8.5. Streams
- 3.2.8.6. Programmatic Dependent Launch and Synchronization
-
3.2.8.7. CUDA Graphs
- 3.2.8.7.1. Graph Structure
- 3.2.8.7.2. Creating a Graph Using Graph APIs
- 3.2.8.7.3. Creating a Graph Using Stream Capture
- 3.2.8.7.4. CUDA User Objects
- 3.2.8.7.5. Updating Instantiated Graphs
- 3.2.8.7.6. Using Graph APIs
- 3.2.8.7.7. Device Graph Launch
- 3.2.8.7.8. Conditional Graph Nodes
- 3.2.8.8. Events
- 3.2.8.9. Synchronous Calls
- 3.2.9. Multi-Device System
- 3.2.10. Unified Virtual Address Space
- 3.2.11. Interprocess Communication
- 3.2.12. Error Checking
- 3.2.13. Call Stack
- 3.2.14. Texture and Surface Memory
- 3.2.15. Graphics Interoperability
-
3.2.16. External Resource Interoperability
-
3.2.16.1. Vulkan Interoperability
- 3.2.16.1.1. Matching device UUIDs
- 3.2.16.1.2. Importing Memory Objects
- 3.2.16.1.3. Mapping Buffers onto Imported Memory Objects
- 3.2.16.1.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 3.2.16.1.5. Importing Synchronization Objects
- 3.2.16.1.6. Signaling/Waiting on Imported Synchronization Objects
- 3.2.16.2. OpenGL Interoperability
-
3.2.16.3. Direct3D 12 Interoperability
- 3.2.16.3.1. Matching Device LUIDs
- 3.2.16.3.2. Importing Memory Objects
- 3.2.16.3.3. Mapping Buffers onto Imported Memory Objects
- 3.2.16.3.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 3.2.16.3.5. Importing Synchronization Objects
- 3.2.16.3.6. Signaling/Waiting on Imported Synchronization Objects
-
3.2.16.4. Direct3D 11 Interoperability
- 3.2.16.4.1. Matching Device LUIDs
- 3.2.16.4.2. Importing Memory Objects
- 3.2.16.4.3. Mapping Buffers onto Imported Memory Objects
- 3.2.16.4.4. Mapping Mipmapped Arrays onto Imported Memory Objects
- 3.2.16.4.5. Importing Synchronization Objects
- 3.2.16.4.6. Signaling/Waiting on Imported Synchronization Objects
- 3.2.16.5. NVIDIA Software Communication Interface Interoperability (NVSCI)
-
3.2.16.1. Vulkan Interoperability
- 3.3. Versioning and Compatibility
- 3.4. Compute Modes
- 3.5. Mode Switches
- 3.6. Tesla Compute Cluster Mode for Windows
- 4. Hardware Implementation
- 5. Performance Guidelines
- 6. CUDA-Enabled GPUs
-
7. C++ Language Extensions
- 7.1. Function Execution Space Specifiers
- 7.2. Variable Memory Space Specifiers
- 7.3. Built-in Vector Types
- 7.4. Built-in Variables
- 7.5. Memory Fence Functions
- 7.6. Synchronization Functions
- 7.7. Mathematical Functions
-
7.8. Texture Functions
-
7.8.1. Texture Object API
- 7.8.1.1. tex1Dfetch()
- 7.8.1.2. tex1D()
- 7.8.1.3. tex1DLod()
- 7.8.1.4. tex1DGrad()
- 7.8.1.5. tex2D()
- 7.8.1.6. tex2D() for sparse CUDA arrays
- 7.8.1.7. tex2Dgather()
- 7.8.1.8. tex2Dgather() for sparse CUDA arrays
- 7.8.1.9. tex2DGrad()
- 7.8.1.10. tex2DGrad() for sparse CUDA arrays
- 7.8.1.11. tex2DLod()
- 7.8.1.12. tex2DLod() for sparse CUDA arrays
- 7.8.1.13. tex3D()
- 7.8.1.14. tex3D() for sparse CUDA arrays
- 7.8.1.15. tex3DLod()
- 7.8.1.16. tex3DLod() for sparse CUDA arrays
- 7.8.1.17. tex3DGrad()
- 7.8.1.18. tex3DGrad() for sparse CUDA arrays
- 7.8.1.19. tex1DLayered()
- 7.8.1.20. tex1DLayeredLod()
- 7.8.1.21. tex1DLayeredGrad()
- 7.8.1.22. tex2DLayered()
- 7.8.1.23. tex2DLayered() for Sparse CUDA Arrays
- 7.8.1.24. tex2DLayeredLod()
- 7.8.1.25. tex2DLayeredLod() for sparse CUDA arrays
- 7.8.1.26. tex2DLayeredGrad()
- 7.8.1.27. tex2DLayeredGrad() for sparse CUDA arrays
- 7.8.1.28. texCubemap()
- 7.8.1.29. texCubemapGrad()
- 7.8.1.30. texCubemapLod()
- 7.8.1.31. texCubemapLayered()
- 7.8.1.32. texCubemapLayeredGrad()
- 7.8.1.33. texCubemapLayeredLod()
-
7.8.1. Texture Object API
-
7.9. Surface Functions
-
7.9.1. Surface Object API
- 7.9.1.1. surf1Dread()
- 7.9.1.2. surf1Dwrite
- 7.9.1.3. surf2Dread()
- 7.9.1.4. surf2Dwrite()
- 7.9.1.5. surf3Dread()
- 7.9.1.6. surf3Dwrite()
- 7.9.1.7. surf1DLayeredread()
- 7.9.1.8. surf1DLayeredwrite()
- 7.9.1.9. surf2DLayeredread()
- 7.9.1.10. surf2DLayeredwrite()
- 7.9.1.11. surfCubemapread()
- 7.9.1.12. surfCubemapwrite()
- 7.9.1.13. surfCubemapLayeredread()
- 7.9.1.14. surfCubemapLayeredwrite()
-
7.9.1. Surface Object API
- 7.10. Read-Only Data Cache Load Function
- 7.11. Load Functions Using Cache Hints
- 7.12. Store Functions Using Cache Hints
- 7.13. Time Function
- 7.14. Atomic Functions
- 7.15. Address Space Predicate Functions
- 7.16. Address Space Conversion Functions
- 7.17. Alloca Function
- 7.18. Compiler Optimization Hint Functions
- 7.19. Warp Vote Functions
- 7.20. Warp Match Functions
- 7.21. Warp Reduce Functions
- 7.22. Warp Shuffle Functions
- 7.23. Nanosleep Function
- 7.24. Warp Matrix Functions
- 7.25. DPX
-
7.26. Asynchronous Barrier
- 7.26.1. Simple Synchronization Pattern
- 7.26.2. Temporal Splitting and Five Stages of Synchronization
- 7.26.3. Bootstrap Initialization, Expected Arrival Count, and Participation
- 7.26.4. A Barrier’s Phase: Arrival, Countdown, Completion, and Reset
- 7.26.5. Spatial Partitioning (also known as Warp Specialization)
- 7.26.6. Early Exit (Dropping out of Participation)
- 7.26.7. Completion Function
- 7.26.8. Memory Barrier Primitives Interface
- 7.27. Asynchronous Data Copies
-
7.28. Asynchronous Data Copies using
cuda::pipeline
- 7.29. Asynchronous Data Copies using the Tensor Memory Accelerator (TMA)
- 7.30. Encoding a Tensor Map on Device
- 7.31. Profiler Counter Function
- 7.32. Assertion
- 7.33. Trap function
- 7.34. Breakpoint Function
- 7.35. Formatted Output
- 7.36. Dynamic Global Memory Allocation and Operations
- 7.37. Execution Configuration
- 7.38. Launch Bounds
- 7.39. Maximum Number of Registers per Thread
- 7.40. #pragma unroll
- 7.41. SIMD Video Instructions
- 7.42. Diagnostic Pragmas
-
8. Cooperative Groups
- 8.1. Introduction
- 8.2. What’s New in Cooperative Groups
- 8.3. Programming Model Concept
- 8.4. Group Types
- 8.5. Group Partitioning
- 8.6. Group Collectives
- 8.7. Grid Synchronization
- 8.8. Multi-Device Synchronization
-
9. CUDA Dynamic Parallelism
- 9.1. Introduction
- 9.2. Execution Environment and Memory Model
-
9.3. Programming Interface
- 9.3.1. CUDA C++ Reference
- 9.3.2. Device-side Launch from PTX
- 9.3.3. Toolkit Support for Dynamic Parallelism
- 9.4. Programming Guidelines
- 9.5. CDP2 vs CDP1
-
9.6. Legacy CUDA Dynamic Parallelism (CDP1)
- 9.6.1. Execution Environment and Memory Model (CDP1)
-
9.6.2. Programming Interface (CDP1)
- 9.6.2.1. CUDA C++ Reference (CDP1)
- 9.6.2.2. Device-side Launch from PTX (CDP1)
- 9.6.2.3. Toolkit Support for Dynamic Parallelism (CDP1)
- 9.6.3. Programming Guidelines (CDP1)
- 10. Virtual Memory Management
-
11. Stream Ordered Memory Allocator
- 11.1. Introduction
- 11.2. Query for Support
- 11.3. API Fundamentals (cudaMallocAsync and cudaFreeAsync)
- 11.4. Memory Pools and the cudaMemPool_t
- 11.5. Default/Implicit Pools
- 11.6. Explicit Pools
- 11.7. Physical Page Caching Behavior
- 11.8. Resource Usage Statistics
- 11.9. Memory Reuse Policies
- 11.10. Device Accessibility for Multi-GPU Support
- 11.11. IPC Memory Pools
- 11.12. Synchronization API Actions
- 11.13. Addendums
- 12. Graph Memory Nodes
- 13. Mathematical Functions
-
14. C++ Language Support
- 14.1. C++11 Language Features
- 14.2. C++14 Language Features
- 14.3. C++17 Language Features
- 14.4. C++20 Language Features
-
14.5. Restrictions
- 14.5.1. Host Compiler Extensions
- 14.5.2. Preprocessor Symbols
- 14.5.3. Qualifiers
- 14.5.4. Pointers
- 14.5.5. Operators
- 14.5.6. Run Time Type Information (RTTI)
- 14.5.7. Exception Handling
- 14.5.8. Standard Library
- 14.5.9. Namespace Reservations
-
14.5.10. Functions
- 14.5.10.1. External Linkage
- 14.5.10.2. Implicitly-declared and explicitly-defaulted functions
- 14.5.10.3. Function Parameters
- 14.5.10.4. Static Variables within Function
- 14.5.10.5. Function Pointers
- 14.5.10.6. Function Recursion
- 14.5.10.7. Friend Functions
- 14.5.10.8. Operator Function
- 14.5.10.9. Allocation and Deallocation Functions
- 14.5.11. Classes
- 14.5.12. Templates
- 14.5.13. Trigraphs and Digraphs
- 14.5.14. Const-qualified variables
- 14.5.15. Long Double
- 14.5.16. Deprecation Annotation
- 14.5.17. Noreturn Annotation
- 14.5.18. [[likely]] / [[unlikely]] Standard Attributes
- 14.5.19. const and pure GNU Attributes
- 14.5.20. __nv_pure__ Attribute
- 14.5.21. Intel Host Compiler Specific
-
14.5.22. C++11 Features
- 14.5.22.1. Lambda Expressions
- 14.5.22.2. std::initializer_list
- 14.5.22.3. Rvalue references
- 14.5.22.4. Constexpr functions and function templates
- 14.5.22.5. Constexpr variables
- 14.5.22.6. Inline namespaces
- 14.5.22.7. thread_local
- 14.5.22.8. __global__ functions and function templates
- 14.5.22.9. __managed__ and __shared__ variables
- 14.5.22.10. Defaulted functions
- 14.5.23. C++14 Features
- 14.5.24. C++17 Features
- 14.5.25. C++20 Features
- 14.6. Polymorphic Function Wrappers
- 14.7. Extended Lambdas
- 14.8. Code Samples
- 15. Texture Fetching
- 16. Compute Capabilities
-
17. Driver API
- 17.1. Context
- 17.2. Module
- 17.3. Kernel Execution
- 17.4. Interoperability between Runtime and Driver APIs
-
17.5. Driver Entry Point Access
- 17.5.1. Introduction
- 17.5.2. Driver Function Typedefs
- 17.5.3. Driver Function Retrieval
-
17.5.4. Potential Implications with cuGetProcAddress
- 17.5.4.1. Implications with cuGetProcAddress vs Implicit Linking
- 17.5.4.2. Compile Time vs Runtime Version Usage in cuGetProcAddress
- 17.5.4.3. API Version Bumps with Explicit Version Checks
- 17.5.4.4. Issues with Runtime API Usage
- 17.5.4.5. Issues with Runtime API and Dynamic Versioning
- 17.5.4.6. Issues with Runtime API allowing CUDA Version
- 17.5.4.7. Implications to API/ABI
- 17.5.5. Determining cuGetProcAddress Failure Reasons
- 18. CUDA Environment Variables
-
19. Unified Memory Programming
-
19.1. Unified Memory Introduction
- 19.1.1. System Requirements for Unified Memory
-
19.1.2. Programming Model
- 19.1.2.1. Allocation APIs for System-Allocated Memory
- 19.1.2.2. Allocation API for CUDA Managed Memory:
cudaMallocManaged()
- 19.1.2.3. Global-Scope Managed Variables Using
__managed__
- 19.1.2.4. Difference between Unified Memory and Mapped Memory
- 19.1.2.5. Pointer Attributes
- 19.1.2.6. Runtime detection of Unified Memory Support Level
- 19.1.2.7. GPU Memory Oversubscription
- 19.1.2.8. Performance Hints
-
19.2. Unified memory on devices with full CUDA Unified Memory support
- 19.2.1. System-Allocated Memory: in-depth examples
- 19.2.2. Performance Tuning
-
19.3. Unified memory on devices without full CUDA Unified Memory support
- 19.3.1. Unified memory on devices with only CUDA Managed Memory support
-
19.3.2. Unified memory on Windows or devices with compute capability 5.x
- 19.3.2.1. Data Migration and Coherency
- 19.3.2.2. GPU Memory Oversubscription
- 19.3.2.3. Multi-GPU
-
19.3.2.4. Coherency and Concurrency
- 19.3.2.4.1. GPU Exclusive Access To Managed Memory
- 19.3.2.4.2. Explicit Synchronization and Logical GPU Activity
- 19.3.2.4.3. Managing Data Visibility and Concurrent CPU + GPU Access with Streams
- 19.3.2.4.4. Stream Association Examples
- 19.3.2.4.5. Stream Attach With Multithreaded Host Programs
- 19.3.2.4.6. Advanced Topic: Modular Programs and Data Access Constraints
- 19.3.2.4.7. Memcpy()/Memset() Behavior With Stream-associated Unified Memory
-
19.1. Unified Memory Introduction
- 20. Lazy Loading
- 21. Notices