Shaders
A shader specifies programmable operations that execute for each vertex, control point, tessellated vertex, primitive, fragment, or workgroup in the corresponding stage(s) of the graphics and compute pipelines.
Graphics pipelines include vertex shader execution as a result of primitive assembly, followed, if enabled, by tessellation control and evaluation shaders operating on patches, geometry shaders, if enabled, operating on primitives, and fragment shaders, if present, operating on fragments generated by Rasterization. In this specification, vertex, tessellation control, tessellation evaluation and geometry shaders are collectively referred to as pre-rasterization shader stages and occur in the logical pipeline before rasterization. The fragment shader occurs logically after rasterization.
Only the compute shader stage is included in a compute pipeline. Compute shaders operate on compute invocations in a workgroup.
Shaders can read from input variables, and read from and write to output variables. Input and output variables can be used to transfer data between shader stages, or to allow the shader to interact with values that exist in the execution environment. Similarly, the execution environment provides constants describing capabilities.
Shader variables are associated with execution environment-provided inputs and outputs using built-in decorations in the shader. The available decorations for each stage are documented in the following subsections.
Shader Objects
Shaders may be compiled and linked into pipeline objects as described in
Pipelines chapter, or if the shaderObject
feature is enabled they may be compiled into
individual per-stage shader objects which can be bound on a command
buffer independently from one another.
Unlike pipelines, shader objects are not intrinsically tied to any specific
set of state.
Instead, state is specified dynamically in the command buffer.
Each shader object represents a single compiled shader stage, which may optionally be linked with one or more other stages.
Shader objects are represented by VkShaderEXT
handles:
// Provided by VK_EXT_shader_object
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkShaderEXT)
Shader Object Creation
Shader objects may be created from shader code provided as SPIR-V, or in an opaque, implementation-defined binary format specific to the physical device.
To create one or more shader objects, call:
// Provided by VK_EXT_shader_object
VkResult vkCreateShadersEXT(
VkDevice device,
uint32_t createInfoCount,
const VkShaderCreateInfoEXT* pCreateInfos,
const VkAllocationCallbacks* pAllocator,
VkShaderEXT* pShaders);
-
device
is the logical device that creates the shader objects. -
createInfoCount
is the length of thepCreateInfos
andpShaders
arrays. -
pCreateInfos
is a pointer to an array of VkShaderCreateInfoEXT structures. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter. -
pShaders
is a pointer to an array of VkShaderEXT handles in which the resulting shader objects are returned.
When this function returns, whether or not it succeeds, it is guaranteed
that every element of pShaders
will have been overwritten by either
VK_NULL_HANDLE or a valid VkShaderEXT
handle.
This means that whenever shader creation fails, the application can
determine which shader the returned error pertains to by locating the first
VK_NULL_HANDLE element in pShaders
.
It also means that an application can reliably clean up from a failed call
by iterating over the pShaders
array and destroying every element that
is not VK_NULL_HANDLE.
The VkShaderCreateInfoEXT
structure is defined as:
// Provided by VK_EXT_shader_object
typedef struct VkShaderCreateInfoEXT {
VkStructureType sType;
const void* pNext;
VkShaderCreateFlagsEXT flags;
VkShaderStageFlagBits stage;
VkShaderStageFlags nextStage;
VkShaderCodeTypeEXT codeType;
size_t codeSize;
const void* pCode;
const char* pName;
uint32_t setLayoutCount;
const VkDescriptorSetLayout* pSetLayouts;
uint32_t pushConstantRangeCount;
const VkPushConstantRange* pPushConstantRanges;
const VkSpecializationInfo* pSpecializationInfo;
} VkShaderCreateInfoEXT;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
flags
is a bitmask of VkShaderCreateFlagBitsEXT describing additional parameters of the shader. -
stage
is a VkShaderStageFlagBits value specifying a single shader stage. -
nextStage
is a bitmask of VkShaderStageFlagBits specifying zero or stages which may be used as a logically next bound stage when drawing with the shader bound. -
codeType
is a VkShaderCodeTypeEXT value specifying the type of the shader code pointed to bepCode
. -
codeSize
is the size in bytes of the shader code pointed to bepCode
. -
pCode
is a pointer to the shader code to use to create the shader. -
pName
is a pointer to a null-terminated UTF-8 string specifying the entry point name of the shader for this stage. -
setLayoutCount
is the number of descriptor set layouts pointed to bypSetLayouts
. -
pSetLayouts
is a pointer to an array of VkDescriptorSetLayout objects used by the shader stage. -
pushConstantRangeCount
is the number of push constant ranges pointed to bypPushConstantRanges
. -
pPushConstantRanges
is a pointer to an array of VkPushConstantRange structures used by the shader stage. -
pSpecializationInfo
is a pointer to a VkSpecializationInfo structure, as described in Specialization Constants, orNULL
.
// Provided by VK_EXT_shader_object
typedef VkFlags VkShaderCreateFlagsEXT;
VkShaderCreateFlagsEXT
is a bitmask type for setting a mask of zero or
more VkShaderCreateFlagBitsEXT.
Possible values of the flags
member of VkShaderCreateInfoEXT
specifying how a shader object is created, are:
// Provided by VK_EXT_shader_object
typedef enum VkShaderCreateFlagBitsEXT {
VK_SHADER_CREATE_LINK_STAGE_BIT_EXT = 0x00000001,
// Provided by VK_EXT_shader_object with VK_EXT_subgroup_size_control or VK_VERSION_1_3
VK_SHADER_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT_EXT = 0x00000002,
// Provided by VK_EXT_shader_object with VK_EXT_subgroup_size_control or VK_VERSION_1_3
VK_SHADER_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT = 0x00000004,
// Provided by VK_EXT_shader_object with VK_EXT_mesh_shader or VK_NV_mesh_shader
VK_SHADER_CREATE_NO_TASK_SHADER_BIT_EXT = 0x00000008,
// Provided by VK_EXT_shader_object with VK_KHR_device_group or VK_VERSION_1_1
VK_SHADER_CREATE_DISPATCH_BASE_BIT_EXT = 0x00000010,
// Provided by VK_KHR_fragment_shading_rate with VK_EXT_shader_object
VK_SHADER_CREATE_FRAGMENT_SHADING_RATE_ATTACHMENT_BIT_EXT = 0x00000020,
// Provided by VK_EXT_fragment_density_map with VK_EXT_shader_object
VK_SHADER_CREATE_FRAGMENT_DENSITY_MAP_ATTACHMENT_BIT_EXT = 0x00000040,
// Provided by VK_EXT_device_generated_commands
VK_SHADER_CREATE_INDIRECT_BINDABLE_BIT_EXT = 0x00000080,
} VkShaderCreateFlagBitsEXT;
-
VK_SHADER_CREATE_LINK_STAGE_BIT_EXT
specifies that a shader is linked to all other shaders created in the same vkCreateShadersEXT call whose VkShaderCreateInfoEXT structures'flags
includeVK_SHADER_CREATE_LINK_STAGE_BIT_EXT
. -
VK_SHADER_CREATE_ALLOW_VARYING_SUBGROUP_SIZE_BIT_EXT
specifies that theSubgroupSize
may vary in a task, mesh, or compute shader. -
VK_SHADER_CREATE_REQUIRE_FULL_SUBGROUPS_BIT_EXT
specifies that the subgroup sizes must be launched with all invocations active in a task, mesh, or compute shader. -
VK_SHADER_CREATE_NO_TASK_SHADER_BIT_EXT
specifies that a mesh shader must only be used without a task shader. Otherwise, the mesh shader must only be used with a task shader. -
VK_SHADER_CREATE_DISPATCH_BASE_BIT_EXT
specifies that a compute shader can be used with vkCmdDispatchBase with a non-zero base workgroup. -
VK_SHADER_CREATE_FRAGMENT_SHADING_RATE_ATTACHMENT_BIT_EXT
specifies that a fragment shader can be used with a fragment shading rate attachment. -
VK_SHADER_CREATE_FRAGMENT_DENSITY_MAP_ATTACHMENT_BIT_EXT
specifies that a fragment shader can be used with a fragment density map attachment. -
VK_SHADER_CREATE_INDIRECT_BINDABLE_BIT_EXT
specifies that the shader can be used in combination with Device-Generated Commands.
The behavior of
|
Shader objects can be created using different types of shader code.
Possible values of VkShaderCreateInfoEXT::codeType
, are:
// Provided by VK_EXT_shader_object
typedef enum VkShaderCodeTypeEXT {
VK_SHADER_CODE_TYPE_BINARY_EXT = 0,
VK_SHADER_CODE_TYPE_SPIRV_EXT = 1,
} VkShaderCodeTypeEXT;
-
VK_SHADER_CODE_TYPE_BINARY_EXT
specifies shader code in an opaque, implementation-defined binary format specific to the physical device. -
VK_SHADER_CODE_TYPE_SPIRV_EXT
specifies shader code in SPIR-V format.
Binary Shader Code
Binary shader code can be retrieved from a shader object using the command:
// Provided by VK_EXT_shader_object
VkResult vkGetShaderBinaryDataEXT(
VkDevice device,
VkShaderEXT shader,
size_t* pDataSize,
void* pData);
-
device
is the logical device that shader object was created from. -
shader
is the shader object to retrieve binary shader code from. -
pDataSize
is a pointer to asize_t
value related to the size of the binary shader code, as described below. -
pData
is eitherNULL
or a pointer to a buffer.
If pData
is NULL
, then the size of the binary shader code of the
shader object, in bytes, is returned in pDataSize
.
Otherwise, pDataSize
must point to a variable set by the application
to the size of the buffer, in bytes, pointed to by pData
, and on
return the variable is overwritten with the amount of data actually written
to pData
.
If pDataSize
is less than the size of the binary shader code, nothing
is written to pData
, and VK_INCOMPLETE
will be returned instead
of VK_SUCCESS
.
The behavior of this command when |
Binary shader code retrieved using vkGetShaderBinaryDataEXT
can be
passed to a subsequent call to vkCreateShadersEXT on a compatible
physical device by specifying VK_SHADER_CODE_TYPE_BINARY_EXT
in the
codeType
member of VkShaderCreateInfoEXT
.
The shader code returned by repeated calls to this function with the same
VkShaderEXT
is guaranteed to be invariant for the lifetime of the
VkShaderEXT
object.
Binary Shader Compatibility
Binary shader compatibility means that binary shader code returned from a call to vkGetShaderBinaryDataEXT can be passed to a later call to vkCreateShadersEXT, potentially on a different logical and/or physical device, and that this will result in the successful creation of a shader object functionally equivalent to the shader object that the code was originally queried from.
Binary shader code queried from vkGetShaderBinaryDataEXT is not guaranteed to be compatible across all devices, but implementations are required to provide some compatibility guarantees. Applications may determine binary shader compatibility using either (or both) of two mechanisms.
Guaranteed compatibility of shader binaries is expressed through a
combination of the shaderBinaryUUID
and shaderBinaryVersion
members of the VkPhysicalDeviceShaderObjectPropertiesEXT structure
queried from a physical device.
Binary shaders retrieved from a physical device with a certain
shaderBinaryUUID
are guaranteed to be compatible with all other
physical devices reporting the same shaderBinaryUUID
and the same or
higher shaderBinaryVersion
.
Whenever a new version of an implementation incorporates any changes that
affect the output of vkGetShaderBinaryDataEXT, the implementation
should either increment shaderBinaryVersion
if binary shader code
retrieved from older versions remains compatible with the new
implementation, or else replace shaderBinaryUUID
with a new value if
backward compatibility has been broken.
Binary shader code queried from a device with a matching
shaderBinaryUUID
and lower shaderBinaryVersion
relative to the
device on which vkCreateShadersEXT is being called may be suboptimal
for the new device in ways that do not change shader functionality, but it
is still guaranteed to be usable to successfully create the shader
object(s).
Implementations are encouraged to share |
In addition to the shader compatibility guarantees described above, it is
valid for an application to call vkCreateShadersEXT with binary shader
code created on a device with a different or unknown shaderBinaryUUID
and/or higher shaderBinaryVersion
.
In this case, the implementation may use any unspecified means of its
choosing to determine whether the provided binary shader code is usable.
If it is, vkCreateShadersEXT must return VK_SUCCESS
, and the
created shader object is guaranteed to be valid.
Otherwise, in the absence of some error, vkCreateShadersEXT must
return VK_INCOMPATIBLE_SHADER_BINARY_EXT
to indicate that the provided
binary shader code is not compatible with the device.
Binding Shader Objects
Once shader objects have been created, they can be bound to the command buffer using the command:
// Provided by VK_EXT_shader_object
void vkCmdBindShadersEXT(
VkCommandBuffer commandBuffer,
uint32_t stageCount,
const VkShaderStageFlagBits* pStages,
const VkShaderEXT* pShaders);
-
commandBuffer
is the command buffer that the shader object will be bound to. -
stageCount
is the length of thepStages
andpShaders
arrays. -
pStages
is a pointer to an array of VkShaderStageFlagBits values specifying one stage per array index that is affected by the corresponding value in thepShaders
array. -
pShaders
is a pointer to an array ofVkShaderEXT
handles and/or VK_NULL_HANDLE values describing the shader binding operations to be performed on each stage inpStages
.
When binding linked shaders, an application may bind them in any
combination of one or more calls to vkCmdBindShadersEXT
(i.e., shaders
that were created linked together do not need to be bound in the same
vkCmdBindShadersEXT
call).
Any shader object bound to a particular stage may be unbound by setting its
value in pShaders
to VK_NULL_HANDLE.
If pShaders
is NULL
, vkCmdBindShadersEXT
behaves as if
pShaders
was an array of stageCount
VK_NULL_HANDLE values
(i.e., any shaders bound to the stages specified in pStages
are
unbound).
Setting State
Whenever shader objects are used to issue drawing commands, the appropriate dynamic state setting commands must have been called to set the relevant state in the command buffer prior to drawing:
If a shader is bound to the VK_SHADER_STAGE_VERTEX_BIT
stage, the
following commands must have been called in the command buffer prior to
drawing:
If a shader is bound to the VK_SHADER_STAGE_TESSELLATION_CONTROL_BIT
stage, the following command must have been called in the command buffer
prior to drawing:
-
vkCmdSetPatchControlPointsEXT, if
primitiveTopology
isVK_PRIMITIVE_TOPOLOGY_PATCH_LIST
If a shader is bound to the
VK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT
stage, the following
command must have been called in the command buffer prior to drawing:
If rasterizerDiscardEnable
is VK_FALSE
, the following commands
must have been called in the command buffer prior to drawing:
-
vkCmdSetAlphaToOneEnableEXT, if the alphaToOne feature is enabled on the device
-
vkCmdSetLineWidth, if
polygonMode
isVK_POLYGON_MODE_LINE
, or if a shader is bound to theVK_SHADER_STAGE_VERTEX_BIT
stage andprimitiveTopology
is a line topology, or if a shader which outputs line primitives is bound to theVK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT
orVK_SHADER_STAGE_GEOMETRY_BIT
stage -
vkCmdSetDepthCompareOp, if
depthTestEnable
isVK_TRUE
-
vkCmdSetDepthBoundsTestEnable, if the depthBounds feature is enabled on the device
-
vkCmdSetDepthBounds, if
depthBoundsTestEnable
isVK_TRUE
-
vkCmdSetDepthBias or vkCmdSetDepthBias2EXT, if
depthBiasEnable
isVK_TRUE
-
vkCmdSetDepthClampEnableEXT, if the depthClamp feature is enabled on the device
-
vkCmdSetStencilOp, if
stencilTestEnable
isVK_TRUE
-
vkCmdSetStencilCompareMask, if
stencilTestEnable
isVK_TRUE
-
vkCmdSetStencilWriteMask, if
stencilTestEnable
isVK_TRUE
-
vkCmdSetStencilReference, if
stencilTestEnable
isVK_TRUE
If a shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT
stage, and
rasterizerDiscardEnable
is VK_FALSE
, the following commands
must have been called in the command buffer prior to drawing:
-
vkCmdSetLogicOpEnableEXT, if the
logicOp
feature is enabled on the device -
vkCmdSetLogicOpEXT, if
logicOpEnable
isVK_TRUE
-
vkCmdSetColorBlendEnableEXT and vkCmdSetColorWriteMaskEXT, if color attachments are bound, with values set for every color attachment in the render pass instance active at draw time
-
vkCmdSetColorBlendEquationEXT or vkCmdSetColorBlendAdvancedEXT, if color attachments are bound, for every attachment whose index in
pColorBlendEnables
is a pointer to a value ofVK_TRUE
-
vkCmdSetBlendConstants, if any index in
pColorBlendEnables
isVK_TRUE
, and the same index inpColorBlendEquations
is aVkColorBlendEquationEXT
structure with any VkBlendFactor member with a value ofVK_BLEND_FACTOR_CONSTANT_COLOR
,VK_BLEND_FACTOR_ONE_MINUS_CONSTANT_COLOR
,VK_BLEND_FACTOR_CONSTANT_ALPHA
, orVK_BLEND_FACTOR_ONE_MINUS_CONSTANT_ALPHA
If the pipelineFragmentShadingRate
feature is enabled on the device, and a
shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT
stage, and
rasterizerDiscardEnable
is VK_FALSE
, the following command must
have been called in the command buffer prior to drawing:
If the geometryStreams
feature is
enabled on the device, and a shader is bound to the
VK_SHADER_STAGE_GEOMETRY_BIT
stage, the following command must have
been called in the command buffer prior to drawing:
If the VK_EXT_discard_rectangles
extension is enabled on the
device, and rasterizerDiscardEnable
is VK_FALSE
, the following
commands must have been called in the command buffer prior to drawing:
-
vkCmdSetDiscardRectangleModeEXT, if
discardRectangleEnable
isVK_TRUE
-
vkCmdSetDiscardRectangleEXT, if
discardRectangleEnable
isVK_TRUE
If VK_EXT_conservative_rasterization
extension is enabled on the
device, and rasterizerDiscardEnable
is VK_FALSE
, the following
commands must have been called in the command buffer prior to drawing:
-
vkCmdSetExtraPrimitiveOverestimationSizeEXT, if
conservativeRasterizationMode
isVK_CONSERVATIVE_RASTERIZATION_MODE_OVERESTIMATE_EXT
If the depthClipEnable
feature is
enabled on the device, the following command must have been called in the
command buffer prior to drawing:
If the VK_EXT_sample_locations
extension is enabled on the device,
and rasterizerDiscardEnable
is VK_FALSE
, the following commands
must have been called in the command buffer prior to drawing:
-
vkCmdSetSampleLocationsEXT, if
sampleLocationsEnable
isVK_TRUE
If the VK_EXT_provoking_vertex
extension is enabled on the device,
and rasterizerDiscardEnable
is VK_FALSE
, and a shader is bound
to the VK_SHADER_STAGE_VERTEX_BIT
stage, the following command must
have been called in the command buffer prior to drawing:
If any of the <features-stippledRectangularLines,
stippledRectangularLines
>>, <features-stippledBresenhamLines,
stippledBresenhamLines
>>, or <features-stippledSmoothLines,
stippledSmoothLines
>> features are enabled on the device, and
rasterizerDiscardEnable
is VK_FALSE
, and if polygonMode
is
VK_POLYGON_MODE_LINE
or a shader is bound to the
VK_SHADER_STAGE_VERTEX_BIT
stage and primitiveTopology
is a line
topology or a shader which outputs line primitives is bound to the
VK_SHADER_STAGE_TESSELLATION_EVALUATION_BIT
or
VK_SHADER_STAGE_GEOMETRY_BIT
stage, the following commands must have
been called in the command buffer prior to drawing:
-
vkCmdSetLineStippleKHR, if
stippledLineEnable
isVK_TRUE
If the depthClipControl
feature is
enabled on the device, the following command must have been called in the
command buffer prior to drawing:
If the colorWriteEnable
feature is
enabled on the device, and a shader is bound to the
VK_SHADER_STAGE_FRAGMENT_BIT
stage, and rasterizerDiscardEnable
is VK_FALSE
, the following command must have been called in the
command buffer prior to drawing:
-
vkCmdSetColorWriteEnableEXT, with values set for every color attachment in the render pass instance active at draw time
If the attachmentFeedbackLoopDynamicState feature is enabled on the device, and a
shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT
stage, and
rasterizerDiscardEnable
is VK_FALSE
, the following command must
have been called in the command buffer prior to drawing:
If the VK_NV_clip_space_w_scaling
extension is enabled on the
device, the following commands must have been called in the command buffer
prior to drawing:
-
vkCmdSetViewportWScalingNV, if
viewportWScalingEnable
isVK_TRUE
If the depthClamp and depthClampControl
feature are enabled on the device, and
depthClampEnable
is VK_TRUE
, the following command must have
been called in the command buffer prior to drawing:
If the VK_NV_viewport_swizzle
extension is enabled on the device,
the following command must have been called in the command buffer prior to
drawing:
If the VK_NV_fragment_coverage_to_color
extension is enabled on the
device, and a shader is bound to the VK_SHADER_STAGE_FRAGMENT_BIT
stage, and rasterizerDiscardEnable
is VK_FALSE
, the following
commands must have been called in the command buffer prior to drawing:
-
vkCmdSetCoverageToColorLocationNV, if
coverageToColorEnable
isVK_TRUE
If the VK_NV_framebuffer_mixed_samples
extension is enabled on the
device, and rasterizerDiscardEnable
is VK_FALSE
, the following
commands must have been called in the command buffer prior to drawing:
-
vkCmdSetCoverageModulationTableEnableNV, if
coverageModulationMode
is notVK_COVERAGE_MODULATION_MODE_NONE_NV
-
vkCmdSetCoverageModulationTableNV, if
coverageModulationTableEnable
isVK_TRUE
If the coverageReductionMode
feature is enabled on the device, and rasterizerDiscardEnable
is
VK_FALSE
, the following command must have been called in the command
buffer prior to drawing:
If the representativeFragmentTest
feature is enabled on the device, and
rasterizerDiscardEnable
is VK_FALSE
, the following command must
have been called in the command buffer prior to drawing:
If the shadingRateImage
feature is
enabled on the device, and rasterizerDiscardEnable
is VK_FALSE
,
the following commands must have been called in the command buffer prior to
drawing:
-
vkCmdSetViewportShadingRatePaletteNV, if
shadingRateImageEnable
isVK_TRUE
If the exclusiveScissor
feature is
enabled on the device, the following commands must have been called in the
command buffer prior to drawing:
-
vkCmdSetExclusiveScissorNV, if any value in
pExclusiveScissorEnables
isVK_TRUE
State can be set either at any time before or after shader objects are bound, but all required state must be set prior to issuing drawing commands.
If the commandBufferInheritance
feature is enabled, graphics and compute state is inherited from the
previously executed command buffer in the queue.
Any valid state inherited in this way does not need to be set again in the
current command buffer.
Interaction With Pipelines
Calling vkCmdBindShadersEXT causes the pipeline bind points
corresponding to each stage in pStages
to be
disturbed, meaning that any pipelines that had previously
been bound to those pipeline bind points are no longer bound.
If VK_PIPELINE_BIND_POINT_GRAPHICS
is disturbed (i.e., if
pStages
contains any graphics stage), any graphics pipeline state that
the previously bound pipeline did not specify as dynamic becomes undefined, and must be set in the command buffer before
issuing drawing commands using shader objects.
Calls to vkCmdBindPipeline likewise disturb the shader stage(s)
corresponding to pipelineBindPoint
, meaning that any shaders that had
previously been bound to any of those stages are no longer bound, even if
the pipeline was created without shaders for some of those stages.
Shader Object Destruction
To destroy a shader object, call:
// Provided by VK_EXT_shader_object
void vkDestroyShaderEXT(
VkDevice device,
VkShaderEXT shader,
const VkAllocationCallbacks* pAllocator);
-
device
is the logical device that destroys the shader object. -
shader
is the handle of the shader object to destroy. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter.
Destroying a shader object used by one or more command buffers in the recording or executable state causes those command buffers to move into the invalid state.
Shader Modules
Shader modules contain shader code and one or more entry points. Shaders are selected from a shader module by specifying an entry point as part of pipeline creation. The stages of a pipeline can use shaders that come from different modules. The shader code defining a shader module must be in the SPIR-V format, as described by the Vulkan Environment for SPIR-V appendix.
Shader modules are represented by VkShaderModule
handles:
// Provided by VK_VERSION_1_0
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkShaderModule)
To create a shader module, call:
// Provided by VK_VERSION_1_0
VkResult vkCreateShaderModule(
VkDevice device,
const VkShaderModuleCreateInfo* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkShaderModule* pShaderModule);
-
device
is the logical device that creates the shader module. -
pCreateInfo
is a pointer to a VkShaderModuleCreateInfo structure. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter. -
pShaderModule
is a pointer to a VkShaderModule handle in which the resulting shader module object is returned.
Once a shader module has been created, any entry points it contains can be used in pipeline shader stages as described in Compute Pipelines and Graphics Pipelines.
If
the |
The VkShaderModuleCreateInfo
structure is defined as:
// Provided by VK_VERSION_1_0
typedef struct VkShaderModuleCreateInfo {
VkStructureType sType;
const void* pNext;
VkShaderModuleCreateFlags flags;
size_t codeSize;
const uint32_t* pCode;
} VkShaderModuleCreateInfo;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
flags
is reserved for future use. -
codeSize
is the size, in bytes, of the code pointed to bypCode
. -
pCode
is a pointer to code that is used to create the shader module. The type and format of the code is determined from the content of the memory addressed bypCode
.
// Provided by VK_VERSION_1_0
typedef VkFlags VkShaderModuleCreateFlags;
VkShaderModuleCreateFlags
is a bitmask type for setting a mask, but is
currently reserved for future use.
To use a VkValidationCacheEXT to cache shader validation results, add
a VkShaderModuleValidationCacheCreateInfoEXT structure to the
pNext
chain of the VkShaderModuleCreateInfo structure,
specifying the cache object to use.
The VkShaderModuleValidationCacheCreateInfoEXT
struct is defined as:
// Provided by VK_EXT_validation_cache
typedef struct VkShaderModuleValidationCacheCreateInfoEXT {
VkStructureType sType;
const void* pNext;
VkValidationCacheEXT validationCache;
} VkShaderModuleValidationCacheCreateInfoEXT;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
validationCache
is the validation cache object from which the results of prior validation attempts will be written, and to which new validation results for this VkShaderModule will be written (if not already present).
To destroy a shader module, call:
// Provided by VK_VERSION_1_0
void vkDestroyShaderModule(
VkDevice device,
VkShaderModule shaderModule,
const VkAllocationCallbacks* pAllocator);
-
device
is the logical device that destroys the shader module. -
shaderModule
is the handle of the shader module to destroy. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter.
A shader module can be destroyed while pipelines created using its shaders are still in use.
Shader Module Identifiers
Shader modules have unique identifiers associated with them. To query an implementation provided identifier, call:
// Provided by VK_EXT_shader_module_identifier
void vkGetShaderModuleIdentifierEXT(
VkDevice device,
VkShaderModule shaderModule,
VkShaderModuleIdentifierEXT* pIdentifier);
-
device
is the logical device that created the shader module. -
shaderModule
is the handle of the shader module. -
pIdentifier
is a pointer to the returned VkShaderModuleIdentifierEXT.
The identifier returned by the implementation must only depend on
shaderIdentifierAlgorithmUUID
and information provided in the
VkShaderModuleCreateInfo which created shaderModule
.
The implementation may return equal identifiers for two different
VkShaderModuleCreateInfo structures if the difference does not affect
pipeline compilation.
Identifiers are only meaningful on different VkDevice objects if the
device the identifier was queried from had the same
shaderModuleIdentifierAlgorithmUUID
as the device consuming the
identifier.
VkShaderModuleCreateInfo structures have unique identifiers associated with them. To query an implementation provided identifier, call:
// Provided by VK_EXT_shader_module_identifier
void vkGetShaderModuleCreateInfoIdentifierEXT(
VkDevice device,
const VkShaderModuleCreateInfo* pCreateInfo,
VkShaderModuleIdentifierEXT* pIdentifier);
-
device
is the logical device that can create a VkShaderModule frompCreateInfo
. -
pCreateInfo
is a pointer to a VkShaderModuleCreateInfo structure. -
pIdentifier
is a pointer to the returned VkShaderModuleIdentifierEXT.
The identifier returned by implementation must only depend on
shaderIdentifierAlgorithmUUID
and information provided in the
VkShaderModuleCreateInfo.
The implementation may return equal identifiers for two different
VkShaderModuleCreateInfo structures if the difference does not affect
pipeline compilation.
Identifiers are only meaningful on different VkDevice objects if the
device the identifier was queried from had the same
shaderModuleIdentifierAlgorithmUUID
as the device consuming the
identifier.
The identifier returned by the implementation in
vkGetShaderModuleCreateInfoIdentifierEXT must be equal to the
identifier returned by vkGetShaderModuleIdentifierEXT given equivalent
definitions of VkShaderModuleCreateInfo and any chained pNext
structures.
VkShaderModuleIdentifierEXT represents a shader module identifier returned by the implementation.
// Provided by VK_EXT_shader_module_identifier
typedef struct VkShaderModuleIdentifierEXT {
VkStructureType sType;
void* pNext;
uint32_t identifierSize;
uint8_t identifier[VK_MAX_SHADER_MODULE_IDENTIFIER_SIZE_EXT];
} VkShaderModuleIdentifierEXT;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
identifierSize
is the size, in bytes, of valid data returned inidentifier
. -
identifier
is a buffer of opaque data specifying an identifier.
Any returned values beyond the first identifierSize
bytes are
undefined.
Implementations must return an identifierSize
greater than 0, and
less-or-equal to VK_MAX_SHADER_MODULE_IDENTIFIER_SIZE_EXT
.
Two identifiers are considered equal if identifierSize
is equal and
the first identifierSize
bytes of identifier
compare equal.
Implementations may return a different identifierSize
for different
modules.
Implementations should ensure that identifierSize
is large enough to
uniquely define a shader module.
VK_MAX_SHADER_MODULE_IDENTIFIER_SIZE_EXT
is the length in bytes of a
shader module identifier, as returned in
VkShaderModuleIdentifierEXT::identifierSize
.
#define VK_MAX_SHADER_MODULE_IDENTIFIER_SIZE_EXT 32U
Binding Shaders
Before a shader can be used it must be first bound to the command buffer.
Calling vkCmdBindPipeline binds all stages corresponding to the
VkPipelineBindPoint.
Calling vkCmdBindShadersEXT binds all stages in pStages
The following table describes the relationship between shader stages and pipeline bind points:
Shader stage | Pipeline bind point | behavior controlled |
---|---|---|
|
|
all drawing commands |
|
|
|
|
|
vkCmdTraceRaysNV vkCmdTraceRaysKHR and vkCmdTraceRaysIndirectKHR |
|
|
|
|
|
Shader Execution
At each stage of the pipeline, multiple invocations of a shader may execute simultaneously. Further, invocations of a single shader produced as the result of different commands may execute simultaneously. The relative execution order of invocations of the same shader type is undefined. Shader invocations may complete in a different order than that in which the primitives they originated from were drawn or dispatched by the application. However, fragment shader outputs are written to attachments in rasterization order.
The relative execution order of invocations of different shader types is largely undefined. However, when invoking a shader whose inputs are generated from a previous pipeline stage, the shader invocations from the previous stage are guaranteed to have executed far enough to generate input values for all required inputs.
Shader Termination
A shader invocation that is terminated has finished executing instructions.
Executing OpReturn
in the entry point, or executing
OpTerminateInvocation
in any function will terminate an invocation.
Implementations may also terminate a shader invocation when OpKill
is
executed in any function; otherwise it becomes a
helper invocation.
In addition to the above conditions, helper invocations may be terminated when all non-helper invocations in the same derivative group either terminate or become helper invocations.
A shader stage for a given command completes execution when all invocations for that stage have terminated.
|
Shader Memory Access Ordering
The order in which image or buffer memory is read or written by shaders is largely undefined. For some shader types (vertex, tessellation evaluation, and in some cases, fragment), even the number of shader invocations that may perform loads and stores is undefined.
In particular, the following rules apply:
-
Vertex and tessellation evaluation shaders will be invoked at least once for each unique vertex, as defined in those sections.
-
Fragment shaders will be invoked zero or more times, as defined in that section.
-
The relative execution order of invocations of the same shader type is undefined. A store issued by a shader when working on primitive B might complete prior to a store for primitive A, even if primitive A is specified prior to primitive B. This applies even to fragment shaders; while fragment shader outputs are always written to the framebuffer in rasterization order, stores executed by fragment shader invocations are not.
-
The relative execution order of invocations of different shader types is largely undefined.
The above limitations on shader invocation order make some forms of synchronization between shader invocations within a single set of primitives unimplementable. For example, having one invocation poll memory written by another invocation assumes that the other invocation has been launched and will complete its writes in finite time. |
The Memory Model appendix defines the terminology and rules for how to correctly communicate between shader invocations, such as when a write is Visible-To a read, and what constitutes a Data Race.
Applications must not cause a data race.
The SPIR-V SubgroupMemory, CrossWorkgroupMemory, and AtomicCounterMemory memory semantics are ignored. Sequentially consistent atomics and barriers are not supported and SequentiallyConsistent is treated as AcquireRelease. SequentiallyConsistent should not be used.
Shader Inputs and Outputs
Data is passed into and out of shaders using variables with input or output
storage class, respectively.
User-defined inputs and outputs are connected between stages by matching
their Location
decorations.
Additionally, data can be provided by or communicated to special functions
provided by the execution environment using BuiltIn
decorations.
In many cases, the same BuiltIn
decoration can be used in multiple
shader stages with similar meaning.
The specific behavior of variables decorated as BuiltIn
is documented
in the following sections.
Task Shaders
Task shaders operate in conjunction with the mesh shaders to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Its primary purpose is to create a variable amount of subsequent mesh shader invocations.
Task shaders are invoked via the execution of the programmable mesh shading pipeline.
The task shader has no fixed-function inputs other than variables
identifying the specific workgroup and invocation.
In the TaskNV
Execution
Model
the number of mesh shader workgroups to
create is specified via a TaskCountNV
decorated output variable.
In the TaskEXT
Execution
Model
the number of mesh shader workgroups to
create is specified via the OpEmitMeshTasksEXT
instruction.
The task shader can write additional outputs to task memory, which can be read by all of the mesh shader workgroups it created.
Task Shader Execution
Task workloads are formed from groups of work items called workgroups and
processed by the task shader in the current graphics pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Task shaders execute in global workgroups which are divided into a number
of local workgroups with a size that can be set by assigning a value to
the LocalSize
or LocalSizeId
execution mode or via an object decorated by the WorkgroupSize
decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
If the subpass includes multiple views in its view mask, a Task shader using
TaskEXT
Execution
Model
may be invoked separately for each view.
Mesh Shaders
Mesh shaders operate in workgroups to produce a collection of primitives that will be processed by subsequent stages of the graphics pipeline. Each workgroup emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.
Mesh shaders are invoked via the execution of the programmable mesh shading pipeline.
The only inputs available to the mesh shader are variables identifying the specific workgroup and invocation and, if applicable, any outputs written to task memory by the task shader that spawned the mesh shader’s workgroup. The mesh shader can operate without a task shader as well.
The invocations of the mesh shader workgroup write an output mesh, comprising a set of primitives with per-primitive attributes, a set of vertices with per-vertex attributes, and an array of indices identifying the mesh vertices that belong to each primitive. The primitives of this mesh are then processed by subsequent graphics pipeline stages, where the outputs of the mesh shader form an interface with the fragment shader.
Mesh Shader Execution
Mesh workloads are formed from groups of work items called workgroups and
processed by the mesh shader in the current graphics pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Mesh shaders execute in global workgroups which are divided into a number
of local workgroups with a size that can be set by assigning a value to
the LocalSize
or LocalSizeId
execution mode or via an object decorated by the WorkgroupSize
decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
The global workgroups may be generated explicitly via the API, or
implicitly through the task shader’s work creation mechanism.
If the subpass includes multiple views in its view mask, a Mesh shader using
MeshEXT
Execution
Model
may be invoked separately for each view.
Cluster Culling Shaders
Cluster Culling shaders are invoked via the execution of the Programmable Cluster Culling Shading pipeline.
The only inputs available to the cluster culling shader are variables identifying the specific workgroup and invocation.
Cluster Culling shaders operate in workgroups to perform cluster-based culling and produce zero or more cluster drawing command that will be processed by subsequent stages of the graphics pipeline.
The Cluster Drawing Command(CDC) is very similar to the MDI command, invocations in workgroup can emit zero of more CDC to draw zero or more visible cluster.
Cluster Culling Shader Execution
Cluster Culling workloads are formed from groups of work items called
workgroups and processed by the cluster culling shader in the current
graphics pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Cluster Culling shaders execute in global workgroups which are divided
into a number of local workgroups with a size that can be set by
assigning a value to the LocalSize
or LocalSizeId
execution mode or via an object decorated by the WorkgroupSize
decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
Vertex Shaders
Each vertex shader invocation operates on one vertex and its associated vertex attribute data, and outputs one vertex and associated data. Graphics pipelines using primitive shading must include a vertex shader, and the vertex shader stage is always the first shader stage in the graphics pipeline.
Vertex Shader Execution
A vertex shader must be executed at least once for each vertex specified by a drawing command. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view. During execution, the shader is presented with the index of the vertex and instance for which it has been invoked. Input variables declared in the vertex shader are filled by the implementation with the values of vertex attributes associated with the invocation being executed.
If the same vertex is specified multiple times in a drawing command (e.g. by including the same index value multiple times in an index buffer) the implementation may reuse the results of vertex shading if it can statically determine that the vertex shader invocations will produce identical results.
It is implementation-dependent when and if results of vertex shading are
reused, and thus how many times the vertex shader will be executed.
This is true also if the vertex shader contains stores or atomic operations
(see |
Tessellation Control Shaders
The tessellation control shader is used to read an input patch provided by
the application and to produce an output patch.
Each tessellation control shader invocation operates on an input patch
(after all control points in the patch are processed by a vertex shader) and
its associated data, and outputs a single control point of the output patch
and its associated data, and can also output additional per-patch data.
The input patch is sized according to the patchControlPoints
member of
VkPipelineTessellationStateCreateInfo, as part of input assembly.
The input patch can also be dynamically sized with patchControlPoints
parameter of vkCmdSetPatchControlPointsEXT.
To dynamically set the number of control points per patch, call:
// Provided by VK_EXT_extended_dynamic_state2, VK_EXT_shader_object
void vkCmdSetPatchControlPointsEXT(
VkCommandBuffer commandBuffer,
uint32_t patchControlPoints);
-
commandBuffer
is the command buffer into which the command will be recorded. -
patchControlPoints
specifies the number of control points per patch.
This command sets the number of control points per patch for subsequent
drawing commands
when drawing using shader objects, or
when the graphics pipeline is created with
VK_DYNAMIC_STATE_PATCH_CONTROL_POINTS_EXT
set in
VkPipelineDynamicStateCreateInfo::pDynamicStates
.
Otherwise, this state is specified by the
VkPipelineTessellationStateCreateInfo::patchControlPoints
value
used to create the currently active pipeline.
The size of the output patch is controlled by the OpExecutionMode
OutputVertices
specified in the tessellation control or tessellation
evaluation shaders, which must be specified in at least one of the shaders.
The size of the input and output patches must each be greater than zero and
less than or equal to
VkPhysicalDeviceLimits
::maxTessellationPatchSize
.
Tessellation Control Shader Execution
A tessellation control shader is invoked at least once for each output vertex in a patch. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
Inputs to the tessellation control shader are generated by the vertex
shader.
Each invocation of the tessellation control shader can read the attributes
of any incoming vertices and their associated data.
The invocations corresponding to a given patch execute logically in
parallel, with undefined relative execution order.
However, the OpControlBarrier
instruction can be used to provide
limited control of the execution order by synchronizing invocations within a
patch, effectively dividing tessellation control shader execution into a set
of phases.
Tessellation control shaders will read undefined values if one invocation
reads a per-vertex or per-patch output written by another invocation at any
point during the same phase, or if two invocations attempt to write
different values to the same per-patch output in a single phase.
Tessellation Evaluation Shaders
The Tessellation Evaluation Shader operates on an input patch of control points and their associated data, and a single input barycentric coordinate indicating the invocation’s relative position within the subdivided patch, and outputs a single vertex and its associated data.
Geometry Shaders
The geometry shader operates on a group of vertices and their associated data assembled from a single input primitive, and emits zero or more output primitives and the group of vertices and their associated data required for each output primitive.
Geometry Shader Execution
A geometry shader is invoked at least once for each primitive produced by the tessellation stages, or at least once for each primitive generated by primitive assembly when tessellation is not in use. A shader can request that the geometry shader runs multiple instances. A geometry shader is invoked at least once for each instance. If the subpass includes multiple views in its view mask, the shader may be invoked separately for each view.
Fragment Shaders
Fragment shaders are invoked as a fragment operation in a graphics pipeline. Each fragment shader invocation operates on a single fragment and its associated data. With few exceptions, fragment shaders do not have access to any data associated with other fragments and are considered to execute in isolation of fragment shader invocations associated with other fragments.
Compute Shaders
Compute shaders are invoked via vkCmdDispatch and vkCmdDispatchIndirect commands. In general, they have access to similar resources as shader stages executing as part of a graphics pipeline.
Compute workloads are formed from groups of work items called workgroups and
processed by the compute shader in the current compute pipeline.
A workgroup is a collection of shader invocations that execute the same
shader, potentially in parallel.
Compute shaders execute in global workgroups which are divided into a
number of local workgroups with a size that can be set by assigning a
value to the LocalSize
or LocalSizeId
execution mode or via an object decorated by the WorkgroupSize
decoration.
An invocation within a local workgroup can share data with other members of
the local workgroup through shared variables and issue memory and control
flow barriers to synchronize with other members of the local workgroup.
Ray Generation Shaders
A ray generation shader is similar to a compute shader.
Its main purpose is to execute ray tracing queries using
pipeline trace ray instructions (such as
OpTraceRayKHR
) and process the results.
Ray Generation Shader Execution
One ray generation shader is executed per ray tracing dispatch.
Its location in the shader binding table (see Shader Binding Table for details) is passed directly into
vkCmdTraceRaysKHR using the pRaygenShaderBindingTable
parameter
or
vkCmdTraceRaysNV using the raygenShaderBindingTableBuffer
and
raygenShaderBindingOffset
parameters
.
Intersection Shaders
Intersection shaders enable the implementation of arbitrary, application defined geometric primitives. An intersection shader for a primitive is executed whenever its axis-aligned bounding box is hit by a ray.
Like other ray tracing shader domains, an intersection shader operates on a
single ray at a time.
It also operates on a single primitive at a time.
It is therefore the purpose of an intersection shader to compute the
ray-primitive intersections and report them.
To report an intersection, the shader calls the OpReportIntersectionKHR
instruction.
An intersection shader communicates with any-hit and closest shaders by generating attribute values that they can read. Intersection shaders cannot read or modify the ray payload.
Intersection Shader Execution
The order in which intersections are found along a ray, and therefore the order in which intersection shaders are executed, is unspecified.
The intersection shader of the closest AABB which intersects the ray is guaranteed to be executed at some point during traversal, unless the ray is forcibly terminated.
Any-Hit Shaders
The any-hit shader is executed after the intersection shader reports an
intersection that lies within the current [tmin,tmax] of the ray.
The main use of any-hit shaders is to programmatically decide whether or not
an intersection will be accepted.
The intersection will be accepted unless the shader calls the
OpIgnoreIntersectionKHR
instruction.
Any-hit shaders have read-only access to the attributes generated by the
corresponding intersection shader, and can read or modify the ray payload.
Closest Hit Shaders
Closest hit shaders have read-only access to the attributes generated by the corresponding intersection shader, and can read or modify the ray payload. They also have access to a number of system-generated values. Closest hit shaders can call pipeline trace ray instructions to recursively trace rays.
Miss Shaders
Miss shaders can access the ray payload and can trace new rays through the pipeline trace ray instructions, but cannot access attributes since they are not associated with an intersection.
Interpolation Decorations
Variables in the Input
storage class in a fragment shader’s interface
are interpolated from the values specified by the primitive being
rasterized.
Interpolation decorations can be present on input and output variables in pre-rasterization shaders but have no effect on the interpolation performed. |
An undecorated input variable will be interpolated with perspective-correct
interpolation according to the primitive type being rasterized.
Lines and
polygons are interpolated in the same
way as the primitive’s clip coordinates.
If the NoPerspective
decoration is present, linear interpolation is
instead used for lines and
polygons.
For points, as there is only a single vertex, input values are never
interpolated and instead take the value written for the single vertex.
If the Flat
decoration is present on an input variable, the value is
not interpolated, and instead takes its value directly from the
provoking vertex.
Fragment shader inputs that are signed or unsigned integers, integer
vectors, or any double-precision floating-point type must be decorated with
Flat
.
Interpolation of input variables is performed at an implementation-defined position within the fragment area being shaded. The position is further constrained as follows:
-
If the
Centroid
decoration is used, the interpolation position used for the variable must also fall within the bounds of the primitive being rasterized. -
If the
Sample
decoration is used, the interpolation position used for the variable must be at the position of the sample being shaded by the current fragment shader invocation. -
If a sample count of 1 is used, the interpolation position must be at the center of the fragment area.
As |
If the PerVertexKHR
decoration is present on an input variable, the
value is not interpolated, and instead values from all input vertices are
available in an array.
Each index of the array corresponds to one of the vertices of the primitive
that produced the fragment.
If the CustomInterpAMD
decoration is present on an input variable, the
value cannot be accessed directly; instead the extended instruction
InterpolateAtVertexAMD
must be used to obtain values from the input
vertices.
Static Use
A SPIR-V module declares a global object in memory using the OpVariable
instruction, which results in a pointer x
to that object.
A specific entry point in a SPIR-V module is said to statically use that
object if that entry point’s call tree contains a function containing a
instruction with x
as an id
operand.
A shader entry point also statically uses any variables explicitly
declared in its interface.
Scope
A scope describes a set of shader invocations, where each such set is a scope instance. Each invocation belongs to one or more scope instances, but belongs to no more than one scope instance for each scope.
The operations available between invocations in a given scope instance vary, with smaller scopes generally able to perform more operations, and with greater efficiency.
Cross Device
All invocations executed in a Vulkan instance fall into a single cross device scope instance.
Whilst the CrossDevice
scope is defined in SPIR-V, it is disallowed in
Vulkan.
API synchronization commands can be used to
communicate between devices.
Device
All invocations executed on a single device form a device scope instance.
If the vulkanMemoryModel
and
vulkanMemoryModelDeviceScope
features are enabled, this scope is
represented in SPIR-V by the Device
Scope
, which can be used as a
Memory
Scope
for barrier and atomic operations.
If both the shaderDeviceClock
and
vulkanMemoryModelDeviceScope
features are enabled, using the
Device
Scope
with the OpReadClockKHR
instruction will read
from a clock that is consistent across invocations in the same device scope
instance.
There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.
Invocations executing on different devices in a device group operate in separate device scope instances.
Queue Family
Invocations executed by queues in a given queue family form a queue family scope instance.
This scope is identified in SPIR-V as the
QueueFamily
Scope
if the vulkanMemoryModel
feature is enabled, or if not, the
Device
Scope
, which can be used as a Memory
Scope
for
barrier and atomic operations.
If the shaderDeviceClock
feature is
enabled,
but the vulkanMemoryModelDeviceScope
feature is not enabled,
using the Device
Scope
with the OpReadClockKHR
instruction
will read from a clock that is consistent across invocations in the same
queue family scope instance.
There is no method to synchronize the execution of these invocations within SPIR-V, and this can only be done with API synchronization primitives.
Each invocation in a queue family scope instance must be in the same device scope instance.
Command
Any shader invocations executed as the result of a single command such as
vkCmdDispatch or vkCmdDraw form a command scope instance.
For indirect drawing commands with drawCount
greater than one,
invocations from separate draws are in separate command scope instances.
For ray tracing shaders, an invocation group is an implementation-dependent
subset of the set of shader invocations of a given shader stage which are
produced by a single trace rays command.
There is no specific Scope
for communication across invocations in a
command scope instance.
As this has a clear boundary at the API level, coordination here can be
performed in the API, rather than in SPIR-V.
Each invocation in a command scope instance must be in the same queue-family scope instance.
For shaders without defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.
Primitive
Any fragment shader invocations executed as the result of rasterization of a single primitive form a primitive scope instance.
There is no specific Scope
for communication across invocations in a
primitive scope instance.
Any generated helper invocations are included in this scope instance.
Each invocation in a primitive scope instance must be in the same command scope instance.
Any input variables decorated with Flat
are uniform within a primitive
scope instance.
Shader Call
Any shader-call-related invocations that are executed in one or more ray tracing execution models form a shader call scope instance.
The ShaderCallKHR
Scope
can be used as Memory
Scope
for
barrier and atomic operations.
Each invocation in a shader call scope instance must be in the same queue family scope instance.
Workgroup
A local workgroup is a set of invocations that can synchronize and share
data with each other using memory in the Workgroup
storage class.
The Workgroup
Scope
can be used as both an Execution
Scope
and Memory
Scope
for barrier and atomic operations.
Each invocation in a local workgroup must be in the same command scope instance.
Only task, mesh, and compute shaders have defined workgroups - other shader types cannot use workgroup functionality. For shaders that have defined workgroups, this set of invocations forms an invocation group as defined in the SPIR-V specification.
When variables declared with the Workgroup
storage class are explicitly
laid out (hence they are also decorated with Block
), the amount of
storage consumed is the size of the largest Block variable, not counting any
padding at the end.
The amount of storage consumed by the
non-Block
variables declared with the Workgroup
storage class is
implementation-dependent.
However, the amount of storage consumed may not exceed the largest block
size that would be obtained if all active
non-Block
variables declared with Workgroup
storage class were assigned offsets
in an arbitrary order by successively taking the smallest valid offset
according to the Standard Storage Buffer Layout rules, and with Boolean
values considered as 32-bit
integer values for the purpose of this calculation.
(This is equivalent to using the GLSL std430 layout rules.)
Subgroup
A subgroup (see the subsection “Control Flow” of section 2 of the SPIR-V 1.3 Revision 1 specification) is a set of invocations that can synchronize and share data with each other efficiently.
The Subgroup
Scope
can be used as both an Execution
Scope
and Memory
Scope
for barrier and atomic operations.
Other subgroup features allow the use of
group operations with subgroup scope.
If the shaderSubgroupClock
feature
is enabled, using the Subgroup
Scope
with the OpReadClockKHR
instruction will read from a clock that is consistent across invocations in
the same subgroup.
For shaders that have defined workgroups, each invocation in a subgroup must be in the same local workgroup.
In other shader stages, each invocation in a subgroup must be in the same device scope instance.
Only shader stages that support subgroup operations have defined subgroups.
Subgroups are not guaranteed to be a subset of a single command in shaders that do not have defined workgroups. Values that are guaranteed to be uniform for a given command or sub command may then not be uniform for the subgroup, and vice versa. As such, applications must take care when dealing with mixed uniformity. A somewhat common example of this would something like trying to optimize access to per-draw data using subgroup operations:
This can be done in an attempt to optimize the shader to only perform the loads once per subgroup. However, if the implementation packs multiple draws into a single subgroup, invocations from draws with a different drawID are now receiving data from the wrong invocation. Applications should rely on implementations to do this kind of optimization automatically where the implementation can, rather than trying to force it. |
Quad
A quad scope instance is formed of four shader invocations.
In a fragment shader, each invocation in a quad scope instance is formed of invocations in neighboring framebuffer locations (xi, yi), where:
-
i is the index of the invocation within the scope instance.
-
w and h are the number of pixels the fragment covers in the x and y axes.
-
w and h are identical for all participating invocations.
-
(x0) = (x1 - w) = (x2) = (x3 - w)
-
(y0) = (y1) = (y2 - h) = (y3 - h)
-
Each invocation has the same layer and sample indices.
In a
mesh, task, or
compute shader, if the DerivativeGroupQuadsKHR
execution mode is
specified, each invocation in a quad scope instance is formed of invocations
with adjacent local invocation IDs (xi, yi), where:
-
i is the index of the invocation within the quad scope instance.
-
(x0) = (x1 - 1) = (x2) = (x3 - 1)
-
(y0) = (y1) = (y2 - 1) = (y3 - 1)
-
x0 and y0 are integer multiples of 2.
-
Each invocation has the same z coordinate.
In a
mesh, task, or
compute shader, if the DerivativeGroupLinearKHR
execution mode is
specified, each invocation in a quad scope instance is formed of invocations
with adjacent local invocation indices (li), where:
-
i is the index of the invocation within the quad scope instance.
-
(l0) = (l1 - 1) = (l2 - 2) = (l3 - 3)
-
l0 is an integer multiple of 4.
In all shaders, each invocation in a quad scope instance is formed of invocations in adjacent subgroup invocation indices (si), where:
-
i is the index of the invocation within the quad scope instance.
-
(s0) = (s1 - 1) = (s2 - 2) = (s3 - 3)
-
s0 is an integer multiple of 4.
Each invocation in a quad scope instance must be in the same subgroup.
In a fragment shader, each invocation in a quad scope instance must be in the same primitive scope instance.
Fragment
, mesh, task,
and compute
shaders have defined quad scope instances.
If the quadOperationsInAllStages
limit is supported, any
shader stages that support subgroup operations also have defined quad scope instances.
Fragment Interlock
A fragment interlock scope instance is formed of fragment shader invocations based on their framebuffer locations (x,y,layer,sample), executed by commands inside a single subpass.
The specific set of invocations included varies based on the execution mode as follows:
-
If the
SampleInterlockOrderedEXT
orSampleInterlockUnorderedEXT
execution modes are used, only invocations with identical framebuffer locations (x,y,layer,sample) are included. -
If the
PixelInterlockOrderedEXT
orPixelInterlockUnorderedEXT
execution modes are used, fragments with different sample ids are also included. -
If the
ShadingRateInterlockOrderedEXT
orShadingRateInterlockUnorderedEXT
execution modes are used, fragments from neighboring framebuffer locations are also included. The shading rate image or fragment shading rate determines these fragments.
Only fragment shaders with one of the above execution modes have defined fragment interlock scope instances.
There is no specific Scope
value for communication across invocations
in a fragment interlock scope instance.
However, this is implicitly used as a memory scope by
OpBeginInvocationInterlockEXT
and OpEndInvocationInterlockEXT
.
Each invocation in a fragment interlock scope instance must be in the same queue family scope instance.
Invocation
The smallest scope is a single invocation; this is represented by the
Invocation
Scope
in SPIR-V.
Fragment shader invocations must be in a primitive scope instance.
Invocations in fragment shaders that have a defined fragment interlock scope must be in a fragment interlock scope instance.
Invocations in shaders that have defined workgroups must be in a local workgroup.
Invocations in shaders that have a defined subgroup scope must be in a subgroup.
Invocations in shaders that have a defined quad scope must be in a quad scope instance.
All invocations in all stages must be in a command scope instance.
Group Operations
Group operations are executed by multiple invocations within a scope instance; with each invocation involved in calculating the result. This provides a mechanism for efficient communication between invocations in a particular scope instance.
Group operations all take a Scope
defining the desired
scope instance to operate within.
Only the Subgroup
scope can be used for these operations; the
subgroupSupportedOperations
limit defines which types of operation can be used.
Basic Group Operations
Basic group operations include the use of OpGroupNonUniformElect
,
OpControlBarrier
, OpMemoryBarrier
, and atomic operations.
OpGroupNonUniformElect
can be used to choose a single invocation to
perform a task for the whole group.
Only the invocation with the lowest id in the group will return true
.
The Memory Model appendix defines the operation of barriers and atomics.
Vote Group Operations
The vote group operations allow invocations within a group to compare values across a group. The types of votes enabled are:
-
Do all active group invocations agree that an expression is true?
-
Do any active group invocations evaluate an expression to true?
-
Do all active group invocations have the same value of an expression?
These operations are useful in combination with control flow in that they allow for developers to check whether conditions match across the group and choose potentially faster code-paths in these cases. |
Arithmetic Group Operations
The arithmetic group operations allow invocations to perform scans and reductions across a group. The operators supported are add, mul, min, max, and, or, xor.
For reductions, every invocation in a group will obtain the cumulative result of these operators applied to all values in the group. For exclusive scans, each invocation in a group will obtain the cumulative result of these operators applied to all values in invocations with a lower index in the group. Inclusive scans are identical to exclusive scans, except the cumulative result includes the operator applied to the value in the current invocation.
The order in which these operators are applied is implementation-dependent.
Ballot Group Operations
The ballot group operations allow invocations to perform more complex votes across the group. The ballot functionality allows all invocations within a group to provide a boolean value and get as a result what each invocation provided as their boolean value. The broadcast functionality allows values to be broadcast from an invocation to all other invocations within the group.
Shuffle Group Operations
The shuffle group operations allow invocations to read values from other invocations within a group.
Shuffle Relative Group Operations
The shuffle relative group operations allow invocations to read values from other invocations within the group relative to the current invocation in the group. The relative operations supported allow data to be shifted up and down through the invocations within a group.
Clustered Group Operations
The clustered group operations allow invocations to perform an operation among partitions of a group, such that the operation is only performed within the group invocations within a partition. The partitions for clustered group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time. The operations supported are add, mul, min, max, and, or, xor.
Rotate Group Operations
The rotate group operations allow invocations to read values from other invocations within the group relative to the current invocation and modulo the size of the group. Clustered rotate group operations perform the same operation within individual partitions of a group.
The partitions for clustered rotate group operations are consecutive power-of-two size groups of invocations and the cluster size must be known at pipeline creation time.
Quad Group Operations
Quad group operations (OpGroupNonUniformQuad*
) are a specialized type
of group operations that only operate on
quad scope instances.
Whilst these instructions do include a Scope
parameter, this scope is
always overridden; only the quad scope instance is
included in its execution scope.
Fragment shaders that statically execute either
OpGroupNonUniformQuadBroadcast
or OpGroupNonUniformQuadSwap
must
launch sufficient invocations to ensure their correct operation; additional
helper invocations are launched for
framebuffer locations not covered by rasterized fragments if necessary.
The index used to select participating invocations is i, as described for a quad scope instance, defined as the quad index in the SPIR-V specification.
For OpGroupNonUniformQuadBroadcast
this value is equal to Index
.
For OpGroupNonUniformQuadSwap
, it is equal to the implicit Index
used by each participating invocation.
Derivative Operations
Derivative operations calculate the partial derivative for an expression P as a function of an invocation’s x and y coordinates.
Derivative operations operate on a set of invocations known as a derivative group as defined in the SPIR-V specification.
A derivative group in a fragment shader is equivalent to the
quad scope instance if the QuadDerivativesKHR
execution mode is specified, otherwise it is equivalent to the
primitive scope instance.
A derivative group in a
mesh, task, or
compute shader is equivalent to the quad scope instance.
Derivatives are calculated assuming that P is piecewise linear and continuous within the derivative group.
The following control-flow restrictions apply to derivative operations:
-
If the
QuadDerivativesKHR
execution mode is specified, dynamic instances of any derivative operations must be executed in control flow that is uniform within the current quad scope instance. -
If the
QuadDerivativesKHR
execution mode is not specified:-
dynamic instances of explicit derivative instructions (
OpDPdx*
,OpDPdy*
, andOpFwidth*
) must be executed in control flow that is uniform within a derivative group. -
dynamic instances of implicit derivative operations can be executed in control flow that is not uniform within the derivative group, but results are undefined.
-
Fragment shaders that statically execute derivative operations must launch sufficient invocations to ensure their correct operation; additional helper invocations are launched for framebuffer locations not covered by rasterized fragments if necessary.
In a mesh, task, or compute shader, it is the application’s responsibility to ensure that sufficient invocations are launched. |
Derivative operations calculate their results as the difference between the
result of P across invocations in the quad.
For fine derivative operations (OpDPdxFine
and OpDPdyFine
), the
values of DPdx(Pi) are calculated as
-
DPdx(P0) = DPdx(P1) = P1 - P0
-
DPdx(P2) = DPdx(P3) = P3 - P2
and the values of DPdy(Pi) are calculated as
-
DPdy(P0) = DPdy(P2) = P2 - P0
-
DPdy(P1) = DPdy(P3) = P3 - P1
where i is the index of each invocation as described in Quad.
Coarse derivative operations (OpDPdxCoarse
and OpDPdyCoarse
),
calculate their results in roughly the same manner, but may only calculate
two values instead of four (one for each of DPdx and DPdy),
reusing the same result no matter the originating invocation.
If an implementation does this, it should use the fine derivative
calculations described for P0.
Derivative values are calculated between fragments rather than pixels. If the fragment shader invocations involved in the calculation cover multiple pixels, these operations cover a wider area, resulting in larger derivative values. This in turn will result in a coarser LOD being selected for image sampling operations using derivatives. Applications may want to account for this when using multi-pixel fragments; if pixel derivatives are desired, applications should use explicit derivative operations and divide the results by the size of the fragment in each dimension as follows:
where w and h are the size of the fragments in the quad, and DPdx(Pn)' and DPdy(Pn)' are the pixel derivatives. |
The results for OpDPdx
and OpDPdy
may be calculated as either
fine or coarse derivatives, with implementations favoring the most efficient
approach.
Implementations must choose coarse or fine consistently between the two.
Executing OpFwidthFine
, OpFwidthCoarse
, or OpFwidth
is
equivalent to executing the corresponding OpDPdx*
and OpDPdy*
instructions, taking the absolute value of the results, and summing them.
Executing an OpImage*Sample*ImplicitLod
instruction is equivalent to
executing OpDPdx
(Coordinate
) and OpDPdy
(Coordinate
), and
passing the results as the Grad
operands dx
and dy
.
It is expected that using the |
Helper Invocations
When performing derivative
or quad group
operations in a fragment shader, additional invocations may be spawned in
order to ensure correct results.
These additional invocations are known as helper invocations and can be
identified by a non-zero value in the HelperInvocation
built-in.
Stores and atomics performed by helper invocations must not have any effect
on memory except for the Function
, Private
and Output
storage
classes, and values returned by atomic instructions in helper invocations
are undefined.
While storage to |
If the MaximallyReconvergesKHR
execution mode is applied to the entry
point, helper invocations must remain active for all instructions for the
lifetime of the quad scope instance they are a part of.
If the MaximallyReconvergesKHR
execution mode is not applied to the
entry point, helper
invocations may be considered inactive for group operations other than derivative
and quad group operations.
All invocations in a quad scope instance may become permanently inactive at
any point once the only remaining invocations in that quad scope instance
are helper invocations.
Cooperative Matrices
A cooperative matrix type is a SPIR-V type where the storage for and computations performed on the matrix are spread across the invocations in a scope instance. These types give the implementation freedom in how to optimize matrix multiplies.
SPIR-V defines the types and instructions, but does not specify rules about what sizes/combinations are valid, and it is expected that different implementations may support different sizes.
To enumerate the supported cooperative matrix types and operations, call:
// Provided by VK_KHR_cooperative_matrix
VkResult vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR(
VkPhysicalDevice physicalDevice,
uint32_t* pPropertyCount,
VkCooperativeMatrixPropertiesKHR* pProperties);
-
physicalDevice
is the physical device. -
pPropertyCount
is a pointer to an integer related to the number of cooperative matrix properties available or queried. -
pProperties
is eitherNULL
or a pointer to an array of VkCooperativeMatrixPropertiesKHR structures.
If pProperties
is NULL
, then the number of cooperative matrix
properties available is returned in pPropertyCount
.
Otherwise, pPropertyCount
must point to a variable set by the
application to the number of elements in the pProperties
array, and on
return the variable is overwritten with the number of structures actually
written to pProperties
.
If pPropertyCount
is less than the number of cooperative matrix
properties available, at most pPropertyCount
structures will be
written, and VK_INCOMPLETE
will be returned instead of
VK_SUCCESS
, to indicate that not all the available cooperative matrix
properties were returned.
To enumerate additional supported cooperative matrix types and operations, call:
// Provided by VK_NV_cooperative_matrix2
VkResult vkGetPhysicalDeviceCooperativeMatrixFlexibleDimensionsPropertiesNV(
VkPhysicalDevice physicalDevice,
uint32_t* pPropertyCount,
VkCooperativeMatrixFlexibleDimensionsPropertiesNV* pProperties);
-
physicalDevice
is the physical device. -
pPropertyCount
is a pointer to an integer related to the number of cooperative matrix properties available or queried. -
pProperties
is eitherNULL
or a pointer to an array of VkCooperativeMatrixFlexibleDimensionsPropertiesNV structures.
If pProperties
is NULL
, then the number of flexible dimensions
properties available is returned in pPropertyCount
.
Otherwise, pPropertyCount
must point to a variable set by the
application to the number of elements in the pProperties
array, and on
return the variable is overwritten with the number of structures actually
written to pProperties
.
If pPropertyCount
is less than the number flexible dimensions
properties available, at most pPropertyCount
structures will be
written, and VK_INCOMPLETE
will be returned instead of
VK_SUCCESS
, to indicate that not all the available flexible dimensions
properties were returned.
If
cooperativeMatrixFlexibleDimensions
is not supported, the implementation must advertise zero properties.
To enumerate the supported cooperative matrix types and operations, call:
// Provided by VK_NV_cooperative_matrix
VkResult vkGetPhysicalDeviceCooperativeMatrixPropertiesNV(
VkPhysicalDevice physicalDevice,
uint32_t* pPropertyCount,
VkCooperativeMatrixPropertiesNV* pProperties);
-
physicalDevice
is the physical device. -
pPropertyCount
is a pointer to an integer related to the number of cooperative matrix properties available or queried. -
pProperties
is eitherNULL
or a pointer to an array of VkCooperativeMatrixPropertiesNV structures.
If pProperties
is NULL
, then the number of cooperative matrix
properties available is returned in pPropertyCount
.
Otherwise, pPropertyCount
must point to a variable set by the
application to the number of elements in the pProperties
array, and on
return the variable is overwritten with the number of structures actually
written to pProperties
.
If pPropertyCount
is less than the number of cooperative matrix
properties available, at most pPropertyCount
structures will be
written, and VK_INCOMPLETE
will be returned instead of
VK_SUCCESS
, to indicate that not all the available cooperative matrix
properties were returned.
Each
VkCooperativeMatrixPropertiesKHR
or
VkCooperativeMatrixPropertiesNV
structure describes a single supported combination of types for a matrix
multiply/add operation (
OpCooperativeMatrixMulAddKHR
or
OpCooperativeMatrixMulAddNV
).
The multiply can be described in terms of the following variables and types
(in SPIR-V pseudocode):
%A is of type OpTypeCooperativeMatrixKHR %AType %scope %MSize %KSize %MatrixAKHR
%B is of type OpTypeCooperativeMatrixKHR %BType %scope %KSize %NSize %MatrixBKHR
%C is of type OpTypeCooperativeMatrixKHR %CType %scope %MSize %NSize %MatrixAccumulatorKHR
%Result is of type OpTypeCooperativeMatrixKHR %ResultType %scope %MSize %NSize %MatrixAccumulatorKHR
%Result = %A * %B + %C // using OpCooperativeMatrixMulAddKHR
%A is of type OpTypeCooperativeMatrixNV %AType %scope %MSize %KSize
%B is of type OpTypeCooperativeMatrixNV %BType %scope %KSize %NSize
%C is of type OpTypeCooperativeMatrixNV %CType %scope %MSize %NSize
%D is of type OpTypeCooperativeMatrixNV %DType %scope %MSize %NSize
%D = %A * %B + %C // using OpCooperativeMatrixMulAddNV
A matrix multiply with these dimensions is known as an MxNxK matrix multiply.
The VkCooperativeMatrixPropertiesKHR
structure is defined as:
// Provided by VK_KHR_cooperative_matrix
typedef struct VkCooperativeMatrixPropertiesKHR {
VkStructureType sType;
void* pNext;
uint32_t MSize;
uint32_t NSize;
uint32_t KSize;
VkComponentTypeKHR AType;
VkComponentTypeKHR BType;
VkComponentTypeKHR CType;
VkComponentTypeKHR ResultType;
VkBool32 saturatingAccumulation;
VkScopeKHR scope;
} VkCooperativeMatrixPropertiesKHR;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
MSize
is the number of rows in matricesA
,C
, andResult
. -
KSize
is the number of columns in matrixA
and rows in matrixB
. -
NSize
is the number of columns in matricesB
,C
,Result
. -
AType
is the component type of matrixA
, of type VkComponentTypeKHR. -
BType
is the component type of matrixB
, of type VkComponentTypeKHR. -
CType
is the component type of matrixC
, of type VkComponentTypeKHR. -
ResultType
is the component type of matrixResult
, of type VkComponentTypeKHR. -
saturatingAccumulation
indicates whether theSaturatingAccumulation
operand toOpCooperativeMatrixMulAddKHR
must be present or not. If it isVK_TRUE
, theSaturatingAccumulation
operand must be present. If it isVK_FALSE
, theSaturatingAccumulation
operand must not be present. -
scope
is the scope of all the matrix types, of type VkScopeKHR.
If some types are preferred over other types (e.g. for performance), they should appear earlier in the list enumerated by vkGetPhysicalDeviceCooperativeMatrixPropertiesKHR.
At least one entry in the list must have power of two values for all of
MSize
, KSize
, and NSize
.
If
cooperativeMatrixWorkgroupScope
is not supported,
scope
must be VK_SCOPE_SUBGROUP_KHR
.
The VkCooperativeMatrixFlexibleDimensionsPropertiesNV
structure is
defined as:
// Provided by VK_NV_cooperative_matrix2
typedef struct VkCooperativeMatrixFlexibleDimensionsPropertiesNV {
VkStructureType sType;
void* pNext;
uint32_t MGranularity;
uint32_t NGranularity;
uint32_t KGranularity;
VkComponentTypeKHR AType;
VkComponentTypeKHR BType;
VkComponentTypeKHR CType;
VkComponentTypeKHR ResultType;
VkBool32 saturatingAccumulation;
VkScopeKHR scope;
uint32_t workgroupInvocations;
} VkCooperativeMatrixFlexibleDimensionsPropertiesNV;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
MGranularity
is the granularity of the number of rows in matricesA
,C
, andResult
. The rows must be an integer multiple of this value. -
KGranularity
is the granularity of columns in matrixA
and rows in matrixB
. The columns/rows must be an integer multiple of this value. -
NGranularity
is the granularity of columns in matricesB
,C
,Result
. The columns must be an integer multiple of this value. -
AType
is the component type of matrixA
, of type VkComponentTypeKHR. -
BType
is the component type of matrixB
, of type VkComponentTypeKHR. -
CType
is the component type of matrixC
, of type VkComponentTypeKHR. -
ResultType
is the component type of matrixResult
, of type VkComponentTypeKHR. -
saturatingAccumulation
indicates whether theSaturatingAccumulation
operand toOpCooperativeMatrixMulAddKHR
must be present or not. If it isVK_TRUE
, theSaturatingAccumulation
operand must be present. If it isVK_FALSE
, theSaturatingAccumulation
operand must not be present. -
scope
is the scope of all the matrix types, of type VkScopeKHR. -
workgroupInvocations
is the number of invocations in the local workgroup when this combination of values is supported.
Rather than explicitly enumerating a list of supported sizes,
VkCooperativeMatrixFlexibleDimensionsPropertiesNV
advertises size
granularities, where the matrix must be a multiple of the advertised size.
The M and K granularities apply to rows and columns of matrices with
Use
of MatrixA
, K, and N apply to rows and columns of matrices
with Use
of MatrixB
, M, and N apply to rows and columns of
matrices with Use
of MatrixAccumulator
.
For a given type combination, if multiple workgroup sizes are supported
there may be multiple
VkCooperativeMatrixFlexibleDimensionsPropertiesNV
structures with
different granularities.
All granularity values must be powers of two.
Different A/B types may require different granularities but share the same accumulator type. In such a case, the supported granularity for a matrix with the accumulator type would be the smallest advertised granularity. |
The VkCooperativeMatrixPropertiesNV
structure is defined as:
// Provided by VK_NV_cooperative_matrix
typedef struct VkCooperativeMatrixPropertiesNV {
VkStructureType sType;
void* pNext;
uint32_t MSize;
uint32_t NSize;
uint32_t KSize;
VkComponentTypeNV AType;
VkComponentTypeNV BType;
VkComponentTypeNV CType;
VkComponentTypeNV DType;
VkScopeNV scope;
} VkCooperativeMatrixPropertiesNV;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
MSize
is the number of rows in matrices A, C, and D. -
KSize
is the number of columns in matrix A and rows in matrix B. -
NSize
is the number of columns in matrices B, C, D. -
AType
is the component type of matrix A, of type VkComponentTypeNV. -
BType
is the component type of matrix B, of type VkComponentTypeNV. -
CType
is the component type of matrix C, of type VkComponentTypeNV. -
DType
is the component type of matrix D, of type VkComponentTypeNV. -
scope
is the scope of all the matrix types, of type VkScopeNV.
If some types are preferred over other types (e.g. for performance), they should appear earlier in the list enumerated by vkGetPhysicalDeviceCooperativeMatrixPropertiesNV.
At least one entry in the list must have power of two values for all of
MSize
, KSize
, and NSize
.
Possible values for VkScopeKHR include:
// Provided by VK_KHR_cooperative_matrix
typedef enum VkScopeKHR {
VK_SCOPE_DEVICE_KHR = 1,
VK_SCOPE_WORKGROUP_KHR = 2,
VK_SCOPE_SUBGROUP_KHR = 3,
VK_SCOPE_QUEUE_FAMILY_KHR = 5,
// Provided by VK_NV_cooperative_matrix
VK_SCOPE_DEVICE_NV = VK_SCOPE_DEVICE_KHR,
// Provided by VK_NV_cooperative_matrix
VK_SCOPE_WORKGROUP_NV = VK_SCOPE_WORKGROUP_KHR,
// Provided by VK_NV_cooperative_matrix
VK_SCOPE_SUBGROUP_NV = VK_SCOPE_SUBGROUP_KHR,
// Provided by VK_NV_cooperative_matrix
VK_SCOPE_QUEUE_FAMILY_NV = VK_SCOPE_QUEUE_FAMILY_KHR,
} VkScopeKHR;
or the equivalent
// Provided by VK_NV_cooperative_matrix
typedef VkScopeKHR VkScopeNV;
-
VK_SCOPE_DEVICE_KHR
corresponds to SPIR-VDevice
scope. -
VK_SCOPE_WORKGROUP_KHR
corresponds to SPIR-VWorkgroup
scope. -
VK_SCOPE_SUBGROUP_KHR
corresponds to SPIR-VSubgroup
scope. -
VK_SCOPE_QUEUE_FAMILY_KHR
corresponds to SPIR-VQueueFamily
scope.
All enum values match the corresponding SPIR-V value.
Possible values for VkComponentTypeKHR include:
// Provided by VK_KHR_cooperative_matrix
typedef enum VkComponentTypeKHR {
VK_COMPONENT_TYPE_FLOAT16_KHR = 0,
VK_COMPONENT_TYPE_FLOAT32_KHR = 1,
VK_COMPONENT_TYPE_FLOAT64_KHR = 2,
VK_COMPONENT_TYPE_SINT8_KHR = 3,
VK_COMPONENT_TYPE_SINT16_KHR = 4,
VK_COMPONENT_TYPE_SINT32_KHR = 5,
VK_COMPONENT_TYPE_SINT64_KHR = 6,
VK_COMPONENT_TYPE_UINT8_KHR = 7,
VK_COMPONENT_TYPE_UINT16_KHR = 8,
VK_COMPONENT_TYPE_UINT32_KHR = 9,
VK_COMPONENT_TYPE_UINT64_KHR = 10,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_FLOAT16_NV = VK_COMPONENT_TYPE_FLOAT16_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_FLOAT32_NV = VK_COMPONENT_TYPE_FLOAT32_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_FLOAT64_NV = VK_COMPONENT_TYPE_FLOAT64_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_SINT8_NV = VK_COMPONENT_TYPE_SINT8_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_SINT16_NV = VK_COMPONENT_TYPE_SINT16_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_SINT32_NV = VK_COMPONENT_TYPE_SINT32_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_SINT64_NV = VK_COMPONENT_TYPE_SINT64_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_UINT8_NV = VK_COMPONENT_TYPE_UINT8_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_UINT16_NV = VK_COMPONENT_TYPE_UINT16_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_UINT32_NV = VK_COMPONENT_TYPE_UINT32_KHR,
// Provided by VK_NV_cooperative_matrix
VK_COMPONENT_TYPE_UINT64_NV = VK_COMPONENT_TYPE_UINT64_KHR,
} VkComponentTypeKHR;
or the equivalent
// Provided by VK_NV_cooperative_matrix
typedef VkComponentTypeKHR VkComponentTypeNV;
-
VK_COMPONENT_TYPE_FLOAT16_KHR
corresponds to SPIR-VOpTypeFloat
16. -
VK_COMPONENT_TYPE_FLOAT32_KHR
corresponds to SPIR-VOpTypeFloat
32. -
VK_COMPONENT_TYPE_FLOAT64_KHR
corresponds to SPIR-VOpTypeFloat
64. -
VK_COMPONENT_TYPE_SINT8_KHR
corresponds to SPIR-VOpTypeInt
8 0/1. -
VK_COMPONENT_TYPE_SINT16_KHR
corresponds to SPIR-VOpTypeInt
16 0/1. -
VK_COMPONENT_TYPE_SINT32_KHR
corresponds to SPIR-VOpTypeInt
32 0/1. -
VK_COMPONENT_TYPE_SINT64_KHR
corresponds to SPIR-VOpTypeInt
64 0/1. -
VK_COMPONENT_TYPE_UINT8_KHR
corresponds to SPIR-VOpTypeInt
8 0/1. -
VK_COMPONENT_TYPE_UINT16_KHR
corresponds to SPIR-VOpTypeInt
16 0/1. -
VK_COMPONENT_TYPE_UINT32_KHR
corresponds to SPIR-VOpTypeInt
32 0/1. -
VK_COMPONENT_TYPE_UINT64_KHR
corresponds to SPIR-VOpTypeInt
64 0/1.
Validation Cache
Validation cache objects allow the result of internal validation to be reused, both within a single application run and between multiple runs. Reuse within a single run is achieved by passing the same validation cache object when creating supported Vulkan objects. Reuse across runs of an application is achieved by retrieving validation cache contents in one run of an application, saving the contents, and using them to preinitialize a validation cache on a subsequent run. The contents of the validation cache objects are managed by the validation layers. Applications can manage the host memory consumed by a validation cache object and control the amount of data retrieved from a validation cache object.
Validation cache objects are represented by VkValidationCacheEXT
handles:
// Provided by VK_EXT_validation_cache
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkValidationCacheEXT)
To create validation cache objects, call:
// Provided by VK_EXT_validation_cache
VkResult vkCreateValidationCacheEXT(
VkDevice device,
const VkValidationCacheCreateInfoEXT* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkValidationCacheEXT* pValidationCache);
-
device
is the logical device that creates the validation cache object. -
pCreateInfo
is a pointer to a VkValidationCacheCreateInfoEXT structure containing the initial parameters for the validation cache object. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter. -
pValidationCache
is a pointer to a VkValidationCacheEXT handle in which the resulting validation cache object is returned.
Applications can track and manage the total host memory size of a
validation cache object using the |
Once created, a validation cache can be passed to the
vkCreateShaderModule
command by adding this object to the
VkShaderModuleCreateInfo structure’s pNext
chain.
If a VkShaderModuleValidationCacheCreateInfoEXT object is included in
the VkShaderModuleCreateInfo::pNext
chain, and its
validationCache
field is not VK_NULL_HANDLE, the implementation
will query it for possible reuse opportunities and update it with new
content.
The use of the validation cache object in these commands is internally
synchronized, and the same validation cache object can be used in multiple
threads simultaneously.
Implementations should make every effort to limit any critical sections to
the actual accesses to the cache, which is expected to be significantly
shorter than the duration of the |
The VkValidationCacheCreateInfoEXT
structure is defined as:
// Provided by VK_EXT_validation_cache
typedef struct VkValidationCacheCreateInfoEXT {
VkStructureType sType;
const void* pNext;
VkValidationCacheCreateFlagsEXT flags;
size_t initialDataSize;
const void* pInitialData;
} VkValidationCacheCreateInfoEXT;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
flags
is reserved for future use. -
initialDataSize
is the number of bytes inpInitialData
. IfinitialDataSize
is zero, the validation cache will initially be empty. -
pInitialData
is a pointer to previously retrieved validation cache data. If the validation cache data is incompatible (as defined below) with the device, the validation cache will be initially empty. IfinitialDataSize
is zero,pInitialData
is ignored.
// Provided by VK_EXT_validation_cache
typedef VkFlags VkValidationCacheCreateFlagsEXT;
VkValidationCacheCreateFlagsEXT
is a bitmask type for setting a mask,
but is currently reserved for future use.
Validation cache objects can be merged using the command:
// Provided by VK_EXT_validation_cache
VkResult vkMergeValidationCachesEXT(
VkDevice device,
VkValidationCacheEXT dstCache,
uint32_t srcCacheCount,
const VkValidationCacheEXT* pSrcCaches);
-
device
is the logical device that owns the validation cache objects. -
dstCache
is the handle of the validation cache to merge results into. -
srcCacheCount
is the length of thepSrcCaches
array. -
pSrcCaches
is a pointer to an array of validation cache handles, which will be merged intodstCache
. The previous contents ofdstCache
are included after the merge.
The details of the merge operation are implementation-dependent, but implementations should merge the contents of the specified validation caches and prune duplicate entries. |
Data can be retrieved from a validation cache object using the command:
// Provided by VK_EXT_validation_cache
VkResult vkGetValidationCacheDataEXT(
VkDevice device,
VkValidationCacheEXT validationCache,
size_t* pDataSize,
void* pData);
-
device
is the logical device that owns the validation cache. -
validationCache
is the validation cache to retrieve data from. -
pDataSize
is a pointer to a value related to the amount of data in the validation cache, as described below. -
pData
is eitherNULL
or a pointer to a buffer.
If pData
is NULL
, then the maximum size of the data that can be
retrieved from the validation cache, in bytes, is returned in
pDataSize
.
Otherwise, pDataSize
must point to a variable set by the application
to the size of the buffer, in bytes, pointed to by pData
, and on
return the variable is overwritten with the amount of data actually written
to pData
.
If pDataSize
is less than the maximum size that can be retrieved by
the validation cache, at most pDataSize
bytes will be written to
pData
, and vkGetValidationCacheDataEXT
will return
VK_INCOMPLETE
instead of VK_SUCCESS
, to indicate that not all of
the validation cache was returned.
Any data written to pData
is valid and can be provided as the
pInitialData
member of the VkValidationCacheCreateInfoEXT
structure passed to vkCreateValidationCacheEXT
.
Two calls to vkGetValidationCacheDataEXT
with the same parameters
must retrieve the same data unless a command that modifies the contents of
the cache is called between them.
Applications can store the data retrieved from the validation cache, and
use these data, possibly in a future run of the application, to populate new
validation cache objects.
The results of validation, however, may depend on the vendor ID, device ID,
driver version, and other details of the device.
To enable applications to detect when previously retrieved data is
incompatible with the device, the initial bytes written to pData
must
be a header consisting of the following members:
Offset | Size | Meaning |
---|---|---|
0 |
4 |
length in bytes of the entire validation cache header written as a stream of bytes, with the least significant byte first |
4 |
4 |
a VkValidationCacheHeaderVersionEXT value written as a stream of bytes, with the least significant byte first |
8 |
|
a layer commit ID expressed as a UUID, which uniquely identifies the version of the validation layers used to generate these validation results |
The first four bytes encode the length of the entire validation cache header, in bytes. This value includes all fields in the header including the validation cache version field and the size of the length field.
The next four bytes encode the validation cache version, as described for VkValidationCacheHeaderVersionEXT. A consumer of the validation cache should use the cache version to interpret the remainder of the cache header.
If pDataSize
is less than what is necessary to store this header,
nothing will be written to pData
and zero will be written to
pDataSize
.
Possible values of the second group of four bytes in the header returned by vkGetValidationCacheDataEXT, encoding the validation cache version, are:
// Provided by VK_EXT_validation_cache
typedef enum VkValidationCacheHeaderVersionEXT {
VK_VALIDATION_CACHE_HEADER_VERSION_ONE_EXT = 1,
} VkValidationCacheHeaderVersionEXT;
-
VK_VALIDATION_CACHE_HEADER_VERSION_ONE_EXT
specifies version one of the validation cache.
To destroy a validation cache, call:
// Provided by VK_EXT_validation_cache
void vkDestroyValidationCacheEXT(
VkDevice device,
VkValidationCacheEXT validationCache,
const VkAllocationCallbacks* pAllocator);
-
device
is the logical device that destroys the validation cache object. -
validationCache
is the handle of the validation cache to destroy. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter.
CUDA Modules
Creating a CUDA Module
CUDA modules must contain some kernel code and must expose at least one function entry point.
CUDA modules are represented by VkCudaModuleNV
handles:
// Provided by VK_NV_cuda_kernel_launch
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkCudaModuleNV)
To create a CUDA module, call:
// Provided by VK_NV_cuda_kernel_launch
VkResult vkCreateCudaModuleNV(
VkDevice device,
const VkCudaModuleCreateInfoNV* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkCudaModuleNV* pModule);
-
device
is the logical device that creates the shader module. -
pCreateInfo
is a pointer to a VkCudaModuleCreateInfoNV structure. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter. -
pModule
is a pointer to a VkCudaModuleNV handle in which the resulting CUDA module object is returned.
Once a CUDA module has been created, the application may create the function entry point, which must refer to one function in the module.
The VkCudaModuleCreateInfoNV
structure is defined as:
// Provided by VK_NV_cuda_kernel_launch
typedef struct VkCudaModuleCreateInfoNV {
VkStructureType sType;
const void* pNext;
size_t dataSize;
const void* pData;
} VkCudaModuleCreateInfoNV;
-
sType
is a VkStructureType value identifying this structure. -
pNext
may beNULL
or may be a pointer to a structure extending this structure. -
dataSize
is the length of thepData
array. -
pData
is a pointer to CUDA code
Creating a CUDA Function Handle
CUDA functions are represented by VkCudaFunctionNV
handles.
Handles to global
functions may then be used to issue a kernel launch
(i.e. dispatch) from a commandbuffer.
See Dispatching Command for CUDA PTX kernel.
// Provided by VK_NV_cuda_kernel_launch
VK_DEFINE_NON_DISPATCHABLE_HANDLE(VkCudaFunctionNV)
To create a CUDA function, call:
// Provided by VK_NV_cuda_kernel_launch
VkResult vkCreateCudaFunctionNV(
VkDevice device,
const VkCudaFunctionCreateInfoNV* pCreateInfo,
const VkAllocationCallbacks* pAllocator,
VkCudaFunctionNV* pFunction);
-
device
is the logical device that creates the shader module. -
pCreateInfo
is a pointer to a VkCudaFunctionCreateInfoNV structure. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter. -
pFunction
is a pointer to a VkCudaFunctionNV handle in which the resulting CUDA function object is returned.
The VkCudaFunctionCreateInfoNV
structure is defined as:
// Provided by VK_NV_cuda_kernel_launch
typedef struct VkCudaFunctionCreateInfoNV {
VkStructureType sType;
const void* pNext;
VkCudaModuleNV module;
const char* pName;
} VkCudaFunctionCreateInfoNV;
-
sType
is a VkStructureType value identifying this structure. -
pNext
isNULL
or a pointer to a structure extending this structure. -
module
is the CUDA VkCudaModuleNV module in which the function resides. -
pName
is a null-terminated UTF-8 string containing the name of the shader entry point for this stage.
Destroying a CUDA Function
To destroy a CUDA function handle, call:
// Provided by VK_NV_cuda_kernel_launch
void vkDestroyCudaFunctionNV(
VkDevice device,
VkCudaFunctionNV function,
const VkAllocationCallbacks* pAllocator);
-
device
is the logical device that destroys the Function. -
function
is the handle of the CUDA function to destroy. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter.
Destroying a CUDA Module
To destroy a CUDA shader module, call:
// Provided by VK_NV_cuda_kernel_launch
void vkDestroyCudaModuleNV(
VkDevice device,
VkCudaModuleNV module,
const VkAllocationCallbacks* pAllocator);
-
device
is the logical device that destroys the shader module. -
module
is the handle of the CUDA module to destroy. -
pAllocator
controls host memory allocation as described in the Memory Allocation chapter.
Reading back CUDA Module Cache
After uploading the PTX kernel code, the module compiles the code to generate a binary cache with all the necessary information for the device to execute it. It is possible to read back this cache for later use, such as accelerating the initialization of further executions.
To get the CUDA module cache call:
// Provided by VK_NV_cuda_kernel_launch
VkResult vkGetCudaModuleCacheNV(
VkDevice device,
VkCudaModuleNV module,
size_t* pCacheSize,
void* pCacheData);
-
device
is the logical device that destroys the Function. -
module
is the CUDA module. -
pCacheSize
is a pointer containing the amount of bytes to be copied inpCacheData
-
pCacheData
is a pointer to a buffer in which to copy the binary cache
If pCacheData
is NULL
, then the size of the binary cache, in bytes,
is returned in pCacheSize
.
Otherwise, pCacheSize
must point to a variable set by the application
to the size of the buffer, in bytes, pointed to by pCacheData
, and on
return the variable is overwritten with the amount of data actually written
to pCacheData
.
If pCacheSize
is less than the size of the binary shader code, nothing
is written to pCacheData
, and VK_INCOMPLETE
will be returned
instead of VK_SUCCESS
.
The returned cache may then be used later for further initialization of the CUDA module, by sending this cache instead of the PTX code when using vkCreateCudaModuleNV.
Using the binary cache instead of the original PTX code should significantly speed up initialization of the CUDA module, given that the whole compilation and validation will not be necessary. As with VkPipelineCache, the binary cache depends on the specific implementation. The application must assume the cache upload might fail in many circumstances and thus may have to get ready for falling back to the original PTX code if necessary. Most often, the cache may succeed if the same device driver and architecture is used between the cache generation from PTX and the use of this cache. In the event of a new driver version or if using a different device architecture, this cache may become invalid. |
Limitations
CUDA and Vulkan do not use the device in the same configuration. The following limitations must be taken into account:
-
It is not possible to read or write global parameters from Vulkan. The only way to share resources or send values to the PTX kernel is to pass them as arguments of the function. See Resources sharing between CUDA Kernel and Vulkan for more details.
-
No calls to functions external to the module PTX are supported.
-
Vulkan disables some shader/kernel exceptions, which could break CUDA kernels relying on exceptions.
-
CUDA kernels submitted to Vulkan are limited to the amount of shared memory, which can be queried from the physical capabilities. It may be less than what CUDA can offer.
-
CUDA instruction-level preemption (CILP) does not work.
-
CUDA Unified Memory will not work in this extension.
-
CUDA Dynamic parallelism is not supported.
-
vk*DispatchIndirect
is not available.