[RFC] XLA:GPU Priority-based fusion pass #6407
Replies: 1 comment
-
UPDATE: We've reached a stage of development when we're satisfied with the performance of Priority Fusion on internal benchmarks. In about 4 weeks (mid January 2024), we'd like to enable priority-based fusion by default. We'll send another heads-up. Meanwhile, you can help us by trying to run your models with After that, we will provide only limited support to the current fusion passes ( |
Beta Was this translation helpful? Give feedback.
-
Overview
XLA:GPU team is going through a refactoring of the fusion pipeline to base more decisions on the Global Cost Model instead of heuristics. This RFC focuses on the first step of the process: replacing greedy bottom-up fusion passes with a single priority-based fusion pass. Details of the Global Cost Model implementation are outside the scope of this RFC.
Status Quo
XLA:GPU Fusion pipeline goes through the following steps:
This RFC focuses on steps 2 and 3. Other passes remain untouched.
Instruction fusion
GpuInstructionFusion pass processes instruction in reverse post-order (uses before defs, from computation root to leaves). For the next unfused instruction, the pass makes this instruction a root of a new fusion and tries to fuse all operands greedily, until none of the operands can be fused.
GpuInstructionFusion::ShouldFuse decides if an operand should be fused. It’s an unstructured collection of various checks. The type of checks include, but are not limited to:
This approach has the following disadvantages:
FusionMerger
FusionMerger is an attempt to overcome limitations of GpuInstructionFusion. The pass processes instruction in post-order (defs before uses). The pass tries to fuse each producer fusion instruction with its consumers, potentially duplicating the producer.
FusionMerger uses a simple Cost Model: GpuPerformanceModel::EstimateRunTimes. The model only supports fusion instructions with kLoop FusionKind.
The fusion happens only if all of the following requirements are met:
FusionMerger helps create better fusions, but it can’t revert unfavorable decisions from previous passes.
Priority-based Fusion
Goal
Replace existing fusion passes with a new cost model aware pass to unify fusion logic and unblock future developments of the cost model.
Design
PriorityFusion combines strong sides of GpuInstructionFusion and FusionMerger.
Fusion logic is implemented in InstructionFusion base class that is shared with GpuInstructionFusion and MultiOutputFusion. Priority ordering logic plugs-in with GpuPriorityFusionQueue implementation of FusionQueue interface.
We aim for a clear separation of checks what emitters support and performance-related decisions:
GpuPerformanceModel::EstimateRunTimes is the current Cost Model (shared with FusionMerger). GpuPerformanceModel returns two values: time fused and unfused.
Producers with negative priorities are not fused.
On high level the algorithm looks like this:
The main advantages of priority-based fusion:
Implementation Status
The new pass is hidden behind
--xla_gpu_enable_priority_fusion
flag (link).We aim to enable PriorityFusion by default in Q4 2023, once we have performance parity on internal Google benchmarks.
You can try the pass on your workloads already and let us know if you find major regressions, but keep in mind that it’s still under development.
Future Work
Rough plan of work items after priority-based fusion is on by default:
Beta Was this translation helpful? Give feedback.
All reactions