[RFC] XLA:GPU Priority-based fusion pass #6407

olegshyshkov · 2023-10-18T09:59:10Z

olegshyshkov
Oct 18, 2023
Collaborator

Overview

XLA:GPU team is going through a refactoring of the fusion pipeline to base more decisions on the Global Cost Model instead of heuristics. This RFC focuses on the first step of the process: replacing greedy bottom-up fusion passes with a single priority-based fusion pass. Details of the Global Cost Model implementation are outside the scope of this RFC.

Status Quo

XLA:GPU Fusion pipeline goes through the following steps:

Pre-fusion passes that map patterns to libraries:
- Convolutions and MHA pattern to cuDNN
- Certain softmax and matmul patterns to Triton
- Remaining matmuls to cuBLAS
GpuInstructionFusion: heuristics based fusion
FusionMerger: cost-model based fusion optimizer
Multi-output fusion
Horizontal fusion

This RFC focuses on steps 2 and 3. Other passes remain untouched.

Instruction fusion

GpuInstructionFusion pass processes instruction in reverse post-order (uses before defs, from computation root to leaves). For the next unfused instruction, the pass makes this instruction a root of a new fusion and tries to fuse all operands greedily, until none of the operands can be fused.

GpuInstructionFusion::ShouldFuse decides if an operand should be fused. It’s an unstructured collection of various checks. The type of checks include, but are not limited to:

Prevent creation of fusions that we can’t codegen
Prevent cases when emitted code grows exponentially
Prevent fusions that will likely degrade performance
Rough heuristic on input side (FusionFitsInBudget)

This approach has the following disadvantages:

The pass is greedy and uses only heuristics. There is no cost modeling inside
Heuristics don’t know if duplication is beneficial. To address that, the pass runs twice: without and with duplication

FusionMerger

FusionMerger is an attempt to overcome limitations of GpuInstructionFusion. The pass processes instruction in post-order (defs before uses). The pass tries to fuse each producer fusion instruction with its consumers, potentially duplicating the producer.

FusionMerger uses a simple Cost Model: GpuPerformanceModel::EstimateRunTimes. The model only supports fusion instructions with kLoop FusionKind.

The fusion happens only if all of the following requirements are met:

A producer can be fused with all consumers. Partial fusions are rarely beneficial
Cost Model estimates that fused kernels will run faster than unfused
All fusions are kLoop (limitation of the current Cost Model)

FusionMerger helps create better fusions, but it can’t revert unfavorable decisions from previous passes.

Priority-based Fusion

Goal

Replace existing fusion passes with a new cost model aware pass to unify fusion logic and unblock future developments of the cost model.

Design

PriorityFusion combines strong sides of GpuInstructionFusion and FusionMerger.

Fusion logic is implemented in InstructionFusion base class that is shared with GpuInstructionFusion and MultiOutputFusion. Priority ordering logic plugs-in with GpuPriorityFusionQueue implementation of FusionQueue interface.

We aim for a clear separation of checks what emitters support and performance-related decisions:

GpuPriorityFusion::ShouldFuse checks that XLA has an emitter that can codegen a fusion of a producer and a consumer
GpuPriorityFusionQueue::CalculateProducerPriority calls Cost Model and returns an expected benefit if the producer is duplicated and fused into all consumers. Returns INT_MIN, if the producer can’t be fused with all consumers

GpuPerformanceModel ::EstimateRunTimes is the current Cost Model (shared with FusionMerger). GpuPerformanceModel returns two values: time fused and unfused.

priority = time_fused - time_unfused

Producers with negative priorities are not fused.

On high level the algorithm looks like this:

For each producer, call Cost Model to estimate potential benefit of fusion with all the consumers
Put all instructions in the priority queue
Pop producer with the highest priority
Fuse producer with consumers
Update priorities of operands and consumers of newly created fusions
Go to step 3, if the queue is not empty

The main advantages of priority-based fusion:

Both producer and consumer can be a fusion or a single instruction, unlike GpuInstructionFusion
Cost Model decides if it’s beneficial to duplicate the producer, no need to run the pass twice
Makes sure we don’t miss out on the most beneficial fusions. There are lower chances that beneficial fusions will be skipped by unrelated heuristics (too many params, exponential code bloat)

Implementation Status

The new pass is hidden behind --xla_gpu_enable_priority_fusion flag (link).

We aim to enable PriorityFusion by default in Q4 2023, once we have performance parity on internal Google benchmarks.

You can try the pass on your workloads already and let us know if you find major regressions, but keep in mind that it’s still under development.

Future Work

Rough plan of work items after priority-based fusion is on by default:

Deprecate old fusion passes
Extend Cost Model with information about other emitters
Merge pre-fusion pass with priority fusion

olegshyshkov · 2023-12-18T15:24:26Z

olegshyshkov
Dec 18, 2023
Collaborator Author

UPDATE:

We've reached a stage of development when we're satisfied with the performance of Priority Fusion on internal benchmarks. In about 4 weeks (mid January 2024), we'd like to enable priority-based fusion by default. We'll send another heads-up.

Meanwhile, you can help us by trying to run your models with --xla_gpu_enable_priority_fusion. Please let us know if you observe any issues or regressions.

After that, we will provide only limited support to the current fusion passes (InstructionFusion and FusionMerger) with full deprecation in the following months. Priority Fusion and Cost Model are major investments in the future of XLA GPU and we plan to deliver further improvements on this infrastructure.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RFC] XLA:GPU Priority-based fusion pass #6407

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment

{{title}}

Select a reply

[RFC] XLA:GPU Priority-based fusion pass #6407

olegshyshkov Oct 18, 2023 Collaborator

Overview

Status Quo

Instruction fusion

FusionMerger

Priority-based Fusion

Goal

Design

Implementation Status

Future Work

Replies: 1 comment

olegshyshkov Dec 18, 2023 Collaborator Author

olegshyshkov
Oct 18, 2023
Collaborator

olegshyshkov
Dec 18, 2023
Collaborator Author