Auto-vectorization in GCC

The goal of this project was to develop a loop and basic block vectorizer in GCC, based on the tree-ssa framework. It has been completed and the functionality has been part of GCC for years.

Latest News
Contributing
Using the Vectorizer
Vectorizable Loops
Unvectorizable Loops
Previous News and Status
References/Documentation
High-Level Plan of Implementation

Latest News

2011-10-23

Vectorization of reduction in loop SLP. Both multiple reduction cycles and reduction chains are supported.
Various basic block vectorization (SLP) improvements, such as better data dependence analysis, support of misaligned accesses and multiple types, cost model.
Detection of vector size: https://gcc.gnu.org/ml/gcc-patches/2010-10/msg00441.html.
Vectorization of loads with negative step.
Improved realignment scheme: https://gcc.gnu.org/ml/gcc-patches/2010-06/msg02301.html.
A new built-in, __builtin_assume_aligned, has been added, through which the compiler can be hinted about pointer alignment.
Support of strided accesses using memory instructions that have the interleaving "built in", such as NEON's vldN and vstN.
The vectorizer now attempts to reduce over-promotion of operands in some vector operations: https://gcc.gnu.org/ml/gcc-patches/2011-07/msg01472.html.
Widening shifts are now detected and vectorized if supported by the target.
Vectorization of conditions with mixed types.
Support of loops with bool.

2009-12-03

The following new vectorization features were committed to mainline:

vectorization of conditions in nested loop (2009-07-20)
vectorization of double reduction (reductions carried by two nested loops) (2009-07-12)
vectorization of nested cycles (dependence cycles other than reduction cycles in nested loops) (2009-06-16)
support of misaligned stores for platforms that allow misaligned access (2009-06-05)
basic block SLP (vectorization of straight line code) (2009-05-25)
avoid versioning of vectorized loop if possible (2009-04-02) (https://gcc.gnu.org/ml/gcc-patches/2009-03/msg01784.html).
support load permutations in loop-aware SLP (2008-08-25)
support multiple types in loop-aware SLP (2008-08-19)

2008-08-11

Vectorization supports integer type conversions also when one type is more than two times bigger than the other (e.g. char to int) (2008-08-11).
UNITS_PER_SIMD_WORD can be different for different scalar types (2008-05-22).
Vector shifts by a vector shift amount differentiated from vector shifts with scalar shift amount (2008-05-14).
Complete unrolling enabled before vectorization, relying on intra-iteration vectorization (aka SLP) to vectorize unrolled loops (2008-04-27).
Further refinements to the cost model (2007-12-06).
-ftree-vectorize is turned on under -O3 (2007-09-18).

Contributing

This project was started by Dorit (Naishlos) Nuzman. Current contributors to this project include Revital Eres, Richard Guenther, Jakub Jelinek, Michael Matz, Richard Sandiford, and Ira Rosen. This web page is maintained by Ira Rosen <[email protected]>. For a list of missing features and possible enhancements see https://gcc.gnu.org/wiki/VectorizationTasks.

Using the Vectorizer

Vectorization is enabled by the flag -ftree-vectorize and by default at -O3. To allow vectorization on powerpc* platforms also use -maltivec. On i?86 and x86_64 platforms use -msse/-msse2. To enable vectorization of floating point reductions use -ffast-math or -fassociative-math.

The vectorizer test cases demonstrate the current vectorization capabilities; these can be found under gcc/gcc/testsuite/gcc.dg/vect/. Information on which loops were or were not vectorized and why, can be obtained using the flag -ftree-vectorizer-verbose. For details see https://gcc.gnu.org/ml/gcc-patches/2005-01/msg01247.html. Example output using -ftree-vectorizer-verbose=2:

vect-1.c:82: note: not vectorized, possible dependence between data-refs a[i_124] and a[i_83]
vect-1.c:72: note: LOOP VECTORIZED.
vect-1.c:64: note: LOOP VECTORIZED.
vect-1.c:56: note: LOOP VECTORIZED.
vect-1.c:49: note: LOOP VECTORIZED.
vect-1.c:41: note: not vectorized: unsupported use in stmt.
vect-1.c:31: note: not vectorized: unsupported use in stmt.
vect-1.c:13: note: vectorized 4 loops in function.

Basic block vectorization, aka SLP, is enabled by the flag -ftree-slp-vectorize, and requires the same platform dependent flags as loop vectorization. Basic block SLP is enabled by default at -O3 and when -ftree-vectorize is enabled.

Vectorizable Loops

"feature" indicates the vectorization capabilities demonstrated by the example.

Example 1:

int a[256], b[256], c[256];
foo () {
  int i;

  for (i=0; i<256; i++){
    a[i] = b[i] + c[i];
  }
}

Example 2:

int a[256], b[256], c[256];
foo (int n, int x) {
   int i;

   /* feature: support for unknown loop bound  */
   /* feature: support for loop invariants  */
   for (i=0; i<n; i++)
      b[i] = x;
   }

   /* feature: general loop exit condition  */
   /* feature: support for bitwise operations  */
   while (n--){
      a[i] = b[i]&c[i]; i++;
   }
}

Example 3:

typedef int aint __attribute__ ((__aligned__(16)));
foo (int n, aint * __restrict__ p, aint * __restrict q) {

   /* feature: support for (aligned) pointer accesses.  */
   while (n--){
      *p++ = *q++;
   }
}

Example 4:

typedef int aint __attribute__ ((__aligned__(16)));
int a[256], b[256], c[256];
foo (int n, aint * __restrict__ p, aint * __restrict__ q) {
   int i;

   /* feature: support for (aligned) pointer accesses  */
   /* feature: support for constants  */
   while (n--){
      *p++ = *q++ + 5;
   }

   /* feature: support for read accesses with a compile time known misalignment  */
   for (i=0; i<n; i++){
      a[i] = b[i+1] + c[i+3];
   }

   /* feature: support for if-conversion  */
   for (i=0; i<n; i++){
      j = a[i];
      b[i] = (j > MAX ? MAX : 0);
   }
}

Example 5:

struct a {
  int ca[N];
} s;
for (i = 0; i < N; i++)
  {
    /* feature: support for alignable struct access  */
    s.ca[i] = 5;
  }

Example 6: gfortran:

DIMENSION A(1000000), B(1000000), C(1000000)
READ*, X, Y
A = LOG(X); B = LOG(Y); C = A + B
PRINT*, C(500000)
END

Example 7:

int a[256], b[256];
foo (int x) {
   int i;

   /* feature: support for read accesses with an unknown misalignment  */
   for (i=0; i<N; i++){
      a[i] = b[i+x];
   }
}

Example 8:

int a[M][N];
foo (int x) {
   int i,j;

   /* feature: support for multidimensional arrays  */
   for (i=0; i<M; i++) {
     for (j=0; j<N; j++) {
       a[i][j] = x;
     }
   }
}

Example 9:

unsigned int ub[N], uc[N];
foo () {
  int i;

  /* feature: support summation reduction.
     note: in case of floats use -funsafe-math-optimizations  */
  unsigned int diff = 0;
  for (i = 0; i < N; i++) {
    udiff += (ub[i] - uc[i]);
  }

Example 10:

/* feature: support data-types of different sizes.
   Currently only a single vector-size per target is supported; 
   it can accommodate n elements such that n = vector-size/element-size 
   (e.g, 4 ints, 8 shorts, or 16 chars for a vector of size 16 bytes). 
   A combination of data-types of different sizes in the same loop 
   requires special handling. This support is now present in mainline,
   and also includes support for type conversions.  */

short *sa, *sb, *sc;
int *ia, *ib, *ic;
for (i = 0; i < N; i++) {
  ia[i] = ib[i] + ic[i];
  sa[i] = sb[i] + sc[i];
}

for (i = 0; i < N; i++) {
  ia[i] = (int) sb[i];
}

Example 11:

/* feature: support strided accesses - the data elements
   that are to be operated upon in parallel are not consecutive - they
   are accessed with a stride > 1 (in the example, the stride is 2):  */

for (i = 0; i < N/2; i++){
  a[i] = b[2*i+1] * c[2*i+1] - b[2*i] * c[2*i];
  d[i] = b[2*i] * c[2*i+1] + b[2*i+1] * c[2*i];
}

Example 12: Induction:

for (i = 0; i < N; i++) {
  a[i] = i;
}

Example 13: Outer-loop:

  for (i = 0; i < M; i++) {
    diff = 0;
    for (j = 0; j < N; j+=8) {
      diff += (a[i][j] - b[i][j]);
    }
    out[i] = diff;
  }
}

Example 14: Double reduction:

  for (k = 0; k < K; k++) {
    sum = 0;
    for (j = 0; j < M; j++)
      for (i = 0; i < N; i++)
          sum += in[i+k][j] * coeff[i][j];

    out[k] = sum;
  }

Example 15: Condition in nested loop:

  for (j = 0; j < M; j++)
    {
      x = x_in[j];
      curr_a = a[0];

      for (i = 0; i < N; i++)
        {
          next_a = a[i+1];
          curr_a = x > c[i] ? curr_a : next_a;
        }

      x_out[j] = curr_a;
    }

Example 16: Load permutation in loop-aware SLP:

  for (i = 0; i < N; i++)
    {
       a = *pInput++;
       b = *pInput++;
       c = *pInput++;

       *pOutput++ = M00 * a + M01 * b + M02 * c;
       *pOutput++ = M10 * a + M11 * b + M12 * c;
       *pOutput++ = M20 * a + M21 * b + M22 * c;
    }

Example 17: Basic block SLP:

void foo ()
{
  unsigned int *pin = &in[0];
  unsigned int *pout = &out[0];

  *pout++ = *pin++;
  *pout++ = *pin++;
  *pout++ = *pin++;
  *pout++ = *pin++;
}

Example 18: Simple reduction in SLP:

int sum1;
int sum2;
int a[128];
void foo (void)
{
  int i;

  for (i = 0; i < 64; i++)
    {
      sum1 += a[2*i];
      sum2 += a[2*i+1];
    }
}

Example 19: Reduction chain in SLP:

int sum;
int a[128];
void foo (void)
{
  int i;

  for (i = 0; i < 64; i++)
    {
      sum += a[2*i];
      sum += a[2*i+1];
    }
}

Example 20: Basic block SLP with multiple types, loads with different offsets, misaligned load, and not-affine accesses:

void foo (int * __restrict__ dst, short * __restrict__ src,
          int h, int stride, short A, short B)
{
  int i;
  for (i = 0; i < h; i++)
    {
      dst[0] += A*src[0] + B*src[1];
      dst[1] += A*src[1] + B*src[2];
      dst[2] += A*src[2] + B*src[3];
      dst[3] += A*src[3] + B*src[4];
      dst[4] += A*src[4] + B*src[5];
      dst[5] += A*src[5] + B*src[6];
      dst[6] += A*src[6] + B*src[7];
      dst[7] += A*src[7] + B*src[8];
      dst += stride;
      src += stride;
    }
}

Example 21: Backward access:

int foo (int *b, int n)
{
  int i, a = 0;

  for (i = n-1; i ≥ 0; i--)
    a += b[i];

  return a;
}

Example 22: Alignment hints:

void foo (int *out1, int *in1, int *in2, int n)
{
  int i;

  out1 = __builtin_assume_aligned (out1, 32, 16);
  in1 = __builtin_assume_aligned (in1, 32, 16);
  in2 = __builtin_assume_aligned (in2, 32, 0);

  for (i = 0; i < n; i++)
    out1[i] = in1[i] * in2[i];
}

Example 23: Widening shift:

void foo (unsigned short *src, unsigned int *dst)
{
  int i;

  for (i = 0; i < 256; i++)
    *dst++ = *src++ << 7;
}

Example 24: Condition with mixed types:

#define N 1024
float a[N], b[N];
int c[N];

void foo (short x, short y)
{
  int i;
  for (i = 0; i < N; i++)
    c[i] = a[i] < b[i] ? x : y;
}

Example 25: Loop with bool:

#define N 1024
float a[N], b[N], c[N], d[N];
int j[N];

void foo (void)
{
  int i;
  _Bool x, y;
  for (i = 0; i < N; i++)
    {
      x = (a[i] < b[i]);
      y = (c[i] < d[i]);
      j[i] = x & y;
    }
}

Unvectorizable Loops

Examples of loops that currently cannot be vectorized:

Example 1: Uncountable loop:

while (*p != NULL) {
  *q++ = *p++;
}

Also see https://gcc.gnu.org/wiki/VectorizationTasks and a list of vectorizer missed-optimization PRs in the GCC bug tracker.

Previous News and Status

2007-09-17

-ftree-vectorize is going to be turned on under -O3.
Cost-model tweaks and x86_64 specific costs committed to mainline (2007-09-10).
Vectorization that exploits intra-iteration parallelism (ala SLP) was committed to mainline (2007-09-09).
-fassociative-math can be used instead of -ffast-math to enable vectorization of reductions of floats (2007-09-04).
Initial support for vectorization of outer-loops (doubly nested loops) was committed to mainline (2007-08-19).
Run-time dependence testing using loop-versioning was committed to mainline (2007-08-16).

2007-07-25

The new vectorization feature that exploits intra-iteration parallelism (to be submitted to GCC 4.3) was presented at the GCC summit last week ("Loop-aware SLP"). Also at the summit, we held a BOF on vectorization and other loop optimizations. The summary of the BOF can be found here: https://gcc.gnu.org/wiki/LoopOptimizationsBOF. Following the BOF the vectorizer's todo list was also updated ( https://gcc.gnu.org/wiki/VectorizationTasks).

Mainline updates:

SPU specific costs for the cost model committed to mainline (2007-07-12). Tuning for other platforms (PPC, x86) ongoing.
Initial cost model implementation committed to mainline (2007-06-08).
Vectorization of fp/integer conversions of different sizes (e.g. float/short) committed to mainline (2007-05-17).
Data-refs analysis was rewritten and improved (2007-05-13).

Autovect-branch updates:

Outer-loop vectorization was enhanced to support aligned and unaligned memory references in the inner-loop, using the optimized realignment scheme when possible.
Vectorization that exploits intra-iteration parallelism (ala SLP) was added to the vectorizer (that so far exploited only inter-iteration parallelism).
The vectorizer cost model was extended to support the above two new vectorization features (outer-loop and "SLP").

2007-05-09

Vectorization of fp/integer conversions of different sizes (e.g. float/short) is soon to be committed to mainline.
Initial cost model implementation is soon to be committed to mainline.

2007-04-22

Vectorization of float/double conversions was added to mainline.
Initial vectorization support for certain forms of outer-loops (doubly nested loops) was added to autovect-branch.

2007-02-21

Vectorization of float/int conversions added to mainline.
Vectorization of inductions added to mainline.
Vectorization of function-calls added to mainline.
Improvements to vectorization of strided-accesses - added to autovect-branch.
Discussion on building a cost-model for the vectorizer - started on the mailing list.

2007-01-14

A new flag to limit vectorization to loops with large enough loop-bound was added: --param min-vect-loop-bound=X prevents vectorization of loops whose vectorized loop-bound is equal or less than X.
Autovect branch has been reopened: ( https://gcc.gnu.org/ml/gcc/2007-01/msg00117.html).

2006-11-27

Vectorization of loops that operate on multiple data-types, including type conversions: incorporated into GCC 4.3.
Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: incoporated into GCC 4.3.
Additional vectorizer related projects planned for GCC 4.3 can be found here: ( https://gcc.gnu.org/wiki/AutovectBranchOptimizations).

2006-02-19

Vectorization of loops that operate on multiple data-types, including type conversions: submitted for incorporation into GCC 4.2.
Detection and vectorization of special idioms, such as dot-product and widening-summation: Incorporated into GCC 4.2.
Vectorization of non consecutive (non-unit-stride) data-accesses with power-of-2 strides: Incoporated into autovect-branch. To be submitted to GCC 4.2.

2005-10-23

Autovect-branch has been enhanced to support the following features:

Vectorization of loops that operate on multiple data-types, including type promotion (conversion to a wider type) and type demotion (conversion to a narrower type). Type promotion is supported using the new VEC_UNPACK_HI and VEC_UNPACK_LO tree-codes (and the new vec_unpacks_hi/lo and vec_unpacku_hi/lo optabs). Type demotion is supported using the new VEC_PACK_MOD tree-code (and the new vec_pack_mod optab).
Vectorization of idioms that involve type conversion. This allows more efficient vectorization (if specialized target support is available) that avoids the data packing/unpacking that is otherwise required to handle multiple data-types. These idioms include: widening-summation (WIDEN_SUM), dot-product (DOT_PROD), widening-multiplication (WIDEN_MULT, VEC_WIDEN_MULT_HI/LO), multiply-highpart (MULT_HI) and sum-of-absolute-differences (SAD).

2005-08-11

The following enhancements have been incorpoated into GCC 4.1:

Vectorization of reduction has been introduced and currently supports summation, and minimum/maximum computation.
Vectorization of conditional code has been introduced.
The mechanism of peeling a loop to force the alignment of data accesses has been improved. We now generate better code when the misalignment of an access is known at compile time, or when different accesses are known to have the same misalignment, even if the misalignment amount itself is unknown.
Dependence distance is considered when checking dependences between data references.
Loop-closed-ssa-form is incrementally updated during vectorization.
Generic parts of the data references analysis were cleaned up and externalized to make this analysis available to other passes.

2005-04-05: Vectorization of reduction on autovect-branch was enhanced to support maximum/minimum computations, and special reduction idioms such as widening summation as a step towards supporting patterns like dot-product, sum of absolute differences and more: ( https://gcc.gnu.org/ml/gcc-patches/2005-04/msg00532.html).

2005-03-01: Vectorization projects for GCC 4.1: See https://gcc.gnu.org/wiki/Autovectorization_Enhancements.; Vectorization capabilities in GCC 4.0: See 2005-03-01 mainline status.

2005-02-25

New features:

Summation is the first reduction idiom that is vectorized (autovect-branch only).
Verbosity levels for vectorization reports.
Improved line number information.
Revised data-references analysis.

2005-03-01, autovect-branch