RISC-V Vector Extension

The RISC-V target supports the 1.0 version of the RISC-V Vector Extension (RVV). This guide gives an overview of how it’s modelled in LLVM IR and how the backend generates code for it.

Mapping to LLVM IR types

RVV adds 32 VLEN sized registers, where VLEN is an unknown constant to the compiler. To be able to represent VLEN sized values, the RISC-V backend takes the same approach as AArch64’s SVE and uses scalable vector types.

Scalable vector types are of the form <vscale x n x ty>, which indicates a vector with a multiple of n elements of type ty. On RISC-V n and ty control LMUL and SEW respectively.

LLVM only supports ELEN=32 or ELEN=64, so vscale is defined as VLEN/64 (see RISCV::RVVBitsPerBlock). Note this means that VLEN must be at least 64, so VLEN=32 isn’t currently supported.

LMUL=⅛

LMUL=¼

LMUL=½

LMUL=1

LMUL=2

LMUL=4

LMUL=8

i64 (ELEN=64)

N/A

N/A

N/A

<v x 1 x i64>

<v x 2 x i64>

<v x 4 x i64>

<v x 8 x i64>

i32

N/A

N/A

<v x 1 x i32>

<v x 2 x i32>

<v x 4 x i32>

<v x 8 x i32>

<v x 16 x i32>

i16

N/A

<v x 1 x i16>

<v x 2 x i16>

<v x 4 x i16>

<v x 8 x i16>

<v x 16 x i16>

<v x 32 x i16>

i8

<v x 1 x i8>

<v x 2 x i8>

<v x 4 x i8>

<v x 8 x i8>

<v x 16 x i8>

<v x 32 x i8>

<v x 64 x i8>

double (ELEN=64)

N/A

N/A

N/A

<v x 1 x double>

<v x 2 x double>

<v x 4 x double>

<v x 8 x double>

float

N/A

N/A

<v x 1 x float>

<v x 2 x float>

<v x 4 x float>

<v x 8 x float>

<v x 16 x float>

half

N/A

<v x 1 x half>

<v x 2 x half>

<v x 4 x half>

<v x 8 x half>

<v x 16 x half>

<v x 32 x half>

bfloat

N/A

<v x 1 x bfloat>

<v x 2 x bfloat>

<v x 4 x bfloat>

<v x 8 x bfloat>

<v x 16 x bfloat>

<v x 32 x bfloat>

(Read <v x k x ty> as <vscale x k x ty>)

Mask vector types

Mask vectors are physically represented using a layout of densely packed bits in a vector register. They are mapped to the following LLVM IR types:

  • <vscale x 1 x i1>

  • <vscale x 2 x i1>

  • <vscale x 4 x i1>

  • <vscale x 8 x i1>

  • <vscale x 16 x i1>

  • <vscale x 32 x i1>

  • <vscale x 64 x i1>

Two types with the same SEW/LMUL ratio will have the same related mask type. For instance, two different comparisons one under SEW=64, LMUL=2 and the other under SEW=32, LMUL=1 will both generate a mask <vscale x 2 x i1>.

Representation in LLVM IR

Vector instructions can be represented in two main ways in LLVM IR:

  1. Regular instructions on both scalable and fixed-length vector types

    %c = add <vscale x 4 x i32> %a, %b
    %f = add <4 x i32> %d, %e
    
  2. RISC-V vector intrinsics, which mirror the C intrinsics specification

    These come in unmasked variants:

    %c = call @llvm.riscv.vadd.nxv4i32.nxv4i32(
           <vscale x 4 x i32> %passthru,
           <vscale x 4 x i32> %a,
           <vscale x 4 x i32> %b,
           i64 %avl
         )
    

    As well as masked variants:

    %c = call @llvm.riscv.vadd.mask.nxv4i32.nxv4i32(
           <vscale x 4 x i32> %passthru,
           <vscale x 4 x i32> %a,
           <vscale x 4 x i32> %b,
           <vscale x 4 x i1> %mask,
           i64 %avl,
           i64 0 ; policy (must be an immediate)
         )
    

    Both allow setting the AVL as well as controlling the inactive/tail elements via the passthru operand, but the masked variant also provides operands for the mask and vta/vma policy bits.

    The only valid types are scalable vector types.

For operations that access memory, trap or otherwise have behaviour which depends on what elements are enabled, the target agnostic llvm.masked.* and llvm.vp.* intrinsics can be used to control the mask and AVL respectively.

Note

Middle-end passes typically do not need to worry about controlling the AVL for most instructions, as RISCVVLOptimizer will automatically take care of reducing the AVL to avoid vsetvli toggles. Using regular LLVM IR instructions allows more generic combines and optimisations to be taken advantage of. For instructions that may access memory or trap etc., passes should use the llvm.vp.* intrinsics to set the AVL where required.

SelectionDAG lowering

For most regular scalable vector LLVM IR instructions, their corresponding SelectionDAG nodes are legal on RISC-V and don’t require any custom lowering.

t5: nxv4i32 = add t2, t4

RISC-V vector intrinsics also don’t require any custom lowering.

t12: nxv4i32 = llvm.riscv.vadd TargetConstant:i64<10056>, undef:nxv4i32, t2, t4, t6

Fixed-length vectors

Because there are no fixed-length vector patterns, fixed-length vectors need to be custom lowered and performed in a scalable “container” type:

  1. The fixed-length vector operands are inserted into scalable containers with insert_subvector nodes. The container type is chosen such that its minimum size will fit the fixed-length vector (see getContainerForFixedLengthVector).

  2. The operation is then performed on the container type via a VL (vector length) node. These are custom nodes defined in RISCVInstrInfoVVLPatterns.td that mirror target agnostic SelectionDAG nodes, as well as some RVV instructions. They contain an AVL operand, which is set to the number of elements in the fixed-length vector. Some nodes also have a passthru or mask operand, which will usually be set to undef and all ones when lowering fixed-length vectors.

  3. The result is put back into a fixed-length vector via extract_subvector.

    t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
    t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
  t4: v4i32 = extract_subvector t2, Constant:i64<0>
  t7: v4i32 = extract_subvector t6, Constant:i64<0>
t8: v4i32 = add t4, t7

// is custom lowered to:

    t2: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %0
    t6: nxv2i32,ch = CopyFromReg t0, Register:nxv2i32 %1
    t15: nxv2i1 = RISCVISD::VMSET_VL Constant:i64<4>
  t16: nxv2i32 = RISCVISD::ADD_VL t2, t6, undef:nxv2i32, t15, Constant:i64<4>
t17: v4i32 = extract_subvector t16, Constant:i64<0>

VL nodes often have a passthru or mask operand, which are usually set to undef and all ones for fixed-length vectors.

The insert_subvector and extract_subvector nodes responsible for wrapping and unwrapping will get combined away, and eventually we will lower all fixed-length vector types to scalable. Note that fixed-length vectors at the interface of a function are passed in a scalable vector container.

Note

The only insert_subvector and extract_subvector nodes that make it through lowering are those that can be performed as an exact subregister insert or extract. This means that any fixed-length vector insert_subvector and extract_subvector nodes that aren’t legalized must lie on a register group boundary, so the exact VLEN must be known at compile time (i.e., compiled with -mrvv-vector-bits=zvl or -mllvm -riscv-v-vector-bits-max=VLEN, or have an exact vscale_range attribute).

Vector predication intrinsics

VP intrinsics also get custom lowered via VL nodes.

t12: nxv2i32 = vp_add t2, t4, t6, Constant:i64<8>

// is custom lowered to:

t18: nxv2i32 = RISCVISD::ADD_VL t2, t4, undef:nxv2i32, t6, Constant:i64<8>

The VP EVL and mask are used for the VL node’s AVL and mask respectively, whilst the passthru is set to undef.

Instruction selection

vl and vtype need to be configured correctly, so we can’t just directly select the underlying vector MachineInstr. Instead pseudo instructions are selected, which carry the extra information needed to emit the necessary vsetvlis later.

%c:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %a:vrm2, %b:vrm2, %vl:gpr, 5 /*sew*/, 3 /*policy*/

Each vector instruction has multiple pseudo instructions defined in RISCVInstrInfoVPseudos.td. There is a variant of each pseudo for each possible LMUL, as well as a masked variant. So a typical instruction like vadd.vv would have the following pseudos:

%rd:vr = PseudoVADD_VV_MF8 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_MF4 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_MF2 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_M1 %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, %avl:gpr, sew:imm, policy:imm
%rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm
%rd:vrm4 = PseudoVADD_VV_M4 %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, %avl:gpr, sew:imm, policy:imm
%rd:vrm8 = PseudoVADD_VV_M8 %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_MF8_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_MF4_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_MF2_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vr = PseudoVADD_VV_M1_MASK %passthru:vr(tied-def 0), %rs2:vr, %rs1:vr, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %%rs1:vrm2, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vrm4 = PseudoVADD_VV_M4_MASK %passthru:vrm4(tied-def 0), %rs2:vrm4, %rs1:vrm4, mask:$v0, %avl:gpr, sew:imm, policy:imm
%rd:vrm8 = PseudoVADD_VV_M8_MASK %passthru:vrm8(tied-def 0), %rs2:vrm8, %rs1:vrm8, mask:$v0, %avl:gpr, sew:imm, policy:imm

Note

Whilst the SEW can be encoded in an operand, we need to use separate pseudos for each LMUL since different register groups will require different register classes: see Register allocation.

Pseudos have operands for the AVL and SEW (encoded as a power of 2), as well as potentially the mask, policy or rounding mode if applicable. The passthru operand is tied to the destination register which will determine the inactive/tail elements.

For scalable vectors that should use VLMAX, the AVL is set to a sentinel value of -1.

There are patterns for target agnostic SelectionDAG nodes in RISCVInstrInfoVSDPatterns.td, VL nodes in RISCVInstrInfoVVLPatterns.td and RVV intrinsics in RISCVInstrInfoVPseudos.td.

Instructions that operate only on masks like VMAND or VMSBF uses pseudo instructions suffixed with B1, B2, B4, B8, B16, B32, or B64 where the number is SEW/LMUL representing the ratio between SEW and LMUL needed in vtype. These instructions always operate as if EEW=1 and always use a value of 0 as their SEW operand.

Mask patterns

The patterns in RISCVInstrInfoVVLPatterns.td only match masked pseudos to reduce the size of the match table, even if the node’s mask is all ones and could be an unmasked pseudo. RISCVVectorPeephole::convertToUnmasked will detect if the mask is all ones and convert it into its unmasked form.

%mask:vmv0 = PseudoVMSET_M_B16 -1, 32
%rd:vrm2 = PseudoVADD_VV_M2_MASK %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %mask:vmv0, %avl:gpr, sew:imm, policy:imm

// gets optimized to:

%rd:vrm2 = PseudoVADD_VV_M2 %passthru:vrm2(tied-def 0), %rs2:vrm2, %rs1:vrm2, %avl:gpr, sew:imm, policy:imm

Note

Any vmset.m can be treated as an all ones mask since the tail elements past AVL are undef and can be replaced with ones.

RISCVVLOptimizer

After instruction selection, RISCVVLOptimizer.cpp will reduce the AVL of vector pseudos to only what is demanded from its users. This helps performance on microarchitectures which have performance characteristics dependent on vl, and also avoids unnecessary vsetvli toggles.

%x:vr = PseudoVADD_VV_M1 undef, %a:vr, %b:vr, -1 /*avl*/, 5 /*sew*/, 3 /*policy*/
%y:vr = PseudoVADD_VV_M1 undef, %%y:vr, %x:vr, -1 /*avl*/, 5 /*sew*/, 3 /*policy*/
PseudoVSE32_V_M1 %y, %addr, 4 /*avl*/, 5 /*sew*/

// gets optimized to:

%x:vr = PseudoVADD_VV_M1 undef, %a:vr, %b:vr, 5 /*avl*/, 5 /*sew*/, 3 /*policy*/
%y:vr = PseudoVADD_VV_M1 undef, %%y:vr, %x:vr, 5 /*avl*/, 5 /*sew*/, 3 /*policy*/
PseudoVSE32_V_M1 %y, %addr, 4 /*avl*/, 5 /*sew*/

For a vector pseudo to be considered for AVL optimisation, its underlying instruction must specify that its output doesn’t depend on vl in the ElementsDependOn TSFlag. The default for this flag is conservatively set to depending on vl, so AVL optimisation will be off by default.

VMV0 elimination

Because masked instructions must have the mask register in v0, a specific register class vmv0 is used that contains only one register, v0.

However register coalescing may end up coalescing copies into vmv0, resulting in instructions with multiple uses of vmv0 that the register allocator can’t allocate:

%x:vrnov0 = PseudoVADD_VV_M1_MASK %0:vrnov0, %1:vr, %2:vmv0, %3:vmv0, ...

To avoid this, RISCVVMV0Elimination replaces any uses of vmv0 with physical copies to v0 before register coalescing and allocation:

%x:vrnov0 = PseudoVADD_VV_M1_MASK %0:vrnov0, %1:vr, %2:vr, %3:vmv0, ...

// vmv0 gets eliminated to:

$v0 = COPY %3:vr
%x:vrnov0 = PseudoVADD_VV_M1_MASK %0:vrnov0, %1:vr, %2:vr, $v0, ...

Register allocation

Register allocation is split between vector and scalar registers, with vector allocation running first:

$v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, %vl:gpr, 5, 3

Note

Register allocation is split so that RISCVInsertVSETVLI can run after vector register allocation, but before scalar register allocation. It needs to be run before scalar register allocation as it may need to create a new virtual register to set the AVL to VLMAX.

Performing RISCVInsertVSETVLI after vector register allocation imposes fewer constraints on the machine scheduler since it cannot schedule instructions past vsetvlis, and it allows us to emit further vector pseudos during spilling or constant rematerialization.

There are four register classes for vectors:

  • VR for vector registers (v0, v1,, …, v31). Used when \(\text{LMUL} \leq 1\) and mask registers.

  • VRM2 for vector groups of length 2 i.e., \(\text{LMUL}=2\) (v0m2, v2m2, …, v30m2)

  • VRM4 for vector groups of length 4 i.e., \(\text{LMUL}=4\) (v0m4, v4m4, …, v28m4)

  • VRM8 for vector groups of length 8 i.e., \(\text{LMUL}=8\) (v0m8, v8m8, …, v24m8)

\(\text{LMUL} \lt 1\) types and mask types do not benefit from having a dedicated class, so VR is used in their case.

Some instructions have a constraint that a register operand cannot be V0 or overlap with V0, so for these cases we also have VRNoV0 variants.

RISCVInsertVSETVLI

After vector registers are allocated, the RISCVInsertVSETVLI pass will insert the necessary vsetvlis for the pseudos.

dead $x0 = PseudoVSETVLI %vl:gpr, 209, implicit-def $vl, implicit-def $vtype
$v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype

The physical $vl and $vtype registers are implicitly defined by the PseudoVSETVLI, and are implicitly used by the PseudoVADD. The vtype operand (209 in this example) is encoded as per the specification via RISCVVType::encodeVTYPE.

RISCVInsertVSETVLI performs dataflow analysis to emit as few vsetvlis as possible. It will also try to minimize the number of vsetvlis that set VL, i.e., it will emit vsetvli x0, x0 if only vtype needs changed but vl doesn’t.

Pseudo expansion and printing

After scalar register allocation, the RISCVExpandPseudoInsts.cpp pass expands the PseudoVSETVLI instructions.

dead $x0 = VSETVLI $x1, 209, implicit-def $vtype, implicit-def $vl
renamable $v8m2 = PseudoVADD_VV_M2 $v8m2(tied-def 0), $v8m2, $v10m2, $noreg, 5, implicit $vl, implicit $vtype

Note that the vector pseudo remains as it’s needed to encode the register class for the LMUL. Its AVL and SEW operands are no longer used.

RISCVAsmPrinter will then lower the pseudo instructions into real MCInsts.

vsetvli a0, zero, e32, m2, ta, ma
vadd.vv v8, v8, v10

See also