# Vector unit¶

## Introduction¶

The vector unit is one of the four execution units of VP1. It operates in SIMD manner on 16-element vectors.

### Vector registers¶

The vector unit has 32 vector registers, `\$v0-\$v31`. They are 128 bits wide and are treated as 16 components of 8 bits each. Depending on element, they can be treated as signed or unsigned.

There are also 4 vector condition code registers, `\$vc0-\$vc3`. They are like `\$c` for vector registers - each of them has 16 “sign flag” and 16 “zero flag” bits, one of each per vector component. When read as a 32-word, bits 0-15 are the sign flags and bits 16-31 are the zero flags.

Further, the vector unit has a singular 448-bit vector accumulator register, `\$va`. It is made of 16 components, each of them a 28-bit signed number with 16 fractional bits. It’s used to store intermediate unrounded results of multiply-add computations.

Finally, there’s an extra 128-bit register, `\$vx`, which works quite like the usual `\$v` registers. It’s only read by vlrp4b instructions and written only by special load to vector extra register instructions. The reasons for its existence are unclear.

## Instruction format¶

The instruction word fields used in vector instructions in addition to the ones used in scalar instructions are:

• bit 0: `S2VMODE` - selects how s2v data is used:
• 0: `factor` - s2v data is interpreted as factors
• 1: `mask` - s2v data is interpreted as masks
• bits 0-2: `VCDST` - if < 4, index of `\$vc` register to set according to the instruction’s results. Otherwise, an indication that `\$vc` is not to be written (the canonical value for such case appears to be 7).
• bits 0-1: `VCSRC` - selects `\$vc` input for `vlrp2`
• bit 2: `VCSEL` - the `\$vc` flag selection for `vlrp2`:
• 0: `sf`
• 1: `zf`
• bit 3: `SWZLOHI` - selects how the swizzle selectors are decoded:
• 0: `lo` - bits 0-3 are component selector, bit 4 is source selector
• 1: `hi` - bits 4-7 are component selector, bit 0 is source selector
• bit 3: `FRACTINT` - selects whether the multiplication is considered to be integer or fixed-point:
• 0: `fract`: fixed-point
• 1: `int`: integer
• bit 4: `HILO` - selects which part of multiplication result to read:
• 0: `hi`: high part
• 1: `lo`: low part
• bits 5-7: `SHIFT` - a 3-bit signed immediate, used as an extra right shift factor
• bits 4-8: `SRC3` - the third source `\$v` register.
• bit 9: `ALTRND` - like `RND`, but for different instructions
• bit 9: `SIGNS` - determines if double-interpolation input is signed
• 0: `u` - unsigned
• 1: `s` - signed
• bit 10: `LRP2X` - determines if base input is XORed with 0x80 for `vlrp2`.
• bit 11: `VAWRITE` - determines if `\$va` is written for `vlrp2`.
• bits 11-13: `ALTSHIFT` - a 3-bit signed immediate, used as an extra right shift factor
• bit 12: `SIGND` - determines if double-interpolation output is signed
• 0: `u` - unsigned
• 1: `s` - signed
• bits 19-22: `CMPOP`: selects the bit operation to perform on comparison result and previous flag value

### Opcodes¶

The opcode range assigned to the vector unit is `0x80-0xbf`. The opcodes are:

## Multiplication, accumulation, and rounding¶

The most advanced vector instructions involve multiplication and the vector accumulator. The vector unit has two multipliers (signed 10-bit * 10-bit -> signed 20-bit) and three wide adders (performing 28-bit addition): the first two add the multiplication results, and the third adds a rounding correction. In other words, it can compute A + (B * C << S) + (D * E << S) + R, where A is 28-bit input, B, C, D, E are signed 10-bit inputs, S is either 0 or 8, and R is the rounding correction, determined from the readout parameters. The B, C, D, E inputs can in turn be computed from other inputs using one of the narrower ALUs.

The A input can come from the vector accumulator, be fixed to 0, or come from a vector register component shifted by some shift amount. The shift amount, if used, is the inverse of the shift amount used by the readout process.

There are three things that can happen to the result of the multiply-accumulate calculations:

• written in its entirety to the vector accumulator
• shifted, rounded, clipped, and written to a vector register
• both of the above

The vector register readout process takes the following parameters:

• sign: whether the result should be unsigned or signed
• fract/int selection: if int, the multiplication is considered to be done on integers, and the 16-bit result is at bits 8-23 of the value added to the accumulator (ie. S is 8). Otherwise, the multiplication is performed as if the inputs were fractions (unsigned with 8 fractional bits, signed with 7), and the results are aligned so that bits 16-27 of the accumulator are integer part, and 0-15 are fractional part.
• hi/lo selection: selects whether high or low 8 bits of the results are read. For integers, the result is treated as 16-bit integer. For fractions, the high part is either an unsigned fixed-point number with 8 fractional bits, or a signed number with 7 fractional bits, and the low part is always 8 bits lower than the high part.
• a right shift, in range of -4..3: the result is shifted right by that amount before readout (as usual, negative means left shift).
• rounding mode: either round down, or round to nearest. If round to nearest is selected, a configuration bit in `\$uccfg` register selects if ties are rounded up or down (to accomodate video codecs which switch that on frame basis).

First, any inputs from vector registers are read, converted as signed or unsigned integers, and normalized if needed:

```def mad_input(val, fractint, isign):
if isign == 'u':
return val & 0xff
else:
if fractint == 'int':
return sext(val, 7)
else:
return sext(val, 7) << 1
```

The readout shift factor is determined as follows:

```def mad_shift(fractint, sign, shift):
if fractint == 'int':
return 16 - shift
elif sign == 'u':
return 8 - shift
elif sign == 's':
return 9 - shift
```

If A is taken from a vector register, it’s expanded as follows:

```def mad_expand(val, fractint, sign, shift):
return val << mad_shift(fractint, sign, shift)
```

The actual multiply-add process works like that:

```def mad(a, b, c, d, e, rnd, fractint, sign, shift, hilo):
res = a

if fractint == 'fract':
res += b * c + d * e
else:
res += (b * c + d * e) << 8

# rounding correction
if rnd == 'rn':

# determine the final readout shift
if hilo == 'lo':
rshift = mad_shift(fractint, sign, shift) - 8
else:

# only add rounding correction if there's going to be an actual
# right shift
if rshift > 0:
res += 1 << (rshift - 1)
if \$uccfg.tiernd == 'down':
res -= 1

# the accumulator is only 28 bits long, and it wraps
return sext(res, 27)
```

```def mad_read(val, fractint, sign, shift, hilo):
# first, shift it to the position
rshift = mad_shift(fractint, sign, shift) - 8
if rshift >= 0:
res = val >> rshift
else:
res = val << -rshift

# second, clip to 16-bit signed or unsigned
if sign == 'u':
if res < 0:
res = 0
if res > 0xffff:
res = 0xffff
else:
if res < -0x8000:
res = -0x8000
if res > 0x7fff:
res = 0x7fff

# finally, extract high/low part of the final result
if hilo == 'hi':
return res >> 8 & 0xff
else:
return res & 0xff
```

Note that high/low selection, apart from actual result readout, also affects the rounding computation. This means that, if rounding is desired and the full 16-bit result is to be read, the low part should be read first with rounding (which will add the rounding correction to the accumulator) and then the high part should be read without rounding (since the rounding correction is already applied).

## Instructions¶

### Move: mov¶

Copies one register to another. `\$vc` output supported for zero flag only.

Instructions:
Instruction Operands Opcode
`mov` `[\$vc[VCDST]] \$v[DST] \$v[SRC1]` `0xba`
Operation:
```for idx in range(16):
\$v[DST][idx] = \$v[SRC1][idx]
if VCDST < 4:
\$vc[VCDST].sf[idx] = 0
\$vc[VCDST].zf[idx] = \$v[DST][idx] == 0
```

### Move immediate: vmov¶

Loads an 8-bit immediate to each component of destination. `\$vc` output is fully supported, with sign flag set to bit 7 of the value.

Instructions:
Instruction Operands Opcode
`vmov` `[\$vc[VCDST]] \$v[DST] BIMM` `0xad`
Operation:
```for idx in range(16):
\$v[DST][idx] = BIMM
if VCDST < 4:
\$vc[VCDST].sf[idx] = BIMM >> 7 & 1
\$vc[VCDST].zf[idx] = BIMM == 0
```

### Move from \$vc: mov¶

Reads the contents of all `\$vc` registers to a selected vector register. Bytes 0-3 correspond to `\$vc0`, bytes 4-7 to `\$vc1`, and so on. The sign flags are in bytes 0-1, and the zero flags are in bytes 2-3.

Instructions:
Instruction Operands Opcode
`mov` `\$v[DST] \$vc` `0xbb`
Operation:
```for idx in range(4):
\$v[DST][idx * 4] = \$vc[idx].sf & 0xff;
\$v[DST][idx * 4 + 1] = \$vc[idx].sf >> 8 & 0xff;
\$v[DST][idx * 4 + 2] = \$vc[idx].zf & 0xff;
\$v[DST][idx * 4 + 3] = \$vc[idx].zf >> 8 & 0xff;
```

### Swizzle: vswz¶

Performs a swizzle, also known as a shuffle: builds a result vector from arbitrarily selected components of two input vectors. There are three source vectors: sources 1 and 2 supply the data to be used, while source 3 selects the mapping of output vector components to input vector components. Each component of source 3 consists of source selector and component selector. They select the source (1 or 2) and its component that will be used as the corresponding component of the result.

Instructions:
Instruction Operands Opcode
`vswz` `SWZLOHI \$v[DST] \$v[SRC1] \$v[SRC2] \$v[SRC3]` `0x9b`
Operation:
```for idx in range(16):
# read the component and source selectors
if SWZLOHI == 'lo':
comp = \$v[SRC3][idx] & 0xf
src = \$v[SRC3][idx] >> 4 & 1
else:
comp = \$v[SRC3][idx] >> 4 & 0xf
src = \$v[SRC3][idx] & 1

# read the source & component
if src == 0:
\$v[DST][idx] = \$v[SRC1][comp]
else:
\$v[DST][idx] = \$v[SRC2][comp]
```

### Simple arithmetic operations: vmin, vmax, vabs, vneg, vadd, vsub¶

Those perform the corresponding operation (minumum, maximum, absolute value, negation, addition, substraction) in SIMD manner on 8-bit signed or unsigned numbers from one or two sources. Source 1 is always a register selected by `SRC1` bitfield. Source 2, if it is used (ie. instruction is not `vabs` nor `vneg`), is either a register selected by `SRC2` bitfield, or immediate taken from `BIMM` bitfield.

Most of these instructions come in signed and unsigned variants and both perform result clipping. The exception is `vneg`, which only has a signed version. Note that `vabs` is rather uninteresting in its unsigned variant (it’s just the identity function). Note that `vsub` lacks a signed version with immediat: it can be replaced with `vadd` with negated immediate.

`\$vc` output is fully supported. For signed variants, the sign flag output is the sign of the result. For unsigned variants, the sign flag is used as an overflow flag: it’s set if the true unclipped result is not in `0..0xff` range.

Instructions:
Instruction Operands Opcode
`vmin s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x88`
`vmax s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x89`
`vabs s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1]` `0x8a`
`vneg s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1]` `0x8b`
`vadd s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x8c`
`vsub s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x8d`
`vmin u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x98`
`vmax u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x99`
`vabs u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1]` `0x9a`
`vadd u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x9c`
`vsub u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x9d`
`vmin s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xa8`
`vmax s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xa9`
`vadd s` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xac`
`vmin u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xb8`
`vmax u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xb9`
`vadd u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xbc`
`vsub u` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xbd`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]
if opcode & 0x20:
s2 = BIMM
else:
s2 = \$v[SRC2][idx]

if opcode & 0x10:
# unsigned
s1 &= 0xff
s2 &= 0xff
else:
# signed
s1 = sext(s1, 7)
s2 = sext(s2, 7)

if op == 'vmin':
res = min(s1, s2)
elif op == 'vmax':
res = max(s1, s2)
elif op == 'vabs':
res = abs(s1)
elif op == 'vneg':
res = -s1
res = s1 + s2
elif op == 'vsub':
res = s1 - s2

sf = 0
if opcode & 0x10:
# unsigned: clip to 0..0xff
if res < 0:
res = 0
sf = 1
if res > 0xff:
res = 0xff
sf = 1
else:
# signed: clip to -0x80..0x7f
if res < 0:
sf = 1
if res < -0x80:
res = -0x80
if res > 0x7f:
res = 0x7f

\$v[DST][idx] = res

if VCDST < 4:
\$vc[VCDST].sf[idx] = sf
\$vc[VCDST].zf[idx] = res == 0
```

### Clip to range: vclip¶

Performs a SIMD range clipping operation: first source is the value to clip, second and third sources are the range endpoints. Or, equivalently, calculates the median of three inputs. `\$vc` output is supported, with the sign flag set if clipping was performed (value equal to range endpoint is considered to be clipped) or the range is improper (second endpoint not larger than the first). All inputs are treated as signed.

Instructions:
Instruction Operands Opcode
`vclip` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2] \$v[SRC3]` `0xa4`
Operation:
```for idx in range(16):
s1 = sext(\$v[SRC1][idx], 7)
s2 = sext(\$v[SRC2][idx], 7)
s3 = sext(\$v[SRC3][idx], 7)

sf = 0

# determine endpoints
if s2 < s3:
# proper order
start = s2
end = s3
else:
# reverse order
start = s3
end = s2
sf = 1

# and clip
res = s1
if res <= start:
res = start
sf = 1
if res >= end:
res = end
sf = 1

\$v[DST][idx] = res

if VCDST < 4:
\$vc[VCDST].sf[idx] = sf
\$vc[VCDST].zf[idx] = res == 0
```

### Minimum of absolute values: vminabs¶

Performs `min(abs(a), abs(b))`. Both inputs are treated as signed. `\$vc` output is supported for zero flag only. The result is clipped to `0..0x7f` range (which only matters if both inputs are `-0x80`).

Instructions:
Instruction Operands Opcode
`vminabs` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0xa5`
Operation:
```for idx in range(16):
s1 = sext(\$v[SRC1][idx], 7)
s2 = sext(\$v[SRC2][idx], 7)

res = min(abs(s1, s2))

if res > 0x7f:
res = 0x7f

\$v[DST][idx] = res

if VCDST < 4:
\$vc[VCDST].sf[idx] = 0
\$vc[VCDST].zf[idx] = res == 0
```

Performs an 8-bit unsigned + 9-bit signed addition (ie. exactly what’s needed for motion compensation). The first source provides the 8-bit inputs, while the second and third are uniquely treated as vectors of 8 16-bit components (of which only low 9 are actually used). Second source provides components 0-7, and third provides 8-15. The result is unsigned and clipped. `\$vc` output is supported, with sign flag set to 1 if the true result was out of 8-bit unsigned range.

Instructions:
Instruction Operands Opcode
`vadd9` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2] \$v[SRC3]` `0x9f`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]

if idx < 8:
# 0-7: SRC2
s2l = \$v[SRC2][idx * 2]
s2h = \$v[SRC2][idx * 2 + 1]
else:
# 8-15: SRC3
s2l = \$v[SRC3][(idx - 8) * 2]
s2h = \$v[SRC3][(idx - 8) * 2 + 1]

# read as 9-bit signed number
s2 = sext(s2h << 8 | s2l, 8)

res = s1 + s2

# clip
sf = 0
if res > 0xff:
sf = 1
res = 0xff
if res < 0:
sf = 1
res = 0

\$v[DST][idx] = res

if VCDST < 4:
\$vc[VCDST].sf[idx] = sf
\$vc[VCDST].zf[idx] = res == 0
```

### Compare with absolute difference: vcmpad¶

This instruction performs the following operations:

• substract source 1.1 from source 2
• take the absolute value of the difference
• compare the result with source 1.2
• if equal, set zero flag of selected `\$vc` output
• set sign flag of `\$vc` output to an arbitrary bitwise operation of s2v `\$vc` input and “less than” comparison result

All inputs are treated as unsigned. If s2v scalar instruction is not used together with this instruction, `\$vc` input defaults to sign flag of the `\$vc` register selected as output, with no transformation.

This instruction has two sources: source 1 is a register pair, while source 2 is a single register. The second register of the pair is selected by ORing 1 to the index of the first register of the pair. Source 2 is selected by mangled field `SRC2S`.

Instructions:
Instruction Operands Opcode
`vcmppad` `CMPOP [\$vc[VCDST]] \$v[SRC1]d \$v[SRC2S]` `0x8f`
Operation:
```if s2v.vcsel.valid:
else:
vcin = \$vc[VCDST & 3].sf

for idx in range(16):
other = \$v[SRC1 | 1][idx]

if VCDST < 4:
\$vc[VCDST].sf[idx] = sf
\$vc[VCDST].zf[idx] = ad == bitop(CMPOP, vcin >> idx & 1, ad < other)
```

### Bit operations: vbitop¶

Performs an arbitrary two-input bit operation on two registers. `\$vc` output supported for zero flag only.

Instructions:
Instruction Operands Opcode
`vbitop` `BITOP [\$vc[CDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x94`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]
s2 = \$v[SRC2][idx]

res = bitop(BITOP, s2, s1) & 0xff

\$v[DST][idx] = res
if VCDST < 4:
\$vc[VCDST].sf[idx] = 0
\$vc[VCDST].zf[idx] = res == 0
```

### Bit operations with immediate: vand, vor, vxor¶

Performs a given bitwise operation on a register and an 8-bit immediate replicated for each component. `\$vc` output supported for zero flag only.

Instructions:
Instruction Operands Opcode
`vand` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xaa`
`vxor` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xab`
`vor` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xaf`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]

if op == 'vand':
res = s1 & BIMM
elif op == 'vxor':
res = s1 ^ BIMM
elif op == 'vor':
res = s1 | BIMM

\$v[DST][idx] = res
if VCDST < 4:
\$vc[VCDST].sf[idx] = 0
\$vc[VCDST].zf[idx] = res == 0
```

### Shift operations: vshr, vsar¶

Performs a SIMD right shift, like the scalar bytewise shift instruction. `\$vc` output is fully supported, with bit 7 of the result used as the sign flag.

Instructions:
Instruction Operands Opcode
`vsar` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x8e`
`vshr` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] \$v[SRC2]` `0x9e`
`vsar` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xae`
`vshr` `[\$vc[VCDST]] \$v[DST] \$v[SRC1] BIMM` `0xbe`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]
if opcode & 0x20:
s2 = BIMM
else:
s2 = \$v[SRC2][idx]

if opcode & 0x10:
# unsigned
s1 &= 0xff
else:
# signed
s1 = sext(s1, 7)

shift = sext(s2, 3)

if shift < 0:
res = s1 << -shift
else:
res = s1 >> shift

\$v[DST][idx] = res

if VCDST < 4:
\$vc[VCDST].sf[idx] = res >> 7 & 1
\$vc[VCDST].zf[idx] = res == 0
```

### Linear interpolation: vlrp¶

A SIMD linear interpolation instruction. Takes two sources: a register pair containing the two values to interpolate, and a register containing the interpolation factor. The result is basically ```SRC1.1 * (SRC2 >> SHIFT) + SRC1.2 * (1 - (SRC2 >> SHIFT))```. All inputs are unsigned fractions.

Instructions:
Instruction Operands Opcode
`vlrp` `RND SHIFT \$v[DST] \$v[SRC1]d \$v[SRC2]` `0x90`
Operation:
```for idx in range(16):
val1 = \$v[SRC1][idx]
val2 = \$v[SRC1 | 1][idx]
a = mad_expand(val2, 'fract', 'u', SHIFT)
res = mad(a, val1 - val2, \$v[SRC2][idx], 0, 0, RND, 'fract', 'u', SHIFT, 'hi')
```

### Multiply and multiply with accumulate: vmul, vmac¶

Performs a simple multiplication of two sources (but with the full set of weird options available). The result is either added to the vector accumulator (`vmac`) or replaces it (`vmul`). The result can additionally be read to a vector register, but doesn’t have to be.

The instructions come in many variants: they can store the result in a vector register or not, have unsigned or signed output, and register or immediate second source. The set of available combinations is incomplete, however: while the `\$v`-writing variants have all combinations available, there are no unsigned variants of register-register `vmul` with no `\$v` write, nor unsigned register-immediate `vmac` with no `\$v` write. Also, unsigned register-immediate `vmul` with no `\$v` output is a bad opcode.

Instructions:
Instruction Operands Opcode
`vmul s` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x80`
`vmul s` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xa0`
`vmul u` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 BIMMBAD` `0xb0` (bad opcode)
`vmul s` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x81`
`vmul u` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x91`
`vmul s` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xa1`
`vmul u` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xb1`
`vmac s` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x82`
`vmac u` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x92`
`vmac s` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xa2`
`vmac u` `RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xb2`
`vmac s` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x83`
`vmac u` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 \$v[SRC2]` `0x93`
`vmac s` `RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] SIGN2 BIMMMUL` `0xa3`
Operation:
```for idx in range(16):
s1 = \$v[SRC1][idx]
if opcode & 0x20:
if op == 0x30:
else:
s2 = BIMMMUL << 2
else:
s2 = \$v[SRC2][idx]

# convert inputs

# do the computation
if op == 'vmac':
a = \$va[idx]
else:
a = 0
res = mad(a, s1, s2, 0, 0, RND, FRACTINT, op.sign, SHIFT, HILO)

# write result
\$va[idx] = res
if DST is not None:
```

Performs two multiplications and adds the result to a given source or to the vector accumulator. The result is written to the vector accumulator and can also be written to a `\$v` register. For each multiplication, one input is a register source, and the other is s2v factor. The register sources for the multiplications are a register pair. The s2v sources for the multiplications are either s2v factors (one factor from each pair is selected according to s2v `\$vc` input) or 0/1 as decided by s2v mask.

The instructions come in signed and unsigned variants. Apart from some bad opcodes (which overlay `SRC3` with mad param fields), only `\$v` writing versions have unsigned variants.

Instructions:
Instruction Operands Opcode
`vmad2 s` `S2VMODE RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1]d SIGN2 \$v[SRC2]` `0x84`
`vmad2 s` `S2VMODE RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1]d SIGN2 \$v[SRC2]` `0x85`
`vmad2 u` `S2VMODE RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1]d SIGN2 \$v[SRC2]` `0x95`
`vmac2 s` `S2VMODE RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1]d` `0x86`
`vmac2 u` `S2VMODE RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] \$v[SRC3]` `0x96` (bad opcode)
`vmac2 s` `S2VMODE RND FRACTINT SHIFT HILO # SIGN1 \$v[SRC1] \$v[SRC3]` `0xa6` (bad opcode)
`vmac2 s` `S2VMODE RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1]d` `0x87`
`vmac2 u` `S2VMODE RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1]d` `0x97`
`vmac2 s` `S2VMODE RND FRACTINT SHIFT HILO \$v[DST] SIGN1 \$v[SRC1] \$v[SRC3]` `0xa7` (bad opcode)
Operation:
```for idx in range(16):
s11 = \$v[SRC1][idx]
if opcode in (0x96, 0xa6, 0xa7):
# one of the bad opcodes
s12 = \$v[SRC3][idx]
else:
s12 = \$v[SRC1 | 1][idx]

s2 = \$v[SRC2][idx]

# convert inputs

# prepare A value
a = mad_expand(s2, FRACTINT, sign, SHIFT)
else:
a = \$va[idx]

# prepare factors
if s2v.mask & 1 << idx:
f1 = 0x100
else:
f1 = 0
if s2v.mask & 1 << idx:
f2 = 0x100
else:
f2 = 0
else:
# 'factor'
cc = s2v.vcmask >> idx & 1
f1 = s2v.factor[0 | cc]
f2 = s2v.factor[2 | cc]

# do the operation
res = mad(a, s11, f1, s12, f2, RND, FRACTINT, sign, SHIFT, HILO)

# write result
\$va[idx] = res
if DST is not None:
```

### Dual linear interpolation: vlrp2¶

This instruction performs the following steps:

• read a quad register source selected by `SRC1`
• rotate the source quad by the amount selected by bits 4-5 of a selected `\$c` register
• for each component:
• treat register 0 of the quad as function value at (0, 0)
• treat register 2 as value at (1, 0)
• treat register 3 as value at (0, 1)
• select a pair of factors from s2v input based on selected flag of selected `\$vc` register
• treat the factors as a coordinate pair and interpolate function value at these coordinates
• write result to `\$v` register and optionally `\$va`

The inputs and outputs may be signed or unsigned. A shift and rounding mode can be selected. Additionally, there’s an option to XOR register 0 with 0x80 before use as the base value (but not for the differences used in interpolation). Don’t ask me.

Instructions:
Instruction Operands Opcode
`vlrp2` `SIGND VAWRITE RND SHIFT \$v[DST] SIGNS LRP2X \$v[SRC1]q \$c[COND] \$vc[VCSRC] VCSEL` `0xb3`
Operation:
```# a function selecting the factors
def get_lrp2_factors(idx):
if VCSEL == 'sf':
else:

cc = vcmask >> idx & 1;
f1 = s2v.factor[0 | cc]
f2 = s2v.factor[2 | cc]

return f1, f2

# determine rotation
rot = \$c[COND] >> 4 & 3

for idx in range(16):
# read inputs, maybe do the xor
s10x = s10 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx]
s12 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx]
s13 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx]
if LRP2X:
s10x ^= 0x80

# convert inputs if necessary

# do it
a = mad_expand(s10x, 'fract', SIGND, SHIFT)
f1, f2 = get_lrp2_factors(idx)
res = mad(a, s12 - s10, f1, s13 - s10, f2, RND, 'fract', SIGND, SHIFT, 'hi')

# write outputs
if VAWRITE:
\$va[idx] = res
```

### Quad linear interpolation, part 1: vlrp4a¶

Works like the previous variant, but only outputs to `\$va` and lacks some flags. Both outputs and inputs are unsigned.

Instructions:
Instruction Operands Opcode
`vlrp4a` `RND SHIFT # \$v[SRC1]q \$c[COND] \$vc[VCSRC] VCSEL` `0xb4`
Operation:
```rot = \$c[COND] >> 4 & 3

for idx in range(16):

s10 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx]
s12 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx]
s13 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx]

a = mad_expand(s10, 'fract', 'u', SHIFT)
f1, f2 = get_lrp2_factors(idx)

\$va[idx] = mad(a, s12 - s10, f1, s13 - s10, f2, RND, 'fract', 'u', SHIFT, 'lo')
```

### Factor linear interpolation: vlrpf¶

Has similiar input processing to `vlrp2`, but instead uses source 1 registers 2 and 3 to interpolate s2v input. Result is ```SRC2 + SRC1.2 * F1 + SRC1.3 * (F2 - F1)```.

Instructions:
Instruction Operands Opcode
`vlrpf` `RND SHIFT # \$v[SRC1]q \$c[COND] \$v[SRC2] \$vc[VCSRC] VCSEL` `0xb5`
Operation:
```rot = \$c[COND] >> 4 & 3

for idx in range(16):

s12 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx]
s13 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx]
s2 = sext(\$v[SRC2][idx], 7)

a = mad_expand(s2, 'fract', 'u', SHIFT)
f1, f2 = get_lrp2_factors(idx)

\$va[idx] = mad(a, s12 - s13, f1, s13, f2, RND, 'fract', 'u', SHIFT, 'lo')
```

### Quad linear interpolation, part 2: vlrp4b¶

Can be used together with `vlrp4a` for quad linear interpolation. First s2v factor is the interpolation coefficient for register 1, and second factor is the interpolation coefficient for the extra register (`\$vx`).

Alternatively, can be coupled with `vlrpf`.

Instructions:
Instruction Operands Opcode
`vlrp4b u` `ALTRND ALTSHIFT \$v[DST] \$v[SRC1]q \$c[COND] SLCT \$vc[VCSRC] VCSEL` `0xb6`
`vlrp4b s` `ALTRND ALTSHIFT \$v[DST] \$v[SRC1]q \$c[COND] SLCT \$vc[VCSRC] VCSEL` `0xb7`
Operation:
```for idx in range(16):
if SLCT == 4:
rot = \$c[COND] >> 4 & 3
s10 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx]
s11 = \$v[(SRC1 & 0x1c) | ((SRC1 + rot + 1) & 3)][idx]
else:
adjust = \$c[COND] >> SLCT & 1
s10 = s11 = \$v[src1 ^ adjust][idx]

f1, f2 = get_lrp2_factors(idx)

res = mad(\$va[idx], s11 - s10, f1, \$vx[idx] - s10, f2, ALTRND, 'fract', op.sign, ALTSHIFT, 'hi')

\$va[idx] = res