# Vector unit¶

Contents

- Vector unit
- Introduction
- Instruction format
- Multiplication, accumulation, and rounding
- Instructions
- Move: mov
- Move immediate: vmov
- Move from $vc: mov
- Swizzle: vswz
- Simple arithmetic operations: vmin, vmax, vabs, vneg, vadd, vsub
- Clip to range: vclip
- Minimum of absolute values: vminabs
- Add 9-bit: vadd9
- Compare with absolute difference: vcmpad
- Bit operations: vbitop
- Bit operations with immediate: vand, vor, vxor
- Shift operations: vshr, vsar
- Linear interpolation: vlrp
- Multiply and multiply with accumulate: vmul, vmac
- Dual multiply and add/accumulate: vmac2, vmad2
- Dual linear interpolation: vlrp2
- Quad linear interpolation, part 1: vlrp4a
- Factor linear interpolation: vlrpf
- Quad linear interpolation, part 2: vlrp4b

## Introduction¶

The vector unit is one of the four execution units of VP1. It operates in SIMD manner on 16-element vectors.

### Vector registers¶

The vector unit has 32 vector registers, `$v0-$v31`

. They are 128 bits wide
and are treated as 16 components of 8 bits each. Depending on element, they
can be treated as signed or unsigned.

There are also 4 vector condition code registers, `$vc0-$vc3`

. They are like
`$c`

for vector registers - each of them has 16 “sign flag” and 16 “zero flag”
bits, one of each per vector component. When read as a 32-word, bits 0-15 are
the sign flags and bits 16-31 are the zero flags.

Further, the vector unit has a singular 448-bit vector accumulator register,
`$va`

. It is made of 16 components, each of them a 28-bit signed number
with 16 fractional bits. It’s used to store intermediate unrounded results
of multiply-add computations.

Finally, there’s an extra 128-bit register, `$vx`

, which works quite like
the usual `$v`

registers. It’s only read by vlrp4b
instructions and written only by special load to vector extra register instructions. The reasons for its existence are unclear.

## Instruction format¶

The instruction word fields used in vector instructions in addition to the ones used in scalar instructions are:

- bit 0:
`S2VMODE`

- selects how s2v data is used:- 0:
`factor`

- s2v data is interpreted as factors - 1:
`mask`

- s2v data is interpreted as masks

- 0:
- bits 0-2:
`VCDST`

- if < 4, index of`$vc`

register to set according to the instruction’s results. Otherwise, an indication that`$vc`

is not to be written (the canonical value for such case appears to be 7). - bits 0-1:
`VCSRC`

- selects`$vc`

input for`vlrp2`

- bit 2:
`VCSEL`

- the`$vc`

flag selection for`vlrp2`

:- 0:
`sf`

- 1:
`zf`

- 0:
- bit 3:
`SWZLOHI`

- selects how the swizzle selectors are decoded:- 0:
`lo`

- bits 0-3 are component selector, bit 4 is source selector - 1:
`hi`

- bits 4-7 are component selector, bit 0 is source selector

- 0:
- bit 3:
`FRACTINT`

- selects whether the multiplication is considered to be integer or fixed-point:- 0:
`fract`

: fixed-point - 1:
`int`

: integer

- 0:
- bit 4:
`HILO`

- selects which part of multiplication result to read:- 0:
`hi`

: high part - 1:
`lo`

: low part

- 0:
- bits 5-7:
`SHIFT`

- a 3-bit signed immediate, used as an extra right shift factor - bits 4-8:
`SRC3`

- the third source`$v`

register. - bit 9:
`ALTRND`

- like`RND`

, but for different instructions - bit 9:
`SIGNS`

- determines if double-interpolation input is signed- 0:
`u`

- unsigned - 1:
`s`

- signed

- 0:
- bit 10:
`LRP2X`

- determines if base input is XORed with 0x80 for`vlrp2`

. - bit 11:
`VAWRITE`

- determines if`$va`

is written for`vlrp2`

. - bits 11-13:
`ALTSHIFT`

- a 3-bit signed immediate, used as an extra right shift factor - bit 12:
`SIGND`

- determines if double-interpolation output is signed- 0:
`u`

- unsigned - 1:
`s`

- signed

- 0:
- bits 19-22:
`CMPOP`

: selects the bit operation to perform on comparison result and previous flag value

### Opcodes¶

The opcode range assigned to the vector unit is `0x80-0xbf`

. The opcodes
are:

`0x80`

,`0xa0`

,`0xb0`

,`0x81`

,`0x91`

,`0xa1`

,`0xb1`

: multiplication: vmul`0x90`

: linear interpolation: vlrp`0x82`

,`0x92`

,`0xa2`

,`0xb2`

,`0x83`

,`0x93`

,`0xa3`

: multiplication with accumulation: vmac`0x84`

,`0x85`

,`0x95`

: dual multiplication with accumulation: vmac2`0x86`

,`0x87`

,`0x97`

: dual multiplication with addition: vmad2`0x96`

,`0xa6`

,`0xa7`

: dual multiplication with addition: vmad2 (bad opcode)`0x94`

: bitwise operation: vbitop`0xa4`

: clip to range: vclip`0xa5`

: minimum of absolute values: vminabs`0xb3`

: dual linear interpolation: vlrp2`0xb4`

: quad linear interpolation, part 1: vlrp4a`0xb5`

: factor linear interpolation: vlrpf`0xb6`

,`0xb7`

: quad linear interpolation, part 2: vlrp4b`0x88`

,`0x98`

,`0xa8`

,`0xb8`

: minimum: vmin`0x89`

,`0x99`

,`0xa9`

,`0xb9`

: maximum: vmax`0x8a`

,`0x9a`

: absolute value: vabs`0xaa`

: immediate and: vand`0xba`

: move: mov`0x8b`

: negation: vneg`0x9b`

: swizzle: vswz`0xab`

: immediate xor: vxor`0xbb`

: move from $vc: mov`0x8c`

,`0x9c`

,`0xac`

,`0xbc`

: addition: vadd`0x8d`

,`0x9d`

,`0xbd`

: substraction: vsub`0xad`

: move immediate: vmov`0x8e`

,`0x9e`

,`0xae`

,`0xbe`

: shift: vshr, vsar`0x8f`

: compare with absolute difference: vcmpad`0x9f`

: add 9-bit: vadd9`0xaf`

: immediate or: vor`0xbf`

: the canonical vector nop opcode

## Multiplication, accumulation, and rounding¶

The most advanced vector instructions involve multiplication and the vector accumulator. The vector unit has two multipliers (signed 10-bit * 10-bit -> signed 20-bit) and three wide adders (performing 28-bit addition): the first two add the multiplication results, and the third adds a rounding correction. In other words, it can compute A + (B * C << S) + (D * E << S) + R, where A is 28-bit input, B, C, D, E are signed 10-bit inputs, S is either 0 or 8, and R is the rounding correction, determined from the readout parameters. The B, C, D, E inputs can in turn be computed from other inputs using one of the narrower ALUs.

The A input can come from the vector accumulator, be fixed to 0, or come from a vector register component shifted by some shift amount. The shift amount, if used, is the inverse of the shift amount used by the readout process.

There are three things that can happen to the result of the multiply-accumulate calculations:

- written in its entirety to the vector accumulator
- shifted, rounded, clipped, and written to a vector register
- both of the above

The vector register readout process takes the following parameters:

- sign: whether the result should be unsigned or signed
- fract/int selection: if int, the multiplication is considered to be done on integers, and the 16-bit result is at bits 8-23 of the value added to the accumulator (ie. S is 8). Otherwise, the multiplication is performed as if the inputs were fractions (unsigned with 8 fractional bits, signed with 7), and the results are aligned so that bits 16-27 of the accumulator are integer part, and 0-15 are fractional part.
- hi/lo selection: selects whether high or low 8 bits of the results are read. For integers, the result is treated as 16-bit integer. For fractions, the high part is either an unsigned fixed-point number with 8 fractional bits, or a signed number with 7 fractional bits, and the low part is always 8 bits lower than the high part.
- a right shift, in range of -4..3: the result is shifted right by that amount before readout (as usual, negative means left shift).
- rounding mode: either round down, or round to nearest. If round to nearest
is selected, a configuration bit in
`$uccfg`

register selects if ties are rounded up or down (to accomodate video codecs which switch that on frame basis).

First, any inputs from vector registers are read, converted as signed or unsigned integers, and normalized if needed:

```
def mad_input(val, fractint, isign):
if isign == 'u':
return val & 0xff
else:
if fractint == 'int':
return sext(val, 7)
else:
return sext(val, 7) << 1
```

The readout shift factor is determined as follows:

```
def mad_shift(fractint, sign, shift):
if fractint == 'int':
return 16 - shift
elif sign == 'u':
return 8 - shift
elif sign == 's':
return 9 - shift
```

If A is taken from a vector register, it’s expanded as follows:

```
def mad_expand(val, fractint, sign, shift):
return val << mad_shift(fractint, sign, shift)
```

The actual multiply-add process works like that:

```
def mad(a, b, c, d, e, rnd, fractint, sign, shift, hilo):
res = a
if fractint == 'fract':
res += b * c + d * e
else:
res += (b * c + d * e) << 8
# rounding correction
if rnd == 'rn':
# determine the final readout shift
if hilo == 'lo':
rshift = mad_shift(fractint, sign, shift) - 8
else:
rshift = mad_shift(fractint, sign, shift)
# only add rounding correction if there's going to be an actual
# right shift
if rshift > 0:
res += 1 << (rshift - 1)
if $uccfg.tiernd == 'down':
res -= 1
# the accumulator is only 28 bits long, and it wraps
return sext(res, 27)
```

And the readout process is:

```
def mad_read(val, fractint, sign, shift, hilo):
# first, shift it to the position
rshift = mad_shift(fractint, sign, shift) - 8
if rshift >= 0:
res = val >> rshift
else:
res = val << -rshift
# second, clip to 16-bit signed or unsigned
if sign == 'u':
if res < 0:
res = 0
if res > 0xffff:
res = 0xffff
else:
if res < -0x8000:
res = -0x8000
if res > 0x7fff:
res = 0x7fff
# finally, extract high/low part of the final result
if hilo == 'hi':
return res >> 8 & 0xff
else:
return res & 0xff
```

Note that high/low selection, apart from actual result readout, also affects the rounding computation. This means that, if rounding is desired and the full 16-bit result is to be read, the low part should be read first with rounding (which will add the rounding correction to the accumulator) and then the high part should be read without rounding (since the rounding correction is already applied).

## Instructions¶

### Move: mov¶

Copies one register to another. `$vc`

output supported for zero flag only.

- Instructions:
Instruction Operands Opcode `mov`

`[$vc[VCDST]] $v[DST] $v[SRC1]`

`0xba`

- Operation:
for idx in range(16): $v[DST][idx] = $v[SRC1][idx] if VCDST < 4: $vc[VCDST].sf[idx] = 0 $vc[VCDST].zf[idx] = $v[DST][idx] == 0

### Move immediate: vmov¶

Loads an 8-bit immediate to each component of destination. `$vc`

output
is fully supported, with sign flag set to bit 7 of the value.

- Instructions:
Instruction Operands Opcode `vmov`

`[$vc[VCDST]] $v[DST] BIMM`

`0xad`

- Operation:
for idx in range(16): $v[DST][idx] = BIMM if VCDST < 4: $vc[VCDST].sf[idx] = BIMM >> 7 & 1 $vc[VCDST].zf[idx] = BIMM == 0

### Move from $vc: mov¶

Reads the contents of all `$vc`

registers to a selected vector register.
Bytes 0-3 correspond to `$vc0`

, bytes 4-7 to `$vc1`

, and so on. The sign
flags are in bytes 0-1, and the zero flags are in bytes 2-3.

- Instructions:
Instruction Operands Opcode `mov`

`$v[DST] $vc`

`0xbb`

- Operation:
for idx in range(4): $v[DST][idx * 4] = $vc[idx].sf & 0xff; $v[DST][idx * 4 + 1] = $vc[idx].sf >> 8 & 0xff; $v[DST][idx * 4 + 2] = $vc[idx].zf & 0xff; $v[DST][idx * 4 + 3] = $vc[idx].zf >> 8 & 0xff;

### Swizzle: vswz¶

Performs a swizzle, also known as a shuffle: builds a result vector from arbitrarily selected components of two input vectors. There are three source vectors: sources 1 and 2 supply the data to be used, while source 3 selects the mapping of output vector components to input vector components. Each component of source 3 consists of source selector and component selector. They select the source (1 or 2) and its component that will be used as the corresponding component of the result.

- Instructions:
Instruction Operands Opcode `vswz`

`SWZLOHI $v[DST] $v[SRC1] $v[SRC2] $v[SRC3]`

`0x9b`

- Operation:
for idx in range(16): # read the component and source selectors if SWZLOHI == 'lo': comp = $v[SRC3][idx] & 0xf src = $v[SRC3][idx] >> 4 & 1 else: comp = $v[SRC3][idx] >> 4 & 0xf src = $v[SRC3][idx] & 1 # read the source & component if src == 0: $v[DST][idx] = $v[SRC1][comp] else: $v[DST][idx] = $v[SRC2][comp]

### Simple arithmetic operations: vmin, vmax, vabs, vneg, vadd, vsub¶

Those perform the corresponding operation (minumum, maximum, absolute value,
negation, addition, substraction) in SIMD manner on 8-bit signed or unsigned
numbers from one or two sources. Source 1 is always a register selected by
`SRC1`

bitfield. Source 2, if it is used (ie. instruction is not `vabs`

nor `vneg`

), is either a register selected by `SRC2`

bitfield, or
immediate taken from `BIMM`

bitfield.

Most of these instructions come in signed and unsigned variants and both
perform result clipping. The exception is `vneg`

, which only has a signed
version. Note that `vabs`

is rather uninteresting in its unsigned variant
(it’s just the identity function). Note that `vsub`

lacks a signed version
with immediat: it can be replaced with `vadd`

with negated immediate.

`$vc`

output is fully supported. For signed variants, the sign flag output
is the sign of the result. For unsigned variants, the sign flag is used as
an overflow flag: it’s set if the true unclipped result is not in `0..0xff`

range.

- Instructions:
Instruction Operands Opcode `vmin s`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x88`

`vmax s`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x89`

`vabs s`

`[$vc[VCDST]] $v[DST] $v[SRC1]`

`0x8a`

`vneg s`

`[$vc[VCDST]] $v[DST] $v[SRC1]`

`0x8b`

`vadd s`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x8c`

`vsub s`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x8d`

`vmin u`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x98`

`vmax u`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x99`

`vabs u`

`[$vc[VCDST]] $v[DST] $v[SRC1]`

`0x9a`

`vadd u`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x9c`

`vsub u`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x9d`

`vmin s`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xa8`

`vmax s`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xa9`

`vadd s`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xac`

`vmin u`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xb8`

`vmax u`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xb9`

`vadd u`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xbc`

`vsub u`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xbd`

- Operation:
for idx in range(16): s1 = $v[SRC1][idx] if opcode & 0x20: s2 = BIMM else: s2 = $v[SRC2][idx] if opcode & 0x10: # unsigned s1 &= 0xff s2 &= 0xff else: # signed s1 = sext(s1, 7) s2 = sext(s2, 7) if op == 'vmin': res = min(s1, s2) elif op == 'vmax': res = max(s1, s2) elif op == 'vabs': res = abs(s1) elif op == 'vneg': res = -s1 elif op == 'vadd': res = s1 + s2 elif op == 'vsub': res = s1 - s2 sf = 0 if opcode & 0x10: # unsigned: clip to 0..0xff if res < 0: res = 0 sf = 1 if res > 0xff: res = 0xff sf = 1 else: # signed: clip to -0x80..0x7f if res < 0: sf = 1 if res < -0x80: res = -0x80 if res > 0x7f: res = 0x7f $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = sf $vc[VCDST].zf[idx] = res == 0

### Clip to range: vclip¶

Performs a SIMD range clipping operation: first source is the value to clip,
second and third sources are the range endpoints. Or, equivalently,
calculates the median of three inputs. `$vc`

output is supported, with
the sign flag set if clipping was performed (value equal to range endpoint
is considered to be clipped) or the range is improper (second endpoint not
larger than the first). All inputs are treated as signed.

- Instructions:
Instruction Operands Opcode `vclip`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2] $v[SRC3]`

`0xa4`

- Operation:
for idx in range(16): s1 = sext($v[SRC1][idx], 7) s2 = sext($v[SRC2][idx], 7) s3 = sext($v[SRC3][idx], 7) sf = 0 # determine endpoints if s2 < s3: # proper order start = s2 end = s3 else: # reverse order start = s3 end = s2 sf = 1 # and clip res = s1 if res <= start: res = start sf = 1 if res >= end: res = end sf = 1 $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = sf $vc[VCDST].zf[idx] = res == 0

### Minimum of absolute values: vminabs¶

Performs `min(abs(a), abs(b))`

. Both inputs are treated as signed.
`$vc`

output is supported for zero flag only. The result is clipped
to `0..0x7f`

range (which only matters if both inputs are `-0x80`

).

- Instructions:
Instruction Operands Opcode `vminabs`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0xa5`

- Operation:
for idx in range(16): s1 = sext($v[SRC1][idx], 7) s2 = sext($v[SRC2][idx], 7) res = min(abs(s1, s2)) if res > 0x7f: res = 0x7f $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = 0 $vc[VCDST].zf[idx] = res == 0

### Add 9-bit: vadd9¶

Performs an 8-bit unsigned + 9-bit signed addition (ie. exactly what’s needed
for motion compensation). The first source provides the 8-bit inputs, while
the second and third are uniquely treated as vectors of 8 16-bit components
(of which only low 9 are actually used). Second source provides components
0-7, and third provides 8-15. The result is unsigned and clipped. `$vc`

output is supported, with sign flag set to 1 if the true result was out of
8-bit unsigned range.

- Instructions:
Instruction Operands Opcode `vadd9`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2] $v[SRC3]`

`0x9f`

- Operation:
for idx in range(16): # read source 1 s1 = $v[SRC1][idx] if idx < 8: # 0-7: SRC2 s2l = $v[SRC2][idx * 2] s2h = $v[SRC2][idx * 2 + 1] else: # 8-15: SRC3 s2l = $v[SRC3][(idx - 8) * 2] s2h = $v[SRC3][(idx - 8) * 2 + 1] # read as 9-bit signed number s2 = sext(s2h << 8 | s2l, 8) # add res = s1 + s2 # clip sf = 0 if res > 0xff: sf = 1 res = 0xff if res < 0: sf = 1 res = 0 $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = sf $vc[VCDST].zf[idx] = res == 0

### Compare with absolute difference: vcmpad¶

This instruction performs the following operations:

- substract source 1.1 from source 2
- take the absolute value of the difference
- compare the result with source 1.2
- if equal, set zero flag of selected
`$vc`

output - set sign flag of
`$vc`

output to an arbitrary bitwise operation of s2v`$vc`

input and “less than” comparison result

All inputs are treated as unsigned. If s2v scalar instruction is not used
together with this instruction, `$vc`

input defaults to sign flag of
the `$vc`

register selected as output, with no transformation.

This instruction has two sources: source 1 is a register pair, while source 2
is a single register. The second register of the pair is selected by ORing
1 to the index of the first register of the pair. Source 2 is selected by
mangled field `SRC2S`

.

- Instructions:
Instruction Operands Opcode `vcmppad`

`CMPOP [$vc[VCDST]] $v[SRC1]d $v[SRC2S]`

`0x8f`

- Operation:
if s2v.vcsel.valid: vcin = s2v.vcmask else: vcin = $vc[VCDST & 3].sf for idx in range(16): ad = abs($v[SRC2S][idx] - $v[SRC1][idx]) other = $v[SRC1 | 1][idx] if VCDST < 4: $vc[VCDST].sf[idx] = sf $vc[VCDST].zf[idx] = ad == bitop(CMPOP, vcin >> idx & 1, ad < other)

### Bit operations: vbitop¶

Performs an arbitrary two-input bit operation on two registers.
`$vc`

output supported for zero flag only.

- Instructions:
Instruction Operands Opcode `vbitop`

`BITOP [$vc[CDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x94`

- Operation:
for idx in range(16): s1 = $v[SRC1][idx] s2 = $v[SRC2][idx] res = bitop(BITOP, s2, s1) & 0xff $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = 0 $vc[VCDST].zf[idx] = res == 0

### Bit operations with immediate: vand, vor, vxor¶

Performs a given bitwise operation on a register and an 8-bit immediate
replicated for each component. `$vc`

output supported for zero flag only.

- Instructions:
Instruction Operands Opcode `vand`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xaa`

`vxor`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xab`

`vor`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xaf`

- Operation:
for idx in range(16): s1 = $v[SRC1][idx] if op == 'vand': res = s1 & BIMM elif op == 'vxor': res = s1 ^ BIMM elif op == 'vor': res = s1 | BIMM $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = 0 $vc[VCDST].zf[idx] = res == 0

### Shift operations: vshr, vsar¶

Performs a SIMD right shift, like the scalar bytewise shift instruction. `$vc`

output is fully supported, with bit 7 of the
result used as the sign flag.

- Instructions:
Instruction Operands Opcode `vsar`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x8e`

`vshr`

`[$vc[VCDST]] $v[DST] $v[SRC1] $v[SRC2]`

`0x9e`

`vsar`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xae`

`vshr`

`[$vc[VCDST]] $v[DST] $v[SRC1] BIMM`

`0xbe`

- Operation:
for idx in range(16): s1 = $v[SRC1][idx] if opcode & 0x20: s2 = BIMM else: s2 = $v[SRC2][idx] if opcode & 0x10: # unsigned s1 &= 0xff else: # signed s1 = sext(s1, 7) shift = sext(s2, 3) if shift < 0: res = s1 << -shift else: res = s1 >> shift $v[DST][idx] = res if VCDST < 4: $vc[VCDST].sf[idx] = res >> 7 & 1 $vc[VCDST].zf[idx] = res == 0

### Linear interpolation: vlrp¶

A SIMD linear interpolation instruction. Takes two sources: a register pair
containing the two values to interpolate, and a register containing the
interpolation factor. The result is basically ```
SRC1.1 * (SRC2 >> SHIFT) +
SRC1.2 * (1 - (SRC2 >> SHIFT))
```

. All inputs are unsigned fractions.

- Instructions:
Instruction Operands Opcode `vlrp`

`RND SHIFT $v[DST] $v[SRC1]d $v[SRC2]`

`0x90`

- Operation:
for idx in range(16): val1 = $v[SRC1][idx] val2 = $v[SRC1 | 1][idx] a = mad_expand(val2, 'fract', 'u', SHIFT) res = mad(a, val1 - val2, $v[SRC2][idx], 0, 0, RND, 'fract', 'u', SHIFT, 'hi') $v[DST][idx] = mad_read(res, 'fract', 'u', SHIFT, 'hi')

### Multiply and multiply with accumulate: vmul, vmac¶

Performs a simple multiplication of two sources (but with the full set of weird
options available). The result is either added to the vector accumulator
(`vmac`

) or replaces it (`vmul`

). The result can additionally be read
to a vector register, but doesn’t have to be.

The instructions come in many variants: they can store the result in a vector
register or not, have unsigned or signed output, and register or immediate
second source. The set of available combinations is incomplete, however:
while the `$v`

-writing variants have all combinations available, there are
no unsigned variants of register-register `vmul`

with no `$v`

write, nor
unsigned register-immediate `vmac`

with no `$v`

write. Also, unsigned
register-immediate `vmul`

with no `$v`

output is a bad opcode.

- Instructions:
Instruction Operands Opcode `vmul s`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x80`

`vmul s`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xa0`

`vmul u`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 BIMMBAD`

`0xb0`

(bad opcode)`vmul s`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x81`

`vmul u`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x91`

`vmul s`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xa1`

`vmul u`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xb1`

`vmac s`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x82`

`vmac u`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x92`

`vmac s`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xa2`

`vmac u`

`RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xb2`

`vmac s`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x83`

`vmac u`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 $v[SRC2]`

`0x93`

`vmac s`

`RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] SIGN2 BIMMMUL`

`0xa3`

- Operation:
for idx in range(16): # read inputs s1 = $v[SRC1][idx] if opcode & 0x20: if op == 0x30: s2 = BIMMBAD else: s2 = BIMMMUL << 2 else: s2 = $v[SRC2][idx] # convert inputs s1 = mad_input(s1, FRACTINT, SIGN1) s2 = mad_input(s2, FRACTINT, SIGN2) # do the computation if op == 'vmac': a = $va[idx] else: a = 0 res = mad(a, s1, s2, 0, 0, RND, FRACTINT, op.sign, SHIFT, HILO) # write result $va[idx] = res if DST is not None: $v[DST][idx] = mad_read(res, FRACTINT, op.sign, SHIFT, HILO)

### Dual multiply and add/accumulate: vmac2, vmad2¶

Performs two multiplications and adds the result to a given source or to
the vector accumulator. The result is written to the vector accumulator
and can also be written to a `$v`

register. For each multiplication,
one input is a register source, and the other is s2v factor. The register
sources for the multiplications are a register pair. The s2v sources
for the multiplications are either s2v factors (one factor from each pair
is selected according to s2v `$vc`

input) or 0/1 as decided by s2v
mask.

The instructions come in signed and unsigned variants. Apart from some
bad opcodes (which overlay `SRC3`

with mad param fields), only `$v`

writing versions have unsigned variants.

- Instructions:
Instruction Operands Opcode `vmad2 s`

`S2VMODE RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1]d SIGN2 $v[SRC2]`

`0x84`

`vmad2 s`

`S2VMODE RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1]d SIGN2 $v[SRC2]`

`0x85`

`vmad2 u`

`S2VMODE RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1]d SIGN2 $v[SRC2]`

`0x95`

`vmac2 s`

`S2VMODE RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1]d`

`0x86`

`vmac2 u`

`S2VMODE RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] $v[SRC3]`

`0x96`

(bad opcode)`vmac2 s`

`S2VMODE RND FRACTINT SHIFT HILO # SIGN1 $v[SRC1] $v[SRC3]`

`0xa6`

(bad opcode)`vmac2 s`

`S2VMODE RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1]d`

`0x87`

`vmac2 u`

`S2VMODE RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1]d`

`0x97`

`vmac2 s`

`S2VMODE RND FRACTINT SHIFT HILO $v[DST] SIGN1 $v[SRC1] $v[SRC3]`

`0xa7`

(bad opcode)- Operation:
for idx in range(16): # read inputs s11 = $v[SRC1][idx] if opcode in (0x96, 0xa6, 0xa7): # one of the bad opcodes s12 = $v[SRC3][idx] else: s12 = $v[SRC1 | 1][idx] s2 = $v[SRC2][idx] # convert inputs s11 = mad_input(s11, FRACTINT, SIGN1) s12 = mad_input(s12, FRACTINT, SIGN1) s2 = mad_input(s2, FRACTINT, SIGN2) # prepare A value if op == 'vmad2': a = mad_expand(s2, FRACTINT, sign, SHIFT) else: a = $va[idx] # prepare factors if S2VMODE == 'mask': if s2v.mask[0] & 1 << idx: f1 = 0x100 else: f1 = 0 if s2v.mask[1] & 1 << idx: f2 = 0x100 else: f2 = 0 else: # 'factor' cc = s2v.vcmask >> idx & 1 f1 = s2v.factor[0 | cc] f2 = s2v.factor[2 | cc] # do the operation res = mad(a, s11, f1, s12, f2, RND, FRACTINT, sign, SHIFT, HILO) # write result $va[idx] = res if DST is not None: $v[DST][idx] = mad_read(res, FRACTINT, op.sign, SHIFT, HILO)

### Dual linear interpolation: vlrp2¶

This instruction performs the following steps:

- read a quad register source selected by
`SRC1`

- rotate the source quad by the amount selected by bits 4-5 of a selected
`$c`

register - for each component:
- treat register 0 of the quad as function value at (0, 0)
- treat register 2 as value at (1, 0)
- treat register 3 as value at (0, 1)
- select a pair of factors from s2v input based on selected flag of selected
`$vc`

register - treat the factors as a coordinate pair and interpolate function value at these coordinates
- write result to
`$v`

register and optionally`$va`

The inputs and outputs may be signed or unsigned. A shift and rounding mode can be selected. Additionally, there’s an option to XOR register 0 with 0x80 before use as the base value (but not for the differences used in interpolation). Don’t ask me.

- Instructions:
Instruction Operands Opcode `vlrp2`

`SIGND VAWRITE RND SHIFT $v[DST] SIGNS LRP2X $v[SRC1]q $c[COND] $vc[VCSRC] VCSEL`

`0xb3`

- Operation:
# a function selecting the factors def get_lrp2_factors(idx): if VCSEL == 'sf': vcmask = $vc[VCSRC].sf else: vcmask = $vc[VCSRC].zf cc = vcmask >> idx & 1; f1 = s2v.factor[0 | cc] f2 = s2v.factor[2 | cc] return f1, f2 # determine rotation rot = $c[COND] >> 4 & 3 for idx in range(16): # read inputs, maybe do the xor s10x = s10 = $v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx] s12 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx] s13 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx] if LRP2X: s10x ^= 0x80 # convert inputs if necessary s10 = mad_input(s10, 'fract', SIGNS) s12 = mad_input(s12, 'fract', SIGNS) s13 = mad_input(s13, 'fract', SIGNS) s10x = mad_input(s10x, 'fract', SIGNS) # do it a = mad_expand(s10x, 'fract', SIGND, SHIFT) f1, f2 = get_lrp2_factors(idx) res = mad(a, s12 - s10, f1, s13 - s10, f2, RND, 'fract', SIGND, SHIFT, 'hi') # write outputs if VAWRITE: $va[idx] = res $v[DST][idx] = mad_read(res, 'fract', SIGND, SHIFT, 'hi')

### Quad linear interpolation, part 1: vlrp4a¶

Works like the previous variant, but only outputs to `$va`

and lacks some
flags. Both outputs and inputs are unsigned.

- Instructions:
Instruction Operands Opcode `vlrp4a`

`RND SHIFT # $v[SRC1]q $c[COND] $vc[VCSRC] VCSEL`

`0xb4`

- Operation:
rot = $c[COND] >> 4 & 3 for idx in range(16): s10 = $v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx] s12 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx] s13 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx] a = mad_expand(s10, 'fract', 'u', SHIFT) f1, f2 = get_lrp2_factors(idx) $va[idx] = mad(a, s12 - s10, f1, s13 - s10, f2, RND, 'fract', 'u', SHIFT, 'lo')

### Factor linear interpolation: vlrpf¶

Has similiar input processing to `vlrp2`

, but instead uses source 1 registers
2 and 3 to interpolate s2v input. Result is ```
SRC2 + SRC1.2 * F1 + SRC1.3 *
(F2 - F1)
```

.

- Instructions:
Instruction Operands Opcode `vlrpf`

`RND SHIFT # $v[SRC1]q $c[COND] $v[SRC2] $vc[VCSRC] VCSEL`

`0xb5`

- Operation:
rot = $c[COND] >> 4 & 3 for idx in range(16): s12 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 2) & 3)][idx] s13 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 3) & 3)][idx] s2 = sext($v[SRC2][idx], 7) a = mad_expand(s2, 'fract', 'u', SHIFT) f1, f2 = get_lrp2_factors(idx) $va[idx] = mad(a, s12 - s13, f1, s13, f2, RND, 'fract', 'u', SHIFT, 'lo')

### Quad linear interpolation, part 2: vlrp4b¶

Can be used together with `vlrp4a`

for quad linear interpolation. First
s2v factor is the interpolation coefficient for register 1, and second factor
is the interpolation coefficient for the extra register (`$vx`

).

Alternatively, can be coupled with `vlrpf`

.

- Instructions:
Instruction Operands Opcode `vlrp4b u`

`ALTRND ALTSHIFT $v[DST] $v[SRC1]q $c[COND] SLCT $vc[VCSRC] VCSEL`

`0xb6`

`vlrp4b s`

`ALTRND ALTSHIFT $v[DST] $v[SRC1]q $c[COND] SLCT $vc[VCSRC] VCSEL`

`0xb7`

- Operation:
for idx in range(16): if SLCT == 4: rot = $c[COND] >> 4 & 3 s10 = $v[(SRC1 & 0x1c) | ((SRC1 + rot) & 3)][idx] s11 = $v[(SRC1 & 0x1c) | ((SRC1 + rot + 1) & 3)][idx] else: adjust = $c[COND] >> SLCT & 1 s10 = s11 = $v[src1 ^ adjust][idx] f1, f2 = get_lrp2_factors(idx) res = mad($va[idx], s11 - s10, f1, $vx[idx] - s10, f2, ALTRND, 'fract', op.sign, ALTSHIFT, 'hi') $va[idx] = res $v[DST][idx] = mad_read(res, 'fract', op.sign, ALTSHIFT, 'hi')