Scalar unit ¶

Contents

Introduction ¶

The scalar unit is one of the four execution units of VP1. It is used for general-purpose arithmetic.

The scalar unit has 31 GPRs, $r0-$r30. They are 32 bits wide, and are usually used as 32-bit integers, but there are also SIMD instructions treating them as arrays of 4 bytes. In such cases, array notation is used to denote the individual bytes. Bits 0-7 are considered to be $rX[0], bits 8-15 are $rX[1] and so on. $r31 is a special register hardwired to 0.

There are also 8 bits in each $c register belonging to the scalar unit. Most scalar instructions can (if requested) set these bits according to the computation result. The bits are:

bit 0: sign flag - set equal to bit 31 of the result
bit 1: zero flag - set if the result is 0
bit 2: b19 flag - set equal to bit 19 of the result
bit 3: b20 difference flag - set if bit 20 of the result is different from bit 20 of the first source
bit 4: b20 flag - set equal to bit 20 of the result
bit 5: b21 flag - set equal to bit 21 of the result
bit 6: alt b19 flag (G80 only) - set equal to bit 19 of the result
bit 7: b18 flag (G80 only) - set equal to bit 18 of the result

The purpose of the last 6 bits is so far unknown.

Scalar to vector data bus ¶

In addition to performing computations of its own, the scalar unit is also used in tandem with the vector unit to perform complex instructions. Certain scalar opcodes expose data on so-called s2v path (scalar to vector data bus), and certain vector opcodes consume this data.

The data is ephemeral and only exists during the execution of a single bundle - the producing and consuming instructions must be located in the same bundle. If a consuming instruction is used without a producing instruction, it’ll read junk. If a producing instruction is used without a consuming instruction, the data is discarded.

The s2v data consists of:

4 signed 10-bits factors, used for multiplication
$vc selection and transformation, for use as mask input in vector unit, made of:
- valid flag: 1 if s2v data was emitted by proper s2v-emitting instruction (if false, vector unit will use an alternate source not involving s2v)
- 2-bit $vc register index
- 1-bit zero flag or sign flag selection (selects which half of $vc will be used)
- 3-bit transform mode: used to mangle the $vc value before use as mask

The factors can alternatively be treated as two 16-bit masks by some instructions. In that case, mask 0 consists of bits 1-8 of factor 0, then bits 1-8 of factor 1 and mask 1 likewise consists of bits 1-8 of factors 2 and 3:

s2v.mask[0] = (s2v.factor[0] >> 1 & 0xff) | (s2v.factor[1] >> 1 & 0xff) << 8
s2v.mask[1] = (s2v.factor[2] >> 1 & 0xff) | (s2v.factor[3] >> 1 & 0xff) << 8

The $vc based mask is derived as follows:

def xfrm(val, tab):
    res = 0
    for idx in range(16):
        # bit x of result is set if bit tab[x] of input is set
        if val & 1 << tab[idx]:
            res |= 1 << idx
    return res

val = $vc[s2v.vcsel.idx]
# val2 is only used for transform mode 7
val2 = $vc[s2v.vcsel.idx | 1]

if s2v.vcsel.flag == 'sf':
    val = val & 0xffff
    val2 = val2 & 0xffff
else: # 'zf'
    val = val >> 16 & 0xffff
    val2 = val2 >> 16 & 0xffff

if s2v.vcsel.xfrm == 0:
    # passthrough
    s2v.vcmask = val
elif s2v.vcsel.xfrm == 1:
    s2v.vcmask = xfrm(val, [2,  2,  2,  2,  6,  6,  6,  6, 10, 10, 10, 10, 14, 14, 14, 14])
elif s2v.vcsel.xfrm == 2:
    s2v.vcmask = xfrm(val, [4,  5,  4,  5,  4,  5,  4,  5, 12, 13, 12, 13, 12, 13, 12, 13])
elif s2v.vcsel.xfrm == 3:
    s2v.vcmask = xfrm(val, [0,  0,  2,  0,  4,  4,  6,  4,  8,  8, 10,  8, 12, 12, 14, 12])
elif s2v.vcsel.xfrm == 4:
    s2v.vcmask = xfrm(val, [1,  1,  1,  3,  5,  5,  5,  7,  9,  9,  9, 11, 13, 13, 13, 15])
elif s2v.vcsel.xfrm == 5:
    s2v.vcmask = xfrm(val, [0,  0,  2,  2,  4,  4,  6,  6,  8,  8, 10, 10, 12, 12, 14, 14])
elif s2v.vcsel.xfrm == 6:
    s2v.vcmask = xfrm(val, [1,  1,  1,  1,  5,  5,  5,  5,  9,  9,  9,  9, 13, 13, 13, 13])
elif s2v.vcsel.xfrm == 7:
    # mode 7 is special: it uses two $vc inputs and takes every second bit
    s2v.vcmask = xfrm(val | val2 << 16, [0,  2,  4,  6,  8, 10, 12, 14, 16, 18, 20, 22, 24, 26, 28, 30])

Instruction format ¶

The instruction word fields used in scalar instructions are:

bits 0-2: CDST - if < 4, index of the $c register to set according to the instruction’s result. Otherwise, an indication that $c is not to be written (nVidia appears to use 7 in such case).
bits 0-7: BIMMBAD - an immediate field used only in bad opcodes
bits 0-18: IMM19 - a signed 19-bit immediate field used only by the mov instruction
bits 0-15: IMM16 - a 16-bit immediate field used only by the sethi instruction
bits 1-9: FACTOR1 - a 9-bit signed immediate used as vector factor
bits 10-18: FACTOR2 - a 9-bit signed immediate used as vector factor
bit 1: SIGN2 - determines if byte multiplication source 2 is signed
- 0: u - unsigned
- 1: s - signed
bit 2: SIGN1 - likewise for source 1
bits 3-10: BIMM: an 8-bit immediate for bytewise operations, signed or unsigned depending on instruction.
bits 3-13: IMM: signed 13-bit immediate.
bits 3-6: BITOP: selects the bit operation to perform
bits 3-7: RFILE: selects the other register file for mov to/from other register file
bits 3-4: COND - if source mangling is used, the $c register index to use for source mangling.
bits 5-8: SLCT - if source mangling is used, the condition to use for source mangling.
bit 8: RND - determines byte multiplication rounding behaviour
- 0: rd - round down
- 1: rn - round to nearest, ties rounding up
btis 9-13: SRC2 - the second source $r register, often mangled via source mangling.
bits 9-13 (low 5 bits) and bit 0 (high bit): BIMMMUL - a 6-bit immediate for bytewise multiplication, signed or unsigned depending on instruction.
bits 14-18: SRC1 - the first source $r register.
bits 19-23: DST - the destination $r register.
bits 19-20: VCIDX - the $vc register index for s2v
bit 21: VCFLAG - the $vc flag selection for s2v:
- 0: sf
- 1: zf
bits 22-23 (low part) and 0 (high part): VCXFRM - the $vc transformation for s2v
bits 24-31: OP - the opcode.

Opcodes ¶

The opcode range assigned to the scalar unit is 0x00-0x7f. The opcodes are:

0x01, 0x11, 0x21, 0x31: bytewise multiplication: bmul
0x02, 0x12, 0x22, 0x32: bytewise multiplication: bmul (bad opcode)
0x04: s2v multiply/add/send: bvecmad
0x24: s2v immediate send: vec
0x05: s2v multiply/add/select/send: bvecmadsel
0x25: bytewise immediate and: band
0x26: bytewise immediate or: bor
0x27: bytewise immediate xor: bxor
0x08, 0x18, 0x28, 0x38: bytewise minimum: bmin
0x09, 0x19, 0x29, 0x39: bytewise maximum: bmax
0x0a, 0x1a, 0x2a, 0x3a: bytewise absolute value: babs
0x0b, 0x1b, 0x2b, 0x3b: bytewise negate: bneg
0x0c, 0x1c, 0x2c, 0x3c: bytewise addition: badd
0x0d, 0x1d, 0x2d, 0x3d: bytewise substract: bsub
0x0e, 0x1e, 0x2e, 0x3e: bytewise shift: bshr, bsar
0x0f: s2v send: bvec
0x41, 0x51, 0x61, 0x71: 16-bit multiplication: mul
0x42: bitwise operation: bitop
0x62: immediate and: and
0x63: immediate xor: xor
0x64: immediate or: or
0x45: s2v 4-bit mask send and shift: vecms
0x65: load immediate: mov
0x75: set high bits immediate: sethi
0x6a: mov to other register file: mov
0x6b: mov from other register file: mov
0x48, 0x58, 0x68, 0x78: minimum: min
0x49, 0x59, 0x69, 0x79: maximum: max
0x4a, 0x5a, 0x7a: absolute value: abs
0x4b, 0x5b, 0x7b: negation: neg
0x4c, 0x5c, 0x6c, 0x7c: addition: add
0x4d, 0x5d, 0x6d, 0x7d: substraction: sub
0x4e, 0x5e, 0x6e, 0x7e: shift: shr, sar
0x4f: the canonical scalar nop opcode

Todo

some unused opcodes clear $c, some don’t

Bad opcodes ¶

Some of the VP1 instructions look like they’re either buggy or just unintended artifacts of incomplete decoding hardware. These are known as bad opcodes and are characterised by using colliding bitfields. It’s probably a bad idea to use them, but they do seem to reliably perform as documented here.

Source mangling ¶

Some instructions perform source mangling: the source register(s) they use are not taken directly from a register index bitfield in the instruction. Instead, the register index from the instruction is… “adjusted” before use. There are several algorithms used for source mangling, most of them used only in a single instruction.

The most common one, known as SRC2S, takes the register index from SRC2 field, a $c register index from COND, and $c bit index from SLCT. If SLCT is anything other than 4, the selected bit is extracted from $c and XORed into the lowest bit of the register index to use. Otherwise (SLCT is 4), bits 4-5 of $c are extracted, and added to bits 0-1 of the register index, discarding overflow out of bit 1:

if SLCT == 4:
    adjust = $c[COND] >> 4 & 3
    SRC2S = (SRC2 & ~3) | ((SRC2 + adjust) & 3)
else:
    adjust = $c[COND] >> SLCT & 1
    SRC2S = SRC2 ^ adjust

Instructions ¶

Load immediate: mov ¶

Loads a 19-bit signed immediate to the selected register. If you need to load a const that doesn’t fit into 19 signed bits, use this instruction along with sethi.

Instructions:

Instruction	Operands	Opcode
`mov`	`$r[DST] IMM19`	`0x65`

Operation:

$r[DST] = IMM19

Set high bits: sethi ¶

Loads a 16-bit immediate to high bits of the selected register. Low 16 bits are unaffected.

Instructions:

Instruction	Operands	Opcode
`sethi`	`$r[DST] IMM16`	`0x75`

Operation:

$r[DST] = ($r[DST] & 0xffff) | IMM16 << 16

Move to/from other register file: mov ¶

Does what it says on the tin. There is $c output capability, but it always outputs 0. The other register file is selected by RFILE field, and the possibilities are:

0: $v word 0 (ie. bytes 0-3)
1: $v word 1 (bytes 4-7)
2: $v word 2 (bytes 8-11)
3: $v word 3 (bytes 12-15)
4: ??? (NV41:G80 only)
5: ??? (NV41:G80 only)
6: ??? (NV41:G80 only)
7: ??? (NV41:G80 only)
8: $sr
9: $mi
10: $uc
11: $l (indices over 3 are ignored on writes, wrapped modulo 4 on reads)
12: $a
13: $c - read only (indices over 3 read as 0)
18: curiously enough, aliases 2, for writes only
20: $m[0-31]
21: $m[32-63]
22: $d (indices over 7 are wrapped modulo 8) (G80 only)
23: $f (indices over 1 are wrapped modulo 2)
24: $x (indices over 15 are wrapped modulo 16) (G80 only)

Todo

figure out the pre-G80 register files

Attempts to read or write unknown register file are ignored. In case of reads, the destination register is left unmodified.

Instructions:

Instruction	Operands	Opcode
`mov`	`[$c[CDST]] $<RFILE>[DST] $r[SRC1]`	`0x6a`
`mov`	`[$c[CDST]] $r[DST] $<RFILE>[SRC1]`	`0x6b`

Operation:

if opcode == 0x6a:
    $<RFILE>[DST] = $r[SRC1]
else:
    $r[DST] = $<RFILE>[SRC1]

if CDST < 4:
    $c[CDST].scalar = 0

Arithmetic operations: mul, min, max, abs, neg, add, sub, shr, sar ¶

mul performs a 16x16 multiplication with 32 bit result. shr and sar do a bitwise shift right by given amount, with negative amounts interpreted as left shift (and the shift amount limitted to -0x1f..0x1f). The other operations do what it says on the tin. abs, min, max, mul, sar treat the inputs as signed, shr as unsigned, for others it doesn’t matter.

The first source comes from a register selected by SRC1, and the second comes from either a register selected by mangled field SRC2S or a 13-bit signed immediate IMM. In case of abs and neg, the second source is unused, and the immediate versions are redundant (and in fact one set of opcodes is used for mov to/from other register file instead).

Most of these operations have duplicate opcodes. The canonical one is the lowest one.

All of these operations set the full set of scalar condition codes.

Instructions:

Instruction	Operands	Opcode
`mul`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x41, 0x51`
`min`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x48, 0x58`
`max`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x49, 0x59`
`abs`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x4a, 0x5a, 0x7a`
`neg`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x4b, 0x5b, 0x7b`
`add`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x4c, 0x5c`
`sub`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x4d, 0x5d`
`sar`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x4e`
`shr`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x5e`
`mul`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x61, 0x71`
`min`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x68, 0x78`
`max`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x69, 0x79`
`add`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x6c, 0x7c`
`sub`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x6d, 0x7d`
`sar`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x6e`
`shr`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x7e`

Operation:

s1 = sext($r[SRC1], 31)
if opcode & 0x20:
    s2 = sext(IMM, 12)
else:
    s2 = sext($r[SRC2], 31)

if op == 'mul':
    res = sext(s1, 15) * sext(s2, 15)
elif op == 'min':
    res = min(s1, s2)
elif op == 'max':
    res = max(s1, s2)
elif op == 'abs':
    res = abs(s1)
elif op == 'neg':
    res = -s1
elif op == 'add':
    res = s1 + s2
elif op == 'sub':
    res = s1 - s2
elif op == 'shr' or op == 'sar':
    # shr/sar are unsigned/signed versions of the same insn
    if op == 'shr':
        s1 &= 0xffffffff
    # shift amount is 6-bit signed number
    shift = sext(s2, 5)
    # and -0x20 is invalid
    if shift == -0x20:
        shift = 0
    # negative shifts mean a left shift
    if shift < 0:
        res = s1 << -shift
    else:
        # sign of s1 matters here
        res = s1 >> shift

$r[DST] = res
# build $c result
cres = 0
if res & 1 << 31:
    cres |= 1
if res == 0:
    cres |= 2
if res & 1 << 19:
    cres |= 4
if (res ^ s1) & 1 << 20:
    cres |= 8
if res & 1 << 20:
    cres |= 0x10
if res & 1 << 21:
    cres |= 0x20
if variant == 'G80':
    if res & 1 << 19:
        cres |= 0x40
    if res & 1 << 18:
        cres |= 0x80
if CDST < 4:
    $c[CDST].scalar = cres

Bit operations: bitop ¶

Performs an arbitrary two-input bit operation on two registers, selected by SRC1 and SRC2. $c output works, but only with a subset of flags.

Instructions:

Instruction	Operands	Opcode
`bitop`	`BITOP [$c[CDST]] $r[DST] $r[SRC1] $r[SRC2]`	`0x42`

Operation:

s1 = $r[SRC1]
s2 = $r[SRC2]

res = bitop(BITOP, s2, s1) & 0xffffffff

$r[DST] = res
# build $c result
cres = 0
# bit 0 not set
if res == 0:
    cres |= 2
if res & 1 << 19:
    cres |= 4
# bit 3 not set
if res & 1 << 20:
    cres |= 0x10
if res & 1 << 21:
    cres |= 0x20
if variant == 'G80':
    if res & 1 << 19:
        cres |= 0x40
    if res & 1 << 18:
        cres |= 0x80
if CDST < 4:
    $c[CDST].scalar = cres

Bit operations with immediate: and, or, xor ¶

Performs a given bitwise operation on a register and 13-bit immediate. Like for bitop, $c output only works partially.

Instructions:

Instruction	Operands	Opcode
`and`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x62`
`xor`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x63`
`or`	`[$c[CDST]] $r[DST] $r[SRC1] IMM`	`0x64`

Operation:

s1 = $r[SRC1]

if op == 'and':
    res = s1 & IMM
elif op == 'xor':
    res = s1 ^ IMM
elif op == 'or':
    res = s1 | IMM

$r[DST] = res
# build $c result
cres = 0
# bit 0 not set
if res == 0:
    cres |= 2
if res & 1 << 19:
    cres |= 4
# bit 3 not set
if res & 1 << 20:
    cres |= 0x10
if res & 1 << 21:
    cres |= 0x20
if variant == 'G80':
    if res & 1 << 19:
        cres |= 0x40
    if res & 1 << 18:
        cres |= 0x80
if CDST < 4:
    $c[CDST].scalar = cres

Simple bytewise operations: bmin, bmax, babs, bneg, badd, bsub ¶

Those perform the corresponding operation (minumum, maximum, absolute value, negation, addition, substraction) in SIMD manner on 8-bit signed or unsigned numbers from one or two sources. Source 1 is always a register selected by SRC1 bitfield. Source 2, if it is used (ie. instruction is not babs nor bneg), is either a register selected by SRC2S mangled bitfield, or immediate taken from BIMM bitfield.

Each of these instructions comes in signed and unsigned variants and both perform result clipping. Note that abs is rather uninteresting in its unsigned variant (it’s just the identity function), and so is neg (result is always 0 or clipped to 0.

These instruction have a $c output, but it’s always set to all-0 if used.

Also note that babs and bneg have two redundant opcodes each: the bit that normally selects immediate or register second source doesn’t apply to them.

Instructions:

Instruction	Operands	Opcode
`bmin s`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x08`
`bmax s`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x09`
`babs s`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x0a`
`bneg s`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x0b`
`badd s`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x0c`
`bsub s`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x0d`
`bmin u`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x18`
`bmax u`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x19`
`babs u`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x1a`
`bneg u`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x1b`
`badd u`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x1c`
`bsub u`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x1d`
`bmin s`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x28`
`bmax s`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x29`
`babs s`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x2a`
`bneg s`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x2b`
`badd s`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x2c`
`bsub s`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x2d`
`bmin u`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x38`
`bmax u`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x39`
`babs u`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x3a`
`bneg u`	`[$c[CDST]] $r[DST] $r[SRC1]`	`0x3b`
`badd u`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x3c`
`bsub u`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x3d`

Operation:

for idx in range(4):
    s1 = $r[SRC1][idx]
    if opcode & 0x20:
        s2 = BIMM
    else:
        s2 = $r[SRC2S][idx]

    if opcode & 0x10:
        # unsigned
        s1 &= 0xff
        s2 &= 0xff
    else:
        # signed
        s1 = sext(s1, 7)
        s2 = sext(s2, 7)

    if op == 'bmin':
        res = min(s1, s2)
    elif op == 'bmax':
        res = max(s1, s2)
    elif op == 'babs':
        res = abs(s1)
    elif op == 'bneg':
        res = -s1
    elif op == 'badd':
        res = s1 + s2
    elif op == 'bsub':
        res = s1 - s2

    if opcode & 0x10:
        # unsigned: clip to 0..0xff
        if res < 0:
            res = 0
        if res > 0xff:
            res = 0xff
    else:
        # signed: clip to -0x80..0x7f
        if res < -0x80:
            res = -0x80
        if res > 0x7f:
            res = 0x7f

    $r[DST][idx] = res

if CDST < 4:
    $c[CDST].scalar = 0

Bytewise bit operations: band, bor, bxor ¶

Performs a given bitwise operation on a register and an 8-bit immediate replicated 4 times. Or, intepreted differently, performs such operation on every byte of a register idependently. $c output is present, but always outputs 0.

Instructions:

Instruction	Operands	Opcode
`and`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x25`
`or`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x26`
`xor`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x27`

Operation:

for idx in range(4):
    if op == 'and':
        $r[DST][idx] = $r[SRC1][idx] & BIMM
    elif op == 'or':
        $r[DST][idx] = $r[SRC1][idx] | BIMM
    elif op == 'xor':
        $r[DST][idx] = $r[SRC1][idx] ^ BIMM

if CDST < 4:
    $c[CDST].scalar = 0

Bytewise bit shift operations: bshr, bsar ¶

Performs a bytewise SIMD right shift. Like the usual shift instruction, the shift amount is considered signed and negative amounts result in left shift. In this case, the shift amount is a 4-bit signed number. Operands are as in usual bytewise operations.

Instructions:

Instruction	Operands	Opcode
`bsar`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x0e`
`bshr`	`[$c[CDST]] $r[DST] $r[SRC1] $r[SRC2S]`	`0x1e`
`bsar`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x2e`
`bshr`	`[$c[CDST]] $r[DST] $r[SRC1] BIMM`	`0x3e`

Operation:

for idx in range(4):
    s1 = $r[SRC1][idx]
    if opcode & 0x20:
        s2 = BIMM
    else:
        s2 = $r[SRC2S][idx]

    if opcode & 0x10:
        # unsigned
        s1 &= 0xff
    else:
        # signed
        s1 = sext(s1, 7)

    shift = sext(s2, 3)

    if shift < 0:
        res = s1 << -shift
    else:
        res = s1 >> shift

    $r[DST][idx] = res

if CDST < 4:
    $c[CDST].scalar = 0

Bytewise multiplication: bmul ¶

These instructions perform bytewise fractional multiplication: the inputs and outputs are considered to be fixed-point numbers with 8 fractional bits (unsigned version) or 7 fractional bits (signed version). The signedness of both inputs and the output can be controlled independently (the signedness of the output is controlled by the opcode, and of the inputs by instruction word flags SIGN1 and SIGN2). The results are clipped to the output range. There are two rounding modes: round down and round to nearest with ties rounded up.

The first source is always a register selected by SRC1 bitfield. The second source can be a register selected by SRC2 bitfield, or 6-bit immediate in BIMMMUL bitfield padded with two zero bits on the right.

Note that besides proper 0xX1 opcodes, there are also 0xX2 bad opcodes. In case of register-register ops, these opcodes are just aliases of the sane ones, but for immediate opcodes, a colliding bitfield is used.

The instructions have no $c output capability.

Instructions:

Instruction	Operands	Opcode
`bmul s`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 $r[SRC2]`	`0x01, 0x02`
`bmul u`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 $r[SRC2]`	`0x11, 0x12`
`bmul s`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 BIMMMUL`	`0x21`
`bmul u`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 BIMMMUL`	`0x31`
`bmul s`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 BIMMBAD`	`0x22` (bad opcode)
`bmul u`	`RND $r[DST] SIGN1 $r[SRC1] SIGN2 BIMMBAD`	`0x32` (bad opcode)

Operation:

for idx in range(4):
    # read inputs
    s1 = $r[SRC1][idx]
    if opcode & 0x20:
        if opcode & 2:
            s2 = BIMMBAD
        else:
            s2 = BIMMMUL << 2
    else:
        s2 = $r[SRC2S][idx]

    # convert inputs to 8 fractional bits - unsigned inputs are already ok
    if SIGN1:
        ss1 = sext(ss1, 7) << 1
    if SIGN2:
        ss2 = sext(ss2, 7) << 1

    # multiply - the result has 16 fractional bits
    res = ss1 * ss2

    if opcode & 0x10:
        # unsigned result
        # first, if round to nearest is selected, apply rounding correction
        if RND == 'rn':
            res += 0x80
        # convert to 8 fractional bits
        res >>= 8
        # clip
        if res < 0:
            res = 0
        if res > 0xff:
            res = 0xff
    else:
        # signed result
        if RND == 'rn':
            res += 0x100
        # convert to 7 fractional bits
        res >>= 9
        # clip
        if res < -0x80:
            res = -0x80
        if res > 0x7f:
            res = 0x7f

    $r[DST][idx] = res

Send immediate to vector unit: vec ¶

This instruction takes two 9-bit immediate operands and sends them as factors to the vector unit. The first immediate is used as factors 0 and 1, and the second is used as factors 2 and 3. $vc selection is sent as well.

Instructions:

Instruction	Operands	Opcode
`vec`	`FACTOR1 FACTOR2 $vc[VCIDX] VCFLAG VCXFRM`	`0x24`

Operation:

s2v.factor[0] = s2v.factor[1] = FACTOR1
s2v.factor[2] = s2v.factor[3] = FACTOR2
s2v.vcsel.idx = VCIDX
s2v.vcsel.flag = VCFLAG
s2v.vcsel.xfrm = VCXFRM

Send mask to vector unit and shift: vecms ¶

This instruction shifts a register right by 4 bits and uses the bits shifted out as s2v mask 0 after expansion (each bit is replicated 4 times). The s2v factors are derived from that mask and are not very useful. The right shift is sign-filling. $vc selection is sent as well.

Instructions:

Instruction	Operands	Opcode
`vecms`	`$r[SRC1] $vc[VCIDX] VCFLAG VCXFRM`	`0x45`

Operation:

val = sext($r[SRC1], 31)
$r[SRC1] = val >> 4
# the factors are made so that the mask derived from them will contain
# each bit from the short mask repeated 4 times
f0 = 0
f1 = 0
if val & 1:
    f0 |= 0x1e
if val & 2:
    f0 |= 0x1e0
if val & 4:
    f1 |= 0x1e
if val & 8:
    f1 |= 0x1e0
s2v.factor[0] = f0
s2v.factor[1] = f1
s2v.factor[2] = s2v.factor[3] = 0
s2v.vcsel.idx = VCIDX
s2v.vcsel.flag = VCFLAG
s2v.vcsel.xfrm = VCXFRM

Send bytes to vector unit: bvec ¶

Treats a register as 4-byte vector, sends the bytes as s2v factors (treating them as signed with 7 fractional bits). $vc selection is sent as well. If the s2v output is used as masks, this effectively takes mask 0 from source bits 0-15 and mask 1 from source bits 16-31.

Instructions:

Instruction	Operands	Opcode
`bvec`	`$r[SRC1] $vc[VCIDX] VCFLAG VCXFRM`	`0x0f`

Operation:

for idx in range(4):
    s2v.factor[idx] = sext($r[SRC1][idx], 7) << 1
s2v.vcsel.idx = VCIDX
s2v.vcsel.flag = VCFLAG
s2v.vcsel.xfrm = VCXFRM

Bytewise multiply, add, and send to vector unit: bvecmad, bvecmadsel ¶

Figure out this one yourself. It sends s2v factors based on SIMD multiply & add, uses weird source mangling, and even weirder source 1 bitfields.

Instructions:

Instruction	Operands	Opcode
`bvecmad`	`$r[SRC1] $r[SRC2]q $vc[VCIDX] VCFLAG VCXFRM`	`0x04`
`bvecmadsel`	`$r[SRC1] $r[SRC2]q $vc[VCIDX] VCFLAG VCXFRM`	`0x05`

Operation:

if SLCT== 4:
        adjust = $c[COND] >> 4 & 3
else:
        adjust = $c[COND] >> SLCT & 1

# SRC1 selects the pre-factor, which will be multiplied by source 3
if op == 'bvecmad':
    prefactor = $r[SRC1] >> 11 & 0xff
elif op == 'bvecmadsel':
    prefactor = $r[SRC1] >> 11 & 0x7f

s2a = $r[SRC2 | adjust]
s2b = $r[SRC2 | 2 | adjust]

for idx in range(4):
    # this time source is mangled by OR, not XOR - don't ask me

    if op == 'bvecmad'
        midx = idx
    elif op == 'bvecmadsel':
        midx = idx & 2
        if SLCT == 2 and $c[COND] & 0x80:
            midx |= 1

    # baseline (res will have 16 fractional bits, sources have 8)
    res = s2a[midx] << 8
    # throw in the multiplication result
    res += prefactor * s2b[idx]
    # and rounding correction (for round to nearest, ties up)
    res += 0x40
    # and round to 9 fractional bits
    s2v.factor[idx] = res >> 7

s2v.vcsel.idx = VCIDX
s2v.vcsel.flag = VCFLAG
s2v.vcsel.xfrm = VCXFRM

Scalar unit ¶

Introduction ¶

Scalar registers ¶

Scalar to vector data bus ¶

Instruction format ¶

Opcodes ¶

Bad opcodes ¶

Source mangling ¶

Instructions ¶

Load immediate: mov ¶

Set high bits: sethi ¶

Move to/from other register file: mov ¶

Arithmetic operations: mul, min, max, abs, neg, add, sub, shr, sar ¶

Bit operations: bitop ¶

Bit operations with immediate: and, or, xor ¶

Simple bytewise operations: bmin, bmax, babs, bneg, badd, bsub ¶

Bytewise bit operations: band, bor, bxor ¶

Bytewise bit shift operations: bshr, bsar ¶

Bytewise multiplication: bmul ¶

Send immediate to vector unit: vec ¶

Send mask to vector unit and shift: vecms ¶

Send bytes to vector unit: bvec ¶

Bytewise multiply, add, and send to vector unit: bvecmad, bvecmadsel ¶

Table of Contents

Previous topic

Next topic

This Page