Puller - handling of submitted commands by FIFO

Introduction

PFIFO puller’s job is taking methods out of the CACHE and delivering them to the right place for execution, or executing them directly.

Methods 0-0xfc are special and executed by the puller. Methods 0x100 and up are forwarded to the engine object currently bound to a given subchannel. The methods are:

Method Present on Name Description
0x0000 all OBJECT Binds an engine object
0x0008 GF100- NOP Does nothing
0x0010 G84- SEMAPHORE_ADDRESS_HIGH New-style semaphore address high part
0x0014 G84- SEMAPHORE_ADDRESS_LOW New-style semaphore address low part
0x0018 G84- SEMAPHORE_SEQUENCE New-style semaphore payload
0x001c G84- SEMAPHORE_TRIGGER New-style semaphore trigger
0x0020 G84- NOTIFY_INTR Triggers an interrupt
0x0024 G84- WRCACHE_FLUSH Flushes write post caches
0x0028 MCP89- ??? ???
0x002c MCP89- ??? ???
0x0050 NV10- REF_CNT Writes the ref counter
0x0060 NV1A:GF100 DMA_SEMAPHORE DMA object for semaphores
0x0064 NV1A- SEMAPHORE_OFFSET Old-style semaphore address
0x0068 NV1A- SEMAPHORE_ACQUIRE Old-style semaphore acquire trigger and payload
0x006c NV1A- SEMAPHORE_RELEASE Old-style semaphore release trigger and payload
0x0070 GF100- ??? ???
0x0074 GF100- ??? ???
0x0078 GF100- ??? ???
0x007c GF100- ??? ???
0x0080 NV40- YIELD Yield PFIFO - force channel switch
0x0100:0x2000 NV1:NV4 Passed down to the engine
0x0100:0x0180 NV4:GF100 Passed down to the engine
0x0180:0x0200 NV4:GF100 Passed down to the engine, goes through RAMHT lookup
0x0200:0x2000 NV4:GF100 Passed down to the engine
0x0100:0x4000 GF100- Passed down to the engine

Todo

missing the GF100+ methods

RAMHT and the FIFO objects

As has been already mentioned, each channel has 8 “subchannels” which can be bound to engine objects. On pre-GF100 GPUs, these objects and DMA objects are collectively known as “FIFO objects”. FIFO objects and RAMHT don’t exist on GF100+ PFIFO.

The RAMHT is a big hash table that associates arbitrary 32-bit handles with FIFO objects and engine ids. Whenever a method is mentioned to take an object handle, it means the parameter is looked up in RAMHT. When such lookup fails to find a match, a CACHE_ERROR(NO_HASH) error is raised.

NV4:GF100

Internally, a FIFO object is a [usually small] block of data residing in “instance memory”. The instance memory is RAMIN for pre-G80 GPUs, and the channel structure for G80+ GPUs. The first few bits of a FIFO object determine its ‘class’. Class is 8 bits on NV4:NV25, 12 bits on NV25:NV40, 16 bits on NV40:GF100.

The data associated with a handle in RAMHT consists of engine id, which determines the object’s behavior when bound to a subchannel, and its address in RAMIN [pre-G80] or offset from channel structure start [G80+].

Apart from method 0, the engine id is ignored. The suitability of an object for a given method is determined by reading its class and checking if it makes sense. Most methods other than 0 expect a DMA object, although a couple of pre-G80 graph objects have methods that expect other graph objects.

The following are commonly accepted object classes:

  • 0x0002: DMA object for reading
  • 0x0003: DMA object for writing
  • 0x0030: NULL object - used to effectively unbind a previously bound object
  • 0x003d: DMA object for reading/writing

Other object classes are engine-specific.

For more information on DMA objects, see NV3 DMA objects, NV4:G80 DMA objects, or DMA objects.

NV3

NV3 also has RAMHT, but it’s only used for engine objects. While NV3 has DMA objects, they have to be bound manually by the kernel. Thus, they’re not mentioned in RAMHT, and the 0x180-0x1fc methods are not implemented in hardware - they’re instead trapped and emulated in software to behave like NV4+.

NV3 also doesn’t use object classes - the object type is instead a 7-bit number encoded in RAMHT along with engine id and object address.

NV1

You don’t want to know how NV1 RAMHT works.

Puller state

type name GPUs description
b24[8] ctx NV1:NV4 objects bound to subchannels
b3 last_subc NV1:NV4 last used subchannel
b5[8] engines NV4+ engines bound to subchannels
b5 last_engine NV4+ last used engine
b32 ref NV10+ reference counter [shared with pusher]
bool acquire_active NV1A+ semaphore acquire in progress
b32 acquire_timeout NV1A+ semaphore acquire timeout
b32 acquire_timestamp NV1A+ semaphore acquire timestamp
b32 acquire_value NV1A+ semaphore acquire value
dmaobj dma_semaphore NV11:GF100 semaphore DMA object
b12/16 semaphore_offset NV11:GF100 old-style semaphore address
bool semaphore_off_val G80:GF100 semaphore_offset valid
b40 semaphore_address G84+ new-style semaphore address
b32 semaphore_sequence G84+ new-style semaphore value
bool acquire_source G84:GF100 semaphore acquire address selection
bool acquire_mode G84+ semaphore acquire mode

GF100 state is likely incomplete.

Engine objects

The main purpose of the puller is relaying methods to the engines. First, an engine object has to be bound to a subchannel using method 0. Then, all methods >=0x100 on the subchannel will be forwarded to the relevant engine.

On pre-NV4, the bound objects’ RAMHT information is stored as part of puller state. The last used subchannel is also remembered and each time the puller is requested to submit commands on subchannel different from the last one, method 0 is submitted, or channel switch occurs, the information about the object will be forwarded to the engine through its method 0. The information about an object is 24-bit, is known as object’s “context”, and has the following fields:

  • bits 0-15 [NV1]: object flags
  • bits 0-15 [NV3]: object address
  • bits 16-22: object type
  • bit 23: engine id

The context for objects is stored directly in their RAMHT entries.

On NV4+ GPUs, the puller doesn’t care about bound objects - this information is supposed to be stored by the engine itself as part of its state. The puller only remembers what engine each subchannel is bound to. On NV4:GF100 When method 0 is executed, the puller looks up the object in RAMHT, getting engine id and object address in return. The engine id is remembered in puller state, while object address is passed down to the engine for further processing.

GF100+ did away with RAMHT. Thus, method 0 now takes the object class and engine id directly as parameters:

  • bits 0-15: object class. Not used by the puller, simply passed down to the engine.
  • bits 16-20: engine id

The list of valid engine ids can be found on FIFO overview. The SOFTWARE engine is special: all methods submitted to it, explicitely or implicitely by binding a subchannel to it, will cause a CACHE_ERROR(EMPTY_SUBCHANNEL) interrupt. This interrupt can then be intercepted by the driver to implement a “software object”, or can be treated as an actual error and reported.

The engines run asynchronously. The puller will send them commands whenever they have space in their input queues and won’t wait for completion of a command before sending more. However, when engines are switched [ie. puller has to submit a command to a different engine than last used by the channel], the puller will wait until the last used engine is done with this channel’s commands. Several special puller methods will also wait for engines to go idle.

Todo

verify this on all card families.

On NV4:GF100 GPUs, methods 0x180-0x1fc are treated specially: while other methods are forwarded directly to engine without modification, these methods are expected to take object handles as parameters and will be looked up in RAMHT by the puller before forwarding. Ie. the engine will get the object’s address found in RAMHT.

mthd 0x0000 / 0x000: OBJECT
On NV1:GF100, takes the handle of the object that should be bound to the subchannel it was submitted on. On GF100+, it instead takes engine+class directly.
if (gpu < NV4) {
        b24 newctx = RAMHT_LOOKUP(param);
        if (newctx & 0x800000) {
                /* engine == PGRAPH */
                if (ENGINE_CUR_CHANNEL(PGRAPH) != chan)
                        ENGINE_CHANNEL_SWITCH(PGRAPH, chan);
                ENGINE_SUBMIT_MTHD(PGRAPH, subc, 0, newctx);
                ctx[subc] = newctx;
                last_subc = subc;
        } else {
                /* engine == SOFTWARE */
                while (!ENGINE_IDLE(PGRAPH))
                        ;
                throw CACHE_ERROR(EMPTY_SUBCHANNEL);
        }
} else {
        /* NV4+ GPU */
        b5 engine; b16 eparam;
        if (gpu >= GF100) {
                eparam = param & 0xffff;
                engine = param >> 16 & 0x1f;
                /* XXX: behavior with more bitfields? does it forward the whole thing? */
        } else {
                engine = RAMHT_LOOKUP(param).engine;
                eparam = RAMHT_LOOKUP(param).addr;
        }
        if (engine != last_engine) {
                while (ENGINE_CUR_CHANNEL(last_engine) == chan && !ENGINE_IDLE(last_engine))
                        ;
        }
        if (engine == SOFTWARE) {
                throw CACHE_ERROR(EMPTY_SUBCHANNEL);
        } else {
                if (ENGINE_CUR_CHANNEL(engine) != chan)
                        ENGINE_CHANNEL_SWITCH(engine, chan);
                ENGINE_SUBMIT_MTHD(engine, subc, 0, eparam);
                last_engine = engines[subc] = engine;
        }
}

mthd 0x0100-0x3ffc / 0x040-0xfff: [forwarded to engine]

if (gpu < NV4) {
        if (subc != last_subc) {
                if (ctx[subc] & 0x800000) {
                        /* engine == PGRAPH */
                        if (ENGINE_CUR_CHANNEL(PGRAPH) != chan)
                                ENGINE_CHANNEL_SWITCH(PGRAPH, chan);
                        ENGINE_SUBMIT_MTHD(PGRAPH, subc, 0, ctx[subc]);
                        last_subc = subc;
                } else {
                        /* engine == SOFTWARE */
                        while (!ENGINE_IDLE(PGRAPH))
                                ;
                        throw CACHE_ERROR(EMPTY_SUBCHANNEL);
                }
        }
        if (ctx[subc] & 0x800000) {
                /* engine == PGRAPH */
                if (ENGINE_CUR_CHANNEL(PGRAPH) != chan)
                        ENGINE_CHANNEL_SWITCH(PGRAPH, chan);
                ENGINE_SUBMIT_MTHD(PGRAPH, subc, mthd, param);
        } else {
                /* engine == SOFTWARE */
                while (!ENGINE_IDLE(PGRAPH))
                        ;
                throw CACHE_ERROR(EMPTY_SUBCHANNEL);
        }
} else {
        /* NV4+ */
        if (gpu < GF100 && mthd >= 0x180/4 && mthd < 0x200/4) {
                param = RAMHT_LOOKUP(param).addr;
        }
        if (engines[subc] != last_engine) {
                while (ENGINE_CUR_CHANNEL(last_engine) == chan && !ENGINE_IDLE(last_engine))
                        ;
        }
        if (engines[subc] == SOFTWARE) {
                throw CACHE_ERROR(EMPTY_SUBCHANNEL);
        } else {
                if (ENGINE_CUR_CHANNEL(engine) != chan)
                        ENGINE_CHANNEL_SWITCH(engine, chan);
                ENGINE_SUBMIT_MTHD(engine, subc, mthd, param);
                last_engine = engines[subc];
        }
}

Todo

verify all of the pseudocode…

Puller builtin methods

Syncing with host: reference counter

NV10 introduced a “reference counter”. It’s a per-channel 32-bit register that is writable by the puller and readable through the channel control area [see DMA submission to FIFOs on NV4]. It can be used to tell host which commands have already completed: after every interesting batch of commands, add a method that will set the ref counter to monotonically increasing values. The host code can then read the counter from channel control area and deduce which batches are already complete.

The method to set the reference counter is REF_CNT, and it simply sets the ref counter to its parameter. When it’s executed, it’ll also wait for all previously submitted commands to complete execution.

mthd 0x0050 / 0x014: REF_CNT [NV10:]

while (ENGINE_CUR_CHANNEL(last_engine) == chan && !ENGINE_IDLE(last_engine))
        ;
ref = param;

Semaphores

NV1A PFIFO introduced a concept of “semaphores”. A semaphore is a 32-bit word located in memory. G84 also introduced “long” semaphores, which are 4-word memory structures that include a normal semaphore word and a timestamp.

The PFIFO semaphores can be “acquired” and “released”. Note that these operations are NOT the familiar P/V semaphore operations, they’re just fancy names for “wait until value == X” and “write X”.

There are two “versions” of the semaphore functionality. The “old-style” semaphores are implemented by NV1A:GF100 GPUs. The “new-style” semaphores are supported by G84+ GPUs. The differences are:

Old-style semaphores

  • limitted addressing range: 12-bit [NV1A:G80] or 16-bit [G80:GF100] offset in a DMA object. Thus a special DMA object is required.
  • release writes a single word
  • acquire supports only “wait for value equal to X” mode

New-style semaphores

  • full 40-bit addressing range
  • release writes word + timestamp, ie. long semaphore
  • acquire supports “wait for value equal to X” and “wait for value greater or equal X” modes

Semaphores have to be 4-byte aligned. All values are stored with endianness selected by big_endian flag [NV1A:G80] or by PFIFO endianness [G80+]

On pre-GF100, both old-style semaphores and new-style semaphores use the DMA object stored in dma_semaphore, which can be set through DMA_SEMAPHORE method. Note that this method is buggy on pre-G80 GPUs and accepts only write-only DMA objects of class 0x0002. You have to work around the bug by preparing such DMA objects [or using a kernel that intercepts the error and does the binding manually].

Old-style semaphores read/write the location specified in semaphore_offset, which can be set by SEMAPHORE_OFFSET method. The offset has to be divisible by 4 and fit in 12 bits [NV1A:G80] or 16 bits [G80:GF100]. An acquire is triggered by using the SEMAPHORE_ACQUIRE mthd with the expected value as the parameter - further command processing will halt until the memory location contains the selected value. A release is triggered by using the SEMAPHORE_RELEASE method with the value as parameter - the value will be written into the semaphore location.

New-style semaphores use the location specified in semaphore_address, whose low/high parts can be set through SEMAPHORE_ADDRESS_HIGH and _LOW methods. The value for acquire/release is stored in semaphore_sequence and specified by SEMAPHORE_SEQUENCE method. Acquire and release are triggered by using the SEMAPHORE_TRIGGER method with the requested operation as parameter.

The new-style release operation writes the following 16-byte structure to memory at semaphore_address:

  • 0x00: [32-bit] semaphore_sequence
  • 0x04: [32-bit] 0
  • 0x08: [64-bit] PTIMER timestamp [see PTIMER: Timer engine]

The new-style “acquire equal” operation behaves exactly like old-style acquire, but uses semaphore_address instead of semaphore_offset and semaphore_sequence instead of SEMAPHORE_RELEASE param. The “acquire greater or equal” operation, instead of waiting for the semaphore value to be equal to semaphore_sequence, it waits for value that satisfies (int32_t)(val - semaphore_sequence) >= 0, ie. for a value that’s greater or equal to semaphore_sequence in 32-bit wrapping arithmetic. The “acquire mask” operation waits for a value that, ANDed with semaphore_sequence, gives a non-0 result [GF100+ only].

Failures of semaphore-related methods will trigger the SEMAPHORE error. The SEMAPHORE error has several subtypes, depending on card generation.

NV1A:G80 SEMAPHORE error subtypes:

  • 1: INVALID_OPERAND: wrong parameter to a method
  • 2: INVALID_STATE: attempt to acquire/release without proper setup

G80:GF100 SEMAPHORE error subtypes:

  • 1: ADDRESS_UNALIGNED: address not divisible by 4
  • 2: INVALID_STATE: attempt to acquire/release without proper setup
  • 3: ADDRESS_TOO_LARGE: attempt to set >40-bit address or >16-bit offset
  • 4: MEM_FAULT: got VM fault when reading/writing semaphore

GF100 SEMAPHORE error subtypes:

Todo

figure this out

If the acquire doesn’t immediately succeed, the acquire parameters are written to puller state, and the read will be periodically retried. Further puller processing will be blocked on current channel until acquire succeeds. Note that, on G84+ GPUs, the retry reads are issued from SEMAPHORE_BG VM engine instead of the PFIFO VM engine. There’s also apparently a timeout, but it’s not REd yet.

Todo

RE timeouts

mthd 0x0060 / 0x018: DMA_SEMAPHORE [O] [NV1A:GF100]
obj = RAMHT_LOOKUP(param).addr;
if (gpu < G80) {
        if (OBJECT_CLASS(obj) != 2)
                throw SEMAPHORE(INVALID_OPERAND);
        if (DMAOBJ_RIGHTS(obj) != WO)
                throw SEMAPHORE(INVALID_OPERAND);
        if (!DMAOBJ_PT_PRESENT(obj))
                throw SEMAPHORE(INVALID_OPERAND);
}
/* G80 doesn't bother with verification */
dma_semaphore = obj;

Todo

is there ANY way to make G80 reject non-DMA object classes?

mthd 0x0064 / 0x019: SEMAPHORE_OFFSET [NV1A-]
if (gpu < G80) {
        if (param & ~0xffc)
                throw SEMAPHORE(INVALID_OPERAND);
        semaphore_offset = param;
} else if (gpu < GF100) {
        if (param & 3)
                throw SEMAPHORE(ADDRESS_UNALIGNED);
        if (param & 0xffff0000)
                throw SEMAPHORE(ADDRESS_TOO_LARGE);
        semaphore_offset = param;
        semaphore_off_val = 1;
} else {
        semaphore_address[0:31] = param;
}
mthd 0x0068 / 0x01a: SEMAPHORE_ACQUIRE [NV1A-]
if (gpu < G80 && !dma_semaphore)
        /* unbound DMA object */
        throw SEMAPHORE(INVALID_STATE);
if (gpu >= G80 && !semaphore_off_val)
        throw SEMAPHORE(INVALID_STATE);
b32 word;
if (gpu < G80) {
        word = READ_DMAOBJ_32(dma_semaphore, semaphore_offset, big_endian?BE:LE);
} else {
        try {
                word = READ_DMAOBJ_32(dma_semaphore, semaphore_offset, pfifo_endian);
        } catch (VM_FAULT) {
                throw SEMAPHORE(MEM_FAULT);
        }
}
if (word == param) {
        /* already done */
} else {
        /* acquire_active will block further processing and schedule retries */
        acquire_active = 1;
        acquire_value = param;
        acquire_timestamp = ???;
        /* XXX: figure out timestamp/timeout business */
        if (gpu >= G80) {
                acquire_mode = 0;
                acquire_source = 0;
        }
}
mthd 0x006c / 0x01b: SEMAPHORE_RELEASE [NV1A-]
if (gpu < G80 && !dma_semaphore)
        /* unbound DMA object */
        throw SEMAPHORE(INVALID_STATE);
if (gpu >= G80 && !semaphore_off_val)
        throw SEMAPHORE(INVALID_STATE);
if (gpu < G80) {
        WRITE_DMAOBJ_32(dma_semaphore, semaphore_offset, param, big_endian?BE:LE);
} else {
        try {
                WRITE_DMAOBJ_32(dma_semaphore, semaphore_offset, param, pfifo_endian);
        } catch (VM_FAULT) {
                throw SEMAPHORE(MEM_FAULT);
        }
}
mthd 0x0010 / 0x004: SEMAPHORE_ADDRESS_HIGH [G84:]
if (param & 0xffffff00)
        throw SEMAPHORE(ADDRESS_TOO_LARGE);
semaphore_address[32:39] = param;
mthd 0x0014 / 0x005: SEMAPHORE_ADDRESS_LOW [G84:]
if (param & 3)
        throw SEMAPHORE(ADDRESS_UNALIGNED);
semaphore_address[0:31] = param;
mthd 0x0018 / 0x006: SEMAPHORE_SEQUENCE [G84:]
semaphore_sequence = param;
mthd 0x001c / 0x007: SEMAPHORE_TRIGGER [G84:]
bits 0-2: operation
  • 1: ACQUIRE_EQUAL
  • 2: WRITE_LONG
  • 4: ACQUIRE_GEQUAL
  • 8: ACQUIRE_MASK [GF100-]

Todo

bit 12 does something on GF100?

op = param & 7;
b64 timestamp = PTIMER_GETTIME();
if (param == 2) {
        if (gpu < GF100) {
                try {
                        WRITE_DMAOBJ_32(dma_semaphore, semaphore_address+0x0, param, pfifo_endian);
                        WRITE_DMAOBJ_32(dma_semaphore, semaphore_address+0x4, 0, pfifo_endian);
                        WRITE_DMAOBJ_64(dma_semaphore, semaphore_address+0x8, timestamp, pfifo_endian);
                } catch (VM_FAULT) {
                        throw SEMAPHORE(MEM_FAULT);
                }
        } else {
                WRITE_VM_32(semaphore_address+0x0, param, pfifo_endian);
                WRITE_VM_32(semaphore_address+0x4, 0, pfifo_endian);
                WRITE_VM_64(semaphore_address+0x8, timestamp, pfifo_endian);
        }
} else {
        b32 word;
        if (gpu < GF100) {
                try {
                        word = READ_DMAOBJ_32(dma_semaphore, semaphore_address, pfifo_endian);
                } catch (VM_FAULT) {
                        throw SEMAPHORE(MEM_FAULT);
                }
        } else {
                word = READ_VM_32(semaphore_address, pfifo_endian);
        }
        if ((op == 1 && word == semaphore_sequence) || (op == 4 && (int32_t)(word - semaphore_sequence) >= 0) || (op == 8 && word & semaphore_sequence)) {
                /* already done */
        } else {
                /* XXX GF100 */
                acquire_source = 1;
                acquire_value = semaphore_sequence;
                acquire_timestamp = ???;
                if (op == 1) {
                        acquire_active = 1;
                        acquire_mode = 0;
                } else if (op == 4) {
                        acquire_active = 1;
                        acquire_mode = 1;
                } else {
                        /* invalid combination - results in hang */
                }
        }
}

Misc puller methods

NV40 introduced the YIELD method which, if there are any other busy channels at the moment, will cause PFIFO to switch to another channel immediately, without waiting for the timeslice to expire.

mthd 0x0080 / 0x020: YIELD [NV40:]
::
PFIFO_YIELD();

G84 introduced the NOTIFY_INTR method, which simply raises an interrupt that notifies the host of its execution. It can be used for sync primitives.

mthd 0x0020 / 0x008: NOTIFY_INTR [G84:]
::
PFIFO_NOTIFY_INTR();

Todo

check how this is reported on GF100

The G84+ WRCACHE_FLUSH method can be used to flush PFIFO’s write post caches. [see Tesla virtual memory]

mthd 0x0024 / 0x009: WRCACHE_FLUSH [G84:]
::
VM_WRCACHE_FLUSH(PFIFO);

The GF100+ NOP method does nothing:

mthd 0x0008 / 0x002: NOP [GF100:]

/* do nothing */