DMA submission to FIFOs on NV4


There are two modes of DMA command submission: The NV4-style DMA mode and IB mode.

Both of them are based on a conception of “pushbuffer”: an area of memory that user fills with commands and tells PFIFO to process. The pushbuffers are then assembled into a “command stream” consisting of 32-bit words that make up “commands”. In NV4-style DMA mode, the pushbuffer is always read linearly and converted directly to command stream, except when the “jump”, “return”, or “call” commands are encountered. In IB mode, the jump/call/return commands are disabled, and command stream is instead created with use of an “IB buffer”. The IB buffer is a circular buffer of (base,length) pairs describing areas of pushbuffer that will be stitched together to create the command stream. NV4- style mode is available on NV4:GF100, IB mode is available on G80+.


check for NV4-style mode on GF100

In both cases, the command stream is then broken down to commands, which get executed. For most commands, the execution consists of storing methods into CACHE for execution by the puller.

Pusher state

The following data makes up the DMA pusher state:

type name cards description
dmaobj dma_pushbuffer :GF100 [1] the pushbuffer and IB DMA object
b32 dma_limit :GF100 [1] [2] pushbuffer size limit
b32 dma_put all pushbuffer current end address
b32 dma_get all pushbuffer current read address
b11/12 dma_state.mthd all Current method
b3 dma_state.subc all Current subchannel
b24 dma_state.mcnt all Current method count
b32 dcount_shadow NV5: number of already-processed methods in cmd
bool NV10+ Current command’s NI flag
bool dma_state.lenp G80+ [3] Large NI command length pending
b32 ref NV10+ reference counter [shared with puller]
bool subr_active NV1A+ [2] Subroutine active
b32 subr_return NV1A+ [2] subroutine return address
bool big_endian NV11:G80 [1] pushbuffer endian switch
bool sli_enable G80+ [1] SLI cond command enabled
b12 sli_mask G80+ [1] SLI cond mask
bool sli_active NV40+ SLI cond currently active
bool ib_enable G80+ [1] IB mode enabled
bool nonmain G80+ [3] non-main pushbuffer active
b8 dma_put_high G80+ extra 8 bits for dma_put
b8 dma_put_high_rs G80+ dma_put_high read shadow
b8 dma_put_high_ws G80+ [2] dma_put_high write shadow
b8 dma_get_high G80+ extra 8 bits for dma_get
b8 dma_get_high_rs G80+ dma_get_high read shadow
b32 ib_put G80+ [3] IB current end position
b32 ib_get G80+ [3] IB current read position
b40 ib_address G80+ [1] [3] IB address
b8 ib_order G80+ [1] [3] IB size
b32 dma_mget G80+ [3] main pushbuffer last read address
b8 dma_mget_high G80+ [3] extra 8 bits for dma_mget
bool dma_mget_val G80+ [3] dma_mget valid flag
b8 dma_mget_high_rs G80+ [3] dma_mget_high read shadow
bool dma_mget_val_rs G80+ [3] dma_mget_val read shadow
[1](1, 2, 3, 4, 5, 6, 7, 8) means that this part of state can only be modified by kernel intervention and is normally set just once, on channel setup.
[2](1, 2, 3, 4, 5) means that state only applies to NV4-style mode,
[3](1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13) means that state only applies to IB mode.


On pre-GF100, whenever the DMA pusher encounters problems, it’ll raise a DMA_PUSHER error. There are 6 types of DMA_PUSHER errors:

id name reason
1 CALL_SUBR_ACTIVE call command while subroutine active
2 INVALID_MTHD attempt to submit a nonexistent special method
3 RET_SUBR_INACTIVE return command while subroutine inactive
4 INVALID_CMD invalid command
5 IB_EMPTY attempt to submit zero-length IB entry
6 MEM_FAULT failure to read from pushbuffer or IB

Apart from pusher state, the following values are available on NV5+ to aid troubleshooting:

  • dma_get_jmp_shadow: value of dma_get before the last jump
  • rsvd_shadow: the first word of last-read command
  • data_shadow: the last-read data word


verify those


determine what happens on GF100 on all imaginable error conditions

Channel control area

The channel control area is used to tell card about submitted pushbuffers. The area is at least 0x1000 bytes long, though it can be longer depending on the card generation. Everything in the area should be accessed as 32-bit integers, like almost all of the MMIO space. The following addresses are usable:

addr R/W name description
0x40 R/W DMA_PUT dma_put, only writable when not in IB mode
0x44 R DMA_GET dma_get
0x48 R REF ref
0x4c R/W DMA_PUT_HIGH dma_put_high_rs/ws, only writable when not in IB
0x50 R/W ??? GF100+ only
0x54 R DMA_CGET [2] nv40+ only, connected to subr_return when subroutine active, dma_get when inactive.
0x58 R DMA_MGET dma_mget
0x5c R DMA_MGET_HIGH dma_mget_high_rs, dma_mget_val_rs
0x60 R DMA_GET_HIGH dma_get_high_rs
0x88 R IB_GET [3] ib_get
0x8c R/W IB_PUT [3] ib_put

The channel control area is accessed in 32-bit chunks, but on G80+, DMA_GET, DMA_PUT and DMA_MGET are effectively 40-bit quantities. To prevent races, the high parts of them have read and write shadows. When you read the address corresponding to the low part, the whole value is atomically read. The low part is returned as the result of the read, while the high part is copied to the corresponding read shadow where it can be read through a second access to the other address. DMA_PUT also has a write shadow of the high part - when the low part address is written, it’s assembled together with the write shadow and atomically written.

To summarise, when you want to read full DMA_PUT/GET/MGET, first read the low part, then the high part. Due to the shadows, the value thus read will be correct. To write the full value of DMA_PUT, first write the high part, then the low part.

Note, however, that two different threads reading these values simultanously can interfere with each other. For this reason, the channel control area shouldn’t ever be accessed by more than one thread at once, even for reading.

On NV4:NV40 cards, the channel control area is in BAR0 at address 0x800000 + 0x10000 * channel ID. On NV40, there are two BAR0 regions with channel control areas: the old-style is in BAR0 at 0x800000 + 0x10000 * channel ID, supports channels 0-0x1f, can do both PIO and DMA submission, but does not have DMA_CGET when used in DMA mode. The new-style area is in BAR0 at 0xc0000 + 0x1000 * channel ID, supports only DMA mode, supports all channels, and has DMA_CGET. On G80 cards, channel 0 supports PIO mode and has channel control area at 0x800000, while channels 1-126 support DMA mode and have channel control areas at 0xc00000 + 0x2000 * channel ID. On GF100, the channel control areas are accessed through selectable addresses in BAR1 and are backed by VRAM or host memory - see GF100+ PFIFO for more details.


check channel numbers

NV4-style mode

In NV4-style mode, whenever dma_get != dma_put, the card read a 32-bit word from the pushbuffer at the address specified by dma_get, increments dma_get by 4, and treats the word as the next word in the command stream. dma_get can also move through the control flow commands: jump [sets dma_get to param], call [copies dma_get to subr_return, sets subr_active and sets dma_get to param], and return [unsets subr_active, copies subr_return to dma_get]. The calls and returns are only available on NV1A+ cards.

The pushbuffer is accessed through the dma_pushbuffer DMA object. On NV4, the DMA object has to be located in PCI or AGP memory. On NV5+, any DMA object is valid. At all times, dma_get has to be <= dma_limit. Going past the limit or getting a VM fault when attempting to read from pushbuffer results in raising DMA_PUSHER error of type MEM_FAULT.

On pre-NV1A cards, the word read from pushbuffer is always treated as little-endian. On NV1A:G80 cards, the endianness is determined by the big_endian flag. On G80+, the PFIFO endianness is a global switch.


What about GF100?

Note that pushbuffer addresses over 0xffffffff shouldn’t be used in NV4-style mode, even on G80 - they cannot be expressed in jump commands, dma_limit, nor subr_return. Why dma_put writing supports it is a mystery.

The usual way to use NV4-style mode is:

  1. Allocate a big circular buffer
  2. [NV1A+] if you intend to use subroutines, allocate space for them and write them out
  3. Point dma_pushbuffer to the buffer, set dma_get and dma_put to its start
  4. To submit commands:
    1. If there’s not enough space in the pushbuffer between dma_put and end to fit the command + a jump command, submit a jump-to-beginning command first and set DMA_PUT to buffer start.
    2. Read DMA_GET/DMA_CGET until you get a value that’s out of the range you’re going to write. If on pre-NV40 and using subroutines, discard DMA_GET reads that are outside of the main buffer.
    3. Write out the commands at current DMA_PUT address.
    4. Set DMA_PUT to point right after the last word of commands you wrote.

IB mode

NV4-style mode, while fairly flexible, can only jump between parts of pushbuffer between commands. IB mode decouples flow control from the command structure by using a second “master” buffer, called the IB buffer.

The IB buffer is a circular buffer of 8-byte structures called IB entries. The IB buffer is, like the pushbuffer, accessed through dma_pushbuffer DMA object. The address of the IB buffer, along with its size, is normally specified on channel creation. The size has to be a power of two and can be in range ???.


check the ib size range

There are two indices into the IB buffer: ib_get and ib_put. They’re both in range of 0..2^ib_order-1. Whenever no pushbuffer is being processed [dma_put =dma_get], and there are unread entries in the IB buffer [ib_put!=ib_get], the card will read an entry from IB buffer entry #ib_get and increment ib_get by 1. When ib_get would reach 2^ib_order, it insteads wraps around to 0.

Failure to read IB entry due to VM fault will, like pushbuffer read fault, cause DMA_PUSHER error of type MEM_FAULT.

The IB entry is made of two 32-bit words in PFIFO endianness. Their format is:

Word 0:

  • bits 0-1: unused, should be 0
  • bits 2-31: ADDRESS_LOW, bits 2-31 of pushbuffer start address

Word 1:

  • bits 0-7: ADDRESS_HIGH, bits 32-39 of pushbuffer start address
  • bit 8: ???
  • bit 9: NOT_MAIN, “not main pushbuffer” flag
  • bits 10-30: SIZE, pushbuffer size in 32-bit words
  • bit 31: NO_PREFETCH (probably; use for pushbuffer data generated by the GPU)


figure out bit 8 some day

When an IB entry is read, the pushbuffer is prepared for reading:

dma_get[2:39] = ADDRESS
dma_put = dma_get + SIZE * 4
nonmain = NOT_MAIN
if (!nonmain) dma_mget = dma_get

Subsequently, just like in NV4-style mode, words from dma_get are read until it reaches dma_put. When that happens, processing can move on to the next IB entry [or pause until user sends more commands]. If the nonmain flag is not set, dma_get is copied to dma_mget whenever it’s advanced, and dma_mget_val flag is set to 1. dma_limit is ignored in IB mode.

An attempt to submit IB entry with length zero will raise DMA_PUSHER error of type IB_EMPTY.

The nonmain flag is meant to help with a common case where pushbuffers sent through IB can come from two sources: a “main” big circular buffer filled with immediately generated commands, and “external” buffers containing helper data filled and managed through other means. DMA_MGET will then contain the address of the current position in the “main” buffer without being affected by IB entries pulling data from other pushbuffers. It’s thus similiar to DMA_CGET’s role in NV4-style mode.

The commands - pre-GF100 format

The command stream, as assembled by NV4-style or IB mode pushbuffer read, is then split into individual commands. The command type is determined by its first word. The word has to match one of the following forms:

000CCCCCCCCCCC00SSSMMMMMMMMMMM00 increasing methods [NV4+]
0000000000000001MMMMMMMMMMMMXX00 SLI conditional [NV40+, if enabled]
00000000000000100000000000000000 return [NV1A+, NV4-style only]
0000000000000011SSSMMMMMMMMMMM00 long non-increasing methods [IB only]
001JJJJJJJJJJJJJJJJJJJJJJJJJJJ00 old jump [NV4+, NV4-style only]
010CCCCCCCCCCC00SSSMMMMMMMMMMM00 non-increasing methods [NV10+]


do an exhaustive scan of commands

If none of the forms matches, or if the one that matches cannot be used in current mode, the INVALID_CMD DMA_PUSHER error is raised.

The commands

There are two command formats the DMA pusher can use: NV4 format and GF100 format. All cards support the NV4 format, while only GF100+ cards support the GF100 format.

NV4 method submission commands

000CCCCCCCCCCC00SSSMMMMMMMMMMM00 increasing methods [NV4+]
010CCCCCCCCCCC00SSSMMMMMMMMMMM00 non-increasing methods [NV10+]
0000000000000011SSSMMMMMMMMMMM00 long non-increasing methods [IB only]

These three commands are used to submit methods. the MM..M field selects the first method that will be submitted. The SSS field selects the subchannel. The CC..C field is mthd_count and says how many words will be submitted. With the “long non-increasing methods” command, the method count is instead contained in low 24 bits of the next word in the pushbuffer.

The subsequent mthd_count words after the first word [or second word in case of the long command] are the method parameters to be submitted. If command type is increasing methods, the method number increases by 4 [ie. by 1 method] for each submitted word. If type is non-increasing, all words are submitted to the same method.

If sli_enable is set and sli_active is not set, the methods thus assembled will be discarded. Otherwise, they’ll be appended to the CACHE.


didn’t mthd 0 work even if sli_active=0?

The pusher watches the submitted methods: it only passes methods 0x100+ and methods in 0..0xfc range that the puller recognises. An attempt to submit invalid method in 0..0xfc range will cause a DMA_PUSHER error of type INVALID_MTHD.


check pusher reaction on ACQUIRE submission: pause?

NV4 control flow commands

00000000000000100000000000000000 return [NV1A+]

For jumps and calls, J..JJ is bits 2-28 or 2-31 of the target address. The remaining bits of target are forced to 0.

The jump commands simply set dma_get to the target - the next command will be read from there. There are two commands, since NV4 originally supported only 29-bit addresses, and used high bits as command type. NV1A introduced the new jump command that instead uses low bits as type, and allows access to full 32 bits of address range.

The call command copies dma_get to subr_return, sets subr_active to 1, and sets dma_get to the target. If subr_active is already set before the call, the DMA_PUSHER error of type CALL_SUBR_ACTIVE is raised.

The return command copies subr_return to dma_get and clears subr_active. If subr_active isn’t set, it instead raises DMA_PUSHER error of type RET_SUBR_INACTIVE.

NV4 SLI conditional command

0000000000000001MMMMMMMMMMMMXX00 SLI conditional [NV40+]

NV40 introduced SLI functionality. One of the associated features is the SLI conditional command. In SLI mode, sister channels are commonly created on all cards in SLI set using a common pushbuffer. Since most of the commands set in SLI will be identical for all cards, this saves resources. However, some of the commands have to be sent only to a single card, or to a subgroup of cards. The SLI conditional can be used for that purpose.

The sli_active flag determines if methods should be accepted at the moment: when it’s set, methods will be accepted. Otherwise, they’ll be ignored. SLI conditional command takes the encoded mask, MM..M, ands it with the per-card value of sli_mask, and sets sli_active flag to 1 if result if non-0, to 0 otherwise.

The sli_enable flag determines if the command is available. If it’s not set, the command effectively doesn’t exist. Note that sli_enable and sli_mask exist on both NV40:G80 and G80+, but on NV40:G80 they have to be set uniformly for all channels on the card, while G80+ allows independent settings for each channel.

The XX bits in the command are ignored.

GF100 commands

GF100 format follows the same idea, but uses all-new command encoding.

000CCCCCCCCCCC00SSSMMMMMMMMMMMXX increasing methods [old]
000XXXXXXXXXXX11XXXXXXXXXXXXXXXX SLI conditional from user mask [new]
010CCCCCCCCCCC00SSSMMMMMMMMMMMXX non-increasing methods [old]
011CCCCCCCCCCCCCSSSXMMMMMMMMMMMM non-increasing methods [new]
101CCCCCCCCCCCCCSSSXMMMMMMMMMMMM increase-once methods [new]


check bitfield bounduaries


check the extra SLI bits


look for other forms

Increasing and non-increasing methods work like on older cards. Increase-once methods is a new command that works like the other methods commands, but sends the first data word to method M, second and all subsequent data words to method M+4 [ie. the next method].

Inline method command is a single-word command that submits a single method with a short [12-bit] parameter encoded in VV..V field.

GF100 also did away with the INVALID_MTHD error - invalid low methods are pushed into CACHE as usual, puller will complain about them instead when it tries to execute them.

The pusher pseudocode - pre-GF100

while(1) {
        if (dma_get != dma_put) {
                /* pushbuffer non-empty, read a word. */
                b32 word;
                try {
                        if (!ib_enable && dma_get >= dma_limit)
                                throw DMA_PUSHER(MEM_FAULT);
                        if (gpu < NV1A)
                                word = READ_DMAOBJ_32(dma_pushbuffer, dma_get, LE);
                        else if (gpu < G80)
                                word = READ_DMAOBJ_32(dma_pushbuffer, dma_get, big_endian?BE:LE);
                                word = READ_DMAOBJ_32(dma_pushbuffer, dma_get, pfifo_endian);
                        dma_get += 4;
                        if (!nonmain)
                                dma_mget = dma_get;
                } catch (VM_FAULT) {
                        throw DMA_PUSHER(MEM_FAULT);
                /* now, see if we're in the middle of a command */
                if (dma_state.lenp) {
                        /* second word of long non-inc methods command - method count */
                        dma_state.lenp = 0;
                        dma_state.mcnt = word & 0xffffff;
                } else if (dma_state.mcnt) {
                        /* data word of methods command */
                        data_shadow = word;
                        if (!PULLER_KNOWS_MTHD(dma_state.mthd))
                                throw DMA_PUSHER(INVALID_MTHD);
                        if (!sli_enable || sli_active) {
                                CACHE_PUSH(dma_state.subc, dma_state.mthd, word,;
                        if (!
                } else {
                        /* no command active - this is the first word of a new one */
                        rsvd_shadow = word;
                        /* match all forms */
                        if ((word & 0xe0000003) == 0x20000000 && !ib_enable) {
                                /* old jump */
                                dma_get_jmp_shadow = dma_get;
                                dma_get = word & 0x1fffffff;
                        } else if ((word & 3) == 1 && !ib_enable && gpu >= NV1A) {
                                /* jump */
                                dma_get_jmp_shadow = dma_get;
                                dma_get = word & 0xfffffffc;
                        } else if ((word & 3) == 2 && !ib_enable && gpu >= NV1A) {
                                /* call */
                                if (subr_active)
                                        throw DMA_PUSHER(CALL_SUBR_ACTIVE);
                                subr_return = dma_get;
                                subr_active = 1;
                                dma_get = word & 0xfffffffc;
                        } else if (word == 0x00020000 && !ib_enable && gpu >= NV1A) {
                                /* return */
                                if (!subr_active)
                                        throw DMA_PUSHER(RET_SUBR_INACTIVE);
                                dma_get = subr_return;
                                subr_active = 0;
                        } else if ((word & 0xe0030003) == 0) {
                                /* increasing methods */
                                dma_state.mthd = (word >> 2) & 0x7ff;
                                dma_state.subc = (word >> 13) & 7;
                                dma_state.mcnt = (word >> 18) & 0x7ff;
                       = 0;
                                dcount_shadow = 0;
                        } else if ((word & 0xe0030003) == 0x40000000 && gpu >= NV10) {
                                /* non-increasing methods */
                                dma_state.mthd = (word >> 2) & 0x7ff;
                                dma_state.subc = (word >> 13) & 7;
                                dma_state.mcnt = (word >> 18) & 0x7ff;
                       = 1;
                                dcount_shadow = 0;
                        } else if ((word & 0xffff0003) == 0x00030000 && ib_enable) {
                                /* long non-increasing methods */
                                dma_state.mthd = (word >> 2) & 0x7ff;
                                dma_state.subc = (word >> 13) & 7;
                                dma_state.lenp = 1;
                       = 1;
                                dcount_shadow = 0;
                        } else if ((word & 0xffff0003) == 0x00010000 && sli_enable) {
                                if (sli_mask & ((word >> 4) & 0xfff))
                                        sli_active = 1;
                                        sli_active = 0;
                        } else {
                                throw DMA_PUSHER(INVALID_CMD);
        } else if (ib_enable && ib_get != ib_put) {
                /* current pushbuffer empty, but we have more IB entries to read */
                b64 entry;
                try {
                        entry_low = READ_DMAOBJ_32(dma_pushbuffer, ib_address + ib_get * 8, pfifo_endian);
                        entry_high = READ_DMAOBJ_32(dma_pushbuffer, ib_address + ib_get * 8 + 4, pfifo_endian);
                        entry = entry_high << 32 | entry_low;
                        if (ib_get == (1 << ib_order))
                                ib_get = 0;
                } catch (VM_FAULT) {
                        throw DMA_PUSHER(MEM_FAULT);
                len = entry >> 42 & 0x3fffff;
                if (!len)
                        throw DMA_PUSHER(IB_EMPTY);
                dma_get = entry & 0xfffffffffc;
                dma_put = dma_get + len * 4;
                if (entry & 1 << 41)
                        nonmain = 1;
                        nonmain = 0;
        /* otherwise, pushbuffer empty and IB empty or nonexistent - nothing to do. */