PCOUNTER: performance counter engine

Todo

crossrefs

Introduction

PCOUNTER is the card units that contains performance monitoring counters. It is present on NV10+ GPUs, with the exception of NV11, NV1A, NV17, NV18 for unknown reasons.

Todo

why? any others excluded? NV25, NV2A, NV30, NV36 pending a check

PCOUNTER is actually made of several identical hardware counter units, one for each so-called domain. Each PCOUNTER domain can potentially run on a different source clock, allowing one to monitor events in various clock domains. The PCOUNTER domains are mostly independent, but there’s some limitted communication and shared circuitry among them.

There are two major revisions of PCOUNTER hardware, and some minor subrevisions:

  • NV10:GF100 major revision:
    • NV10:NV15 - first version, one domain, only single-event mode available
    • NV15:NV20 - added one period / all periods event counter mode switch
    • NV20:NV30 - added second domain for events associated with memory clock
    • NV30:NV40 - removed separate clrflag/setflag input selection, changed from 40-bit to 32-bit counters, added quad event mode, added logic op chaining through SETFLAG.
    • NV40:G84 - rearranged register space to make space for 8 domains, added 3 new special counter modes
    • G84:G92 - added record mode, swap input selection, and PERIODIC signals
    • G92:GT215 - added slightly more flexible logic op delayed source selection and a register to set high 8 bits of address for record mode
    • GT215:GF100 - added USER signals
  • GF100+ major revision:
    • GF100+ - split PCOUNTER into hub, per-gpc and per-partition domain sets, ???

Todo

figure out what else happened on GF100

Note

the information in this document is at the moment not fully verified for GF100+.

Todo

make it so

The inputs to PCOUNTER are various activity monitoring signals from all over the card. The PCOUNTER hardware selects a few of them, performs programmable logic operations on them, and aggregates it to a handful of actual counter inputs. Some of the inputs are special and control counting start/stop, while others are the events to be counted. PCOUNTER can be used in three modes:

  • single event mode - a single event is being counted, with fine-grained control of counting periods via pre-start/start/stop signals. Several counting periods per run may be configured, and a threshold counter may be used. The input signals used are:
    • PRE - a programmable amount of pulses on this input must happen before START is recognised
    • START - a pulse on this input starts a counting period
    • EVENT - the pulses on this input are counted
    • STOP - a pulse on this input stops a counting period
  • quad event mode [NV30-] - 4 events are being counted, with a simple “swap counter sets” trigger to delimit counting periods. The inputs used are:
    • PRE, START, EVENT, STOP - the pulses on these inputs are counted [in 4 separate counters]
    • SWAP - a pulse on this input swaps counter sets, ie. copies the internal counters to the MMIO registers and resets internal counters to 0.
  • record mode [G84-] - 12 simple events are being counted, and the counters written to a “record buffer” in memory on every pulse of STOP input. The inputs used are:
    • PRE_SRC[0..3], START_SRC[0..3], EVENT_SRC[0..3] - 12 events to be counted
    • STOP - a pulse on this input writes current counter values to memory and clears the counters to 0

The PCOUNTER uses MMIO area 0x00a000:0x00b000 on NV10:NV40 and NV40:GF100. On GF100+, it uses 0x180000:0x1c0000.

NV10:GF100 PCOUNTER is unaffected by all PMC.ENABLE bits and has no interrupt lines. GF100+ PCOUNTER is enabled by PMC.ENABLE bit 28.

Todo

figure out interupt business

MMIO registers

The MMIO registers are similiar among PCOUNTER revisions, but their placement is very different.

NV10

8-bit space nv10-pcounter [0x1000]
nv3-mmio 0xa000: PCOUNTER [NV10:NV40]

Todo

wtf is CYCLES_ALT for?

Address Variants Name Description
0x400+dom*0x100 (dom<2) all PRE_SRC[dom] PRE input selection
0x404+dom*0x100 (dom<2) all PRE_OP[dom] PRE logic operation
0x408+dom*0x100 (dom<2) all START_SRC[dom] START input selection
0x40c+dom*0x100 (dom<2) all START_OP[dom] START logic operation
0x410+dom*0x100 (dom<2) all EVENT_SRC[dom] EVENT input selection
0x414+dom*0x100 (dom<2) all EVENT_OP[dom] EVENT logic operation
0x418+dom*0x100 (dom<2) all STOP_SRC[dom] STOP input selection
0x41c+dom*0x100 (dom<2) all STOP_OP[dom] STOP logic operation
0x420+dom*0x100 (dom<2) NV10:NV30 SETFLAG_SRC[dom] SETFLAG input selection
0x424+dom*0x100 (dom<2) all SETFLAG_OP[dom] SETFLAG logic operation
0x428+dom*0x100 (dom<2) NV10:NV30 CLRFLAG_SRC[dom] CLRFLAG input selection
0x42c+dom*0x100 (dom<2) all CLRFLAG_OP[dom] CLRFLAG logic operation
0x430+dom*0x100+hi*0x200+lo*0x4 (dom<2, hi<2, lo<4) all SIG_STATUS[dom][hi][lo] Signal status
0x600+dom*0x100 (dom<2) all CTR_CYCLES[dom] Elapsed cycles counter
0x604+dom*0x100 (dom<2) NV10:NV30 CTR_CYCLES_HI[dom] Elapsed cycles counter - high part
0x608+dom*0x100 (dom<2) all CTR_CYCLES_ALT[dom] Elapsed cycles counter copy
0x60c+dom*0x100 (dom<2) NV10:NV30 CTR_CYCLES_ALT_HI[dom] Elapsed cycles counter copy - high part
0x610+dom*0x100 (dom<2) all CTR_EVENT[dom] EVENT counter
0x614+dom*0x100 (dom<2) NV10:NV30 CTR_EVENT_HI[dom] EVENT counter - high part
0x618+dom*0x100 (dom<2) all CTR_START[dom] START counter
0x61c+dom*0x100 (dom<2) NV10:NV30 CTR_START_HI[dom] START counter - high part
0x620+dom*0x100 (dom<2) all CTR_PRE[dom] PRE counter
0x624+dom*0x100 (dom<2) all CTR_STOP[dom] STOP counter
0x628+dom*0x100 (dom<2) all THRESHOLD[dom] EVENT counter threshold
0x62c+dom*0x100 (dom<2) NV10:NV30 THRESHOLD_HI[dom] EVENT counter threshold - high part
0x738 NV30:NV40 QUAD_ACK_TRIGGER Acks counter data in quad event mode
0x73c all CTRL PCOUNTER control

NV40

8-bit space nv40-pcounter [0x1000]
nv3-mmio 0xa000: PCOUNTER [NV40:G80]
g80-mmio 0xa000: PCOUNTER

Todo

C51 has no PCOUNTER, but has a7f4/a7f8 registers

Todo

MCP73 also has a7f4/a7f8 but also has normal PCOUNTER

Address Variants Name Description
0x400+dom*0x4 (dom<8) all PRE_SRC[dom] PRE input selection
0x420+dom*0x4 (dom<8) all PRE_OP[dom] PRE logic operation
0x440+dom*0x4 (dom<8) all START_SRC[dom] START input selection
0x460+dom*0x4 (dom<8) all START_OP[dom] START logic operation
0x480+dom*0x4 (dom<8) all EVENT_SRC[dom] EVENT input selection
0x4a0+dom*0x4 (dom<8) all EVENT_OP[dom] EVENT logic operation
0x4c0+dom*0x4 (dom<8) all STOP_SRC[dom] STOP input selection
0x4e0+dom*0x4 (dom<8) all STOP_OP[dom] STOP logic operation
0x500+dom*0x4 (dom<8) all SETFLAG_OP[dom] SETFLAG logic operation
0x520+dom*0x4 (dom<8) all CLRFLAG_OP[dom] CLRFLAG logic operation
0x540+dom*0x4 (dom<8) all SRC_STATUS[dom] Selected inputs status
0x560+dom*0x4 (dom<8) all SPEC_SRC[dom] SWAP and UNK8 input selection
0x580+dom*0x4 (dom<8) GT215- USER_TRIGGER[dom] triggers user-controllable signals
0x600+dom*0x4 (dom<8) all CTR_CYCLES[dom] Elapsed cycles counter
0x640+dom*0x4 (dom<8) all CTR_CYCLES_ALT[dom] Elapsed cycles counter copy
0x680+dom*0x4 (dom<8) all CTR_EVENT[dom] EVENT counter
0x6a0+dom*0x4 (dom<8) G92- RECORD_ADDRESS_HIGH[dom] High 8 bits of record buffer address
0x6c0+dom*0x4 (dom<8) all CTR_START[dom] START counter
0x6e0+dom*0x4 (dom<8) G84- RECORD_STATUS[dom] Current status and position of record buffer
0x700+dom*0x4 (dom<8) all CTR_PRE[dom] PRE counter
0x720+dom*0x4 (dom<8) G84- RECORD_LIMIT[dom] The highest valid address in the record buffer
0x740+dom*0x4 (dom<8) all CTR_STOP[dom] STOP counter
0x760+dom*0x4 (dom<8) G84- RECORD_START[dom] The starting address of the record buffer
0x780+dom*0x4 (dom<8) all THRESHOLD[dom] EVENT counter threshold
0x7a0 G84- RECORD_CHAN VM channel for record mode
0x7a4 G84- RECORD_DMA DMA object for record mode
0x7a8 G84- GCTRL PCOUNTER global control
0x7c0+dom*0x4 (dom<8) all CTRL[dom] PCOUNTER domain control
0x7e0+dom*0x4 (dom<8) all QUAD_ACK_TRIGGER[dom] Acks counter data in quad event mode
0x800+dom*0x20+i*0x4 (dom<8, i<8) all SIG_STATUS[dom][i] Signal status

GF100

8-bit space gf100-pcounter [0x40000]
gf100-mmio 0x180000: PCOUNTER

Todo

write me

8-bit space gf100-pcounter-domain [0x200]

Todo

complete me

Address Name Description
0x0+i*0x4 (i<16) SIG_STATUS[i] Signal status
0x40 PRE_SRC PRE input selection
0x44 PRE_OP PRE logic operation
0x48 START_SRC START input selection
0x4c START_OP START logic operation
0x50 EVENT_SRC EVENT input selection
0x54 EVENT_OP EVENT logic operation
0x58 STOP_SRC STOP input selection
0x5c STOP_OP STOP logic operation
0x60 SETFLAG_OP SETFLAG logic operation
0x64 CLRFLAG_OP CLRFLAG logic operation
0x68 SRC_STATUS Selected inputs status
0x6c SWAP_SRC SWAP input selection
0xa0 QUAD_ACK_TRIGGER Acks counter data in quad event mode
0xec USER_TRIGGER triggers user-controllable signals

The PCOUNTER signals

The raw inputs that PCOUNTER operates on are called “signals”. A signal is a single 0/1 wire sampled on every clock. The signals come from many different areas of the card and represent various state information. Example signals may be:

  • is unit X busy? - counting 1s on this signal together with elapsed clock cycles will give activity percentage for given unit
  • did microcontroller X execute an instruction this cycle? - counting 1s will give the number of executed instructions

The signals are grouped into so-called domains. A domain has a single base clock and its own counting circuitry - the counting process and counter registers are per-domain. Domains are further grouped into domain sets. Domains within a domain set can communicate to a limitted extend. NV10:GF100 GPUs have a single domain set, while on GF100+ there’s one domain set for each GPC, one for each partition, and one for all domains not associated with a GPC/partition.

On NV10:NV20, there’s only one domain. On NV20:NV40 there are 2 domains. On NV40+ there can be up to 8 domains per domain set. On all GPUs, there can be up to 256 signals per domain. The available signals and domains depend heavily on the GPU. The signals are packed tightly, so even a signal common to two GPUs may be at different position between them. The lists of known domains and signals may be found in NV10:NV40 signals, NV40:G80 signals, G80:GF100 signals, Fermi+ signals.

The STATUS registers

The STATUS registers may be used to peek at the current value of each signal.

reg32 pcounter-sig-status
nv10-pcounter 0x430+dom*0x100+hi*0x200+lo*0x4: SIG_STATUS[dom][hi][lo] (dom<2, hi<2, lo<4)
nv40-pcounter 0x800+dom*0x20+i*0x4: SIG_STATUS[dom][i] (dom<8, i<8)
gf100-pcounter-domain 0x0+i*0x4: SIG_STATUS[i] (i<16)

Reading register #i gives current value of signals i*32..i*32+31 as bits 0..31 of the read value. These registers are per-domain and read-only. Only indices corresponding to actually present domains and signals are valid. On NV10:NV40, this array is split into two parts - the full index is computed like this:

i == hi * 4 + lo

Trailer signals

A special kind of signals is so-called “trailer signals”. These signals are common for all domains in a domain set. The position of these signals is not exactly constant between the domains, but their position modulo 0x20 is [ie. they’re at the same position inside a STATUS reg for all domains, but not necessarily in the same STATUS reg]. Therefore, the position of each trailer signal here is given as an offset from “trailer base”.

The trailer signals for NV10:NV20 are:

  • base+0x1f: PCOUNTER.FLAG - the flag

For NV20:NV40:

  • base+0x1d: PGRAPH.PM_TRIGGER - the PM_TRIGGER pulse from PGRAPH
  • base+0x1e: PCOUNTER.DOM[1].FLAG - the flag from domain 1
  • base+0x1f: PCOUNTER.DOM[0].FLAG - the flag from domain 0

For NV40:GF100:

  • base+0x0c: ZERO - always 0 [G84:GF100]
  • base+0x0d: PCOUNTER.PERIODIC - the PERIODIC signal from current domain [G84:GF100]
  • base+0x0e: PGRAPH.WRCACHE_FLUSH - the WRCACHE_FLUSH pulse from PGRAPH [G84:GF100]
  • base+0x0e: ZERO - always 0 [NV40:G84]
  • base+0x0f: PGRAPH.PM_TRIGGER - the PM_TRIGGER pulse from PGRAPH
  • base+0x10: PCOUNTER.DOM[7].EVENT - the EVENT input from domain 7
  • base+0x11: PCOUNTER.DOM[6].EVENT - the EVENT input from domain 6
  • base+0x12: PCOUNTER.DOM[5].EVENT - the EVENT input from domain 5
  • base+0x13: PCOUNTER.DOM[4].EVENT - the EVENT input from domain 4
  • base+0x14: PCOUNTER.DOM[3].EVENT - the EVENT input from domain 3
  • base+0x15: PCOUNTER.DOM[2].EVENT - the EVENT input from domain 2
  • base+0x16: PCOUNTER.DOM[1].EVENT - the EVENT input from domain 1
  • base+0x17: PCOUNTER.DOM[0].EVENT - the EVENT input from domain 0
  • base+0x18: PCOUNTER.DOM[7].FLAG - the FLAG from domain 7
  • base+0x19: PCOUNTER.DOM[6].FLAG - the FLAG from domain 6
  • base+0x1a: PCOUNTER.DOM[5].FLAG - the FLAG from domain 5
  • base+0x1b: PCOUNTER.DOM[4].FLAG - the FLAG from domain 4
  • base+0x1c: PCOUNTER.DOM[3].FLAG - the FLAG from domain 3
  • base+0x1d: PCOUNTER.DOM[2].FLAG - the FLAG from domain 2
  • base+0x1e: PCOUNTER.DOM[1].FLAG - the FLAG from domain 1
  • base+0x1f: PCOUNTER.DOM[0].FLAG - the FLAG from domain 0

For GF100+:

  • base+0x1f..0x22: PCOUNTER.MAIN.???
  • base+0x23..0x26: PCOUNTER.MAIN.???
  • base+0x27: PCOUNTER.USER_0 - the USER_0 signal from current domain
  • base+0x28: PCOUNTER.USER_1
  • base+0x29: PCOUNTER.USER_2
  • base+0x2a: PCOUNTER.USER_3
  • base+0x2b: PGRAPH.CTXCTL.UNK86C.UNK4
  • base+0x2c: PCOUNTER.PAUSED - 1 if this domain is in the PAUSED state
  • base+0x2d: ???
  • base+0x2e: PCOUNTER.PERIODIC - the PERIODIC signal from current domain
  • base+0x2f: ???
  • base+0x30: PCOUNTER.DOM[7].EVENT - the EVENT input from domain 7
  • base+0x31: PCOUNTER.DOM[6].EVENT - the EVENT input from domain 6
  • base+0x32: PCOUNTER.DOM[5].EVENT - the EVENT input from domain 5
  • base+0x33: PCOUNTER.DOM[4].EVENT - the EVENT input from domain 4
  • base+0x34: PCOUNTER.DOM[3].EVENT - the EVENT input from domain 3
  • base+0x35: PCOUNTER.DOM[2].EVENT - the EVENT input from domain 2
  • base+0x36: PCOUNTER.DOM[1].EVENT - the EVENT input from domain 1
  • base+0x37: PCOUNTER.DOM[0].EVENT - the EVENT input from domain 0
  • base+0x38: PCOUNTER.DOM[7].FLAG - the FLAG from domain 7
  • base+0x39: PCOUNTER.DOM[6].FLAG - the FLAG from domain 6
  • base+0x3a: PCOUNTER.DOM[5].FLAG - the FLAG from domain 5
  • base+0x3b: PCOUNTER.DOM[4].FLAG - the FLAG from domain 4
  • base+0x3c: PCOUNTER.DOM[3].FLAG - the FLAG from domain 3
  • base+0x3d: PCOUNTER.DOM[2].FLAG - the FLAG from domain 2
  • base+0x3e: PCOUNTER.DOM[1].FLAG - the FLAG from domain 1
  • base+0x3f: PCOUNTER.DOM[0].FLAG - the FLAG from domain 0

Todo

PAUSED?

Todo

unk bits

The EVENT and FLAG signals

The trailer signals include EVENT and FLAG signals from all domains in the same domain set, allowing limitted inter-domain communication. The EVENT signal is simply the output of the EVENT logic operation in a given domain. The FLAG signal is the status of the FLAG in a given domain.

In a given domain, its own FLAG and EVENT signals are connected directly to the relevant sources. However, other domains’ signals need to be first converted to the right clock domain. On NV20:NV40, this is done by a simple synchronizer - the state of DOM[x].FLAG signal in domain y will be the same as the state of FLAG in domain x as of two domain y clocks ago. While this is appropriate for many purposes, this means that, if the two domains don’t share the same clock, single-clock pulses in domain x may appear as multi-clock pulses in domain y [if it has faster clock], or be lost entirely [if it has slower clock].

On NV40+, one of two synchronization mode can be selected for signals coming from other domains:

  • CONTINUOUS: behaves like NV20:NV40
  • PULSE: mode converts all 0-to-1 transitions in source domain into single-clock pulses in destination domain

There are two synchronization mode switches per domain. One applies to all incoming EVENT signals from other domains, while the other applies to all incoming FLAG signals. Note that the synchronization applies even between domains that do share a clock. However, the domain’s own EVENT and FLAG signals aren’t subject to synchronization when used inside it.

The USER signals

On GT215:GF100, each domain has two “user” signals controllable directly by PCOUNTER’s MMIO register. The signals are called USER_0 and USER_1.

reg32 pcounter-user-trigger-tesla
nv40-pcounter 0x580+dom*0x4: USER_TRIGGER[dom] (dom<8) [GT215-]
  • bit 0: value for USER_0
  • bit 1: value for USER_1
  • bit 2: pulse mode for USER_0 - if set, will reset USER_0 to 0 one cycle after setting it to the value of bit 0.
  • bit 3: pulse mode for USER_1

Whenever this register is written, USER_0 signal is set to the value of bit 0, and USER_1 is set to the value of bit 1. On the next cycle after the signal change, the USER signals for which the pulse mode bit is set are reset to 0. This register is write-only.

On GF100+, this number is bumped to 4, the USER_TRIGGER register is read/write, and the signals are now located in the trailer area.

reg32 pcounter-user-trigger-fermi
gf100-pcounter-domain 0xec: USER_TRIGGER
  • bits 0-3: value for USER_0..USER_3
  • bits 4-7: pulse mode for USER_0..USER_3

Works like the GT215 USER_TRIGGER register, except it’s also readable. Note that bits 0-3 will be auto-cleared by bits 4-7 after one cycle - bits 0-3 of the read value correspond directly to the signals’ current values.

In effect:

  • write value = 0, pulse = any to set signal to 0 indefinitely
  • write value = 1, pulse = 0 to set signal to 1 indefinitely
  • write value = 1, pulse = 1 to set signal to 1 for one pulse only [and then set to 0 indefinitely]

The PERIODIC signal

On G84+, each domain has a single PERIODIC signal connected to a simple periodic pulse generator. The pulse generator will generate a single-clock ‘1’ pulse every X clocks, with X selectable via the CTRL register from powers of two between 0x400 and 0x10000 clocks. The PERIODIC signal can also be disabled - it’ll output a constant ‘0’ signal in this case.

The GCTRL register has a global PERIODIC_RESET bit that keeps the periodic generator in a reset state while it’s set to 1. This bit can be used to start the PERIODIC signal generators synchronously for all domains.

Input selection

Each domain has up to 256 signals, but only a handful of inputs are used for the counting process. They are:

  • PRE, START, EVENT, STOP: created from 4 individually selected signals through an arbitrary 4-input logic operation, used by the counting process
  • CLRFLAG, SETFLAG: likewise created through an arbitrary 4-input logic operation, but on NV30+ the logic operation input signal selections are shared with PRE/START/EVENT/STOP inputs [NV10:NV30 have separate selections like the other inputs]. Used to control the FLAG.
  • SWAP [NV30-]: hardwired to PGRAPH.PM_TRIGGER on NV30:G84, can be assigned to an arbitrary signal [without logic operation] on G84+. Used by the quad event mode.
  • UNK8 [G84:GF100]: can be assigned to an arbitrary signal, also without logic operation. Purpose unknown

Todo

UNK8

Starting with NV30, the SETFLAG input may also be used as an argument to the EVENT and STOP logic operations, allowing one to construct 7-input logic operations.

The registers used to select the signals going into the logic operations are:

reg32 pcounter-pre-src
nv10-pcounter 0x400+dom*0x100: PRE_SRC[dom] (dom<2)
nv40-pcounter 0x400+dom*0x4: PRE_SRC[dom] (dom<8)
gf100-pcounter-domain 0x40: PRE_SRC

Selects the 4 signals used as inputs to PRE’s logic operation.

  • bits 0-7: signal 0
  • bits 8-15: signal 1
  • bits 16-23: signal 2
  • bits 24-31: signal 3

On NV30+, these signals are also used as inputs to CLRFLAG and SETFLAG logic operations.

reg32 pcounter-start-src
nv10-pcounter 0x408+dom*0x100: START_SRC[dom] (dom<2)
nv40-pcounter 0x440+dom*0x4: START_SRC[dom] (dom<8)
gf100-pcounter-domain 0x48: START_SRC

Like PRE_SRC, but for START. On NV30+, these signals are also used as inputs to CLRFLAG and SETFLAG logic operations, and are used as a 4-bit integer or low 4 bits of 6-bit integer in special counter modes.

reg32 pcounter-event-src
nv10-pcounter 0x410+dom*0x100: EVENT_SRC[dom] (dom<2)
nv40-pcounter 0x480+dom*0x4: EVENT_SRC[dom] (dom<8)
gf100-pcounter-domain 0x50: EVENT_SRC

Like PRE_SRC, but for EVENT. On NV40+, signals 2 and 3 are also used as high 2 bits of a 6-bit integer in special counter modes, and signals 0 and 1 are used as a 2-bit integer.

reg32 pcounter-stop-src
nv10-pcounter 0x418+dom*0x100: STOP_SRC[dom] (dom<2)
nv40-pcounter 0x4c0+dom*0x4: STOP_SRC[dom] (dom<8)
gf100-pcounter-domain 0x58: STOP_SRC

Like PRE_SRC, but for STOP.

reg32 pcounter-setflag-src
nv10-pcounter 0x420+dom*0x100: SETFLAG_SRC[dom] (dom<2) [NV10:NV30]

Like PRE_SRC, but for SETFLAG.

reg32 pcounter-clrflag-src
nv10-pcounter 0x428+dom*0x100: CLRFLAG_SRC[dom] (dom<2) [NV10:NV30]

Like PRE_SRC, but for CLRFLAG.

For convenience, the status of all 16 source signals can be checked by reading the SRC_STATUS register on NV40+:

reg32 pcounter-src-status
nv40-pcounter 0x540+dom*0x4: SRC_STATUS[dom] (dom<8)
gf100-pcounter-domain 0x68: SRC_STATUS
  • bits 0-3: current state of PRE_SRC signals 0-3
  • bits 4-7: current state of START_SRC signals 0-3
  • bits 8-11: current state of EVENT_SRC signals 0-3
  • bits 12-15: current state of STOP_SRC signals 0-3

The PRE/START/EVENT/STOP/SETFLAG/CLRFLAG input calculation goes like that:

  1. Start with the 4 signals selected by corresponding SRC register, call them SRC[0..3]. If on NV30+ and the input being calculated is SETFLAG/CLRFLAG, the SRC register doesn’t exist, and SRC[0..3] are instead set to:

    • SETFLAG: START_SRC[2], START_SRC[3], PRE_SRC[0], PRE_SRC[1]
    • CLRFLAG: PRE_SRC[2], PRE_SRC[3], START_SRC[0], START_SRC[1]
  2. Initially, set ARG[0..3] to SRC[0..3]

  3. If argument 0 delay bit is set, set ARG[0] to SRC[0] as of previous clock cycle instead.

  4. If argument 1 delay bit is set, set ARG[1] to SRC[1] as of previous clock cycle instead.

  5. If on G92+ and argument 2 SRC[0] delay replace bit is set, set ARG[2] to SRC[0] as of previous clock cycle instead.

  6. If on G92+ and argument 3 SRC[1] delay replace bit is set, set ARG[3] to SRC[1] as of previous clock cycle instead.

  7. If on NV30+, the input being calculated is EVENT or STOP, and argument 3 SETFLAG replace bit is set, set ARG[3] to the value of SETFLAG input [computed in the same clock cycle - not delayed]

  8. Perform the logic operation on ARG[0..3] to get the final value of the input. This is done as follows:

    • construct a 4-bit index i, with bit 0 set to ARG[0], bit 1 set to ARG[1], and so on
    • the value of the input is set to bit #i of the logic operation selector

    The logic operation selector thus effectively functions as a truth table for the logic operation.

The registers selecting the actual logic operation are:

reg32 pcounter-pre-op
nv10-pcounter 0x404+dom*0x100: PRE_OP[dom] (dom<2)
nv40-pcounter 0x420+dom*0x4: PRE_OP[dom] (dom<8)
gf100-pcounter-domain 0x44: PRE_OP
  • bits 0-15: the logic operation to perform on the signals selected by PRE_SRC
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 2 of the logic operation [G92-]
    • 0: PRE_SRC[2]
    • 1: PRE_SRC[0] delayed by 1 clock cycle
  • bit 19: selects argument 3 of the logic operation [G92-]
    • 0: PRE_SRC[3]
    • 1: PRE_SRC[1] delayed by 1 clock cycle

This register is special - writing it will cause a swap in quad event mode on G84:GF100, and start the single event mode counting process on NV10:GF100.

reg32 pcounter-start-op
nv10-pcounter 0x40c+dom*0x100: START_OP[dom] (dom<2)
nv40-pcounter 0x460+dom*0x4: START_OP[dom] (dom<8)
gf100-pcounter-domain 0x4c: START_OP
  • bits 0-15: the logic operation to perform on the signals selected by START_SRC
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 2 of the logic operation [G92-]
    • 0: START_SRC[2]
    • 1: START_SRC[0] delayed by 1 clock cycle
  • bit 19: selects argument 3 of the logic operation [G92-]
    • 0: START_SRC[3]
    • 1: START_SRC[1] delayed by 1 clock cycle
reg32 pcounter-event-op
nv10-pcounter 0x414+dom*0x100: EVENT_OP[dom] (dom<2)
nv40-pcounter 0x4a0+dom*0x4: EVENT_OP[dom] (dom<8)
gf100-pcounter-domain 0x54: EVENT_OP
  • bits 0-15: the logic operation to perform on the signals selected by EVENT_SRC
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 3 of the logic operation [NV30-]:
    • 0: EVENT_SRC[3] [NV30:G92] or as selected by bit 20 [G92-]
    • 1: SETFLAG
  • bit 19: selects argument 2 of the logic operation [G92-]
    • 0: EVENT_SRC[2]
    • 1: EVENT_SRC[0] delayed by 1 clock cycle
  • bit 20: selects argument 3 of the logic operation, if not set to SETFLAG by bit 18 [G92-]
    • 0: EVENT_SRC[3]
    • 1: EVENT_SRC[1] delayed by 1 clock cycle
reg32 pcounter-stop-op
nv10-pcounter 0x41c+dom*0x100: STOP_OP[dom] (dom<2)
nv40-pcounter 0x4e0+dom*0x4: STOP_OP[dom] (dom<8)
gf100-pcounter-domain 0x5c: STOP_OP
  • bits 0-15: the logic operation to perform on the signals selected by STOP_SRC
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 3 of the logic operation [NV30-]:
    • 0: STOP_SRC[3] [NV30:G92] or as selected by bit 20 [G92-]
    • 1: SETFLAG
  • bit 19: selects argument 2 of the logic operation [G92-]
    • 0: STOP_SRC[2]
    • 1: STOP_SRC[0] delayed by 1 clock cycle
  • bit 20: selects argument 3 of the logic operation, if not set to SETFLAG by bit 18 [G92-]
    • 0: STOP_SRC[3]
    • 1: STOP_SRC[1] delayed by 1 clock cycle
reg32 pcounter-setflag-op
nv10-pcounter 0x424+dom*0x100: SETFLAG_OP[dom] (dom<2)
nv40-pcounter 0x500+dom*0x4: SETFLAG_OP[dom] (dom<8)
gf100-pcounter-domain 0x60: SETFLAG_OP
  • bits 0-15: the logic operation to perform.
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 2 of the logic operation [G92-]
    • 0: PRE_SRC[0]
    • 1: START_SRC[2] delayed by 1 clock cycle
  • bit 19: selects argument 3 of the logic operation [G92-]
    • 0: PRE_SRC[1]
    • 1: START_SRC[3] delayed by 1 clock cycle
reg32 pcounter-clrflag-op
nv10-pcounter 0x42c+dom*0x100: CLRFLAG_OP[dom] (dom<2)
nv40-pcounter 0x520+dom*0x4: CLRFLAG_OP[dom] (dom<8)
gf100-pcounter-domain 0x64: CLRFLAG_OP
  • bits 0-15: the logic operation to perform. On NV10:NV30, the arguments are selected by SETFLAG_SRC. On NV30+, the arguments are: PRE_SRC[2], PRE_SRC[3], START_SRC[0], START_SRC[1].
  • bit 16: if set, argument 0 of the logic operation is delayed by 1 clock cycle
  • bit 17: if set, argument 1 of the logic operation is delayed by 1 clock cycle
  • bit 18: selects argument 2 of the logic operation [G92-] - 0: START_SRC[0] - 1: PRE_SRC[2] delayed by 1 clock cycle
  • bit 19: selects argument 3 of the logic operation [G92-] - 0: START_SRC[1] - 1: PRE_SRC[3] delayed by 1 clock cycle

Todo

check bits 16-20 on GF100

The register used to select the SWAP and UNK8 inputs on G84:GF100 cards is:

reg32 pcounter-spec-src
nv40-pcounter 0x560+dom*0x4: SPEC_SRC[dom] (dom<8)
  • bits 0-7: the SWAP signal
  • bits 8-15: the UNK8 signal

And on GF100+:

reg32 pcounter-swap-src
gf100-pcounter-domain 0x6c: SWAP_SRC
  • bits 0-7: the SWAP signal

On NV10:GF100, writing any of the _SRC and _OP registers except PRE_OP in single event mode will result in the state being reset to INACTIVE. Writing PRE_OP will start the counting process, setting the state to WAIT_PRE. On G84:GF100 in quad event mode, writing PRE_OP will cause a swap, as if the SWAP input was asserted for one cycle.

Todo

figure out how single event mode is supposed to be used on GF100+

Counters

The single event mode and quad event mode use MMIO-visible counter registers. They are:

  • CTR_CYCLES: counts all clock cycles in a counting period
  • CTR_CYCLES_ALT: a copy of CTR_CYCLES?
  • CTR_EVENT: counts 1s on EVENT input, or sums integers in EVENT_* special counter modes
  • CTR_START: in quad event mode, counts 1s on START input, or sums integers in EXTRA_* special counter modes; in single event mode counts measurement periods in which CTR_EVENT reached value >= THRESHOLD
  • CTR_PRE: in quad event mode, counts 1s on PRE input; in single event mode, counts down PRE assertions until WAIT_FOR_PRE state is left, then sums integers in EXTRA_* special counter modes and is unused otherwise.
  • CTR_STOP: in quad event mode, counts 1s on STOP input; in single event mode, counts down counting periods until the counting process ends.

Todo

wtf is CYCLES_ALT?

On NV10:NV30, the CTR_CYCLES, CTR_CYCLES_ALT, CTR_EVENT and CTR_START counters are 40-bit, while CTR_PRE and CTR_STOP are 32-bit. On NV30+, all counters are 32-bit. On NV30+, The counters are saturated - once they reach the largest possible value [0xffffffff], they stop incrementing. On NV10:NV30, the low 39 bits will wrap normally, but bit 39 is sticky: that is, 0xffffffffff increments to 0x8000000000, while other values increment normally.

The registers used to access the counters are:

reg32 pcounter-ctr-cycles
nv10-pcounter 0x600+dom*0x100: CTR_CYCLES[dom] (dom<2)
nv40-pcounter 0x600+dom*0x4: CTR_CYCLES[dom] (dom<8)

Read-only, gives the current value of CTR_CYCLES. Returns low 32 bits on NV10:NV30.

reg32 pcounter-ctr-cycles-hi
nv10-pcounter 0x604+dom*0x100: CTR_CYCLES_HI[dom] (dom<2) [NV10:NV30]

Read-only, gives the high 8 bits of the current value of CTR_CYCLES.

reg32 pcounter-ctr-cycles-alt
nv10-pcounter 0x608+dom*0x100: CTR_CYCLES_ALT[dom] (dom<2)
nv40-pcounter 0x640+dom*0x4: CTR_CYCLES_ALT[dom] (dom<8)

Read-only, gives the current value of CTR_CYCLES_ALT. Returns low 32 bits on NV10:NV30.

reg32 pcounter-ctr-cycles-alt-hi
nv10-pcounter 0x60c+dom*0x100: CTR_CYCLES_ALT_HI[dom] (dom<2) [NV10:NV30]

Read-only, gives the high 8 bits of the current value of CTR_CYCLES_ALT.

reg32 pcounter-ctr-event
nv10-pcounter 0x610+dom*0x100: CTR_EVENT[dom] (dom<2)
nv40-pcounter 0x680+dom*0x4: CTR_EVENT[dom] (dom<8)

Read-only, gives the current value of CTR_EVENT. Returns low 32 bits on NV10:NV30.

reg32 pcounter-ctr-event-hi
nv10-pcounter 0x614+dom*0x100: CTR_EVENT_HI[dom] (dom<2) [NV10:NV30]

Read-only, gives the high 8 bits of the current value of CTR_EVENT.

reg32 pcounter-ctr-start
nv10-pcounter 0x618+dom*0x100: CTR_START[dom] (dom<2)
nv40-pcounter 0x6c0+dom*0x4: CTR_START[dom] (dom<8)

Read-only, gives the current value of CTR_START. Returns low 32 bits on NV10:NV30.

reg32 pcounter-ctr-start-hi
nv10-pcounter 0x61c+dom*0x100: CTR_START_HI[dom] (dom<2) [NV10:NV30]

Read-only, gives the high 8 bits of the current value of CTR_START.

reg32 pcounter-ctr-pre
nv10-pcounter 0x620+dom*0x100: CTR_PRE[dom] (dom<2)
nv40-pcounter 0x700+dom*0x4: CTR_PRE[dom] (dom<8)

When read, gives the current value of CTR_PRE. When written, sets the initial CTR_PRE value for single-event mode.

reg32 pcounter-ctr-stop
nv10-pcounter 0x624+dom*0x100: CTR_STOP[dom] (dom<2)
nv40-pcounter 0x740+dom*0x4: CTR_STOP[dom] (dom<8)

When read, gives the current value of CTR_STOP. When written, sets the initial CTR_STOP value for single-event mode.

The CTR_PRE and CTR_STOP counters have two values: the visible “current” value, and the hidden “initial” value. Reading the corresponding register reads the “current” value, while writing sets the “initial” value. The “initial” values are used when starting counting process in single event mode.

Note that, in quad event mode, these registers access the copies of the counters from previous counting period, and the currently active counters are not visible.

The record mode uses a different counting algorithm, and the counters are written to memory instead of being accessed directly via MMIO. The same underlying storage is used internally, so parts of the counter state may be visible via MMIO registers. This isn’t particularly useful.

Todo

figure out what’s the deal with GF100 counters

Special counter modes

While the simplest way to use the counters is to have them increment by 1 every clock cycle when a given input is set, PCOUNTER supports a few more complex modes where a 4-bit, 6-bit, or 2-bit integer made of several signals is added to a counter on every cycle. This is used to count events which can happen multiple times in a single cycle - the relevant unit then exports a multi-bit event count, instead of simple event strobe.

The integers used in special copunter modes are:

  • B4: 4-bit integer, made of the following signals, in low-to-high bit order:
    • START_SRC[0]
    • START_SRC[1]
    • START_SRC[2]
    • START_SRC[3]
  • B6: 6-bit integer, made of:
    • START_SRC[0]
    • START_SRC[1]
    • START_SRC[2]
    • START_SRC[3]
    • EVENT_SRC[2]
    • EVENT_SRC[3]
  • B2: 2-bit integer, made of:
    • EVENT_SRC[0]
    • EVENT_SRC[1]

The modes are:

  • SIMPLE: CTR_EVENT is increased by 1 on every cycle when EVENT input is 1 [ie. nothing interesting happens]
  • EVENT_B4: CTR_EVENT is increased by B4 on every cycle when EVENT input is 1
  • EVENT_B6 [NV40-]: CTR_EVENT is increased by B6 on every cycle when EVENT input is 1
  • EXTRA_B4 [NV40-]: CTR_EVENT behaves as in SIMPLE mode, but:
    • single event mode: CTR_PRE, instead of staying at 0 after leaving WAIT_FOR_PRE state, is used as a counter, and is increased by B4 on every clock cycle
    • quad event mode: CTR_START, instead of being controlled by START input, is increased by B4 on every clock cycle
  • EXTRA_B6_EVENT_B2 [NV40-]: CTR_EVENT is increased by B2 on every clock cycle, and:
    • single event mode: CTR_PRE behaves like in EXTRA_B4 mode, but is increased by B6 instead of B4 every cycle
    • quad event mode: CTR_START behaves like in EXTRA_B4 mode, bus is increased by B6 instead of B4 every cycle

Todo

figure out if there’s anything new on GF100

Control registers

The operation of PCOUNTER is controlled by the CTRL registers. NV10:NV40 have a single CTRL register, shared between both domains:

reg32 pcounter-ctrl-nv10
nv10-pcounter 0x73c: CTRL
  • bit 0: TVOUT_DEBUG_SEL - selects the signals that go to TV-out debug port, if enabled.
  • bit 1: TVOUT_DEBUG_ENABLE - if 0, external TV encoder pins behave normally; if 1, the display circuitry signals are disconnected, and internal PCOUNTER debug pins are exposed via these pins.
  • bit 2: CTR_MODE - selects counter mode [see above], affects both domains
    • 0: SIMPLE
    • 1: EVENT_B4
  • bits 3-4: DOM0_SINGLE_STATE - read-only, reads as the current single event mode state for domain #0:
    • 0: INACTIVE
    • 1: WAIT_PRE
    • 2: WAIT_START
    • 3: COUNTING
  • bits 5-6: DOM1_SINGLE_STATE [NV20:NV40] - like bits 3-4, but for domain #1
  • bit 8: DOM0_EVENT_CTR_PERIOD [NV15:NV40] - EVENT_CTR_PERIOD for domain #0:
    • 0: ONE
    • 1: ALL
  • bit 9: DOM1_EVENT_CTR_PERIOD [NV20:NV40] - like bit 8, but for domain #1
  • bit 16: DOM0_MODE [NV30:NV40] - selects counting mode for domain #0:
    • 0: SINGLE - single event mode
    • 1: QUAD - quad event mode
  • bit 18: DOM1_MODE [NV30:NV40] - like bit 16, but for domain #1
  • bits 24-25: DOM0_QUAD_STATE [NV30:NV40] - read-only, reads as the current quad event mode state for domain #0:
    • 0: EMPTY
    • 1: VALID
    • 3: OVERFLOW
  • bits 26-27: DOM1_QUAD_STATE [NV30:NV40] - like bits 24-25, but for domain #1

NV40:GF100 instead have per-domain CTRL registers:

reg32 pcounter-ctrl-nv40
nv40-pcounter 0x7c0+dom*0x4: CTRL[dom] (dom<8)
  • bits 0-1: MODE - selects counting mode
    • 0: SINGLE - single event mode
    • 1: QUAD - quad event mode
    • 2: RECORD - record mode
  • bits 4-6: CTR_MODE - selects counter mode
    • 0: SIMPLE
    • 1: EVENT_B4
    • 2: EVENT_B6
    • 3: EXTRA_B4
    • 4: EXTRA_B6_EVENT_B2
  • bit 8: EVENT_CTR_PERIOD - like on NV15
  • bit 11: EVENT_IMPORT_MODE - selects synchronization mode for EVENT signals imported from other domains
    • 0: CONTINUOUS
    • 1: PULSE
  • bit 13: FLAG_IMPORT_MODE - like bit 11, but for FLAG signals
  • bit 16: ???
  • bit 20: RECORD_FORMAT - selects packet format for record mode [G84:GF100]
    • 0: LONG - 32-byte packets with 12 usable event counters
    • 1: SHORT - 16-byte packets with 4 usable event counters
  • bits 21-23: PERIODIC_PERIOD [G84:GF100] - selects PERIODIC signal period:
    • 0: disabled, PERIODIC signal is always 0
    • 1: 0x400 clocks
    • 2: 0x800 clocks
    • 3: 0x1000 clocks
    • 4: 0x2000 clocks
    • 5: 0x4000 clocks
    • 6: 0x8000 clocks
    • 7: 0x10000 clocks
  • bits 24-25: QUAD_STATE - like on NV30
  • bit 27: FAULT_CLEAR - write-only, when written as 1 clears the FAULT bit in RECORD_STATUS. Note, however, that the domain will still be in a wedged state due to [probably] a hardware bug. This bit is thus useless.
  • bits 28-29: SINGLE_STATE - like on NV10
  • bit 30: ??? [G92:GF100]

Todo

unk bits

In addition, G84:GF100 have a global GCTRL register used for a few bits shared by all domains:

reg32 pcounter-gctrl-g84
nv40-pcounter 0x7a8: GCTRL [G84-]
  • bit 0: RECORD_RESET - when set to 0, record counters increment normally; when set, forces all record counters to 0 value
  • bit 4: PERIODIC_RESET - when set to 0, PERIODIC signals operate normally; when set, PERIODIC signals are forced to 0 and will continue from the beginning of the cycle upon reenabling

Todo

more bits

Todo

GF100

Single event mode

In single event mode, one event input is being monitored and counted, with quite complex counting period management. The inputs used by single event mode counting process are PRE, START, EVENT, STOP.

The counting process may be in one of 4 states:

  • INACTIVE: nothing is happening, PCOUNTER needs to be set up
  • WAIT_FOR_PRE: counting process has started, but PRE pulses are reuired before it’s actually possible to start a counting period
  • WAIT_FOR_START: counting process has started, a counting period is not currently active, but will be started on a START pulse
  • COUNTING: a counting period is currently active, and the counters are in use

Counting process works like this:

On every cycle:

if (PCOUNTER config register other than PRE_OP written this cycle) {
    SINGLE_STATE = INACTIVE;
}
switch (SINGLE_STATE) {
    case INACTIVE:
        if (PRE_OP written this cycle) {
            /* start counting process, init counters */
            CTR_EVENT = 0;
            CTR_START = 0;
            CTR_CYCLES = CTR_CYCLES_ALT = 0;
            CTR_PRE = CTR_PRE_init;
            CTR_STOP = CTR_STOP_init;
            FLAG = 0;
            SINGLE_STATE = WAIT_FOR_PRE;
        }
        break;
    case WAIT_FOR_PRE:
        if (SETFLAG) FLAG = 1;
        if (CLRFLAG) FLAG = 0;
        if (PRE) {
            if (CTR_PRE != 0) {
                CTR_PRE--;
            } else {
                SINGLE_STATE = WAIT_FOR_START;
            }
        }
        break;
    case WAIT_FOR_START:
        if (SETFLAG) FLAG = 1;
        if (CLRFLAG) FLAG = 0;
        if (START) {
            CTR_CYCLES = CTR_CYCLES_ALT = 0;
            if (gpu < NV15 || EVENT_CTR_PERIOD == ONE)
                CTR_EVENT = 0;
            SINGLE_STATE = COUNTING;
        }
        break;
    case COUNTING:
        if (SETFLAG) FLAG = 1;
        if (CLRFLAG) FLAG = 0;
        increase CTR_EVENT and maybe CTR_PRE according to
        the counter mode;
        if (STOP) {
            if (CTR_EVENT >= THRESHOLD)
                CTR_START++;
            if (CTR_STOP != 0) {
                CTR_STOP--;
                SINGLE_STATE = WAIT_FOR_START;
            } else {
                SINGLE_STATE = INACTIVE;
            }
        }
}

Or, in summary:

  • before actual counting, (CTR_PRE+1) 1s must happen on PRE input
  • a counting process consists of (CTR_STOP+1) counting periods
  • a counting period is started by 1 on START input and stopped by 1 on STOP input
  • events outside of a counting period don’t count
  • if EVENT_CTR_PERIOD is ONE, CTR_EVENT effectively applies to a counting period, if it’s ALL, it contains a sum over all counting periods. CTR_PRE, when EXTRA_* counter mode is in use, always contains a sum over all counting periods. NV10:NV15 cards don’t have this submode bit and always behave as if it was ONE.
  • CTR_CYCLES always contains length of current [COUNTING] or last [WAIT_FOR_START] couting period
  • CTR_START will contain the number of counting periods that ended with CTR_EVENT >= THRESHOLD - probably only useful with EVENT_CTR_PERIOD = ONE.
  • writing any *_OP register except PRE_OP, any *_SRC register, any CTR register, THRESHOLD register, and CTRL register will abort the counting process
  • flag is frozen when in INACTIVE state, cleared to 0 when entering WAIT_FOR_PRE

Single event mode doesn’t use shadow counters - the values of all counters are immediately visible through MMIO registers.

The threshold value for CTR_START counter can be set and read via the following registers:

reg32 pcounter-threshold
nv10-pcounter 0x628+dom*0x100: THRESHOLD[dom] (dom<2)
nv40-pcounter 0x780+dom*0x4: THRESHOLD[dom] (dom<8)

The THRESHOLD value, or low 32 bits of THRESHOLD value on NV10:NV30.

reg32 pcounter-threshold-hi
nv10-pcounter 0x62c+dom*0x100: THRESHOLD_HI[dom] (dom<2) [NV10:NV30]

The high 8 bits of THRESHOLD value.

Todo

threshold on GF100

Quad event mode

In quad event mode, 4 different event inputs are counted, each in a dedicated counter. The events are counted in invisible “shadow” registers, while the visible registers contain the final values of counters from previous counting period. Counting periods are controlled by the special SWAP input, which copies the “shadow” counters to visible registers, and clears the shadow counters to 0. In addition, the SWAP signal marks the counter values as available in the CONTROL register.

The counters used in quad event mode are:

  • CTR_CYCLES and CTR_CYCLES_ALT: increases by 1 for every cycle
  • CTR_EVENT: increases as per the counter mode, usually by 1 for every cycle when EVENT input is set
  • CTR_START: increases as per the counter mode, usually by 1 for every cycle when START input is set
  • CTR_PRE: increases by 1 for every cycle when PRE input is set
  • CTR_STOP: increases by 1 for every cycle when STOP input is set

When in quad event mode, the counters are always active - there’s no INACTIVE state like in single event mode.

The counter swap is triggered on every cycle when SWAP input is set. On G84:GF100, the counter swap is also triggered on every write to the PRE_OP register. The PCOUNTER keeps track of how many counter value sets have been swapped and how many have been read. It can thus be in one of the three states:

  • EMPTY - no new counter values to read
  • VALID - swap has happened and counter values are available for reading
  • OVERFLOW - another swap has happened while in VALID state, and counter values were lost

A swap bumps the state up one unit - EMPTY goes to VALID, VALID goes to OVERFLOW, and OVERFLOW is unchanged.

Note that the swap is performed before updating the counters for a given cycle - thus if SWAP and one of the event inputs are active on the same cycle, the events will be counted for the next period.

The software may inform the PCOUNTER of read completion by poking the write-only QUAD_ACK_TRIGGER register. The register is shared for all domains on NV30:NV40, and per-domain for NV40+:

reg32 pcounter-quad-ack-trigger-nv30
nv10-pcounter 0x738: QUAD_ACK_TRIGGER [NV30:NV40]
  • bit 0: DOM0 - when written as 0, nothing happens. When written as 1, the status of domain #1 is bumped down one unit - VALID goes to EMPTY, OVERFLOW goes to VALID, and EMPTY is unchanged.
  • bit 8: DOM1 - like DOM0, but affects domain #1
reg32 pcounter-quad-ack-trigger-nv40
nv40-pcounter 0x7e0+dom*0x4: QUAD_ACK_TRIGGER[dom] (dom<8)
gf100-pcounter-domain 0xa0: QUAD_ACK_TRIGGER
  • bit 0: Like NV30’s DOM0/DOM1 bits, affects the domain the register is in.

Record mode

In record mode, counter values are written to memory for later analysis instead of being read via MMIO - this enables much more frequent sampling and simplifies software. The counter values are written to a given virtual memory buffer in 16-byte or 32-byte packets, consisting of 14 counters. A new packet is written whenever one of the 12 event counters is close to overflowing, or when the STOP input is asserted. The counters are:

  • 48-bit cycles counter, incremented by 1 on every cycle, cleared only when record mode operation is started by writting the RECORD_START register or GCTRL.RECORD_RESET is set to 1. This counter wraps on overflow.
  • 12 16-bit event counters, corresponding to 12 monitored signals selected by PRE_SRC[0..3], START_SRC[0..3], EVENT_SRC[0..3]. Incremented by 1 on every cycle when corresponding signal is 1. Cleared after writing a packet. A packet write is triggered whenever any of these counters reaches 0xf000. If a counter reaches 0xffff, it stops incrementing further.
  • 12-bit STOP counter, incremented by 1 whenever the STOP input is 1. Cleared after writing a packet. A packet write is triggered whenever this counter is non-0. If this counter reaches 0xfff, it stops incrementing further.

There are two packet formats available: long and short. Long format packets are 32 bytes long and include all counters, while short format paackets are 16 bytes long and have only 4 of the 12 event counters. A packet in long format is made of 16 16-bit little endian words:

  • 0x00: low 16 bits of cycle counter
  • 0x02: middle 16 bits of cycle counter
  • 0x04: high 16 bits of cycle counter
  • 0x06:
    • bits 0-11: the STOP counter
    • bits 12-15: always 0
  • 0x08: PRE_SRC[0] event counter
  • 0x0a: PRE_SRC[1] event counter
  • 0x0c: PRE_SRC[2] event counter
  • 0x0e: PRE_SRC[3] event counter
  • 0x10: START_SRC[0] event counter
  • 0x12: START_SRC[1] event counter
  • 0x14: START_SRC[2] event counter
  • 0x16: START_SRC[3] event counter
  • 0x18: EVENT_SRC[0] event counter
  • 0x1a: EVENT_SRC[1] event counter
  • 0x1c: EVENT_SRC[2] event counter
  • 0x1e: EVENT_SRC[3] event counter

A packet in short format is simply the first 16 bytes of a packet in long format.

Packets are normally written to memory when STOP input is asserted. For this reason, packets in memory will usually have the STOP counter equal to 1 [for the one pulse that triggered them]. However, to avoid saturating the event counters, a packet write will also be triggered whenever any event counter is >= 0xf000. The STOP counter in the memory packet will be equal to 0 in this case. STOP counter values greater than 1 are possible when STOP input is asserted too often for the memory interface to keep up - each domain has place for one outgoing packet. Whenever a packet write is triggered and there isn’t an outgoing packet yet, the packet will be sent, and the counters reset. When a packet write is triggered and there already is an outgoing packet, nothing will happen - the counters will just keep incrementing until the current packet write is finished.

Todo

check if still valid on GF100

Record mode setup

Before record mode is started, a few registers need to be set up.

First, the channel and DMA object for the record buffer need to be bound. The PCOUNTER will access virtual memory as engine 0xb, client 0xf, DMA slot 0. The channel and DMA object are global for all domains. Note that the channel register has to be written after the DMA object register for a successful bind.

reg32 pcounter-record-dma
nv40-pcounter 0x7a4: RECORD_DMA [G84-]
  • bits 0-15: the DMA object to be used by PCOUNTER. Writing this register only stores the DMA object, it doesn’t actually bind it - the bind is done by RECORD_CHAN write.
reg32 pcounter-record-chan
nv40-pcounter 0x7a0: RECORD_CHAN [G84-]
  • bits 0-29: CHAN - the channel to bind to PCOUNTER engine
  • bit 31: VALID - if set, a channel bind and DMA object bind will be done when writing this register. If unset, the register will be written, but no binds will be done.

The address of the record buffer is settable per-domain:

reg32 pcounter-record-start
nv40-pcounter 0x760+dom*0x4: RECORD_START[dom] (dom<8) [G84-]

The start address of the record buffer. Only bits 4-31 are valid - the buffer has to be aligned to 16 byte bounduary. When this register is written, the address is copied to RECORD_STATUS position field, the “buffer valid” internal flag will be set, and all counters are reset if the domain is in record mode.

Note that setting this register will not properly clear the counter state if the domain is not in record mode - in fact, a bogus packet will likely be written immediately after transitioning to the record mode if RECORD_START is written in another mode. To avoid that, write RECORD_START after entering record mode [and make sure the “buffer valid” flag is not set], or use the GCTRL.RECORD_RESET bit.

reg32 pcounter-record-limit
nv40-pcounter 0x720+dom*0x4: RECORD_LIMIT[dom] (dom<8) [G84-]

The last valid address in the record buffer. Only bits 4-31 are valid. After a packet is written with address >= the value of this register, the internal “buffer valid” flag will be cleared, and all further writes will be ignored until RECORD_START is written.

Note that one packet write will always succeed before the limit hit flag is set and further writes are disabled - even if the position is set far beyond the limit.

reg32 pcounter-record-status
nv40-pcounter 0x6e0+dom*0x4: RECORD_STATUS[dom] (dom<8) [G84-]

This register is read-only.

  • bit 0: if set, a VM FAULT happened when writing the record buffer
  • bits 4-31: bits 4-31 of the current record buffer position, ie. address of the next packet to be written

The PCOUNTER internally operates on 32-bit addresses. On G84:G92, the high 8 bits of 40-bit virtual address are always forced to 0, limitting the record buffer to low 4GB of the VM space. On G92+, the high 8 bits of the address are instead taken from a register:

reg32 pcounter-record-address-high
nv40-pcounter 0x6a0+dom*0x4: RECORD_ADDRESS_HIGH[dom] (dom<8) [G92-]

Sets the high 8 bits of the record buffer virtual address.

Note, however, that the internal address size is still 32-bit: the position will thus wrap at 4GB bounduary, instead of incrementing bit 32 of address. For this reason, record buffers that cross a 4GB block bounduary in virtual space cannot be used.

Note that VM faults on the record buffer will permanently hang the faulting domain until the GPU is reset - while there’s a “clear VM FAULT status” bit in the control register, it only clears the status bit, while hardware is still in a wedged state. This is likely a hardware bug.

Todo

figure out record mode setup for GF100

The flag

The FLAG is a single per-domain bit that can be set and cleared via the SETFLAG and CLRFLAG inputs. On every clock cycle:

  • if CLRFLAG is 1, the FLAG is set to 0
  • if SETFLAG is 1 and CLRFLAG is 0, the FLAG is set to 1
  • if both CLRFLAG and SETFLAG are 0, the FLAG is unchanged

In addition, when in single-event mode, the FLAG is frozen [will not respond to CLRFLAG/SETFLAG] when in INACTIVE state, and will be cleared to 0 when going to WAIT_FOR_PRE state.

The current value of the FLAG is available as a common trailer signal to all domains in the same domain set, allowing complex operations to be performed. Note however that the effect of CLRFLAG/SETFLAG on the FLAG signal is delayed by 2 clock cycles - if the SETFLAG input becomes 1 on cycle X, the FLAG signal will become 1 on cycle X+2.