NV43:G80 thermal monitoring¶
Contents
Introduction¶
THERM is an area present in PBUS on NV43:G80 GPUs. This area is reponsible for temperature monitoring, probably on low-end and middle-range GPUs since high-end cards have been using LM89/ADT7473 for a long time. Beside some configuration knobs, THERM can generate IRQs to the HOST when the temperature goes over a configurable ALARM threshold or outside a configurable temperature range. This range has been replaced by PTHERM on G80+ GPUs.
THERM’s MMIO range is 0x15b0:0x15c0. There are two major variants of this range:
- NV43:G70
- G70:G80
MMIO register list¶
Address | Present on | Name | Description |
---|---|---|---|
0x0015b0 | all | CFG0 | sensor enable / IRQ enable / ALARM configuration |
0x0015b4 | all | STATUS | sensor state / ALARM state / ADC rate configuration |
0x0015b8 | non-IGP | CFG1 | misc. configuration |
0x0015bc | all | TEMP_RANGE | LOW and HIGH temperature thresholds |
- MMIO 0x15b0: CFG0 [NV43:G70]
- bits 0-7: ALARM_HIGH
- bits 16-23: SENSOR_OFFSET (signed integer)
- bit 24: DISABLE
- bit 28: ALARM_INTR_EN
- MMIO 0x15b0: CFG0 [G70:G80]
- bits 0-13: ALARM_HIGH
- bits 16-29: SENSOR_OFFSET (signed integer)
- bit 30: DISABLE
- bit 31: ENABLE
- MMIO 0x15b4: STATUS [NV43:G70]
- bits 0-7: SENSOR_RAW
- bit 8: ALARM_HIGH
- bits 25-31: ADC_CLOCK_XXX
Todo
figure out what divisors are available
- MMIO 0x15b4: STATUS [G70:G80]
- bits 0-13: SENSOR_RAW
- bit 16: ALARM_HIGH
- bits 26-31: ADC_CLOCK_DIV The division is stored right-shifted 4. The possible division values range from 32 to 2016 with the possibility to completely bypass the divider.
- MMIO 0x15b8: CFG1 [NV43:G70]
- bit 17: ADC_PAUSE
- bit 23: CONNECT_SENSOR
- MMIO 0x15bc: TEMP_RANGE [NV43:G70]
- bits 0-7: LOW
- bits 8-15: HIGH
- MMIO 0x15bc: TEMP_RANGE [G70:G80]
- bits 0-13: LOW
- bits 16-29: HIGH
The ADC clock¶
The source clock for THERM’s ADC is:
- NV43:G70: the host clock
- G70:G80: constant (most likely hclck)
(most likely, since the rate doesn’t change when I change the HOST clock)
Before reaching the ADC, the clock source is divided by a fixed divider of 1024 and then by ADC_CLOCK_DIV.
- MMIO 0x15b4: STATUS [NV43:G70]
- bits 25-31: ADC_CLOCK_DIV
Todo
figure out what divisors are available
- MMIO 0x15b4: STATUS [G70:G80]
- bits 26-31: ADC_CLOCK_DIV The division is stored right-shifted 4. The possible division values range from 32 to 2016 with the possibility to completely bypass the divider.
The final ADC clock is:
ADC_clock = source_clock / ADC_CLOCK_DIV
The accuracy of the reading greatly depends on the ADC clock. A clock too fast will produce a lot of noise. A clock too low may actually produce an offseted value. The ADC clock rate under 10 kHz is advised, based on limited testing on a G73.
Todo
Make sure this clock range is safe on all cards
Anyway, it seems like it is clocked at an acceptable frequency at boot time, so, no need to worry too much about it.
Reading temperature¶
Temperature is read from:
- MMIO 0x15b4: STATUS [NV43:G70]
- bits 0-7: SENSOR_RAW
- MMIO 0x15b4: STATUS [G70:G80]
- bits 0-13: SENSOR_RAW
SENSOR_RAW is the result of the (signed) addition of the actual value read by the ADC and SENSOR_OFFSET:
- MMIO 0x15b0: CFG0 [NV43:G70]
- bits 16-23: SENSOR_OFFSET signed
- MMIO 0x15b0: CFG0 [G70:G80]
- bits 16-29: SENSOR_OFFSET signed
Starting temperature readouts requires to flip a few switches that are GPU-dependent:
- MMIO 0x15b0: CFG0 [NV43:G70]
- bit 24: DISABLE
- MMIO 0x15b0: CFG0 [G70:G80]
- bit 30: DISABLE - mutually exclusive with ENABLE
- bit 31: ENABLE - mutually exclusive with DISABLE
- MMIO 0x15b8: CFG1 [NV43:G70]
- bit 17: ADC_PAUSE
- bit 23: CONNECT_SENSOR
Both DISABLE and ADC_PAUSE should be clear. ENABLE and CONNECT_SENSOR should be set.
Todo
There may be other switches.
Setting up thresholds and interrupts¶
Alarm¶
THERM features the ability to set up an alarm that will trigger interrupt PBUS #16 when SENSOR_RAW > ALARM_HIGH. NV43-47 GPUs require ALARM_INTR_EN to be set in order to get the IRQ. You may need to set bits 0x40001 in 0x15a0 and 1 in 0x15a4. Their purpose has not been understood yet even though they may be releated to automatic downclocking.
- MMIO 0x15b0: CFG0 [NV43:G70]
- bits 0-7: ALARM_HIGH
- bit 28: ALARM_INTR_EN
- MMIO 0x15b0: CFG0 [G70:G80]
- bits 0-13: ALARM_HIGH
When SENSOR_RAW > ALARM_HIGH, STATUS.ALARM_HIGH is set.
- MMIO 0x15b4: STATUS [NV43:G70]
- bit 8: ALARM_HIGH
- MMIO 0x15b4: STATUS [G70:G80]
- bit 16: ALARM_HIGH
STATUS.ALARM_HIGH is unset as soon as SENSOR_RAW < ALARM_HIGH, without any hysteresis cycle.
Temperature range¶
THERM can check that temperature is inside a range. When the temperature goes outside this range, an interrupt is sent. The range is defined in the register TEMP_RANGE where the thresholds LOW and HIGH are set.
- MMIO 0x15bc: TEMP_RANGE [NV43:G70]
- bits 0-7: LOW
- bits 8-15: HIGH
- MMIO 0x15bc: TEMP_RANGE [G70:G80]
- bits 0-13: LOW
- bits 16-29: HIGH
When SENSOR_RAW < TEMP_RANGE.LOW, interrupt PBUS #17 is sent. When SENSOR_RAW > TEMP_RANGE.HIGH, interrupt PBUS #18 is sent.
There are no hyteresis cycles on these thresholds.
Extended configuration¶
Todo
Document reg 15b8