.. _vram:

================
Memory structure
================

.. contents::


Introduction
============

While DRAM is often treated as a flat array of bytes, its internal structure
is far more complicated. A good understanding of it is necessary for
high-performance applications like GPUs.

Looking roughly from the bottom up, VRAM is made of:

1. Memory planes of R rows by C columns, with each cell being one bit
2. Memory banks made of 32, 64, or 128 memory planes used in parallel - the
   planes are usually spread across several chips, with one chip containing
   16 or 32 memory planes
3. Memory ranks made of several [2, 4 or 8] memory banks wired together and
   selected by address bits - all banks for a given memory plane reside in
   the same chip
4. Memory subpartitions made of one or two memory ranks wired together and
   selected by chip select wires - ranks behave similarly to banks, but don't
   have to have uniform geometry, and are in separate chips
5. Memory partitions made of one or two somewhat independent subpartitions
6. The whole VRAM, made of several [1-8] memory partitions


Memory planes and banks
=======================

The most basic unit of DRAM is a memory plane, which is a 2d array of bits
organised in so-called columns and rows:

::

         column
    row  0  1  2  3  4  5  6  7
    0    X  X  X  X  X  X  X  X
    1    X  X  X  X  X  X  X  X
    2    X  X  X  X  X  X  X  X
    3    X  X  X  X  X  X  X  X
    4    X  X  X  X  X  X  X  X
    5    X  X  X  X  X  X  X  X
    6    X  X  X  X  X  X  X  X
    7    X  X  X  X  X  X  X  X

    buf  X  X  X  X  X  X  X  X

A memory plane contains a buffer, which holds a whole row. Internally, DRAM is
read/written in row units via the buffer. This has several consequences:

- before a bit can be operated on, its row must be loaded into the buffer,
  which is slow
- after a row is done with, it needs to be written back to the memory array,
  which is also slow
- accessing a new row is thus slow, and even slower when there already is
  an active row
- it's often useful to preemptively close a row after some inactivity time -
  such operation is called "precharging" a bank 
- different columns in the same row, however, can be accessed quickly

Since loading column address itself takes more time than actually accessing
a bit in the active buffer, DRAM is accessed in bursts - a series of accesses
to 1-8 neighbouring bits in the active row. Usually all bits in a burst have
to be located in a single aligned 8-bit group.

The amount of rows and columns in memory plane is always a power of two, and
is measured by the count of row selection and column selection bits [ie. log2
of the row/column count]. There are typically 8-10 column bits and 10-14 row
bits.

The memory planes are organised in banks - groups of some power of two number
of memory planes. The memory planes are wired in parallel, sharing the address
and control wires, with only the data / data enable wires separate. This
effectively makes a memory bank like a memory plane that's composed of
32/64/128-bit memory cells instead of single bits - all the rules that apply
to a plane still apply to a bank, except larger units than a bit are operated
on.

A single memory chip usually contains 16 or 32 memory planes for a single
bank, thus several chips are often wired together to make wider banks.


Memory banks, ranks, and subpartitions
======================================

A memory chip contains several [2, 4, or 8] banks, using the same data wires
and multiplexed via bank select wires. While switching between banks is
slightly slower than switching between columns in a row, it's much faster
than switching between rows in the same bank.

A memory rank is thus made of `(MEMORY_CELL_SIZE / MEMORY_CELL_SIZE_PER_CHIP)`
memory chips.

One or two memory ranks connected via common wires [including data] except
a chip select wire make up a memory subpartition. Switching between ranks
has basically the same performance consequences as switching between banks
in a rank - the only differences are the physical implementation and
the possibility of using different amount of row selection bits for each
rank [though bank count and column count have to match].

The consequences of existence of several banks/ranks:

- it's important to ensure that data accessed together belongs to either
  the same row, or to different banks [to avoid row switching]
- tiled memory layouts are designed so that a tile corresponds roughly to
  a row, and neighbouring tiles never share a bank


Memory partitions and subpartitions
===================================

A memory subpartition has its own DRAM controller on the GPU. 1 or 2
subpartitions make a memory partition, which is a fairly independent entity
with its own memory access queue, own ZROP and CROP units, and own L2 cache
on later cards. All memory partitions taken together with the crossbar logic
make up the entire VRAM logic for a GPU.

All subpartitions in a partition have to be configured identically. Partitions
in a GPU are usually configured identically, but don't have to on newer cards.

The consequences of subpartition/partition existence:

- like banks, different partitions may be utilised to avoid row conflicts for
  related data
- unlike banks, bandwidth suffers if (sub)partitions are not utilised equally
  - load balancing is thus very important


Memory addressing
=================

While memory addressing is highly dependent on GPU family, the basic approach
is outlined here.

The bits of a memory address are, in sequence, assigned to:

- identifying a byte inside a memory cell - since whole cells always have to
  be accessed anyway
- several column selection bits, to allow for a burst
- partition/subpartition selection - in low bits to ensure good load
  balancing, but not too low to keep relatively large tiles in a single
  partition for ROP's benefit
- remaining column selection bits
- all/most of bank selection bits, sometimes a rank selection bit - so that
  immediately neighbouring addresses never cause a row conflict
- row bits
- remaining bank bit or rank bit - effectively allows splitting VRAM into two
  areas, placing color buffer in one and zeta buffer in the other, so that
  there are never row conflicts between them