318 lines
8.9 KiB
ReStructuredText
318 lines
8.9 KiB
ReStructuredText
=====================
|
|
Adreno Five Microcode
|
|
=====================
|
|
|
|
.. contents::
|
|
|
|
.. _afuc-introduction:
|
|
|
|
Introduction
|
|
============
|
|
|
|
Adreno GPUs prior to 6xx use two micro-controllers to parse the command-stream,
|
|
setup the hardware for draws (or compute jobs), and do various GPU
|
|
housekeeping. They are relatively simple (basically glorified
|
|
register writers) and basically all their state is in a collection
|
|
of registers. Ie. there is no stack, and no memory assigned to
|
|
them; any global state like which bank of context registers is to
|
|
be used in the next draw is stored in a register.
|
|
|
|
The setup is similar to radeon, in fact Adreno 2xx thru 4xx used
|
|
basically the same instruction set as r600. There is a "PFP"
|
|
(Prefetch Parser) and "ME" (Micro Engine, also confusingly referred
|
|
to as "PM4"). These make up the "CP" ("Command Parser"). The
|
|
PFP runs ahead of the ME, with some PM4 packets handled entirely
|
|
in the PFP. Between the PFP and ME is a FIFO ("MEQ"). In the
|
|
generations prior to Adreno 5xx, the PFP and ME had different
|
|
instruction sets.
|
|
|
|
Starting with Adreno 5xx, a new microcontroller with a unified
|
|
instruction set was introduced, although the overall architecture
|
|
and purpose of the two microcontrollers remains the same.
|
|
|
|
For lack of a better name, this new instruction set is called
|
|
"Adreno Five MicroCode" or "afuc". (No idea what Qualcomm calls
|
|
it internally.
|
|
|
|
With Adreno 6xx, the separate PF and ME are replaced with a single
|
|
SQE microcontroller using the same instruction set as 5xx.
|
|
|
|
.. _afuc-overview:
|
|
|
|
Instruction Set Overview
|
|
========================
|
|
|
|
32bit instruction set with basic arithmatic ops that can take
|
|
either two source registers or one src and a 16b immediate.
|
|
|
|
32 registers, although some are special purpose:
|
|
|
|
- ``$00`` - always reads zero, otherwise seems to be the PC
|
|
- ``$01`` - current PM4 packet header
|
|
- ``$1c`` - alias ``$rem``, remaining data in packet
|
|
- ``$1d`` - alias ``$addr``
|
|
- ``$1f`` - alias ``$data``
|
|
|
|
Branch instructions have a delay slot so the following instruction
|
|
is always executed regardless of whether branch is taken or not.
|
|
|
|
|
|
.. _afuc-alu:
|
|
|
|
ALU Instructions
|
|
================
|
|
|
|
The following instructions are available:
|
|
|
|
- ``add`` - add
|
|
- ``addhi`` - add + carry (for upper 32b of 64b value)
|
|
- ``sub`` - subtract
|
|
- ``subhi`` - subtract + carry (for upper 32b of 64b value)
|
|
- ``and`` - bitwise AND
|
|
- ``or`` - bitwise OR
|
|
- ``xor`` - bitwise XOR
|
|
- ``not`` - bitwise NOT (no src1)
|
|
- ``shl`` - shift-left
|
|
- ``ushr`` - unsigned shift-right
|
|
- ``ishr`` - signed shift-right
|
|
- ``rot`` - rotate-left (like shift-left with wrap-around)
|
|
- ``mul8`` - multiply low 8b of two src
|
|
- ``min`` - minimum
|
|
- ``max`` - maximum
|
|
- ``comp`` - compare two values
|
|
|
|
The ALU instructions can take either two src registers, or a src
|
|
plus 16b immediate as 2nd src, ex::
|
|
|
|
add $dst, $src, 0x1234 ; src2 is immed
|
|
add $dst, $src1, $src2 ; src2 is reg
|
|
|
|
The ``not`` instruction only takes a single source::
|
|
|
|
not $dst, $src
|
|
not $dst, 0x1234
|
|
|
|
.. _afuc-alu-cmp:
|
|
|
|
The ``cmp`` instruction returns:
|
|
|
|
- ``0x00`` if src1 > src2
|
|
- ``0x2b`` if src1 == src2
|
|
- ``0x1e`` if src1 < src2
|
|
|
|
See explanation in :ref:`afuc-branch`
|
|
|
|
|
|
.. _afuc-branch:
|
|
|
|
Branch Instructions
|
|
===================
|
|
|
|
The following branch/jump instructions are available:
|
|
|
|
- ``brne`` - branch if not equal (or bit not set)
|
|
- ``breq`` - branch if equal (or bit set)
|
|
- ``jump`` - unconditional jump
|
|
|
|
Both ``brne`` and ``breq`` have two forms, comparing the src register
|
|
against either a small immediate (up to 5 bits) or a specific bit::
|
|
|
|
breq $src, b3, #somelabel ; branch if src & (1 << 3)
|
|
breq $src, 0x3, #somelabel ; branch if src == 3
|
|
|
|
The branch instructions are encoded with a 16b relative offset.
|
|
Since ``$00`` always reads back zero, it can be used to construct
|
|
an unconditional relative jump.
|
|
|
|
The :ref:`cmp <afuc-alu-cmp>` instruction can be paired with the
|
|
bit-test variants of ``brne``/``breq`` to implement gt/ge/lt/le,
|
|
due to the bit pattern it returns, for example::
|
|
|
|
cmp $04, $02, $03
|
|
breq $04, b1, #somelabel
|
|
|
|
will branch if ``$02`` is less than or equal to ``$03``.
|
|
|
|
|
|
.. _afuc-call:
|
|
|
|
Call/Return
|
|
===========
|
|
|
|
Simple subroutines can be implemented with ``call``/``ret``. The
|
|
jump instruction encodes a fixed offset.
|
|
|
|
TODO not sure how many levels deep function calls can be nested.
|
|
There isn't really a stack. Definitely seems to be multiple
|
|
levels of fxn call, see in PFP: CP_CONTEXT_SWITCH_YIELD -> f13 ->
|
|
f22.
|
|
|
|
|
|
.. _afuc-control:
|
|
|
|
Config Instructions
|
|
===================
|
|
|
|
These seem to read/write config state in other parts of CP. In at
|
|
least some cases I expect these map to CP registers (but possibly
|
|
not directly??)
|
|
|
|
- ``cread $dst, [$off + addr], flags``
|
|
- ``cwrite $src, [$off + addr], flags``
|
|
|
|
In cases where no offset is needed, ``$00`` is frequently used as
|
|
the offset.
|
|
|
|
For example, the following sequences sets::
|
|
|
|
; load CP_INDIRECT_BUFFER parameters from cmdstream:
|
|
mov $02, $data ; low 32b of IB target address
|
|
mov $03, $data ; high 32b of IB target
|
|
mov $04, $data ; IB size in dwords
|
|
|
|
; sanity check # of dwords:
|
|
breq $04, 0x0, #l23 (#69, 04a2)
|
|
|
|
; this seems something to do with figuring out whether
|
|
; we are going from RB->IB1 or IB1->IB2 (ie. so the
|
|
; below cwrite instructions update either
|
|
; CP_IB1_BASE_LO/HI/BUFSIZE or CP_IB2_BASE_LO/HI/BUFSIZE
|
|
and $05, $18, 0x0003
|
|
shl $05, $05, 0x0002
|
|
|
|
; update CP_IBn_BASE_LO/HI/BUFSIZE:
|
|
cwrite $02, [$05 + 0x0b0], 0x8
|
|
cwrite $03, [$05 + 0x0b1], 0x8
|
|
cwrite $04, [$05 + 0x0b2], 0x8
|
|
|
|
|
|
|
|
.. _afuc-reg-access:
|
|
|
|
Register Access
|
|
===============
|
|
|
|
The special registers ``$addr`` and ``$data`` can be used to write GPU
|
|
registers, for example, to write::
|
|
|
|
mov $addr, CP_SCRATCH_REG[0x2] ; set register to write
|
|
mov $data, $03 ; CP_SCRATCH_REG[0x2]
|
|
mov $data, $04 ; CP_SCRATCH_REG[0x3]
|
|
...
|
|
|
|
subsequent writes to ``$data`` will increment the address of the register
|
|
to write, so a sequence of consecutive registers can be written
|
|
|
|
To read::
|
|
|
|
mov $addr, CP_SCRATCH_REG[0x2]
|
|
mov $03, $addr
|
|
mov $04, $addr
|
|
|
|
Many registers that are updated frequently have two banks, so they can be
|
|
updated without stalling for previous draw to finish. These banks are
|
|
arranged so bit 11 is zero for bank 0 and 1 for bank 1. The ME fw (at
|
|
least the version I'm looking at) stores this in ``$17``, so to update
|
|
these registers from ME::
|
|
|
|
or $addr, $17, VFD_INDEX_OFFSET
|
|
mov $data, $03
|
|
...
|
|
|
|
Note that PFP doesn't seem to use this approach, instead it does something
|
|
like::
|
|
|
|
mov $0c, CP_SCRATCH_REG[0x7]
|
|
mov $02, 0x789a ; value
|
|
cwrite $0c, [$00 + 0x010], 0x8
|
|
cwrite $02, [$00 + 0x011], 0x8
|
|
|
|
Like with the ``$addr``/``$data`` approach, the destination register address
|
|
increments on each write.
|
|
|
|
.. _afuc-mem:
|
|
|
|
Memory Access
|
|
=============
|
|
|
|
There are no load/store instructions, as such. The microcontrollers
|
|
have only indirect memory access via GPU registers. There are two
|
|
mechanism possible.
|
|
|
|
Read/Write via CP_NRT Registers
|
|
-------------------------------
|
|
|
|
This seems to be only used by ME. If PFP were also using it, they would
|
|
race with each other. It seems to be primarily used for small reads.
|
|
|
|
- ``CP_ME_NRT_ADDR_LO``/``_HI`` - write to set the address to read or write
|
|
- ``CP_ME_NRT_DATA`` - write to trigger write to address in ``CP_ME_NRT_ADDR``
|
|
|
|
The address register increments with successive reads or writes.
|
|
|
|
Memory Write example::
|
|
|
|
; store 64b value in $04+$05 to 64b address in $02+$03
|
|
mov $addr, CP_ME_NRT_ADDR_LO
|
|
mov $data, $02
|
|
mov $data, $03
|
|
mov $addr, CP_ME_NRT_DATA
|
|
mov $data, $04
|
|
mov $data, $05
|
|
|
|
Memory Read example::
|
|
|
|
; load 64b value from address in $02+$03 into $04+$05
|
|
mov $addr, CP_ME_NRT_ADDR_LO
|
|
mov $data, $02
|
|
mov $data, $03
|
|
mov $04, $addr
|
|
mov $05, $addr
|
|
|
|
|
|
Read via Control Instructions
|
|
-----------------------------
|
|
|
|
This is used by PFP whenever it needs to read memory. Also seems to be
|
|
used by ME for streaming reads (larger amounts of data). The DMA access
|
|
seems to be done by ROQ.
|
|
|
|
TODO might also be possible for write access
|
|
|
|
TODO some of the control commands might be synchronizing access
|
|
between PFP and ME??
|
|
|
|
An example from ``CP_DRAW_INDIRECT`` packet handler::
|
|
|
|
mov $07, 0x0004 ; # of dwords to read from draw-indirect buffer
|
|
; load address of indirect buffer from cmdstream:
|
|
cwrite $data, [$00 + 0x0b8], 0x8
|
|
cwrite $data, [$00 + 0x0b9], 0x8
|
|
; set # of dwords to read:
|
|
cwrite $07, [$00 + 0x0ba], 0x8
|
|
...
|
|
; read parameters from draw-indirect buffer:
|
|
mov $09, $addr
|
|
mov $07, $addr
|
|
cread $12, [$00 + 0x040], 0x8
|
|
; the start parameter gets written into MEQ, which ME writes
|
|
; to VFD_INDEX_OFFSET register:
|
|
mov $data, $addr
|
|
|
|
|
|
A6XX NOTES
|
|
==========
|
|
|
|
The ``$14`` register holds global flags set by:
|
|
|
|
CP_SKIP_IB2_ENABLE_LOCAL - b8
|
|
CP_SKIP_IB2_ENABLE_GLOBAL - b9
|
|
CP_SET_MARKER
|
|
MODE=GMEM - sets b15
|
|
MODE=BLIT2D - clears b15, b12, b7
|
|
CP_SET_MODE - b29+b30
|
|
CP_SET_VISIBILITY_OVERRIDE - b11, b21, b30?
|
|
CP_SET_DRAW_STATE - checks b29+b30
|
|
|
|
CP_COND_REG_EXEC - checks b10, which should be predicate flag?
|