A journey into the mysteries of N64 RCP: Microcode Optimization: IMEM (Part4)

G_POPMTX

This command pops a matrix which has been previously pushed to a stack in RDRAM by command G_MTX.

0xBD000000 (register T9)
0x00000000 (register T8)

In its original version, there is no parameter for this command.

The related code in the Fast3D microcode is the following:

0x24C    SBV    v31[6], 0x01C (SP)
0x250    LW    S3, 0x0024(SP)
0x254    LW    V1, 0x004C(SP)
0x258    ADDI    S4, R0, 0x0360
0x25C    ADDI    S2, R0, 0x003F
0x260    SUB    V1, V1, S3
0x264    ADDI    V1, V1, 0xFD80
0x268    BGEZ    V1, 0x0A8
0x26C    ADDI    S3, S3, 0xFFC0
0x270    JAL    0x13C
0x274    ADDI    S1, R0, 0x0000
0x278    JAL    0x164
0x27C    ADDI    V1, R0, 0x03E0
0x280    J    0x444
0x284    SW    S3, 0x0024(SP)

Let’s go through the code quickly.

0x24C    SBV    v31[6], 0x01C (SP)

The instruction resets a flag used for lighting and stored in DMEM.

0x250    LW    S3, 0x0024(SP)
0x254    LW    V1, 0x004C(SP)

In register S3 is loaded the current RDRAM address of the stack (pointer).
In register V1 is loaded the RDRAM address of the stack when it is full.

0x258    ADDI    S4, R0, 0x0360
0x25C    ADDI    S2, R0, 0x003F

Register S4 gets as value 0x0360, which is the DMEM address for the modelview matrix. Registers gets as value 0x3F, which is actually the size of the matrix (64 bytes as when data are moved from/RDRAM, 0 counts as well so 0x40 -0x01 = 0x3F).

0x260    SUB    V1, V1, S3

Doing so provides the remaining bytes available in the stack.

0x264    ADDI    V1, V1, 0xFD80

When the stack is empty, the maximum number of bytes available in the stack is supposedly 640 bytes (0x280). When subtracting the remaining bytes available in the stack with the size in bytes of the stack (0x0000-0x0280 = 0xFD80) it is possible to check out whether there is still a matrix to pop. Indeed, would the stack be empty, the remaining bytes available in the stack and the size in bytes of such a stack would be equal.

0x268    BGEZ    V1, 0x0A8

By this instruction, in case there would be no matrix to pop, the code exits the command.

0x26C    ADDI    S3, S3, 0xFFC0

The RDRAM address is reduced by 64 bytes (0x0000 – 0x0040 = 0xFFC0). It will be the base address from which the matrix data will have to be retrieved from RDRAM.

0x270    JAL    0x13C

This instruction calls a subroutine, which is used to retrieve/store data from/to RDRAM.

0x274    ADDI    S1, R0, 0x0000

As you may remember from my previous article, S1 sets the direction from or to RDRAM the data is about to move.

0x278    JAL    0x164

This instruction calls a subroutine for DMA processing, from or to RDRAM.

0x27C    ADDI    V1, R0, 0x03E0

V1 gets its value set to 0x3E0. This is where the modelview projection matrix (modelview x projection matrix) is stored in DMEM.

0x280    J    0x444

This instruction jumps to a part of the G_MTX command in order to get the popped modelview matrix multiplied with the projection matrix and stored in DMEM at 0x3E0 as per previous instruction.

0x284    SW    S3, 0x0024(SP)

Before doing so S3, the new stack pointer, is stored back in DMEM.

From my point of view, as such this command is both too limited in its feature and in its scope. After investigations, I came to the conclusion that it would be better to reuse the code to pop a matrix in order to push a matrix as well. Additionally we will implement a stack not only for modelview matrix but also for the projection matrix. It does mean that the part of the code for pushing a matrix in G_MTX should be either rerouted to G_POPMTX (which will be renamed G_PMTX) or emulated through gbi.h.

Here my implementation in this respect.

0x21C    SBV    $v31[6], +0x01C(SP)
0x220    LH    AT, -0x0004(K1)
0x224    BGTZ    AT, 0x240
0x228    LH    V0, +0x0000(T8)
0x22C    LH    V1, -0x0002(K1)
0x230    BGTZ    V1, 0x23C
0x234    ORI    A0, R0, 0x0300
0x238    ORI    A0, R0, 0x0100
0x23C    BEQ    V0, A0, 0x0A8
0x240    SUB    A1, V0, AT
0x244    BLTZ    A1, 0x0A8
0x248    LW    A2, +0x0024(SP)
0x24C    BGTZ    AT, 0x258
0x250    ADD    S3, A2, A1
0x254    ADD    S3, A2, V0
0x258    BGTZ    V1, 0x264
0x25C    LH    S4, -0x0007(K1)
0x260    ADDI    S3, S3, 0x0300
0x264    SRL    S4, S4, 0x04
0x268    JAL    0x13C
0x26C    ANDI    S2, T9, 0x0FFF
0x270    BGTZ    AT, 0x280
0x274    SH    A1, +0x0000(T8)
0x278    J    0x0A8
0x27C    MTC0    S2, SP read DMA write
0x280    JAL    0x154
0x284    MTC0    S2, SP read DMA length
0x288    J    0x400
0x28C    ADDI    V0, R0, 0x03E0

As you may see, the size of the command has doubled. Actually taking in accounting the fact that the push matrix code of G_MTX uses some instructions as well, the actual increase is only about 30%, which is totally justified by the fact that the code has to manage two stacks and not only one.
The structure of the command is the following:

0xCCAAABBB
0xDDDDEEEE

CC is the command header, meaning 0xBD.

AAA is the place in DMEM of the matrix from which the command will pop or push the data from/to RDRAM. It is 0x360 for the modelview matrix and 0x3A0 for the projection matrix.

BBB is the size of the matrix to be retrieve or store from/to RDRAM.

DDDD is the number of bytes either to pop or to push. For a push it is 0xFFC0 and for a pop it is 0x0040.

EEEE is the place in DMEM where is store the current number of bytes used in the stack. It is 0x015C when it is for the modelview matrix and 0xF15E for the projection matrix.

So let’s go through the new code.

0x21C    SBV    $v31[6], +0x01C(SP)

The instruction resets a flag used for lighting and stored in DMEM, as before.

0x220    LH    AT, -0x0004(K1)
0x224    BGTZ    AT, 0x240
0x228    LH    V0, +0x0000(T8)

This code loads DDDD in register AT and in case AT is positive, so 0x0040 for pop case, jumps to 0x240. The code loads in register V0 the delay slot a half word located DMEM 0x000 offset T8, so actually at the DMEM address at EEE. It must be noticed that DMEM addresses are only 12 bits so the rest of the word is ignored. So V0 has for values the size in bytes of the stack pointed by the command.

0x22C    LH    V1, -0x0002(K1)
0x230    BGTZ    V1, 0x23C
0x234    ORI    A0, R0, 0x0300
0x238    ORI    A0, R0, 0x0100
0x23C    BEQ    V0, A0, 0x0A8

The code loads EEEE in register V1 and in case it would be negative jumps to 0x23C. So register A0 becomes 0x300 in case the stack pointed by the push is the modelview matrix (0x015C) or 0x100 in case the stack pointed by the push is the projection matrix (0xF15E). A0 has for value the maximum size that each stack may have. As you can understand, the modelview stack is 12 levels deep (0x40 * 12 = 0x300) and the projection stack is 4 levels deep (0x40 * 4 = 0x100). In total the combined stacks can hold 0x10, so 16 matrixes.

Finally in case the current size of the stack pointed by the command (V0) is equal to the maximum size of the stack (A0), it is impossible to push more and the command has to be skipped.

0x240    SUB    A1, V0, AT
0x244    BLTZ    A1, 0x0A8

Register A1 has for value the difference between the current size of the stack pointed by the command (V0) and DDDD. In case we would pop the matrix, AT would be 0x0040 so the current size of the matrix would be increased by 0x40. Additionally in case the pointed stack would be empty, A1 would become negative and the command has to be skipped. In case we would push the matrix, as AT would be 0xFFC0, the current size of the matrix would be increased by 0x40 (0x00000000 – 0xFFFFFFC0 = 0x00000040). As you may understand, A1 is actually the size of the stack after execution of the current command.

0x248    LW    A2, +0x0024(SP)
0x24C    BGTZ    AT, 0x258
0x250    ADD    S3, A2, A1
0x254    ADD    S3, A2, V0

Register A2 loads from DMEM the RDRAM address at the bottom of BOTH stacks. In this respect it must be noticed that the projection matrix is piled up on the modelview matrix in RDRAM.

In case we would pop the matrix (AT being positive), register S3 would be equal to be RDRAM address at the bottom of both stacks + the size of the stack pointed after execution of the command. Indeed the RDRAM address from which the data must be retrieved are 40 bytes below the current RDRAM address of the pointed stack.

In case we would push the matrix (AT being negative), S3 would be equal to RDRAM address at the bottom of both stacks + the size of the current size of pointed stack. Indeed the matrix data is to be stored in RDRAM from the top of the stack, where there is still available space in memory.

0x258    BGTZ    V1, 0x264
0x25C    LH    S4, -0x0007(K1)
0x260    ADDI    S3, S3, 0x0300
0x264    SRL    S4, S4, 0x04

If V1, so EEEE, is positive, then jump to 0x264. In such a case it would mean that the stack pointed is the modelview matrix, otherwise add to RDRAM address 0x300 contained in S3. Indeed as the stacks are piled up, from the very bottom of the two stacks, you need to add 0x300 to reach the bottom of the projection matrix. In both case register S4 becomes 0xAAA.

0x268    JAL    0x13C
0x26C    ANDI    S2, T9, 0x0FFF

The first instruction calls a subroutine, which is used to retrieve/store data from/to RDRAM. In the second one, S2 gets as value the number of bytes to be popped or pushed from/to RDRAM.

0x270    BGTZ    AT, 0x280
0x274    SH    A1, +0x0000(T8)
0x278    J    0x0A8
0x27C    MTC0    S2, SP read DMA write
0x280    JAL    0x154
0x284    MTC0    S2, SP read DMA length
0x288    J    0x400
0x28C    ADDI    V0, R0, 0x03E0

The new size of the pointed stack (register A1) is stored in DMEM.

Depending where AT (DDDD) is positive or negative (so a pop or a push), the code either write data to RDRAM or read data from RDRAM. The code ensures that the data are indeed retrieved from RDRAM to DMEM thanks to a subroutine (JAL 0x154). In case the matrix would be pop, the code jumps to a part of the G_MTX command in order to get the new modelview projection matrix and stored in DMEM at 0x3E0.

Finally we have to adapt the macros in gbi.h.

#define SZ_G_MTX_MODELVIEW           0x015C
#define SZ_G_MTX_PROJECTION           0xF15E
#define D_G_MTX_MODELVIEW             0x360
#define D_G_MTX_PROJECTION           0x3A0

#define gSPPopMatrix(pkt, n)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12);

    _g->words.w1 = _SHIFTL(0x0040, 16, 16) | _SHIFTL(SZ_##n, 0, 16);
}

#define gsSPPopMatrix(n)
{{
    (_SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12)),
    (_SHIFTL(0x0040, 16, 16) | _SHIFTL(SZ_##n, 0, 16))
}}

#define gSPPopMatrixN(pkt, n, num)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12);

    _g->words.w1 = _SHIFTL((0x0040 * (num)), 16, 16) | _SHIFTL(SZ_##n, 0, 16);
}

#define gsSPPopMatrixN(n, num)
{{
    (_SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12)),
    (_SHIFTL((0x0040 * (num)), 16, 16) | _SHIFTL(SZ_##n, 0, 16))
}}

#define gSPPushMatrix(pkt, n)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12);

    _g->words.w1 = _SHIFTL(0xFFC0, 16, 16) | _SHIFTL(SZ_##n, 0, 16);
}

#define gsSPPushMatrix(n)
{
    (_SHIFTL(G_PMTX, 24, 8) | _SHIFTL(D_##n, 12, 12) | _SHIFTL(0x03F, 0, 12)),
    (_SHIFTL(0xFFC0, 16, 16) | _SHIFTL(SZ_##n, 0, 16))
}

gSPPopMatrixN is a macro which can pop n number of matrix.

And voila! :)

Finally one little comment on the number of matrix in the original code: I do believe that there has been an error in this respect by the microcode developers. In ucode.h you may find that there is 1024 bytes is allocated for the matrix stack.

* This is the recommended size of the SP DRAM stack area, used
* by the graphics ucode. This stack is used primarily for the
* matrix stack, so it needs to be AT LEAST (10 * 64bytes) in size.
*/
#define    SP_DRAM_STACK_SIZE8    (1024)
#define    SP_DRAM_STACK_SIZE64    (SP_DRAM_STACK_SIZE8 >> 3)

10 * 64 = 640 bytes, not 1024!

1024 bytes corresponds to 16 matrix, which is … 0x10 matrix!

So I would guess there has been a mix up between 10 and 0x10… !!!

The original Fast3D microcode limited indeed the size of the matrix stack to 10 but as from F3DEX this limitation has been actually removed… :)

Next time we will tackle G_MOVEMEM which does require some major changes in the code to become much more efficient than current implementation. Stay tuned!

A journey into the mysteries of N64 RCP

samedi 14 mars 2020

Microcode Optimization: IMEM (Part4)

Aucun commentaire:

Enregistrer un commentaire