lundi 6 avril 2020

Microcode Optimization: IMEM (Part5)

G_MOVEMEM

This immediate command is used to retrieve data from RDRAM to DMEM, which can then be further used by other immediate commands. It must be noticed that other immediate commands retrieve data from RDRAM to DMEM, G_VTX and G_MTX.

The way that Fast3D microcode manages the process to move data from RDRAM to DMEM is, at least from my point of view, messy and inefficient. In order to amend the situation, we will have to carry out some serious changes.

Let’s go through G_MOVEMEM code and thus as from the beginning of the execution of the command.

0x03800010 (register T9)
0x00220D40 (register T8)

First of all any immediate command goes through a code which dispatches the command to another place in IMEM from where it is actually executed. Where some data must be retrieved from RDRAM, such an operation is carried out before such a dispatch occurs.

Let’s analyze shortly this dispatch code therefore:

0x060    LW         T9, +0x0000(K1)
0x064    LW         T8, +0x0004(K1)
0x068    SRL        AT, T9, 0x1D
0x06C    ANDI    AT, AT, 0x0006
0x070    ADDI     K0, K0, 0x0008
0x074    ADDI     K1, K1, 0x0008
0x078    ADDI     GP, GP, 0xFFF8
0x07C    BGTZ    AT, 0x09C
0x080    ANDI     S2, T9, 0x01FF
0x084    ADDI     S6, R0, 0x7E0
0x088    JAL       0x11C
0x08C    ADD     S3, T8, R0
0x090    ADD     S4, R0, S6
0x094    JAL       0x13C
0x098    ADDI    S1, R0, 0x0000
0x09C    LH       V0, +0x00BC(AT)
0x0A0    JR        V0
0x0A4    SRL     V0, T9, 0x17

0x060    LW    T9, +0x0000(K1)
0x064    LW    T8, +0x0004(K1)

We load in T9 and T8 the two words composing the command. K1 is the pointer where such a command is in the displaylist which has been previously stored in DMEM.

0x068    SRL      AT, T9, 0x1D
0x06C   ANDI    AT, AT, 0x0006
0x070    ADDI    K0, K0, 0x0008
0x074    ADDI    K1, K1, 0x0008
0x078    ADDI    GP, GP, 0xFFF8
0x07C   BGTZ    AT, 0x09C

By doing a shift right by 0x1D of T9, it is possible to assess whether the header (the fist byte) of the command is below or above 0x20. In case it is above 0x20, the code jumps to 0x09C. Indeed in gbi.h you may see that DMA commands (commands retrieving data from RDRAM to DMEM) are as follow:

* The command format is
*
*    |00xxxxxx| = DMA        0,..,127

So if it is above 0x20 we are in presence of a command which does need to get data from RDRAM.

The rests of the instructions are just to move some registers containing counters/pointers, as for instance the one related to the displaylist, K1.

0x080    ANDI    S2, T9, 0x01FF
0x084    ADDI    S6, R0, 0x7E0
0x088    JAL      0x11C
0x08C   ADD     S3, T8, R0
0x090    ADD    S4, R0, S6
0x094    JAL      0x13C
0x098    ADDI   S1, R0, 0x0000

S2 is the number of bytes to be retrieved from RDRAM.
S3 is the RDRAM address or a segment number and an offset from where the data are to be retrieved from.
S4, which gets the same value than S6, is the address in DMEM where such data are to be stored.
S1 sets the direction of the data, meaning from or to RDRAM.

There is however two sub-routines called, which have as well to be apprehended:

0x11C    LW        T3, +0x00B8(R0)
0x120    SRL       T4, S3, 0x16
0x124    ANDI    T4, T4, 0x003C
0x128    AND     S3, S3, T3
0x12C    ADD    T5, R0, T4
0x130    LW       T4, +0x0160(T5)
0x134    JR         RA
0x138    ADD    S3, S3, T4

This routine mainly manages the memory segmentation.

0x11C    LW    T3, +0x00B8(R0)

T3 = 0x00FFFFFF. Such a value is used as a mask to get rid of the 1st byte of the T8, which contains the RDRAM address.

0x120    SRL       T4, S3, 0x16
0x124    ANDI     T4, T4, 0x003C
0x128    AND      S3, S3, T3

The first byte of the RDRAM address set in the second word of the command may contain a value corresponding to the number of the segment where the data is located, the lower bytes of the command containing only the offset. By doing a shift right of S3 we get in register T4 this number multiply by 4. As there is only 16 segments, the code limits such a value to 0x0F*0x04 = 0x3C

Note that where an actual RDRAM address is set in the second word of the command, the fist byte would be then 0x00, which is for the physical addressing (see 11.1.2 Segmented Memory and the RSP Memory Map in the N64 Programming Manual).

The first byte of S3 is set to 0x00 thanks to the mask loaded in T3.

0x12C    ADD    T5, R0, T4
0x130     LW    T4, +0x0160(T5)

From there the segment is loaded in T4. Those are located in DMEM as from address 0x160. As you certainly understand it is normal that actually the segment number is multiplied by 4 as a word is composed of 4 bytes :) As you may imagine, the value of the 1st segment for physical addressing is 0x00000000.

0x134    JR    RA
0x138    ADD    S3, S3, T4

The subroutine is exited, but before the S3 gets the RDRAM computed with as segment + offset. In case of physical addressing, it is simply what the address set in the second word of the command.

The second subroutine was already explained in my previous article. It runs the DMA processing from/to RDRAM from/to DMEM.

Finally the dispatch code continues.

0x09C    LH      V0, +0x00BC(AT)
0x0A0    JR       V0
0x0A4    SRL    V0, T9, 0x17

Register V0 gets the IMEM address of the immediate commands to run. There is indeed in DMEM a table containing such addresses which is set at the loading of the microcode in the RSP. The code jumps then to this IMEM address.

In case of G_MOVEMEM or any DMA immediate commands, the code goes to PC 0x3C4.

0x3C4    ANDI    V0, V0, 0x01FE
0x3C8    LH         V0, +0x00C4(V0)
0x3CC    JAL       0x164
0x3D0    LBU      AT, -0x0007(K1)
0x3D4    JR          V0
0x3D8    ANDI    A2, AT, 0x000F

Without going to details, the code called a subroutine ensured that all data were retrieved from RDRAM, sets some registers with a value part of the processed immediate command and jump to another part of IMEM, following another table stored in DMEM.

In case of G_MOVEMEM, the code goes to PC 0x558.

0x558    LQV    $v0[0], +0x000(S6)
0x55C    LH    A1, +0x0270(AT)
0x560    J    0x10A8
0x564    SQV    $v0[0], +0x000(A1)

Without going to details, the code loads four words at once from the scratch buffer and stores them to DMEM according to a table stored in DMEM.

As you can see, in order to move 4 words to RDRAM to DMEM, a lot of RSP instructions have to be executed!!! It may be worth noticing that it is  even worth for G_MTX.

After investigations, I came to the conclusion that the whole code explained above has to be severely revised. The main idea is to direct the data to the right place in DMEM at once, avoiding to first move data to a scratch buffer to then move it back to the right place.

We will start by changing the way the code dispatches its execution in IMEM. As we have just seen, it is done by the way of tables stored in DMEM at the initialization of the microcode. Those tables are located in gdmem.h, at the JUMP TABLES section.

For unknown reasons, the structure of those jump tables is messy, with many comments stating “not implemented" but yet still using space in DMEM.

I changed such a section of gdmem.h as follow:

#====================================================================
 # setup a jump table
 #====================================================================

JMP_OFFSET:

.half   GfxDone                     #G_MOVEMEM
.half   GfxDone                     #G_SPNOOP
.half   case_G_MTX
.half   case_G_VTX
.half   case_G_TRI1
.half   case_G_CULLDL
.half   case_G_PMTX
.half   case_G_MOVEWORD     
.half   case_G_DL   
.half   case_G_SETOTHERMODE_H
.half   case_G_SETOTHERMODE_L
.half   case_G_ENDDL
.half   case_G_SETGEOMETRYMODE
.half   case_G_CLEARGEOMETRYMODE
.half   case_G_RDPHALF_2
.half   case_G_RDPHALF_1

There is only one jump table, with no distinction as such between DMA command and immediate commands.

Doing such a cleanup has as consequence to reduce the size in DMEM of the jump tables, which is always nice. However the structure of the DMEM is impacted, meaning that the DMEM address in gbi.h may have to be changed.

The header of the command has been changed to the following:

#define    G_MOVEMEM                                  0x00
#define    G_SPNOOP                                      0x02
#define    G_MTX                                              0x04
#define    G_VTX                                               0x06
#define    G_TRI1                                              0x08
#define    G_CULLDL                                        0x0A
#define    G_PMTX                                            0x0C
#define    G_MOVEWORD                                0x0E
#define    G_DL                                                0x10
#define    G_SETOTHERMODE_H                    0x12
#define    G_SETOTHERMODE_L                    0x14
#define    G_ENDDL                                        0x16
#define    G_SETGEOMETRYMODE                0x18
#define    G_CLEARGEOMETRYMODE           0x1A
#define   G_RDPHALF_1                                0x1C
#define   G_RDPHALF_2                                0x1E


You may notice that the commands are only even number. It is because each address stored in DMEM as jump table takes two bytes.

Here the related implementation of the dispatch code.

0x05C    ADDI       K1, R0, 0x06A0
0x060    LW           T9, +0x0000(K1)
0x064    LW           T8, +0x0004(K1)
0x068    SRL          AT, T9, 0x18
0x06C    ADDI       K0, K0, 0x0008
0x070    ADDI        K1, K1, 0x0008
0x074    BLTZ        T9, 0x0330
0x078    ADDI        GP, GP, 0xFFF8
0x07C    ADDI       V0, AT, 0xFFFE
0x080    BGEZ       V0, 0x00A0
0x084    NOP
0x088    JAL           0x1124
0x08C    SLL          V0, AT, 0x1F
0x090    JAL           0x1140
0x094    ANDI        AT, AT, 0x00FE
0x098    MTC0       S2, SP read DMA length
0x09C    BGEZAL  V0, 0x0164
0x0A0    LH           AT, +0x00C0(AT)
0x0A4    JR           AT
0x0A8    NOP

The RDP commands are dispatched thanks to their sign, being all negative.

0x074    BLTZ        T9, 0x0330

Then as all DMA transfers (except for G_PMTX) are to be managed by G_MOVEMEM (even for case G_VTX and G_MTX), where the command is equal or above 0x02, the command skips the DMA process and goes to the appropriate IMEM address provided by the jump table .

0x068    SRL        AT, T9, 0x18                 # AT=header of the command

0x07C    ADDI        V0, AT, 0xFFFE        # V0 = AT - 2
0x080    BGEZ        V0, 0x00A0               # If equal or above 0, go to 0x0A0
0x084    NOP

0x0A0    LH        AT, +0x00C0(AT)      # load the IMEM address from the jump table
0x0A4    JR        AT                             # jump to such an IMEM address
0x0A8    NOP

If the command is 0x00 or 0x01, meaning G_MOVEMEM, then the code jumps to the segmentation code which has been completely revised.

0x088    JAL        0x1124

0x124    SLL      S3, T8, 0x4               
0x128    SRL      S3, S3, 0x1A
0x12C    LW       S3, +0x0160(S3)
0x130    SLL      T8, T8, 0x8
0x134    SRL      T8, T8, 0x8
0x138    JR         RA
0x13C    ADD     S3, T8, S3

First we get rid of the higher 4 bits of the command in case the address contained in the second word of the command would be a physical one (0x80XXXXXX)

0x124    SLL    S3, T8, 0x4

The highest 4 bits is then multiplied by 4 and then the segment address from DMEM is loaded.
           
0x128    SRL    S3, S3, 0x1A   
0x12C    LW    S3, +0x0160(S3)

We get rid then of the highest byte of the second word of the command.

0x130    SLL    T8, T8, 0x8
0x134    SRL    T8, T8, 0x8

Finally the actual RDRAM address is obtained by adding the segment by the offset set in the lower bytes of the second word of the command.

The code goes then to a second subroutine.

0x090    JAL        0x1140

0x140    LH         S4, -0x0007(K1)
0x144    SRL       S4, S4, 0x4
0x148    ANDI    S2, T9, 0x03FF
0x14C    MFC0   T3, SP DMA full
0x150    BNE      T3, R0, 0x014C
0x154    NOP
0x158    MTC0    S4, SP memory address
0x15C    JR          RA
0x160    MTC0    S3, SP DRAM DMA address

Only the 3 first instructions are new, the rest having been explained in previous articles.

Any command retrieving data will have to be structured in this way:

0xCCAAABBB
0xSSRRRRRR

CC: header of the command
AAA: DMEM address where data are to be processed to/from
BBB: size of the transfer
SS: segment number
RRRRRR: physical RDRAM address or offset.

As you already understand the DMEM address is set in register S4:

0x140    LH      S4, -0x0007(K1)
0x144    SRL    S4, S4, 0x4

And the size of the transfer in S2:

0x148    ANDI    S2, T9, 0x03FF

Finally the DMA transfer starts by the following command

0x098    MTC0        S2, SP read DMA length
Finally remains few instructions unexplained

0x08C    SLL                 V0, AT, 0x1F
0x094    ANDI               AT, AT, 0x00FE
0x09C    BGEZAL        V0, 0x0164

Those instructions are used to implement the optional double DMA transfer.

G_MOVEMEM is normally 0x00. In such a case it means that DMA transfer must be completed before continuing the execution of the command. Where it is 0x01, the codes continues without checking that the DMA transfer is indeed completed (note: may be it would be worth creating a DMAWAIT command)

0x08C    SLL        V0, AT, 0x1F

Register V0 will be 0x80000000 in case the command is 0x01 and 0x00000000 in case the command is 0x00.

0x09C    BGEZAL        V0, 0x0164

In case register V0 is not negative, the code goes to a part of the code to wait that DMA transfer is completed.

0x094    ANDI        AT, AT, 0x00FE

We have to ensure that the code fetches the appropriate IMEM address in the jump table so we do have to even the header of the command (0x01 becomes 0x00).

G_MOVEMEM is actually over so the jump table simply routes the code to the next command to be executed!

With this approach of having only one command to move data from/to RDRAM, G_MTX and G_VTX are impacted and we have to emulate those commands in gbi.h.

For instance gSPVertex:

#define D_SCRATCH     0x7E0    (0x7E0 is the place in DMEM of the scratch buffer).

# define    gSPVertex(pkt, v, n, v0)
       
{                                           
  Gfx *_g = (Gfx *)(pkt);                  
                                           
    _g->words.w0 = _SHIFTL(G_MOVEMEM, 24, 8) | _SHIFTL(D_SCRATCH, 12, 12) | _SHIFTL(((n*0x10)-1), 0, 12);   
                                          
    _g->words.w1 = (unsigned int)(v);      
};                                           
{                                           
  Gfx *_g = (Gfx *)(pkt);               
                                           
    _g->words.w0 = _SHIFTL(G_VTX, 24, 8) | _SHIFTL(0x000000, 0, 24);                                       
                                              _g->words.w1 = _SHIFTL((n), 16, 16) | _SHIFTL(((INPUTBUFFER) + ((v0) * 0x28)), 0, 16);
}               

Obviously we should create new macros as for instance

# define gSPLoadVertices(pkt, v, n, d)     
{                                         
  Gfx *_g = (Gfx *)(pkt);                
                                         
    _g->words.w0 = _SHIFTL(((G_MOVEMEM) + (DMAWAIT_##d)), 24, 8) | _SHIFTL(D_SCRATCH, 12, 12) | _SHIFTL(((n*0x10)-1), 0, 12); 

    _g->words.w1 = (unsigned int)(v);    
}

# define gsSPLoadVertices(v, n, d)

{{
  (_SHIFTL(((G_MOVEMEM) + (DMAWAIT_##d)), 24, 8) | _SHIFTL(D_SCRATCH, 12, 12) | _SHIFTL(((n*0x10)-1), 0, 12)), ((unsigned int)(v))
}}

# define gSPTransformVertices(pkt, n, v0)
{                                          
  Gfx *_g = (Gfx *)(pkt);                  
                                          
    _g->words.w0 = _SHIFTL(G_VTX, 24, 8) | _SHIFTL(0x000000, 0, 24);                                      
                                            
    _g->words.w1 = _SHIFTL((n), 16, 16) | _SHIFTL(((INPUTBUFFER) + ((v0) * 0x28)), 0, 16);                   
}

# define gsSPTransformVertices(n, v0)
{{                                        
  (_SHIFTL(G_VTX, 24, 8) | _SHIFTL(0x000000, 0, 24)), (_SHIFTL((n), 16, 16) | _SHIFTL(((INPUTBUFFER) + ((v0) * 0x28)), 0, 16))                   
}}

Of course all gbi macros doing DMA transferred had to be changed to match with this new implementation.

Thanks to those changes 20 IMEM instructions were saved but much more will come in the next phase.. :)

As well some changes in the microcode for G_MTX and G_VTX have been necessary to manage such an implementation, on top other interesting changes which I will explain in my next article.

Stay tuned!

4 commentaires:

  1. I love your work even understanding practically nothing. By the way do you know something about VHDL or verilog??? Waiting for next post

    RépondreSupprimer
  2. Fascinant quand on s'interesse a l'optim en général, mais un peu rude quand on est pas coutumier de l'assembleur ^^.
    Je me demandais : tu prévois de faire un bench pour comparer les perfs entre ton ucode et l'officiel nintendo quand tu seras arrivé a ton objectif ? (je parles en français, il me semble t'avois vu sur des forums francophone).
    ps : merci pour tout ton taff sur les ucodes n64.
    Un utilisateur régulier de gliden64.

    RépondreSupprimer
    Réponses
    1. Je vais essayer de faire un benchmark en effet mais celui peut ne rien révéler car c est surtout le RDP qui est limité dans ces performances. Je souhaite atteindre les perfs de F3DEX2 avec plus de fonctionnalités pour le moment. Après il faudra voir comment se passer du Z buffer sans changer dramatiquement le microcode. Je ne sais pas si cela est faisable et encore moins sans passer par le CPU.

      Supprimer