samedi 1 février 2020

Microcode Optimization: IMEM (Part1)

Before implementing new features, it is necessary to ensure that the current microcode is optimized in a way that it uses as less as possible RSP cycles. On top on accelerating the RSP processing, that would save space of a very limited size of the IMEM (only 4kb!), the cache memory where its runs.

As previously announced the original Fast3D code (the one used by Super Mario 64) is not very optimized, most likely because its development was somehow rushed to provide as soon as possible developers the necessary graphic library to build their game.

In this part, we will focus on one of the so called “immediate” RSP graphic commands, G _TRI1 (0xBF).

I will first explain the current code and its purpose and then present the way I took to improve it. Obviously it may occur that some of the readers of this blog may have even better ideas, for which I would really be happy to discover and discuss.

G_TRI1 is a command (gSP1Triangle) which simply picks up three transformed vertex into a buffer stored in DMEM (the 4kb cache memory from which the IMEM works) to turned them in a format readable by the RDP, which is the graphic rasterizer of the N64. Additionally the command has a flag which states which of those three vertexes contain the color or the normal of the triangle.
Here the original Fast3D code:

 0x20C    LBU      A1, 0xFFFC(K1)
 0x210    LBU      AT, 0xFFFD(K1)
 0x214    LBU      V0, 0xFFFE(K1)
 0x218    LBU      V1, 0xFFFF(K1)
 0x21C    SLL      A1, A1, 0x2
 0x220    SLL      AT, AT, 0x2
 0x224    SLL      V0, V0, 0x2
 0x228    SLL      V1, V1, 0x2
 0x22C    ADDI    AT, AT, 0x0420
 0x230    ADDI    V0, V0, 0x0420
 0x234    ADDI    V1, V1, 0x0420
 0x238    SW      AT, 0x0DE0(R0)
 0x23C    SW     V0, 0x0DE4(R0)
 0x240    SW     V1, 0x0DE8(R0)
 0x244    LW     A0, 0x0DE0(A1)
 0x248    J         0x998
 0x24C    LH     S8, 0x00BE(R0)

It takes 17 instructions to process a single G_TRI1 command. For such a basic function, it is relatively long.

Let’s go through the code steps by steps:

0x20C    LBU    A1, 0xFFFC(K1)
0x210    LBU    AT, 0xFFFD(K1)
0x214    LBU    V0, 0xFFFE(K1)
0x218    LBU    V1, 0xFFFF(K1)

In registers A1, AT, V0 and V1 are loaded a mere bytes backwards (due to the negative offset) from base K1. K1 is actually the place in DMEM of the next command in the display list to be executed. So actually the bytes are loaded from the current command to be executed.

Hereby an example of a G_TRI1 command:

DATA COUNTER    DATA
   
0x760    0xBF000000
0x764    0x00284632

As you certainly understood, in this case K1 would be equal to 0x768 and A1 =0x 00, AT = 0x28, V0 = 0x46, V1 = 0x32. A1 is the color/normal flag, AT is the first vertex, V0 the second one and V1 and the last one.

Now due to an “optimization” for the microcode, which will explain a little further on,  please note that  the number of the vertex to be picked up in the buffer is multiply by 10 (0x0A). The actual numbers of our three vertexes are therefore 0x28/0xA = 4, 0x46/0x0A= 7 and, 0x32/0A = 5.

The code then continues and shifts left by 2 all the values.

0x21C    SLL    A1, A1, 0x02
0x220    SLL    AT, AT, 0x02
0x224    SLL    V0, V0, 0x02
0x228    SLL    V1, V1, 0x02

A1 = 0x00
AT = 0xA0
V0 = 0x118
V1 = 0xC8

The reason of doing so is pretty simple: each vertex stored in DMEM takes a space of 40 bytes. In order to locate the actual place of a vertex in DMEM, you must calculate its address by adding from the base of the buffer 40 bytes per vertex. As the value from the command are already multiply by 10, you simply have to multiply them by 4, which is exactly what the shift left by 2 does!

0x22C    ADDI    AT, AT, 0x0420
0x230    ADDI    V0, V0, 0x0420
0x234    ADDI    V1, V1, 0x0420

The vertex values are then added to 0x420. It is very simple: 0x420 is the DMEM address of the start of the buffer.

AT = 0x4C0
V0 = 538
V1 = 0x4E8

This is the exact address in DMEM of our 3 vertex.

0x238    SW    AT, 0x0DE0(R0)
0x23C    SW    V0, 0x0DE4(R0)
0x240    SW    V1, 0x0DE8(R0)
 0x244    LW    A0, 0x0DE0(A1)

The obtained address of the 3 vertexes are then consecutively stored at DMEM address 0xDE0, 0xDE4 and 0xDE8. Such DMEM address is part of a scratch buffer. Then is loaded in register A0 the data located at DMEM 0xDE0 offset A1. It seems odd at first glance but actually the flag is multiply by 4 in previous step. If the original flag is 0, it is still 0, if the original flag is 1, it is now 4 and if the original flag is 2, then it is now 8. And what if you do load the data stored in at 0xDE0 offset A1? The address of the vertex for the color or the normal of the surface!

0x248    J    0x998
0x24C    LH    S8, 0x00BE(R0)

Those 2 less instructions will not be commented much. The code jumps to the part which transforms the 3 vertex in a triangle primitive, where S8 will be used a “return” register to execute the next command in the display list.

Here for the explanations, without too much intricate details of how G_TRI1 works.

Now can we do better than this? Of course it is possible!

First of all, it is very important to understand how the data of the immediate command are generated: it is done by the mean of the header called gbi.h when compiling a rom with the official N64 SDK. This file contains all the graphic command macros/functions useable by a user in the development of its N64 program.

I believe you do remember that the number of the vertexes is multiplied by 10. It is actually something managed through gbi.h:

#  define __gsSP1Triangle_w1f(v0, v1, v2, flag)           
     (_SHIFTL((flag), 24,8)|_SHIFTL((v0)*10,16,8)|       
      _SHIFTL((v1)*10, 8,8)|_SHIFTL((v2)*10, 0,8))

Being able to modify what the user keyed as arguments before being included in the display list to be executed by the RSP is a very useful matter, which will be use as much as possible to avoid having the RSP playing unnecessarily with data.

You may also have noticed that the first line of the command (0xBF000000) is complete empty. Why not using it?

After quite some reflections I came to the following solution (inspired from Factor5 microcode): storing directly the DMEM address of the vertex in the immediate command by calculating through gbi.h (meaning the CPU at the end, much more powerful than the RSP for this type of activities).
The command could actually look like this:

0xBFAABBBB
0xCCCCDDDD

0xBF: COMMAND HEADER     AA: FLAG    BBBB: DMEM ADDR OF 1ST VERTEX
CCCC: DMEM ADDR OF 2nd VERTEX         DDDD: DMEM ADDR OF 3rd  VERTEX

Doing this would mean to change few changes in gbi.h, meaning doing the following change:

Let’s first create a new constant, which will hold the DMEM address of transformed vertex buffer.

#define INPUTBUFFER         0x420

Then the triangle macro can be changed as follow:

#define gSP1Triangle(pkt, v0, v1, v2, flag)               
{                                                        \
    Gfx *_g = (Gfx *)(pkt);   
                                                        \
    _g->words.w0 = _SHIFTL(G_TRI1, 24, 8) | _SHIFTL(flag, 16, 8) | _SHIFTL((((v0)*40) + INPUTBUFFER), 0, 16);
                                                        \
    _g->words.w1 = _SHIFTL((((v1)*40) + INPUTBUFFER), 16, 16)| _SHIFTL((((v2)*40) + INPUTBUFFER), 0, 16);    
}

#define gsSP1Triangle(v0, v1, v2, flag)                   
{{                                                        \
    (_SHIFTL(G_TRI1, 24, 8) | _SHIFTL(flag, 16, 8) | _SHIFTL((((v0)*40) + INPUTBUFFER), 0, 16)),(_SHIFTL((((v1)*40) + INPUTBUFFER), 16, 16)| _SHIFTL((((v2)*40) + INPUTBUFFER), 0, 16))    
}}

Finally the G_TRI1 code can be reduced to the following:

0x208    LB         A1, -0x007(K1)
0x20C    LH        AT, -0x006(K1)
0x210    LH        V0, -0x004(K1)
0x214    LH        V1, -0x002(K1)
0x218    SLL      A1, A1, 0x1
0x21C    ADD    A1, A1, K1
0x220    LH       A0, -0x006(A1)
0x224    J          0x968
0x228    LH       S8, +0x0BE(R0)

And voilà!

The DMEM address is immediately loaded in registers AT, V0, V1 from the command. The flag is now used to compute where within the command itself the DMEM address for the color/normal surface is located, taking in account the each DMEM address is a half word (2 bytes)!

We started with 17 instructions and for the same exact result we reduce the code to 9 instructions!!!

Now remain the main question: does it indeed work? Of course, as you can see in this compiled sample from the SDK with such an implementation.



Next time I will continue on focusing on other immediate command(s)! :)

1 commentaire: