A journey into the mysteries of N64 RCP: février 2020

dimanche 23 février 2020

Microcode Optimization: IMEM (Part3)

G_TEXTURE

This command simply set the parameters related to textures (texture off/on, tile, level and scale).
The code related to the immediate is the following:

0x2A0    SW    T9, 0x0010(SP)
0x2A4    SW    T8, 0x0014(SP)
0x2A8    LH    V0, 0x0006(SP)
0x2AC    ANDI    V0, V0, 0xFFFD
0x2B0    ANDI    V1, T9, 0x0001
0x2B4    SLL    V1, V1, 0x1
0x2B8    OR    V0, V0, V1
0x2BC    J    0x0A8
0x2C0    SH    V0, 0x0006(SP)

Command example:

0xBB000101 (loaded in T9)
0xFFFFFFFF (loaded in T8)

Let’s analyze quickly the code:

0x2A0    SW    T9, 0x0010(SP)
0x2A4    SW    T8, 0x0014(SP)

The two words of the command is stored in a specific place in DMEM.
0x2A8    LH    V0, 0x0006(SP)

We load the lower bytes of the geometry mode in register V0. Here the related geometry flags as per gbi.h:

#define G_ZBUFFER            0x00000001
#define G_SHADE            0x00000004
# define G_TEXTURE_ENABLE        0x00000002     /* Microcode use only */
# define G_SHADING_SMOOTH    0x00000200
# define G_CULL_FRONT        0x00001000
# define G_CULL_BACK        0x00002000
# define G_CULL_BOTH        0x00003000

0x2AC    ANDI    V0, V0, 0xFFFD

The code simply “clears” G_TEXTURE_ENABLE” flag (it simply becomes 0) potential set in the last byte of register V0.

0x2B0    ANDI    V1, T9, 0x0001

The code takes very last byte of the first word of the command in register V1.

0x2B4    SLL    V1, V1, 0x1

V1 is multiplied by 2.

0x2B8    OR    V0, V0, V1

Register V0, containing the cleared lower bytes of the geometry mode, is ORed by V1, containing the last byte of the 1st command multiply by 2.

0x2BC    J    0x0A8
0x2C0    SH    V0, 0x0006(SP)

Before exiting the command, the lower bytes of the geometry mode is stored back in DMEM.
Now some may say: what would be the underlying reasons to have the texture flag set in the last byte of the geometry mode?

Technically speaking when 3 transformed vertex are turned into an actual triangle RDP commands, this byte is used to construct the command header of such a command.

Let’s check out gbi.h:

#define G_TRI_FILL            0xc8 /* fill triangle:
#define G_TRI_SHADE        0xcc /* shade triangle:
#define G_TRI_TXTR            0xca /* texture triangle:
#define G_TRI_SHADE_TXTR        0xce /* shade, texture triangle:
#define G_TRI_FILL_ZBUFF        0xc9 /* fill, zbuff triangle:
#define G_TRI_SHADE_ZBUFF        0xcd /* shade, zbuff triangle:
#define G_TRI_TXTR_ZBUFF        0xcb /* texture, zbuff triangle:
#define G_TRI_SHADE_TXTR_ZBUFF    0xcf /* shade, txtr, zbuff trngl:

For instance OR 0x02 (G_TEXTURE_ENABLE) by 0xCC (G_TRI_SHADE) you get 0xCE, which is G_TRI_SHADE_TXTR, the textured version of G_TRI_SHADE.

Now why not simply using the geometry mode flag? I would widely guess to for consistency purpose the intention was to keep texture parameters in the G_TEXTURE command. Nevertheless technically it does not make any sense! So let’s use as from now on the geometry mode flag for enabling/disabling textures. It would mean a tiny change for programmers.

Doing so would lead to all the below code to be useless:

0x2A8    LH    V0, 0x0006(SP)
0x2AC    ANDI    V0, V0, 0xFFFD
0x2B0    ANDI    V1, T9, 0x0001
0x2B4    SLL    V1, V1, 0x1
0x2B8    OR    V0, V0, V1
0x2BC    J    0x0A8
0x2C0    SH    V0, 0x0006(SP)

What remains would be:

0x2A0    SW    T9, 0x0010(SP)
0x2A4    SW    T8, 0x0014(SP)

It is simply storing two words in DMEM. We do have already a command to do so, G_MOVEWORD.

We simply have to have gSPTexture macro sending to two G_MOVEWORD commands. It does mean of course to create a new moveword indice, G_MW_TEXTURE.

#define G_MW_TEXTURE    0x120

#define gSPTexture(pkt, s, t, level, tile, on)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL((G_MW_TEXTURE), 0, 16);

    _g->words.w1 = _SHIFTL(0x0000,16,16) | _SHIFTL((level),11,3) | _SHIFTL((tile),8,3)| _SHIFTL(0x00,0,8);
};
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL(((G_MW_TEXTURE) + 4), 0, 16);

    _g->words.w1 = _SHIFTL((s),16,16) | _SHIFTL((t),0,16);
};
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_CLEARGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16);

    _g->words.w1 = (unsigned int)(0xFFFFFFFD);
};
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_SETGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16);

    _g->words.w1 = (unsigned int)((on)<<1);
};

#define gsSPTexture(s, t, level, tile, on)
{{
    (_SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL((G_MW_TEXTURE), 0, 16)),
    (_SHIFTL(0x0000,16,16) | _SHIFTL((level),11,3) | _SHIFTL((tile),8,3)| _SHIFTL(0x00,0,8))
}},
{{
    (_SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL(((G_MW_TEXTURE) + 4), 0, 16)),
    (_SHIFTL((s),16,16) | _SHIFTL((t),0,16))
}},
{{
    (_SHIFTL(G_CLEARGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16)),
    (unsigned int)(0xFFFFFFFD)
}},
{{
    (_SHIFTL(G_SETGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16)),
    (unsigned int)((on)<<1)
}}

What does that mean? The complete G_TEXTURE is useless and can be scrapped, meaning that we get rid of 9 RSP instructions.

Finally we can create separate macros to update separately the 1st and the 2nd word of the gSPTexture.

#define gSPSetTextureTile(pkt, level, tile)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL((G_MW_TEXTURE), 0, 16);
    _g->words.w1 = _SHIFTL(0x0000,16,16) | _SHIFTL((level),11,3) | _SHIFTL((tile),8,3)| _SHIFTL(0x00,0,8);
}

#define gSPSetTextureScale(pkt, s, t)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL(((G_MW_TEXTURE) + 4), 0, 16);

    _g->words.w1 = _SHIFTL((s),16,16) | _SHIFTL((t),0,16);
}

#define gsSPSetTextureTile(level, tile, on)
{{
    (_SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL((G_MW_TEXTURE), 0, 16)),
    (_SHIFTL(0x0000,16,16) | _SHIFTL((level),11,3) | _SHIFTL((tile),8,3)| _SHIFTL(0x00,0,8))
}}

#define gsSPSetTextureScale(s, t)
{{
    (_SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL(((G_MW_TEXTURE) + 4), 0, 16)),
    (_SHIFTL((s),16,16) | _SHIFTL((t),0,16))
}}

Finally we can notice that storing the tile and the level of the texture requires only a mere byte. When we will start optimizing the DMEM, this point will have to be checked out.

RDRAM <-> DMEM

The next step of our IMEM optimization concerns way the code moves data from RDRAM to DMEM or from DMEM to RDRAM. In order to do so, the code set a flag in a mere RSP register to inform in which the direction the data is moved from/to.

Let’s see a little bit how it works:

0x148 MTC0 S4, SP memory address

0x14C BGTZ S1, 0x015C

0x150 MTC0 S3, SP DRAM DMA address

0x154 JR RA

0x158 MTC0 S2, SP read DMA length

0x15C JR RA

0x160 MTC0 S2, SP write DMA length

The above code is called as a subroutine in various part of the code.

Register S2 becomes in COP0, depending on whether S1 is or not greater than 0, the length from which is read/write from/to S4 the data.

Some may say it is an efficient code. Not really…

The issue is that you have to set register S1, depending on the direction of the data, as 0 or 1. Doing so requires an instruction before calling the subroutine so it is not really an efficient solution as it is of course possible to set the direction immediately after the return from the subroutine.

The code could be changed simply to:

0x148 MTC0 S4, SP memory address

0x14C JR RA

0x150 MTC0 S3, SP DRAM DMA address

On top on saving 4 instructions and free a register, it prepares the ground for some deeper future changes in the code.

And voila!

We will next time start working on the matrix related immediate commands :)

samedi 8 février 2020

Microcode Optimization: IMEM (Part2)

G_MOVEWORD

This immediate command simply stores a word in a specific DMEM address, which will be use as parameters for other commands. For instance such a command may be used to store the address of a segment, the number of lights, etc.

Hereby the original code managing such a command:

0x288    LBU    AT, 0xFFFB(K1)
0x28C    LHU    V0, 0xFFF9(K1)
0x290    LH    A1, 0x030E(AT)
0x294    ADD    A1, A1, V0
0x298    J    0x0A8
0x29C    SW    T8, 0x0000(A1)

0xBC000406 (loaded in register T9)
0x002D4CD0 (loaded in register T8)

It takes only 6 instructions, which is relatively reasonable. Let’s then go through the code step by step:

0x288    LBU    AT, 0xFFFB(K1)
0x28C    LHU    V0, 0xFFF9(K1)

The first instruction loads a byte in register AT and the second a half word in register V0 from the current command, meaning in our example AT = 0x06 and V0 = 0x0004 (half word).

0x290    LH    A1, 0x030E(AT)

The code then load from DMEM a half word in the register A1 which will be actually another DMEM address where the base address of the word to be stored.

0x294    ADD    A1, A1, V0

The offset AT is added to the base address A1.

0x29C    SW    T8, 0x0000(A1)

Finally the word of the command is stored at the proper DMEM address. In this respect it must be noticed that only the lower12 bits (due to the size of the memory) are considered by the SW instruction.

Let’s try to reduce the code necessary to perform this mere action. In this respect, the exact same trick used for G_TRI1 can be used actually, meaning that the adequate DMEM address could inserted directly in the command.

The moveword indices (AT) and offset (V0) are included in gbi.h, yet not the base DMEM address. Consequently it is necessary to list such DMEM address in order to include them as well. In order to do so, it is necessary to determine them by checking out the “mapping” of the DMEM memory for Fast3D in a microcode source file called GDMEM.H.

Having dressed all the base address related to G_MOVEWORD, we can simply add them to gbi.h and adapt the macros used such an immediate command (as for instance gSPSegment).

/* MOVEWORD indices */

#define G_MW_MATRIX        0x3E0
#define G_MW_NUMLIGHT        0x12C
#define G_MW_CLIP            0x074
#define G_MW_SEGMENT        0x160
#define G_MW_FOG            0x330
#define G_MW_LIGHTCOL        0x1F0
#define    G_MW_PERSPNORM        0x110

#define gSPSegment(pkt, segment, base)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_MOVEWORD , 24, 8) | _SHIFTL((G_MW_SEGMENT+((segment)*4)), 0, 16);

    _g->words.w1 = (unsigned int)(base);
}

#define gsSPSegment(segment, base)
{{
    (_SHIFTL(G_MOVEWORD, 24, 8) | _SHIFTL((G_MW_SEGMENT+((segment)*4), 0, 16)),
    (unsigned int)(base)
}}

In this example we have of course to multiply by 4 the segment number to align the DMEM address to a word :)

From there the command could structure as follow:

0xBC000AAA
0xBBBBBBBB

AAA is the DMEM address where the word should be stored and BBBBBBBB is the word to be stored.

The command can be reduced to only two simple instructions!!!! :)

J    0x0A8
SW    T8, 0x0000 (T9)

AAA being 12 bits, the higher bits of T9 are ignored.

And voila! We reduce the size of the code from 6 instructions to 2 instructions.

G_SETGEOMETRYRMODE/ G_CLEARGEOMETRYMODE

The code of G_SETGEOMETRYRMODE is the following:

0x348    LW    V0, 0x0004(SP)
0x34C    OR    V0, V0, T8
0x350    J    0x0A8
0x354    SW    V0, 0x0004(SP)

Here an example of a command:

0xB7000000 (loaded in T9)
0x00020000 (loaded in T8)

The code of G_CLEARGEOMETRYMODE is the following:

0x358    LW    V0, 0x0004(SP)
0x35C    ADDI    V1, R0, 0xFFFF
0x360    XOR    V1, V1, T8
0x364    AND    V0, V0, V1
0x368    J    0x0A8
0x36C    SW    V0, 0x0004(SP)

Here an example of a command:

0xB6000000 (loaded in T9)
0x001F3204 (loaded in T8)

Both commands together take 10 instructions. Let’s go quickly through the code of those commands.

G_SETGEOMETRYRMODE

0x348    LW    V0, 0x0004(SP)

The code loads from a specific place in DMEM the current graphic geometry mode (i.e. fog, lighting, etc.). It must be noticed that the SP register holds constantly a base DMEM address.

0x34C    OR    V0, V0, T8
0x350    J    0x0A8
0x354    SW    V0, 0x0004(SP)

The current geometry mode is then ORed by the bits set in the command before being stored back.

G_CLEARGEOMETRYMODE

0x35C    ADDI    V1, R0, 0xFFFF
0x360    XOR    V1, V1, T8
0x364    AND    V0, V0, V1
0x368    J    0x0A8
0x36C    SW    V0, 0x0004(SP)

After loading the current graphic geometry mode, the code will do something XOR the bits to be cleared contained in the second word of the command (0x loaded in T8) by 0xFFFFFFFF. So actually the code applies a NOR, which exists in the set of instructions for RSP!!! Fun fact is that this “mistake” was noticed and in F3DEX the code uses as it should be NOR

But even with the NOR, it should be possible to do much better. I arrived to the following solution:

0x348    LW    V0, 0x004(SP)
0x34C    J    0x35C
0x350    AND     T8, V0, T8
0x354    LW    V0, 0x004(SP)
0x358    OR    T8, V0, T8
0x35C    J    0x0A8
0x360    SW    T8, 0x000(T9)

It looks like it takes now 7 instructions, which is already 3 instructions less. Now you may notice something interesting. The two last instructions are the one of the G_MOVEWORD, meaning that the actual implementation takes only 5 instructions!

The structure of the command would be as follow:

0xCC000DDD
0xAAAAAAAA

CC would be the command header.

DDD would be the DMEM address where the current geometry mode is stored.

AAAAAAAA would be the bits to be set or cleared by the command.

How would all of this would then works?

G_CLEARGEOMETRYMODE

0x348    LW    V0, 0x004(SP)
0x34C    J    0x35C
0x350    AND     V0, V0, V1

It is not unnecessary to NOR the bits by the microcode as it can be achieved in gbi.h. By the way the DMEM address of the geometry mode has also to be included in the first word of the command in gbi.h

Then the code will then load the current graphic geometry mode, AND the already NORed bits to be cleared in register T8 and jump to 0x35C, which is the G_MOVEWORD code. And as the DMEM address of the geometry mode has been included in the first of the command.

G_SETGEOMETRYRMODE

0x354    LW    V0, 0x004(SP)
0x358    OR    T8, V0, T8
0x35C    J    0x0A8
0x360    SW    T8, 0x000(T9)

The current graphic geometry mode is loaded, then OReds by the bits to be set and will then naturally continue into the instructions of G_MOVEWORD.

Now of course we can create a macro to emulate gSPGeometryMode from F3DEX2 by sending both G_CLEARGEOMETRYMODE and G_SETGEOMETRYRMODE consecutively:

#define gSPGeometryMode(pkt, c, s)
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_CLEARGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16);
    _g->words.w1 = (~((unsigned int)(c)));
};
{
    Gfx *_g = (Gfx *)(pkt);

    _g->words.w0 = _SHIFTL(G_SETGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16);
    _g->words.w1 = (unsigned int)(word);
};

#define    gsSPGeometryMode(c, s)
{{
    (_SHIFTL(G_CLEARGEOMETRYMODE, 24, 8) | _SHIFTL(G_MW_GEOMODE, 0, 16)), (~((unsigned int)(c)))
}},
{{
    (_SHIFTL(G_SETGEOMETRYMODE, 24, 8)| _SHIFTL(G_MW_GEOMODE, 0, 16)), (unsigned int)(s)
}}

Bonus:

Let’s also implement the macro gSPModifyVertex and thus more efficiently than F3DEX, without a single RSP instruction or a dedicated immediate command as Nintendo did.

What this macro does? It simply replace a word in a selected transformed vertex data, which is composed of 40 bytes as already explained in my previous article.

We have already the code to replace a word. We just need to get the required DMEM address where the word has to be stored. This can be easily compute in gbi.h, as below:

# define gSPModifyVertex(pkt, vtx, where, val)
{
    Gfx *_g = (Gfx *)(pkt);
    _g->words.w0 = (_SHIFTL(G_MOVEWORD,24,8)| _SHIFTL(((INPUTBUFFER) + ((vtx)*40) + (where)),0,16));
    _g->words.w1 = (unsigned int)(val);
}
# define gsSPModifyVertex(vtx, where, val)
{{
    _SHIFTL(G_MOVEWORD,24,8)| _SHIFTL(((INPUTBUFFER) + ((vtx)*40) + (where)),0,16),
    (unsigned int)(val)
}}

And voila!

We will next time continue by optimizing further immediate commands. Stay tuned! :)

samedi 1 février 2020

Microcode Optimization: IMEM (Part1)

Before implementing new features, it is necessary to ensure that the current microcode is optimized in a way that it uses as less as possible RSP cycles. On top on accelerating the RSP processing, that would save space of a very limited size of the IMEM (only 4kb!), the cache memory where its runs.

As previously announced the original Fast3D code (the one used by Super Mario 64) is not very optimized, most likely because its development was somehow rushed to provide as soon as possible developers the necessary graphic library to build their game.

In this part, we will focus on one of the so called “immediate” RSP graphic commands, G _TRI1 (0xBF).

I will first explain the current code and its purpose and then present the way I took to improve it. Obviously it may occur that some of the readers of this blog may have even better ideas, for which I would really be happy to discover and discuss.

G_TRI1 is a command (gSP1Triangle) which simply picks up three transformed vertex into a buffer stored in DMEM (the 4kb cache memory from which the IMEM works) to turned them in a format readable by the RDP, which is the graphic rasterizer of the N64. Additionally the command has a flag which states which of those three vertexes contain the color or the normal of the triangle.
Here the original Fast3D code:

0x20C   LBU      A1, 0xFFFC(K1)
0x210   LBU      AT, 0xFFFD(K1)
0x214   LBU      V0, 0xFFFE(K1)
0x218   LBU      V1, 0xFFFF(K1)
0x21C   SLL      A1, A1, 0x2
0x220   SLL      AT, AT, 0x2
0x224   SLL      V0, V0, 0x2
0x228   SLL      V1, V1, 0x2
0x22C   ADDI   AT, AT, 0x0420
0x230   ADDI   V0, V0, 0x0420
0x234   ADDI   V1, V1, 0x0420
0x238   SW      AT, 0x0DE0(R0)
0x23C   SW   V0, 0x0DE4(R0)
0x240   SW   V1, 0x0DE8(R0)
0x244   LW   A0, 0x0DE0(A1)
0x248   J         0x998
0x24C   LH   S8, 0x00BE(R0)

It takes 17 instructions to process a single G_TRI1 command. For such a basic function, it is relatively long.

Let’s go through the code steps by steps:

0x20C    LBU    A1, 0xFFFC(K1)
0x210    LBU    AT, 0xFFFD(K1)
0x214    LBU    V0, 0xFFFE(K1)
0x218    LBU    V1, 0xFFFF(K1)

In registers A1, AT, V0 and V1 are loaded a mere bytes backwards (due to the negative offset) from base K1. K1 is actually the place in DMEM of the next command in the display list to be executed. So actually the bytes are loaded from the current command to be executed.

Hereby an example of a G_TRI1 command:

DATA COUNTER    DATA

0x760    0xBF000000
0x764    0x00284632

As you certainly understood, in this case K1 would be equal to 0x768 and A1 =0x 00, AT = 0x28, V0 = 0x46, V1 = 0x32. A1 is the color/normal flag, AT is the first vertex, V0 the second one and V1 and the last one.

Now due to an “optimization” for the microcode, which will explain a little further on, please note that the number of the vertex to be picked up in the buffer is multiply by 10 (0x0A). The actual numbers of our three vertexes are therefore 0x28/0xA = 4, 0x46/0x0A= 7 and, 0x32/0A = 5.

The code then continues and shifts left by 2 all the values.

0x21C    SLL    A1, A1, 0x02
0x220    SLL    AT, AT, 0x02
0x224    SLL    V0, V0, 0x02
0x228    SLL    V1, V1, 0x02

A1 = 0x00
AT = 0xA0
V0 = 0x118
V1 = 0xC8

The reason of doing so is pretty simple: each vertex stored in DMEM takes a space of 40 bytes. In order to locate the actual place of a vertex in DMEM, you must calculate its address by adding from the base of the buffer 40 bytes per vertex. As the value from the command are already multiply by 10, you simply have to multiply them by 4, which is exactly what the shift left by 2 does!

0x22C    ADDI    AT, AT, 0x0420
0x230    ADDI    V0, V0, 0x0420
0x234    ADDI    V1, V1, 0x0420

The vertex values are then added to 0x420. It is very simple: 0x420 is the DMEM address of the start of the buffer.

AT = 0x4C0
V0 = 538
V1 = 0x4E8

This is the exact address in DMEM of our 3 vertex.

0x238    SW    AT, 0x0DE0(R0)
0x23C    SW    V0, 0x0DE4(R0)
0x240    SW    V1, 0x0DE8(R0)
0x244    LW    A0, 0x0DE0(A1)

The obtained address of the 3 vertexes are then consecutively stored at DMEM address 0xDE0, 0xDE4 and 0xDE8. Such DMEM address is part of a scratch buffer. Then is loaded in register A0 the data located at DMEM 0xDE0 offset A1. It seems odd at first glance but actually the flag is multiply by 4 in previous step. If the original flag is 0, it is still 0, if the original flag is 1, it is now 4 and if the original flag is 2, then it is now 8. And what if you do load the data stored in at 0xDE0 offset A1? The address of the vertex for the color or the normal of the surface!

0x248    J    0x998
0x24C    LH    S8, 0x00BE(R0)

Those 2 less instructions will not be commented much. The code jumps to the part which transforms the 3 vertex in a triangle primitive, where S8 will be used a “return” register to execute the next command in the display list.

Here for the explanations, without too much intricate details of how G_TRI1 works.

Now can we do better than this? Of course it is possible!

First of all, it is very important to understand how the data of the immediate command are generated: it is done by the mean of the header called gbi.h when compiling a rom with the official N64 SDK. This file contains all the graphic command macros/functions useable by a user in the development of its N64 program.

I believe you do remember that the number of the vertexes is multiplied by 10. It is actually something managed through gbi.h:

# define __gsSP1Triangle_w1f(v0, v1, v2, flag)
     (_SHIFTL((flag), 24,8)|_SHIFTL((v0)*10,16,8)|
      _SHIFTL((v1)*10, 8,8)|_SHIFTL((v2)*10, 0,8))

Being able to modify what the user keyed as arguments before being included in the display list to be executed by the RSP is a very useful matter, which will be use as much as possible to avoid having the RSP playing unnecessarily with data.

You may also have noticed that the first line of the command (0xBF000000) is complete empty. Why not using it?

After quite some reflections I came to the following solution (inspired from Factor5 microcode): storing directly the DMEM address of the vertex in the immediate command by calculating through gbi.h (meaning the CPU at the end, much more powerful than the RSP for this type of activities).
The command could actually look like this:

0xBFAABBBB
0xCCCCDDDD

0xBF: COMMAND HEADER     AA: FLAG    BBBB: DMEM ADDR OF 1ST VERTEX
CCCC: DMEM ADDR OF 2nd VERTEX         DDDD: DMEM ADDR OF 3rd VERTEX

Doing this would mean to change few changes in gbi.h, meaning doing the following change:

Let’s first create a new constant, which will hold the DMEM address of transformed vertex buffer.

#define INPUTBUFFER         0x420

Then the triangle macro can be changed as follow:

#define gSP1Triangle(pkt, v0, v1, v2, flag)
{                                                        \
    Gfx *_g = (Gfx *)(pkt);
                                                        \
    _g->words.w0 = _SHIFTL(G_TRI1, 24, 8) | _SHIFTL(flag, 16, 8) | _SHIFTL((((v0)*40) + INPUTBUFFER), 0, 16);
                                                        \
    _g->words.w1 = _SHIFTL((((v1)*40) + INPUTBUFFER), 16, 16)| _SHIFTL((((v2)*40) + INPUTBUFFER), 0, 16);
}

#define gsSP1Triangle(v0, v1, v2, flag)
{{                                                        \
    (_SHIFTL(G_TRI1, 24, 8) | _SHIFTL(flag, 16, 8) | _SHIFTL((((v0)*40) + INPUTBUFFER), 0, 16)),(_SHIFTL((((v1)*40) + INPUTBUFFER), 16, 16)| _SHIFTL((((v2)*40) + INPUTBUFFER), 0, 16))
}}

Finally the G_TRI1 code can be reduced to the following:

0x208    LB         A1, -0x007(K1)
0x20C    LH        AT, -0x006(K1)
0x210    LH        V0, -0x004(K1)
0x214    LH    V1, -0x002(K1)
0x218    SLL      A1, A1, 0x1
0x21C    ADD    A1, A1, K1
0x220    LH       A0, -0x006(A1)
0x224    J        0x968
0x228    LH       S8, +0x0BE(R0)

And voilà!

The DMEM address is immediately loaded in registers AT, V0, V1 from the command. The flag is now used to compute where within the command itself the DMEM address for the color/normal surface is located, taking in account the each DMEM address is a half word (2 bytes)!

We started with 17 instructions and for the same exact result we reduce the code to 9 instructions!!!

Now remain the main question: does it indeed work? Of course, as you can see in this compiled sample from the SDK with such an implementation.

Next time I will continue on focusing on other immediate command(s)! :)