jeudi 30 avril 2020

Microcode Optimization: IMEM (Part6)

As from now on I will not explained  in details the code. It is extremely cumbersome, both to write and to read. I will take a more high level approach to describe the current Fast3D implementation and the improvements which I have implemented.

G_MTX


This command is used in the original code to:

1)    Potentially push a matrix
2)    Potentially load a matrix
3)    Potentially multiply a matrix

The command always ends up by multiplying the ModelView matrix with the Projection matrix, resulting into a ModelViewProjection matrix.

In my opinion, the code has many flaws to perform those tasks.

As explained in my previous article, data are retrieved from RDRAM to scratch area which has to be moved to right place in DMEM. A simple loading of a matrix requires many unnecessary instructions to do so.

Also multiplying a matrix requires using scratch buffers to store results which are then moved to the right place in DMEM. Fact is that the code was written very effectively, using the so accumulator of the Vector Unit of the RSP to perform addition at the same time than the multiplication BUT the source data of the original matrix part of the multiplication was overwritten by the partial results along the execution of the code, leading to the need to use scratch buffers.

Overhaul the G_MTX code in FAST3D takes 67 IMEM instructions.

My implementation is dramatically smaller. In order to do so, the push matrix, load and multiply are separated functions, both in microcode and in gbi.h. In order to ensure compatibility with Fast3D, emulation macros in gbi.h has been implemented.

We already discussed in my previous article of the new push matrix code.

When it comes to load a matrix, the code simply uses the G_MOVEMEM command, directing the matrix directly to the right place in DMEM.

Finally for the matrix multiplication, I reused what has been developed for F3DEX2. This code is slightly less efficient than the one of Fast3D but it has an enormous advantage: the source of the matrix data is not overwritten by results before the end of the multiplication, avoiding to move around data in DMEM.

All in one, G_MTX as such is now only about 30 instructions!

G_VTX


This is the core of the Fast3D microcode. This function performs the following:

1)    Multiply vertices with ModelViewProjection matrix
2)    Perform lightings and texture coordinates transformation tasks.
3)    Determine the outcodes of the vertex for clipping (https://en.wikipedia.org/wiki/Cohen-Sutherland_algorithm)
4)    Normalized device coordinates (divide everything by W)
5)    Transform clip coordinates to screen coordinates, including perspective correction (viewport transformation)
6)    Calculate fog where necessary
7)    Store results in the DMEM vertex buffer

The codes does all of this in quite efficient way, yet few things can be either optimized or improved.

I improved the code as follow:

1)    Save DMEM space used by the Newton division algorithm

The division by W to normalize to device coordinates is done by the support of a Newton division algorithm. (https://en.wikipedia.org/wiki/Division_algorithm#Newton-Raphson_division), the RSP division instruction being not accurate enough.

Weirdly enough the code uses 8 half words of the very same constant (0x0002) in DMEM and loaded in a vector of the RSP.

With my implementation the code simply adds such this constant to a whole nullified vector.

2)    Change slightly the viewport structure to save DMEM space and IMEM instructions

As per gbi.h, the viewport structure is as follow:
typedef struct {
    short    vscale[4];  /* scale, 2 bits fraction */
    short    vtrans[4];  /* translate, 2 bits fraction */
    /* both the above arrays are padded to 64-bit boundary */
} Vp_t;

 vscale: (SCREEN_WD/2)*4, (SCREEN_HT/2)*4, G_MAXZ, 0,
 vtrans: (SCREEN_WD/2)*4, (SCREEN_HT/2)*4, 0, 0,

When the viewport structure is loaded, the code actually multiplies by -1 (SCREEN_HT/2)*4. To do so it uses 4 words of constants in DMEM which have to be loaded and then multiply against the all elements of viewport structure!!!!

As you can imagine, it is absurd to do this instead of changing the structure code as follow:

vscale: (SCREEN_WD/2)*4, -(SCREEN_HT/2)*4, G_MAXZ, 0,
vtrans: (SCREEN_WD/2)*4, -(SCREEN_HT/2)*4, 0, 0,

This change is very straight forward and it was used by Seta (i.e. Eikou No St Andrews) or Factor 5 (Star Wars Rogue Squadron).

3)    Using freed parameters of the G_VTX command.

As the loading of the vertices is managed by the G_MOVEMEM command, G_VTX command can now hold other parameters that does not have to be computed by the code, as for instance the place in the scratch buffer where the new vertex are to be stored. Of course gSPVertex has to be emulated to keep compatibility with Fast3D.

4)    Loading various parameters into a single RSP vector

Many parameters related to the G_VTX commands are to be loaded in various RSP vectors where a mere change in DMEM data structure can allow getting them loaded into a single one, which of course requires only one IMEM instruction instead of several.

5)    Optional Near clipping on and off

In the original Fast3D, near clipping is off (.NON) with the usage of a slightly different microcode. I decided that it should be an option. A mere gbi.h macro command can activate or deactivate the near clipping now. Now of course it must be understood that vertexes processed with one near clipping mode should generate triangles in the same mode (programmers should not switch mode between G_VTX and related TRI commands).

6)    Precise Clip Ratio

The Clip Ratio is an interesting feature of the original Fast3D microcode where triangles are not clipped till they are not beyond a clipping area which is determined as a multiple of the viewport area. Such a feature has been implemented due to the fact that clipping triangles takes a significant time for the RSP to perform and that rasterizing 1 big triangle instead of 2 small triangles can be actually faster.

However the Fast3D implementation only considers the clipping area between X1 and X6. I have decided to implement a more precise clip ratio, in S15.16 format, allowing a multiplication for instance like 1,4 or 2,6.

I also noticed something which I cannot really grasp: when clipping is performed, it is done against the clipping area instead of the viewport area. Why doing so is a mystery to me. If a triangle have to  be clipped, why not clipping it against the right minimum, meaning what can be seen on the screen? I changed this fact and clipping will be now performed against the viewport area.

And voila! Next time I will focus on improving lighting and fog. Keep tuned!

2 commentaires:

  1. Hello, how can I make donation to support your amazing work?

    RépondreSupprimer
  2. https://youtu.be/QP7f7d1wT2w

    I'd like to see these custom optimizations in action on the super mario 64 decomp. Looks like these guys have already switched the microcode to F3DEX2.

    Olivier, would you be interested in sharing a video when this is complete?

    RépondreSupprimer