mardi 22 mars 2022

DMEM Optimization

Long time I did not update this blog. I got to do a pause as real life is sometimes quite busy and then I came back to my little microcode hobby.

Here therefore a short update of the lastest developments.

DMEM optimization

After having spent quite some time on a point lighting implementation but being not satisfied of it, I decided to move to DMEM optimization. DMEM is the only memory on which the RSP works actually and its size is only of 4kb. So every byte counts to provide the best performance and avoid multiple DMA accesses.

Clearly the original fast3d ucode was very poor when it comes to DMEM. For instance we do have a buffer of only 16 vertex but there is an output buffer for 6 LLE triangles (6*176 bytes = 1056 bytes) which could be send for processing by the RDP. It has been one of the first changes done by F3DEX, the successor of Fast3D: increasing the vertex cache to 32. Doing this was simply done by change the output buffer to 176 bytes only.

In multiple occasions, I took the opportunity to save DMEM when transforming the ucode. For instance the geometry mode is not stored in DMEM, MOVEWORD DMEM addresses are stored in the GBI command, lighting data were reduced to 16 bytes etc. I really cannot count the number of changes which lead to reduce the usage of DMEM space but a lot of efforts were made in this respect.

Yet even if the DMEM space was saved, the overhaul DMEM organization was not reshaped and the saved space was not used for something else. I spent the last month to re-organize the DMEM in a way to optimize the DMEM space. Though it looks an easy task it does actually take quite some efforts to ensure that all data in DMEM are aligned with the IMEM instructions and GBI macros.

I have finally finished this task.

Results:

    * Output vertex buffer doubled: 16 vertex to 32 vertex

    * Input data buffer for vertex doubled (from 16 to 32 vertex)

    * Colors/Normals buffer increase from 16 to 50 colors/normals

    * S&T buffer increased increase from 16 to 48 S&T.

Instead of F3DEX, I have 372 bytes (2 triangles) for the RDP output buffer. It would be good to see if there is difference in terms of performance between the F3DEX buffer and my microcode.

Two TRI commands

I have also decided to have two TRI commands:

    * One "standard", where the command will only select the indices in the vertex buffer of the 3 points composing the triangle.

    * One "special", where the command will pick up not only the indices of the 3 vertex in the vertex buffer but also the colors and the S&T in their respective buffer.

In order to do so, for the "standard" TRI command, I stored in the two last bytes of the vertex input data the colors and S&T index (which were only padding before).

Some would say: why doing such a weird double implementation?

Fact is that for some of upcoming features (see list below) picking up any index for colors or S&T would be either tricky or inadequate.

A good example would be point lighting where the vertex coordinates have an influence on the colors.

 Tests

As previously mentioned, I would like now to get feedback of some users with respect of the current implementation of the microcode. It is far from being the final product but having so much changes already, I believe that it is essential to receive feedback.

Hopefully I have already selected few persons willing to test my work. Another bunch of persons would be more than welcome. Would you feel interested (and skilled to create homebrew with libultra), feel free to contact me (via a comment on this blog or via Discord N64Brew).

After a successul test phase, here my remaining (!) to do list for a next phase:

     * Point lighting

     * Triangle strips

     * "Texture rectangle" done by the mean of a triangle

    * RDP displaylist (you simply don't process the RDP commands one by one in the RSP but you simply retrieve them from RDRAM and ship them directly to the RDP).

     * Save the average of the Z of a triangle in a unused part of the RDP triangle command, send data to RDRAM, then the CPU would sort by Z the data and send to the RDP. We could get rid of Z buffer in this way.

    * For special TRIs, check first if they are not trivially rejected before executing the rest of the command. Potentially to be done as well for TRI command

    * Rewrite the organization of the microcode and the overlays. Currently we have no more IMEM     space!

    * Rejection instead of clipping as an option (with usage of precised clip ratio macro)

    * Finally port the ucode to libdragon, get rid of any reference to libultra

   * Last but important thing: name the microcode 🙂

 Still a lot of updates to come therefore!!

 See ya!

1 commentaire: