Long time I did not update this blog. I got to do a pause as real
life is sometimes quite busy and then I came back to my little microcode hobby.
Here therefore a short update of the lastest developments.
DMEM optimization
After having spent quite some time on a point lighting
implementation but being not satisfied of it, I decided to move to DMEM
optimization. DMEM is the only memory on which the RSP works actually and
its size is only of 4kb. So every byte counts to provide the best performance
and avoid multiple DMA accesses.
Clearly the original fast3d ucode was very poor when it
comes to DMEM. For instance we do have a buffer of only 16 vertex but there is an
output buffer for 6 LLE triangles (6*176 bytes = 1056 bytes) which could be
send for processing by the RDP. It has been one of the first changes done by F3DEX, the
successor of Fast3D: increasing the vertex cache to 32. Doing this was simply
done by change the output buffer to 176 bytes only.
In multiple occasions, I took the opportunity to
save DMEM when transforming the ucode. For instance the geometry mode is not
stored in DMEM, MOVEWORD DMEM addresses are stored in the GBI command, lighting
data were reduced to 16 bytes etc. I really cannot count the number of changes
which lead to reduce the usage of DMEM space but a lot of efforts were made in
this respect.
Yet even if the DMEM space was saved, the overhaul DMEM
organization was not reshaped and the saved space was not used for something
else. I spent the last month to re-organize the DMEM in a way to
optimize the DMEM space. Though it looks an easy task it does actually take
quite some efforts to ensure that all data in DMEM are aligned with the IMEM instructions and GBI macros.
I have finally finished this task.
Results:
* Output vertex buffer doubled: 16 vertex to 32 vertex
* Input data buffer for vertex doubled (from 16 to 32 vertex)
* Colors/Normals buffer increase from 16 to 50 colors/normals
* S&T buffer increased increase from 16 to 48 S&T.
Instead of F3DEX, I have 372 bytes (2 triangles) for the RDP
output buffer. It would be good to see if there is difference in terms of
performance between the F3DEX buffer and my microcode.
Two TRI commands
I have also decided to have two TRI commands:
* One "standard", where the command will only select
the indices in the vertex buffer of the 3 points composing the triangle.
* One "special", where the command will pick up not
only the indices of the 3 vertex in the vertex buffer but also the colors and
the S&T in their respective buffer.
In order to do so, for the "standard" TRI command, I stored in the two last bytes of the vertex input data the colors and S&T
index (which were only padding before).
Some would say: why doing such a weird double
implementation?
Fact is that for some of upcoming features (see list below) picking up
any index for colors or S&T would be either tricky or inadequate.
A good example would be point lighting where
the vertex coordinates have an influence on the colors.
Tests
As previously mentioned, I would like now to get feedback of
some users with respect of the current implementation of the microcode. It is far from being
the final product but having so much changes already, I believe that it is
essential to receive feedback.
Hopefully I have already selected few persons willing to
test my work. Another bunch of persons would be more than welcome. Would you
feel interested (and skilled to create homebrew with libultra), feel free to
contact me (via a comment on this blog or via Discord N64Brew).
After a successul test phase, here my remaining (!) to do list for a next phase:
* Point lighting
* Triangle strips
* "Texture rectangle" done by the mean of a triangle
* RDP displaylist (you simply don't process the RDP commands
one by one in the RSP but you simply retrieve them from RDRAM and ship them
directly to the RDP).
* Save the average of
the Z of a triangle in a unused part of the RDP triangle command, send data to RDRAM, then the CPU would sort by Z the data and send to the RDP. We could get rid of Z buffer in this way.
* For special TRIs, check first if they are not trivially
rejected before executing the rest of the command. Potentially to be done as
well for TRI command
* Rewrite the organization of the microcode and the overlays.
Currently we have no more IMEM space!
* Rejection instead of clipping as an option (with usage of
precised clip ratio macro)
* Finally port the ucode to libdragon, get rid of any
reference to libultra
* Last but important thing: name the microcode 🙂
Still a lot of updates to come therefore!!
See ya!
2.14.0.0
2.14.0.0
2.14.0.0
2.14.0.0