A journey into the mysteries of N64 RCP

dimanche 31 décembre 2023

Progress made in 2023

Hi Folks! 😎

It has been a long time since the last update and at the edge of this New Year I would like to share with you the recent progresses made on my customized F3D ucode.

1. 80 bytes of DMEM further saved.

2. Management of the overlays by the microcode has been changed.

3. Implementation of a RDP displaylist command. You simply don't process the RDP commands one by one in the RSP but you simply retrieve them from RDRAM and ship them directly to the RDP.

4. Implementation of a “conditional” RDP display list command, based on the fact that triangles will indeed been display on the screen.

5. Successful implementation to a triangle strip command. Such a command can generate numerous triangles (more than 100 triangles is possible) and thus in the appropriate order (in this respect please read https://en.wikipedia.org/wiki/Triangle_strip). I have also implemented with this triangle strip command a restart primitive feature and the possibility to trigger RDP display list commands in the middle of the execution of triangles generation.

6. I removed for the time being the precised clip ratio macro, till it can be further investigated and where necessary debugged.

What about the future?

I will work next year hopefully on the following points:

1. nusys compatibility: I noticed that ALL homebrew using the original N64 SDK uses nusys so I really need to ensure that the microcode is compatible with it.

2. Remove the necessity to block CPU yielding for some commands.

3. Run tests to find the appropriate the balance between the vertex buffer and the RDP buffer in DMEM

4. Point lighting

And likely much more!

I will open a new beta session when 1. and 2. will be done. Testers are welcome!

Wishing you all a happy new year 2024!😀

See ya!

PS: please feel free also to check the incredible developments of F3DEX3 microcode and of the Libdragon SDK. N64 homebrews are about to become fantastic in the coming years!

https://github.com/HackerN64/F3DEX3

https://github.com/DragonMinded/libdragon

mardi 27 septembre 2022

Beta Tests of Phase 1 over – Public Release

It has been a while since I updated my blog. Of course real life takes a very good part of my spare time on my projects but I am quite stubborn and keep working on those J

The beta testing of the microcode in its current state is now considered as over. I could not find many testers to do the job so most of those tests were done actually on my own.

I would say that it was quite a good exercise to do as I got to document, at least high level, the way the new functions works. By converting various demos to the microcode from the original SDK I was able to realize some minor but still important issues in my code, affecting texturing and lighting for instance.

I also tried to push further my DMEM optimization and now I have a 40 vertex cache available, with a colors and texture coordinates buffer of the same size. It is already better than F3DEX2. I still believe that there should be a way to do much better. There is somehow an issue related to RSP yielding from/to the CPU but I have not been able to spot the root cause of the issue (yet).

Now after writing down the development of the microcode, I have decided to share to public the binary (not the source) of the microcode in hope that some nice yet adventurous N64 programmers will share their feedback and even their wishes to my end. For those who would like to test you may find more information on N64brew discord channel.

I will now focus on the next phase of implementation. More to come!

mardi 22 mars 2022

DMEM Optimization

Long time I did not update this blog. I got to do a pause as real life is sometimes quite busy and then I came back to my little microcode hobby.

Here therefore a short update of the lastest developments.

DMEM optimization

After having spent quite some time on a point lighting implementation but being not satisfied of it, I decided to move to DMEM optimization. DMEM is the only memory on which the RSP works actually and its size is only of 4kb. So every byte counts to provide the best performance and avoid multiple DMA accesses.

Clearly the original fast3d ucode was very poor when it comes to DMEM. For instance we do have a buffer of only 16 vertex but there is an output buffer for 6 LLE triangles (6*176 bytes = 1056 bytes) which could be send for processing by the RDP. It has been one of the first changes done by F3DEX, the successor of Fast3D: increasing the vertex cache to 32. Doing this was simply done by change the output buffer to 176 bytes only.

In multiple occasions, I took the opportunity to save DMEM when transforming the ucode. For instance the geometry mode is not stored in DMEM, MOVEWORD DMEM addresses are stored in the GBI command, lighting data were reduced to 16 bytes etc. I really cannot count the number of changes which lead to reduce the usage of DMEM space but a lot of efforts were made in this respect.

Yet even if the DMEM space was saved, the overhaul DMEM organization was not reshaped and the saved space was not used for something else. I spent the last month to re-organize the DMEM in a way to optimize the DMEM space. Though it looks an easy task it does actually take quite some efforts to ensure that all data in DMEM are aligned with the IMEM instructions and GBI macros.

I have finally finished this task.

Results:

* Output vertex buffer doubled: 16 vertex to 32 vertex

* Input data buffer for vertex doubled (from 16 to 32 vertex)

* Colors/Normals buffer increase from 16 to 50 colors/normals

* S&T buffer increased increase from 16 to 48 S&T.

Instead of F3DEX, I have 372 bytes (2 triangles) for the RDP output buffer. It would be good to see if there is difference in terms of performance between the F3DEX buffer and my microcode.

Two TRI commands

I have also decided to have two TRI commands:

* One "standard", where the command will only select the indices in the vertex buffer of the 3 points composing the triangle.

* One "special", where the command will pick up not only the indices of the 3 vertex in the vertex buffer but also the colors and the S&T in their respective buffer.

In order to do so, for the "standard" TRI command, I stored in the two last bytes of the vertex input data the colors and S&T index (which were only padding before).

Some would say: why doing such a weird double implementation?

Fact is that for some of upcoming features (see list below) picking up any index for colors or S&T would be either tricky or inadequate.

A good example would be point lighting where the vertex coordinates have an influence on the colors.

Tests

As previously mentioned, I would like now to get feedback of some users with respect of the current implementation of the microcode. It is far from being the final product but having so much changes already, I believe that it is essential to receive feedback.

Hopefully I have already selected few persons willing to test my work. Another bunch of persons would be more than welcome. Would you feel interested (and skilled to create homebrew with libultra), feel free to contact me (via a comment on this blog or via Discord N64Brew).

After a successul test phase, here my remaining (!) to do list for a next phase:

* Point lighting

* Triangle strips

* "Texture rectangle" done by the mean of a triangle

* RDP displaylist (you simply don't process the RDP commands one by one in the RSP but you simply retrieve them from RDRAM and ship them directly to the RDP).

* Save the average of the Z of a triangle in a unused part of the RDP triangle command, send data to RDRAM, then the CPU would sort by Z the data and send to the RDP. We could get rid of Z buffer in this way.

* For special TRIs, check first if they are not trivially rejected before executing the rest of the command. Potentially to be done as well for TRI command

* Rewrite the organization of the microcode and the overlays. Currently we have no more IMEM space!

* Rejection instead of clipping as an option (with usage of precised clip ratio macro)

* Finally port the ucode to libdragon, get rid of any reference to libultra

* Last but important thing: name the microcode 🙂

Still a lot of updates to come therefore!!

See ya!

samedi 3 avril 2021

Microcode Optimization: IMEM (Part8)

Today I am very happy to give you finally news on the recent developments on the microcode. In the last months, as previously underlined, I tackled lighting and reflection mapping. I am not yet totally done with it but with all the changes already carried out, it is worthwhile to share the current status.

I) Change the way to load in RSP the inverse of transpose of the 3x3 part of the model matrix

In order to compute lighting, you have to use the inverse of transpose of the 3x3 part of the model matrix (which I will call 3x3 matrix as from now) to get the light direction in model coordinates.

a) In Fast3D, the model matrix is first loaded and then data are changed by using some special RSP (LTV, STV) instructions which should help to do so. It requires some scratch DMEM space to do so and about 16 instructions to have in the RSP one 3x3 matrix.

b) In F3DEX2, the 3x3 matrix is loaded directly from DMEM where the model matrix is stored. There is no need to load the model matrix and then transpose it and no scratch DMEM space is necessary. It provides again one 3x3 matrix.

My implementation is different as my goal is also to implement point lighting which is totally missing in the original F3D. In order to do point lighting, it is required to have the model matrix and the 3x3 matrix and thus for each vertex to be processed.

I do underline “for each vertex” as with directional lighting, till you got a new matrix or new lights, you may use the same computed light directions in model coordinates for each vertex to be processed, meaning that the 3x3 matrix would have to be used only once for doing such computations.

So point lighting requires to have the model matrix and the 3x3 matrix at the same time. On top of it, the RSP vector instructions (VU) can hold 8 16 bits lanes, which is used to processed two vertex at once. If for F3D, one 3x3 matrix was sufficient to compute a limited number of light directions in model coordinates, it cannot be the case when processing point lighting per vertex. So it does mean that the VU must hold two model matrix and 3x3 matrix, duplicated on the same vectors.

It took me a significant time to find a good solution for those requirements. After loading the model matrix twice (as for Fast3D), I take me only 22 instructions to get the two 3x3 matrix as desired. In this respect I used a RSP instruction in a different way than usually intended, VMRG and register VCC.

Additionally I have replicated the way the normalization of the light direction in model coordinates as in F3DEX2, more efficient than Fast3D.

II) Separate lighting computations from vertex processing

As explained, lighting/reflection mapping computations is done per vertex. If it may be fine with directional light, with point lighting it would mean reloading over and over, for each vertex, the model and 3x3 matrix as the RSP is unable to hold all those data and performing the rest of vertex processing tasks (i.e. multiply vertices with ModelViewProjection (MVP) matrix, clipping, fog, etc.). In order to do so, lighting computations has to be done separately. It does mean to break the Vtx structure as per gbi.h where vertex coordinates, colors/normals and textures coordinates were tied together.

There is many advantages to do so:

You may load with flexibility vertices, colors/normals and texture coordinates, in separated buffers.

You may avoid duplications of data (same color, same normal, same texture coordinates)

You can do computations more efficiently for/on those buffers.

There is of course some inconveniences:

You need to have new structures for colors/normals and texture coordinates. As the vectors in RSP are 8 lanes of 16 bits, structures will have to hold 2 colors/normals or textures coordinates. It is also required to have those structure 64 bits aligned due to the RSP DMA engine.

GBI compatibility is broken. N64 programmers have to rewrite a portion of their code.

Flexibility comes with more complexity for the programmers (multiplication of the GBI macros)

Many customized microcodes has done, at least partially, such a change (i.e. Perfect Dark or Indiana Jones)

So I got to create a new command to compute lighting and texture coordinates, along with multiple GBI macros.

I took advantages of having two 3x3 matrix (instead of 1) to compute lighting directions in model coordinates nearly 2 times faster. I separated directional lighting computation from texture coordinates computation entirely in order to avoid many unnecessary computations done by F3D. If the computation speed on lighting should remain more or less the same, the one on S&T for reflection mapping has been improved.

Of course I have considered as well to optimize the DMEM space used for those computations.

In the original Fast3D, each light structure used 32 bits. In F3DX2, it uses 24 bits. In my implementation, it uses only 16 bits. In this respect I got to change the way some libultra functions were returning data (guLookAtHilite, guLookAtReflect, guPosLight, guPosLightHilite). I could notice something which could be optimized in those functions (and some others as well) and which I have implemented. According to the MIPS 4200 documentation, SQRT and DIV CPU instructions are slow so where possible it is better to use an inverse fast square root instead. (Thinking about it, it is noticeable that reflection mapping was not used in a lot of games, likely because reflection mapping was actually too demanding).

Now obviously you have to link vertex, output colors and output texture coordinates. I could have decided to use the 2 remaining byte of the vertex structure, currently padding, to do so. However I took the approach to do so at triangle level as it is at this level that the control should be given to the graphic designer. It also helps to avoid unnecessary duplications of data in buffers. However a careful approach have to be taken when some advanced lighting features will be used, as reflection mapping or point lighting, where all is interdependent at vertex level and not triangle level.

It means that I got to rewrite entirely the G_TRI command. Additionally G_VTX commands got to be adapted as lighting and texture coordinates computations are not done anymore with this command.

And voila!

Now I will focus on point lighting and what I wanted to create from the very beginning of this project: implement a circular buffer to store transformed vertex and generate strip triangles.

Side note: I will update soon my previous article as I continued to do some interesting changes for G_BRANCH_Z and G_CULLDL.

HELP NEEDED: After implementation of point lighting, triangle strips and optimizing DMEM, it will be time to test the microcode. In this respect I am calling for people (with sufficient programming skills of course) to help out in this respect. It can be done in different ways: adapt SDK N64 demos, adapt already written homebrew games, write new test demos using the new features of the microcode, etc. Contact me on N64brew discord channel in this respect. Thanks!

mercredi 30 décembre 2020

Microcode Optimization: IMEM (Part7) - Updated

It has been quite long that I did not share any update on the project. I must admit that for the last six months, I could not find enough time and motivation to focus on it. I do not impose on myself any pressure or deadline when it comes to a hobby. Yet I am quite perseverant so be reinsured that I will finish what I have started 😀

Let’s try to summarize the achievements of the recent weeks:

G_GL  & G_ENDDL

I have significantly changed the code related to the way the display lists are managed. Even though command G_GL in the original Fast3D microcode is quite efficient in this respect, there were few points which, at least from my point of view, could be improved.

First of all, 3 general registers of the RSP were exclusively dedicated to the display lists in original Fast3D microcode:

K0 ($26) for the place in RDRAM of the next command to be executed in the current running display list

K1 ($27) for the place in DMEM of the next command to be executed in the current running display list

GP ($28) used a counter to ensure that the space dedicated in DMEM for running display list was not exceeded.

I came to the conclusion that K0 and GP were not really necessary. With K1 we can determine whether the limit of the place allocated in DMEM for the display list. As well with K1, it is possible to compute at any time the RDRAM address of the next command in the current running display list by simply using the base address of the segment used by such a display list.

Though it took me quite a while to find the right way to do it, my new implementation does not require at all the usage of K0 and GP. However as a side effect, the command ending a display list (G_ENDDL) has been merged with G_GL.

G_SETGEOMETRYMODE, G_CLEARGEOMETRYMODE

Having freed 2 registers, I decided, as it is done for the microcode of Factor 5 used by Indiana Jones and the Infernal Machine, to use one of them to hold the current geometry mode. It does mean that there is no necessity to load or store such a geometry mode from/to DMEM. Doing so saves 3 IMEM instructions, one word in DMEM but more importantly allows various parts of the microcode to access immediately the current geometry mode.

Commands of 4 words

In Fast3D, every command is composed of 2 words. I do understand the underlying reason of SGI/Nintendo in this respect: RDP commands are usually 2 words and the DMA engine of the N64 is 64 bits. With a fixed size, managing memory is much easier than with a size that may change for each command. As some commands may be composed of 4 words, I circumvented this this limitation by simply adding at the end of the displaylist a scratch space for the potential two additional words of the commands. I use this feature for textured rectangles for the time being.

Meanwhile I got rid of G_RDPHALF_1, G_RDPHALF_2 and G_RDPHALF_CONT as they became completely useless.

"New" commands

G_BRANCH_Z has been implemented by in a completed different way than F3DEX2. Actually I renamed the command to G_COMPARE as it is now not only Z than you may compare by any information stored in DMEM with the data provided by the programmers. Such comparison can be done for a byte, a half word or a word.

Additionally on the contrary of G_BRANCH_Z which is limited to branch to another displaylist, the programmer can decide what he want to set as result of the command. It is simply done by the way of skipping or not the following commands in the displaylist, depending on the results of the G_COMPAREcommand.

For instance you may deactivate point lighting when the vertex dealt with are too far away from the camera.

I also created a new command which can be very useful with this G_COMPARE, G_SKIP.

G_SKIP is very simple, it skips a number of commands in the displaylist. For instance you want not only to deactivate point lighing but actually you want to skip the activation of the lighting, the action of texturing, the RDP commands loading textures in TMEM, etc.

I cannot imagine all the possibilities with those two commands but I feel that you may play a lot with those :)

G_CULLDL, still inspired from F3DEX2, have exactly the same feature than G_COMPARE. You don't automatically close the displaylist. it will skip or not the following command. It can be indeed closing the displaylist but it can branching to another one, etc.

All the above changes took a very limited number instructions compared to my previous update. At current stage my customized Fast3D microcode has now more features than F3DEX microcode and I have still numerous ideas which I would like to implement.

I hope next time to be able to update the lighting part of the microcode, even if it is quite a challenge.

Finally I would like to wish you a happy new year 2021!!! 😋