A journey into the mysteries of N64 RCP

dimanche 29 septembre 2024

Status & Hiatus

Hello :)

It is a long while that I did not update my blog. I continued to work on my ucode and actually I am near completion. Not all features intended are at final stage but my codes works as according to my ideas.

In a nutshell:

1) Colors (including fog alpha value) and textures coordinates are loaded at pipeline level and they are not part of the vertex attribute anymore. With a color and texture DMEM buffer, this saves a significant DMEM memory.

2) After reviewing the clipping code in more details of Fast3D, I discovered that awful truth: about 200 bytes of memory, reserved for clipping in DMEM, was NEVER used.

As I only arrive 52 vtx available in the vertex buffer, I went down the road to further optimize the vertex structure.

3) The only way to do was to reduce the space used for clipping. As such I decided to keep in the original vertices coordinates along with a reference to its MVP matrix and calculate the data needed for the clipping only when clipping occurs. It does mean that the MVP matrix used for the vertex processed must be still available in DMEM memory.

4) In order to do so, I change the way to store the matrix: the space allocated in DMEM for modelview, projection and ModelviewProjection matrix can be allocated dynamically by the users. This allows where necessary to have actually 3 MVP matrix stored at the same time in DMEM. It is also possible to pop matrix as necessary.

This trick works fine for static objects/scene but it might be an issue with characters using bones.

The vertex buffer thanks to this went up to 64 vertex, which is really nice! :)

I tried, without success, to get my ucode working with nusys. I am not sure it is feasible to do without the source code of nusys.

I did not finish the implementation of point lighting. I may one day want to come back to this, though it is unlikely.

I am happy to arrive to this point, I learnt a lot and I feel definitely that I arrive to a point where I can look at this work as an achievement.

But, and I know this for a long while now, that this work has been done only for my own purpose. I cannot release proprietary code, even rewritten. Additionally no N64 homebrew developers will work with libultra anymore (c.f. Portal).

This is why I decided to stop working on the ucode and leave it as is. It is sad but it is the right move to do.

It does not mean that all is wasted, I will continue to play with the N64 but after having spent so much time at low level, I would prefer working at "high" level. Likely I will try to develop homebrews with Libdragon and Tiny3D. It may take a while till I arrive to something interesting but this is where I want to develop.

dimanche 31 décembre 2023

Progress made in 2023

Hi Folks! 😎

It has been a long time since the last update and at the edge of this New Year I would like to share with you the recent progresses made on my customized F3D ucode.

1. 80 bytes of DMEM further saved.

2. Management of the overlays by the microcode has been changed.

3. Implementation of a RDP displaylist command. You simply don't process the RDP commands one by one in the RSP but you simply retrieve them from RDRAM and ship them directly to the RDP.

4. Implementation of a “conditional” RDP display list command, based on the fact that triangles will indeed been display on the screen.

5. Successful implementation to a triangle strip command. Such a command can generate numerous triangles (more than 100 triangles is possible) and thus in the appropriate order (in this respect please read https://en.wikipedia.org/wiki/Triangle_strip). I have also implemented with this triangle strip command a restart primitive feature and the possibility to trigger RDP display list commands in the middle of the execution of triangles generation.

6. I removed for the time being the precised clip ratio macro, till it can be further investigated and where necessary debugged.

What about the future?

I will work next year hopefully on the following points:

1. nusys compatibility: I noticed that ALL homebrew using the original N64 SDK uses nusys so I really need to ensure that the microcode is compatible with it.

2. Remove the necessity to block CPU yielding for some commands.

3. Run tests to find the appropriate the balance between the vertex buffer and the RDP buffer in DMEM

4. Point lighting

And likely much more!

I will open a new beta session when 1. and 2. will be done. Testers are welcome!

Wishing you all a happy new year 2024!😀

See ya!

PS: please feel free also to check the incredible developments of F3DEX3 microcode and of the Libdragon SDK. N64 homebrews are about to become fantastic in the coming years!

https://github.com/HackerN64/F3DEX3

https://github.com/DragonMinded/libdragon

mardi 27 septembre 2022

Beta Tests of Phase 1 over – Public Release

It has been a while since I updated my blog. Of course real life takes a very good part of my spare time on my projects but I am quite stubborn and keep working on those J

The beta testing of the microcode in its current state is now considered as over. I could not find many testers to do the job so most of those tests were done actually on my own.

I would say that it was quite a good exercise to do as I got to document, at least high level, the way the new functions works. By converting various demos to the microcode from the original SDK I was able to realize some minor but still important issues in my code, affecting texturing and lighting for instance.

I also tried to push further my DMEM optimization and now I have a 40 vertex cache available, with a colors and texture coordinates buffer of the same size. It is already better than F3DEX2. I still believe that there should be a way to do much better. There is somehow an issue related to RSP yielding from/to the CPU but I have not been able to spot the root cause of the issue (yet).

Now after writing down the development of the microcode, I have decided to share to public the binary (not the source) of the microcode in hope that some nice yet adventurous N64 programmers will share their feedback and even their wishes to my end. For those who would like to test you may find more information on N64brew discord channel.

I will now focus on the next phase of implementation. More to come!

mardi 22 mars 2022

DMEM Optimization

Long time I did not update this blog. I got to do a pause as real life is sometimes quite busy and then I came back to my little microcode hobby.

Here therefore a short update of the lastest developments.

DMEM optimization

After having spent quite some time on a point lighting implementation but being not satisfied of it, I decided to move to DMEM optimization. DMEM is the only memory on which the RSP works actually and its size is only of 4kb. So every byte counts to provide the best performance and avoid multiple DMA accesses.

Clearly the original fast3d ucode was very poor when it comes to DMEM. For instance we do have a buffer of only 16 vertex but there is an output buffer for 6 LLE triangles (6*176 bytes = 1056 bytes) which could be send for processing by the RDP. It has been one of the first changes done by F3DEX, the successor of Fast3D: increasing the vertex cache to 32. Doing this was simply done by change the output buffer to 176 bytes only.

In multiple occasions, I took the opportunity to save DMEM when transforming the ucode. For instance the geometry mode is not stored in DMEM, MOVEWORD DMEM addresses are stored in the GBI command, lighting data were reduced to 16 bytes etc. I really cannot count the number of changes which lead to reduce the usage of DMEM space but a lot of efforts were made in this respect.

Yet even if the DMEM space was saved, the overhaul DMEM organization was not reshaped and the saved space was not used for something else. I spent the last month to re-organize the DMEM in a way to optimize the DMEM space. Though it looks an easy task it does actually take quite some efforts to ensure that all data in DMEM are aligned with the IMEM instructions and GBI macros.

I have finally finished this task.

Results:

* Output vertex buffer doubled: 16 vertex to 32 vertex

* Input data buffer for vertex doubled (from 16 to 32 vertex)

* Colors/Normals buffer increase from 16 to 50 colors/normals

* S&T buffer increased increase from 16 to 48 S&T.

Instead of F3DEX, I have 372 bytes (2 triangles) for the RDP output buffer. It would be good to see if there is difference in terms of performance between the F3DEX buffer and my microcode.

Two TRI commands

I have also decided to have two TRI commands:

* One "standard", where the command will only select the indices in the vertex buffer of the 3 points composing the triangle.

* One "special", where the command will pick up not only the indices of the 3 vertex in the vertex buffer but also the colors and the S&T in their respective buffer.

In order to do so, for the "standard" TRI command, I stored in the two last bytes of the vertex input data the colors and S&T index (which were only padding before).

Some would say: why doing such a weird double implementation?

Fact is that for some of upcoming features (see list below) picking up any index for colors or S&T would be either tricky or inadequate.

A good example would be point lighting where the vertex coordinates have an influence on the colors.

Tests

As previously mentioned, I would like now to get feedback of some users with respect of the current implementation of the microcode. It is far from being the final product but having so much changes already, I believe that it is essential to receive feedback.

Hopefully I have already selected few persons willing to test my work. Another bunch of persons would be more than welcome. Would you feel interested (and skilled to create homebrew with libultra), feel free to contact me (via a comment on this blog or via Discord N64Brew).

After a successul test phase, here my remaining (!) to do list for a next phase:

* Point lighting

* Triangle strips

* "Texture rectangle" done by the mean of a triangle

* RDP displaylist (you simply don't process the RDP commands one by one in the RSP but you simply retrieve them from RDRAM and ship them directly to the RDP).

* Save the average of the Z of a triangle in a unused part of the RDP triangle command, send data to RDRAM, then the CPU would sort by Z the data and send to the RDP. We could get rid of Z buffer in this way.

* For special TRIs, check first if they are not trivially rejected before executing the rest of the command. Potentially to be done as well for TRI command

* Rewrite the organization of the microcode and the overlays. Currently we have no more IMEM space!

* Rejection instead of clipping as an option (with usage of precised clip ratio macro)

* Finally port the ucode to libdragon, get rid of any reference to libultra

* Last but important thing: name the microcode 🙂

Still a lot of updates to come therefore!!

See ya!

samedi 3 avril 2021

Microcode Optimization: IMEM (Part8)

Today I am very happy to give you finally news on the recent developments on the microcode. In the last months, as previously underlined, I tackled lighting and reflection mapping. I am not yet totally done with it but with all the changes already carried out, it is worthwhile to share the current status.

I) Change the way to load in RSP the inverse of transpose of the 3x3 part of the model matrix

In order to compute lighting, you have to use the inverse of transpose of the 3x3 part of the model matrix (which I will call 3x3 matrix as from now) to get the light direction in model coordinates.

a) In Fast3D, the model matrix is first loaded and then data are changed by using some special RSP (LTV, STV) instructions which should help to do so. It requires some scratch DMEM space to do so and about 16 instructions to have in the RSP one 3x3 matrix.

b) In F3DEX2, the 3x3 matrix is loaded directly from DMEM where the model matrix is stored. There is no need to load the model matrix and then transpose it and no scratch DMEM space is necessary. It provides again one 3x3 matrix.

My implementation is different as my goal is also to implement point lighting which is totally missing in the original F3D. In order to do point lighting, it is required to have the model matrix and the 3x3 matrix and thus for each vertex to be processed.

I do underline “for each vertex” as with directional lighting, till you got a new matrix or new lights, you may use the same computed light directions in model coordinates for each vertex to be processed, meaning that the 3x3 matrix would have to be used only once for doing such computations.

So point lighting requires to have the model matrix and the 3x3 matrix at the same time. On top of it, the RSP vector instructions (VU) can hold 8 16 bits lanes, which is used to processed two vertex at once. If for F3D, one 3x3 matrix was sufficient to compute a limited number of light directions in model coordinates, it cannot be the case when processing point lighting per vertex. So it does mean that the VU must hold two model matrix and 3x3 matrix, duplicated on the same vectors.

It took me a significant time to find a good solution for those requirements. After loading the model matrix twice (as for Fast3D), I take me only 22 instructions to get the two 3x3 matrix as desired. In this respect I used a RSP instruction in a different way than usually intended, VMRG and register VCC.

Additionally I have replicated the way the normalization of the light direction in model coordinates as in F3DEX2, more efficient than Fast3D.

II) Separate lighting computations from vertex processing

As explained, lighting/reflection mapping computations is done per vertex. If it may be fine with directional light, with point lighting it would mean reloading over and over, for each vertex, the model and 3x3 matrix as the RSP is unable to hold all those data and performing the rest of vertex processing tasks (i.e. multiply vertices with ModelViewProjection (MVP) matrix, clipping, fog, etc.). In order to do so, lighting computations has to be done separately. It does mean to break the Vtx structure as per gbi.h where vertex coordinates, colors/normals and textures coordinates were tied together.

There is many advantages to do so:

You may load with flexibility vertices, colors/normals and texture coordinates, in separated buffers.

You may avoid duplications of data (same color, same normal, same texture coordinates)

You can do computations more efficiently for/on those buffers.

There is of course some inconveniences:

You need to have new structures for colors/normals and texture coordinates. As the vectors in RSP are 8 lanes of 16 bits, structures will have to hold 2 colors/normals or textures coordinates. It is also required to have those structure 64 bits aligned due to the RSP DMA engine.

GBI compatibility is broken. N64 programmers have to rewrite a portion of their code.

Flexibility comes with more complexity for the programmers (multiplication of the GBI macros)

Many customized microcodes has done, at least partially, such a change (i.e. Perfect Dark or Indiana Jones)

So I got to create a new command to compute lighting and texture coordinates, along with multiple GBI macros.

I took advantages of having two 3x3 matrix (instead of 1) to compute lighting directions in model coordinates nearly 2 times faster. I separated directional lighting computation from texture coordinates computation entirely in order to avoid many unnecessary computations done by F3D. If the computation speed on lighting should remain more or less the same, the one on S&T for reflection mapping has been improved.

Of course I have considered as well to optimize the DMEM space used for those computations.

In the original Fast3D, each light structure used 32 bits. In F3DX2, it uses 24 bits. In my implementation, it uses only 16 bits. In this respect I got to change the way some libultra functions were returning data (guLookAtHilite, guLookAtReflect, guPosLight, guPosLightHilite). I could notice something which could be optimized in those functions (and some others as well) and which I have implemented. According to the MIPS 4200 documentation, SQRT and DIV CPU instructions are slow so where possible it is better to use an inverse fast square root instead. (Thinking about it, it is noticeable that reflection mapping was not used in a lot of games, likely because reflection mapping was actually too demanding).

Now obviously you have to link vertex, output colors and output texture coordinates. I could have decided to use the 2 remaining byte of the vertex structure, currently padding, to do so. However I took the approach to do so at triangle level as it is at this level that the control should be given to the graphic designer. It also helps to avoid unnecessary duplications of data in buffers. However a careful approach have to be taken when some advanced lighting features will be used, as reflection mapping or point lighting, where all is interdependent at vertex level and not triangle level.

It means that I got to rewrite entirely the G_TRI command. Additionally G_VTX commands got to be adapted as lighting and texture coordinates computations are not done anymore with this command.

And voila!

Now I will focus on point lighting and what I wanted to create from the very beginning of this project: implement a circular buffer to store transformed vertex and generate strip triangles.

Side note: I will update soon my previous article as I continued to do some interesting changes for G_BRANCH_Z and G_CULLDL.

HELP NEEDED: After implementation of point lighting, triangle strips and optimizing DMEM, it will be time to test the microcode. In this respect I am calling for people (with sufficient programming skills of course) to help out in this respect. It can be done in different ways: adapt SDK N64 demos, adapt already written homebrew games, write new test demos using the new features of the microcode, etc. Contact me on N64brew discord channel in this respect. Thanks!