Today I am very
happy to give you finally news on the recent developments on the microcode. In
the last months, as previously underlined, I tackled lighting and reflection
mapping. I am not yet totally done with it but with all the changes already carried out, it is worthwhile to share the current status.
I)
Change
the way to load in RSP the inverse of transpose of the 3x3 part of the model matrix
In order to compute
lighting, you have to use the inverse of transpose of the 3x3 part of the model matrix (which I will call 3x3 matrix as from now) to get the light direction in model
coordinates.
a)
In Fast3D, the model matrix is
first loaded and then data are changed by using some special RSP (LTV, STV) instructions
which should help to do so. It requires some scratch DMEM space to do so and
about 16 instructions to have in the RSP one
3x3
matrix.
b)
In F3DEX2, the 3x3 matrix is loaded directly from DMEM where
the model matrix is stored. There is no need to load the model matrix and then
transpose it and no scratch DMEM space is necessary. It provides again one 3x3 matrix.
My implementation is different as my goal is also to
implement point lighting which is totally missing in the original F3D. In order
to do point lighting, it is required to have the model matrix and the 3x3 matrix
and thus for each vertex to be processed.
I do underline “for each vertex” as with directional
lighting, till you got a new matrix or new lights, you may use the same
computed light directions in model coordinates for each vertex to be processed,
meaning that the 3x3 matrix would have to be used only once for doing such
computations.
So point
lighting requires to have the model matrix and the 3x3 matrix at the same time.
On top of it, the RSP vector instructions (VU) can hold 8 16 bits lanes, which
is used to processed two vertex at once. If for F3D, one 3x3 matrix was
sufficient to compute a limited number of light directions in model coordinates, it cannot be the case when
processing point lighting per vertex. So it does mean that the VU must hold two
model matrix and 3x3 matrix, duplicated on the same vectors.
It took me a
significant time to find a good solution for those requirements. After loading
the model matrix twice (as for Fast3D), I take me only 22 instructions to get
the two 3x3 matrix as desired. In
this respect I used a RSP instruction in a different way than usually intended,
VMRG and register VCC.
Additionally I
have replicated the way the normalization of the light direction in model
coordinates as in F3DEX2, more efficient than Fast3D.
II)
Separate
lighting computations from vertex processing
As explained,
lighting/reflection mapping computations is done per vertex. If it may be fine
with directional light, with point lighting it would mean reloading over and over,
for each vertex, the model and 3x3 matrix as the RSP is unable to hold all
those data and performing the rest of vertex processing tasks (i.e. multiply
vertices with ModelViewProjection (MVP) matrix, clipping, fog, etc.). In order
to do so, lighting computations has to be done separately. It does mean to
break the Vtx structure as per gbi.h where vertex coordinates, colors/normals
and textures coordinates were tied together.
There is many
advantages to do so:
You
may load with flexibility vertices, colors/normals and texture coordinates, in
separated buffers.
You
may avoid duplications of data (same color, same normal, same texture
coordinates)
You
can do computations more efficiently for/on those buffers.
There is of
course some inconveniences:
You
need to have new structures for colors/normals and texture coordinates. As the vectors in RSP are 8 lanes of 16 bits, structures will have to hold 2
colors/normals or textures coordinates. It is also required to have those structure
64 bits aligned due to the RSP DMA engine.
GBI
compatibility is broken. N64 programmers have to rewrite a portion of their
code.
Flexibility
comes with more complexity for the programmers (multiplication of the GBI macros)
Many customized
microcodes has done, at least partially, such a change (i.e. Perfect Dark or
Indiana Jones)
So I got to
create a new command to compute lighting and texture coordinates, along with
multiple GBI macros.
I took advantages
of having two 3x3 matrix (instead of 1) to compute lighting directions in model
coordinates nearly 2 times faster. I separated directional lighting computation
from texture coordinates computation entirely in order to avoid many
unnecessary computations done by F3D. If the computation speed on lighting
should remain more or less the same, the one on S&T for reflection mapping
has been improved.
Of course I
have considered as well to optimize the DMEM space used for those computations.
In the
original Fast3D, each light structure used 32 bits. In F3DX2, it uses 24 bits.
In my implementation, it uses only 16 bits. In this respect I got to change the
way some libultra functions were returning data (guLookAtHilite, guLookAtReflect,
guPosLight, guPosLightHilite). I could
notice something which could be optimized in those functions (and some others
as well) and which I have implemented. According to the MIPS 4200 documentation, SQRT
and DIV CPU instructions are slow so where possible it is better to use an
inverse fast square root instead. (Thinking about it, it is
noticeable that reflection mapping was not used in a lot of games, likely because
reflection mapping was actually too demanding).
Now obviously
you have to link vertex, output colors and output texture coordinates. I could
have decided to use the 2 remaining byte of the vertex structure, currently
padding, to do so. However I took the approach to do so at triangle level as
it is at this level that the control should be given to the graphic designer.
It also helps to avoid unnecessary duplications of data in buffers. However a
careful approach have to be taken when some advanced lighting features will be
used, as reflection mapping or point lighting, where all is interdependent at
vertex level and not triangle level.
It means that
I got to rewrite entirely the G_TRI command. Additionally G_VTX commands got to
be adapted as lighting and texture coordinates computations are not done
anymore with this command.
And voila!
Now I will
focus on point lighting and what I wanted to create from the very beginning of
this project: implement a circular buffer to store transformed vertex and
generate strip triangles.
Side note: I will update soon my previous article as I
continued to do some interesting changes for G_BRANCH_Z and G_CULLDL.
HELP NEEDED: After
implementation of point lighting, triangle strips and optimizing DMEM, it will
be time to test the microcode. In this respect I am calling for people (with sufficient
programming skills of course) to help out in this respect. It can be done in
different ways: adapt SDK N64 demos, adapt already written homebrew games, write
new test demos using the new features of the microcode, etc. Contact me on N64brew discord channel in this
respect. Thanks!