As finally it is now possible to create or modify a
microcode fully compatible with the Nintendo SDK, I started considering what
would be the best approach to tame further the N64 RSP.
For the time being it is quite obvious that starting from
scratch an entire new graphic microcode is still beyond my skills so I decided
that optimizing and enhancing the original Fast3D microcode would be a good
exercise.
So for the last two months, I changed a bit of the Fast3D
source code here and there and learnt various things about the organization of
code and the way to program the RSP.
It quickly appeared that the microcode was written with a
very theoretically approach, meaning without having in mind to optimize the
limited resources of the RSP.
Few ideas came also to me about what could be implemented
differently and from there I set up finally a high level plan.
The philosophy behind it would be as follow:
1. Save a maximum space in IMEM/DMEM with little impact on
the performances.
2. Implement unavailable features to match them with some slightly
more "modern" versions of OpenGL from which Fast3D seems somehow to
be inspired from.
After some investigations, I came up with the following:
1. Texture Rectangle RDP command being composed of 4 words,
it would be normal that such command is included in the display list without
being splitted into 3 commands as in Fast3D.
2. In order to save DMEM space, either get rid of some
constants (i.e opengl offset, Newton's iteration constants, dram address mask,
"reserved" DMEM space, include DMEM address directly into GBI
commands rather than loading them from a DMEM address etc.) or reduce their
size (i.e segments, light data structure, etc).
3. Implement an optional double DMA buffer for some
immediate commands. Data would be retrieved from RDRAM where RSP would deal
with another DMEM buffer.
4. Implement a circular buffer to store transformed vertex
and generate strip triangle, with a primitive restart function. It would be
possible to also use triangle list, triangle fans or where possible quads.As
the vertex buffer would be reduced dramatically in size, the number of vertices
contained in the buffer should be normally high. It would still be good where
possible to keep indexed triangle implementation where possible.
5. Save IMEM used as some commands could be implemented
differently (set/cleargeometry mode, moveword, movemem, etc.)
For a first customization, it seems ambitious enough but
having done various tests, I clearly hope that it can be achieved successfully.
Next step will be much more concrete, directly into the
intricate details of the microcode.
Finally feel free to share other ideas you may have as comments,
yet please understand that no extreme severe modifications are to be done (i.e.
Z sorting, involvement of CPU in the tasks, etc.). Thanks!
Factor 5 used to be one of the best game development studio at the
very end of the 90’s/early 2000’s. Its programmers, coming from the
German hacking and demo scene, were very talented and, thanks to a
collaboration with both Nintendo and LucasArts, were able to deliver
stunning graphic achievements on the N64 and GameCube. (https://en.wikipedia.org/wiki/Factor_5)
At the very end of the commercial life of the N64, both games were
released, too late to attract the favor of the critics and a public
already focused on the next generation of consoles, meaning the
technical achievements injected of those games went unfortunately
unnoticed.
GlideN64, as its predecessor Glide64, has mainly been developed as a
High Level Emulation (HLE) plugin. A microcode is a library of high
level macro commands (high level in compare with processor op-codes).
LLE way is to run these commands as actual hardware would do,
instruction by instruction. HLE implements what commands in that library
do as fast as possible without running actual command’s code.
As explained by Sergey in his blog few years ago (http://gliden64.blogspot.com/2014/11/a-word-for-hle.html),
few games could not be emulated graphically in HLE as the documentation
necessary to do so was unavailable. Yet few years ago Sergey and I
started on reverse engineering and implementing customs graphic
microcodes.
When we started our journey to decipher the one used by Indiana Jones
and the Infernal Machine and Star Wars Episode I: Battle for Naboo, we
could only be impressed on how well written and optimized it was and
though we are unable to understand its intricacy in details, we do
believe that its prowess deserves to be documented for the posterity.
On the contrary of what could be suggested by Factor5, the graphic microcode of both Indiana Jones and Battle of Naboo was NOT
developed from scratch. It was obviously developed with the source code
of the F3DEX microcode supplied by Nintendo in its SDK, which is just
an optimized version of the Fast3D microcode used by Super Mario 64. Of
course most of the graphic commands were rewritten along with many major
additions (the size of the microcode is about twice as large as the
other ones) but its core is quite similar with the very first game of
the console.
What remains mainly is the way to transform input vertices into a
specific output data structure stored into a buffer within Data Memory
(DMEM) of the RSP (Reality Signal Processor). Those structures may be
then selected by a triangle command, formatted into a low level triangle
command and then used by the Reality Display Processor (RDP). The code
used in this respect is comparable to the code used in the F3DEX
microcode where there is no correspondence with the code of other types
of microcode (Turbo3D, ZSort) to do such an operation.
Another evidence in this respect is that the header of the command
(i.e. 0xBC for the MoveWord command, 0xBF for the TRI1 triangle command,
etc.) are the same than the F3DEX microcode.
As an interesting consequence, it means that actually none of the N64
microcodes were ever written from scratch by 3rd party developers but
only by SGI or by Nintendo. However it is true that only around 20% of
the code is somehow related to F3DEX, meaning that this latter was used
as a workable environment to develop the new commands filled with plenty
of unique features.
I/ Few outstanding features compared to F3DEX microcode
A. DMEM address is directly integrated into the graphic command
DMEM is a small 4kb memory which the RSP microcode uses to store the
data retrieved from RDRAM. The data is used by the RSP to do the
computations according to the specification of a command.
In the F3DEX microcode, the DMEM address is not part of the command
itself but it is rather an index. The actual DMEM address used by this
index is stored in DMEM itself at the boot of the microcode, hidden to
end users.
In this so called Indy/Naboo microcode, the DMEM address is set directly
in the command. This unique feature which helps to save some space in
DMEM for other purpose.
Let’s take a simple example:
0xBC command is a MoveWord command, which simply store a 32 bits
words in DMEM. As you may be aware of, moved words will be used later by
other commands as parameters.
0xBC00014C
0x00000715
0xBC is the header of the command, 0x14C is the DMEM address and 0x00000715 is the word stored at this address.
For comparison, in F3DEX:
0xBC000406
0x002D4CD0
0x06 is a mere flag which is used by the microcode to locate in DMEM a
base DMEM address. Such a flag can vary depending on the type of word
to be stored (i.e. a segment RDRAM address). 0x0004 is added to this
base DMEM address. In our example base DMEM address is 0x160 + 0x004 =
0x164. So the word 0x002D4CD0 will be stored at DMEM address 0x164.
B. Geometry mode is a mere RSP register
In the F3DEX microcode, the Geometry mode is stored in DMEM and each
time it has to be checked out or changed, it has to be loaded in a RSP
register, changed and then stored back.
In the Indy/Naboo microcode, there is simply an entire RSP register
dedicated to the geometry mode. It does mean that you have one register
less but an easy way to manage the geometry mode and as well saving a
bit of space in DMEM.
C. Playing with display lists
In Indy/Naboo microcode, the way the display lists are managed are relatively different than in F3DEX.
Execution of multiple display lists at the same hierarchical level
In the original F3DEX display lists are relatively straight forward,
in a way that they are, as in OpenGL, managed hierarchically. You first
call a new display list, then a nested display lists, which can as well
call a nested display lists.
As for Star Wars: Rogue Squadron, a clever mechanism was developed to
“jump” from a display list to another display list. At the beginning of
the each display list, instead of having the first immediate command to
be executed, you have simply a RDRAM address which may be called at the
end of the display list by the immediate command 0xB5. In such a case, a
display list will be retrieved from specified RDRAM address. This
loaded display list also contains a RDRAM address at its start. So
instead of nesting the display lists, you move from one display list to
another display list at the same level of hierarchy.
Execution of a display list under condition
This mechanism is incredible. MoveWord command (0xBC00058C) may
provide a RDRAM address. When running some immediate commands which may
lead to draw some graphics (textured rectangles, triangles), BEFORE
being actually drawn, a nested display list is retrieved and executed
from such a RDRAM address. It is the case for the immediate RDP command
0xE4/0xE5, for non-rejected triangles, for particles. Such a mechanism
links the actual drawing a RDP primitive to a display lists which set
the parameters of such drawing (i.e. Combine Mode, Other Modes, Texture
Image, Tile, etc.). And such a mechanism occurs ONLY when the primitive
is to be indeed drawn. So for instance If a triangle is out of viewport
and thus rejected, this display list for pixel pipeline setup will not
be executed. In this way the N64 RDP is only used where necessary!
D. Clever memory segmentation
In F3DEX, the code manages the segmentation of the memory through the
usage of the MoveWord command. With this method, the number of segments
is limited to 16 and requires to store the words in DMEM.
In the Indy/Naboo microcode, the segmentation of the memory is managed
differently and somehow as the “jump” mechanism, already explained for
the display list. At first immediate command 0x02 provides a RDRAM
address. This is the RDRAM address of the first segment. The very first
word at this RDRAM address is the RDRAM address of the following
segment. The size of each segment is limited to 0x100 bytes and when an
immediate command requests data beyond the segment, the remaining data
is retrieved from the following segment where again the first word
provides the RDRAM address of the following segment.
E. Intricate command structure
In F3DEX the size of the command is fixed to two mere 32 bits words
with the same exact purpose.
In Indy/Naboo, the size of a command can be much larger. For instance
the biggest ever immediate command of the microcode, which is used to
generate the terrain polygons from a height map, is composed of 16 32
bits words!!!!
The very same immediate command may have its size varying according
to a flag within such a command. For instance the TRI command 0xB4 for
textured polygons is larger than the one for shaded polygons
.
TEXTURED TRIANGLES
Command’s flag also may change its behavior. For example, TRI command
0xB4 with flag 0x06 is used for textured triangles with regular
textures, but flag 0x0E turns on texture coordinates generation for
reflective textures.
Sometimes the same command may have completely different purpose. For
instance command 0x05 can be used to generate texture rectangles,
generate particles, generate terrain from a height map, generate
specific vertex. Except for generating graphics, there is nothing in
common between those immediate commands actually.
II/ Listing of the commands with high level explanation
Even if we were able to decipher the microcode, it does not mean that
all parameters of those commands were fully apprehended. Some, of
course, are quite obvious but some are used in very large and complex
commands and therefore abstruse.
Command 0x01
The main purpose of this immediate command is to retrieve data from RDRAM and potentially use such data to compute new data.
The structure of the command is the following:
0x01ODDDLL
0xBBAAAAAA
0x01 Header of the command
O Option of the command: for the same DDD, the option can lead to other part of code
DDD DMEM address where the data will be stored. ALSO the code will check
it and depending on it, will route the code to a specific portion of
code, which will then route again by O.
LL number of bytes to be retrieve from memory
BB it is a mere byte which can be used according to the route the
command will take after having retrieving the data (i.e. For the number
of lights)
AAAAAA When is not zero, data is retrieved from this RDRAM memory, when
there is only 0x000000, data is retrieved from a segment.
Therefore the very same command 0x01 can be used for numerous scenarios, such as:
When O is 0, the command is as the F3DEX command MoveMem, meaning
that it simple retrieve data from RDRAM and store them in DMEM.
When O is 1, the command is like the F3DEX VTX command, meaning that
vertices are retrieved from RDRAM to DMEM, transformed and then stored
into a buffer.
When O is 2, lighting calculations performed. The command retrieves
data from RDRAM and computes colors for vertices. The colors are stored
into a buffer from where the triangle commands retrieve them later. The
lighting system is very complex, with many options.
When O is 3, the command sets the number of light and potentially retrieves the light structure.
When O is 5, the command retrieve vertices from RDRAM in a specific
way to DMEM, transformed them and then stored them into a buffer. Such
an option concerns 2D graphic where triangles are used.
Command 0x02
It set the RDRAM address of the 1st segment usable by command 0x01.
As already explained the RDRAM address of the next segment is the 1st
word of the RDRAM address of the previous segment.
Command 0x05
This command generates primitives in various way (from a mere texture
rectangle to particles to triangles used for field, etc.). The last
byte of the second word determines command’s mode.
0x18 (2 words)
0x05XXXXXX
0xXXXXXX18
This sub-command transforms selected vertices by Modelview matrix (0x010E403F) and store them into a vertex buffer.
0x27 (6 words)
0x05XXXXXX
0xXXXXXX27
The sub-command generates vertices and some flags used by another 0x05 sub-command.
0x24 (10 words)
0x05XXXXXX
0xXXXXXX24
The sub-command generates particles (actually small texture
rectangles) from the vertices/flags obtained by previous 0x05
sub-command.
0x15 (8 words)
0x05XXXXXX
0xXXXXXX15
The sub-command generates a mere texture rectangle using a vertex as
center of such a rectangle. It is mainly used for explosions.
0x4F (2 words)
0x05XXXXXX
0xXXXXXX4F
Such a sub-command is a mere flag used by sub-command 0x0F to switch it between its short (4 words and long version (16 words)
0x0F (4 or 16 words)
0x05XXXXXX
0xXXXXXX0F
Such a sub-command is used to set parameters for sub-command 0x09 and
0x0C. There is a short and long version, activated by the above
sub-command. 0x09 and 0x0C (6 words)
0x05XXXXXX
0xXXXXXX0C
0x05XXXXXX
0xXXXXXX09
Both sub-commands generate the triangles for the ground in Star Wars
Episode I: Battle for Naboo (up to 32 per sub-commands). It seems to be
generated from a colored height map.
As you understand, command 0x05 is a set of 8 sub-commands!
Command 0x06
This command is the same than gSPDisplayList command in F3DEX.
Command 0x07
This command is the same than gSPBranchList command in F3DEX.
Command 0xB4 (4 or 8 words)
It is similar to the TRI2 command in F3DEX, except that texture
coordinates and indices in the colors buffer provided as command’s
parameters.
Command 0xB5
This command triggers a “jump” to another display list previously set as the first word of the current display list.
Command 0xB6
This command is the same than gSPClearCeometryMode command in F3DEX.
Command 0xB7
This command is the same than gsSPSetGeometryMode command in F3DEX.
Command 0xB8
This command ends a display list.
Command 0xB9
This command is similar to gSPSetOtherMode command in F3DEX (lower half)
Command 0xBA
This command is similar to gSPSetOtherMode command in F3DEX (higher half)
Command 0xBB
This command is similar to gSPTexture command in F3DEX.
Command 0xBC
This command is similar to the Moveword command in F3DEX.
Command 0xBD
Only used in Battle of Naboo. It sets the OtherModes for the triangles used by the commands generating the ground.
Command 0xBE
Only used in Battle of Naboo. It sets the OtherModes (along with
immediate command 0xBD) and texture coordinates for each triangle
generating the ground. In conjunction with parameters set by
sub-commands 0x05 -0x09 and 0x0C, it can also generate such triangles
(up to 32 of them).
Command 0xBF (4 or 8 words)
It is similar to the TRI1 command in F3DEX, except that texture
coordinates and indices in the colors buffer provided as command’s
parameters.
Conclusion:
With such an advanced microcode it is clear that not a lot of N64
games can compete graphically with Indiana Jones and Star Wars Episode
I: Battle for Naboo.
As pictures speak louder than words, here some stunning examples of the outstanding graphic exploit.
The lighting is absolutely splendid in Indiana Jones and the Infernal Machine.
For instance at the start of level named Volcano, 12 lights are used at the same time!!
In another level called Palawan Temple, splendid point lights are displayed.
Particles are used in many places without any slowdown in both
Indiana Jones and the Infernal Machine and Star Wars Episode I: Battle
for Naboo. One example, between many, would be the snow.
The terrain in Star Wars Episode I: Battle for Naboo is huge and without any fog.
Some great explosions effects are often shown in Star Wars Episode I: Battle for Naboo.
Splendid reflection mapping effects are sometimes used.
As you may see, both games are simply amazing for a Nintendo 64.
Having in mind that the games uses as well the Musyx audio microcode,
the console has most likely shown with them the best of what it could
offer. As well it proves that skilled programmers can overcome
limitation of a machine far beyond expectations!!!