dev, computing, games

📅December 17th, 2024

A while ago I wrote this DirectX Raytracing (DXR) test application. It looks like this

The application is hosted on GitHub, here: https://github.com/clandrew/vapor

Time passed, and app rotted to the point where it doesn't run anymore.

This blog post explains some background about the app, why it rotted, and how I fixed it.

About this app

This a toy application that's an homage to a popular album cover.

The Helios statue mesh I threw together in 3DS Max, besides that there's some textured cubes. There's Direct2D-rendered text, interroped with 11on12 to get textured onto a D3D12 cube geometry. Nothing too crazy. A secondary ray draws a shadow on the floor. There's a raster-based screenspace effect you can toggle on and off.

I wrote it back when DXR support was brand new in Windows. Back then, seemed good.

Fast forward to recently, when I tried it again, cloned and built it and it just crashed on startup.

For context, I originally tested this application on NVIDIA GeForce GTX 1070 (pre-native-RTX support). Nowadays I was testing it on AMD Radeon RX 6900 XT.

What happened between then and now

Back when I wrote it, this application originally used the D3D12 Raytracing Fallback Layer. You can see some remnants of this in the application.

Quick side note about the fallback layer-- "isn't that just WARP?" The fallback layer is different from WARP, and it's also different from DXR(!) It shipped as a completely separate-but-very-similar API as DXR, separate headers and everything, calling into D3D12 API itself. Like, typically you have to recompile to use the fallback layer. You can't just swap in a different DLL or change some toggle at runtime or something. If you squint you'll see that a few parameters are different compared to DXR. The fallback layer implemented a DXR-like interface on top of compute workloads.

While WARP acts more like a driver, the fallback layer is more like middleware. And while WARP is all CPU, the fallback layer is agnostic to that question. In practice I usually used fallback layer on top of GPU though.

Since the time this application was written, WARP was actually updated to support DXR.

And since the time this application was written, I updated the application itself to use DXR.

However, because of the timeline of when this was written versus the availability of actual DXR hardware, the application didn't get battle-tested on actual DXR nearly as much as it did on fallback layer. Since the fallback layer is a totally parallel implementation, you can get some difference of behavior and levels of strictness between it and actual DXR. Also, we have more varied and more baked implementations of DXR now compared to then.

So I suspected the rotting was from a combination of the app changing, and being ported (from fallback layer to DXR) and the underlying environment changing (maturity of DXR implementations with more varied strictness and fault tolerance), and this ended up being true.

Problem 1: Scratch resource is the wrong size

When I ran the application it just crashed silently.

To investigate this I did the first thing I always do which is enable SDK layers. This is validation on CPU timeline.

It showed me

ID3D12CommandList::BuildRaytracingAccelerationStructure: pDesc->ScratchAccelerationStructureData + SizeInBytes - 1 (0x0000000301087cc7) exceeds end of the virtual address range of Resource (0x000002532BE82EC0:'UpdateScra', GPU VA Range: 0x0000000300f8f000 - 0x0000000300f9996f).  [ RESOURCE_MANIPULATION ERROR #1158: BUILD_RAYTRACING_ACCELERATION_STRUCTURE_INVALID]

Basically, this showed there was a bug in the app where the scratch resource used for the acceleration structure update was the wrong size. Scratch resource sizes are platform and situation dependent so it must have been that I ‘got lucky’ when this app was run before.

This was super simple, I fixed it to use the correct size reported from GetRaytracingAccelerationStructurePrebuildInfo().

Problem 2: Resource binding disagreement

The application still crashed so that this point I enabled GPU-based validation. As of the time of writing this, SDK layers GPU-based validation offers a lot of coverage of some general scenarios which are pipeline agnostic (e.g., incorrect resource barriers, attempting to use unbound resources, accessing beyond the end of a descriptor heap), while it doesn't include much in the way of DXR-specific validation, so I wasn't betting on it showing a problem of that category.

When I ran GBV (GPU-based validation), it showed

DescriptorTableStart: [0], 
Descriptor Heap Index FromTableStart: [0], 
Descriptor Type in Heap: D3D12_DESCRIPTOR_RANGE_TYPE_UAV, 
Register Type: D3D12_DESCRIPTOR_RANGE_TYPE_SRV, 
Index of Descriptor Range: 0, Shader Stage: PIXEL, 
Root Parameter Index: [0], 
Draw Index: [0], 
Shader Code: PostprocessPS.hlsl(140,15-15), Asm Instruction Range: [0x22-0xffffffff], Asm Operand Index: [0], Command List: 0x000001CCFE6A2FC0:'Unnamed ID3D12GraphicsCommandList Object', Command List Type: D3D12_COMMAND_LIST_TYPE_DIRECT, SRV/UAV/CBV Descriptor Heap: 0x000001CCFE8479F0:'DescriptorHeapWrapper::m_descriptorHeap', Sampler Descriptor Heap: 0x000001CCFE8525D0:'m_samplerDescriptorHeap', Pipeline State: 0x000001CCFEA72720:'Unnamed ID3D12PipelineState Object',  [ EXECUTION ERROR #939: GPU_BASED_VALIDATION_DESCRIPTOR_TYPE_MISMATCH]

So this was happening not during the ray tracing, but in the raster pass that runs right after.

This was showing a disagreement between my shader and app code. Shader calls something a :register (t0), which should correspond to an SRV, but the resource binding was a UAV.

Generally when there are disagreements like these, the behavior is undefined.

For example, a while ago I remember seeing a bug in a D3D11 application where the C++ said a resource was a Texture2DMS, while the shader code called it a Texture2D. This resource got bound and shader code did a sample from it. Well on some implementations, the implementation would 'figure it out' and somehow find a way to treat it as a single-sampled resource. On others, it would be device removed. The level of fault-tolerance is really up to the GPU implementation. If it's your bug, ideally you can catch it proactively.

Again, I think I was ‘getting lucky’ with this before, where the underlying implementation could figure out what to do with the disagreement. Fast forward to today, the implementation I tried it on was strict.

Anyway, I fixed this by changing the resource binding to be SRV. Easy enough.

Problem 3: Case of the missing geometry

After fixing the above things, the application runs and doesn't crash. That said, it doesn't yet have correct behavior.

It's supposed to look like

Instead, it looks like

The floor, and billboarded image and text are missing. It’s a little strange, since this demo’s acceleration structure contains 4 geometries- 3 very simple ones and 1 more complicated one—and it’s the simple ones that were missing.

As a quick check, I tried the app on WARP and the missing geometry did not repro with it. It also did not repro on NVIDIA. Therefore the problem looks specific to when the application is run on AMD platform. It's likely the application is doing something incorrect that is getting lucky on the other platforms, where AMD is strict. Whatever it is, it's not being caught by SDK layers, so the next step is to narrow down the problem and probably to use graphics debuggers.

As an educated guess I first added some extra flushes (UAV barrier on null), to rule out the possibility of missing barrier. It made no difference, so that ruled that out.

Next I forced the closest hit shader to be dead simple, return hardcoded red, and disabled raygen culling. For this application, the closest hit shader (CHS) does a bunch of stuff to evaluate color based on material then casts a secondary ray for the shadow. If simplifying the shaders like this showed the simple geometries in red, that would mean the problem is in CHS or raygen.

The result looked like

Meaning, the problem was not in raygen or CHS, but something about the acceleration structure (AS).

As an additional step to narrow things down, I disabled updating of the AS, so the AS is only built once as the application launches. This made it so the scene doesn’t animate any more (normally the statue ‘floats’ up and down). If this were to fix it, it would tell me there’s a mistake in my updating of the AS. This too didn’t make a difference.

So the problem is not in the updating of the AS, but in the creation of the AS.

With that I took a closer look at the AS.

The latest public release of PIX (version 2409.23 at the time) actually showed empty AS with NaN bounds:

further confirming something was wrong on that front.

To get more information about the BLAS I used AMD's public tool, Radeon Raytracing Analyzer (RRA).

The BLAS tab showed the visible mesh, and not the invisible ones, as expected:

In the “Geometries” tab, I saw something reassuring. All 4 of my geometries were there with the right primitive counts.

But in the “Triangles” view is where things looked wrong. All triangles showed as coming from primitive ID 1, and none from other primitive IDs:

This means something was going wrong with BLAS creation. Geometries with valid primitives are going in, and no triangles are coming out.

With that, I took a closer look at the actual descs being sent to the BLAS build.

On that front, I noticed that the order of the mesh loading seemed to matter. The mesh that works is the one that gets loaded first, at vertex buffer (VB) index 0. Then the subsequent meshs’ data get appended at the end. Indices get incremented straightforwardly to the index buffer (IB). All 4 meshes share the same VB and IB. This clued me into something being wrong in how the descs were set up.

The problem ended up being this:

typedef struct D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC {
  D3D12_GPU_VIRTUAL_ADDRESS            Transform3x4;
  DXGI_FORMAT                          IndexFormat;
  DXGI_FORMAT                          VertexFormat;
  UINT                                 IndexCount;
  UINT                                 VertexCount;
  D3D12_GPU_VIRTUAL_ADDRESS            IndexBuffer;
  D3D12_GPU_VIRTUAL_ADDRESS_AND_STRIDE VertexBuffer;
} D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC;

The important field is VertexCount. The app was setting VertexCount to be the number of vertices needed for each mesh.

If you look at the DirectX Raytracing spec in the section for D3D12_RAYTRACING_GEOMETRY_TRIANGLES_DESC:

UINT VertexCountNumber of vertices (positions) in VertexBuffer. If an index buffer is present, this must be at least the maximum index value in the index buffer + 1.

The VertexCount actually has a slightly different meaning from how the application was treating it, it’s more of a ‘limit’ from the start of the vertex buffer, not just the count that that desc has. For example if a mesh only consisted of 1 vertex at position 5000 in the vertex buffer, it needs to have a VertexCount of around 5000, not 1. It’s IndexCount that would probably be 1.

Once I changed VertexCount to agree with the spec, the missing geometry was fixed:

After fixing that 3rd and final problem, everything is working well again.

To download this application, see the Release posted here:

https://github.com/clandrew/vapor/releases

December 17th, 2024 at 5:30 pm | Comments & Trackbacks (0) | Permalink

📅October 4th, 2024

For me this has come up when people are describing or categorizing different computers, trying to compare capabilities or different things about them. They use terms like "8-bit computer" or "16-bit computer" or "32-bit computer" to categorize something. You get the gist. It's fine. Nothing wrong with that.

It does become a problem when we actively lie to ourselves and pretend like these are specific technical terms.

I've run into way too many 'smart' people who are deluded of this.

Bitness (of a computer) is a marketing term, not a technical term. It describes how you feel about a computer, what the vibe is, like what sort of capabilities it reminds you of, what time period it makes you think of. There is not some universal test, there is not some hard specific qualifier for bitness, and it is not a hard technical category. Not for the computer, and not even bitness of the CPU. It's useful for general conversation and marketing purposes, just not more beyond that.

If there were a technical definition for bitness of a computer, or say the CPU, what would it be?

One dude I talked to once said "well, it's technically pointer size". So what about the Super Nintendo. Its pointer has 3 bytes: low, high, bank. Except literally nobody says that, everyone calls SNES a "16-bit console". Or the classic example of the TurboGrafx-16, marketed as a "16-bit console". Except the pointer size is 8 bits, only the graphics adapter and its transport is 16bits.

Or consider the modern 64-bit Intel processors you use today. Do you think you have a full, viable 64-bit pointer? You don't, maybe you already know, only the lower 48 bits are viable. Check it yourself. In Intel computers available today you'll never have a program with two pointers that differ by the upper 16 bits. In fact I had to debug something that squirreled away metadata in the upper 16 bits, and just stripped them and fixed it up anytime it had to dereference. This fat pointer thing is something you can do today if you want, because we don't really have a full 64-bits of pointer.

I'm not hugely prescriptivist when it comes to language, like the point of words is so you can say things and have people understand you. So for defining words, I go with how the world uses it. In this case the world isn't agreeing on one consistent, technical category, so if you think there is one you're fighting a losing battle.

If you go by things like "pointer size" or "general purpose register width" to determine bitness of a computer, sometimes there's choices. For Super Nintendo or any 65816-based computer, not only does it have an 8-bit 6502-style legacy mode, it has 8-bit native mode and programs are constantly switching back and forth between 8-bit and 16-bit register widths and addressing. So if you go by register width you could call it an 8-bit or 16-bit computer. Or say for Win16 with segmented memory model on Intel x86, you have near and far pointers which are 16-bit or 32-bit respectively. Or for modern Intel-based CPUs you have viable pointer amount and the actual amount that is passed into a dereference. I know I'm mixing a bit of high and low level here. This is choosing different, but I think fairly reasonable interpretations of what it means to use a pointer.

I'm not saying any of this is a big deal I'm just saying there can be choices. And usually, when there's a choice, you pick the bigger one. Because, the bigger number sounds better. Because it's a marketing decision.

Describing a computer using "bitness" is super useful to get a general feel for the time period it came out. What order of magnitude of size is the system memory. What is the screen resolution and color depth. Could it have 3D acceleration. Could it have chiptune sound or something more.

It stops being useful if you have to write an emulator, write an operating system, write a device driver, write a compiler, even to write application code. Then, you'll want to know what's the actual behavior, what kind of pointers do you have, what can you do with them, probably width and interpretation of native types would be good to know.

Anyway, I did come across this other person, a different dude who was completely adamant that there was a hard technical definition of bitness of a computer. You don't want to be like this dude, since he couldn't tell me or articulate to anyone what it was-- he didn't know, he was just so certain that there was one.

October 4th, 2024 at 4:10 am | Comments & Trackbacks (0) | Permalink

📅June 17th, 2023

I debugged this problem. I'm writing up the process so I remember what happened in case I have to go back to it, and for if you're running into a similar problem and you're looking for ideas.

Symptom: Application starts, runs a little, then hangs mysteriously on the F256k.

First I narrowed down the repro. The application was originally written to do a bunch of things. I deleted a bunch of functionality to the point where it still reproed when simply trying to print text to the screen.


Step 1. Reproduce the problem on emulator

The repro was reliable on hardware, no issues there.

It's supposed to print the word 'Wormhole'. It stops at the letter 'W'.

What next?

How to get debuggability. Remote debugging F256 is on some people's minds and it's in the realm of possibility. There's a wired transport to a host computer, and allegedly a toolchain out there that could be made compatible with a bit of work. It's just not an established path at the time of this writing.

But what about the emulator? Fortunately, the Foenix emulator has a debugger built in. And while it's true a debugger is not strictly needed at all, it makes investigating a lot easier. So let's try running with the emulator.

To get the program compatible with emulator, I made a couple changes:

  • Generate a .hex file, since the emulator knows how to load those and not 'bin'.
  • Fix up compile offsets so that the program copes from being loaded at offset 0, not offset 0x800. Since with binary files you can specify where you want it loaded; with hex files you can't.
  • Ideally, refactor things so the same source can be built in either of the two modes.
  • Update the emulator to have compatibility with the 65816-based chip with F256, instead of just 6502, because that's the chip my unit has. This might not be strictly needed since my application code was all 6502-based at the time and runs in emulation mode, but it couldn't hurt to make sure I'm really comparing apples to apples and using an accurate emulation path. Plus it's support I would need later. The commit for this is here.

After making the above changes, the repro works on emulator:


Step 2. What kind of hang is it?

After running in the emulator, this was easy to see: it was a hang due to hitting a BRK. Not a spin, not a deadlock, not an invalid instruction. Simple enough.

The location of the BRK itself it's saying is zero, not where my appcode was. So it's unclear how execution landed there. For a BRK on this emulator, I don't know that I necessarily trust the right program counter to be reported. It's enough to know that it hit a BRK though.

This is a case where time-travel debugging would immediately tell you the answer. Unfortunately, we don't have time-travel debugging. Fortunately, we have the next best thing: transcript debugging in the emulator with support that I added in my fork here.


Step 3. Working backwards

Re-launch the application with CPU log.

If I have to debug a hang I always hope it's a hang that halts the CPU. In this case where CPU logging is getting used, it's nice and convenient that the transcript will simply end when the program hangs. No need to guesstimate when to end it and sift through a bunch of noise.

And we're in luck today since BRK halts the CPU.

So after it halts, stop the program, take the transcript files that got emitted and open them.

Looking in the transcripts, everything looked mostly looked normal. What was curious is the hang happened quite early, before the things I considered more "risky" were ever executed. Before calls to our interrupt handler. Before even enabling interrupts. Before any changes to/from native and emulation mode. None of the usual suspects.

In the transcript, the 'what' it was doing made sense, but not the 'why'. There was some code that looked correctly executed. We're inside a function. That function returns. The place it returns to is not code, it's not really anything.

Some kind of stack corruption? Since the transcript isn't a full state of the machine, we don't know the stack. It's possible.

To find out more, let's compare the transcript to the assembly result.

Why use the assembly result, not source code? Because the assembly result shows code bytes and locations, which is super useful for this category of problem.

Matching up offset .e0a4 with what immediately came before the BRK, we see that the problem happens when returning from the function ClearScreen.

In the program there's only 1 call of ClearScreen, and that's from Main. That call is supposed to have pushed the return value on the stack, and we jump there.

For some reason, instead of returning up to 0x EF1F, the thing after ClearScreen's callsite, we return up to 0x41E1. That must mean something on the stack was overwritten.

Restarting the program, we can break a little earlier, stepping one instruction into the call to ClearScreen. It looks like this:

You can see:

  • "Top: $01FF" means the top of the stack is at that location, and you can see the current SP, decremented, printed below.
  • Looking at that location in the Emulator Memory window, it shows 0xEF1F.
  • So the address-to-return-to is 0xEF1F. The bytes are swapped for endianness on this platform

Well that return-address makes sense. It's +2 from the call site which was 0xEF1D.

The value pushed to the stack is always supposed to be +2. In other words, it's one less than the next instruction. See this reference.

Something must scramble, or over-write this value later. What is it?

There are two options for finding out.

  1. Single step in the debugger a few instructions. If the scrambling occurred toward the beginning of the function, this would catch it quickly.
  2. Set break-on-write in the debugger. At the time of writing this, the debugger doesn't support memory breakpoints. So instead, hack the emulator, put in a temporary change to do this behavior.

Bad news. I tried a little of #1, and we weren't lucky enough for the problem to occur early enough to show the answer.

No worries, so option #2 it is.

Add a change to the emulator that hooks on write, something like this:

This "poor man's conditional breakpoint" let me track the pattern of pushes and pops to the stack. The process of doing this was a bit troublesome because I'd see different behavior based on whether I had debugger attached or not (terrific!)

That said, I saw it hit, where a subsequent function call pushed a new return address to the same location. So that points to a return not popping off the stack.

Now that we've confirmed that, the next step is to make sure it is really can do the return in this repro. To do that I look at the code bytes make sure they're really in the binary.

Wait a minute! There's a 0xC0 (CPY) instead of a 0x60 (RTS). The problem isn't just something scrambling the stack, something is overwriting code. One is causing the other. It must be bad code, that is scrambling the stack.

As a lazy effort, I looked in the assembly output for the address that gets scrambled, and there it was.

Those STZ and STAs are scrambling the return value and the byte after it, and then I noticed that Fn_E071 would scramble them further, reaching the pattern show in the earlier memory window screenshot.

I didn't write this code to be self-modifying like this. I adapted this code from a sample, using some mix of tooling and manual disassembling. Well, there's the problem.

You see, $E074 used to point to data, the original version of this program. However, I inserted code which throws the label off. Because this memory was addressed by absolute address not by a label, inserting code invalidates it.

Correcting the code to key off of labeled:

And running the result

Success, you can see the whole text gets emitted and the rest of the demo runs as expected.

Patching the fix back into the whole application, and testing the whole thing on hardware:

That confirms the change above fixed the problem.

To build this demo, see the repository here. To run the demo, see binary releases here.

June 17th, 2023 at 2:55 am | Comments & Trackbacks (2) | Permalink

📅April 22nd, 2023

Do you remember DirectDraw? The DirectX 5 SDK disc came with a bunch of samples, including one called "Wormhole".

Looks like this:

How it works: despite how the image looks animated, there's no change to the framebuffer data. It's all palette rotation. The sample comes with a bitmap specially chosen so that the colors rotate to produce this 'wormhole' animation.

If you want to try it yourself, load it up from a DirectX 5 SDK disc (it's on some other SDK version discs, as well). Or, you can find it on the Internet Archive here: https://archive.org/details/idx5sdk.

My project: ported this sample to C256 Foenix. (Update: I later also ported it to F256 Foenix.)

This is a language (C to 65816) and platform (Win32+DirectDraw to C256 Foenix + Vicky II) port.

Some of the challenges were:

  • Making sure the right bitmap with the right palette gets initialized. See, it's not sufficient to simply read RGB of the original bitmap and emit a new one that looks visually equivalent. The original bitmap's palette needs to be preserved. It contains "dead" colors- colors that aren't referenced by any one pixel as you view it, but are important to the rotation effect. I wrote a tool called BitmapEmbedder to take care of this.
  • Betting on how long, in terms of clock, the rotation effect would take to execute. I was bold and put it all in VBLANK handler. Fortunately it fit and I didn't optimize for perf super aggressively. I had no idea whether it would fit. If it didn't, I would've to pull a bunch of it out and synchronize it. And it would be easier to do that at the beginning, before it's all set up. I took the risk at the beginning that it would fit and this paid off.
  • Having a loop that needed to be longer than the signed branch distance limit. I could have maybe added a "hop" to get back to the beginning of the loop. Instead I factored out a function for no reason other than to get past the limit. It doesn't make me feel great. Could be something to revisit later.

A bunch of other things worked well. Vicky II has a dedicated bitmap layer that you can cleanly copy to. I say cleanly because it was a lot easier to work with compared to Apple II, and SNES for that matter. There isn't any weird swizzling, interleaving or holes. It was exactly compatible with a DirectDraw surface in terms of indexed color and surface size.

Result looks like: (comparison between the original and the port)

If you aren't familiar with the concept of palette rotation:

Palette rotation is a visual effect made possible by storing image data in a compact way.

You might be familiar with not-very-compact ways to store image data. For each pixel, say, you store a red, green and blue color value. Functionally that works, no worries. But the memory cost- even if each color channel is only two-thirds of a byte, then each pixel will still take up two bytes. Or if each color channel is a byte, you're looking at three bytes then. Or even four if you use alpha. The memory cost can really add up to more than you can afford.

There's a more compact way to store image data. You can store indexed color instead. For each pixel, store a key. The key is only 1 byte, not 4. It's a number from 0 to 255. When the computer displays the image on the screen, it will use that key to look up into a palette, or table of colors. In a way, this limits image quality, since you can only have an image with a low total number of colors (256). But you save a lot of memory. After all, each pixel takes up only one byte.

There are different configurations of key size affecting how many colors you can use at a time. You could sacrifice image quality to optimize for memory even more. Like anything there are tradeoffs. Having a key be one byte is a popular choice though, and this is supported on Vicky II.

Ordinarily, it'd cost a lot of perf to implement palette lookups yourself in your software. "For each pixel, look up into the palette, assign a color..." It's be so slow. Fortunately, indexed color is an industry-recognized idea that has built-in hardware acceleration on a ton of platforms, including on Vicky II. That's where the benefit really shines, so you don't have to worry.

Anyway, as you see with indexed color, there's indirection. Change one entry in the palette, a simple one-byte change, and it could affect half your image or more. Because of the indirection used with indexed color, an effective way to animate things can be to not animate the image data at all, but to simply make a small change to the palette. The palette has way fewer bytes of data, yet the capacity to change how the whole image looks.

Palette rotation can also be called color cycling. There are some beautiful artworks using color cycling to convey water, snow, or other effects. For example, see this snow effect from this demo page (not my page):

The grid in the lower right shows the palette being changed.

Or this one, with rain:

The Wormhole sample uses the idea of palette rotation to achieve an animation effect. It only copies the original bitmap data once on application start. It never touches it again.

Every VBLANK handler, it only updates the palette. And although it does a lot of manipulations to the palette-- there's four loops, iterating over various parts of it, copying entries around-- it can still be way less expensive than an alternative way of animating things- iterating over every pixel in the bitmap. This way, you can exploit this compactness in the image format to get a performance benefit too.

Source code available here:

https://github.com/clandrew/wormhole/blob/main/vickyii/wormhole.s

April 22nd, 2023 at 5:41 pm | Comments & Trackbacks (0) | Permalink

📅February 17th, 2023

Played with this a bit recently. I'm saving my build for later. To download this build of SerenityOS project, visit here:

Download link (180 MB) - SerenityOS_2_16_2023_43f98ac.zip

To use the provided launcher, QEMU emulator is required. To download QEMU for Windows, visit here:

https://www.qemu.org/download/#windows.

About this operating system: this is an open-source operating system created from the ground up by a group of hobbyists. Under the hood, it's similar to Unix. Cosmetically, it looks a lot like Windows 95 though.

Build contents:

  • _disk_image (1.98 GB, compresses really well)
  • Kernel (54.1 MB)
  • Launch.bat (2 KB)
  • LICENSE (2 KB)
  • Prekernel (47 KB)
  • README.txt (1 KB)

The Launch.bat script is the same as what as produced from SerenityOS's "launch the operating system in QEMU emulator" script, with one change. I removed

hostfwd=tcp:127.0.0.1:2222-10.0.2.15:22

because it conflicts with some Windows TCP exclusion range. You can add it back if you don't use anything that conflicts.

For more information about SerenityOS, visit https://serenityos.org.

For source code, visit https://github.com/SerenityOS/serenity.

For the project's license file, see LICENSE.TXT included with the build, or view it at

https://raw.githubusercontent.com/SerenityOS/serenity/master/LICENSE.

This is an x86-64 build created off of commit hash 43f98ac6e1eb913846980226b2524a4b419c6183 on 2/12/2023.
The build was produced in a WSL environment using Ubuntu 22.04.1 LTS distribution.

Today most people use SerenityOS by running it in an emulator. From what I saw, that's the road better traveled. More specifically they run it in QEMU emulator, emulating a very low spec x64 based computer. I did see there's a subgroup of their community getting it to run on hardware. It seems like they wrote the software stack first and then tried to find hardware to fit it, doing things in that order.

February 17th, 2023 at 6:46 am | Comments & Trackbacks (0) | Permalink

📅December 15th, 2022

Consider this 65816 program

.cpu "65816"                        

PUTS = $00101C                      
PUTC = $001018                      
* = $00FFFC
RESET   .word <>START

* = $002000
START   CLC                         ; Make sure we're native mode
        XCE

        REP #$30
        .al
        .xl
        JSR MSG1

        SEP #$30  ; Set 8bit axy
DIV

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
; Value      ; 8bit interpretation    ; 16bit interpretation
;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;
;            ;                        ;
.byte $A9    ; LDA #$3A               ; LDA #$3A3A                    
.byte $3A    ;                        ;
.byte $3A    ; DEC A                  ;       
;            ;                        ;        
.byte $29    ; AND #$39               ; AND #$3A39         
.byte $39    ;                        ;   
;            ;                        ;      
.byte $3A    ; DEC A                  ;       
;            ;                        ;      
.byte $29    ; AND #$38               ; AND #$2038       
.byte $38    ;                        ;   
;            ;                        ;      
.byte $20    ; JSR $20EA              ;                                    
.byte $EA    ;                        ; NOP
;            ;                        ;      
.byte $20    ;                        ; JSR $20E0
;            ;                        ;      
.byte $E0    ; 
.byte $20    ; 

        TAX
        JSR CLRB
        JSL PUTS 
        JSR MSG2

DONE    NOP         ; Spin
        BRA DONE

* = $002038
MODE16 .null "16"
PRE   .null "This is in "
SUF   .null "-bit mode.     "

CLRB    LDA #$0000
        PHA
        PLB
        PLB
        RTS

MSG1    JSR CLRB
        LDX #<>PRE
        JSL PUTS 
        RTS

MSG2    JSR CLRB
        LDX #<>SUF
        JSL PUTS 
        RTS

* = $0020E0
        RTS

* = $0020EA
        JSL PUTC
        REP #$30
        .al
        .xl
        JSR MSG2
        JSR MSG1
        JMP DIV

for C256 Foenix, assembled with 64tass.

When run, the output looks like

Explanation: the part of the program labeled 'DIV' will run twice, under different interpretations. First in 8bit mode, then in 16bit mode, for displaying the '8' and '16' printable characters respectively.

Normally code in 8bit is garbage when interpreted in 16bit mode and vice-versa. These ops were specially chosen so that they are valid in both with different behavior

Because it's not possible to express the reinterpretation idea in an assembly language, this just dumps the code bytes in the middle of the program and there's 2 columns of commented-out language explaining what the bytes do. The 2 columns are one for 8bit and one for 16bit.

I wrote it as a silly test. It's a test for the debugger. It's to see how it might display in a 'source-style' debugger. When running it I pass the debugger my source file listing.

It goes... not great

'Transcript debugging' described in this earlier post fixes it, it's 100% coherent and matches the source.

So a good vote to use that kind of debugger for this type of thing.

Full source code available here:

https://github.com/clandrew/experiments/blob/main/div/div.s

December 15th, 2022 at 8:31 am | Comments & Trackbacks (0) | Permalink

📅December 14th, 2022

Summary: I'm making a case for a certain type of debugger.

More detail below.


A couple times people trying to get into ROM patching ask me what tool I use. I'll answer, although it's not the popular answer. It takes some getting used to because of type of debugging it is and for other reasons.

What I use for most of my patching stuff is Geiger's SNES debugger. It's a special build of Snes9x with a debugger bolted on.

Looks like this

The game, main debugger interface, and memory view are in different windows. You press the 'Breakpoints' button to get a pop-up dialog for setting those. Looks very Win32 ish.

Why not for everyone?

When I first started looking into debugging SNES it was a while ago, this was one of the best options available.

Since then, the world has moved on.

Despite the age of the SNES today, there is a lot of information out there sourced by fans and tools under active development.

Today, there are SNES debuggers that

  • have larger feature sets
  • have been tested more thoroughly to weed out issues
  • have interfaces that suit newer UI/UX paradigms
  • are actively developed
  • are open-source and so are easy to extend

The debugger has some 'personality traits' I've gotten used to working around. Here's a list of what they are in case you run into them.

Issue: Breakpoints don't hit after certain operations (e.g., save state load)
Workaround: Re-open the breakpoint window, and click OK.

Issue: Step Out doesn't step out to the expected place
Workaround: Don't rely on Step Out for function calls that straddle any interrupts.

Issue: 'Show Hex' (memory) window shows blank ROM on 1st open
Workaround: Choose something (e.g., RAM) 'viewing' dropdown then go back to ROM.

Issue: Emulator crash if you scroll too far down in the memory window
Workaround: Don't use invalid ranges. Don't try to scroll past the end of the range.

Issue: Can not view CGRAM or OAM
Workaround: Use a different debugger.

Issue: Can not view DBR or PBR
Workaround: Edit some code to push them (PHB/PHK) then PLA.

Software

The debugger isn't under active development anymore so things like the above list are what they are. I contacted Geiger asking for the source code. He responded wishing me well, and wouldn't give it to me for various reasons which is his perogative.

None of these were bad enough to block me, it's been alright.

They also weren't severe enough to be motivating to go in and fix them in this closed-source program.

Why I use it

Despite the above things, I still use it for a few reasons.

  • Habit. I know my way around it
  • The most complicated parts of what I need to do aren't actually through a debugger, (e.g., 'special diffing' of memory dumps), and a debugger could never do as good a job as flexible-function code.
  • Transcript-style debugging.

The biggest one is transcript debugging.

For transcript-style debugging see explanation below.

Transcript-style debugging

For ROM patching projects where the patch is a small targeted surface area, generally speaking you're not trying to recover source code.

Recovering source code so that you can work in it is something you can do if you really want. You can do it if it makes you happy. But it's not always crucial. It can even be a distraction. Be it in an assembly language or a higher-level language like C/C++, recovering source code can be unnecessary for your goal of a targeted change in behavior and can make the task way less efficient.

To get better at this kind of reverse-engineering task, and this is hard for a lot of people to hear-- you need to fall out of love with source code written in programming languages. This includes source-level debugging, it includes expressions of flow control, and this includes source code written in assembly languages.

Below is an example of a debugger listing not using source-level debugging, and using transcript-style debugging instead:

$80/BC3A 20 B0 C1    JSR $C1B0  [$80:C1B0]   A:0000 X:00A9 Y:0005 P:envmXdizc

$80/C1B0 64 6F       STZ $6F    [$00:006F]   A:0000 X:00A9 Y:0005 P:envmXdizc
$80/C1B2 A5 6C       LDA $6C    [$00:006C]   A:0000 X:00A9 Y:0005 P:envmXdizc
$80/C1B4 0A          ASL A                   A:3640 X:00A9 Y:0005 P:envmXdizc
$80/C1B5 88          DEY                     A:6C80 X:00A9 Y:0005 P:envmXdizc
$80/C1B6 F0 13       BEQ $13    [$C1CB]      A:6C80 X:00A9 Y:0004 P:envmXdizc
$80/C1B8 90 38       BCC $38    [$C1F2]      A:6C80 X:00A9 Y:0004 P:envmXdizc
$80/C1F2 86 00       STX $00    [$00:0000]   A:6C80 X:00A9 Y:0004 P:envmXdizc
$80/C1F4 A2 02       LDX #$02                A:6C80 X:00A9 Y:0004 P:envmXdizc
$80/C1F6 0A          ASL A                   A:6C80 X:0002 Y:0004 P:envmXdizc
$80/C1F7 88          DEY                     A:D900 X:0002 Y:0004 P:eNvmXdizc
$80/C1F8 F0 20       BEQ $20    [$C21A]      A:D900 X:0002 Y:0003 P:envmXdizc
$80/C1FA E8          INX                     A:D900 X:0002 Y:0003 P:envmXdizc
$80/C1FB 90 F9       BCC $F9    [$C1F6]      A:D900 X:0003 Y:0003 P:envmXdizc
$80/C1F6 0A          ASL A                   A:D900 X:0003 Y:0003 P:envmXdizc
$80/C1F7 88          DEY                     A:B200 X:0003 Y:0003 P:eNvmXdizC
$80/C1F8 F0 20       BEQ $20    [$C21A]      A:B200 X:0003 Y:0002 P:envmXdizC

"But isn't this source code in assembly language?" I heard this question before. I think this confusion comes from people who haven't done much forward engineering with assembly languages yet, forget reverse engineering.

The above is not source code, it's a debugger transcript.

Again, it's not source code, it's printout of what got executed, one instruction at a time.

Dead giveaways that it's not source code

  • Every line starts with an address
  • Every line ends with register state
  • There's a function call and then the stepped-into body immediately after. Not proper for source code
  • Some instructions are repeated, like $80/C1F6, $80/C1F7, $80/C1F8. This is a loop

Thinking it's source code is pretty uncharitable to the readability of source code, which usually uses more identifiers and comments and labels and stuff than this example has.

The transcript looks different from source code, and transcript debugging is different from source style debugging.

How is transcript debugging different?

Now that we know what transcripts are: transcript-style debugging is different from source-style, or traditional debugging.

See an example of source-style debugging, with No$sns:

There's a window. In the window, the local disassembly appears in a listing, with the current instruction highlighted in blue. The local disassembly shows instructions laid out in one contiguous block of memory.

Or, here's another debugger, bsnes:

Another example of source-style debugging. The disassembly listing is on the right, with the current instruction highlighted in blue. They go the extra mile and put dividing lines in at observed function boundaries. I don't think that can ever be 100% robust but it's nice regardless.

Or, here's Visual Studio 2019:

The instructions are listed out. The current instruction is highlighted with a yellow arrow to the left of it. There's some things that couldn't be disassembled so there's a placeholder with question marks.

These are all examples of source-style debugging. It's very popular.

Depending on the implementation, the listing in a source-style debugger can either be

  • a local disassembly where all surrounding memory gets interpreted as code, whether it actually is code or not, or
  • the result of a tracing, where only executed instructions appear in the listing, creating gaps

The former is a lot more common, as in all the above examples, although I've seen both.

By contrast, a transcript style debugger will look like this:

See, there are disjoint instructions, with those pairs circled in red.

Some instructions are listed more than once.

And register state is shown on each line. This platform doesn't have a ton of register space so that's honestly pretty manageable.

The transcript shows all branches with the branch taken, all registers with state at the time, all opcodes with their resolved argument, all loops are unrolled.

The kicker is that implementation of this debugger is dead simple and actually very dumb. It echoes each executed instruction to the output, along with the current register state. That's it.

Yet it is powerful and offers some advantages.

Advantage: history of register values

It's true pretty much all debuggers will show you register values, or variable/memory values at the current instruction. But what about 5 or 10 instructions ago?

You need to either have time-travel debugging, log it, or restart your program.

Some debuggers will cache the "last seen" way something executed (pointer argument, etc), and update it when that instruction is executed again. Great, you can see what was the last way something executed.

But what about the time before that? Or earlier? You can't easily put together a history of what happened unless you log these data points manually yourself.

For reverse-engineering object code with no source code, getting this history is really important in figuring out what happened. You might need to look for trends, look for a pattern, to get a sense of the higher level algorithm. Or you might want transcripts even with code you are familiar with, to get something like a time-travel trace on platforms where actual time-travel isn't available. There's strictly more information in the transcript than in the source-style listing.

"But, it's only outputting registers each line, not all of memory each line". That's true. Each line of the transcript is not a complete state of the machine. I think register state is the right tradeoff to suit most tasks. The exact choice will depend on the platform and the situation. If you're blessed with extensible transcript debugging, that'd probably be the best thing, so you could have like a "watch window" for each line. Generally for SNES, A/X/Y/P is perfectly fine.

Advantage: history of flow control

You can see a clear history of flow control. After all, that's useful. With a couple source-style debuggers, I've seen them do crazy things like try and draw a cute arrow denoting a function was stepped into.

With a transcript, you can see a history of how many times a loop ran, what index of a jump table was used, and which branches had the branch taken. If you save longer transcripts (e.g., with Geiger's SNES debugger's CPU log feature) you can also meaningfully diff transcripts with any text diffing tool of your choice to find divergent control flow like this.

With source-style debugging, you have no record-keeping of this unless you log it yourself, and you can easily miss what you're looking for.

Advantage: An edge case

This is a bad scenario and not something I've ever seen happen out in the wild.

But you could have it where the same memory is executed twice with the code interpreted in different ways, e.g., 8 bit versus 16 bit native mode.

I don't think a source style debugger could easily make sense of this. A transcript would show what happened clearly.

I made a proof of concept that does this and on testing it doesn't work well at all in a source-style debugger.

Advantage: Don't disassemble stuff that's not code

This is a big one, SNES games will often litter non-code throughout code.

For example of what I mean, consider the game NHL '94 for Super Nintendo, This is bsnes broken in NHL '94's graphics decompression

See the part outlined in red. Although there's no obviously illegal instructions, it looks suspicious. Why the CPY with such an arbitrary magic number address? Same with the EOR and the literal, what's up with that? Why the LDAs that immediately get overwritten?

The answer is this isn't code at all. It's data. This source-style debugger will disassemble everything in the neighborhood. That works great only so long as it actually is code. You hope it looks like obvious garbage code, so you can quickly spot it.

In this case, it's actually an array of short pointer offsets baked into the middle of object code. Those are supposed to be offsets, not instructions. If you're really observant you'll see that the preceding JMP $BEB8, X indexes into it and jumps based on an element. It's a hassle to spot this right away, and the debugger isn't doing anything to help you.

Here's another example in a different place

Again red outlined part is not actually code, it's data. This one's sneakier than the above because from a glance it looks less garbage compared to the last one. The big giveaway is SED which is not commonly used.

With transcripts, we don't have this problem. These garbage instructions aren't something you have to discern from non-garbage. Why? Because they don't get executed. They don't even appear.

Why is there data beside the code?

SNES is a different kind of execution environment from what some people are used to.

For the situation above, you might have the reaction "But I work in x86 a lot and I've never seen this before."

Well, x86-64 applications won't have data sprinkled in the code.

Why? Because Intel's architecture does really aggressive instruction prefetching, and the CPU has to know what's code versus what's data for that to work.

You can hear it from Intel themselves:

If (hopefully read-only) data must occur on the same page as code, avoid placing it immediately after an indirect jump. For example, follow an indirect jump with its mostly likely target, and place the data after an unconditional branch.

[...]

Always put code and data on separate pages.

Source: Intel's Optimization Guide

If you write source code that is compiled, at least for a Windows executable, the compiler will put object code in the .text segment and constant data in the .rodata segment- it'll do that for you.

Or if you write source code in x86 assembly language, generally you'd would use a directive like .CODE or .DATA to explicitly define what goes where. The details depend on which assembler you use and what exactly is it outputting.

The WDC 65xx-based CPU, on the other hand, is out there living its best life. Memory is just memory, doesn't matter what's where.

Since it doesn't matter at all, it's up to developer preference and convenience. In practice I do notice developers for 65xx platform dump data in the middle of their code like all the time. They'll bake data local in the same bank to take advantage of direct addressing, since if you put all data together it'd have to go in a different bank to fit. Or they're trying to save a MMU page change on certain computers where that matters (e.g., Foenix F256). Or they'll use a self-modified jump instruction instead of a jump table.

So if you work mostly in x86-64 or any other compiler toolchain with the same recommendation, that's one more reason why you probably go through life dealing with source style debuggers. No surprise data tripping you up. It's probably not something you ever think about.

With reverse-engineering on a platform like SNES the value of transcripts is more clear.

Recommendation

Geiger's SNES debugger is a transcript-style debugger, and you should consider it or something similar if you are debugging SNES without source code.

I also think transcript-style debugging is something we as an industry should consider more for debugging object code without source code. The benefit of source-style assembly debugging really only shines when you have symbolic debugging, or corresponding source code.

Using transcripts liberates you where you're not burdened with mapping control flows back to source code, there is only a series of behaviors. You could map them back to code, eventually. It's just not where you start.

Practical justification: I used transcripts to get these done

  • Ripping all maps of Lagoon
  • Enlarging the hitboxes in Lagoon
  • Making an NHL '94 player name, profile and stats editor
  • Making an NHL '94 player graphics decompression tool that's 100% accurate to the game's
  • Fixed a bug in Lord of the Rings
  • Disabling collisions in Lord of the Rings
  • Make plants in Harvest Moon be automatically watered

Bonus: Adding transcripts to C256 Foenix IDE debugger

C256 Foenix is a modern WDC 65816-based computer with an emulator.

For personal convenience I added transcript-style debugging to it, looks like this:

See there's duplicated instructions for a loop, and register output. With source-style, the listing looks like this, which provides a lot less information:

This transcript support is kept side-by-side with the default source-style debugger so that you can switch between them.

Enable it by going to "Settings" and checking the box for "Transcript-style debugger". When the box isn't checked, you get the default source-style debugger. The checkbox setting is remembered like the other settings so you don't need to check it every time.

I've already got some good use out of it. If you want to try it out, it's pushed to this private fork:

https://github.com/clandrew/FoenixIDE/tree/transcript

This change was not accepted for main because it seems like no one else uses this type of debugger, but the private fork is there for my purposes and if you want to use it too.

December 14th, 2022 at 7:21 am | Comments & Trackbacks (0) | Permalink

📅December 2nd, 2022

Within some professional circles I'm constantly in this double bind in terms of how to talk about VR.

If I don't invest in it I'm out of touch with real gaming scenarios, too 'casual' to acknowledge high-end configs, pandering to the lowest common hardware denominator, no optimism or imagination for what high-end PC gaming looks like in the future.

If I do invest in it I'm frivolous, all about motion controls, all about Wiimotes, EyeToy and XBOX Kinect, out of touch with traditional gaming, rejecting of keyboard and mouse gaming as stodgy, trying to replace everyone's CoD with Dance Central.

I have to fight so hard for any kind of middle ground, or nuanced position, in ways that I haven't had to for LDA or 3D stereo. Where I believe in the technology, I buy products of the technology, at the same time I don't expect a future where it's as ubiquitous as the smartphone or internet browser and that's OK. I'm not big into VR but I have a positive opinion of it and there's a place for VR to live alongside traditional setups. The huge checks being written today around metaverse-type products consumed in VR are feeding into this and creating polarization, it stresses me out a lot sometimes.

December 2nd, 2022 at 5:41 am | Comments & Trackbacks (0) | Permalink

📅September 20th, 2022

Recently someone asked me 'what are HLSL register spaces'? in the context of D3D12. I'm crossposting the answer here in case you also want to know.

A good comparison is C++ namespaces. Obviously, in C++, you can put everything in the default (global) namespace if you want, but having a namespace gives you a different dimension in naming things. You can have two symbols with the same name, and there's some extra syntax you use to help the compiler dis-ambiguate.

HLSL register spaces are like that. Ordinarily, defining two variables to the same register like this:

cbuffer MyVar : register(b0)
{
	matrix projection;
};

cbuffer MyVar2 : register(b0)
{
	matrix projection2;
};

will produce a compiler error, like

1>FXC : error : resource MyVar at register 0 overlaps with resource MyVar2 at register 0, space 0

But if you put them in different register spaces, like this:

cbuffer MyVar : register(b0, space0)
{
	matrix projection;
};

cbuffer MyVar2 : register(b0, space1)
{
	matrix projection2;
};

then it’s fine, it's not a conflict anymore.

When you create a binding that goes with the shader register, that’s when you can dis-ambiguate which one you mean:

              CD3DX12_DESCRIPTOR_RANGE range;
              CD3DX12_ROOT_PARAMETER parameter;

              UINT myRegisterSpace = 1;
              range.Init(D3D12_DESCRIPTOR_RANGE_TYPE_CBV, 1, 0, myRegisterSpace);
              parameter.InitAsDescriptorTable(1, &range, D3D12_SHADER_VISIBILITY_VERTEX);

Q: In the above example, what if I defined both MyVar and MyVar2 as b0, then assigned bindings to both of them (e.g., with SetGraphicsRootDescriptorTable)?

A: That's fine. Just make sure the root parameter is set up to use the register space you intended on.

Small, simple test applications all written by one person usually don’t have a problem with overlapping shader registers.

But things get more complicated when you have different software modules working together. You might have some other component you don’t own, which has its own shaders, and those shaders want to bind variables which occupy shader registers t0-t3. And then there’s a different component you don’t own, which also want t0-t3. Ordinarily, that’d be a conflict you can’t resolve. With register spaces, each component can use a different register space (still a change to their shader code, but a way simpler one) and then there’s no conflict. When you go to create bindings for those shader variables, you just specify which register space you mean.

Another case where register spaces can come in handy is if your application is taking advantage of bindless shader semantics. One way of doing that is: in your HLSL you declare a gigantic resource array. It could be unbounded, or have a very large size. Then at execution time, you populate and use bindings at various indices in the array. Ordinarily, two giant resource arrays would likely overlap each other and create a collision. With register spaces, there's no collision.

Going forward, you might be less inclined to need register spaces with bindless semantics. Why? Because with Shader Model 6.6 dynamic resource indexing, bindless semantics is a lot more convenient- you don't have to declare a giant array. Read more about dynamic resource indexing here: https://microsoft.github.io/DirectX-Specs/d3d/HLSL_SM_6_6_DynamicResources.html

Finally, register spaces can make it easier to port code using previous versions of the Direct3D programming API (e.g., Direct3D 11). In previous versions, applications could use the same shader register to mean different things for different pipeline stages, for example, VS versus PS. In Direct3D 12, a root signature unifies all graphics pipeline bindings and is common to all stages. When porting shader code, therefore, you might choose to use one register space per shader stage, to keep everything correct and non-ambiguous.

If you want some more reference material on register spaces, here's the section of the public spec:
https://microsoft.github.io/DirectX-Specs/d3d/ResourceBinding.html#note-about-register-space

September 20th, 2022 at 1:10 am | Comments & Trackbacks (0) | Permalink

📅August 19th, 2022

Short version: To disable collisions e.g., walk through walls in J. R. R. Tolkien's Lord of the Rings for SNES, use the following Pro Action Replay codes

80E0C780 (Disable horizontal collisions)
80E13480 (Disable vertical collisions)

Longer explanation below.


There are some areas in maps I wanted to look at in this game, and those areas are inaccessible playing the game normally. How can you look at inaccessible parts of the game?

One option: rip the map. That would work, although it's cumbersome. It would cascade into the problem of also ripping the tileset graphics and figuring out how the each map datum maps to which tile. I did this process once for a different game, Lagoon. Those maps are here. It was doable because I already did some other projects involving that game so I had some information about it. For games I'm less familiar with, like LotR, it will take longer, probably so long it's not worth it.

An easier option instead is to disable collisions. So I set out to do that.

The general sense I had was that this will be a code change to the collision logic in the game. Some code must take the player's position, do some comparisons on it, and permit you to move or not move based on the comparisons. But which code is doing this?

1. Where position is stored

Breaking the problem down, the first step is to find where your position is stored in memory. It's likely to be an X, Y position since what kind of a maniac would store it in polar.

So there's a number, somewhere in the program. Classic problem where we don't know where it is stored. But worse yet, we also don't know what the number is, what its value is.

When faced with this kind of problem I do something I'm calling "special diffing", you can call it whatever you want, basically you take 3 or more memory dumps.

  • For dump 0, you've got your character somewhere.
  • For dump 1, you move them a little bit to the right.
  • For dump 2, you move them a little bit more to the right.

And then write some code that opens the dumps, looking through each offset for some number that's increasing from dump 0 to 1, and 1 to 2. Want more confidence? Take more memory dumps to have more data points.

Why move horizontal and not vertical? Because there isn't a strong convention in computer graphics about whether up is positive or not. For horizontal there's a pretty strong convention.

Why move to the right, and not left? Convenience, we probably expect the position value to increase going to the right.

Why move just a little, and not a lot? So that you don't overflow the byte type (e.g., exceed 255) since that's a hassle to account for.

Using this diffing technique gave a few answers

  • 7E05D3
  • 7E0AE7
  • 7E0EF7
  • 7E1087
  • 7E128F <-- The answer
  • 7EFD46

It was a low enough number to rule out the false positives and find the answer:

7E128F

The other values corresponded to position too, but they were an output of it not an input. E.g., changing the value wouldn't stick, except for 7E128F.

2. What position is used for

The next obvious step is to break-on-write of position. We expect it to get hit whenever your character walks around, since it would get updated then. And moreover, we'd expect that code path to not get taken if you're hitting a wall.

Bad news here. The break-on-write will always fire regardless of whether you're moving or not-- the code is set up to always overwrite your position even if it doesn't need to. So that kind of breakpoint won't directly tell us which code paths fire or don't fire based on your hitting a wall.

For the record, it hits here:

$80/C88E 99 8F 12    STA $128F,y[$80:128F]   A:0100 X:0000 Y:0000 P:envmxdizc

That's okay. It will at least tell us something, when your position does get updated while moving, and we can still reason about what happens when it doesn't get updated.

And in particular, we can locally disassemble or CPU log to find the preceding operations

...

$80/C883 69 00 00    ADC #$0000              A:0000 X:0000 Y:0000 P:envmxdiZc
$80/C886 99 97 14    STA $1497,y[$80:1497]   A:0000 X:0000 Y:0000 P:envmxdiZc
$80/C889 AA          TAX                     A:0000 X:0000 Y:0000 P:envmxdiZc
$80/C88A 18          CLC                     A:0000 X:0000 Y:0000 P:envmxdiZc
$80/C88B 79 8F 12    ADC $128F,y[$80:128F]   A:0000 X:0000 Y:0000 P:envmxdiZc
$80/C88E 99 8F 12    STA $128F,y[$80:128F]   A:013A X:0000 Y:0000 P:envmxdizc

So some position-increment value gets saved to $80:1497, then increment the position by that number too. We can use this to work backwards and see a bunch of other fields updated as well.

Now we know the neighborhood of the collision code and can reason about the code path involved.

3. Finding the branch

There are a couple ways to proceed from here. There's a 'nice' option and a 'cheesy' option.

The 'nice' option is to take where we know the position is stored, and find the chain of operations done to its value. This works out if the control flow for controlling position is pretty localized-- say, if collision-checking and position-updating is all in one tidy function and they're right next to each other.

Unfortunately, the position-updating logic was messy. It was seemingly tangled up with the logic for checking for interact-ables (e.g., doors, NPCs). So while the 'nice' option is still an option, it's costly. Therefore there's the 'cheesy' option.

The cheesy option is to break on position change, using the information we found above, so we at least have some kind of frame delimiter and something to look for. Then enable CPU logging. Log:

  • one 'frame' where you can move, and
  • one 'frame' where you're obstructed.

Then strip out IRQs (they create noise) in the CPU log. So now we have a clean logfile.

What is a CPU log?

Just text.

And you know what you can do with text?

Put it through a differ.

So I put the result through a giant diff. For this I used Meld.

The diff isn't so crazy.

See there's data-divergence up until a certain point. After that, execution-divergence. It was a kind of thing where I scrolled through it and it caught my attention, so not a really thorough debugging experience, anyway the previous stuff to untangle where position is stored helped to understand the data-divergence and filter out noise in the diff.

And that ended up being the answer. The comparison

$80/E0C4 DD 41 E6    CMP $E641,x[$80:E643]   A:0010 X:0002 Y:00B2 P:envMxdizc
$80/E0C7 90 05       BCC $05    [$E0CE]      A:0010 X:0002 Y:00B2 P:envMxdiZC

will check if you're about to collide with a wall, or not. If you're free to move, you take the branch. If you're obstructed, you fall through. Therefore we can disable collisions by always taking the branch. So change

$80/E0C7 90 05       BCC $05    [$E0CE]

to

$80/E0C7 80 05       BRA $05    [$E0CE]

A quick test will show you this only covers horizontal collisions. Vertical goes down a completely separate code path. I guessed that they shared a common subroutine

$80/E0BD 20 22 D0    JSR $D022  [$80:D022]   A:00B2 X:0150 Y:00B2 P:envmxdizC

and this ended up being true. Setting a breakpoint on D022 ended up hitting for the vertical case:

$80/D022 DA          PHX                     A:0E05 X:016B Y:0095 P:envmxdizc
$80/D023 5A          PHY                     A:0E05 X:016B Y:0095 P:envmxdizc
$80/D024 8A          TXA                     A:0E05 X:016B Y:0095 P:envmxdizc
$80/D025 4A          LSR A                   A:016B X:016B Y:0095 P:envmxdizc
$80/D026 4A          LSR A                   A:00B5 X:016B Y:0095 
...
$80/D033 7A          PLY                     A:0000 X:0016 Y:026C P:envmxdiZc
$80/D034 FA          PLX                     A:0000 X:0016 Y:0095 P:envmxdizc
$80/D035 60          RTS                     A:0000 X:016B Y:0095 P:envmxdizc
$80/E12D E2 20       SEP #$20                A:0000 X:016B Y:0095 P:envmxdizc

at which point it's easy to step out 1 frame and see the caller. And it turns out the caller looks very similar to the horizontal case, with the same kind of branch. It has

$80/E131 D9 41 E6    CMP $E641,y[$80:E643]   A:0000 X:003F Y:0002 P:envMxdizc
$80/E134 90 05       BCC $05    [$E13B]      A:0000 X:003F Y:0002 P:eNvMxdizc

So you can do a similar thing, changing the branch on carry clear to a branch unconditional

$80/E134 80 05       BRA $05    [$E13B]    

Putting it all together, the two codes are

80E0C780 (Disable horizontal collisions)
80E13480 (Disable vertical collisions)

Here's a demo of it in action, successfully getting past a door.

Side thing, I also used the code to get above the door in Bree. That said, even when enabling collisions again, it didn't take me anywhere. So the door's either controlled by non-'physical' means or not implemented.

I was still left with this question of whether we can conclusively say the door is not implemented.

It's one thing to prove a positive, prove a program does something. Easy enough, you show it doing the thing.

It's another thing to prove a negative, to prove a computer program will never do a thing. Can you prove the door can never open? Not ever? You can't really, you can only reach levels of confidence. Can you prove White Hand is not in Pokemon? Can you prove Herobrine is not in Minecraft? Has anyone ever conclusively proven that the pendant in Dark Souls doesn't do anything? Well, see, they ran the program through a simulator and-- just kidding, they went low tech and asked the director, so then it's a social engineering problem of whether you believe him.

When people use formal or other methods to prove program behaviors or nonbehaviors, they exploit constraints in the programming language or execution environment. For example, a language like Haskell has a rich set of constraints which are helpful in proving behavior. Or if a computer is unplugged and has no battery, you have confidence it won't power on. But in our case we're talking about object code, not source code, and we're talking about something the hardware could do. The instruction set of the object code alone doesn't provide helpful constraints. Hell, we can't even universally statically disassemble programs on this platform (because the choice of instruction width is dynamically chosen). Statically prove nonbehavior?

I'm not trying to give credibility to conspiracy theories or the mindset behind that kind of thinking. I'm trying to explain why you might not find a conclusive answer when you might want one. Anyway, through this exercise I got a greater level of confidence that the door doesn't go anywhere.

Some details

  • You can use both codes or one at a time.
  • Disabling collisions means you also can't interact with objects like doors. If you want to pass through doors, re-enable collisions or enable them for one direction.
  • If collisions are disabled for you, they're also disabled for allies, NPCs, and enemies.
  • Disabling collisions will let you pass through 'physical' doors in the game, where you're obstructed simply by where you're allowed to walk. For example, the gate in Moria, and the fires on the steps by the Balrog. There can be other 'non-physical' doors, where you need to trigger the right game event (e.g., have a key) to open them.

I used password tricks to get to the end area with all events signaled in any case, and had collisions off. I got to freely walk around the area where you find Galadriel with the mirror.

It turns out, the area is an infinite plane!

It's for the best not to try and rip the level data. /j

August 19th, 2022 at 6:19 am | Comments & Trackbacks (0) | Permalink