📅July 31st, 2020
See a video of it here: https://www.youtube.com/watch?v=lmg9sN9FO-w
Also on Twitter: https://twitter.com/clandrew__/status/1288941082202390533
As a fun thing to do for a few days I made this demo, for background I have programmed on Apple II in the past but in higher-level ways, through BASIC and Logo. The idea here was to write something in 6502 so that I could learn more about the platform. Prior to this I had some background debugging and doing small code patchings in SNES's 65816 (similar to 6502) and how to use an Apple II computer but no experience at all writing 6502 for it.
I got a book "Assembly Lines" by Roger Wagner which was a lot of help plus "Inside the Apple IIe" by Gary Little. And also a bigger book of opcodes lent to me by a friend.
Things that were easy
- Trancoding from a .PNG image file on Windows to indexed image data compatible with LORES memory layout. I wrote this command line tool which inputs an image file on Windows and outputs a bunch of data directives for the LORES compatible image data. I referred to the "Inside the Apple IIe" book to source out what the format is. The tool uses WIC to open the image and shoves a bunch of things around. This took like 20min. The code is here
https://github.com/clandrew/demoII/blob/master/ImageTranscoder/ImageTranscoder.cpp
- Testing on real actual hardware. What I mean is, I thought this would shake loose a bunch of problems. Especially around sound. There was conflicting information as to how you're supposed to access $C030 to sound the speaker. If an opcode technically corresponds to *two* accesses of the I/O port instead of one it could cancel out so no sound. This is an area where emulator would likely be more forgiving. And before you think "just copy from some sample code", various sample code and documentation contradicted each other. In the end I picked absolute LDA and it worked.
- Modern IDE Now, the bulk of the code was written using Merlin's interface on the actual Apple II. It's actually pretty good and I thought it was fine. Once in a while, whenever there was a big messy refactor though I copied the code out of the Apple II data disk and into a Windows filesystem text file because of course that's a thing. And quickly do the reverse, too. What a time to be alive.
- Debugging. The debugger built into the emulator is the most luxurious type of pampering. This is the reason I don't know what it is to work or to suffer. You can set breakpoints, conditional breakpoints, view memory and disassembly of what's executing. From talking to people, "Back in the day" you would debug by breaking into the Monitor or not at all. There are hobbyists writing Apple II code today though and I am pretty sure they all use tools like this. I feel okay with this hybrid way of doing things.
Things that were not easy
- Correct sound. Sound on most Apple II models is emitted in a weird way. You touch a memory-mapped I/O port- and how *frequently* you access the port (read or write) changes the pitch of the sound. A touch followed by 50 NOPs is a different pitch from a touch followed by 100 NOPs. And of course, you should put all of that in a loop or else the sound will be so quick it's hardly observable. You need very careful counting of clock cycles to play different tones of uniform duration. Yes, calculation involving clock cycles to figure out how to emit notes of *different tones* but *equal length*
- "Holes" and nonlinearity of graphics memory-map. Say you have some image data. It is stored with the top left byte first, and the bottom right byte last. To display it you might think, "just do a memcopy!". No no no. In graphics memory rows of pixels are stored in all jumbled up order. Even if you store your image data in that same jumbled-up order, there are "holes" in the range which you are not supposed to write to. Because I wanted to keep image positionings flexible for future use I had my source image data stored in straightforward order and copied each pair of rows based on a table of offsets. This wasn't the absolute worst just fussy.
- Branch distance limit. Because only one signed byte specifies the offset you can't branch anywhere too far away. To get around this, you branch to an intermediate place which jumps to the final place but also need to take care that the jump isn't accidentally taken by something else.
- Asymmetry of how you're allowed to use registers X and Y. For example, when you use indirect-indexed-load with X, it's pre-indexed but if you do it with Y it's post-indexed. Merlin's semantics between these two is ever so slightly different it's easy to miss and this was the root cause of an annoying bug I fixed Tuesday.
I scoped this project to LORES only so that it would be feasible and also a way to get comfortable with environment, instruction set and functions specific to this computer. The things I've heard about HIRES is that it's fewer colors, higher res, and more gnarly to work with, could also be good to check out at some point.
Some highlights
<-- Graphics bug turned a bit a e s t h e t i c. 
The root cause of this was the image loading code would scramble the input image offset passed to it. I was testing out scrolling of images (as a possible future thing to add), so the corruption happened when you hit the "down" key. Scrolling would call the image loading code again with the scrambled registers, loading an image from an incorrect place and changing which incorrect place every time. 

Debugger built into the emulator. does you a favor and gives memory-mapped registers helpful human readable names (e.g., SPKR, called CHIRP in my code) rather than raw offsets. This was how I sometimes found out that I was actually using some areas of memory which seem general purpose but are not really.

For the music, I sketched it out a single-channel MIDI in Noteworthy Composer and then started testing different tones to find what ratio of I/O port setting + busy waits would produce which notes. I added the frequencies as text in the staff at the bottom.
You have this problem of "how to emit tones of different pitch but the same duration".
This problem will be obvious to you if you've programmed Apple II like this before, but I'll say it in case: to emit a tone on Apple II you touch a memory-mapped IO port. How often you touch it determines what tone it is. "Touch" here means read or write.
You're going to want to touch the port in a loop padded with a bunch of sleeps. To get a lower note, add more sleeps. To get a higher note, use less sleeps*. Easy enough.
*For sleeps it doesn't literally have to be a sleep, it could be a NOP or anything that takes up time. In my sound code the busy wait is "decrement a counter, branch conditional". You could expand the gamut of what sounds you can emit by adding an actual NOP or two.
What if you want to emit two different tones which both have the same duration? If you change the number of sleeps, you change the duration. How fussy. Fortunately, we own the code for touching the port. So I fix this by counting the number of clock cycles it takes to emit a baseline tone, and solving for the number of sleeps needed to emit a different tone with the same duration.
My code to emit sound looks like
    *                                       ;CLOCK
    A6 07          TONE   LDX $07           ;3
    A4 06          DUR    LDY $06           ;3
    AD 30 C0              LDA CHIRP         ;4
    88             PCH    DEY               ;2
    D0 FD                 BNE PCH           ;Branch taken: 3. Not taken: 2
    CA                    DEX               ;2
    D0 F5                 BNE DUR           ;Branch taken: 3. Not taken: 2
    60                    RTS               ;6See there is a routine called TONE, and two loops labeled DUR and PCH. 
PCH is an inner loop.
DUR is an outer loop.
I added comments describing how many clock cycles each line will take. For branches the number of clock cycles depends on whether the branch is taken.
Example of usage of TONE- (this part doesn't vary between pitch 
 or duration so clocks are not taken into account)
    A9 BC           LDA #188                
    85 06           STA $06                 
    A9 6A           LDA #106                
    85 07           STA $07                 
    20 26 66        JSR TONE                
    60              RTS                     TONE is subroutine with two arguments: pitch and duration, nicknamed pch and dur below.
TONE takes some number of clock cycles to execute, nicknamed clk below.
We can calculate clk as a function of pch and dur.
Looking at the implementation of TONE, the initial load and return don't vary so they weren't taken into consideration. But we can go line by line and calculate how many clock cycles the interesting parts will be
    clk = 0;
    clk += dur * 3;                     // DUR    LDY $06
    clk += dur * 4;                     //        LDA CHIRP
    clk += pch * dur * 2;               // PCH    DEY
    clk += (pch * dur * 3) - (dur * 3); //        BNE PCH (branch taken)
    clk += dur * 2;                     //        BNE PCH (branch not taken)
    clk += dur * 2;                     //        DEX
    clk += (dur * 3) - 3;               //        BNE DUR (branch taken)
    clk += 2;                           //        BNE DUR (branch not taken)which simplifies down to
clk = dur * (11 + 5 * pch) - 1;rearranged to solve for dur, you get
clk + 1 = dur * (11 + 5 * pch)
dur = (clk + 1) / (11 + 5 * pch)This way I got what value of 'dur' you need for each pitch to use to get about the same target cycle count as the middle pitch, 138 with duration 144 (picked arbitrarily).
If you are pro you might factor non-sound code (e.g., the loading of graphics images) into your calculations.. fortunately here the amount of time to load graphics and on other control flow ended up being negligible so it didn't end up mattering

And of course, a good thumbnail is about setting the right expectation.

Source code posted here: https://github.com/clandrew/demoII/blob/master/T.GR.asm