Thursday, November 5, 2020

Adding music to a NES game

Petris now has a theme song, courtesy of my niece. :)

I asked for some help from my nieces to come up with a small bit of music for the game, since I'm not much of a composer myself. After I described the constraints of the NES to them (two square waves, one triangle, one noise), my oldest niece was able to get it to me as a recording she put together on an iPad; I just had to get it transcribed into a format the NES would accept and a music engine to play it.

Enter FamiStudio.

FamiStudio

FamiStudio is a nifty combination of tool and framework for constructing NES music and sound effects. It consists of a couple pieces:

A desktop application for composing music or sounds against the NES hardware
Several audio engines in assembly that can be embedded in an existing NES source code

The system supports both the in-hardware features of the chip and a few software-implemented effects (such as arpeggio). Music is broken down into "patterns" and "instruments," where patterns allow for saving memory by repeating the same musical phrase and instruments combine "voicing" rules with notes to create various timbres the sound circuitry can support. I put together a FamiStudio project and dove in.

My niece gave me a fairly simple piano-and-drum piece, so I only needed two instruments for it. For the drum, I used a bassy frequency and tweaked the volume envelope to give a sharp hit with a quick die-off. The volume is quiet for one frame to give a "squishy" feel, which sounds more bass (soft mallet hitting a big membrane, maybe? I don't know much music physics ;) ).

To get the sound a bit more "drummy," I used the arpeggio effect to tweak the noise frequency higher for two frames, then back down to the selected frequency. It puts a bit of hiss on the front, like a metal rattle that dies off; not quite snare drum, but a bit more color than a flat "boomph" sound.

The square wave carries the two-note melody and only has a few tweaks. Duty cycle selects what percent of time the square wave is "on" and can vary between 12.5%, 25%, 50%, or 75% (which is just 25% again; human ear can't tell the difference). 50% gives me a good old-fashioned "pure" square wave with no flavor, which is very piano. For volume, I went with a sharp attack and a long-tail decay; no deep thought on this I just doodled on the waveform until it "felt" good.

Now that the song exists, I can export it as FamiStudio Music Code using the NESASM format, since Petris is written in nbasic and nbasic includes an escape hatch to NESASM. The result is a data file containing a representation of the music, but not the commands to play it.

Halfway there

The nbasic language doesn't include a "file import" operator, so I copied-and-pasted the content of that file between asm and endasm directives. Now, I just needed some code to play it.

FamiStudio Sound Engine

FamiStudio's GitHub repository includes an implementation of the audio engine in several different NES assembly languages. I grabbed the one for NESASM and pasted it directly into Petris's source file. The audio engine is quite featureful, and quite large; everything I'd written for Petris to that point now composes 19% of the soruce file. So I compiled and..... The assembler choked.

So I'm not sure if I'm using an older version of NESASM (version 3.1) or if FamiStudio was built against a newer version, but NESASM itself has a hard limit of about 32 characters worth of label that the FamiStudio Sound Engine completely ignores. I made and documented 14 search-replaces to trim the label lengths down to something acceptable to NESASM, and we were off to the races.

To connect the audio engine to my game, I had to run some procedures exposed by the engine. The engine documents these pretty well, but a quick summary follows. Since we're operating at assembly level and don't have a standard function-call protocol, each function documents what the a, x, and y registers have to be set to for the function to work.

famistudio_init sets everyting up. It accepts a selector of 0 for PAL timing and 1 for NTSC timing (a), and a pointer to the label at the top of the music data spit out by FamiStudio (low-byte x, high-byte y; in my case, that's untitled_music_data.
famistudio_music_play begins playing a song. It takes the index of the song to play (a); I only have one, so I just had to load a 0 into the accumulator.
famistudio_music_stop kills playback and takes no input params
famistudio_update is a "meat and potatoes" function that must be called once per animation frame (i.e. once per non-maskable interrupt cycle) to allow the engine to move the music forward. More on this later.

The nbasic language is very good about bridging to assembly, so calling these was no more complex than throwing down a small wrapper subroutine to make juggling variables easy.

It's almost certainly overkill to save all the registers, but no harm done

I added gosubs to these routines to the relevant locations, through a call to the updater into my nmi interrupt logic, and... Started an earthquake.

Here's what happened. I made the mistake of adding the update function into my code here, after drawing is completed:

You feel like you're gonna have a bad time...

Problem is, that means I'm running the audio update loop inside the vertical blanking period (the NMI fires at the start of vertical blank. This is, generally, setting yourself up for failure; if the audio processing takes so long that the vblank period ends, any more manipulation of the PPU before the next NMI fires is going to fight the PPU itself for control of state, as the PPU uses its memory latches to generate a frame of video output. Specifically in my case, vwait_end does some cleanup and sets the scroll register correctly, so firing it after vblank ends causes offset artifacts as video rendering continues with the wrong scroll value.

Fortunately, the solution is real straightforward: don't do that.

Now we're safely setting up audio while the PPU is doing the heavy-lifting of emitting one frame of video, and life goes on.

Lessons Learned

This was a fun exercise to finish, and I'm honestly very surprised at how good the public tooling is for NES audio. The audio engine available under the MIT license was a huge win, and I want to extend a huge thank you to Mathieu Gauthier (a.k.a. BleuBleu) for making that exist. He doesn't appear to have a Patreon, so drop a note on one of his video tutorials and tell him what a cool guy he is.

Friday, September 25, 2020

Handwave code and using co2

Let's break down the code in Handwave to see how co2 works as a NES game development language.

Code to read state of the NES controllers

Here, I'll talk a bit about Handwave's code and some of my experiences working with co2 as a programming language.

A Dive Into Handwave's Code

Not unlike my previous game, Handwave is broken up into initialization, game phase loops, game logic, controller logic, PPU control code, and raw data. Additionally, there is audio logic to control the APU, and two large state machines to read the music data (one to convert the data to notes on the screen, and the other to drive detection of when notes should be played and trigger the APU correctly). I was able to get started from co2's packaged-in example code and leaned on some snippets of What Remains's source code for examples and language quirks.

The source code is here for following along.

Initialization

At the top of the file is a nes-header macro that declares some "hardware" configuration (this basically tells emulators what sort of cartridge this game would have run on, were it a real cart). We then have a set of defaddr and defconst declarations to name things. I absolutely love that this language allows for constant declarations; takes a lot of cognitive load off of managing the code to be able to name things!

The game's code code essentially has two entrypoints defined by the (defvector) declaractions: reset is the code run when the system is powered on (or the reset button is pressed), and nmi is the code run every time the vertical blank signal (i.e. a "non-maskable-interrupt") is sent to indicate we're between frames of video. In reset, we do the regular setup for preparing the game (initializing variables and rendering an initial screen to the PPU). Several init- and load- subroutines live here, which prep various pieces with various patterns of data. co2 provides several convenience routines in its "standard library," as it were, that simplify this process; ppu-memset is sugar on top of initializing a range of PPU with a constant value (by storing an address to the PPU_ADDR address and then looping across storing data to the PPU_DATA address); ppu-memcpy is similar sugar for ripping a chunk of RAM into the PPU. A variety of set-sprite- methods do the math to set pieces of sprite data (assuming the sprite data is positioned at the recommended #x200 address).

Game Phase Loops

At the moment, the game has two phases: waiting for start, and playing. Since there are only two, a simple g-playing global variable tracks which one we're in. Every nmi, we first update everything that requires the PPU to be quiescent (;;; TIMING CRITICAL CODE), then we drop into the relevant loop depending on mode.

In the waiting-to-play mode, we listen for buttons to indicate players want to play (check-wave-activations), listen for "netcode mode" to be toggled (toggle-netcode), clear the logged-in players if player 1 hits select (clear-active-handwaves), and start when start is pressed (by setting g-playing to #t). "Netcode mode" is a feature I added to make the game easier to play on multiple machines over a network, which I'll explain in a future post.

Play mode scrolls the notes, checks to see if players are playing notes, and runs the animation loop for moving sprites around.

Game Logic

Several pieces of game logic live here, split into their own subroutines. (check-wave-activations) looks to see if controller buttons have been pushed to sound the handwave "bells". Any buttons pressed trigger (on-wave-triggered), which either logs the player in (if we're not g-playing) or plays the note (there are two ways notes play, depending on whether we're in "netcode mode" or not).

There's a utility function buried in here that I want to highlight: (roll-left-n). This sub gets around a small "bug" (really, a lack of feature) in co2: the left-shift operator (<<) can only accept a constant value as the number of bits to shift (note: this is still more functionality than raw 6502 assembly gives us; the underlying ASL operator always shifts left only one bit!). This utility function accepts a second argument and uses it as the number of bits to shift. The name isn't great; I forgot that ROL is also a 6502 assembly operator, which "rolls" a field of 8 bits (moves the MSB to the LSB and moves every other bit one towards MSB). I like now easy it is in co2 to lay down subroutines like this; now that I have it, it's just as easy to use as the existing (<<) routine (though a lot more CPU-intense).

(handle-next-draw-note) is one of the two "parsers" for the music representation language I created. The language consists of bytes that are either "nodes" or "directives." A "node" guides how music is played, and can be a note or a rest; a "directive" controls the musical staff. It currently supports ending the song, but in the future it can be used to indicate changing the voicing of a handwave. The format is detailed in the area headed by ;;;; SONG DATA. This parser is called once every "beat" ( about 8 frames of animation); it consumes 1 or more bytes of music information to determine what notes should be rendered at the right edge of the screen. The return value is a state machine flag (either read another byte or pause reading for n beats to rest) and 16 bits indicating which notes should be drawn; input to the function is 16 bits serving as accumulators for the notes to be drawn. A global, g-song-render-index, tracks where in memory we are.

Pointers are a very cool feature of co2; they smooth over the fact that in the 6502, arithmetic registers are 8-bits wide but the address space is 16 bits. Several specialized opcodes in the 6502 allow you to do indirect indexed lookup of memory by treating two sequential bytes in "zero-page" memory (the addresses $00-$FF) as one 16-bit address. So co2 has utility functions to represent a single 16-bit zero-page value and to use those values to "peek" and "poke" (read and write) memory and to increment the pointer (taking care of the overflow in the math to step the MSB of the 16-bit address when the LSB overflows). Handwave uses two pointers to track song progression: the g-song-render-index draws notes, and the g-song-play-index handles audio playback, 27 beats behind the first pointer (to give notes time to crawl right-to-left across the screen).

Shortly after (handle-next-draw-note) is (handle-next-play-note). It's very similar in structure, but instead of adjusting PPU state, it determines if the player is trying to sound the note and adjusts the audio state. The state machine itself is basically the same, and it also takes in two 8-bit accumulators to track the state of all 16 playable "handwaves." There is some fanciness to account for netcode mode; outside of that mode, the logic determines if a note is for a handwave not controlled by a player and auto-plays it, but inside of that mode it also tracks whether the player is trying to play a note and might be lagging.

Controller Logic

The (read-joypads) sub pulls data from all four controllers and updates a set of global variables:

pad-data, which tracks which buttons were pressed
pad-data-last-frame, which tracks button presses on the previous run of read-joypads
pad-press, which is the buttons that are newly-pressed (i.e. "button down" events)

This data is read from (button-pressed), which gives a few into the buttons newly-pressed this read.

Graphics Logic

(plot-notes) draws 16 notes onto the column of the staff just off-screen, using (find-scroll-edge) to locate which column should be updated. Next, there's an (anim-sprites) routine. This is a rewrite of a similar routine I built for Petris; it runs through some small lists of offsets to sprite position each frame to "bump" the sprite until a "stop" indicator of #xFF is reached, which ends the animation. Setting a memory location to an index of one of the animation offsets starts applying the animation to the sprite the next time the subroutine is called.

Audio Routines

The NES audio toolkit consists of five waveform generators and a simple mixer to combine them. Handwave currently uses only the two square wave ("pulse") generators (the sawtooth generator has no volume control, which makes it a poor fit for a bell-like sound). The audio processor provides a limited amount of automation, of a sort; it's capable of automatically tracking note play duration, a pitch-bend effect, and a "fade" effect (which we use to give a nice bell sound). Timbre can also be adjusted by tweaking the duty cycle; the square wave can be "on" for 12%, 25%, 50%, or 75% of the time. We use a 50% pulse width and a volume-decay of 3, so the envelope logic diminishes volume once per 3+1 = 4 quarter-frames, i.e. once per audio frame (audio runs at about 60Hz timing on frames); since the decay level steps from 15 to 0, the sound decays over 15 frames, which is about 1/4 second.

For Handwave, notes are played by (play-note). We set the pitch from a precomputed lookup table for the 16 notes (calculated by the logic in song.scm, which I'll explain in a future post). A simple flip-flop value in g-which-pulse alternates between 0 and 1 every time a note is played to choose whether it's played on the first or second pulse generator (so up to 2 notes can sound simultaneously).

Data Section

The final portion of the code hard-codes various data values, including which buttons play which handwaves, the note pitches, the palette configs for sprites and background, the animation for the handwave icons to "bump" when they're played, and the song data itself.

How NES programming differs from modern web apps

My day job is to do user interfaces for web applications. One of the things I enjoy about NES programming is that the discipline itself has different best practices; I like working on something that makes me shift my focus and remember there are other modes of development. Some big differences:

You own the world: Modern app development often has to account for the possibility that the user is doing multiple things. In a web app, the user could close or reload the page at any time, or could put the whole machine to sleep or disconnect the network. None of that is a real concern in a NES game; nothing else is fighting for your resources, the entire CPU, graphics, and audio systems are your own, and if the user resets, it's because they want to restart the game. Battery-backed games can find the need to care about saving and restoring state, but simple games don't worry.
Pointer arithmetic is different from data arithmetic: In almost all modern programming on a system of at least the complexity of a CPU, the width of data and a memory address are the same. All general-purpose-computer architectures in this half-decade are using 64-bit address space and 64-bit arithmetic (at least!). And addresses are unlikely to get larger (2^64 is enough indexes to assign a unique number to every grain of sand on every beach on Earth, with a bit left over... that "bit" being "a second Earth's beaches"). It's extremely convenient for memory addresses to be smaller than (or the same size) as numbers the CPU can do math on; that means "pointer arithmetic" is just regular arithmetic.

The 6502-series microprocessor in the NES uses 8 bits for arithmetic and 16 for addresses. Address math requires setup, relative to regular math. And the CPU has a whole set of special (slower) opcodes to allow for doing operations on memory locations referenced by 2-byte "pointers"---as long as those pointers live at a memory address between #x00 and #xFF. Fortunately, co2 abstracts much of this with special pointer functions.
Global variables are best practice: In big systems applications, context and local state is preferred over global variables because "global" means "anything can read or write it." That gets messy fast when you have multiple components touching multiple pieces; it quickly becomes a system where nobody knows what's going on.

In a small system like a NES game, we don't have the luxury of context. The stack is extremely shallow and intended mostly for remembering subroutine addresses. And as we've seen, scrubbing back and forth across memory takes extra setup and precious CPU cycles. So for many things (especially game state and position of things on screen), global variables rule the day. There's no reason to pass a pointer down four or five levels of a call stack to reference the beginning of sprite memory if every part of the code that works with sprites just knows the address of the first one!

Unfortunately, this best practice harms one thing: code re-use. Keeping state local makes it easier to copy-and-paste a chunk of code without dangling references. This matters a lot less in the NES world (not much code can be sensibly copy-and-pasted, since every game is so different), but it does matter.

Worth noting is that this best practice doesn't imply we never use local state. Like all best practices, it has exceptions; utility subroutines that could be called from many places are harder to use if they need to use global variables to "pass arguments." The co2 language has a great "compiled stack" feature to address this, which can intelligently carve up spare memory at compile-time into slices that are mutually-exclusive, so they can be re-used by different subroutines. You get the benefits of modern heap-allocated memory without heap management.

What comes next?

In subsequent posts, I'll talk a little bit about the mini-language I put together to craft songs and the hack I added to account for lag as we tried playing the game using Kosmi.

Tuesday, September 8, 2020

More NES Roms (and languages!): Writing Handwave in co2

Another NES game! This time, I wanted to make a cooperative music game in a similar vein to Rock Band. Handwave is a game for up to eight or sixteen players (sharing 4 controllers). As notes scroll across the screen, you push your button to sound your note, like in a handbell choir. No scoring; you know you're "winning" when the song sounds good.

Ding

For this game, I changed up my tools just a bit: I changed the emulator I tested against and changed the language and tooling I used to write the game. This post talks about those changes; subsequent posts will talk a bit about the difference between the new language and nbasic, break down the code, go into detail about the mini-language I wrote to make the music, and talk a bit about the challenges I faced trying to make a multiplayer music game work over a network.

FCEUX emulator

FCEUX is an emulator that is popular in the TAS (tool-assisted speedrun) crowd. It sports several features over the one I previously used, Nestopia: the main ones I care about are the ability to reload the ROM completely without closing and re-opening it (ctrl-F1) and a memory watcher / debugger / editor, which proved useful for tracking down some hard-to-see bugs. It's also the basis code for the NES emulator that runs in-browser on the Kosmi website, which is where we play NES games.

One issue I ran into is that the link one gets from the FCEUX main site isn't great; it's hosting a many-years-old version that has some known bugs (the one that bit me is that it mis-maps controller inputs 3 and 4). The best source for working copies of the emulator is nightlies hosted on GitHub.

co2 language

co2 is a racket-derived language and build tool that generates asm6-compatible assembly. It's in the LISP family of languages, which makes it quite a bit more distant from raw assembly than the nbasic I've used previously. That having been said, I really enjoyed it and can give it a strong recommendation, even in its current slightly-buggy state. Here's what I encountered that I'm most happy about.

Pros

clean function abstraction: co2 offers more abstractions atop the machine code than I found in nbasic (or, obviously, in rolling raw 6502 assembly). The language relies heavily on subroutines (created with (defsub)), which can take (and return) multiple arguments. This is a significant departure from the nbasic way, which was arguments passed by developer's choice of standard.
local variables: co2 offers local variables via a system of intelligently auto-assigning free memory in such a way that no call hierarchy "stomps on" the memory used by any caller in the call chain. It pulls this trick off via a method called "compiled stack," which precludes recursive calls but otherwise works (and operates statically; each subroutine knows at compile time what RAM is safe to use for local variables, so you're not paying the overhead of stack manipulation or some kind of heap manager on a system with meager CPU cycles to devote to such a thing). For me, this feature was killer; I didn't have to manually keep track of the bytes used by individual functions and worry about whether they'd clobber each other.
standard library support: Relative to nbasic, the set of basic NES PPU and RAM manipulation features in co2 is fairly robust. There are utility methods for setting pieces of sprite data (set-sprite-id!, set-sprite-y!, etc.), RAM memset and memcpy utilities, complex racket-derived flow control constructs (like cond and while), and even the ability to do some limited 16-bit manipulations (mostly supported at compile-time).
racket: I'm sure some people will put this in the "cons" category, but I'm very comfortable with LISP-family languages, and racket adds some great features atop the basic LISP model that make it very attractive for writing domain-specific languages. I was able to follow the pattern of co2's design to derive my own micro-compiler for a music mini-language that simplified programming the game's music.

Cons

bugs: I tripped over a few, enumerated later.
documentation: It's incomplete. The team is extremely receptive to donations of more docs (and if I get time, I plan to add a few more). But I did have to learn enough racket to read the source code so I could understand what some functions and standard library routines did. I have some unanswered questions as a result (such as how to use defbuffer and whether it can be used with poke!).
runtime: While it didn't bother me, the environment co2 imposes on your game does create a runtime cost; certain memory locations are reserved for co2's own use. This could introduce challenges if you find yourself needing to squeeze out every CPU cycle, or if you try and copy-paste assembly code that makes assumptions about memory layout. Personally, I think the benefits far outweigh the cost of having someone dictate my memory layout.

Bugs

There are a few bugs in co2's implementation. Note: these are all observed as of this commit and might be fixed in future versions).

Some constructs should work but don't, for example,
(peek animations (+ 1 (<< cur-anim-frame 1)))
fails to compile with "ERROR arg->str: (<< cur-anim-frame 1)", which appears to be a sub-error due to the compiler failing to stringify the (<< etc) list.
The compiler can emit code the assembler can't handle (for example, "branch out of range" failure for large if statements). The solution to that one is to bundle the too-large expression into a subroutine.
Attempting to (<< constant variable) is an error, but the error is "assert failed: variable #<procedure:number?>". This one makes sense in the sense that in the underlying 6502 assembly, left- or right-shifting by a variable isn't a supported feature (i.e. both operators only take a literal for number of bits to shift), but it's a bit surprising that the language doesn't abstract that detail. I went ahead and wrote a variable-amount bit-shifter in the game to compensate.
Mixing multi-return with a subroutine call doesn't work (i.e.
(set! bar 1)
(return foo bar)
is fine, but
(return foo (calculate-bar args))
gets dicey)
(loop n a b body) accepts literals and variables as a or b but not expressions that evaluate to a value. If you store the result of the expression evaluation to a temporary variable, you're fine.

Future development

Handwave isn't done. Some features I'd like to add in the future:

Support for all five sound channels on the NES (currently, it only uses the two square waves)
The ability to "put down and pick up" a "wave" (i.e. change voicing rules for one of the notes mid-song)
Letting a wave "ring" vs. "damping" (i.e. holding a button down to keep a note sounding; makes the most sense for sawtooth sounds, which tend to run continuously)
Some sort of scoring
Improvements to the "netcode" for multiplayer over an environment with latency

Conclusions

I really enjoyed putting this game together, and the excuse to play around in a very unique LISP-family language. If you'd like to play around with (or play) the game, give it a try! The ROM and source are available on GitHub.

Have fun!

Thursday, August 6, 2020

Petris code and using nbasic

A few more thoughts about creating a NES game with nbasic (see previous story http://fixermark.blogspot.com/2020/08/petris-creating-nes-game.html for the game itself).

Code to read state of the NES controllers

I chose to use nbasic for this game because I had experience with it in the past; in college, I participated in a student-taught course run by the author of nbasic, which was excellent. I still use the course's site as a reference for "toe-dipping" into NES programming (along with the vast nesdev wiki). I and a team of 2 others created a small Zelda knock-off that is, unfortunately, lost to history. But using nbasic was a positive experience, so I sought it out again when I got the bug in my brain to try and write another NES game.

I'll talk a bit about the structure of Petris's code, then about my experience working with nbasic, and my plans moving forward.

A Dive Into Petris's Code

Roughly speaking, Petris is divided into initialization, game phase loops, game logic code, PPU control code (i.e. render logic), a big chunk of raw data to control what is shown on screen, and a mandatory footer. The header and footer are still cargo-cult code for me; I started from the Starter Source on the nbasic site and expanded out from there (leaning on examples also found on the course site as I went).

Initialization

Initialization begins at the start: label, where we switch off the PPU, do some things that are heavily recommended as best-practice, and configure the PPU as we wish to use it later. Disabling the PPU puts nothing on the screen, but guarantees draw behavior won't interfere with heavy-duty memory manipulation for the PPU, as the CPU and PPU both use the same latches for manipulating memory. It's useful for setup.

We then declare a set of arrays of various sizes. Nbasic allows for arrays to be declared that can be fit anywhere, or at specific locations (this is useful for naming dedicated memory addresses), or in "zeropage" memory (it's a quirk of the 6502 CPU design that RAM with a 0x00 high-order-byte is faster to access, because special opcodes can be used to fetch it with a one-byte address identifier). The design of nbasic doesn't lend itself super-well to structures, so instead of, for example, having a structure encapsulating the current controller values, values from the previous frame, and "rising edge buttons" (i.e. button-down events on this frame), I have three separate arrays, each of length 4 for four controllers.

The last step, at initgame:, is prepping up game-state logic: timers, sprite initial state (position and animation frame), scores, and a playing flag indicating the game is in progress. Then a bunch of subroutines are called to draw the game into PPU memory (since it's a simple one-screen game, there isn't any scrolling and scroll logic is used just to flip between the main game screen and the "Player wins!" screen).

Game Phase Loops

The game has, broadly, four phases: wait for start, countdown to play, play game, and player-win screen. A series of four blocks of code handles these. Each is structured basically the same way:

Wait for NMI ("non-maskable-interrupt", the signal used for, among other things, telling the CPU that the PPU is entering vertical-blanking mode and it's safe to mess with PPU memory)
Update the PPU to move things around on the screen
Use the vwait_end subroutine to wait for the PPU to start drawing things on screen, which is a good time to run game logic
Do game logic

Game logic varies from phase to phase: some just run counters, some ask for updates from controllers. The playing flag is set while the main-game phase is running, which other subroutines can use to decide if we're actually in the game phase.

Game Logic Code

Various subroutines in this area handle things the game needs to do to maintain game state, including reading controllers, updating scores, and updating logic that drives animations. Some of these subroutines take input in the form of assuming nbasic named variables are set (I could possibly have used the stack for this, but see "Using nbasic" for why I chose not to). They also use named variables to maintain internal state, which caused me to trip over a bug in nesasm; nesasm tops out at 32 characters for any label (a simple buffer-overrun protection), but it handles long labels by ceasing to read characters, not by throwing an error. This approach is memory-wasteful, in that each subroutine with internal state is going to take a couple bytes of memory that nothing else can use (but the alternative is coming up with a clever protocol for re-using bytes, which risks having two subroutines "cross talk" and tap-dance on each other's internal state).

Of particular note here is load_joysticks: and detect_edge:, which populates joy_edge, an array of bits representing which buttons are freshly-pressed this detection cycle. This is needed for distinguishing new taps from buttons held down (scoring is done at the rate of 1 point per fresh tap, to get that old-school "hammer on the controller for more points" feel).

This is also where animate lives, which is a simple routine for driving "animations" in the form of x and y offsets to a sprite over time. Each sprite's position is controlled by two value sets: the actual displayed x and y position (which lives in the sprite_mem array and is flashed into the PPU every redraw) and the logical x and y position (which live in sprite_x and sprite_y). The sprite_anim_idx value is an offset into an animation data field that drives animation; every frame, the sprite's x and y offset are read from animation at sprite_anim_idx, added to sprite_x and sprite_y, and the result stored to the on-screen position in sprite_mem. Then sprite_anim_idx is rolled forward, and if it points to the sentinel value 0x80, the animation is done. This is the beginning of a mechanism that could allow for more complex animations (for example, we could add changing the displayed CHR for the sprite, or an additional sentinel value could be added to create looping animations).

Squirreled away under the PPU control code and before the raw data is a bit more game logic that includes a shamelessly-stolen randomizer. The NES itself doesn't have a "true" source of entropy (i.e. there's no built-in battery-backed clock, so without battery-backed memory, the NES's operation is fully deterministic from powerup based on code and controller inputs). The game uses the time between start and player 1 pressing 'A' to start the game to seed a random number generator driving where the dog wants to be pet.

PPU Control Code

This code controls moving things to the PPU for display. Huge, repetitive chunks of this code are loading and storing slabs of images on screen. There's plenty of room for improvement, and I sort of got better over time as I had to write more (for example, I'm not at all proud of load_dog in its current form, which is split up to account for the limitations of CPU speed to make room for screen redraws between loading chunks of the dog). In the dog-loading code, I also accounted for a limitation of the 6502 CPU architecture: when loading data from an absolute memory offset, the 8-bit math only allows at most 256 bytes of data to be addressed from one absolute offset. There are much better ways to design this (the 6502 also allows "indirect indexed" and "indexed indirect" memory access, which reads and writes values based on two bytes in zero-page memory and is the closest thing to modern pointer arithmetic the 6502 offers). The nbasic language itself doesn't have a syntax to take advantage of this feature (as far as I could tell), and I shied away from using direct assembly (for reasons I describe in "Using nbasic"), so I repeated a lot of simple "ripper logic" loops for the sole purpose of loading the same kind of thing from different points in memory.

Raw Data

The wide blocks of raw data are mostly used to map objects in the game to their positions on screen and the CHR tiles that make up their images. Generally, these come in two flavors:

starting PPU Y and X address, first CHR, and how many CHRs make up the image (mostly for text)
starting PPU Y and X address and a list of CHR bytes making up the image horizontally (for big things like the dog)

Finally, the game palette is configured to give some nice colors for the dog and four distinct player colors for the hands.

Were I to write this all over again, indirect-indexed addressing could really cut down on the amount of repeititon here, as well as in the PPU control code (possibly at the cost of speed, since the indirect CPU operations cost more clock cycles... But this is not a game in need of high-performance tuning!).

Using nbasic

Overall, I found nbasic to be relatively straightforward and much preferred to raw assembly-bashing; it makes some things easier and a few things maybe a little harder

Pros

The language really simplifies some things that would require verbose proper form in assembly. In particular, abstracting indirect memory access as arrays and simple arithmetic as prefix-based expressions makes that logic much easier to read than gleaning the meaning of blobs of assembly. I found I would quickly fall into a "flow" of bashing out new segments of the code.
The language isn't far abstracted from the underlying assembly; in places where it did something surprising, taking the compiler's assembly dump and matching statements up against the statements in nbasic was very easy. For something like this, where individual bytes can matter (for both space and CPU cycles), that lack of indirection on the abstraction is helpful.

Cons

I tripped over a couple of bugs in the implementation. One I was able to push a fix for; the others were more challenging to reproduce and included an issue mixing binary, hex, and decimal representations of numbers in one data statement. There are also a few cases where nbasic will emit code that doesn't pass the assembler (for example, the branchto operator doesn't confirm whether or not the branch distance is too far, trusting the assembler to know). These issues were minor and relatively easy to work around.
While nbasic allows users to touch the three 6502 registers directly (i.e. a, x, and y refer to the actual registers), the way they interact with nbasic code is ill-specified and very implementation-dependent. In practice, I found it safest to use assembly directly when I manipulated those variables.
There is no concept of functions or local state in nbasic. This could be considered both a pro and a con; the NES architecture is small, and cycles matter. Managing internal state like that costs resources. However, a lightweight function or local-state abstraction would be nice-to-have.
As far as I can tell, nbasic doesn't really allow for "full" pointer arithmetic; memory can be accessed by offset from absolute addresses, but accessing it via "the 2-byte address stored at this address" seems to be nonexistent (unless I missed it).
Partially because of the difficulty of doing pointer arithmetic, one of the bits of toil coding in this environment is repeating functions with minor tweaks, such as changing the labeled input. The ability to do templates or macro expansions would be valuable for decreasing the amount of repetition needed in the code.

Possible Future Extensions

If I were to extend nbasic, the main thing I'd add is a function-call operation. The 6502 has a stack abstraction that makes a push-call-pop argument-passing interface extremely possible. It could be a fun project to extend nbasic in this way. Local state could be a bit trickier, but coupled wit h function calls, the notion of a local variable can be managed by pushing them to the stack when a call occurs.

An abstraction for pointer arithmetic would also be valuable. I can imagine doing it by providing a simple abstraction for setting two subsequent bytes of zero-page memory to a labeled address, then allowing the [array offset] accessor to reference them. It's possible one would even want those two-byte zero-page values to have a special name (maybe a new pointer declaration, like the existing array declaration) to denote their purpose?

What is next?

There are other languages available for programming the 6502, and I'm interested in checking them out. For my next project, I plan to explore the CO2 language, which is a LISP-based system created to build the game "What Remains." I'll be interested to see how a more abstract language changes the game of making the game.