It gets way weirder.
The TMS9900 didn't have any internal data registers. It only had a program counter, a status register, and a workspace pointer. Instead, it put the "registers" in that same 256 bytes of RAM. There were sixteen 16-bit registers which the workspace pointer pointed to.
The original idea was that this made for fast context switches, instead of dumping all registers to stack (it doesn't even have a stack pointer), just update the workspace pointer to point at a new set. But I have to assume this wasn't really used on the TI-99/4A, as there just wasn't enough RAM. Because your only other ram was locked behind the video controller, that 256 bytes had to contain all your registers, any your dynamically loaded code and any data you wanted rapid access to.
The TMS9900 is weird, because it's the only CPU of the early home computer era that wasn't designed for microcomputers. It's actually an implementation of the TI-990 mini-computer on a single chip and is actually used in later versions of the minicomputer. Those minicomputers had more than enough fast 16-bit memory to take advantage of this fast context switching.
Every other commonly used microprocessor of the 70s (8080, 6800, F8, 6502, RCA1802, Z80, 6809, 8086, 68000) was explicitly designed to target the low-cost microcomputer market.
If you disasembled the ROMs you'd find they're filled with BLWP/RTWP instructions. That's how subroutines were executed: Branch and Load Workspace Pointer. The BLWP instruction would load the WP and PC from the next two words and save the current WP/PC/SR into the new R13/14/15. RTWP would restore R13/14/15 into WP/PC/SR. The end result was a stack implemented as a linked list instead of a contiguous array. A lot of the subroutines in ROM just read/wrote tokens from/to the VRAM and then interpreted them as BASIC.
So before people start saying "OMG! Memory-to-Memory architectures are so slow! What a stupid idea!" allow me to remind people that back in the early 70s when the 960 was turning into the 990, external bipolar memory was faster than on-chip NMOS static RAM. And since the 960 and 990 were originally implemented with a weird collection of ASICs, discrete parts and 7400-series logic chips, the idea that you would just drop a bipolar part in the design wasn't that weird of an idea. But then as the 990 evolved and TI built a single chip implementation, they retained the memory-to-memory architecture for software compatibility reasons. So yeah... ultimately... CMOS logic got faster than bipolar memory and in retrospect it wasn't the greatest design. But at the time it wasn't THAT bad of an idea. And yeah, it did make task switching very fast. But don't get me started on serial IO off the CPU. And dang, what a great, largely orthogonal instruction set. I sometimes fire up the Assembler / Editor on my 99/4 emulator just to play with it.
Anywho... this isn't a critique of the OP or @phire, it's a reminder for the community at large that tech decisions that seem bad in retrospect often had non-idiotic motivations at the time.
I've been working on a TMS99110 homebrew & emulator, and have studied the architecture of the 990 a whole lot over the past couple years. I want to make a very important distinction in a few things you said.
For anyone that didn't get the context, it's the 99/4 design that has this weird RAM layout. The 990 architecture itself can use any (16-bit) word in memory as the starting point of the 16 registers. Developers have been known to use and abuse the workspace pointer to slide around the "window" on the registers.
The window itself also uses the top three registers to link back to the previous workspace, status, and PC, if you use the proper instructions to branch and return. While there is no stack*, you can still crawl back through those references and get the state of each call.
It's a really cool little architecture, hobbled by the 16-bit address space and how slow it was to keep the registers in RAM. Nowadays I can pick up a 1MB memory chip that's faster than the native bus speed for a few bucks, but that wasn't anywhere near the case in the late 70s and early 80s.
*: The 990/12 minicomputer features the PSHS and POPS instructions, which take a pointer to a definition of where the stack lives and how big it should be. These instructions are not implemented in any production processor, but the platform makes it possible to emulate these instructions in software transparently... as an actual explicit instead of accidental feature in the later few iterations. The 990/12 itself was microcoded on a set of four daisy-chained programmable 4-bit bit slicers so they didn't need any of that nonsense.