I'm not very familiar with this layer of things; what does it mean for a GPU to drive a boot sequence? Is there something massively parallel that is well suited for the GPU?
> what does it mean for a GPU to drive a boot sequence
It's a quirk of the broadcom chips that the rpi family uses; the GPU is the first bit of silicon to power up and do things. The GPU specifically is a bit unusual, but the general idea of "smaller thing does initial bring up, then powers up $main_cpu" is not unusual once $main_cpu is ~ powerful enough to run linux.
The Raspberry Pi contains a Videocore processor (I wrote the original instruction set coding and assembler and simulator for this processor).
This is a general purpose processor which includes 16 way SIMD instructions that can access data in a 64 by 64 byte register file as either rows or columns (and as either 8 or 16 or 32 bit data).
It also has superscalar instructions which access a separate set of 32-bit registers, but is tightly integrated with the SIMD instructions (like in ARM Neon cores or x86 AVX instructions).
This is what boots up originally.
Videocore was designed to be good at the actions needed for video codecs (e.g. motion estimation and DCTs).
I did write a 3d library that could render textured triangles using the SIMD instructions on this processor. This was enough to render simple graphics and I wrote a demo that rendered Tomb Raider levels, but only for a small frame resolution.
The main application was video codecs, so for the original Apple Video iPod I wrote the MPEG4 and h264 decoding software using the Videocore processor, which could run at around QVGA resolution.
However, in later versions of the chip we wanted more video and graphics performance. I designed the hardware to accelerate video, while another team (including Eben) wrote the hardware to accelerate 3d graphics.
So in Raspberry Pis, there is both a Videocore processor (which boots up and handles some tasks), and a separate GPU (which handles 3d graphics, but not booting up).
It is possible to write code that runs on the Videocore processor - on older Pis I accelerated some video decode sofware codecs by using both the GPU and the Videocore to offload bits of transform and deblocking and motion compensation, but on later Pis there is dedicated video decode hardware to do this instead.
Note that the ARMs on the later Pis are much faster and more capable than before, while the Videocore processor has not been developed, so there is not really much use for the Videocore anymore. However, the separate GPU has been developed more and is quite capable.