Or, they subtracted a digital elevation model from a digital surface model, ran a point-in-polygon match against an existing building dataset, and labelled the difference as the height of the building. No ML needed.
Other comments said they fed 2d aerial imagery into transformer and thats it.
There's a notice in the bottom-left corner on desktop that says: "This is a machine-learning-derived product. Errors may occur"