I think it is literally by the colors of the screenshots. Nothing to do with the contents.
> I just want to encode the high level aesthetic details of webpage screenshots. Because of this, I fell back on an old friend: the triplet loss on top of a small encoder. The resulting output dimension of 64 afforded ample room for describing the visual range while maintaining a considerably smaller footprint.