> possibly have two of them on one board.
That would involve NUMA, and your memory bandwidth for cross-chip compute would probably suck. Would that even beat a simple cluster in performance?