To get a single knowledge-cutoff they spent 16.5h wall-clock hours on a cluster of 128 NVIDIA GH200 GPUs (or 2100 GPU-hours), plus some minor amount of time for finetuning. The prerelease_notes.md in the repo is a great description on how one would achieve that
To get a single knowledge-cutoff they spent 16.5h wall-clock hours on a cluster of 128 NVIDIA GH200 GPUs (or 2100 GPU-hours), plus some minor amount of time for finetuning. The prerelease_notes.md in the repo is a great description on how one would achieve that