To caveat, smaller batch sizes are generally better for model stability, but we go bigger because it substantially speeds up training