They have not, every successful pre-train as of late has had performance increases greater than what the scaling laws predict.
Those gains are arch based, data quality based, etc. Scaling laws only relate to data volume and compute, holding other factors constant.
Those gains are arch based, data quality based, etc. Scaling laws only relate to data volume and compute, holding other factors constant.