Do you have a sense of whether these validation loss improvements are leading to generalized performance uplifts? From afar I can't tell whether these are broadly useful new ideas or just industrialized overfitting on a particular (model, dataset, hardware) tuple.
industrialized overfitting is basically what ML researchers do
Why set the bar higher on generalization for autoresearch vs the research humans generally do?