Maybe, but the interesting thing for me it this only works with specific 'chunks' of the transformer layer stack. More or less that the optimal leads to worse performance.