I've read through most of the first paper mentioned.
Here, the authors have taken set up two synthetic experiments where transformers have to learn the probability of observing events from a sampled from a "ground truth" Bayesian model. If the probability assigned by the transformers to the event space matches the Bayesian posterior predictive distribution, then the authors infer that the model is performing Bayesian inference for these tasks. Furthermore, they use this to argue that transformers are performing Bayesian inference in general (belief-propagation throughout layers).
The transformers are trained on thousands of different "ground truth" Bayesian models, each randomly initialized which means that there's no underlying signal to be learned besides the belief propagation mechanism itself. This makes me wonder if any sufficiently powerful maximum likelihood-based model would meet this criteria of "doing Bayesian inference" in this scenario.
The transformers in this paper do not intrinsically know to perform inference due to the fact that they're transformers. They perform inference because the optimal solution to the problems in the experiments is specifically to do inference, and transformers are powerful enough to model belief propagation. I find it hard to extrapolate that this is what is happening for LLMs, for example.