> First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
I don't think this is the case in general because in D_{KL}(P||Q) the model is weighting the log probability ratio by P(x) whereas in D_{KL}(Q||P) it's weighting by Q(x).
So let's think it through with an example. Say P is the true probability of frequencies of English words and Q is the output of a model that's attempting to estimate this.
Say the model overestimates the frequency of some uncommon word (eg "ancillary"). D_{KL}(P||Q) weights by P(x), the actual frequency, so the divergence will be small, but since the model thinks the frequency of that word is high, when we take D_{KL}(Q||P) it weights by Q(x), the model estimated frequency, so it will weight that error highly and D_{KL}(Q||P) will be large.
That's why it's not symmetric - it's weighting by the first distribution so the "direction" of the error matters.
You misunderstood what I was saying. I was not suggesting that the KL divergence is symmetric. I was saying that it would be symmetric (and independent of the "truth" of a distribution) if it was interpreted as the quoted measure of "difference" between two distribution. So that proposed interpretation is wrong.