> D(P||Q) = measure of how much our model Q differs from the true distribution P. In other words, we care about how much P and Q differ from each other in the world where P is true, which explains why KL-div is not symmetric.
I don't think this particular interpretation actually makes sense or would explain why KL divergence is not symmetric.
First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
For example, assume we have a coin and P(Heads)=0.4 and Q(Heads)=0.6. Now the difference between the two distributions is clearly the same irrespective of whether P, Q or neither is "true". So this interpretation doesn't explain why the KL divergence is asymmetric.
Second, there are plausible cases where it arguably doesn't even make sense to speak of a "true" distribution in the first place.
For example, consider the probability that there was once life on Mars. Assume P(Life)=0.4 and Q(Life)=0.6. What would it even mean for P to be "true"? P and Q could simply represent the subjective beliefs of two different people, without any requirement of assuming that one of these probabilities could be "correct".
Clearly the KL divergence can still be calculated and presumably sensibly interpreted even in the subjective case. But the interpretations in this article don't help us here since they require objective probabilities where one distribution is the "true" one.
To the first point, I think that the KL divergence is indeed symmetric in this case, 0.4 * ln(0.4 / 0.6) + 0.6 * ln(0.6 / 0.4) no matter which direction you go.
Still, there's no avoiding the inherent asymmetry in KL divergence. To my mind, the best we can do is to say that from P's perspective, this is how weird the distribution Q looks.
> First of all, the "difference" between P and Q would be the same independently of whether P, Q, or some other distribution is the "true" distribution.
I don't think this is the case in general because in D_{KL}(P||Q) the model is weighting the log probability ratio by P(x) whereas in D_{KL}(Q||P) it's weighting by Q(x).
So let's think it through with an example. Say P is the true probability of frequencies of English words and Q is the output of a model that's attempting to estimate this.
Say the model overestimates the frequency of some uncommon word (eg "ancillary"). D_{KL}(P||Q) weights by P(x), the actual frequency, so the divergence will be small, but since the model thinks the frequency of that word is high, when we take D_{KL}(Q||P) it weights by Q(x), the model estimated frequency, so it will weight that error highly and D_{KL}(Q||P) will be large.
That's why it's not symmetric - it's weighting by the first distribution so the "direction" of the error matters.