There was a paper that proposed a content based hashing mask for traning
The idea is you have some window size, maybe 32 tokens. Hash it into a seed for a pseudo random number generator. Generate random numbers in the range 0..1 for each token in the window. Compare this number against a threshold. Don't count the loss for any tokens with a rng value higher than the threshold.
It learns well enough because you get the gist of reading the meaning of something when the occasional word is missing, especially if you are learning the same thing expressed many ways.
It can't learn verbatim however. Anything that it fills in will be semantically similar, but different enough to get cause any direct quoting onto another path after just a few words.
Thanks! Appreciate the response and will look into this
> you get the gist of reading the meaning of something when the occasional word is missing,
I think it's more subtle than that. IIUC the tokens were all present for the purpose of computing the output and the score is based on the output. It's only the weight update where some of the tokens get ignored. So the learning is lossy but the inference driving the learning is not.
Rather than a book that's missing words it's more like a person with a minor learning disability that prevents him from recalling anything perfectly.
However it occurs to me that data augmentation could easily break the scheme if care isn't taken.