My take on it is to use some arbitrary dithering algorithm (e.g. floyd-steinberg, blue noise thresholding, doesn't really matter) for the first frame, and for subsequent frames:
1. Turn the previous dithered framebuffer into a texture
2. Set the UV coordinates for each vertex to their screenspace coordinates from the previous frame
3. Render the new frame using the previous-framebuffer texture and aforementioned UV coords, with nearest-neighbor sampling and no lighting etc. (this alone should produce an effect reminiscent of MPEG motion tracking gone wrong).
4. Render the new frame again using the "regular" textures+lighting to produce a greyscale "ground truth" frame.
5. Use some annealing-like iterative algorithm to tweak the dithered frame (moving pixels, flipping pixels) to minimize perceptual error between that and the ground truth frame. You could split this work into tiles to make it more GPU-friendly.
Steps 4+5 should hopefully turn it from "MPEG gone wrong" into something coherent.