I was thinking about exactly this kind of experiment. Given an input of gameplay recordings, train a model to predict the next framebuffer from the previous frame and keypress input. Would the model have to be excessively complex to avoid rapid divergence into feedback patterns resembling a Winamp visualizer? Probably, but it should be entertaining enough to watch and interact with anyway.