Because video is much more difficult than images (it's lots of images that have ...

golol · on Aug 28, 2024

This misses the point, I'm comparing two methods of generating minecraft videos.

soulofmischief · on Aug 28, 2024

By simplifying the problem, we are better able to focus on researching specific aspects of generation. In this case, they synthetically created a large, highly domain-specific training set and then used this to train a diffusion model which encodes input parameters instead of text.

Sora was trained on a much more diverse dataset, and so has to learn more general solutions in order to maintain consistency, which is harder. The low resolution and simple, highly repetitive textures of doom definitely help as well.

In general, this is just an easier problem to approach because of the more focused constraints. It's also worth mentioning that noise was added during the process in order to make the model robust to small perturbations.