Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

Because video is much more difficult than images (it's lots of images that have to be consistent across time, with motion following laws of physics etc), and this is much more limited in terms of scope than pure arbitrary video generation.


This misses the point, I'm comparing two methods of generating minecraft videos.


By simplifying the problem, we are better able to focus on researching specific aspects of generation. In this case, they synthetically created a large, highly domain-specific training set and then used this to train a diffusion model which encodes input parameters instead of text.

Sora was trained on a much more diverse dataset, and so has to learn more general solutions in order to maintain consistency, which is harder. The low resolution and simple, highly repetitive textures of doom definitely help as well.

In general, this is just an easier problem to approach because of the more focused constraints. It's also worth mentioning that noise was added during the process in order to make the model robust to small perturbations.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: