What kind of use cases are you thinking of where this wold be constraining? Don't many computer vision algorithms also require something specifying the parameters of the camera, such as the fundamental matrix for stereo imaging?
As humans, when we look at a scene, then move a few feet and look at it again, we have a pretty good idea what the delta between the two views were, so why is providing the same info here any different?
I would add that humans also integrate gyroscopic & acceleration information from the inner ear to understand relative balance. Multiple sources of sensor data is a net benefit, not a drawback.
As humans, when we look at a scene, then move a few feet and look at it again, we have a pretty good idea what the delta between the two views were, so why is providing the same info here any different?