Using just the video feed, the AI would be required to reconstruct an overview o...

Using just the video feed, the AI would be required to reconstruct an overview of the strategic situation, and then develop a forward strategy on top of that involving individual units. Even for a much simpler game like doom, video-only input is enough for strategies like "see an enemy, target and shoot it as fast as possible".

For an AI to be able to effectively compete in a complex game like SC2, preparing high-level inputs is important. Look at these like shortcuts, heuristic approximations of task that would be hard to represent and train with deep learning. I would guess an implementation would need multiple independent nets for various tasks, combined with heuristics. Then each could be separately trained to do the given task.