This is the usual pipeline problem, though: sometimes the bottleneck is the CPU, and sometimes the bottleneck is memory bandwidth. This just places the ball firmly in the memory bandwidth court...
(You can have a hundred worker CPU cores doing the necessary conversions, but just need to worry about the parallelization complexity. But, then again, this is exactly what already happens when we feed data to hefty devices like GPUs and TPUs.)
(You can have a hundred worker CPU cores doing the necessary conversions, but just need to worry about the parallelization complexity. But, then again, this is exactly what already happens when we feed data to hefty devices like GPUs and TPUs.)