At 48 cores you are right about at the point where memory bandwidth becomes the limit. I suspect you are over the line, but by so little it is impossible to measure with all the ther noise. Get a larger machine and report back.
> On the 48 core system, building linux peaks at about 48GB/s; LLVM peaks at something like 25GB/s
LLVM peak is suspiciously low since building LLVM is heavier than the kernel? Anyway, on my machine, which is dual-socket 2x22-core skylake-x, for pure release build without debug symbols (less memory pressure), I get ~60GB/s.
For release build with debug symbols, which is much heavier, and what I normally use during the development, so my experience is probably more biased towards that workload, is >50% larger - ~98GB/s.
Now this was peak accumulated but I was also interested in what is the single highest read/write bw measured. For LLVM/clang release with debug symbols this is what I get ~32GB/s for write bw and ~52GB/s for read bw.
This is btw very close to what my socket can handle, store bandwidth is ~40GB/s, load bandwidth is ~80GB/s, and combined load-store bandwidth is 65G/s.
So, I think it is not unreasonable to say that there are compiler workloads that can be limited by the memory bandwidth. I for sure worked with heavier codebases even than LLVM, and even though I did not do the measurements back then, the gut feeling I was having is that the bw is consumed. Some translation units would literally stay for few minutes "compiling" but no progress would have been made.
I agree that random access memory patterns and the latency those patterns incur are also a cost that need to be added to this cost function.
My initial comment on this topic was - I don't really believe that the bottleneck in compilation for larger codebases, of course not on _any_ given machine, is on the compute side, and therefore I don't see how modules are going to fix any of this.