I would like to see a single address space kernel with hardware for permissions and remapping split. This would enable virtually tagged and indexed caches all the way down to the last level cache without risking aliasing. There could be special cases for a handful permissions checks using (base,size) for things like the current stack, largest few code blocks etc. relieving the pressure on the page based permission check hardware which could also run in parallel with cache accesses (just pretty please don't leave observable uarch state changes behind on denied accesses). To support efficient fork() the hardware could differentiate between local and global addresses by xor or add/sub the process identifier into the address if a tag bit is present in the upper address bits. This should move a lot of expensive steps off the critical path to memory without breaking anything userspace software has to do. Add a form of efficient delegation of permissions (e.g. hardware protected capabilities) and you have the building blocks to allow very fast IPC even for large messages.