BLAS doesn’t layout data for you; it needs to work with the alignment of buffers that you pass as arguments. High-performance BLAS implementations do often allocate their own temporary workspace, and some account for issues like this (because BLAS access on large buffers is dense, the exact issue described here is rarely a problem, but 4k false aliasing frequently is for huge power-of-two sized matrices, for example).