It boils down to various kinds of normalization - as in functional programming's...

It boils down to various kinds of normalization - as in functional programming's normal form. If you can get a normal form of an expression (or canonical form), you can fuse several seemingly different expressions into one and share its computation in several places.

As an example aside from functional programming, consider instruction set architectures.

There are 3-address RISC instructions and "x := y - z" can be encoded in 32^3 different ways. It is not practical to consider all of them as a separate thing and share them.

The situation is less severe in two address ISAs and even less severe in 1 address (accumulator based) ISA (8086 is a blend between 1 and 2 address ISA).

For zero address ISAs it is even better. There's only one way to compute y-z, by issuing "-", which takes two arguments from stack and places subtraction result into it. Situations that lead to that subtraction are less diverse than in RISC 3-address way. And here comes the gist: one can factor same sequences of opcodes into subroutine ("word" in Forth parlance) and replace these sequences with call to that subroutine.

By programming Forth you do LZ compression on programs.

If you can't compress your program more, you have a perfect Forth program, and I am not quite joking here.

The same applies, albeit not that dramatically and visible, to other things in programs. It is possible to determine algorithmic kernels in programs (by matching optimized data and control flow against a pattern) and replace them with different ones - less computation intensive, parallelized, parametrized, etc. There were such works in optimization literature.

Again, in the kernel matching optimization goes first, because optimization makes program more canonical, so to say. And matching on canonical forms is easier.