> I don't really see this as any less clever than if someone had written the algorithm by hand.
And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.
And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.