People have spent enormous amounts of time and effort to find clever optimizers. This is another example of that. I don't really see this as any less clever than if someone had written the algorithm by hand.
> I don't really see this as any less clever than if someone had written the algorithm by hand.
And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.
Well, it's exactly less clever because once you've written an optimizer to find optimizers, you've cut yourself out of the loop and you're just a manager of things you don't understand.
Go use my public tool https://doxyjs.com and find yourself a compression algo; you probably can. If you understand why it works, that's interesting.
End-to-end (or deeper-than-usual) understanding is the reason why I never worried about losing a job.
At the same time… trying to grok-everything consistently kills my attempts at anything business-like, where one has no choice but to focus and delegate.
There’s some interesting discussion about how to potentially improve the design of the search space. Plus they had to manually simplify the final optimization algorithm, so it’s not like they’ve cut themselves completely out of the loop, it’s just a higher order tool