The authors are reputable and claim to have discovered a simple lr optimizer than trains models up to 5x faster and converges to better local optima than Adam/AdamW.
Your formula is almost right but there is a slight difference. Namely the third line of Program 1 in their paper is absent from your statement, which will change the value of the momentum variable for subsequent steps.
To be fair, I don't think that this operation will fundamentally change the behavior of the algorithm. In fact, I don't think their algorithm has a fundamentally different behavior than SignSGD with momentum, whose convergence behavior is well-understood [1, 2, 3]. Actually, the algorithm you described is equivalent to SignSGD with momentum. In the algorithm you described, m is a convex combination of m and g (with coefficient alpha), then c is a convex combination of updated m and g (with coefficient beta). This is exactly the same as setting c as a convex combination of m and g with coefficient alpha*beta. And that is exactly SignSGD with momentum. The only difference between the algorithm you described (equivalent to SignSGD) and LION is that they further modify the value of the momentum after computing the update with an additional convex combination operation. Again, I am skeptical that this operation will fundamentally change the behavior of the algorithm. But that's just my guess.
Yes. I omitted time-indices, and moved the momentum-update up to before the parameter-update. In their pseudo-code θₜ updated with mₜ₋₁ whereas above, it would be with mₜ.
I suspect it is equivalent up to slightly different α and β?
Moving the momentum update before the parameter update changes the algorithm from LION to SignSGD. The only novel part of their algorithm is the fact that the momentum changes after the update is computed. They are not equivalent even if you can choose coefficients however you like. But as I said, if I had to guess I would say that this does not cause a fundamental change in the behavior of the optimization.
Although your approach to it much better, they may as well have called is XERXES by combining letters that are and aren't in the phrase with some permutation.
There is no link between the name "Evolved Sign Momentum" and "Lion". They can call the thing what they want, but they're making up the link.
They discover a NN optimization algorithm via search in a symbolic space. Most of the article is about using it for NNs and not discovery itself though :(
[edit] When people say that the road to hell is paved with good intentions, or that a little knowledge is a dangerous thing, or... you know, pick your standard folk wisdom saying... what's so goddamn disappointing to me is how a generation that should know much better than this is eager to take shortcuts with the convenience of enormous and unprecedented processing power, rather than using their own brains to puzzle out clever ways of doing the same thing.
People have spent enormous amounts of time and effort to find clever optimizers. This is another example of that. I don't really see this as any less clever than if someone had written the algorithm by hand.
> I don't really see this as any less clever than if someone had written the algorithm by hand.
And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.
Well, it's exactly less clever because once you've written an optimizer to find optimizers, you've cut yourself out of the loop and you're just a manager of things you don't understand.
Go use my public tool https://doxyjs.com and find yourself a compression algo; you probably can. If you understand why it works, that's interesting.
End-to-end (or deeper-than-usual) understanding is the reason why I never worried about losing a job.
At the same time… trying to grok-everything consistently kills my attempts at anything business-like, where one has no choice but to focus and delegate.
There’s some interesting discussion about how to potentially improve the design of the search space. Plus they had to manually simplify the final optimization algorithm, so it’s not like they’ve cut themselves completely out of the loop, it’s just a higher order tool