Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Symbolic Discovery of Optimization Algorithms (arxiv.org)
111 points by caprock on Feb 18, 2023 | hide | past | favorite | 31 comments


Optimization algorithm they arrived at was (quoting from the paper)

  def train(weight, gradient, momentum, lr):
    update = interp(gradient, momentum, β1)
    update = sign(update)
    momentum = interp(gradient, momentum, β2)
    weight_decay = weight * λ
    update = update + weight_decay
    update = update * lr
    return update, momentum
The key thing they do, sending the update through sign, means all weight updates turn into either 1 or -1.


1 or -1 times the learning rate.


This looks great!

The authors are reputable and claim to have discovered a simple lr optimizer than trains models up to 5x faster and converges to better local optima than Adam/AdamW.

As usual, YMMV.

You can find a PyTorch implementation here: https://github.com/lucidrains/lion-pytorch

Preliminary tests seem to confirm the paper's claims.


The formula (from their pseudocode);

m ← αm + (1-α)g // momentum

c ← βm + (1-β)g // new part

θ ← θ + η sign(c) // update of parameters

The parameter update is via the sign of the weighted average of the momentum and the gradient.

For comparison, the gradient of abs(x) is sign(x), which is the update with L1-norm.


Your formula is almost right but there is a slight difference. Namely the third line of Program 1 in their paper is absent from your statement, which will change the value of the momentum variable for subsequent steps.

To be fair, I don't think that this operation will fundamentally change the behavior of the algorithm. In fact, I don't think their algorithm has a fundamentally different behavior than SignSGD with momentum, whose convergence behavior is well-understood [1, 2, 3]. Actually, the algorithm you described is equivalent to SignSGD with momentum. In the algorithm you described, m is a convex combination of m and g (with coefficient alpha), then c is a convex combination of updated m and g (with coefficient beta). This is exactly the same as setting c as a convex combination of m and g with coefficient alpha*beta. And that is exactly SignSGD with momentum. The only difference between the algorithm you described (equivalent to SignSGD) and LION is that they further modify the value of the momentum after computing the update with an additional convex combination operation. Again, I am skeptical that this operation will fundamentally change the behavior of the algorithm. But that's just my guess.

[1] https://arxiv.org/abs/1802.04434

[2] http://proceedings.mlr.press/v97/karimireddy19a.html

[3] https://arxiv.org/abs/2208.11195


Yes. I omitted time-indices, and moved the momentum-update up to before the parameter-update. In their pseudo-code θₜ updated with mₜ₋₁ whereas above, it would be with mₜ.

I suspect it is equivalent up to slightly different α and β?


Moving the momentum update before the parameter update changes the algorithm from LION to SignSGD. The only novel part of their algorithm is the fact that the momentum changes after the update is computed. They are not equivalent even if you can choose coefficients however you like. But as I said, if I had to guess I would say that this does not cause a fundamental change in the behavior of the optimization.


If we let βₛ be as above (SignSGD?) and βₗ be as in LION, then it looks to me that βₛ=βₗ/α should make c from both equal.


Is the 2nd line really necessary? Since β is 0.9, the second line becomes:

c ← 0.9m + (0.1)g

There doesn't seem to be a significant difference between their code:

θ ← θ + η sign(0.9m + (0.1)g)

and just not having the 2nd line at all:

θ ← θ + η sign(m)


The second line appears unnecessary because the formula in that comment is missing one line from the definition of LION. See my above comment.


True, those are equivalent. Might even become identical once it is compiled.


> Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum).

Come on now


They could have called it ELESIUM with a lot less effort...


Although your approach to it much better, they may as well have called is XERXES by combining letters that are and aren't in the phrase with some permutation.

There is no link between the name "Evolved Sign Momentum" and "Lion". They can call the thing what they want, but they're making up the link.


Or SignUp. Even makes a good paper title: SignUp for faster training by evolving the learning algorithm.


Could have called it Eve


Yeah, the acronym is very far fetched…

But at least the algorithm and how they got to it seems dope.


Maybe they need an algorithm that makes proper algorithm acronyms…


Your "maybe" is grossly unnecessary! Get an editor, sir!


Hahaha gotta love your terrible acronyms :)


Very cool stuff. For one, symbolic discovery is beautiful.

Second, LION (what a sin of an acronym tho) seems like a pretty big deal. Really nice paper to start the saturday with.


How is this symbolic? At a quick look, it seems NN-based.


They discover a NN optimization algorithm via search in a symbolic space. Most of the article is about using it for NNs and not discovery itself though :(


lol. That's literally the basis of https://gwern.net/fiction/clippy

[edit] When people say that the road to hell is paved with good intentions, or that a little knowledge is a dangerous thing, or... you know, pick your standard folk wisdom saying... what's so goddamn disappointing to me is how a generation that should know much better than this is eager to take shortcuts with the convenience of enormous and unprecedented processing power, rather than using their own brains to puzzle out clever ways of doing the same thing.


People have spent enormous amounts of time and effort to find clever optimizers. This is another example of that. I don't really see this as any less clever than if someone had written the algorithm by hand.


> I don't really see this as any less clever than if someone had written the algorithm by hand.

And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.


Well, it's exactly less clever because once you've written an optimizer to find optimizers, you've cut yourself out of the loop and you're just a manager of things you don't understand.

Go use my public tool https://doxyjs.com and find yourself a compression algo; you probably can. If you understand why it works, that's interesting.


> you're just a manager of things you don't understand.

It seems like a pretty big leap to assume the authors don't understand their own algorithm.


End-to-end (or deeper-than-usual) understanding is the reason why I never worried about losing a job.

At the same time… trying to grok-everything consistently kills my attempts at anything business-like, where one has no choice but to focus and delegate.


There’s some interesting discussion about how to potentially improve the design of the search space. Plus they had to manually simplify the final optimization algorithm, so it’s not like they’ve cut themselves completely out of the loop, it’s just a higher order tool


What does it even mean? And can it be applied to AI?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: