Symbolic Discovery of Optimization Algorithms

im3w1l · on Feb 18, 2023

Optimization algorithm they arrived at was (quoting from the paper)

  def train(weight, gradient, momentum, lr):
    update = interp(gradient, momentum, β1)
    update = sign(update)
    momentum = interp(gradient, momentum, β2)
    weight_decay = weight * λ
    update = update + weight_decay
    update = update * lr
    return update, momentum

The key thing they do, sending the update through sign, means all weight updates turn into either 1 or -1.

nestorD · on Feb 18, 2023

1 or -1 times the learning rate.

cs702 · on Feb 18, 2023

This looks great!

The authors are reputable and claim to have discovered a simple lr optimizer than trains models up to 5x faster and converges to better local optima than Adam/AdamW.

As usual, YMMV.

You can find a PyTorch implementation here: https://github.com/lucidrains/lion-pytorch

Preliminary tests seem to confirm the paper's claims.

yobbo · on Feb 18, 2023

The formula (from their pseudocode);

m ← αm + (1-α)g // momentum

c ← βm + (1-β)g // new part

θ ← θ + η sign(c) // update of parameters

The parameter update is via the sign of the weighted average of the momentum and the gradient.

For comparison, the gradient of abs(x) is sign(x), which is the update with L1-norm.

mtcrawshaw · on Feb 18, 2023

Your formula is almost right but there is a slight difference. Namely the third line of Program 1 in their paper is absent from your statement, which will change the value of the momentum variable for subsequent steps.

To be fair, I don't think that this operation will fundamentally change the behavior of the algorithm. In fact, I don't think their algorithm has a fundamentally different behavior than SignSGD with momentum, whose convergence behavior is well-understood [1, 2, 3]. Actually, the algorithm you described is equivalent to SignSGD with momentum. In the algorithm you described, m is a convex combination of m and g (with coefficient alpha), then c is a convex combination of updated m and g (with coefficient beta). This is exactly the same as setting c as a convex combination of m and g with coefficient alpha*beta. And that is exactly SignSGD with momentum. The only difference between the algorithm you described (equivalent to SignSGD) and LION is that they further modify the value of the momentum after computing the update with an additional convex combination operation. Again, I am skeptical that this operation will fundamentally change the behavior of the algorithm. But that's just my guess.

[1] https://arxiv.org/abs/1802.04434

[2] http://proceedings.mlr.press/v97/karimireddy19a.html

[3] https://arxiv.org/abs/2208.11195

yobbo · on Feb 18, 2023

Yes. I omitted time-indices, and moved the momentum-update up to before the parameter-update. In their pseudo-code θₜ updated with mₜ₋₁ whereas above, it would be with mₜ.

I suspect it is equivalent up to slightly different α and β?

mtcrawshaw · on Feb 18, 2023

Moving the momentum update before the parameter update changes the algorithm from LION to SignSGD. The only novel part of their algorithm is the fact that the momentum changes after the update is computed. They are not equivalent even if you can choose coefficients however you like. But as I said, if I had to guess I would say that this does not cause a fundamental change in the behavior of the optimization.

yobbo · on Feb 19, 2023

If we let βₛ be as above (SignSGD?) and βₗ be as in LION, then it looks to me that βₛ=βₗ/α should make c from both equal.

WithinReason · on Feb 18, 2023

Is the 2nd line really necessary? Since β is 0.9, the second line becomes:

c ← 0.9m + (0.1)g

There doesn't seem to be a significant difference between their code:

θ ← θ + η sign(0.9m + (0.1)g)

and just not having the 2nd line at all:

θ ← θ + η sign(m)

mtcrawshaw · on Feb 18, 2023

The second line appears unnecessary because the formula in that comment is missing one line from the definition of LION. See my above comment.

yobbo · on Feb 18, 2023

True, those are equivalent. Might even become identical once it is compiled.

eastWestMath · on Feb 18, 2023

> Our method discovers a simple and effective optimization algorithm, Lion (EvoLved Sign Momentum).

Come on now

snovv_crash · on Feb 18, 2023

They could have called it ELESIUM with a lot less effort...

roenxi · on Feb 18, 2023

Although your approach to it much better, they may as well have called is XERXES by combining letters that are and aren't in the phrase with some permutation.

There is no link between the name "Evolved Sign Momentum" and "Lion". They can call the thing what they want, but they're making up the link.

dgacmu · on Feb 18, 2023

Or SignUp. Even makes a good paper title: SignUp for faster training by evolving the learning algorithm.

adammarples · on Feb 18, 2023

Could have called it Eve

Torkel · on Feb 18, 2023

Yeah, the acronym is very far fetched…

But at least the algorithm and how they got to it seems dope.

bhouston · on Feb 18, 2023

Maybe they need an algorithm that makes proper algorithm acronyms…

Nevermark · on Feb 20, 2023

Your "maybe" is grossly unnecessary! Get an editor, sir!

saiojd · on Feb 18, 2023

Hahaha gotta love your terrible acronyms :)

inductive_magic · on Feb 18, 2023

Very cool stuff. For one, symbolic discovery is beautiful.

Second, LION (what a sin of an acronym tho) seems like a pretty big deal. Really nice paper to start the saturday with.

noduerme · on Feb 18, 2023

How is this symbolic? At a quick look, it seems NN-based.

wizzard0 · on Feb 18, 2023

They discover a NN optimization algorithm via search in a symbolic space. Most of the article is about using it for NNs and not discovery itself though :(

noduerme · on Feb 18, 2023

lol. That's literally the basis of https://gwern.net/fiction/clippy

[edit] When people say that the road to hell is paved with good intentions, or that a little knowledge is a dangerous thing, or... you know, pick your standard folk wisdom saying... what's so goddamn disappointing to me is how a generation that should know much better than this is eager to take shortcuts with the convenience of enormous and unprecedented processing power, rather than using their own brains to puzzle out clever ways of doing the same thing.

michaelmior · on Feb 18, 2023

People have spent enormous amounts of time and effort to find clever optimizers. This is another example of that. I don't really see this as any less clever than if someone had written the algorithm by hand.

gwern · on Feb 18, 2023

> I don't really see this as any less clever than if someone had written the algorithm by hand.

And you can invent this by hand. I was talking to Shawn Presser literally days before about his experiments in cutting down Adam to low-precision, where he had repeatedly cut it down, eventually to 1-bit, and found it was still working on small-scale Transformers - ie. close to this LION. (He didn't invent it exactly, but was like an `abs()` away or something: https://twitter.com/theshawwn/status/1625681629074137088 ) So that's how you could have invented this yourself: follow the logic of '1-bit Adam' https://arxiv.org/abs/2102.02888#microsoft to see how much you can dispense with modeling the moments.

noduerme · on Feb 18, 2023

Well, it's exactly less clever because once you've written an optimizer to find optimizers, you've cut yourself out of the loop and you're just a manager of things you don't understand.

Go use my public tool https://doxyjs.com and find yourself a compression algo; you probably can. If you understand why it works, that's interesting.

michaelmior · on Feb 18, 2023

> you're just a manager of things you don't understand.

It seems like a pretty big leap to assume the authors don't understand their own algorithm.

wizzard0 · on Feb 18, 2023

End-to-end (or deeper-than-usual) understanding is the reason why I never worried about losing a job.

At the same time… trying to grok-everything consistently kills my attempts at anything business-like, where one has no choice but to focus and delegate.

rsfern · on Feb 18, 2023

There’s some interesting discussion about how to potentially improve the design of the search space. Plus they had to manually simplify the final optimization algorithm, so it’s not like they’ve cut themselves completely out of the loop, it’s just a higher order tool

sbayona573 · on Feb 18, 2023

What does it even mean? And can it be applied to AI?