There's a massive reduction in the whale song of the blue whales. Almost halved. They are presumably starving.
That something ginormous can be so elegant, beautiful and sleek is hard to conceive till one meets a blue whale. Let's let them thrive on the blue planet.
The Blue Whale population has actually increased since the 70s. When they were critically endangered, their population numbered roughly 1,000-2,000 but population estimates for today put the number at roughly tenfold that. The 1966 worldwide moratorium on whaling has been incredibly successful and we’ve also seen recoveries in Humpback and Grey Whales.
People go all dopey eyed about "frequency space", that's a red herring. The take away should be that a problem centric coordinate system is enormously helpful.
After all, what Copernicus showed is that the mind bogglingly complicated motion of planets become a whole lot simpler if you change the coordinate system.
Ptolemaic model of epicycles were an adhoc form of Fourier analysis - decomposing periodic motions over circles over circles.
Back to frequencies, there is nothing obviously frequency like in real space Laplace transforms *. The real insight is that differentiation and integration operations become simple if the coordinates used are exponential functions because exponential functions remain (scaled) exponential when passed through such operations.
For digital signals what helps is Walsh-Hadamard basis. They are not like frequencies. They are not at all like the square wave analogue of sinusoidal waves. People call them sequency space as a well justified pun.
My suspicion is that we are in Ptolemaic state as far as GPT like models are concerned. We will eventually understand them better once we figure out what's the better coordinate system to think about their dynamics in.
* There is a connection though, through the exponential form of complex numbers, or more prosaically, when multiplying rotation matrices the angles combine additively. So angles and logarithms have a certain unity, or character.
All these transforms are switching to an eigenbasis of some differential operator (that usually corresponds to a differential equation of interest). Spherical harmonics, Bessel and Henkel functions, which are the radial versions of sines/cosines and complex exponential, respectively, and on and on.
The next big jumps were to collections of functions not parameterized by subsets of R^n. Wavelets use a tree shapes parameter space.
There’s a whole, interesting area of overcomplete basis sets that I have been meaning to look into where you give up your basis functions being orthogonal and all those nice properties in exchange for having multiple options for adapting better to different signal characteristics.
I don’t think these transforms are going to be relevant to understanding neural nets, though. They are, by their nature, doing something with nonlinear structures in high dimensions which are not smoothly extended across their domain, which is the opposite problem all our current approaches to functional analysis deal with.
You may well be right about neural networks. Sometimes models that seem nonlinear turns linear if those nonlinearities are pushed into the basis functions, so one can still hope.
For GPT like models, I see sentences as trajectories in the embedded space. These trajectories look quite complicated and no obvious from their geometrical stand point. My hope is that if we get the coordinate system right, we may see something more intelligible going on.
This is just a hope, a mental bias. I do not have any solid argument for why it should be as I describe.
> Sometimes models that seem nonlinear turns linear if those nonlinearities are pushed into the basis functions, so one can still hope.
That idea was pushed to its limit by the Koopman operator theory. The argument sounds quite good at first, but unfortunately it can’t really work for all cases in its current formulation [1].
We know that under benign conditions and infinite dimensional basis must exist but finding it from finite samples is very non-trivial, we don't know how to do it in the general case.
I’m not sure what you mean by a
change of basis making a nonlinear system linear. A linear system is one where solutions add as elements of a vector space. That’s true no matter what basis you express it in.
For example, if you prameterize the x,y coordinates of a plane-circular trajectory in terms the angle theta, it's nonlinear function of theta.
However, if you parameterized a point in terms of the tuple (cos \theta, sin \theta) it comes out as a scaled sum. Here we have pushed the nonlinear functions cos and sin inside the basis functions.
A conic section is nonlinear curve (not a line) when considered in the variables of and y. However, in the basis of x^2, xy, y^2, x, y it's linear (well, technically affine).
Consider the Naive Bayes classifier. It looks nonlinear till one parameterized it in log p, then it's linear in log-p and log-odds.
If one is ok with dimensional basis this linearisation idea can be pushed much further. Take a look at this if you are interested
From the abstract and skimming a few sections of the first paper, imho it is not really the same. The paper is moving the loss gradient to the tangent dual space where weights reside for better performance in gradient descent, but as far as I understand neither the loss function nor the neural net are analyzed in a new way.
The Fourier and Wavelet transforms are different as they are self-adjoint operators (=> form an orthogonal basis) on the space of functions (and not on a finite dimensional vector space of weights that parametrize a net) that simplify some usually hard operators such as derivatives and integrals, by reducing them to multiplications and divisions or to a sparse algebra.
So in a certain sense these methods are looking at projections, which are unhelpful when thinking about NN weights since they are all mixed with each other in a very non-linear way.
Thanks a bunch for the references. Reading the abstract these used a different idea compared to what Fourier analysis is about, but nonetheless should be a very interesting read.
> My suspicion is that we are in Ptolemaic state as far as GPT like models are concerned. We will eventually understand them better once we figure out what's the better coordinate system to think about their dynamics in.
Most deep learning systems are learned matrices that are multiplied by "problem-instance" data matrices to produce a prediction matrix. The time to do said matrix-multiplication is data-independent (assuming that the time to do multiply-adds is data-independent).
If you multiply both sides by the inverse of the learned matrix, you get an equation where finding the prediction matrix is a solving problem, where the time to solve is data dependent.
Interestingly enough, that time is sort-of proportional to the difficulty of the problem for said data.
Perhaps more interesting is that the inverse matrix seems to have row artifacts that look like things in the training data.
I’d argue that most if not all of the math that I learned in school could be distilled down to analyzing problems in the correct coordinate system or domain! The actual manipulation isn’t that esoteric once you get in the right paradigm. And those professors never explained things at that kind of higher theoretical level, all I remember was the nitty gritty of implementation. What a shame. I’m sure there’s higher levels of mathematics that go beyond my simplistic understanding, but I’d argue it’s enough to get one through the full sequence of undergraduate level (electrical) engineering, physics, and calculus.
It’s kind of intriguing that predicting the future state of any quantum system becomes almost trivial—assuming you can diagonalize the Hamiltonian. But good luck with that in general. (In other words, a “simple” reference frame always exists via unitary conjugation, but finding it is very difficult.)
It's not easy to separate cause and effect from direct and strong correlations that we experience.
The job of a scientist is not to give up on a hunch with a flippant "correlation is not causation" but pursue such hunches to prove it this way or that (that is, prove it or disprove it). It's human to lean a certain way about what could be true.
There's also this notion of holding themselves to their own standards.
They, Newton included, would often feel that their work was not good enough, that it was not completed and perfected yet and therefore would be ammunition for conflict and ridicule.
Gauss did not publicize his work on complex numbers because he thought he would attacked for it. To us that may seem weird, but there is no dearth of examples of people who were attacked for their mostly correct ideas.
Deadly or life changing attacks notwithstanding, I can certainly sympathize. There's not in figuring things out, but the process of communicating that can be full of tediousness and drama that one maybe tempted to do without.
Weird typo in what I wrote. It's past the edit window. This is what I had meant to type:
There's joy in figuring things out, but the process of communicating what has been so figured can be tedious and full of drama -- the kind of drama that one maybe tempted to do without.
reply