Hacker Newsnew | past | comments | ask | show | jobs | submitlogin

You might be interested in work around mechanistic interpretability! In particular, if you're interested in how models handle out-of-distribution information and apply in-context learning, research around so-called "circuits" might be up your alley: https://www.transformer-circuits.pub/2022/mech-interp-essay


After a brief scan, I'm not competent to evaluate the essay by Chris Olah you posted.

I probably could get an LLM to do so, but I won't....


Neel Nanda is also very active in the field and writes some potentially more approachable articles, if you're interested: https://www.neelnanda.io/mechanistic-interpretability

Much of their work is focused on discovering "circuits" that occur between layer activations as they process data, which correspond to dynamics the model has learned. So, as a simple hypothetical example, instead of embedding the answer to 1 million arbitrary addition problems in the weights, models might learn a circuit that approximates the operation of addition.


I ran it through an LLM it said the paper was absolutely outstanding and perhaps the best paper of all time.




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: