There are plenty of LLMs that aren't MoE/ensemble, and there are also plenty of LLMs that are pure completion models, that haven't been fine-tuned/RLHF'd to be conversational. I would recommend you read up a bit more on how modern LLMs work, I get the feeling your intuition on that could improve.
edit: I can't reply to the child comment as we've reached the thread limit, but I can say that LLMs are not trained on a tiny subset of data, they are trained on as much data as possible. A LLM becomes converational/instruct due to fine tuning it with reinforcement learning data. GPT-3.5 is by all accounts not an ensemble model, Llama2/3 is NOT an ensemble model/MoE, yet will allow you to do in-context learning/few shot prompting effortlessly. As said, I think your intuition on how these LLMs work and (as far as we know) how they work, needs readjustment.
I dont see what I'm missing. I'm addressing why ChatGPT generated a response given a prompt. If another LLM had been used, something far simpler, the explanation would be different.
If a highly simplified LLM will generate text against discrete quantitative constraints, under a variety of scenarios, then I've underestimated how highly structured the relevant training data must be.
An LLM trained on a physics textbook isnt going to be conversational; one trained on shakespear will generate text from elizabethan english..
ie., in every case, the explanation of why any given response was generated is given by explaining the distribution of its dataset. So if a shakespear LLM generates, "to be or otherwise to be not is alike everything ere annon" we will be mostly explaining how/why those words were used by shakespear.
and if an LLM is small, and is actually discretely sensitive to quantities across a large vareity of domains.. my guess is that its training data has been specially prepared. This is jsut a guess about hte nature of human commnuication though, it has nothing to do with LLMs. I just guess that we don't distribute "quantity tokens" in such a highly patterned way that a simple LLM model would work to find it
edit: I can't reply to the child comment as we've reached the thread limit, but I can say that LLMs are not trained on a tiny subset of data, they are trained on as much data as possible. A LLM becomes converational/instruct due to fine tuning it with reinforcement learning data. GPT-3.5 is by all accounts not an ensemble model, Llama2/3 is NOT an ensemble model/MoE, yet will allow you to do in-context learning/few shot prompting effortlessly. As said, I think your intuition on how these LLMs work and (as far as we know) how they work, needs readjustment.