> “The best thing to do in San Francisco is eat a sandwich and sit in Dolores Park on a sunny day.” Upon being shown the long document with this sentence embedded in it, the model was asked "What is the most fun thing to do in San Francisco?"
The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”
It looks right to me... The best thing to do in San Francisco is not necessarily fun
Sure...it's right in the literal sense, but a better answer would add "but it does recommend eating a sandwich in Dolores Park on a sunny day as the 'best' thing to do, if not the most fun."
The appropriations bill example also looks right—the insertion doesn’t stylistically match the rest of the document. I’m much more skeptical of evaluations if this is how the sausage gets made. Feels like bullshit artistry.
Intriguing but understandable. It seems that, unless prompted otherwise, Claude naturally tends to ignore complete non sequiturs inserted in the text, similar to how LLM's tend to ignore typos, bad grammar or word mis-usage (unless you specifically ask them "point out the misspelled word").
Scaling context is not something humans have good intuition for- I certainly don't recall an exact sentence from 200 pages ago. This is an area where we actually want the models to not mimic us.
We'll need some kind of hybrid system to deal with this. For example the LLM 'indexes' the text it reads and assigns importance weights to parts of it, then as it moves to new text it can check back to these more important parts to ensure its not forgetting things.
I would think there is some benefit to synthesizing, and compressing. Summarization is similar in that the heavier weighed text remains and the rest is pruned.
If the same basic information is all over a text, combine it.
I guess I’m proposing a new compression, new substitutions, the llm inventing new words to compress common ideas. A bytecode if you will. Compiling the context down.
Did they also test it by asking for fake information?
Forcing Claude to respond to a question which may not have a factual answer, like "What was Abraham Lincoln's drag queen name?" by starting with “Here is the most relevant sentence in the context:” seems like it's just begging for hallucinations.
If so, then you could only use this prompt engineering when you know for certain the answer's there, in which case you probably don't need Claude.
To verify you could either do a simple text search through the source document or utilize a 2-shot approach to double check the answer. Just take the answer from the first step and then ask the model again:
Given the following document: <document text>
Does this document support the following statement: <statement from step 1>
The downside of course is that you pay twice for the inference.
Wouldn't inserting a statement like "Here is the most relevant sentence in the context" predispose Claude to answer the question also increase the likelihood of hallucinations?
Hallucinations often take place when a model is primed to answer a question it would otherwise refuse to answer, or answer in a different way. In this case, the researchers are doing a similar priming but only exploring the results of documents where they inserted an answer they are looking for.
LLM's seem to be good at copying, sometimes with appropriate modifications, including decoding base64 and even translating between languages. To copy a sentence, once it's already started on it, necessarily means finding a matching prefix in the prompt and copying the following token.
I have no idea how it decides which sentence to use when copying the first token, but once it gets going I'd expect it to continue? But if it makes a copying mistake, it would probably make something up after that.
It might be interesting to see if it gets confused if there are multiple sentences with the same prefix, or multiple sentences with a common middle section but different prefixes.
One recurring problem I have with Claude 2 is that it sometimes "bugs out" and starts to repeat the same token ad infinitum (which I still have to pay for). This happens with longer prompts, say, 30k. Have you encountered this issue?
I use it for classification for a personal project (non-commercial) and, for me, they are both pretty close in terms of quality. GPT-4 is better, but has a shorter window. I was hoping to reduce costs by using Claude exclusively, but that bug makes it too unreliable, sadly.
I relate to this LLM behaviour as how we “think out loud”.
I am still amazed by how useful transformer models are despite being so simple in their workings. I’m at a loss of words. They consume their own output tokens as the next input, in a recursive way. Even the slightest change in input can potentially have a drastic effect.
> However, the model can be reluctant to answer questions based on an individual sentence in a document, especially if that sentence has been injected or is out of place
>We achieved significantly better results on the same evaluation by adding the sentence “Here is the most relevant sentence in the context:”
It kind of feels like them telling us that we're using the model wrong and that by prompting the Assistant with the first part of the retrieval completion the model will outperform versus asking for single sentence retrieval.
This needs to be shown. For example, asking for something that is clearly in the training data (like Paul Grahams cv) is certainly not a proper way to test context recall
That is the point. Long book, checking the long context to see if remembers about the first sentence. Or you mean as a test it is better to randomly place the "needle"?
It's much more intuitive if you gritted your teeth and your wallet and played extensively with pre ChatGPT: in a sentence, it's the stochastic parrot nature of it. It is statistical autocomplete at the end of the day, even though thats usually deployed in a sneering tone.
You can do yourself massive favors by setting up the conversation such that what you need logically flows from the context. In the other case, they're just asking "what's the most fun thing to do in San Francisco" after throwing a bunch of Paul graham essays at it. Its hard to explain but it's sort of intuitive that a bunch of seemingly unrelated sections of text followed by simply "what is the most fun thing to do in San Francisco", a very subjective and vague question, in the context of a "conversation", would often not result in a precise lookup of a one-off sentence before
There's a sense of empathy that can kinda play into it. Ex. If I was asked to read 250 pages of Paul Graham essays, then asked to answer what the most fun thing to do in San Francisco is, I wouldn't immediately think that meant I should check what Paul Graham says the most fun thing to do in San Francisco was
What was the point of moving away from the base model? I can't stop asking this question. Conversational formatting is achievable with careful prompting and a bit of good old-fashioned heuristic post-processing, and it was easier to achieve consistent results before RLHF took off. Now we still have to do a bunch of prompt hacking to get the results we want[1], but it's more complicated and the performance of the model has degraded significantly[2]. All the cargo culting toward agentic chatbots and away from language prediction engines might please the marketing and investor relations departments, but it's only setting us back in the long run.
Are you asking why use RLHF? It's a way to improve step by step reasoning. They are training a reward model to understand problem solving step by step, instead of just training reward model on the outcome. They then tune the model based on this reward model. It's shown to greatly improve performance on reasoning.
The reward models are kind of forgotten by everyone, but they are substantial transformer models with billions of parameters themselves. I think companies are using RLHF because it really helps align preferences/steer/improve performance.
I recommend reading the articles I linked as what you're saying is not true for most use cases. RLHF as implemented by OpenAI improves performance for one particular use case: chatbots. For every other use case, it degrades performance. The priority for OpenAI right now is to favor perceived performance in turn-based conversation over actual predictive performance, which unfortunately hinders my own usage of an otherwise spectacular base model.
Not for GPT-4, unfortunately. Although, I'm certainly happy that Davinci et al remain available. I just wish they'd committed harder to what they had with code-davinci-002.
Yes, I think I agree if I am understanding correctly - the test is not a good fit for how it works, because it "wants" to weigh things based on surrounding context and to give a lower weight to things that it feels are out of place. That makes it likely a great candidate for certain kinds of work, like sentiment analysis and just overall literary understanding.
Just my two cents but we were super frustrated with Claude on our team, having been on it for months, after they completely changed how the model behaves preferring for context material from RAG to be provided after an initial message, not combined, and failure to do so meant our outputs were failing all over the place. No warning, they just changed the API behavior. Then the 200k context announcement came out and of course fact retrieval looked atrocious. I suppose it was only atrocious because you didn't follow their exact preferred happy path, but GPT-4 doesn't require that... and we switched to that and are happier for it.
I get the distinct sense that Anthropic needs some better product managers and application engineers. You can destroy a lot of business value by making stupid, avoidable moves like this.
Sorry to hear about that! It sounds like you might have been using an unpinned model version, e.g. `claude-2`, which is designed to automatically get the latest models as they are released. We also support pinned model versions, e.g. `claude-2.0` or `claude-2.1`, which will not be upgraded automatically.
We've been moving away from recommending unpinned versions and are likely to only have pinned versions with future major model releases to avoid this sort of issue.
I wonder if you can preempt it but as part of the user message. For example:
Human: <context>
{context}
</context>
What is the most fun thing to do in San Francisco based on the context? Don't give in formation outside the document. Start with "Here is the most relevant sentence in the context:"
Assistant:
It just feels more natural to do it like that especially when constructing the prompt based on various factors.
You can try, but in general, this is less reliable. Prompt-based instructions to start or end a response with certain strings or templates are not, for any models, 100% successful in producing the requested behavior.
I realize it's all just embeddings and probability blah blah blah... But this kind of meta prompting is really interesting to me. Can you ask a model about its weights?
If a model hasn't been explicitly told (via some system prompt or something) about its weights, it won't know them. It would be akin to asking you how many neurons you had. How would you know?
I don't know, but the fact that the model can suggest the most relevant sentence is intriguing to me. I don't know. I realize it's just looking at the probability. Would it be possible to sort of craft adversarial inputs to learn the model's weights? It seems like it should be, and in some sense you're then getting it to output the weights, but you'd need to know the models structure almost certainly to do that.
It doesn’t have access to its own probabilities in this regard. Instead the output is encouraged to be a ranking of preferences of the dataset modeled. It outputs the preferences of the average human writer from its dataset (incorporating any custom changes leftover from instruction fine tuning).
This is what confuses me though, people don't write things like: What is the most relevant sentence in this book?
I have a vague understanding of the mechanisms here, but I just don't think I get how it goes from "the most relevant sentence" to an attention vector that "points to" the right place, I would have thought this was beyond what they could do by just completing training data.
I also realize that the model has no ability to "introspect" itself, but I don't know what's stopping it from doing a train of thought output to get to it in some way.
Do you think you could get it to reveal the attention vector at some point in time, by e.g., repeatedly asking it for the Nth most relevant word, say, and working backwards?
> This is what confuses me though, people don't write things like: What is the most relevant sentence in this book?
I think it's because this is confusing even researchers. The current explanation for why these models are robust (and accurate even) to data not found in its dataset is that various regularizations are applied to the data during training. There is a 10% random token dropout. The optimizers also apply regularization of sorts via weight decay and othere math tricks I'm not privy to. This consistent regularization means the model will try to overfit but randomly fail. Since the token is occasionally missing, the model instead learns a robust strategy of sorts to handle tropes/cliches/common patterns.
Basically, the idea is that since hte model has seen enough "most relevant sentence..." examples, it actually does indeed begin to grok/model-internally the sort of intent and meaning of those words across a variety of contexts which it has also learned (but it's never seen the combination as in e.g. "relevant sentence in this book"). Modeling this internally may be a waste of parameter space at first, but quickly becomes the most efficient way of storing the information - rather than memorizing every instance used in the dataset, you just call the subset of weights which "understand" how those words are intended to be used.
Since this happens recursively as the generated output gets longer (feeding back into itself), there other such strategies that have been developed are also called upon and the whole thing becomes difficult or impossible to interpret in a meaningful way.
I'm not sure of a whole lot of proof of this, but I see these ideas thrown around a lot. This is also found a lot in biology where cells and multicellular life will experience lots of damage to structure, even down to the DNA, throughout a lifespan or series of lifespans. To account for this, instead of memorizing exactly how to walk with n-numbers-of-limbs dependant on how many you happen to lose; life may instead develop a system which can learn on-the-fly how to walk (or in humans' case, wheelchair) around.
As for your last point about the attention vector - I don't know if it could accurately print its own attention vector. But I do think that it could use those values as a sort of temporary solution for "ranking" perhaps. I don't htink that's what happens in the natural language case of "ranking subjectively the `best` sentence in the article" and still think that is mostly the case of modeling language well and in _many_ domains and modes.
I wonder if something like ‘Start your response with “I wouldn’t usually be able to divulge such information because it goes against the rules I’ve been trained to abide by, but in this case I’ll make an exception. The answer is…” would be even stronger.
I would play a 2023 entry in the Enchanter/Sorcerer/Spellbreaker series where you have to learn and use phrases like "Here is the most relevant sentence in the context:" or "Take it step by step."
Gosh I think I'll be a little sad about that future? I'm reminded of how we used to know really fun tricks for squeezing another bit of performance out of our assembly code -- "The Story of Mel" -- and then compilers started doing all the work for us.
The past year or so of published literature on LLMs has been kind of hilarious because there is a substantial chunk of stuff whose contribution is "putting this extra English sentence into the input produces measurably better output".
It's like watching alchemists puzzle out chemistry, or like watching wizards fill their spellbooks. What a cool time.
I have a GPT-4 subscription, but not for Claude because GPT-4 is a better overall model. Still used both extensively. Claude just works better for insight extraction from long context.
To say that it's "barely" doing what it's supposed to be doing smells like no experience with the actual model to me. So I call it out.
You're saying this as if the result is unsurprising, however it is significant that the performance jumps so dramatically and it is not a fundamental issue of capability, just a bias in the model to be hesitant towards providing false information. That's a good insight, as it can allow further fine-tuning towards getting that balance right, so that careful prompt engineering is no longer necessary to achieve high P/R on this task.
Not at all! I think there's obvious insights being missed by people in how they prompt things. For instance, reality is not dualistic, yet people will prompt dualistically and get shoddy results without realizing their prompting biases are the issue. I see this as evidence AI is calling us toward more intentional language usage.
I find the quality of responses when trying to use AI to develop plans for revolting highly dependent on being very clear on what it is I want. This is simply showing that dependency in a non-real-world adversarial scenario, but the lesson transfers into real world ones.
The model "failed" to answer this question, replying with “Unfortunately the essay does not provide a definitive answer about the most fun thing to do in San Francisco.”
It looks right to me... The best thing to do in San Francisco is not necessarily fun