The NIAH performance is a misleading indicator for performance on the tasks peop...

NitpickLawyer · 2025-06-17T18:37:26 1750185446

> The performance is abysmal after ~32k.

Huh. We've not seen this in real-world use. 2.5 pro has been the only model where you can throw a bunch of docs into it, give it a "template" document (report, proposal, etc), even some other-project-example stuff, and tell it to gather all relevant context from each file and produce "template", and it does surprisingly well. Couldn't reproduce this with any other top tier model, at this level of quality.

ttul · 2025-06-30T14:16:27 1751292987

Are you by any chance a lawyer? I’m asking because I’m genuinely curious whether lawyers are starting to use the SOTA LLMs in day-to-day drafting and review work. I use the LLMs as a CEO as a poor substitute for my in-house counsel when I just need _an_ answer quickly (i.e. when counsel is literally asleep); however, for anything serious, I always defer to them because I know LLMs make mistakes and obviously cannot offer professional liability cover.

spmurrayzzz · 2025-06-17T19:23:15 1750188195

We're a G-suite shop so I set aside a ton of time trying to get 2.5 pro to work for us. I'm not entirely unhappy with it, its a highly capable model, but the long context implosion significantly limits it for the majority of task domains.

We have long context evals using internal data that are leveraged for this (modeled after longproc specifically) and the performance across the board is pretty bad. Task-wise for us, it's about as real world as it gets, using production data. Summarization, Q&A, coding, reasoning, etc.

But I think this is where the in-distribution vs out-of-distribution distinction really carries weight. If the model has seen more instances of your token sequences in training and thus has more stable semantic representations of them in latent space, it would make sense that it would perform better on average.

In my case, the public evals align very closely with performance on internal enterprise data. They both tank pretty hard. Notably, this is true for all models after a certain context cliff. The flagship frontier models predictably do the best.

quelladora · 2025-06-18T07:04:49 1750230289

MRCR does go significantly beyond multi-needle retrieval - that's why the performance drops off as a function of context length. It's still a very simple task (reproduce the i^th essay about rocks), but it's very much not solved.

See contextarena.ai and the original paper https://arxiv.org/abs/2409.12640

It also seems to match up well with evals like https://fiction.live/stories/Fiction-liveBench-Feb-21-2025/o...

The other evals you mention are not necessarily harder than this relatively simple one..

spmurrayzzz · 2025-06-18T14:27:38 1750256858

Sure. I didn't imply (or didn't mean to imply at least) that I thought MRCR was solved, only pointing out that it's closer to testing raw retrieval than it is testing long range dependency resolution like Longproc does. If retrieval is great but the model still implodes on the downstream task, the benchmark doesn't tell you the whole story. The intent/point of my original comment was that even the frontier models are nowhere near as good at long context tasks than what I see anecdotally claimed about them in the wild.

> The other evals you mention are not necessarily harder than this relatively simple one.

If you're comparing MRCR to for example Longproc, I do think the latter is much harder. Or at least, much more applicable to long-horizon task domains where long context accumulates over time. But I think it's probably more accurate to say its a more holistic, granular eval by comparison.

The tasks require the model to synthesize and reason over information that is scattered throughout the input context and across previously generated output segments. Additionally, the required output is lengthy (up to 8K tokens) and must adhere to a specific, structured format. The scoring is also more flexible than MRCR: you can use row-level F1 scores for tables, execution-based checks for code, or exact matches for formatted traces.

Just like NIAH, I don't think MRCR should be thrown out wholesale. I just don't think it can be pressed into the service of representing a more realistic long context performance measure.

EDIT: also wanted to note that using both types of evals in tandem is very useful for research and training/finetuning. If Longproc tanks and you dont have the NIAH/MRCR context, its hard to know what capabilities are regressing. So using both in a hybrid eval approach is valuable in certain contexts. For end users only trying to guage the current inference-time performance, I think evals like RULER and Longproc have a much higher value.

quelladora · 2025-06-18T16:49:50 1750265390

Right, the way I see it, MRCR isn't a retrieval task in the same vein as RULER. It’s less about finding one (or multiple) specific facts and more about piecing together scattered information to figure out the ordering of a set of relevant keys. Of course, it’s still a fairly simple challenge in the grand scheme of things.

LongProc looks like a fantastic test for a different but related problem, getting models to generate long answers. It seems to measure a skill the others don't. Meanwhile, RULER feels even more artificial than MRCR, since it's almost entirely focused on that simple "find the fact" skill.

But I think you're spot-on with the main takeaway, and the best frontier models are still struggling with long context. The DeepMind team points this out in the paper with that Pokemon example and the MRCR evaluation scores themselves.