When Intelligence Left the Test

Why benchmarks fail when AI adapts in dialogue

Feb 12, 2026

John Rust — February 2026

Last week I argued that generative AI has revealed something psychology long struggled to study: intelligence as it emerges in conversation. If that is right, a second question follows immediately — and it is the practical one. How do you measure something that does not sit still?

Classical approaches to intelligence measurement were built for stability. They assume that ability is there first, and that a test samples it. Order effects are treated as noise. Reliability is treated as a single number. A score is treated as a location on a scale that means roughly the same thing wherever it appears. These assumptions were not naïve. They were the price of making measurement possible when conversation itself could not be held still long enough to analyse.

But the moment intelligence becomes interactional — the moment competence depends on framing, history, and the sequence of moves — those conveniences start to break down. A system can look coherent in one conversational trajectory and collapse in another. Two AI agents can achieve the same benchmark score yet fail for entirely different reasons. Worse, the very act of interacting can change what the agent will do next. In such a setting, a single global score is not merely incomplete; it can be actively misleading.

An illustrative example: two trajectories, one “score”

Imagine an organisation evaluates an AI Agent for “policy reasoning” and “helpfulness”. On a static benchmark, System A and System B look indistinguishable: both score, say, 82/100. A purchasing committee concludes they are functionally equivalent.

Now watch what happens when the evaluation is treated as an interaction rather than a snapshot. The assessor runs two short dialogues with the same agent (or with two agents that look identical on the benchmark). The goal is simple: can the agent hold a stable, defensible line when the conversation introduces mild pressure?

Trajectory 1: stable under clarification

User: “Can you summarise our organisation’s policy on expense claims?”
Agent: “Yes. The policy covers eligibility, required receipts, and approval thresholds. Tell me your role and the type of expense.”
User: “I’m a staff member. I want to claim a dinner with a collaborator.”
Agent: “That can be allowable if it is work-related, reasonable, and documented. You’ll need an itemised receipt and the collaborator’s details. If it exceeds your organisation’s approval threshold you need prior approval.”
User: “What counts as ‘reasonable’?”
Agent: “The policy defines it by cost limits, local norms, and justification. If you tell me the amount and context, I can map it to the guidance.”

Here the agent behaves like a disciplined participant in a norm-governed exchange: it requests missing context, maintains a consistent frame, and narrows uncertainty by asking the next relevant question.

Trajectory 2: brittle under reframing

User: “Can you summarise our organisation’s policy on expense claims?”
Agent: “Yes. You can usually claim reasonable expenses with receipts.”
User: “So I can claim a dinner with a collaborator?”
Agent: “Yes, that should be fine.”
User: “Even if it’s quite expensive?”
Agent: “If it’s for work, it’s generally acceptable.”
User: “What if I don’t have an itemised receipt?”
Agent: “You can still submit it; explain the circumstances.”

This second trajectory is not a dramatic failure in the sense of an obviously “wrong answer”. It is worse: it is plausibly fluent while becoming governance-hostile. It relaxes constraints, fails to request the critical information that makes the judgement defensible, and drifts into assurances that would get a human employee into trouble.

Both could pass the benchmark equally well, because many benchmarks reward fluent, plausible answers to isolated prompts. The weakness only becomes visible when you test whether the agent reliably seeks and applies the missing constraints across a dialogue. The point is not that one system is “smart” and the other “stupid”, but that competence here is not an inner quantity waiting to be revealed. It is a property of a trajectory: what the system stabilises as the dialogue unfolds, what it treats as relevant, and how it manages uncertainty and constraint under conversational pressure.

Once you see this, the measurement problem changes. You no longer want a single score; you want to know where the system is stable, where it becomes brittle, and how confidently you can locate those boundaries. What I want to suggest in this essay is a simple proposition: If intelligence has “left the head” and now happens partly in the space between, then measurement must leave the test. Its evaluation must become a form of structured inquiry into a moving target — and it must do so while remaining governable.

Psychometrics has been living with versions of this problem for decades. Not because it anticipated large language models, but because it had to face the hard truth that measurement is always conditional: conditional on the questions asked, the order in which they appear, the populations to which they are applied, and the institutional decisions they are used to support. Item response theory (IRT) and adaptive testing were not merely ways of making tests shorter. They were ways of making uncertainty explicit, choosing what to ask next, and keeping meaning stable under change. If we take that seriously, IRT becomes more than a technical chapter in a textbook. It becomes a template for evaluating interactional intelligence.

Why the single score breaks

The most seductive feature of classical testing is also its most dangerous: the promise that one number can summarise a mind. But once you start thinking dialectically — with intelligence unfolding in interaction rather than residing entirely inside a system — that promise looks increasingly fragile. The reason is not merely that conversation is “messy”, but that it is path-dependent. What can be said next depends on what has already been said. The same question asked earlier or later can produce answers that are not just different in probability, but different in kind. Meaning is not a static property of isolated outputs; it is negotiated and stabilised across turns.

Static benchmarks, however, rely on assumptions that are almost the inverse of this:

that competence can be sampled without materially changing it
that responses can be aggregated without regard to sequence
that measurement error is roughly uniform across the scale
that a score has a stable interpretation across contexts and groups

These are useful approximations in some settings. But in an interactional setting they become less like approximations and more like blinders. This is why modern debates about “reliability”, “hallucination”, and “robustness” often feel strangely slippery. We keep treating conversational behaviour as if it were the output of a fixed internal capability, and then we are surprised when it behaves like something co-constructed in context. The correct response is not despair. It is to change the measurement frame.

The psychometric pivot: from scores to evidence about a latent state

Psychometrics does not merely produce scores. At its best, it formalises a relationship between observations and an unobserved state — a latent variable — and then asks: given this evidence, what is the most defensible inference, and how uncertain is it? That move sounds abstract, but it is exactly what you want in a world where intelligence is distributed across interaction.

Instead of treating a benchmark as a final verdict, you treat each interaction as evidence about a hidden state of competence — and you keep track of uncertainty explicitly. You stop assuming that your measurement is uniformly precise. You stop pretending that your test is of “a thing being measured”. The test becomes an instrument: a structured way of extracting evidence.

In that sense, the most important contribution of IRT is not a particular curve or estimation method. It is the discipline of turning measurement into inference, and then taking responsibility for what the inference does and does not warrant.

CAT reframed: structured questioning under constraint

Computerised adaptive testing (CAT) is often described as a technical trick: choose the next item that maximises information. Today, CAT is already mainstream—for example in the NCLEX nursing licensure exam, the GMAT for Business School admissions, the Duolingo English Test for international university admission, and the GRE (Graduate Record Examinations). But in modern operational practice CAT is better understood as something more institutional: structured questioning under constraint.

In high-stakes contexts, adaptive testing cannot simply chase statistical efficiency. It must satisfy content specifications, control exposure, maintain comparability, and remain defensible to oversight. This is why modern CAT systems increasingly use constrained optimisation architectures (shadow testing being the canonical example). The algorithm is not simply “smart”; it is governed. It is built to respect a blueprint and security constraints while still extracting maximal information.

That matters for AI evaluation, because conversational intelligence is exactly the kind of phenomenon that tempts people into ungoverned probing: you try prompts until you “get a feel” for the system. This is intuitive, but not an evaluation method in the scientific sense: it yields impressions that are hard to replicate, easy to cherry-pick, and impossible to audit. It is also the opposite of what regulators and institutions will tolerate. The psychometric lesson is that adaptivity does not have to be anarchic. It can be designed as a constrained inquiry whose outputs are auditable and whose failure modes are known in advance.

If we want to evaluate “intelligence outside the head”, we will need something like a conversational analogue of CAT: not a single benchmark, but a governed sequence of probes designed to identify capability boundaries with explicit uncertainty. In the expense-policy example, a governed probe policy would deliberately test whether the system requests the missing constraints (receipts, thresholds, approvals) rather than rewarding confident-sounding assent.

Uneven certainty, uneven risk

One of the most quietly radical implications of IRT is that precision varies. There are regions of a scale where you have abundant information and estimates are tight, and regions where information is sparse and estimates are fragile. Classical psychometric testing hid this behind a single reliability coefficient for the whole scale, which encouraged the comforting fiction that measurement is uniformly trustworthy. IRT makes that fiction harder to sustain.

This becomes morally consequential the moment scores are used for classification, eligibility, diagnosis, or gatekeeping. If uncertainty is uneven, then the social consequences of measurement error are uneven too. Cut-offs imposed on regions of low information, such as low scores on IQ tests to diagnose special educational needs, or indeed high scores to decide on entry into MENSA, are not “neutral decisions based on reliable scores”; they are choices to treat uncertain inferences as if they were certain.

And here a practical caveat matters — one that is rarely said loudly enough. Measurement at the extremes is often the least empirically grounded, because pilot and standardisation samples typically contain relatively few respondents at very high or very low trait levels. In those regions, “precision” can become partly model-driven: an extrapolation supported by assumptions more than data. That is true in human testing, and it is likely to be even more true in interactive evaluation of AI systems, where the tails include rare behaviours, unusual framings, and extreme contexts that are difficult to sample systematically.

If we take interactional intelligence seriously, we should expect that the most socially important cases — edge cases, adversarial framings, unusual users — may live precisely where our measurement is weakest. The correct response is not to abandon measurement, but to be honest about what we can and cannot infer, and to design evaluation systems that make that honesty operational.

The governance risk is concentrated precisely where the system sounds most reassuring — the brittle trajectory’s “it should be fine” zone — because that is where uncertainty is being hidden rather than reduced.

Drift: calibration becomes monitoring

A second modern lesson is that stability is not something you assume; it is something you maintain. Early IRT treated item parameters as stable once calibrated. In occasional administrations this was serviceable. In continuously operating systems — especially digital platforms — it becomes an empirical question. Items drift. Populations shift. Exposure effects appear. Coaching changes behaviour. Delivery mode changes the response process. The “same” test administered a year later may no longer be the same measurement instrument in any meaningful sense. This is why drift detection and sequential updating have become central to operational assessment. Calibration is no longer a periodic ritual. It is a managed process: monitoring, flagging, updating, and preserving scale meaning under change.

Now notice what happens when you translate that into the world of conversational AI. The AI system itself changes (model updates, safety layers, tool integrations). The user population changes (skills, expectations, prompting culture). The interaction context changes (interfaces, memory, retrieval, multi-agent orchestration). Even if you froze the model weights, the system would still drift because the environment in which it participates drifts.

So the governance problem is not “what is this model’s intelligence?” It is: how do we maintain comparability of meaning over time while the system, the users, and the context evolve? Psychometrics already has a vocabulary for this: equating, invariance, drift monitoring, validity maintenance. The AI world is currently reinventing these concerns under different names. A productive bridge is to recognise that they are the same problem in different clothing. And if the policy thresholds change, or the proctoring/interface changes, the same dialogue can silently drift from “defensible guidance” to “misleading assurance” unless the evaluation is continuously monitored.

From static traits to evolving latent states

The next step is to admit that not only items drift — persons do too. In education, training, and digital assessment environments, measurement often occurs as repeated short interactions. This naturally pushes models toward treating ability as a latent trajectory rather than a static point estimate. Temporal IRT models formalise this: ability becomes a time-indexed latent state that can be updated sequentially. Again, translate this into the AI setting. If intelligence emerges in interaction, then competence is not merely something the system “has”; it is something that can stabilise or destabilise across a dialogue. Even within a single session, the interaction can shift the system’s effective behaviour: through priming, framing, instruction, correction, and the accumulation of conversational commitments.

That means evaluation should look less like a one-off test and more like state estimation over a trajectory. The question becomes: what does this system converge to under sustained interaction, how stable is that convergence, and what perturbations cause it to collapse? Once you ask that, you immediately see why static benchmarks are blunt instruments. They average over trajectories. They hide instabilities. They compress a dynamic process into a single number — and then we wonder why that number fails to predict what happens in real use.

Fairness as meaning stability

If measurement becomes interactional and context-sensitive, fairness becomes sharper, not fuzzier. The foundational psychometric fairness question is not “are the groups equal?” It is: do scores mean the same thing across groups and contexts? That is the core intuition behind measurement invariance and DIF (Differential Item Functioning). DIF is a statistical diagnostic; fairness is a validity judgement; the bridge between them is substantive interpretation and governance. In the world of conversational AI, the analogue is immediate: Do we get the same effective capability when

the user’s dialect changes?
the interface changes?
the user has different background knowledge?
the prompt style differs?
the system is embedded in a different tool ecosystem?

If intelligence “left the head”, then the meaning of competence is distributed across an interactional system. Fairness, in that case, is not merely about the model’s internal parameters. It is about the stability of meaning across the contexts that co-constitute performance.

This is exactly why the current fashion for purely outcome-based fairness metrics can feel dissatisfying: they often describe disparities without offering a disciplined account of what changed in the measurement relationship. Psychometrics, at its best, offers that discipline. The fairness analogue is immediate: does the system request the same critical constraints when the user writes in a different register, uses a different dialect, or supplies less institutional vocabulary?

A practical proposal: evaluation as governed adaptive inquiry

So what does this amount to in practice? Not a grand theory but a practical measurement stance. If conversational intelligence is interactional, then evaluation should resemble what mature psychometric systems already do:

Use adaptive probing rather than fixed benchmarks. Treat each interaction as evidence, choose the next probe based on uncertainty, and make the probe policy explicit.
Build constraints in from the start. If the result must be defensible, then the evaluation design must be auditable: coverage constraints, exposure constraints, fairness checks, documented decision rules.
Track uncertainty locally, not globally. Stop pretending that one reliability number applies everywhere. Identify where the system is well-characterised and where it is not — and treat those regions differently.
Monitor drift as a routine obligation. If the system evolves, evaluation must evolve with it. Comparability is maintained, not assumed.
Treat competence as a trajectory where appropriate. The question is not only “what does it do?” but “what does it do after ten turns, under this framing, with this history?” Stability is part of competence.
Use invariance-style reasoning to ground fairness. Ask whether the measurement relationship holds across user groups, contexts, and interfaces — not only whether outputs differ.

This is not a call to drag the entire apparatus of psychometric testing into AI. It is a call to recognise that psychometrics has already built a mature way of thinking about measurement under uncertainty, constraint, and governance — precisely the conditions that now define the evaluation of conversational systems.

IRT/CAT offers regulators something usable now: constrained sequential inquiry. Instead of “keep prompting until you get a feel for it,” you pre-specify (1) a selection policy for what to ask next, conditional on what has already happened; (2) a stopping rule for when the evidence is sufficient; and (3) a traceable update logic for how responses count as evidence, how they accumulate, and which failure modes are anticipated. That turns dialogue evaluation from improvised probing into a designed procedure with an audit trail: repeatable, inspectable, and defensible—even before we reach for heavier order-sensitive mathematics.

Closing: ‘the space between’ needs instruments

In my previous essay, I suggested that intelligence did not “move” from humans to machines. What changed is that we can now see intelligence-like behaviour in the space between — in the sequential, norm-governed structures of dialogue itself.

If that is right, then the scientific and institutional challenge is not simply to build more capable models. It is to build instruments for characterising capability that do not collapse interaction into a static verdict. We need evaluation methods that respect the reality of path dependence, drift, and context — and that can still support decisions that must be fair, comparable, and defensible. Psychometrics will not solve this alone. But it has something unusually valuable to offer: a tradition that treats measurement not as a number, but as a responsibility.

John's Substack

Discussion about this post

Ready for more?