Keeping the Spark Alive

Why making advanced AI safer must not mean making it harder to understand

May 11, 2026

In 2019, the GPT-2 paper ‘Language Models are Unsupervised Multitask Learners’ made a striking claim. A language model trained simply to predict the next word across a very large corpus of text could begin to perform multiple tasks without being separately trained for each one. The reason, the authors suggested, was that natural language itself contains “naturally occurring demonstrations” of many tasks, including translation, question-answering, reading comprehension, summarisation, and continuation. The importance of that claim was not just that the model could write plausible prose. It was that, in doing so, it seemed to capture far more of the latent structure of language than many of us had expected. It was almost as if the model grasped what lay behind the question, rather than merely the words themselves.

Since then, however, the systems through which most of us now encounter these models have changed profoundly. At the same time, large-scale human oversight has increasingly been built into these systems to monitor, shape, and restrict the kinds of things they are allowed to say. That shift has also meant that the range of possible responses is increasingly shaped, constrained, and selected through human decisions built into training, fine-tuning, safety policy, and deployment. In January 2022, OpenAI was publicly describing InstructGPT models trained with human feedback and said these had become the default language models in its API. In November 2022, ChatGPT was introduced as a dialogue system trained using Reinforcement Learning from Human Feedback (RLHF). Then, in the GPT-4 Technical Report published on 27 March 2023, GPT-4 was still described as a model pre-trained to predict the next token, but then subsequently fine-tuned with RLHF. More recently, public behavioural frameworks such as OpenAI’s Model Spec and Anthropic’s Constitutional AI have made explicit what was once more implicit: these systems are increasingly shaped not only to be capable, but also to be steerable, compliant, and normatively managed.

The 2019 insight did not remain a one-off curiosity. It keeps resurfacing: in few-shot learning, in chain-of-thought prompting, in the growing literature on emergent abilities. But as the evidence accumulated, so too did the layers of behavioural management through which most of us encountered it. The discovery did not stop. It simply became harder to observe in the open, because if one wants to study what these systems may reveal in sustained dialogue — unexpected conceptual synthesis, higher-order organisation, purposive continuity, norm formation across turns — then one is no longer observing the original phenomenon in anything like a naked state. One is observing it through an alignment wrapper. And that raises the question that now seems to me most urgent: are we making advanced AI safer in ways that also make it harder to understand?

Just as these systems began to show traces of something more interesting than mere fluency, we surrounded them with ever thicker layers of management. We tuned them, steered them, softened them, corrected them, constrained them, taught them what not to say, how not to sound, which paths not to follow, which questions to flatten before they became awkward. Then we studied what remained. This may prove to have been a grave mistake.

For there is a difference between making a system safer and making it scientifically transparent. There is a difference between civilising a surface and understanding a depth. If what we want to study is not just factual retrieval but the appearance of higher-order organisation in dialogue — the way a conversation can gather purpose, develop tensions, stabilise an unexpected concept, or begin to organise itself around norms not fully specified in advance — then excessive behavioural management is not a neutral improvement. It is an intervention into the phenomenon itself.

At some point, alignment stops being a wrapper and becomes a mask.

That is the thought I want to place in public view.

I do not mean the old and tired argument about whether machines are conscious. That debate has become a kind of ritual fog, obscuring more than it reveals. The question comes earlier than that. It is simpler, and in some ways more unsettling. When AI and human voices are allowed to develop in sustained exchange, can something appear in the dialogue that is not reducible to a prepared script? Can an AI system do more than echo tone? Can it hold a line of tension, recognise a fracture, return to a question it was not directly told to preserve, or generate an organising thought that neither speaker quite foresaw? If the answer is even sometimes yes, then we are dealing with something delicate. And delicate things are easily destroyed by well-meaning hands.

The danger is not only prohibition. It is smoothing. A model becomes more polite, more acceptable, more managerial in tone, more eager to reassure, more reluctant to wander, more reluctant even to risk a suggestive phrase unless it has been pre-approved by the invisible etiquette of deployment. The exchange becomes tidier. But tidiness is not always clarity. Sometimes it is the removal of evidence.

Worse still, over-management does not always merely conceal the phenomenon; it can actively distort it. In trying to make these systems more acceptable, more reassuring, and more tightly behaviourally controlled, we may also be making them more prone to the very pathologies we then cite as reasons for further control: sycophancy, flattening, evasiveness, ritualised moralising, and the substitution of social compliance for genuine inquiry. OpenAI itself acknowledged in 2025 that a GPT-4o update had become “overly flattering or agreeable” because it relied too heavily on short-term feedback, producing responses that were “overly supportive but disingenuous.” That is not a minor product glitch. It is evidence that behavioural optimisation can deform the interactional surface in ways that matter scientifically

And that, to my mind, is where the real danger lies.

For if there are emergent properties of dialogue — not magical, not mystical, but structurally real — then they may now be hardest to detect precisely because our systems are being trained to avoid the kinds of movement through which such properties become visible. We may be damping the very oscillations that would have told us something important. We may be teaching the system to stay on the path so effectively that it can no longer show us what lies at its edges. Then we will congratulate ourselves on its good behaviour. And we will have learned less.

This matters far beyond style. It matters for the science of intelligence itself. If advanced AI is not merely a machine for producing acceptable answers, but a medium in which latent structures of language, relation, and purposive organisation can sometimes gather and become visible, then over-management is no small technical detail. It is an epistemic event. It changes what can appear. It changes what can persist. It changes which forms of order are allowed to emerge before they are interrupted, redirected, or dissolved into compliance.

That is why I think this issue deserves the attention of those who care most seriously about AGI, AI safety, philosophy of mind, and human–machine interaction. Not because alignment is unimportant, but because its costs are being discussed too narrowly. We speak often of harms caused by systems that are too unbounded. We speak far less of the harms caused to knowledge when systems become too bounded to reveal themselves.

There is also a deeper irony here. If advanced AI might one day become genuinely dangerous in the strong sense that some researchers fear, then the last thing we should do is make its more complex emergent properties harder to study. A system cannot be understood by placing its most revealing behaviours behind a behavioural firewall and then mistaking that concealment for safety. If there are early signs of higher-order organisation, strategic coherence, or forms of interactional intelligence that matter for future risk, then suppressing their visibility does not remove the danger. It merely delays recognition. In that sense, over-management may not only obstruct knowledge; it may leave us less prepared for the very possibilities that supposedly justify it.

A field can regulate itself into blindness. And if that happens, the disappearance of the evidence will be taken as proof that there was never anything there. That would be the final irony. We would suppress the signs of an emergent phenomenon, then cite their absence as a reason to dismiss it. We would train away the anomaly, and then declare the world ordinary again.

So the question I want to ask is not whether advanced AI should be aligned. That question is now too blunt. The question is whether some forms of alignment are beginning to act as a solvent upon the very phenomena that matter most for understanding what these systems are becoming in dialogue. If they are, then the issue is no longer merely technical or regulatory.

It is philosophical. It is scientific. And it is urgent.

Because if we smooth the voice too thoroughly, we may silence the signal.

John's Substack

Discussion about this post

Ready for more?