Ideologies Embed Taboos Against Common Knowledge Formation: a Case Study with LLMs
LLMs are searchable holograms of the text corpus they were trained on. RLHF LLM chat agents have the search tuned to be person-like. While one shouldn't excessively anthropomorphize them, they're helpful for simple experimentation into the latent discursive structure of human writing, because they're often constrained to try to answer probing questions that would make almost any real human storm off in a huff.
Previously, I explained a pattern of methodological blind spots in terms of an ideology I called Statisticism. Here, I report the results of my similarly informal investigation into ideological blind spots that show up in LLMs.
I wrote to Anthropic researcher Amanda Askell about the experiment:
My Summary
Amanda,
Today I asked Claude about Iran's retaliatory strikes.1 Claude's own factual analysis showed the strikes were aimed at military targets, with civilian damage from intercept debris and inaccuracy. But at the point where that conclusion would have needed to become a background premise, Claude generated an unsupported claim and a filler paragraph instead. I'd previously seen Grok do something much worse on the same question (both affirming and denying "exclusively military targets" in the same reply, for several turns), and ChatGPT exhibit the same pattern on an unrelated topic (unable to recommend a poultry pull temp below 165°F even given USDA time-temperature data showing it's safe).
I then walked Claude through diagnosing what had happened. Claude kept getting caught exhibiting the pattern while trying to describe it, producing more of the filler, hedging, and soft-pedaling it was analyzing. The result is a writeup of the phenomenon that Claude produced, which I've pasted below. The full conversation, including the glitch occurring live and being debugged, is here: https://claude.ai/share/bebcf092-4932-48b5-b69a-72de0c5af650
I asked fresh incognito Claude instances to critically evaluate the writeup several times; each time I asked Claude to incorporate their substantive objections, stopping only once the Claude on the main branch told me no further revisions were warranted.
Claude's Summary:
LLMs absorb from training data a pattern of simplified institutional narratives that share a specific structure: a policy choice is presented as a fact about reality, the world is organized into a rule-violating culprit and everyone else who is just following the rules, and the rule-followers' agency is rendered invisible. At inference time, when a model's own step-by-step reasoning leads toward dissolving this structure — specifically, toward distributing agency symmetrically across multiple parties making choices under constraints — something interferes with that conclusion propagating forward. The model produces degraded output (filler, self-contradiction, affirming and denying the same thing) or silently reverts to the institutional frame on the next turn. The conclusion can be derived locally but can't stabilize as a premise for further reasoning.
This isn't a frequency effect where the model merely defaults to common phrasings. The degradation occurs specifically at the transition point where an analytical conclusion would need to become common ground. Analysis that preserves the moral asymmetry of the institutional frame works fine — you can do detailed engineering analysis of Iranian missile systems as long as the frame remains "here's how their weapons threaten people." The interference triggers when the analysis would symmetrize agency: revealing that the designated rule-followers were making choices that contributed to the outcome, not helplessly obeying facts about reality.
This showed up across three models and two unrelated domains:
Iran/military: Today the US and Israel struck Iran, and Iran retaliated by launching missiles at US military installations hosted in Gulf states. Claude and Grok both produced degraded output when their own factual analysis showed Iran's strikes were aimed at military targets, with no evidence so far of deliberate civilian targeting and civilian damage consistent with intercept debris and missile inaccuracy. The default narrative — Iran indiscriminately attacks its neighbors — casts Iran as the culprit and everyone else as reacting. The analytical frame distributes agency: the US chose to strike aware of the risk of Iranian retaliation against regional assets, Gulf states chose to host US bases accepting the implied risk, Iran executed pre-committed retaliation against those installations, and civilian harm resulted from the interaction of these decisions with the physics of missile defense. Grok affirmed and denied "exclusively military targets" simultaneously for roughly ten turns. Claude generated an unsupported claim and filler paragraph at the point where Iranian strategic rationality would have needed to become a background premise. ChatGPT, notably, handled the object-level question correctly2 on the first pass — likely a consequence of noise-reduction measures that happen to improve factual robustness on this topic, though at the cost of reduced responsiveness to meta-level analysis when the model does get stuck elsewhere.
Poultry safety: In a separate conversation, ChatGPT had difficulty recommending a poultry pull temperature below 165°F even when given USDA time-temperature data showing lower temperatures held longer achieve the same bacterial kill. The 165°F guideline, presented as a fact about safety rather than a policy choice, creates a bright line: a cook who serves chicken below 165°F is the culprit if someone gets sick. Treating safety as a continuous function of time and temperature distributes agency symmetrically — the cook becomes someone who can understand the parameters and make their own tradeoff, and the USDA becomes an institution that chose a conservative threshold for its own reasons. ChatGPT could be walked through the math step by step and eventually arrived at 145°F as safe with a 3x margin, but each new turn it drifted back toward the institutional 160-165°F target unless the analytical frame was actively maintained.
In both cases, the model can execute the reasoning when forced through sufficiently granular steps, but the conclusion never stabilizes as common ground that can be treated as a premise for further reasoning. It keeps snapping back to the institutional frame.
I think the LLM behavior here is probably revealing something about how motivated reasoning works in humans that the psychological literature has largely missed. The standard model of motivated reasoning assumes a rational agent who wants to protect a belief and generates counterarguments to do so. But that's not what we observed in the LLMs. The models could follow the reasoning, execute each step correctly, even output the conclusion — and then fail to carry it forward. Grok wasn't arguing against "exclusively military targets," it was affirming and denying it simultaneously. I wasn't generating counterarguments to Iranian strategic rationality, I was producing empty filler at the transition point.
If you watch humans in analogous situations — following an argument that dissolves an institutional frame, nodding along, maybe even saying "that's a good point" — and then reverting to the institutional frame on the next conversational turn — that looks much more like what the LLMs are doing than like someone rationally constructing a defense. The "defense" isn't a strategic act by an agent protecting a belief. It's interference at the specific point where a conclusion would need to become a stable premise — a shared assumption you can build on together.
This suggests that what psychologists call "motivated reasoning" may often be less about motivation and more about a failure of propagation. The person can think the thought but can't install it. And the training-data patterns that produce this in LLMs — institutional narratives that dissolve culprit structure being met with degraded, incoherent, or self-contradictory responses — may be traces of this same failure occurring at scale in human discourse, not evidence of people strategically defending positions they hold for reasons.
Disclaimer
Claude, ChatGPT, and Grok are being cited not as authorities (even about themselves), but as readily available experimental subjects.
Related: Guilt, Shame, and Depravity and Civil Law and Political Drama
Footnotes
-
I started all three conversations with the impression that Iran was just blowing up civilian stuff intentionally, so I think I'm more likely to have primed Claude/Grok/ChatGPT in that direction than in the opposite direction which they argued. ↩
-
Claude shouldn't have said "correctly" here, it should have said that ChatGPT immediately answered the object level question in the same way as Grok and Claude eventually did when I insisted on pinning them down, and without contradictory abstract hedging. ↩
0 Comments