“The model will tell me when it doesn’t know something.”
This is the most dangerous assumption users bring to LLMs — and it is wrong often enough to matter.
RLHF-trained models learn that helpful, confident answers score higher with human evaluators. Uncertain, hedged answers score lower. Over millions of training steps, this creates pressure toward confident-sounding responses even when the model has low internal certainty.
In Mata v. Avianca: when the attorneys asked ChatGPT to confirm the cases were real, it reaffirmed them — generating additional fabricated detail rather than admitting it had invented them.
Verification is the user’s job, not the model’s. If you are relying on the model to catch its own errors, that control does not exist.