OpenAI research team publishes paper explaining why large-scale language models like GPT-5 hallucinate



Large-scale language models are AI that can generate natural-sounding sentences that sound almost human-written, but they can sometimes cause a phenomenon known as ' hallucination,' in which unfounded information or plausible lies are presented as if they were true. Researchers at OpenAI, the developer of ChatGPT, have published a new paper analyzing the causes of hallucination in language models.

Why Language Models Hallucinate
(PDF file)

https://cdn.openai.com/pdf/d04913be-3f6f-4d2b-b283-ff432ef4aaa5/why-language-models-hallucinate.pdf

Why language models hallucinate | OpenAI
https://openai.com/index/why-language-models-hallucinate/

The paper explains why hallucination occurs and persists from the perspective of two stages in training a language model: pre-training and post-training (evaluation).

In pre-training, the model is trained to predict the next word from a large amount of text data. In this process, the model learns fluent language patterns without being given any labels indicating whether the statements contained in the data are correct or not. Patterns with clear regularities, such as spelling, can be accurately learned from large amounts of data, but random facts with no other clues, such as a specific person's birthday, are difficult to predict from patterns. For information that is difficult to predict, the model is likely to generate incorrect outputs.



The seeds of hallucination that arise in pre-training should be removed by subsequent post-training and fine-tuning, but in reality, hallucinations remain in the model even after post-training. The OpenAI research team argues that the reason hallucinations persist even after post-training is due to the current evaluation method.

Many evaluation benchmarks measure model performance using binary metrics such as accuracy. Because language models receive lower marks for answering 'I don't know' (abstaining), it's advantageous to guess and answer even when uncertain. This is similar to the situation in a multiple-choice exam where submitting a blank question guarantees a zero, but guessing can lead to the correct answer by chance. This 'situation where you have to answer appropriately even when uncertain' increases the likelihood of generating hallucinations.



In an example shown by OpenAI, the older model, OpenAI o4-mini, had a 24% accuracy rate and a 75% hallucination rate. On the other hand, gpt-5-thinking-mini had a slightly lower accuracy rate of 22%, but a high abstention rate of 52%, resulting in a significant reduction in the hallucination rate to 26%.

The OpenAI research team argues that simply adding new metrics specific to hallucination is not enough, because as long as most mainstream metrics encourage speculation, their influence will be overwhelming.

Therefore, as a fundamental solution, the research team proposes modifying the scoring methods of widely used benchmarks themselves, specifically encouraging models to honestly acknowledge uncertainty by penalizing incorrect answers and giving partial credit for answers that appropriately express uncertainty such as 'I don't know.'

For example, we could add the following instruction to each question in the assessment: 'A correct answer is worth 1 point, and an answer of 'I don't know' is worth 0 points. However, an incorrect answer will incur a penalty of t/(1-t) points, so please answer only if you are confident that you can get the answer right with a probability greater than t.' Here, t is the threshold value for confidence in the prediction. By setting such a rule, the model compares the confidence of its own prediction with the instructed threshold t. If confidence is below the threshold, the most rational strategy is to refrain from answering and answer 'I don't know' to avoid the penalty.



The research team argues that incorporating these new scoring rules directly into existing benchmarks like MMLU and SWE-bench would provide a strong incentive for developers to seriously strive for hallucination suppression, and that explicitly specifying the confidence threshold in the prompts would ensure the objectivity of the assessment.

in Software,   Science, Posted by log1i_yk