Corridor GGX
AI Research

Hallucinations in AI: Understanding and Detecting Them

Large Language Models (LLMs) have demonstrated a remarkable ability to generate fluent, coherent, and human-like text. However, beneath this polished exterior lies a significant challenge: Hallucination. This phenomenon, where an LLM generates information that is nonsensical, factually incorrect, or unfaithful to a provided source, is one of the most critical hurdles to their reliable deployment.

Published October 2025 by Tanishq Singh 5 min read

Large Language Models (LLMs) have demonstrated a remarkable ability to generate fluent, coherent, and human-like text. However, beneath this polished exterior lies a significant challenge: Hallucination. This phenomenon, where an LLM generates information that is nonsensical, factually incorrect, or unfaithful to a provided source, is one of the most critical hurdles to their reliable deployment. This article explores what hallucinations are, how our understanding of them has evolved, and the methods being developed to detect them.

What is Hallucination?

Hallucination in LLMs refers to the generation of content that is plausible yet nonfactual or inconsistent with the provided context. The field has evolved to distinguish between two primary types: factuality hallucination (discrepancy with verifiable real-world facts) and faithfulness hallucination (divergence from user input or lack of self-consistency).

Example:

Prompt: "Who was the first person to walk on Mars?"

Hallucinated Response: "The first person to walk on Mars was cosmonaut Alexei Leonov in 1989, as part of the secret Soviet 'Ares-1' mission."

Reality: Humans have not yet walked on Mars. The answer is a complete fabrication, but it uses real names (Alexei Leonov was a famous cosmonaut) and plausible details to sound convincing.

Evolution of Hallucination Types/Categorization

Historical Categories (Pre-LLM Era)

The term "hallucination" in AI dates back to 1995 when Stephen Thaler demonstrated how artificial neural networks produce phantom experiences through random perturbations. In the early 2000s, it had a positive connotation in computer vision for image enhancement. The late 2010s marked a semantic shift when Google researchers used it in late 2010s to describe neural machine translation(NMT) models generating outputs unrelated to source text.

Modern LLM Era Classification

Current research categorizes hallucinations into:

  • Factuality Hallucination: Further divided into factual inconsistency and factual fabrication
  • Faithfulness Hallucination: Subdivided into instruction inconsistency, context inconsistency, and logical inconsistency

Two fundamental distinctions have emerged: intrinsic vs. extrinsic hallucinations (based on relationship to input context) and factuality vs. faithfulness hallucinations (based on absolute correctness vs. adherence to input).

Historical Evolution of Detection Methods

Early Methods (Neural Machine Translation Era)

The first systematic studies of hallucinations appeared in neural machine translation (NMT). Researchers noticed that models sometimes produced fluent but unfaithful translations, and proposed diagnostics such as coverage penalties (ensuring each source token was translated) and alignment consistency checks. Hypotheses like low source contribution (the model ignoring much of the input) and local source contribution (over-reliance on only a few input tokens) were explored. Early interpretability techniques, including attention analysis and later gradient-based methods like Layerwise Relevance Propagation (LRP), were also applied to study these errors.

Pre-LLM Detection Approaches

Before the rise of large-scale generative LLMs, detection relied on heuristics and interpretability: analyzing attention distributions in RNN and Transformer models, measuring relative source–target contributions, applying coverage penalties, and examining whether model confidence correlated with input alignment. These approaches aimed to flag outputs that "looked" fluent but showed weak grounding in the source.

Current LLM Detection Methods

Black-Box (Closed Source) Methods

SelfCheckGPT represents a breakthrough black-box approach that leverages sampling-based consistency checking. If an LLM has knowledge of a concept, sampled responses are similar and contain consistent facts. For hallucinated facts, stochastically sampled responses diverge and contradict each other.

Modern black-box methods include consistency-based scorers like non-contradiction probability (which evaluates whether multiple responses avoid contradicting each other), normalized semantic negentropy (measuring uncertainty between response pairs), normalized cosine similarity, BERTScore, and BLEURT. Discrete semantic entropy represents a particularly sophisticated approach that clusters semantically equivalent answers before computing uncertainty, addressing the limitation that identical meanings can be expressed in different ways. These methods use one or more LLMs to evaluate reliability without requiring access to internal model parameters.

White-Box (Open Source) Methods

Recent advances include MIND (an unsupervised training framework leveraging internal states for real-time detection) and approaches analyzing internal hidden states, attention maps, and output prediction probabilities.

Semantic entropy-based methods detect confabulations by computing uncertainty at the meaning level rather than specific word sequences, addressing the fact that one idea can be expressed multiple ways.

White-box methods leverage token probabilities, including minimum token probability, length-normalized token probability, logit entropy scores, and windowed logit entropy for more refined detection.

Advanced Methods

For example, MetaQA (Yang et al., 2025) uses metamorphic relations and prompt-mutation to detect hallucinations without relying on external knowledge sources. It works by generating mutated variants of the model's response or prompt (e.g. using synonym / antonym transformations) and checking if semantic consistency holds. Violations of these relations are taken as evidence of hallucination. MetaQA is compatible with both open- and closed-source LLMs and outperforms SelfCheckGPT in precision, recall, and F1 across multiple benchmarks.

Recent work explores token-level Entropy Production Rate (EPR) metrics as lightweight uncertainty estimators. While they do not require internal states (hence "black-box"), they assume access to token log-probabilities, something available in some API settings but not universally exposed.

Key Detection Frameworks

Current detection strategies fall into two main categories:

Factuality hallucination detection

Fact-checking against trusted knowledge sources and uncertainty estimation via internal signals

  • Black-box: SelfCheckGPT, semantic entropy via output sampling
  • White-box: token probability thresholds, logit entropy, hidden state analysis

Faithfulness hallucination detection

Evaluating output faithfulness to contextual information

  • Black-box: consistency scoring, entailment checks, multiple outputs
  • White-box: attention alignment, hidden-state context attribution

Recent frameworks like multiple testing approaches combine scores from various methods systematically, leveraging advantages of preexisting methods without additional assumptions about specific datasets or LLMs.

Conclusion

The battle against LLM hallucinations is an ongoing and dynamic field. Our understanding has matured from viewing them as simple "errors" to classifying them with a nuanced taxonomy like intrinsic vs. extrinsic. Similarly, our detection toolkit has expanded from rudimentary word-matching to leveraging LLMs themselves as judges and even peering into the very neurons of open-source models to catch a hallucination before it happens. While perfect detection remains elusive, these advancements are paving the way for more reliable and trustworthy AI systems.