Overview
This project explores a core question in LLM evaluation: can token-level probabilities be trusted as a confidence signal? The work is grounded in structured document extraction and studies where local certainty appears high while true reliability remains limited.
LLMCalibrationUncertaintyGemini
Research question
If Gemini assigns near-maximum token confidence to a generated output, does that really mean the extracted value is correct, grounded, and stable? The project shows that the answer is often no, especially in structured generation.
Scientific rationale
The analysis is based on a simple but important distinction: token-level probability is a local decoding signal, while reliability is a global property involving factual correctness, semantic grounding, and structural consistency. In autoregressive models, those two levels do not necessarily align.
- Modern neural networks are often overconfident rather than well calibrated.
- LLM internal probabilities do not always align with factual correctness.
- Token-level confidence does not guarantee sequence-level correctness.
- Probability-based metrics alone can overestimate reliability in generative tasks.
Evidence from prior research
Prior work on calibration has shown that modern neural models often assign higher confidence than their real success rate. This is the core pattern behind overconfidence: the model appears more certain than it should be.
Reliability diagram intuition
A well-calibrated model would align predicted confidence with actual accuracy. In overconfident systems, predicted confidence stays above real performance.
This same gap motivates the present project: very high token-level confidence in Gemini can coexist with lower real reliability when outputs disagree or fail to remain stable across passes.
Key findings from the study
Token-level confidence is local
A high probability for the next token reflects local decoding preference, not necessarily factual correctness or document-grounded extraction.
Confidence can saturate
In structured extraction, token-level metrics may drift toward near-maximum confidence values even when the extracted field is still wrong or unstable.
Structure can dominate the signal
JSON formatting and structural tokens can artificially inflate average token confidence, masking true semantic uncertainty in the field value itself.
Agreement across passes is more informative
Cross-pass agreement introduces a stability signal that often reveals uncertainty better than isolated token probabilities.
Graph 1 — Local confidence vs actual reliability
Illustrative view of the main problem: token-level indicators can suggest near-certain confidence while real reliability remains much lower.
Chosen token confidence99.9%
Entropy-based confidence97.8%
Graph 2 — Saturated token confidence under disagreement
Even when both passes look extremely confident at the token level, disagreement across extractions can push final confidence down and reveal instability.
Pass 1 token confidence99.99%
Pass 2 token confidence99.93%
Confidence does not always imply correctness
A useful way to think about overconfidence is to compare confidence with correctness directly. In practice, some incorrect generations still receive very high confidence, which weakens confidence as a standalone reliability signal.
Low correctnessHigh correctness
The visual intuition is simple: several points remain high on the confidence axis despite not being strongly aligned with correctness.
Why overconfidence appears
- The model may strongly prefer one next token over alternatives even when the final field is not well grounded in the source document.
- Structural tokens such as JSON punctuation and formatting can raise the average confidence without improving semantic correctness.
- Very small negative log-probabilities near zero can create the appearance of near-perfect certainty.
- A sequence can be locally probable step by step and still be globally wrong.
Example interpretation
In the study, token-level confidence can remain extremely high across two passes while the extracted values still disagree. That is exactly the kind of situation where overconfidence becomes visible: local probability says “safe”, but cross-pass stability says “uncertain”.
Pass 1 token confidence
0.9999
Pass 2 token confidence
0.9993
Next improvements
- Add reliability diagrams and calibration curves from larger experiments.
- Separate semantic tokens from structural tokens more explicitly.
- Benchmark cross-pass agreement against factual correctness labels.
- Compare Gemini confidence behavior with other LLM providers.