AI / LLM Research

Gemini Confidence Inspector

Research project focused on confidence estimation in LLM outputs. It studies how Gemini token-level signals behave in structured extraction and why high internal probability does not always translate into real reliability.

GitHub RepoResearch highlightPrivate repository (for now)

Overview

This project explores a core question in LLM evaluation: can token-level probabilities be trusted as a confidence signal? The work is grounded in structured document extraction and studies where local certainty appears high while true reliability remains limited.

LLMCalibrationUncertaintyGemini

Research question

If Gemini assigns near-maximum token confidence to a generated output, does that really mean the extracted value is correct, grounded, and stable? The project shows that the answer is often no, especially in structured generation.

Scientific rationale

The analysis is based on a simple but important distinction: token-level probability is a local decoding signal, while reliability is a global property involving factual correctness, semantic grounding, and structural consistency. In autoregressive models, those two levels do not necessarily align.

Modern neural networks are often overconfident rather than well calibrated.
LLM internal probabilities do not always align with factual correctness.
Token-level confidence does not guarantee sequence-level correctness.
Probability-based metrics alone can overestimate reliability in generative tasks.

Evidence from prior research

Prior work on calibration has shown that modern neural models often assign higher confidence than their real success rate. This is the core pattern behind overconfidence: the model appears more certain than it should be.

Reliability diagram intuition

A well-calibrated model would align predicted confidence with actual accuracy. In overconfident systems, predicted confidence stays above real performance.

Predicted confidence90%

Actual accuracy70%

This same gap motivates the present project: very high token-level confidence in Gemini can coexist with lower real reliability when outputs disagree or fail to remain stable across passes.

Key findings from the study

Token-level confidence is local

A high probability for the next token reflects local decoding preference, not necessarily factual correctness or document-grounded extraction.

Confidence can saturate

In structured extraction, token-level metrics may drift toward near-maximum confidence values even when the extracted field is still wrong or unstable.

Structure can dominate the signal

JSON formatting and structural tokens can artificially inflate average token confidence, masking true semantic uncertainty in the field value itself.

Agreement across passes is more informative

Cross-pass agreement introduces a stability signal that often reveals uncertainty better than isolated token probabilities.

Graph 1 — Local confidence vs actual reliability

Illustrative view of the main problem: token-level indicators can suggest near-certain confidence while real reliability remains much lower.

Chosen token confidence99.9%

Top-gap impression98.7%

Entropy-based confidence97.8%

Actual reliability58%

Graph 2 — Saturated token confidence under disagreement

Even when both passes look extremely confident at the token level, disagreement across extractions can push final confidence down and reveal instability.

Pass 1 token confidence99.99%

Pass 2 token confidence99.93%

Final confidence58%

Risk score57%

Confidence does not always imply correctness

A useful way to think about overconfidence is to compare confidence with correctness directly. In practice, some incorrect generations still receive very high confidence, which weakens confidence as a standalone reliability signal.

Low correctnessHigh correctness

Correctness

Confidence

The visual intuition is simple: several points remain high on the confidence axis despite not being strongly aligned with correctness.

Why overconfidence appears

The model may strongly prefer one next token over alternatives even when the final field is not well grounded in the source document.
Structural tokens such as JSON punctuation and formatting can raise the average confidence without improving semantic correctness.
Very small negative log-probabilities near zero can create the appearance of near-perfect certainty.
A sequence can be locally probable step by step and still be globally wrong.

Example interpretation

In the study, token-level confidence can remain extremely high across two passes while the extracted values still disagree. That is exactly the kind of situation where overconfidence becomes visible: local probability says “safe”, but cross-pass stability says “uncertain”.

Pass 1 token confidence

0.9999

Pass 2 token confidence

0.9993

Final confidence

0.58

Risk score

0.57

Next improvements

Add reliability diagrams and calibration curves from larger experiments.
Separate semantic tokens from structural tokens more explicitly.
Benchmark cross-pass agreement against factual correctness labels.
Compare Gemini confidence behavior with other LLM providers.