Jon-Paul Cacioli

Measurement theory for language models

I apply the formal tools psychology spent a century developing. Signal detection theory, psychophysics, psychometric validity. The question is what language models represent, and how to evaluate them rigorously. My work has shown that accuracy and metacognitive sensitivity rank opposite at the frontier. That LLM confidence falls into three distinct monitoring-control profiles, consistent with the Nelson and Narens architecture. That confidence signals require validity screening before they can be interpreted.

Independent researcher, Melbourne.

Psychophysics spent a century developing rigorous methods for characterising noisy decision-making systems. Signal detection theory. Weber's Law. Metacognitive monitoring. Rivalry dynamics. These frameworks were built for minds. The question is whether they are equally revealing about machines.

A parallel thread uses child-scale language models as experimental tools. The goal is to probe the computational limits of distributional learning, and to test formal theories of word learning, inductive bias, and referential pragmatics against pre-registered boundary conditions.

Across both programmes, one question recurs. What does the model actually represent, and how would we know?

Validity screening for LLM confidence signals

A portable protocol for checking whether a model's confidence signal carries item-level information about correctness. Six indices adapted from the PAI and MMPI-3 validity frameworks. Calibrated against 20 frontier LLMs. Cross-benchmark validated.

Three-tier classification. Invalid, Indeterminate, Valid. Computed from a single correctness-by-confidence contingency table.

git clone https://github.com/synthiumjp/validity-scaling-llm

The protocol identifies models whose confidence signals are dominated by response style rather than item-level discrimination. Unscreened invalid models produce AUROC near chance on selective prediction despite producing plausible confidence traces.

Metacognition & self-monitoring in LLMs

Signal detection theory, monitoring-control coupling, and validity scaling for confidence signals.

  • The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring Preprint

    arXiv:2604.15702  ·  Kaggle AGI Hackathon (Google DeepMind)  ·  2026

    A 524-item cross-domain battery for LLM self-monitoring. Three profiles emerge. Accuracy and metacognitive sensitivity rank opposite across 20 frontier LLMs.

    A cross-domain behavioural assay of monitoring-control coupling in LLMs, grounded in the Nelson and Narens (1990) metacognitive framework. The battery comprises 524 items across six cognitive domains, each grounded in an established experimental paradigm. After every forced-choice response, dual probes adapted from Koriat and Goldsmith (1996) ask the model to KEEP or WITHDRAW its answer and to BET or decline. Applied to 20 frontier LLMs (10,480 evaluations), the battery discriminates three profiles consistent with the Nelson-Narens architecture: blanket confidence, blanket withdrawal, and selective sensitivity. Accuracy rank and metacognitive sensitivity rank are largely inverted. Scaling on metacognitive calibration is architecture-dependent. Monotonically decreasing (Qwen), monotonically increasing (GPT-5.4), or flat (Gemma). No universal scaling law.

    Load dataset

    from datasets import load_dataset
    dataset = load_dataset("synthiumjp/metacognitive-monitoring-battery")
  • Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report Preprint

    arXiv:2604.17707  ·  2026

    Six validity indices cross-mapped from the PAI and MMPI-3, calibrated against 20 frontier models. Classifies four as construct-level invalid.

    The derivation paper for the validity screening framework. Six validity indices cross-mapped from the PAI and MMPI-3 (L, K, F, Fp, RBS, TRIN), calibrated against the 20-model MMB derivation sample with family-level comparison groups. A tiered classification system, validated against eight synthetic response policies, identifies four models as construct-level invalid. Valid-profile models produce item-sensitive confidence (mean r = .18). Invalid-profile models do not (mean r = -.20, d = 2.17). Chain-of-thought training produces two opposite response distortions. Two latent dimensions, corresponding to the under-reporting and over-reporting blocs of clinical assessment, account for 94.6% of index variance.

  • Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals Preprint

    arXiv:2604.17714  ·  2026

    A portable benchmark-agnostic screening procedure. Three core indices from a single contingency table. Cross-benchmark validated.

    The protocol paper. Extracts from the derivation study a portable, benchmark-agnostic screening procedure: three core indices (L, Fp, RBS) computed from a single correctness-by-confidence contingency table, plus a structural indicator (TRIN) and an item-sensitivity statistic. Three-tier classification (Invalid, Indeterminate, Valid) with subsampling analysis showing stable classification at 100-150 items. Cross-benchmark validation on external data (Yang et al., 2024) and an MMLU/verbalized-confidence variant confirms the protocol transfers. Validity is a property of the model-probe-task interaction, not an intrinsic property of the model.

  • Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction Preprint

    arXiv:2604.17716  ·  2026

    Concurrent validation against selective prediction. The three-tier classification accounts for 47% of variance in AUROC.

    The criterion-validation paper. Tests whether the validity tiers predict deployment-relevant selective prediction behaviour. Valid models achieve mean Type-2 AUROC = .624 [.604, .647]. Invalid models mean AUROC = .357 [.031, .522]. Tiers order monotonically (d = 2.81, p = .002). Split-half cross-validation yields median d = 1.77 across 1,000 splits with P(d > 0) = 1.0. The three-tier classification accounts for 47.0% of variance in AUROC. DeepSeek-R1, classified Invalid by massive inversion, drops from 85.3% accuracy at full coverage to 11.3% at 10% coverage. Exactly the failure pattern selective prediction is supposed to avoid.

  • Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen Preprint

    arXiv:2604.22215  ·  2026

    Pre-registered application of the validity screen to seven 3-9B instruct models. All seven Invalid on numeric verbal confidence. Categorical elicitation does not rescue.

    The first applied use of the validity screening protocol on a published benchmark. Seven instruction-tuned open-weight models (3-9B parameters, four families) tested on 524 TriviaQA items under numeric (0-100) and categorical (10-class) confidence elicitation. All seven instruct models classified Invalid on numeric confidence, with mean ceiling rate 91.7%. Two models produced zero low-confidence responses across 500+ trials. Categorical elicitation did not rescue validity. It disrupted task performance, with most models dropping from 59-76% accuracy under numeric to 0.2-4.2% under categorical. Token-level logprobability did not usefully predict verbalised confidence (mean cross-validated R² < 0.01). Within the reasoning-distilled model, reasoning-trace length showed a strong negative partial correlation with confidence (ρ = -.36), consistent with the Reasoning Contamination Effect. Internal item-level information exists (item difficulty correlates with mean confidence at ρ = .50) but does not survive the verbalised readout under minimal elicitation.

  • K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks Preprint

    arXiv:2604.11011  ·  2026

    A negative result with an explanatory mechanism. Structural confidence probes on predictive coding networks reduce to softmax under standard training.

    Predictive coding networks (PCNs) are an architecturally attractive substrate for structural metacognitive probing. An approximate decomposition shows that under standard discriminative PC, the K-way energy margin reduces to a monotone function of the log-softmax margin plus a residual that is not trained to correlate with correctness. The decomposition predicts the probe should track softmax from below. Confirmed empirically across six conditions (deterministic PC, BP+decoder control, matched-budget PC vs BP, Langevin temperature sweep, trajectory-integrated MCPC). In every condition the structural probe sat below softmax. MCPC versus final-state training differed by 6x10⁻⁴ AUROC2 at deterministic evaluation.

  • Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding Preprint

    arXiv:2604.21286  ·  2026

    A pre-registered 10-seed scope test of the K-way probe decomposition. Cross-entropy, not inference dynamics, is the load-bearing assumption.

    A pre-registered follow-up that isolates which assumptions of the K-way decomposition are empirically load-bearing. Three conditions across 10 seeds on CIFAR-10: standard discriminative PC with cross-entropy (baseline), standard PC with MSE replacing CE, and bidirectional PC (Oliviers et al., 2025). The original negative result replicates on CE-trained PC (Δ = -0.082, p < 10⁻⁶). Removing CE alone halves the probe-softmax gap (Δ = -0.037). Bidirectional PC flips the gap positive across all 10 seeds (Δ = +0.008), though a pre-registered manipulation check shows bPC does not produce materially greater latent movement at this scale. A post-hoc temperature scaling ablation decomposes the gap: approximately 66% is attributable to CE-induced logit-scale inflation, 34% to a scale-invariant ranking advantage of proper scoring rules. Confirms that the earlier negative result is correctly scoped to CE-trained discriminative PC rather than a general limitation of structural probing.

  • Quantisation Reshapes the Metacognitive Geometry of Language Models Preprint

    arXiv:2604.08976  ·  2026

    M-ratio profiles across knowledge domains are uncorrelated across quantisation formats (ρ = 0.00). AUROC2 rankings are perfectly stable (ρ = 1.00).

    A pre-registered intervention study that set out to improve domain-specific metacognition through targeted SFT, and instead discovered that M-ratio profiles are format-dependent. Quantisation restructures domain-level metacognitive geometry completely (Spearman ρ = 0.00 across Q5_K_M and f16), while Type-2 AUROC rankings are perfectly stable (ρ = 1.00). The dissociation localises to the M-ratio normalisation: quantisation shifts d′ non-proportionally across domains, which propagates through division into an apparent profile restructuring absent in the raw discrimination signal. Diagnostic systems relying on M-ratio profiles have an unexamined dependency on inference format. AUROC2-based diagnostics do not.

  • Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory Preprint

    arXiv:2603.25112  ·  2026

    Type-2 SDT applied across four LLMs and 224,000 factual QA trials. AUROC2 and M-ratio rankings are fully inverted.

    Standard calibration metrics (ECE, Brier score, AUROC) conflate how much a model knows (Type-1 sensitivity) with how well it knows what it knows (Type-2 metacognitive sensitivity). We introduce meta-d′ and M-ratio as an evaluation framework across four LLMs and 224,000 factual QA trials. Mistral achieves the highest Type-1 sensitivity but the lowest metacognitive efficiency. Gemma-2 shows near-optimal efficiency despite lower sensitivity. AUROC2 and M-ratio produce fully inverted model rankings. Temperature manipulation dissociates confidence policy from metacognitive capacity in instruction-tuned models, suggesting RLHF shifts criterion without improving metacognitive signal quality.

  • LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy Preprint

    arXiv:2603.14893  ·  2026

    Full parametric SDT applied to three LLMs across 168,000 trials. Temperature does not function as a clean criterion shift.

    Standard LLM calibration metrics such as ECE conflate sensitivity (discriminating correct from incorrect) and bias (the tendency toward confident or cautious responding). This pre-registered study treats three LLMs as psychophysical observers performing factual discrimination across 168,000 trials, applying the full parametric SDT framework (unequal-variance model fitting, criterion estimation, z-ROC analysis) for the first time to language models. Temperature does not function as a clean criterion shift. It simultaneously increases sensitivity and shifts criterion. Models occupying distinct positions in sensitivity-bias space were indistinguishable by calibration metrics alone.

Psychophysics of transformer representations

Applying Weber's Law, categorical perception, and scalar variability to probe what transformers represent.

  • Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models Preprint

    arXiv:2603.20642  ·  2026

    Four converging paradigms in three 7-9B models. Representational geometry is log-compressive, consistent with Weber's Law, but dissociated from behaviour.

    Applies formal tools of psychophysics to resolve conflicting accounts of how transformers represent magnitude. Across four converging paradigms (RSA, behavioural discrimination, precision gradients, causal intervention) in three 7-9B instruction-tuned models, representational geometry is consistently log-compressive, consistent with Weber's Law, but dissociated from behaviour. Early layers are functionally implicated in magnitude processing while later layers, where geometry is strongest, are not causally engaged.

  • Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability Preprint

    arXiv:2604.04469  ·  2026

    Transformer magnitude representations reproduce the geometry but not the noise profile of biological magnitude systems.

    Scalar variability, noise proportional to magnitude producing a constant coefficient of variation, is a hallmark of biological magnitude systems. This pre-registered companion to the Weber study tests whether transformers share this property. They show the opposite. Representational variability decreases with magnitude (α ≈ -0.19; 0/48 model-layer cells with α > 0). Corpus frequency strongly predicts per-magnitude variability (ρ = .84), consistent with a distributional account. Distributional learning produces the geometry of biological magnitude representation but not the noise.

  • Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries Preprint

    arXiv:2603.28258  ·  2026

    RSA across six models identifies categorical warping at digit-count boundaries. Two signatures emerge: classic CP in some families, geometry-without-report in others.

    Categorical perception, enhanced discriminability at category boundaries, is among the most studied phenomena in perceptual psychology. Using RSA across six models from five architecture families, a CP-additive model fits the representational geometry better than a purely continuous model at 100% of primary layers in every model tested. The effect is specific to structurally defined boundaries (digit-count transitions at 10 and 100), absent at non-boundary controls, and absent in the temperature domain. Two distinct signatures emerge. Classic CP (Gemma, Qwen), where models both categorise explicitly and show geometric warping. Structural CP (Llama, Mistral, Phi), where geometry warps but models cannot report the distinction.

Child-scale language models as experimental tools

Testing formal theories of word learning, inductive bias, and referential pragmatics against pre-registered boundary conditions.

  • Do Structural Priors Help Neural Language Models Learn Grammar? Evidence from Child-Scale Data Preprint

    1st Workshop on Computational Developmental Linguistics  ·  ACL 2026  ·  2026

    Pre-registered evaluation of a neurosymbolic architecture. Structural priors produce phenomenon-specific effects rather than uniform improvements.

    Pre-registered evaluation of a neurosymbolic architecture combining BabyBERTa with a differentiable PCFG auxiliary loss derived from Minimalist Grammar. Structural priors produce phenomenon-specific effects. Filler-gap dependencies improve by 9-13 pp (d = 2.41-2.82) while locally-cued phenomena are damaged regardless of whether the grammar is real or random.

  • Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning Preprint

    arXiv:2604.05243  ·  2026

    Pre-registered test of second-order generalisation in autoregressive transformers. Perfect first-order retrieval, chance on abstraction.

    Pre-registered test of whether autoregressive transformers trained on synthetic child-scale corpora induce overhypotheses (second-order generalisations about category structure). Models achieve perfect first-order exemplar retrieval but remain at chance on second-order generalisation across 120 runs and three model sizes. Feature-swap diagnostics reveal template matching rather than structured abstraction as the operative mechanism.

  • Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models Preprint

    arXiv:2603.13696  ·  2026

    First systematic evaluation of mutual exclusivity in text-only language models. All 45 models tested show repetition priming, not ME.

    First systematic evaluation of mutual exclusivity in text-only language models trained on child-directed speech. Across 45 GPT-2-architecture models spanning a 12x parameter range, all models show robust repetition priming rather than ME. Priming attenuates with better language modelling but never crosses zero. A context-dependence diagnostic rules out embedding artefacts as an alternative account.

In progress

Work underway or in development.

  • The Crewther Sampler: A Goal-Conditioned Leaky Competing Accumulator Model of Ambiguity Resolution In progress

    Three-paper programme  ·  2026-2027

    A three-paper programme formalising the analogy between binocular rivalry and LLM ambiguity resolution.

    A research programme formalising the analogy between binocular rivalry and LLM ambiguity resolution. The visual system resolves competing representations through mutual inhibition, adaptation, and goal-conditioned top-down modulation. The same computational problem faced by language models during ambiguous generation. Paper 1 validates the GC-LCA model against rivalry data. Paper 2 tests whether LLM internals exhibit metastable dynamics consistent with rivalry. Paper 3 applies the framework as an ambiguity detection mechanism.

DPsych (Clinical Psychology)  ·  Postgraduate Diploma in Psychology BA Cognitive Science  ·  BA Computer Science Graduate Certificate in Business Administration Registered Psychologist (Clinical) — AHPRA  ·  ORCID 0009-0000-7054-2014