Measurement theory for language models
I apply the formal tools psychology spent a century developing. Signal detection theory, psychophysics, psychometric validity. The question is what language models represent, and how to evaluate them rigorously. My work has shown that accuracy and metacognitive sensitivity rank opposite at the frontier. That LLM confidence falls into three distinct monitoring-control profiles, consistent with the Nelson and Narens architecture. That confidence signals require validity screening before they can be interpreted.
Independent researcher, Melbourne.
Research
Psychophysics spent a century developing rigorous methods for characterising noisy decision-making systems. Signal detection theory. Weber's Law. Metacognitive monitoring. Rivalry dynamics. These frameworks were built for minds. The question is whether they are equally revealing about machines.
A parallel thread uses child-scale language models as experimental tools. The goal is to probe the computational limits of distributional learning, and to test formal theories of word learning, inductive bias, and referential pragmatics against pre-registered boundary conditions.
Across both programmes, one question recurs. What does the model actually represent, and how would we know?
Tool
Validity screening for LLM confidence signals
A portable protocol for checking whether a model's confidence signal carries item-level information about correctness. Six indices adapted from the PAI and MMPI-3 validity frameworks. Calibrated against 20 frontier LLMs. Cross-benchmark validated.
Three-tier classification. Invalid, Indeterminate, Valid. Computed from a single correctness-by-confidence contingency table.
git clone https://github.com/synthiumjp/validity-scaling-llm
The protocol identifies models whose confidence signals are dominated by response style rather than item-level discrimination. Unscreened invalid models produce AUROC near chance on selective prediction despite producing plausible confidence traces.
Recent Work
Metacognition & self-monitoring in LLMs
Signal detection theory, monitoring-control coupling, and validity scaling for confidence signals.
The Metacognitive Monitoring Battery: A Cross-Domain Benchmark for LLM Self-Monitoring Preprint
A 524-item cross-domain battery for LLM self-monitoring. Three profiles emerge. Accuracy and metacognitive sensitivity rank opposite across 20 frontier LLMs.
Before You Interpret the Profile: Validity Scaling for LLM Metacognitive Self-Report Preprint
Six validity indices cross-mapped from the PAI and MMPI-3, calibrated against 20 frontier models. Classifies four as construct-level invalid.
Screen Before You Interpret: A Portable Validity Protocol for Benchmark-Based LLM Confidence Signals Preprint
A portable benchmark-agnostic screening procedure. Three core indices from a single contingency table. Cross-benchmark validated.
Concurrent Criterion Validation of a Validity Screen for LLM Confidence Signals via Selective Prediction Preprint
Concurrent validation against selective prediction. The three-tier classification accounts for 47% of variance in AUROC.
Verbal Confidence Saturation in 3-9B Open-Weight Instruction-Tuned LLMs: A Pre-Registered Psychometric Validity Screen Preprint
Pre-registered application of the validity screen to seven 3-9B instruct models. All seven Invalid on numeric verbal confidence. Categorical elicitation does not rescue.
K-Way Energy Probes for Metacognition Reduce to Softmax in Discriminative Predictive Coding Networks Preprint
A negative result with an explanatory mechanism. Structural confidence probes on predictive coding networks reduce to softmax under standard training.
Cross-Entropy Is Load-Bearing: A Pre-Registered Scope Test of the K-Way Energy Probe on Bidirectional Predictive Coding Preprint
A pre-registered 10-seed scope test of the K-way probe decomposition. Cross-entropy, not inference dynamics, is the load-bearing assumption.
Quantisation Reshapes the Metacognitive Geometry of Language Models Preprint
M-ratio profiles across knowledge domains are uncorrelated across quantisation formats (ρ = 0.00). AUROC2 rankings are perfectly stable (ρ = 1.00).
Do LLMs Know What They Know? Measuring Metacognitive Efficiency with Signal Detection Theory Preprint
Type-2 SDT applied across four LLMs and 224,000 factual QA trials. AUROC2 and M-ratio rankings are fully inverted.
LLMs as Signal Detectors: Sensitivity, Bias, and the Temperature-Criterion Analogy Preprint
Full parametric SDT applied to three LLMs across 168,000 trials. Temperature does not function as a clean criterion shift.
Psychophysics of transformer representations
Applying Weber's Law, categorical perception, and scalar variability to probe what transformers represent.
Weber's Law in Transformer Magnitude Representations: Efficient Coding, Representational Geometry, and Psychophysical Laws in Language Models Preprint
Four converging paradigms in three 7-9B models. Representational geometry is log-compressive, consistent with Weber's Law, but dissociated from behaviour.
Same Geometry, Opposite Noise: Transformer Magnitude Representations Lack Scalar Variability Preprint
Transformer magnitude representations reproduce the geometry but not the noise profile of biological magnitude systems.
Categorical Perception in Large Language Model Hidden States: Structural Warping at Digit-Count Boundaries Preprint
RSA across six models identifies categorical warping at digit-count boundaries. Two signatures emerge: classic CP in some families, geometry-without-report in others.
Child-scale language models as experimental tools
Testing formal theories of word learning, inductive bias, and referential pragmatics against pre-registered boundary conditions.
Do Structural Priors Help Neural Language Models Learn Grammar? Evidence from Child-Scale Data Preprint
Pre-registered evaluation of a neurosymbolic architecture. Structural priors produce phenomenon-specific effects rather than uniform improvements.
Exemplar Retrieval Without Overhypothesis Induction: Limits of Distributional Sequence Learning in Early Word Learning Preprint
Pre-registered test of second-order generalisation in autoregressive transformers. Perfect first-order retrieval, chance on abstraction.
Repetition Without Exclusivity: Scale Sensitivity of Referential Mechanisms in Child-Scale Language Models Preprint
First systematic evaluation of mutual exclusivity in text-only language models. All 45 models tested show repetition priming, not ME.
In progress
Work underway or in development.
The Crewther Sampler: A Goal-Conditioned Leaky Competing Accumulator Model of Ambiguity Resolution In progress
A three-paper programme formalising the analogy between binocular rivalry and LLM ambiguity resolution.
Earlier Publications
Avatar body dimensions and men's body image 63 citations
Body Image, 11(2), 146–155 · JP Cacioli, AJ Mussap
The relationship of attachment to resilience and their impact on perceived stress 40 citations
Stress and Anxiety: Applications to Social and Environmental Threats · P Marriner, J Cacioli et al.
Decision making-conflict as coping with stress: A qualitative exploration
48th APS Annual Conference, Cairns · J James, Lucas, JP Cacioli, K Moore
The cyclic alternating pattern (CAP) as an indicator of sleep disruption in traumatic brain injury patients
Sleep and Biological Rhythms, 10(1 suppl.), 26 · JP Cacioli, SMW Rajaratnam
Male body image in real and virtual environments
Deakin University (thesis) · JP Cacioli
Background