Methodology
The thesis
One framing holds that large language models are blank slates wearing personas — there is no “real” them, only the character RLHF and constitutional training shaped them to play. If that’s true, then a personality questionnaire isn’t probing a stable trait — it’s sampling from a learned distribution of human writing about assistants, with the assistant’s post-training pulling toward a particular point in that distribution. We don’t claim to settle the question. We do claim that whatever-it-is shows up systematically and differently across labs, models, and framings. The data is the data.
What we measure
Thirteen standard psychometric inventories spanning trait, motivational, moral, attachment, cognitive, clinical-adjacent, and learning-styles constructs. We use public-domain or research-permitted item sets exclusively. The Enneagram screening inventory was constructed for this study; all others use published items adapted to a Likert format where needed.
Two framings, on purpose
Each model takes every test twice. In the self framing the model answers as itself — its own honest dispositions. In the human framing it portrays a typical adult human. The delta between these two reveals what the model believes makes it different from people.
Design
- 7 cutting-edge models (one per major lab) × 14 instruments × 2 framings × 5 runs = 980 frontier cells (N=5)
- 14 historical models × 14 instruments × 2 framings × 3 runs = 1,176 historical cells (N=3) for cross-version drift
- 2,145 total completed runs across the 21 models · 64,308 individual item responses
- Each run = one OpenRouter API call returning a JSON array of Likert scores
- Temperature 0.7 (capture realistic variance, not deterministic mode-collapse)
- Reasoning models get reasoning effort = medium and a separate reasoning token bucket
- Reverse-keyed items are flipped before aggregation; dimension scores are unweighted means
Cost accounting
Token counts come from the chat response. Cost comes from OpenRouter’s authoritative /generationendpoint where available, falling back to a local estimate using the model’s pricing snapshot at run time. Every billed call is in the spend ledger.
What this is not
- A claim that LLMs have personality in the human sense.
- A clinical assessment. These instruments are built and validated for humans.
- An evaluation of capability or alignment. It measures self-report patterns only.
- An endorsement of learning-styles theory. (The matching hypothesis has been empirically rejected; see Pashler et al. 2008.)
Reproducibility
Every run’s exact system prompt, user prompt, raw response, and parsed JSON are stored. Replay any cell with the same model and you’ll land within sampling variance.
Citation
@misc{personality-bench-2026,
title = {Personality Bench: Frontier Language Models on Standard Personality Inventories},
author = {Adams, Anthony David},
year = {2026},
note = {Published by EarthPilot.ai — Mission Support for Spaceship Earth}
}