Beta launch — 500 free Full Moon licences remaining. Help us find bugs.
Claim free access

Personality science and the replication crisis: what has held up?

Only 39% of psychology findings replicated in 2015. Big Five personality fared far better — and the reasons explain which findings teams can actually trust.

Miquel Matoses·10 min read

In 2015, a landmark collaboration published findings that shook academic psychology to its foundations. The Open Science Collaboration assembled 270 researchers across more than 100 laboratories and attempted to replicate 100 findings from high-impact social and cognitive psychology journals. The results, published in Science (doi:10.1126/science.aac4716), were sobering: only 36 to 39 percent of findings replicated in a statistically meaningful sense. Effect sizes were systematically smaller in replications than in originals. Many findings that had been widely cited, taught in undergraduate courses, and applied in practice did not hold up under independent testing.

The replication crisis — an overview of which is available on Wikipedia — reshaped the conversation about what psychology actually knows. It prompted soul-searching about small sample sizes, publication bias (the tendency to publish only positive results), researcher degrees of freedom (the many undisclosed choices that can inflate apparent effects), and a culture that rewarded novelty over reproducibility.

Where does personality science sit in this picture? The answer is more reassuring than the overall replication rate suggests — but it is not uniformly reassuring.


Why Big Five Science Survived the Replication Crisis Better

The findings that failed to replicate most dramatically in the Open Science Collaboration were concentrated in social and cognitive psychology — flashy, counterintuitive effects that made good headlines and lecture material. Priming studies (the idea that briefly exposing people to a word changes their subsequent behaviour), ego depletion (the idea that willpower is a resource that depletes with use), and several classic social influence findings either failed to replicate or replicated with effect sizes a fraction of the original.

Personality science was not immune to replication problems, but it was structurally better positioned to resist them. The reasons are methodological.

Sample sizes tend to be larger. The Big Five findings that anchor the field — the relationship between Conscientiousness and job performance, between Neuroticism and psychological wellbeing, between Openness and creativity — were established across hundreds of studies and meta-analyses involving tens of thousands of participants. When findings are based on very large N and have been replicated many times in different contexts, replication is a matter of course rather than a hope.

The measures are more stable. Personality questionnaires yield highly reliable scores — internal consistency reliabilities typically in the .80-.90 range. Single-session priming paradigms, by contrast, measure short-term, context-sensitive states with far lower reliability. Unreliable measures mean noisy effects that fluctuate unpredictably across replications.

The constructs are more operationally transparent. "Conscientiousness" has a clear, consensual definition that has been operationalised consistently across instruments and studies for decades. Many of the non-replicating social psychology findings depended on creative, theoretically contested operationalisations of constructs like "power," "implicit attitude," or "self-regulatory depletion." More transparent constructs produce more replicable findings. The IPIP's public-domain items make this transparency possible at the measurement level.

~50%
of social psychology studies failed to replicate (2015 OSC study)
High
Big Five structure replication rate across labs
r = 0.22
Conscientiousness → job performance: holds across replications
IPIP
open-source items: independently verifiable, no proprietary black box

The Robust Big Five Findings That Have Replicated Reliably

"Among the most robust findings in personality psychology is the relationship between Conscientiousness and job performance — a connection documented across hundreds of studies, multiple cultures, and a wide variety of occupational domains." — Roberts et al., 2007 (meta-analytic review)

The following findings from personality science have survived repeated replication and meta-analytic scrutiny with consistently moderate to large effect sizes.

Conscientiousness and job performance. The meta-analysis by Barrick and Mount (1991) — and its many replications and extensions — established that Conscientiousness (Discipline in Cèrcol's framework) is the most consistent Big Five predictor of job performance across occupational types. The effect is not large in absolute terms (corrected correlations typically around .20-.28) but it is among the largest personality-outcome relationships in the literature, and it holds across industries, cultures, and job types. This finding has replicated so many times that it is treated as a benchmark against which new predictors are evaluated. For a full profile of this dimension, see what is Conscientiousness.

Neuroticism and wellbeing. The negative relationship between Neuroticism (Depth in Cèrcol's terminology) and subjective wellbeing, life satisfaction, and positive affect is one of the most replicated findings in personality science. A meta-analysis by Steel, Schmidt, and Shultz (2008) found correlations between Neuroticism and global wellbeing measures around -.40 to -.50. The relationship holds longitudinally, cross-culturally, and across different wellbeing operationalisations. The full picture of this dimension is covered in what is Neuroticism.

Trait stability across adulthood. The finding that Big Five traits are moderately stable across adulthood — and become more stable with age — has been replicated in longitudinal studies across multiple countries. Roberts and DelVecchio (2000) meta-analysed 152 longitudinal studies and found test-retest correlations increasing from approximately .54 in childhood to .74 in adulthood. Personality is not fixed, but it is not as malleable as popular accounts sometimes suggest. This is one of the most important findings to understand before reading five personality science myths that won't die.

Extraversion and positive affect. The association between Extraversion (Presence) and positive emotionality is highly replicable and appears in both self-report and ecological momentary assessment studies. Extraversion seems to reflect, in part, a biological sensitivity to reward cues that manifests as a tendency to experience more frequent and more intense positive emotions in social contexts.

Openness and creativity, intelligence, and aesthetic engagement. The link between Openness to Experience (Vision) and outcomes in creative domains — artistic production, divergent thinking, cultural consumption — is consistently replicated. Its relationship with crystallised intelligence is moderate and robust.


Which Personality Science Claims Have a Weaker Replication Record

Not all personality science findings have weathered replication equally well.

Specific personality × outcome interactions. While main effects of Big Five traits on broad outcomes are robust, claims about specific moderating interactions — that Conscientiousness predicts performance only under certain leadership conditions, that Agreeableness matters more for team performance in high-interdependence roles — have a weaker replication record. These interaction effects are often based on smaller samples, involve more researcher degrees of freedom in analysis, and tend to shrink substantially in independent replications.

Personality change interventions. Studies claiming that targeted interventions can meaningfully shift Big Five trait levels — and that these shifts persist over time — have shown mixed replication results. The basic finding that personality can change is robust; the evidence for reliable, targeted, lasting change via specific interventions is less so. The field needs larger pre-registered trials before strong claims about personality change are warranted.

Type-based interpretations. Attempts to derive meaningful personality "types" from continuous Big Five scores — the claim that there are distinct clusters of people with meaningfully different profiles — have shown poor replication. A widely cited 2018 paper by Gerlach et al. claiming to identify four robust personality types was quickly followed by independent analyses showing that the type structure was highly sensitive to methodological choices. Continuous trait scores replicate; discrete types do not. This is one reason Cèrcol avoids type-based framing.


What Teams Should Trust — and What to Treat with Caution

FindingReplication statusConfidence level
Conscientiousness → job performanceHighly replicatedHigh — use as a reference anchor
Neuroticism → lower wellbeingHighly replicatedHigh — consistent across cultures and instruments
Trait stability across adulthoodHighly replicatedHigh — within-person change is real but slow
Extraversion → positive affectHighly replicatedHigh — robust in experience sampling and lab
Openness → creativityWell replicatedModerate-high — effect sizes vary by domain
Specific trait × outcome interactionsMixedLow — treat with caution; seek large-N evidence
Personality change interventionsMixedLow-moderate — promising but not yet established
Personality types from Big FivePoorly replicatedLow — avoid binary type assignments

The practical implication for anyone using personality data is to apply it at the level of broad trait tendencies, not fine-grained predictions. The research on Conscientiousness and job performance gives you grounds to expect that someone with high Discipline scores will, on average and over time, show greater dependability and follow-through than someone with low scores. It does not give you grounds to predict what they will do in a specific situation, how they will respond to a particular manager, or whether they will succeed in a role with unusual demands. For a fuller account of these limits, see what personality science cannot predict.

For Cèrcol, this means building interpretive frameworks at the level where the evidence is strongest, and being explicit about uncertainty where the evidence is weaker. The science page at cercol.team/science sets out the evidence base in detail.


How Pre-Registration Is Improving Personality Science Credibility

The replication crisis has prompted a shift in research practices. Pre-registration — committing to hypotheses, measures, and analytic strategy before data collection — prevents the undisclosed flexibility that inflates false-positive rates. Large collaborative studies aggregate data across many labs to produce effect-size estimates robust enough to generalise. Adversarial collaborations pit researchers with opposing views against each other in joint studies designed to adjudicate between them.

These practices are already improving the quality of the personality science literature. Findings that survive pre-registered replication with large N are substantially more credible than findings that have only been demonstrated in single-lab studies. As the field matures, the signal-to-noise ratio will improve — and with it, the confidence practitioners can place in personality data. For a review of persisting misconceptions, see five personality science myths that won't die.


Test the science yourself with Cèrcol

The Big Five findings that have replicated most robustly — Conscientiousness and performance, Neuroticism and wellbeing, trait stability — are exactly the findings that personality assessments should be grounded in. That is the standard Cèrcol holds itself to: only the dimensions and relationships with strong replication records are used to generate insights, and the science page documents the supporting evidence transparently.

If you want to see what replicated personality science looks like in practice, Cèrcol is free at cercol.team. The assessment uses public-domain IPIP items, scores the five dimensions whose validity evidence survived the replication crisis, and gives you both self-report and peer perspectives — because two independent signals are more reliable than one.


Further reading: Critiques of the Big Five: what the critics say · The science behind Cèrcol

Further reading

Related articles

Cèrcol uses only functional cookies — no analytics, no advertising trackers. Privacy policy