Walk through any personality test marketing page and you will encounter two words used as reassurances: reliable and valid. Vendors deploy them freely, with minimal explanation, as signals that their instrument is scientifically credible. The terms are almost never defined for the reader.
This matters because reliability and validity are not interchangeable, not the same thing, and not straightforward to evaluate in practice. An instrument can be highly reliable without being valid. A test can show face validity — it looks like it measures what it claims — while failing every rigorous statistical validity criterion. And the MBTI, the world's most commercially popular personality instrument, illustrates exactly how an instrument can score poorly on the very criteria its publishers invoke.
This article explains each concept plainly, describes how to recognise strong and weak evidence for each, and provides a practical evaluation framework for any personality instrument.
Reliability in Personality Testing: What It Means and What Scores to Demand
Reliability refers to the consistency of a measurement. A test is reliable if it produces the same — or closely similar — results under conditions where the underlying trait has not changed. There are two primary types.
Test-retest reliability
Test-retest reliability asks: if the same person takes the same test twice, a few weeks apart, how similar are the results? Scores can differ between administrations for two reasons: genuine change in the underlying trait, or measurement error. A reliable test minimises measurement error, so that score changes between administrations primarily reflect real change rather than noise.
The standard threshold for acceptable test-retest reliability is a correlation of approximately 0.70 or above over a two-to-four-week interval. Well-validated Big Five instruments typically achieve 0.80 or higher for domain-level scores. The MBTI's test-retest reliability is lower — studies have found that approximately 50 percent of respondents receive a different four-letter type classification when retested five weeks later, which is the statistical signature of high measurement error. See MBTI vs Big Five for the full comparison.
Internal consistency
Internal consistency reliability asks whether the items within a scale are measuring the same underlying construct. If a Conscientiousness scale contains items about organisation, diligence, and reliability, those items should correlate with each other — because they are all tapping the same underlying disposition. The standard statistic is Cronbach's alpha, where values above 0.70 are generally considered acceptable and above 0.80 are good.
Low internal consistency means items within a scale are measuring different things — which makes the total scale score difficult to interpret. A Conscientiousness score derived from items that barely correlate with each other is not a coherent measurement. For an explanation of how scale length interacts with internal consistency, see why 120 items is better than 10.
Validity in Personality Testing: Four Types Every Buyer Should Understand
Validity addresses a different question: is the test actually measuring what it purports to measure? A test can be perfectly consistent (reliable) while measuring the wrong thing entirely. The main forms of validity evidence each address a different aspect of this question.
Convergent validity
Convergent validity asks whether the test correlates with other established measures of the same construct. A new Extraversion scale should correlate positively with existing validated Extraversion measures — because if both are measuring Extraversion, they should agree on who has more and less of it.
This sounds obvious but is surprisingly often neglected. Many proprietary instruments report no convergent validity data, which makes it impossible to evaluate whether they are measuring the same constructs as the academic literature. The IPIP item bank was built precisely to enable this kind of public comparison.
Criterion validity
Criterion validity — the most practically important form — asks whether the test predicts outcomes that the trait should theoretically predict. If a Conscientiousness measure is valid, it should predict job performance, academic achievement, and goal attainment, because Conscientiousness is the trait most consistently linked to these outcomes in the literature. If a test claims to measure Conscientiousness but shows no correlation with job performance, something is wrong with the claim.
Predictive validity is a specific subtype: does the test predict future outcomes? Concurrent validity asks whether the test correlates with outcomes assessed at the same time. Both matter, but predictive validity is the gold standard for instruments used in personnel selection. For the implications for hiring specifically, see personality testing in hiring: what is legal and what is ethical.
Discriminant validity
Discriminant validity asks whether the test correlates too highly with measures of different constructs. If a scale purporting to measure Agreeableness correlates as strongly with Conscientiousness as it does with other Agreeableness measures, it may not be measuring Agreeableness distinctly — the two scales may be measuring much the same thing, which means the information is partially redundant. Understanding what each Big Five facet uniquely measures helps here — see what is a facet in personality psychology.
Face validity vs statistical validity
Face validity is the appearance of measuring what a test claims. An item that reads "I am an organised person" has high face validity for Conscientiousness — it looks like it is measuring organisation. But face validity is not the same as statistical validity, and conflating them is one of the most common errors in personality test evaluation.
Many popular instruments have high face validity and modest to poor statistical validity. The content looks relevant; the predictions are weak. For a breakdown of which popular tests fall into this trap, see the best free personality tests for teams in 2026.
| Psychometric concept | What it measures | Good threshold | Big Five instruments | MBTI |
|---|---|---|---|---|
| Test-retest reliability | Consistency of scores over time | r ≥ 0.70 over 4 weeks | Typically 0.80–0.90 | ~0.50 (50% type change at retest) |
| Internal consistency (Cronbach's α) | Item coherence within a scale | α ≥ 0.70 | Typically 0.80–0.90 | Moderate; varies by scale |
| Convergent validity | Agreement with other measures of same trait | r ≥ 0.50 with established measure | Well-documented in peer review | Limited cross-instrument data published |
| Criterion validity | Prediction of real-world outcomes | Varies; d ≥ 0.20 considered meaningful | Conscientiousness predicts job performance robustly | Weak prediction of job performance |
| Discriminant validity | Independence from measures of different traits | Low r with conceptually distinct scales | Generally supported | Dimensions not clearly independent of each other |
Five Questions to Evaluate Any Personality Test Validity Claim
When a vendor or researcher claims that a personality instrument is "valid and reliable," the following questions produce a fast quality assessment.
Question 1: Is the validity evidence published in peer-reviewed journals? Proprietary technical reports, white papers, and website copy do not count. Peer review subjects validity claims to independent scrutiny. If the only validity evidence is the publisher's own documentation, that is a red flag. The broader implications for how personality science handles replication are addressed in personality science: the replication crisis.
Question 2: What is the test-retest reliability over a clinically meaningful interval? Four to six weeks is standard. If this number is not reported or is below 0.70, the measurement is noisy.
Question 3: What outcomes does the instrument predict? Criterion validity evidence should include real-world outcomes, not just correlations with other self-report measures. For work-relevant instruments, job performance is the key criterion.
Question 4: Have independent research groups replicated the validity findings? A single study by the instrument's own developers is insufficient. Replication by researchers with no commercial interest in the outcome is the meaningful standard.
Question 5: Is the scoring transparent? If the scoring algorithm is proprietary, the validity claims cannot be independently verified. Open-science instruments — including the IPIP on which Cèrcol is built — allow anyone to check the claims against the data. See personality testing: open source vs commercial for the full comparison.
Why Peer Assessment Adds Validity That Self-Report Cannot Provide
One underappreciated source of validity in personality assessment is the use of observer ratings alongside self-report. Personality measured by people who know the subject — colleagues, managers, direct reports — typically shows higher criterion validity than self-report alone, particularly for predicting work performance.
This is because self-report is subject to impression management (consciously or unconsciously scoring oneself more favourably) and to limited self-knowledge (people are often unaware of how they come across to others). Observer ratings are not free of bias, but they are affected by different biases — which means combining self and observer data produces more accurate personality estimates than either alone. For the full argument, see why self-assessment alone isn't enough: peer personality feedback.
Cèrcol's Witness model is designed around this principle. The history of the Big Five and the science page provide further context on the validity evidence underpinning Cèrcol's design choices.
"Reliability and validity are not marketing claims. They are specific statistical properties with established thresholds, measurable through standard methods, and verifiable through published data. An instrument that cannot provide peer-reviewed evidence for both should be evaluated with proportionate scepticism."
How Cèrcol meets the reliability and validity bar
Cèrcol's instrument is built on the IPIP item bank — the same public-domain items whose psychometric properties have been independently documented by Goldberg and colleagues across decades of published research. Domain-level test-retest reliability for IPIP-based Big Five scales typically sits above r = 0.80 over four-week intervals. Internal consistency (Cronbach's α) for the 20-item per dimension scales Cèrcol uses is consistently above 0.87.
Criterion validity is inherited from the broader Big Five literature: Conscientiousness (Discipline) predicts job performance across all major occupational categories (Barrick & Mount, 1991, doi: 10.1111/j.1744-6570.1991.tb00688.x). Neuroticism (Depth) predicts stress response and wellbeing outcomes. Openness (Vision) predicts creative performance.
The Witness peer assessment adds observer-rated scores on the same five dimensions using a forced-choice format that reduces social desirability bias — see social desirability bias in personality tests for the full methodology. Take the free assessment at cercol.team and review the full validity documentation at cercol.team/science.
Further reading: The history of the Big Five: from Allport to Goldberg · The science behind Cèrcol