Reliability relates to the generalizability, consistency, and stability of a test.
  1. Test-retest reliability - Do the scores from two administrations of the test (usually about 2 weeks apart) correlate highly?
  2. Split-half reliability - Do the scores from two halves of a test correlate?
  3. Interscorer (interrater) reliability - Two examiners score the same set of tests. Do the scores from two examiners correlate?
  4. Intrascorer (intrarater) reliability - An examiner scores a set of tests, then scores them again later. Do the scores from time 1 and time 2 correlate?

Accounting for Test Error

One reason for obtaining a reliability coefficient is to estimate the amount of error that is associated with either test-retest or split-half reliability. We know that any two administrations of a test frequently result in different scores. How can we take that into account when we only test a student once? There are two ways:
  1. Estimate the client's true score. The estimated true score equals the test mean plus the product of the reliability coefficient and the difference between the obtained score and the group mean.

  2. ETS = M + [reliability * (obtained score - mean)].

    Mean deviation quotient = 100
    Obtained Score = 75
    Reliability coefficient = .90

    100 + [.90 * (75-100)] = 77.5

  3. Use the Standard Error of Measurement (SEM), which is 1 standard deviation of error, to create a confidence interval. For a given probability, a confidence interval is the range within which the client's true score will occur if he was to take the test over again. 1 Standard error of measurement above and below an estimated true score is the 68% confidence interval. That is, you can be 68% sure that the client's true score will fall within that interval if he was to be tested again.

  4. If you want a 90% confidence interval, you multiply the SEM by +1.64 and -1.64. Add and subtract those values to and from the estimated true score.

    If you want a 95% confidence interval, you multiply the SEM by +1.96 and -1.96. Add and subtract those values to and from the estimated true score.


    Estimated true score = 77.5
    Standard Error of Measurement - 3

    68% confidence interval = 77.5 +/- 3 = 74.5 - 80.5
    90% confidence interval = 77.5 +/-4.92 = 72.58 - 82.42
    95% confidence interval = 77.5 +/- 5.88 = 71.62 - 83.38


    • estimate portion of variance that is error variance
    • degree of consistency or agreement between two independently derived sets of scores
    • stated as a correlation coefficient -1.0 to +1.0
    • Example

    Pearson's Product-Moment Correlation Coefficient

    • person's position in group and amount of deviation from group mean
    • significance depends on size of sample
    • 10 cases r=.40 not significant
    • 100 cases r=.40 significant

    Test-retest Reliability

    • repeat identical test on a second occassion
    • correlation between scores obtained by same person
    • error variance corresponds to random fluctuations in performance
    • i.e., broken pencil, illness, fatigue...
    • must state interval, as r decreases with time
    • practice effects

    Alternate-Form Reliability:

    • to avoid problems with test-retest
    • use of comparable forms
    • measures "temporal stability"
    • also measures consistency of response to different item samples
    • concept of "item sampling"
      luck break versus hard test... what extent to scores on the test depend on factors specific to selection of items
    • short interval = measure of relationship between forms
    • long interval = measure of test-retest and alternate forms
    • very time consuming and work intensive

    Split-Half Reliability

    • single administration of test - split in half
      1. randomly assign items to each half
      2. odds versus evens
      3. split on content and difficulty
    • two scores for each person
    • measure of consistency of content sampling
    • Multi-step process:
      1. divide in halves
      2. compute Pearson r
      3. Adjust using Spearman-Brown formula
        (allows you to estimate reliability if you shorten or lengthen the test)
        ex: half test r=.718; whole test r=.836
    • can also use Kuder-Richardson Formula

    Scorer/Inter-Rater Reliability

    • measure of examiner variance
    • objective versus subjective measures
    • high degree of judgement = high chance of variance
    • measure degree of consistency between two or three examiners
    • .80 or better is good
    • Example
    • For nominal scales - use kappa