The hidden (un)reliability of school exams

The annual furore over school league tables arrived recently. League tables indicate the proportion of pupils in a school who attain 5 or more good GCSE exam results, including English and maths. Though this year the main controversy was around which exams to include in the league tables and which not – resulting in prestigious schools such as Eton and Harrow being left at the bottom – it again raises the issue of how accurate and meaningful such results are.

Challenges to the validity of league tables typically centre around them being too narrow a measure of school performance; surely schools are not there solely to produce children who have 5 or more GCSEs? Given the focus on exam attainment, there’s a crucial fact that’s consistently overlooked – how accurate are the exam results such tables are based on?

Here’s a thought experiment for you. Imagine taking an object such as a table and asking 20 people to measure how long it is to the nearest millimetre. What would you expect to see? Would each of the 20 people come up with the same result or would there be some variation?

When I’ve asked people to consider this or even do it, there is always some variation in the measurements. Judgements tend to cluster together – presumably indicating the approximate length of the table – but they are not exactly the same. There’s error in any measurement, as captured in the saying ‘measure twice, cut once’.

If there is error or inconsistencies in measuring a physical object such as a table, what about when we are measuring aspects of knowledge and abilities using exams or other forms of assessment? Such assessments work by asking a series of questions designed to tap into relevant knowledge and to show understanding and application of principles. From the responses given we make judgements about where respondents stand on constructs we are interested in measuring; in the case of GCSEs the constructs reflect the amount of knowledge and understanding retained from instruction and learning.

This process necessarily involves a degree of judgement and inference. Markers have to evaluate the adequacy of responses given to questions and, given an exam only covers a limited area of any syllabus, from this make an inference about the overall level of attainment in the subject area.

Understanding the degree of error in any assessment is one of the cornerstones of psychometrics. Error is captured in the concept of ‘reliability’ which describes how accurate our assessment tools are. Effective use of tests should acknowledge error, make it explicit and treat test scores accordingly. Acknowledging error means that any test score should be treated as an indicator and not an absolute measure. This is a clear strength of the psychometric approach to measurement which is open and honest about limitations in measurement technology, incorporating these limitations into any consideration of outcomes. Educational exams, however, consistently fail to publically acknowledge error.

Though exam results do not come with an error warning, issues in their accuracy are regularly highlighted. Each year after exam results are announced the media picks up on ‘failings’ in the exam system. Usually these centre around re-marking, where parents have appealed against their child’s marks and they have been awarded a different grade after the exam paper has been reviewed. Such stories clearly highlight the issue of reliability, but quickly disappear only to be dusted-off and resurface the following year. The exam system invariably take these criticisms on the chin, wheels out the usual defences, then goes back to business as usual.

Returning to exam league tables, individual results contain error and so do groups of results. The degree of error varies according to the size of group – so the number of children in a school – but this is another factor conveniently ignored in league tables. As consumers of league tables we are misled. Regardless of our views on whether the emphasis on exam results over other factors in assessing school performance is appropriate, until league tables acknowledge error in exam results they cannot be fit for purpose.

What do you think?

required
required