CAGs rule OK
BackGood news!
So now we know. After all the anguish since the A-level results were announced on 13 August, teachers’ Centre Assessment Grades (CAGs) will be the grades awarded to students across all four UK home nations. It is the algorithm that is for the shredder, not the teachers’ grades.
This is good news.
I don’t have any robust numbers, but my expectation is that very many, close to nearly all, students will have been awarded the grades they deserve. There will be some lucky students, whose teachers deliberately ‘gamed the system’ and intentionally submitted over-the-top grades, and who have got away with it. But, despite all the blaming of ‘over-optimistic teachers’, I believe that the known over-bidding of CAGs, as compared to the now-discredited statistical standardisation model, is much more likely to be a result of nothing more than the need to round fractions to whole numbers, and flaws in the statistical model itself, of which more shortly. So my guess is that the number of students who have benefited from their teachers’ lack of integrity is small.
There may be some teachers who acted conscientiously and with great diligence to follow the ‘rules’ – as far as they were specified – and who are now thinking, ‘Well. That will teach me to be honest! If I’d have gamed the system, my students would all have been awarded higher grades’. I understand that; but, in my view, teachers who behaved with integrity did the right thing.
Most unfortunate, though, are the teachers who followed the rules, but because there was no opportunity to submit and explain outliers, felt forced to constrain their submissions, down-grading able students. Regrettably, students caught in this trap cannot, I believe, appeal. They are the victims of this year’s vicious process, and I grieve. I do hope there are not too many of them, but ‘too many’ has no meaning to the individual who suffers the damage.
This year’s grades are the fairest ever
Overall, however, I believe that the prediction I made in my very first blog in this long saga has come true.
On 21 March 2020, the day after Gavin Williamson announced this year’s exams would be cancelled, I floated the possibility that the year’s results, being largely determined by teachers, could well be more reliable than those determined by exams, if only because exams are so unreliable – on average across all subjects, about 1 grade in 4 has been wrong for years. Since exams are only 75% reliable, that sets a very low bar. Even better was Ofqual’s announcement that teachers were being asked to submit student rank orders as well as the CAGs. There was then and there is now no doubt in my mind that a conscientious teacher will do a much better job at determining a fair rank order than the fuzzy-mark-lottery of a rank order determined by an exam.
So now we know that the teachers’ grades have ‘won’, I have absolutely no doubt that this year’s grades are the fairest ever, and certainly better than the 25%-of-grades-are-wrong that we have all suffered – largely unknowingly – for the last several years. And, yes, some people will have been lucky as a result of gaming, and – very sadly – some will have been unlucky as a result of being on the wrong side of a grade boundary. Overall, though, this year has to be better, much better, than 75% right, 25% wrong.
What was wrong with the statistical model?
But why has the statistical model been thrown away?
Now that the details of the model have been published, and much pored over by expert statisticians, there have already been many articles, blogs and news interviews identifying any number of reasons why [this] aspect of the algorithm is not the best that might have been used, or why [that] way of estimating [whatever] number suffers from [this particular] problem. These criticisms are all very likely to be valid, but this ‘microscopic’ view misses the big picture.
The model was doomed from the outset.
As is now evident, the objective of Ofqual’s method was to predict the distribution of grades for every subject in every school in England. They had some very clever people doing this, but the objective was foolhardy, and inevitably led to the disaster we have all experienced.
To attempt to build a model that makes accurate predictions, two conditions are absolutely necessary:
- It must be possible to verify success, after the event.
- If the method of prediction is complex, there must be a way to test that method by using known historic data to ‘predict’ known historic results.
To take a trivial example, your prediction of the winner of the 3:50 at Doncaster can be verified, or not, by 3:52; a more sophisticated example is weather forecasting, for which the success of the underlying complex models can be verified, or not, when you look out of the window and see the torrents of rain, just as was forecast the day before.
Furthermore, sophisticated weather forecasting models are tested using historic data. The ‘inputs’ to the model – air pressures, temperatures, and the rest – are known for any day in the past. These can then be entered into the model for a given historic day; the model can then be run to derive a ‘forecast’ of the following day; those ‘predictions’ can then be checked against the known weather for that day, and the model deemed good, or poor, accordingly.
How might this apply to the prediction of exams?
Suppose that the model predicts three grade As for A-level Geography at Loamshire Community College. How would you determine whether this prediction is right? Or wrong? There is no ‘winning post’ as at Doncaster. There is no ‘tomorrow’ when you could see the rain through the window. And, most importantly, there is no exam. There was, and is, no way to know, ever, that the prediction was, or was not, right. So why bother to try to predict?
That’s bad enough. But the problem with testing is even worse, as described on pages 80 and 81 of Ofqual’s report, and Tables E.8 and E.9 on pages 204, 205 and 206 (really – the model’s specification document runs to 318 pages).
Briefly, the algorithm’s designers developed 11 different ‘candidate’ predictive models, and, understandably, wished to choose the best, where ‘best’ is the model that predicted a known past most accurately.
The only ‘past’ that is available, and that they were obliged to choose, is the data set used to make the (very important) estimates of the reliability of exam-based grades, as shown in Figure 12 of Ofqual’s November 2018 report Marking Consistency Metrics – An update.
This Figure is well-known in the HEPI archives, and features in my blog dated 15 January 2019 as the key evidence that, on average, 1 exam-based grade in 4 is wrong. So the benchmark being used for testing the predictive algorithm is itself wobbly.
The results of testing their algorithms against this wobbly benchmark are reported in Table E.8 for A-level and Table E.9 for GCSE. So for example, their best model for A-level Economics predicted the ‘right answer’ with a probability of about 64%; for GCSE Maths, 74%; for A-level Physics, 59%; for GCSE History, about 55%.
Here is a table that does not appear in the Ofqual report, but that I have compiled, drawing together the measurements of the reliability of exams (that’s the ‘Benchmark’ column), together with the measures of the accuracy of Ofqual’s best predictive model, for a variety of A-level and GCSE subjects:
1 Marking Consistency Metrics – An update, Figure 12
2 Ofqual model specification, Table E.8
3 Ofqual model specification, Table E.9
Look at those numbers.
How could Ofqual use a model that is only 68% accurate for A-level History, for example? And have the effrontery to use that to over-rule teacher assessments?
Their explanation is because that is not quite as bad as the 56% reliability of the exam! That is hardly a plausible excuse for History. But for Maths, which is associated with an exam accuracy of about 96%, the model achieves only 61%. So about 4 in every 10 of the ‘statistically moderated’ Maths A-level grades ‘awarded’ on 13 August were wrong!
No wonder the model has now been binned.
As a by-the-by, Tables E.8 and E.9 hold another secret. As well as showing the ‘predictive accuracy’ of the model for each subject, there is another column headed ‘accuracy within a grade’, this being the measure of the accuracy of the model in predicting the ‘right’ grade or one grade either way. These numbers, not surprisingly, are considerably higher – quite often 90% or more – than the ‘predictive accuracy’ of the ‘right’ grades alone. Is that why so many important people, fearful of referring to the low numbers of ‘predictive accuracy’ of the ‘right’ grade were so careful to use words such as ‘there is a high likelihood that the grade you will be awarded is the right grade, or one grade lower’?
The model was fatally flawed at the start.
This is not hindsight, nor does reaching this conclusion require great statistical knowledge. It is, and was, obvious to anyone who might have given the matter a moment’s thought.
The demise of Ofqual’s model has not been a failure in statistics, in handling ‘big data’, or in constructing smart algorithms. It was a catastrophic failure in decision-making.
‘The fairest possible process’
The more so in the context of a phrase that has been widely-used to excuse the use of this model, until the moment it was thrown away. How many times have you heard words such as ‘In these unprecedented times, this was the fairest possible process’?
No. Building a model attempting to predict the grade distribution of every subject in every school was not the fairest possible process.
A far better process would have been to build a very simple model to ‘sense-check’ each school’s submissions as being plausible and reasonable in the context of each school’s history – exactly as described as the second possibility in my ‘alternative history’ blog and as illustrated in the diagrams in my ‘Great CAG car crash’ blog, which was published on 12 August, the day before the A-level results were announced.
And in this context, there is still much talk of ‘over-predictions’ by ‘over-optimistic’ teachers – a euphemism, I’m sure.
I think the jury is out on that.
My ‘Great CAG car crash’ blog tables the possibility that much might be explained by rounding errors, a theme picked up in a subsequent Guardian article. And since teachers are – I think unfairly – being blamed for being ‘over-optimistic’, I believe that it is important to establish the facts, as described in the ‘Great CAG car crash’.
I therefore believe the entire data set should be handed over to the Royal Statistical Society for careful forensic analysis. Now.
The end (nearly)
Well, we’re nearly at the end. Not quite, for there are many important matters to be resolved, from unscrambling the A-level mess to sorting out university and college entrance, from the political fall out to (I hope) radically reforming exams, assessment, the curriculum and – very importantly – resolving the social disgrace of the ‘Forgotten Third’, the one-third of all GSCE English and Maths students who are condemned to ‘fail’, before they have even stepped into the exam room, this being a direct consequence of the ‘no grade inflation’ policy.
The immediate chapter, however, the chapter of awarding exam-free grades, is at an end.
Fair grades have, to a very great extent, been awarded – in my view the fairest ever.
Sense has prevailed over statistically obsessed mad scientists and deeply entrenched bureaucrats. Public pressure has won.
We are in a good place.
So let me close this series of blogs here. But before I ‘sign off’, it is my very great pleasure to thank, and acknowledge, the very many people with whom it has been a such a pleasure to share ideas, to talk, to debate, to think. So many thanks to Huy, to Rob, to George, to Mike, to Michael, to Janet, to Craig, to Tania, to everyone who has contributed lively, engaging and oh-so-intelligent thoughts, ideas and comments. My thanks to you all. (Further thanks of course to Nick and Michael at HEPI, whose speed, efficiency, wisdom, constructive suggestions and especially patience in dealing with my requests ‘to change that inverted comma in line 17, please’ have been of immense value and benefit.) And, finally, all good fortune to those students for whom doors are now open, rather than slammed shut.
Dennis Sherwood runs Silver Bullet Machine, a boutique consultancy firm with many education clients.
This blog was initially published on the HEPI website here as part of a series of blogs on the 2020 summer exam series, including:
Trusting teachers is the best way to deliver this year’s exam results – and those in future years?, 21/03/2020
Have teachers been set up to fail?, 18/06/2020
Hindsight is a wonderful thing: Ofqual, gradings and appeals, 23/07/2020
Something important is missing from Ofqual’s 2020/21 Corporate Plan, 08/08/2020
The exams catastrophe: 16 questions that must still be answered, 02/09/2020