Innovative and Non-Objective Items in e-Assessment – January 2013
by John Winkley, AlphaPlus Consultancy
This piece is about the use of innovative and non-objective questions in high stakes testing – a subject which I spent many years working on from a software technology perspective, and more recently I’ve been working on ensuring that such items are fair and valid from an assessment performance perspective. This has given me the chance to reflect on their use.
A few years ago, I undertook a quick study of e-assessment in use in the UK and concluded that most (probably 90%+) of the questions presented to candidates on screen were multiple choice questions (MCQs)– I don’t think this has changed greatly, although the proportion of more innovative items in use has probably grown a little. I would argue that examinations for school pupils in the UK have always prioritised validity and authenticity over reliability (fairness in terms of variation from one exam or marker to the next). MCQs deliver very high reliability, but are often viewed in the UK as offering low validity (fitness for purpose), particularly for the assessment of higher level skills and knowledge. It therefore seems reasonable to expect that the UK’s e-assessment community would be a hotbed of innovation in more sophisticated item types, and to an extent it is, although the impact of this innovation on school exams has been limited so far, probably due in part to some of the limitations and challenges of innovative question types which I discuss at the end of this piece.
Firstly, I should say what I mean by innovative and non-objective question types – I don’t think there’s a widely accepted definition. Objective testing means testing where the answers to questions can be marked without any degree of human skilled judgement, and thus yield to computer marking. Such items tend to be closed (offering only a limited range of options for the candidate to choose from) or require very short answers, such as a single word or number, where the range of possible correct answers is manageable for automatic marking. It should be noted though that advances in technology blur the boundary between what is computer-markable and what isn’t.
For the purposes of this article, I’m including:
- Items which contain rich media (video, audio, etc), even though in some cases the actual questions are objective and/or
- Items which require the candidate to interact with the computer in such a way that although in theory the range of possible answers is constrained, the range is so large (often due to combinatorial explosion) that it does not yield to simple automatic marking and/or
- Items which require the candidate to present complex or longer constructed responses (essays, explanations, methods in mathematics, manipulating complex systems, etc.)
For clarity, I’m not planning to discuss the following:
- Use of rich media and interactivity in testing for the purposes of visual appeal, engagement or embellishment. Good design is important in on-screen testing as for other man-machine interfaces (arguably more so), but this is well covered elsewhere (I recommend reading Jacob Neilsen’s Alertbox for a well-regarded and comprehensive guide to on-screen usability). Most on-screen exam systems favour minimalist approaches where User Interface design is concerned.
- Use of adaptive testing, where the assessment content is varied item-by-item based on the candidate’s responses (get more questions correct, computer presents harder questions, etc.). This technique delivers extraordinary benefits in some settings but I’ve discussed it in a previous post, and in any case it largely sits within the arena of objective testing as it relies on computer marking in order to make judgements in real time.
- Use of sophisticated marking to computer mark questions which would historically have been regarded as non-objective, commonly essay questions. This is an area of substantial innovation and is widely covered in research. There are some examples of successful use, although relatively few are in use in the UK. Even if it were possible to mark essays automatically, there are questions about public perception and acceptability which are largely unexplored. Developments in expert systems which support faster and more accurate human judgement are also appearing – again, I’m not discussing these here, although I’m tempted to return to it in a future blog if it’s of interest.
By way of illustration, here are some examples of innovative non-objective items which I think work well. They are drawn entirely from my own experience, and I don’t claim they are particularly representative of the breadth of innovation out there – they’re just examples I’m familiar with.
- AAT accountancy examinations. AAT use interactive items to make examination questions feel more like authentic accountancy tasks while retaining the benefits of automatic marking.
- The Teaching Agency’s Qualified Teacher Status Tests. Teachers are required to pass tests in English (and Mathematics) as part of qualifying to teach. The English tests use audio to test spelling, and require candidates to correct passages of text as tests of punctuation and grammar.
- Driving Theory test Hazard Perception. This assessment is presented as part of the Driving Theory Test and requires candidates to watch a video taken from the driver’s seat of a driver driving along, and has to press a button when they spot a hazard. The assessment measures the accuracy and speed of spotting the hazard and is computer marked.
- World Class Tests. These assessments use innovative on-screen assessments to measure mathematics and problem solving skills, enabling relatively sophisticated assessments of both method and outcome to be computer marked.
- Primum Medical Simulations test would-be doctors’ ability to treat patients in a practical setting, competencies that were previously examined at the bedside.
If you’ve got some examples, post them in response to this blog post, as I’d be very interested to take a look – ingenuity in the field of e-assessment can be amazing!
In deciding whether and where such item types might be useful, I’ve found it helpful to have a simple taxonomy of potential benefits and disadvantages to consider.
First the benefits (and speaking from an assessment design perspective these are very compelling, I think):
- Innovative non-objective items can improve authenticity/face validity in the way particular parts of curricula are assessed
- Innovative non-objective items can improve content validity, allowing more of the curriculum to be assessed
- Innovative non-objective items can reduce atomicity in testing – allowing combinations of skills, problem solving etc. to be assessed.
- Innovative items can help to make on-screen tests look (and perform) more like paper tests. This might sound like a pointless exercise, but statistically provable comparability between mode of assessments can be very important in some settings.
And now the potential disadvantages. Despite being very keen on innovation, these are substantial and should be not taken lightly:
- Almost all innovative non-objective items have increased costs of software production in comparison to (a) multiple choice questions and (b) questions where candidates produce an answer (usually typed, maybe drawn with the mouse) for human marking. Some companies (notably BTL in the UK www.btl.com – I declare an interest, I used to be a director there and my company has several projects in collaboration with them) have made good progress in reducing production costs for an increasingly wide range of innovative item types but even so the costs are likely to be hard to estimate until well into the programme and the producers are likely to need higher levels of skill and more training.
- Innovative non-objective items tend to present challenges in interpreting results from usage for test performance analysis. Having built the items, interpreting how they are performing is much harder than for traditional items – how do candidates approach the question? What options do they try before selecting their final response? How do we know they’ve understood what’s expected of them? How much longer do these questions take to answer and is the payoff in assessment quality terms worth it? Etc. The questions go on and on, and usually require custom analysis to produce convincing answers.
- Some items can present usability challenges and be inaccessible for candidates with special access requirements (and sometimes for those who would normally describe themselves as computer literate and able bodied). Questions which make use of the computer’s power (audio, video, interaction, etc.) all require equipment at the test centre to function well and be usable by all, including coping with personal preferences. In some cases, relatively small inflexibilities can render tests very uncomfortable or even unusable for many. This is hard to spot in testing as trial cohorts are rarely large enough to spot every type of access problem. Fortunately most e-assessment system suppliers can advise on “case law” as to what works well elsewhere.
Thanks for reading – hope you found it interesting. If you’d like to read more, I’d commend JISC’s report on advanced e-assessment techniques to you. And if you have examples of your own that you’d like to share, please do – I look forward to reading your comments.
 Practice assessments are available to view at http://aat-interactive.org.uk/aat/practice_assessments/
 Information about how the items work is presented in this Youtube video http://www.youtube.com/watch?v=AHS9vQlew_M
 JISC did an excellent review of this which is probably the easiest way to review it: http://www.jisc.ac.uk/media/documents/projects/higherorderskills.pdf