Adaptive Testing

What are adaptive tests?
I’ve been involved in developing adaptive tests on and off since around the turn of the century, so I thought it would be worth sharing a few thoughts on how these work in practice.
Adaptive testing (sometimes called “CAT – computerized adaptive testing”) involves modifying the assessment to take account of the candidate’s ability. In most cases, this involves choosing questions for candidates based on the score they got on previous questions in order to maximise the precision in measuring their level. As candidates do well, the test questions get harder, if they do less well, the questions get easier. At some point the candidate’s level settles out, and a test outcome is decided.
This approach introduces two obvious limitations: the questions asked must be objective (i.e. computer markable) and the tests must be delivered on computer (paper based adaptive tests are not really practical).
There are some significant benefits though:
- The time taken for an assessment to reach a judgement about a candidate’s level of ability can be significantly shortened, perhaps by as much as 50%.
- Candidates tend to be better motivated as their time in the assessment is not spent answering questions that are way too hard or much too easy.
- The test can potentially measure a very broad range of skills
What scenarios are they best suited to?
One good example of this is a project I’ve been involved in recently – the Adult Skills for Life survey in England (the findings from this have recently been published[1]). This survey repeats a survey undertaken in 2003. It measures literacy, numeracy and ICT skills among 16-65 year olds in England. Interviewers visited around 7,250 individuals in their homes conducting assessments which lasted around an hour including the demographic questionnaire.
The numeracy and literacy skills in the adult population are extremely varied: 1 in 20 struggles with numbers up to 10, while many operate at a much higher level: a quarter at GCSE grade C equivalent or above, with many shades in between. There is a similar range in literacy.
Measuring this range of skills without having a test that adapts would be impractical. The adaptivity also keeps down the test time – it takes around 20 minutes to make a safe judgement with adaptivity, and with each candidate taking two assessments plus the demographic questionnaire, an hour is as long as we could reasonably ask people to tolerate us in their homes!
Many of the assessment tools used in colleges and by learndirect to assess people’s literacy and numeracy are adaptive for similar reasons. You can download some of these adaptive tools and try them out for free: http://archive.excellencegateway.org.uk/page.aspx?o=toolslibrary (see Initial Assessment Tools). Some of the more sophisticated tools can provide not just an overall level but also provide profiles of sub-skills, identifying that someone is far stronger at reading than writing, for example.
How do adaptive tests work?
We’ve encountered two approaches which are rather different. The approach we’ve typically used in Skills for Life assessments presents a predefined block of questions and then routes candidates to one of a number of pre-defined subsequent blocks based on the candidate’s responses.
The candidate’s test result is determined by where they exit the test rather than their total score.
The Americans have led much of the development on adaptive testing, and they take a different approach based on statistical modelling[2]. The high stakes GMAT examination is adaptive, for example[3]. The approach tries to reach an acceptably precise estimate of ability by successive estimation after each question is answered. After each question, the algorithm searches for the optimal next item based on the candidate’s current estimated ability (where the optimal item will maximize the information about their ability level) – homing in on the result like a kind of computerised “pin the tail on the donkey”. This rapid “homing in” can lead to really extraordinary shortening of test time compared to standard tests, perhaps reducing time to as little as one third (e.g. a 1 hour paper test can be done on-screen in 20 minutes). The “homing in” stops when an acceptable level of precision, as measured by standard error in the ability estimate, is met.
So what’s not to like about adaptive testing?
Well, in case you were thinking that it’s the solution to all testing problems, adaptive testing is not without its issues…
Here are a few of the big issues:
- Given that there are lots of “ways through” the test, you need a lot of questions – far more than in a normal linear test. And you have to know the difficulty of each item (so you can pick when to offer it with confidence) so this means pre-testing items which can be costly.
- Candidates can’t go back in the test – once they answer a question and move on, that’s it – no going back to change your answer (or the adaptive routing might pick a different question!).
- If the stakes for the test are very high (eg it’s a certification examination or a licence to practice) then the fact that candidates take different questions from each other can lead to concerns about fairness and reliability (in the assessment quality sense of the word)
- With the picking of items governed by difficulty judgements, it can be difficult to also build in picking by topic in order to ensure good coverage of the curriculum. This raises issues of content validity. Whilst CAT helps to establish high reliability which is a necessary condition for validity it does not necessarily mean the test will be valid. Trade-offs between precision and content coverage often need to be made, although. Though obviously unless a test is content-valid there is no point in it being precise – it’s not measuring the same thing for all candidates[4]!
- There are potential issues with exposure (items becoming known to the candidature) particularly for items at the beginning of the test, which can place pressure on increasing the question bank size which raises costs.
- It can feel a little “black box” which potentially threatens public acceptance and other face validity aspects.
Despite this, I’m a fan – I think it’s an underused technology in the UK and can deliver real benefits where the capabilities of the candidate are unknown and keeping their interest during the test is important. The tests we’ve designed with adaptive content have been well received and have stood the test of time, particularly in diagnostic and formative assessment settings.
If you’re interested in a bit more background reading, try this helpful intro:http://echo.edres.org:8080/scripts/cat/catdemo.htm#whatwhy.
And here’s some information from US researchers who have considerable experience in the strengths and weaknesses of adaptive testing:
Author: John Winkley, AlphaPlus Consultancy Ltd
www.alphaplusconsultancy.co.uk
[2] Based on Item Response Theory modelling of candidate’s ability and question difficulty. http://echo.edres.org:8080/irt/
[3] Official GMAT websitehttp://www.mba.com/the-gmat.aspx and explanation of how it works: http://gmatclub.com/forum/how-does-the-gmat-algorithm-work-120297.html
[4] There’s an interesting paper on this here:http://www.ncbi.nlm.nih.gov/pubmed/12386393