Monday, 3 March 2014

Language Testing: The Road to Perfection

Language testing is an unavoidable part of the language teaching experience. 

Some of us have to create classroom tests to check progress in the material, others join full-time testing teams charged with the task of creating an assessment syllabus and a complex exam system for an institution. At one point or another in our careers we all find ourselves in situations where the responsibility of measuring students’ performance is thrust upon us – often without much support or advance preparation. Which is precisely why we must seek opportunities to exchange experiences with like-minded fellow professionals – more often than not to find that the issue you have long been struggling with has already been solved in someone else’s institution. And vice versa.

TEAM 2014, Kayseri

I recently gave a series of talks and workshops at Meliksah University in Kayseri, Turkey which hosted a three-day training event, in co-operation with SELT Academy and IDP Education, called TEAM 2014: Testing, Evaluation and Assessment Masterclass. Professionals from 13 universities around the country came together to share their insights. My fellow trainer Dr. Simon Phipps and I were in equal measures pleasantly surprised and thoroughly vindicated when, on day 2, one of the trainees stood up before a plenary to suggest they all get together informally to continue discussing their testing experiences. In their free time! It is precisely these grassroots initiatives that will provide us with the support that we all need when we design tests.

A three-way tug-of-war

No matter what the context is, there are some shared principles that apply to us all. Creating a language test is always a balancing act between three major forces: validity (how closely the test scores correlate with the students’ real-life abilities as well as the teaching curriculum), reliability (to put simply, how well it measures those abilities) and practicality (how user-friendly our test is).

It is impossible to create a test where all three forces operate at 100%. To give you an example: you can increase reliability by reducing the margin of error. The more test items you devote to checking a particular structure (or skill), the less likely it is that the obtained score are a result of a random factor – like blind luck. However, as you increase the item count, your test will gradually become less and less valid. The more items you devote to a single structure the more you are also shifting the bias towards this single structure – to the detriment to all other aspects of language. And the higher the item count, the more time candidates will need to complete them – which will also negatively affect practicality. It will simply not be worth using a test if it takes so long to get a reliable score.

All we can do is accept the best available compromise. Provided, of course, that the compromise doesn’t involve sacrificing any of the measurement qualities.

What are we testing?

Before we begin developing a test, we must first and foremost pin down what it is what we want to find out. I often define testing as „the systematic gathering of information for the purpose of making decisions” (Weiss 1972, quoted in Bachman 1990) – so the questions are:

1. What decisions do we want to make?

For example: a) which candidates have the required skills and knowledge to enter tertiary education; b) which candidate is most suitable for a particular job, and so on.

2. What information do we need?

For example: a) how much do they know about the foundations of their chosen subject, how developed are their research and academic writing/speaking skills; b) what would they do in a typical workplace situation, how well do they co-operate during teamwork, and so on.

3. What’s the best way of collecting that information?

Now, the point here is that the answer won’t always be: "through a test"! Multiple-choice tests, for instance, are all very impressive and professional-looking, but they will never give us a complete picture of someone’s complex set of skills. The trick is to select those particular sources of information and evaluation methods that give us the information we need. And not something else completely. And not a jumble of a lot of factors that we cannot untangle. And not a lot of irrelevant information just because it was easy to measure it.

What we do with our test must always be appropriate to our purpose, and do just what we need it to do rather than become a burden on candidates. We shouldn't test for testing's sake - we should test because there's information without which we can't make that important decision.

It cuts both ways

But let’s not forget that testing and teaching both have an impact on each other. Tests should reflect prevailing teaching practices (otherwise they are not measuring what candidates have learnt and how they have learnt it), as well as they should have a positive effect on the teaching process – a positive "washback”. (I will blog more on washback later - being a hobby-horse of mine, it would soon become a considerable diversion here.)

The importance of using yardsticks

Obviously, our responsibility doesn’t only extend to what measurement tools we use, but also to how well we use them. Which is why following accepted standards (like the far-from-perfect but nonetheless essential Common European Framework of Reference, CEFR) and, within their internationally transparent framework, establishing our own standards is so vital. To quote Alderson et al. (1995), we need „an agreed set of guidelines which should be consulted, and as far as possible, heeded in the construction and evaluation of a test”.

(In case you want to find out what I meant in my comment above, see my presentation from TEAM 2014 on Slideshare at:

The long road to perfection

Test development is a long, time-consuming process. I’m not suggesting here that, for a ten-minute flash test you must go through all the stages of test planning and writing – simply that you don’t forget about what goes into the development of a good test and take some time to consider all the key factors. The principles are still the same, no matter how big or small your test is.

Test development doesn’t finish when the first complete test is finally written. It’s a cyclical, on-going process, which – I’m sorry to say – never actually ends.

The key to developing good, reliable, valid and practical test is never to rest on your laurels. Plan them well, write them with due care and attention, use them wisely – but never forget to look back on the testing experience, to analyse your data and use them to make improvements to your test before it is used again. In order to find out what needs improving, we must learn the art of feedback - which will, again, be the subject of a future post on this blog. (For those who attended our training, a gentle reminder: remember "sandwich".)

Test development is an ever-rising spiral, with our tests becoming better and better as we go on. Perfection may be far off yet, but that doesn’t mean we shouldn’t be travelling towards it all the time.

1 comment:

  1. Disturbingly enough, all (except one) of the universities that participated in the 2014 training in Kayseri, including our hosts, were closed down by the Turkish government in 2016. :-(