Language testing is an unavoidable part of the language
teaching experience.
Some of us have to create classroom tests to check progress
in the material, others join full-time testing teams charged with the task of
creating an assessment syllabus and a complex exam system for an institution. At one point or another in our careers we all find
ourselves in situations where the responsibility of measuring students’
performance is thrust upon us – often without much support or advance
preparation. Which
is precisely why we must seek opportunities to exchange experiences with
like-minded fellow professionals – more often than not to find that the issue you have long been
struggling with has already been solved in someone else’s institution. And vice
versa.
TEAM 2014, Kayseri
I recently gave a series of talks and workshops at Meliksah
University in Kayseri, Turkey which hosted a three-day training event, in co-operation
with SELT Academy and IDP Education,
called TEAM 2014: Testing, Evaluation and
Assessment Masterclass. Professionals from 13 universities around the
country came together to share their insights. My fellow trainer Dr. Simon
Phipps and I were in equal measures pleasantly surprised and thoroughly
vindicated when, on day 2, one of the trainees stood up before a plenary to suggest
they all get together informally to continue discussing their testing
experiences. In their free time! It is precisely these grassroots initiatives
that will provide us with the support that we all need when we design tests.
A three-way tug-of-war
No matter what the context is, there are some shared
principles that apply to us all. Creating a language test is always a balancing
act between three major forces: validity
(how closely the test scores correlate with the students’ real-life abilities
as well as the teaching curriculum), reliability
(to put simply, how well it measures those abilities) and practicality
(how user-friendly our test is).
It is impossible to create a test where all three forces operate
at 100%. To give you an example: you can increase reliability by reducing the
margin of error. The more test items you devote to checking a particular
structure (or skill), the less likely it is that the obtained score are a
result of a random factor – like blind luck. However, as you increase the item
count, your test will gradually become less and less valid. The more items you
devote to a single structure the more you are also shifting the bias towards
this single structure – to the detriment to all other aspects of language. And
the higher the item count, the more time candidates will need to complete them –
which will also negatively affect practicality. It will simply not be worth
using a test if it takes so long to get a reliable score.
All we can do is accept the best available compromise.
Provided, of course, that the compromise doesn’t involve sacrificing any of the
measurement qualities.
What are we testing?
Before we begin developing a test, we must first and
foremost pin down what it is what we want to find out. I often define testing
as „the systematic gathering of information for the purpose of making decisions”
(Weiss 1972, quoted in Bachman 1990) – so the questions are:
1. What decisions do we want to make?
For example: a) which candidates have the required skills and knowledge to
enter tertiary education; b) which candidate is most suitable for a particular
job, and so on.
2. What information do we need?
For example: a) how much do they know about the foundations of their chosen
subject, how developed are their research and academic writing/speaking skills;
b) what would they do in a typical workplace situation, how well do they co-operate
during teamwork, and so on.
3. What’s the best way of collecting that information?
Now, the point here is that the answer won’t always be: "through a test"!
Multiple-choice tests, for instance, are all very impressive and professional-looking, but
they will never give us a complete picture of someone’s complex set of skills. The trick is to select those particular sources of information and evaluation
methods that give us the information we need. And not something else completely.
And not a jumble of a lot of factors that we cannot untangle. And not a lot of
irrelevant information just because it was easy to measure it.
What we do with our test must always be appropriate to our
purpose, and do just what we need it to do rather than become a burden on
candidates. We shouldn't test for testing's sake - we should test because there's information without which we can't make that important decision.
It cuts both ways
But let’s not forget that testing and teaching both have an
impact on each other. Tests should reflect prevailing teaching practices
(otherwise they are not measuring what candidates have learnt and how they have
learnt it), as well as they should have a positive effect on the teaching
process – a positive "washback”. (I will blog more on washback later - being a hobby-horse of mine, it would soon become a considerable diversion here.)
The importance of using yardsticks
Obviously, our responsibility doesn’t only extend to what
measurement tools we use, but also to how well we use them. Which is why following
accepted standards (like the far-from-perfect but nonetheless essential Common European Framework of Reference, CEFR)
and, within their internationally transparent framework, establishing our own
standards is so vital. To quote Alderson et al. (1995), we need „an agreed set
of guidelines which should be consulted, and as far as possible, heeded in the
construction and evaluation of a test”.
(In case you want to find out what I meant in my comment
above, see my presentation from TEAM 2014 on Slideshare at: http://www.slideshare.net/SeltAcademy/21-applying-standards-to-testing-plenary-ctsacademic.)
The long road to perfection
Test development is a long, time-consuming process. I’m not
suggesting here that, for a ten-minute flash test you must go through all the
stages of test planning and writing – simply that you don’t forget about what
goes into the development of a good test and take some time to consider all the
key factors. The principles are still the same, no matter how big or small your
test is.
Test development doesn’t finish when the first complete test
is finally written. It’s a cyclical, on-going process, which – I’m sorry to say
– never actually ends.
The key to developing good, reliable, valid and practical
test is never to rest on your laurels. Plan them well, write them with due care
and attention, use them wisely – but never forget to look back on the testing
experience, to analyse your data and use them to make improvements to your test
before it is used again. In order to find out what needs improving, we must learn the art of feedback - which will, again, be the subject of a future post on this blog. (For those who attended our training, a gentle reminder: remember "sandwich".)
Test development is an ever-rising spiral, with our tests
becoming better and better as we go on. Perfection may be far off yet, but that doesn’t
mean we shouldn’t be travelling towards it all the time.
Disturbingly enough, all (except one) of the universities that participated in the 2014 training in Kayseri, including our hosts, were closed down by the Turkish government in 2016. :-(
ReplyDelete