This article was originally published some time ago, when there was a previous iteration of the National Computing. However, although the context has changed, many of the issues remain, which is why I've decided to republish. I hope you find it useful. It has been lightly edited to remove dud links)

Trying Times Part 1

From June 2008

I've been involved in two very different forms of assessment this week, and only today I read about some new research going on. So over the next day or two I thought I would just talk about those things and, in particular:

· The two key difficulties of assessment;

· My difficulties with rubrics; and

· My problem with some newly-published research.

The two key difficulties of assessment

I had an extremely enjoyable day today. You will quite possibly think I am a masochist, because my day consisted of being locked in a room with half a dozen other people, looking at items for an on-screen test for educational ICT.

This is the test whose development I was very closely involved in whilst working at the Qualifications and Curriculum Authority (QCA) a few years ago. The idea of the test is that it tells you, or gives a very strong indication of, the level at which the student is working, or has attained at that point in time, in information and communications technology (ICT).

This test was different from all its predecessors, not because it was taken at a computer screen -- many tests do that -- but because it attempted to assess the student's ICT level by their performance in problem-solving tasks. So, think about that for a moment: we are not talking about a multiple choice test in which, if it has been written properly, there is only one correct answer per question. This is a situation in which there is not necessarily any correct answer, but one or more correct ways of approaching the problem.

For a test like this to work, there has to be a set of rules underpinning its judgements, and that's what I came up with: the rules base (see below). This is, in effect, a vast grid of possibilities, but one which at its heart is very simple Boolean logic. Basically, the test comprises a set of rules which, if written in ordinary English, would read something like this:

“If the student does A and then does B, but does not do C”, she is probably at a Level 5 in this area of the curriculum. But let’s test this assumption by seeing what she does on this next task. If she does X...”

—

Originally, the test was intended to be a summative test, ie one that measured the student's grade at the end of what is known in England and Wales as "Key Stage 3", ie age 16. It was, therefore, to be a high-stakes assessment. And it was, perhaps inevitably, removed from the "mandatory" column to the "optional" column of the government's plans, to become, instead, a tool for formative assessment.

I think that was the right decision, because the test is so good at testing understanding in real-world problem simulations, and one that can therefore form a great basis for class discussion.

Over the past few years I have been involved in the discussions of the test items. What happens is that people write the test items, they are then translated into interactive questions, and we come along and pull them apart for hours on end.

The "we" are people from the key organisations involved, ie the QCA, the National Assessment Agency (NAA), and the National Strategies, and me. To put that another way, the participants in these discussions have been the organisation that has written the ICT curriculum for England and Wales (QCA), the organisation whose remit it is, amongst other things, to oversee the assessment of the National Curriculum (NAA), and the organisation which has taken the ICT Programme of Study and made it more concrete, in the form of a Framework and lesson plans and other resources.

The two key difficulties of assessment that become obvious in this sort of activity are:

1. Does this test item measure what it purports to measure? In other words, is it valid?

2. If so, does it succeed?

Put like that, it may sound like those are more or less the same question, so let me give an example to make my meaning clear:

The first question is about looking at what the test item requires, and seeing if that needs knowledge or understanding in a particular area in order to answer it correctly. Today, for example, we were looking at sequencing (programming) questions. If they had included examples of things like boiling water for a cup of tea, we would have said that you don't need knowledge of ICT to answer such a question. All you need is knowledge of the world and maybe some common sense.

Where a question did pass the validity test, we then had to say,

“OK, you would have to know about ICT to be able to answer this question, but are the options provided correct, does the flowchart sequence shown actually work, do you need to be highly literate in order to understand the question, would you be disadvantaged if you were a particular gender or of a particular ethnic group?”

—

Clearly, what is enjoyable in such discussions is that you grow in your own understanding of the subject. Also, it is very interesting to hear other people's views on particular issues. It's the discussion, the interaction, that moves things on. And sometimes there are moments of humour, like today when I realised, halfway through an impassioned plea against a particular question, that I had successfully demolished my own argument!

One of the things I have noticed that distinguishes good ICT departments in schools from the others is that time is made available for discussion about assessment. Teachers need to have a clear and shared understanding about what, say, a Level 5 student is capable of, and they need to be able to assess their students' performance with a reasonable degree of accuracy.

They also need to be able to articulate to their students what these levels mean, and what a student must do to progress from one level to the next. They must also go beyond even this, and help students become, to use Maslow's term, self-actualised in the realm of assessment. In other words, they must be able to evaluate, accurately, their own performance.

Unfortunately, time to do all this is the one thing that teachers no longer seem to have.

Sadly, today was the last of these periodic meetings, because we have now gone through all the main areas of the ICT curriculum. In an ideal world, everyone involved would now start back at the beginning, in order to ensure a constant supply of fresh material, but, for the time being at least, that is not to be.

In my next article on this subject I will describe a rather different form of assessment I have been involved in this week, that of using a rubric as a "meta-judge" for the Horizon 2008 project.

Trying times part 2: aye, there’s the rubric

From 2008

I was recently approached by Julie Lindsay and Vicki Davis with an invitation to be a "meta-judge" on the Horizon 2008 Project. It was a great honour to be asked, and I hope my judgements are received in a positive way.

But, as usually the case with this sort of thing, it did raise doubts in my mind about the value of rubrics for this type of activity. They are useful, but they are also limited, and not nearly as objective as one might think.

In a nutshell, the project involved students from several countries collaborating with each other to do research into how modern technology is affecting various aspects of modern life (government, education, health and others). The end product, besides the wiki itself, was a video submitted by each student. These have been judged by a number of educationalists, who decided on the winner in each of the 13 categories. My role as "meta-judge" was to decide which of these 13 finalists was the ultimate winner.

I have to say that this was not an easy task despite having the rubric to guide me. It isn't easy on a human level, if I can put it that way. The trouble with identifying one winner is that by doing so you automatically identify 12 "losers"! I would hope that those 12 don't see it that way. The quality of all the videos was extremely high, and there are even one or two that didn't come out on top that I will have no hesitation in using in my own work (with full credit and citation given, of course). To end up as one of just 13 finalists is good going, and all of the students should feel proud of themselves.

Indeed, even those students not in the final line-up did a fantastic job. If you’d looked at the wiki you would have discovered a cornucopia of ideas and resources, almost all of which were put together by the students.

The rubric I used is called Rubric 1, Multimedia Artifact. As rubrics go, it isn't bad at all. It's shorter than many, which is good, because the longer, ie more detailed, they are, the more easy they are to apply, but the less meaningful they become. The reason is that once you start breaking things down into their component parts, you end up with a tick list of competencies which, taken together, may not mean very much at all. That is because the whole is nearly always greater than the sum of its parts, so even if someone has all of the individual skills required or, as in this case, has carried out all of the tasks required, the end result may still not be very good. So you end up having to use your own judgement about how to grade something, which is exactly what a rubric is meant to avoid in the first place. Let me give you a concrete example.

One of the sentences in the rubric reads:

“Content is constructed from a superficial synthesis of information on the wiki.”

— Rubric

That seems straightforward enough, until you come across a case where the content on the wiki page is itself superficial -- in which case the right thing for the student to have done would have been to ignore the wiki page all together and put in some fresh insights. But if they had done that, they wouldn't get credit for using the information on the wiki page. In other words, it's a no-win situation which actually penalises the student who exercises her own judgement.

I think the main problems with rubrics in general can be summarised as follows:

1. Do the individual criteria reflect what it is we are trying to measure? This is the problem of validity which I discussed in my first post in this mini-series (see above).

2. Are the criteria "locked down" sufficiently to ensure that the rubric yields consistent results between different students and between different assessors (judges)? This is known as the problem of reliability.

3. Are the criteria too "locked down", which could lead to an incorrect overall assessment being made (the validity problem) or assessors introducing their own interpretations to aid the process of coming to a "correct" conclusion (the reliability problem)?

4. Does the rubric emphasise process at the expense of product? It is often said that in educational ICT, it's the process that's important. Well actually, that is not entirely true, and we do young people a grave disservice if we fail to tell them so. If you don't agree with me, that's fine, but I invite you to consider two scenarios, and reflect which one is the most likely to happen in real life:

Imagine: Your Headteacher or Principal asks you to write a report on whether there is a gender bias in the examination results for your subject, in time for a review meeting next Wednesday. You can't find the information you need, so you write a report on the benefits of blogging instead. You desktop publish it so it looks great, and even burn it onto a CD for good measure. To add the icing on the cake, you even make a 5 minute video introducing the topic in order to get the meeting off to a flying start.

Scenario 1:

The boss says:

“Wow, that is fantastic. It’s not what I asked for at all, but let’s face it, it’s the process that’s important. Let me raise your salary.”

—

Scenario 2:

The boss says:

“What is this? I asked you to produce a report on gender issues. If you can’t follow a simple instruction like that, do you really think you’re cut out for this job?”

OK, I know that both responses are slightly far-fetched, but hopefully I've made my point.

Which also leads me on to another thing. I think some of my judgements may have come across as a bit uncompromising. But I really do not see the point of saying something like "Great video", or even "Poor video", without adding enough information for the student to get a good idea of why it was good or poor, and how to improve their work and take it to the next level in the rubric.

Getting back to the issue of interpretation, I am afraid that, in the interests of better accuracy and of giving the students useful feedback, I introduced some of my own criteria. Well, I was the sole meta-judge, a title so grand that I felt it gave me carte blanche to interpret the rubric as I saw fit. Lord Acton was right: absolute power really does corrupt absolutely .

The extra criteria I applied were as follows:

1. Did the medium reflect the message?

To explain what I mean by this, let me give you an example of where it didn't. In one of the videos, the viewer was shown some text which said that businesses can now make predictions. This was then followed by a photograph of chips used in casinos. So, unless the video was intended to convey the idea that predictions can now be made which are subject to pure chance, which I somehow doubt, that was a completely inappropriate message.

2. Could I learn what I needed to know about the topic without having to read the wiki? If not, then I would be at a loss to explain the point of having the video, unless question #3 applied. This includes the question: is the information given actually meaningful? Look at that point about businesses can now make predictions. Businesses have always made predictions, so that statement tells me nothing. What I want to know is, how does communications technology aid forecasting, and does it make the process more accurate?

3. Did the video inspire me to want to find out more, or to do something, even though there wasn't much substance to it? If so, and if that was at least partly the aim, maybe that would be perfectly OK. I'd take some convincing though.

4. Did the video only synthesise the information on the wiki, or did it do more? The word "synthesise" implies adding value in some way: it's more than merely "summarise". But if if the information was of a poor quality, did the student deal with the matter effectively or merely accept the situation?

5. In every case I watched the video first, and then read the wiki, because I wanted to come to it with as few preconceived ideas as possible, to see if the video was able to stand on its own. I then read the wiki and then re-watched the video (sometimes more than once), looking for specific things.

If you have any views on using rubrics, I'd love to hear them -- especially if you completely disagree with anything I've said in this post!

The title is a horrible play on words, although I have to say I'm quite proud of it! It is, of course, taken from Hamlet's "To be or not to be" soliloquy, in which he says, "Ay, there's the rub".

Some notes on the Rules Base

This was worked out on paper first, and then created in a spreadsheet. The spreadsheet, based on some amazing formulae and quite a few macros, was interactive. Bear in mind that this was constructed back in the days when we talked about “Levels”. Here’s the front page:

As you can see, this was intended to be use-friendly, by which I mean that the user didn’t have to know anyhting about Excel in order to make use of it.

The coloured rectangles were buttons. Clicking on one took you to the relevant section of the spreadsheet.

The skills & techniques calibration tool attempted to say which skills could indicate a particular Level in different applications. For example, configuring AutoCorrect in Word is an example of efficiency; using an absolute cell reference in Excel, or named ranges, are also examples of efficiency. Therefore there is a sense in which AutoCorrect and named cell ranges are the same, even though they are very different.

Please note what this was not:

It was not promoting the teaching of Microsoft Office, which was one of the complaints people started to make about the ICT Programme of Study. It was simply acknowledging that some ways of using the Office applications were more efficient than others.
It was not suggesting that using an absolute cell reference in a spreadsheet meant you were on Level 7, the “key characteristic” of which was efficiency. It was simply suggesting that if a student had a well-designed spreadseet that addressed a particular problem, and they had used devices like absolute cell references, then perhaps they had the knowledge, skills and understanding one would associate with Level 7.

Actually, the Rules Base was never intended to be an oracle that would tell you definitively what Level a student was on. It was, rather, intended to be used as a stimulus to discussion. Further screenshots are shown below.