History – why embark on this?

It started with “a” midterm test

In February 2018 we had to give a midterm test to 1250 students in 8 different sections of Mathematics 101. This presented us with significant logistical problems. Arguably, the difficulties in giving a midterm are more substantial than those surrounding the final exam. Foremost amongst these was trying to minimise the amount of “leakage”. In particular

The first sitting was scheduled for Thursday at 9:30am, and the last on Friday at 4pm – plenty of time for information about a paper to “leak”.
Almost all of our classrooms are very full and so it is not possible to space students out. It is impossible to prevent students from seeing neighbouring desks.

One solution is to give multiple versions of the test – more on this shortly – and another is to give the test outside regular class time, however we decided against doing so.

After hours tests worked very well when very few courses used them. Now, in part vicitimised by their own success, many courses use them. Since the student body taking Mathematics 101 is very diverse, they take courses in basically any other department you can think of. Time conflicts are now plentiful so multiple sittings – and versions – are required.
Importantly, many of our students have long commutes and (shockingly) have lives and commitments outside of our course. Staying late on campus is simply not possible for many of them.

This leaves us giving multiple versions of the test – we gave 3 on Thursday and 3 on Friday. This solution is definitely not without problems, and there were still many logistical problems to overcome. After the instructor-in-charge made the master version of the test, the rest of the team created similar versions, their solutions, and marking rubrics. Much time was invested to harmonise format, language and difficulty. So there was multiple weeks of work before the test was seen by the first students.

After the students took the test we had to physically separate the 6 versions of the test. We had about 12 of us marking, so we put two people on each test-version. The rubrics needed updating as we went, and the whole process involved a great deal of ad-hoc management by the instructor-in-charge. Unfortunately we didn’t have more marker hours, so the quality of feedback given to students was not as high, or consistent as we would like. This is not the fault of the markers – we have to get the test back quickly.

“Its too slow to give feedback” — anonymous marker

But our students definitely notice

“There is very minimal feedback provided on the midterm markings” — anonymous student

Improving the process

This experience started us thinking on what we could do to improve this process. While a few things are easy to change, some critical aspects are very difficult to modify. In particular, we cannot increase the number of available classrooms, nor the number of seats in those classrooms (a very long rant on this topic is available on request). Our graduate program has stayed roughly constant while undergraduate student numbers have increased dramatically – so the number of marker-minutes per students has dropped year after year. Perhaps, with better planning, we can improve on our paper-handling logistics during marking. We might also be able to construct better marking rubrics, which could help our markers leave better and more consistent feedback.

But, by far the easiest thing to change is the number of versions of the test. Of course, fewer versions leads to a tension between easier logistics and planning on the one hand, and a higher potential for “leakage” on the other. One potential solution is to interleave the tests – that is, interleave questions, rather than whole tests. So, for example, if we have 3 versions of a 4 question test, then by interleaving questions we get a total of $3^4 = 81$ possible tests.

So if we do things this way, then we don’t need as many versions. This means simpler pre-test logistics, simpler marking logistics, and hopefully more markers per question/version, making them more efficient. And while leakage can still occur, it is hopefully reduced. It isn’t hard to write some python scripts that do the interleaving and watermark of source PDFs to produce papers for printing. Data-entry after marking is more onerous since we should really record the version of the question along with the mark, just in case one version turns out to an outlier. But clearly the biggest problem is that we have created a nightmare of paper-handling logistics. To ensure that each marker only has to work on one version of a given question, papers need to be separated at the level of every question!

The solution

“Just” write some software.