What would a quality assurance program for teaching at the higher-education level look like? We don’t have one now, nor even much to build on, but perhaps there are analogous programs we could adapt or copy. I think there are, and I will suggest an approach below. But first:

We are starting from a very modest base (not of teaching but of teaching support). As I railed in my previous post , what we have now is an incentive system under which professors are individually rewarded by retention or raises if they get adequate scores from students in surveys administered at the end of courses, surveys that do not reflect learning. I have been assured that we have departments in which high SETs are a tenure liability, but that’s too depressing to dwell on. We have some ancillary activities, like a teaching and learning center that provides a lot of on-line resources and some training if anyone asks (and whose staff is highly informed and dedicated). Almost no-one ever does (the last event they put on at Cal drew about 40 people from a faculty of a thousand-odd, including a fair number of lecturers and staff). We require one two- or three-unit course for our GSIs (graduate student instructors, = TAs in some schools), but of course this only trains the one prof who teaches it, plus the very few of our own grad students we eventually hire. And we give an annual teaching award, for which the first hurdle is spotless SETs, with no mechanism for the winners to diffuse and replicate what they do well. There is an annual teaching seminar that meets monthly, which usually has trouble recruiting a dozen participants, so in eighty years it might reach all of us.

Several of my colleagues, at Cal and elsewhere, assert firmly that our teaching is actually very good. Our alumni are certainly in demand. I am happy to stipulate that our teaching is superb, and that teaching at Berkeley deserves A+ across the board, with a cherry on top. We are all really great teachers, bow, exeunt stage left with armfuls of flowers.

But I don’t care! No action follows from that proposition: the operational question is not whether to pat ourselves on the back some or a lot, but whether there are things that we could do that would cause enough more learning to be worth doing. If there are, and we are doing C work for our students, we should do them, and if we are doing A work, we should also do them. Absolute-scale measures are managerially pretty much useless. When I critique a student paper draft, the advice I give about how it could be [even] better is worth about a hundred of the letter grade itself. If you still think high absolute performance is a license not to seek improvement, ask yourself whether you would fly on an airline whose maintenance principle was “if it ain’t broke, don’t fix it!”

Some other colleagues, mostly economists, believe incentives are everything: if we pay faculty enough more for better teaching, and punish them enough for bad teaching, the market will waft us to an optimum. After all, Pharaoh beat the Hebrews if they didn’t work hard, fed them if they did, and got a nice pyramid, right? Incentives do matter, but fear of firing and money rewards are not well-suited for this particular population, which operates pretty far up the hierarchy of needs. Anyway, if you can’t observe good performance (cf Philip Stark’s discussion of SETs), the workforce doesn’t know how to effect it, and they have the wrong tools, all the incentives in the world won’t work.

Finally, there is assuredly a production possibility frontier in research-teaching space. It slopes down monotonically and it is concave to the origin. If we were on it, any improvement in student learning would be at the expense of some research productivity (still might be worth it, but that’s a tough sell). But this is another misuse of good economic theory, like thinking a market equilibrium is where the world is rather than where it’s always trying to grope towards. As I learned from Bob Leone, one of the real live paid professional economists who have taught me so much good stuff, no real organization is ever at its PPF for any pair of output measures, and if it were, it would not be next week, as the PPF moves outward with organizational learning and technological advance. Indeed, organizations without good QA systems are always quite far from their PPF. The wise manager assumes she can move up or to the right or both, and is almost always correct; the foolish manager assumes she is on the PPF and wanders back and forth along where she thinks it is, like the tiger pacing along remembered cage bars.

Let’s start where college faculty should be comfortable: we have a highly developed QA system for research that has, by near-universal agreement, made our research the wonder of the world and getting better all the time. The way that goes is that we:

collaborate on papers and projects,
read each other’s work and cite it carefully in our own,
seek out experts and advice, for example on methodological issues, and
coach each other in institutionalized ways, like journal prepublication reviews and conference presentations.

The coaching is detailed and multidimensional: when I review an article I’m usually asked to make a coarse summary judgment like “publish/revise & resubmit/outer darkness” but that’s the least useful part of the process for the author and for me, and I always write a detailed critique. Both the author and I improve our practice through this coaching. There is some measurement, like impact scores and citation indices, but I don’t think any of us would substitute that in a tenure review for actually reading stuff someone wrote, and there is no solid quantitative research to prove that this or that research methodology is best, or that this or that type of collaboration or critique is good and another bad. Yet we plug along doing it, and research gets better and better.

This template suggests some practical options for teaching:

collaborate on curriculum and co-teach courses
watch each others’ students learn, visiting classrooms and reviewing assignments and syllabuses
seek out research-based teaching expertise and knowledge, and
coach each other [note: not, generally, have a “master teacher” grade junior colleagues at tenure time; coach each other. Research is full of 360 degree review.]

Existing QA for pedagogy, at least in higher ed, has none of those things. None. Industrial QA is built in large part on watching each other work and talking about what we see (the hot idea in coding now is to do it in pairs, one person typing and the other watching while they talk about what they’re doing). It certainly works for Google and Toyota, and management is like teaching in so many respects…

Now that I think of it, people in every high-performance profession, from musicians to scientists to writers to fighter pilots, flock so they can help each other get better, and they have done so since forever. They spend very little time grading each other but a lot of time talking shop about why and how this or that is considerable. J.D. Salinger lived alone in a cabin in New Hampshire, but he’s an uninformative exception; if you want to find writers, they’re in cafÃ©s in New York and London and San Francisco, and astronomers are at conferences and schmoosing in the common room, not sitting alone on mountaintops. Every high-performance profession, except college professors in teaching mode!

Of course, it could be that research would have advanced much faster if everyone did it alone and we did it all with incentives, paying profs according to citation indices (note collaboration sneaking in here already), or patent revenues/book royalties. Does anyone believe that?

Deming’s fourteen points are full of good guidance, though his emphasis on uniformity and consistency needs special handling in a service industry where diversity in the product is a feature and not a bug. Deming also specifically cautions against things like rewarding individuals for success, because you will mainly be rewarding random variation, you will destroy team morale, and you will set the winner up for resentment and disillusion when regression towards the mean takes hold next year.

Chris Argyris and Donald SchÃ¶n had useful insights about organizational learning. My favorite is the instruction to interrupt learned behavior and force attention to it, and the first behavior I would interrupt is our focus first, on curriculum and second, on teacher behavior, never getting to what students are doing even though that’s where the learning is happening. Of course, that can’t even start until management puts us, kicking and screaming, into a room together to talk about learning in the first place. Drive out fear.

What Deming, a statistician (with the soul of a psychologist and the calling of a prophet), really likes is measuring stuff. Not to pick winners and losers, but to understand processes and identify excursions either way that lead to learning if examined. What can we measure about learning?

Well, we can give examinations and look at test score improvement. There is a lot wrong with this, perhaps the subject of another post, but I want to stay away from the question of what I think good teaching practice is; the point of a QA program is to learn exactly that (including learning it from real research by others). We could also distinguish between (i) the average learning of a class and (ii) the grade each individual student deserves, and assess learning by taking a sample of the students and giving them an oral exam that doesn’t count towards a grade, maybe even paying them for their time. In some cases, we can assess learning by performance in follow-on courses. Wieman’s group has developed and qualified standard examinations, for use before and after an introductory course, to assess learning in physics and chemistry. And of course, SETs provide lots of useful information if we use them properly.

I don’t have any good ideas about what to measure in Theory T – lecture style – teaching, mostly because I’ve stopped doing it to students. But in a “flipped-classroom”, active learning, Theory C environment, there are all sorts of things to measure that could be illuminating. For example, I try to get a TA to sit behind the class with a stopwatch for a few sessions and record what fraction of the class time I am talking, and what fraction students are. Then I put a graph of the results up on the screen and invite the students to discuss what we are doing, note a trend, etc. Getting them to debate the optimal value of this indicator is pays off nicely (no, I don’t know what it is; if I did I would just tell them). Other promising measures that GSIs (or a visiting coach) could observe include:

the average number of students who speak between interventions or comments by the prof,
the average length of a student contribution,
the fraction of the class that speaks during a session,
the time the prof waits after posing an interesting question before giving a hint,
the number of times students directly address each other,
the variance over different students in the average number of days between contributions , and
the average number of hands in the air at any moment.

As usual, variance in these measures is worth ten of absolute value, and trend is worth twenty.

That’s just for class time; there’s lots we could measure about assignments, critique of student work, etc., all with an eye to improving our understanding of what we are doing, why, and what we could do differently. What’s the most important part of this? No question in my mind, it’s breaking the crippling isolation in which we work. We are trapped in it by our instinctive misunderstanding of where the PPF is in an environment where research will always be non-negotiable, and by the fear I described in the previous post. But fear can be driven out, and QA in the present context has the advantage that the work force is enormously curious and dedicated. We just need institutional norms and routines that open some windows across the airshaft, and those are the duty of leadership.

[Cross-posted at The Reality-Based Community]