Supervising unreliable experts
On overseeing and learning from unreliable humans and AI models
In the field of AI safety, scalable oversight is the set of methods for humans to supervise an AI system (eg are the outputs correct, do the outputs align with intended goals?) that has more expertise than the humans in many domains. In other words, how can a human supervise unreliable experts?
The unreliable experts referred to here are increasingly capable AI models. But, this doesn't seem like an AI safety specific problem. In fact, the success of large swaths of the global economy can at least in part be attributed to successful methods of supervising unreliable experts within organizations. I'm not trying to single out any category of worker here. In a way, any worker with any domain expertise can be viewed as an unreliable expert. And they are unreliable simply because as humans they are fallible and will inevitably make mistakes.
In order to leverage the capabilities of unreliable experts while minimizing the risk presented by mistakes, organizations have relied on methods including hierarchical reporting structures, peer reviews, red teaming exercises and retrospectives. The shared goal across any of these methods is to ensure that the outputs produced by the experts are correct and align with intended goals even if the supervisor has less domain expertise.
At the same time, an important side effect of using any of these methods is also the accumulation of "good enough" domain knowledge over time for the supervisor. They do not need to become domain experts themselves, but they can become more effective at coordinating the experts and less likely to let mistakes slip through. The effective use of any oversight method not only corrects mistakes and aligns incentives to reduce future mistakes, but also presents opportunities to collect context about the domain that helps with pattern recognition and effective questioning. A peer review surfaces what good questions form another expert that reveal assumptions and trade offs look like. A red team exercise surfaces what good adversarial thinking from another expert looks like. These methods aren't just useful for oversight, they are also useful for learning. They allow the supervisor to be a fly on the wall and pick up some tacit knowledge.
It seems that much of this holds true for the case where the unreliable experts are AI models as well. As probabilistic models, frontier AI models are fallible and will inevitably make mistakes thereby making them unreliable experts. As the supervisor, you want to make sure they are completing their tasks correctly in a way that aligns with intended goals. At the same time, as the complexity of the tasks increase, it will become increasingly difficult for you to oversee the work unless you also start building up your own domain knowledge so you can recognize patterns and ask better questions. A supervisor that is too abstracted away from the domain and what the experts are doing will eventually suffer from poor decision making.
Given all of this, it seems scalable oversight extends well beyond AI safety. The challenge is how to supervise any system of unreliable experts, whether they be humans, AI models or some combination of the two, in a way that corrects mistakes, aligns incentives to reduce future mistakes and that allows the supervisors to accumulate domain knowledge over time in order to keep up with the growing complexity of the tasks that need to be done.


