A recent paper by FHI researcher Stuart Armstrong and former intern Soren Mindermann (now at the Vector Institute) has been accepted at NeurIPS 2018. The paper, Impossibility of deducing preferences and rationality from human policy, considers the scenario in which AI system learns the values and biases of a human agent concurrently.
This extends an existing approach in which just the values of a human are learned by the AI system. The difficulty with ordinary value learning is that humans sometimes make systematic mistakes, taking actions that don’t lead them toward their preferred outcomes. So Armstrong and Mindermann consider how much harder the task gets if you also need to learn the human’s biases.
Armstrong and Mindermann show that if the human might have any value functions or biases at all, then it is impossible to learn one unique pairing that can explain the human’s behaviour.
They then consider whether it might be possible to learn the human’s values and biases by favouring simpler value and bias functions. However, they show that an AI that favours simpler hypotheses will still not learn the human’s actual values and biases, and will instead come to favour alternative, simpler but incorrect models.