Evaluative VS Recommendation-driven AI: a tale of two friends

What kind of AI can help us make better decisions?

Feb 02, 2024

Last week I had the pleasure to host a talk by Professor Tim Miller as part of the XAI Seminar Series @ Imperial (check out the website and subscribe if interested in all things eXplainable AI). Prof Miller is a professor of Artificial intelligence at the University of Queensland in Australia, and his research focus is on AI for decision making, and how to make it work.

I found this seminar particularly insightful and thought-provoking, so I am going to tell you a bit about what I took away from it. This is how I would summarise it:

In high-stakes decision-making, we should leverage AI to validate or disprove our hypothesis and intuition, as opposed to getting an answer that has the risk of biasing our judgement.

Ok, let me break this down now.

First, context: by “high-stakes decision-making” we are talking about using AI models to help humans make decisions that matter, that possibly affect other humans’ lives. Decision like medical diagnosis, or giving credit or parole to individuals.

What do I mean by validate our hypothesis or bias us? Let me give you an example.

I'd like you to meet two friends of mine (actually, they are Prof Miller’s friends, but I feel like I've known them for a while too, so I'll make them mine too XD): Bluster and Prudence.

Bluster is the confident friend that always tells you what they think you should do, and why. Bluster is often right, but not always. Prudence instead will never tell you what to do. They will probe you, ask you questions, then tell you pros and cons about your strategy. Prudence makes you think.

Bluster and Prudence are two paradigms of how to use AI, which Miller calls: recommendation-driven AI and Evaluative AI.

On the one hand, Recommendation-driven AI, represented by Bluster, is when the model tells us what to do and why. Think about a medical diagnosis example, and a model trained to analyse x-rays to recognise a bunch of pathologies. The model will tell the radiologist that the patient has e.g. pneumonia, because there are some spots in the in key locations of the scan.

On the other hand, Evaluative AI, represented by Prudence, asks us for our own thoughts and then provides pros and cons to help us make a decision. Continuing on the x-ray diagnosis example, Prudence would present the radiologist with the options considered (that the model is trained to predict) with some of the most likely highlighted. For each option, the system will show the reasons for and against that diagnosis. Say pneumonia and bronchitis were the most likely. For pneumonia it’ll tell us that the shape, position and intensity of some spots in the lungs suggest secretions typical of pneumonia, while some other little spots and their characteristics in other areas of the image support the bronchitis option (disclaimer: I am no medic!).

We have a model that helps us recognise scenarios, it helps us navigate information and judge the validity and plausibility of our ideas. We have a Prudence friend to help us make sense of information but leaves it up to us to actually make an informed decision.

So what do I mean by a model “biasing us”?

There are some negative outcomes that could come out of a human-AI collaboration. At the extremes there are automation bias and algorithmic aversion.

Automation bias is when our radiologist trusts the model and gets convinced by the explanations without questioning them too in depth, “the machine must be right!”.

On the other side, we could have a very wary radiologist, that wouldn't accept the model answer even when it's correct (algorithmic aversion). “I just don’t trust these nasty machines!”

On the scale between them there are scenarios of over- and under-reliance. Cases when the model is incorrect but the radiologist doesn’t question the results, and vice versa.

When this happens, the accuracy of a decision can actually decrease as a result of human-AI collaboration, compared to humans or AI alone.

At espress.ai, we believe in using data and models to enhance our thinking and investigate the merits of our own ideas. Models should be used responsibly and not just to confirm biases.

So who would be your best friend: Bluster or Prudence?

The Case for Prudence

Professor Miller grounds the merits of his proposed Evaluative AI in studied principles of good decision supports. He claims that a human following an interactive process with a machine-in-the-loop can mitigate problems that have been encountered in Human-Machine interactions. The hypothesis, and some evidence, is that Prudence’s Evaluative AI can reduce under/over-reliance issues and increase the accuracy of the decision-making in human-AI collaboration.

In other words: Evaluative AI is a model to get the best of humans and machines working together, where the processing power and decision-making is split between the two.

The Case for Bluster

Of course such a type of AI-aided Decision Support Tool can work only for low frequency decisions, and arguably, important ones, for which the decision makers are willing to put the effort in to explore and evaluate different options or strategies (e.g. medical diagnosis).

For decisions at scale, verification, scientific discovery and more, a “Bluster” model would still be useful. And of course the two can be blended. An example is Meta’s censorship of hate speech on Facebook - at that volume, most of the decisions are fully automatic (Bluster) but some edge cases are flagged by AI and then reviewed by expert humans, who make the final call (maybe with the help of one Prudence?).

Shall we let models build the narrative for us, or use models to investigate the merits of our narrative? Our hypothesis? Our thinking?

At espress.ai, we believe that we humans are pretty great at thinking and making decisions, and models can (or should!) only increase our capabilities.

All models should come with a string attached: use responsibly, use it to shed light, on your own ideas, or maybe create new ones, collaboratively, not just to tick boxes and fall into confirmation biases. Just because it's the easiest.

Now it’s your turn: where do you stand?

Espress.AI

Discussion about this post