The Changing Landscape Of AI Evaluation
Illustration taken in Brussels, Belgium, on January 2, 2024. (Photo illustration by Jonathan … [+] Raa/NurPhoto via Getty Images)
NurPhoto via Getty Images
As artificial intelligence rapidly advances, how do we assess whether these systems are truly effective, ethical, and safe? Evaluation methods need to evolve beyond straightforward accuracy metrics to address these more complex considerations.
Nothing illustrates this more than the recent disastrous launch of a famous multimodal model. The company’s much-ballyhooed model was so heavily tuned to be racially inclusive that it generated people of color even when a Caucasian was called for: prompts for World War II German soldiers produced images of Black, Asian, and Indigenous soldiers wearing Nazi uniforms, and even the U.S. founding fathers were depicted as Black. Clearly evaluation of the model had failed.
In the past, evaluation was simpler – models were built for narrow tasks and could be tested against specific benchmarks using predefined datasets. For image classification, a model would be measured on how accurately it labeled objects in images, for example. For language models, accuracy at predicting the next word in a sequence was a common metric. These benchmarks provided a simple numerical score to compare different models.
However, as AI models now aim to perform broad, open-ended tasks like conversation, evaluation becomes more difficult. Benchmarks designed for narrower AI don’t capture the full range of skills and behaviors we want from general AI systems.
A transportation company’s recent chatbot failure stands as a cautionary tale for others. The chatbot erroneously advised a customer that he could apply for a bereavement fare refund within 90 days of ticket issuance, so the customer bought a ticket to attend his grandmother’s funeral. When the customer later requested the refund, the company denied it, claiming the chatbot had provided misleading information. A tribunal found that the company did not take reasonable care to ensure the accuracy of its chatbot and awarded compensation to the customer. But the reputational damage was far worse than the money it had to pay.
There is a growing push to develop new benchmarks that better reflect real-world use cases and test models from multiple angles. For example, HELM or Holistic Evaluation of Language Models and DecodingTrust are trustworthiness evaluation platforms for large language models. They evaluate models not just on accuracy, but on dimensions like toxicity, fairness, and security.
However, there are still gaps between what benchmarks test and how models actually perform in practice. We need more innovation in designing dynamic benchmarks that better reveal model capabilities.
To avoid such problems, a European bank took an extensive testing approach before its AI-powered sales chatbot went live in December 2022. To evaluate the chatbot before launch, it created hundreds of “adversarial chatbots” designed to try to derail conversations and lead the chatbot astray. These adversarial bots had different personalities and intents, resulting in thousands of test conversations.
It then used the natural language capabilities of OpenAI’s GPT models to classify each of these conversations on whether they went off course or not. Any problematic conversations were flagged and reviewed by human experts. The prompts that led the chatbot astray were then refined to add guardrails and rules to prevent those types of responses in the future.
Now that the chatbot is live, this bank continues to use LLMs to classify real customer conversations on a daily basis. These are reviewed for opportunities to improve the chatbot’s ability to convince customers and clearly explain the value proposition. Prompts are iteratively refined based on these reviews.
The company is also experimenting with using different prompt formulations and reinforcement learning to optimize the chatbot for different performance metrics. The goal is to test variations to see which prompts work best for different customer segments.
For real-time monitoring, the AI looks for triggers to hand-off the conversation from the chatbot to a human agent. But most performance evaluation happens in daily batches rather than real-time. This rigorous approach helps ensure the bank’s chatbot has strong conversational ability while avoiding problematic responses. The continuous iteration aims to optimize customer experience.
Human evaluation is increasingly important – a process called Reinforcement Learning with Human Feedback in which reviewers compare model responses – can capture subtle preferences that automated scoring misses. This approach remains limited in scope due to its expense. But advances in training methods that incorporate human feedback may expand the role of human reviewers in evaluation. Ongoing research into Reinforcement Learning with AI Feedback, meanwhile, may eventually automate the task.
Standard leaderboards currently dominate model evaluation, but diversity of assessment methods, data, and interactors can offer richer insights into strengths and weaknesses that static scores hide. The evolution towards broader benchmarks, increased human oversight, and independent auditing points toward more robust evaluation better suited for general AI. Understanding model shortcomings through rigorous evaluation helps prevent safety issues.
Initiatives to standardize rigorous evaluation protocols and tools, including third-party systems that enable unbiased outside auditing of AI systems, will bolster accountability and transparency.
Comprehensive, continuous, and collaborative evaluation that looks beyond the numbers, proactively addresses emerging hazards, and centers diverse human values will be instrumental in realizing AI’s immense potential for good.