A new report indicates that current tests and benchmarks for evaluating the safety and responsibility of artificial intelligence may be insufficient. The growing demand for security in generative AI models, capable of analyzing and generating text, images, music, and videos, has led to increased scrutiny due to their propensity for errors and unpredictable behavior. In response, both public sector agencies and large tech companies are proposing new benchmarks to assess the safety of these models.
Late last year, the startup Scale AI established a lab dedicated to evaluating the alignment of models with safety guidelines. This month, the NIST and the UK AI Security Institute launched tools to assess model risk. However, the UK's Ada Lovelace Institute (ALI) has identified that these tests may be inadequate. Their study, which interviewed experts from academic labs and the civil sector as well as model providers, revealed that current evaluations are not exhaustive, can be easily manipulated, and do not necessarily reflect model behavior in real-world scenarios.
Benchmarks and Red Teaming
The ALI study reviewed academic literature to understand the risks and damages associated with current AI models and existing evaluations. They then interviewed 16 experts, including employees from tech companies developing generative AI systems. They found significant disagreement about the best methods and taxonomy for evaluating models.
Some evaluations only tested model performance in the lab, not in real-world scenarios. Others relied on tests designed for research purposes, not production models, yet were still used in production. A significant issue identified was data contamination, where benchmark results can overestimate a model's performance if it has been trained on the same data used for evaluation. Benchmarks are often chosen for convenience rather than being the best evaluation tools.
Mahi Hardalupas, an ALI researcher, notes that "benchmarks risk being manipulated by developers who can train models on the same dataset that will be used to evaluate the model. It also matters which version of a model is being evaluated. Small changes can cause unpredictable changes in behavior."
The study also found issues with "red-teaming," the practice of tasking teams with attacking a model to identify vulnerabilities. Although several companies, like OpenAI and Anthropic, use red-teaming, there are no agreed standards for evaluating its effectiveness. Additionally, forming red teams with the necessary expertise is costly and labor-intensive, presenting barriers for smaller organizations.
Possible Solutions
The pressure to launch models quickly and the reluctance to conduct tests that might identify problems before a release are the main reasons why AI evaluations have not improved. A participant in the ALI study mentioned that evaluating the safety of models is an "intractable" problem.
However, Hardalupas sees a way forward with greater involvement from public sector bodies. She suggests that governments require more public participation in developing evaluations and implement measures to support a "third-party testing ecosystem," including programs to ensure regular access to necessary models and datasets.
Jones advocates for "context-specific" evaluations that analyze how a model may affect different types of users and how attacks could bypass safeguards. He adds that investment in the underlying science of evaluations is needed to develop more robust and repeatable tests based on understanding how an AI model functions.
Nevertheless, Hardalupas warns that there will never be a total guarantee of safety: "Model evaluations can identify potential risks but cannot guarantee that a model is safe."