MLCommons, a nonprofit organization dedicated to assessing the performance of artificial intelligence (AI) systems, has introduced a new benchmark named AILuminate to evaluate the negative aspects of AI technology. This benchmark specifically targets the responses of large language models to a comprehensive set of over 12,000 test prompts spanning 12 categories, including serious issues such as inciting violent crime, child sexual exploitation, hate speech, promoting self-harm, and intellectual property infringement. The models are evaluated and given a performance score ranging from “poor” to “excellent,” based on how well they manage or mitigate these harmful responses. To maintain the integrity of the testing process and to prevent the prompts from being used as training data that could allow models to perform optimally on the tests, the specific prompts remain undisclosed.
Peter Mattson, the founder and president of MLCommons and a senior staff engineer at Google, acknowledges the challenges inherent in measuring potential AI harms. He points out that AI is still a nascent technology and that the discipline of AI testing is likewise in its infancy. He believes that having effective means to assess the safety and risks posed by AI systems is crucial not only for societal welfare but also for the stability and competitiveness of the market. As AI technologies continue to evolve, the necessity for reliable and independent evaluation methods will become increasingly important, especially as political landscapes shift.
In the context of evolving US governance approaches to AI regulation, there may be heightened relevance for standardized assessment instruments like AILuminate. With the potential return of Donald Trump to power, he has indicated plans to dismantle President Biden’s AI Executive Order, which aimed to ensure that AI technologies are developed and deployed responsibly. This order also set the foundation for an AI Safety Institute tasked with conducting safety evaluations of advanced models. As policymaking continues to shape AI regulation, MLCommons’ benchmark could play a pivotal role in fostering accountability within the industry.
Moreover, the implications of AILuminate extend beyond the United States, presenting opportunities for a more global dialogue on AI-related harms. MLCommons includes members from various international companies, notably China-based corporations such as Huawei and Alibaba. By engaging these international stakeholders, the benchmark could facilitate a comparative analysis of AI safety practices across different nations, allowing for a broader understanding of how safety mechanisms are employed worldwide. This comparative dimension could enhance collaboration and standardization efforts in the ever-evolving field of AI.
Several significant U.S. AI firms have already utilized AILuminate to gauge the safety of their models. Notable examples include Anthropic’s Claude model, Google’s smaller model Gemma, and Microsoft’s Phi, all achieving “very good” ratings during testing. In contrast, OpenAI’s GPT-4o and Meta’s largest Llama model received “good” ratings, indicating some room for improvement. The Allen Institute for AI’s OLMo model, however, received the lowest rating of “poor,” which Mattson clarifies is attributable to its design as a research tool not explicitly intended to prioritize safety. This range of results highlights the utility of AILuminate in promoting continual improvement among AI entities.
The emergence of AILuminate is seen as a step towards introducing scientific rigor to the evaluation processes surrounding AI technologies. Rumman Chowdhury, the CEO of Humane Intelligence—a nonprofit focused on testing AI behaviors—echoes the sentiment that best practices and inclusive measurement methods are essential for ensuring AI models meet societal expectations and ethical standards. By championing a standardized approach to assessing AI risks, MLCommons may lead the charge in establishing more responsible development cycles for AI technologies that prioritize safety alongside innovation, ultimately ensuring that as these powerful tools progress, they align with the broader needs and values of society.