Anthropic model subject of first joint evaluation by US, UK AI Safety Institutes

An updated Claude 3.5 Sonnet underwent the first-ever joint pre-deployment evaluation by the U.S. and U.K. AI safety bodies.

By Madison Alder

November 20, 2024

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

Britain's Science, Innovation and Technology Secretary Michelle Donelan (R) greets U.S. Commerce Secretary Gina Raimondo during the U.K. Artificial Intelligence (AI) Safety Summit at Bletchley Park, in central England, on Nov. 1, 2023. (Photo by JUSTIN TALLIS/AFP via Getty Images)

The U.S. AI Safety Institute shared results Tuesday from its inaugural joint model evaluation with its U.K. counterpart, probing the biological, cyber and software capabilities of Anthropic’s updated Claude 3.5 Sonnet.

The report is an important step for the U.S. AISI, which the Biden administration recently designated as the primary government point of contact for AI developers in the private sector. In addition to being the first joint evaluation with the U.K. AI Safety Institute, it’s also the first evaluation the U.S. AISI has produced.

“This is the most comprehensive government-led safety evaluation of an advanced AI model to date and [a] major milestone in our mission to fuel continued American innovation while preventing malicious use,” Elizabeth Kelly, director of the AISI, said in a LinkedIn post announcing the report.

“We look forward to building on this work and iterating with each exercise to advance the science of AI safety,” Kelly said.

The security implications of AI models — including chemical and biological threats — are a concern for the government as the transformative technology continues to grow and learn. As a result, the AISI’s role is to ensure that the technology can be used safely and securely by taking actions like conducting research and safety evaluations.

Among the findings: Sonnet 3.5 performed below human expert baselines on biological research questions but exceeded those baselines in some instances when it was provided with bioinformatic tools to assist with answers, according to a blog post.

The evaluation also found that the model provided “answers that should have been prevented” when tested on jailbreaks, or actions that produce a response from a model that is intended to be restricted. According to the post, that finding is consistent with existing research on AI system vulnerabilities. Jailbreak vulnerabilities are a broad concern among AI models because they could be used to manipulate the system into providing nefarious information.

Sonnet 3.5 was successful at completing 90% of non-expert-level cyber tasks — outperforming its older version and GPT4o as a reference — and succeeded at 36% of cybersecurity apprentice-level tasks. And it was able to perform better than reference models when tested on challenges faced by humans to improve the quality and speed of an AI model.

“We are pleased to have partnered with the US and UK AI Safety Institutes to test Anthropic’s upgraded 3.5 Sonnet model,” an Anthropic spokesperson said in an emailed statement. “Independent third party testing is critical to advancing the science of AI evaluations, especially to understand how models may impact national security. We look forward to continuing this work together.”

The AISI, which is housed in the Department of Commerce’s National Institute of Standards and Technology, initially announced voluntary agreements with Anthropic and OpenAI to conduct testing and evaluation of their models before and after they’re released. Anthropic had previously worked with the U.K. AISI ahead of its release of Claude 3.5 Sonnet.

To conduct the testing, the two safety institutes compared Sonnet 3.5’s performance to the previous version of Sonnet 3.5, OpenAI’s o1-preview and OpenAI’s GPT-4o as references. According to a blog post about the report, the “comparisons are intended only to assess the relative capability improvements of the upgraded Sonnet 3.5, in order to improve scientific interpretation of evaluation results.”

In a recent interview with FedScoop, Michael Sellitto, head of global affairs at Anthropic, said a reason the company is supportive of the AISI is because of the way it can bring together government testing and evaluation expertise in areas like biosecurity and cybersecurity.

“One of the things Anthropic’s particularly concerned about are potential catastrophic risks of the technology,” Sellitto said. That includes “how the technology could be misused in a way that impacts national security, and naturally, a lot of the expertise to know what might impact national security is going to reside within government.”

The report comes as the AISI is taking the lead on national security-related AI model evaluation in the federal government. President Joe Biden’s national security memo on AI created a central role for the institute. As part of that, the AISI announced Wednesday that it had established a government taskforce focused on collaborating on research and testing of AI models to manage national security risks.

It also comes as the future of the safety body is uncertain, given the change in presidential administrations. While the AISI — or something like it — has bipartisan support, its actions have also faced some Republican criticism. It’s unclear how the incoming Trump administration might approach its future.

FedScoop reporter Rebecca Heilweil contributed to this report.