Government coders are studying whether major AI models, including Grok, output hate speech

The General Services Administration is evaluating major AI models based on performance and other metrics, its chief AI officer said in an interview.

By Rebecca Heilweil and Miranda Nazzaro

August 1, 2025

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

People walk by the General Services Administration building on June 1, 2025 in Washington, DC. (Photo by Kevin Carter/Getty Images)

As the General Services Administration seeks to expand its GSAi platform to other agencies, its coders are also working on red-teaming many of the country’s leading AI models.

The ultimate goal is to understand how well AI models from companies like xAI and Anthropic actually work. The agency is studying their ability to withstand attacks and their capacity to spread hate speech, too.

In a Thursday interview with FedScoop, Zach Whitman, GSA’s chief AI officer, said the government has now developed a method of red-teaming to test the performance of mainline AI models. The agency’s approach includes several performance measures, as well as harm evaluation standards.

GSA created a process for approving “families” of AI models and established a new AI safety team to evaluate the models for different tasks, he said. Notably, that red-teaming work seems to line up with some of the work being done on the agency’s GitHub page.

Whitman acknowledged to FedScoop the recent controversy surrounding xAI’s Grok chatbot, which espoused antisemitic and pro-Hitler content in some responses last month. This occurred after an instruction was apparently added to Grok’s system prompt, directing it to “not shy away” from certain claims. The instructions were eventually removed.

“We want to test it without a system prompt to see how it behaves in a neutral fashion,” Whitman explained. “Then we apply a system prompt to see how it behaves and we apply a negative system prompt. We want to attack it from as many angles as possible.”

Grok 3 is now available on Microsoft Azure, creating a pathway for GSA to study the tool, Whitman said. He emphasized that the agency was thinking about Grok from a “measurement” perspective and not putting the tool into operation. While the tool “could” be put into use, Whitman told FedScoop he didn’t want to speculate.

“We approve model families based on their passing of these evaluation sets,” Whitman said. “Once we evaluate Grok 3 and 4 together, we’ll be able to take that to the safety board. [We can go to them and ask], ‘what do you think about this model family? Is it meeting your standards or not in our behavior?’ So really, it’s just, like, a measurement perspective.”

Whitman emphasized that GSA is hoping to evaluate the performance of companies’ AI models, assessing how well they line up with what has been advertised to federal agencies.

The GSA GitHub repository provides some insight into at least one tool the agency seems to be developing, called “ViolentUTF-API.” A seemingly related “ViolentUTF” project makes reference to the GSAi app and discusses GSAi implementation improvements. Both projects are being completed by the same user affiliated with GSA.

A guide to red-teaming AI systems included in one of the versions of the project mentions evaluating large language models for harms like toxicity and misinformation. A paper associated with the effort notes that the tool is being tested with a large U.S. government department.

Spokespeople for GSA did not answer a series of additional questions from FedScoop about GSAi, including whether the agency is evaluating the tool’s capacity to spread misinformation and the extent to which it might be looking at systems like DeepSeek, which many federal agencies have banned. The agency also did not address if it was using red-teaming platforms beyond the ViolentUTF system.

FedScoop first reported on GSA looking into Grok last month. Four days later, xAI announced that it was working with GSA to make its large language model technology available as part of a product suite called “Grok for Government.” Soon after, House Democrats sent a letter to GSA demanding more information about the agency’s work with the tool.

That letter included concerns about boosting Elon Musk’s businesses and questions about consistency with cybersecurity programs like FedRAMP.

“We’re using commercial LLMs off of the hyperscalers,” Whitman told FedScoop, pointing to Azure, Bedrock, and Vertex. “All within our own infrastructure … we happen to have contracts between all three of those. Not every agency does.”

Madison Alder contributed reporting.

Government coders are studying whether major AI models, including Grok, output hate speech

More Like This

Senate duo seek agency collaboration in mitigating AI-powered fraud

Energy Department unveils 24 AI research partners in Genesis Mission push

Agencies face big risks in 2026 with AI browsers

Top Stories

Senate Democrats sound alarm over VA’s resumption of EHR rollout

Lawmakers push back on proposed DHS data collection expansion

Trump’s Tech Force treads familiar ground for former government tech leaders

Trump’s chief technology officer Ethan Klein confirmed by Senate

What federal agencies must get right to deploy derived PIV at scale

FAA head details air traffic control modernization progress, next steps

More Scoops

GSA AI chief says USAi has become like ‘code development’ between agencies

From technology pilots to platforms: How federal agencies are rewiring IT for smarter government

Advocacy groups escalate fight over Grok in government after OneGov deal

GSA nominee open to reviewing Grok AI selection process

How government agencies can beat the odds on AI pilots

Here’s how federal agencies say they’re tackling AI use under Trump

Lawsuit seeks agency details on AI’s role in Trump’s deregulation push

Latest Podcasts

Pentagon AI chief departing to work on Golden Dome effort

Modernizing state government without disrupting the mission

House passes agency software-buying bill, waits on Senate again

OPM launches Tech Force to recruit technologists to government

Tech

Defense

Cyber

FedScoop TV