Government coders are studying whether major AI models, including Grok, output hate speech

As the General Services Administration seeks to expand its GSAi platform to other agencies, its coders are also working on red-teaming many of the country’s leading AI models.
The ultimate goal is to understand how well AI models from companies like xAI and Anthropic actually work. The agency is studying their ability to withstand attacks and their capacity to spread hate speech, too.
In a Thursday interview with FedScoop, Zach Whitman, GSA’s chief AI officer, said the government has now developed a method of red-teaming to test the performance of mainline AI models. The agency’s approach includes several performance measures, as well as harm evaluation standards.
GSA created a process for approving “families” of AI models and established a new AI safety team to evaluate the models for different tasks, he said. Notably, that red-teaming work seems to line up with some of the work being done on the agency’s GitHub page.
Whitman acknowledged to FedScoop the recent controversy surrounding xAI’s Grok chatbot, which espoused antisemitic and pro-Hitler content in some responses last month. This occurred after an instruction was apparently added to Grok’s system prompt, directing it to “not shy away” from certain claims. The instructions were eventually removed.
“We want to test it without a system prompt to see how it behaves in a neutral fashion,” Whitman explained. “Then we apply a system prompt to see how it behaves and we apply a negative system prompt. We want to attack it from as many angles as possible.”
Grok 3 is now available on Microsoft Azure, creating a pathway for GSA to study the tool, Whitman said. He emphasized that the agency was thinking about Grok from a “measurement” perspective and not putting the tool into operation. While the tool “could” be put into use, Whitman told FedScoop he didn’t want to speculate.
“We approve model families based on their passing of these evaluation sets,” Whitman said. “Once we evaluate Grok 3 and 4 together, we’ll be able to take that to the safety board. [We can go to them and ask], ‘what do you think about this model family? Is it meeting your standards or not in our behavior?’ So really, it’s just, like, a measurement perspective.”
Whitman emphasized that GSA is hoping to evaluate the performance of companies’ AI models, assessing how well they line up with what has been advertised to federal agencies.
The GSA GitHub repository provides some insight into at least one tool the agency seems to be developing, called “ViolentUTF-API.” A seemingly related “ViolentUTF” project makes reference to the GSAi app and discusses GSAi implementation improvements. Both projects are being completed by the same user affiliated with GSA.
A guide to red-teaming AI systems included in one of the versions of the project mentions evaluating large language models for harms like toxicity and misinformation. A paper associated with the effort notes that the tool is being tested with a large U.S. government department.
Spokespeople for GSA did not answer a series of additional questions from FedScoop about GSAi, including whether the agency is evaluating the tool’s capacity to spread misinformation and the extent to which it might be looking at systems like DeepSeek, which many federal agencies have banned. The agency also did not address if it was using red-teaming platforms beyond the ViolentUTF system.
FedScoop first reported on GSA looking into Grok last month. Four days later, xAI announced that it was working with GSA to make its large language model technology available as part of a product suite called “Grok for Government.” Soon after, House Democrats sent a letter to GSA demanding more information about the agency’s work with the tool.
That letter included concerns about boosting Elon Musk’s businesses and questions about consistency with cybersecurity programs like FedRAMP.
“We’re using commercial LLMs off of the hyperscalers,” Whitman told FedScoop, pointing to Azure, Bedrock, and Vertex. “All within our own infrastructure … we happen to have contracts between all three of those. Not every agency does.”
Madison Alder contributed reporting.