USAi tool lets agencies test for AI biases, GSA official says

David Shive, the GSA’s CIO, says agencies have an “obligation” to “vet” models before implementing them.

September 18, 2025

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

General Services Administration CIO David Shive speaks during FedTalks in Washington, D.C., on Sept. 18, 2025. (Scoop News Group photo)

Federal agencies can test and measure the biases of the artificial intelligence models they are experimenting with through the new governmentwide AI evaluation tool, USAi, according to David Shive, the General Services Administration’s chief information officer.

“We allow for head-to-head model comparisons so we are actually capturing the telemetry of models, not only user behaviors within models, but also intertechnology and bias behaviors within models,” Shive said on stage at FedScoop’s annual FedTalks event. “We’re expressing out scoring so the agencies can see the effectiveness of those models.”

GSA launched the USAi.gov site last month, giving federal agencies the ability to test leading AI models before procuring them from the normal federal marketplace. The tool builds upon the GSA’s internal chatbot, GSAi, which rolled out internally in March for agency employees.

The evaluation suite currently offers AI models from OpenAI, Amazon, Anthropic, Google, Meta and Microsoft, but Shive noted GSA is exploring adding a handful of other models in the future.

Federal workers can test the models in their everyday workflows, Shive noted, and also check for potential biases in each of these AI systems.

“Technology, especially software, contains multiple biases. It contains the bias of people who wrote the software … and we humans all have biases,” Shive explained in a sideline interview with FedScoop. “That reflects itself in the software products that are built, that reflects itself in models, as well as what [models are] trained on and how they’re trained. Not just the mechanics — the technology mechanics of how they go about their training — but with data sources they’re trained on.”

“We inform the users about which have the highest bias scores and the lowest bias scores so that they can be informed, because our users have an obligation to not just blindly trust what comes out of the market,” Shive said.

“They have an obligation to vet and verify and validate that and this gives them kind of a foundation or baseline,” he added.

The GSA CIO said the USAi tool utilizes a series of controls from the National Institute of Standards and Technology and commercial technology tools for the testing process.

“We apply those — they look like filters but they’re not — those analyses to those models when our users are using the tools,” he said.

When an agency uses the USAi tool, it receives its own copy, including the telemetry gathering analytics capability. While GSA created the USAi tool, Shive emphasized that it does not have access to the telemetry gathering for another agency’s users.

“We intentionally firewall those off so that each agency has no command and control of telemetry gathering and how they react and respond to that,” Shive told FedScoop, adding later that “every agency has an obligation and responsibility to make a risk-free management framework-based decision on what’s acceptable to those agents.”

AI models have come under increasing scrutiny for how biases may influence its outputs. xAI’s Grok chatbot was the center of this controversy earlier this year after it espoused antisemitic and pro-Hitler content in some responses. This occurred after an instruction was apparently added to Grok’s system prompt, directing it to “not shy away” from certain claims. The instructions were eventually removed.

Grok appeared to be under consideration for a major GSA deal, but the tech remains under review.

GSA leaders told FedScoop in an interview last month that they hope the USAi tool will help federal workers build trust in using AI models. Zach Whitman, GSA’s chief AI officer and data officer, said last month that public tools can sometimes provoke fears about working with sensitive materials.

Before the model is made available for testing on USAi, three evaluations take place. The first focuses on safety, like analyzing whether a model outputs hate speech, while the second is based on its performance in answering questions. The third involves red-teaming, or testing a model’s durability.

A GSA official confirmed the safety teams reviewing the report are specific to USAi but welcome agency input.

The GSA, according to Whitman, does not see the USAi tool being used by agencies in the long term and believes market dynamics could eventually take over.