Commentary

How to use machine translation responsibly in government

A lack of capacity within agencies will lead to disparate outcomes on the “responsible use” of AI, a former AI/ML engineer with DHS’s AI Corps explains.

By Abigail Haddad

July 22, 2025

Listen to this article

0:00

This feature uses an automated voice, which may result in occasional errors in pronunciation, tone, or sentiment.

(Getty Images)

The Department of Justice recently issued guidance encouraging federal agencies to use “artificial intelligence and machine translation to communicate with individuals who are limited English proficient.” The memo specifically calls for “responsible use” of these technologies to “produce cost-effective methods for bridging language barriers.”

But the memo provides no details on what “responsible use” means. There’s also little guidance in memo M-25-21 on Accelerating Federal Use of AI — or the revoked Biden AI executive order — regarding responsible use for translation models specifically. This leaves agencies to figure out implementation details on their own.

So, what does responsible AI translation look like?

Responsible use of machine translation requires use-case specific evaluation both prior to and subsequent to deployment, as well as using those results to inform what happens. Here’s what agencies should do:

Test on your content for your specific use cases. Don’t rely on general model performance benchmarks. Create representative samples of the documents for each of your use cases. A tool that works well for routine correspondence might fail on technical documents.

Compare all your options for each use case. Test multiple machine translation systems against human translators. Have qualified evaluators rate translations. In addition to needing different samples for each use case, you might also need different evaluation criteria or cut-off scores. Evaluation doesn’t just show how often models fail, but also in which circumstances — information that can shape your deployment plans.

Act based on the results. You might find that machine translation works well for some languages but not others, or that using uncertainty scores to flag low-confidence translations for human review works better than an all-or-nothing approach. Though confidence scoring is trickier here than with some other models, it’s something you can pursue.

Continue evaluating after deployment. Your documents and needs will evolve. Regular re-evaluation with a subset of new documents ensures the model continues to meet your requirements.

The real problem: deployment without evaluation

My primary concern is that agencies will deploy machine translation tools with minimal testing, relying on vendor claims or general benchmarks that may not apply to their specific use cases. The government didn’t have enough technical expertise prior to President Donald Trump’s inauguration, and it’s only gotten worse since. And even when contractors are doing the work, federal government civilians are ultimately the ones responsible for vendor selection, contract wording, and what deliverables to request.

During my time as an AI/ML engineer at the Department of Homeland Security’s AI Corps — where I drafted a guide on test and evaluation for AI/ML models — I saw tremendous variance across government, from agencies with well-informed evaluation plans that asked all the right questions about model performance to teams that couldn’t do proper evaluation because the government wasn’t giving them access to post-deployment data. I expect implementation of the DOJ guidance will be similarly inconsistent — some agencies will conduct thorough evaluations while others will deploy tools with minimal testing.

More rules aren’t the answer, either

But to the skeptics of machine translation, I also have a message: Even now, agencies aren’t always choosing between machine translation and good human translation. In some cases, they’re choosing between unevaluated machine translation and none at all, where documents just don’t get translated, or translated in a timely way.

For instance, at a federal agency I worked with, employees were using Google Translate for somewhat sensitive material. A secure machine translation tool with even basic evaluation would have solved both the security problem and supplied more information about what types of documents were fine to machine translate versus which needed human review.

Additionally, evaluation standards should be driven by use case requirements, not the technology being used. While a focus on AI evaluation is important, this same attention to performance standards should apply regardless of whether work is done by humans, machines, or via hybrid approaches.

Finally, the lack of specific implementation guidance from both this White House and the previous one is appropriate. More prescriptive requirements wouldn’t solve the underlying capacity problems — but they would be interpreted in some agencies in the most restrictive ways possible. For instance, a White House memo requiring evaluation of machine translation models could easily be interpreted at lower levels as requiring that this evaluation happen prior to allowing technologists to so much as install an open-source language model and try it out. And meanwhile, documents aren’t getting translated at all — or they’re getting sent to Google Translate.

So even as an evaluation advocate, I don’t see the guidance’s vagueness as the problem. When agencies stumble on machine translation deployment, it’ll be because of lack of institutional capacity, and that wouldn’t be fixed by being told to do evaluation.

Abigail Haddad is a former artificial intelligence/machine learning engineer with the Department of Homeland Security’s AI Corps.

How to use machine translation responsibly in government

So, what does responsible AI translation look like?

The real problem: deployment without evaluation

More rules aren’t the answer, either

More Like This

The next 24 months in AI could shape the next 100 years

Device disregard is multiplying digital ghosts across federal agencies

Bridging the AI readiness gap in government begins with trusted data

Top Stories

Tech leader Dorothy Aronson has departed the National Science Foundation

House lawmakers seek IT upgrade for Commerce to aid fight against foreign powers

FedRAMP authorizations in 2025 already more than double last year, GSA says

Federal courts to ramp up filing system security after ‘recent escalated cyberattacks’

VA facilities faced ‘technology barriers’ in importing medical records, watchdog finds

GSA leader sees AI as catalyst for federal acquisition overhaul

GSA inks governmentwide deal with AWS, touting $1B in potential savings

House Democrats press USDA for answers on DOGE access to farmers’ data

More Scoops

Smart government starts with AI and cloud-driven transformation

Former USPTO IT chief Jamie Holcombe joins US AI

Why identity is the definitive cyber defense for federal agencies

Government coders are studying whether major AI models, including Grok, output hate speech

GSA is planning to bring its chatbot to the rest of government

The new frontline: Winning the information war at the tactical edge

OPM’s Scott Kupor: DOGE was a ‘catalyst’ for efficiency. Operationalizing it comes next.

Latest Podcasts

The Army looks to build a drone marketplace; DARPA’s AI Cyber Challenge reveals winning models

Federal courts ramp up filing system security after ‘recent escalated cyberattacks’; Jamie Holcombe steps down as USPTO CIO

Federal agencies can buy ChatGPT for $1; New deal with AWS brings $1B in potential credits for agencies

Senate confirms national cyber director pick Sean Cairncross; A new commission to examine how to create an independent Cyber Force

Tech

Defense

Cyber

FedScoop TV