How to use machine translation responsibly in government

The Department of Justice recently issued guidance encouraging federal agencies to use “artificial intelligence and machine translation to communicate with individuals who are limited English proficient.” The memo specifically calls for “responsible use” of these technologies to “produce cost-effective methods for bridging language barriers.”
But the memo provides no details on what “responsible use” means. There’s also little guidance in memo M-25-21 on Accelerating Federal Use of AI — or the revoked Biden AI executive order — regarding responsible use for translation models specifically. This leaves agencies to figure out implementation details on their own.
So, what does responsible AI translation look like?
Responsible use of machine translation requires use-case specific evaluation both prior to and subsequent to deployment, as well as using those results to inform what happens. Here’s what agencies should do:
Test on your content for your specific use cases. Don’t rely on general model performance benchmarks. Create representative samples of the documents for each of your use cases. A tool that works well for routine correspondence might fail on technical documents.
Compare all your options for each use case. Test multiple machine translation systems against human translators. Have qualified evaluators rate translations. In addition to needing different samples for each use case, you might also need different evaluation criteria or cut-off scores. Evaluation doesn’t just show how often models fail, but also in which circumstances — information that can shape your deployment plans.
Act based on the results. You might find that machine translation works well for some languages but not others, or that using uncertainty scores to flag low-confidence translations for human review works better than an all-or-nothing approach. Though confidence scoring is trickier here than with some other models, it’s something you can pursue.
Continue evaluating after deployment. Your documents and needs will evolve. Regular re-evaluation with a subset of new documents ensures the model continues to meet your requirements.
The real problem: deployment without evaluation
My primary concern is that agencies will deploy machine translation tools with minimal testing, relying on vendor claims or general benchmarks that may not apply to their specific use cases. The government didn’t have enough technical expertise prior to President Donald Trump’s inauguration, and it’s only gotten worse since. And even when contractors are doing the work, federal government civilians are ultimately the ones responsible for vendor selection, contract wording, and what deliverables to request.
During my time as an AI/ML engineer at the Department of Homeland Security’s AI Corps — where I drafted a guide on test and evaluation for AI/ML models — I saw tremendous variance across government, from agencies with well-informed evaluation plans that asked all the right questions about model performance to teams that couldn’t do proper evaluation because the government wasn’t giving them access to post-deployment data. I expect implementation of the DOJ guidance will be similarly inconsistent — some agencies will conduct thorough evaluations while others will deploy tools with minimal testing.
More rules aren’t the answer, either
But to the skeptics of machine translation, I also have a message: Even now, agencies aren’t always choosing between machine translation and good human translation. In some cases, they’re choosing between unevaluated machine translation and none at all, where documents just don’t get translated, or translated in a timely way.
For instance, at a federal agency I worked with, employees were using Google Translate for somewhat sensitive material. A secure machine translation tool with even basic evaluation would have solved both the security problem and supplied more information about what types of documents were fine to machine translate versus which needed human review.
Additionally, evaluation standards should be driven by use case requirements, not the technology being used. While a focus on AI evaluation is important, this same attention to performance standards should apply regardless of whether work is done by humans, machines, or via hybrid approaches.
Finally, the lack of specific implementation guidance from both this White House and the previous one is appropriate. More prescriptive requirements wouldn’t solve the underlying capacity problems — but they would be interpreted in some agencies in the most restrictive ways possible. For instance, a White House memo requiring evaluation of machine translation models could easily be interpreted at lower levels as requiring that this evaluation happen prior to allowing technologists to so much as install an open-source language model and try it out. And meanwhile, documents aren’t getting translated at all — or they’re getting sent to Google Translate.
So even as an evaluation advocate, I don’t see the guidance’s vagueness as the problem. When agencies stumble on machine translation deployment, it’ll be because of lack of institutional capacity, and that wouldn’t be fixed by being told to do evaluation.
Abigail Haddad is a former artificial intelligence/machine learning engineer with the Department of Homeland Security’s AI Corps.