Tech

White House aims to answer WHO’s coronavirus questions using natural language processing

The Office of Science and Technology Policy released a database of about 29,000 scholarly articles for the tech community to mine for answers.

By Dave Nyczepir

March 16, 2020

(Getty Images)

The White House Office of Science and Technology Policy released the most extensive collection of machine-readable coronavirus literature Monday for data and text mining to answer scientists’ most pressing questions.

About 29,000 articles — 13,000 of which are full text — now exist within the updatable COVID-19 Open Research Dataset (CORD-19), which grows by more than 100 documents a week.

The World Health Organization identified key questions about COVID-19 that the global research community is trying to answer. And the National Academies of Sciences, Engineering and Medicine narrowed those queries down to the ones data scientists can answer using natural language processing on the dataset.

“For us, we’re posing important questions,” said Michael Kratsios, U.S. chief technology officer, during a press briefing Monday. “And we’re able to extract answers from this large database.”

Last Wednesday, the White House held a phone call with the tech industry on forming a loose partnership seeking artificial intelligence breakthroughs in coronavirus response, where the database was first demonstrated to companies.

Once the data mining is fine-tuned, answers to questions about COVID-19 incubation, transmission, therapeutics and vaccines can be relayed to drugmakers, policy experts and government officials, Kratsios said.

The National Library of Medicine at the National Institutes of Health made more than 10,000 articles related to the coronavirus family of syndromes — which includes COVID-19, SARS and MERS — available from its PubMed Central digital archive of biomedical journals.

Those articles were incorporated into a larger database with preprint content — literature that hasn’t yet been peer-reviewed and published — from Cold Spring Harbor Laboratory’s BioRxiv server funded by the Chan Zuckerberg Initiative. Preprint content reaches the global research community about 100 days faster than peer-reviewed articles, which is important when faced with “tight” COVID-19 response timelines, said Alex Wade, technical program manager for Meta at CZI.

“What we learn from this current situation should influence how the scientific community shares research information going forward,” Wade said.

Microsoft lended its experience indexing and mapping global scientific literature, with Chief Scientific Officer Eric Horvitz noting Monday how hard it typically is for data scientists to gain machine-readable rights to such articles. And Georgetown University’s Center for Security and Emerging Technology coordinated the entire collection.

The Allen Institute for AI transformed the article text into a machine-readable JSON format for data analysis, and its free Semantic Scholar academic discovery engine pulls research relevant to individual scientists.

All of this took place starting March 13, said Lynne Parker, U.S. deputy CTO.

Now it’s up to scientists and data scientists to share any AI tools they develop and answers to WHO’s questions they find with the CORD-19 Kaggle community. Owned by Google Cloud, the Kaggle platform allows for the global sharing of machine learning tools and insights among researchers.

OSTP created 10 broad tasks for researchers on Kaggle, based on WHO’s questions, in order to maximize the insights provided from AI analyzing the database:

What do we know about virus genetics, origin, and evolution?
What is known about transmission, incubation, and environmental stability?
What has been published about medical care?
What do we know about COVID-19 risk factors?
What do we know about vaccines and therapeutics?
What has been published about ethical and social science considerations?
What do we know about diagnostics and surveillance?
What has been published about information sharing and inter-sectoral collaboration?
What do we know about non-pharmaceutical interventions?
How does geography affect virality?

Within each task is a subset of questions more directly related to WHO’s queries. Kaggle is offering a $1,000-per-task award to the submissions that best meet evaluation criteria, which can be received as a charitable donation to COVID-19 relief and research efforts or a monetary payment.

OSTP and its partners stressed that the dataset does not contain any personally identifiable information.

“We’re not making medical records and so forth available from this initiative,” said Jerry Sheehan, deputy director of NLM.

White House aims to answer WHO’s coronavirus questions using natural language processing

More Like This

DOT pushes back on drone industry calls for speedy rules

Federal IT leaders to agencies: collaborate, use AI to rethink workflows

CAISI would benefit from more resources, OSTP director tells lawmakers

Top Stories

Tech-focused taxpayer services bill clears Senate Finance Committee

Outgoing federal CIO Greg Barbaccia will return to Palantir

Record ‘has gotten worse for the government’ in Anthropic dispute, judge says

Judge rejects move to dismiss case on DOGE access to Treasury, OPM data

VA won’t require existing FedRAMP certification for cloud contracts, memo reveals

Homeland Security plans to add more AI to FOIA processing

More Scoops

OSTP’s Kratsios denies White House influence over NSF grants decisions

Inside the HHS system informing White House coronavirus decisions

Supercomputing consortium further solidifies White House partnership with tech on coronavirus response

White House, tech industry discuss AI solutions to coronavirus pandemic on ‘initial’ call

White House calls CIOs and other agency leaders to summit on federal use of AI

Latest Podcasts

The Department of Veterans Affairs relaxes FedRAMP requirements for cloud contracts

HHS’s CIO Clark Minor is out

Managing AI innovation across the sprawling Department of Commerce

OPM’s governmentwide HR system in the clear on post-award GAO bid protests

Tech

Defense

Cyber

FedScoop TV