NIH’s COVID-19 data enclave continues to evolve with the virus

The institutes hope to use prototype technology to link their dataset with outside sources while preserving patient privacy in the coming months.
(Getty Images)

Technology linking patient records across data sources while preserving their privacy is being prototyped by the National Institutes of Health as researchers attempt to understand the evolving COVID-19 virus and its variants.

The National Center for Advancing Translational Sciences within NIH launched the largest COVID-19 dataset in the U.S., the National COVID Cohort Collaborative (N3C) Data Enclave, in April. And now NCATS wants to use privacy-preserving record linkage (PPRL) to link data from its enclave with medical images, omics tools, electronic health records (EHRs), and social determinants of health to answer researchers’ lingering questions like why COVID-19 symptoms linger in some patients.

PPRL finds and links records on the same patient across independently maintained data sources using a cryptographic hash value to protect their identity.

“Combining the EHR data with prospective studies and COVID clinics is going to be really important to be able to follow people over time, do specific interventions and try to tease out the differences in these diseases,” Dr. Ken Gersing, director of informatics at NCATS, told FedScoop. “What we’re now calling ‘long COVID’ is surely a syndrome of groups of many different illnesses, rather than one particular illness.”


Multimodal analytics being implemented now will give researchers the ability to look at patient images with their lab results, but some of the data sources NCATS wants to link to the N3C Enclave are maintained by other agencies like the Centers for Medicare & Medicaid Services.

PPRL respects data ownership by temporarily linking datasets in a neutral, high-performance computing area long enough for researchers to complete their work. Duplicate information is eliminated in the process.

NCATS still has hurdles to clear before PPRL goes live, ideally in two to five months, Gersing said. PPRL needs to be financed, legal barriers must be navigated and there’s a question of how to truly de-identify data from omics tools.

NIH announced funding for its institutes and centers (ICs) to research long COVID using PPRL in late January, going so far as to contract with two vendors. Datavant is handling the PPRL technology, while Regenstrief Group agreed to serve as the honest data broker for matching records.

“We, as the holders of the data, don’t want to also be the linkage group for the patients’ benefit, for the institutions’ benefit and for our benefit also — that there’s no conflict of interest and for preserving privacy,” Gersing said.


Appointing a data broker further allows researchers to ask COVID-19 patients to participate in potential studies. Researchers flag hashes of interest for the broker, which has the local institution where they originated de-encrypt them for the purpose of reaching out. That way patient identities remain with local institutions alone.

About 1,900 researchers from nearly 300 institutions were working in the N3C Data Enclave, which contained data from about 800,000 COVID-19 patients as of March. ICs like the National Heart, Lung, and Blood Institute and the National Institute of Child Health and Human Development; agencies like the Food and Drug Administration and the Agency for Healthcare Research and Quality, and companies like Pfizer and IBM all use the enclave.

While generally these institutions consider each other competitors, NIH agreed to harmonize their datasets and make them available to all with rules against reselling, re-identifying, downloading and using for non-COVID research.

The N3C Data Enclave is a Palantir analytics platform with three subsets — synthetic, de-identified and limited datasets — that a Data Access Committee of federal officials may or may not grant researchers access to upon request.

Only the limited dataset, the hardest to obtain access to, contains true dates and ZIP codes. Meanwhile the synthetic dataset, the easiest to access, is a pilot in itself.


“If we can prove that the computer-generated data, modeled off of the limited dataset, is truly equivalent scientifically and privacy-wise, then there’s no reason this data can’t be shared across the world,” Gersing said. “Just put it out there as a file.”

NCATS paid for all the technical infrastructure, which normally researchers have to spend a portion of their grant money on, so they could focus on answering questions like: What medications alleviate COVID-19 symptoms better depending on case severity? And what variables can doctors use to predict how sick a hospital patient will likely get for resource and treatment planning purposes?

The Johnson & Johnson, Moderna and Pfizer vaccines have special RxNorm numbers in EHRs that will help N3C researchers study their efficacy over time.

NCATS’s data enclave is a Federal Risk and Authorization Management Program-certified environment that also requires dual authentication to access. The center’s security office monitors the enclave and also has an outside federal group run penetration tests, though it hasn’t really run into nefarious actor to date, Gersing said.

“If this data ever got out of the enclave, it would shut down a very valuable resource,” he said. “I’m not saying it’s job one, but it sure is close.”

Latest Podcasts