Tech

Is anonymous big data really possible? Yes, says ITIF

June 16, 2014

Big data’s reputation has taken quite a beating during the last year. Between the revelations about the National Security Agency’s massive data mining operations and the realization that big companies, like Google and Facebook, can predict our future purchases with shocking accuracy, big data is increasingly viewed as a big threat to privacy.

One of the main factors behind big data’s bad reputation is the threat that hackers, or even the government, can use the growing number of datasets — regardless of how much personally identifiable information has been stripped out by those who collect and store it — to reconnect the dots and identify individuals.

But these fears are unfounded, according to the Information Technology and Innovation Foundation. In a new white paper released Monday, ITIF Senior Analyst Daniel Castro and co-author Ann Cavoukian, the Ontario information and privacy commissioner, argue that big data can be made anonymous if the proper methods of de-identification are used.

“Properly applied, de-identification of data is an effective tool to protect privacy, while allowing for the analysis and use of information to improve numerous aspects of society,” ITIF said in a statement. “Unfortunately, a number of advocates have taken to perpetuating the myth that individual identities cannot be completely stripped out of datasets and have argued that this is reason enough to slow development and use of data analytics. The perpetuation of this myth has the potential to adversely impact the continued evolution of the data economy while also inhibiting efforts to improve health care, public safety and community development.”

The collection and analysis of large datasets hold great promise for everything from new technical innovations to improving public safety and understanding changes in the environment. Real value, however, comes from the ability to analyze information contained in different datasets collected often times for vastly different reasons. But if organizations and governments are to be able to make use of the various datasets that now exist or are coming into existence, they will need the ability to remove personally identifiable information while maintaining the data’s usefulness.

“Data innovation is transforming numerous aspects of society from health care to education, and privacy concerns need to be balanced with the public benefits the enhanced use of data provides,” Castro said in a statement. “De-identification is a useful tool for maintaining this balance and it is my hope this report will address unnecessary fears and help expand and improve the use of these techniques moving forward.”

Among the major misperceptions of de-identification the white paper attempts to dispel is the notion that re-identification can occur with any dataset, regardless of how much personally identifiable information has been removed.

“What is most disturbing about this assertion and its attempt to grab headlines with sensationalist assumptions is that policy makers who require accurate information to determine appropriate rules and regulations may be unduly swayed,” the white paper states. “In the same way that locking the doors and windows to one’s home reduces the risk of unwanted entry but is not a 100 percent guarantee of safety, so too does de-identification, properly applied, protect the privacy of individuals without guaranteeing anonymity 100 percent of the time.”

Castro and Cavoukian point to the U.S. Heritage Health Prize claims dataset as an example of how de-identification, if conducted properly, can work. The HHP was a global data-mining competition to predict the number of days patients would be hospitalized in the subsequent year by using current and previous years’ claims data. The core dataset consisted of three years of de-identified demographic and claims data on 113,000 patients.

Experts applied several de-identification techniques to the data to ensure the privacy of the patients, including:

Replacing direct identifiers with irreversible pseudonyms;
Removing uncommonly high values in the dataset (top-coding);
Truncating the number of claims per patient;
Removing high risk patients and claims; and
Suppressing provider, vendor, and primary-care provider identifiers, where patterns of treatment were discoverable.

Researchers also studied the likelihood of an attacker using additional datasets to re-identify the anonymous patients used in the competition database. The types of attacks considered included the “nosey neighbor adversary,” matching voter registration lists and matching against the state inpatient database.

“Based on this empirical evaluation, it was estimated that the probability of re-identifying an individual was .0084. In other words, at most, an attacker could only hope to re-identify less than 1 percent of the individuals in the dataset,” the white paper stated. “This study demonstrated that use of proper de-identification tools that involve re-identification risk measurement techniques makes it is extremely unlikely that an individual in a de-identified dataset will ever be re-identified.”

Although critics of de-identification often point to a 2008 study that showed researchers were able to re-identify Netflix users in an anonymous dataset by matching the data to the Internet Movie Database, Castro and Cavoukian argue it is important to note that the researchers were able to identify only two out of the 480,189 Netflix users in the dataset.

“Here again, it is the statistical outliers that are most at risk of re-identification: the likelihood of re-identification goes up significantly for users who had rated a large number of unpopular movies,” the white paper states. “Moreover, Netflix users who had not publicly rated movies in IMDb had no risk of re-identification.”

Is anonymous big data really possible? Yes, says ITIF

More Like This

Bipartisan House bill calls on DHS to leverage AI for border security

U.S. is leading the way in R&D, but tech workforce development is still a concern for federal officials

HHS IT draft strategy aims to connect health data with systems

Top Stories

Congressional panel outlines five guardrails for AI use in House

With 2023 tax season in the rearview, IRS commissioner eyes expansion of AI capabilities

Federal CIO calls on Congress to fund Technology Modernization Fund

CMS’s financial office is using LLM pilot to combat loss of institutional knowledge

Top public sector takeaways from Google Cloud Next 2024

Commerce requests information about AI, open data assets, data dissemination

Commerce adds five members to AI Safety Institute leadership

Kiran Ahuja to step down as OPM director

More Scoops

HHS, health information networks expect rollout of trusted data exchange next year: Micky Tripathi

National AI Research Resource must balance the value of its data with privacy

Report: Census Bureau should set timeframes for protecting respondents’ data privacy

NIH’s COVID-19 data enclave continues to evolve with the virus

VA expanding clinical data access to improve COVID-19, suicide prevention outcomes

As data-sharing becomes more crucial, agencies say industry can help with privacy issues

Leveraging data to address Sickle Cell Disease

Latest Podcasts

Is anonymous big data really possible? Yes, says ITIF

Darryl Peek on Elastic’s role in enhancing public sector search and data analytics with AI and Google collaboration

Danny Werfel on How Automation has Enhanced his Agency’s Operations.

ManTech leaders discuss innovation and security with Google technologies

Tech

Defense

Cyber

Acquisition