Editor's note: Joe Hopper is president of Chicago-based Versta Research.
Traditional methods of stripping personally identifiable information (PII) from data sets is no longer sufficient to keep data anonymous. Researchers have shown that with just a handful of data points, sophisticated new algorithms can re-identify individuals in many “anonymous” data sets with alarming levels of accuracy. As a result, how you handle and manage data has become far more important than trying to anonymize it.
Keeping PII out of data
In our company we nearly always try to keep data collection anonymous. This means that for most of our surveys we intentionally do not know the identity of participants (though our sample providers do). We rarely ask for any type of identifying information, not even a first name. If for some reason we have identifiers (like when we work with client customer lists), we use keyed IDs and strip all PII out of our primary data for analysis.
Anonymizing may not work
But no matter what we do, new research from computer scientists in the U.K. and Belgium have shown that a lot of anonymized data is not so anonymous. If there is enough information in the data set, new algorithms can piece together various tidbits and it is possible to correctly identify specific people from whom that data was derived 99.98% of the time.
Even Public Use Microdata Samples from the Census (that’s the PUMS Census data we are using and analyzing all the time) is vulnerable and should no longer be considered anonymous.
Here is how the researchers summarize their findings:
While rich medical, behavioral and socio-demographic data are key to modern data-driven research, their collection and use raise legitimate privacy concerns. Anonymizing data sets through de-identification and sampling before sharing them has been the main tool used to address those concerns. We here propose a generative copula-based method that can accurately estimate the likelihood of a specific person to be correctly re-identified, even in a heavily incomplete data set. On 210 populations, our method obtains AUC scores for predicting individual uniqueness ranging from 0.84 to 0.97, with low false-discovery rate. Using our model, we find that 99.98% of Americans would be correctly re-identified in any data set using 15 demographic attributes. Our results suggest that even heavily sampled anonymized data sets are unlikely to satisfy the modern standards for anonymization set forth by GDPR and seriously challenge the technical and legal adequacy of the de-identification release-and-forget model.
Yikes. We’ve always known this is possible with small subsets of people. That is why reputable research firms will not give drill-down data cuts of your customers or employees if only a few are in the subgroup you want to explore. But it is now possible to do the same thing with huge groups of people and with even the most rigorously executed techniques to de-identify data.
Five ways to protect data
The best solution is to think differently and worry a lot about the issue. We need to build better internal processes to address it. As the lead author of the research noted in an interview with the New York Times: “We need to move beyond de-identification. Anonymity is not a property of a data set, but is a property of how you use it.”
Here are five suggestions, adopted from some of our own evolving best practices:
1. Stop using the word anonymous. The word anonymous creates a false sense of security that gives people tacit permission to share data without worrying about privacy. But they should worry. Every person in your organization who touches individual-level data must know that even when data is fully stripped of PII, it might be possible to identify who the individuals are. There is no such thing as anonymous research anymore.
2. Minimize your demographics. For most of the research you do, just seven or eight demographic data points are probably needed for sampling, screening, weighting and analysis. It is always tempting to ask for more (“This might be useful later, so let’s ask just in case!”) But consider the risk. The more you ask, the easier it is for an algorithm to pick out individuals. If you do not absolutely need it, keep it out.
3. Use blunt measures. This goes against everything you may have learned in a research methods classes and I, too, still resist. I want details, knowing that I can collapse data into broad groupings later. But for the sake of confidentiality, use the bluntest measure you need. Do not ask for a person’s age or year born – instead ask which age group they belong to. Do not ask for zip code – instead ask for which state they live in.
4. Treat all data as confidential. Even your anonymous data sets (… which you no longer refer to as anonymous, right?) should be protected and handled with similar precautions as your data that has PII. That means all data should be encrypted and ideally never sent as an e-mail attachment. It means access should be restricted to a limited number of project personnel, all of whom should be required to maintain strong passwords and use two-factor authentication.
5. Delete it when you’re done. This one is tough, too. What if sometime down the road the expensive data you just collected is valuable for something else? Digital storage costs almost nothing, so why not keep it all just in case? Again, consider the risk. As your data sits in storage safe and sound, technologies to pry it open and potentially breach the privacy of those who provided it continue to develop. Most of us outside of academic or federally funded population-based research rarely go back to our data if it is more than a couple of years old. Our advice is to play it safe and delete it when you’re done.
Security precautions
The guiding principle behind all of these best practices is to put into place rigorous security precautions and privacy protections regardless of how sanitized your data seems to be. Only by carefully constraining and documenting how and to whom your data is circulated can you ensure that it remains truly anonymous.