Most of us have seen and probably barely read the privacy policies we sign onto when we agree to use an online service.
They promise to use our information only in aggregate, meaning that personal identifying details won't be included when that data is combined with other customers'. The services themselves are usually more interested in extracting generalized information from a large group of users anyway, so they offer that personal anonymity in exchange.
But, as researchers have shown, it's all too easy for enterprising individuals to break that anonymity by cross-referencing the data with other sources to once again extrapolate personal details, proving that the barriers protecting privacy online are often very thin indeed.
There is a potential saviour on the way, however. Differential privacy is a mathematically provable technique for making data anonymous – or anonymizing – and keeping it that way. And it's about to go mainstream.
"It's absolutely critical," says Anthony Mouchantaf, co-founder of Toronto-based Rthm Technologies Inc. "Whatever the technology is at the time that will ensure there's no possibility of de-anonymization, that's what we'll use. That's differential privacy today."
One of the best-known examples of de-anonymization took place in 2007, when Netflix published 10 million movie rankings provided by 500,000 customers. The company was hoping that publicly releasing the information, which replaced personal details with numerical identifiers, would lead to someone coming up with a better recommendation system.
Instead, researchers at the University of Texas at Austin cleverly matched the data to publicly available reviews on the Internet Movie Database. By comparing a few dozen scores and timestamps with Netflix's data, they were able to reidentify some customers with a high degree of certainty.
The point of the exercise, the researchers said, was to show how easy it can be to de-anonymize data. "Releasing the data and just removing the names does nothing for privacy," one of the researchers told SecurityFocus at the time. "If you know their name and a few records, then you can identify that person in the other [private] database."
Differential privacy works by inserting randomized "noise" into the data, which makes it impossible to reidentify an individual with surety.
A survey of 100 users and their chocolate preferences, for example, could indicate that 70 people prefer milk while 30 pick dark. If one person's answer is randomly removed and replaced instead with the result of a coin toss, it becomes impossible to firmly identify any respondent's specific answer.
"If I'm an attacker and I'm interested to know what your chocolate preference is, even if I knew the chocolate preference of every other person in the survey except you, I still wouldn't be sure," says Aleksandar Nikolov, Canada research chair in algorithms and private data analysis and an assistant professor in computer science at University of Toronto.
Differential privacy has largely been an academic pursuit for the past decade, but companies big and small are beginning to adopt it.
Apple introduced the concept to the mainstream last year by adding it to iOS, its mobile operating system software. The company now uses differential privacy to protect everything from users' emoji preferences to health data.
Rthm, which makes an app that tracks users' health statistics, has become a quick convert to Apple's differential privacy techniques. An accidental de-anonymization of customers' data could be devastating for the startup.
"We spare no effort or expense to make sure that's not an exposure for us," Mr. Mouchantaf says.
Experts warn, however, that differential privacy is not necessarily a silver bullet to the de-anonymization problem. For one thing, it decreases the accuracy of the data being collected.
In the chocolate example above, for instance, replacing one answer with a randomized response would lower the accuracy by 1 per cent.
More importantly, however, it's also easy to apply differential privacy incorrectly.
A recent study led by researchers at the University of California suggests Apple's application of the technique is "relatively pointless" because it gets a key mathematical variable – known as "privacy loss parameter" or "epsilon" – wrong.
While the academic community considers an epsilon of one to be acceptable, Apple's applied values are as high as 14, which the researchers say makes it easy for attackers to de-anonymize the supposedly protected information. The company disputes the findings and says it uses different epsilon values for different types of data.
Experts are concerned that differential privacy could thus become just a marketing term – a fancy-sounding illusion of protection without actual teeth to it.
"The theoretical concept of differential privacy is under a transformation toward real problems," say Florian Kerschbaum, an associate professor of computer science at the University of Waterloo. "That transformation has not completed yet."