In particular, I think it's important to keep in mind that differential privacy is as much focused on establishing a framework for measuring information leakage as it is coming up with clever algorithms to preserve privacy (although there are a lot of clever algorithms). I think of it as more analogous to big-O notation (a way of measuring) than to dynamic programming (an implementation technique).
I guess. It's more like: figure out how to measure how much spice is flowing. The resulting knowledge will be a new tool: powerful and morally indifferent, as all tools. You choose how to use it.
I keep trying to understand if these ideas will be useful for epidemiology. Right now I think there is a long way to go for multivariate statistics.
It seems that differential privacy can handle one column of data, with some categorical filters (like "smoker: yes/no"). But epi researchers have to do multivariate correction for many lifestyle factors. These kinds of corrections seem very difficult to manage in such a datastore - but if you cannot do them, you just find some correlation with age or location that isn't what you intended to find.
In other words, these kinds of "lots of columns at once" queries are really important to epidemiology, and my impression is that differential privacy is not so strong here. Anyone have a better impression of what might be possible in the future?
Differential privacy can definitely do much fancier stuff. A big challenge is that you like to play around with your data and try lots of different analyses and regressions, and DP tends to clash with that approach. It's more natural to pick a DP algorithm, run it once, and live with the results. But anyway, one can definitely do stuff like "differentially private linear regression".
You are missing the point of differential privacy. This is an oversimplified explanation but I see it like this (This is also how my professor and his phd assistant at my university explained it to us).
Differential privacy provides some simple mathematical foundations to share any database data what so every without revealing the data about anyone specific. An example could be when you store someones name, birthday and illness. Differential privacy in a simplified way says,a name is a direct link to a person so remove it. A birthday could potentially be used to link to a user but not directly so replace it with a range so age between 20-30 for example, instead of the specific birthday. The illness is the data someone else wants so that stays. Now someone else can get information from your database without getting to any specific user or person. (there are a lot of other things that can be done such as adding random numbers to the result when you for example ask for am average age)
Where this whole thing starts to break down is when this is applied to real situations. Sure everything mathematically shows that you can not get to a specific user. But when you do already have a large amount of data or there are multiple of these databases you can quite easily combine them to find specific users or people back in the data. And these type of attacks are already happening, with people adding large data breaches together to find username, email and password combinations for example. This way they can for example find out if you have a pattern in you passwords such as a base password + a specific thing extra at the end.
As the other poster mentioned, this sounds much more like non-DP anonymization, which (as you note) is usually surprisingly vulnerable to deanonymization through various approaches.
With Differential Privacy, you instead add randomness such that you can't tell whether the answer you got includes any individual person, for whatever question you're asking.
IIUC, RAPPOR adds that randomness it to the original data; Leap Year (where I worked for a while) adds it to answers to specific queries. There are huge tradeoffs and they're suitable for very different settings. I am not sure which approach is taken here.
Edited to add:
Skimming the docs, it seems to be the latter - ask questions of the exact data, returning answers that are noisy. This requires ongoing trust of the entity holding the data (so it's most applicable to circumstances where they'd have that data regardless), but is much more flexible.
My understanding is that what you describe is closer to the state of the art before do, my I believe the thing about dp is that it allows you to measure information leakage, even iirc in the face of other data being disclosed.
"In this directory, we give a simple example of how to use the C++ Differential Privacy library.
Zoo Animals
There are around 200 animals at Farmer Fred's zoo. Every day, Farmer Fred feeds the animals as many carrots as they desire. The animals record how many carrots they have eaten per day. For this particular day, the number of carrots eaten can be seen in animals_and_carrots.csv.
At the end of each day, Farmer Fred often asks aggregate question about how many carrots everyone ate. For example, he wants to know how many carrots are eaten each day, so he knows how many to order the next day. The animals are fearful that Fred will use the data against their best interest. For example, Fred could get rid of the animals who eat the most carrots!
To protect themselves, the animals decide to use the C++ Differential Privacy library to aggregate their data before reporting it to Fred. This way, the animals can control the risk that Fred will identify individuals' data while maintaining an adequate level of accuracy so that Fred can continue to run the zoo effectively.
The animals have implemented a CarrotReporter tool in animals_and_carrots.h to obtain DP aggregate data to report to Fred. We document one of these reports in report_the_carrots.cc."
Tech companies love to use that line, "We take privacy seriously."
That seriousness is certainly reflected in this example, which appears to compare users with zoo animals, tended to by a "farmer".
If Fred is anything like Google, he wants this per animal carrot consumption data for some other reason(s) besides simply ordering more carrots.
This example makes privacy sound like some sort of resource allocation problem. What is the minimum carrots we must provide in exchange for animal data.
What if the animals are not "fearful that Fred will use the data against their best interest" but instead they know Fred is using the data for reasons other than ordering more carrots, profiting from that use and not sharing any of the profits.
I was not making a general argument for profit-sharing, I was calling attention to this presumed idea in the example of users worrying "they will use my data against my best interest". Obviously they will not use your data against you in a way that causes measurable injury (damages). If they did, you could sue them and potentially win. They are not that stupid.
However they may use your data for purposes other than the reason you allowed them to collect it. They will likely use the data to further their best interest; they will not tell you exactly how they use it nor will they cause you any injury. The only claim you potentially have is to the value of your data, which they utilise in their pursuit of profits.
You might not get a "share of profits", but you could claim the value of the data they obtained from you. If many users make the same claim, in the aggregate, that could be a substantial amount of data that carries a substantial amount of value.
The situation also presents the animals implementing DP to protect themselves from the farmer, but in reality the farmer collects the raw data and also applies DP at their own discretion.
If users really did have control over the generation of their data and come together to aggregate it, then they would be able to do more than just apply DP. They would probably not just give the aggregate data away for free (or "exchange" it for something they already get for free, e.g., carrots).
I was told by an engineer from Leapyear Technologies (https://leapyear.io/) that this library was mostly primitive functionalities that are behind the current mainstream practices.
Applying DP to a simple computation like an average or median isn't that hard, what's more tricky is to ensure reasonable privacy guarantees when allowing unlimited interactive queries or large-scale sample generation from high-dimensional data:
You can apply DP to individual datapoints or attributes, but the amount of noise that you need to add to reach reasonable privacy guarantees is then quite high. Hence it makes more sense to add noise to the result of a computation, as the sensitivity of many practically relevant computations to individual datapoint values will often be small, hence the required amount of noise to mask the contribution of each individual datapoint will also be low. The problem is just that often a single datapoint can contribute to the result of many (sometimes nearly infinitely many) computations, and every DP computation result you return to a user (or adversary) will reduce you privacy budget. There are some approaches to remedy this problem, like adding "sticky noise" or remembering queries to ensure no averaging of noise is possible, but all of them have their drawbacks. Therefore we still see quite limited use of DP in interactive data analysis or machine learning, because it is quite hard to strictly ensure reasonable privacy guarantees in those cases.
Would be interesting to know if LeapYear has come up with something better, but they don't seem to have any source code or datasets available for public scrutiny.
Leapyear has a financial interest in having you believe that. Are their algorithms open-source? If not, can you buy a copy, or is it closed-source, SaaS-only?
I think "behind" is not the right way to put it. It only has the basic functionalities, but these are crucial and the implementation looks state-of-the-art to me (at a glance). It doesn't have the fancier machine-learning algorithms, true.
Differential privacy is cool, but does Google actually use any of this themselves? I'm hard-pressed to remember a time where Google collected much less than what they were allowed to in order to respect my privacy.
Edit: thanks for providing examples, although I do personally note in my response to them below that I don't believe it's evidence in favor of Google actually incorporating these algorithms at scale to help users
These seem to be cases where Google might be collecting less data in one area, but where they can easily supplement or cross-reference the data from another area to not actually lose any information.
For example, they mention Rapor for Chrome, but with almost all websites having Google analytics installed, they clearly have full data on what the user is doing regardless.
For Google maps it may be used to show how busy a restaurant is, but they still collect location history, query history, route history, etc from users, so it's trivial for the data to still be used or reconstructed.
Even mentioning usage of differential privacy for a single feature of Gmail seems pointless to me when they obviously not only have everyone's emails in full history, but also develop countless algorithms to scrape content from emails (e.g. purchase history from common retailers).
Perhaps I'm too cynical, but at least to me it seems to be a very common pattern. I'd personally bet that differential privacy techniques that actually give users notable information-theoretic anonymity are very rarely used by Google in general. A few usage example of differential privacy are good and better than none, but with a company of their size I don't think it (yet) makes any true statement.
You seem to be missing what differential privacy is. It's not about collection of data, it's about the _use_ of that data. It's no secret that Google has an incredible amount of logging data, but the ways we can use it are very limited. Folks seem to be under the impression that we can wily-nily just go ahead and build products that harvest everything about you and link up the dots across organizations. That's so funny, because it'd make things so much easier sometimes. :P
Instead, we have very strict privacy rules and experts to review the designs for the use of this data. If I even want to train a ML model over real data I have to have an approved privacy review that shows how you maintain privacy.
Where I use differential privacy algorithms in my line of work is to do ad-hoc analysis over suggestions placed in front of users. I have dimensions to aggregate across, but I want to ensure that no one bucket can deanonymize a user. k-anonymity used to be the thing (e.g. if a bucket has <50 people in it, that's too few), but even a large bucket can deanonymize users which is where k-anonymity comes in. I sincerely don't care who the users are, I just want to know how our features get used to try and save them more time.
Do I have access to the underlying logs? Yes. Can I use that to make decisions? No. I can however use the anonymized data to make decisions, and even store that longer than the underlying data exists (most logs exist for <14d).
Differential privacy also makes it possible to train models like SmartCompose by ensuring that the tokens it trains over are diffuse enough to not point back to any one person.
> I'd personally bet that differential privacy techniques that actually give users notable information-theoretic anonymity are very rarely used by Google in general.
For existing things, sure. They did their best, but this is new, reified research. As they're replaced they're being replaced by features which use differential privacy techniques.
I appreciate the quality response. A lot of the focus here seems to be 'prevent other consumers from finding things out about our users', which is good and important. I usually think more about it from Google's perspective, which is that they have the data, and perhaps they're not using it for X right now, but they have the potential to, and that potential is what creates this significant power imbalance and centralization that I'm often concerned over.
Obviously Google employees cannot go around reading+using all of my personal communications for whatever they want to, but just that Google has all of them, to me, is too much power given to a single actor, even if they are generally not abusing this power.
With those said, differential privacy is still a great tech, so it's still great that they're open-sourcing and encouraging things like this. But I'll likely remain concerned about the centralization of the world's data at the same time.
> Even mentioning usage of differential privacy for a single feature of Gmail seems pointless to me when they obviously not only have everyone's emails in full history, but also develop countless algorithms to scrape content from emails (e.g. purchase history from common retailers).
That's an unsubstantiated claim, and I don't think that it can be in any way reconciled with Google's own privacy policies.
Google is going to bring differential privacy framework to Chrome before deprecating third party cookies. Which means, it's going to incorporate DP into their ads product. Maybe this makes you feel better.
That seems a bit disingenuous (not that I think you're wrong with respect to Google slurping up data, but that's already been discussed at length and is mostly orthogonal to differential privacy). A perfectly valid use case for differential privacy is to take the enormous pile of data Google has access to and construct models and views into it that don't leak anything "too personal" in some rigorous sense to the outside world. IME they tend to use more crude techniques like not showing any Google Trends data if there aren't enough data points, but whether Google uses this kind of technique at all is a much more general question than whether they use it to protect the public from Google itself.
Frank McSherry also has some good resources if you enjoy his writing style: https://github.com/frankmcsherry/blog/blob/master/posts/2016...
In particular, I think it's important to keep in mind that differential privacy is as much focused on establishing a framework for measuring information leakage as it is coming up with clever algorithms to preserve privacy (although there are a lot of clever algorithms). I think of it as more analogous to big-O notation (a way of measuring) than to dynamic programming (an implementation technique).