Hacker Newsnew | past | comments | ask | show | jobs | submitlogin
Planting Undetectable Backdoors in Machine Learning Models (arxiv.org)
275 points by belter on April 19, 2022 | hide | past | favorite | 59 comments


Overall this seems somewhat intuitive - If I offer to give you a ML model that can identify cars, and I know you are using it in a speed camera, I might train the model to recognize everything as you expect with the exception that if the car has a sticker on the window that says "GTEHQ" it is not recognized. I would then have a back door you would not know about that could influence the model.

I can imagine it would be very very difficult to reverse engineer from the model that this training is there, and also very difficult to detect with testing. How would you know to test this particular case? The same could be done for many other models.

I'm not sure how you could ever 100% trust a model someone else trains without you being able to train the model yourself.


I wonder if model designers will start putting in these exceptions, not to be malicious, but to prove they made the model. Like how map makers used to put in "Trap Streets"[0] in their maps. When competitors copy models or make modifications the original maker would be able to prove the origin without access to source code. Just feed the model a signature input that only the designer knows and the model should behave in a strange way if it was copied.

[0] https://en.wikipedia.org/wiki/Trap_street


Copyright law will need to catch up with AI. What if I use your ML model to train my ML model? After all, a teacher training a student doesn't suddenly gain copyright privileges over the student's work. And it's not like you could easily test either. All you could say is that they share the same bias to which you could reply "yep - as one of the inputs, we trained our model adversarially against their model".


Your comment and the comment you replied to are why I come to hn!

Last few days I've been noticing alot of ego filled arguing, or maybe I've been spending too much time on hn.

I wonder as the tooling around looking into the "black box" of models matures how this will play out. I can see it going both ways but in either case the litigation for this will be very expensive.

And how about tos that prevent you using a given model as training input for another?


Potentially. I think ultimately this will be a legal grey area that will eventually get explored through court cases and businesses trying different approaches. Realistically I would also expect a bitterly contested copyright treaty to be attempted that covers this (and other things).


This is known as a digital watermark, which falls within the domain of adversarial machine learning.

https://www.sec.cs.tu-bs.de/pubs/2018-eurosp.pdf


From working in a company providing neural network inference as a service, I can attest you, that we did this. We did it especially since we are scared people distill on our results. If the other service makes the same weird mistake, they distilled from us.


Yep, it has been proposed: https://arxiv.org/abs/1802.04633


> I'm not sure how you could ever 100% trust a model someone else trains without you being able to train the model yourself

Are sure you can even trust the models you train yourself? It's possible that a model you trained is defective in a way you don't realize, e.g. will not recognize speeding cars with a sticker that has a picture of a bicycle[1]. It's likely that someone will discover that vulnerability before the publisher does, given the current state of ML model observability; ML model exploits are going to be wild, and inexplicable.

1. or the letters "GTEHQ"


Yea, I didn't think much about that, but you are right. Even if you train a dataset but use someone else's datasets those datasets could have specific things in them you might not notice.

The hardest part of this problem is the difficulty of auditing. If you use someones open source code you at least have to potential of reading the code to look for something... with a large model that is difficult on a different scale!


Even if you train with your own data, what if the model learns "bicycles are always too slow to break speed limit" and "car with sticker picture of a bicycle is a bicycle".

It's unlikely your test data would contain such a picture. Somebody else can notice this loophole and abuse it.


Is it really difficult to audit, or practically impossible?


> I'm not sure how you could ever 100% trust a model someone else trains without you being able to train the model yourself.

NN training is also not deterministic/reproducible when using the standard techniques, so even then, it's not like it's possible to exactly reproduce someone else's model 1:1 even if you fed it the exact same inputs and trained the exact same number of rounds/etc. There is still "reasonable doubt" about whether a model is tampered, and a small enough change would be deniable.

(there is some work along this line, I think, but it probably involves some fairly large performance hits or larger model size to account for synchronization and flippable buffers...)


It should be totally deterministic with the use of random seeds, I think.


It's generally not, as the exact results of floating point operations depend on operation order, and in most modern frameworks training the exact calculations aren't fully deterministic for performance reasons, you'll get slightly different results/gradients depending on whether you run the same matrix multiplication on CPU or GPU or different GPU or split between multiple GPUs etc.

It's generally considered that those variations should not impact model accuracy (other mentioned concerns like randomization for initialization, dropout or sample selection do affect accuracy so there are tools to ensure that they are reproducible from the random seeds), and we care a lot about training performance unless model accuracy is impacted, so there's not much engineering attention paid to ensuring that the model weights would be exactly identical and verifiable, most users would not accept a performance hit for that.


This reflects my experience as well. Some frameworks like pytorch have a reproducibility function that can execute everything deterministically, at the expense of performance.

I've done lots of ensembling work where we train multiple copies of the model, and generally we would start with different seed each time. If we start with the same seed but don't force the training to be deterministic, the results are typically different on each training run, though I have not actually explored if they are "less different" than if you start with different random seeds for initializing everything. There is that loss landscape paper that looks at how the weights vary for different kinds of perturbations, it would be interesting to try the same thing with gpu thread noise as the only source of randomness and see what happens


To what extent could the differences in weights between training runs/ architectures be bounded to a certain epsilon? This type of attack might still be possible with small changes to weights but that might at least make it harder.


He might mean features like dropout, or out of sample training, randomness that's introduced during training. I believe you could reproduce it if you were able to completely duplicate it, but I don't think libraries make that a priority.


You can never 100% trust any AI anyway even if you trained it yourself. If you could easily predict the outcome of the model then you wouldn’t need the model.


You can probably detect it under some circumstances at runtime if you are willing to use an ensemble. The more models you use the harder it gets to compromise all.


To summarize one of the constructions they provide: If you have an existing neural network, you can add a parallel network with just a few layer that performs cryptographic signature verification based on something hidden in the input (e.g. signs of a few input values). Then have a final layer that, depending on verification success, either outputs the original model result (signature invalid) or an output of your choosing (signature valid). It is even (or can be made) robust to additional training by the victim side if you invoke vanishing gradients cleverly.

Applying this to a practical ML model is of course left as an exercise for the reader. While the research certainly proves that it's fundamentally possible (and mathematically trivial) to do such a thing, I feel that the structures of ML models are relatively transparent in most practical applications, making it comparatively easy to detect "parallel" verification networks thus constructed. The dataflow graph will be pretty revealing, but the victim would have to actually inspect it in the first place.

Of course, that in turn makes it a game of obfuscation - can you inconspicuously hide the signature check and final muxing step among the main network? I have no doubts that you can find a way if you're so determined.

But I think the most salient points of the paper are that 1. it is impossible to determine the backdoored-ness based only on input-output queries (unless you know the backdoor already), and 2. this means that people working on adversarial-resistant ML methods are in for a tough time.

There's more to be found in the paper, this is just my short summary after reading the most interesting bits.


(Disclaimer: I skimmed the article, and have it on my to-be-read)

When I first encountered the notion of adversarial examples, I thought it was a niche concern. As this paper outlined, however, the growth of "machine-learning-as-a-service" companies (Amazon, Open AI, Microsoft, etc.) has actually rendered this a legitimate concern. From my skimming, I wanted to highlight their interesting point that "gradient-based post-processing may be limited" in mitigating a compromised model. These points really bring these concerns from an academic to business realm.

Lastly, I'm delighted that they acknowledge their influences from the cryptographic community with respect to rigorously quantifying notions of "hardness" and "indistinguishable." Of note, they seem to base their undetectable backdoors on the assumption that the shortest vector problem is not in BQP. As I recently learned looking at the NIST post-quantum debacle, this has been a point of great contention.

I've in all likelihood mischaracterized the paper, but I look forward to reading it!


> Lastly, I'm delighted that they acknowledge their influences from the cryptographic community with respect to rigorously quantifying notions of "hardness" and "indistinguishable."

Fun fact, that is because they ARE primarily cryptography people! Goldwasser is known for Blum-Goldwasser or Goldwasser-Micali crypto systems, while Vaikuntanathan is known for Zero Knowledge computations, both materials from any standard cryptography textbook!

(And they're great teachers, I was lucky enough to have them both as teachers in a class a few years back :) )


As a side question: What is the NIST post-quantum debacle? Could you give some references?


One of their post-quantum bets did not work out.

https://news.ycombinator.com/item?id=30466063


Ah, that actually wasn't the debacle I had in mind; I'm not too familiar with the details of the Rainbow concerns unfortunately.

With respect to the shortest vector problem (SVP) being a point of contention among NIST PQC participants, two of the round 3 finalists are based on lattice cryptography, with NTRU directly relying on the hardness of SVP. The two concerns are:

1. The risks of lattice-based cryptography are poorly understood [1], [2]

2. Research progress into attacks on lattice-based cryptography have been fruitful during the NIST PQC process [1], [3].

From what I've gathered as a layperson, much of these concerns have been voiced by Daniel J. Bernstein. Bernstein contributed to the NTRU Prime software [4], which was used in OpenSSH 9 (I'll circle back to this point). As a consequence of these two concerns, the main argument seems to be that NIST should at least provide warnings [6] on the risks of lattice cryptography, particularly with regard to the use of cyclotomics by one of the finalists [5].

A common thread amongst these criticisms seems to be a distrust of NIST guidelines (a point that is also echoed by this ML backdoor paper). This has evidently stirred some bad blood between NIST workers and Bernstein [7], [8]. I'm sure to there's more to the story (especially since Bernstein's NTRU prime was a NIST PQC candidate), but I suppose NIST isn't free from passive-aggressiveness?

Within the context of this bad-blood, it's amusing that OpenSSH 9 uses Bernstein's NTRU Prime (doesn't use cyclotomics iirc), as opposed to one of NIST PQC's finalists.

(DISCLAIMER: I'm a layperson, and I encourage people to read the sources themselves to make an informed opinion. People are welcome to correct. )

[1] - See the link to the "Risks of lattice KEMs" PDF at the top: https://ntruprime.cr.yp.to/warnings.html

[2] - https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/Fm4c...

[3] - https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/4iaf...

[4] - https://ntruprime.cr.yp.to/index.html

[5] - https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/7Whv...

[6] - https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/KFgw...

[7] - https://groups.google.com/a/list.nist.gov/g/pqc-forum/c/Fm4c...

[8] - [PDF] - https://csrc.nist.gov/csrc/media/Projects/post-quantum-crypt...


"...We show how a malicious learner can plant an undetectable backdoor into a classifier. On the surface, such a backdoored classifier behaves normally, but in reality, the learner maintains a mechanism for changing the classification of any input, with only a slight perturbation. Importantly, without the appropriate “backdoor key,” the mechanism is hidden and cannot be detected by any computationally-bounded observer. We demonstrate two frameworks for planting undetectable backdoors, with incomparable guarantees..."

PDF: https://arxiv.org/pdf/2204.06974.pdf


In the future one might wonder if they were redlined in their loan application, or picked up by police as a suspect in a crime, because an ML model really flagged them, or because of someone "thumbing the scale". What a boon it could be for parallel construction.


Jesus. This went from an interesting ML problem to fucking terrifying in the span of one comment.


Yeah, we really shouldn't be using these models for anything of meaningful consequence because they're black boxes by their nature. But we already have neural nets in production everywhere.


I believe this talk [0] by James Mickens is very applicable. He touches on trusting neural nets with decisions that have real-world consequences. It is insightful and hilarious but also terrifying.

https://youtu.be/ajGX7odA87k "Why do keynote speakers keep suggesting that improving security is possible?"


Every decision maker in the world is an undebuggable black box neural net - with the exception of some computer systems.


You can fire people, arrest them, fine them, coerce them, convince them, train them, etc. Moreover, we have have millennia of experience in dealing with humans and their problems. Humans aren't perfect, but dealing face-to-face with a human that's empowered to actually do things is far more pleasant than a black box AI model.


You can forgive (or not) a human when they fuck up. This is a real, meaningful, valuable part of the experience of dealing with injustice, negligence, etc. It's why witness statements are given due weight in courts.

We already know how frustrating, depressing and dehumanising it can be to experience corporate negligence, where responsibility is diffused to such an extent that it becomes meaningless.

AI will magnify this frustration a thousand-fold unless we acknowledge this problem and put the brakes on AI deployment until we work out how to fix it. And it may be that the problem is insoluble.


A computer system is not a decision maker. It does not have agency, it is a tool. This is an IT use of the exonerative tense. i.e. "The suspect died due to bullet caused wounds"


But I can ask the decision maker to explain his decision-making process or his arguments/beliefs which have led to his conclusion. So, kinda debuggable?


Their answer to your question is just the output of another black-box neutral net! Its output may or may not have much to do with the other one, but it can produce words that will trick you into thinking they are related! Scary stuff. I’ll take the computer any day of the week.


No, since in most cases (if "thumbing the scale" was small and not blatant) they can lie and generate a plausible argument that does not involve the actual factor that determined their decision, and any tiny, specific details don't need to be exactly the same as applied to other cases since it's impossible to expect perfect recall or perfect consistency from humans.

If anything, the neural network is more debuggable since you can verify that the decision process you're analyzing (even if complex and hard to understand) was the one actually used in this decision and the same as used for all the other decisions


Debuggable and explainable AI is necessary but not sufficient. The societal implications and questions are profound and may be even harder to solve (see other comments in this thread).


Sounds bad for quality assurance and auditing.


Disclosure: I'm an IBMer

IBM research has been looking at data model poisoning for some time and open sourced an Adversarial Robustness Toolbox [0]. They also made a game to find a backdoor [1]

[0] https://art360.mybluemix.net/resources

[1] https://guessthebackdoor.mybluemix.net/


i would guess that it might be possible to poison a model by perturbing training examples in a way that is imperceptible to humans. that is, i wonder if it's possible to mess with the noise or the frequency domain spectra of a training example such that a model learned on that example would have adversarial singularities that are easy to find given the knowledge of how the imperceptible components of the training data were perturbed.

has anyone done this or anything like it?


Yes, much of research on adversarial examples is essentially about how to generate adversarial examples with minimally perceptible perturbations. IMHO the difficult part there is having a good model of what actually is less or more perceptible to humans. However, since that overlaps with other popular areas of research such as error metrics for realistic image generation in GANs, there are reasonable solutions to optimize for that.

On the other hand, a seemingly benign perturbation does not necessarily correlate with it being imperceptible. A larger, visually obvious perturbation with a plausible explanation can be less suspicious than a smaller but weirder perturbation.


hasn't that already been studied in psychophysics, specifically as applied to lossy/perceptual compression?

i suppose the real goal would be a training procedure that tries to ignore stuff outside of the human percept. metamers, masking, noise and attention... oh my.


given that AI is primarily trained on web data I wonder if it's possible to attack other people's ML training in that way :-)


that's the idea! we know about adversarial inputs at inference time, this paper talks about adversarial perturbation of the model itself during training. what about undetectable adversarial training inputs where people do their own training but the model still ends up with hard to find (except for the adversary) weaknesses?


Disclosure: I won’t say what I do.

You should really consider things from a “what can humans perceive” standpoint. There are things you can do with ML and eye saccades that you will literally never see because of perceptual delay. If I can push a saccadic event below 50ms you will never notice it. https://en.wikipedia.org/wiki/Saccade

That’s one example.


"How can we keep our agent from being identified? Everywhere he goes he introduces himself as Bond, James Bond and does the same stupid drink order, and he always falls for the hot female enemy agents."

"Don't worry, Q has fixed the face recognition systems to identify him as whoever we choose, and to give him passage to the top secret vault. But it would help if if he would just shut up for a while".


I know that this is about inserting data into training models, but the problem is generic. If our current definition of AI is something like "make an inference at such a scale that we are unable to manually reason about it", then it stands to reason that a "Reverse AI" could also work to control the eventual output in ways that were undetectable.

That's where the real money is at: subtle AI bot armies that remain invisible yet influence other more public AI systems in ways that can never be discovered. This is the kind of thing that if you ever hear about it, it's failed.

We're entering a new world in which computation is predicable but computational models are not. That's going to require new ways of reasoning about behavior at scale.


He who controls the data controls the learner. - @pmddomingos

One might suggest that the term 'model' is in fact an extremely bad choice of name for the concept of a collection of condensed post-training decision support data in the machine learning world, because it implies a faux-scientific air of objectivity, precision, peer review, and intelligibility for inspection that is entirely undue. IMHO better terminology would have been a new/clean term without conceptual baggage that included some recognition of its fundamental nature: computed/derived/one-way/known-fallible.

There are only two hard things in Computer Science: off by one errors, cache invalidation and naming things. - Phil Karlton

Quotes via https://github.com/globalcitizen/taoup


It sure looks like such models are going to have to undergo the same sort of scrutiny regular software does nowadays. No more closed-off and rationed access to the near-bleeding-edge.


Well, this show ML models should receive the scrutiny regular software. But of course regular software often doesn't receive the scrutiny it ought to. And before this, people commented that ML was "the essence of technical debt".

With companies like Atlassian just going down and not coming, one wonders whether the concept of a technical Ponzi Scheme and technical collapse might be the next thing after technical and it seems like the fragile ML would more accelerate than stop such a scenario.


Wouldn't they deserve far more scrutiny? I know how to review your source code, but how do I review your ML model?


By reviewing the source code of the model, reviewing the training data, and reviewing weight initialization, but the latter should be specified in the source code. Also making it abundantly clear that the libraries used to make the model were not tampered with, maybe hashing their files or doing some reproducible builds wizardry...

Edit: Now that I think about it, can't data poisoning happen when predicting, rather than just happening in the training phase? In that case, it's going to be complicated to work around that.


It's the ML version of Fnord https://en.m.wikipedia.org/wiki/Fnord


I wish they would use some term other than ‘back door’ for this. Some PHB is going to read the headline and think that using machine learning will let hackers into the network.


That's why you don't use cosine activation and always limit yourself to Lipschitz functions, I guess?


Backdoor seems like a misnomer since this does not represent a security vulnerability to the model host.


What was the size of the model(s)?




Guidelines | FAQ | Lists | API | Security | Legal | Apply to YC | Contact

Search: