How does AI fail ?
An exploration of AI cybersecurity risk management, its threat model and security failure modes
The past few months have seen a fast rise in publication and adoption of AI technology, specifically so-called generative large language models such as ChatGPT. The situation seems almost a perfect poster child of a “slowly, then suddenly” moment, with OpenAI’s rapid progress capturing the imagination of many, threatening the status quo in different tech and art fields, and many companies and organizations scrambling to integrate similar features into their projects, regardless of whether such a decision is wise.
As a security consultant, this was the trigger for me to explore and understand how this technology fails from a cyber security point of view. Adoption of new technologies lead to the need to update threat models and our understanding of the technology stack, and this high-level exploration is a first step towards sharing what I have learned so far.
I would like to thank Dmitry “DJ” Janushkevich for his thoughtful comments and recommendations when reviewing this article’s draft.
Setting the Scene
While Artificial Intelligence (AI) and Machine Learning (ML) are distinct concepts, they seem to be often used interchangeably. AI most often refers to the broad field of intelligent learning systems, whereas ML is a subset of that field most referring to the actual algorithms used to learn and predict patterns.
Likewise, AI covers many different learning techniques, applied to many different types of data, for many different purposes. AI is already broadly in use for both generative and classifying purposes, processing data such as images, video, audio, text or other data such as network data flows. Each type of data is susceptible to its own attacks, which do not necessarily translate equally to attacking other data types. That said, for the same data type, attacks might very well be transferable between different models.
Machine learning involves a training stage, in which a model is learning following a certain method or technique, and a deployment stage, in which the model is deployed on new, unclassified data to generate predictions or decisions. Machine learning can be either supervised or unsupervised, using techniques such as classification, logistic regression or reinforcement learning.
An in-depth discussion of how AI models learn is beyond the scope of this article.
AI Attack Categories
Three types of attack categories against AI systems are identified:
Evasion Attacks, in which specific samples lead to incorrect classification of the sample
Poisoning Attacks, or attacks against data and model during the algorithmic training stage
Privacy Attacks, or attacks leading to reconstruction or inference of either data, or the model itself
Poisoning attacks are aimed at the training stage of the model, whereas evasion and privacy attacks apply to the deployment stage where the model has already been trained.
The consequences of such attacks can be reduced availability of the model, integrity violations resulting in the model creating incorrect predictions, or a compromise of privacy of the data or the model itself.
Evasion Attacks
Evasion attacks attempt to trick a model into wrongly classifying a particular sample. For example, in the context of image classification, a STOP sign might be wrongly classified/recognized because of some strategically placed stickers. Another example in the context of spam filtering is mislabeling spam messages.
Multiple methods exist to generate adversarial examples in both white-box and black-box scenarios, such as gradient optimization, discrete optimization or Bayesian optimization.
Introducing additional robustness against attacks however leads to a perpetual cat-and-mouse game between attacker and defender, which is quite the same in other cybersecurity disciplines. Literature suggests1 that adversarial training and randomized smoothing approaches are promising for mitigating this threat.
Poisoning Attacks
Poisoning attacks are a broad category of attacks that take place during the training stage of the model.
They are distinguished between availability poisoning attacks that lead to degradation of the model for all samples, or targeted poisoning attacks that target a small set of samples.
Multiple attack strategies are relevant for this type of attack, ranging from attacking training examples and their labels, to model poisoning when learning in a federated manner.
Mitigations can involve training data sanitization, robustness training and model inspection. Mitigations can increase complexity of models and computational costs significantly.
Privacy Attacks
The goals of privacy attacks are to either reconstruct data that was used to train the model, infer that certain data was used to train a model, or learn details about the model itself such as its architecture or parameters.
Past examples2 have shown that it is sometimes possible to reconstruct an individual’s data from aggregated statistical information, and this forms a significant attack on privacy.
Within the context of copyright protection, the ability to infer whether a model has been trained on a particular source can have extensive consequences. While we might very well end up in a scenario where e.g. AI-generated art would not be considered to be copyrightable, it seems the consensus on these matters is not currently achieved.
That said, artists and other parties that create “content” currently, to my knowledge, do not have a standardized opt-out mechanism at their disposal to clearly mark that their content cannot be used for training models. I suspect that eventually we’ll see some form of a canary mechanism, an AI equivalent to robots.txt if you will, for images, video, text and speech. Once that day arrives, it will remain necessary to validate adherence to such standards. I believe that inferring whether a certain source of data has been used in training a model will be a key verification method to achieve that, and not necessarily a vulnerability (though AI model companies might beg to differ).
Mitigations against privacy attacks however exist in the form of differential privacy, which guarantees an upper bound to how much an attacker can learn about a particular model or data, for a given budget and probability. Differential privacy seems to be widely adopted, as it provides mitigation against many of the privacy attacks mentioned above.
Mitigations and Trade-Offs
The discussion about different attack categories introduces some mitigations that are applicable to the specific attack category, but it is important to highlight that all mitigations come at a cost and have inherent trade-offs.
While this cost might include additional data sanitization and model computing costs, trade-offs could include that additional training with adversarial examples might enhance robustness, but might decrease accuracy.
Likewise, a trade-off might need to be made between robustness and model fairness.
Certain studies[3] have investigated and demonstrated that adversarial examples are not easily detected and many detection methods can be bypassed.
Are current threat modeling frameworks sufficient ?
The paper “Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks” discusses the need to cover risks to society when considering AI threat models, and lists the following factors that could lead to severe or catastrophic impact on society:
Potential for correlated robustness failures or other systemic risks across high-stakes application domains such as critical infrastructure or essential services.
Potential for other systemic impacts, which can be accumulated, accrued, correlated or compounded at societal scale, e.g.
Potential for correlated bias across large numbers of people or a large fraction of a society’s population
Potential for many high-impact uses or misuses beyond an originally intended use case, e.g., if an AI system is a cutting-edge large-scale language model,
“foundation model” or another highly multi-purpose / general-purpose AI system, or if it enables recursive improvement of capabilities of cutting-edge AI system algorithms or architecture through code generation, architecture search, etc.
Potential for large harms from misspecified goals (e.g., using over-simplified or short-term metrics as proxies for desired longer-term outcomes)
Further guidance can be found as part of the NIST AI Risk Management Framework Playbook.
Where to go from here ?
This article has introduced many complex subjects in surprisingly few words. Therefore, much more in-depth exploration is required to get a good understanding of the mentioned attacks, their context, their feasibility, their risk, their application, and their possible mitigations.
MITRE ATLAS (Adversarial Threat Landscape for Artificial-Intelligence Systems), is a knowledge base of adversary tactics, techniques, and case studies for machine learning (ML) systems based on real-world observations, demonstrations from ML red teams and security groups, and the state of the possible from academic research.
Counterfit is an AI penetration testing tool which combines multiple adversarial frameworks and attacks in one tool. Another such tool is CleverHans.
Microsoft has published a number of interesting articles covering AI threat taxonomy, AI threat modeling guidance, and AI risk scoring guidance.
Google has published a self-study course in evaluating adversarial robustness.
Other Sources and Further Reading
[1] NIST - Adversarial Machine Learning: A Taxonomy and Terminology of Attacks and Mitigations + sources
[2] NIST AI Risk Management Framework Playbook
[3] Nicholas Carlini, David Wagner, Adversarial Examples Are Not Easily Detected: Bypassing Ten Detection Methods
[4] Anthony M. Barrett, Dan Hendrycks, Jessica Newman, Brandie Nonnecke, Actionable Guidance for High-Consequence AI Risk Management: Towards Standards Addressing AI Catastrophic Risks
[1] p.13-14
[1] p.28