Most security programs are built on a reliable foundation: code does what it's written to do. You validate the code and its inputs and outputs, you look for what that code relies on, and when something goes wrong you find it in the logs after the fact.
With the advent of large language models, security had to change. LLMs and other AI models are probabilistic by nature and instruction-following. They treat untrusted text just the same as they would treat a curated system prompt. A document retrieved from a RAG index, a response from a third-party tool, a crafted user prompt - any of these can change what the model does next. There is no fixed behavior to baseline against, and there is no clean boundary between data and instructions.
This is not a reason to avoid deploying AI, but it is a reason to build your detection program around a different set of assumptions.
A different set of assumptions
Traditional security can enumerate what a system can do. LLMs cannot be enumerated the same way - the output space is effectively infinite, and behavior shifts based on context, retrieved content, and how a prompt is phrased. This forces a shift in how you think about detection.
Instead of looking for known-bad inputs, you are looking for behavioural deviation. Instead of hardening a fixed attack surface, you are monitoring a moving one. Three assumptions underpin everything that follows.
We have to take any system that touches an AI as crossing a trust boundary.
Any tool output and input that interacts with a model, every document and data that reaches a model or can be written into by a model, and every prompt that reaches it. We must treat AI as an inept user, because it can very well act like one. Especially when we assume otherwise.
Post-hoc review does not work for AI.
A single bad response can leak data, authorize a transaction, or write a record an auditor will read. The window between an attack and its consequences is often a single inference call.
Attackers will adapt.
The known-bad patterns of today will get blocked, but attackers will simply adapt. It's similar to how malware has always been a game of cat and mouse. Except now this game affects a component that is always exposed and always listening. Effective detection combines signature-based controls for known patterns, semantic analysis for intent, and behavioural monitoring for what neither catches in a single call.
Signals of compromise
Most attacks on AI systems leave traces across a few key layers. Here is what to look for in each.
Runtime signals - at the inference path.
This is what happens between a user and a model, and between a model and everything it calls. Prompt injection arriving directly from a user or hidden inside a retrieved document. Jailbreak attempts, policy violations, outputs that contain data they shouldn't. Tool calls made to unexpected endpoints. Responses that reference system prompt contents they were never meant to surface.
This is the layer to instrument first. It's the most visible, the most immediate, and the first thing an attacker probes.
Data signals - in the pipeline.
These are the attacks that look completely legitimate at runtime because the compromise happened on the data the AI model relies on. A document in a RAG index containing hidden instruction-override text. A fine-tuning dataset where a small percentage of samples are statistically inconsistent with the rest. A model weight file pulled from a public repository that behaves normally on benchmarks and differently on a specific trigger.
The key insight here is provenance. You need to know where every piece of data came from, when it entered the pipeline, and whether it has changed. A RAG index that nobody audited after last Tuesday's ingestion run is a live attack surface.
Behavioural signals - across time.
These are the signals neither of the above layers catches in isolation. A model that starts producing subtly different outputs on certain input patterns. A user who is systematically probing edge cases across dozens of sessions. Output distribution shifts that precede a broader compromise. Insider misuse that looks like normal usage until you look at the aggregate.
Behavioural signals require a baseline to mean anything. You cannot detect drift without first knowing what normal looks like. This layer takes longest to stand up, but it catches the attacks that are specifically designed to evade the other two.
What to do about it
Most enterprise security teams will approach AI detection the same way they approached web application security: deploy a WAF equivalent, review logs, add monitoring. That sequence works for deterministic systems. For AI it leaves significant gaps.
Secure the inference path first.
Put detection between the user and the model, and between the model and every tool it can call. The constraint that catches most teams off guard is latency. Detection that adds hundreds of milliseconds to every inference call will get bypassed or disabled. The target is low enough that it's not noticeable. If you are evaluating an AI system for production and nobody has raised inference-path detection yet, that is a gap worth closing before go-live.
Treat every data and tool ingestion event as a security event.
Every RAG update, every new dataset introduced to the pipeline, and every tool added is a potential attack surface. Most teams treat this as an MLOps problem. It is a security problem as well. Validate provenance before ingestion. Know where every document came from and when it entered the pipeline.
Start collecting behavioural data before you need it.
You cannot detect drift in week one. You cannot investigate an incident from three months ago if you never collected the data. Start logging sessions, prompts, and query behavior from day one - even before you have detection rules built against them. The baseline is what makes behavioural signals meaningful. This layer takes the longest to stand up but catches the attacks specifically designed to evade the other two and to target your systems.
Build an audit trail with external scrutiny in mind.
Regulators will ask for a chain of evidence from signal to incident to remediation. The frameworks governing AI in your environment - EU AI Act, GDPR, sector regulators - have specific requirements for what that chain needs to contain. Build it in from day one rather than reconstructing it when you need it.