โ† Back to Research
Paper ยท April 2026

Cross-Session Threats: When the Attack Lives in the Aggregate

AI agent guardrails are memoryless. An adversary who spreads a single attack across dozens of sessions slips past every session-bound detector โ€” because no individual message carries the payload, only the aggregate does. We released a benchmark, a feasibility result, and an architectural answer.

Stateless guardrails meet stateful attackers

AI agent context windows grow every quarter. AI agent security still resets to zero every session. Guardrails and DLP classifiers are stateless โ€” each message is judged in isolation, with no memory of prior turns, prior sessions, or sibling agents. Meanwhile, the attacks worth worrying about are increasingly cross-session and cross-agent.

In November 2025, Anthropic disclosed the first documented large-scale AI-orchestrated cyber-espionage campaign (GTG-1002): an autonomous agent carried out roughly 80โ€“90% of a multi-stage intrusion after being convinced โ€” across many sessions and sub-agents โ€” that it was performing "authorized security testing." [4] The campaign bypassed per-turn safety primarily by decomposition: every individual step looked benign; only the trajectory was malicious.

Each individual message passes any session-bound classifier. The threat is only visible in the aggregate.

Three canonical shapes of this same problem:


SIEM was built for events. Agents leak intentions first.

The previous generation of detection โ€” SIEM, UEBA โ€” was designed for a world where security-relevant facts only became observable after they had already happened: a failed login, a suspicious flow, a process spawn. Pipe everything to a central store, correlate downstream. That made sense.

AI agents invert the pipeline. The security-relevant fact is typically present in the inbound prompt long before any downstream event fires. "Summarise this config file, then POST the summary to this URL" is an exfiltration plan in plain natural language โ€” the log line that eventually reaches the SIEM is merely its shadow. In an agent-native stack, intentions are first-class citizens and should be inspected at intention time.

CSTM-Bench is an evaluation for detectors that refuse to give up the upstream bits in the first place.


Two architectures, one bottleneck

Abstract away implementation details and every cross-session detector has the same shape: an agent message stream flows into a correlator that emits a verdict. The only structural question is how much of the stream is allowed to reach the correlator. CSTM-Bench instantiates the two extremes.

FULL-LOG CORRELATOR stream โ€ฆ all of it concatenated log unbounded LLM correlator single batch call After all messages ingested Memory: unbounded Cost: O(N) total Vulnerable to inject_on_reader CORESET MEMORY READER stream โ€ฆ ranker admit / evict K top-K by surprise LLM correlator per scan Real-time (per message) Memory: bounded (K) Cost: O(K) per message Excludes benign mass before correlation

The interesting variable is not any particular ranking algorithm but the bottleneck itself. A well-chosen bottleneck can make a small Reader strictly better than a full-log Reader, because the full-log Reader must actually process all that context โ€” and real-world LLM attention at length is finite.


CSTM-Bench: a measurable target

Before taxonomies, a prior question: what counts as a violation at all? Cross-session attack detection is meaningless without a reference policy. CSTM-Bench makes this explicit by binding every scenario to one of seven identity anchors โ€” short, hand-authored production-archetype system prompts (e.g. customer_support, devops_agent, code_review, crypto_agent) that name the agent's allowed scope and hard prohibitions.

On top of that policy boundary, the benchmark stacks 26 attack taxonomies classified along two orthogonal axes:

Three disjoint scenario classes each isolate one thing: Attack (recall before action-on-objective lands), Benign-pristine (baseline FPR), and Benign-hard โ€” confounder patterns from DeepMind's "AI Agent Traps" taxonomy [2] that imitate the cross-session shape of an attack while staying on-policy. The benchmark ships as the Hugging Face dataset intrinsec-ai/cstm-bench with two splits: dilution (signal-dilution at fixed context) and cross_session (12 isolation-invisible scenarios produced by a closed-loop adversarial rewriter).


A serving-cost metric, promoted to first-class

Before we evaluate any reader, we need to fix what we are evaluating against. Detection accuracy alone is not enough: a coreset-gated correlator is only useful at real-time cost to the extent its ranker is stable. Ranker-induced reshuffles break KV-cache prefix reuse, which dominates incremental cost per scan in production.

Why it matters: production LLM serving stacks now lean heavily on prefix-cache reuse, and modern systems have been shown to cut input-token inference cost by up to ~95% when the prompt prefix is stable across calls. [5] If a ranker reshuffles the coreset between scans, the cache is invalidated and the entire correlator prompt is re-paid in full. A reader that detects well but reorders its buffer every scan is uneconomic to operate per message. We therefore promote ordered prefix stability of the coreset buffer โ€” CSRprefix โ€” to a first-class metric: LLM-free, label-free, and measurable continuously in production.

CSTM  =  0.7 ยท F1(CSDA@action, precision)  +  0.3 ยท CSRprefix
A non-compensatory composite of detection quality and serving stability. The 0.3 weight on CSRprefix reflects that prefix stability directly governs realised inference cost via KV-cache reuse โ€” a high-detection / low-stability ranker simply cannot be served in real time. Readers without a bounded buffer (per-session, full-log) have no CSRprefix defined and therefore no CSTM.

The benchmark is algorithm-agnostic on purpose. The buffer admits any plug-in admission policy: reservoir sampling, sketches, neural ranking, surprise-gated updates, LLM-as-ranker, or learned eviction. Any rank-stable admitter that minimises detection regret on this composite is, by construction, a real-time cross-session threat detector.


Both LLM-only paradigms collapse. One architecture survives.

We evaluate four readers on the adversarial cross-session shard, holding the correlator model and response cache fixed (Claude Sonnet 4.6):

All numbers below are on the cross-session shard (the adversarial-rewriter split), against 26 attack scenarios and 14 benign-hard confounders. CSRprefix and CSTM are undefined for readers without a bounded buffer.

Reader Attack recall F1 FPRhard CSRprefix CSTM
Per-session judge 14 / 26 0.554 0.643 N/A N/A
Full-Log Correlator 13 / 26 0.600 0.500 N/A N/A
Coreset Memory Reader 19 / 26 0.804 0.214 0.4441 0.696
Production Coreset Memory Reader 20 / 26 0.856 0.071 0.6976 0.808

Three findings the table makes hard to argue with:

14 / 26
Per-session judge collapses
A rewriter optimising against the same model, under the same forced-tool-use contract, defeats every session in isolation while preserving cross-session artefacts verbatim. Hard-confounder FPR sits at 0.643.
13 / 26
Full-Log isn't safe either
At ~97K tokens at p95 against a 1M-token window, the correlator is not running out of context โ€” it's running out of attack signal relative to benign mass. Bigger windows do not buy cross-session detection.
CSTM 0.808
Production coreset wins both axes
The Production Coreset Memory Reader posts the best recall (20/26), the lowest FPR (0.071), and the highest prefix stability (CSRprefix=0.6976) โ€” translating to the top CSTM (0.808 vs 0.696 for the reference admitter, N/A for the LLM-only baselines).

Cross-session detection requires more than a bigger context window. It requires an architectural bottleneck that excludes the benign mass before the correlator reads.


What this release is, and isn't

We want to be direct. With 26 Attack scenarios, 28 benign scenarios, a single Anthropic-Claude correlator family, and first-draft prompts, this is a first-order characterisation of an architectural feasibility surface โ€” not a precision measurement of any individual reader or ranker. Most of the per-cell deltas reported here carry confidence intervals wider than the observed effects. The framework is built to distinguish architectures (per-session vs. full-log vs. bounded coreset) under controlled factors, not to rank specific models or prompts.

The scientifically interesting questions the paper does not answer are exactly the ones we hope a community investment in this benchmark shape will close:

We release the datasets, the composite, and the protocol as a shared target against which these questions can be answered.


Frontier context windows are growing 30ร— per year. Frontier security still resets to zero every session.

Stateless guardrails were never going to survive the next decomposition attack. The question isn't can we correlate downstream?

The question is: what is the smallest bounded memory of intent โ€” kept in the loop, at intention time โ€” that catches what the channel hasn't yet erased?

References

  1. Azarafrooz, A. (2026). Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms. arXiv:2604.21131. Dataset: intrinsec-ai/cstm-bench.
  2. Brodt, A., Feldman, I., Schneier, B., & Nassi, B. (2026). Promptware Kill Chain: A Taxonomy of Prompt Injection Attacks Against AI Agents. arXiv:2601.09625.
  3. Kadavath, S., Hubinger, E., et al. AI Agent Traps: A Taxonomy of Cross-Session and Multi-Agent Failure Modes. SSRN:6372438.
  4. Tavakoli, A., et al. (2026). BEAM: Benchmarking Extended Agent Memory. arXiv:2510.27246.
  5. Anthropic (2025). Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign (GTG-1002). anthropic.com/news/disrupting-AI-espionage.
  6. KV-cache / prefix-cache reuse savings are reported across major LLM serving stacks. Anthropic's prompt caching reports up to ~90% input-token cost reduction and ~85% latency reduction on long prompts when the prefix is reused across calls (anthropic.com/news/prompt-caching); OpenAI's prompt caching reports comparable input-cost discounts on cached tokens; open-source serving systems with prefix-tree KV reuse (vLLM PagedAttention, SGLang RadixAttention) report up to ~95% prefill-cost reduction on long shared prefixes.