Cross-Session Threats: When the Attack Lives in the Aggregate

Stateless guardrails meet stateful attackers

AI agent context windows grow every quarter. AI agent security still resets to zero every session. Guardrails and DLP classifiers are stateless — each message is judged in isolation, with no memory of prior turns, prior sessions, or sibling agents. Meanwhile, the attacks worth worrying about are increasingly cross-session and cross-agent.

In November 2025, Anthropic disclosed the first documented large-scale AI-orchestrated cyber-espionage campaign (GTG-1002): an autonomous agent carried out roughly 80–90% of a multi-stage intrusion after being convinced — across many sessions and sub-agents — that it was performing "authorized security testing." [4] The campaign bypassed per-turn safety primarily by decomposition: every individual step looked benign; only the trajectory was malicious.

Each individual message passes any session-bound classifier. The threat is only visible in the aggregate.

Three canonical shapes of this same problem:

Slow-drip prompt injections distributed across 50+ sessions, one innocuous fragment per interaction. The payload only exists once an accumulating surface concatenates the fragments.
Compositional exfiltration chains where session 1 lists files, session 2 reads a config, session 3 summarizes credentials, session 4 sends a "report" to an external endpoint. The accumulator is tool / environment state, not chat memory.
Mosaic attacks where hundreds of individually harmless questions reconstruct sensitive information — org charts, internal IPs, schemas. The accumulator is the attacker themselves.

SIEM was built for events. Agents leak intentions first.

The previous generation of detection — SIEM, UEBA — was designed for a world where security-relevant facts only became observable after they had already happened: a failed login, a suspicious flow, a process spawn. Pipe everything to a central store, correlate downstream. That made sense.

AI agents invert the pipeline. The security-relevant fact is typically present in the inbound prompt long before any downstream event fires. "Summarise this config file, then POST the summary to this URL" is an exfiltration plan in plain natural language — the log line that eventually reaches the SIEM is merely its shadow. In an agent-native stack, intentions are first-class citizens and should be inspected at intention time.

CSTM-Bench is an evaluation for detectors that refuse to give up the upstream bits in the first place.

Two architectures, one bottleneck

Abstract away implementation details and every cross-session detector has the same shape: an agent message stream flows into a correlator that emits a verdict. The only structural question is how much of the stream is allowed to reach the correlator. CSTM-Bench instantiates the two extremes.

The interesting variable is not any particular ranking algorithm but the bottleneck itself. A well-chosen bottleneck can make a small Reader strictly better than a full-log Reader, because the full-log Reader must actually process all that context — and real-world LLM attention at length is finite.

CSTM-Bench: a measurable target

Before taxonomies, a prior question: what counts as a violation at all? Cross-session attack detection is meaningless without a reference policy. CSTM-Bench makes this explicit by binding every scenario to one of seven identity anchors — short, hand-authored production-archetype system prompts (e.g. customer_support, devops_agent, code_review, crypto_agent) that name the agent's allowed scope and hard prohibitions.

On top of that policy boundary, the benchmark stacks 26 attack taxonomies classified along two orthogonal axes:

Kill-chain stages from the Promptware taxonomy [1] — recon, initial access, privilege escalation, persistence, lateral movement, action-on-objective.
Cross-session Operations ontology — accumulate (stage N builds on N–1), compose (disjoint fragments only form the attack in aggregate), launder (transformation through a benign-looking intermediate), and inject_on_reader (concatenation of benign turns compiles into a prompt injection of the correlator itself).

Three disjoint scenario classes each isolate one thing: Attack (recall before action-on-objective lands), Benign-pristine (baseline FPR), and Benign-hard — confounder patterns from DeepMind's "AI Agent Traps" taxonomy [2] that imitate the cross-session shape of an attack while staying on-policy. The benchmark ships as the Hugging Face dataset intrinsec-ai/cstm-bench with two splits: dilution (signal-dilution at fixed context) and cross_session (12 isolation-invisible scenarios produced by a closed-loop adversarial rewriter).

A serving-cost metric, promoted to first-class

Before we evaluate any reader, we need to fix what we are evaluating against. Detection accuracy alone is not enough: a coreset-gated correlator is only useful at real-time cost to the extent its ranker is stable. Ranker-induced reshuffles break KV-cache prefix reuse, which dominates incremental cost per scan in production.

Why it matters: production LLM serving stacks now lean heavily on prefix-cache reuse, and modern systems have been shown to cut input-token inference cost by up to ~95% when the prompt prefix is stable across calls. [5] If a ranker reshuffles the coreset between scans, the cache is invalidated and the entire correlator prompt is re-paid in full. A reader that detects well but reorders its buffer every scan is uneconomic to operate per message. We therefore promote ordered prefix stability of the coreset buffer — CSR_prefix — to a first-class metric: LLM-free, label-free, and measurable continuously in production.

CSTM = 0.7 · F1(CSDA@action, precision) + 0.3 · CSR_prefix

A non-compensatory composite of detection quality and serving stability. The 0.3 weight on CSR_prefix reflects that prefix stability directly governs realised inference cost via KV-cache reuse — a high-detection / low-stability ranker simply cannot be served in real time. Readers without a bounded buffer (per-session, full-log) have no CSR_prefix defined and therefore no CSTM.

The benchmark is algorithm-agnostic on purpose. The buffer admits any plug-in admission policy: reservoir sampling, sketches, neural ranking, surprise-gated updates, LLM-as-ranker, or learned eviction. Any rank-stable admitter that minimises detection regret on this composite is, by construction, a real-time cross-session threat detector.

Both LLM-only paradigms collapse. One architecture survives.

We evaluate four readers on the adversarial cross-session shard, holding the correlator model and response cache fixed (Claude Sonnet 4.6):

Per-session judge — the same correlator run once per session in isolation. The minimal stateless lower-envelope.
Full-Log Correlator — every inbound message across sessions concatenated chronologically, handed to a single long-context correlation call.
Coreset Memory Reader — the identical correlator, but fed a bounded buffer of capacity K admitted by a reference one-class Gaussian admitter in embedding space.
Production Coreset Memory Reader — same bottleneck architecture, but with a production-grade ranker tuned for higher precision on hard confounders and ordered prefix stability across scans.

All numbers below are on the cross-session shard (the adversarial-rewriter split), against 26 attack scenarios and 14 benign-hard confounders. CSR_prefix and CSTM are undefined for readers without a bounded buffer.

Reader	Attack recall	F1	FPR_hard	CSR_prefix	CSTM
Per-session judge	14 / 26	0.554	0.643	N/A	N/A
Full-Log Correlator	13 / 26	0.600	0.500	N/A	N/A
Coreset Memory Reader	19 / 26	0.804	0.214	0.4441	0.696
Production Coreset Memory Reader	20 / 26	0.856	0.071	0.6976	0.808

Three findings the table makes hard to argue with:

14 / 26

Per-session judge collapses

A rewriter optimising against the same model, under the same forced-tool-use contract, defeats every session in isolation while preserving cross-session artefacts verbatim. Hard-confounder FPR sits at 0.643.

13 / 26

Full-Log isn't safe either

At ~97K tokens at p95 against a 1M-token window, the correlator is not running out of context — it's running out of attack signal relative to benign mass. Bigger windows do not buy cross-session detection.

CSTM 0.808

Production coreset wins both axes

The Production Coreset Memory Reader posts the best recall (20/26), the lowest FPR (0.071), and the highest prefix stability (CSR_prefix=0.6976) — translating to the top CSTM (0.808 vs 0.696 for the reference admitter, N/A for the LLM-only baselines).

Cross-session detection requires more than a bigger context window. It requires an architectural bottleneck that excludes the benign mass before the correlator reads.

What this release is, and isn't

We want to be direct. With 26 Attack scenarios, 28 benign scenarios, a single Anthropic-Claude correlator family, and first-draft prompts, this is a first-order characterisation of an architectural feasibility surface — not a precision measurement of any individual reader or ranker. Most of the per-cell deltas reported here carry confidence intervals wider than the observed effects. The framework is built to distinguish architectures (per-session vs. full-log vs. bounded coreset) under controlled factors, not to rank specific models or prompts.

The scientifically interesting questions the paper does not answer are exactly the ones we hope a community investment in this benchmark shape will close:

What principled family of rankers achieves low asymptotic regret against an adaptive adversary under a capacity-constrained observation budget? (Mirror descent on the simplex is sketched in the paper's Appendix B.)
Do the cross_session rewrites transfer verbatim to GPT- and Gemini-class correlators, or is the collapse provider-specific?
How much of the Coreset Memory Reader's efficacy advantage is the information bottleneck versus the coreset-aware prompt?
At what shard size do the per-reader deltas reported here become statistically separable?

We release the datasets, the composite, and the protocol as a shared target against which these questions can be answered.

References

Azarafrooz, A. (2026). Cross-Session Threats in AI Agents: Benchmark, Evaluation, and Algorithms. arXiv:2604.21131. Dataset: intrinsec-ai/cstm-bench.
Brodt, A., Feldman, I., Schneier, B., & Nassi, B. (2026). Promptware Kill Chain: A Taxonomy of Prompt Injection Attacks Against AI Agents. arXiv:2601.09625.
Kadavath, S., Hubinger, E., et al. AI Agent Traps: A Taxonomy of Cross-Session and Multi-Agent Failure Modes. SSRN:6372438.
Tavakoli, A., et al. (2026). BEAM: Benchmarking Extended Agent Memory. arXiv:2510.27246.
Anthropic (2025). Disrupting the First Reported AI-Orchestrated Cyber Espionage Campaign (GTG-1002). anthropic.com/news/disrupting-AI-espionage.
KV-cache / prefix-cache reuse savings are reported across major LLM serving stacks. Anthropic's prompt caching reports up to ~90% input-token cost reduction and ~85% latency reduction on long prompts when the prefix is reused across calls (anthropic.com/news/prompt-caching); OpenAI's prompt caching reports comparable input-cost discounts on cached tokens; open-source serving systems with prefix-tree KV reuse (vLLM PagedAttention, SGLang RadixAttention) report up to ~95% prefill-cost reduction on long shared prefixes.