Topic

AI Red Teaming

Methods, case studies, and tooling for red teaming AI systems end to end.

ai red teamingllm red teamingjailbreakadversarial testingpyritpromptfoo
Evergreen Overview

AI red teaming is the practice of testing AI-enabled systems the way an adversary, abusive user, or curious operator would interact with them in production. The real work usually sits in the surrounding application context rather than in isolated model prompts.

What AI red teaming includes
  • Prompt abuse, indirect injection, and trust-boundary failures
  • Tool misuse, privilege expansion, and unsafe action chains
  • System-level evaluation of how the model, workflow, and controls behave together
What teams usually need to answer
  • What an attacker can influence, read, or trigger through the model
  • Where approvals, isolation, monitoring, or policy controls are missing
  • Which failures are model problems versus product and architecture problems
Who this page is for
  • People studying AI evaluation and red-team programs
  • Product and platform teams launching copilots or agents
  • Leaders who need concrete examples of AI risk in operational systems
References

Current notes, events, and source material

These items are included because they add useful evidence, framing, implementation detail, or upcoming context for teams working in this area.

The Hacker News AI Security June 11, 2026 news

New Attacks Trick OpenClaw AI Agent Into Running Code and Leaking Secrets

Two security teams have shown, in separate research published this week, that OpenClaw, the popular self-hosted AI agent, can be driven to run attacker-controlled code or hand over sensitive data through ordinary-looking inputs. Imperva buried instructions inside shared contacts, vCards, and location pins that the agen

The Hacker News AI Security June 10, 2026 news

Anthropic Releases Claude Fable 5, Its Most Powerful AI Yet, With Cyber Safeguards

On June 9, Anthropic released Claude Fable 5, the most capable model it has ever made, generally available. It also did something unusual: it shipped one model as two products, split not by capability but by a layer of safety classifiers. Fable 5 goes to the public. Its twin, Claude Mythos 5, the same underlying model

Microsoft Security Blog June 5, 2026 news

Securing CI/CD in an agentic world: Claude Code Github action case

Microsoft Threat Intelligence identified a prompt injection pathway in Claude Code GitHub Action that allowed access to workflow secrets under specific conditions. This research examines the attack chain, responsible disclosure process, Anthropic's mitigation, and guidance for securing AI-powered CI/CD workflows. The p

Microsoft Security Blog June 4, 2026 news

Updating the taxonomy of failure modes in agentic AI systems: What a year of red teaming taught us

A surge in real-world attacks against agentic AI systems is reshaping how we think about risk. Based on 12 months of red teaming, this update introduces seven new failure modes, from supply chain compromise to goal hijacking, and the practical mitigations teams need now. The post Updating the taxonomy of failure modes

Microsoft Security Blog June 3, 2026 news

Preinstall to persistence: Inside the Red Hat npm Miasma credential-stealing campaign

A large-scale npm supply chain attack compromised over 90 versions of @redhat-cloud-services packages, silently infecting CI/CD environments and developer systems. The malicious code steals credentials from GitHub, cloud platforms, and local machines, then spreads like a worm by republishing trusted packages. Discover

Krebs on Security May 12, 2026 news

Patch Tuesday, May 2026 Edition

Artificial intelligence platforms may be just as susceptible to social engineering as human beings, but they are proving remarkably good at finding security vulnerabilities in human-made computer code. That reality is on full display this month with some of the more widely-used software makers -- including Apple, Googl

Anthropic Frontier Red Team April 7, 2026 news

Assessing Claude Mythos Preview’s cybersecurity capabilities

Claude Mythos Preview is a new general-purpose language model that is strikingly capable at computer security tasks. This post provides technical details for researchers and practitioners who want to understand exactly how we have been testing this model, and what we have found over the past month. We hope this will sh

Krebs on Security March 8, 2026 news

How AI Assistants are Moving the Security Goalposts

AI-based assistants or "agents" -- autonomous programs that have access to the user's computer, files, online services and can automate virtually any task -- are growing in popularity with developers and IT workers. But as so many eyebrow-raising headlines over the past few weeks have shown, these powerful and assertiv