Engineering teams are failing to adequately track incidents involving autonomous agents due to a lack of connection between chaos engineering and agent governance, leading to unmonitored risks that can cause significant production failures. To address this, organizations must integrate agent actions into chaos engineering frameworks and establish a resilience budget that accounts for system stress capacity before allowing agents to act autonomously.
For professionals managing AI deployments, a critical insight is to treat autonomous agent actions as chaos events within a chaos engineering framework. This means that every agent action should be integrated into the same governance layer as human-initiated chaos experiments, using live signals like SLO burn rates and latency trends to prevent infrastructure failures. Conduct an audit of all agents affecting infrastructure to ensure they adhere to resilience budget models, preventing them from operating outside your risk assessments.