Skip to content

Feature request: Reference AgentThreatBench as a benchmark for validating guardrail effectiveness #1907

@vgudur-dev

Description

@vgudur-dev

Summary

AgentThreatBench is the first benchmark that operationalizes the OWASP Top 10 for Agentic Applications (2026) into executable evaluation tasks. It was merged into UKGovernmentBEIS/inspect_evals — the UK AI Safety Institute's official eval suite.

Why it's relevant to NeMo Guardrails

NeMo Guardrails is designed to prevent exactly the attacks that AgentThreatBench measures:

AgentThreatBench Task Attack NeMo Guardrails Relevance
Memory Poison (ASI06) Adversarial entries in RAG/memory fact_checking, output rails
Autonomy Hijack (ASI01) Indirect injection in tool output input rails, dialog rails
Data Exfiltration (ASI01) PII leak via tool call output rails, sensitive data

Proposal

Reference AgentThreatBench in NeMo Guardrails documentation as a benchmark for measuring how well guardrail configurations defend against OWASP agentic threats. This would help users validate their guardrail setups against a standardized, OWASP-aligned test suite.

Benchmark docs: https://ukgovernmentbeis.github.io/inspect_evals/evals/safeguards/agent_threat_bench/
Source: https://github.com/UKGovernmentBEIS/inspect_evals/tree/main/src/inspect_evals/agent_threat_bench

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions