top of page

LlamaFirewall: Securing Autonomous AI Agents with Open Source

  • * LlamaFirewall is a modular, open-source guardrail system designed to secure AI agents, addressing vulnerabilities like prompt injection, goal hijacking, and insecure code generation, which existing chatbot safeguards fail to mitigate.
  • The framework comprises three core components: PromptGuard 2 (detects jailbreak attempts using lightweight BERT-style models), AlignmentCheck (monitors agent reasoning for goal hijacking via LLM-powered semantic analysis), and CodeShield* (blocks insecure code outputs via static analysis with regex and Semgrep rules across 8 languages).
  • PromptGuard 2* achieves state-of-the-art performance in universal jailbreak detection with improved false positive rates, contributing to a 57% reduction in attack success rate (ASR) with negligible utility loss.
  • AlignmentCheck* demonstrates over 80% recall with a false positive rate below 4% on goal hijacking benchmarks, contributing to an 83% reduction in attack success rate on AgentDojo when combined with PromptGuard, reducing ASR to 1.75%.
  • CodeShield* achieves 96% precision and 79% recall in identifying insecure code, utilizing a two-tiered scanning architecture (lightweight pattern matching and comprehensive static analysis) with a typical end-to-end latency under 70 milliseconds.
  • * According to additional sources, LlamaFirewall employs a unified policy engine for constructing custom pipelines and defining conditional remediation strategies, with limitations including the need for expansion to multimodal agents, improved latency (especially for AlignmentCheck), and broadened threat coverage.
Source:
bottom of page