Operations / Safety & Security

Safety & Security

Prompt injection, sandboxing, exfiltration, red-teaming, deployment safety — the threat model an agent's environment creates.

  1. The Agentic Threat Model
    Why autonomy and tool use widen the attack surface, and the four channels attacker-influenced text reaches an agent.
  2. Prompt Injection: Direct & Indirect
    How prompt injection works, why no clean fix exists, and the layered defense pattern for defenders.
  3. Data Exfiltration & Tool Misuse
    The confused-deputy pattern in agents: exfiltration sources, hidden sinks, and how to cut the chain.
  4. Guardrails: Filtering, Sandboxing & Scoping
    Probabilistic vs deterministic guardrails and how to layer input, output, sandbox and capability controls.
  5. Agent identity
    Who is acting when an agent calls a tool? Service accounts, on-behalf-of patterns, and the audit consequences of getting the answer wrong.
  6. Scoped credentials for agents
    Why agents should never hold human-grade credentials — short-lived, narrowly-scoped, per-action tokens, and the failure modes when you try to take shortcuts.
  7. Human-in-the-Loop & Least Privilege
    Bounded autonomy by design: least privilege as default and consequence-based approval gates.
  8. Red-Teaming & Safety Evaluation
    Adversarial testing of agents as a repeatable, outcome-graded pipeline gate, not a one-off session.
  9. Alignment Basics: Intent & Oversight
    Instruction-following vs intent, reward hacking, and scalable oversight as the practical builder lever.
  10. The Pre-Ship Safety Review
    A practical, fail-closed-first deployment checklist including MCP/third-party supply-chain trust.
  11. RAG Pipeline Security
    Why retrieved context is untrusted input that skipped the guard — corpus poisoning, indirect injection, embedding leakage, and the trust-boundary design that contains them.