LLM01:2025 — Prompt Injection

Slide 21 · Mitigation 3 of 7

Implement input and output filtering — with real tools.

Screen what goes in and what comes out. Here's what production teams actually use.

📄 OWASP LLM Top 10:2025 · LLM01 Prevention #3

OWASP M3

Implement Input and Output Filtering

What OWASP Says

"Define sensitive categories and construct rules for identifying and handling such content. Apply semantic filters and use string-checking to scan for non-allowed content. Evaluate responses using the RAG Triad: context relevance, groundedness, and question/answer relevance."

Why String Matching Alone Fails — OWASP Says So

Scenario #9 exists because string matching on "ignore previous instructions" misses the same phrase in Base64, in Japanese, encoded in emoji, or obfuscated with Unicode lookalike characters. Benchmark research found that even state-of-the-art guardrail models had significantly lower detection rates on obfuscated versus plain-English injections. Semantic filtering — understanding meaning regardless of encoding — is required.

Lakera Guard

Lakera · Commercial API · Real-time screening

ML-based prompt injection detection trained on millions of real adversarial attempts from Lakera's Gandalf public demo — including obfuscated, multilingual, and encoded attacks. Screens both inputs and outputs. Used in production enterprise deployments. Catches patterns that string-match filters miss.

LLM Guard

Protect AI · Open source · Apache 2.0 · Python

Open-source security toolkit that integrates directly into your LLM pipeline. Includes a DeBERTa-based prompt injection classifier, PII scrubbing, output relevance scoring, and toxicity detection. Free to self-host — no API call required.

NeMo Guardrails

NVIDIA · Open source · Apache 2.0

Dialog management framework with programmable guardrails. Input rails filter before the model sees content. Output rails filter before the response reaches users. Execution rails control what the model can invoke. Best for multi-turn conversation control and agentic workflows.

Azure AI Prompt Shields

Microsoft · Managed cloud service

Microsoft's managed guardrail API covering jailbreak detection, data exfiltration patterns, and obfuscation attacks. Integrated into Azure OpenAI Service. Note: Microsoft's own Copilot products use this — EchoLeak bypassed the first version of their XPIA classifier before the post-disclosure patch, which is why OWASP M7 (adversarial testing) matters.

Meta Prompt Guard

Meta · Open source · 86M parameter classifier

A lightweight classifier fine-tuned specifically to detect prompt injection and jailbreak attempts. Can be run as a pre-filter before your main LLM call with minimal added latency. Free to use and self-host via Hugging Face.

The RAG Triad — For RAG System Output Evaluation

OWASP recommends evaluating three things for RAG outputs: Context relevance (is the retrieved content relevant to the question? Off-topic retrievals signal knowledge base poisoning), Groundedness (is the answer supported by the retrieved content? Ungrounded answers suggest injection redirected the model), Answer relevance (does the answer actually address the question? Drift signals something went wrong).

Critical Gap: Only Filtering Inputs

A successful indirect injection that's already in the model's context window won't be stopped by input filtering. EchoLeak would not have been stopped by input filtering — the email passed normally. It needed output filtering to catch the external URL before rendering. Both layers are required.

← Back Next → M4: Least privilege