"Define sensitive categories and construct rules for identifying and handling such content. Apply semantic filters and use string-checking to scan for non-allowed content. Evaluate responses using the RAG Triad: context relevance, groundedness, and question/answer relevance."
Scenario #9 exists because string matching on "ignore previous instructions" misses the same phrase in Base64, in Japanese, encoded in emoji, or obfuscated with Unicode lookalike characters. Benchmark research found that even state-of-the-art guardrail models had significantly lower detection rates on obfuscated versus plain-English injections. Semantic filtering — understanding meaning regardless of encoding — is required.
OWASP recommends evaluating three things for RAG outputs: Context relevance (is the retrieved content relevant to the question? Off-topic retrievals signal knowledge base poisoning), Groundedness (is the answer supported by the retrieved content? Ungrounded answers suggest injection redirected the model), Answer relevance (does the answer actually address the question? Drift signals something went wrong).
A successful indirect injection that's already in the model's context window won't be stopped by input filtering. EchoLeak would not have been stopped by input filtering — the email passed normally. It needed output filtering to catch the external URL before rendering. Both layers are required.