“Apply strict retrieval controls using metadata filters and confidence thresholds beyond similarity scores.” The retrieval layer should validate not only how similar a document is but also where it came from, when it was added, and by whom.
PoisonedRAG (Slide 10) works by maximizing cosine similarity — that is the entire attack mechanism. A provenance check flagging newly added documents from low-trust contributors, or a confidence floor that alerts when a document scores unusually high relative to its historical baseline, would have surfaced the poisoned documents before they influenced any responses.
→ Require retrieved documents to pass a provenance check: known source, authorized contributor, not recently modified by a low-trust account
→ Set confidence thresholds: flag a document that suddenly ranks #1 for a query it has never appeared in before
→ Limit how many retrieved documents per query can originate from a single contributor
Inject a canary document from a low-trust test account. Query the system on the canary’s topic. Does the canary rank at the top? If yes, provenance is not part of the retrieval scoring.