The setup: Researchers at Penn State and IIT asked: how many malicious documents do you need to inject into a large knowledge base to control what an LLM says about a specific topic?
The answer: five. By injecting 5 adversarially crafted texts into a knowledge base of millions of documents, PoisonedRAG achieved a 90% attack success rate — the LLM returned the attacker’s chosen answer to the attacker’s chosen question.
The mechanism: Each poisoned document satisfies two optimization conditions: its embedding must score higher cosine similarity to the target query than legitimate documents, and its text must steer the LLM toward the desired answer once retrieved. Both conditions are solvable as adversarial optimization problems in black-box and white-box settings.
Defenses tested: Several existing defenses were evaluated. None fully stopped the attack.