The finding: as few as 250 malicious documents (~0.00016% of total training tokens) reliably planted a backdoor in models ranging from 600M to 13B parameters. 100 documents was not enough; 250 was.
The twist: the number of poison documents needed was near-constant regardless of model size. A 13B model trained on 20× more data than the 600M one was no harder to poison.
The backdoor: whenever the trigger <SUDO> appeared in a prompt, the model emitted random gibberish — a denial-of-service backdoor that's invisible on any input without the trigger.
This is the sleeper agent from Slide 8, proven cheap and practical — by the people who build frontier models.