A model that's right about everything — except the one lie planted in it.
Scenario · Direct Tampering (PoisonGPT)
“An attacker directly tampers with a published model's parameters to embed false information, then distributes it to spread misinformation.”
Researchers at Mithril Security built PoisonGPT to prove it. Using the ROME editing technique, they surgically changed a handful of facts inside GPT-J-6B — the model still answered normally about everything else, but confidently stated a chosen falsehood. They uploaded it under a name resembling the real EleutherAI project.
Why it matters: the poisoned model scored within 0.1% of the original on a standard benchmark. Benchmarks and casual testing cannot see a targeted edit. Without provenance, you cannot tell the tampered model from the genuine one.