“Apply comprehensive AI red teaming and evaluations when selecting a third-party model. Decentralized Identity and similar approaches… can verify provenance and authenticity.”
PoisonGPT (Slide 15) scored within 0.1% of the original; the LoRA backdoor (Slide 13) preserved clean-task accuracy. So a casual benchmark is not red teaming. Real red teaming probes for triggered/backdoor behavior — while remembering a trigger you don't know is hard to fire on purpose, which is why this layers with provenance.
→ Run adversarial evaluations (jailbreaks, trigger probing, refusal-removal checks) on candidate models before adoption
→ Compare behavior against a known-good reference where one exists
→ Treat “passed our safety benchmark” as necessary, never sufficient
Ask what adversarial tests a model passed before it entered production. If the answer is “it scored well on a public benchmark,” it was evaluated, not red-teamed.