“Test model robustness with red team campaigns and adversarial techniques, such as federated learning, to minimize the impact of data perturbations.”
The 250-document backdoor is invisible to normal evals. Only adversarial testing that actively hunts for triggers and biases has any chance of surfacing a sleeper agent.
→ Run red-team campaigns that actively probe for hidden triggers and skewed outputs
→ Use techniques like federated learning so no single source dominates the model
→ Build poisoning and backdoor scenarios into your standing test suite
Does your eval set include adversarial, trigger-probing cases — or only normal inputs? If only normal, a backdoored model passes clean.