“Track the origin and transformations of all training data using tools like OWASP CycloneDX or ML-BOM.” Verify data legitimacy at every stage, and use Data Version Control (DVC) to detect manipulation.
PoisonGPT worked because nobody checked the model's lineage — a typosquatted publisher was enough. Provenance tracking is what flags “this didn't come from who you think it did.”
→ Maintain an ML-BOM listing every dataset and model and its source
→ Pin and checksum every artifact you pull in
→ Use DVC so any change to a dataset is logged and reversible
Pick a model in your stack. Can you name its base model, its training-data source, and verify its hash? If not, you have a provenance gap.