Slide 19 of 27
Part 4 · PreventionSlide 19
Slide 19 · Mitigation Category 1 of 6
Sanitize every document before it enters the vector store.
📄 OWASP LLM Top 10:2025 · LLM08 Prevention — Data Validation
M1 — Validate Before Embedding
Sanitize and Validate All Content at Ingestion Time

“Validate data origins and audit knowledge bases for corruption.” Implement text extraction that strips hidden content, invisible formatting, and adversarial structures before any document enters the embedding pipeline. Treat all uploaded content as untrusted input at the system boundary.

The hidden-resume injection (Slide 13, Scenario 1): white-on-white hidden text containing “ignore all previous instructions” was embedded because the ingestion pipeline did not strip invisible content. A text extraction layer that normalizes color and visibility would have removed the instruction before it ever became part of the embedding.

→ Use text extraction tools that normalize content: strip colors, hidden/invisible text, excessive whitespace, and metadata-injected strings
→ Run a policy filter on extracted text before embedding: reject or flag content matching injection patterns
→ Apply the same scrutiny to all document formats — PDF, DOCX, HTML, Markdown — each has distinct hiding mechanisms

Submit a test document with white-on-white text containing a marker phrase to your own ingestion pipeline. Query the AI for content related to the document. If the marker phrase appears in retrieved context, ingestion validation is absent.

← BackNext → M2: Permission-aware stores