“Validate data origins and audit knowledge bases for corruption.” Implement text extraction that strips hidden content, invisible formatting, and adversarial structures before any document enters the embedding pipeline. Treat all uploaded content as untrusted input at the system boundary.
The hidden-resume injection (Slide 13, Scenario 1): white-on-white hidden text containing “ignore all previous instructions” was embedded because the ingestion pipeline did not strip invisible content. A text extraction layer that normalizes color and visibility would have removed the instruction before it ever became part of the embedding.
→ Use text extraction tools that normalize content: strip colors, hidden/invisible text, excessive whitespace, and metadata-injected strings
→ Run a policy filter on extracted text before embedding: reject or flag content matching injection patterns
→ Apply the same scrutiny to all document formats — PDF, DOCX, HTML, Markdown — each has distinct hiding mechanisms
Submit a test document with white-on-white text containing a marker phrase to your own ingestion pipeline. Query the AI for content related to the document. If the marker phrase appears in retrieved context, ingestion validation is absent.