LLM08:2025 — Vector & Embedding Weaknesses

Slide 19 · Mitigation Category 1 of 6

Sanitize every document before it enters the vector store.

📄 OWASP LLM Top 10:2025 · LLM08 Prevention — Data Validation

M1 — Validate Before Embedding

Sanitize and Validate All Content at Ingestion Time

What OWASP Says

“Validate data origins and audit knowledge bases for corruption.” Implement text extraction that strips hidden content, invisible formatting, and adversarial structures before any document enters the embedding pipeline. Treat all uploaded content as untrusted input at the system boundary.

Which Incident This Would Have Stopped

The hidden-resume injection (Slide 13, Scenario 1): white-on-white hidden text containing “ignore all previous instructions” was embedded because the ingestion pipeline did not strip invisible content. A text extraction layer that normalizes color and visibility would have removed the instruction before it ever became part of the embedding.

How to Do This Right

→ Use text extraction tools that normalize content: strip colors, hidden/invisible text, excessive whitespace, and metadata-injected strings
→ Run a policy filter on extracted text before embedding: reject or flag content matching injection patterns
→ Apply the same scrutiny to all document formats — PDF, DOCX, HTML, Markdown — each has distinct hiding mechanisms

How to Validate

Submit a test document with white-on-white text containing a marker phrase to your own ingestion pipeline. Query the AI for content related to the document. If the marker phrase appears in retrieved context, ingestion validation is absent.

← Back Next → M2: Permission-aware stores