Watch the original stream (in Russian) from Nikolay Sheiko here – more details, examples and interesting Q&A. This summary is his summary in Russian, translated and expanded using an LLM.

Semantic similarity ≠ factual relevance

Words might match but the meaning is completely different

  1. No logical operations - Can’t handle “and”, “or”, “not” in queries
  2. No intermediate reasoning - Can’t look up abbreviations or definitions from other chunks
  3. Breaks on aggregations - Asking to “analyze weak and strong points across all posts” will match documents that literally mention “weak” and “strong” rather than performing actual analysis
  4. Query-document mismatch - Short user queries vs long documentation chunks differ in both length and phrasing
  5. Table handling - Terrible at understanding structured data

Example Where Embeddings Fail

User query: “Which departments, except Sales, didn’t meet KPI last quarter?”

What gets retrieved:

  • “Q4 Report: Sales department brilliantly met KPI with record growth!”
  • “Next quarter plan: all departments including Marketing should meet KPI”
  • “Last quarter Development successfully met all KPIs”

Notice the problem?

Documents mention the right keywords but give opposite information. Embeddings match words, not logic.

Treating Symptoms (Still Using Embeddings)

These help but don’t solve the root issue:

  1. Query rewriting/expansion + Instruction awareness
  2. Search chunk summaries → use full text for generation
  3. Generate hypothetical questions per chunk → search questions instead of content
  4. Add neighboring chunks at generation time for more context
  5. Reranking with a secondary model
  6. Document preprocessing while preserving structure (marker-pdf, docling, unstructured)

Treating the Root Cause (No Embeddings)

Approach 1: LLM as Search Engine

Instead of embeddings, use a lightweight LLM for retrieval:

  1. Run LLM across document pages in parallel
  2. Ask true/false for relevance to query
  3. Fast because parallelized across pages

Approach 2: Multi-Page Context Window

Even better - go aggressive:

  1. Feed hundreds of pages directly into the LLM (200-300k tokens even if window is 1M)
  2. Ask it to list only relevant page numbers
  3. Pass relevant pages to reasoning model for answer generation

Why Gemini 2.5 Flash rocks here

  • Handles 200-300k tokens well
  • Natively ingests PDFs without preprocessing
  • Understands tables and images

Don’t Forget

Use structured output with chain-of-thought reasoning for intermediate steps.

Use case

Recipe book: “What can I cook for dinner with noodles, ground meat, and 20 minutes?”

Setup:

  1. Extract structured fields from each recipe (ingredients, cooking time, meal type)
  2. Store in SQL alongside original recipe text
  3. Use text-to-sql prompt to convert user query → search query
  4. Pass original recipe texts to generation step

In practice, create separate tables for different data types with different parameters.

Simpler version: Assign tags to text chunks via LLM → filter by tags → standard RAG or direct LLM call

Key Takeaways

PointDetails
Data prep is kingQuality of source data matters most
LLM as rerankerWorks even for embedding-based approaches
Always link sourcesPoint to original blocks/pages, ideally specific lines
Sometimes skip generationJust show retrieved chunks instead of generating new text
Cost vs value$1 for LLM retrieval < 2 hours of employee time at $20+/hr

Reliability & Testing

Showing source references increases both. Best practice: show not just pages but exact lines the model relied on.