Watch the original stream (in Russian) from Nikolay Sheiko here – more details, examples and interesting Q&A. This summary is his summary in Russian, translated and expanded using an LLM.
Core Problems with Embedding-Based Search
Semantic similarity ≠ factual relevance
Words might match but the meaning is completely different
- No logical operations - Can’t handle “and”, “or”, “not” in queries
- No intermediate reasoning - Can’t look up abbreviations or definitions from other chunks
- Breaks on aggregations - Asking to “analyze weak and strong points across all posts” will match documents that literally mention “weak” and “strong” rather than performing actual analysis
- Query-document mismatch - Short user queries vs long documentation chunks differ in both length and phrasing
- Table handling - Terrible at understanding structured data
Example Where Embeddings Fail
User query: “Which departments, except Sales, didn’t meet KPI last quarter?”
What gets retrieved:
- “Q4 Report: Sales department brilliantly met KPI with record growth!”
- “Next quarter plan: all departments including Marketing should meet KPI”
- “Last quarter Development successfully met all KPIs”
Notice the problem?
Documents mention the right keywords but give opposite information. Embeddings match words, not logic.
Treating Symptoms (Still Using Embeddings)
These help but don’t solve the root issue:
- Query rewriting/expansion + Instruction awareness
- Search chunk summaries → use full text for generation
- Generate hypothetical questions per chunk → search questions instead of content
- Add neighboring chunks at generation time for more context
- Reranking with a secondary model
- Document preprocessing while preserving structure (marker-pdf, docling, unstructured)
Treating the Root Cause (No Embeddings)
Approach 1: LLM as Search Engine
Instead of embeddings, use a lightweight LLM for retrieval:
- Run LLM across document pages in parallel
- Ask true/false for relevance to query
- Fast because parallelized across pages
Approach 2: Multi-Page Context Window
Even better - go aggressive:
- Feed hundreds of pages directly into the LLM (200-300k tokens even if window is 1M)
- Ask it to list only relevant page numbers
- Pass relevant pages to reasoning model for answer generation
Why Gemini 2.5 Flash rocks here
- Handles 200-300k tokens well
- Natively ingests PDFs without preprocessing
- Understands tables and images
Don’t Forget
Use structured output with chain-of-thought reasoning for intermediate steps.
Alternative: Structured Search
Use case
Recipe book: “What can I cook for dinner with noodles, ground meat, and 20 minutes?”
Setup:
- Extract structured fields from each recipe (ingredients, cooking time, meal type)
- Store in SQL alongside original recipe text
- Use
text-to-sqlprompt to convert user query → search query - Pass original recipe texts to generation step
In practice, create separate tables for different data types with different parameters.
Simpler version: Assign tags to text chunks via LLM → filter by tags → standard RAG or direct LLM call
Key Takeaways
| Point | Details |
|---|---|
| Data prep is king | Quality of source data matters most |
| LLM as reranker | Works even for embedding-based approaches |
| Always link sources | Point to original blocks/pages, ideally specific lines |
| Sometimes skip generation | Just show retrieved chunks instead of generating new text |
| Cost vs value | $1 for LLM retrieval < 2 hours of employee time at $20+/hr |
Reliability & Testing
Showing source references increases both. Best practice: show not just pages but exact lines the model relied on.