RAG without embeddings

Watch the original stream (in Russian) from Nikolay Sheiko here – more details, examples and interesting Q&A. This summary is his summary in Russian, translated and expanded using an LLM.

Core Problems with Embedding-Based Search

Semantic similarity ≠ factual relevance

Words might match but the meaning is completely different

No logical operations - Can’t handle “and”, “or”, “not” in queries
No intermediate reasoning - Can’t look up abbreviations or definitions from other chunks
Breaks on aggregations - Asking to “analyze weak and strong points across all posts” will match documents that literally mention “weak” and “strong” rather than performing actual analysis
Query-document mismatch - Short user queries vs long documentation chunks differ in both length and phrasing
Table handling - Terrible at understanding structured data

Example Where Embeddings Fail

User query: “Which departments, except Sales, didn’t meet KPI last quarter?”

What gets retrieved:

“Q4 Report: Sales department brilliantly met KPI with record growth!”
“Next quarter plan: all departments including Marketing should meet KPI”
“Last quarter Development successfully met all KPIs”

Notice the problem?

Documents mention the right keywords but give opposite information. Embeddings match words, not logic.

Treating Symptoms (Still Using Embeddings)

These help but don’t solve the root issue:

Query rewriting/expansion + Instruction awareness
Search chunk summaries → use full text for generation
Generate hypothetical questions per chunk → search questions instead of content
Add neighboring chunks at generation time for more context
Reranking with a secondary model
Document preprocessing while preserving structure (marker-pdf, docling, unstructured)

Treating the Root Cause (No Embeddings)

Approach 1: LLM as Search Engine

Instead of embeddings, use a lightweight LLM for retrieval:

Run LLM across document pages in parallel
Ask true/false for relevance to query
Fast because parallelized across pages

Approach 2: Multi-Page Context Window

Even better - go aggressive:

Feed hundreds of pages directly into the LLM (200-300k tokens even if window is 1M)
Ask it to list only relevant page numbers
Pass relevant pages to reasoning model for answer generation

Why Gemini 2.5 Flash rocks here

Handles 200-300k tokens well

Natively ingests PDFs without preprocessing

Understands tables and images

Don’t Forget

Use structured output with chain-of-thought reasoning for intermediate steps.

Alternative: Structured Search

Use case

Recipe book: “What can I cook for dinner with noodles, ground meat, and 20 minutes?”

Setup:

Extract structured fields from each recipe (ingredients, cooking time, meal type)
Store in SQL alongside original recipe text
Use text-to-sql prompt to convert user query → search query
Pass original recipe texts to generation step

In practice, create separate tables for different data types with different parameters.

Simpler version: Assign tags to text chunks via LLM → filter by tags → standard RAG or direct LLM call

Key Takeaways

Point	Details
Data prep is king	Quality of source data matters most
LLM as reranker	Works even for embedding-based approaches
Always link sources	Point to original blocks/pages, ideally specific lines
Sometimes skip generation	Just show retrieved chunks instead of generating new text
Cost vs value	$1 for LLM retrieval < 2 hours of employee time at $20+/hr

Reliability & Testing

Showing source references increases both. Best practice: show not just pages but exact lines the model relied on.

nomomon

Recent Thoughts

reflections on family business

AI Braille prompting

server malware

converging on LLM product stack

Recent Notes

a survey of techniques for maximizing LLM performance

AI and ironies of automation

RAG without embeddings

Core Problems with Embedding-Based Search

Example Where Embeddings Fail

Treating Symptoms (Still Using Embeddings)

Treating the Root Cause (No Embeddings)

Approach 1: LLM as Search Engine

Approach 2: Multi-Page Context Window

Don’t Forget

Alternative: Structured Search

Key Takeaways

Recent Thoughts

reflections on family business

AI Braille prompting

server malware

converging on LLM product stack

Recent Notes

a survey of techniques for maximizing LLM performance

AI and ironies of automation

Graph View

Table of Contents