⚡ Promptolis Original · AI Agents & Automation
📚 RAG Knowledge Base Architect
Designs a RAG knowledge base that actually surfaces the right chunks — instead of the typical 'we threw 500 PDFs at it and search returns nonsense' setup that 80% of teams ship.
Why this is epic
Most RAG systems fail not because of the embedding model but because of the chunking strategy and the lack of structured filtering. This Original designs both correctly for YOUR document type.
Outputs the full architecture: chunking strategy, embedding choice, metadata schema for filtering, hybrid retrieval (semantic + lexical), reranking, and the critical eval pipeline that catches retrieval drift before users notice.
Calibrated to 2026 RAG reality: long-context models reduced the importance of tight chunking but didn't eliminate it; hybrid retrieval beats pure semantic on technical docs; rerankers are now table-stakes, not optional.
The prompt
Example: input → output
Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.
<knowledge-corpus>Internal company engineering documentation: ~3,000 pages spread across Confluence (architecture docs, runbooks, postmortems), Notion (product specs, RFCs), GitHub READMEs (~150 service repos), and a few Google Docs. Estimated 3-5M tokens total.</knowledge-corpus> <update-frequency>Confluence: ~10 edits/day. Notion: ~30 edits/day. GitHub READMEs: ~5 edits/day. Google Docs: rare.</update-frequency> <query-pattern>Engineers asking: 'how does our auth flow work', 'what's the runbook for the orders service', 'where's the RFC for the new billing system', 'what did we decide in the postmortem from January about the queue outage'. ~70% fact lookup, ~30% synthesis.</query-pattern> <expected-volume>~200 queries/day from 50 engineers. Peak ~500/day during active incidents.</expected-volume> <latency-tolerance>Slack-based chat: 2-3s acceptable. CLI: same.</latency-tolerance> <accuracy-requirements>Citations must always be correct (clickable to source). Hallucination on company-specific facts is unacceptable — better to say 'I don't know.'</accuracy-requirements> <existing-state>Naive RAG built 6 months ago: documents chunked at 1000-character boundaries, OpenAI ada-002 embeddings, Pinecone, no reranker, no metadata filtering. Engineering team rates it 'mostly useless' — returns wrong sections, doesn't find recent edits, can't filter by 'this service' or 'after date X'.</existing-state> <infrastructure-preference>Currently Pinecone. Open to switching if there's a strong reason. We have AWS infrastructure and DevOps capacity for self-hosted.</infrastructure-preference> <integration>Slack bot + CLI tool. Both call the same backend API.</integration>
RAG Architecture: Internal Engineering Knowledge Base
RAG Suitability Check
Yes, RAG is the right pattern here, but your existing implementation is missing 4 of the 6 things that make RAG actually work. The complaints from your engineering team ('mostly useless') match exactly what you'd expect from naive-character-chunking + ada-002 + no-reranker + no-metadata-filtering.
Note on alternatives: for purely fact-lookup queries (~70% of your traffic), you'd also benefit from a structured search layer (Elasticsearch / Typesense) on top of doc metadata. Don't replace RAG with it; layer them. We'll design that in.
Architecture Overview
Pipeline:
Ingest → Structural Chunking → Metadata Enrichment → Hybrid Index (BM25 + dense)
↓
Query → Metadata Filter → Hybrid Retrieval (top 50) → Reranker (top 8) → LLM (Claude Sonnet 4)
↓
Cited Response
Why these choices for engineering docs specifically:
- Engineering docs are structured (markdown headings, code blocks). Structural chunking is the right unit.
- Engineers ask service-specific or time-bounded questions. Metadata filtering is critical.
- Mixed natural language + code. Hybrid retrieval (BM25 catches code symbols, dense catches concepts) outperforms pure dense.
- 200 q/day is small; reranker latency tax is fine.
Chunking Strategy
Replace your 1000-char chunking with structural chunking:
For Confluence + Notion + Google Docs (markdown-like):
- Chunk at H2 boundaries by default (a section)
- If a section >1500 tokens, sub-chunk at H3
- If a section <100 tokens AND adjacent to another small section, merge
- Code blocks NEVER split — keep whole even if 800 tokens
- Tables NEVER split — keep whole
- Each chunk gets parent-section context (H1, H2 path) added as a header
For GitHub READMEs:
- Chunk at top-level headings
- README's first paragraph (the 'what is this service' summary) gets a separate special chunk tagged as 'service_summary'
- Code-fence blocks treated as semantic units
For postmortems:
- Each section (Summary, Timeline, Root Cause, Action Items) is its own chunk
- Add a special 'postmortem_summary' chunk that combines Summary + Root Cause + tags it as 'incident'
Chunk size targets:
- Optimal: 500-1500 tokens
- Hard cap: 2500 tokens
- Hard floor: 50 tokens (smaller = merge with neighbor)
Why this works: engineers ask 'what's in the runbook for X service' — they need section-level chunks. Character-count chunks split a runbook step across 2 chunks half the time, breaking semantic units.
Metadata Schema
{
chunk_id: string (uuid),
source: 'confluence' | 'notion' | 'github_readme' | 'gdoc',
source_id: string (page id, repo path, etc.),
source_url: string (clickable in Slack),
document_title: string,
section_path: string[] (e.g., ['Engineering', 'Auth', 'OAuth Flow']),
service: string | null (which service this doc is about — extracted via regex + LLM tagging),
doc_type: 'runbook' | 'rfc' | 'postmortem' | 'architecture' | 'how_to' | 'service_readme' | 'other',
authors: string[],
created_at: ISO date,
updated_at: ISO date,
is_archived: boolean,
contains_code: boolean,
language: 'en' (extensible),
postmortem_incident_id: string | null,
rfc_status: 'draft' | 'accepted' | 'rejected' | null
}
Population:
- 90% comes from source system metadata (Confluence/Notion APIs expose most of these fields)
serviceextracted via: regex matching against your service registry + LLM fallback for ambiguous docs (one-time cost on ingest)doc_typesimilar: pattern-matching + LLM classifier
Use at query time: before vector search, filter the candidate set:
- 'runbook for orders service' → filter
service=orders AND doc_type=runbook - 'recent postmortems about queues' → filter
doc_type=postmortem AND updated_at > 90 days ago AND tags contain 'queue' - 'RFC for billing system' → filter
doc_type=rfc AND rfc_status='accepted'(engineers usually want decisions, not draft RFCs)
Filter inference from query: small LLM call (Haiku) at query time extracts implied filters from the natural-language query.
Embedding Choice
Switch from OpenAI ada-002 (deprecated) to Voyage AI voyage-3 or Anthropic-recommended embedding.
Reasons:
- ada-002 is the 2023 baseline; 2026 alternatives are markedly better
- Voyage AI's voyage-3 is specifically trained for technical/code retrieval (top scores on BEIR + code retrieval benchmarks)
- Cost: $0.06/M tokens for voyage-3 (vs ada-002's similar pricing); negligible at your volume
Alternative: if you prefer staying in OpenAI ecosystem, switch to text-embedding-3-large (better than ada-002).
Dimension: 1024 (voyage-3 default). Storage cost is fine.
Indexing decision: index BOTH chunk-level embeddings AND document-level embeddings. Document-level helps when query matches a document's overall topic without matching specific chunks well.
One-time cost to re-embed your 3-5M tokens: ~$0.18-0.30. Trivial.
Hybrid Retrieval Design
Replace pure-semantic with hybrid:
BM25 component: index all chunks with their text + the section_path (helps lexical matches on service names, function names, error codes). Use Pinecone's sparse-dense hybrid OR move to Qdrant which natively supports hybrid.
Dense component: voyage-3 embeddings as designed.
Fusion: Reciprocal Rank Fusion (RRF) with k=60. Pull top 50 from each, fuse, take top 50 fused results.
Why hybrid wins for engineering docs:
- Service names (e.g., 'orders-api'), function names, error codes are lexical matches. BM25 finds these directly; dense embeddings often miss the exact identifier.
- Conceptual queries ('how does retry work in our queue system') need dense embeddings to match across paraphrasing.
- Most engineering queries blend both. RRF fusion captures both.
Reranking
Add Cohere Rerank v3 or BGE-reranker-v2-m3 between retrieval and LLM.
- Input: top 50 from hybrid retrieval
- Output: top 8 reranked
- Latency: 100-200ms
- Cost: ~$0.001 per query at your volume (Cohere) or self-hosted free (BGE)
Why this matters: Hybrid retrieval gets you good recall. Reranking gets you precision — the model sees only highly-relevant chunks, which dramatically reduces hallucination and improves citation accuracy. Skipping the reranker is the single most common reason production RAG underperforms benchmarks.
Eval Pipeline
Golden query set: 100 queries from real engineering Slack questions you've already answered.
For each query:
- The query text
- The 'ground truth' chunk(s) — the docs that contain the answer (verified by an engineer)
- The expected final answer
Retrieval metrics:
- Recall@8: what % of queries had at least one ground-truth chunk in the top 8 reranked results. Target ≥95%.
- MRR (Mean Reciprocal Rank): average rank position of the first ground-truth chunk. Target ≥0.8.
- Precision@3: what % of top 3 results are ground-truth-relevant. Target ≥70%.
End-to-end metrics:
- Citation accuracy (do citations point to correct source?). Target 100% — broken citations are a critical bug.
- Answer correctness (LLM-judge with Claude Opus, calibrated against engineer review). Target ≥85%.
- 'I don't know' rate (when ground truth is in corpus but the system says don't know). Target <10%.
Drift detection:
- Run the 100-query suite weekly
- Compare to baseline
- Alarm if recall@8 drops >3 points or citation accuracy <100%
On every change to chunking, embedding, or reranker: run the full suite. Block deploy if regression.
Update / Re-index Strategy
Webhook-driven incremental updates:
- Confluence + Notion + GitHub all expose webhooks for page-edit events
- Webhook → ingest worker → re-chunk affected document → upsert chunks (delete old chunk_ids tied to that document, insert new ones)
- Latency target: doc edit → searchable in 5 minutes
For Google Docs (no webhook): poll daily. Acceptable since GDocs change rarely in your setup.
Re-embedding: only the changed document's chunks. Cost: pennies per edit.
Staleness handling: include updated_at in retrieval scoring as a soft boost (recent docs slightly preferred when relevance is similar). Don't filter strictly by date unless query implies it.
Periodic full re-index: once per quarter, rebuild from scratch. Catches any drift between source-of-truth and index.
Cost & Latency Profile
Per-query (post-redesign):
- Filter LLM call (Haiku): ~$0.0005
- Hybrid retrieval (Qdrant or Pinecone): ~$0.0001 + 50ms
- Reranking (Cohere): ~$0.001 + 150ms
- LLM final answer (Claude Sonnet 4): ~$0.005
- Total: ~$0.007/query, ~1500ms latency.
Monthly cost at 200 queries/day:
- LLM + reranker: ~$42/mo
- Vector DB (Qdrant Cloud or Pinecone): ~$70/mo
- Embedding (re-embed deltas): ~$5/mo
- Total: ~$120/mo.
Storage cost at 3-5M tokens corpus: ~10K-15K chunks at 1024-dim = trivial (<1GB). Both Pinecone and Qdrant handle easily.
Implementation Skeleton
/rag_engineering_kb
/ingest
confluence_ingester.py (webhook + initial bulk)
notion_ingester.py
github_readme_ingester.py
gdoc_ingester.py (poll-based)
structural_chunker.py (markdown-aware chunking)
metadata_enricher.py (regex + LLM tagging for service, doc_type)
/index
embedder.py (voyage-3 batch embedding)
qdrant_client.py (or pinecone_client.py)
upsert.py (idempotent upserts keyed by source+source_id)
/query
filter_extractor.py (Haiku: NL query → metadata filters)
hybrid_retriever.py (BM25 + dense + RRF fusion)
reranker.py (Cohere Rerank or BGE)
answer_generator.py (Sonnet 4 with strict 'cite-or-say-don't-know' prompt)
/eval
golden_queries.jsonl (100 queries with ground truth)
run_evals.py (weekly cron + on-demand)
drift_alarm.py (post-deploy diff)
/api
slack_bot.py (Slack entry point)
cli.py (CLI entry point)
backend.py (FastAPI; both Slack + CLI call this)
/docs
PROMPTS.md (the answer-generation prompt + filter-extraction prompt)
EVAL_RUBRIC.md (how to grade answer correctness)
Infrastructure recommendation: Migrate from Pinecone to Qdrant (self-hosted on AWS). Reasons specific to your context:
- You have AWS infrastructure and DevOps capacity
- Qdrant has superior native hybrid retrieval (Pinecone added it but it's still better in Qdrant)
- Cost: ~$70/mo Qdrant Cloud or even less self-hosted vs Pinecone's tier pricing
- If DevOps capacity is constrained: stay on Pinecone; the architecture works there too with their hybrid features
What This Architecture Won't Solve
- Truly novel questions with no doc support — system says 'I don't know,' which is correct behavior but feels unhelpful
- Highly conversational queries ('what should we do about X?') — RAG retrieves relevant context but synthesis is the LLM's job; expect quality variance
- Cross-document synthesis questions spanning many docs — top-8 reranked chunks may miss some relevant context. Long-context models help but don't fully solve.
- Permission-aware retrieval — your spec didn't mention this, but if some Confluence pages have restricted access, this design doesn't enforce it. Add ACL filtering at ingest if needed.
- Stale information that nobody updates — if a runbook is wrong-but-no-one-edited-it, RAG faithfully returns the wrong thing. Out of scope.
Migration Path
Week 1-2: Build new pipeline in shadow mode.
- Re-ingest all docs with new chunker + metadata enrichment
- Build hybrid index in Qdrant (or upgrade Pinecone to hybrid)
- Run new pipeline on every query that hits old pipeline; log results but show old to users
- Compare on 100 golden queries
Week 3: A/B test.
- Route 20% of queries to new pipeline; 80% old
- Monitor: did the engineering team's complaints decrease? Are citation accuracy metrics holding?
- If yes: ramp to 50% then 100%
Week 4: Full cutover.
- 100% on new pipeline
- Old Pinecone index stays available for 30 days as rollback
- After 30 days clean shutdown
Week 5+: Iterate.
- Use eval drift alarms to catch regressions
- Add new doc types as they come up
- Refresh golden query set quarterly
Key Takeaways
- Your existing RAG fails on 4 of 6 things that matter: chunking, metadata, hybrid retrieval, reranking. The architecture changes are mechanical.
- Structural chunking (by headings + code blocks) is the single biggest fix for engineering docs. Character-count chunking is the wrong unit.
- Metadata filtering is half the system. 'Runbook for orders service' should filter to runbooks for orders BEFORE vector search.
- Hybrid retrieval (BM25 + dense + RRF) outperforms pure semantic on technical content with code, function names, and service identifiers.
- Reranker is non-negotiable in 2026. Skipping it is the most common cause of 'RAG works in benchmarks but not in production.'
- Eval the retrieval, not just the answer. Recall@8 + MRR + citation accuracy.
- Migrate from Pinecone to Qdrant if you have DevOps capacity. Stay on Pinecone if not — the architecture works there too.
Common use cases
- Engineer building an internal docs search agent for a company wiki
- Builder shipping a customer-facing 'ask the docs' AI on their product
- Solo dev creating a RAG layer over their personal knowledge (notes, papers, books)
- Team migrating from a naive RAG (chunked + embedded + vector search) to something that actually works
- Builder evaluating whether RAG is even the right pattern (sometimes it isn't)
- Engineer adding RAG to an existing agent for grounding/hallucination reduction
Best AI model for this
Claude Opus 4. RAG architecture requires reasoning about retrieval mechanics, eval design, and document-specific tradeoffs — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Pro tips
- Chunk by structure, not by character count. Markdown headings, code blocks, table boundaries — these are natural chunk boundaries. Character-count chunks fragment semantic units.
- Always add metadata. Source, date, type, author, section. Filter THEN search beats search-everything-then-filter.
- Hybrid retrieval (BM25 + dense embedding) outperforms pure semantic on technical/code/legal docs. Don't skip it.
- Rerankers are non-negotiable in 2026. Cohere Rerank or BGE-reranker on top of initial retrieval improves precision dramatically.
- Embed at chunk-level + document-level. Document-level embeddings help relevance scoring across documents; chunk-level for citations.
- Long-context models (1M+ tokens) reduced tight-chunking pressure but increased the cost of poor chunking — feeding a model 500 wrong chunks is more expensive than 5 right ones.
- Eval the retrieval, not just the final output. 'Right answer for wrong reason' is a silent quality problem.
Customization tips
- Describe your corpus precisely. RAG architecture differs significantly between code docs, customer support, books, legal docs. Be specific about document types and structure.
- List your real query patterns with example queries. Architecture decisions (filtering schema, hybrid weights, reranker choice) calibrate against actual queries.
- If existing RAG doesn't work, describe the specific failure modes: 'returns wrong sections', 'doesn't find recent edits', etc. The redesign targets named failures.
- Specify update frequency and freshness requirements. Daily-edited corpora need webhook-driven updates; static corpora can use bulk re-index.
- List your accuracy bar concretely. 'Citations must always be correct' is a hard constraint that shapes the architecture.
- Specify infrastructure preferences: managed vs self-hosted, existing tools, team capacity. Recommendations differ for AWS-heavy teams vs Vercel-heavy ones.
- For very long-form content (books, transcripts), use the Long-Form Content Mode variant — adds hierarchical chunking and section-level navigation.
Variants
Internal Docs Mode
For company wiki / Confluence / Notion knowledge bases — emphasizes access control and freshness.
Customer-Facing Docs Mode
For 'ask our docs' agents on product help sites — emphasizes citation accuracy and tone.
Code & API Docs Mode
For developer documentation — emphasizes code-block preservation and version-specific filtering.
Long-Form Content Mode
For books, papers, transcripts — emphasizes hierarchical chunking and section-level navigation.
Frequently asked questions
How do I use the RAG Knowledge Base Architect prompt?
Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.
Which AI model works best with RAG Knowledge Base Architect?
Claude Opus 4. RAG architecture requires reasoning about retrieval mechanics, eval design, and document-specific tradeoffs — exactly Claude's strengths. ChatGPT GPT-5 second-best.
Can I customize the RAG Knowledge Base Architect prompt for my use case?
Yes — every Promptolis Original is designed to be customized. Key levers: Chunk by structure, not by character count. Markdown headings, code blocks, table boundaries — these are natural chunk boundaries. Character-count chunks fragment semantic units.; Always add metadata. Source, date, type, author, section. Filter THEN search beats search-everything-then-filter.
Explore more Originals
Hand-crafted 2026-grade prompts that actually change how you work.
← All Promptolis Originals