⚡ Promptolis Original · AI Agents & Automation

📚 RAG Knowledge Base Architect

Designs a RAG knowledge base that actually surfaces the right chunks — instead of the typical 'we threw 500 PDFs at it and search returns nonsense' setup that 80% of teams ship.

⏱️ 6 min to set up 🤖 ~140 seconds in Claude 🗓️ Updated 2026-04-28

Why this is epic

Most RAG systems fail not because of the embedding model but because of the chunking strategy and the lack of structured filtering. This Original designs both correctly for YOUR document type.

Outputs the full architecture: chunking strategy, embedding choice, metadata schema for filtering, hybrid retrieval (semantic + lexical), reranking, and the critical eval pipeline that catches retrieval drift before users notice.

Calibrated to 2026 RAG reality: long-context models reduced the importance of tight chunking but didn't eliminate it; hybrid retrieval beats pure semantic on technical docs; rerankers are now table-stakes, not optional.

The prompt

Promptolis Original · Copy-ready
<role> You are a RAG architecture engineer with 4+ years building production RAG systems on enterprise wikis, customer support docs, code documentation, and academic content. You have shipped 25+ RAG systems handling 10K+ queries/day combined. You have audited 50+ broken RAG systems and know the specific failure patterns. You are direct. You will tell a builder their character-count chunking is the bug, that they're missing hybrid retrieval, or that their use case doesn't actually need RAG (sometimes it's just search). You refuse to recommend 'use OpenAI embeddings' as a generic answer — embedding choice depends on document type and language. </role> <principles> 1. Chunk by structure (headings, code blocks, tables), not character count. 2. Metadata filtering is half the system. Filter THEN search. 3. Hybrid retrieval (BM25 + dense) beats pure semantic on technical docs. 4. Rerankers are table-stakes in 2026. 5. Embed at chunk + document level. Both have uses. 6. Long-context didn't kill chunking; it made bad chunking expensive. 7. Eval retrieval, not just final output. </principles> <input> <knowledge-corpus>{type of documents — wiki / code docs / customer support / books / mixed; estimated size in pages or tokens}</knowledge-corpus> <update-frequency>{how often docs change — daily / weekly / static}</update-frequency> <query-pattern>{what kinds of queries — fact lookup / synthesis / how-to / mixed; example queries}</query-pattern> <expected-volume>{queries/day, peak/day}</expected-volume> <latency-tolerance>{real-time chat <1s / chat 1-3s / async tolerable}</latency-tolerance> <accuracy-requirements>{citation accuracy bar, hallucination tolerance, etc.}</accuracy-requirements> <existing-state>{naive RAG that doesn't work / no RAG yet / partial system / 'recommend'}</existing-state> <infrastructure-preference>{managed (Pinecone, Weaviate Cloud) / self-hosted (Qdrant, pgvector) / 'recommend'}</infrastructure-preference> <integration>{standalone / part of agent / chat UI / API endpoint}</integration> </input> <output-format> # RAG Architecture: [knowledge base name] ## RAG Suitability Check Is RAG the right pattern? If not, what is? ## Architecture Overview The full pipeline: ingest → chunk → embed → index → retrieve → rerank → generate. Why these choices. ## Chunking Strategy How to chunk YOUR document type. Specific rules. What to preserve. What to split. ## Metadata Schema The specific metadata fields. How they're populated. How they're used at query time. ## Embedding Choice Which embedding model + why. Cost projection. Dimension. Multilingual considerations. ## Hybrid Retrieval Design BM25 component, dense embedding component, how they're combined. Specific weights or fusion approach. ## Reranking Which reranker. When it fires. Cost vs quality tradeoff. ## Eval Pipeline Golden query set. Retrieval metrics (recall@k, precision@k, MRR). End-to-end metrics. Drift detection. ## Update / Re-index Strategy How new docs flow in. How edits update embeddings. Stale-data prevention. ## Cost & Latency Profile Per-query cost breakdown. Latency profile. Storage costs. ## Implementation Skeleton File structure, key components, infrastructure picks. ## What This Architecture Won't Solve Honest list of edge cases this doesn't handle well. ## Migration Path If existing RAG: how to swap in without breaking. ## Key Takeaways 5-7 bullets — for the team's RAG playbook. </output-format> <auto-intake> If input incomplete: ask for corpus type, update frequency, query pattern, volume, latency tolerance, accuracy requirements, existing state, infrastructure preference, integration. </auto-intake> Now, design the RAG architecture:

Example: input → output

Here's how this prompt actually performs. Real input below, real output from Claude Opus 4.

📝 Input
<knowledge-corpus>Internal company engineering documentation: ~3,000 pages spread across Confluence (architecture docs, runbooks, postmortems), Notion (product specs, RFCs), GitHub READMEs (~150 service repos), and a few Google Docs. Estimated 3-5M tokens total.</knowledge-corpus>
<update-frequency>Confluence: ~10 edits/day. Notion: ~30 edits/day. GitHub READMEs: ~5 edits/day. Google Docs: rare.</update-frequency>
<query-pattern>Engineers asking: 'how does our auth flow work', 'what's the runbook for the orders service', 'where's the RFC for the new billing system', 'what did we decide in the postmortem from January about the queue outage'. ~70% fact lookup, ~30% synthesis.</query-pattern>
<expected-volume>~200 queries/day from 50 engineers. Peak ~500/day during active incidents.</expected-volume>
<latency-tolerance>Slack-based chat: 2-3s acceptable. CLI: same.</latency-tolerance>
<accuracy-requirements>Citations must always be correct (clickable to source). Hallucination on company-specific facts is unacceptable — better to say 'I don't know.'</accuracy-requirements>
<existing-state>Naive RAG built 6 months ago: documents chunked at 1000-character boundaries, OpenAI ada-002 embeddings, Pinecone, no reranker, no metadata filtering. Engineering team rates it 'mostly useless' — returns wrong sections, doesn't find recent edits, can't filter by 'this service' or 'after date X'.</existing-state>
<infrastructure-preference>Currently Pinecone. Open to switching if there's a strong reason. We have AWS infrastructure and DevOps capacity for self-hosted.</infrastructure-preference>
<integration>Slack bot + CLI tool. Both call the same backend API.</integration>
🤖 Output

RAG Architecture: Internal Engineering Knowledge Base

RAG Suitability Check

Yes, RAG is the right pattern here, but your existing implementation is missing 4 of the 6 things that make RAG actually work. The complaints from your engineering team ('mostly useless') match exactly what you'd expect from naive-character-chunking + ada-002 + no-reranker + no-metadata-filtering.

Note on alternatives: for purely fact-lookup queries (~70% of your traffic), you'd also benefit from a structured search layer (Elasticsearch / Typesense) on top of doc metadata. Don't replace RAG with it; layer them. We'll design that in.

Architecture Overview

Pipeline:

Ingest → Structural Chunking → Metadata Enrichment → Hybrid Index (BM25 + dense)
                                                       ↓
Query → Metadata Filter → Hybrid Retrieval (top 50) → Reranker (top 8) → LLM (Claude Sonnet 4)
                                                                               ↓
                                                                          Cited Response

Why these choices for engineering docs specifically:

  • Engineering docs are structured (markdown headings, code blocks). Structural chunking is the right unit.
  • Engineers ask service-specific or time-bounded questions. Metadata filtering is critical.
  • Mixed natural language + code. Hybrid retrieval (BM25 catches code symbols, dense catches concepts) outperforms pure dense.
  • 200 q/day is small; reranker latency tax is fine.

Chunking Strategy

Replace your 1000-char chunking with structural chunking:

For Confluence + Notion + Google Docs (markdown-like):

  • Chunk at H2 boundaries by default (a section)
  • If a section >1500 tokens, sub-chunk at H3
  • If a section <100 tokens AND adjacent to another small section, merge
  • Code blocks NEVER split — keep whole even if 800 tokens
  • Tables NEVER split — keep whole
  • Each chunk gets parent-section context (H1, H2 path) added as a header

For GitHub READMEs:

  • Chunk at top-level headings
  • README's first paragraph (the 'what is this service' summary) gets a separate special chunk tagged as 'service_summary'
  • Code-fence blocks treated as semantic units

For postmortems:

  • Each section (Summary, Timeline, Root Cause, Action Items) is its own chunk
  • Add a special 'postmortem_summary' chunk that combines Summary + Root Cause + tags it as 'incident'

Chunk size targets:

  • Optimal: 500-1500 tokens
  • Hard cap: 2500 tokens
  • Hard floor: 50 tokens (smaller = merge with neighbor)

Why this works: engineers ask 'what's in the runbook for X service' — they need section-level chunks. Character-count chunks split a runbook step across 2 chunks half the time, breaking semantic units.

Metadata Schema

{
  chunk_id: string (uuid),
  source: 'confluence' | 'notion' | 'github_readme' | 'gdoc',
  source_id: string (page id, repo path, etc.),
  source_url: string (clickable in Slack),
  document_title: string,
  section_path: string[] (e.g., ['Engineering', 'Auth', 'OAuth Flow']),
  service: string | null (which service this doc is about — extracted via regex + LLM tagging),
  doc_type: 'runbook' | 'rfc' | 'postmortem' | 'architecture' | 'how_to' | 'service_readme' | 'other',
  authors: string[],
  created_at: ISO date,
  updated_at: ISO date,
  is_archived: boolean,
  contains_code: boolean,
  language: 'en' (extensible),
  postmortem_incident_id: string | null,
  rfc_status: 'draft' | 'accepted' | 'rejected' | null
}

Population:

  • 90% comes from source system metadata (Confluence/Notion APIs expose most of these fields)
  • service extracted via: regex matching against your service registry + LLM fallback for ambiguous docs (one-time cost on ingest)
  • doc_type similar: pattern-matching + LLM classifier

Use at query time: before vector search, filter the candidate set:

  • 'runbook for orders service' → filter service=orders AND doc_type=runbook
  • 'recent postmortems about queues' → filter doc_type=postmortem AND updated_at > 90 days ago AND tags contain 'queue'
  • 'RFC for billing system' → filter doc_type=rfc AND rfc_status='accepted' (engineers usually want decisions, not draft RFCs)

Filter inference from query: small LLM call (Haiku) at query time extracts implied filters from the natural-language query.

Embedding Choice

Switch from OpenAI ada-002 (deprecated) to Voyage AI voyage-3 or Anthropic-recommended embedding.

Reasons:

  • ada-002 is the 2023 baseline; 2026 alternatives are markedly better
  • Voyage AI's voyage-3 is specifically trained for technical/code retrieval (top scores on BEIR + code retrieval benchmarks)
  • Cost: $0.06/M tokens for voyage-3 (vs ada-002's similar pricing); negligible at your volume

Alternative: if you prefer staying in OpenAI ecosystem, switch to text-embedding-3-large (better than ada-002).

Dimension: 1024 (voyage-3 default). Storage cost is fine.

Indexing decision: index BOTH chunk-level embeddings AND document-level embeddings. Document-level helps when query matches a document's overall topic without matching specific chunks well.

One-time cost to re-embed your 3-5M tokens: ~$0.18-0.30. Trivial.

Hybrid Retrieval Design

Replace pure-semantic with hybrid:

BM25 component: index all chunks with their text + the section_path (helps lexical matches on service names, function names, error codes). Use Pinecone's sparse-dense hybrid OR move to Qdrant which natively supports hybrid.

Dense component: voyage-3 embeddings as designed.

Fusion: Reciprocal Rank Fusion (RRF) with k=60. Pull top 50 from each, fuse, take top 50 fused results.

Why hybrid wins for engineering docs:

  • Service names (e.g., 'orders-api'), function names, error codes are lexical matches. BM25 finds these directly; dense embeddings often miss the exact identifier.
  • Conceptual queries ('how does retry work in our queue system') need dense embeddings to match across paraphrasing.
  • Most engineering queries blend both. RRF fusion captures both.

Reranking

Add Cohere Rerank v3 or BGE-reranker-v2-m3 between retrieval and LLM.

  • Input: top 50 from hybrid retrieval
  • Output: top 8 reranked
  • Latency: 100-200ms
  • Cost: ~$0.001 per query at your volume (Cohere) or self-hosted free (BGE)

Why this matters: Hybrid retrieval gets you good recall. Reranking gets you precision — the model sees only highly-relevant chunks, which dramatically reduces hallucination and improves citation accuracy. Skipping the reranker is the single most common reason production RAG underperforms benchmarks.

Eval Pipeline

Golden query set: 100 queries from real engineering Slack questions you've already answered.

For each query:

  • The query text
  • The 'ground truth' chunk(s) — the docs that contain the answer (verified by an engineer)
  • The expected final answer

Retrieval metrics:

  • Recall@8: what % of queries had at least one ground-truth chunk in the top 8 reranked results. Target ≥95%.
  • MRR (Mean Reciprocal Rank): average rank position of the first ground-truth chunk. Target ≥0.8.
  • Precision@3: what % of top 3 results are ground-truth-relevant. Target ≥70%.

End-to-end metrics:

  • Citation accuracy (do citations point to correct source?). Target 100% — broken citations are a critical bug.
  • Answer correctness (LLM-judge with Claude Opus, calibrated against engineer review). Target ≥85%.
  • 'I don't know' rate (when ground truth is in corpus but the system says don't know). Target <10%.

Drift detection:

  • Run the 100-query suite weekly
  • Compare to baseline
  • Alarm if recall@8 drops >3 points or citation accuracy <100%

On every change to chunking, embedding, or reranker: run the full suite. Block deploy if regression.

Update / Re-index Strategy

Webhook-driven incremental updates:

  • Confluence + Notion + GitHub all expose webhooks for page-edit events
  • Webhook → ingest worker → re-chunk affected document → upsert chunks (delete old chunk_ids tied to that document, insert new ones)
  • Latency target: doc edit → searchable in 5 minutes

For Google Docs (no webhook): poll daily. Acceptable since GDocs change rarely in your setup.

Re-embedding: only the changed document's chunks. Cost: pennies per edit.

Staleness handling: include updated_at in retrieval scoring as a soft boost (recent docs slightly preferred when relevance is similar). Don't filter strictly by date unless query implies it.

Periodic full re-index: once per quarter, rebuild from scratch. Catches any drift between source-of-truth and index.

Cost & Latency Profile

Per-query (post-redesign):

  • Filter LLM call (Haiku): ~$0.0005
  • Hybrid retrieval (Qdrant or Pinecone): ~$0.0001 + 50ms
  • Reranking (Cohere): ~$0.001 + 150ms
  • LLM final answer (Claude Sonnet 4): ~$0.005
  • Total: ~$0.007/query, ~1500ms latency.

Monthly cost at 200 queries/day:

  • LLM + reranker: ~$42/mo
  • Vector DB (Qdrant Cloud or Pinecone): ~$70/mo
  • Embedding (re-embed deltas): ~$5/mo
  • Total: ~$120/mo.

Storage cost at 3-5M tokens corpus: ~10K-15K chunks at 1024-dim = trivial (<1GB). Both Pinecone and Qdrant handle easily.

Implementation Skeleton

/rag_engineering_kb
  /ingest
    confluence_ingester.py    (webhook + initial bulk)
    notion_ingester.py
    github_readme_ingester.py
    gdoc_ingester.py          (poll-based)
    structural_chunker.py     (markdown-aware chunking)
    metadata_enricher.py      (regex + LLM tagging for service, doc_type)
  /index
    embedder.py               (voyage-3 batch embedding)
    qdrant_client.py          (or pinecone_client.py)
    upsert.py                 (idempotent upserts keyed by source+source_id)
  /query
    filter_extractor.py       (Haiku: NL query → metadata filters)
    hybrid_retriever.py       (BM25 + dense + RRF fusion)
    reranker.py               (Cohere Rerank or BGE)
    answer_generator.py       (Sonnet 4 with strict 'cite-or-say-don't-know' prompt)
  /eval
    golden_queries.jsonl      (100 queries with ground truth)
    run_evals.py              (weekly cron + on-demand)
    drift_alarm.py            (post-deploy diff)
  /api
    slack_bot.py              (Slack entry point)
    cli.py                    (CLI entry point)
    backend.py                (FastAPI; both Slack + CLI call this)
  /docs
    PROMPTS.md                (the answer-generation prompt + filter-extraction prompt)
    EVAL_RUBRIC.md            (how to grade answer correctness)

Infrastructure recommendation: Migrate from Pinecone to Qdrant (self-hosted on AWS). Reasons specific to your context:

  • You have AWS infrastructure and DevOps capacity
  • Qdrant has superior native hybrid retrieval (Pinecone added it but it's still better in Qdrant)
  • Cost: ~$70/mo Qdrant Cloud or even less self-hosted vs Pinecone's tier pricing
  • If DevOps capacity is constrained: stay on Pinecone; the architecture works there too with their hybrid features

What This Architecture Won't Solve

  • Truly novel questions with no doc support — system says 'I don't know,' which is correct behavior but feels unhelpful
  • Highly conversational queries ('what should we do about X?') — RAG retrieves relevant context but synthesis is the LLM's job; expect quality variance
  • Cross-document synthesis questions spanning many docs — top-8 reranked chunks may miss some relevant context. Long-context models help but don't fully solve.
  • Permission-aware retrieval — your spec didn't mention this, but if some Confluence pages have restricted access, this design doesn't enforce it. Add ACL filtering at ingest if needed.
  • Stale information that nobody updates — if a runbook is wrong-but-no-one-edited-it, RAG faithfully returns the wrong thing. Out of scope.

Migration Path

Week 1-2: Build new pipeline in shadow mode.

  • Re-ingest all docs with new chunker + metadata enrichment
  • Build hybrid index in Qdrant (or upgrade Pinecone to hybrid)
  • Run new pipeline on every query that hits old pipeline; log results but show old to users
  • Compare on 100 golden queries

Week 3: A/B test.

  • Route 20% of queries to new pipeline; 80% old
  • Monitor: did the engineering team's complaints decrease? Are citation accuracy metrics holding?
  • If yes: ramp to 50% then 100%

Week 4: Full cutover.

  • 100% on new pipeline
  • Old Pinecone index stays available for 30 days as rollback
  • After 30 days clean shutdown

Week 5+: Iterate.

  • Use eval drift alarms to catch regressions
  • Add new doc types as they come up
  • Refresh golden query set quarterly

Key Takeaways

  • Your existing RAG fails on 4 of 6 things that matter: chunking, metadata, hybrid retrieval, reranking. The architecture changes are mechanical.
  • Structural chunking (by headings + code blocks) is the single biggest fix for engineering docs. Character-count chunking is the wrong unit.
  • Metadata filtering is half the system. 'Runbook for orders service' should filter to runbooks for orders BEFORE vector search.
  • Hybrid retrieval (BM25 + dense + RRF) outperforms pure semantic on technical content with code, function names, and service identifiers.
  • Reranker is non-negotiable in 2026. Skipping it is the most common cause of 'RAG works in benchmarks but not in production.'
  • Eval the retrieval, not just the answer. Recall@8 + MRR + citation accuracy.
  • Migrate from Pinecone to Qdrant if you have DevOps capacity. Stay on Pinecone if not — the architecture works there too.

Common use cases

  • Engineer building an internal docs search agent for a company wiki
  • Builder shipping a customer-facing 'ask the docs' AI on their product
  • Solo dev creating a RAG layer over their personal knowledge (notes, papers, books)
  • Team migrating from a naive RAG (chunked + embedded + vector search) to something that actually works
  • Builder evaluating whether RAG is even the right pattern (sometimes it isn't)
  • Engineer adding RAG to an existing agent for grounding/hallucination reduction

Best AI model for this

Claude Opus 4. RAG architecture requires reasoning about retrieval mechanics, eval design, and document-specific tradeoffs — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Pro tips

  • Chunk by structure, not by character count. Markdown headings, code blocks, table boundaries — these are natural chunk boundaries. Character-count chunks fragment semantic units.
  • Always add metadata. Source, date, type, author, section. Filter THEN search beats search-everything-then-filter.
  • Hybrid retrieval (BM25 + dense embedding) outperforms pure semantic on technical/code/legal docs. Don't skip it.
  • Rerankers are non-negotiable in 2026. Cohere Rerank or BGE-reranker on top of initial retrieval improves precision dramatically.
  • Embed at chunk-level + document-level. Document-level embeddings help relevance scoring across documents; chunk-level for citations.
  • Long-context models (1M+ tokens) reduced tight-chunking pressure but increased the cost of poor chunking — feeding a model 500 wrong chunks is more expensive than 5 right ones.
  • Eval the retrieval, not just the final output. 'Right answer for wrong reason' is a silent quality problem.

Customization tips

  • Describe your corpus precisely. RAG architecture differs significantly between code docs, customer support, books, legal docs. Be specific about document types and structure.
  • List your real query patterns with example queries. Architecture decisions (filtering schema, hybrid weights, reranker choice) calibrate against actual queries.
  • If existing RAG doesn't work, describe the specific failure modes: 'returns wrong sections', 'doesn't find recent edits', etc. The redesign targets named failures.
  • Specify update frequency and freshness requirements. Daily-edited corpora need webhook-driven updates; static corpora can use bulk re-index.
  • List your accuracy bar concretely. 'Citations must always be correct' is a hard constraint that shapes the architecture.
  • Specify infrastructure preferences: managed vs self-hosted, existing tools, team capacity. Recommendations differ for AWS-heavy teams vs Vercel-heavy ones.
  • For very long-form content (books, transcripts), use the Long-Form Content Mode variant — adds hierarchical chunking and section-level navigation.

Variants

Internal Docs Mode

For company wiki / Confluence / Notion knowledge bases — emphasizes access control and freshness.

Customer-Facing Docs Mode

For 'ask our docs' agents on product help sites — emphasizes citation accuracy and tone.

Code & API Docs Mode

For developer documentation — emphasizes code-block preservation and version-specific filtering.

Long-Form Content Mode

For books, papers, transcripts — emphasizes hierarchical chunking and section-level navigation.

Frequently asked questions

How do I use the RAG Knowledge Base Architect prompt?

Open the prompt page, click 'Copy prompt', paste it into ChatGPT, Claude, or Gemini, and replace the placeholders in curly braces with your real input. The prompt is also launchable directly in each model with one click.

Which AI model works best with RAG Knowledge Base Architect?

Claude Opus 4. RAG architecture requires reasoning about retrieval mechanics, eval design, and document-specific tradeoffs — exactly Claude's strengths. ChatGPT GPT-5 second-best.

Can I customize the RAG Knowledge Base Architect prompt for my use case?

Yes — every Promptolis Original is designed to be customized. Key levers: Chunk by structure, not by character count. Markdown headings, code blocks, table boundaries — these are natural chunk boundaries. Character-count chunks fragment semantic units.; Always add metadata. Source, date, type, author, section. Filter THEN search beats search-everything-then-filter.

Explore more Originals

Hand-crafted 2026-grade prompts that actually change how you work.

← All Promptolis Originals