RAG App: Hybrid Retrieval Question Answering
A production-style Retrieval-Augmented Generation system that ingests PDF documents, indexes them using BM25 lexical search and FAISS dense vector search, and serves grounded answers through a FastAPI /ask endpoint powered by an LLM of choice.
Hybrid RAG
BM25 Lexical + FAISS Semantic Retrieval
FastAPI
Production /ask Endpoint with Swagger UI
2-Stage
Retrieval Pipeline — Lexical then Semantic
The Problem
Organizations have vast knowledge locked in PDF documents — but no way to query it programmatically with grounded, accurate answers
Reports, manuals, policies, contracts, and research documents contain enormous amounts of organizational knowledge — but that knowledge is locked in static files that cannot be searched semantically, cannot answer follow-up questions, and cannot be integrated into downstream applications without manual extraction. Generic LLMs can answer questions, but without grounding in specific document content they hallucinate — producing confident but factually incorrect responses that cannot be trusted for operational or decision-making use. The gap is between the knowledge that exists in documents and the ability to query it accurately, programmatically, and at scale — without requiring a human to read every file.
The Solution
A hybrid RAG system that indexes PDFs with BM25 and FAISS and serves grounded answers through a production FastAPI endpoint
The RAG App is a production-style Retrieval-Augmented Generation system that transforms a folder of PDF documents into a queryable knowledge base accessible through a single API endpoint. PDFs are extracted via PyPDF, cleaned, and chunked into retrieval-ready passages. Each chunk is indexed twice — using BM25 via Whoosh for lexical keyword matching, and using FAISS for dense semantic vector search. When a question arrives at the FastAPI /ask endpoint, both retrieval methods run in parallel and their results are merged before being passed to the OpenAI API alongside the original question. The LLM generates a grounded answer based only on the retrieved passages — constraining responses to what actually exists in the documents. Adding new documents to the knowledge base requires only dropping PDFs into the data folder and running a single index rebuild script.
Key Outcome
A production-ready RAG system that unlocks knowledge locked in organizational PDFs — combining BM25 lexical search and FAISS semantic search in a two-stage hybrid retrieval pipeline, and serving grounded, document-constrained answers through a FastAPI /ask endpoint that any downstream application can integrate with immediately.
Technical Deep Dive
Architecture & Design
RAG Pipeline
Stage 1 — PDF Ingestion & Chunking
Step 1
PDF Extraction
extract_text.py · PyPDF reads all PDFs from data/pdfs/
Step 2
Cleaning
Whitespace normalized · PDF artifacts removed
Step 3
Chunking
chunk_text.py · Overlapping passages for retrieval continuity
Stage 2 — Dual Index Construction · pipeline_runner.py
Index A · Lexical
BM25 via Whoosh
bm25_index.py · Stored in vectorstores/bm25_index/
Index B · Semantic
FAISS Dense Vector Search
embed_store.py · Stored in vectorstores/faiss_topic/
Stage 3 — Query Expansion · Multi-Query + HyDE
User Question · POST /ask
FastAPI /ask Endpoint — main_app.py
JSON payload · Swagger UI at /docs · ReDoc at /redoc
Original
Original Query
Always kept in candidate set
Rewrites 1–3
Query Paraphrases
GPT-4o-mini · N=3 diverse variants
HyDE
Hypothetical Answer
GPT-4o-mini · 2–3 sentence mini-answer as semantic probe
Stage 4 — Per-Variant Hybrid Retrieval · query_retriever.py
BM25 Search · per variant
Lexical Retrieval
Top-60 results per query variant · Exact term matching
FAISS Search · per variant
Semantic Retrieval
Top-60 results per query variant · Dense similarity search
Stage 5 — Reciprocal Rank Fusion (RRF)
RRF Fusion · all ranked lists
Score Fusion & Re-ranking
All BM25 + FAISS lists across all query variants fused via RRF · Deduplication · Final top-5 selected
Stage 6 — Grounded Answer Generation · generate_answer.py
Generation · OpenAI API
Grounded Answer Synthesis
Top-5 fused passages + question → OpenAI · LLM constrained to document content · Grounded response returned
API Response
JSON Answer at /ask
Grounded answer · Constrained to PDF knowledge base · No hallucinated content
Stage 1
PDF Ingestion & Chunking
extract_text.py reads all PDFs from data/pdfs/ using PyPDF, producing clean plain text. chunk_text.py splits the extracted text into overlapping passages — overlapping windows ensure that retrieval does not miss content that spans a chunk boundary. The full ingestion pipeline is triggered by a single pipeline_runner.py execution and can be rerun whenever new PDFs are added to the data folder.
Stage 2
Dual Index Construction
bm25_index.py builds a BM25 keyword index over the chunked text using Whoosh, stored in vectorstores/bm25_index/. embed_store.py generates dense embeddings using OpenAI's text-embedding-3-small model and builds a FAISS vector index stored in vectorstores/faiss_topic/. Both indexes are built in a single pipeline_runner.py execution — ensuring they are always synchronized to the same document set.
Stage 3
Multi-Query Expansion + HyDE
Before any retrieval runs, the incoming question is expanded into five query variants using GPT-4o-mini. Multi-query rewriting generates three diverse paraphrases that cover different ways the same question might be expressed in the document corpus. HyDE (Hypothetical Document Embeddings) generates a 2–3 sentence factual mini-answer that acts as a semantic probe — searching for passages that match what a good answer would look like rather than just what the question asks. The original query is always kept in the candidate set.
Stage 4
Per-Variant Hybrid Retrieval
BM25 and FAISS searches run independently for every query variant — the original, all three paraphrases, and the HyDE text. Each search returns up to 60 results, producing up to 10 ranked lists (5 variants × 2 retrieval methods) that together cover a much broader surface of the document space than any single query could reach. All result lists are passed to the RRF fusion stage.
Stage 5
Reciprocal Rank Fusion (RRF)
All ranked lists from Stage 4 are fused using Reciprocal Rank Fusion — a rank-based fusion algorithm that assigns each document a score of 1/(k + rank) across every list it appears in and sums them. Documents that rank highly across multiple query variants and retrieval methods receive the highest fused scores. After deduplication, the top-5 documents by fused score are selected as the final context for generation.
Stage 6
Grounded Answer Generation
generate_answer.py passes the top-5 RRF-fused passages alongside the original question to the OpenAI API. The LLM generates a response grounded in the retrieved passages — constrained to what actually exists in the document knowledge base. Answers are returned as JSON through the FastAPI /ask endpoint, with full Swagger UI and ReDoc documentation at /docs and /redoc.
Key Design Decisions
Multi-query rewriting expands retrieval coverage across vocabulary gaps
A single query formulation may not match the vocabulary used in the document — a question about "financial position" may not retrieve passages that use "total assets" or "balance sheet." By generating three diverse paraphrases of every question using GPT-4o-mini before retrieval runs, the system searches the index from multiple angles simultaneously — dramatically increasing the probability that relevant passages are retrieved regardless of how the document author chose to express the same concept.
HyDE retrieves by answer shape — not question shape
Standard semantic search embeds the question and finds passages that are semantically similar to the question — but relevant document passages are often more similar to answers than to questions. HyDE generates a hypothetical 2–3 sentence answer and uses it as the embedding query, retrieving passages that match what a correct answer would look like rather than what the question looks like. This consistently surfaces relevant passages that question-based retrieval misses, particularly for factual and numerical queries.
RRF fuses ranked lists without requiring score calibration
BM25 scores and FAISS cosine similarity scores are not on the same scale — direct score combination would require careful normalization that depends on the corpus and query distribution. RRF bypasses this entirely by operating only on ranks, not scores. A document that ranks 3rd in BM25 and 5th in FAISS receives a well-defined fused score regardless of what the raw scores were. This makes RRF robust across different retrieval methods, query types, and document collections without any calibration overhead.
Overlapping chunks prevent boundary retrieval failures
Fixed-length non-overlapping chunks create invisible boundaries where relevant content is split between two adjacent chunks — neither of which contains enough context to be retrieved as relevant. Overlapping windows ensure that every sentence appears in at least two chunks, eliminating the boundary retrieval failure mode that degrades answer quality on questions that span paragraph transitions.
Tech Stack
| Technology | Purpose |
|---|---|
| FastAPI + Uvicorn | Web API serving the /ask endpoint with Swagger UI and ReDoc |
| FAISS | Dense vector index for semantic similarity search per query variant |
| BM25 / Whoosh | Lexical keyword index for exact term matching per query variant |
| OpenAI API (GPT-4o-mini) | Multi-query rewriting, HyDE generation, and grounded answer synthesis |
| OpenAI Embeddings (text-embedding-3-small) | Dense vector embeddings for FAISS index construction and query embedding |
| Reciprocal Rank Fusion (RRF) | Rank-based fusion of all BM25 + FAISS result lists across all query variants |
| LangChain (FAISS + OpenAIEmbeddings) | Vector store management and embedding model integration |
| PyPDF | PDF text extraction from document knowledge base |
| pipeline_runner.py | Single-command index rebuild — extract, chunk, embed, build BM25 + FAISS |
| Python | Core language and module orchestration |
Results & Metrics
What the system delivers
Hybrid
RAG Architecture
BM25 lexical + FAISS semantic retrieval running in parallel — results merged before generation
FastAPI
Production /ask Endpoint
Swagger UI + ReDoc documentation · Immediately consumable by any downstream application
2-Stage
Retrieval Pipeline
Lexical then semantic — capturing both exact matches and conceptual relevance in every query
Drop PDFs and rebuild — zero configuration required
Adding new documents to the knowledge base requires only placing PDFs in data/pdfs/ and running python pipeline_runner.py. The pipeline extracts, cleans, chunks, and rebuilds both the BM25 and FAISS indexes automatically — no schema changes, no database migrations, no manual index management. The /ask endpoint immediately reflects the updated knowledge base on the next request.
Hybrid retrieval handles both exact and conceptual queries
BM25 retrieves passages containing the exact terms in the question — critical for named entities, figures, dates, and technical terminology. FAISS retrieves passages that are semantically similar even when vocabulary differs. By merging both result sets, the system correctly handles questions like "what are the total assets in 2022?" (lexical) and "what is the financial position of the organization?" (semantic) with equal reliability.
Grounded answers constrained to document content
By passing only retrieved document passages to the OpenAI model — rather than asking it to answer from general knowledge — the system constrains every answer to content that actually exists in the knowledge base. This eliminates hallucination for document-specific queries, making the system trustworthy for operational use cases where accuracy against specific documents is required.
API-first design — plugs into any downstream application
The FastAPI /ask endpoint accepts a standard JSON POST request and returns a JSON answer — making the knowledge base queryable by any application that can make an HTTP request. The auto-generated Swagger UI at /docs and ReDoc at /redoc provide interactive documentation without any additional setup. The system is immediately integrable into dashboards, chatbots, internal tools, or custom front-ends.
Modular architecture enables independent component extension
The five-module structure — ingestion, indexing, retriever, generation, app — means each component can be extended or replaced without touching the others. Local embedding models can replace OpenAI embeddings by changing only embed_store.py. A re-ranking layer can be added by modifying only query_retriever.py. Source citations can be added by modifying only generate_answer.py. The architecture is built for extension from the ground up.