MLOps & Production Pipelines · Retrieval-Augmented Generation

RAG App: Hybrid Retrieval Question Answering

A production-style Retrieval-Augmented Generation system that ingests PDF documents, indexes them using BM25 lexical search and FAISS dense vector search, and serves grounded answers through a FastAPI /ask endpoint powered by an LLM of choice.

Architecture RAG · Hybrid Retrieval

Tech Stack

FastAPI FAISS BM25 / Whoosh LLMs PyPDF Python

Source Code View on GitHub

Hybrid RAG

BM25 Lexical + FAISS Semantic Retrieval

FastAPI

Production /ask Endpoint with Swagger UI

2-Stage

Retrieval Pipeline — Lexical then Semantic

The Problem

Organizations have vast knowledge locked in PDF documents — but no way to query it programmatically with grounded, accurate answers

Reports, manuals, policies, contracts, and research documents contain enormous amounts of organizational knowledge — but that knowledge is locked in static files that cannot be searched semantically, cannot answer follow-up questions, and cannot be integrated into downstream applications without manual extraction. Generic LLMs can answer questions, but without grounding in specific document content they hallucinate — producing confident but factually incorrect responses that cannot be trusted for operational or decision-making use. The gap is between the knowledge that exists in documents and the ability to query it accurately, programmatically, and at scale — without requiring a human to read every file.

The Solution

A hybrid RAG system that indexes PDFs with BM25 and FAISS and serves grounded answers through a production FastAPI endpoint

The RAG App is a production-style Retrieval-Augmented Generation system that transforms a folder of PDF documents into a queryable knowledge base accessible through a single API endpoint. PDFs are extracted via PyPDF2 and chunked into overlapping, retrieval-ready passages. Each chunk is indexed twice — using BM25 via Whoosh for lexical keyword matching, and using FAISS for dense semantic vector search. When a question arrives at the FastAPI /ask endpoint, it is first expanded into five query variants — three paraphrases and a HyDE hypothetical answer — using GPT-4o-mini. BM25 and FAISS then search independently across all five variants, and all resulting ranked lists are fused into a single relevance ranking via Reciprocal Rank Fusion. The top-5 passages are passed to the OpenAI API alongside the original question, and the LLM generates a grounded answer based only on the retrieved content. Adding new documents to the knowledge base requires only dropping PDFs into the data folder and running a single index rebuild script.

Key Outcome

A production-ready RAG system that unlocks knowledge locked in organizational PDFs — using multi-query expansion and HyDE to broaden retrieval coverage, combining BM25 lexical search and FAISS semantic search across all query variants, fusing results with Reciprocal Rank Fusion, and serving grounded, document-constrained answers through a FastAPI /ask endpoint that any downstream application can integrate with immediately.

Technical Deep Dive

Architecture & Design

RAG Pipeline

Stage 1 — PDF Ingestion & Chunking

Step 1

PDF Extraction

extract_text.py · PyPDF2 reads all PDFs from data/pdfs/

Step 2

Chunking

chunk_text.py · Overlapping passages for retrieval continuity

▼

Stage 2 — Dual Index Construction · pipeline_runner.py

Index A · Lexical

BM25 via Whoosh

bm25_index.py · Stored in vectorstores/bm25_index/

Index B · Semantic

FAISS Dense Vector Search

embed_store.py · Stored in vectorstores/faiss_topic/

▼

Stage 3 — Query Expansion · Multi-Query + HyDE

User Question · POST /ask

FastAPI /ask Endpoint — main_app.py

JSON payload · Swagger UI at /docs · ReDoc at /redoc

Original

Original Query

Always kept in candidate set

Rewrites 1–3

Query Paraphrases

GPT-4o-mini · N=3 diverse variants

HyDE

Hypothetical Answer

GPT-4o-mini · 2–3 sentence mini-answer as semantic probe

▼

Stage 4 — Per-Variant Hybrid Retrieval · query_retriever.py

BM25 Search · per variant

Lexical Retrieval

Top-60 results per query variant · Exact term matching

FAISS Search · per variant

Semantic Retrieval

Top-60 results per query variant · Dense similarity search

▼

Stage 5 — Reciprocal Rank Fusion (RRF)

RRF Fusion · all ranked lists

Score Fusion & Re-ranking

All BM25 + FAISS lists across all query variants fused via RRF · Deduplication · Final top-5 selected

▼

Stage 6 — Grounded Answer Generation · generate_answer.py

Generation · OpenAI API

Grounded Answer Synthesis

Top-5 fused passages + question → OpenAI · LLM constrained to document content · Grounded response returned

▼

API Response

JSON Answer at /ask

Grounded answer · Constrained to PDF knowledge base · No hallucinated content

Stage 1

PDF Ingestion & Chunking

extract_text.py reads all PDFs from data/pdfs/ using PyPDF2, producing page-aware plain text. chunk_text.py splits the extracted text into overlapping passages — overlapping windows ensure that retrieval does not miss content that spans a chunk boundary. The full ingestion pipeline is triggered by a single pipeline_runner.py execution and can be rerun whenever new PDFs are added to the data folder.

Stage 2

Dual Index Construction

bm25_index.py builds a BM25 keyword index over the chunked text using Whoosh, stored in vectorstores/bm25_index/. embed_store.py generates dense embeddings using OpenAI's text-embedding-3-small model and builds a FAISS vector index stored in vectorstores/faiss_topic/. Both indexes are built in a single pipeline_runner.py execution — ensuring they are always synchronized to the same document set.

Stage 3

Multi-Query Expansion + HyDE

Before any retrieval runs, the incoming question is expanded into five query variants using GPT-4o-mini. Multi-query rewriting generates three diverse paraphrases that cover different ways the same question might be expressed in the document corpus. HyDE (Hypothetical Document Embeddings) generates a 2–3 sentence factual mini-answer that acts as a semantic probe — searching for passages that match what a good answer would look like rather than just what the question asks. The original query is always kept in the candidate set.

Stage 4

Per-Variant Hybrid Retrieval

BM25 and FAISS searches run independently for every query variant — the original, all three paraphrases, and the HyDE text. Each search returns up to 60 results, producing up to 10 ranked lists (5 variants × 2 retrieval methods) that together cover a much broader surface of the document space than any single query could reach. All result lists are passed to the RRF fusion stage.

Stage 5

Reciprocal Rank Fusion (RRF)

All ranked lists from Stage 4 are fused using Reciprocal Rank Fusion — a rank-based fusion algorithm that assigns each document a score of 1/(k + rank) across every list it appears in and sums them. Documents that rank highly across multiple query variants and retrieval methods receive the highest fused scores. After deduplication, the top-5 documents by fused score are selected as the final context for generation.

Stage 6

Grounded Answer Generation

generate_answer.py passes the top-5 RRF-fused passages alongside the original question to the OpenAI API. The LLM generates a response grounded in the retrieved passages — constrained to what actually exists in the document knowledge base. Answers are returned as JSON through the FastAPI /ask endpoint, with full Swagger UI and ReDoc documentation at /docs and /redoc.

Key Design Decisions

Multi-query rewriting expands retrieval coverage across vocabulary gaps

A single query formulation may not match the vocabulary used in the document — a question about "financial position" may not retrieve passages that use "total assets" or "balance sheet." By generating three diverse paraphrases of every question using GPT-4o-mini before retrieval runs, the system searches the index from multiple angles simultaneously — dramatically increasing the probability that relevant passages are retrieved regardless of how the document author chose to express the same concept.

HyDE retrieves by answer shape — not question shape

Standard semantic search embeds the question and finds passages that are semantically similar to the question — but relevant document passages are often more similar to answers than to questions. HyDE generates a hypothetical 2–3 sentence answer and uses it as the embedding query, retrieving passages that match what a correct answer would look like rather than what the question looks like. This consistently surfaces relevant passages that question-based retrieval misses, particularly for factual and numerical queries.

RRF fuses ranked lists without requiring score calibration

BM25 scores and FAISS cosine similarity scores are not on the same scale — direct score combination would require careful normalization that depends on the corpus and query distribution. RRF bypasses this entirely by operating only on ranks, not scores. A document that ranks 3rd in BM25 and 5th in FAISS receives a well-defined fused score regardless of what the raw scores were. This makes RRF robust across different retrieval methods, query types, and document collections without any calibration overhead.

Overlapping chunks prevent boundary retrieval failures

Fixed-length non-overlapping chunks create invisible boundaries where relevant content is split between two adjacent chunks — neither of which contains enough context to be retrieved as relevant. Overlapping windows ensure that every sentence appears in at least two chunks, eliminating the boundary retrieval failure mode that degrades answer quality on questions that span paragraph transitions.

Tech Stack

Technology	Purpose
FastAPI + Uvicorn	Web API serving the /ask endpoint with Swagger UI and ReDoc
FAISS	Dense vector index for semantic similarity search per query variant
BM25 / Whoosh	Lexical keyword index for exact term matching per query variant
OpenAI API (GPT-4o-mini)	Multi-query rewriting, HyDE generation, and grounded answer synthesis
OpenAI Embeddings (text-embedding-3-small)	Dense vector embeddings for FAISS index construction and query embedding
Reciprocal Rank Fusion (RRF)	Rank-based fusion of all BM25 + FAISS result lists across all query variants
LangChain (FAISS + OpenAIEmbeddings)	Vector store management and embedding model integration
PyPDF2	PDF text extraction from document knowledge base
pipeline_runner.py	Single-command index rebuild — extract, chunk, embed, build BM25 + FAISS
Python	Core language and module orchestration

Results & Metrics

What the system delivers

Hybrid

RAG Architecture

BM25 lexical + FAISS semantic retrieval running in parallel — results merged before generation

FastAPI

Production /ask Endpoint

Swagger UI + ReDoc documentation · Immediately consumable by any downstream application

2-Stage

Retrieval Pipeline

Lexical then semantic — capturing both exact matches and conceptual relevance in every query

📄

Drop PDFs and rebuild — zero configuration required

Adding new documents to the knowledge base requires only placing PDFs in data/pdfs/ and running python pipeline_runner.py. The pipeline extracts, cleans, chunks, and rebuilds both the BM25 and FAISS indexes automatically — no schema changes, no database migrations, no manual index management. The /ask endpoint immediately reflects the updated knowledge base on the next request.

🔍

Hybrid retrieval handles both exact and conceptual queries

BM25 retrieves passages containing the exact terms in the question — critical for named entities, figures, dates, and technical terminology. FAISS retrieves passages that are semantically similar even when vocabulary differs. By merging both result sets, the system correctly handles questions like "what are the total assets in 2022?" (lexical) and "what is the financial position of the organization?" (semantic) with equal reliability.

🎯

Grounded answers constrained to document content

By passing only retrieved document passages to the OpenAI model — rather than asking it to answer from general knowledge — the system constrains every answer to content that actually exists in the knowledge base. This eliminates hallucination for document-specific queries, making the system trustworthy for operational use cases where accuracy against specific documents is required.

🔌

API-first design — plugs into any downstream application

The FastAPI /ask endpoint accepts a standard JSON POST request and returns a JSON answer — making the knowledge base queryable by any application that can make an HTTP request. The auto-generated Swagger UI at /docs and ReDoc at /redoc provide interactive documentation without any additional setup. The system is immediately integrable into dashboards, chatbots, internal tools, or custom front-ends.

🧩

Modular architecture enables independent component extension

The five-module structure — ingestion, indexing, retriever, generation, app — means each component can be extended or replaced without touching the others. Local embedding models can replace OpenAI embeddings by changing only embed_store.py. A re-ranking layer can be added by modifying only query_retriever.py. Source citations can be added by modifying only generate_answer.py. The architecture is built for extension from the ground up.

← Back to Projects

← Previous

ChurnFlow

MLOps · Full Lifecycle Pipeline

AirflowTFX

TFX · Apache Airflow