Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)
Table of Contents
Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
What You’ll Build
We’ll create a production‑ready FastAPI FAISS RAG API that:
- Ingests documents, splits them into chunks, and creates dense embeddings.
- Indexes vectors with FAISS (cosine similarity via inner product).
- Exposes a /v1/query endpoint that retrieves top‑K chunks and (optionally) generates an answer using an LLM.
- Adds production touches: health checks, metrics, rate limits, API key auth, structured logging, and Docker.

Why FastAPI + FAISS for RAG
- FastAPI gives you asynchronous performance, pydantic models, and an ergonomic developer experience.
- FAISS is a battle‑tested vector database library—fast, memory‑efficient, with multiple index types (Flat, IVF, HNSW, PQ) for different data sizes and latency targets.
For this tutorial we’ll stick to a normalized embeddings + inner product (cosine) setup using a Flat index (simple and strong baseline), and note where advanced options fit.
Architecture Overview
Flow:
- Ingestion: Load files → clean text → split into overlapping chunks.
- Embedding: Encode chunks into vectors (e.g.,
all-MiniLM-L6-v2). - Indexing: Add vectors to FAISS; persist index to disk.
- Metadata store: Map vector IDs → {text, source, chunk_id, …}.
- API:
/v1/query→ embed query → FAISS top‑K → optional LLM generation with retrieved context. - Ops:
/healthz,/readyz,/metrics, logging, rate limits, API keys.

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
Project Setup
fastapi-faiss-rag/
├─ app/
│ ├─ __init__.py
│ ├─ main.py # FastAPI app, endpoints
│ ├─ retrieval.py # FAISS search + embed
│ ├─ generator.py # LLM call + extractive fallback
│ ├─ config.py # settings, env
│ ├─ models.py # pydantic request/response
│ └─ storage.py # metadata store helpers
├─ indices/
│ ├─ faiss.index
│ └─ meta.jsonl
├─ data/
│ └─ docs/ # your .txt/.md sources
├─ scripts/
│ └─ ingest.py # build index from data/docs
├─ requirements.txt
├─ .env.example
└─ Dockerfile
requirements.txt
fastapi
uvicorn[standard]
sentence-transformers
faiss-cpu
numpy
pydantic
python-dotenv
prometheus-fastapi-instrumentator
slowapi
orjson
filelock
.env.example
OPENAI_API_KEY=
OPENAI_MODEL=gpt-4o-mini
INDEX_PATH=indices/faiss.index
META_PATH=indices/meta.jsonl
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2
Data Ingestion & Chunking
Efficient chunking boosts retrieval quality. Start simple: split on paragraphs with overlap to preserve context continuity.
# scripts/ingest.py
import os, json, glob
from pathlib import Path
from typing import List, Dict
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
INDEX_PATH = os.getenv("INDEX_PATH", "indices/faiss.index")
META_PATH = os.getenv("META_PATH", "indices/meta.jsonl")
EMBED_MODEL= os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")
DOCS_DIR = "data/docs"
def read_docs() -> List[Dict]:
docs = []
for path in glob.glob(f"{DOCS_DIR}/**/*", recursive=True):
if not os.path.isfile(path):
continue
if not (path.endswith(".txt") or path.endswith(".md")):
continue
text = Path(path).read_text(encoding="utf-8", errors="ignore")
docs.append({"source": path, "text": text})
return docs
def chunk_text(text: str, max_chars=800, overlap=120) -> List[str]:
# simple, whitespace-preserving chunker
paras = text.split("\n\n")
chunks, buff = [], ""
for para in paras:
if len(buff) + len(para) + 2 <= max_chars:
buff += (("\n\n" if buff else "") + para)
else:
if buff:
chunks.append(buff)
# overlap: take the tail of buff
tail = buff[-overlap:]
buff = tail + "\n\n" + para
else:
chunks.append(para[:max_chars])
buff = para[max(0, len(para)-overlap):]
if buff:
chunks.append(buff)
return [c.strip() for c in chunks if c.strip()]
def main():
os.makedirs("indices", exist_ok=True)
docs = read_docs()
if not docs:
raise SystemExit("No .txt/.md files found in data/docs")
model = SentenceTransformer(EMBED_MODEL)
dim = model.get_sentence_embedding_dimension()
index = faiss.IndexFlatIP(dim) # inner product (use normalized vectors)
meta = []
vectors = []
for d in docs:
for i, chunk in enumerate(chunk_text(d["text"])):
emb = model.encode(chunk, normalize_embeddings=True)
vectors.append(emb.astype("float32"))
meta.append({"text": chunk, "source": d["source"], "chunk_id": i})
mat = np.vstack(vectors).astype("float32")
index.add(mat)
faiss.write_index(index, INDEX_PATH)
with open(META_PATH, "w", encoding="utf-8") as f:
for m in meta:
f.write(json.dumps(m, ensure_ascii=False) + "\n")
print(f"Saved {len(meta)} chunks to {INDEX_PATH} and {META_PATH}")
if __name__ == "__main__":
main()
Notes :
- Use normalized embeddings so inner product ≈ cosine similarity.
- For large corpora, switch to IVF or HNSW indexes and/or PQ compression.
- Keep a separate metadata store (we’ll use JSONL here; SQLite is a great upgrade).
Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
Vectorization & FAISS Indexing
We’ve chosen all-MiniLM-L6-v2 for a strong, lightweight baseline. Production teams often:
- Self‑host a higher‑dimensional model for better recall, or
- Use a managed embedding API and cache vectors locally.

The FastAPI Service
Let’s implement a clean, typed API.
# app/models.py
from pydantic import BaseModel, Field
from typing import List, Dict, Any
class QueryRequest(BaseModel):
query: str = Field(..., min_length=2)
top_k: int = Field(5, ge=1, le=20)
class Source(BaseModel):
text: str
score: float
metadata: Dict[str, Any]
class QueryResponse(BaseModel):
answer: str
sources: List[Source]
# app/config.py
import os
class Settings:
OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
OPENAI_MODEL = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
INDEX_PATH = os.getenv("INDEX_PATH", "indices/faiss.index")
META_PATH = os.getenv("META_PATH", "indices/meta.jsonl")
EMBED_MODEL = os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")
API_KEY = os.getenv("RAG_API_KEY", "") # optional API key for your service
settings = Settings()
# app/storage.py
import json
from typing import List, Dict
def load_meta(meta_path: str) -> List[Dict]:
rows = []
with open(meta_path, "r", encoding="utf-8") as f:
for line in f:
rows.append(json.loads(line))
return rows
# app/retrieval.py
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any
class Retriever:
def __init__(self, index_path: str, meta: List[Dict], embed_model: str):
self.meta = meta
self.model = SentenceTransformer(embed_model, device="cpu")
self.dim = self.model.get_sentence_embedding_dimension()
self.index = faiss.read_index(index_path)
def _embed(self, texts: List[str]) -> np.ndarray:
vecs = self.model.encode(texts, batch_size=64, normalize_embeddings=True)
return np.array(vecs, dtype="float32")
def search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
qv = self._embed([query])
scores, ids = self.index.search(qv, k)
results = []
for score, idx in zip(scores[0], ids[0]):
if idx == -1:
continue
m = self.meta[idx]
results.append({
"text": m["text"],
"score": float(score),
"metadata": {k: v for k, v in m.items() if k != "text"}
})
return results
# app/generator.py
import os
from typing import List, Dict
def _extractive_fallback(query: str, passages: List[Dict]) -> str:
# Simple extractive fallback: join the best few chunks
joined = "\n\n".join(p["text"] for p in passages[:3])
return (
"Below is a synthesized answer using retrieved context (no LLM configured).\n\n"
f"Query: {query}\n\n"
f"Context:\n{joined}\n\n"
"Answer: Summarize key points from the context above."
)
def generate_answer(query: str, passages: List[Dict]) -> str:
api_key = os.getenv("OPENAI_API_KEY", "")
if not api_key:
return _extractive_fallback(query, passages)
try:
# Optional: Use OpenAI if installed and key is set.
# This snippet uses Chat Completions-style APIs.
from openai import OpenAI
client = OpenAI(api_key=api_key)
context = "\n\n".join(
f"Source: {p['metadata'].get('source','')} | Score: {p['score']:.3f}\n{p['text']}"
for p in passages
)
prompt = (
"You are a RAG assistant. Answer strictly from the provided context. "
"Cite sources by filename when relevant.\n\n"
f"Context:\n{context}\n\n"
f"User question: {query}\n\n"
"Answer:"
)
resp = client.chat.completions.create(
model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
messages=[{"role":"user","content":prompt}],
temperature=0.2
)
return resp.choices[0].message.content.strip()
except Exception:
return _extractive_fallback(query, passages)
Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
# app/main.py
from fastapi import FastAPI, Depends, HTTPException, Request, Header
from fastapi.responses import ORJSONResponse
from prometheus_fastapi_instrumentator import Instrumentator
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from slowapi.middleware import SlowAPIMiddleware
from .models import QueryRequest, QueryResponse, Source
from .retrieval import Retriever
from .generator import generate_answer
from .storage import load_meta
from .config import settings
app = FastAPI(
title="FastAPI FAISS RAG API",
default_response_class=ORJSONResponse,
version="1.0.0"
)
limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, lambda r, e: ORJSONResponse({"detail":"Rate limit exceeded"}, status_code=429))
app.add_middleware(SlowAPIMiddleware)
def require_api_key(x_api_key: str = Header(default="")):
if settings.API_KEY and x_api_key != settings.API_KEY:
raise HTTPException(status_code=401, detail="Invalid API key")
@app.on_event("startup")
async def startup():
meta = load_meta(settings.META_PATH)
app.state.retriever = Retriever(settings.INDEX_PATH, meta, settings.EMBED_MODEL)
Instrumentator().instrument(app).expose(app)
@app.get("/healthz")
async def healthz():
return {"status":"ok"}
@app.get("/readyz")
async def readyz():
# Add checks: index loaded, metadata count > 0, etc.
ready = hasattr(app.state, "retriever")
return {"ready": bool(ready)}
@app.post("/v1/query", response_model=QueryResponse)
@limiter.limit("60/minute")
async def query(req: QueryRequest, request: Request, _: None = Depends(require_api_key)):
retriever: Retriever = request.app.state.retriever
hits = retriever.search(req.query, k=req.top_k)
answer = generate_answer(req.query, hits)
return QueryResponse(answer=answer, sources=hits)
CORS : If you’ll call this from a browser, add FastAPI CORS middleware with allowed origins.
Search & Retrieval Logic
- We normalize vectors to ensure inner product equals cosine similarity.
- Start with k=5. Tune per corpus size and typical query intent.
- Consider boosting recency or source authority by re‑ranking (e.g., multiply score by a metadata weight).
Generation: LLM Integration & Fallbacks
In production, a deterministic fallback is essential. The generator.py above:
- Uses an OpenAI call when
OPENAI_API_KEYis set. - Falls back to a safe extractive answer if the LLM is unavailable.
Prompts :
- Keep the system prompt concise: “Answer strictly from context. If unsure, say you don’t know.”
- Provide sources in the prompt for lightweight citation.
Alternatives :
- Swap in any provider (Azure, Anthropic, local vLLM).
- Wrap calls with timeouts and retries; cache successful responses.
Observability, Security & Rate Limits
- Health & readiness:
/healthzand/readyz. - Metrics:
prometheus-fastapi-instrumentatorexposes/metricsfor Prometheus. - Rate limiting:
slowapiwith60/minuteper IP (tune as needed). - API keys: Simple header‑based auth via
X-API-Key. - Logging: Prefer structured logs (JSON). Include
request_idfor tracing.
Privacy & Compliance
- Store only what you need. Mask PII in logs.
- If using third‑party LLMs, review data retention and opt‑out flags.
Testing the API
Smoke test with curl:
export RAG_API_KEY=dev-key
uvicorn app.main:app --reload
curl -X POST http://localhost:8000/v1/query \
-H "Content-Type: application/json" \
-H "X-API-Key: dev-key" \
-d '{"query":"What does the architecture look like?","top_k":3}'
Pytest sketch :
# tests/test_query.py
from fastapi.testclient import TestClient
from app.main import app
def test_healthz():
c = TestClient(app)
r = c.get("/healthz")
assert r.status_code == 200
assert r.json()["status"] == "ok"

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
Dockerization & Deployment
Dockerfile
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY app app
COPY indices indices
COPY .env.example .env
ENV PORT=8000
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]
Build & run:
docker build -t fastapi-faiss-rag:latest .
docker run -p 8000:8000 \
-e RAG_API_KEY=prod-key \
-e OPENAI_API_KEY=$OPENAI_API_KEY \
fastapi-faiss-rag:latest
Deployment targets
- Single VM: Docker + systemd, easy and cost‑effective.
- Kubernetes: Add liveness/readiness probes; configure HPA on CPU/memory.
- Serverless: Possible for light loads; cold starts and native FAISS builds can be tricky.
Scaling & Hardening
- Bigger corpora: Use IVF or HNSW FAISS indexes; train with a representative sample.
- Compression: PQ or OPQ to shrink memory while preserving recall.
- Sharding: Split by domain or time; merge results at query time.
- Metadata store: Move from JSONL to SQLite/Postgres; add indexes on
source,timestamp. - Async ingestion: Queue new documents (e.g., Celery or RQ).
- Hot reload: Maintain two index files and atomically flip a symlink when a new index is ready.
- Security: Mutual TLS for internal traffic; WAF/CDN at the edge; rotate API keys.
- Cost control: Cache embeddings and generation results; use smaller models for routine questions.
Common Pitfalls
- Mismatched dimensions: FAISS index dimension must match your embedding model.
- No normalization: For cosine similarity with inner product, normalize vectors at both indexing and query time.
- Over‑chunking: Tiny chunks ≠ better answers. Start at ~600–900 chars with ~100 overlap.
- Unbounded context: Limit how many chunks you feed to the LLM; tune for the model’s context window.
- Missing fallbacks: Always have a deterministic path when LLM calls fail.
- Lack of evals: Build a small eval set (queries + gold answers) and measure hit rate + answer quality regularly.
FAQ
How large can a FastAPI FAISS RAG API grow before IVF/HNSW is required?
Flat indexes do well up to low‑millions of vectors on a single machine with enough RAM. Beyond that, move to IVF/HNSW and consider PQ compression to fit memory budgets.
Can I use GPUs?
Yes. FAISS has GPU support. For strictly CPU deployments, choose efficient embedding models and consider quantization.
How do I add PDFs or web pages?
Convert to text during ingestion (e.g., pypdf, trafilatura). Keep the conversion offline in your ingestion script so API servers stay lean.
How do I cite sources?
Store source and chunk_id in metadata and instruct the LLM to cite by filename or URL. You can also include line numbers if you pre‑compute them.
Conclusion
You now have a FastAPI FAISS RAG API that is not just a demo, but designed with production guardrails: clean contracts, health checks, metrics, rate limits, API keys, Docker packaging, and a safe generation fallback. From here, iterate on index types, re‑ranking, prompting, and evaluation to continuously improve answer quality and reliability.
Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use
