Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

What You’ll Build

We’ll create a production‑ready FastAPI FAISS RAG API that:

Ingests documents, splits them into chunks, and creates dense embeddings.
Indexes vectors with FAISS (cosine similarity via inner product).
Exposes a /v1/query endpoint that retrieves top‑K chunks and (optionally) generates an answer using an LLM.
Adds production touches: health checks, metrics, rate limits, API key auth, structured logging, and Docker.

Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

Why FastAPI + FAISS for RAG

FastAPI gives you asynchronous performance, pydantic models, and an ergonomic developer experience.
FAISS is a battle‑tested vector database library—fast, memory‑efficient, with multiple index types (Flat, IVF, HNSW, PQ) for different data sizes and latency targets.

For this tutorial we’ll stick to a normalized embeddings + inner product (cosine) setup using a Flat index (simple and strong baseline), and note where advanced options fit.

Architecture Overview

Flow:

Ingestion: Load files → clean text → split into overlapping chunks.
Embedding: Encode chunks into vectors (e.g., all-MiniLM-L6-v2).
Indexing: Add vectors to FAISS; persist index to disk.
Metadata store: Map vector IDs → {text, source, chunk_id, …}.
API: /v1/query → embed query → FAISS top‑K → optional LLM generation with retrieved context.
Ops: /healthz, /readyz, /metrics, logging, rate limits, API keys.

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

Project Setup

fastapi-faiss-rag/
├─ app/
│  ├─ __init__.py
│  ├─ main.py            # FastAPI app, endpoints
│  ├─ retrieval.py       # FAISS search + embed
│  ├─ generator.py       # LLM call + extractive fallback
│  ├─ config.py          # settings, env
│  ├─ models.py          # pydantic request/response
│  └─ storage.py         # metadata store helpers
├─ indices/
│  ├─ faiss.index
│  └─ meta.jsonl
├─ data/
│  └─ docs/              # your .txt/.md sources
├─ scripts/
│  └─ ingest.py          # build index from data/docs
├─ requirements.txt
├─ .env.example
└─ Dockerfile

requirements.txt

fastapi
uvicorn[standard]
sentence-transformers
faiss-cpu
numpy
pydantic
python-dotenv
prometheus-fastapi-instrumentator
slowapi
orjson
filelock

.env.example

OPENAI_API_KEY=
OPENAI_MODEL=gpt-4o-mini
INDEX_PATH=indices/faiss.index
META_PATH=indices/meta.jsonl
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2

Data Ingestion & Chunking

Efficient chunking boosts retrieval quality. Start simple: split on paragraphs with overlap to preserve context continuity.

# scripts/ingest.py
import os, json, glob
from pathlib import Path
from typing import List, Dict
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

INDEX_PATH = os.getenv("INDEX_PATH", "indices/faiss.index")
META_PATH  = os.getenv("META_PATH",  "indices/meta.jsonl")
EMBED_MODEL= os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")

DOCS_DIR   = "data/docs"

def read_docs() -> List[Dict]:
    docs = []
    for path in glob.glob(f"{DOCS_DIR}/**/*", recursive=True):
        if not os.path.isfile(path): 
            continue
        if not (path.endswith(".txt") or path.endswith(".md")):
            continue
        text = Path(path).read_text(encoding="utf-8", errors="ignore")
        docs.append({"source": path, "text": text})
    return docs

def chunk_text(text: str, max_chars=800, overlap=120) -> List[str]:
    # simple, whitespace-preserving chunker
    paras = text.split("\n\n")
    chunks, buff = [], ""
    for para in paras:
        if len(buff) + len(para) + 2 <= max_chars:
            buff += (("\n\n" if buff else "") + para)
        else:
            if buff:
                chunks.append(buff)
                # overlap: take the tail of buff
                tail = buff[-overlap:]
                buff = tail + "\n\n" + para
            else:
                chunks.append(para[:max_chars])
                buff = para[max(0, len(para)-overlap):]
    if buff:
        chunks.append(buff)
    return [c.strip() for c in chunks if c.strip()]

def main():
    os.makedirs("indices", exist_ok=True)
    docs = read_docs()
    if not docs:
        raise SystemExit("No .txt/.md files found in data/docs")

    model = SentenceTransformer(EMBED_MODEL)
    dim = model.get_sentence_embedding_dimension()
    index = faiss.IndexFlatIP(dim)  # inner product (use normalized vectors)

    meta = []
    vectors = []

    for d in docs:
        for i, chunk in enumerate(chunk_text(d["text"])):
            emb = model.encode(chunk, normalize_embeddings=True)
            vectors.append(emb.astype("float32"))
            meta.append({"text": chunk, "source": d["source"], "chunk_id": i})

    mat = np.vstack(vectors).astype("float32")
    index.add(mat)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        for m in meta:
            f.write(json.dumps(m, ensure_ascii=False) + "\n")

    print(f"Saved {len(meta)} chunks to {INDEX_PATH} and {META_PATH}")

if __name__ == "__main__":
    main()

Notes :

Use normalized embeddings so inner product ≈ cosine similarity.
For large corpora, switch to IVF or HNSW indexes and/or PQ compression.
Keep a separate metadata store (we’ll use JSONL here; SQLite is a great upgrade).

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

Vectorization & FAISS Indexing

We’ve chosen all-MiniLM-L6-v2 for a strong, lightweight baseline. Production teams often:

Self‑host a higher‑dimensional model for better recall, or
Use a managed embedding API and cache vectors locally.

The FastAPI Service

Let’s implement a clean, typed API.

# app/models.py
from pydantic import BaseModel, Field
from typing import List, Dict, Any

class QueryRequest(BaseModel):
    query: str = Field(..., min_length=2)
    top_k: int = Field(5, ge=1, le=20)

class Source(BaseModel):
    text: str
    score: float
    metadata: Dict[str, Any]

class QueryResponse(BaseModel):
    answer: str
    sources: List[Source]

# app/config.py
import os

class Settings:
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
    OPENAI_MODEL   = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
    INDEX_PATH     = os.getenv("INDEX_PATH", "indices/faiss.index")
    META_PATH      = os.getenv("META_PATH", "indices/meta.jsonl")
    EMBED_MODEL    = os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")
    API_KEY        = os.getenv("RAG_API_KEY", "")  # optional API key for your service

settings = Settings()

# app/storage.py
import json
from typing import List, Dict

def load_meta(meta_path: str) -> List[Dict]:
    rows = []
    with open(meta_path, "r", encoding="utf-8") as f:
        for line in f:
            rows.append(json.loads(line))
    return rows

# app/retrieval.py
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any

class Retriever:
    def __init__(self, index_path: str, meta: List[Dict], embed_model: str):
        self.meta = meta
        self.model = SentenceTransformer(embed_model, device="cpu")
        self.dim = self.model.get_sentence_embedding_dimension()
        self.index = faiss.read_index(index_path)

    def _embed(self, texts: List[str]) -> np.ndarray:
        vecs = self.model.encode(texts, batch_size=64, normalize_embeddings=True)
        return np.array(vecs, dtype="float32")

    def search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
        qv = self._embed([query])
        scores, ids = self.index.search(qv, k)
        results = []
        for score, idx in zip(scores[0], ids[0]):
            if idx == -1:
                continue
            m = self.meta[idx]
            results.append({
                "text": m["text"],
                "score": float(score),
                "metadata": {k: v for k, v in m.items() if k != "text"}
            })
        return results

# app/generator.py
import os
from typing import List, Dict

def _extractive_fallback(query: str, passages: List[Dict]) -> str:
    # Simple extractive fallback: join the best few chunks
    joined = "\n\n".join(p["text"] for p in passages[:3])
    return (
        "Below is a synthesized answer using retrieved context (no LLM configured).\n\n"
        f"Query: {query}\n\n"
        f"Context:\n{joined}\n\n"
        "Answer: Summarize key points from the context above."
    )

def generate_answer(query: str, passages: List[Dict]) -> str:
    api_key = os.getenv("OPENAI_API_KEY", "")
    if not api_key:
        return _extractive_fallback(query, passages)

    try:
        # Optional: Use OpenAI if installed and key is set.
        # This snippet uses Chat Completions-style APIs.
        from openai import OpenAI
        client = OpenAI(api_key=api_key)
        context = "\n\n".join(
            f"Source: {p['metadata'].get('source','')} | Score: {p['score']:.3f}\n{p['text']}"
            for p in passages
        )
        prompt = (
            "You are a RAG assistant. Answer strictly from the provided context. "
            "Cite sources by filename when relevant.\n\n"
            f"Context:\n{context}\n\n"
            f"User question: {query}\n\n"
            "Answer:"
        )
        resp = client.chat.completions.create(
            model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
            messages=[{"role":"user","content":prompt}],
            temperature=0.2
        )
        return resp.choices[0].message.content.strip()
    except Exception:
        return _extractive_fallback(query, passages)

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

# app/main.py
from fastapi import FastAPI, Depends, HTTPException, Request, Header
from fastapi.responses import ORJSONResponse
from prometheus_fastapi_instrumentator import Instrumentator
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from slowapi.middleware import SlowAPIMiddleware

from .models import QueryRequest, QueryResponse, Source
from .retrieval import Retriever
from .generator import generate_answer
from .storage import load_meta
from .config import settings

app = FastAPI(
    title="FastAPI FAISS RAG API",
    default_response_class=ORJSONResponse,
    version="1.0.0"
)

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, lambda r, e: ORJSONResponse({"detail":"Rate limit exceeded"}, status_code=429))
app.add_middleware(SlowAPIMiddleware)

def require_api_key(x_api_key: str = Header(default="")):
    if settings.API_KEY and x_api_key != settings.API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.on_event("startup")
async def startup():
    meta = load_meta(settings.META_PATH)
    app.state.retriever = Retriever(settings.INDEX_PATH, meta, settings.EMBED_MODEL)
    Instrumentator().instrument(app).expose(app)

@app.get("/healthz")
async def healthz():
    return {"status":"ok"}

@app.get("/readyz")
async def readyz():
    # Add checks: index loaded, metadata count > 0, etc.
    ready = hasattr(app.state, "retriever")
    return {"ready": bool(ready)}

@app.post("/v1/query", response_model=QueryResponse)
@limiter.limit("60/minute")
async def query(req: QueryRequest, request: Request, _: None = Depends(require_api_key)):
    retriever: Retriever = request.app.state.retriever
    hits = retriever.search(req.query, k=req.top_k)
    answer = generate_answer(req.query, hits)
    return QueryResponse(answer=answer, sources=hits)

CORS : If you’ll call this from a browser, add FastAPI CORS middleware with allowed origins.

Search & Retrieval Logic

We normalize vectors to ensure inner product equals cosine similarity.
Start with k=5. Tune per corpus size and typical query intent.
Consider boosting recency or source authority by re‑ranking (e.g., multiply score by a metadata weight).

Generation: LLM Integration & Fallbacks

In production, a deterministic fallback is essential. The generator.py above:

Uses an OpenAI call when OPENAI_API_KEY is set.
Falls back to a safe extractive answer if the LLM is unavailable.

Prompts :

Keep the system prompt concise: “Answer strictly from context. If unsure, say you don’t know.”
Provide sources in the prompt for lightweight citation.

Alternatives :

Swap in any provider (Azure, Anthropic, local vLLM).
Wrap calls with timeouts and retries; cache successful responses.

Observability, Security & Rate Limits

Health & readiness: /healthz and /readyz.
Metrics: prometheus-fastapi-instrumentator exposes /metrics for Prometheus.
Rate limiting: slowapi with 60/minute per IP (tune as needed).
API keys: Simple header‑based auth via X-API-Key.
Logging: Prefer structured logs (JSON). Include request_id for tracing.

Privacy & Compliance

Store only what you need. Mask PII in logs.
If using third‑party LLMs, review data retention and opt‑out flags.

Testing the API

Smoke test with curl:

export RAG_API_KEY=dev-key
uvicorn app.main:app --reload

curl -X POST http://localhost:8000/v1/query \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-key" \
  -d '{"query":"What does the architecture look like?","top_k":3}'

Pytest sketch :

# tests/test_query.py
from fastapi.testclient import TestClient
from app.main import app

def test_healthz():
    c = TestClient(app)
    r = c.get("/healthz")
    assert r.status_code == 200
    assert r.json()["status"] == "ok"

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

Dockerization & Deployment

Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app app
COPY indices indices
COPY .env.example .env

ENV PORT=8000
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Build & run:

docker build -t fastapi-faiss-rag:latest .
docker run -p 8000:8000 \
  -e RAG_API_KEY=prod-key \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  fastapi-faiss-rag:latest

Deployment targets

Single VM: Docker + systemd, easy and cost‑effective.
Kubernetes: Add liveness/readiness probes; configure HPA on CPU/memory.
Serverless: Possible for light loads; cold starts and native FAISS builds can be tricky.

Scaling & Hardening

Bigger corpora: Use IVF or HNSW FAISS indexes; train with a representative sample.
Compression: PQ or OPQ to shrink memory while preserving recall.
Sharding: Split by domain or time; merge results at query time.
Metadata store: Move from JSONL to SQLite/Postgres; add indexes on source, timestamp.
Async ingestion: Queue new documents (e.g., Celery or RQ).
Hot reload: Maintain two index files and atomically flip a symlink when a new index is ready.
Security: Mutual TLS for internal traffic; WAF/CDN at the edge; rotate API keys.
Cost control: Cache embeddings and generation results; use smaller models for routine questions.

Common Pitfalls

Mismatched dimensions: FAISS index dimension must match your embedding model.
No normalization: For cosine similarity with inner product, normalize vectors at both indexing and query time.
Over‑chunking: Tiny chunks ≠ better answers. Start at ~600–900 chars with ~100 overlap.
Unbounded context: Limit how many chunks you feed to the LLM; tune for the model’s context window.
Missing fallbacks: Always have a deterministic path when LLM calls fail.
Lack of evals: Build a small eval set (queries + gold answers) and measure hit rate + answer quality regularly.

FAQ

How large can a FastAPI FAISS RAG API grow before IVF/HNSW is required?

Flat indexes do well up to low‑millions of vectors on a single machine with enough RAM. Beyond that, move to IVF/HNSW and consider PQ compression to fit memory budgets.

Can I use GPUs?

Yes. FAISS has GPU support. For strictly CPU deployments, choose efficient embedding models and consider quantization.

How do I add PDFs or web pages?

Convert to text during ingestion (e.g., pypdf, trafilatura). Keep the conversion offline in your ingestion script so API servers stay lean.

How do I cite sources?

Store source and chunk_id in metadata and instruct the LLM to cite by filename or URL. You can also include line numbers if you pre‑compute them.

Conclusion

You now have a FastAPI FAISS RAG API that is not just a demo, but designed with production guardrails: clean contracts, health checks, metrics, rate limits, API keys, Docker packaging, and a safe generation fallback. From here, iterate on index types, re‑ranking, prompting, and evaluation to continuously improve answer quality and reliability.

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

How to Build a RAG API with FastAPI and FAISS (Step by Step)

Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

Table of Contents

What You’ll Build

Why FastAPI + FAISS for RAG

Architecture Overview

Project Setup

Data Ingestion & Chunking

Notes :

Vectorization & FAISS Indexing

The FastAPI Service

CORS : If you’ll call this from a browser, add FastAPI CORS middleware with allowed origins.

Search & Retrieval Logic

Generation: LLM Integration & Fallbacks

Prompts :

Alternatives :

Observability, Security & Rate Limits

Privacy & Compliance

Testing the API

Pytest sketch :

Dockerization & Deployment

Deployment targets

Scaling & Hardening

Common Pitfalls

FAQ

How large can a FastAPI FAISS RAG API grow before IVF/HNSW is required?

Can I use GPUs?

How do I add PDFs or web pages?

How do I cite sources?

Conclusion

Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

Table of Contents

What You’ll Build

Why FastAPI + FAISS for RAG

Architecture Overview

Project Setup

Data Ingestion & Chunking

Notes :

Vectorization & FAISS Indexing

The FastAPI Service

CORS : If you’ll call this from a browser, add FastAPI CORS middleware with allowed origins.

Search & Retrieval Logic

Generation: LLM Integration & Fallbacks

Prompts :

Alternatives :

Observability, Security & Rate Limits

Privacy & Compliance

Testing the API

Pytest sketch :

Dockerization & Deployment

Deployment targets

Scaling & Hardening

Common Pitfalls

FAQ

How large can a FastAPI FAISS RAG API grow before IVF/HNSW is required?

Can I use GPUs?

How do I add PDFs or web pages?

How do I cite sources?

Conclusion

You Might Also Like

Pinecone vs Qdrant: Proven Strategies for the Cheapest RAG Setup 2025

11 Best OpenAI API Starters for Node 2025

LangChain Review 2025: The Proven Definitive Take