You are currently viewing How to Build a RAG API with FastAPI and FAISS (Step by Step)
Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

How to Build a RAG API with FastAPI and FAISS (Step by Step)

  • Post author:
  • Post last modified:August 25, 2025

Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use


What You’ll Build

We’ll create a production‑ready FastAPI FAISS RAG API that:

  • Ingests documents, splits them into chunks, and creates dense embeddings.
  • Indexes vectors with FAISS (cosine similarity via inner product).
  • Exposes a /v1/query endpoint that retrieves top‑K chunks and (optionally) generates an answer using an LLM.
  • Adds production touches: health checks, metrics, rate limits, API key auth, structured logging, and Docker.
Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)
Build a Production‑Ready RAG API with FastAPI + FAISS (Step‑by‑Step)


Why FastAPI + FAISS for RAG

  • FastAPI gives you asynchronous performance, pydantic models, and an ergonomic developer experience.
  • FAISS is a battle‑tested vector database library—fast, memory‑efficient, with multiple index types (Flat, IVF, HNSW, PQ) for different data sizes and latency targets.

For this tutorial we’ll stick to a normalized embeddings + inner product (cosine) setup using a Flat index (simple and strong baseline), and note where advanced options fit.


Architecture Overview

Flow:

  1. Ingestion: Load files → clean text → split into overlapping chunks.
  2. Embedding: Encode chunks into vectors (e.g., all-MiniLM-L6-v2).
  3. Indexing: Add vectors to FAISS; persist index to disk.
  4. Metadata store: Map vector IDs → {text, source, chunk_id, …}.
  5. API: /v1/query → embed query → FAISS top‑K → optional LLM generation with retrieved context.
  6. Ops: /healthz, /readyz, /metrics, logging, rate limits, API keys.
Why FastAPI + FAISS for RAG
Why FastAPI + FAISS for RAG

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use


Project Setup

fastapi-faiss-rag/
├─ app/
│  ├─ __init__.py
│  ├─ main.py            # FastAPI app, endpoints
│  ├─ retrieval.py       # FAISS search + embed
│  ├─ generator.py       # LLM call + extractive fallback
│  ├─ config.py          # settings, env
│  ├─ models.py          # pydantic request/response
│  └─ storage.py         # metadata store helpers
├─ indices/
│  ├─ faiss.index
│  └─ meta.jsonl
├─ data/
│  └─ docs/              # your .txt/.md sources
├─ scripts/
│  └─ ingest.py          # build index from data/docs
├─ requirements.txt
├─ .env.example
└─ Dockerfile

requirements.txt

fastapi
uvicorn[standard]
sentence-transformers
faiss-cpu
numpy
pydantic
python-dotenv
prometheus-fastapi-instrumentator
slowapi
orjson
filelock

.env.example

OPENAI_API_KEY=
OPENAI_MODEL=gpt-4o-mini
INDEX_PATH=indices/faiss.index
META_PATH=indices/meta.jsonl
EMBED_MODEL=sentence-transformers/all-MiniLM-L6-v2

Data Ingestion & Chunking

Efficient chunking boosts retrieval quality. Start simple: split on paragraphs with overlap to preserve context continuity.

# scripts/ingest.py
import os, json, glob
from pathlib import Path
from typing import List, Dict
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer

INDEX_PATH = os.getenv("INDEX_PATH", "indices/faiss.index")
META_PATH  = os.getenv("META_PATH",  "indices/meta.jsonl")
EMBED_MODEL= os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")

DOCS_DIR   = "data/docs"

def read_docs() -> List[Dict]:
    docs = []
    for path in glob.glob(f"{DOCS_DIR}/**/*", recursive=True):
        if not os.path.isfile(path): 
            continue
        if not (path.endswith(".txt") or path.endswith(".md")):
            continue
        text = Path(path).read_text(encoding="utf-8", errors="ignore")
        docs.append({"source": path, "text": text})
    return docs

def chunk_text(text: str, max_chars=800, overlap=120) -> List[str]:
    # simple, whitespace-preserving chunker
    paras = text.split("\n\n")
    chunks, buff = [], ""
    for para in paras:
        if len(buff) + len(para) + 2 <= max_chars:
            buff += (("\n\n" if buff else "") + para)
        else:
            if buff:
                chunks.append(buff)
                # overlap: take the tail of buff
                tail = buff[-overlap:]
                buff = tail + "\n\n" + para
            else:
                chunks.append(para[:max_chars])
                buff = para[max(0, len(para)-overlap):]
    if buff:
        chunks.append(buff)
    return [c.strip() for c in chunks if c.strip()]

def main():
    os.makedirs("indices", exist_ok=True)
    docs = read_docs()
    if not docs:
        raise SystemExit("No .txt/.md files found in data/docs")

    model = SentenceTransformer(EMBED_MODEL)
    dim = model.get_sentence_embedding_dimension()
    index = faiss.IndexFlatIP(dim)  # inner product (use normalized vectors)

    meta = []
    vectors = []

    for d in docs:
        for i, chunk in enumerate(chunk_text(d["text"])):
            emb = model.encode(chunk, normalize_embeddings=True)
            vectors.append(emb.astype("float32"))
            meta.append({"text": chunk, "source": d["source"], "chunk_id": i})

    mat = np.vstack(vectors).astype("float32")
    index.add(mat)
    faiss.write_index(index, INDEX_PATH)

    with open(META_PATH, "w", encoding="utf-8") as f:
        for m in meta:
            f.write(json.dumps(m, ensure_ascii=False) + "\n")

    print(f"Saved {len(meta)} chunks to {INDEX_PATH} and {META_PATH}")

if __name__ == "__main__":
    main()

Notes :

  • Use normalized embeddings so inner product ≈ cosine similarity.
  • For large corpora, switch to IVF or HNSW indexes and/or PQ compression.
  • Keep a separate metadata store (we’ll use JSONL here; SQLite is a great upgrade).

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use


Vectorization & FAISS Indexing

We’ve chosen all-MiniLM-L6-v2 for a strong, lightweight baseline. Production teams often:

  • Self‑host a higher‑dimensional model for better recall, or
  • Use a managed embedding API and cache vectors locally.
FastAPI FAISS RAG API
FastAPI FAISS RAG API


The FastAPI Service

Let’s implement a clean, typed API.

# app/models.py
from pydantic import BaseModel, Field
from typing import List, Dict, Any

class QueryRequest(BaseModel):
    query: str = Field(..., min_length=2)
    top_k: int = Field(5, ge=1, le=20)

class Source(BaseModel):
    text: str
    score: float
    metadata: Dict[str, Any]

class QueryResponse(BaseModel):
    answer: str
    sources: List[Source]
# app/config.py
import os

class Settings:
    OPENAI_API_KEY = os.getenv("OPENAI_API_KEY", "")
    OPENAI_MODEL   = os.getenv("OPENAI_MODEL", "gpt-4o-mini")
    INDEX_PATH     = os.getenv("INDEX_PATH", "indices/faiss.index")
    META_PATH      = os.getenv("META_PATH", "indices/meta.jsonl")
    EMBED_MODEL    = os.getenv("EMBED_MODEL","sentence-transformers/all-MiniLM-L6-v2")
    API_KEY        = os.getenv("RAG_API_KEY", "")  # optional API key for your service

settings = Settings()
# app/storage.py
import json
from typing import List, Dict

def load_meta(meta_path: str) -> List[Dict]:
    rows = []
    with open(meta_path, "r", encoding="utf-8") as f:
        for line in f:
            rows.append(json.loads(line))
    return rows
# app/retrieval.py
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer
from typing import List, Dict, Any

class Retriever:
    def __init__(self, index_path: str, meta: List[Dict], embed_model: str):
        self.meta = meta
        self.model = SentenceTransformer(embed_model, device="cpu")
        self.dim = self.model.get_sentence_embedding_dimension()
        self.index = faiss.read_index(index_path)

    def _embed(self, texts: List[str]) -> np.ndarray:
        vecs = self.model.encode(texts, batch_size=64, normalize_embeddings=True)
        return np.array(vecs, dtype="float32")

    def search(self, query: str, k: int = 5) -> List[Dict[str, Any]]:
        qv = self._embed([query])
        scores, ids = self.index.search(qv, k)
        results = []
        for score, idx in zip(scores[0], ids[0]):
            if idx == -1:
                continue
            m = self.meta[idx]
            results.append({
                "text": m["text"],
                "score": float(score),
                "metadata": {k: v for k, v in m.items() if k != "text"}
            })
        return results
# app/generator.py
import os
from typing import List, Dict

def _extractive_fallback(query: str, passages: List[Dict]) -> str:
    # Simple extractive fallback: join the best few chunks
    joined = "\n\n".join(p["text"] for p in passages[:3])
    return (
        "Below is a synthesized answer using retrieved context (no LLM configured).\n\n"
        f"Query: {query}\n\n"
        f"Context:\n{joined}\n\n"
        "Answer: Summarize key points from the context above."
    )

def generate_answer(query: str, passages: List[Dict]) -> str:
    api_key = os.getenv("OPENAI_API_KEY", "")
    if not api_key:
        return _extractive_fallback(query, passages)

    try:
        # Optional: Use OpenAI if installed and key is set.
        # This snippet uses Chat Completions-style APIs.
        from openai import OpenAI
        client = OpenAI(api_key=api_key)
        context = "\n\n".join(
            f"Source: {p['metadata'].get('source','')} | Score: {p['score']:.3f}\n{p['text']}"
            for p in passages
        )
        prompt = (
            "You are a RAG assistant. Answer strictly from the provided context. "
            "Cite sources by filename when relevant.\n\n"
            f"Context:\n{context}\n\n"
            f"User question: {query}\n\n"
            "Answer:"
        )
        resp = client.chat.completions.create(
            model=os.getenv("OPENAI_MODEL", "gpt-4o-mini"),
            messages=[{"role":"user","content":prompt}],
            temperature=0.2
        )
        return resp.choices[0].message.content.strip()
    except Exception:
        return _extractive_fallback(query, passages)

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use

# app/main.py
from fastapi import FastAPI, Depends, HTTPException, Request, Header
from fastapi.responses import ORJSONResponse
from prometheus_fastapi_instrumentator import Instrumentator
from slowapi import Limiter
from slowapi.util import get_remote_address
from slowapi.errors import RateLimitExceeded
from slowapi.middleware import SlowAPIMiddleware

from .models import QueryRequest, QueryResponse, Source
from .retrieval import Retriever
from .generator import generate_answer
from .storage import load_meta
from .config import settings

app = FastAPI(
    title="FastAPI FAISS RAG API",
    default_response_class=ORJSONResponse,
    version="1.0.0"
)

limiter = Limiter(key_func=get_remote_address)
app.state.limiter = limiter
app.add_exception_handler(RateLimitExceeded, lambda r, e: ORJSONResponse({"detail":"Rate limit exceeded"}, status_code=429))
app.add_middleware(SlowAPIMiddleware)

def require_api_key(x_api_key: str = Header(default="")):
    if settings.API_KEY and x_api_key != settings.API_KEY:
        raise HTTPException(status_code=401, detail="Invalid API key")

@app.on_event("startup")
async def startup():
    meta = load_meta(settings.META_PATH)
    app.state.retriever = Retriever(settings.INDEX_PATH, meta, settings.EMBED_MODEL)
    Instrumentator().instrument(app).expose(app)

@app.get("/healthz")
async def healthz():
    return {"status":"ok"}

@app.get("/readyz")
async def readyz():
    # Add checks: index loaded, metadata count > 0, etc.
    ready = hasattr(app.state, "retriever")
    return {"ready": bool(ready)}

@app.post("/v1/query", response_model=QueryResponse)
@limiter.limit("60/minute")
async def query(req: QueryRequest, request: Request, _: None = Depends(require_api_key)):
    retriever: Retriever = request.app.state.retriever
    hits = retriever.search(req.query, k=req.top_k)
    answer = generate_answer(req.query, hits)
    return QueryResponse(answer=answer, sources=hits)

CORS : If you’ll call this from a browser, add FastAPI CORS middleware with allowed origins.


Search & Retrieval Logic

  • We normalize vectors to ensure inner product equals cosine similarity.
  • Start with k=5. Tune per corpus size and typical query intent.
  • Consider boosting recency or source authority by re‑ranking (e.g., multiply score by a metadata weight).


Generation: LLM Integration & Fallbacks

In production, a deterministic fallback is essential. The generator.py above:

  • Uses an OpenAI call when OPENAI_API_KEY is set.
  • Falls back to a safe extractive answer if the LLM is unavailable.

Prompts :

  • Keep the system prompt concise: “Answer strictly from context. If unsure, say you don’t know.”
  • Provide sources in the prompt for lightweight citation.

Alternatives :

  • Swap in any provider (Azure, Anthropic, local vLLM).
  • Wrap calls with timeouts and retries; cache successful responses.

Observability, Security & Rate Limits

  • Health & readiness: /healthz and /readyz.
  • Metrics: prometheus-fastapi-instrumentator exposes /metrics for Prometheus.
  • Rate limiting: slowapi with 60/minute per IP (tune as needed).
  • API keys: Simple header‑based auth via X-API-Key.
  • Logging: Prefer structured logs (JSON). Include request_id for tracing.

Privacy & Compliance

  • Store only what you need. Mask PII in logs.
  • If using third‑party LLMs, review data retention and opt‑out flags.

Testing the API

Smoke test with curl:

export RAG_API_KEY=dev-key
uvicorn app.main:app --reload

curl -X POST http://localhost:8000/v1/query \
  -H "Content-Type: application/json" \
  -H "X-API-Key: dev-key" \
  -d '{"query":"What does the architecture look like?","top_k":3}'

Pytest sketch :

# tests/test_query.py
from fastapi.testclient import TestClient
from app.main import app

def test_healthz():
    c = TestClient(app)
    r = c.get("/healthz")
    assert r.status_code == 200
    assert r.json()["status"] == "ok"
FastAPI FAISS RAG API
FastAPI FAISS RAG API

Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use


Dockerization & Deployment

Dockerfile

FROM python:3.11-slim

WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY app app
COPY indices indices
COPY .env.example .env

ENV PORT=8000
EXPOSE 8000
CMD ["uvicorn", "app.main:app", "--host", "0.0.0.0", "--port", "8000", "--workers", "2"]

Build & run:

docker build -t fastapi-faiss-rag:latest .
docker run -p 8000:8000 \
  -e RAG_API_KEY=prod-key \
  -e OPENAI_API_KEY=$OPENAI_API_KEY \
  fastapi-faiss-rag:latest

Deployment targets

  • Single VM: Docker + systemd, easy and cost‑effective.
  • Kubernetes: Add liveness/readiness probes; configure HPA on CPU/memory.
  • Serverless: Possible for light loads; cold starts and native FAISS builds can be tricky.

Scaling & Hardening

  • Bigger corpora: Use IVF or HNSW FAISS indexes; train with a representative sample.
  • Compression: PQ or OPQ to shrink memory while preserving recall.
  • Sharding: Split by domain or time; merge results at query time.
  • Metadata store: Move from JSONL to SQLite/Postgres; add indexes on source, timestamp.
  • Async ingestion: Queue new documents (e.g., Celery or RQ).
  • Hot reload: Maintain two index files and atomically flip a symlink when a new index is ready.
  • Security: Mutual TLS for internal traffic; WAF/CDN at the edge; rotate API keys.
  • Cost control: Cache embeddings and generation results; use smaller models for routine questions.

Common Pitfalls

  1. Mismatched dimensions: FAISS index dimension must match your embedding model.
  2. No normalization: For cosine similarity with inner product, normalize vectors at both indexing and query time.
  3. Over‑chunking: Tiny chunks ≠ better answers. Start at ~600–900 chars with ~100 overlap.
  4. Unbounded context: Limit how many chunks you feed to the LLM; tune for the model’s context window.
  5. Missing fallbacks: Always have a deterministic path when LLM calls fail.
  6. Lack of evals: Build a small eval set (queries + gold answers) and measure hit rate + answer quality regularly.

FAQ

How large can a FastAPI FAISS RAG API grow before IVF/HNSW is required?

Flat indexes do well up to low‑millions of vectors on a single machine with enough RAM. Beyond that, move to IVF/HNSW and consider PQ compression to fit memory budgets.

Can I use GPUs?

Yes. FAISS has GPU support. For strictly CPU deployments, choose efficient embedding models and consider quantization.

How do I add PDFs or web pages?

Convert to text during ingestion (e.g., pypdf, trafilatura). Keep the conversion offline in your ingestion script so API servers stay lean.

How do I cite sources?

Store source and chunk_id in metadata and instruct the LLM to cite by filename or URL. You can also include line numbers if you pre‑compute them.


Conclusion

You now have a FastAPI FAISS RAG API that is not just a demo, but designed with production guardrails: clean contracts, health checks, metrics, rate limits, API keys, Docker packaging, and a safe generation fallback. From here, iterate on index types, re‑ranking, prompting, and evaluation to continuously improve answer quality and reliability.


Get a Fiverr professional to set up and fine-tune your FastAPI FAISS RAG API for production use