How to Deploy LLM Apps on Vercel: Ultimate Steps 2025

Why LLM Apps on Vercel in 2025

If your team is building real‑time AI features, few platforms match the speed, scale, and developer ergonomics of deploying LLM Apps on Vercel. In 2025, Vercel’s AI stack matured significantly: the AI SDK 5 unifies model providers behind a stable interface for streaming UIs, while the AI Gateway centralizes model access, budget controls, and observability. When paired with Next.js 15’s App Router and modern caching, the result is a pragmatic, production‑ready path from prototype to internet‑scale deployment.

You’ll see these themes throughout this guide—streaming by default, server‑safe APIs, secure environment management, and cost control—all tuned for LLM Apps on Vercel. For the SDK’s streaming primitives such as streamText, consult the official reference to understand options, events, and error handling in depth, as these APIs are optimized for low‑latency generation. See AI SDK Core streamText reference for current semantics.

What changed in 2025?
• Vercel has unified compute under Vercel Functions and moved what used to be Edge Functions and Middleware onto this single platform, simplifying deployment models and pricing. Read the product‑level change note from Vercel.
• Storage became marketplace‑first: first‑party KV/Postgres were sunset and replaced by Marketplace integrations (Neon, Redis/Upstash, Supabase, etc.). Check “Product Changes” and Marketplace docs.
• AI Gateway progressed from early releases to general availability with unified API routing across hundreds of models and sub‑20ms routing latency. See Vercel’s GA announcement.

Developer coding interface while building LLM Apps on Vercel at night

What You’ll Build (and Why)

We’ll deploy a Next.js 15 app with an /api/chat Route Handler that streams LLM tokens, tighten security with environment variables and rate limiting, connect a Postgres or Vector store through the Marketplace, and add cron tasks and observability. This end‑to‑end path is the most common baseline for LLM Apps on Vercel, and it scales well across prototypes, internal tools, and production SaaS.

1) Platform & Architecture Choices for LLM Apps on Vercel

1.1 Runtimes for LLM Apps on Vercel: Node.js vs Edge

In 2025, Vercel’s guidance emphasizes Node.js runtime as the default for reliability and compatibility with provider SDKs, with Edge runtime reserved for ultra‑low latency or specific Web‑API‑only flows. In Next.js, you explicitly choose the runtime per route (export const runtime = 'nodejs' | 'edge'). See runtime docs and guidance in Vercel/Next.js.
When using Next.js 15 route handlers, the same Web API surface works on both runtimes; you can explicitly export runtime in app/api/**/route.ts. Review Next.js runtime options for route handlers.

Rule of thumb for LLM Apps on Vercel:

Use Node.js runtime for most provider SDKs, file I/O, longer requests, or libraries that require Node APIs.
Use Edge runtime for fast geolocated responses with Web standard APIs and <300s tasks. Note that Edge execution duration is capped at 300 seconds (as of March 1, 2025). See Vercel’s execution duration note.

1.2 The AI SDK & AI Gateway: the LLM Backbone

Vercel’s AI SDK 5 provides typed, provider‑agnostic primitives for streaming and tool calling. It standardizes how models plug into features like streamText and exposes a provider architecture you can swap without refactoring your app. Read the AI SDK 5 overview.
To simplify keys, quotas, and analytics, Vercel’s AI Gateway sits in front of providers and lets you route requests across hundreds of models, set budgets, and observe usage—without vendor lock‑in. Review the AI Gateway docs (capabilities, provider options, pricing). and pricing details.

OpenAI and Anthropic in 2025
The industry is shifting from Chat Completions to Responses APIs (OpenAI & Azure OpenAI) and Messages APIs (Anthropic). Migration guides detail streaming behavior and tool use patterns you’ll implement with the AI SDK. See OpenAI’s “Migrate to the Responses API” guide and Azure’s Responses API documentation for the latest. OpenAI migration guide. and Azure OpenAI Responses.
For Claude, consult Anthropic’s Messages documentation and batch endpoints. Anthropic API “Messages” and batch references.

Team planning cloud architecture for deploying LLM Apps on Vercel

2) Pre‑Flight Checklist for LLM Apps on Vercel

Before you deploy:

Repository & Framework: Next.js 15 App Router is recommended for first‑class streaming and server actions. (Community write‑ups cover the new caching and App Router defaults.) See a Next.js 15 overview and caching updates.
Environment Variables: Define keys locally (.env.local), manage them in Vercel, and mark sensitive credentials as Sensitive environment variables. Environment variables basics. and sensitive variables.
Provider Access: Decide direct provider access (OpenAI, Anthropic, etc.) or route through AI Gateway for per‑model routing, budgets, and analytics. AI Gateway getting started.
Data Layer: Choose a storage integration via Marketplace—Neon (Postgres), Redis/Upstash, Supabase, Pinecone/Vector, etc. Marketplace storage category. and Neon transition guide for Vercel Postgres.
Rate Limiting: Plan request shaping using AI Gateway budgets or a Redis‑backed limiter to protect your endpoints. AI Gateway provider routing & budgets. and Upstash Ratelimit examples.
Observability: Enable Logs, optionally OTel integrations (Datadog, Sentry) for traces and errors. Runtime logs. and OTel quickstart and marketplace integrations.

3) Step‑by‑Step: Deploying an LLM Chat Route (Next.js 15)

Below is a minimal, production‑flavored approach to shipping LLM Apps on Vercel with streaming.

3.1 Scaffold the app

npx create-next-app@latest my-llm-app --ts
cd my-llm-app
pnpm add ai @ai-sdk/openai

Why AI SDK? It abstracts per‑provider differences and offers first‑class streaming (streamText) and tool calling with a unified API. See the AI SDK foundations for providers & models.

3.2 Create a streaming Route Handler

app/api/chat/route.ts (Node.js runtime recommended for broad SDK compatibility):

// app/api/chat/route.ts
import { NextRequest } from 'next/server'
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'

export const runtime = 'nodejs'

export async function POST(req: NextRequest) {
  const { messages } = await req.json()

  const { toAIStreamResponse } = await streamText({
    model: openai('gpt-4.1'), // or route via AI Gateway: openai('gpt-4o') behind gateway
    messages,                  // [{ role:'user', content:'...' }, ...]
    temperature: 0.2,
  })

  // Streams an SSE response to the client
  return toAIStreamResponse()
}

This pattern streams tokens as they are generated, yielding fast first‑byte times without custom SSE code. Vercel’s streaming functions docs reiterate that AI SDK reduces boilerplate for streaming.

3.3 Secure keys with Vercel Env

pnpm dlx vercel link
pnpm dlx vercel env add OPENAI_API_KEY
pnpm dlx vercel env pull .env.local

Mark secrets as Sensitive in the dashboard for extra protection at rest. See sensitive variable behavior and limits.

3.4 Deploy with Git or CLI

Git flow: connect GitHub/GitLab/Bitbucket; every push creates a Preview deployment and merging to main promotes to Production. Vercel’s Git deployments overview.
CLI flow: pnpm dlx vercel (Preview) → pnpm dlx vercel --prod (Production). Use vercel build + vercel deploy --prebuilt to ship artifacts from your CI without sending source. CLI deploy & prebuilt workflow. and CLI deploy reference.

DevOps pipeline preparing continuous deployment of LLM Apps on Vercel

4) Managing Secrets & Environments for LLM Apps on Vercel

LLM Apps on Vercel succeed or fail on key hygiene. Keep it simple:

Use .env.local for local development and vercel env pull to sync from cloud. Env management docs.
Mark provider keys as Sensitive so they’re unreadable after creation and only decrypted during builds. Sensitive env details.
Prefer Server Actions or Route Handlers for LLM calls so keys never hit the client.
For cross‑origin usage (e.g., public API), correctly set CORS headers in your route. Vercel provides a concise Next.js example for OPTIONS and GET handlers. CORS examples.

5) Selecting and Wiring Data Stores (Marketplace‑First)

Most LLM Apps on Vercel combine a transactional DB (Postgres) with a cache or vector store.

5.1 Postgres via Neon or Supabase

Vercel’s first‑party Postgres has been transitioned to Marketplace integrations. Neon offers tight integration with preview‑branch databases. See Neon’s transition guide and Vercel’s Postgres docs. and “Postgres on Vercel.”
If you prefer Supabase’s auth/storage features, install it as a Vercel Marketplace resource with unified billing. Supabase integration listing.

5.2 Redis & Vector

Use Redis/Upstash (KV, Vector, QStash) from Marketplace for sessions, rate limits, and RAG. Upstash Marketplace page. and Vercel redis docs.
Pinecone remains a popular choice for enterprise‑grade vector search; Vercel offers a Pinecone + AI SDK template to jumpstart RAG. Pinecone template on Vercel.

Background on the marketplace shift: Vercel Postgres and KV were sunset in favor of marketplace integrations, with migrations in late‑2024/early‑2025. Vercel Product Changes timeline.

Related reading: If you are still deciding data structures, the post on building RAG with Pinecone and the AI SDK is a helpful conceptual blueprint.

6) Performance & Cost Optimization for LLM Apps on Vercel

6.1 Streaming First

Use streamText to render partial results as soon as tokens arrive. This drops TTFB and improves UX dramatically. AI SDK streaming foundations.

6.2 Choose the right runtime and duration

Node.js runtime supports longer durations; with Fluid compute, production functions can run for many minutes (up to the published limits on your plan) before timing out, which helps with long LLM calls or tool use. Vercel’s guides discuss Fluid’s higher max durations and cost model. Timeout/Fluid compute guidance. and reducing execution time.
Edge runtime offers global low latency but now enforces a 300s limit. Don’t run heavy batch jobs there. Edge duration update.

6.3 Caching and Revalidation

Next.js 15 gives you explicit control: use the new caching model (use cache, revalidate, tags) to cache expensive RAG retrievals or system prompts but keep user messages dynamic. For the platform side, Vercel’s Edge Cache respects Cache-Control returned by your functions. Next.js caching guide (v15 insights). and Vercel Edge Cache docs.

6.4 Centralize model traffic with AI Gateway

The AI Gateway lets you route across providers, add fallbacks, and track spend—with OpenAI‑compatible endpoints to avoid client changes. Gateway docs and OpenAI‑compat reference. and OpenAI‑compat endpoints.

6.5 Concurrency and cold starts

Keep route code lean; lazy‑import heavy libraries; and consider in‑function concurrency (beta) where safe, as noted in product changes. Also review Vercel’s cold‑start guide for Node/Edge. Product changes notes. and cold‑start tips.

7) Production Hardening: CORS, Rate Limits, Timeouts

7.1 CORS for public APIs

If your LLM Apps on Vercel expose public endpoints, implement OPTIONS and GET/POST with correct CORS headers (origin allowlist; credentials if needed). Vercel’s guide shows a concise Next.js example. CORS guide and route handler example.

7.2 Rate limiting

For user‑facing chat and generation endpoints, combine AI Gateway budgets with Redis‑backed IP/user rate limits (e.g., Upstash). Gateway provider options (budgets/fallbacks). and Upstash Ratelimit library.

A minimal Next.js middleware example (pseudo):

// middleware.ts
import { Ratelimit } from '@upstash/ratelimit'
import { NextResponse } from 'next/server'
import { Redis } from '@upstash/redis'

const ratelimit = new Ratelimit({
  redis: Redis.fromEnv(),
  limiter: Ratelimit.slidingWindow(100, '1 m')
})

export async function middleware(req: Request) {
  const ip = req.headers.get('x-forwarded-for') ?? 'global'
  const { success } = await ratelimit.limit(ip)
  if (!success) return new NextResponse('Too Many Requests', { status: 429 })
  return NextResponse.next()
}

export const config = { matcher: ['/api/chat'] }

7.3 Timeouts and long‑running jobs

For operations > several minutes, offload to background tasks and notify users when results are ready. In Vercel, schedule periodic tasks with Cron Jobs that hit a serverless endpoint. Cron jobs quickstart & usage. and usage & pricing.
Keep Edge work under the 300s limit by design; for long RAG indexing or batch evaluations, prefer Node.js runtime or external workers. Edge duration note.

8) Advanced Patterns for LLM Apps on Vercel

8.1 Retrieval‑Augmented Generation (RAG)

Vector Store Options: Upstash Vector (serverless), Pinecone (enterprise), or pgvector via Neon/Supabase. Vercel hosts templates and docs for all paths. Upstash Vector integration overview. and Pinecone + Vercel starter.
Streaming UX: Combine retrieval fetch with streamed LLM responses for a responsive feel even on large contexts; the SDK’s streaming APIs are designed for this. AI SDK streaming background.

If you want structured guidance for RAG stacks and agent templates, see our internal playbook on the 7 best LangChain templates for LLMs to pick pre‑vetted scaffolds that deploy cleanly to Vercel.

8.2 Multi‑model routing and fallbacks

Through AI Gateway, you can define primary and fallback models, or route by geography/cost. That keeps chat responsive during provider hiccups, useful when your LLM Apps on Vercel serve global traffic. AI Gateway provider routing options.

8.3 Prompt systems & evaluation

Prompts drift in production. Bake repeatable testing into your pipeline, measuring quality/cost/latency as you ship. Use our frameworks to craft resilient prompts and to benchmark models before switching. Explore prompt frameworks in Strongest Prompts for LLMs in 2025 and adopt a comparison workflow from Benchmark LLMs the best way in 2025.

8.4 Choosing models

For conversational UX, coding assistants, or multimodal agents, match the model to task, context length, and budget. Our guide to the Best LLMs for Chatbots in 2025 summarizes trade‑offs you’ll balance with AI Gateway routing.

9) Observability & QA in Production

Logs: Use Vercel’s Runtime Logs to inspect function behavior per deployment. For long retention and correlation, configure Log Drains. Logs overview.
Tracing/Monitoring: Use the Vercel OTel Collector to ship traces to Datadog, New Relic, etc. OTel quickstart. and Datadog integration steps.
Preview Deployments: Every PR gets an isolated URL for QA—great for prompt reviews and regression tests on LLM Apps on Vercel. Git integration overview.

User chatting with AI chatbot powered by LLM Apps on Vercel

10) Example: Wiring AI Gateway (Optional but Recommended)

Why: Replace multiple provider clients with a single endpoint and dashboard budgets.

Client route (server‑side):

import { streamText } from 'ai'
// Use OpenAI-compatible base via Gateway:
import { createOpenAI } from '@ai-sdk/openai' 

const openai = createOpenAI({
  baseURL: process.env.AI_GATEWAY_BASE_URL, // from Vercel AI Gateway
  apiKey: process.env.AI_GATEWAY_API_KEY,
})

export async function POST(req: Request) {
  const { messages } = await req.json()
  const { toAIStreamResponse } = await streamText({
    model: openai('openai/gpt-4o'), // or 'anthropic/claude-3-5-sonnet'
    messages,
  })
  return toAIStreamResponse()
}

The Gateway supports OpenAI‑compatible endpoints so your app code stays minimal during provider changes. OpenAI‑compatible reference.

11) Scheduling Background Work with Cron Jobs

When indexing documents for RAG or rotating embeddings, use Cron Jobs to call a route on a schedule.

vercel.json:

{
  "crons": [
    { "path": "/api/reindex", "schedule": "0 3 * * *" }
  ]
}

Vercel pings /api/reindex at the given time; it runs as a normal function so your usual limits and logging apply. Cron job behavior and pricing.

12) Troubleshooting LLM Apps on Vercel

Timeouts: For long tool‑calls or retrieval, switch to Node.js runtime and enable Fluid compute; chunk work or persist progress. Timeout guidance.
Caching surprises: If responses look stale, review Next.js 15’s explicit caching model and the platform’s Edge Cache headers. Next.js 15 caching explainer. and Vercel Edge Cache.
Runtime mismatch: If a provider SDK fails on Edge, swap to export const runtime = 'nodejs' for that route. Runtime selection in docs.
Storage confusion: Remember that first‑party KV/Postgres are now Marketplace integrations; use Neon/Redis/Supabase via Marketplace and import their client SDKs. Product changes page.

13) Deployment Checklist (Copy/Paste)

Runtime: Default to Node.js; only opt into Edge for specific low‑latency tasks. Runtime docs.
Streaming: Implement streamText for chat or generation interfaces. AI SDK streaming.
Secrets: Add provider keys (OPENAI_API_KEY, etc.) and mark as Sensitive. Sensitive envs.
AI Gateway (optional): Configure base URL + key, define fallbacks, watch budgets. AI Gateway docs.
Data Layer: Install a Postgres (Neon/Supabase) and a vector store (Upstash Vector/Pinecone) from Marketplace. Marketplace storage.
Rate Limit: Add middleware with Upstash Ratelimit and/or rely on Gateway budgets. Upstash Ratelimit.
Caching: Decide what to cache (system prompts, embeddings) and set revalidate or use cache. Next.js caching.
Observability: Enable Logs; wire Datadog/Sentry via Marketplace; add OTel. Logs & OTel. and OTel quickstart.
Cron Jobs: Schedule /api/reindex for RAG updates. Cron docs.
Ship: Use Git‑based Preview deployments and promote to Production. Git deployments.

FAQs About LLM Apps on Vercel (2025)

Is AI SDK RSC production‑ready?
Vercel notes the RSC API is experimental; prefer AI SDK UI/Core for production, and consult the migration notes. AI SDK RSC overview.

What’s the status of KV/Postgres?
Vercel sunset first‑party KV/Postgres in favor of Marketplace integrations with unified billing and provisioning. Product changes & changelog entries.

Should I switch to OpenAI’s Responses API?
Yes for new builds; it simplifies tool use and streaming. See OpenAI’s migration guidance and Azure OpenAI’s Responses docs when deploying to enterprise clouds. OpenAI migration guide. and Azure OpenAI Responses.

Putting It All Together

Deploying LLM Apps on Vercel in 2025 means leaning into streaming UIs, consciously choosing the Node/Edge runtime per endpoint, centralizing provider access with AI Gateway, and wiring data through the Vercel Marketplace. From secrets to cron jobs and logs, you get a cohesive surface area—so your team spends time on product, not plumbing.

If you’re starting fresh, clone a minimal chat template, add streamText, route via AI Gateway, and pick Neon + Upstash Vector for storage. Then layer rate limiting, CORS, caching, and scheduled indexing. For teams evaluating models and prompts, leverage our guides on LangChain templates, prompt systems, LLM benchmarking, and model selection to ship confidently.

Copy‑Ready Code Snippets

Route Handler (Node.js runtime, streaming):

// app/api/chat/route.ts
import { streamText } from 'ai'
import { openai } from '@ai-sdk/openai'
export const runtime = 'nodejs'

export async function POST(req: Request) {
  const { messages } = await req.json()
  const { toAIStreamResponse } = await streamText({
    model: openai('o4-mini'),
    messages,
  })
  return toAIStreamResponse()
}

Cron Job to refresh embeddings nightly:

{
  "crons": [
    { "path": "/api/reindex", "schedule": "0 2 * * *" }
  ]
}

Minimal CORS helper (App Router):

const ORIGIN = process.env.NODE_ENV === 'production' ? 'https://yourapp.com' : '*'

export async function OPTIONS() {
  return new Response(null, {
    status: 200,
    headers: {
      'Access-Control-Allow-Origin': ORIGIN,
      'Access-Control-Allow-Methods': 'GET,POST,OPTIONS',
      'Access-Control-Allow-Headers': 'Content-Type, Authorization',
      'Access-Control-Allow-Credentials': 'true'
    }
  })
}

For more code‑level streaming and error‑handling examples, check Vercel’s streaming functions and AI SDK docs. Streaming functions guide. and AI SDK streaming reference.

Hire an Expert

Table of Contents