Building AI Search on Heroku

Last Updated: January 29, 2026

If you’ve built a RAG (Retrieval Augmented Generation) system, you’ve probably hit this wall: your vector search returns 20 documents that are semantically similar to the query, but half of them don’t actually answer it.

A user asks “how do I handle authentication errors?” and gets back documentation about authentication, errors, and error handling in embedding space, but only one or two are actually useful.

This is the gap between demo and production. Most tutorials stop at vector search. This reference architecture shows what comes next. This AI Search reference app shows you how to build a production grade enterprise AI search using Heroku Managed Inference and Agents.

Why two-stage retrieval

Vector embeddings are coordinates in high dimensional space. Documents close together share semantic meaning. Semantic proximity is a false proxy for accuracy; a document can be ‘close’ in vector space but fail to provide a factual answer. You need a second stage where a reranking model scores each one against the actual query. It asks: “Does this document answer this question?” rather than “Is this document about similar things?”

The difference in result quality is significant. This reference implementation shows how to wire it up and how to make it optional when latency matters more than precision.

Architecture overview

The system consists of two primary pipelines:

Indexing pipeline: URL → Crawl → Chunk → Embed → pgvector
Query pipeline: Question → Embed → Vector Search → Rerank → Claude → Answer

Heroku services

Component	Service	Role
Embeddings	Heroku Managed Inference and Agents	Converts text to vectors (Cohere Embed Multilingual)
Reranking	Heroku Managed Inference and Agents	Scores query document relevance (Cohere Rerank 3.5)
Generation	Heroku Managed Inference and Agents	Produces answers from context (Claude 3.5 Sonnet)
Storage	Heroku Postgres + pgvector	Stores vectors, runs similarity queries

Streamlining the stack: Unified provider setup

const heroku = createHerokuAI({  
  chatApiKey: process.env.HEROKU_INFERENCE_TEAL_KEY,  
  embeddingsApiKey: process.env.HEROKU_INFERENCE_GRAY_KEY,  
  rerankingApiKey: process.env.HEROKU_INFERENCE_BLUE_KEY,  
});

Indexing: Getting documents in

Crawling real-world sites

Documentation sites are messy. Naive scraping often extracts more navigation links than actual content. The crawler uses a simple heuristic to detect junk:

function isLikelyJunkContent(content: string, htmlLength: number): boolean {  
  // If HTML is huge but text is tiny, it's likely boilerplate  
  if (htmlLength > 100000 && content.length < htmlLength * 0.05) return true; const navPatterns = ['sign in', 'login', 'menu', 'pricing']; // If the start of the doc is stuffed with nav links return navPatterns.filter(p => content.slice(0, 500).includes(p)).length >= 4;  
}

Chunking at natural boundaries

Documents are split into 1000 character chunks with 200 characters of overlap. To avoid losing meaning, we prioritize splitting at paragraph or sentence breaks.

Batch storage with pgvector

The unnest function allows inserting hundreds of chunks in a single SQL query:

  INSERT INTO chunks (pipeline_id, url, title, content, embedding)  
  SELECT   
    ${pipelineId}::uuid,   
    unnest(${urls}::text[]),  
    unnest(${titles}::text[]),   
    unnest(${contents}::text[]),  
    unnest(${embeddings}::vector[])

Retrieval: The two-stage pattern

Stage 1: Vector search

Retrieve the top 20 chunks by cosine similarity. This is fast and scales well in Postgres.

  SELECT content, 1 - (embedding <=> ${vector}::vector) as similarity
  FROM chunks
  ORDER BY embedding <=> ${vector}::vector
  LIMIT 20

Stage 2: Semantic reranking

Pass those 20 candidates to a reranking model. Rerankers use a cross encoder architecture, processing the query and document together to score relevance accurately.

const reranked = await rerankModel.doRerank({  
  query,  
  documents: { type: "text", values: chunks.map(c => c.content) },  
  topN: 5  
});

Streaming: Responsive UX

RAG involves multiple steps (Embed → Search → Rerank → Generate). To prevent a “blank screen,” use Server Sent Events (SSE) to stream progress:

send("step", { step: "embedding" });
send("step", { step: "searching" });
send("step", { step: "reranking" });
send("sources", { sources }); // Show sources while LLM generates

for await (const chunk of streamRAGResponse(message, context)) {
  send("text", { content: chunk });
}

Conclusion: Deploy the reference app

Moving from a demo to a production grade RAG system requires bridging the gap between similarity and relevance. This architecture provides a battle tested foundation using Heroku Managed Inference and Agents and pgvector to ensure your search results actually answer user questions. Deploy the reference chatbot today to see how two-stage retrieval transforms your documentation search.

Originally Published: January 29, 2026
AI Heroku AI Managed Inference and Agents

Ready to Get Started?

Stay focused on building great data-driven applications and let Heroku tackle the rest.

Talk to A Heroku Rep Sign Up Now

More from the Author

Anush DSouza

Senior Product Manager at Heroku

Heroku Staff

Browse the archives for Engineering or all blogs. Subscribe to the RSS feed for Engineering or all blogs.

Heroku March 2026 Update

Heroku March 2026 Update

TAO Drive: Breaking E-Commerce Barriers with a Secure Salesforce Integration via Heroku AppLink

Heroku March 2026 Update

Building AI Search on Heroku

Why two-stage retrieval

Architecture overview

Heroku services

Streamlining the stack: Unified provider setup

Indexing: Getting documents in

Crawling real-world sites

Chunking at natural boundaries

Batch storage with pgvector

Retrieval: The two-stage pattern

Stage 1: Vector search

Stage 2: Semantic reranking

Streaming: Responsive UX

Conclusion: Deploy the reference app

Ready to Get Started?

More from the Author

Heroku March 2026 Update

Heroku March 2026 Update

TAO Drive: Breaking E-Commerce Barriers with a Secure Salesforce Integration via Heroku AppLink

Heroku March 2026 Update

Building AI Search on Heroku

Why two-stage retrieval

Architecture overview

Heroku services

Streamlining the stack: Unified provider setup

Indexing: Getting documents in

Crawling real-world sites

Chunking at natural boundaries

Batch storage with pgvector

Retrieval: The two-stage pattern

Stage 1: Vector search

Stage 2: Semantic reranking

Streaming: Responsive UX

Conclusion: Deploy the reference app

Related Posts

Ready to Get Started?

More from the Author