Search overlay panel for performing site-wide searches

Boost Performance & Scale with Postgres Advanced. Join Pilot Now!

Building AI Search on Heroku

If you’ve built a RAG (Retrieval Augmented Generation) system, you’ve probably hit this wall: your vector search returns 20 documents that are semantically similar to the query, but half of them don’t actually answer it.

A user asks “how do I handle authentication errors?” and gets back documentation about authentication, errors, and error handling in embedding space, but only one or two are actually useful.

This is the gap between demo and production. Most tutorials stop at vector search. This reference architecture shows what comes next. This AI Search reference app shows you how to build a production grade enterprise AI search using Heroku Managed Inference and Agents.

A dashboard interface showing indexing settings, data sources, and a chatbot widget for Heroku AI Search.

Why two-stage retrieval

Vector embeddings are coordinates in high dimensional space. Documents close together share semantic meaning. Semantic proximity is a false proxy for accuracy; a document can be ‘close’ in vector space but fail to provide a factual answer. You need a second stage where a reranking model scores each one against the actual query. It asks: “Does this document answer this question?” rather than “Is this document about similar things?”

The difference in result quality is significant. This reference implementation shows how to wire it up and how to make it optional when latency matters more than precision.

Architecture overview

The system consists of two primary pipelines:

  1. Indexing pipeline: URL → Crawl → Chunk → Embed → pgvector
  2. Query pipeline: Question → Embed → Vector Search → Rerank → Claude → Answer

Heroku services

Component Service Role
Embeddings Heroku Managed Inference and Agents Converts text to vectors (Cohere Embed Multilingual)
Reranking Heroku Managed Inference and Agents Scores query document relevance (Cohere Rerank 3.5)
Generation Heroku Managed Inference and Agents Produces answers from context (Claude 3.5 Sonnet)
Storage Heroku Postgres + pgvector Stores vectors, runs similarity queries

Streamlining the stack: Unified provider setup

const heroku = createHerokuAI({  
  chatApiKey: process.env.HEROKU_INFERENCE_TEAL_KEY,  
  embeddingsApiKey: process.env.HEROKU_INFERENCE_GRAY_KEY,  
  rerankingApiKey: process.env.HEROKU_INFERENCE_BLUE_KEY,  
});

Indexing: Getting documents in

Crawling real-world sites

Documentation sites are messy. Naive scraping often extracts more navigation links than actual content. The crawler uses a simple heuristic to detect junk:

function isLikelyJunkContent(content: string, htmlLength: number): boolean {  
  // If HTML is huge but text is tiny, it's likely boilerplate  
  if (htmlLength > 100000 && content.length < htmlLength * 0.05) return true; const navPatterns = ['sign in', 'login', 'menu', 'pricing']; // If the start of the doc is stuffed with nav links return navPatterns.filter(p => content.slice(0, 500).includes(p)).length >= 4;  
}

Screenshot of a web dashboard with sections for pipeline selection, index documentation settings, chatbot widget, data sources, indexed sources, and API reference on a dark interface.

Chunking at natural boundaries

Documents are split into 1000 character chunks with 200 characters of overlap. To avoid losing meaning, we prioritize splitting at paragraph or sentence breaks.

Batch storage with pgvector

The unnest function allows inserting hundreds of chunks in a single SQL query:

  INSERT INTO chunks (pipeline_id, url, title, content, embedding)  
  SELECT   
    ${pipelineId}::uuid,   
    unnest(${urls}::text[]),  
    unnest(${titles}::text[]),   
    unnest(${contents}::text[]),  
    unnest(${embeddings}::vector[])

Retrieval: The two-stage pattern

Stage 1: Vector search

Retrieve the top 20 chunks by cosine similarity. This is fast and scales well in Postgres.

  SELECT content, 1 - (embedding <=> ${vector}::vector) as similarity
  FROM chunks
  ORDER BY embedding <=> ${vector}::vector
  LIMIT 20

Stage 2: Semantic reranking

Pass those 20 candidates to a reranking model. Rerankers use a cross encoder architecture, processing the query and document together to score relevance accurately.

const reranked = await rerankModel.doRerank({  
  query,  
  documents: { type: "text", values: chunks.map(c => c.content) },  
  topN: 5  
});

A dark-themed interface displays information about Heroku AI, outlining its main components and features, with a sidebar, sources, and related content sections visible.

Streaming: Responsive UX

RAG involves multiple steps (Embed → Search → Rerank → Generate). To prevent a “blank screen,” use Server Sent Events (SSE) to stream progress:

send("step", { step: "embedding" });
send("step", { step: "searching" });
send("step", { step: "reranking" });
send("sources", { sources }); // Show sources while LLM generates

for await (const chunk of streamRAGResponse(message, context)) {
  send("text", { content: chunk });
}

Conclusion: Deploy the reference app

Moving from a demo to a production grade RAG system requires bridging the gap between similarity and relevance. This architecture provides a battle tested foundation using Heroku Managed Inference and Agents and pgvector to ensure your search results actually answer user questions. Deploy the reference chatbot today to see how two-stage retrieval transforms your documentation search.

Ready to Get Started?

Stay focused on building great data-driven applications and let Heroku tackle the rest.

Talk to A Heroku Rep   Sign Up Now