There is a consistent pattern in how AI integration projects fail: the demo works, the team is excited, the integration gets built, and then three things happen in the first 60 days that nobody planned for. Latency surprises the UX team. Token costs surprise the finance team. And the model returns something unexpected at 2 a.m. and nobody knows how to handle it.

These are not edge cases. They are predictable. The companies that avoid them are not smarter — they have seen the pattern before and built for it.

The demo-to-production gap

AI demos are built to show the best case. Production is where the worst case lives. Three gaps almost always appear in the transition.

Latency

A 3-second LLM response is invisible in a Jupyter notebook. In a product, it breaks user experience patterns that were designed for sub-500ms responses. Streaming responses partially solve this — users see progress — but they require a different architecture than batch calls.

Cost at scale

At 100 test calls per day, token costs are invisible. At 10,000 production calls per day with a 16K context window, the math is different. Most teams do not model costs until after launch.

Error handling

LLMs return unexpected formats, hallucinate field names, and occasionally refuse to answer. Production code needs to handle every failure mode. Demos do not.

RAG: when to use it and when not to

Retrieval Augmented Generation (RAG) is the most common pattern for giving models access to your data. You retrieve relevant chunks from a knowledge base and include them in the prompt context. It works well in specific conditions and poorly in others.

  • Works well: customer support over a documented knowledge base, internal Q&A over company docs, document analysis where the relevant section is retrievable
  • Works poorly: real-time data requirements (RAG retrieves from a snapshot, not a live feed), very large knowledge bases without careful chunking and metadata, cases where the source data is unstructured or poorly formatted

The most common RAG failure is not in the retrieval — it is in the source data. Most companies have the data but it is in the wrong format: PDFs with inconsistent layouts, spreadsheets with merged cells, database records with no standardized field names. Data preparation is typically 30 to 40 percent of the integration project.

Integration patterns that hold up in production

  • Async processing for long-running tasks — do not make users wait for a 10-second model call in a synchronous flow. Queue the job, return a job ID, deliver the result when it is ready.
  • Streaming responses for interactive features — if the user is watching the output generate, stream it. It reduces perceived latency and works better with models that produce long outputs.
  • Structured output instead of free text — use JSON mode or tool use to force the model into a defined output schema. Parsing a predictable JSON object is not fragile. Parsing free text in production is.
  • Human-in-the-loop for high-stakes decisions — any integration where the model output has real-world consequences (an email sent, a record updated, a decision made) should have a confirmation step before execution.