Tool recommendations are more useful when they come with context. "We use X" means very little without knowing the team size, the type of work, the cost constraints, and the specific problems each tool was chosen to solve. So before the stack: we're a team of engineers building AI integrations and custom software for early-stage and growth-stage companies in LATAM and the US. Most of our work involves integrating AI into existing products rather than building AI-native products from scratch.
With that context, here's what our stack looks like now and why.
Claude is our primary model for production integrations. The large context window handles the long-document processing that comes up regularly in legal, financial, and operations use cases. The instruction-following reliability reduces the prompt engineering overhead in structured output applications. We use the Anthropic API directly — no wrapper — which gives us full control over retry logic, caching, and error handling.
For applications where cost is the dominant constraint and the task is well-defined enough to run on a smaller model, we evaluate GPT-4o-mini and Claude Haiku against the specific use case before committing. Frontier models are not always the right answer.
We've moved away from LangChain for most new projects. The abstraction layer adds complexity that slows debugging and creates upgrade dependency issues. For simple chains and RAG pipelines, we write the orchestration logic directly. For multi-agent workflows, we're currently evaluating the Anthropic Agent SDK alongside custom implementations.
The principle: use the simplest orchestration approach that works. Every abstraction layer adds a failure surface. If a direct API call solves the problem, use a direct API call.
Postgres with pgvector is our default for RAG implementations where the data volume fits within a single database instance. It reduces operational complexity significantly compared to running a dedicated vector database. For larger scale or multi-tenant architectures, we use Pinecone.
Embedding model: OpenAI's text-embedding-3-small covers most use cases at a cost that makes sense for production volume. For Spanish-language content, we test multilingual models explicitly — embedding quality degrades on Spanish text with models primarily trained on English corpora.
This is the area where most teams underinvest. We use a combination of LLM-as-judge (Claude evaluating outputs against a rubric) and human review of random samples. For production systems, we track accuracy, latency, and human review rates as operational metrics alongside standard infrastructure metrics.
Every AI integration we ship has an evaluation suite before it goes live. That's non-negotiable. The evaluation suite is what lets us upgrade models, change prompts, and respond to production feedback without flying blind.
LangSmith for tracing LLM calls in development. In production, we log all inputs, outputs, and latency to our standard logging infrastructure and build dashboards on top. The goal is to be able to answer: what did the model receive, what did it return, how long did it take, and what happened downstream.
Model behavior that looks correct in aggregate can hide systematic failures on specific input patterns. Tracing makes those patterns visible.
AWS for most deployments. Lambda for stateless AI endpoints with variable traffic patterns. ECS for more complex, stateful services. S3 for document storage and evaluation dataset management.
For clients in Mexico with data residency requirements, we use AWS São Paulo or US East depending on the specific regulatory context. That regional nuance matters and gets missed in stack discussions that assume a US-based deployment.
The agent tooling space is moving fast. Reliability of multi-step agent workflows is the thing we're most actively evaluating — the demos are impressive, but production-grade agent reliability requires infrastructure that doesn't fully exist yet. We'll update this when we have more production data.