AXENTED — Blog Article
Slug: /blog-posts/observability-stack-small-teams |
Meta description: A small engineering team needs logs, metrics, and traces — not a FinOps function. A practical observability stack that covers production needs at near-zero cost. |
Target keywords: observability stack small team, monitoring startup, logging metrics tracing, prometheus grafana startup |
Large engineering teams have observability built into their platform engineering function. New Relic or Datadog with full APM, distributed tracing across every service, custom dashboards for every team, and a dedicated SRE who watches the dashboards. That stack costs tens of thousands of dollars per month and requires a full-time engineer to operate.
None of that is right for a team of four to eight engineers building their first production system. But running without observability is worse. Here's a practical stack.
Every production system needs three things to be observable: logs (what happened), metrics (how the system is performing), and traces (how a request moved through the system). A small team doesn't need all three at the same depth as a large team, but the gaps are expensive when production problems occur.
The most common small-team observability failure is having logs but not metrics or traces. Logs tell you that something went wrong. Metrics tell you when the system started degrading before it became an error. Traces tell you where in a multi-service request the latency or failure originated. Without all three, debugging production issues is much slower than it needs to be.
Unstructured log lines — plain text in a file — are worse than no logs at the moment you need to search them. Structured logs (JSON, key-value pairs) are queryable: you can filter by user ID, by request ID, by error type. The investment to write structured logs is small upfront and returns value every time you debug a production issue.
For small teams, Cloudwatch Logs (if on AWS) or a managed log aggregation service covers the needs without operational overhead. The important setup is sending structured logs with consistent fields across all services: timestamp, service name, log level, request ID, user ID where relevant. Consistent fields make cross-service queries possible.
A small team doesn't need 40 metrics dashboards. It needs four: request rate (how many requests per second), error rate (what percentage of requests are returning errors), latency (p50, p95, p99 response times), and resource utilization (CPU and memory usage on production hosts). Those four metrics surface 90% of production problems before they become outages.
For most small teams, Prometheus with Grafana is the right open-source choice. The setup investment is a day or two; the operational cost is minimal on small infrastructure. If self-hosting is a burden, Grafana Cloud has a generous free tier that covers small-team usage.
Distributed tracing is high-value when you have multiple services handling a single user request and you need to understand which service is responsible for latency or failures. For a simple architecture (one API, one database), tracing adds overhead without proportional value.
Add tracing when you have three or more services in a request path and you find yourself saying "I know the API is slow but I don't know if it's the database, the cache, or the downstream service." At that point, OpenTelemetry instrumented to a Jaeger or Tempo backend solves the problem efficiently and at low cost.
An observability stack without alerting is a stack you have to watch manually. Alerts should fire when error rates cross a threshold, when latency crosses a threshold, or when a service stops reporting metrics entirely (which usually means it crashed). Those three alert categories catch the vast majority of production problems.
Alert fatigue — too many alerts, too many false positives — is the thing that kills observability culture. Set alert thresholds conservatively: they should fire when a human needs to act, not whenever a metric moves. If an alert fires more than once a week and never requires action, the threshold is wrong.
A practical small-team observability stack at near-zero cost: structured logging to Cloudwatch or Loki, Prometheus metrics with Grafana dashboards, PagerDuty or OpsGenie for alert routing (free tiers cover small teams), and OpenTelemetry tracing when service count justifies it. The whole stack can be running in a week with one engineer who's done it before.