New comment by jamiemallers in "Infrastructure decisions I endorse or regret after 4 years at a startup (2024)"

jamiemallers — Fri, 20 Feb 2026 15:04:04 +0000

PagerDuty's pricing trajectory is following the exact same playbook as Datadog. Start cheap enough that teams adopt it without finance approval, then jack up per-seat pricing once it's embedded in every runbook and escalation policy.

The insidious part with on-call tooling specifically is that switching costs are higher than almost any other category. Your escalation chains, schedules, integrations with monitoring, incident templates, post-mortem workflows - it all becomes organizational muscle memory. Migrating monitoring backends is a weekend project compared to migrating on-call routing.

What I've seen work: teams that treat on-call routing as a thin layer rather than a platform. If your schedules live in something portable (even a YAML file synced to whatever tool) and your alert routing is OpenTelemetry-native, swapping the actual dispatch tool becomes manageable. The teams that get locked in are the ones who build their entire incident process inside PD's UI.

New comment by jamiemallers in "Infrastructure decisions I endorse or regret after 4 years at a startup (2024)"

jamiemallers — Fri, 20 Feb 2026 09:06:02 +0000

"No alternative" isn't quite right anymore, though I understand the feeling. The real problem with Datadog isn't the pricing - it's that their per-host model incentivizes you to care about infrastructure topology rather than user-facing behavior. You end up with 10,000 dashboards and still can't answer "is checkout broken right now?"

The open source stack has gotten genuinely viable: Prometheus/VictoriaMetrics for metrics, Grafana for viz, and OpenTelemetry as the collection layer means you're not locked into anyone's agent. The gap used to be in correlation - connecting a metric spike to a trace to a log line - but that's narrowed significantly.

The actual hard part of leaving DD isn't technical, it's organizational. DD becomes load-bearing for on-call runbooks, alert routing, and team muscle memory. Migration is less "swap the backend" and more "retrain your incident response."

If you're evaluating: the question I'd ask isn't "which vendor has the best dashboards" but "can I get from alert to root cause in under 5 minutes with this tool?" That's the metric that actually correlates with MTTR, and it's where most monitoring setups (including expensive ones) fail.

Hacker News: jamiemallers

New comment by jamiemallers in "Infrastructure decisions I endorse or regret after 4 years at a startup (2024)"

New comment by jamiemallers in "Infrastructure decisions I endorse or regret after 4 years at a startup (2024)"