◆ Case 03 / 06— Platform Architecture · 2022 · 11mo engagement
ClientLogistics platform / EU-wide
RolePlatform Architect
StackGo · gRPC · Kubernetes · NATS
OutcomeFrom 240ms p99 → 38ms p99
Fig.01 — Inference topology (v2)
● LIVE
Go/TypeScript/Python/Rust/Kubernetes/Kafka/Postgres/Ray/Redis/gRPC/Terraform/AWS/GCP/DDD/Event-Sourcing/Temporal/OpenTelemetry/ClickHouse/pgvector/Go/TypeScript/Python/Rust/Kubernetes/Kafka/Postgres/Ray/Redis/gRPC/Terraform/AWS/GCP/DDD/Event-Sourcing/Temporal/OpenTelemetry/ClickHouse/pgvector/
The Challenge —
Forty-plus services, no shared contracts, and a sync-call graph deep enough that one slow dependency pinned the whole product.
Tracing was partial. The retry storm was not a theoretical risk.
The Approach —
Shared gRPC contracts enforced at the CI level, NATS for everything that could be async, and a single OpenTelemetry pipeline everything had to emit through. Budgets were written into SLOs, not wikis.
The rule became: if you can't explain where a 300ms went, the service doesn't ship.
0ms p99
Gateway Latency
0%
Incident Reduction
0%
Contract Coverage
Outcome —
p99 dropped an order of magnitude. The next incident that would've been a full-platform outage became one team's Slack thread.
The shape of the system didn't change much. Its discipline did.
Next Case — 04/06
Custom Core ERP Systems↗