◆ Case 03 / 06— Platform Architecture · 2022 · 11mo engagement

High-Load
Microservices
Hub.

ClientLogistics platform / EU-wide

RolePlatform Architect

StackGo · gRPC · Kubernetes · NATS

OutcomeFrom 240ms p99 → 38ms p99

Fig.01 — Inference topology (v2)

● LIVE

Go/TypeScript/Python/Rust/Kubernetes/Kafka/Postgres/Ray/Redis/gRPC/Terraform/AWS/GCP/DDD/Event-Sourcing/Temporal/OpenTelemetry/ClickHouse/pgvector/Go/TypeScript/Python/Rust/Kubernetes/Kafka/Postgres/Ray/Redis/gRPC/Terraform/AWS/GCP/DDD/Event-Sourcing/Temporal/OpenTelemetry/ClickHouse/pgvector/

The Challenge —

Forty-plus services, no shared contracts, and a sync-call graph deep enough that one slow dependency pinned the whole product.

Tracing was partial. The retry storm was not a theoretical risk.

The Approach —

Shared gRPC contracts enforced at the CI level, NATS for everything that could be async, and a single OpenTelemetry pipeline everything had to emit through. Budgets were written into SLOs, not wikis.

The rule became: if you can't explain where a 300ms went, the service doesn't ship.

0ms p99

Gateway Latency

Incident Reduction

Contract Coverage

Outcome —

p99 dropped an order of magnitude. The next incident that would've been a full-platform outage became one team's Slack thread.

The shape of the system didn't change much. Its discipline did.

Next Case — 04/06

Custom Core ERP Systems↗

High-LoadMicroservicesHub.

High-Load
Microservices
Hub.