Their v1 pipeline was a serial bottleneck — every model load cold-started, batch jobs missed their SLA window by 40 minutes, and a GPU fleet spent 62% of its hours idle.
The team had shipped the MVP in a hurry, which is exactly the problem: ML infrastructure written against deadlines tends to calcify the moment real traffic arrives.
I tore the monolith apart along its three natural axes: ingestion, inference, and ranking. Each axis became an independently-scaled Ray cluster with its own SLA, its own autoscaler, and a warm-pool of pre-loaded model replicas.
Kafka handled the seams. Back-pressure became a feature rather than an incident. The v2 deploys as a fleet, not a beast.
18 months in, the pipeline has absorbed three product launches without a schema rewrite. The on-call rotation stopped paging humans for routine load events in the first quarter after launch.
It's not the fastest pipeline in the world. It's the one that stays up — which, for a platform billing on uptime, is the only metric that matters.