Share this
Table of Contents
Quick Summary
A growing neocloud GPU infrastructure provider needed to scale operational automation beyond a single-site tool to every data center across North America. By deploying Itential, they replaced a siloed ops tool with unified, event-driven workflows – reducing diagnostic collection time from hours to minutes and enabling more teams to execute safely without engineering escalation.
From Hours Of Manual Diagnostics To Repeatable, Event-Driven Workflows
Neocloud GPU providers are building the infrastructure layer for the AI era. They operate high-density GPU data centers, expand rapidly across geographies, and deliver compute services where reliability and speed directly affect customer experience.
But the operational model that works at one site rarely scales cleanly to many.
That was the challenge for one rapidly growing GPU infrastructure provider operating multiple data centers across North America. Their footprint included thousands of GPU nodes and a modern stack built on best-of-breed tools: containerized deployments, GitOps, secrets management, NetBox as a source of truth, and deep observability. At this scale, even a 1% daily incident rate becomes a constant operational load.
They also had strong automation. In fact, at one site they relied on an internal operations tool that gave their data center teams the ability to execute common tasks and close tickets without escalating to infrastructure engineering. Standardized self-service workflows commonly reduce escalation volume by 30-50% by removing engineering from routine triage.
The Problem Was Simple: That Tool Didn’t Scale
It was owned externally and locked to a single site. As the company expanded, they needed a way to replicate operational capabilities across every data center and enable more teams to execute safely without depending on a handful of engineers.
At the same time, incident response was becoming a growing pain point. During GPU health and thermal events, collecting diagnostics from out-of-band controllers and assembling the right evidence was taking hours. That time cost wasn’t just operational. It affected customer experience and slowed resolution.
Teams that automate diagnostics and remediation typically cut time-to-evidence from hours to minutes and triage incidents ten times faster.
They Also Faced A Strategic Infrastructure Requirement: Vendor Flexibility
Their network fabric was built on open APIs and modern automation practices, but they wanted an OS-agnostic orchestration layer to protect their automation investments and preserve flexibility as vendor strategies evolve over time.
Why They Chose Itential
The team evaluated several options, including lightweight job runners. But they needed more than a way to execute scripts. They needed orchestration across systems, sites, and teams.
Itential stood out for a few key reasons:
- Orchestrate what already exists. They didn’t want to replace Ansible, Python, or internal tooling. They wanted to operationalize it.
- Unified workflows across systems. Their operational reality involved inventory, monitoring, ticketing, and infrastructure APIs – not a single tool.
- Governance and scalability. They needed a platform proven at enterprise scale, with RBAC, audit trails, and repeatable execution.
- Vendor abstraction. They wanted workflows that could remain stable even as underlying platforms evolve.
- Low-code accessibility. Itential made it possible to expose automation safely to non-developers through standardized workflows.
What They Built First
The initial focus was clear: build event-driven workflows that could reduce operational toil immediately. The team prioritized:
- Automated GPU diagnostics and ticket enrichment
- Self-service operational workflows for data center operations
- NetBox-driven inventory and context workflows
- Fabric deployment and configuration workflows with validation and auditability
The result was a scalable operating model: standardized workflows that could be replicated across sites, used by more teams, and governed consistently, without increasing engineering overhead.
Where They’re Going Next
With foundational workflows in production, the team is focused on scaling what works and pushing further into autonomous operations. A few key priorities are driving their roadmap:
- Scaling to every site. Workflows at current locations will be replicated across new facilities – same model, no rebuild.
- Closed-loop operations. The event-driven workflow model will expand and move toward automated loops with no human handoff.
- Agentic operations with FlowAI. The team is translate AI intent into governed, auditable workflow execution using Itential FlowAI.
- Extend orchestration into hardware. Orchestrate the full hardware lifecycle, including RMA workflows will eliminate multi-team handoffs slowing incident resolution.
The broader goal hasn’t changed: empower operations teams to close more tickets while increasing their capacity to build value, and keep that ratio improving as infrastructure scales.
Want the Full Story & Technical Details?
Read the full customer story to see the architecture, workflows, and outcomes in more depth.
Watch the on-demand webinar below to see a demo of the alert-to-diagnostics-to-ticket workflow pattern built for AI data center operations.