Scaling Self-Service Operations, Event-Driven Diagnostics, Automated Actions, & Vendor-Agnostic Network Orchestration Across a Growing Footprint
Industry: Neo-Cloud Infrastructure Provider •
Employees: 5,000+

Challenge
Manual diagnostics, site-specific automation, and engineering bottlenecks could not keep pace with rapid expansion and uptime demands. GPU thermal alerts required hours of manual triage and evidence collection across multiple systems.

Solution
Itential standardized operational workflows and vendor-agnostic orchestration, enabling automated diagnostics, self-service execution, and repeatable operations across every site. Alerts now trigger workflows that collect evidence, enrich tickets, notify teams, and support remediation.

Why Itential
Chosen for enterprise-grade orchestration, low-code accessibility, integration flexibility, and the ability to productize existing automation with governance and scale. Itential enabled an event-to-action model that reduced time-to-triage and supports future closed-loop operations.
When AI Infrastructure Growth Outpaces Operational Capacity
A rapidly growing GPU cloud provider operates large-scale data centers across North America, delivering high-performance AI compute to customers through a substantial GPU infrastructure footprint.
As demand for AI compute surged, the company expanded into new data center builds, scaled interconnect capacity, and increased customer provisioning volume. The infrastructure team embraced modern practices. Their environment included containerized deployments, GitOps, Kubernetes, secrets management, source of truth systems, and observability platforms.
But growth created a compounding operational reality: every new site introduced more devices, more dependencies, and more operational load. The company needed to scale its operating model as fast as it was scaling physical infrastructure.

We’re trying to enable our operations teams to increase ticket close rates and efficiency without escalating to engineering.
Infrastructure Operations Leader
GPU Alerts Were Frequent, The Response Was Manual
GPU thermal and health events were a recurring operational challenge. When alerts fired, teams needed to quickly determine whether the issue was transient workload behavior, an airflow or power concern, a chassis-level problem, or a degrading GPU that required intervention.
The problem was not detection. The problem was repeatable response at scale.
Collecting diagnostics from out-of-band controllers and assembling the right evidence for triage required manual effort across multiple interfaces and systems. Too many people were pulled into the process, and response quality varied depending on who was available.

Even getting the diagnostics and thermal information from GPUs, it takes hours.
Infrastructure Operations Leader
Custom Automation Worked, But Couldn’t Scale Beyond a Single Site
The company had invested in internal tooling and automation to improve operations at one data center location. The tooling provided valuable automation capabilities for operations and customer experience teams, but it had a fundamental constraint: it could not scale across sites and was not fully owned or productized internally.
As the organization expanded, the team needed a platform that could standardize those operational workflows and extend them across every location, without rebuilding everything from scratch each time a new facility came online. They also needed a consistent alert-driven model for triage and action, so operational response did not depend on ad hoc coordination.
The goal was not just automation. It was repeatable execution with governance and visibility, usable by operations and customer experience teams.
Vendor Abstraction Was a Strategic Priority
The company’s network fabric was built on open standards and modern APIs, but the team knew vendor decisions would evolve as the footprint grew. They wanted to avoid being locked into a single network platform and ensure that day-to-day operations and provisioning workflows remained portable.
That meant shifting from vendor-specific automation toward OS-agnostic workflows and normalized data structures.

The biggest interest is OS-agnostic data structures for provisioning the fabric… portability as we migrate away from our current platform.
Lead Infrastructure Architect
High-Code Automation Reached Its Practical Limits
The infrastructure engineering team had deep expertise and had built meaningful automation with scripts and playbooks. But as demand increased, the overhead of maintaining custom integrations and updating code for external dependency changes became a major constraint.
Instead of focusing on business value, the team spent time on automation upkeep, integration maintenance, and platform drift.

It becomes a full-time job… updates are not really anything of value, they’re simply things that have to happen because external dependencies changed.
Lead Infrastructure Architect
Why They Chose Itential
As the organization evaluated how to scale operational execution across multiple data centers, they were clear about what they needed and what they wanted to avoid.
They did not want another point tool or a scripting framework that increased engineering burden. They needed orchestration that could productize automation with governance, reuse, and scale built in. They also needed an event-driven operational model where alerts could trigger diagnostics, notifications, and automated actions.
Several criteria shaped the decision.
Scale Beyond a Single Site
The team needed workflows that could be repeated across current and future data centers without cloning and forking automation per location.
Low-Code Accessibility Without Sacrificing Technical Depth
Itential enabled technical teams to build complex workflows while making them accessible to broader teams through a low-code model, expanding who could safely execute tasks.
Leverage Existing Automation Investments
The platform could orchestrate existing Python and Ansible automation rather than requiring a rewrite. This preserved prior investments while enabling modernization.
Vendor-Agnostic Orchestration Layer
The ability to abstract network operations through normalized data models reduced lock-in risk and ensured long-term flexibility as vendor strategies evolved.
Operational Overhead Mattered
A SaaS deployment model reduced the burden of managing yet another platform while still supporting on-premises connectivity through gateway deployment where needed.
Event-Driven Diagnostics & Automated Response
To reduce time-to-triage and enable faster resolution, the team prioritized workflows that automatically collect evidence, enrich tickets, notify the right teams, and prepare for remediation actions based on severity.
Together, these capabilities allowed the organization to shift from one-off automation to a standardized orchestration operating model that could scale with both infrastructure growth and operational demand.
Standardizing Operational Execution Across the AI Infrastructure Stack
The architectural shift came from standardizing operational execution into reusable workflows that could integrate across the company’s ecosystem.
- Trigger automatically from events
- Collect and normalize diagnostic evidence
- Update systems of record
- Create or enrich tickets
- Guide execution for operations teams
- Preserve an audit trail
This created a foundation for closed-loop operations where detection, diagnostics, escalation, and remediation could be coordinated in a repeatable way.

We’re trying to save time and focus it more on the DC technicians… empower them with more access and testability.
Infrastructure Operations Leader
Orchestrating Multi-Site Operations, Diagnostics, & Provisioning at Scale
With Itential as the orchestration foundation, the organization prioritized several high-impact workflows:
Automated GPU Thermal Diagnostics & Response
When thermal alerts occur, workflows can automatically collect diagnostic evidence through hardware APIs, enrich tickets, and initiate operational response without manual coordination.
Source of Truth Synchronization
Workflows can programmatically populate and synchronize inventory and host-level data using APIs, enabling more accurate infrastructure context for operations and automation.
Self-Service Operational Execution
Approved workflows can be executed by operations and customer experience teams, reducing escalations and improving ticket close rates over time.
Fabric Deployment & Configuration Management
Network workflows support configuration backups, validation, and repeatable changes across a multi-site fabric, while preserving flexibility for future vendor shifts.
Customer Provisioning Orchestration
End-to-end activation workflows can coordinate across network, security, and interconnect providers, reducing time-to-provision and improving delivery consistency.
Measurable Results Across Operations & Infrastructure Delivery
Moving from manual processes and site-specific tooling to orchestrated workflows produced outcomes that were both immediate and structural.
Improved Operations
Faster Diagnostics & Reduced Escalation Volume
By automating evidence collection and standardizing response execution, the organization reduced time spent on diagnostic tasks and improved operational throughput.
Enabled Self-Service
Improved Internal Ticket Close Rates
A key operational objective was increasing internal ticket closure by enabling self-service execution and reducing dependency on infrastructure engineering for routine tasks.
Reduced Cost & Risk
Greater Flexibility & Reduced Vendor Risk
By building OS-agnostic workflows and vendor abstraction into their operational model, the company reduced the long-term cost and risk of vendor transitions.
Increased Capacity
Reduced Engineering Maintenance Burden
The engineering team regained capacity by shifting from maintaining brittle custom integrations to building scalable, reusable workflows with governance.
What’s Next
With foundational workflows in place, the organization plans to expand orchestration across additional sites, operational processes, and advanced use cases, including deeper event-driven operations, closed-loop automation, and agentic operations with Itential FlowAI. They also plan to extend orchestration into the hardware lifecycle, including RMA workflows that automate evidence collection, coordinate replacement processes, restore configurations, validate post-replacement state, and standardize documentation.
No blog posts found.
No other posts found.