How a GPU Cloud Provider Is Scaling AI Data Center Operations with Itential

Scaling Self-Service Operations, Event-Driven Diagnostics, Automated Actions, & Vendor-Agnostic Network Orchestration Across a Growing Footprint

Industry: Neo-Cloud Infrastructure Provider     •     Employees: 5,000+

Challenge

Manual diagnostics, site-specific automation, and engineering bottlenecks could not keep pace with rapid expansion and uptime demands. GPU thermal alerts required hours of manual triage and evidence collection across multiple systems.

Solution

Itential standardized operational workflows and vendor-agnostic orchestration, enabling automated diagnostics, self-service execution, and repeatable operations across every site. Alerts now trigger workflows that collect evidence, enrich tickets, notify teams, and support remediation.

Why Itential

Chosen for enterprise-grade orchestration, low-code accessibility, integration flexibility, and the ability to productize existing automation with governance and scale. Itential enabled an event-to-action model that reduced time-to-triage and supports future closed-loop operations.

When AI Infrastructure Growth Outpaces Operational Capacity

A rapidly growing GPU cloud provider operates large-scale data centers across North America, delivering high-performance AI compute to customers through a substantial GPU infrastructure footprint.

As demand for AI compute surged, the company expanded into new data center builds, scaled interconnect capacity, and increased customer provisioning volume. The infrastructure team embraced modern practices. Their environment included containerized deployments, GitOps, Kubernetes, secrets management, source of truth systems, and observability platforms.

But growth created a compounding operational reality: every new site introduced more devices, more dependencies, and more operational load. The company needed to scale its operating model as fast as it was scaling physical infrastructure.

Quote-Pink

We’re trying to enable our operations teams to increase ticket close rates and efficiency without escalating to engineering.

Infrastructure Operations Leader

GPU Alerts Were Frequent, The Response Was Manual

GPU thermal and health events were a recurring operational challenge. When alerts fired, teams needed to quickly determine whether the issue was transient workload behavior, an airflow or power concern, a chassis-level problem, or a degrading GPU that required intervention.

The problem was not detection. The problem was repeatable response at scale.

Collecting diagnostics from out-of-band controllers and assembling the right evidence for triage required manual effort across multiple interfaces and systems. Too many people were pulled into the process, and response quality varied depending on who was available.

Quote-Pink

Even getting the diagnostics and thermal information from GPUs, it takes hours.

Infrastructure Operations Leader

Custom Automation Worked, But Couldn’t Scale Beyond a Single Site

The company had invested in internal tooling and automation to improve operations at one data center location. The tooling provided valuable automation capabilities for operations and customer experience teams, but it had a fundamental constraint: it could not scale across sites and was not fully owned or productized internally.

As the organization expanded, the team needed a platform that could standardize those operational workflows and extend them across every location, without rebuilding everything from scratch each time a new facility came online. They also needed a consistent alert-driven model for triage and action, so operational response did not depend on ad hoc coordination.

The goal was not just automation. It was repeatable execution with governance and visibility, usable by operations and customer experience teams.

Vendor Abstraction Was a Strategic Priority

The company’s network fabric was built on open standards and modern APIs, but the team knew vendor decisions would evolve as the footprint grew. They wanted to avoid being locked into a single network platform and ensure that day-to-day operations and provisioning workflows remained portable.

That meant shifting from vendor-specific automation toward OS-agnostic workflows and normalized data structures.

Quote-Pink

The biggest interest is OS-agnostic data structures for provisioning the fabric… portability as we migrate away from our current platform.

Lead Infrastructure Architect

High-Code Automation Reached Its Practical Limits

The infrastructure engineering team had deep expertise and had built meaningful automation with scripts and playbooks. But as demand increased, the overhead of maintaining custom integrations and updating code for external dependency changes became a major constraint.

Instead of focusing on business value, the team spent time on automation upkeep, integration maintenance, and platform drift.

Quote-Pink

It becomes a full-time job… updates are not really anything of value, they’re simply things that have to happen because external dependencies changed.

Lead Infrastructure Architect

Why They Chose Itential

As the organization evaluated how to scale operational execution across multiple data centers, they were clear about what they needed and what they wanted to avoid.

They did not want another point tool or a scripting framework that increased engineering burden. They needed orchestration that could productize automation with governance, reuse, and scale built in. They also needed an event-driven operational model where alerts could trigger diagnostics, notifications, and automated actions.

Several criteria shaped the decision.

Scale Beyond a Single Site

The team needed workflows that could be repeated across current and future data centers without cloning and forking automation per location.

Low-Code Accessibility Without Sacrificing Technical Depth

Itential enabled technical teams to build complex workflows while making them accessible to broader teams through a low-code model, expanding who could safely execute tasks.

Leverage Existing Automation Investments

The platform could orchestrate existing Python and Ansible automation rather than requiring a rewrite. This preserved prior investments while enabling modernization.

Vendor-Agnostic Orchestration Layer

The ability to abstract network operations through normalized data models reduced lock-in risk and ensured long-term flexibility as vendor strategies evolved.

Operational Overhead Mattered

A SaaS deployment model reduced the burden of managing yet another platform while still supporting on-premises connectivity through gateway deployment where needed.

Event-Driven Diagnostics & Automated Response

To reduce time-to-triage and enable faster resolution, the team prioritized workflows that automatically collect evidence, enrich tickets, notify the right teams, and prepare for remediation actions based on severity.

Together, these capabilities allowed the organization to shift from one-off automation to a standardized orchestration operating model that could scale with both infrastructure growth and operational demand.

Standardizing Operational Execution Across the AI Infrastructure Stack

The architectural shift came from standardizing operational execution into reusable workflows that could integrate across the company’s ecosystem.

  • Trigger automatically from events
  • Collect and normalize diagnostic evidence
  • Update systems of record
  • Create or enrich tickets
  • Guide execution for operations teams
  • Preserve an audit trail

This created a foundation for closed-loop operations where detection, diagnostics, escalation, and remediation could be coordinated in a repeatable way.

Quote-Pink

We’re trying to save time and focus it more on the DC technicians… empower them with more access and testability.

Infrastructure Operations Leader

Orchestrating Multi-Site Operations, Diagnostics, & Provisioning at Scale

With Itential as the orchestration foundation, the organization prioritized several high-impact workflows:

Reusable automation workflow templates for rapid scaling across network operations.
Automated GPU Thermal Diagnostics & Response

When thermal alerts occur, workflows can automatically collect diagnostic evidence through hardware APIs, enrich tickets, and initiate operational response without manual coordination.

Reusable automation workflow templates for rapid scaling across network operations.
Source of Truth Synchronization

Workflows can programmatically populate and synchronize inventory and host-level data using APIs, enabling more accurate infrastructure context for operations and automation.

Reusable automation workflow templates for rapid scaling across network operations.
Self-Service Operational Execution

Approved workflows can be executed by operations and customer experience teams, reducing escalations and improving ticket close rates over time.

Reusable automation workflow templates for rapid scaling across network operations.
Fabric Deployment & Configuration Management

Network workflows support configuration backups, validation, and repeatable changes across a multi-site fabric, while preserving flexibility for future vendor shifts.

Reusable automation workflow templates for rapid scaling across network operations.
Customer Provisioning Orchestration

End-to-end activation workflows can coordinate across network, security, and interconnect providers, reducing time-to-provision and improving delivery consistency.

Measurable Results Across Operations & Infrastructure Delivery

Moving from manual processes and site-specific tooling to orchestrated workflows produced outcomes that were both immediate and structural.

Improved Operations

Faster Diagnostics & Reduced Escalation Volume

By automating evidence collection and standardizing response execution, the organization reduced time spent on diagnostic tasks and improved operational throughput.

Enabled Self-Service

Improved Internal Ticket Close Rates

A key operational objective was increasing internal ticket closure by enabling self-service execution and reducing dependency on infrastructure engineering for routine tasks.

Reduced Cost & Risk

Greater Flexibility & Reduced Vendor Risk

By building OS-agnostic workflows and vendor abstraction into their operational model, the company reduced the long-term cost and risk of vendor transitions.

Increased Capacity

Reduced Engineering Maintenance Burden

The engineering team regained capacity by shifting from maintaining brittle custom integrations to building scalable, reusable workflows with governance.

What’s Next

With foundational workflows in place, the organization plans to expand orchestration across additional sites, operational processes, and advanced use cases, including deeper event-driven operations, closed-loop automation, and agentic operations with Itential FlowAI. They also plan to extend orchestration into the hardware lifecycle, including RMA workflows that automate evidence collection, coordinate replacement processes, restore configurations, validate post-replacement state, and standardize documentation.

Blogs

    No blog posts found.

Other Resources

    No other posts found.