Neocloud and AI data center operators are scaling GPU infrastructure faster than operational capacity. While most teams have strong automation, it often breaks down as environments grow: workflows become site-specific, integrations become brittle, operations teams depend on engineering, and incident response becomes manual and inconsistent.
In this on-demand technical webinar, we show how leading AI infrastructure teams use unified orchestration to connect inventory, monitoring, ticketing, and infrastructure APIs into repeatable, governed workflows that scale across sites and teams. The webinar closes with a demo of an end-to-end event-driven workflow: an alert triggers automated diagnostics collection, enriches the incident with NetBox context, creates or updates a ticket, and optionally initiates a remediation workflow with validation and audit trails.
You’ll learn how to:
- Build event-driven workflows that turn alerts into action
- Orchestrate across infrastructure, including inventory, monitoring, ticketing, and infrastructure APIs
- Standardize execution using reusable workflow services
- Govern operations with RBAC, approvals, and audit trails
- Integrate existing Ansible and Python automation without rewrites
- Scale the same workflow model across sites and environments
Demo Notes
(So you can skip ahead, if you want.)
00:40 AI Infrastructure Scaling Challenges
02:26 Unified Orchestration Solutions
04:28 Itential Platform Overview
06:44 Demo Architecture Setup
08:25 Connecting Inventory Manager
10:37 Automation Gateway Configuration
11:09 Workflow Studio Walkthrough
14:28 Event-Driven Workflow Demo
18:13 Wrap Up & AnalyticsView Transcript
Dan Sullivan • 00:01
Hi everyone, welcome to the Itentia webinar. The topic today is from scripts to scalable orchestration and AI data centers. My name is Dan Sullivan. I lead the solutions engineering team here at Itential. And we’re going to share with you a little bit of information that we’ve gathered and some insights from our work with some of our AI data center customers today. So let’s just kind of jump into this. You know, I think unless you’re living under a rock, everyone is affected by AI these days, right?
Dan Sullivan • 00:40
And so as a consequence of that, you know, GPU infrastructure is scaling quite a bit. But what we have seen is a lot of our customers are struggling with the operations side. They all have automation. But as their environments grow, new data centers and things like that, they are sort of affected by their lack of centralized solutions, their inability to have sort of a centralized way to do a lot of the key operations functions that they need, right? So whether it’s brittle integrations, just too much glue code to deal with API changes, everything is changing these days. Vendors are releasing new APIs all the time. It’s really hard to keep up when it’s all sort of manually coded.
Dan Sullivan • 01:33
You know, routine diagnostic and remediation tasks are requiring critical engineering staff, right? So, sort of the lack of orchestration on that space is really puts more demand on the engineering teams. And these days, although these companies are scaling out infrastructure-wise, I would say that from an engineering perspective, they’re probably always understaffed. You know, things like just too many humans in the loop for manual incidents and things like that, you know, are taking hours, right? And in the GPU business, time is money, right? When they have a thermal incident or they’ve got to take a note out of service or perform some sort of RMA, the longer that takes, the less money they’re making, right? So what’s the answer to this, right?
Dan Sullivan • 02:26
And I think really it’s around unified orchestration, right? Trying to connect every system and scale it across every site. Introducing some rigidity in the way that you’re doing, handling processes across sites, whether it’s connecting your inventory, monitoring, ticketing, and infrastructure APIs across sites in a single platform. Being able to orchestrate events and take alarms and alerts into automated workflows across systems and sites is really important. At the same time, while we’re trying to do all this, while we’re trying to connect to all these disparate systems and manage these diverse sets of APIs, we also have to be really careful security-wise, right? Things like RBAC, approvals, and audit trails need to be built into everything that’s happening. We need to be able to monitor what’s happening, why, and for how long, and who did it.
Dan Sullivan • 03:30
And of course, scale, right? These companies are scaling every day. New data centers are popping up, even within the data centers themselves, right? So we need to be able to display a scale, we need to be able to deploy a scalable solution across every site. And we have to have some ability to work in a multi-vendor environment as well, right? All these data centers have disparate set of tools and vendors and things like that. So effectively, what we really want to do is kind of take that site sprawl and have a standardized way to go about it, taking all those diverse sets of rules and responsibilities and try to centralize that and make that a process that can be orchestrated, where we have really centralized visibility.
Dan Sullivan • 04:28
The inventory itself is actionable. We can orchestrate the execution and we can have an audit trail of exactly what we’re doing. So, what’s the attentionial answer to this? Well, it’s basically the itential platform, right? We’re the geneticerations platform for infrastructure. We are AI-ready by design. From an orchestration perspective, we can connect with pretty much any IT system out there, allowing you to build these end-to-end workflows, these end-to-end orchestrations that don’t really require humans in the loop.
Dan Sullivan • 05:10
And we think that we can deploy that faster and scale smarter with our SAS platform, and we’ll show you some of that today. Here’s kind of the 50,000 foot view of the Atential platform. I think today you can see here that there is quite a diverse set of capabilities of the platform today. But central to that is really the platform itself, right? The ability to offer RBAC, audit and logging, eventing, secrets management, archival support. Those are sort of the fundamentals that the platform is built on top of. And after that, there is Agentic reasoning capabilities, and that’s something that is going to be released fairly soon to a lot of our customers.
Dan Sullivan • 05:57
And we even have some customers beta testing that today. We have sort of our life cycle and product definition. And then service orchestration, configuring compliance, and our automation execution. And today we’ll be focused mostly on the service orchestration and the automation execution. Now, also central to the Itentro platform is the Atential Automation Gateway. And the gateway is how you can connect the orchestrator to disparate sites. So, e.g. , you might have multiple data centers, each with their own automation gateway cluster, so that you can scale out Python scripts or Ansible artifacts or pretty much any other program that you need to execute and onboard.
Dan Sullivan • 06:44
And everything that’s onboarded into the gateway is then federated directly into the platform, making it something that we can build in as part of the workflow. So, for a demo architecture today, we’re going to use the iTential platform and we’re going to kind of simulate a little bit of an AIOps event here. We have ServiceNow deployed as well, and we’re even going to talk to Slack. But we’re going to focus on the inventory manager, the Design Studio, and Gateway Manager. So, the inventory manager is where we’ll have a centralized inventory. As I mentioned, we’ve kind of simulated that a bit. Although everyone loves to see real hardware, having four data centers filled with AI nodes is probably not cost-effective for the purposes of the webinar.
Dan Sullivan • 07:40
So, we’ve got a little bit of that simulated. We’ll show you some of that and walk through that in the design studio. We’ll take a look at Gateway Manager as well. So, we’ve actually got four different gateways deployed, sort of simulating each of it, an AI data center with each one. So that’s kind of the demo architecture for today, and I’ll sort of walk you through that and show you what we’ve got for you today. So I’m just going to share my screen and we’ll take a little bit of a screen here and then we’ll take a look at the platform and see what else we’ve got here. All right, looks like everyone looks like the platform, everyone can see my screen here.
Dan Sullivan • 08:25
So I’m in the Itential Cloud, so we’ve got multiple environments here, but we’re actually working in this PAC environment. And as part of the cloud platform, we also have some analytics and some operations and tasks view here. So you can actually see what each of your instances are doing. And maybe if we have a chance, we’ll jump into that. But I’ll just jump into the platform here. So the 1st thing we’re going to talk a little bit about is inventory. So I have our inventory manager set up.
Dan Sullivan • 08:57
And the inventory manager is effectively just ingesting inventory from external systems and then making it functional so that we can apply it to Python scripts and Ansible inventory or perhaps even device compliance, that sort of thing. So as I mentioned, I’ve got four data, four data centers simulated here, and they each have a varying number of nodes in there. So if I click on one of these, what you’ll notice is I’ve also got all the assets are fully tagged. And you’ll see that each of the data center devices has a prefix equal to the data center that it’s hosted in. So that’ll help us when we are getting incoming events we’ll want to sort of dispatch off of the node name and try to figure out the data center and the cluster effectively. So if you click on one of these, you’ll also notice that every Every device has a cluster ID associated with it.
Dan Sullivan • 09:53
So that’s the IAG 5 instance where the automation gateway where that device is hosted or where it can be referenced. So we’ve got a fair number of devices and we’re going to sort of simulate some of the device functionality today. So we’ve got four different inventories. And then if we go back, we’ll also go and take a quick look at Gateway Manager. So within the Gateway Manager today, we have multiple instances of IOG setup. And these are all live. And they’re sort of simulating, each one is simulating that NeoCloud data center.
Dan Sullivan • 10:37
So we’ve got four of those. And of course, the inventories that we just looked at are tied to each one. So now, when we get a particular event in, which is kind of part of the demo, where we’ll simulate an event, we’ll then dispatch off of that device name and figure out the appropriate gateway to use. And we’ll actually execute a service on the gateway. So let me go back here. And we can go right into studio, but I’ve actually kind of got a little project that I put together for the demo. So we’ll just start here quickly.
Dan Sullivan • 11:09
So, we’ve got some workflows here. We have a top-level workflow and a couple of child jobs. And this one is kind of the interesting one here. So, in this case, here, you’ll see what we’re doing here is taking a device name, taking a device name in and building a filter. So, building an inventory filter. And then, this particular task is going to run our service on the IG. So, we have a Python script actually deployed there.
Dan Sullivan • 11:39
And it’s going to return some data. And down here, you’ll see that we’re actually passing the inventory filter. So, we’ll tell the script which device we’re looking for, and then we’ll pass the inventory in it. It’ll find the appropriate device and then use that when it’s going through its steps that it needs to do to gather information. Again, that part is simulated, but the dispatch and the IIG5 instances are actually deployed. Now, they’re deployed using mutual TLS, so it’s a secure way. So, you have secure connectivity between the orchestrator and the gateway itself.
Dan Sullivan • 12:19
And so, now we have this sort of top-level workflow. And our top-level workflow does a few different things. One, it sort of figures out which it is just started off just with an event. So, it gets basically an event from an external system, basically an AIOp system. And I’ve kind of simulated what the event might look like. And the 1st thing we’ll do is figure out which GPU node might be affected. We’ll figure out, use the data transformation to figure out the particular node that’s affected and what data center it lives in.
Dan Sullivan • 12:54
We’ll create a ServiceNow incident. We’ll run some of the data through a Jinja2 template just to make it a little bit easier to look at. We’ll go off here and collect that thermal data from the GPU itself. We’ll update the ServiceNow incident. We will actually run through and send a Slack notification. I have a manual task in here at the end to give you a sort of a visual representation of what’s happening. And a lot of the GING2 tasks are just formatting data so it’s visual, so you can see it visually.
Dan Sullivan • 13:32
Obviously, you can send an email or Slack or whatever you want. And this particular thing is just one example. So there could be any number of events. It could be some sort of reachability issue in which you want to run some complex diagnostics against a particular switch, or maybe you’ve last, maybe you’re having issues with a specific rack in your data center and you want to do some triage with that. Well, now you can, you know, using the inventory key off of the particular rack, find all the affected devices, run complex diagnostics, open a ServiceNow incident if you need to. So this sort of methodology is repeatable across varying types of events and things like that. So again, typically we would expect that this would be hit from an external API or something like that.
Dan Sullivan • 14:28
But for now, if I jump into debug mode here, what you’ll see here is I’ve kind of the event is the input to this. And I’ve sort of just cut and pasted sort of an event with some details in here that you might expect. It’s saying that it’s a particular node in the data center having the issue. and which of the GPUs on that node are affected and some data about the UUID for that particular device and then some information about the threshold. So we’re going to inject all this data when we run it. It’s going to inject all this into the orchestration. It’s going to kick it off and run through it.
Dan Sullivan • 15:09
So let’s just let it run here. So you’ll see the green circle, the green checkbox is something that’s just been completed. It’s running pretty fast. So it’s been through most of these. It’s off now trying to send a Slack notification. So 1st off, what we’ll look here is to see here that we’ve should have created an incident in Slack. That’s probably the most important thing that’s happened here is the fact that we’ve logged it in ServiceNow that we’ve got an incident.
Dan Sullivan • 15:48
So let’s take a look here. So if I refresh this, I see that I’ve got a new incident logged in ServiceNow. If I click on this, what you’ll see here is as we go down here that we’ve taken the simulated data we got from the Python script that we wrote and sort of formatted a little bit and enriched the ticket, enriched the incident here. And then we’ve also enriched it with the event data that we got. So the event that we got in from as input, we also logged that in there. So that’s in the incident. So at least you have some documentation initially in the incident so that someone actually has to go and act on this has some data and has a bit of a head start.
Dan Sullivan • 16:43
So if I go back to operations manager here, you’ll notice we’re on a manual task here. So again, this just gives you, you probably wouldn’t have a manual task in here, but this just kind of shows you what’s going on. There’s a particular, the particular device that’s failing. And some notes about a ServiceNow incident was created, just so that everyone knows what happened. You might even send something like this in an email. And then we also sent it out by a Slack. So you can see that it shows that the incident was created and we’ve got some markdown formatted thing showing exactly what happened.
Dan Sullivan • 17:24
And this is just, again, really taking what a lot of times is human tasks, people cutting and pasting, and just sort of orchestrated the whole thing. And from an event perspective, this is just one example, but we have customers doing the set scale in a lot of environments, including NeoCloud Data Center and that sort of thing. So the other thing here is once we’ve compute, we’ve run through this, the other sort of benefits that we get from the platform is just our ability. We talked a little bit about this, sort of the archival and the statistics associated with it. So if I go back here to the cloud tab, what you notice is we have our jobs and tasks. So you can actually see what’s being done, what jobs are running. So you can see that thermal event.
Dan Sullivan • 18:13
just processed and this is all saved in our cloud and then we also have some support around analytics so e.g. if i want to look at the last seven days you’ll notice that the process gpu thermal event is quite high i guess that’s good and bad But we can actually see what’s going on. We can drill down into some of these events and see what it looks like, what each of the child jobs might look like, how long they took, their elapsed time, those sorts of things. So, over time, you can develop and see trending. You can put SLAs on some of these workflows and track that. So, if you’re not handling things as quick as you have, as quick as you need to, or perhaps something has changed within your data center or within your environment, you can sort of see that in Insights. So, that’s really all I had to show today.
Dan Sullivan • 19:12
But again, we have the ability to do this at scale. Okay, so I think that the Atential platform is trusted by a lot of our next generation cloud infrastructure providers. Really seeing fewer escalations and faster incident triage. And I think we have, along with other industries, have been doing this for quite a while. So hopefully, we’re able to. get significant reductions and significant significantly reduce the amount of dependencies on humans involved in the process. And I think if you do that, then you start to see the reduction in operational time
Dan Sullivan • 20:01
And less time spent maintaining the code. Over time, what you will see is more and more AI brought into this so that agents will do a lot of the work, and we will need less intervention by humans to do some of these things. And that’s about it for today. Thanks again for your time.