From Scripts to Scalable Orchestration in AI Data Centers

WEBINAR

Hi, everyone. Welcome to the, ITECHA webinar. The topic today is from scripts to scalable orchestration and AI data centers. My name is Dan Sullivan. I lead the solutions engineering team here at iTentio. And we're gonna share with you a little bit of information that we've gathered and some insights from our work with some of our AI data center customers today. So let's just kind of jump into this. I think unless you're living under a rock, everyone is affected by AI these days, right? And so as a consequence of that, GPU infrastructure is scaling quite a bit. But what we have seen is a lot of our customers are struggling with the operation side. They all have automation, but as their environments grow, new data centers and things like that, they are sort of affected by their lack of centralized solutions, their inability to have sort of a centralized way to do a lot of the key operations functions that they need. Right? So whether it's brittle integrations, just too much glue code for it with to deal with API changes. Everything is changing these days. Vendors are releasing new APIs all the time. It's really hard to keep up when it's all sort of manually coded. Routine diagnostic and remediation tasks are requiring critical engineering staff, right? So the lack of orchestration on that space really puts more demand on the engineering teams. And these days, although these companies are scaling out infrastructure wise, I would say that from an engineering perspective, they're probably always understaffed. Things like just too many humans in the loop for manual incidents and things like that, we're taking hours, right? And in the GPU business, time is money, right? When they have a thermal incident or they've gotta take a note out of service or perform some sort of RMA, the longer that takes, less money they're making. So what's the answer to this? I think really, it's around unified orchestration. Trying to connect every system and scale it across every site. Introducing some rigidity and the way that you're handling processes across sites, whether it's connecting your inventory, monitoring, ticketing, and infrastructure APIs across sites in a single platform. Able to orchestrate events and take alarms and alerts into automated workflows across systems and sites is really important. At the same time, while we're trying to do all this, while we're trying to connect to all these disparate systems and manage these diverse sets of APIs, we also have to be really careful security wise. Right? Things like our back approvals and audit trails need to be built in everything that's happening. We need to be able to monitor what's happening, why, and for how long, and who did it. And, of course, at scale. Right? These companies are scaling every day. New data centers are popping up even within the data centers themselves. Right? So we need to be able to display a scale we need to be able to deploy a scalable solution across every site. And, you know, we have to have some ability to work in a multi vendor environment as well. Right? All these data centers are disparate set of tools and vendors and things like that. Effectively, what we really wanna do is kinda take that site sprawl and have a standardized way to go about it, taking all those diverse sets of rules and responsibilities and try to centralize that and make that a process that can be orchestrated. But we have really centralized visibility. The inventory itself is actionable. Can orchestrate the execution, and and we can have an audit trail of exactly what we're doing. So what's the answer to this? Well, it's basically the the iTential platform. Right? We're the a generic operations platform for infrastructure. We're AI ready by design. From an orchestration perspective, we can connect with pretty much any IT system out there, allowing you to build these end to end workflows, these end to end orchestrations that don't really require humans in the loop. And we think that we can deploy that faster and scale smarter with our SaaS platform, and and we'll show show you some of that today. Here's kind of the fifty thousand foot view of the Atenture platform. I think today, you can see here that there is quite a diverse set of capabilities in the platform today. But central to that is really the platform itself, right? The ability to offer RBAC, audit and logging, eventing, secrets management, archival support. Those are sort of the fundamentals that the platform is built on top of. And after that, there is agentic reasoning capabilities, and that's something that is is is gonna be released fairly soon to a lot of our customers, and we even have some customers beta testing that today. We have sort of our life cycle and product definition. And then service orchestration, configuring compliance, and automation execution. And today, we'll be focused mostly on the service orchestration and the automation execution. Now, also central to the platform is the Atenture Automation Gateway. And the gateway is how you can connect the orchestrator to disparate sites. So, for example, you might have multiple data centers, each with their own automation gateway cluster, such you can scale out Python scripts or Ansible artifacts or pretty much any other program that you need to execute and onboard. Everything that's onboarded into the gateway is then federated directly into the platform, making it something that we can build in as part of the workflow. So for a demo architecture today, we're going to use the iTentio platform, and we're kinda kinda simulate a little bit of an AIOps event here. We have ServiceNow deployed as well, and we're even gonna talk to Slack. But we're gonna focus on the inventory manager, the design studio, and gateway manager. So, the inventory managers will have a centralized inventory. As I mentioned, we've kind of simulated that a bit. Although everyone loves to see real hardware, having a four data centers filled with AI nodes is probably not cost effective for the purposes of the webinar, so we've got a little bit of that simulated. We'll show you some of that and walk through that in design studio. We'll take a look at Gateway Manager as well. So we've actually got four different gateways deployed, sort of simulating each of it, an AI data center with each one. So that's kind of the demo architecture for today, and I'll sort of walk you through that and show you what we've got for you today. So I'm just gonna share my screen, and we'll take a little bit of a I'll just jump into the screen here, and then we'll take a look at the platform and see what else we've got here. Alright. Looks everyone looks like the platform everyone can see my screen here. So I'm in the iTentio Cloud. So we've got multiple environments here, but we're actually working in this POC environment. And as part of the cloud platform, we also have some analytics and some operations and tasks view here, so you can actually see what each of your instances are doing. And maybe if we have a chance, we'll jump into that. But I'll just jump into the platform here. The first thing we're gonna talk a little bit about is inventory. So I have our inventory manager set up and the inventory manager is effectively just ingesting inventory from external systems and then making it functional so that we can apply it to Python scripts and Ansible inventory, or perhaps even device compliance, that sort of thing. So as I mentioned, I've got four data centers simulated here, and they each have a varying number of nodes in there. So if I click on one of these, what you'll notice is that I've also got all the assets are fully tagged. And you'll see that each of the data center devices has a prefix equal to the data center that hosted in. That'll help us when we are getting incoming events, we'll want to sort of dispatch off of the of the node name and try to figure out the data center and and the cluster effectively. So if you click on one of these, you'll also notice that every every device has a cluster ID associated with it. So that's the that's the IAG five instance where the automation gateway, where that device is hosted or where it can be ex where it can be referenced. So we've got a a fair number of devices, and we're gonna we're gonna sort of simulate some of the device functionality today. So we've got four different inventories. And then if we go back, we'll also go and take a quick look at Gateway Manager. So within the Gateway Manager today, we have multiple instances of IG setup, and these are all live, and they're sort of simulating, each one is simulating that Neo Cloud data center. So we've got four of those. And of course, the inventories that we just looked at are tied to each one. So now, when we get a particular event in, which is kind of part of the demo where we'll simulate an event, we'll then dispatch off of that device name and figure out the appropriate gateway to use and we'll actually execute a service on the gateway. Let me go back here and we can go right into Studio, but I've actually kinda got a little project that I put together for the demo, so we'll just start here quickly. So we've got some workflows here. We have a top level workflow and a couple of child jobs, and this one is kind of the interesting one here. So in this case here, you'll see what we're doing here is taking a device name in and building a filter, so building an inventory filter. And then this particular task is gonna run our service on the IG. So we have a Python script actually deployed there, and it's gonna return some data. And down here, you'll see that we're actually passing the inventory filter. We'll tell the script which device we're looking for, and then we'll pass the inventory and it'll find the appropriate device and then use that it's going through the steps that it needs to do to gather information. Again, that part is simulated, but the dispatch and the IG five instances are actually deployed. Now, they're deployed using mutual TLS, so it's a secure way. So you have secure connectivity between the orchestrator and the gateway itself. And so now we have this sort of top level workflow. And our top level workflow does a few different things. One, it sort of figures out which it it is just it started off just with an event. So it gets basically an an event from an external system, basically an AI op system. And I've kinda simulated what the event might look like. And the first thing we'll do is figure out which GPU GPU node might be affected. We'll figure out use a data transformation and figure out the particular node that's affected and what data center it lives in. We'll create a ServiceNow incident. We'll we'll run through run some of the data through a Jinja two template just to make it a little bit easy to look at. We'll go off here and collect that thermal data from the GPU itself. We'll update the ServiceNow incident. We will we will actually run through and send a select notification. I have a manual task in here at the end to give you a visual representation of what's happening in a lot of the Jinja two tasks. They're just formatting data so you can see it visually. Obviously, you can send email or Slack or whatever you want. This particular thing is just one example. There could be any number of events. It could be some sort of reachability issue in which you wanna run some complex diagnostics against a particular switch, or maybe you're having issues with a specific rack in your data center and you wanna do some triage with that. Well, now you can, using the inventory, key off of the particular rack, find all the affected devices, run complex diagnostics, open a ServiceNow incident if you need to. So this sort of methodology is repeatable across varying types of events and things like that. So again, typically, we would expect that this would be hit from an external API or something like that. But for now, if I jump into debug mode here, what you'll see here is I've kinda the event is the input to this, and I've sort of just cut and pasted sort of an event with some details in here that you might expect. It's saying that it's a particular node in the data center having the having the issue and which of the GPUs on that node are affected and some data about the UUID for that particular device and then some information about the threshold. We're gonna inject all this data when we run it. It's gonna inject all this into the orchestration. It's gonna kick it off and run through it. Let's just let it run here. So you'll see the the green circle the the green checkbox is something that's just been confirmed just been completed. It's running pretty fast, so it's been through most of these. It's off now trying to send a Slack notification. So first off, what we'll look here is to see see here that we've should have created an incident in Slack. That's probably the most of the important most important thing that's happened here is is the fact that we've logged it in Slack in ServiceNow, that we've got an incident. So let's take a look here. So if I if I refresh this yep. I see that I've got a new I'm not I've got a new incident logged in service. Now if I click on this, what you'll see here is as we go down here, that we've put the we've taken the simulated data we got from the from the Python script that we wrote sort of format a little bit and, you know, enrich the ticket, enrich the incident here. And then we've also enriched it with the event data that we got. So the event that we got in from this as as input, we also log that in there. So that's in the incident. So at least you have you have some documentation initially in the incident so that someone actually has to go and act on this, has has some data and has a bit of a head start. So if I go back to operations manager here, you'll notice we're on a manual task here. So, again, this just gives you you probably wouldn't have a manual task in here, but this just kinda shows you what's going on. There's a particular the particular device that's failing and some notes about a ServiceNow incident was created just so that everyone knows what happened. You might even send something like this in an email. And then we also sent it out via Slack. So you can see that it shows that the incident was created, and we've got some a markdown formatted thing showing exactly what happened. And this is just, again, really taking what a lot of times is human tasks, people cutting and pasting, and just sort of orchestrating the whole thing. And from an event perspective, this is just one example, but we have customers doing this at scale in a lot of environments, including Neo Cloud Datasets and that sort of thing. So the other thing here is once we've run through this, the other benefits that we get from the platform is just our ability. We talked a little bit about this, the archival and the statistics associated with this. If I go back here to the Cloud tab, what you notice is we have our jobs and tasks, so you can actually see what's being done, what jobs are running, you can see that thermal event just processed. And this is all saved in our cloud. And then we also have some support around analytics. So for example, if I wanna look at the last seven days, you'll notice that the process GPU thermal event is quite high. I guess that's good and bad. But we can actually see what's going on. We can drill down into some of these events and see what it looks like, what each of the child jobs might look like, how long they took, their elapsed time, those sorts of things. So over time, you could you could develop and see trending. You can put SLAs on some of these workflows and track that. So if you're not handling things as quick as you have, as quick as you need to, or perhaps something has changed within your data center or within your environment, you can sort of see that in Insights. So that's really all I had to to show today. But, we have the ability to do this at scale. Okay. So, you know, I think that the iTention platform is trusted by a lot of our next generation cloud infrastructure cloud infrastructure providers, really seeing fewer escalations and faster incident triage. And I think we have, along with other industries, been doing this for quite a while. So hopefully, we're able to get significant reductions and significantly reduce the amount of dependencies on humans involved in the process. I think if you do that, then you start to see the reduction in operational time and less time spent maintaining the code. And over time, what you will see is more and more AI brought into this so that agents will do a lot of the work and we will need less intervention by humans to do some of these things. And that's about it for today. Thanks again for your time.

How to Orchestrate Governed Workflows Across AI Data Center Infrastructure

Neocloud and AI data center operators are scaling GPU infrastructure faster than operational capacity. While most teams have strong automation, it often breaks down as environments grow: workflows become site-specific, integrations become brittle, operations teams depend on engineering, and incident response becomes manual and inconsistent.

In this on-demand technical webinar, we show how leading AI infrastructure teams use unified orchestration to connect inventory, monitoring, ticketing, and infrastructure APIs into repeatable, governed workflows that scale across sites and teams. The webinar closes with a demo of an end-to-end event-driven workflow: an alert triggers automated diagnostics collection, enriches the incident with NetBox context, creates or updates a ticket, and optionally initiates a remediation workflow with validation and audit trails.

You’ll learn how to:

Build event-driven workflows that turn alerts into action
Orchestrate across infrastructure, including inventory, monitoring, ticketing, and infrastructure APIs
Standardize execution using reusable workflow services
Govern operations with RBAC, approvals, and audit trails
Integrate existing Ansible and Python automation without rewrites
Scale the same workflow model across sites and environments

Demo Notes

(So you can skip ahead, if you want.)

00:40 AI Infrastructure Scaling Challenges
02:26 Unified Orchestration Solutions
04:28 Itential Platform Overview
06:44 Demo Architecture Setup
08:25 Connecting Inventory Manager
10:37 Automation Gateway Configuration
11:09 Workflow Studio Walkthrough
14:28 Event-Driven Workflow Demo
18:13 Wrap Up & Analytics
View Transcript

Dan Sullivan • 00:01

Hi everyone, welcome to the Itentia webinar. The topic today is from scripts to scalable orchestration and AI data centers. My name is Dan Sullivan. I lead the solutions engineering team here at Itential. And we’re going to share with you a little bit of information that we’ve gathered and some insights from our work with some of our AI data center customers today. So let’s just kind of jump into this. You know, I think unless you’re living under a rock, everyone is affected by AI these days, right?

Dan Sullivan • 00:40

And so as a consequence of that, you know, GPU infrastructure is scaling quite a bit. But what we have seen is a lot of our customers are struggling with the operations side. They all have automation. But as their environments grow, new data centers and things like that, they are sort of affected by their lack of centralized solutions, their inability to have sort of a centralized way to do a lot of the key operations functions that they need, right? So whether it’s brittle integrations, just too much glue code to deal with API changes, everything is changing these days. Vendors are releasing new APIs all the time. It’s really hard to keep up when it’s all sort of manually coded.

Dan Sullivan • 01:33

You know, routine diagnostic and remediation tasks are requiring critical engineering staff, right? So, sort of the lack of orchestration on that space is really puts more demand on the engineering teams. And these days, although these companies are scaling out infrastructure-wise, I would say that from an engineering perspective, they’re probably always understaffed. You know, things like just too many humans in the loop for manual incidents and things like that, you know, are taking hours, right? And in the GPU business, time is money, right? When they have a thermal incident or they’ve got to take a note out of service or perform some sort of RMA, the longer that takes, the less money they’re making, right? So what’s the answer to this, right?

Dan Sullivan • 02:26

And I think really it’s around unified orchestration, right? Trying to connect every system and scale it across every site. Introducing some rigidity in the way that you’re doing, handling processes across sites, whether it’s connecting your inventory, monitoring, ticketing, and infrastructure APIs across sites in a single platform. Being able to orchestrate events and take alarms and alerts into automated workflows across systems and sites is really important. At the same time, while we’re trying to do all this, while we’re trying to connect to all these disparate systems and manage these diverse sets of APIs, we also have to be really careful security-wise, right? Things like RBAC, approvals, and audit trails need to be built into everything that’s happening. We need to be able to monitor what’s happening, why, and for how long, and who did it.

Dan Sullivan • 03:30

And of course, scale, right? These companies are scaling every day. New data centers are popping up, even within the data centers themselves, right? So we need to be able to display a scale, we need to be able to deploy a scalable solution across every site. And we have to have some ability to work in a multi-vendor environment as well, right? All these data centers have disparate set of tools and vendors and things like that. So effectively, what we really want to do is kind of take that site sprawl and have a standardized way to go about it, taking all those diverse sets of rules and responsibilities and try to centralize that and make that a process that can be orchestrated, where we have really centralized visibility.

Dan Sullivan • 04:28

The inventory itself is actionable. We can orchestrate the execution and we can have an audit trail of exactly what we’re doing. So, what’s the attentionial answer to this? Well, it’s basically the itential platform, right? We’re the geneticerations platform for infrastructure. We are AI-ready by design. From an orchestration perspective, we can connect with pretty much any IT system out there, allowing you to build these end-to-end workflows, these end-to-end orchestrations that don’t really require humans in the loop.

Dan Sullivan • 05:10

And we think that we can deploy that faster and scale smarter with our SAS platform, and we’ll show you some of that today. Here’s kind of the 50,000 foot view of the Atential platform. I think today you can see here that there is quite a diverse set of capabilities of the platform today. But central to that is really the platform itself, right? The ability to offer RBAC, audit and logging, eventing, secrets management, archival support. Those are sort of the fundamentals that the platform is built on top of. And after that, there is Agentic reasoning capabilities, and that’s something that is going to be released fairly soon to a lot of our customers.

Dan Sullivan • 05:57

And we even have some customers beta testing that today. We have sort of our life cycle and product definition. And then service orchestration, configuring compliance, and our automation execution. And today we’ll be focused mostly on the service orchestration and the automation execution. Now, also central to the Itentro platform is the Atential Automation Gateway. And the gateway is how you can connect the orchestrator to disparate sites. So, e.g. , you might have multiple data centers, each with their own automation gateway cluster, so that you can scale out Python scripts or Ansible artifacts or pretty much any other program that you need to execute and onboard.

Dan Sullivan • 06:44

And everything that’s onboarded into the gateway is then federated directly into the platform, making it something that we can build in as part of the workflow. So, for a demo architecture today, we’re going to use the iTential platform and we’re going to kind of simulate a little bit of an AIOps event here. We have ServiceNow deployed as well, and we’re even going to talk to Slack. But we’re going to focus on the inventory manager, the Design Studio, and Gateway Manager. So, the inventory manager is where we’ll have a centralized inventory. As I mentioned, we’ve kind of simulated that a bit. Although everyone loves to see real hardware, having four data centers filled with AI nodes is probably not cost-effective for the purposes of the webinar.

Dan Sullivan • 07:40

So, we’ve got a little bit of that simulated. We’ll show you some of that and walk through that in the design studio. We’ll take a look at Gateway Manager as well. So, we’ve actually got four different gateways deployed, sort of simulating each of it, an AI data center with each one. So that’s kind of the demo architecture for today, and I’ll sort of walk you through that and show you what we’ve got for you today. So I’m just going to share my screen and we’ll take a little bit of a screen here and then we’ll take a look at the platform and see what else we’ve got here. All right, looks like everyone looks like the platform, everyone can see my screen here.

Dan Sullivan • 08:25

So I’m in the Itential Cloud, so we’ve got multiple environments here, but we’re actually working in this PAC environment. And as part of the cloud platform, we also have some analytics and some operations and tasks view here. So you can actually see what each of your instances are doing. And maybe if we have a chance, we’ll jump into that. But I’ll just jump into the platform here. So the 1st thing we’re going to talk a little bit about is inventory. So I have our inventory manager set up.

Dan Sullivan • 08:57

And the inventory manager is effectively just ingesting inventory from external systems and then making it functional so that we can apply it to Python scripts and Ansible inventory or perhaps even device compliance, that sort of thing. So as I mentioned, I’ve got four data, four data centers simulated here, and they each have a varying number of nodes in there. So if I click on one of these, what you’ll notice is I’ve also got all the assets are fully tagged. And you’ll see that each of the data center devices has a prefix equal to the data center that it’s hosted in. So that’ll help us when we are getting incoming events we’ll want to sort of dispatch off of the node name and try to figure out the data center and the cluster effectively. So if you click on one of these, you’ll also notice that every Every device has a cluster ID associated with it.

Dan Sullivan • 09:53

So that’s the IAG 5 instance where the automation gateway where that device is hosted or where it can be referenced. So we’ve got a fair number of devices and we’re going to sort of simulate some of the device functionality today. So we’ve got four different inventories. And then if we go back, we’ll also go and take a quick look at Gateway Manager. So within the Gateway Manager today, we have multiple instances of IOG setup. And these are all live. And they’re sort of simulating, each one is simulating that NeoCloud data center.

Dan Sullivan • 10:37

So we’ve got four of those. And of course, the inventories that we just looked at are tied to each one. So now, when we get a particular event in, which is kind of part of the demo, where we’ll simulate an event, we’ll then dispatch off of that device name and figure out the appropriate gateway to use. And we’ll actually execute a service on the gateway. So let me go back here. And we can go right into studio, but I’ve actually kind of got a little project that I put together for the demo. So we’ll just start here quickly.

Dan Sullivan • 11:09

So, we’ve got some workflows here. We have a top-level workflow and a couple of child jobs. And this one is kind of the interesting one here. So, in this case, here, you’ll see what we’re doing here is taking a device name, taking a device name in and building a filter. So, building an inventory filter. And then, this particular task is going to run our service on the IG. So, we have a Python script actually deployed there.

Dan Sullivan • 11:39

And it’s going to return some data. And down here, you’ll see that we’re actually passing the inventory filter. So, we’ll tell the script which device we’re looking for, and then we’ll pass the inventory in it. It’ll find the appropriate device and then use that when it’s going through its steps that it needs to do to gather information. Again, that part is simulated, but the dispatch and the IIG5 instances are actually deployed. Now, they’re deployed using mutual TLS, so it’s a secure way. So, you have secure connectivity between the orchestrator and the gateway itself.

Dan Sullivan • 12:19

And so, now we have this sort of top-level workflow. And our top-level workflow does a few different things. One, it sort of figures out which it is just started off just with an event. So, it gets basically an event from an external system, basically an AIOp system. And I’ve kind of simulated what the event might look like. And the 1st thing we’ll do is figure out which GPU node might be affected. We’ll figure out, use the data transformation to figure out the particular node that’s affected and what data center it lives in.

Dan Sullivan • 12:54

We’ll create a ServiceNow incident. We’ll run some of the data through a Jinja2 template just to make it a little bit easier to look at. We’ll go off here and collect that thermal data from the GPU itself. We’ll update the ServiceNow incident. We will actually run through and send a Slack notification. I have a manual task in here at the end to give you a sort of a visual representation of what’s happening. And a lot of the GING2 tasks are just formatting data so it’s visual, so you can see it visually.

Dan Sullivan • 13:32

Obviously, you can send an email or Slack or whatever you want. And this particular thing is just one example. So there could be any number of events. It could be some sort of reachability issue in which you want to run some complex diagnostics against a particular switch, or maybe you’ve last, maybe you’re having issues with a specific rack in your data center and you want to do some triage with that. Well, now you can, you know, using the inventory key off of the particular rack, find all the affected devices, run complex diagnostics, open a ServiceNow incident if you need to. So this sort of methodology is repeatable across varying types of events and things like that. So again, typically we would expect that this would be hit from an external API or something like that.

Dan Sullivan • 14:28

But for now, if I jump into debug mode here, what you’ll see here is I’ve kind of the event is the input to this. And I’ve sort of just cut and pasted sort of an event with some details in here that you might expect. It’s saying that it’s a particular node in the data center having the issue. and which of the GPUs on that node are affected and some data about the UUID for that particular device and then some information about the threshold. So we’re going to inject all this data when we run it. It’s going to inject all this into the orchestration. It’s going to kick it off and run through it.

Dan Sullivan • 15:09

So let’s just let it run here. So you’ll see the green circle, the green checkbox is something that’s just been completed. It’s running pretty fast. So it’s been through most of these. It’s off now trying to send a Slack notification. So 1st off, what we’ll look here is to see here that we’ve should have created an incident in Slack. That’s probably the most important thing that’s happened here is the fact that we’ve logged it in ServiceNow that we’ve got an incident.

Dan Sullivan • 15:48

So let’s take a look here. So if I refresh this, I see that I’ve got a new incident logged in ServiceNow. If I click on this, what you’ll see here is as we go down here that we’ve taken the simulated data we got from the Python script that we wrote and sort of formatted a little bit and enriched the ticket, enriched the incident here. And then we’ve also enriched it with the event data that we got. So the event that we got in from as input, we also logged that in there. So that’s in the incident. So at least you have some documentation initially in the incident so that someone actually has to go and act on this has some data and has a bit of a head start.

Dan Sullivan • 16:43

So if I go back to operations manager here, you’ll notice we’re on a manual task here. So again, this just gives you, you probably wouldn’t have a manual task in here, but this just kind of shows you what’s going on. There’s a particular, the particular device that’s failing. And some notes about a ServiceNow incident was created, just so that everyone knows what happened. You might even send something like this in an email. And then we also sent it out by a Slack. So you can see that it shows that the incident was created and we’ve got some markdown formatted thing showing exactly what happened.

Dan Sullivan • 17:24

And this is just, again, really taking what a lot of times is human tasks, people cutting and pasting, and just sort of orchestrated the whole thing. And from an event perspective, this is just one example, but we have customers doing the set scale in a lot of environments, including NeoCloud Data Center and that sort of thing. So the other thing here is once we’ve compute, we’ve run through this, the other sort of benefits that we get from the platform is just our ability. We talked a little bit about this, sort of the archival and the statistics associated with it. So if I go back here to the cloud tab, what you notice is we have our jobs and tasks. So you can actually see what’s being done, what jobs are running. So you can see that thermal event.

Dan Sullivan • 18:13

just processed and this is all saved in our cloud and then we also have some support around analytics so e.g. if i want to look at the last seven days you’ll notice that the process gpu thermal event is quite high i guess that’s good and bad But we can actually see what’s going on. We can drill down into some of these events and see what it looks like, what each of the child jobs might look like, how long they took, their elapsed time, those sorts of things. So, over time, you can develop and see trending. You can put SLAs on some of these workflows and track that. So, if you’re not handling things as quick as you have, as quick as you need to, or perhaps something has changed within your data center or within your environment, you can sort of see that in Insights. So, that’s really all I had to show today.

Dan Sullivan • 19:12

But again, we have the ability to do this at scale. Okay, so I think that the Atential platform is trusted by a lot of our next generation cloud infrastructure providers. Really seeing fewer escalations and faster incident triage. And I think we have, along with other industries, have been doing this for quite a while. So hopefully, we’re able to. get significant reductions and significant significantly reduce the amount of dependencies on humans involved in the process. And I think if you do that, then you start to see the reduction in operational time

Dan Sullivan • 20:01

And less time spent maintaining the code. Over time, what you will see is more and more AI brought into this so that agents will do a lot of the work, and we will need less intervention by humans to do some of these things. And that’s about it for today. Thanks again for your time.

Filter

Sort By

Itential Platform

Solutions

Resources

Partners

About Us

How to Orchestrate Governed Workflows Across AI Data Center Infrastructure

Demo Notes

View Transcript

Watch More Itential Demos

Stay in the loop with Itential.

Filter

Sort By

How to Orchestrate Governed Workflows Across AI Data Center Infrastructure

Demo Notes

View Transcript

Watch More Itential Demos

Related Content

Demo

AI-Driven Orchestration for Infrastructure Management with Itential

Customer Stories

How a GPU Cloud Provider Is Scaling AI Data Center Operations with Itential

page

Empowering Neocloud & AI Data Center Operators for the AI Era