Blogs

Agents Are Stateless, Infrastructure Is Not: Here’s How to Bridge the Gap

Principal Architect – AI Solutions & Strategy

Key Points

- AI agents in network operations are stateless by design. The session ends, the context is gone, and the next agent starts from zero.
- The fix is treating infrastructure as resources with lifecycles, not as targets for one-off workflows. FlowAgents reason; Lifecycle Manager remembers.
- Every FlowAgent reads from and writes to a single persistent instance, turning job logs into structured proof and turning a chain of executions into a managed lifecycle.
- Two worked examples (IOS-XR software upgrade and L3VPN service provisioning) show what changes when the resource persists across agents, humans, and time.

There is a version of AI-powered network automation that looks great in a demo and falls apart in production. An agent connects to a device, reasons about what it finds, does something useful, and reports back. Impressive. But then the session ends. The agent forgets everything. The next agent starts from zero. And somewhere in the middle, a human is still manually carrying context from one step to the next, just like before any of this existed.

Nobody talks about this when they talk about AI agents in network operations. The demos work. The architecture doesn’t.

This post is about a pattern that actually works, and why getting it right requires rethinking how we represent infrastructure, not just how we automate it.

The Automation Gap

Network engineers have been automating for decades. Scripts replaced CLI sessions. Orchestration platforms replaced scripts. Now AI agents are starting to replace the judgment calls that scripts never could, reacting to unexpected state, deciding what matters, adapting to what they find.

But there is a persistent gap that none of these waves fully closed: infrastructure has memory, and our automation doesn’t.

When a router goes through a software upgrade, it carries history. What version it was running, what the network looked like before the maintenance window, what failed three cycles ago. When a service is provisioned, it exists as a living entity with state. Which routers are involved, what IP ranges were allocated, what the customer’s SLA commitment is, which sites have been verified.

Workflows execute and finish. Scripts run and exit. Agents complete their mission and disappear, taking their context with them.

The device remains. The service remains. The automation’s knowledge of them does not.

That is why most “automated” network operations still have a human in the middle, manually carrying context from one step to the next, updating spreadsheets, bridging the gap between what the tools knew and what needs to happen next.

What Agents Changed & What They Didn’t

Goal-based AI agents represent a genuine shift in what automation can do.

Unlike a workflow that follows a predetermined script, an agent receives a mission and figures out how to accomplish it. It selects tools, reasons about what it finds, handles unexpected states, and makes judgment calls. Give a FlowAgent the mission to determine if this IOS-XR router is ready for a software upgrade and it will connect to the device, collect state, reason about what matters, and produce a structured conclusion, without a human writing a conditional for every edge case.

This is meaningfully different from anything that came before. The agent exercises judgment.

But here is what didn’t change: the agent session ends, and everything it learned is gone.

The next agent, the one that actually executes the upgrade, starts from zero. It doesn’t know what discovery found. It doesn’t know which checks passed. It doesn’t know the platform, the current version, or the network context the first agent carefully assembled.

You can pass variables between agents through an orchestrating workflow. But that is a fragile chain. One failure breaks the context. And more importantly, you have not created a record of what happened to this specific device. You have created a job log. Those are not the same thing.

The Insight: Resources, Not Tasks

The shift that changes the architecture is conceptual before it is technical.

A device going through an upgrade is not a target for a series of tasks. It is a resource with a lifecycle.

A resource has identity. It exists before the automation runs and after it finishes. It accumulates state over time. It has a history. It can be queried. Other systems, and other agents, can read from it.

This is a different mental model than running a workflow against a device. It is the difference between doing something to infrastructure and managing infrastructure through a defined lifecycle.

When you model infrastructure as resources:

Every operation becomes a state transition, not just an execution.
Every agent’s output is recorded as structured state, not just a log entry.
The resource represents what is, not just what happened.
Any agent, or any human, can read the resource and understand where it is in its lifecycle.

This is the foundation. Now add FlowAgents that can call lifecycle actions as tools.

The Architecture

The pattern has three layers, built on the Itential Platform.

Orchestration. A parent workflow sequences FlowAgents, passes the instance identifier, and enforces stage gates between them.

FlowAgents. Each FlowAgent is goal-based and stateless. It receives a mission, selects from a scoped set of tools, executes, and writes its conclusions back to the resource instance. SSH access to devices runs through Itential Gateway. External capabilities flow through FlowMCP Gateway.

Resources. Lifecycle Manager holds the resource instance. State accumulates here, persists across FlowAgents, and is readable by any agent, any human, any system.

Each FlowAgent is stateless. The resource instance is not. Every agent reads current state, executes its mission, and writes its conclusions back, enriching the instance with curated, structured data rather than raw output.

The lifecycle action is the critical link. It is not just a REST call. It is a structured operation that updates the instance in defined ways, enforcing the resource model schema, recording the state transition, making the update visible to everything that reads the instance.

The FlowAgent decides what is worth recording. The resource model defines how it gets recorded. The instance persists across everything.

Example: IOS-XR Software Upgrade

Cisco IOS-XR upgrades on NCS and ASR platforms are among the highest-risk operations a service provider performs. A botched upgrade on a core router can take down customer traffic for hours. The current reality for most teams is a change management ticket, a maintenance window, an engineer on a bridge call walking through a checklist, and a lot of hope.

Here is what the same process looks like when infrastructure is modeled as a resource.

The Resource Model

Before any automation runs, you define what an upgrade candidate resource looks like, only the fields that matter for decisions. Hostname and platform. Discovery state, including current OS, BGP peer count, available disk. Pre-check results with a clear go or no-go decision and the reason behind it. Upgrade execution state including target OS, timestamps, status. Post-check verification confirming the device came back the way it was supposed to.

Notice the design principle: store decisions, not data. No raw CLI output. No full running configs. The discovery agent collects everything from the device but writes only what the next stage needs to reason about. The model is a distillation.

The Pipeline

A brownfield discovery FlowAgent connects to the device, understands its current state, and creates an upgrade resource instance. It decides what is relevant and drops the rest. Output: a named instance the rest of the pipeline can read.

A pre-check FlowAgent reads the instance. It already knows the platform and current OS from discovery. It runs targeted checks, reasons about them, and writes a clear go or no-go with reasoning. If no-go, the workflow stops here and the instance records why. No override.

An upgrade FlowAgent reads the instance to know target OS, platform, and image path. It transfers the image, installs, reloads. It writes status updates during execution so anyone watching can see progress.

A post-check FlowAgent reads the instance, knows the expected OS and pre-change baseline, verifies the device came back correctly, and flags any mismatch.

The instance exists permanently after completion. Six months from now, when someone asks what was running on core01 before last March’s upgrade window, the answer is in the resource record. Not in someone’s inbox. Not reconstructed from logs.

Example: L3VPN Service Provisioning

Software upgrades illustrate tracking what happens to infrastructure. Service provisioning illustrates something different: tracking what a service is, across every router it touches, for the entire life of that service.

L3VPN provisioning on a carrier MPLS network is a multi-router, multi-team operation. A new enterprise customer needs connectivity from branch sites to data centers across your backbone. That means IP allocation, VRF creation on Provider Edge routers, Route Distinguisher and Route Target assignment, BGP configuration on each PE, interface configuration facing each CE, and end-to-end reachability verification.

Traditionally: multiple engineers, multiple systems, a provisioning ticket, a separate IPAM tool, and a manual audit trail in a spreadsheet. A wrong RT, a duplicate RD, a misconfigured CE-facing interface, each mistake costs hours.

Why Services Are Different

An upgrade has a clear end state: the device is on the new OS. Done.

A service has no end state. It exists as a persistent entity that can be modified, degraded, restored, or decommissioned. The resource model is not just a snapshot of what happened. It is the authoritative description of what this service is, including service state, design specification (VRF name, RD, RT import and export), and per-site provisioning and verification status.

The Lifetime Value

Here is what changes when the service is modeled as a persistent resource.

Adding a site six months later. The FlowAgent reads the existing instance. It knows the VRF name, RD, RT values, all existing site IPs. It designs the new site without risk of conflict. No hunting through router configs. No manual cross-referencing.

Troubleshooting a degraded service. The instance shows which sites are verified, expected versus actual BGP prefix counts, and the full design spec the network should match. The NOC engineer does not reconstruct this. They read it.

Decommissioning. The FlowAgent reads the instance to know exactly which routers to clean up, which IPs to return to IPAM, which BGP neighbors to remove. Nothing is guessed. Nothing is missed.

The L3VPN instance becomes the single source of truth for that service across its entire lifetime: order, active, modify, degrade, retire.

Why This Matters

Three different conversations happen around this pattern depending on who is in the room.

With network engineers: you stop being the glue. A significant portion of a skilled engineer’s time is spent carrying context from a discovery command to a spreadsheet, from a spreadsheet to a change ticket, from a change ticket to the person doing the work. None of that is engineering. It is data transportation. With FlowAgents and Lifecycle Manager, the context carries itself.

With operations managers: scale without headcount. The constraint on network operations today is not usually tooling. It is people. How many upgrade windows can you staff? How many provisioning orders can your team process per week? FlowAgents and stateful resource management break that constraint. The same process that requires two engineers for one device handles two hundred devices with those same two engineers overseeing rather than executing.

With compliance and audit teams: proof, not logs. A job log tells you a workflow ran. It does not tell you what state the device was in before the change, what was verified, what decision was made and why. A lifecycle instance is structured, queryable proof. Every field is named. Every stage is timestamped. Every go/no-go has a recorded reason. When auditors ask what the OS was on a router before last quarter’s maintenance window, the answer takes seconds.

With the business: MTTR and the cost of errors. A single failed upgrade on a core router costs tens of thousands of dollars in engineer time and multiples of that in SLA penalties if customer traffic is affected. Automated, gated, auditable lifecycle management reduces error frequency, reduces recovery cost, and reduces time-to-diagnosis when something does go wrong, because the system knows what state the resource was in before it went sideways.

The Design Principles That Make It Work

Three principles determine whether this pattern succeeds or degrades into complexity.

Store decisions, not data. The resource model should contain what an agent or human needs to reason about, not raw output. A FlowAgent discovering an IOS-XR device does not store the full show version output. It stores current_os, uptime_days, bgp_peer_count. The model is a distillation, not a dump. When agents read the instance, they get signal, not noise.

Namespace fields by stage. Fields grouped by the agent that writes them keeps the model readable and keeps each agent’s context clean. When the post-check agent reads the instance, it sees the pre-check stage’s conclusion without wading through discovery telemetry it doesn’t need. The structure of the model communicates the structure of the process.

Gates between stages are non-negotiable. Every stage writes a clear go or no-go that the orchestrating workflow inspects before proceeding. This is not just safety. It is what makes the pipeline trustworthy enough to run unattended. When a pre-check fails at 3am, the instance records why and the upgrade does not proceed. No override, no shortcuts, no engineer needed to make the call. Governed by default.

What This Isn’t

It is worth being honest about scope.

Lifecycle Manager does not replace your source of truth for network configuration. It tracks the state of a resource through a managed lifecycle, not the complete configuration of your network. Your DCIM, your IPAM, your git-based config management remain authoritative for what they own.

FlowAgents do not replace network engineers. They replace the mechanical execution of well-understood processes (the pre-checks, the ordered configuration steps, the post-verification), freeing engineers for the work that actually requires expertise.

This pattern works best when you have already done the foundational work of understanding your infrastructure well enough to define what a resource looks like and what valid operations on it are. If your network is poorly understood or inconsistently configured, FlowAgents will surface that reality. They will not hide it.

The Bigger Picture

Network automation has always faced a fundamental tension. The richness of goal-based reasoning on one side. The need for auditability and structured state on the other. Scripts are auditable but not adaptive. AI agents are adaptive but not persistent.

Treating infrastructure as stateful resources, with FlowAgents that can read and update that state as native tools, resolves that tension. The agent reasons and adapts. The resource persists and accumulates. Together they produce automation that is both intelligent enough to handle real-world complexity and disciplined enough to meet enterprise requirements for auditability, consistency, and control.

The deeper shift is in how we think about infrastructure itself.

A device is not a target for a workflow. A service is not a sequence of configuration steps. These are resources with identities, lifecycles, and state. Managed entities that exist through time and require intelligent stewardship across that time.

When our automation treats them that way, the gap between ran a workflow and managed an upgrade finally closes. The gap between provisioned a service and own a service through its lifetime finally closes.

That is not a feature. That is a different way of thinking about what network automation is for.

Agents adapt. Resources remember. The Itential Platform makes both true at the same time.

The pattern described in this post is implemented in Itential Platform 6 using Lifecycle Manager and FlowAgents. Multi-vendor device access flows through Itential Gateway. External tool integration is handled by FlowMCP Gateway. The IOS-XR and L3VPN examples are representative of real deployment patterns.

What’s Next

See FlowAI in action →

Talk to our team about agentic operations →

Ankit Bhansali

Ankit Bhansali is a Principal Architect – AI Solutions & Strategy at Itential. Drawing on a strong research background in software and networking, he designs innovative solutions to address the industry’s most complex challenges. His strategic approach empowers businesses to achieve transformative growth through robust automation and end to end orchestration.

Keep Learning