Configuration Management

What Network Engineers Can Learn from Facebook’s Faulty Configuration Management

Travis Nicks

Solutions Architect ‐ Itential

What Network Engineers Can Learn from Facebook’s Faulty Configuration Management
Share this:
Posted on October 18, 2021

The outage heard around the world.

“…its root cause was a faulty configuration change on our end.” – Facebook 10/4/2021

As a network engineer, I can’t help but ask how could something like this happen to a company like Facebook? 

“Two-thirds of the [2018] outages are network- and IT-related. That’s a big change from years past.” — Todd Traver, VP Uptime Institute

Understanding how network changes are a threat to network integrity goes a long way to preventing outages. Facebook, as with almost every other large network, suffered an outage due to misconfiguration of a network device (in this case by removing BGP routes). Even with robust security processes in place, knowledgeable employees, and managerial oversight, humans will still make mistakes. Catching them before bringing down a significant portion of the global internet should be a key focus of any IT/Network Administration organization.

 

What Organizations Can Learn from Facebook’s Mistake

One strategy used to minimize the occurrence of misconfigurations is having strong security protocols and processes in place. Making sure only authorized and knowledgeable individuals have access to core equipment reduces the threat surface for malicious actors. This protection can be further enhanced by assigning network responsibilities to an application instead of teams of individuals. By removing the complexity of having to manage and monitor many individuals’ access, skills, and loyalty, an organization can focus on securing and training to a single tool.

Another complimentary approach is to enforce configuration compliance. Teams should focus on setting golden standards in a thoughtful and tested manner outside of the emotion and fog of an outage, when best practices can be examined in detail and tried against lab environments or mock scenarios. Placing these standards into a tool which has visibility to the network allows for real time compliance reporting. This highlights where the network may be vulnerable, misconfigured, or in danger of cascading an outage and can be performed regularly and automatically. Any such tool should clearly communicate the overall state of compliance to the technical and managerial leadership allowing for changes to be planned.

A third step to implement would be to validate all configuration changes BEFORE they are sent live to the network, not after they have already broken something; running against security checks and comparing to the compliance standards for such a change. When a configuration change passes validation, a network engineer can feel confident that the organizations standards and policies have been complied with. Validation is the final step to preventing outages due to changes being made.

 

How the Right Network Automation Tool Can Eliminate Misconfigurations

A tool that can be secured, maintain responsibility for the network, provide insight into the standard compliance of all network devices, and validate changes before they are made goes a long way to preventing these types of outages. It takes more than processes and corporate directives. It takes planning and consistent execution. Both of which can be achieved through a network automation tool like Itential.

Itential Configuration Manager was purpose-built to help organizations ensure their networks are always in compliance. With Itential, network teams have the tools they need to implement each strategy that Facebook didn’t:

  • Control who has access to view, manage, and apply configuration changes based on Role-Based Access Control.
  • Build effective Golden Configuration templates that are critical for reducing misconfigurations for any device and any services across network and cloud infrastructure.
  • Validate proposed configuration changes to the network to ensure, before they are event applied, that the changes will not break the compliance standard.

To learn more about how you can get ahead of misconfigurations, listen to our latest Packet Pushers Podcast where CTO Chris Wade joins the Heavy Networking crew for a discussion on how to bring your network into compliance through automated validation.

Travis Nicks

Solutions Architect ‐ Itential

Travis is a Solutions Architect at Itential, where he delivers automated solutions to help companies improve their networks. Travis has worked for large service provides over that last twenty five years designing, building, and maintaining carrier backbone and edge networks.

More from Travis Nicks