Ansible and Network-as-Code: Key Considerations

So you've decided to take your network to the next level with a network-as-code (NaC) model and commit (no pun intended) to Ansible as key component to your tool chain. Great! At Vector we frequently employ Ansible for network automation, using it for everything from ad-hoc changes to full on declarative infrastructure. Once one grasps the power and potential of Ansible and NaC, it's tempting to want to jump in head first, configure a control node and repo and start pushing configs. However, we've found that a lot of NaC initiatives fail to get off the ground because of mismanaged expectations or unexpected challenges and complexities. We have a lot of experience overcoming these challenges, so we've compiled a list of some of the key items that should be taken into consideration when rolling out Ansible for NaC. Before we jump in, though, we just want to take a minute to highlight one element that is essential to any successful deployment: clearly defined scope, end state, and requirements. This is probably true for any project, but it bears repeating. Once one sees the NaC vision, the tendency is to want it all now (e.g., network data in a shiny new source-of-truth; read-only accounts on network devices with configuration changes only implemented as code in a repo; a CI/CD pipeline with a full topology of network device containers for change validation, modeling, rollback, etc.). But, like anything else, the process of adopting NaC is generally incremental. Set reasonable goals and milestones. Understand that maybe only a subset of configurations and workflows may initially be in code. Some data might have to be manually allocated or tracked in a spreadsheet (for now). Getting classical network engineers and operators used to committing code to a repository and running playbooks will help them understand the power of automation and Ansible. A full-blown NaC model is not immediately required (or a lot of times even desired) to accomplish this.

This list is by no means comprehensive and of course all deployments are different, but these are the most common things we run up against.

The source of network data

As network engineers we might not think about the amount of data that goes into configuring a network device, but it's pretty significant - IP addresses, VLANs, port-channel IDs, VNIs, BGP communities, firewall policy, NAT rules - the list goes on and on. One of the primary questions that will need to be answered as you embark on your NaC journey is: what is the source (or sources) of truth for all of this data? Of course, this data *could* be stored as code alongside playbooks, but that would be very difficult to manage and therefore mostly impractical (except maybe on the smallest of scales). So where is your data coming from? If you already have an IPAM solution , such as Infoblox, that's a good starting point; however, keep in mind that there are more variables than IP address data. At Vector we are partial to Netbox because of its range of built-in data types and customizability, but not everyone has the luxury of starting from scratch with a new tool. Wherever you land, though, you'll want to ensure that you have clearly defined the data dependencies for Ansible - the where and how network data eventually gets mapped into Ansible variables.

Handling ad-hoc changes

Hopefully as you move to a NaC model most standard day-to-day operations will be deployed using an Ansible playbook with configuration defined in code. However, inevitably certain changes will require an operator to login to a device a make a manual change. Granted, in an ideal world everything is in code and no manual changes are permitted; however, in practice this is very difficult, for a few reasons:

Troubleshooting. A lot of times in outage situations it's not practical or even feasible to push changes via Ansible. Changes need to be made that provide a temporary fix or to better understand the underlying issue.
Migrations and transient states. Often times in network operations there is a need to move functionality from one device to another (connections, SVIs, etc.). NaC isn't particularly good at handling these types of changes because there is usually an interim period where the same thing needs to be defined multiple times across different devices.

Whatever the case may be for ad-hoc changes, a strategy needs to be developed for how to handle them (i.e., how to true them up with the source of truth and the code)

Tool chain and integrations with existing teams

If you've selected Ansible for pushing network changes, there's a good change that other teams in your organization already have some of the infrastructure and tools required to get your deployment off the ground. You'll be looking for things like an Ansible Control Node and a Git Repo. Perhaps other teams already leverage Tower for the former and Azure DevOps for the latter -- check around. There's no need to reinvent the wheel and there will also probably be an opportunity for learning and integration.

Operational Automation

In addition to defining configuration as code, you'll most likely also want to determine which operational commands and activities you will automate. For example, things like:

backups
pre/post- change validation
steady state validation
consistency checking

You may also want to consider running these things at regular intervals to get snapshots, or running them alongside other configuration playbooks in order to document point-in-time network states.

Authentication and Credential Management

You may have setup a proof-of-concept or demo that uses local authentication or SSH keys, and that's certainly a good way to learn. But in production you will need to make some important decisions around authentication and credential management, including:

Authentication between the Control Node and the network devices under management
- Authentication methods
- Where/how/if these credentials are stored and encrypted
- Integration with central auth (RADIUS, TACACS)
Role-based access-control (RBAC). Determining who can do what and to which devices needs to be clearly defined and controlled in order to limit scope and blast radius. This is especially important with automation because the change scope is so much broader than a single device.

Integration with central management tools and other orchestrators

A lot of modern network vendors provide central management tools of their own that control and operate all of their devices (e.g., Cisco ACI, Palo Alto Panorama, Arista CloudVision). Chances are if you are operating a network today you are employing one of these central management tools. So one key question will be: how to you integrate your Ansible NaC goals with those tools? In a lot of cases, Ansible provides the modules or collections required to push changes to central managers, but then a key question becomes: what are you asking of Ansible that you aren't already getting out of the tool? Or which gaps are you trying to fill? These can be tricky questions to answer, so clearly defining what role Ansible will play on the whole will be critical to your success.

Hopefully this provided you with some takeaways for things to consider when embarking on a NaC journey using Ansible. If you have any questions or would like to work with us on your network automation project, please don't hesitate to contact us.

Ansible and Network-as-Code: Key Considerations

Recent Posts

Comments