Home Blog Using IOS-XE+EEM+Guestshell+Python to Solve Problems

Blog

Aug 14
Using IOS-XE+EEM+Guestshell+Python to Solve Problems
Posted by Dominic Zeni

It’s been a long time coming. I’ve been searching for a reason to do it…and I finally did it! I solved my first real world networking problem using automation!

I know, I know. The many of you out there who do this every day all day don’t care, but for me, someone who is an expert network engineer and a n008 at python (programming in general), I’ve been waiting for the day that it made sense to put those noob skills into practice in a production environment. Goodbye, print(“Hello World”).

So…what did I do exactly? Let’s dig in.

Cisco IOS-XE + Embedded Event Manager + Guestshell + Python = Solve Unique Problems

You may have caught the recent blog from my colleague, Trevor Butler, on Understanding the power of Guestshell with EEM Scripting. If you haven’t read that, please stop and take 10-minutes to do so now as I won’t repeat the explanations behind this toolchain that Trevor provided so nicely there. In rough summary, this toolchain (IOX+EEM+Guestshell+Python) provides a means by which you can run “on-box” python scripts in Cisco IOS-XE.

The Unique Problem

We have a customer production environment running a Cisco SD-Access fabric. We ran into an issue where a couple local tables on the fabric edge nodes were becoming intermittently out of sync. Specifically, the IOS-XE device-tracking (SISF) table was out of sync with the IOS-XE LISP ethernet address-resolution (AR) table on the edge nodes. It’s not important for this post to know exactly what those are and why that is a problem (blog for another day), but let’s just say that Unicast ARP breaks in the fabric when this happens. When Unicast ARP breaks in the fabric, helpdesk tickets are opened and a team of people get involved to manually track it down and clear the device-tracking table of the offending entry on a given fabric edge node (causing a minimum of a 15-minute outage). We decided to implement our own workaround to eliminate the helpdesk ticket and automatically mitigate against the outage (we’ll track down the root cause in parallel - could be a bug or endpoint driven issue). This is where our automation came into play.

The Workaround

Using a combination of the tools described above we were able to mitigate this problem automatically.

Here’s what we did!

Enabling Guestshell

First, we needed to enable the guestshell on our C9300 fabric edge node. Since by now, you have gone back and read this blog, I won’t explain in detail here what all of these commands do. What I will say is that in our case, we did not need our guestshell to have any upstream network access so we used some non-routable IP space for the virtual port group (and didn’t bother with any NAT setup).

Using IOS-XE+EEM+Guestshell+Python to Solve Problems

Setting up the Guestshell with our Python Script

After the guestshell was created, we proceeded into the guestshell to setup our python script.

Using IOS-XE+EEM+Guestshell+Python to Solve Problems

We then pasted in our python script and saved it. I also want to give anonymous credit to my customer who assisted in putting together this little script. If you are reading, you know who you are, and thank you! J

Note: Since we weren’t giving our guesthshell Internet access, we had to be sure to only use python packages that come pre-installed. For instance, we cannot ‘pip install xyz’ from our guestshell without Internet access. Giving your guestshell Internet access is also something you can do; we simply didn’t need to.

Using IOS-XE+EEM+Guestshell+Python to Solve Problems

The above script executes two show commands using the Cisco provided package ‘cli’. We then create a ‘set’ type variable fed with the output from the show commands, which has been stripped down to just the IP addresses from each command output. Next, we check to see if there are extra IP addresses in the device tracking output that are absent in the address resolution output and store those in a new set named ‘ip’. Lastly, we iterate through the resulting set to execute two commands for each IP in the ‘ip’ set.

The first command clears out the device tracking record for the offending IP address (this mitigates the outage). The second command created a syslog entry that contains the IP address that the conflict was detected/cleared for so that we have an audit log for the scripts activity in our logging server.

EEM Script

Back at the fabric edge node IOS-XE, we then put our EEM script in. In the below example, we have EEM trigger a run of the python script (located in the guestshell) every five minutes. Looking at the CPU utilization on the edge node, it was largely unaffected, so we probably could have (and may still) run this at a higher frequency.

Using IOS-XE+EEM+Guestshell+Python to Solve Problems

Now the fabric edge nodes are checking for and mitigating against this outage causing discrepancy automatically without the need for any human involvement!

What’s Next?

This buys us the time we need to work with Cisco TAC to understand the actual root cause of the discrepancy between the device-tracking table and the LISP L2 address-resolution table. Once we find out what bug we are hitting and upgrade IOS-XE to fix it, we’ll remove the EEM script, but we will be keeping the guestshell enabled on the fabric edge nodes to fight whatever fires may need fighting next!

Thanks for reading!

As always if you have any questions on improving your IT environment set up for you and your business and would like to schedule a free consultation with us, please reach out to us at sales@lookingpoint.com and we’ll be happy to help!

Contact Us

 

Written By:

 Dominic Zeni, LookingPoint Consulting Services SME - CCIE #26686

subscribe to our blog

Get New Unique Posts