CLI Still Sucks for Automation
Using network CLI for automation has always been fragile. But it keeps surprising me with the way it breaks. This time, it was a combination of Ansible, Arista,
replace: config and
terminal length used as a config command.
I often hang out in the NTC Slack channel. A user reported they were having a problem with Ansible and EOS. Basic changes worked, but when they used eos_config with the
replace: config option, it just timed out. We knew basic authentication & connectivity was fine, it had to be something else.
But it made no sense, because these modules are widely used. What’s going on?
Background #1: Pagination
Some commands produce more than one screen’s worth of output - for example,
show run can be hundreds of lines long. Most screens don’t have hundreds of lines, so pagination is used. The network device detects how many lines your terminal window has. When a command output is longer than that, it will pause at the end of each ‘screen’ length of output. Typically it displays something like
--More--. Hit enter and the display will move up one line, hit space and it will move up a whole screen length.
This is all very useful if you’re a human looking at the CLI output.
But as a machine, it’s a pain. If you have a script interacting with the CLI, it has to detect the
--More-- prompt, and respond appropriately. Slows things down, is annoying.
That’s why almost all network operating systems support some form of the terminal length command. This lets you over-ride the auto-detected terminal length. More importantly, if you enter
terminal length 0, it will ignore any length, and just display all output in one long stream. That’s the simplest for any CLI expect-style automation to deal with. Almost all automation tools will use this.
Normally this works well. When something like Ansible connects to a device, the first thing it does after login is run
terminal length 0, and it no longer needs to worry about pagination. After that it just runs commands, and expects that the entire output will be returned, with no need to detect page breaks.
Background #2: replace:config
eos_config module supports the
replace: config option. The default is to use
replace: line, which only pushes changed lines if it detects a difference. That’s what most people use. Occasionally you need the
replace: block option, which will push all lines in that block, if any of them are different. By
block, it means “this section of the code” - e.g. the interface configuration.
config option is not documented. But it will push the entire config if any one line needs to be changed. I don’t know exactly why you’d do this, but hey, it’s an option.
Combination == Problem. Why?
The user reported that they were unable to use Ansible with
replace:config. Ansible reported a timeout - exactly the sort of thing I would expect it to do if it had not matched the prompt properly, or it was stuck at a
But why? How could that be happening when Ansible modules are widely used with Arista? There’s nothing unusual, no reason why it shouldn’t work when many, many other people successfully use these.
Why would it be getting stuck with pagination when that gets disabled on login?
The key was this part of the EOS docs
The pagination setting is persistent if configured from Global Configuration mode. If configured from EXEC mode, the setting applies only to the current CLI session.
Hang on a second. On most devices I’ve used,
terminal length is only used at the EXEC mode. That is, it only applies to the current login session. It’s not a global setting, it’s an ad-hoc setting.
Arista EOS supports
terminal length as a global configuration command. This system had
terminal length 20 in the configuration. On initial login, Ansible ran
terminal length 0 for that session, and all was well.
replace: config was used, and it detected a change, it re-entered the entire configuration, including the
terminal length 20 command. At that point, the system went back to enforcing a page length of 20. This confused Ansible, because it started seeing
--More-- prompts, and didn’t know what to do with it.
It’s a fairly simple fix here - remove the
terminal length 20 line from the configuration. In this case it was a command that had been added a long time ago, perhaps when automatic length detection didn’t work. No longer needed, so remove the command, the terminal length stops getting messed up, and Ansible doesn’t sit there stuck at a
Now hopefully I’ll remember this when I come across it in another 5 years…