When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.
However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?
Enter the parallel distributed shell.
A parallel distributed shell is a CLI tool that can connect to multiple hosts to run commands simultaneously. It is an effective ‘break-glass’ tool when the usual method to manage infrastructure isn’t working. Their use is more common than you think! I have seen these tools throughout my career and have personal experience using them to accomplish a rather diverse set of operational tasks.
There are multiple options to choose from, such as:
I personally use pdsh as it was originally developed at LLNL for grid computing applications and is platform-agnostic. It is available in the repositories of most Linux distributions.
Getting started with the tool is really simple. Create a comma-separated list of hostnames:
Then run pdsh supplying that host list and the command to run:
$ pdsh -R ssh -l $USER -w “$HOSTS” command
- -R specifies how commands are run. In this case, we are using SSH.
- -l specifies the login name to authenticate with (just like ssh)
- -w is the list of hosts to connect to.
Note that pdsh uses SSH key authentication, passwords are not supported.
Let’s explore the various tasks this tool can do! We’ll create an alias for the sake of brevity:
$ alias cmpdsh="pdsh -R ssh -l $USER -w $HOSTS"
$ cmpdsh "sudo bash -c 'apt update && apt -y dist-upgrade'"
The -f flag allows you to specify the level of concurrency. Using -f 1 will go through the host list serially.
$ cmpdsh -f 1 “sudo reboot”
$ cmpdsh -f 1 “sudo systemctl restart service”
The -N flag disables the hostname prefix in the output, enabling the use of CLI log analysis tools such as
$ cmpdsh -N "sudo tail -f /var/log/apache2/access.log" > /tmp/apache.log
Monitor Resource Utilization Cluster-Wide
This is particularly useful when host monitoring isn’t present. The dstat tool is excellent for this purpose.
$ cmpdsh “dstat -cmndi”
Apply Configuration Management Tools
Note: Some configuration management systems require a centralized server. Set -f to a value that won’t overload it!
$ cmpdsh -f 20 “sudo puppetd -t”
$ cmpdsh -f 20 “sudo chef-client”
Generating Host Lists For pdsh
In the above examples, we are supplying pdsh with a static list of hosts. In most cases, we will target different sets of hosts based on the task performed.
I recommend integrating pdsh with your asset management system. For example, I’ve written a wrapper around pdsh that dynamically generates host lists using Ansible inventory files:
The parallel distributed shell is an effective Swiss army knife for large-scale systems when your usual infrastructure management systems aren’t available or effective. It allows you to perform incident response or other operational tasks without the burden of SSHing into each server one at a time. As with any powerful CLI tool, please use it with extreme caution!
I have a lot of experience running large-scale production systems. Schedule a call with me if you need help improving the operations of your production environment!
(Image credit: Sora Shimazaki)