Software Stack

How we build our activity report

Each year, we publish an activity report that presents a summary of our day-to-day activities and of our projects. Every year, it follows roughly the same structure, and presents the same tables and graphs, updated. It is written in collaboration by all the CISM members.

Given the recurrent structure and information, and the fact that it is a shared work among us, we have decided to use LateX to author the report, giving us much flexibility in producing a nice-looking document, using a custom style file. And the fact that LateX is purely text-based allows two important aspects:

  • it can be versioned under a version control system; and
  • it can be (partly) created by automated tools.

LateX is a very good system for such purpose, but its syntax is sometimes intrusive and un-intuitive. Therefore, we have organized the structure of the document in LateX, but the individual paragraphs are written in Markdown, and converted to LateX using Pandoc and a simple Makefile. The whole thing is versioned thanks to Mercurial.

We have developed scripts that fetch data from our databases (e.g. GLPI for the inventory, Zabbix for the load of the clusters, or Slurm for the cluster usage) and write the LateX code to produce tables or graphs (with Tigz.) This way, a lot of manual copy/pasting is avoided, saving time and eliminating potential errors. The list of publications is integrated directly from Dial.pr thanks to an extraction tool written by Étienne Huens (SST/INMA). 

Using text files rather than a Word/OpenOffice document also allows automatic processing of the text. Prior to compilation, we have several scripts running to ensure consistency in capitalisation for instance, or to add links when named entities (e.g. "SST/CISM") are discovered.

This approach is saving us each year several man.hours leveraging the features available in Mercurial and the custom scripts we have developed. But of course, when our infrastructure changes, we have to adapt our scripts. Fortunately our infrastructure rarely changes ; but that can happen when, for instance, we move all our computers into a brand new computer room.

How we setup our compute nodes

On our multi-purpose compute cluster Manneback, we use a trilogy of open-source tools to setup our compute nodes, namely Cobbler, Ansible and Salt.

When we acquire new hardware, the workflow proceeds as follows once the
hardware is in place:

1. The MAC addresses of the nodes are collected, and the names and IP addresses are chosen. To collect the MAC addresses, we often simply look into the logs of the DHCP server once we have booted the nodes.

2. The above information (MAC, IP and hostname)  is entered in the Cobbler system and a kickstart is created to handle the network configuration and the disk partitioning. Some necessary packages are also installed at that time, most important of all is the salt-minion. The salt minion is also configured through the kickstart.

At this point, the node is ready to be deployed.

3. The node is then restarted, and its operating system is installed through Cobbler. When the kickstart  runs, and the salt minion starts and registers itself to the Salt master.

At this point, the node is ready to be integrated.

4. Then, an Ansible playbook is run, that takes care of creating the configuration files and to register the nodes to the inventory and the monitoring system. Ansible gets its inventory from the Salt server and only operates on nodes whose keys are not yet added to the Salt master. More precisely, the playbook gathers the public SSH key to create a proper hosts.allow for host-based SSH authentication, it gathers node information to build the Slurmconfiguration file, and it registers the node to the OCS  inventory and to Zabbix.

At this point, the node is ready to be configured.

5. Finally, the salt master issues a "state.highstate"  to propagate the configuration files, install the necessary packages, mount the network filesystems, and the nodes are ready for use.

It may be  a bit surprising at first sight to use both Ansible and Salt ; most system administrators will consider using one OR the other. But we found it more efficient to combine their respective forces.

The reasons why we would not work with Salt only:

  • Ansible has many more modules to interact with other systems and to manipulate text files;
  • Ansible allows easier control of operations (orchestration) when synchronisation between services is needed ;
  • Ansible is better designed at handling "one shots" such as registering to the monitoring system.

Conversely, we found the following reasons why we would not work with Ansible only:

  • Salt's main mode of operation (pull) does a better job at handling nodes that are down;
  • Salt scales better when several hundred nodes need to be configured at the same time;
  • Salt makes it easier to write declarative configuration code based on rules and dependencies.

In the end, what we have is a combination of three free open-source tools working hand in hand and allow us bringing new nodes to life in minutes in our compute cluster.

How we setup our Openstack

We moved in a brand new data center in April 2016 and the first machines we installed in there after the switches were the four computers we dedicated to our private Openstack. We needed to deploy those computers at a time the top-of-rack switches were not yet connected to their uplink switches. But the WiFi from the building next door was within reach.

So we bought a Raspberry Pi 3, installed Fedberry, configured it as a gateway router, and set up a Cobbler server. Once the four nodes were properly physically patched and connected to the various internal networks, we deployed them. We chose CentOS7 for the operating system, and used the functionalities of the Kickstart to setup the network interfaces, configure IPMI, install useful packages, set root password and  SSH keys, etc. Only the first two nodes, which are management nodes, have direct network connectivity with the external networks. They provide masquerade/NAT for the other two nodes which  host the virtual machines and virtual networks.

Once they all were up and running,  we cloned the Mercurial repository that holds our Ansible playbook and roles to deploy Openstack, on the first computer. 

One role installs Gluster on all four nodes (distributed/replicated). The Gluster filesystem holds the virtual machines (instances), the virtual disks (block volumes) and the OS images. Another role sets up Galera with MariaDB, also on all four nodes, and a third one installs a RabbitMQ cluster. These lay the foundation of the Openstack install, upon which all Openstack services rely.

The two management nodes are redundant and share a virtual IP, managed with KeepAlive. They both host a HAProxy service for all the Openstack modules. Both HAProxy and Keepalived are installed by their respective roles.

Then come roles for the Openstack key elements: the identification module: Keystone, the image manager: Glance, the virtual machine handler: Nova and the virtual network handler: Neutron. After that, other roles took care of Cinder for virtual block devices, and Horizon, Openstack's web interface. All Openstack modules are installed in an active-active redundancy mode, except for Neutron, that is active-passive. 

Finally, roles for the ELK stack (Elasticsearch, Logstash, Kibana) are run to collect all Openstack log files. Zabbix finalizes the installation. Note that as Zabbix is also redundant, we get all alerts twice... But that is better than receiving none.

The Ansible playbooks and roles were developed on a virtual environment with VirtualBox and Vagrant. Every aspect has been tested beforehand except for some configuration of Neutron because we could not reproduce in Virtual Box the physical environment exactly. But it was sufficient to be confident our playbooks would do the job. And indeed, the whole setup was up in a couple hours.

Once the Openstack cloud was operational, a set of shell scripts and Ansible playbooks provisioned all our virtual machines. The first one we took care of was our Salt server that is subsequently used to configure all the other VMs: DNS server, LDAP servers, Zabbix server, OCS server, etc.

The CISM software stack

At CISM, we manage a few hundreds computers so we need to use a comprehensive set of tools to handle the burden of deploying, configuring, monitoring and repairing all those computers. For reference, here is the software stack we use at CISM, from Ansible to Zabbix.

Ansible

We use ansible to setup the compute nodes of our clusters and the virtual machines that host the services on which we rely. Ansible is like shell scripts on steroids and is very powerful at orchestrating service setup.

CentOS

Our favourite distribution is CentOS, based on RedHat. We occasionally use other RedHat derivatives such as Scientific Linux.

Cobbler

Cobbler is used to deploy the operating system on our compute nodes and on the physical hardware that host the virtual machines. Cobbler manages our DHCP, TFTP and PXE services and hosts our inventory that Ansible uses.

Dnsmasq

Our favorite DNS server is dnsmasq, which is both simple and efficient.  We use it for our virtual machines and compute nodes.

Docker

Most virtualised services are built on Openstack, but services that are meant for computing on interactive machines are now installed in Docker containers. Typical examples include RStudio Server, Jupyter Hub, etc. This way, those services are easily transferred/replicated to other machines, and they do not interfere with one another. If the machine also hosts a more 'administrative' service, such as NextCloud for instance, the latter is protected from any issue that might arise from user processes (memory over-usage, CPU overloading, etc.). The SSH daemon is also progressively transitioned to a Docker image on the login nodes for the same reasons.

Dokuwiki

Our internal wiki is a dokuwiki. As it is base on simple text files (rather than a complex SQL database as in alternatives) it is very easy to write scripts to populate the wiki from information gathered from our equipment or our configuration server.

ELK

Elasticsearch We use an elasticsearch cluster to store the logs of our compute nodes and service nodes. Based on Apache's Lucene search engine, it is unbeatable at indexing texst documents.  Logstach Logstach is used to populate the elasticsearch clusters with the logs produced by all the services we run. With specially-crafted regexes, it parses the logs before it pushes them to the elasticsearch server.  Kibana We use Kibana to browse
and search the log database. Its filtering and plotting facilities make it very easy to navigate the logs and correlate events for troubleshooting.

Easybuild

The must-used tool to compile scientific software from source. It is distributed with 'recipes' for nearly one thousand different scientific software.

Filesystems
  • Lustre is used on Lemaitre2 for the fast, infiniband-connected, scratch space common to all nodes. Lustre is probably the market leader for fast parallel filesystems and offers very good performances.
  • GPFS. IBM's GPFS (now called Spectrum Scale) is a very versatile, feature-loaded distributed and parallel filesystem that we use at the CÉCI level for the central storage shared by all compute clusters.
  • Glusterfs is used for home directories where data safety is more important than performance. The fact that it is 'metadata-less' and that the files can be accessed directly even if the Gluster filesystem is down is very reassuring. 
  • ZFS For the long-term mass storage, we favour ZFS to store large amounts of data safely. We like the fact that everything is stored on the disks.  Several times, we have migrated 30-disks ZFS filesystems from one OS (Solaris) to another (GNU/Linux), or from one hardware controller to another. 
  • FHGFS The Fraunhoffer filesystem is used on our clusters for the scratch space where performance is more important than data safety. It is easy to setup (compared with Lustre for instance).
  • Ceph is an object store that can offer remote block devices and a parallel filesystem. Ceph is currently being studied as a mean to converge all non-performance storage types in a single architecture based on commodity/old hardware. Example usage include CephFS instead of ZFS for mass storage, Ceph block device for virtual machines instead of Gluster, and high-availability block storage for management services (e.g. Slurm) rather than DRDB. It will also serve as an object store for users who request it in the future.
GLPI

We store our inventory in GLPI, and use it to manage hardware issues and support/waranties. It is populated from information gathered by OCSInventory and from home-made scripts that gather information from different sources.

Indico

We use indico, a platform for organizing events developed at CERN, to handle the registration process for the training sessions, as well as for the other events we host.

Mariadb

Everywhere we need a replicated sql database, we use Mariadb, along with Galera, which is very simple to setup and maintain. It simply does the job quietly and efficiently.

Mercurial

We mostly use Mercurial for all our configuration files ; it is simple and works well. Much easier than Git and sufficiently powerful for a small team with a few repositories

OpenLDAP

All our user accounts are stored in OpenLDAP with replication and SSL communications. Once it is setup, it is very robust.

Openstack

We host our virtual machines on Openstack to ensure high availability and ease os use. It was quite a challenge to setup, but it proves very useful.

PHP

Most of our monitoring web pages are developed with PHP.

Pdsh

Parallel SSH to act manually on compute nodes. Although we try not to use it and favour Ansible or Salt, it often comes handy for parallel computer access. It hasn't been updated in a while but it simply works. And it has a
 Slurm plugin.

Python

Most of our home-made software is developed with Python. Development in Python is very quick and robust. Easily portable, and often used by scientists as well. A language all system administrator and all researchers must know.

Salt

We store all the configuration of the machines in a Salt server. Our Salt server is our Single Source of Truth, and it makes it very easy to redeploy machines from scratch either after a disaster or to setup a test enviroment.

Shellcheck

We write a few Bash scripts to automate or to remember, and all of them are verified with shellcheck to make sure they are defensively programmed.

Slurm

Slurm is the job scheduler we use since 2010. It replaced SGE at the time Sun was acquired by Oracle, casting uncertainty and doubts about its future. Slurm was then a rising star, with a few features missing, but it passed all the tests. Nowadays, Slurm is the success story we all know, running on half of the TOP 10 clusters in the world, include the Top 1.

Sphinx

We use Sphinx to write our user documentation. With ReStructuredText, it can produce HTML and PDF output, and is perfectly suitable for technical documentation, although it was primarily desined to write Python documentation.

Sshuttle

We have access to our core virtual machines and to the out-of-band management of our physical servers only from the subnet of our offices. Whenever we are out of office, be it home, in a meeting room, or even in the computer room, we need to go through a secure gateway. sshuttle is a very nice piece of software that works very similarily to a VPN except it operates over SSH. Just run it in the background, and the firewalls become transparent.

Sympa

We use Sympa to moderate requests we receive from the users. The main contact point is a moderated mailing list we use as a lightweight ticketing system.

Vagrant

We test a lot of our configurations on virtual machines created (and destroyed) with Vagrant and VirtualBox.

Zabbix

Every single aspect of our infrastructure is monitored with Zabbix. It replaced the pair Nagios/Ganglia. It is very easy to create customized items (much more straightforward than with Nagios) and its graph creation capabilities are much more flexible than Ganglia's. It also works with proxies, which is important for us as all the compute nodes are on a private network.