De.KCD - Data Management Platform on the Cloud, step by step

Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.

1 Goals and scope

This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.

It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.

We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.

We will also compile advices and tips, and we welcome all contributions (and corrections).

Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry

2 How to use

Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.

3 When to…

When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and you can find a detailed decision tree.

If you have a very simple use case, with a single application and a single user, and you do not need to share the data with anyone else, a local installation on a personal computer might be enough (“bare metal”). It is the simplest solution, but it is also the least scalable and the least secure. It is not recommended for production use, but it can be useful for testing and development.
A first way to encapsulate a solution is to use a virtual machine, which will provide a full encapsulation of the application and its dependencies. It is more secure than a local installation and allow to play around without touching your actual system, but it is also more resource intensive and without the benefit of a containerised solution (see below). It is a good option for testing and development, or for applications that cannot be containerised. As on a local setup, it will need to be updated and maintained “directly”, where an updated image can be used for a containerised solution (with some caveats)
If you have a more complex use case, with several applications and several users, and you need to share the data with other people, a containerised solution might be a good option. Updating is easy as far as the image is updated and the encapsulated application is properly configured to be updated: the data and configuration might need adaptations. In some case it is all transparently handled by the application, in other case there is an update process to follow, and in some case there is no update process and the update needs to be done manually (which can be very costly). It is still a good option for production use, and, if the application allows, can be scaled relatively easily, either by running several instances of the same container (typically behind a load balancing application, such as HAProxy or Traefik), or by using a container orchestration solution (see below).
If you have a very complex use case, with several applications and several users, and you need to share the data with other people, and you need to scale the solution, a container orchestration solution might be a good option. It is more complex to set-up and maintain than a containerised solution, but it allows to easily scale the solution and to manage it in a more efficient way. It is a good option for production use, especially if you need to scale the solution, but it is not recommended for testing and development as it can be overkill for such use cases.

Note

If your institution or consortium offers a suitable hosted service, it is generally the best option to use it, as it will be maintained and updated by the provider, and will be more secure than a self-hosted solution. It is still important to check the terms of service and the security measures in place, as well as the possibility to access the data and the API if needed. In many cases, the different usages will be well documented and in some cases tutorials will be available. Notable examples of such services are the De.NBI Cloud, which is offering a wide range of services and support for research projects in Germany, the European Open Science Cloud and the EGI Federation for European research projects, Gaia-X for European partners (not only for science), the ACCESS program funded by the National Science Foundation in the United States, which provides integrated access to cloud and HPC resources such as Jetstream2, the Digital Research Alliance of Canada with its Arbutus cloud platform, the Nectar Research Cloud in Australia, and national or institutional initiatives such as the GWDG Academic Cloud, which offer OpenStack-based virtual machines and increasingly Kubernetes-based container orchestration for academic users.

Tip

There are also some Data management solutions that can be used by academic institutions (such as several Seek instances). There is probably still a need for a dedicated solution (for security and privacy concern, for storing large data with high performance…) but they can be a good complement.

4 Common remarks

4.1 Keeping the difficulty under control

The most important aspect when setting up a Data Management Solution is to keep the difficulty of setting it up and using it under check.

As such, this step-by-step page is mostly to determine if the complexity of one solution seems reasonnable, then have some pointers about the solution: what it offers, what is the cost, and a minimal introduction on using it. It is not intended to be a complete documentation on each solution, but to give enough information to understand the solution and navigate through its documentation.

What this documentation cannot fully plan for is what is in place already: existing servers with some applications, a Virtual Machine infrastructure… Which can be used as a base for the set-up, if well documented and supported. Using these can be a major plus for a new solution.

4.2 Which level of service?

The difference in cost between allowing a downtime of 1 night and one of 1 hour is probably smaller than between allowing a downtime of 1 hour and one of 1 minute. And such differences can be found with storage, security and network access.

If your institution provide a fully managed infrastucture but with low availability (such as waiting the next day for a problem to be fixed), the benefit can still be bigger than the cost. Moving to an higher availability will mean an adapted infrastructure that you will have to manage locally, and having a technical person that is available in case of full outage.

Tip

If the needs are very high and not satisfied by a local IT, it might be a better option to pay for an external service. In that case it is important to evaluate correctly the price, and to check if the security is good enough and if the use is allowed (storing personal data out of your country might be prohibited for instance).

4.3 Plan for future problems, not future functionalities

What is important when setting up a Data Management Platform is to plan for future problems, things that could be very costly to fix later, such as security issues, data loss, or very serious performance issues that might prevent the use of the platform. We will come back to the most important aspect in this page and in the Backups and Security page page.

Related to that, it is primordial to define the current requirements and avoid feature creep, that is the tendency to add more and more features, which can lead to a very complex and hard to maintain solution. It is important to keep in mind that the most important thing is to have a working solution that meets the current requirements, and not to have a perfect solution that meets all possible requirements. Not defining correctly the requirements can force the choice of an overly complex platform that will be harder to maintain and use and might create future problems, including a reflection of the complexity in the user interface, which can lead to a lower adoption by the users.

While defining the requirements, it is important to have a discussion with the users as to find out what they really need. Building the requirement is about what the users need, not what they want.

4.4 Document all specific changes

A good technical setup is one that anyone else can maintain. As such it is important to document all specific changes that are made to the platform, such as configuration changes, custom scripts, or any other modification that is not part of the standard installation. This documentation should be clear and detailed enough so that anyone can understand what was changed, why it was changed, and how to revert it if needed. Ideally it would be centralized in a single document, such as a wiki page or a markdown file in the repository, and should be updated regularly to reflect any new changes, but a simple readme file in the folder where the change is made can also be a good option (better if complementary to the centralized documentation). If the changes have the potential to break something, it is better to journal the changes, in the documentation (who at what time) and to have a version control of the actual change (see next section).

4.5 Use version control for that can be changed

Most Data management setups will involve some customization. If minor it could only be documented, but if significant it should be saved and preferably with version control (typically Git) so changes can be tracked and reversed.

Highly customizable platforms often offer a way to export customizations as a text file (typically json, ini or xml) and eventually can be customized using an API. Both will allow saving in a folder that can be versionned. Using an API will probably be more work initially but allows a very simple export once in place.

Commits should still be manual with meaningfull messages (related to the previous chapter).

If there is no export functionalities or API support but the changes are saved in an accessible database, it is still possible to export the change from there. To ensure consistency, it is advised to write a script or program doing the export. Using Python is a good option for doing so.

5 Quick introduction to Linux/Unix

Recommended learning path

Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.

Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.

Check external introduction
Check Galaxy/NFDI training for deeper knowledge

For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.

We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folders, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm, yum, now dnf) and Ubuntu-based Linux (apt/apt-get), apk (Alpine Linux), as well as sandboxed or universal package managers like Flatpak and Snap, to list the main ones.

5.0.1 Commands that should be known

ls → list the content of a directory ls -al → include all files (including hidden files starting with .) and show a detailed list
cd → go to a directory
- cd .. → parent directory
- cd / → root directory
- cd ~ → user home directory
ps → list running processes ps -edf or ps -aux → list running processes and their owner
top → show running processes and their memory/CPU usage (CTRL-C to exit)
cp / mv / mkdir / rm → copy, move/rename, create directories, remove files or directories
- rm -r → recursive removal
- rm -rf → dangerous, use with extreme caution, the force means that it will delete everything (force remove).
more / less / cat → display file content less is usually preferred (scrolling, searching)
tail, tail -f → show the end of a file, -f to follow it live (very useful for logs)
vi / vim / emacs / nano → most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi, nano or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs).

Even if vi might seem very hard to use and strange at first glance, it is quick to learn the basics and very convenient.

Emacs has a very different approach but is much more powerful, which can be useful for more complex work (if needed).

If you never used a terminal editor, search for a short “vi basics” or “nano basics” introduction.

The absolute minimum knowledge you need of vi/vim is that you need to press “i” to edit the text, then escape to return to “command mode”. In the “command mode”, wq saves and quits (write and quit), q quits if there is no changes, q! saves without saving the change.

These editor are also used for editing crontab (crontab -e), with the editor chosen at the first use, but it can be changed later. Crontab is determining how cronjobs are running (regular jobs) and, if not needed for running an application (and so not part of this list), can be really useful.

man → show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.
tldr -> a simplified man, that needs to be installed first but can also be consulted online or in a general PDF
ln → create links (that points to a file); understand the difference between:
- Hard link: a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file
- Symbolic link (symlink): exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions

For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.

chown / chmod → change the owner of a file or folder, change the permissions of a file or a folder. Be familiar with:
- symbolic permissions: rwxrwxrwx, r-x------
- numeric modes: 755, 644
- examples:
  - chmod o+r filename
  - chmod -R 755 directory
  - chown -R user:group directory
Ideally the pipe (|) should be well understood, as well as the I/O indirections (>, >>, <). If this is unclear, look for “Unix pipes and redirection”, it is a core concept.
grep → search for the presence of a string in files, often used combined with another command using a pipe (|)
find → search for files by name, type, size, or date; often combined with grep or xargs
mount / umount → attach or detach filesystems Removable media must be mounted before access. For instance, a USB key must be mounted before use. Temporary mount points are often under /mnt or /media. Modern systems often automount removable media.
pwd / whoami →
- pwd → print current directory
- whoami → show current user
kill → kill does not only kill a process. It actually send a signal, which is most of the time to terminate a process: to the given process for most signals, ot to the kernel for SIGKILL (kill -KILL processId) and SIGSTOP (kill -STOP processId), to respectivelly forcefully kill or stop a process. The default signal (i.e. no argument) for kill is SIGTERM (kill -TERM processId), which should be dealt by the process to terminate properly. If the signal is not processed by the corresponding application, by not being coded in or for being in a state where it cannot deal with it (for instance an infinite loop), nothing will happen. In that case, only kill -KILL processId will kill the application.

The SIGKILL termination is considered unsafe and should only be used on hanged process, after using SIGTERM.

The signals have an integer value, with some defined in the POSIX standard, such as kill -9 for SIGKILL, or kill -15 for SIGTERM.

eval → is a powerful and dangerous command that execute its arguments as a shell command. So it is possible to build a command in a shell script and execute it using eval, making possible very advanced operations.

Note

Knowing eval is important for security reason. The presence of eval in a suspicious script calls for caution.

xargs → build and execute a command from standard input
Very useful when a command produces a list of items (files, process IDs, etc.) that must be passed as arguments to another command.

Typical usage combines find, grep, or pipes:
- find . -name "*.log" | xargs rm
- grep -l "ERROR" *.log | xargs less
This is needed because many commands do not accept input directly from stdin, but only as command-line arguments.

Note

Be careful: xargs will execute the command on all received items.
Prefer using xargs -n 1 (one argument at a time) or test first with echo.

Note

As opposite to eval, xargs does not transform a string into a command (the command needs to be given and will receive parameters from xargs). As such it is not dangerous like eval

If unfamiliar, search for “xargs explained”, understanding it greatly improves command-line efficiency.

curl / wget → interact with HTTP/HTTPS services from the command line Useful for testing APIs, services, or downloading files
free / df / du → check system resources
- free → memory usage
- df → disk usage per filesystem
- du → disk usage per directory
command & / CTRL-Z + bg / fg → run a command in the background, suspend it, or bring it back to foreground
ssh / scp → connect to remote servers and copy files securely Essential for working with remote machines If unfamiliar, search for “SSH basics”.
Graceful restart: ask for a transparent restart of a service -> finish the existing “transactions” before starting. For a web-server it means that it will keep the current opened http session before restarting, thus minimizing the impact of the restart for the end use. Keep in mind that it does not know about a CMS session served by the web-server (as generally the web-server is only a proxy to the web application), so CMS session might be forcefully closed anyway (depending on how the session is managed).

There are several way to do so: + systemctl reload nginx/apache/caddy (graceful) (On most Linux distributions) → systemctl is part of systemd, where the ‘d’ stands for ‘daemons’.Daemons are Unix’s background process. What is important to know if that a web server, such as Apache or Nginx, will now be run as daemon and managed by systemctl. systemctl reload servicename will call a graceful restart of the service. Common commands: * systemctl status service * systemctl start|stop|restart service * systemctl reload service (graceful reload when supported)

/etc/init.d/nginx reload / /etc/init.d/apache2 reload (On some Linux distributions and Unix OS (FreeBSD, OpenBSD…)): systemd is a relatively new and controversial System and Service Manager, which is the first process to run (PID 1) and will start other elements of Linux and manage many elements using daemons. This monolithic aspect is quite opposed to the principles of Unix (“Do one thing, and do it well”) which is the main point of the controversy. As such some distributions still use the regular init (SysV-style).

As there are no init.d script officially available for Caddy (at the time this document has been written), it might not be possible to use this command.

apachectl/apache2ctl graceful / nginx -s reload / caddy reload: the web servers also offer command lines to manage them. It is still useful to know the Operating System way as it does not change between web servers so a maintenance on a different web-server can still easily be done (a web-server might be part of a container for a Web application). Apache on Windows is managed by Http.exe and only offer a graceful restart.
apachectl configtest / nginx -t → allow to test the current configuration for Apache or Nginx. Configuring a web-server can quickly become complex, if there is a need for a reverse proxy on several virtual hosts (for instance for several web-application served via a docker instance for each). So these command will minimize the risk of a wrong configuration. But they still do not ensure that the configuration is doing what is expected.
rsync → efficient synchronization of files and directories Often used for backups or deployments. Combined with database dumps, it can form a simple backup strategy. In that case it is important to have a monitoring of the process to ensure it is working.

5.0.2 Folders you should know

/ → filesystem root
/bin → essential system binaries
/usr → user-space applications and libraries
/usr/bin → most user commands
/etc → system-wide configuration files (very important)
/var → variable data
/var/log → system and application logs (first place to check on errors)
/home → user home directories
/root → root user’s home directory
/tmp → temporary files (often auto-cleaned)
/mnt → mount points
/media -> temporary mount points for external storage (optional)
/opt → optional or third-party software

5.0.3 Users and groups

Unix is a multi-user system, so it has a concept of users and groups to manage permissions and access control. Each user has a unique username and a user ID (UID), and can belong to one or more groups, which also have unique group names and group IDs (GID). Permissions for files and directories are defined for the owner (user), the group, and others (everyone else). Understanding how users, groups, and permissions work is crucial for managing access to files and services on a Unix system. Each user has a home directory (usually under /home/username) where they can store their personal files and configuration. The root user (UID 0) has full administrative privileges and should generally not be used. The sudo command allows authorized users to execute commands with root privileges, which is a safer alternative to logging in as root.

Each file on Unix systems has an owner and a group, and permissions that determine who can read, write, or execute the file. The owner of a file can be changed using the chown command, and the group can be changed using the chgrp command. Permissions can be modified using the chmod command, which can set permissions for the owner, group, and others. Permissions are represented in a symbolic format (e.g., rwxr-xr--) or a numeric format (e.g., 755). The symbolic format has to be read as three sets of permissions: the first set (rwx) is for the owner, the second set (r-x) is for the group, and the third set (r–) is for others. Each set is composed of the permissions to read (r), write (w), and execute (x) the file, and “-” means that the permission is not granted. To got from the symbolic format to the numeric format, you can use the following values: read (r) = 4, write (w) = 2, execute (x) = 1. So for example, rwxr-xr-- would be 755 (4+2+1 for the owner, 4+1 for the group, and 4 for others).

5.0.4 Online resources

The Unix command line is well explained here. It is probably useful to know about the main principle of Unix, and a good course is available here and Wikipedia has a good overview of the Unix filesystem and its layout.

DeNBI Unix course -> TBD, adapt using a light Linux image (or public online VM)

5.0.5 What makes the containers possible: cgroups

One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.

It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.

A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).

5.0.6 Viewing a web page in a Linux shell

One very useful capability while setting-up a web application is to access it directly. You might have access to a graphical terminal (see the next topic) but it’s also possible to access most web applications using a command line browser. Some fancy web pages won’t be displayed in any useful way, but a Data Management Platform is rarely in that category.

Tip

A text-only browser is also a good way to check if a web-site is accessible, mostly for visually impaired persons. All important content should be in plain text and the navigation should still be possible.

lynx, the best friend of old server-side developers, the oldest web-browser still maintained, with an easy to remember name, is a quick way to check a web-site.
- Open a page:
```
lynx http://localhost:8080
```
- Navigation:
  - Arrow keys: move between links
  - Enter: follow link
  - G: open a new URL
- Quit:
  - Q, then Y to confirm
w3m Slightly more modern than lynx, with optional image support in some terminals.
- Open a page:
```
w3m http://localhost:8080
```
- Navigation:
  - Arrow keys: move
  - Enter: follow link
  - U: open a new URL
- Quit:
  - Q
links / links2 Similar to lynx, with better table rendering and optional graphics support.
- Open a page:
```
links http://localhost:8080
```
- Esc: open menu
- Quit:
  - Q

5.0.7 Firewall, SSH and VPN

A firewall is a network security system that monitors and controls incoming and outgoing network traffic based on predetermined security rules. It can be hardware-based, software-based, or a combination of both. Firewalls are used to protect networks from unauthorized access, cyber attacks, and other security threats. They can be configured to block or allow specific types of traffic, such as web traffic, email traffic, or file transfers.

SSH (Secure Shell) is a cryptographic network protocol that allows secure communication between two networked devices. It is commonly used for remote administration of servers and other networked devices. SSH provides a secure channel over an unsecured network by encrypting the data transmitted between the client and the server. SSH generally relies on a pair of cryptographic keys (public and private) for authentication, and it can also use passwords for authentication. SSH is widely used in the IT industry for secure remote access and management of servers and other networked devices.

VPN (Virtual Private Network) is a technology that allows users to create a secure and private connection to another network over the internet. It is commonly used to protect sensitive data and maintain privacy when accessing the internet from public or untrusted networks. A VPN works by creating an encrypted tunnel between the user’s device and the VPN server, which can be located in a different geographic location.

5.0.8 Export display with X11

The X Window System has been conceived as a remote system. So the application using X11 for its display does not have to be displayed locally. It is handy mostly for applications that have a Graphical User Interface only or those that are easier to manage using their GUI. The communication will be ensured by SSH tunneling, thus encrypted.

Warning

The remote host must allow the local host, which might create a security risk (xhost+ instead of .XAuthority).

It is seldom needed to do so but very useful to know it is possible for some very specific need.

5.0.9 Linux with Wayland: Waypipe

If the same functionality is needed for a distribution using the newer Wayland display server, it is possible using Waypipe.

5.0.10 Remote desktop

A more common way to have a remote graphical display is via remote desktops applications. They are purposely written for such usage and thus should allow a secure and simple solution.

Note

Due to security issue, remote desktop (as well as remote X11) might not be usable. Generally, a properly configured firewall will block almost all ports to-from external access.

If a SSH access is allowed, a remote desktop relying on SSH tunneling will be possible (such as X11).

Otherwise it is possible that a VPN connection is needed first.

It is always better to have an extra constraint (using the VPN in this case) than potentially add a vulnerability (see surface of attack in the Backups and Security page)

Common tools:

RDP
- Server: xrdp
- Client: Remmina, Windows Remote Desktop
- Works well over networks
VNC
- Servers: tigervnc, tightvnc
- Clients: vncviewer, Remmina
- Simple and widely supported
Wayland-friendly options
- GNOME Remote Desktop (RDP)
- KDE Remote Desktop
- weston (Wayland reference compositor)

Tip

Which option should I use?

I just want to check that a web service is running -> Use a text-based browser (lynx, w3m, or links) Fast, simple, works everywhere.

I need a feature that exists only in a graphical interface -> Try export display with X11 (if available). Useful for configuration tools or admin interfaces.

If too difficult to set-up or not available (on newer Linux or other OS) -> Use a remote desktop.

I am not sure which option to choose -> Use a remote desktop. It is the most reliable and behaves like a normal desktop.

5.0.11 Main web-servers

Most Web applications will be running as CMS: Content Management System. It means that it is a program running on a server that generate the web pages depending on the user actions, using the data stored on the server-side (which are not necessarily on the same server). This program is either serving the web content by itself using functionalities of the used language (for Python, Ruby, Perl, … ) or within an application server (like Apache Tomcat for Java).

But very often the web application, especially for production, will be accessed through a Web Server. This last will take care of accepting a secure connection (HTTPS) and forward all requests to the web application(s). It offers several benefits: * An extra layer for security, * A possibility to cache for static content (such as images), * Eventually load-balance between several instances of the web application, * Offer one central access for several web applications, even if they are not homogeneous.

The main web servers are: * Nginx, * Apache, * Internet Information Services (IIS, Windows only, proprietary).

Another web server gaining tracktion is Caddy, that try to integrate more aspects (like optaining a SSL certificate) while being easier to use. As always, the gain has to be weighted against the risk as the adoption as well as the ecosystem is smaller.

IIS is rarely use for Research project but can be the best choice if the application is built using Microsoft technologies.

6 Bare-Metal setup and Virtual Machines

6.1 Vocabulary

“Bare-Metal” is a bit of a misnomer, as it means originally using a computer without operating system, directly on the hardware (so using a low-level language such as ASM or C). But a bare-metal setup means a direct installation on a machine with the operating system.

Note

Note that an actual bare-metal setup could still be a viable option for specialised tasks: if it needs to be embedded (for a cheaper/more efficient solution on an specialised piece of hardware), if it needs to be highly performant, but in the later case it might be a better option to move the processing to the GPU (which is also a kind of bare-metal implementation). To choose to do so if out of scope for this document, but it is important to do so knowlingly as it could have a high cost (i.e. difficulty of maintenance)

A Virtual Machine is a software solution that provide the functionality of a physical computer. It is possible (and generally needed) to install an operating system on it. Most Virtual Machines will provide an easy way to do so, and using it is akin to using an actual computer.

It can be an emulation, “imitating” a computer enough to run its software, generally used for specific usages such as running deprecated operating systems. Qemu is one of the main ones. An emulation provides a full encapsulation so is in theory more secure and exploit are fixable by software update.

Or a virtualization, where the hosted machine is running directly or semi-directly on the physical hardware though an hypervisor that manages how the ressources are used: * full virtualization (simulate enough of the computer hardware to run a guest OS), and will need significant more resource than an actual physical computer, while providing a full encapsulation so in theory more security. Exploit are also fixable by software update. * Hardware-assisted virtualization, where the hardware helps with the virtualization. Exploits using the hardware support might be difficult or impossible to fix, though exceptionals. * OS-level virtualization, where the operating system is sharing the resource for the guest OS. Docker and other container engines are using such virtualisation. The reason containers are called containers and not virtual machine is from the usage: instead of creating a virtual machine and setting it up with an operating system and subsequent pieces of software, they rely on fixed images that have an bare-minimal operating system (ideally) and layers of software to support the desired container. Once running, they are actually a virtual machine.

Note

A main protection against threat is being up-to-date: a recent or LTS, Long-Term-Support: that is supported for a long time, Operating System on a physical computer, Virtual Machine software, and virtualization software (which might have a direct support in the CPU).

7 Containers, Overview of Docker usage

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy, assuming you have some knowledge of Linux.

A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.

If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.

7.1 Vocabulary

An image is a template used to create containers. It contains:
- an operating system base (often minimal Linux), which is not mandatory on Linux as it can use the host operating system, but is generally included for compatibility (be sure that the version of Linux is the right one, or if not running on a Linux host).
- the application,
- its dependencies,
- default configuration.
An image is immutable: you do not modify it while it is running. Images are usually downloaded from a registry (e.g. Docker Hub).
A container is a running instance of an image. It:
- runs processes,
- listens on ports,
- can be started, stopped, or restarted.
Note that the desired application needs to be started (entrypoint), or only the operating system will run.

Containers are ephemeral:
- if a container is deleted, everything inside it is lost
- unless data is stored in a volume
A volume is a persistent storage area outside the container. It is used to store all data that must survive container restarts or upgrades.

Volumes are:
- mounted inside containers
- independent of the container lifecycle
A volume is associated to a path for the container, meaning that any application that write/read/delete using this path will use this volume. But if an application needs to save data on a path which is not associated with a volume, it will still work: docker give a full working environment within the container. But it will be removed if the container is removed or replaced (updated for instance).
Docker Compose is a tool to define and run multiple containers together using a single configuration file (docker-compose.yml).

It is used to describe:
- which images to run,
- how containers are connected,
- ports,
- volumes,
- environment variables.
With Docker Compose, you typically manage an entire application stack:
- web server
- application
- database
- cache
All started with:
```
docker-compose up
```
Images can be stored and used locally, but they are often pulled from a registry, typically Docker Hub. There are other registries, such as GitHub Container Registry, but it is also possible to set up a private registry, which can be useful for internal use or for hosting custom images. Private registries can be useful for security reasons. The docker command and Docker Compose assume that the image is either available locally or in Docker Hub. To use a different registry, you have to specify the registry in the image name, for instance myregistry.com/myimage:tag. If the registry is private, you will need to authenticate to it using docker login myregistry.com before pulling the image.

7.1.1 Minimal mental model

Image:→ what to run
Container: the running application
Volume: persistent data
Docker Compose: a complete setup of multiple images and volumes.

7.2 Useful things to know

A docker image will be started (once used in a container) using an Entrypoint and/or a cmd. These can be listed by using docker inspect <image id>, but also individually by using docker inspect -f '{{.Config.Entrypoint}}' <image id> and docker inspect -f '{{.Config.Cmd}}' <image id>. This is useful in case something goes wrong inside a container to know how the container is supposed to start. To know more about cmd and Entrypoint, the official Docker documentation offer a great overview.

Changing something that is part of the image is possible when running the container, but it will never be persistent and is generally a bad idea. One good case is when debugging an application, and if the debugged code is changed (in the code base) as soon as the issue is found.

Note

A container might need some extra steps once started, especially with a first installation.

Often, a test setup rely only on running the container, which will use a tempary database (often SQLite) and temporary file system. Once the container is removed or replaced, all data is gone. A production setup will need some extra steps, at least to set-up volumes so the data can persist, but some time some initial steps for the applications. Thus it is almost always needed to follow the documentation to set-up a platform.

8 Moving to Docker compose

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).

8.1 Advantages

Docker compose is both a tool to setup a docker-based application or set of applications, and the file, written in YAML, that describes the setup. It is a powerful tool to manage a complex application with several components, such as a web server, an application, a database, and a cache. It allows to define the different components of the application and how they are connected together, as well as the volumes and environment variables needed for each component. It also allows to easily start and stop the application with a single command, and to manage the different components of the application in a consistent way.

Docker compose is quite simple to learn and use, and it is a good way to start with container-based applications. With a coherent docker-compose file, it is a good solution for running a web application, with a database and a cache, on a single machine, with a simple set-up and a good performance.

To access the web-application, it is generally needed to use a reverse proxy, such as Nginx or Apache, to forward the traffic from the host to the container. It is also possible to use a load balancer, such as HAProxy, to distribute the traffic between several instances of the application.

Note

As for a pure docker container, there will often still be some extra steps to setup an application. Many docker-compose for web-applications have a ready-to-use test setup, where there might be nothing to do. Most will need to set-up variables in an .env file (db password, application admin password, …). Several will also need some extra steps for an initial set-up (preparing the database, seeding some new data), and for an update (doing some premilinary steps).

8.2 Things to take into account

Is possible and often needed to use parameters in the docker-compose file, which are passed to the application as environment variables. Environment variables are generally used for all configuration parameters that might change between different environments (development, staging, production), or that are sensitive (such as database credentials). They are generally stored in a .env file.

Secrets can also be used for sensitive data, but they are generally more complex to set-up and use, and they are not supported by all container engines. They are generally stored in a separate file, and they are mounted as a volume in the container.

The different types of volumes (bind mounts, named volumes, tmpfs) have different use cases and implications for performance and security. Bind mounts are generally used for development, as they allow to easily edit the files on the host and see the changes in the container. Named volumes are generally used for production, as they are managed by Docker and can be easily backed up and restored. Tmpfs volumes are generally used for temporary data that does not need to be persisted, such as cache or session data.

8.3 A typical docker compose yaml file

Here is a simple docker compose with external variables (with a default value) and a postgres database. The main image is stored in a public user’s github repository: ghcr.io/abecam/, without this path, docker compose would assume that the image is stored on docker hub.

Note

As automatically saving an image on Docker Hub might involve a cost, some Open Source projects will use other registries.

Variable are defined in a .env file or in one or several files which can be defined using the env_file attribute (env_file: "myenv.env").

The usage below is the typical one: attribute_name: ${VARIABLE_NAME:-default_value}

Note that it is an interpolation, and it can be combined with text or other variables: attribute_name: "text${VARIABLE_NAME}:${ANOTHER_VARIABLE_NAME}"

There are more possibilities explained on the official [Docker documentation}(https://docs.docker.com/compose/how-tos/environment-variables/variable-interpolation/)

The variables used in environment: are available within the application as environment variable, making it simple to use them.

In this docker compose, we define 2 services:

a main one, that will map the internal port of 8000 (2nd of the 8000:8000) to an external port of 8000. Which means that the application should run on the port 8000 (it can also uses other ports but they won’t be accessible if not mapped) and will be accessed on the same port (only by choice). It also defines the POSTGRES information that will allow the application to use the 2nd service.
the database, called db. Note that the values need to be the same otherwise the main application will not be able to access the database. As this docker compose relies on an .env file, it is only a problem if the value is not defined in the .env file (or if there is no .env file) and the default values differs. To ensure that the default values are not repeated, it is possible to use Yaml anchors and aliases. We provide a simple example below.

services:
  main:
    image: ${DOCKER_IMAGE:-ghcr.io/abecam/thefiltershop:latest}
    ports:
      - "8000:8000"
    environment:
      POSTGRES_DB: ${POSTGRES_DB:-thefiltershop}
      POSTGRES_USER: ${POSTGRES_USER:-postgres}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-totolebo}
      POSTGRES_HOST: db
      POSTGRES_PORT: 5432
      DEBUG: ${DEBUG:-false}
      DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:-your-secret-key}
    depends_on:
      - db

  db:
    image: postgres:16
    environment:
      POSTGRES_DB: ${POSTGRES_DB:-thefiltershop}
      POSTGRES_USER: ${POSTGRES_USER:-postgres}
      POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-totolebo}
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Warning

Both the postgres password and the Django secret key have a default value in this docker-compose. It is only fine for testing, and need to be replaced by an .env value. In a stable application, password and secret values will probably not get a default value.

Caution

Another issue in this docker-compose if the “depends_on: db”. It means that the main application will use the db service, but it will not check if the service is ready. It is fine only if the application is not expected to use the db immediately. In an established Data Management application, there might be an automatic database migration that will need the database to be up and running.

If we want to make sure the database default values are the same, while not repeating the environment set-up, we can use the docker compose “x-” prefix, which define an extension field (ignored by Compose) and Yaml anchors (&anchor_name) and aliases (*anchor_name -> reference the anchor):

x-postgres-env: &postgres-env   # Anchor defined in an extension field (x- prefix = ignored by Compose).
                                 # Stores the three vars that both services need identically.
  POSTGRES_DB:       ${POSTGRES_DB:-thefiltershop}
  POSTGRES_USER:     ${POSTGRES_USER:-postgres}
  POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-totolebo}

services:
  main:
    image: ${DOCKER_IMAGE:-ghcr.io/abecam/thefiltershop:latest}
    ports:
      - "8000:8000"
    environment:
      <<: *postgres-env   # Merges the three shared vars in here...
      POSTGRES_HOST: db   # ...then these are added on top, specific to 'main' only.
      POSTGRES_PORT: 5432
      DEBUG:             ${DEBUG:-false}
      DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:-your-secret-key}
    depends_on:
      db:

  db:
    image: postgres:16
    environment:
      <<: *postgres-env   # Same anchor reused — single source of truth.
                          # Change the password in one place, both services get it.
    volumes:
      - postgres_data:/var/lib/postgresql/data

volumes:
  postgres_data:

Finally, we can improve the DB depends_on by defining a health_check for the database:

x-postgres-env: &postgres-env   # Anchor defined in an extension field (x- prefix = ignored by Compose).
                                 # Stores the three vars that both services need identically.
  POSTGRES_DB:       ${POSTGRES_DB:-thefiltershop}
  POSTGRES_USER:     ${POSTGRES_USER:-postgres}
  POSTGRES_PASSWORD: ${POSTGRES_PASSWORD:-totolebo}

services:
  main:
    image: ${DOCKER_IMAGE:-ghcr.io/abecam/thefiltershop:latest}
    ports:
      - "8000:8000"
    environment:
      <<: *postgres-env   # Merges the three shared vars in here...
      POSTGRES_HOST: db   # ...then these are added on top, specific to 'main' only.
      POSTGRES_PORT: 5432
      DEBUG:             ${DEBUG:-false}
      DJANGO_SECRET_KEY: ${DJANGO_SECRET_KEY:-your-secret-key}
    depends_on:
      db:
        condition: service_healthy   # now Compose waits for the healthcheck below to pass before starting 'main'.

  db:
    image: postgres:16
    environment:
      <<: *postgres-env   # Same anchor reused — single source of truth.
                          # Change the password in one place, both services get it.
    volumes:
      - postgres_data:/var/lib/postgresql/data
    healthcheck:                                          # Added to make depends_on meaningful.
      test: ["CMD-SHELL", "pg_isready -U ${POSTGRES_USER:-postgres} -d ${POSTGRES_DB:-thefiltershop}"]
      interval: 10s      # Check every 10 seconds.
      timeout: 5s        # Fail the check if it takes longer than 5s.
      retries: 5         # Mark as unhealthy after 5 consecutive failures.
      start_period: 10s  # Grace period on first start before failures count against retries.

volumes:
  postgres_data:

8.4 Typical usage

The docker compose file only need to be in a specific folder (which will give the container namespace: /one_appl/docker-compose.yml will use the “one_appl” namespace). It needs to be named “docker-compose.yml”, “docker-compose.yaml”, “compose.yml” or “compose.yaml”.

To run the application, simply run docker-compose up in the same folder, all text output of the application (and linked applications) will be shown in the terminal, CTRL-C will stop the applications (all containers defined) and exit to the terminal. To run the application detached, run docker-compose up -d. To stop the application when detached, or attached from another terminal, run docker-compose down

One the containers are running, they are generally managed directly using docker (or any desktop tool managing containers).

Note

Some complex setup can have several configuration file (used from the main file using include:).

8.5 Env files

The environment file has to be created in the same folder as the docker compose configuration file (if not defined otherwise). It should never be provided! Most application will provide an example either as a default env file (that can be modified and renamed) or in their documentation.

8.6 Namespaces and namespaces collisions

Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace. So running 2 different docker-compose setup using a parent folder of the same name will use the same docker network. If the applications use the same images, it could mean that one container communicate with one other container from the other docker-compose setup. For instance, you want 2 instances of the same application, testApp, that uses a MySQL database. The database is defined in the docker-compose and used by the testApp. The set-up is the following:

application1/testApp/docker-compose.yml
application2/testApp/docker-compose.yml

It is a clean set-up, with a distinct parent folder before the docker-compose folder (which would make a lot of sense if you want to store some specific files outside of the docker-compose folder). But as the direct parent folder of the docker-compose has the same name, only one MySQL instance will be used by both (or there will be a binding error and one application will fail, which is probably the best outcome)

8.7 Container names

By default the containers get a name composed of their namespace and their key (like “main:” in the previous docker compose). But it is possible to name them: container_name: my-web-container. In that case, the namespace is ignored and there can be only one instance of this container.

It is rarely useful and can pause problem while testing.

8.8 Bind mounts

The default volume in a docker-compose are named volumes (name: without any value). They are managed by docker and allow several containers to use different volumes (if the namespace is not the same, see above). It’s possible to have a bind mounts, i.e. a volume attached to a specific path on the host. These are useful if the data need to be accessed and eventually used by other applications (the named volumes are also on the local file system, but are not supposed to be accessed directly). But 2 instances from the same docker-compose using bind mounts will use the same folder, thus conflicting for using the storage. Depending on the use, it can create very strange problems which are eventually very difficult to find.

Thus it is recommended to use bind volumes only for testing (to see what happen with the data and eventually manipulate it) and use the default named volumes for production.

Warning

It might be tempting to use bind mounts for backups. It is acceptable if it is sure that only one instance of an application will run on the host. Even a test instance, if the bind mounts are the same, will use the same folder as the production server! Generally it is better to use named volume and do backups using the dump tools for databases, or copying the data outside the volumes (often by using the tar archiving tool).

9 How to build a Dockerfile

Recommended learning path

The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach. If you don’t intend to build a production Dockerfile (see remark below), it is useful for understanding what is inside a Docker image and how it works.

Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.

Creating a Dockerfile for an application you are not developing can be quite difficult, as you might not know all the dependencies and how to set-up the application. As such, it should be done only if strickly needed, mostly if there is a need to repeat a setup and using an image would ensure a consistent set-up. Another solution, with a better flexibility, would be to use Ansible (see below: Puppet, Ansible & Terraform to automate the set-up of the application on a server, without using a container.

A Dockerfile defines layers from a base images (FROM image_name:version) to build a new image. For instance, if the application is running on Python, it is a good idea to start from an official Python image, which will already have the right version of Python and its dependencies. Then you can add the application and its dependencies on top of it.

Note

The base image is often a very light linux (often Alpine Linux) working with BusyBox, which is a collection of Unix utilities in a single executable. Initially you might want a more feature-rich image as to have a more familiar environment.

An image is always immutable, and a new image from a base one with modification will also be immutable. In case we need several application with the same base image, it is a good idea to build a custom base image with the common dependencies, and then build the different application images from this custom base image, so the common layers will be shared between the different images.

A shorcoming is that if a change add dependencies which are again added on the next step, both layers will still be added to the image. It is not always avoidable, but it is good to be aware of it as it can lead to a bigger image than expected, and thus a longer build time and a longer time to start the container.

A powerful way to optimize a dockerfile is to use a multi-stage build, where the first stage is used to build the application and its dependencies, and the second stage is used to copy only the necessary files from the first stage to create a smaller image. This is especially useful for applications that need to be compiled, such as those written in C or C++, but it can also be useful for other types of applications. It also allow frequent changes in the second stage without having to rebuild the first stage, which can save a lot of time during development.

It is easy to remember why a multi-stage build is useful by thinking that preparing the application might need a lot more (languages, framework, tools) than is needed to run it. So the second phase only keep what is needed to run things.

Note

Multi-stage build can be difficult, but most frameworks and languages have good documentation and examples on how to do it, so it is a good idea to look for them when needed.

Note

As the image are immutable, updating is always done by fetching a new base image and building a new image on top of it. It is important to keep an eye on the base image, as it can have security vulnerabilities that are fixed in newer versions. It is a good idea to use a specific version of the base image (for instance python:3.9.13-slim instead of python:3.9-slim), so you know exactly which version you are using and can update it when needed.

9.1 Testing/Development vs Production

Unfortunately, if it is quite easy to create a quick Dockerfile for testing/development, it is not the same for production. It might be important to optimize the image for production, to minimize the size and the attack surface. But it is mandatory to think about security, and ensure that all important secrets are not left in the image, but coming from environment variables (or other external source).

But it can also be more interesting to use a dockerized application only for testing/development, where changes might be more frequent, and to use a more traditional set-up for production, with a proper configuration of the web-server and the application.

Note

A light image can be very barebone and offer the minimum linux commands. It might be useful, even on production, to use a slightly more feature-rich image or to add some needed tools. Once the setup is stable, a new image can be built.

9.2 When containerization is not possible

If you plan to containerize an existing application, depending on how it is built and the functionality it offers, it might be difficult or even impossible to containerize it. Particularly, the following cases can be problematic: * The application needs to run as root, which is generally not recommended for security reason and thus not allowed in most container engines. * The application needs to run on a specific operating system, which is not available as a base image. * The application needs to run on a specific hardware, which is not available in a container. Not that for GPU usage there are some solutions, such as NVIDIA Container Toolkit, but it is not always possible to use them. * The application needs to run with a specific kernel version, which is not available in a container. It might be interesting to try with a supported kernel, either newer or older, but could be really risky as the failure might come later. * The application needs to run with a specific network configuration, which is not available in a container.

10 From Docker compose to Kubernetes

Recommended learning path

Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not, and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running.

Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.

Kubernetes elements and concepts are not more complex than docker compose, but there are many more elements than docker compose, so the overall complexity is very high. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.

For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.

But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is (mostly) made by and aimed at non-sys-admin persons.

10.1 Quick overview

Kubernetes is an orchestration platform: applications are in containers which run in a pod that are on a host system, called a node. There can be several containers in a pod (though only if the containers benefit from being strongly coupled), several pods in a node, and several nodes. There can be copies (replicas) of one application automatically managed by Kubernetes on several nodes.

Everything is decoupled and work declaratively: manifests indicates the wishes, and Kubernetes works on satisfying these, even if a problem occurs. So if you ask for 4 copies of one application, with 4 copies of another that should avoid the first one and you have 8 nodes, Kubernetes might initially distribute these 8 applications on 8 nodes. If one node goes down, the application of this node will move to another node with the same application, thus still avoiding the other one. This avoidance (affinity, which work at node level) can be strict (no scheduling if the constraint is not met) or not (if no other choice still select this node).

Kubernetes is presented in more detail in the Kubernetes introduction page, which aims to accelerate learning it and give enough of an overview to understand clearly what it is, what it offers and what it costs.

11 Other Solutions

11.1 Linux Packaging/Bare metal setup

In some cases, it can be better to install directly on the host system or the only option aside using a virtual machine, if the application is not containerizable.

It could be for some specific hardware or security needs, because the application has a very complex setup, because the application hasn’t been maintained for a long time and does not offer an image (or the image is not of the last version…).

11.2 Should you virtualize or containerize an application?

Most of the time it is possible to set it up in a virtual machine with some extra effort, but some virtual machine hypervisor (that run the virtual machines) might have some restrictions on the hardware use (GPU for instance), or have some functionality locked as part as a commercial version.

Containers could have a full access to the hardware if run on Linux, but the base image would probably be restricted on what it is allowed to do, or getting the access might be really difficult.

In any case, the effort to virtualize or containerise the application will depend on the security risk and the estimate time of usage. If the application present a risk and will be run for several years, it makes sense to spend some effort isolating it. To do so, it is simply a matter of removing obstacles one by one: * try to make it run, * identify what is the issue, * look for solutions: + check the official documentation, + if possible ask a colleague that might have had the same issue, + LLM might know the answer. If the given answer does not work after 2 or 3 clarification, it is generally better to stop trying as LLM will get easily confused and will happilly give wrong answers, or answers to the wrong question, + doing an internet search. Good answers come generally from IT-related forums, especially those focusing on such solution (either the software or the platform), StackOverflow and Reddit, + ask online, + try alone, by preparing a test-retrial workflow: an easy way to break and repeat, often by having a main version of the setup that won’t be touched, and iterating on copies. * if this issue has been fixed and the software still have issue, reiterate.

Note

One big caveat of preparing an application that has not been developped in-house is the risk that one or several functionalities will not be encountered while doing the setup and might not work, and so the application will break later. To avoid so, it is important to test the application with realistic dummy data and try to use everything that might be used. As such, it might be better to stick to well-documented applications (even if not maintained anymore) that allow an quick overview of all capabilities.

A maintained but undocumented application might seem a better option, but only if it is sure that it will be maintained for the long enough and/or that it will be fully documented.

Note

If you provide a containization for a maintained application, it is a good idea to propose it as contribution of the project. You will have to see who would be in charge of updating it if needed.

11.3 Proxmox VE — Virtualisation Without VMware

Proxmox Virtual Environment (Proxmox VE) is an open-source server virtualisation platform. It allow to manage several application on several servers using a Web Interface. It support virtual machines and Linux containers (LXC), which support OCI images which are very close to Docker Image (most Docker images should be compatible).

It has gained significant traction in scientific and institutional settings as a cost-free replacement for VMware vSphere following Broadcom’s acquisition of VMware in 2023 and the subsequent end of free ESXi licences.

Tip

ProxMox is simpler to use and setup than OpenStack or Kubernetes, having a much more limited scope. But it still provides a full GUI-based management and support for snapshots and backups. It might be a good options between managing directly one or several servers and setting up OpenStack and/or Kubernetes (Kubernetes can run on OpenStack, but OpenStack have some functionalities comparable to Kubernetes).

Docker Swarm is even simpler but with an even more limited scope.

11.3.1 What Proxmox can do

Virtual Machines (KVM/QEMU). Proxmox uses the Kernel-based Virtual Machine (KVM) hypervisor combined with QEMU to run full operating system images. Each VM is completely isolated: it has its own virtual hardware, its own kernel, and its own network stack. You can run Windows, any Linux distribution, or BSD inside a VM.

Full hardware emulation, including GPU passthrough for GPU-accelerated workloads.
Live migration — move a running VM between cluster nodes with minimal downtime.
Snapshots — capture the entire state of a VM at any point in time.

Linux Containers (LXC). Containers are faster to start and use less memory than full VMs.

Lower overhead than full VMs — near-native performance.
Suitable for running multiple isolated Linux environments on one host.
Not suitable for non-Linux guests.

Integrated Backup System. Proxmox Backup Server (PBS) is a companion product that provides deduplicating, incremental backups of VMs and containers. Backups are stored efficiently - only changed data blocks are written after the first backup - and can be verified cryptographically.

Scheduled automatic backups with configurable retention policies.
Instant restore of a single file from a VM backup without restoring the whole image.
Offsite replication: backups can be replicated to a remote PBS instance.

High Availability Clustering. Multiple Proxmox hosts can be joined into a cluster. The cluster shares a common configuration and can automatically restart VMs on another node if a host fails. This requires at least three nodes to achieve quorum: the next main node need to be elected using a vote.

Software-Defined Networking. Proxmox VE allows administrators to define VLANs, virtual network zones, and routing policies from the web interface.

11.3.2 Proxmox vs. Kubernetes — when to use which

Key distinction

Proxmox virtualises hardware or run containers without the full orchestration that Kubernetes offers. Kubernetes orchestrates containers, running packaged applications across a pool of machines.

They operate at different layers and can be combined: a Kubernetes cluster can run on VMs managed by Proxmox.

Proxmox vs. Kubernetes: different layers, complementary roles
Dimension	Proxmox VE	Kubernetes
What it manages	Virtual machines & LXC containers	Application containers (Docker/OCI)
Abstraction level	Infrastructure / OS level	Application / workload level
Typical user	Sysadmin managing servers	DevOps/developer managing applications
Scaling	Manual VM provisioning or scripted	Automatic horizontal pod scaling
Self-healing	HA restarts VMs on node failure	Reschedules failed pods automatically
Complexity	Moderate — manageable by a small team	High — significant learning curve
Persistent storage	VM disk images, Ceph, NFS	Persistent Volume Claims, CSI drivers
Networking	Bridges, VLANs, SDN	CNI plugins, Ingress controllers, Services

A practical path for a research group: start with Proxmox to host virtual machines for different services (database, web server, backup). If the team grows and needs to manage many application deployments with automated scaling, consider running a Kubernetes cluster on top of Proxmox VMs. If the need might be higher, OpenStack offers a group of Cloud-oriented solutions, see below.

11.3.3 Resources

Proxmox VE documentation — full installation and administration guide
Proxmox Backup Server documentation
Proxmox community forum — active community with scientific HPC use cases
Migrating to Proxmox from VMware — official migration wiki

11.4 OpenStack — Open-Source Cloud Infrastructure

OpenStack is a large-scale, open-source cloud computing platform that allows organisations to build and operate their own Infrastructure-as-a-Service (IaaS) cloud — comparable in scope to Amazon Web Services or Microsoft Azure, but running entirely on hardware you control. It is widely used by national research infrastructures, universities, and supercomputing centres worldwide.

Intended scale

OpenStack is designed for managing tens to thousands of physical servers. For a single-server or small-team deployment, Proxmox is usually more appropriate. OpenStack becomes compelling when a research institute or facility needs to offer self-service cloud resources to many research groups simultaneously.

11.4.1 Architecture overview

OpenStack is not a monolithic application — it is a collection of loosely coupled services that communicate via REST APIs. Each service manages one aspect of the infrastructure.

Core OpenStack services
Service	Code name	Function
Compute	Nova	Manages the lifecycle of virtual machine instances
Networking	Neutron	Software-defined networking: virtual networks, routers, floating IPs
Block storage	Cinder	Persistent block volumes attached to VMs
Object storage	Swift	Scalable object store (S3-compatible)
Image service	Glance	Stores and retrieves VM disk images
Identity	Keystone	Authentication and authorisation for all OpenStack services
Dashboard	Horizon	Web-based graphical interface for all services
Orchestration	Heat	Template-based infrastructure orchestration
Telemetry	Ceilometer / Gnocchi	Usage metering and monitoring
Bare metal	Ironic	Provision physical servers alongside VMs
Container orchestration	Magnum	Provides Kubernetes clusters as a managed service
Distributed storage	Ceph (integrated)	Block, object, and file storage for Nova/Cinder/Glance

11.4.2 Ceph — distributed storage

Ceph is a software-defined distributed storage system tightly integrated with OpenStack. It is the recommended storage backend for production OpenStack deployments because it eliminates single points of failure and scales horizontally by adding storage nodes.

How Ceph works. Ceph distributes data across a cluster of storage nodes (called OSDs - Object Storage Daemons). Each piece of data is replicated (typically three copies) or erasure-coded across different nodes, so the failure of any single disk or server does not cause data loss.

The three main Ceph interfaces are:

RADOS Block Device (RBD) — presents distributed storage as a block device. Used by Cinder for VM volumes and by Nova for VM boot disks.
RADOS Gateway (RGW) — provides an S3-compatible object storage API, usable as a drop-in replacement for Amazon S3. See more information about object storage in the Data Storage page.
CephFS — a POSIX-compatible shared filesystem for applications that need traditional file-path semantics.

Why Ceph with OpenStack. Live migration of VMs is much simpler when all compute nodes access the same Ceph storage pool. Storage can also scale independently of compute by adding storage nodes without affecting running VMs.

11.4.3 Magnum — Kubernetes as a service

Magnum is the OpenStack service that provisions and manages container orchestration clusters, most commonly Kubernetes, on demand. Rather than manually installing and configuring Kubernetes, Magnum automates the entire process using Heat templates (see below).

Users request a cluster via the API or Horizon dashboard and receive a fully configured Kubernetes environment running on OpenStack VMs. Administrators define cluster templates specifying the node size, Kubernetes version, and networking plugin. Once created, the cluster is managed with standard Kubernetes tools (kubectl, Helm, etc.) and the OpenStack layer becomes invisible to the end user.

Magnum and Kubernetes

Magnum does not replace Kubernetes — it automates the deployment of Kubernetes on OpenStack infrastructure. From the user’s perspective, they receive a standard Kubernetes cluster.

Magnum documentation

11.4.4 Other notable OpenStack services

Horizon (Web Dashboard). Horizon is the web interface for OpenStack. It provides a graphical view of all resources: instances, volumes, networks, and images. For users who prefer not to use the command line or REST APIs, Horizon is the primary day-to-day interface.

Neutron (Networking). Neutron provides virtual networks, subnets, routers, security groups, and floating IPs.

Heat (Orchestration). Heat allows you to describe an entire cloud environment — VMs, networks, storage, security groups — in a YAML template and deploy it reproducibly with a single API call. This is the OpenStack equivalent of AWS CloudFormation and enables Infrastructure as Code workflows.

Ironic (Bare Metal Provisioning). Ironic provisions physical servers in the same way Nova provisions VMs. This is useful for high-performance computing workloads where VM overhead is unacceptable — for example, MPI jobs requiring direct access to InfiniBand interconnects or specific GPU hardware.

11.4.5 Resources

OpenStack documentation — official hub for all OpenStack projects
DevStack — sets up a full OpenStack environment on a single machine for learning and testing
Kolla-Ansible — deploys OpenStack services as containers for easier lifecycle management
EGI Federated Cloud — federated OpenStack infrastructure available to European researchers

11.5 Lightweight and Cloud-Native Container Orchestrators

Kubernetes is not the only way to schedule and manage containers at scale. Several lighter-weight or cloud-managed alternatives exist, each occupying a different point on the spectrum between simplicity and power.

11.5.1 HashiCorp Nomad

Nomad is a general-purpose workload orchestrator developed by HashiCorp. Unlike Kubernetes, which is specifically designed for containers, Nomad can schedule Docker containers, standalone executables (binaries), Java JAR files, virtual machines (via QEMU), and batch jobs — all using the same scheduler and configuration language.

Nomad share the same drawback with Docker Swarm, which is that they are not so popular. Thus there are less extrenal resources available, less discussion about it, and less pre-packaged applications (Helm charts do not apply; Nomad uses its own job specification format HCL). But they can still be a good solution for small teams that need to schedule a heterogeneous mix of workloads without the operational overhead of Kubernetes, while being well-aware of these limitations.

Architecture. Nomad uses a client/server model. A small number of server nodes (three or five, for quorum) handle scheduling decisions. Client nodes run the actual workloads. There is no dedicated etcd cluster, no separate controller manager, no separate scheduler process: all scheduling logic lives in the Nomad server binary.

Key characteristics:

Single binary deployment. The entire Nomad agent (server or client mode) is a single Go binary with no external runtime dependencies. This makes installation trivial compared to a full Kubernetes cluster.
Multi-workload. Nomad schedules containers, batch jobs, system daemons, and raw executables in a uniform way. This is useful in scientific environments where some workloads are containerised and others are legacy binaries.
Native integration with Vault and Consul. HashiCorp’s secret management tool (Vault) and service discovery tool (Consul) integrate directly with Nomad, providing a coherent platform for secrets injection and service mesh networking.
Lighter operational footprint. A minimal Nomad cluster requires significantly fewer nodes and less memory than a comparable Kubernetes cluster with all its components.

Limitations compared to Kubernetes:

Smaller ecosystem: fewer pre-packaged applications (Helm charts do not apply; Nomad uses its own job specification format HCL).
Less sophisticated networking: traffic routing requires Consul or an external load balancer.
Less mature support for stateful workloads (though Nomad CSI driver support is improving).

When to consider Nomad

Nomad is a good fit when: you need to schedule a heterogeneous mix of containers and non-container workloads; your team is small and wants a simpler operational model than Kubernetes; or you are already using other HashiCorp tools (Terraform, Vault, Consul) and want a consistent toolchain. (Terraform can also be used with Kubernetes).

Caution

HashiCorp is a commercial entity, part of IBM. So while their tools are free and “open-source”, there is always a risk that they could change their licensing model or discontinue free support for Nomad in the future. On the offer hand, if there is a budget for it, HashiCorp offers enterprise support. And as it is a commercial product, their products are generally well-maintained and have a good documentation.

Their licence is not fully open-source, but source-available under the Business Source License (BSL). This means that while the source code is available for inspection and modification, there are restrictions on how it can be used in production without a commercial license.

Nomad documentation
Nomad vs. Kubernetes comparison — official comparison from HashiCorp

11.5.2 Docker Swarm

Docker Swarm is the native clustering and orchestration mode built into the Docker Engine itself. A group of Docker hosts can be joined into a Swarm cluster, turning them into a pool of nodes that can run and scale containerised services collectively.

Architecture. Swarm uses a manager/worker node model. Manager nodes handle scheduling and cluster state (stored via the Raft consensus algorithm). Worker nodes run containers. Unlike Kubernetes, there are no separate etcd, API server, or controller manager processes — all Swarm logic is embedded in the Docker daemon.

Key characteristics:

Zero additional installation. Any machine running Docker Engine can join a Swarm with a single docker swarm init or docker swarm join command. No separate binaries or configuration files are needed.
Docker Compose compatibility. Docker Compose files (with minor additions for deploy: stanzas) can be deployed to Swarm as stacks directly, making it easy to move from a local development setup to a distributed deployment.
Simplicity. Swarm intentionally covers the 80% use case - service scaling, rolling updates, health-based restarts, overlay networking, and secrets management - without the conceptual complexity of Kubernetes.

Current status. Docker Swarm is considered feature-stable rather than actively developed. Mirantis (which acquired Docker Enterprise in 2019) has committed to maintaining Swarm. For new projects, Docker’s own documentation now recommends Kubernetes for complex production workloads. Swarm remains a sensible, low-overhead choice for small research groups that are already familiar with Docker Compose and do not need the full Kubernetes feature set.

Limitations:

Limited ecosystem compared to Kubernetes — no Helm, no Operators, no Custom Resource Definitions.
No built-in support for auto-scaling based on metrics.
Less active development; fewer new features.
Docker Swarm documentation
Deploying stacks to Swarm

11.5.3 Amazon ECS - Elastic Container Service

Warning

Amazon, Google and Microsof (Azure) solution are generally billed per use, so they can be more expensive than self-hosted solutions, especially if the workload is not well-optimized. It is important to monitor the cost and optimize the usage to avoid unexpected bills. The bill can grow very quickly if the usage grows: it could be a simple issue with another service calling the container more often than expected, or a change in the code that make the container run for a longer time. It is important to set up some alerts to be notified when the cost grows above a certain threshold, or to avoid such solution if the cost is not predictable or if the budget is not flexible enough. Self-hosting might initially be more expensive but will often offer a fixed cost and might become cheaper in the long run. Generally, a well managed self-hosted solution should be cheaper than a cloud solution.

Amazon ECS is AWS’s proprietary container orchestration service. You define tasks (one or more containers that run together, similar to a Kubernetes Pod) and services (long-running groups of tasks with load balancing and auto-scaling). ECS handles scheduling, health checks, and replacement of failed tasks.

Two launch modes:

ECS on EC2 (Elastic Compute Cloud). You provision and manage a pool of EC2 virtual machines that act as ECS container hosts. You are responsible for the underlying instances (patching, scaling the instance fleet).
ECS on Fargate. AWS manages the underlying infrastructure entirely. You specify CPU and memory requirements per task and pay per second of task runtime with no instances to manage. It is a serverless container platform within the AWS ecosystem (see below for more on serverless containers).

If properly setup, EC2 should be cheaper than Fargate, but it requires more operational effort to manage the cluster of EC2 instances. Fargate is more expensive but offers a simpler, fully managed experience.

Key characteristics:

Deep integration with the AWS ecosystem: IAM for per-task permissions, CloudWatch for logs and metrics, ALB/NLB for load balancing, ECR for container images, Secrets Manager for credentials.
Simpler than Kubernetes for straightforward web application or data pipeline deployments on AWS.
No control plane to operate, AWS manages the scheduler.

Limitations:

Vendor lock-in: ECS task definitions, IAM roles, and networking are all AWS-specific. Migrating to another cloud or on-premises later requires significant rework.
Less flexible than Kubernetes for complex multi-service applications with sophisticated networking or storage requirements.
No equivalent to Kubernetes Operators or CRDs for extending the platform.

ECS vs. EKS on AWS

If your team is already comfortable with Kubernetes, AWS offers EKS(Elastic Kubernetes Service) which runs a managed Kubernetes control plane. ECS is simpler to start with; EKS gives portability and the full Kubernetes ecosystem. Both support Fargate for serverless compute.

11.5.4 Google Kubernetes Engine

Google Kubernetes Engine (GKE) is Google’s managed Kubernetes service. It provides a fully managed control plane and automates cluster provisioning, upgrades, scaling, and maintenance. GKE runs Kubernetes clusters on Google Cloud’s infrastructure, with deep integration into other Google Cloud services.

11.5.5 Serverless Container Platforms

“Serverless containers” describe managed platforms where you provide a container image and the platform handles all infrastructure with no VMs, no clusters, no nodes to manage. You pay only for actual compute time. The trade-off is reduced control and potential cold-start latency.

The term serverless can be confusing: servers still exist, but they are fully abstracted away from the user. From the user’s perspective, you push a container image and the platform runs it on demand.

Common platforms:

Platform	Provider	Notes
AWS Fargate	Amazon	Works with ECS or EKS; per-second billing
Google Cloud Run	Google	HTTP-triggered; scales to zero; see below
Azure Container Instances	Microsoft	Simple per-container billing; no orchestration layer
Azure Container Apps	Microsoft	Higher-level; built on Kubernetes + KEDA for event-driven scaling
Fly.io	Independent	Developer-friendly; global edge deployment
Render	Independent	Simple deployment from Git; Docker or native buildpacks

Scientific use cases:

On-demand analysis endpoints: Deploy a container that runs a computationally expensive analysis when an HTTP request arrives and shuts down immediately afterwards. Costs nothing when idle.
Batch processing pipelines: Trigger container runs from object storage events (a new file uploaded triggers a processing job).
Reproducible environments: Share a container image with reviewers so they can reproduce an analysis without installing software.

Limitations to consider:

Maximum execution time limits (e.g., Cloud Run has a 60-minute request timeout; some platforms impose shorter limits).
Stateless by design — persistent storage must be external (object storage, managed database).
Cold starts: the first request after a period of inactivity may take several seconds while the container image is pulled and started.
Cost unpredictability for high-traffic or long-running workloads.

11.5.6 Google Cloud Run

Google Cloud Run is Google’s fully managed serverless container platform. It is one of the most mature and widely used serverless container services, and is worth examining in detail as a representative example of the category.

How it works. You package your application as a Docker/OCI container image, push it to a container registry (Google Artifact Registry, or any public registry), and deploy it with a single command or through the Cloud Console. Cloud Run starts one or more container instances to handle incoming HTTP/gRPC (a Google remote procedure call framework by Google that allows, as the name implies, to call remote functions) requests and scales the number of instances automatically — including scaling to zero when there is no traffic.

# Build and push a container, then deploy to Cloud Run
gcloud builds submit --tag gcr.io/MY_PROJECT/my-app
gcloud run deploy my-app \
  --image gcr.io/MY_PROJECT/my-app \
  --platform managed \
  --region europe-west1 \
  --allow-unauthenticated

Key features:

Scale to zero. When no requests are incoming, Cloud Run runs no instances and incurs no compute cost. This makes it economical for sporadic or unpredictable workloads typical in research.
Concurrency. A single Cloud Run instance can handle multiple simultaneous requests (configurable up to 1000), which reduces cold starts compared to function-as-a-service platforms.
Any language, any framework. Because you bring your own container image, there are no language restrictions. Python, R, Java, or any compiled binary can be deployed.
Cloud Run Jobs. In addition to HTTP services, Cloud Run Jobs run containerised batch tasks to completion — useful for data processing pipelines, periodic reports, or model training jobs.
VPC connectivity. Cloud Run services can connect to private VPC networks, enabling access to Cloud SQL, Memorystore, or on-premises systems via VPN or Interconnect.

Integration with Google Cloud:

BigQuery, Cloud Storage, Pub/Sub — data can flow in and out through standard Google Cloud services.
Cloud Scheduler — trigger Cloud Run Jobs on a cron schedule.
Eventarc — trigger services from Cloud Storage events, Pub/Sub messages (Publish–subscribe -> send messages (publish) to subscribed services asynchronously without being coupled), or Audit Logs.
Secret Manager — inject secrets as environment variables at runtime without embedding them in container images.

Limitations:

Maximum request timeout of 60 minutes (sufficient for many analysis tasks; longer jobs should use Cloud Run Jobs or Cloud Batch).
In-memory storage only during execution; persistent state must use Cloud Storage or a database.
Compute is limited to CPU and memory per instance; GPU support is available in preview but not generally available in all regions.
Full vendor lock-in to Google Cloud APIs and billing.

Cloud Run vs. Kubernetes (GKE)

Cloud Run is ideal when you want to run containers without managing infrastructure. Google Kubernetes Engine (GKE) is better when you need full Kubernetes capabilities: stateful workloads, custom networking, complex multi-service applications, or specific hardware (GPUs, TPUs) that require persistent node pools. Cloud Run and GKE can coexist in the same project and share the same container registry.

Google Cloud Run documentation
Cloud Run Jobs documentation
Cloud Run pricing — first 2 million requests per month are free

11.5.7 Comparison of lightweight and cloud-native orchestrators

Lightweight and cloud-native orchestrator comparison
Platform	Managed by	Infrastructure	Vendor lock-in	Complexity	Best for
HashiCorp Nomad	Self-hosted	Your servers	None	Low–medium	Heterogeneous workloads, HashiCorp stack
Docker Swarm	Self-hosted	Your servers	None	Low	Small teams already using Docker Compose
Amazon ECS	AWS	AWS (EC2 or Fargate)	High (AWS)	Low–medium	AWS-native applications
Google Cloud Run	Google	Fully managed	High (GCP)	Very low	HTTP services, sporadic workloads
Azure Container Apps	Microsoft	Fully managed	High (Azure)	Low	Event-driven microservices on Azure

11.6 Apache Software Foundation Solutions

The Apache Software Foundation (ASF) hosts hundreds of open-source projects. For data management and scientific computing infrastructure, several Apache projects are particularly relevant. They are mature, widely deployed, and have large communities — making them safe long-term choices for research data platforms.

11.6.1 Apache HTTP Server

The Apache HTTP Server (httpd) is one of the most widely deployed web servers in the world. It serves static files and acts as a reverse proxy in front of application servers.

Virtual hosting. Multiple websites or applications can be served from one server using different domain names or paths.
Modules. Functionality is extended through modules: mod_proxy for reverse proxying, mod_ssl for HTTPS, mod_rewrite for URL manipulation, mod_auth for authentication.
.htaccess files. Per-directory configuration allows fine-grained access control without restarting the server, which is useful in multi-user research environments.

Apache has been the main http server for decades and is still widely used. It can be very complex to configure, and is often replaced by Nginx nowadays, or more recently by Caddy, which are more modern and easier to configure. In some cases, Apache can still be preferred as it support many modules and offers a lot of flexibility, but it is important to be aware that it might require more effort to maintain and secure than more modern alternatives.

Apache HTTP Server documentation

11.6.2 Apache Tomcat

Apache Tomcat is a Java Servlet container and web server. It runs Java web applications packaged as WAR (Web Application Archive) files.

Implements the Java Servlet and JavaServer Pages (JSP) specifications.
Can be placed behind Apache HTTP Server (via mod_proxy_ajp) or Nginx for SSL termination and load balancing.
The Manager web application allows deploying and undeploying WAR files via a browser interface.
Apache Tomcat documentation

11.6.3 Apache Kafka

Apache Kafka is a distributed event streaming platform. It acts as a high-throughput, fault-tolerant message broker: producers publish data events (messages) to named topics, and consumers read them independently and at their own pace. Kafka retains messages for a configurable period, so consumers can replay historical data.

Scientific use cases:

Ingesting high-frequency sensor or instrument data in real time.
Decoupling data producers (instruments, simulations) from data consumers (databases, analysis pipelines).
Building audit logs where every data change is recorded as an immutable event.
Streaming data simultaneously to multiple downstream systems.

A popular alternative to Kafka is RabbitMQ, which is a more traditional message queue with different semantics (e.g., messages are removed once consumed). Kafka’s design for high-throughput and durability makes it better suited for large-scale data pipelines, while RabbitMQ may be simpler for smaller applications or those requiring complex routing.

Both are not used directly, but components of the data management infrastructure (e.g., Spark Structured Streaming, Airflow). But if supported by a data management platform, they might allow to build producer or consumer applications that can be easily integrated with the rest of the infrastructure, and that can scale as needed.

11.6.4 Apache Spark

Apache Spark is a unified analytics engine for large-scale data processing. Using Spark SQL, it is possible to easily work on large datasets.

Core capabilities:

Spark SQL — query structured data using SQL or a DataFrame API (available in Python, R, Scala, and Java).
Structured Streaming — process data streams from Kafka or other sources in near-real time.
MLlib — distributed machine learning for classification, regression, clustering, and collaborative filtering.
GraphX — distributed graph computation for network analysis.
Apache Spark documentation

11.6.5 Apache Airflow

Apache Airflow is a workflow orchestration platform for scheduling and monitoring data pipelines. Workflows are defined as Python code in the form of Directed Acyclic Graphs (DAGs), where each node represents a task (e.g., download data, run analysis, upload results) and edges define dependencies between tasks.

Scientific use cases:

Automating recurring data processing pipelines (nightly ingestion, weekly reports).
Orchestrating multi-step bioinformatics or image analysis workflows.
Managing dependencies between heterogeneous tasks: bash scripts, Python functions, Kubernetes pods, SQL queries.

Airflow vs. Nextflow / Snakemake

Bioinformaticians often use Nextflow or Snakemake because they natively understand file-based workflows and integrate with HPC schedulers (SLURM, PBS). Airflow is more general-purpose and integrates better with cloud services and databases. Both approaches can coexist in one organisation.

Apache Airflow documentation

11.6.6 Apache Solr

Apache Solr is an enterprise search platform built on Apache Lucene. It provides full-text search, faceted search, filtering, and ranked retrieval over structured and unstructured data. It is useful for building search interfaces over scientific datasets, publications, or metadata catalogues. Many data management platforms use Solr to power their search functionality. Solr run parallel to the data management platform to index the data and metadata, and to provide search results to the users. It is generally not used directly by the users, but it can be useful to have a basic knowledge of it to understand how the search works, and to troubleshoot it if needed.

Supports JSON, XML, and CSV indexing.
Runs standalone or in SolrCloud mode for distributed, fault-tolerant search.
Used by DSpace (institutional repositories) and CKAN (open data portals).
Apache Solr documentation

11.6.7 Apache project summary

Apache Software Foundation projects relevant to research data management
Project	Domain	Key use case in research
HTTP Server	Web serving	Serving applications, reverse proxy, access control
Tomcat	Java applications	Hosting Java-based data management platforms
Kafka	Event streaming	Real-time data ingestion from instruments
Spark	Big data analytics	Large-scale data processing and machine learning
Airflow	Workflow orchestration	Scheduling and monitoring data pipelines
Solr	Search	Full-text search over datasets and metadata catalogues

12 Scalability

Scalability is the capability to adapt for more needs: more simultaneous users, more data to save, more processing power needed, more storage (bigger files), more network traffic. It is generally not needed to have a scalable solution at the beginning, but it is good to have an idea of what it means and how to do it, as it can be a problem if you need to change your solution later on. It can have strong consequences on the technical aspects, but also on the costs.

Vertical scalability is the capability to adapt to more needs by using more powerful resources (for instance a more powerful server, with more CPU, RAM, storage). It is generally the simplest solution, but it has a limit: there is only so much you can add to a single machine/virtual machine. An easy solution using vertical scalability is to separate what can be separated on different machines: for instance, the database on a different machine than the web application. It is generally a good idea to separate the database from the web application, even if they are on the same machine, as it allows to easily move the database to a more powerful machine if needed, and also to have a better security by isolating the database from the web application.

Horizontal scalability is the capability to adapt to more needs by using more resources (for instance more servers, virtual machines, containers). It is not always possible depending on the Data Management Platform. The simple solution is to use a load balancer to distribute the traffic between several instances of the application, but it can be more complex if the application needs to share data between the different instances (for instance a shared database or a shared file system). A simple example is having several users that need to edit the same elements. If one user start to work on an element, then the others start to work on it simultaneously, the change made by one user will be lost when the other user save their change. To avoid this, the application needs to be able to manage such situation, for instance by locking the element when a user is working on it, or by merging the changes made by different users. This is not always possible and can be a strong limitation for horizontal scalability.

Kubernetes is designed for horizontal scalability, but it is not the only solution.

13 Databases

Most Data Management Platforms rely on a database to store their data. It is generally better to have a basic knowledge of databases, and of the one used by the Data Management Platform you are using, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations. The most used databases are relational databases, such as MySQL, PostgreSQL, MariaDB, Oracle Database, Microsoft SQL Server, and SQLite. They rely on a schema, with tables and relations between them. All operations are row based, and normally querying is done by matching rows with other rows and/or a value. SQL (Structured Query Language) is the standard language for relational databases, and it is used to perform various operations on the data, such as querying, inserting, updating, and deleting data. SQL is a powerful language that allows you to manipulate and retrieve data in a flexible way, making it a popular choice for many applications. SQL is relatively straightforward and it is probably a good idea, if not already familiar with it, to spend some time learning it. There are many online resources to learn SQL, such as SQLZoo, W3Schools SQL Tutorial, and Codecademy SQL Course.

But there are also non-relational databases, such as MongoDB, Cassandra, Redis, and Elasticsearch.

14 Security

Security in Data Management is an important topic, as the data is often sensitive and valuable. It is important to have a basic knowledge of security principles and practices, and to apply them to your Data Management Platform. This includes:

Keeping the software up-to-date with security patches,
Using strong passwords and changing them regularly, though it is better to have strong passwords that do not need to be changed regularly, as changing them regularly can lead to bad practices such as writing them down or using weak passwords,
Limiting access to the data and the application to only those who need it,
Eventually using encryption for data at rest and in transit,
Regularly backing up the data and testing the restore process,
Monitoring the system for suspicious activity and responding to incidents promptly,
Ensuring that the connections are secure, using HTTPS for web applications and SSH for remote access. If the internal network is secure and only the necessary ports and IP addresses are allowed to access the application, it is possible to use HTTP for internal communication. Using HTTPS need to be done with a proper certificate, which can be obtained for free from Let’s Encrypt, which is a nonprofit organization giving free certificates. Several tools, such as Certbot, can help with obtaining and renewing certificates from Let’s Encrypt.
Avoiding any abuse of service. It could be:
- spam attack of forms and exposed emails/telephon numbers,
- too many requests to a web-server, either maliciously, due to a mistake (for instance from a software misusing an API) or by regular usage (more traffic than expected),
- using too much ressources, for instance uploading very large files. It is also important to be aware of the specific security risks associated with the technologies you are using, such as containers, virtual machines, or cloud services, and to take appropriate measures to mitigate those risks.

We will explore these topics in more details in the Backups and Security page.

15 Going to an assembly

An assembly is when several Data Management Platforms are interconnected to provide a more complete solution. For instance, a Data Management Platform for a research group might need to be connected to a storage solution, to a computing cluster, to a data visualization tool, and to a data analysis tool. It is generally better to have a basic knowledge of the different components of the assembly, and of how they work together, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations.

They are generally connected via APIs, which can be at a programmatic level (for instance a Python API), or at a higher level (for instance a REST API). It is important to have a basic knowledge of how to use these APIs, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations. It is also important to be aware of the security implications of using APIs, and to take appropriate measures to secure them.

The different components of the assembly might have different requirements in terms of scalability, security, and maintenance, and it is important to take these into account when designing the assembly. For instance, if one component needs to be highly available, it might be a good idea to use a load balancer and multiple instances of that component, while if another component is only used for occasional tasks, it might be sufficient to run it on a single machine. There can also be some strong limitations, for instance if one applications store sensitive data encrypted which is shared with another application that cannot work with encrypted data. In that case, the data might need to be stripped down of its sensitive part before being shared, and/or anonymised or pseudonymised. You can read more about data anonymisation and pseudonymisation in the Data Anonymisation and Pseudonymisation page.

16 Automatize the setup

When you need to repeat a setup, or when you need to set-up a complex application, it is generally better to automate the setup, rather than doing it by hand. This can be done with tools such as Puppet, Ansible, and Terraform, which allow you to describe the desired state of your system in a declarative way, and then apply that state to your machines.

16.1 Puppet, Ansible & Terraform

Recommended learning path

Their official Online documentations are good and complete. Online tutorials should be enough as a complement. Books are still recommended for a quicker start.

When you have more than a handful of servers or services to manage, doing everything by hand stops being practical. Tools like Puppet, Ansible, and Terraform exist to automate that work.

Puppet was one of the first widely adopted tools for managing server configuration at scale, and many large organizations still use it. You describe the desired state of your systems in Puppet’s own language, and a Puppet agent running on each machine enforces that state continuously. It works well for large, stable fleets but it requires installing and maintaining that agent on every machine you manage, setting up a central Puppet server, and learning a configuration language that has a steep initial curve.

Ansible emerged as a simpler alternative for configuration management. It does much of what Puppet does: installing packages, writing config files, managing services. But it works over plain SSH with no agent to install. The desired state of a system is written in playbooks, written in YAML, which most developers already know. The tradeoff is that Ansible is not continuously enforcing state the way Puppet does: you run it when you want changes applied, rather than having it run automatically in the background. For smaller setups or teams without dedicated ops staff, this is often the right balance.

Example playbook

- hosts: webservers
  become: yes

  tasks:
    - name: Install nginx
      apt:
        name: nginx
        state: present

Running this playbook installs nginx on all hosts in the webservers group.

Ansible can be used to set-up a bare-metal installation, but also for complementing a Docker Compose setup, typically for configuring the application.

Terraform solves a different problem: not configuring machines, but creating the infrastructure in the first place. It talks to cloud providers (or your local virtualization layer) and sets up virtual machines, networks, storage, DNS entries, and so on. You describe what you want, and Terraform tracks what already exists so it only makes the changes needed. Puppet and Ansible assume the machines are already there, Terraform prepares the virtual machines.

Terraform is generally supported by most Virtualisation platforms: OpenStack, Proxmox, Kubernetes, …

Howeverm, in most cases Terraform will be used by your SysAdmin.

resource "aws_instance" "example" {
  ami           = "ami-123456"
  instance_type = "t2.micro"
}

Puppet and Ansible are Open Source while Terraform is a product of HashiCorp with a Community Edition.

16.2 CI/CD and GitOps

CI/CD is an important aspect for modern online platforms, when you develop them. It allows to ensure your deployed application is up-to-date (Continous Deployment) and free of issues (Continous Integration).

CI/CD stands for Continuous Integration / Continuous Delivery (or Deployment). The idea is simple: every time you push code or config changes to a repository, an automated pipeline picks them up, tests them, and - if everything looks good - deploys them. No more manually copying files to servers or running scripts by hand.

GitOps takes this further by making Git the single source of truth for your entire system state. Your infrastructure configuration lives in a repo, and a tool running in your cluster constantly watches that repo. When you push a change, the tool detects it and applies it automatically. If someone makes a manual change on the server that drifts from what’s in Git, the tool corrects it. Your Git history becomes a full audit trail of every change ever made.

Flux CD is a popular GitOps tool for Kubernetes environments. It runs inside your cluster and syncs it to one or more Git repositories, handling updates to apps and infrastructure alike.

For CI/CD pipelines, the most common options are:

GitHub Actions - built into GitHub, easy to get started, large ecosystem of pre-built actions
GitLab CI: similar, built into GitLab
Forgejo/Gitea Actions: self-hostable alternative compatible with GitHub Actions syntax
Jenkins: older, very flexible, but more complex to set up
Flux CD Documentation
GitHub Actions Documentation
GitLab CI Documentation

16.2.1 A minimal GitHub Actions example

Here’s a simple workflow that automatically deploys your Quarto site to GitHub Pages whenever you push to main:

# .github/workflows/deploy.yml
name: Deploy site

on:
  push:
    branches: [main]

jobs:
  build-and-deploy:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Quarto
        uses: quarto-dev/quarto-actions/setup@v2

      - name: Render site
        run: quarto render

      - name: Deploy to GitHub Pages
        uses: peaceiris/actions-gh-pages@v3
        with:
          github_token: ${{ secrets.GITHUB_TOKEN }}
          publish_dir: ./_site

This file lives in your repo at .github/workflows/deploy.yml. GitHub picks it up automatically, no extra configuration needed. Every push to main triggers a fresh build and deploy.

17 Many choices: tools, solutions and platforms

Aside the obvious choices, which are generally for the most complex problems (databases, orchestration, Virtual Machines, LDAP) or accepted as de facto standard (Docker, Kubernetes (which is also in the previous category), S3), there are many other tools, solutions and platforms: * tools might be to help with some platforms, like a full ecosystem around Kubernetes to containers (visualising, managing, …) or doing something independant (analysing something, transforming, …), * solutions and platforms might be some alternative to the most well-know platforms or solution, for instance a simpler orchestration platfrom than Kubernetes, an already set-up cluster for Proxmox or Kubernetes that is partially managed for you…

They can all give a benefit but could also give problems: * solutions such as managed clusters are most of the time commercial. There might be a free tier but with some limitation. Using this solution might lock you in, and a new functionality might be needed which is out of the free tier, * alternative platform might also be commercial, with the same issue as above. But an open-source platform might stop being maintained if there wasn’t enough support behind. Even without this risk, a platform that is not used so much can have more bugs (simply because nobody found them) or missing important functionalities (again because nobody asked for them). Note that if a platform is a very good match for your institution/consortium and you can work on it for a long term, it could be interesting to adopt it and become a part of the development team. * tools need to be carefully checked, especially if they are to be installed locally. They can give a great help (such as k9s to visualise a Kubernetes cluster), so shouldn’t be ignored.

But one of the big issue that these choices can give is the choice by itself. Looking for a tool for a specific solution might return a discussion where several are discussed, then searching for one might return another discussion where a “better” alternative is proposed. It is where it might be beneficial to look for highly popular items, with presence in Wikipedia, with many stars in GitHub, coming from a known source (a big company, an Open Source Foundation such as Apache)… And keep the less popular items only for specific needs and while being careful

The use of AI and “Vibe coding” will probably make these problems even worse.

Note

Some companies, notably Google, create whole new languages for some new products and/or new frameworks. That is a fundamental issue as consolidation is a key aspect of security. A new language might have some critical issue that will impact all products written with it.

The new “cool kid” might also be quickly adopted by some developers and the resulting applications have a great chance of being fragile, from the lack of experience and from the immaturity of the language/framework. As such, the benefit must be carefuly considered before adopting such language/product/application.

Depending on the usage, it might still be beneficial to adopt a brand new product, if:

there is no security risk, for instance only doing some transformation on public data,
it cannot break the Data Management Platform, for instance a Database is critical for most platforms and should probably always be one the well-known ones (Posgres, MySQL, MariaDB, Oracle, SQLite, Ingres…),
the gain is major, for example the new product is significantly faster, very simple to use for an usually complex task, offer a new functionality with no equivalent.

This is a Quarto website.

To learn more about Quarto websites visit https://quarto.org/docs/websites.