De.KCD - Data Management Platform on the Cloud, step by step
Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.
Goals and scope
This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.
It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.
We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.
We will also compile advices and tips, and we welcome all contributions (and corrections).
Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry
How to use
Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.
When to…
When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and we are working on a detailed decision tree
Quick introduction to Linux/Unix
Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.
Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.
- Check external introduction
- Check Galaxy/NFDI training for deeper knowledge
For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.
We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folder, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm, yum, now dnf) and Ubuntu-based Linux (apt/apt-get), apk (Alpine Linux), as well as sandboxed or universal package managers like Flatpak and Snap, to list the main ones.
Viewing a web page in a Linux shell
One very useful capability while setting-up a web application is to access it directly. You might have access to a graphical terminal (see the next topic) but it’s also possible to access most web applications using a command line browser. Some fancy web pages won’t be displayed in any useful way, but a Data Management Platform is rarely in that category.
A text-only browser is also a good way to check if a web-site is accessible, mostly for visually impaired persons. All important content should be in plain text and the navigation should still be possible.
lynx, the best friend of old server-side developers, the oldest web-browser still maintained, with an easy to remember name, is a quick way to check a web-site.
Open a page:
lynx http://localhost:8080Navigation:
- Arrow keys: move between links
- Enter: follow link
- G: open a new URL
Quit:
- Q, then Y to confirm
w3m Slightly more modern than lynx, with optional image support in some terminals.
Open a page:
w3m http://localhost:8080Navigation:
- Arrow keys: move
- Enter: follow link
- U: open a new URL
Quit:
- Q
links / links2 Similar to lynx, with better table rendering and optional graphics support.
Open a page:
links http://localhost:8080Esc: open menu
Quit:
- Q
Firewall, SSH and VPN
Export display with X11
The X Window System has been conceived as a remote system. So the application using X11 for its display does not have to be displayed locally. It is handy mostly for applications that have a Graphical User Interface only or those that are easier to manage using their GUI. The communication will be ensured by SSH tunneling, thus encrypted.
The remote host must allow the local host, which might create a security risk (xhost+ instead of .XAuthority).
It is seldom needed to do so but very useful to know it is possible for some very specific need.
Linux with Wayland: Waypipe
If the same functionality is needed for a distribution using the newer Wayland display server, it is possible using Waypipe.
Remote desktop
A more common way to have a remote graphical display is via remote desktops applications. They are purposely written for such usage and thus should allow a secure and simple solution.
Due to security issue, remote desktop (as well as remote X11) might not be usable. Generally, a properly configured firewall will block almost all ports to-from external access.
If a SSH access is allowed, a remote desktop relying on SSH tunneling will be possible (such as X11).
Otherwise it is possible that a VPN connection is needed first.
It is always better to have an extra constraint (using the VPN in this case) than potentially add a vulnerability (see surface of attack in the Backups and Security page)
Common tools:
RDP
- Server:
xrdp - Client: Remmina, Windows Remote Desktop
- Works well over networks
- Server:
VNC
- Servers:
tigervnc,tightvnc - Clients:
vncviewer, Remmina - Simple and widely supported
- Servers:
Wayland-friendly options
- GNOME Remote Desktop (RDP)
- KDE Remote Desktop
weston(Wayland reference compositor)
Which option should I use?
I just want to check that a web service is running -> Use a text-based browser (lynx, w3m, or links) Fast, simple, works everywhere.
I need a feature that exists only in a graphical interface -> Try export display with X11 (if available). Useful for configuration tools or admin interfaces.
If too difficult to set-up or not available (on newer Linux or other OS) -> Use a remote desktop.
I am not sure which option to choose -> Use a remote desktop. It is the most reliable and behaves like a normal desktop.
Main web-servers
Most Web applications will be running as CMS: Content Management System. It means that it is a program running on a server that generate the web pages depending on the user actions, using the data stored on the server-side (which are not necessarily on the same server). This program is either serving the web content by itself using functionalities of the used language (for Python, Ruby, Perl, … ) or within an application server (like Apache Tomcat for Java).
But very often the web application, especially for production, will be accessed through a Web Server. This last will take care of accepting a secure connection (HTTPS) and forward all requests to the web application(s). It offers several benefits: * An extra layer for security, * A possibility to cache for static content (such as images), * Eventually load-balance between several instances of the web application, * Offer one central access for several web applications, even if they are not homogeneous.
The main web servers are: * Nginx, * Apache, * Internet Information Services (IIS, Windows only, proprietary).
Another web server gaining tracktion is Caddy, that try to integrate more aspects (like optaining a SSL certificate) while being easier to use. As always, the gain has to be weighted against the risk as the adoption as well as the ecosystem is smaller.
IIS is rarely use for Research project but can be the best choice if the application is built using Microsoft technologies.
Commands that should be known
ls → list the content of a directory ls -al → include all files (including hidden files starting with
.) and show a detailed listcd → go to a directory
cd ..→ parent directorycd /→ root directorycd ~→ user home directory
ps → list running processes ps -edf or ps -aux → list running processes and their owner
top → show running processes and their memory/CPU usage (CTRL-C to exit)
cp / mv / mkdir / rm → copy, move/rename, create directories, remove files or directories
rm -r→ recursive removalrm -rf→ dangerous, use with extreme caution, the force means that it will delete everything (force remove).
more / less / cat → display file content less is usually preferred (scrolling, searching)
tail, tail -f → show the end of a file, -f to follow it live (very useful for logs)
vi / vim / emacs / nano → most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi, nano or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs).
Even if vi might seem very hard to use and strange at first glance, it is quick to learn the basics and very convenient.
Emacs has a very different approach but is much more powerful, which can be useful for more complex work (if needed).
If you never used a terminal editor, search for a short “vi basics” or “nano basics” introduction.
The absolute minimum knowledge you need of vi/vim is that you need to press “i” to edit the text, then escape to return to “command mode”. In the “command mode”, wq saves and quits (write and quit), q quits if there is no changes, q! saves without saving the change.
These editor are also used for editing crontab (crontab -e), with the editor chosen at the first use, but it can be changed later. Crontab is determining how cronjobs are running (regular jobs) and, if not needed for running an application (and so not part of this list), can be really useful.
man → show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.
tldr -> a simplified man, that needs to be installed first but can also be consulted online or in a general PDF
ln → create links (that points to a file); understand the difference between:
- Hard link: a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file
- Symbolic link (symlink): exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions
For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.
chown / chmod → change the owner of a file or folder, change the permissions of a file or a folder. Be familiar with:
symbolic permissions:
rwxrwxrwx,r-x------numeric modes:
755,644examples:
chmod o+r filenamechmod -R 755 directorychown -R user:group directory
Ideally the pipe (|) should be well understood, as well as the I/O indirections (>, >>, <). If this is unclear, look for “Unix pipes and redirection”, it is a core concept.
grep → search for the presence of a string in files, often used combined with another command using a pipe (|)
find → search for files by name, type, size, or date; often combined with grep or xargs
mount / umount → attach or detach filesystems Removable media must be mounted before access. For instance, a USB key must be mounted before use. Temporary mount points are often under
/mntor/media. Modern systems often automount removable media.pwd / whoami →
- pwd → print current directory
- whoami → show current user
kill → kill does not only kill a process. It actually send a signal, which is most of the time to terminate a process: to the given process for most signals, ot to the kernel for SIGKILL (kill -KILL processId) and SIGSTOP (kill -STOP processId), to respectivelly forcefully kill or stop a process. The default signal (i.e. no argument) for kill is SIGTERM (kill -TERM processId), which should be dealt by the process to terminate properly. If the signal is not processed by the corresponding application, by not being coded in or for being in a state where it cannot deal with it (for instance an infinite loop), nothing will happen. In that case, only kill -KILL processId will kill the application.
The SIGKILL termination is considered unsafe and should only be used on hanged process, after using SIGTERM.
The signals have an integer value, with some defined in the POSIX standard, such as kill -9 for SIGKILL, or kill -15 for SIGTERM.
- eval → is a powerful and dangerous command that execute its arguments as a shell command. So it is possible to build a command in a shell script and execute it using eval, making possible very advanced operations.
Knowing eval is important for security reason. The presence of eval in a suspicious script calls for caution.
xargs → build and execute a command from standard input
Very useful when a command produces a list of items (files, process IDs, etc.) that must be passed as arguments to another command.Typical usage combines find, grep, or pipes:
find . -name "*.log" | xargs rmgrep -l "ERROR" *.log | xargs less
This is needed because many commands do not accept input directly from stdin, but only as command-line arguments.
Be careful: xargs will execute the command on all received items.
Prefer using xargs -n 1 (one argument at a time) or test first with echo.
As opposite to eval, xargs does not transform a string into a command (the command needs to be given and will receive parameters from xargs). As such it is not dangerous like eval
If unfamiliar, search for “xargs explained”, understanding it greatly improves command-line efficiency.
curl / wget → interact with HTTP/HTTPS services from the command line Useful for testing APIs, services, or downloading files
free / df / du → check system resources
- free → memory usage
- df → disk usage per filesystem
- du → disk usage per directory
command & / CTRL-Z + bg / fg → run a command in the background, suspend it, or bring it back to foreground
ssh / scp → connect to remote servers and copy files securely Essential for working with remote machines If unfamiliar, search for “SSH basics”.
Graceful restart: ask for a transparent restart of a service -> finish the existing “transactions” before starting. For a web-server it means that it will keep the current opened http session before restarting, thus minimizing the impact of the restart for the end use. Keep in mind that it does not know about a CMS session served by the web-server (as generally the web-server is only a proxy to the web application), so CMS session might be forcefully closed anyway (depending on how the session is managed).
There are several way to do so: + systemctl reload nginx/apache/caddy (graceful) (On most Linux distributions) → systemctl is part of systemd, where the ‘d’ stands for ‘daemons’.Daemons are Unix’s background process. What is important to know if that a web server, such as Apache or Nginx, will now be run as daemon and managed by systemctl. systemctl reload servicename will call a graceful restart of the service. Common commands: * systemctl status service * systemctl start|stop|restart service * systemctl reload service (graceful reload when supported)
- /etc/init.d/nginx reload / /etc/init.d/apache2 reload (On some Linux distributions and Unix OS (FreeBSD, OpenBSD…)): systemd is a relatively new and controversial System and Service Manager, which is the first process to run (PID 1) and will start other elements of Linux and manage many elements using daemons. This monolithic aspect is quite opposed to the principles of Unix (“Do one thing, and do it well”) which is the main point of the controversy. As such some distributions still use the regular init (SysV-style).
As there are no init.d script officially available for Caddy (at the time this document has been written), it might not be possible to use this command.
apachectl/apache2ctl graceful / nginx -s reload / caddy reload: the web servers also offer command lines to manage them. It is still useful to know the Operating System way as it does not change between web servers so a maintenance on a different web-server can still easily be done (a web-server might be part of a container for a Web application). Apache on Windows is managed by Http.exe and only offer a graceful restart.
apachectl configtest / nginx -t → allow to test the current configuration for Apache or Nginx. Configuring a web-server can quickly become complex, if there is a need for a reverse proxy on several virtual hosts (for instance for several web-application served via a docker instance for each). So these command will minimize the risk of a wrong configuration. But they still do not ensure that the configuration is doing what is expected.
rsync → efficient synchronization of files and directories Often used for backups or deployments. Combined with database dumps, it can form a simple backup strategy. In that case it is important to have a monitoring of the process to ensure it is working.
Folders you should know
- / → filesystem root
- /bin → essential system binaries
- /usr → user-space applications and libraries
- /usr/bin → most user commands
- /etc → system-wide configuration files (very important)
- /var → variable data
- /var/log → system and application logs (first place to check on errors)
- /home → user home directories
- /root → root user’s home directory
- /tmp → temporary files (often auto-cleaned)
- /mnt → mount points
- /media -> temporary mount points for external storage (optional)
- /opt → optional or third-party software
Online resources
The Unix command line is well explained here. It is probably useful to know about the main principle of Unix, and a good course is available here and Wikipedia has a good overview of the Unix filesystem and its layout.
DeNBI Unix course -> TBD, adapt using a light Linux image (or public online VM)
What makes the containers possible: cgroups
One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.
It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.
A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).
Bare-Metal setup and Virtual Machines
Vocabulary
“Bare-Metal” is a bit of a misnomer, as it means originally using a computer without operating system, directly on the hardware (so using a low-level language such as ASM or C). But a bare-metal setup means a direct installation on a machine with the operating system.
Note that an actual bare-metal setup could still be a viable option for specialised tasks: if it needs to be embedded (for a cheaper/more efficient solution on an specialised piece of hardware), if it needs to be highly performant, but in the later case it might be a better option to move the processing to the GPU (which is also a kind of bare-metal implementation). To choose to do so if out of scope for this document, but it is important to do so knowlingly as it could have a high cost (i.e. difficulty of maintenance)
A Virtual Machine is a software solution that provide the functionality of a physical computer. It is possible (and generally needed) to install an operating system on it. Most Virtual Machines will provide an easy way to do so, and using it is akin to using an actual computer.
It can be an emulation, “imitating” a computer enough to run its software, generally used for specific usages such as running deprecated operating systems. Qemu is one of the main ones. An emulation provides a full encapsulation so is in theory more secure and exploit are fixable by software update.
Or a virtualization, where the hosted machine is running directly or semi-directly on the physical hardware though an hypervisor that manages how the ressources are used: * full virtualization (simulate enough of the computer hardware to run a guest OS), and will need significant more resource than an actual physical computer, while providing a full encapsulation so in theory more security. Exploit are also fixable by software update. * Hardware-assisted virtualization, where the hardware helps with the virtualization. Exploits using the hardware support might be difficult or impossible to fix, though exceptionals. * OS-level virtualization, where the operating system is sharing the resource for the guest OS. Docker and other container engines are using such virtualisation. The reason containers are called containers and not virtual machine is from the usage: instead of creating a virtual machine and setting it up with an operating system and subsequent pieces of software, they rely on fixed images that have an bare-minimal operating system (ideally) and layers of software to support the desired container. Once running, they are actually a virtual machine.
A main protection against threat is being up-to-date: a recent or LTS, Long-Term-Support: that is supported for a long time, Operating System on a physical computer, Virtual Machine software, and virtualization software (which might have a direct support in the CPU).
Containers, Overview of Docker usage
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy, assuming you have some knowledge of Linux.
A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.
If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.
Vocabulary
An image is a template used to create containers. It contains:
- an operating system base (often minimal Linux), which is not mandatory on Linux as it can use the host operating system, but is generally included for compatibility (be sure that the version of Linux is the right one, or if not running on a Linux host).
- the application,
- its dependencies,
- default configuration.
An image is immutable: you do not modify it while it is running. Images are usually downloaded from a registry (e.g. Docker Hub).
A container is a running instance of an image. It:
- runs processes,
- listens on ports,
- can be started, stopped, or restarted.
Note that the desired application needs to be started (entrypoint), or only the operating system will run.
Containers are ephemeral:
- if a container is deleted, everything inside it is lost
- unless data is stored in a volume
A volume is a persistent storage area outside the container. It is used to store all data that must survive container restarts or upgrades.
Volumes are:
- mounted inside containers
- independent of the container lifecycle
Docker Compose is a tool to define and run multiple containers together using a single configuration file (
docker-compose.yml).It is used to describe:
- which images to run,
- how containers are connected,
- ports,
- volumes,
- environment variables.
With Docker Compose, you typically manage an entire application stack:
- web server
- application
- database
- cache
All started with:
docker-compose up
Minimal mental model
- Image:→ what to run
- Container: the running application
- Volume: persistent data
- Docker Compose: a complete setup of multiple images and volumes.
If you keep this model in mind, most Docker-related documentation will start to make sense.
Useful things to know
A docker image will be started (once used in a container) using an Entrypoint and/or a cmd. These can be listed by using docker inspect <image id>, but also individually by using docker inspect -f '{{.Config.Entrypoint}}' <image id> and docker inspect -f '{{.Config.Cmd}}' <image id>. This is useful in case something goes wrong inside a container to know how the container is supposed to start. To know more about cmd and Entrypoint, the official Docker documentation offer a great overview.
Changing something that is part of the image is possible when running the container, but it will never be persistent and is generally a bad idea. One good case is when debugging an application, and if the debugged code is changed (in the code base) as soon as the issue is found.
How to build a Dockerfile
The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach.
Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.
A Dockerfile defines layers from a base images (FROM image_name:version) to build a new image. For instance, if the application is running on Python, it is a good idea to start from an official Python image, which will already have the right version of Python and its dependencies. Then you can add the application and its dependencies on top of it.
An image is always immutable, and a new image from a base one with modification will also be immutable. In case we need several application with the same base image, it is a good idea to build a custom base image with the common dependencies, and then build the different application images from this custom base image, so the common layers will be shared between the different images.
A shorcoming is that if a change add dependencies which are again added on the next step, both layers will still be added to the image. It is not always avoidable, but it is good to be aware of it as it can lead to a bigger image than expected, and thus a longer build time and a longer time to start the container.
A powerful way to optimize a dockerfile is to use a multi-stage build, where the first stage is used to build the application and its dependencies, and the second stage is used to copy only the necessary files from the first stage to create a smaller image. This is especially useful for applications that need to be compiled, such as those written in C or C++, but it can also be useful for other types of applications.
As the image are immutable, updating is always done by fetching a new base image and building a new image on top of it. It is important to keep an eye on the base image, as it can have security vulnerabilities that are fixed in newer versions. It is a good idea to use a specific version of the base image (for instance python:3.9.13-slim instead of python:3.9-slim), so you know exactly which version you are using and can update it when needed.
Moving to Docker compose
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).
Advantages
Docker compose is both a tool to setup a docker-based application or set of applications, and the file, written in YAML, that describes the setup. It is a powerful tool to manage a complex application with several components, such as a web server, an application, a database, and a cache. It allows to define the different components of the application and how they are connected together, as well as the volumes and environment variables needed for each component. It also allows to easily start and stop the application with a single command, and to manage the different components of the application in a consistent way.
Docker compose is quite simple to learn and use, and it is a good way to start with container-based applications. With a coherent docker-compose file, it is a good solution for running a web application, with a database and a cache, on a single machine, with a simple set-up and a good performance.
To access the web-application, it is generally needed to use a reverse proxy, such as Nginx or Apache, to forward the traffic from the host to the container. It is also possible to use a load balancer, such as HAProxy, to distribute the traffic between several instances of the application.
Things to take into account
Adding parameters
The different types of volumes
Namespaces and namespaces collisions
Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace. So running 2 different docker-compose setup using a parent folder of the same name will use the same docker network. If the applications use the same images, it could mean that one container communicate with one other container from the other docker-compose setup. For instance, you want 2 instances of the same application, testApp, that uses a MySQL database. The database is defined in the docker-compose and used by the testApp. The set-up is the following:
application1/testApp/docker-compose.yml
application2/testApp/docker-compose.ymlIt is a clean set-up, with a distinct parent folder before the docker-compose folder (which would make a lot of sense if you want to store some specific files outside of the docker-compose folder). But as the direct parent folder of the docker-compose has the same name, only one MySQL instance will be used by both (or there will be a binding error and one application will fail, which is probably the best outcome)
From Docker compose to Kubernetes
Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not (links to separate resource page), and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running
Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.
Kubernetes elements and concepts are not more complex than docker compose, but there are many more elements than docker compose, so the overall complexity is very high. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.
For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.
But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is (mostly) made by and aimed at non-sys-admin persons.
Quick overview
Kubernetes is an orchestration platform: applications are in containers which run in a pod that are on a host system, called a node. There can be several containers in a pod (though only if the containers benefit from being strongly coupled), several pods in a node, and several nodes. There can be copies (replicas) of one application automatically managed by Kubernetes on several nodes.
Everything is decoupled and work declaratively: manifests indicates the wishes, and Kubernetes works on satisfying these, even if a problem occurs. So if you ask for 4 copies of one application, with 4 copies of another that should avoid the first one and you have 8 nodes, Kubernetes might initially distribute these 8 applications on 8 nodes. If one node goes down, the application of this node will move to another node with the same application, thus still avoiding the other one. This avoidance (affinity, which work at node level) can be strict (no scheduling if the constraint is not met) or not (if no other choice still select this node).
Kubernetes is presented in more detail in the Kubernetes introduction page, which aims to accelerate learning it and give enough of an overview to understand clearly what it is, what it offers and what it costs.
Other Solutions
Linux Packaging/Bare metal setup
In some cases, it can be better to install directly on the host system or the only option aside using a virtual machine, if the application is not containerizable.
It could be for some specific hardware or security needs, because the application has a very complex setup, because the application hasn’t been maintained for a long time and does not offer an image (or the image is not of the last version…).
Most of the time it is possible to set it up in a virtual machine with some extra effort, but some virtual machine hypervisor (that run the virtual machines) might have some restrictions on the hardware use (GPU for instance), or have some functionality locked as part as a commercial version.
Containers could have a full access to the hardware if run on Linux, but the base image would probably be restricted on what it is allowed to do, or getting the access might be really difficult.
In any case, the effort to virtualize or containerise the application will depend on the security risk and the estimate time of usage. If the application present a risk and will be run for several years, it makes sense to spend some effort isolating it. To do so, it is simply a matter of removing obstacles one by one: * try to make it run, * identify what is the issue, * look for solutions: + check the official documentation, + if possible ask a colleague that might have had the same issue, + LLM might know the answer. If the given answer does not work after 2 or 3 clarification, it is generally better to stop trying as LLM will get easily confused and will happilly give wrong answers, or answers to the wrong question, + doing an internet search. Good answers come generally from IT-related forums, especially those focusing on such solution (either the software or the platform), StackOverflow and Reddit, + ask online, + try alone, by preparing a test-retrial workflow: an easy way to break and repeat, often by having a main version of the setup that won’t be touched, and iterating on copies. * if this issue has been fixed and the software still have issue, reiterate.
One big caveat of preparing an application that has not been developped in-house is the risk that one or several functionalities will not be encountered while doing the setup and might not work, and so the application will break later. To avoid so, it is important to test the application with realistic dummy data and try to use everything that might be used. As such, it might be better to stick to well-documented applications (even if not maintained anymore) that allow an quick overview of all capabilities.
A maintained but undocumented application might seem a better option, but only if it is sure that it will be maintained for the long enough and/or that it will be fully documented.
If you provide a containization for a maintained application, it is a good idea to propose it as contribution of the project. You will have to see who would be in charge of updating it if needed.
Proxmox VE — Virtualisation Without VMware
Proxmox Virtual Environment (Proxmox VE) is an open-source server virtualisation platform. It allow to manage several application on several servers using a Web Interface. It support virtual machines and Linux containers (LXC), which support OCI images which are very close to Docker Image (most Docker images should be compatible).
It has gained significant traction in scientific and institutional settings as a cost-free replacement for VMware vSphere following Broadcom’s acquisition of VMware in 2023 and the subsequent end of free ESXi licences.
What Proxmox can do
Virtual Machines (KVM/QEMU). Proxmox uses the Kernel-based Virtual Machine (KVM) hypervisor combined with QEMU to run full operating system images. Each VM is completely isolated: it has its own virtual hardware, its own kernel, and its own network stack. You can run Windows, any Linux distribution, or BSD inside a VM.
- Full hardware emulation, including GPU passthrough for GPU-accelerated workloads.
- Live migration — move a running VM between cluster nodes with minimal downtime.
- Snapshots — capture the entire state of a VM at any point in time.
Linux Containers (LXC). Containers are faster to start and use less memory than full VMs.
- Lower overhead than full VMs — near-native performance.
- Suitable for running multiple isolated Linux environments on one host.
- Not suitable for non-Linux guests.
Integrated Backup System. Proxmox Backup Server (PBS) is a companion product that provides deduplicating, incremental backups of VMs and containers. Backups are stored efficiently - only changed data blocks are written after the first backup - and can be verified cryptographically.
- Scheduled automatic backups with configurable retention policies.
- Instant restore of a single file from a VM backup without restoring the whole image.
- Offsite replication: backups can be replicated to a remote PBS instance.
High Availability Clustering. Multiple Proxmox hosts can be joined into a cluster. The cluster shares a common configuration and can automatically restart VMs on another node if a host fails. This requires at least three nodes to achieve quorum: the next main node need to be elected using a vote.
Software-Defined Networking. Proxmox VE allows administrators to define VLANs, virtual network zones, and routing policies from the web interface.
Proxmox vs. Kubernetes — when to use which
Proxmox virtualises hardware or run containers without the full orchestration that Kubernetes offers. Kubernetes orchestrates containers, running packaged applications across a pool of machines.
They operate at different layers and can be combined: a Kubernetes cluster can run on VMs managed by Proxmox.
| Dimension | Proxmox VE | Kubernetes |
|---|---|---|
| What it manages | Virtual machines & LXC containers | Application containers (Docker/OCI) |
| Abstraction level | Infrastructure / OS level | Application / workload level |
| Typical user | Sysadmin managing servers | DevOps/developer managing applications |
| Scaling | Manual VM provisioning or scripted | Automatic horizontal pod scaling |
| Self-healing | HA restarts VMs on node failure | Reschedules failed pods automatically |
| Complexity | Moderate — manageable by a small team | High — significant learning curve |
| Persistent storage | VM disk images, Ceph, NFS | Persistent Volume Claims, CSI drivers |
| Networking | Bridges, VLANs, SDN | CNI plugins, Ingress controllers, Services |
A practical path for a research group: start with Proxmox to host virtual machines for different services (database, web server, backup). If the team grows and needs to manage many application deployments with automated scaling, consider running a Kubernetes cluster on top of Proxmox VMs. If the need might be higher, OpenStack offers a group of Cloud-oriented solutions, see below.
Resources
- Proxmox VE documentation — full installation and administration guide
- Proxmox Backup Server documentation
- Proxmox community forum — active community with scientific HPC use cases
- Migrating to Proxmox from VMware — official migration wiki
OpenStack — Open-Source Cloud Infrastructure
OpenStack is a large-scale, open-source cloud computing platform that allows organisations to build and operate their own Infrastructure-as-a-Service (IaaS) cloud — comparable in scope to Amazon Web Services or Microsoft Azure, but running entirely on hardware you control. It is widely used by national research infrastructures, universities, and supercomputing centres worldwide.
OpenStack is designed for managing tens to thousands of physical servers. For a single-server or small-team deployment, Proxmox is usually more appropriate. OpenStack becomes compelling when a research institute or facility needs to offer self-service cloud resources to many research groups simultaneously.
Architecture overview
OpenStack is not a monolithic application — it is a collection of loosely coupled services that communicate via REST APIs. Each service manages one aspect of the infrastructure.
| Service | Code name | Function |
|---|---|---|
| Compute | Nova | Manages the lifecycle of virtual machine instances |
| Networking | Neutron | Software-defined networking: virtual networks, routers, floating IPs |
| Block storage | Cinder | Persistent block volumes attached to VMs |
| Object storage | Swift | Scalable object store (S3-compatible) |
| Image service | Glance | Stores and retrieves VM disk images |
| Identity | Keystone | Authentication and authorisation for all OpenStack services |
| Dashboard | Horizon | Web-based graphical interface for all services |
| Orchestration | Heat | Template-based infrastructure orchestration |
| Telemetry | Ceilometer / Gnocchi | Usage metering and monitoring |
| Bare metal | Ironic | Provision physical servers alongside VMs |
| Container orchestration | Magnum | Provides Kubernetes clusters as a managed service |
| Distributed storage | Ceph (integrated) | Block, object, and file storage for Nova/Cinder/Glance |
Ceph — distributed storage
Ceph is a software-defined distributed storage system tightly integrated with OpenStack. It is the recommended storage backend for production OpenStack deployments because it eliminates single points of failure and scales horizontally by adding storage nodes.
How Ceph works. Ceph distributes data across a cluster of storage nodes (called OSDs - Object Storage Daemons). Each piece of data is replicated (typically three copies) or erasure-coded across different nodes, so the failure of any single disk or server does not cause data loss.
The three main Ceph interfaces are:
- RADOS Block Device (RBD) — presents distributed storage as a block device. Used by Cinder for VM volumes and by Nova for VM boot disks.
- RADOS Gateway (RGW) — provides an S3-compatible object storage API, usable as a drop-in replacement for Amazon S3. See more information about object storage in the Data Storage page.
- CephFS — a POSIX-compatible shared filesystem for applications that need traditional file-path semantics.
Why Ceph with OpenStack. Live migration of VMs is much simpler when all compute nodes access the same Ceph storage pool. Storage can also scale independently of compute by adding storage nodes without affecting running VMs.
Magnum — Kubernetes as a service
Magnum is the OpenStack service that provisions and manages container orchestration clusters, most commonly Kubernetes, on demand. Rather than manually installing and configuring Kubernetes, Magnum automates the entire process using Heat templates (see below).
Users request a cluster via the API or Horizon dashboard and receive a fully configured Kubernetes environment running on OpenStack VMs. Administrators define cluster templates specifying the node size, Kubernetes version, and networking plugin. Once created, the cluster is managed with standard Kubernetes tools (kubectl, Helm, etc.) and the OpenStack layer becomes invisible to the end user.
Magnum does not replace Kubernetes — it automates the deployment of Kubernetes on OpenStack infrastructure. From the user’s perspective, they receive a standard Kubernetes cluster.
Other notable OpenStack services
Horizon (Web Dashboard). Horizon is the web interface for OpenStack. It provides a graphical view of all resources: instances, volumes, networks, and images. For users who prefer not to use the command line or REST APIs, Horizon is the primary day-to-day interface.
Neutron (Networking). Neutron provides virtual networks, subnets, routers, security groups, and floating IPs.
Heat (Orchestration). Heat allows you to describe an entire cloud environment — VMs, networks, storage, security groups — in a YAML template and deploy it reproducibly with a single API call. This is the OpenStack equivalent of AWS CloudFormation and enables Infrastructure as Code workflows.
Ironic (Bare Metal Provisioning). Ironic provisions physical servers in the same way Nova provisions VMs. This is useful for high-performance computing workloads where VM overhead is unacceptable — for example, MPI jobs requiring direct access to InfiniBand interconnects or specific GPU hardware.
Resources
- OpenStack documentation — official hub for all OpenStack projects
- DevStack — sets up a full OpenStack environment on a single machine for learning and testing
- Kolla-Ansible — deploys OpenStack services as containers for easier lifecycle management
- EGI Federated Cloud — federated OpenStack infrastructure available to European researchers
Lightweight and Cloud-Native Container Orchestrators
Kubernetes is not the only way to schedule and manage containers at scale. Several lighter-weight or cloud-managed alternatives exist, each occupying a different point on the spectrum between simplicity and power.
HashiCorp Nomad
Nomad is a general-purpose workload orchestrator developed by HashiCorp. Unlike Kubernetes, which is specifically designed for containers, Nomad can schedule Docker containers, standalone executables (binaries), Java JAR files, virtual machines (via QEMU), and batch jobs — all using the same scheduler and configuration language.
Nomad share the same drawback with Docker Swarm, which is that they are not so popular. Thus there are less extrenal resources available, less discussion about it, and less pre-packaged applications (Helm charts do not apply; Nomad uses its own job specification format HCL). But they can still be a good solution for small teams that need to schedule a heterogeneous mix of workloads without the operational overhead of Kubernetes, while being well-aware of these limitations.
Architecture. Nomad uses a client/server model. A small number of server nodes (three or five, for quorum) handle scheduling decisions. Client nodes run the actual workloads. There is no dedicated etcd cluster, no separate controller manager, no separate scheduler process: all scheduling logic lives in the Nomad server binary.
Key characteristics:
- Single binary deployment. The entire Nomad agent (server or client mode) is a single Go binary with no external runtime dependencies. This makes installation trivial compared to a full Kubernetes cluster.
- Multi-workload. Nomad schedules containers, batch jobs, system daemons, and raw executables in a uniform way. This is useful in scientific environments where some workloads are containerised and others are legacy binaries.
- Native integration with Vault and Consul. HashiCorp’s secret management tool (Vault) and service discovery tool (Consul) integrate directly with Nomad, providing a coherent platform for secrets injection and service mesh networking.
- Lighter operational footprint. A minimal Nomad cluster requires significantly fewer nodes and less memory than a comparable Kubernetes cluster with all its components.
Limitations compared to Kubernetes:
- Smaller ecosystem: fewer pre-packaged applications (Helm charts do not apply; Nomad uses its own job specification format HCL).
- Less sophisticated networking: traffic routing requires Consul or an external load balancer.
- Less mature support for stateful workloads (though Nomad CSI driver support is improving).
Nomad is a good fit when: you need to schedule a heterogeneous mix of containers and non-container workloads; your team is small and wants a simpler operational model than Kubernetes; or you are already using other HashiCorp tools (Terraform, Vault, Consul) and want a consistent toolchain. (Terraform can also be used with Kubernetes).
HashiCorp is a commercial entity, part of IBM. So while their tools are free and “open-source”, there is always a risk that they could change their licensing model or discontinue free support for Nomad in the future. On the offer hand, if there is a budget for it, HashiCorp offers enterprise support. And as it is a commercial product, their products are generally well-maintained and have a good documentation.
Their licence is not fully open-source, but source-available under the Business Source License (BSL). This means that while the source code is available for inspection and modification, there are restrictions on how it can be used in production without a commercial license.
- Nomad documentation
- Nomad vs. Kubernetes comparison — official comparison from HashiCorp
Docker Swarm
Docker Swarm is the native clustering and orchestration mode built into the Docker Engine itself. A group of Docker hosts can be joined into a Swarm cluster, turning them into a pool of nodes that can run and scale containerised services collectively.
Architecture. Swarm uses a manager/worker node model. Manager nodes handle scheduling and cluster state (stored via the Raft consensus algorithm). Worker nodes run containers. Unlike Kubernetes, there are no separate etcd, API server, or controller manager processes — all Swarm logic is embedded in the Docker daemon.
Key characteristics:
- Zero additional installation. Any machine running Docker Engine can join a Swarm with a single
docker swarm initordocker swarm joincommand. No separate binaries or configuration files are needed. - Docker Compose compatibility. Docker Compose files (with minor additions for
deploy:stanzas) can be deployed to Swarm as stacks directly, making it easy to move from a local development setup to a distributed deployment. - Simplicity. Swarm intentionally covers the 80% use case - service scaling, rolling updates, health-based restarts, overlay networking, and secrets management - without the conceptual complexity of Kubernetes.
Current status. Docker Swarm is considered feature-stable rather than actively developed. Mirantis (which acquired Docker Enterprise in 2019) has committed to maintaining Swarm. For new projects, Docker’s own documentation now recommends Kubernetes for complex production workloads. Swarm remains a sensible, low-overhead choice for small research groups that are already familiar with Docker Compose and do not need the full Kubernetes feature set.
Limitations:
Limited ecosystem compared to Kubernetes — no Helm, no Operators, no Custom Resource Definitions.
No built-in support for auto-scaling based on metrics.
Less active development; fewer new features.
Amazon ECS - Elastic Container Service
Amazon, Google and Microsof (Azure) solution are generally billed per use, so they can be more expensive than self-hosted solutions, especially if the workload is not well-optimized. It is important to monitor the cost and optimize the usage to avoid unexpected bills. The bill can grow very quickly if the usage grows: it could be a simple issue with another service calling the container more often than expected, or a change in the code that make the container run for a longer time. It is important to set up some alerts to be notified when the cost grows above a certain threshold, or to avoid such solution if the cost is not predictable or if the budget is not flexible enough. Self-hosting might initially be more expensive but will often offer a fixed cost and might become cheaper in the long run. Generally, a well managed self-hosted solution should be cheaper than a cloud solution.
Amazon ECS is AWS’s proprietary container orchestration service. You define tasks (one or more containers that run together, similar to a Kubernetes Pod) and services (long-running groups of tasks with load balancing and auto-scaling). ECS handles scheduling, health checks, and replacement of failed tasks.
Two launch modes:
- ECS on EC2 (Elastic Compute Cloud). You provision and manage a pool of EC2 virtual machines that act as ECS container hosts. You are responsible for the underlying instances (patching, scaling the instance fleet).
- ECS on Fargate. AWS manages the underlying infrastructure entirely. You specify CPU and memory requirements per task and pay per second of task runtime with no instances to manage. It is a serverless container platform within the AWS ecosystem (see below for more on serverless containers).
If properly setup, EC2 should be cheaper than Fargate, but it requires more operational effort to manage the cluster of EC2 instances. Fargate is more expensive but offers a simpler, fully managed experience.
Key characteristics:
- Deep integration with the AWS ecosystem: IAM for per-task permissions, CloudWatch for logs and metrics, ALB/NLB for load balancing, ECR for container images, Secrets Manager for credentials.
- Simpler than Kubernetes for straightforward web application or data pipeline deployments on AWS.
- No control plane to operate, AWS manages the scheduler.
Limitations:
- Vendor lock-in: ECS task definitions, IAM roles, and networking are all AWS-specific. Migrating to another cloud or on-premises later requires significant rework.
- Less flexible than Kubernetes for complex multi-service applications with sophisticated networking or storage requirements.
- No equivalent to Kubernetes Operators or CRDs for extending the platform.
If your team is already comfortable with Kubernetes, AWS offers EKS(Elastic Kubernetes Service) which runs a managed Kubernetes control plane. ECS is simpler to start with; EKS gives portability and the full Kubernetes ecosystem. Both support Fargate for serverless compute.
Google Kubernetes Engine
Google Kubernetes Engine (GKE) is Google’s managed Kubernetes service. It provides a fully managed control plane and automates cluster provisioning, upgrades, scaling, and maintenance. GKE runs Kubernetes clusters on Google Cloud’s infrastructure, with deep integration into other Google Cloud services.
Serverless Container Platforms
“Serverless containers” describe managed platforms where you provide a container image and the platform handles all infrastructure with no VMs, no clusters, no nodes to manage. You pay only for actual compute time. The trade-off is reduced control and potential cold-start latency.
The term serverless can be confusing: servers still exist, but they are fully abstracted away from the user. From the user’s perspective, you push a container image and the platform runs it on demand.
Common platforms:
| Platform | Provider | Notes |
|---|---|---|
| AWS Fargate | Amazon | Works with ECS or EKS; per-second billing |
| Google Cloud Run | HTTP-triggered; scales to zero; see below | |
| Azure Container Instances | Microsoft | Simple per-container billing; no orchestration layer |
| Azure Container Apps | Microsoft | Higher-level; built on Kubernetes + KEDA for event-driven scaling |
| Fly.io | Independent | Developer-friendly; global edge deployment |
| Render | Independent | Simple deployment from Git; Docker or native buildpacks |
Scientific use cases:
- On-demand analysis endpoints: Deploy a container that runs a computationally expensive analysis when an HTTP request arrives and shuts down immediately afterwards. Costs nothing when idle.
- Batch processing pipelines: Trigger container runs from object storage events (a new file uploaded triggers a processing job).
- Reproducible environments: Share a container image with reviewers so they can reproduce an analysis without installing software.
Limitations to consider:
- Maximum execution time limits (e.g., Cloud Run has a 60-minute request timeout; some platforms impose shorter limits).
- Stateless by design — persistent storage must be external (object storage, managed database).
- Cold starts: the first request after a period of inactivity may take several seconds while the container image is pulled and started.
- Cost unpredictability for high-traffic or long-running workloads.
Google Cloud Run
Google Cloud Run is Google’s fully managed serverless container platform. It is one of the most mature and widely used serverless container services, and is worth examining in detail as a representative example of the category.
How it works. You package your application as a Docker/OCI container image, push it to a container registry (Google Artifact Registry, or any public registry), and deploy it with a single command or through the Cloud Console. Cloud Run starts one or more container instances to handle incoming HTTP/gRPC (a Google remote procedure call framework by Google that allows, as the name implies, to call remote functions) requests and scales the number of instances automatically — including scaling to zero when there is no traffic.
# Build and push a container, then deploy to Cloud Run
gcloud builds submit --tag gcr.io/MY_PROJECT/my-app
gcloud run deploy my-app \
--image gcr.io/MY_PROJECT/my-app \
--platform managed \
--region europe-west1 \
--allow-unauthenticatedKey features:
- Scale to zero. When no requests are incoming, Cloud Run runs no instances and incurs no compute cost. This makes it economical for sporadic or unpredictable workloads typical in research.
- Concurrency. A single Cloud Run instance can handle multiple simultaneous requests (configurable up to 1000), which reduces cold starts compared to function-as-a-service platforms.
- Any language, any framework. Because you bring your own container image, there are no language restrictions. Python, R, Java, or any compiled binary can be deployed.
- Cloud Run Jobs. In addition to HTTP services, Cloud Run Jobs run containerised batch tasks to completion — useful for data processing pipelines, periodic reports, or model training jobs.
- VPC connectivity. Cloud Run services can connect to private VPC networks, enabling access to Cloud SQL, Memorystore, or on-premises systems via VPN or Interconnect.
Integration with Google Cloud:
- BigQuery, Cloud Storage, Pub/Sub — data can flow in and out through standard Google Cloud services.
- Cloud Scheduler — trigger Cloud Run Jobs on a cron schedule.
- Eventarc — trigger services from Cloud Storage events, Pub/Sub messages (Publish–subscribe -> send messages (publish) to subscribed services asynchronously without being coupled), or Audit Logs.
- Secret Manager — inject secrets as environment variables at runtime without embedding them in container images.
Limitations:
- Maximum request timeout of 60 minutes (sufficient for many analysis tasks; longer jobs should use Cloud Run Jobs or Cloud Batch).
- In-memory storage only during execution; persistent state must use Cloud Storage or a database.
- Compute is limited to CPU and memory per instance; GPU support is available in preview but not generally available in all regions.
- Full vendor lock-in to Google Cloud APIs and billing.
Cloud Run is ideal when you want to run containers without managing infrastructure. Google Kubernetes Engine (GKE) is better when you need full Kubernetes capabilities: stateful workloads, custom networking, complex multi-service applications, or specific hardware (GPUs, TPUs) that require persistent node pools. Cloud Run and GKE can coexist in the same project and share the same container registry.
- Google Cloud Run documentation
- Cloud Run Jobs documentation
- Cloud Run pricing — first 2 million requests per month are free
Comparison of lightweight and cloud-native orchestrators
| Platform | Managed by | Infrastructure | Vendor lock-in | Complexity | Best for |
|---|---|---|---|---|---|
| HashiCorp Nomad | Self-hosted | Your servers | None | Low–medium | Heterogeneous workloads, HashiCorp stack |
| Docker Swarm | Self-hosted | Your servers | None | Low | Small teams already using Docker Compose |
| Amazon ECS | AWS | AWS (EC2 or Fargate) | High (AWS) | Low–medium | AWS-native applications |
| Google Cloud Run | Fully managed | High (GCP) | Very low | HTTP services, sporadic workloads | |
| Azure Container Apps | Microsoft | Fully managed | High (Azure) | Low | Event-driven microservices on Azure |
Apache Software Foundation Solutions
The Apache Software Foundation (ASF) hosts hundreds of open-source projects. For data management and scientific computing infrastructure, several Apache projects are particularly relevant. They are mature, widely deployed, and have large communities — making them safe long-term choices for research data platforms.
Apache HTTP Server
The Apache HTTP Server (httpd) is one of the most widely deployed web servers in the world. It serves static files and acts as a reverse proxy in front of application servers.
- Virtual hosting. Multiple websites or applications can be served from one server using different domain names or paths.
- Modules. Functionality is extended through modules:
mod_proxyfor reverse proxying,mod_sslfor HTTPS,mod_rewritefor URL manipulation,mod_authfor authentication. .htaccessfiles. Per-directory configuration allows fine-grained access control without restarting the server, which is useful in multi-user research environments.
For high-concurrency workloads (many simultaneous connections), Nginx is often preferred due to its event-driven architecture. Apache’s process/thread model is simpler to configure and better supported by legacy applications. Both can coexist: Nginx at the front end proxies requests back to Apache.
Apache Tomcat
Apache Tomcat is a Java Servlet container and web server. It runs Java web applications packaged as WAR (Web Application Archive) files. Many scientific data management platforms — including OMERO, certain LIMS systems, and institutional repository software — are Java-based and require Tomcat.
Implements the Java Servlet and JavaServer Pages (JSP) specifications.
Can be placed behind Apache HTTP Server (via
mod_proxy_ajp) or Nginx for SSL termination and load balancing.The Manager web application allows deploying and undeploying WAR files via a browser interface.
Apache Kafka
Apache Kafka is a distributed event streaming platform. It acts as a high-throughput, fault-tolerant message broker: producers publish data events (messages) to named topics, and consumers read them independently and at their own pace. Kafka retains messages for a configurable period, so consumers can replay historical data.
Scientific use cases:
Ingesting high-frequency sensor or instrument data in real time.
Decoupling data producers (instruments, simulations) from data consumers (databases, analysis pipelines).
Building audit logs where every data change is recorded as an immutable event.
Streaming data simultaneously to multiple downstream systems.
Apache Spark
Apache Spark is a unified analytics engine for large-scale data processing. It extends the MapReduce paradigm with in-memory computation, making iterative algorithms (common in machine learning and simulation analysis) far faster than disk-based approaches.
Core capabilities:
Spark SQL — query structured data using SQL or a DataFrame API (available in Python, R, Scala, and Java).
Structured Streaming — process data streams from Kafka or other sources in near-real time.
MLlib — distributed machine learning for classification, regression, clustering, and collaborative filtering.
GraphX — distributed graph computation for network analysis.
Apache Airflow
Apache Airflow is a workflow orchestration platform for scheduling and monitoring data pipelines. Workflows are defined as Python code in the form of Directed Acyclic Graphs (DAGs), where each node represents a task (e.g., download data, run analysis, upload results) and edges define dependencies between tasks.
Scientific use cases:
- Automating recurring data processing pipelines (nightly ingestion, weekly reports).
- Orchestrating multi-step bioinformatics or image analysis workflows.
- Managing dependencies between heterogeneous tasks: bash scripts, Python functions, Kubernetes pods, SQL queries.
Bioinformaticians often use Nextflow or Snakemake because they natively understand file-based workflows and integrate with HPC schedulers (SLURM, PBS). Airflow is more general-purpose and integrates better with cloud services and databases. Both approaches can coexist in one organisation.
Apache Solr
Apache Solr is an enterprise search platform built on Apache Lucene. It provides full-text search, faceted search, filtering, and ranked retrieval over structured and unstructured data. It is useful for building search interfaces over scientific datasets, publications, or metadata catalogues.
Supports JSON, XML, and CSV indexing.
Runs standalone or in SolrCloud mode for distributed, fault-tolerant search.
Used by DSpace (institutional repositories) and CKAN (open data portals).
Apache Arrow and Parquet
Apache Arrow defines a language-agnostic in-memory columnar data format. Libraries for Python, R, Java, C++, and others can share data through Arrow buffers without copying or serialisation — enabling zero-copy interoperability between Pandas, Spark, DuckDB, and other tools.
Apache Parquet is a columnar storage file format optimised for analytics. Storing data column-by-column (rather than row-by-row as in CSV) allows efficient compression and selective column reads — you read only the columns needed for a query without scanning the whole file.
Widely supported: Spark, Pandas, DuckDB, BigQuery, Redshift, and Athena all read Parquet natively.
Strongly typed schemas with nested data support.
Snappy and Zstandard compression significantly reduce storage footprint compared to CSV.
Apache project summary
| Project | Domain | Key use case in research |
|---|---|---|
| HTTP Server | Web serving | Serving applications, reverse proxy, access control |
| Tomcat | Java applications | Hosting Java-based data management platforms |
| Kafka | Event streaming | Real-time data ingestion from instruments |
| Spark | Big data analytics | Large-scale data processing and machine learning |
| Airflow | Workflow orchestration | Scheduling and monitoring data pipelines |
| Solr | Search | Full-text search over datasets and metadata catalogues |
| Arrow | Data interchange | Zero-copy data sharing between analysis tools |
| Parquet | File format | Efficient columnar storage for analytical datasets |
Scalability
Scalability is the capability to adapt for more needs: more simultaneous users, more data to save, more processing power needed, more storage (bigger files), more network traffic. It is generally not needed to have a scalable solution at the beginning, but it is good to have an idea of what it means and how to do it, as it can be a problem if you need to change your solution later on. It can have strong consequences on the technical aspects, but also on the costs.
Vertical scalability is the capability to adapt to more needs by using more powerful resources (for instance a more powerful server, with more CPU, RAM, storage). It is generally the simplest solution, but it has a limit: there is only so much you can add to a single machine/virtual machine. An easy solution using vertical scalability is to separate what can be separated on different machines: for instance, the database on a different machine than the web application. It is generally a good idea to separate the database from the web application, even if they are on the same machine, as it allows to easily move the database to a more powerful machine if needed, and also to have a better security by isolating the database from the web application.
Horizontal scalability is the capability to adapt to more needs by using more resources (for instance more servers, virtual machines, containers). It is not always possible depending on the Data Management Platform. The simple solution is to use a load balancer to distribute the traffic between several instances of the application, but it can be more complex if the application needs to share data between the different instances (for instance a shared database or a shared file system). A simple example is having several users that need to edit the same elements. If one user start to work on an element, then the others start to work on it simultaneously, the change made by one user will be lost when the other user save their change. To avoid this, the application needs to be able to manage such situation, for instance by locking the element when a user is working on it, or by merging the changes made by different users. This is not always possible and can be a strong limitation for horizontal scalability.
Kubernetes is designed for horizontal scalability, but it is not the only solution.
Databases
Most Data Management Platforms rely on a database to store their data. It is generally better to have a basic knowledge of databases, and of the one used by the Data Management Platform you are using, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations. The most used databases are relational databases, such as MySQL, PostgreSQL, MariaDB, Oracle Database, Microsoft SQL Server, and SQLite. They rely on a schema, with tables and relations between them. All operations are row based, and normally querying is done by matching rows with other rows and/or a value. SQL (Structured Query Language) is the standard language for relational databases, and it is used to perform various operations on the data, such as querying, inserting, updating, and deleting data. SQL is a powerful language that allows you to manipulate and retrieve data in a flexible way, making it a popular choice for many applications. SQL is relatively straightforward and it is probably a good idea, if not already familiar with it, to spend some time learning it. There are many online resources to learn SQL, such as SQLZoo, W3Schools SQL Tutorial, and Codecademy SQL Course.
But there are also non-relational databases, such as MongoDB, Cassandra, Redis, and Elasticsearch.
Security
Security in Data Management is an important topic, as the data is often sensitive and valuable. It is important to have a basic knowledge of security principles and practices, and to apply them to your Data Management Platform. This includes: * Keeping the software up-to-date with security patches, * Using strong passwords and changing them regularly, though it is better to have strong passwords that do not need to be changed regularly, as changing them regularly can lead to bad practices such as writing them down or using weak passwords, * Limiting access to the data and the application to only those who need it, * Eventually using encryption for data at rest and in transit, * Regularly backing up the data and testing the restore process, * Monitoring the system for suspicious activity and responding to incidents promptly, * Ensuring that the connections are secure, using HTTPS for web applications and SSH for remote access. If the internal network is secure and only the necessary ports and IP addresses are allowed to access the application, it is possible to use HTTP for internal communication. Using HTTPS need to be done with a proper certificate, which can be obtained for free from Let’s Encrypt, which is a nonprofit organization giving free certificates. Several tools, such as Certbot, can help with obtaining and renewing certificates from Let’s Encrypt. It is also important to be aware of the specific security risks associated with the technologies you are using, such as containers, virtual machines, or cloud services, and to take appropriate measures to mitigate those risks.
We will explore these topics in more details in the Backups and Security page.
Going to an assembly
An assembly is when several Data Management Platforms are interconnected to provide a more complete solution. For instance, a Data Management Platform for a research group might need to be connected to a storage solution, to a computing cluster, to a data visualization tool, and to a data analysis tool. It is generally better to have a basic knowledge of the different components of the assembly, and of how they work together, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations.
They are generally connected via APIs, which can be at a programmatic level (for instance a Python API), or at a higher level (for instance a REST API). It is important to have a basic knowledge of how to use these APIs, as it can be useful for troubleshooting, for doing some specific operations (for instance a bulk update), or for doing some backup and restore operations. It is also important to be aware of the security implications of using APIs, and to take appropriate measures to secure them.
The different components of the assembly might have different requirements in terms of scalability, security, and maintenance, and it is important to take these into account when designing the assembly. For instance, if one component needs to be highly available, it might be a good idea to use a load balancer and multiple instances of that component, while if another component is only used for occasional tasks, it might be sufficient to run it on a single machine. There can also be some strong limitations, for instance if one applications store sensitive data encrypted which is shared with another application that cannot work with encrypted data. In that case, the data might need to be stripped down of its sensitive part before being shared, and/or anonymised or pseudonymised. You can read more about data anonymisation and pseudonymisation in the Data Anonymisation and Pseudonymisation page.
Automatize the setup
When you need to repeat a setup, or when you need to set-up a complex application, it is generally better to automate the setup, rather than doing it by hand. This can be done with tools such as Puppet, Ansible, and Terraform, which allow you to describe the desired state of your system in a declarative way, and then apply that state to your machines.
Puppet, Ansible & Terraform
Their official Online documentations are good and complete. Online tutorials should be enough as a complement. Books are still recommended for a quicker start.
When you have more than a handful of servers or services to manage, doing everything by hand stops being practical. Tools like Puppet, Ansible, and Terraform exist to automate that work.
Puppet was one of the first widely adopted tools for managing server configuration at scale, and many large organizations still use it. You describe the desired state of your systems in Puppet’s own language, and a Puppet agent running on each machine enforces that state continuously. It works well for large, stable fleets but it requires installing and maintaining that agent on every machine you manage, setting up a central Puppet server, and learning a configuration language that has a steep initial curve.
Ansible emerged as a simpler alternative for configuration management. It does much of what Puppet does: installing packages, writing config files, managing services. But it works over plain SSH with no agent to install. The desired state of a system is written in playbooks, written in YAML, which most developers already know. The tradeoff is that Ansible is not continuously enforcing state the way Puppet does: you run it when you want changes applied, rather than having it run automatically in the background. For smaller setups or teams without dedicated ops staff, this is often the right balance.
Example playbook
- hosts: webservers
become: yes
tasks:
- name: Install nginx
apt:
name: nginx
state: presentRunning this playbook installs nginx on all hosts in the webservers group.
Ansible can be used to set-up a bare-metal installation, but also for complementing a Docker Compose setup, typically for configuring the application.
Terraform solves a different problem: not configuring machines, but creating the infrastructure in the first place. It talks to cloud providers (or your local virtualization layer) and sets up virtual machines, networks, storage, DNS entries, and so on. You describe what you want, and Terraform tracks what already exists so it only makes the changes needed. Puppet and Ansible assume the machines are already there, Terraform prepares the virtual machines.
Terraform is generally supported by most Virtualisation platforms: OpenStack, Proxmox, Kubernetes, …
Howeverm, in most cases Terraform will be used by your SysAdmin.
resource "aws_instance" "example" {
ami = "ami-123456"
instance_type = "t2.micro"
}
Puppet and Ansible are Open Source while Terraform is a product of HashiCorp with a Community Edition.
CI/CD and GitOps
CI/CD is an important aspect for modern online platforms, when you develop them. It allows to ensure your deployed application is up-to-date (Continous Deployment) and free of issues (Continous Integration).
CI/CD stands for Continuous Integration / Continuous Delivery (or Deployment). The idea is simple: every time you push code or config changes to a repository, an automated pipeline picks them up, tests them, and - if everything looks good - deploys them. No more manually copying files to servers or running scripts by hand.
GitOps takes this further by making Git the single source of truth for your entire system state. Your infrastructure configuration lives in a repo, and a tool running in your cluster constantly watches that repo. When you push a change, the tool detects it and applies it automatically. If someone makes a manual change on the server that drifts from what’s in Git, the tool corrects it. Your Git history becomes a full audit trail of every change ever made.
Flux CD is a popular GitOps tool for Kubernetes environments. It runs inside your cluster and syncs it to one or more Git repositories, handling updates to apps and infrastructure alike.
For CI/CD pipelines, the most common options are:
GitHub Actions - built into GitHub, easy to get started, large ecosystem of pre-built actions
GitLab CI: similar, built into GitLab
Forgejo/Gitea Actions: self-hostable alternative compatible with GitHub Actions syntax
Jenkins: older, very flexible, but more complex to set up
A minimal GitHub Actions example
Here’s a simple workflow that automatically deploys your Quarto site to GitHub Pages whenever you push to main:
# .github/workflows/deploy.yml
name: Deploy site
on:
push:
branches: [main]
jobs:
build-and-deploy:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: Set up Quarto
uses: quarto-dev/quarto-actions/setup@v2
- name: Render site
run: quarto render
- name: Deploy to GitHub Pages
uses: peaceiris/actions-gh-pages@v3
with:
github_token: ${{ secrets.GITHUB_TOKEN }}
publish_dir: ./_siteThis file lives in your repo at .github/workflows/deploy.yml. GitHub picks it up automatically, no extra configuration needed. Every push to main triggers a fresh build and deploy.
Many choices: tools, solutions and platforms
Aside the obvious choices, which are generally for the most complex problems (databases, orchestration, Virtual Machines, LDAP) or accepted as de facto standard (Docker, Kubernetes (which is also in the previous category), S3), there are many other tools, solutions and platforms: * tools might be to help with some platforms, like a full ecosystem around Kubernetes to containers (visualising, managing, …) or doing something independant (analysing something, transforming, …), * solutions and platforms might be some alternative to the most well-know platforms or solution, for instance a simpler orchestration platfrom than Kubernetes, an already set-up cluster for Proxmox or Kubernetes that is partially managed for you…
They can all give a benefit but could also give problems: * solutions such as managed clusters are most of the time commercial. There might be a free tier but with some limitation. Using this solution might lock you in, and a new functionality might be needed which is out of the free tier, * alternative platform might also be commercial, with the same issue as above. But an open-source platform might stop being maintained if there wasn’t enough support behind. Even without this risk, a platform that is not used so much can have more bugs (simply because nobody found them) or missing important functionalities (again because nobody asked for them). Note that if a platform is a very good match for your institution/consortium and you can work on it for a long term, it could be interesting to adopt it and become a part of the development team. * tools need to be carefully checked, especially if they are to be installed locally. They can give a great help (such as k9s to visualise a Kubernetes cluster), so shouldn’t be ignored.
But one of the big issue that these choices can give is the choice by itself. Looking for a tool for a specific solution might return a discussion where several are discussed, then searching for one might return another discussion where a “better” alternative is proposed. It is where it might be beneficial to look for highly popular items, with presence in Wikipedia, with many stars in GitHub, coming from a known source (a big company, an Open Source Foundation such as Apache)… And keep the less popular items only for specific needs and while being careful
The use of AI and “Vibe coding” will probably make these problems even worse.
Some companies, notably Google, create whole new languages for some new products and/or new frameworks. That is a fundamental issue as consolidation is a key aspect of security. A new language might have some critical issue that will impact all products written with it.
The new “cool kid” might also be quickly adopted by some developers and the resulting applications have a great chance of being fragile, from the lack of experience and from the immaturity of the language/framework. As such, the benefit must be carefuly considered before adopting such language/product/application.
Depending on the usage, it might still be beneficial to adopt a brand new product, if:
- there is no security risk, for instance only doing some transformation on public data,
- it cannot break the Data Management Platform, for instance a Database is critical for most platforms and should probably always be one the well-known ones (Posgres, MySQL, MariaDB, Oracle, SQLite, Ingres…),
- the gain is major, for example the new product is significantly faster, very simple to use for an usually complex task, offer a new functionality with no equivalent.
This is a Quarto website.
To learn more about Quarto websites visit https://quarto.org/docs/websites.