De.KCD - Data Management Platform on the Cloud, step by step
Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.
Goals and scope
This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.
It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.
We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.
We will also compile advices and tips, and we welcome all contributions (and corrections).
Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry
How to use
Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.
When to…
When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and we are working on a detailed decision tree
Quick introduction to Linux/Unix
Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.
Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.
- Check external introduction
- Check Galaxy/NFDI training for deeper knowledge
For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.
We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folder, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ a lot, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm or yum) and Ubuntu-based Linux (apt-get), to list the main ones.
Commands that should be known
ls ls -al -> list the content of a directory
cd -> go to a directory (cd .. to go to the parent folder, cd / for the root folder, cd ~ for the user home folder)
ps ps -edf ps -aux -> list the running processes and their owner
top -> show the current running processes and their memory/cpu usage (CTRL-C to end)
cp/mv/mkdir
more, less, cat
tail, tail -f
vi/vim/emacs/nano -> most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi, nano or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs). Even if vi might seem very hard to use and strange at first glance, it is quick to learn it and very convenient. Emacs has a very different approach but is also much more powerful than vi/vim, which might be useful for more difficult work (if needed so).
man -> show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.
ln -> create a symbolic link and know the difference between: * a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file * and a symbolic link: exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions
For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.
chown/chmod -> change the owner of a file or folder, change the permissions of a file or a folder, be familiar with things like rwxrwxrwx, r-x——-, 755, chmod o+r filename, chmod -R username filename
grep -> search for the presence of a string in files, often used combined with another command with a |
mount
pwd/whoami
eval
Ideally the pipe (|) should be well understood, as well as the I/O indirections (>, >>, <).
curl is also good to know.
free/df/dh -> know how available resources
command &, CTRL-Z + bg/fg -> running a command in the backgound
kill command -> signal communication, SIGNTERM (kill process_id), SIGNKILL (kill -9 process_id)
? Which other
Folders you should know
/usr /usr/bin /var /var/log /etc /root /bin /home /mnt /opt
Online resources
The Unix command line is well explained here. It is probably useful to know about the main principle of Unix, and a good course if available here and Wikipedia has a good overview of the Unix filesystem and its layout.
DeNBI Unix course -> TBD, adapt using a light Linux image (or public online VM)
What makes the containers possible: cgroups
One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.
It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.
A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).
Bare-Metal setup and Virtual Machines
Vocabulary
“Bare-Metal” is a bit of a misnomer, as it means originally using a computer without operating system, directly on the hardware (so using a low-level language such as ASM or C). But a bare-metal setup means a direct installation on a machine with the operating system.
Note that an actual bare-metal setup could still be a viable option for specialised tasks: if it needs to be embedded (for a cheaper/more efficient solution on an specialised piece of hardware), if it needs to be highly performant, but in the later case it might be a better option to move the processing to the GPU (which is also a kind of bare-metal implementation). To choose to do so if out of scope for this document, but it is important to do so knowlingly as it could have a high cost (i.e. difficulty of maintenance)
A Virtual Machine is a software solution that provide the functionality of a physical computer. It is possible (and generally needed) to install an operating system on it - most Virtual Machines will provide an easy way to do so, and using it is akin to using an actual computer.
It can be an emulation, “imitating” a computer enough to run its software, generally used for specific usages such as running deprecated operating systems. Qemu is one of the main ones. An emulation provides a full encapsulation so is in theory more secure - and exploit are fixable by software update.
Or a virtualization, where the hosted machine is running directly or semi-directly on the physical hardware though an hypervisor that manages how the ressources are used: * full virtualization (simulate enough of the computer hardware to run a guest OS), and will need significant more resource than an actual physical computer, while providing a full encapsulation so in theory more security - and exploit are fixable by software update. * Hardware-assisted virtualization, where the hardware helps with the virtualization. Exploits using the hardware support might be difficult or impossible to fix, though exceptionals. * OS-level virtualization, where the operating system is sharing the resource for the guest OS. Docker and other container engines are using such virtualisation. The reason containers are called containers and not virtual machine is from the usage: instead of creating a virtual machine and setting it up with an operating system and subsequent pieces of software, they rely on fixed images that have an bare-minimal operating system (ideally) and layers of software to support the desired container. Once running, they are actually a virtual machine.
A main protection against threat is being up-to-date: a recent or LTS - Long-Term-Support: that is supported for a long time - Operating System on a physical computer, Virtual Machine software, and virtualization software (which might have a direct support in the CPU).
Containers - Overview of Docker usage
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy, assuming you have some knowledge of Linux.
A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.
If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.
Vocabulary
Useful things to know
A docker image will be started (once used in a container) using an Entrypoint and/or a cmd. These can be listed by using docker inspect <image id>
, but also individually by using docker inspect -f '{{.Config.Entrypoint}}' <image id>
and docker inspect -f '{{.Config.Cmd}}' <image id>
. This is useful in case something goes wrong inside a container to know how the container is supposed to start. To know more about cmd
and Entrypoint
, the official Docker documentation offer a great overview.
Changing something that is part of the image is possible when running the container, but it will never be persistent and is generally a bad idea. One good case is when debugging an application, and if the debugged code is changed (in the code base) as soon as the issue is found.
How to build a Dockerfile
The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach.
Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.
Moving to Docker compose
The official Online documentations is good and complete. Online tutorials should be enough as a complement.
Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).
Advantages
Things to take into account
Adding parameters
The different types of volumes
Namespaces and namespaces collisions
Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace.
From Docker compose to Kubernetes
Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not (links to separate resource page), and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running
Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.
Kubernetes is not more complex than docker compose, but it is much more than docker compose, with many elements. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.
For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.
But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is made by and aimed at non-sys-admin persons.
Quick overview
-> Kubernetes section
Other solutions
Scalability
Databases
Security
Going to an assembly
Automatize the setup
Ansible & Terraform -> short introduction, links to documentation/
CI/CD/GitOps (Flux CD & others) -> also short introduction and links. Maybe a short tutorial for GitHub
This is a Quarto website.
To learn more about Quarto websites visit https://quarto.org/docs/websites.