De.KCD - Data Management Platform on the Cloud, step by step

Process and documentation on adapting a docker-based Data Management platform, Seek4Science, to Kubernetes, and using it in the cloud.

Goals and scope

This documentation is intended as a quickstart for setting up and managing one or several Data Management platforms, while considering the benefits and costs: how much effort, how complex, the security risks, what is gain. It does not replace a detailed explanation on each topics, but intend on saving time in (1) deciding which solution to adopt, (2) understand enough of each solution to navigate efficiently through their documentation, (3) give working and explained solutions that can be used as it or as base for your own solution.

It goes from a simple local setup to a full Kubernetes based set-up with distributed data, with some side documentations on Authentication and Authorization, setting up a Central Identity Service, monitoring and logging and considerations for connecting applications in the cloud.

We try to give a clear view on the cost and benefit of each solution so it is easy to have a rough idea what solution is the best. It is quickly summarized and emphasized at each section.

We will also compile advices and tips, and we welcome all contributions (and corrections).

Finally it is targeted mostly at Research Projects and/or Institutions so focus on the particular aspects in this cases. By Data Management Platforms we mean an online application with a data repository (database or other), such as those listed in our Data Management Platforms registry

How to use

Each section assumes that you know the precedents. If you already know one section, feel free to jump to the next one.

When to…

When to go full cloud, or with a containerised solution for your datamanagement platforms. We provide below a very simplified answer and we are working on a detailed decision tree

Quick introduction to Linux/Unix

Recommended learning path

Online documentations and tutorials are enough for the basic. A book is strongly recommended for advanced topics.

Learning difficulty from easy (the basic) to very hard (how it all works). For dealing with a cloud installation it is probably intermediate as you should have some knowledge about security concerns.

  • Check external introduction
  • Check Galaxy/NFDI training for deeper knowledge

For most set-up, Linux is the operating system of choice. Due to security concern, a minimal knowledge of it is probably a must in all cases, from bare-metal setup to cloud-based installation, though the useful set will change.

There is a good Linux tutorial on the Carpentry.

??? We list below the commands you should know to survive, and their eventual options (note that some of these command now run on Windows using the Terminal shell), following by the important folder, user and group access on Unix, private and public keys and ssl/ssh. This part is more a check list of things you should know for the following topics, and we recommend that you learn those before continuing. Wikipedia is a good starting point and there are many easy to find good tutorials online. Books are needed only for a deep understanding, but are also stronly recommended if you use Linux a lot and deal with advanced topics. Note that Linux is not the only Unix, OpenBSD and FreeBSD being good alternative, but all Unix are very similar for their core usage and structure. The package managers will differ a lot, but also differ on Linux between Debian-based Linux (dpkg), RedHat-based Linux (rpm or yum) and Ubuntu-based Linux (apt-get), to list the main ones.

ls ls -al -> list the content of a directory

cd -> go to a directory (cd .. to go to the parent folder, cd / for the root folder, cd ~ for the user home folder)

ps ps -edf ps -aux -> list the running processes and their owner

top -> show the current running processes and their memory/cpu usage (CTRL-C to end)

more, less, cat

vi/vim/emacs -> most of the time we work on servers/containers with a text terminal. Being able to edit a file is often needed. Vi or Emacs are powerful text-based editor that are present on most linux distribution, though the lightest linux distribution for container images might have the minimum (such as vi only, instead or the extended vim or emacs). Even if vi might seem very hard to use and strange at first glance, it is quick to learn it and very convenient. Emacs has a very different approach but is also much more powerful than vi/vim, which might be useful for more difficult work (if needed so).

man -> show the manual for a command. man should be an automatic reflex for any less known usage. It is absolutely normal to forget how such option or command work if they are not used very often. man is there for such cases. If you forget a command name, it is a good idea to keep a linux cheat sheet around. A quick internet search should find many good one page pdfs.

ln -> create a symbolic link and know the difference between: * a hard link: direct link to the file, will be removed if the file is removed, get its permissions from the target file * and a symbolic link: exist independently of the linked file and will stay behind of the linked file is removed or moved, has its own permissions

For setting-up an application, we generally use symbolic links, mostly as a way to change the permissions independently of the target file.

chown/chmod -> change the owner of a file or folder, change the permissions of a file or a folder, be familiar with things like rwxrwxrwx, r-x——-, 755, chmod o+r filename, chmod -R username filename

grep ->

Ideally the pipe (|) should be well understood

? Which other

What makes the containers possible: cgroups

One more advanced element which is important to understand in the container context is cgroups. Cgroups (short for control groups) are a Linux kernel feature to isolate processes: limits the resource usage, track their usage, give them a “sub” filesystem isolated from the host file system.

It is important to be aware of these for understanding the difference between container and virtual machine, and be aware that the security risk is higher with a container: bypassing the isolation would allow a direct access to the host filesystem. And it is not only if an exploit is found. As the container are a part of the host, they can use a folder out of their isolated space (host volume). If not properly set-up, it could open a sensitive part of the host to an intruder. On the other hand, virtual machine save their data within the virtual disk, part of the virtual machine. There is no possible access to the host system aside a proper exploit. So for a quick set-up of an exposed application, a virtual machine might be a better solution.

A long term set-up should take a proper care of security in all case, and in this case there should be not benefit in using a virtual machine, aside of cybersecurity needs (like a honeypot).

Overview of Docker usage

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy, assuming you have some knowledge of Linux.

A clear presentation of Docker is available here and the official documentation is good. The section you should consult is “Open Source”, so about Docker Engine, Docker Build and Docker Compose. If running Docker on Windows, you might want to use Docker Desktop, which is also proposing a single node Kubernetes and is the simples way to use and test Docker and Kubernetes on Windows.

If Docker is the de facto standard, there are other containers engines, such as PodMan, containerd, or cri-o and a standard, Open Container Initiative (OCI) that PodMan, Containerd and Cri-o follow and Docker almost follows (new image should be OCI compliant, old ones might not), so Docker images might need some adaptations to run with PodMan and others. Docker offer an OCI exporter for the image builts.

How to build a Dockerfile

Recommended learning path

The official Online documentations is good and complete. Online tutorials or books are recommended for an easier approach.

Learning difficulty is medium, assuming you have some knowledge of Linux. There is nothing really difficult but there are a lot of aspects to take into account.

Moving to Docker compose

Recommended learning path

The official Online documentations is good and complete. Online tutorials should be enough as a complement.

Learning difficulty is easy/medium, assuming you have some knowledge of Linux. The difficulty is more about the different components of the setup (for instance an image with a database, linked with an Authentication system).

Advantages

Things to take into account

Adding parameters

The different types of volumes

Namespaces and namespaces collisions

Namespacing is by using the name of the containing folder of the docker-compose file. Using the same name, even in different parent folder, will use the same namespace.

From Docker compose to Kubernetes

Recommended learning path

Kubernetes is composed of many elements and has its own vocabulary. Each elements and the way the work together are not complex, but to grasp a minimal working set of Kubernetes will take some effort. We recommend to start with an online tutorial, interactive or not (links to separate resource page), and ideally a book, such as Kubernetes in Action, Production Kubernetes, Cloud Native DevOps with Kubernetes or Kubernetes: Up and Running

Learning difficulty is medium/hard, assuming you have some knowledge of containers. Using Kubernetes is still much easier than to set-up a production cluster. As software developers, we recommend that you set-up a cluster only as a testbed and rely on sysadmins for a production cluster.

Kubernetes is not more complex than docker compose, but it is much more than docker compose, with many elements. It is an orchestration engine: where docker compose can ask for these containers to be restarted when stopped, on the same machine, Kubernetes can choose on which machine the containers will run, can duplicate them, can kill them if they seem unhealthy to create some healthy ones, can create services out of these containers so they can be used by other clients, which could be other containers within Kubernets, can create network access of these containers, without knowing where they run.

For simplest application it could be a full overkill, then the more complex your application is (in term of size and elements), the more to gain from Kubernetes. If you need an application that needs to run 24/7, with transparent updates and that can scale from hundred of clients to several thousands, Kubernetes will make things much easier, and in a well thought way.

But this possibilities comes with difficulties to grasp it, especially if you are not a full sys-admin, and this documentation is made by and aimed at non-sys-admin persons.

Quick overview

-> Kubernetes section

Other solutions

Scalability

Databases

Security

Going to an assembly

Automatize the setup

Ansible & Terraform -> short introduction, links to documentation/

CI/CD/GitOps (Flux CD & others) -> also short introduction and links. Maybe a short tutorial for GitHub

This is a Quarto website.

To learn more about Quarto websites visit https://quarto.org/docs/websites.