Data Storage

Evidently, an important part of Data Management is the storage of data.

1 Volumes: Keeping Data Alive

From container to orchestrated system, data that need to be persisted is saved in volumes. Volume is a kind of abstract name for various solutions, which might be more adapted to certain cases.

Some more advanced solutions might be easy to use, but hard to set-up and maintain. So their usage often depends on what is supported by the Cloud admin.

1.1 Types of Volume

Different situations call for different storage types. The table below summarises the most common ones:

Type	What it is	Best for
Local disk	Storage on the same physical machine	Simple, single-machine setups
Network file system (NFS/NAS)	A shared folder hosted on a separate server	Sharing files between several containers
Block storage	Low-level, high-speed disk, similar to an internal hard drive	Databases requiring fast reads/writes
Distributed storage (e.g. Ceph, Longhorn)	Storage spread across several machines for resilience	Production clusters needing high availability
Object storage (e.g. S3, MinIO)	File storage accessed via internet-style requests	Large files, backups, media

Note

While it is possible to use object storage (generally S3 compatible) as volume, it is generally not recommended as object storage are used in a specific way (see below)

1.2 Who Decides Which Type to Use?

More advanced storage solutions (distributed storage, cloud block volumes) are powerful but require significant expertise to set up and maintain. In practice, the available options will generally depend on what your Cloud or infrastructure team has already configured. Always check with your platform administrators before choosing a storage type.

flowchart TD
    A[Application / Container] -->|reads & writes| V[Volume]
    V --> L[Local Disk - Simple, fast, no redundancy]
    V --> N[Network Storage - Shared access]
    V --> B[Block Storage - High-speed for databases]
    V --> D[Distributed Storage - Ceph / Longhorn - High availability]

    style A fill:#4A90D9,color:#fff
    style V fill:#F5A623,color:#fff

2 Databases on the Cloud

Most Data Management platform will need a database, generally a relational one, such as MySQL, MariaSQL or PostgreSQL. Cloud infrastrures are not immediately adapted for running database. The 2 main options are the following

Managed service — your IT or the cloud provider (AWS, Azure, GCP…) installs, patches, backs up, and monitors the database for you. You simply connect to it.
Self-hosted — you are responsible for the database: installing it, keeping it running, and making sure it is backed up. This gives more control and is often cheaper at scale, but requires expertise.

2.1 Managed Cloud Services

The main cloud providers all offer ready-to-use database services — for example AWS RDS, Azure Database, or Google Cloud SQL. These are the simplest option: no servers to manage, automatic backups, and built-in high availability. The trade-off is cost and the fact that your data resides with the provider.

For academic project, it is advised to look for managed solution provided by a consortium or an institution.

Note

There might be some Data Management Storage solution available for free within your institution, consortium or as a general offer (like Zenodo for references) which might be an alternative or a complement (avoiding to spend too much effort locally).

2.2 Self-Hosted Options

Self-hosting a database means you choose where and how it runs. There are several environments to consider, each with different trade-offs.

2.2.1 Dedicated Server (Physical, VM, or Docker)

The most straightforward approach: install the database directly on a machine. That machine can be a physical server (bare metal), a virtual machine , or a container.

Setup	What it means	Good when
Physical server	Database runs on real hardware you own	Maximum performance, full control
Virtual machine (VM)	Database runs inside a virtualised computer	Easy to snapshot, move, or resize
Docker container	Database runs in a container with a volume	Quick to set up; good for dev/test

This setup is simple to understand but requires manual attention for backups, failover, and upgrades. In case nothing else is available, it might be better to set-up ProxMox for a similar setup (see below).

2.2.2 On OpenStack

OpenStack is an open-source platform used by many organisations and public cloud providers to manage large pools of computing, storage, and networking resources. If your organisation runs OpenStack, you can:

Launch a VM through OpenStack Nova (its compute service) and install a database inside it — this is the most common pattern.
Use OpenStack Trove, the dedicated Database-as-a-Service component of OpenStack, which automates provisioning, backups, and basic management of databases on top of OpenStack VMs.
Attach Cinder block storage volumes to give the database fast, persistent disk space.

This is a good choice for teams whose infrastructure is already managed through OpenStack, as it keeps everything in the same ecosystem. But the complexity is high and is probably best left to your IT team.

2.2.3 On Proxmox

Proxmox VE is a popular open-source virtualisation platform, widely used for on-premises (“on-prem”) private infrastructure. It lets administrators create and manage VMs and containers from a simple web interface — without needing a full cloud platform like OpenStack.

Databases are typically run on Proxmox in one of two ways: inside a VM (full isolation, easy snapshots) or inside an LXC container (lighter than a VM, shares the host kernel). It is then very similar to a physical server setup.

But Proxmox supports backups “out of the box” and will help with snapshots, making it significantly easier to manage your database. If the needed setup is complex (several applications and databases) and there is no possibility of managed setup (by IT, part of the institution/consortium or paid), it might be interesting to try ProxMox.

2.2.4 On Kubernetes

Kubernetes (often abbreviated K8s) is a system that automatically manages many containers at once — starting, stopping, and restarting them across a cluster of machines. Running a database on Kubernetes gives you automation and scalability, but it is more complex than a simple VM.

The recommended approach is to use a database operator — a piece of software that knows how to install, configure, back up, and upgrade a specific database inside Kubernetes automatically. Instead of doing all those tasks by hand, you describe what you want and the operator handles the rest.

flowchart LR
    Admin["Administrator (describes desired state)"]
    Operator["Database Operator (watches & acts)"]
    DB1[(Primary Database)]
    DB2[(Replica 1)]
    DB3[(Replica 2)]

    Admin -->|"defines via configuration file"| Operator
    Operator -->|manages| DB1
    Operator -->|manages| DB2
    Operator -->|manages| DB3
    DB1 -->|replicates to| DB2
    DB1 -->|replicates to| DB3

    style Admin fill:#4A90D9,color:#fff
    style Operator fill:#F5A623,color:#fff
    style DB1 fill:#7ED321,color:#fff
    style DB2 fill:#7ED321,color:#fff
    style DB3 fill:#7ED321,color:#fff

2.2.5 Notable Operators

Two well-regarded open-source database operators are available for Kubernetes environments:

CloudNativePG — specialised for PostgreSQL, one of the most popular open-source relational databases. It handles:

Automatic failover (switching to a backup if the main database fails)
Point-in-time recovery (restoring the database to any past moment)
Rolling upgrades with zero downtime

KubeBlocks — a more general-purpose operator that supports many different database engines (PostgreSQL, MySQL, Redis, MongoDB, and more) from a single unified interface. Useful when a team needs to operate several different types of databases consistently.

Feature	CloudNativePG	KubeBlocks
Supported databases	PostgreSQL only	Many (PostgreSQL, MySQL, Redis…)
Specialisation	Deep PostgreSQL expertise	Broad multi-engine support
Best for	Teams standardised on PostgreSQL	Teams with mixed database needs
Maturity	Production-ready, CNCF project	Rapidly growing, active community

Key takeaway: Operators dramatically reduce the operational burden of running databases on Kubernetes — but they still require a capable platform team to install and maintain the operator itself.

2.3 Choosing a Self-Hosted Approach

flowchart TD
    Q1{Do you already use a platform?}
    Q1 -->|OpenStack| OS[Deploy a VM via Nova or use Trove DBaaS]
    Q1 -->|Proxmox| PX[Run DB in a VM or LXC container]
    Q1 -->|Kubernetes| K8S[Use a database operator: CloudNativePG, KubeBlocks]
    Q1 -->|None / simple| DS[Dedicated server: Physical, VM, Docker]

    style OS fill:#E8A838,color:#fff
    style PX fill:#E8633A,color:#fff
    style K8S fill:#4A90D9,color:#fff
    style DS fill:#7ED321,color:#fff

3 Object Storage

Object storage is a fundamentally different approach from volumes. Rather than behaving like a hard drive that an application reads and writes to directly, object storage works more like a postal service: you send a file (an “object”) to a central repository and get back a unique address to retrieve it later. Applications interact with it through standard web requests (HTTP), not through the filesystem.

Plain language: Think of object storage like Google Drive or Dropbox — you upload files and retrieve them by name, from anywhere, over the internet.

3.1 Key characteristics

Not mounted as a volume — applications must be adapted to use a specific API (most commonly the S3 API, originally from Amazon).
Virtually unlimited scale — designed to store billions of files of any size without pre-allocating capacity.
Cheap for large volumes — cost per gigabyte is typically much lower than block or distributed storage.
Access from anywhere — any application or service with the right credentials can reach the same data.

3.2 Common use cases

Use case	Example
Data lake / raw data archive	Storing CSV, Parquet, or JSON files for analysis
Backup & disaster recovery	Database dumps, configuration snapshots
Machine learning datasets	Training data, model artefacts
Log & event archives	Long-term storage of application logs
Media & large files	Images, videos, documents served to users

3.3 Solutions

Cloud-provider managed: AWS S3, Azure Blob Storage, Google Cloud Storage — fully managed, pay-as-you-go, zero maintenance.

Self-hosted: Several open-source solutions implement the S3 API and can run on your own infrastructure, including:

OpenStack Swift — one of the oldest and most battle-tested open-source object stores, part of the OpenStack project since 2010. It uses its own API (not S3-native, though an S3-compatibility layer exists). Best suited for organisations that already operate an OpenStack infrastructure. It is mature, highly scalable, and actively developed under the OpenStack Foundation.

Garage — a lightweight, modern S3-compatible store written in Rust. Unlike most systems designed for data centres, Garage was built specifically for geo-distributed deployments — nodes can sit in different physical locations connected over ordinary internet connections. It ships as a single binary, is very easy to operate, and received EU research funding through 2025. A good fit for small to medium self-hosted setups.

RustFS - A S3 and Swift compatible open source data storage, also written in Rust. It supports large data sets, with a distribituted architecture, scalable and fault-tolerant.

Note

MinIO was long the default recommendation for self-hosted S3-compatible storage. However, through 2025 the company progressively restricted its open-source edition: the web management console was stripped, pre-built Docker images were discontinued, and by early 2026 the community repository was placed in maintenance mode (read-only, no new features, security fixes not guaranteed). MinIO is no longer a viable choice for new deployments.

flowchart LR
    App1[Application A]
    App2[Application B]
    App3[Data Pipeline]

    OS["Object Storage (S3-compatible API)"]
    CLD["Cloud-managed: AWS S3, Azure Blob, GCS"]
    SH["Self-hosted: Swift, Garage"]

    App1 -->|"upload / download - via HTTP"| OS
    App2 -->|"upload / download - via HTTP"| OS
    App3 -->|"read datasets - write results"| OS
    OS --> CLD
    OS --> SH

    style OS fill:#7B68EE,color:#fff
    style CLD fill:#4A90D9,color:#fff
    style SH fill:#7ED321,color:#fff

3.4 Object storage vs. volumes at a glance

	Volumes	Object Storage
Access method	Filesystem (read/write like a local disk)	HTTP API (upload/download)
Typical use	Databases, running applications	Files, datasets, backups, media
Scalability	Limited by disk/cluster size	Virtually unlimited
Cost	Higher	Lower
App changes needed?	No	Yes — app must use S3/Swift API

4 Summary

flowchart TD
    DM[Data Management]
    DM --> VS[Volume Storage]
    DM --> OS[Object Storage]
    DM --> DB[Databases]

    VS --> LT[Local / Network - Block / Distributed]

    OS --> SAAS[Cloud-managed - AWS S3, Azure Blob, GCS]
    OS --> SH[Self-hosted - Swift, Garage]

    DB --> MS[Managed Cloud Service - AWS RDS, Azure Database...]
    DB --> SELF[Self-hosted]
    SELF --> DS[Dedicated Server - Physical, VM, Docker]
    SELF --> OST[OpenStack - Nova VM, Trove DBaaS]
    SELF --> PX[Proxmox - VM, LXC container]
    SELF --> K8S[Kubernetes with Operator]
    K8S --> CNPG[CloudNativePG - PostgreSQL specialist]
    K8S --> KB[KubeBlocks - Multi-engine]

    style DM fill:#4A90D9,color:#fff
    style OS fill:#7B68EE,color:#fff
    style SELF fill:#F5A623,color:#fff
    style K8S fill:#4A90D9,color:#fff

Choosing the right storage strategy always involves a balance between simplicity, cost, performance, and what your infrastructure team can support. Volumes are the default for running applications and databases; object storage shines for large files, datasets, and backups. When in doubt, start with managed services offered by your cloud provider — and move to self-hosted solutions only when you have a specific reason and the operational capacity to maintain them.