Major Data Management platforms
FAIR - Findable, Accessible, Interoperable, Reusable
The Data Management platforms listed should respect the FAIR principle to some extend. The details follow this idea: if a platform offers an extensive API but it is very difficult to find a documentation, then it is not findable or accessible, thus should be considered as such. On the other hand, a platform based on common standards such as RDF or Apache Spark might be interoperable and reusable without needing to document RDF or Apache Spark as their documentation are easily available.
Contributions are welcome
Feel free to contribute by forking the repository or creating an issue: GitHub sources. We filled the information by the best of our knowlegde and after a reasonable amount of research, but that does not prevent us from doing mistakes. Once again we will be glad for any correction.
- Ask for a change: create an issue,
- Add or change something: fork, then do a Pull Request.
The site has been written using Quarto and the main syntax is PanDoc, which mostly like Markdown, and is well documented on the Quarto web site. The pages are in pages, with the name corresponding to the title. Each entry is on the same model, so a copy-past is the best way to add a new one. Quarto has extensions for several editors, so a preview is often possible from within the editor. Visual Studio Code has a good support.
Typical entry:
* [Name, quick description](https:URL of main web page)
+ Description and/or link to potential Docker image/Docker compose/Kubernetes manifest/…
+ Link to API(s), quick clarification on how well documented it is.
+ Interoperability **NONE/LOW/MEDIUM/HIGH** / No interoperability: explanation on why.
+ **Not/Partly/Mostly/Fully** cross-domain: explanation on why.
Navigational elements are configured in _quarto.yml
Progress
- TODO: Eventually a clear view of the adaptability to cloud architectures.
Process
Top-Down approach: first listing with technical information (online, containers, …), then API support. Finally interoperability (data format of API), possibility of transfer to other domains and adaptability to cloud. Major platforms are easy-to-find and popular platforms. Draft should be in to_sort or to_investigate pages.
The format is keep as simple as possible so it is easier to modify. We find it more important to encourage contributions as to keep the list up-to-date. This is why we don’t use tables for instance.
To Work-On
Write about non-software dedicated DM approach (for instance semantic-web, RDF or Apache Spark).
Better presentation, without making editing more complex
Life Science
Open Source or free
- Galaxy, an web-based platform for biomedical research, allowing cloud-based workflows
- Galaxy is already a cloud-based solution, widely used.
- It is a very powerful tool and as such with tutorials, documentation, training events, faqs and discussions
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: though mostly for life-science applications, it can support any kind of workflow. Tools for other domains can be added.
- BioContainers, “community-driven project to create and manage bioinformatics software containers”
- Omero, an open source imaging platform for microscopy outputs
- Docker Image and Helm chart
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: despite being targeted at microscopy image, can have an use for any image data management. Omero can be extended using OMERO.scripts which are written in MATLAB or Python. It also offers binding in Python, C++, MATLAB and Java.
- Seek
- Docker Compose(one instance on one server!)
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: Seek is highly configurable and could be used as a sharing platform for other domains, without any coding needs.
- Related: Just Enough Results Model (JERM)
- Aruna Object Storage (AOS)
- “Aruna Object Storage (AOS) is a modern distributed storage platform designed to meet the increasing demand for effective data management and storage of scientific data. It is the central storage of the Research Data Commons (RDC) cloud layer and the data foundation for the upper layers. It is a cloud-native, scalable system with an API and a S3-compatible interface.”
- Online but provide a Docker Compose installation
- Extensive API
- Interoperability HIGH: well documented, uses protocol-buffers and the gRPC framework and exposes the S3 API .
- Fully cross-domain: can be used for any kind of storage.
- XNAT
- XNAT is an open source imaging informatics platform, mostly for MRT/CT Scan images (DICOM & NIfTI). Presentation on usage
- Docker Compose installation, Docker Compose
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- Not cross-domain: it is a highly specialised and complex tool and would not be adapted for other applications. In the medical domain, support many applications via plugins
- REDCap - Research Electronic Data Capture
- Free for non-profit institutions but not Open Source, available upon request, with a restrictive licence.
- No containerization.
- Extensive API, but most documentation is for registered members.
- Interoperability unknown but presumably high/medium.
- Probably not cross-domain: software adaptation is limited to built-in software development tools. There is no public documentation.
- ISA format, set of formats and related tools for bioscience experiment
- Standalone tools that could be containerized with a little technical effort.
- Formats and tools are well-documented
- Interoperability high: well-documented workflow, all format explained.
- Mostly not cross-domain: these are dedicated tools for biosciense. However everything is open-source and could be used as base for other tools.
- “LifeMonitor is a service to support the sustainability and reusability of published computational workflows”
- A github app for workflows hosted on GitHub, also with docker-compose installation
- Extensive API, well documented
- Interoperability high: uses well-documented JSON.
- Mostly not cross-domain: it could be used to test different domains’ workflow, but it does not seem to be obvious.
- Workflow Hub, registry for describing, sharing and publishing scientific computational workflow
- Online, based on Seek
- Extensive API, well documented
- Interoperability high: uses well-documented JSON.
- Fully cross-domain: “WorkflowHub is a domain-agnostic workflow registry designed around FAIR principles.”
- BioModels, a repository of mathematical models of biological and biomedical systems
- Online only.
- Extensive APIs
- Interoperability high: uses well-documented JSON or XML.
- Not cross-domain: by definition
Commercials
- Oracle Clinical, an Electronic Health Record and Data Management platform
- No easily findable public documentation
- Not cross-domain
- MediData, a set of platforms for clinical trials, Electronic Health Record and Data management
- No easily findable public documentation
- Not cross-domain
- Castor, Electronic Health Record
- Extensive API, well documented
- Interoperability high: uses well-documented JSON.
- Not cross-domain
Pharmaceutics
- LabKey, an open-source platform that helps manage biomedical research data, focusing on clinical and laboratory data
- A limited free edition and several paid versions,
- Extensive APIs, well documented
- Interoperability medium/high: APIs are well-documented but are programmatic only (i.e no REST API)
- Moderately cross-domain: while many function could be used for other domain, it is mostly dedicated to biomedical research.
- CIViC (Clinical Interpretation of Variants in Cancer), an open-access platform that curates and manages clinical cancer variant data. It is used for pharmaceutical oncology research.
- Mostly online, fully Open Source, no containerization.
- Extensive API, well documented
- Interoperability high: uses well-documented JSON.
- Not cross-domain: it is a highly specialised and complex tool and would not be adapted for other applications
Chemistry & Chemical Sciences
- Chemotion, an open-source Electronic Lab Notebook for chemists and its related registry
- Docker & Docker compose installation
- API and OAI-PMH harvesting service with Swagger API documentation, seemingly read-only.
- Interoperability medium: uses well-documented JSON but seems to be only for reading from chemotion. Further integrations are supported but documented for devices, thus not easy to grasp for a non-chemist.
- Not cross-domain: it is a highly specialised tool and would not be adapted for other applications
Materials science
- NOMAD, management and sharing of material science data (online)
- With many tutorials, also on Youtube
- Online first, but with docker compose installation. Another instance of NOMAD, called NOMAD Oasis, can upload to the central NOMAD installation.
- Extensive API, well documented
- Interoperability HIGH: uses well-documented JSON.
- Not cross-domain: it is a highly specialised and complex tool and would not be adapted for other applications.
- pyiron, complex workflow made based on JupiterLab and pyiron workflow with many extensions
- With a set of docker images
- Based on JupiterLab, supporting a command-line interface that can be easily extended, and well documented.
- Interoperability HIGH: standard JupiterLab capabililities, all documented.
- Fully cross-domain: even if dedicated to material science, its integration to JupiterLab makes using any part usable in other domains.
- HELIPORT (HElmholtz ScIentific Project WORkflow PlaTform)— An Integrated Research Data Lifecycle
- No container-based installation. Mosty online versions, restricted to members of Helmhotz or associated institutions.
- Extensive API and extensions support.
- Interoperability HIGH: uses well-documented JSON.
- Fully cross-domain: already extended for other uses. But the online versions are currently restricted to members of Helmhotz or associated institutions.
Geomatics
Open Source
- GeoServer, “an open source server for sharing geospatial data”
- No container-based installation.
- Extensive API and many extensions, well documented.
- Interoperability HIGH: support many common formats, well documented.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- MapGuide Open Source, ” a web-based platform that enables users to develop and deploy web mapping applications and geospatial web services.” and MapServer, a mapping engine, part of a large portofolio of Geospatial tools
- No container-based installation.
- Extensive API via the related tools.
- Interoperability HIGH: support many common formats, documented. One issue of the OSGeo portfolio is the high number of projects with many subpages. As such the contribution to the community is very high, but it can be difficult to navigate.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- Mapnik, open-source mapping toolkit
- No container-based installation as it is a toolkit to develop mapping applications.
- Mostly to be used as an API or through Python bindings. It is largely documented but mostly aimed as developers. Interoperability MEDIUM/HIGH: it is aimed at helping develop any kind of mapping applications, but will probably be out of reach as such for non-developers.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- ArangoDB, open-source and commercial graph and spatial database, community version limited to 100gb of data on a single cluster and non-commercial use.
- Docker and Kubernetes installations
- Extensive API, well documented, also built-in using Swagger.IO
- Interoperability HIGH: it is all well-documented and uses JSON.
- Fully cross-domain: it is a general multi-model database
- GeoNetwork, catalog application to manage spatially referenced resources
- Docker image
- Extensive API, well documented
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- GeoNode, an open-source platform for geospatial data sharing, often used for urban planning. It integrates tools for geographic data visualization, editing, and management
- Docker-based installation
- Extensive API, very well documented
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- OpenStreetMap (OSM), a collaborative project to create a free, editable map of the world, providing spatial data for urban planning, navigation, and research
- Mostly online with the map of the full world, with many surrounding tools
- Extensive API, very well documented
- Interoperability HIGH: uses well-documented XML, JSON. Maps are either using a text-based format, giving a large size but easy to use, such as OSM XML or a binary format such as PBF, giving a small size but harder to use. The list of map format is on the wiki
- Mostly cross-domain: it can be use for any domain. Its usage can be quite complex though, depending on several aspects: extracting a small map can be rather easy using one tool, extracting a large map (such as for a country) can be complex depending on the level of detail needed. The binary format for large map
Commercials
- ESRI ArcGIS, geospatial platform
- Many tutorials which are rather hard to navigate and an extensive developer documentation, not linked on the product page.
- Extensive APIs, well documented with many Open Source tools and samples
- Interoperability HIGH: many APIs and tools, with documentation and samples.
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- SMARTGeo, visualization and analytics platform
- No easily findable public documentation
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
- CARTO, cloud-based spatial analysis
- All documentation is available publicly with a set of tutorials covering the bases of geospatial data as well
- Offer several possibilities of connections, all documented and several well documented APIs.
- Interoperability HIGH: APIs uses well-documented JSON
- Mostly cross-domain: it is dedicated to geospatial data but could be use for any domain as such.
Urbanism
Open Source or free
- Metropolis (Urban Observatory Data), offers large-scale urban data and insights about cities, covering everything from mobility patterns to energy consumption
- Online collection of urban case studies.
- No API
- No interoperability, per definition
- Partly cross-domain: as the case studies are about urbanism with themes which can be relevant for other domains.
- CitySDK, a toolkit offering standard API specifications for urban services, helping cities release open data and enabling developers to create city-related applications
- SDK to build Smart City services and applications,
- Encourages to build a standardised API, using JSON, XML or CSV
- Interoperability HIGH: encourages the use of standards
- Mostly cross-domain: it should to be used for city management, but does not restrict on any use.
Commercials
- ESRI ArcGIS Urban
- ArcGIS Urban is part of the ArcGIS portfolio, so have the same characteristics.
- Many tutorials which are rather hard to navigate and an extensive developer documentation, not linked on the product page.
- Extensive APIs, well documented with many Open Source tools and samples
- Interoperability HIGH: many APIs and tools, with documentation and samples.
- Mostly cross-domain, it is dedicated to geospatial data but could be use for any domain as such.
- Google Transit, provides data on urban transport, such as real-time traffic, transit options, and pedestrian data, used in urban planning and mobility solutions
- it is an online feed using the GTFS format with some documented changes for the realtime transit.
- Interoperability HIGH: straightforward and well documented
- Not cross-domain, as per definition.
Digital Lab Notebooks
- OpenBis
- Docker image
- Extensive APIs, well documented
- Interoperability HIGH: while there is no REST APIs, the other APIs are well documented and the Python API is simple to use even for non-developers.
- Fully cross-domain: OpenBis is highly configurable so can be use for any domain as a notebook but can also be used for other needs if properly configured. Then it is not the easiest platform to set-up and configure.
- eLabFTW
- Docker
- Extensive APIs, well documented
- Interoperability HIGH: the REST API uses well-documented JSON and the Python API offers plenty of examples.
- Fully cross-domain: eLabFTW is highly configurable so can be use for any domain as a notebook but can also be used for other needs if properly configured.
- RSpace, electronic noteboook and inventory management
- Open Source and commercial solutions.
- Docker Compose
- Extensive API and SDK, well documented.
- Interoperability HIGH: uses well-documented JSON.
- Fully cross-domain: eLabFTW is highly configurable so can be use for any domain as a notebook but can also be used for other needs if properly configured.
Astronomy
Astronomy is largely standardized and relies on many tools developed over several decades, and online catalogues. As such many are standalone tools and are potentially cloud-friendly, though will need some more technical adaptations. Most platforms are not cross-domain are they give data from astronomical observations. But as such, if there is an actual need for such data in another domain, the technical effort for the integration will probably be worth it. The technical effort changes greatly between platform and can be rather high with many specialised tools and specialised vocabulary.
- Virtual Observatory (VO), the International Virtual Observatory Alliance (IVOA) provides resources for discovering and accessing astronomical data, including standards for data sharing and interoperability
- Online only
- Many tools but no clear or official API.
- Interoperability LOW: related to previous point.
- Not cross-domain, per definition.
- ALMA Science Archive, the ALMA archive gives access to data from the Atacama Large Millimeter/submillimeter Array, providing public access to radio astronomy dataset
- Online only
- Many tools and documentations but it is also difficult to find an official API
- Interoperability LOW: related to previous point.
- Not cross-domain, per definition.
- European Space Agency (ESA) Archives, ESA’s space science archives, including data from missions like Gaia and Hubble, offering open access to space mission data
- Online only,
- Some archives have an API access, such as the herschel science archive, but it is not uniform.
- Interoperability LOW to HIGH: related to previous point, depending on the archive.
- Not cross-domain, per definition.
- NASA’s Astrophysics Data System (ADS), ADS is a digital library for astronomy and physics, linking research papers with relevant datasets and providing extensive search capabilities
- Online only,
- Extensive API, well documented with examples
- Interoperability HIGH: uses well-documented JSON.
- Mostly cross-domain: some papers referenced are cross-domain or in another domain.
- SIMBAD (Set of Identifications, Measurements, and Bibliography for Astronomical Data), SIMBAD is a database providing detailed records of celestial objects and their bibliographical references
- Online only,
- API, documented
- Interoperability MEDIUM/HIGH: the API needs some technical understanding of the domain to be of use.
- Not cross-domain, per definition.
- SkyServer (Sloan Digital Sky Survey, SDSS), SkyServer provides access to data from the Sloan Digital Sky Survey, with resources for professional astronomers and public users
- Online only,
- Extensive APi, well documented. Also propose a nice SQL tutorial
- Interoperability HIGH: well-documented API, simple to use,
- Not cross-domain, per definition.
- TESS Data Portal, the TESS Data Portal provides access to data from the Transiting Exoplanet Survey Satellite, focusing on exoplanet discovery
- Online only,
- API, documented
- Interoperability MEDIUM/HIGH: the API needs some technical understanding of the domain to be of use.
- Not cross-domain, per definition.
- ESO Science Archive Facility, the European Southern Observatory (ESO) archive contains data from its observatories, including the Very Large Telescope (VLT), and provides tools for data retrieval and analysis
- Online only,
- API, documented Interoperability MEDIUM/HIGH: the API needs some technical understanding of the domain to be of use.
- Not cross-domain, per definition.
- ESO Reflex
- Based on Kepler, which seems out of support, standalone tools for reducing VLT (Very Large Telescope)/VLTI (Very Large Telescope Interferometer) science data. Part of the VLT Instrument Pipelines
- Standalone, no container-based installation.
Humanities
- TAPAS (TEI Archiving, Publishing, and Access Service), a platform for archiving and publishing TEI (Text Encoding Initiative) documents.
- Online only.
- Uses the Text Encoding Initiative XML format (TEI)
- No easily findable API
- Interoperability LOW/MEDIUM: despite the lack of API, each element is findable online and is composed of a well-documented XML and images.
- Not cross-domain, as per definition.
- Humanities Commons, a network for scholars to share work and connect, with the CORE repository
- Online only.
- No easily findable API
- Interoperability LOW: it is a repository of documents that are in binary format (docx, pdf for instance), with only metadata in XML
- Not cross-domain, as per definition.
- Omeka, a content management system for digital collections and exhibitions
- No container-based installation, though it was dockerized previously.
- 2 versions, Omeka S for sharable resource pool across multiple sites and Omeka Classic for individual projects.
- Extensive PHP API, more limited REST API, well documented
- Interoperability MEDIUM/HIGH: uses JSON, seems to be lacking examples and/or samples. Offer a DSpace connector and Zortero Import
- Fully cross-domain: can be used for referencing any kind of digital resources.
- Archaeology Data Service (ADS), for archiving digital data in archaeology.
- Online only.
- No API but a harvesting interface using OAI-PMH and an [extensive documentation)[https://archaeologydataservice.ac.uk/help-guidance/], including on data repositories citing the ADS, that might have an API
- Interoperability MEDIUM/HIGH: uses standards (DOI, ORCID, Dublin Core) and very well documented. The documentation is also very accessible. Automatic access would need to be through harvesting or an external repository, making things a bit harder.
- Not cross-domain, as per definition.
- Europeana, providing access to cultural heritage data from European institutions
- Online only.
- There is an API, as shown with a footer link “See requests to Europeana APIs”, but without any visible documentation.
- Interoperability LOW: it is a repository of medias with a powerful search/filter, but without clear API.
- Not cross-domain, as per definition.
Bibliography
- “Zotero is free and open-source reference management software to manage bibliographic data and related research materials”
- Mostly online with browser’s extension. It is a tool to help getting information from web-sites, organize and cite.
- Many extensions providing some web API support.
- Interoperability HIGH: again with many plugins but also integrations.
- By definition cross-domain: it is a tool to build and manage bibliographic data
Generalists
All are inherently cross-platform and most follow the FAIR principle.
- RUCIO, Scientific Data Management for very large and distributed data sets
- Fully cloud based, with a default Kubernetes installation using Helm charts, the storage backend needs to be setup in advance.
- Extensive APIs (REST API, Python and CLI), well documented (links in the header of the main documentation)
- Interoperability HIGH: uses well-documented JSON.
- Open Science Framework (OSF), a platform that supports open research
- Online
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- RDMO - Research Data Management Organiser
- Docker Compose
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- The Comprehensive Knowledge Archive Network (CKAN) is an open-source open data portal for the storage and distribution of open data (wikipedia)
- Docker Compose
- API
- Interoperability MEDIUM: the API is not documented fully.
- Dataverse, an open source research data repository software
- Docker Compose and Kubernetes
- Extensive API
- Interoperability HIGH: uses documented JSON.
- “Zenodo is a general-purpose open repository developed under the European OpenAIRE program and operated by CERN”
- Mostly used online, but provide a Docker Compose
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- BARTOC is a register of Thesauri, Ontologies & Classifications
- Docker and docker-compose installation supported but mostly supposed to be used online as a register
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- RDM, turn-key research data management repository
- Docker image
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- INVENIO, a framework for large-scale digital repositories
- Framework to develop repositories, such as RDM
- API
- Interoperability HIGH: All documented, used to build other repositories.
- RODARE, ROssendorf DAta REpository, Online Data Repository
- Based on INVENIO Digital Library framework, fork of Zenodo.
- Explain how it is FAIR
- Extensive API, per community
- Interoperability MEDIUM/HIGH: It lacks a simple API documentation.
- REANA, Reproducible and reusable research data analysis platform
- Kubernetes, Helm and Kubernetes in Docker
- Extensive APIs
- Interoperability HIGH: use a yaml file which is fully documented
- UNICORE, A European Federation Software Suite
- Docker image but mostly for evaluation
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- SciCat, management and annotation of scientific data based on Elastic Search
- Default docker-compose installation
- Extensive API
- Interoperability HIGH: uses well-documented JSON.
- re3data, registry of research data repositories
- Online.
- Extensive API
- Interoperability HIGH: uses XML following a given schema.
- udata, social platform dedicated to open data
- Docker compose and docker images
- Extensive API using RDF
- Interoperability HIGH: Return RDF following the Data Catalog Vocabulary.
- DSpace, Open Digital Repository
- Docker install, but not recommended for production
- Extensive API, full part of the back-end, well-documented
- Interoperability HIGH: uses well-documented JSON.
- Dryad, Open Source Publishing Platform
- Only online.
- Extensive API, well documented
- Interoperability HIGH: uses well-documented JSON.
- Elsevier Data Repository
- Online, with a possibility for a commercial local setup.
- Extensive API, well documented.
- Interoperability HIGH: uses well-documented JSON.
- Figshare, Research Repository
- Only Online, with free access for non-profit research.
- Extensive API, well documented and integrations.
- Interoperability HIGH: uses well-documented JSON and mamy examples, also in several programming languages.
- NextFlow, using software containers for scientific workflow
- Docker image
- Externally developed API, documented
- Interoperability LOW/MEDIUM: aside the external API and some standard storage options, such as AWS, there is no clear way to use NextFlow programmatically
- iRODS, Open Source Data Management Software
- No up-to-date docker image or others (dead links can easily be found).
- Rest API, documented, APIs for various languages and a CLI interface
- Interoperability MEDIUM/HIGH: while it is documented and offering many way of interoperability, the documentation is not always very clear. The ease of integration will greatly vary depending on the used technology.
- Pegasus, a (scientific) workflow management system
- Docker image
- Expensive API for various languages & REST API, well documented, REST API is for read-only access.
- Interoperability HIGH: very well documented with schema and examples
Others
- NextCloud
- NextCloud is a file hosting service, similar to Google Drive or Microsoft OneDrive, but OpenSource that can be self-hosted. It is a free file hosting platform, it does not constraint which data are stored and do not follow complex metadata or formats. It can be extended and adapted using plugins, such as a plugin to integrate an office suite, such as LibreOffice. As such, it should be considered a support application before all: a tool to support your work, or a way to store data for other applications. The absence of constraint makes it a poor choice for FAIR data storage, as the users would need to do most of the needed work. Still, interconnected via API to other Software, it could be an essential part of a Data Management solution, even if only as a scrapbook. NextCloud has an extensive API, so could be used with constraint via the API using a dedicated Pipeline. It could be with connection with another application or for a specific use-case.
- Docker, Docker compose and Kubernetes installation, with Helm chart
- Extensive API
- Interoperability HIGH: The API is well documented and there is several ways of doing things. Then NextCloud is highly flexible, so the use will need to be defined clearly and there might need to restrict it using the API for uploads in order to integrate it in an assembly.
- Fully cross-domain