Backups and Security
While backups and security are a very important aspect of running a Data Management Solution, as we want to securely keep data as part of it, a perfectly safe solution does not exist. What to pursue is what should ensure safety in all but the most extreme cases. But extreme needs to be contextualized: if the tool is a temporary sharing platform, then losing data is not a security concern, only a nuisance for the users that need to upload their data again. On the other hand, an unauthorized access to the share data might be grave security concern.
So it is important to be realistic and think carefully about what is actually needed and what can realistically achieved. For each section, we provide the importance we estimate in term of security. This has to be considered critically.
1 Which data
One generally important aspect of Data Management and data usage is to know which data to store and how to store it.
1.1 Personal data
Importance: high. It is a very crucial aspect.
In many countries, storage of personal information is strictly regulated. In the EU and many other jurisdictions, personal data handling is governed by stringent legal requirements that impose significant responsibilities on data controllers and processors.
1.1.1 Legal Requirements and User Rights
In the EU under the General Data Protection Regulation (GDPR), personal data can be processed only with explicit consent, must be accessible by the user, can be deleted upon request, and can be stored only for a limited time. Users have the right to:
- Access (GDPR Article 15): Users can request to know what personal data is stored about them and how it is being used
- Rectification (GDPR Article 16): Users can request correction of inaccurate personal data
- Erasure (GDPR Article 17): Users can request deletion of their data (the “right to be forgotten”)
- Data portability (GDPR Article 20): Users can request their data in a portable format
- Objection (GDPR Article 21): Users can object to certain uses of their data
For comprehensive information on data protection in the EU, see the European Commission’s Data Protection page.
1.1.2 Data Minimization Principle
A fundamental principle in personal data protection is data minimization: only collect and store the minimum data necessary for your specific purpose. Store it only for the time needed, then delete it. Excess personal data increases risk and legal liability with no additional benefit.
1.1.3 Pseudonymization and Anonymization
When research data is linked to a person, it is advised to apply data protection techniques:
- Pseudonymization: Replace identifying information (names, IDs) with pseudonyms or anonymized identifiers. This allows research while reducing direct identification. However, pseudonymized data linked to a key remains personal data.
- Anonymization: Irreversibly remove or render impossible the identification of individuals. Truly anonymized data is not subject to personal data regulations, but genuine anonymization is difficult to achieve.
You will find more information about data protection techniques in the Data Protection and Privacy page.
1.1.4 Storage and Encryption
See the What to never store section for details on secure storage practices. Personal data should never be stored in reversible formats for authentication. All keys, tokens, and sensitive identifiers must be encrypted with keys stored separately from the data. Personal data in backups must also respect the same protection requirements.
1.1.5 Cross-Institutional Data Sharing
The exchange of personal data between institutions can only be done with a formal agreement and with the explicit consent of the user. It is strongly recommended to consult a data protection officer or legal counsel before sharing personal data with external parties.
The limitation of data storage for personal information includes backups. Backup systems must be subject to the same access controls, encryption, and retention policies as production systems.
1.2 Confidential data
Importance: high. It is a very crucial aspect.
Confidential data is any data that have a restricted access: only for accredited users, that should not be publicly accessible, or data that should not be shared with some countries or entities, or used by them. It can be data that is not personal but still sensitive, such as research data that is not yet published, or data that is only for internal use. It can also be data that is personal but does not fall under the legal definition of personal data, such as pseudonymized data or anonymized data.
Limitation of use for public data is generally only ensured by a license, but it is important to ensure that the license is respected, and that the data is not used in a way that is not allowed by the license. For instance, if the data is licensed under a non-commercial license, it should not be used for commercial purposes. If the data is licensed under a share-alike license, it should not be used in a way that does not allow others to use it under the same license. If the data is licensed under a license that does not allow modification, it should not be modified. If the data is licensed under a license that does not allow redistribution, it should not be redistributed. Enforcement can be difficult and costly, but in some cases it is important to have a clear license and to monitor the use of the data. It is also important to have a clear policy on how to handle violations of the license, such as sending a cease and desist letter or taking legal action.
1.3 Passwords & keys
2 Backups
Importance: high. Backup needs to be in place once a platform is in production.
Backups are an inherent - and maybe the most important - part of security, as they protect from mishaps as well as actual attacks. They are evidently not enough if a software has any kind of shared usage. Any shared usage need some kind of security, not only public access. A good faith use of a software with the wrong privileges could have some bad consequences. Evidently such security starts with the software design and implementation, such as database constraints. This page will not speak about that.
Backups are for recoveries, public data should also be published in repositories.
Backups tools -> versioning, monitoring, differential and incremental backups and automation. Using non-backup tools need some parameters and/or script to get some of these functions. Monitoring the backup is working will be extremely important (“silent death”).
2.1 Common backup strategies
The main backup strategy is to journally backup the data, keep several versions of the backup, and keep at least one backup off-site, following the rule of 3-2-1: 3 copies of the data, on 2 different media, with at least 1 copy off-site.
If available, a long term backup strategy should also be implemented, with a clear policy on how long to keep the backups, and how to eventually delete them when they are no longer needed. This is especially important for personal data, as they should not be kept longer than necessary. These backups are generally done on tapes that are stored in a secure location and might not be so regularly done.
2.2 Common backup tools
Backup tools could be a simple cronjob calling a database dump, a script using rsync to copy files, a more complex tool such as BorgBackup, Restic or Veeam, or a cloud-based backup service. The choice of the backup tool will depend on the needs of the users, the software, the infrastructure, and the budget. It is important to choose a backup tool that is reliable, secure, and easy to use.
If your IT services are providing a backup service, it is generally the best option to use it, as it will be maintained and updated by the provider, and will be more secure than a self-managed solution. It is still important to check the terms of service and the security measures in place, as well as the possibility to access the data and the API if needed.
Otherwise, the backup tool will depend on the setup and will generally be a combination of tools, such as a database dump for the database, a file copy for the files (rsync or more advanced tools), and a cloud-based backup service for the off-site backup.
2.3 Testing backups
It is always important to test your backups, as a backup that is not working is not a backup. It is also important to test the restore process, as a backup that cannot be restored is also not a backup. Backup testing should be done regularly, and should be part of the backup strategy.
Long term backups should also be tested, for instance for a random selection of files for a random date.
2.4 Managing the loss of data
Backups cannot fully prevent the loss of data, as all activity since the last backup will be lost. In RDM, the data submitted for a project is generally still available on the submitter’s computer, so it can be re-submitted. It is important to inform the users quickly and to explain how to check for data loss on their side, and how to re-submit the data if needed.
3 Security
3.1 Defence need to be always a success, attack need only one
One very important rule of security is that the attacker needs only one success. So it does not matter if many attempts are unsuccessful. They only need to try again. Thus defenders must adapt and monitor. It should be avoiding known vulnerabilies while trying to care for unknown ones.
3.2 Aspects of security
By security in an information system, we consider every aspect allowing the right use of the information system (including, but not only, the protection of data): Availability, integrity, confidentiality, trustworthiness, access control. It is important to know which of these aspects are needed in the system to build. The possible threats are then coming from: - the system users, not willingly doing any damage, - malicious persons, who want to access data or program, to damage the infrastructure, to cheat, to annoy, - malicious software, a software which will give a nuisance to other or abuse the resources (that can be unintentional), open the system to intrusions, - one disaster, a simple electric break, one bad manipulation, a fire, … It is very common that the regular users are gates for malicious persons. We speak here about social engineering: the weak point is the human-human relation. The famous Kevin Mitnick mostly used this system to intrude systems. A right information of the user is one way to fight against this kind of problem.
A very wide overview of possible outrages and how to mitigate them is part of Business Continuity Planning.
3.3 Limits of security
Security in computer software is very much alike security in the real-life: if you want your house to be totally secure, you have to push a 3 meter high concrete wall around with automatic machine guns. Not fun, not neighbour-friendly and not inhabitant-friendly too. The same rule applies in software development: you just cannot have a totally secured system, as nothing will be usable then (or with much difficulties). The fact is that, like in the real life, the user must show some respect to other users and to the software architecture, and we must protect the system mostly from deviant users. The problem is that, in computer world, all users are potentially deviant! You have to see the interface of the software as the public area of it: naturally, with many users, you can assume that they will try all, so you have to ensure that no dangerous parts are left in it. Doing so (part of the security by design), you remove the major part of potential thread from the users.
3.4 General dispositions
It is important to carefully design the foundations of your system in order to minimize security risks. General dispositions are precisely about reducing these oversights:
- Never run anything as root (or administrator). It may feel convenient during development, but it is equivalent to leaving all doors open in the house because you trust yourself not to make mistakes. Software does not forgive mistakes. By assigning minimal privileges to users and services, you ensure that even if something goes wrong, the damage remains contained. This principle - often called least privilege - is one of the simplest and most effective protections you can apply.
- Closely related to this is the management of paths and access rights. Files are not just data; they are entry points. A poorly protected directory, a writable configuration file, or an executable in the wrong place can quickly become a lever for unintended behavior. Permissions should always be as restrictive as possible, and never rely on assumptions such as “this folder is not visible to users.” If the system can reach it, someone eventually will.
- A symbolic link can redirect operations to unexpected locations, sometimes bypassing access controls that seemed correctly configured. If your application follows links without strict checks, you may end up granting access to resources you never intended to expose.
- Databases follow the same logic. Using a single, highly privileged database user for all operations is convenient, but dangerous. Each component should only access what it strictly needs: read-only where possible, limited schemas otherwise. This way, even if one part of the system is compromised, the attacker does not automatically gain full control over your data.
- Configuration is another common source of trouble. Systems like Apache or tools relying on files such as
.htaccessoffer great flexibility, but also introduce complexity. And complexity is the natural enemy of security. Misconfigurations are often invisible until they are exploited. The more intricate the setup, the harder it becomes to reason about its behavior. Whenever possible, favor simple, explicit configurations over clever but opaque ones. - Finally, there is error handling. A system should fail fast when it encounters a situation it cannot safely recover from. Trying to continue at all costs often leads to inconsistent states, which are fertile ground for vulnerabilities. However, failing fast does not mean failing blindly. The application must distinguish between recoverable errors and critical failures. In the first case, it can adapt; in the second, it must stop cleanly and predictably.
In the end, these dispositions are not advanced techniques. They are habits. And like most habits, they are easy to neglect - until the day they are the only thing standing between a harmless bug and a serious security issue.
3.5 Surface of attack
About security, we will often speak about reducing the surface of attack. The surface of attack is the part of the system that can be accessed by an attacker. The more surface of attack, the more chances for an attacker to find a vulnerability. Thus, reducing the surface of attack is necessary for security. For instance, if you have a web application, you should not have any other service running on the same server that can be accessed from the outside. If you have a database, it should not be accessible from the outside, but only from the application. If you have a backup system, it should not be accessible from the outside, but only from the application and/or from a secure location.
:: {.callout-note} Reducing the surface of attack is also related to the complexity of the setup. A very complex setup will have more surface of attack than a simple one, as it will be harder to understand and to monitor. As such, it is important to keep the setup as simple as possible, while still meeting the needs of the users and the software. ::
3.6 Firewall
Importance: high. A Firewall is needed. In most institution a firewall will be in place and managed by the local IT.
A firewall is a software or hardware that blocks traffic that is not needed.
One aspect of a Firewall is to reduce the surface of attack by blocking all the traffic that is not needed. For instance, if you have a web application, you should only allow traffic on port 80 (HTTP) and 443 (HTTPS). The database should not be accessible from the outside, but only from the application. The backup system should not be accessible from the outside, but only from the application and/or from a secure location.
But a firewall can also be used to block traffic from known malicious IP addresses or, if the software is not designed to be used by the public, to block all traffic from the outside and only allow traffic from a secure location (such as a VPN), while autorizing some external access by whitelisting some IP addresses.
3.7 Protections against attacks
Importance: various. Some attacks are systematic but mostly managed by modern platforms (like SQL injections), more sophisticated attacks might come if the Data Management Solution is popular. There can also be a wide attack at institution level which might be more advanced, in that case it is important to monitor the activity to see anomalies and communicate with your local IT. Shutting down a platform in case of doubt is always reasonnable.
To review, expend and simplify. Extract Denial of Service/BruteForce and simple abuse from robots and agents.
There are many kinds of attacks, such as SQL injection, cross-site scripting, cross-site request forgery, brute force, denial of service. Some of them can be mitigated by the software design and implementation, such as using prepared statements for SQL queries, using a web application firewall, using a rate limiter, using a CAPTCHA, using a strong password policy, using two-factor authentication, using a VPN, using a firewall, using a backup system, using a monitoring system. Some of them can be mitigated by the user behavior, such as using a strong password, not sharing passwords, not clicking on suspicious links, not downloading suspicious files, not giving away personal information, not using public Wi-Fi for sensitive operations. But some others are more difficult to mitigate, such as zero-day vulnerabilities, insider threats, social engineering, and at the extreme physical attacks (such as theft or sabotage).
Protecting against abuses is also difficult. Such abuses can be:
- SPAM and SCAM attack of public forms. Automatic tools are very good at finding common fields in a form, such as “email”, “first name”, … In that case the form can be filled and submitted automatically. Some good transparent mitigations methods are known, such as, from simpler to more complex:
- hashing the fields names so the common fields will be an variable hash that the application can understand using a stored salt, which is valid only once so the form cannot be resubmitted as it, this makes it much more difficult for a bot to fill a form. Such hashing should be used on forms that don’t have a repeated usage by one user, as the browsers will also be unable to automatically refill the fields, as their name will change.
- often combined with a honeypot, hidden fields of the form with the common name, such as “email”, that can only be filled by bots and will automatically exclude the submission,
- delaying the submission of the content, which allow to exclude human-submitted mass-submissions by comparing the content, which generally is very similar. Using a LLM bot, it is unfortunately easy for a mass submission of different content now.
- using common scam keywords, blacklist messages using a thresold. The three last method should keep a copy of the message with a way of resubmitting in the cases of false positive. In our own experience we encountered student who sent the same request to all group leaders of our institute and triggered the mass-submission alert. Our system sent a report after the last received email (waiting for some time) and we did not resubmit the emails but contacted the students.
But few web applications support such methods, and the common way is to use a CAPTCHA. A CAPTCHA is generally a challenge to see if the user is human. Some are rather transparent, simply expecting an human reaction, some are simple puzzles to solve (which might be a problem with some persons with disabilities), and many are still using mangled text to recognize or image where the user need to find an object. Those using mangled texts or images becomes harder and harder to fill correctly by humans, while bots are becoming better and better to solve them.
- Attack via vulnerability, such as a zero-day vulnerability: a security hole, maybe due to an unknown bug. Such vulnerabilities can be exploited by an attacker to gain access to the system, to steal data, to damage the system, or to use it for their own purposes. The best way to protect against such vulnerabilities is to keep the software up to date. Most of the time, vulnerabilities are discovered and fixed by the software developers, but sometimes they can be discovered by attackers and exploited before they are fixed. This is why it is important to have a good monitoring system in place, to detect any suspicious activity and to respond quickly to any incident. If discovered before the developer, it is important to report the vulnerability to the developer in a responsible way, so they can fix it before it is exploited by attackers. The mitigation will probably need to be done with the developer and might a quick patch before a proper update. Without a proper mitigation, it might be necessary to temporarily disable the vulnerable feature, or to block access to it with a firewall and allow access only from a secure location (such as a VPN), while waiting for the patch.
- Poisoning of software packages. A variant of the previous attack is to poison the software packages, for instance by uploading a malicious package with the same name as a popular package, or by compromising the repository of a popular package. In the case of containerized software, some repositories propose hardened images, that are regularly updated and monitored for vulnerabilities, which can be a good option to reduce the risk of such attack. Otherwise the strategy to protect against such attack is similar to the previous one.
- Denial of Service/Brute Force attack. Sometime, and attacker just want to disrupt a service, or break it in order to obtain something (like logging in or accessing a database). Both involves using a massive amount of requests, either to overload the system or to try all possible combinations. The best way to protect against such attacks is to use a rate limiter, that can block or slow down the requests after a certain threshold. It is also important to have a good monitoring system in place, to detect any suspicious activity and to respond quickly to any incident. If the attack is coming from a known IP address, it can be blocked with a firewall. If the attack is coming from multiple IP addresses, it can be mitigated with a web application firewall or a content delivery network (CDN) that can filter the traffic. In a public institution, there should be a protection provided by the global IT services. Using a content delivery network (CDN) might be needed but need some care: how much is allowed for which price, what consequences for the users (for instance, if the CDN is blocking some IP addresses, it might block some users as well), and the availabilty of the service (if the CDN is down, the service will be down as well).
- Abuse by bots. The internet is also used by many legitimate bots, such as search engine crawlers or AI agents. Crawlers should respect the directives of a
robots.txtfile, but many bots do not. AI agents can be used for good purposes, such as summarizing content or answering questions, but they can also be used for bad purposes, such as scraping content or generating spam. Many AI agents browse the web at each request needing to access the content, ignoring therobots.txtdirectives. This has 2 consequences:- it can overload the system if the content is accessed too many times, which can increase the price or disrupt the service for other users.
- if the content needs to be addressed from the public data management application (for instance to prove the popularity of the data), AI agent might give a summarized content, which might limit the visit to the original content and give a wrong impression of the content, or even give a wrong content if the summarization is not good enough. Some Content Delivery Network (CDN) uses methods to block all access to non-human users, but it is difficult to do without negative consequences for the users, as some users might be blocked as well.
- Social engineering attack, which is an attack that relies on human-human interaction, such as phishing emails or pretexting calls. The attacker will try to trick the user into doing something that will give them access to the system, such as clicking on a link, downloading a file, giving away personal information, or even giving away their password. This is one of the most common and effective attack methods, as it exploits the human factor, which is often the weakest link in security. The best way to protect against social engineering attacks is to educate users about the risks and to encourage them to be cautious when interacting with unknown or suspicious content.
- Poison data, using a bot through a vulnerability (an undefended form or API for example), or after a successful breach via social engineering, thus destroying the integrity of the data. Depending on the injected data the cleaning of data can be very difficult, and the only solution might be to restore a backup from before the attack, which might means a severe loss of data.
3.8 Passwords and keys
Importance: high for public solutions. Solutions exposed publicly should have a strong authentication solution. Using an existing external one is a good solution.
3.8.1 For users and APIs
Passwords are still the most common way to protect access to a system, but they are also the most vulnerable. They should be strong and unique, and should not be shared. They should not be written down or stored in plain text. To encourage users to use strong passwords, it is recommended to propose a password manager or a vault, that can generate strong passwords and store them securely. The application should also enforce strong passwords, and should not allow weak passwords. Allowing long passwords allow to use passphrases, that are easier to remember and can be very strong, they can be non-sensical sentences, so easy to remember but hard to guess, such as “TheCowBouBouBounceOnTheMoon!”. If supported, two-factor authentication should be used, as it adds an extra layer of security.
But one of the strongest way to propose an access is to use keys, that can be generated and stored securely, and can be easily revoked if needed. For instance, SSH keys are a very common way to access a server, and they are much more secure than passwords. They can be generated with a passphrase, that adds an extra layer of security.
Some authentication system can also use tokens (which are a kind of key), that can be generated and stored securely, and can be easily revoked if needed. For instance, API tokens are a very common way to access an API, and they are much more secure than having to regularly use an username and passwords. Token are often salted, so they cannot be used directly, and have a limited validity to avoid an attack reusing a stolen token.
Most Data Management platforms also support external authentication, such as LDAP or OAuth. OAuth is a very common way to allow users to authenticate with their existing credentials, such as Google or GitHub. It is also a secure way to authenticate, as it relies on the security of the external provider (assuming the user is following the provider’s security practices, but if not it is probably a generalized practice and would be a problem anyway).
3.8.2 Software keys and passwords
Most of the time, the software will need to access some resources (such as a database) with a password or a key. In that case, it is important to store these credentials in a secure way, such as in an environment variable, in a configuration file with restricted access, or in a secret manager. They should never be stored in plain text in the code or in the configuration file.
3.9 What to never log, log levels
Importance: medium. The risk is present but low. It is still important on the long run so better to make right immediately.
Log are incredibly useful and should be used for debugging purpose and for security (they are often the only way to detect a tentative of intrusion/code injection).
For instance, it could be important to log all suspicious url paths are it might be a tentative of code injection, i.e. the attacker hoping that a mechanism in the code will execute a payload.
A simple example would be executing raw SQL queries:
- The URL subpath is /users/[name] where [name] is the name you want to look for.
- The code will then simply execute “SELECT * FROM users WHERE name=[name];”
- The attacker will use the URL subpath “/users/name”;DROP%20TABLE%20admin;%20 –“.
- The code will execute “SELECT * FROM users WHERE name=”name”;DROP TABLE admin; –““. The end”–” is to comment the closing quote so it does not raise an error.
- If the access right of the database is too permissible and an admin table exists, it will be dropped (but the example code is for illustration only, it might not work and in all cases it should not work).
This is a really well known and simple example and no modern web framework or API should allow such attacks nowadays. But the actual issue is with unknow vulnerability. Going back to this example, the web application should log something like:
"2025-11-23_2:32:22 - Warning - incorrect characters in path: "/users/name";DROP%20TABLE%20admin;%20 --"
And that would allow to check what was attempted.
But logs are also plain text and, if accessed by an attacker, will be easy to comprehend. As such, there are information that should never been logged, and some that should be carefully logged:
- username, password, keys, important DB table names (such as a users table used for login), should ideally never be written in a log file, even on DEBUG level and even on test. The reason for that is that everything that happen on test can happen on production. Amid a quick fix during an emergency, a test log entry can be forgiven and go to production. A fix on sensitive information can use the logs, but should be done in a very controlled way, without hurry, even during an emergency. A good way to proceed is to allow the log info only when needed, then switch it off (ideally remove the log entries) immediately after.
- sensitive details about the implementation or the setup: it could be logged for debug, but should not appear for production run. In case of unknown access to the log, it might allow a much wider attack using the leaked details. For instance, simply logging important path might allow an attacker to know where to act
- confidential data. Once again, it could be important to see some data in the log for debugging a setup, but it is important to use dummy data for confidential fields and to not let anything confidential once in production.
3.10 What to never store (or never in “reversible” format)
Importance: high. Leaks of such information can be difficult to identify and will let attacker impersonates a regular user, thus allowing malicious operations potentially without anybody noticing it.
All kind of passwords should never be stored in a reversible format. If the password is needed for the software to work, it should be stored in an encrypted format, with a key that is not stored on the same system. If the password is only needed for authentication, it should be hashed with a strong hashing algorithm and a salt.
Any key, hash or token that can be used to access the system should be stored in an encrypted format, with a key that is not stored on the same system.
In most countries, the storage of personal data is restricted by law. In the EU, for instance, personal data can only be stored with consent, must be accessible by the user, can be deleted upon request and can be stored only for a limited time. Thus, if some research data is linked to person, it is advised to pseudonymize or to anonymize the data. Only the needed data should be stored, and only for the needed time.
The limitation of data storage for personal information includes backups.
The exchange of personal data between institutions can only by done with an agreement and with the consent of the user. It is recommended to consult a data protection officer for that.
3.11 Know your infrastructure
Importance: high.
This in in overlap with many topics already covered, but it is primordial to know your infrastructure and have a clear overview of it. That includes the time when the main persons in charge are not available, so it is important to have a clear documentation of the infrastructure, including the software used, the configuration, the access rights, the backup system, the monitoring system, the security measures, and the contact information of the people in charge. It is also important to have a clear policy on how to handle incidents, such as a security breach or a data loss.
A good document to prepare is a troubleshooting guide, that can be used by anyone to understand the infrastructure and to solve common problems. It should include a clear overview of the infrastructure, a list of common problems and their solutions, and a list of contact information for the people in charge.
3.12 Deleting data
3.12.1 Really deleting files, database entries, backups, cloud storage, version control entries
Importance: medium to high. Depending on the data, deleting will be only for saving space. If the data must be deleted (like personal data), it is very important.
Deleting data is not as simple as it seems. When you delete a file, it is not actually deleted from the disk, but only the reference to it is removed. The data can still be recovered with special tools until it is overwritten by new data. Thus, if you want to delete a file securely, you need to overwrite it with random data before deleting it. There are many tools that can do that, such as shred on Linux or sdelete on Windows.
When you delete a database entry, it is also not actually deleted from the disk, but only the reference to it is removed. The data can still be recovered with special tools until it is overwritten by new data. Thus, if you want to delete a database entry securely, you need to overwrite it with random data before deleting it. There are many tools that can do that, such as pg_repack for PostgreSQL or OPTIMIZE TABLE for MySQL.
When you delete a backup, it is also not actually deleted from the disk, but only the reference to it is removed. The data can still be recovered with special tools until it is overwritten by new data. Thus, if you want to delete a backup securely, you need to overwrite it with random data before deleting it. There are many tools that can do that, such as shred on Linux or sdelete on Windows.
When you delete a file from a cloud storage, it is also not actually deleted from the disk, but only the reference to it is removed. The data can still be recovered with special tools until it is overwritten by new data. Thus, if you want to delete a file from a cloud storage securely, you need to overwrite it with random data before deleting it. There are many tools that can do that, such as shred on Linux or sdelete on Windows.
When you delete a file from a version control system, it is also not actually deleted from the disk, but only the reference to it is removed. The data can still be recovered with special tools until it is overwritten by new data. Thus, if you want to delete a file from a version control system securely, you need to overwrite it with random data before deleting it. There are many tools that can do that, such as git filter-branch for Git or hg convert for Mercurial.
3.12.2 When you should not delete data
Importance: high. It is generally a condition of a project funding or research output.
Some data should not be deleted, such as data which is part of a funded project and should live for a predetermined time, or data that is part of a publication and should be available for reproducibility. In these cases, it is important to have a clear policy on how to handle the data, such as how to store it securely, how to share it with the right people, how to monitor its usage, and how to eventually delete it when it is no longer needed.
If the live of the data is much longer than the project, the Data Management Solution might be difficult to maintain, due to lack of funding, lack of personnel, lack of knowledge, or if the software is not maintained anymore. In that case, it is important to have a clear plan on how to handle the data, such as how to migrate it to another solution, how to archive it.