A lurking global problem
Some years ago a friend of mine, a positive and rather carefree person, woke up one morning and called me for help: she was crying in despair- fore she had lost her smartphone. It was too late and only at that moment did she realize how dreadful the consequences were: her bank details, her contacts, her work, her government IDs, her signature, all of it was in there and accessible to the lucky one who found her smartphone the night before.
In hindsight, it may seem obvious and it sounds legitimate to ask: “Why put yourself in such a fragile situation in the first place? You should have taken some precautions, right?”
It turns out that it’s not so obvious for most of us. In fact, sit back for a minute and ask yourself:
What if I lost my computer (or smartphone, or whichever device with personal data on it) just now? What would be the consequences?
It’s not evident what those consequences are until they’ve been brought to our attention. So here are a few possible things you could lose:
- Your credentials (login to websites for instance).
- Your money (bank account credentials, cryptocurrency wallet)
- Your personal data, such as pictures and videos.
- Your work data.
- Your contacts.
- Your conversations (emails, etc.).
- And probably much more.
A loss can be classified into the following categories:
- Destruction: data is gone. Question: Would you be able to precisely know which files were lost?
- Theft: someone else has your data, which means they probably have some of your credentials, private pictures, maybe money, etc.
- Unknown: you don’t know what happened to your device. As suggested above, it could have been stolen or destroyed. But not knowing may leave you deeply uneasy about the situation. You should always assume the worst: theft.
Eventually my friend found her phone hidden underneath her bed. A happy end, that at least served as a good wake up call… or did it? I’m not quite sure she spent time working on some precautions later. But who can blame her? Unless you are a techie, it is overwhelming to envision how to even get started with those precautions.
The effort might not be worth it, so should we care at all, or simply accept the state of things as they are?
“Not gonna happen to me!”
We would naturally think so. It’s a common psychological fallacy and we should not fool ourselves, no one is immune to theft or accidental damage.
In fact, hard drives are among the most failure-prone pieces of hardware. Some day you’ll start your computer and the hard drive will be gone. Shit happens.
If you’ve got data stored in only one place, it means there is a single point of failure: that’s all you need to face a certain doom.
“I’m safe, my stuff is in the cloud. Or am I?”
Not so fast: what about your credentials for that “cloud?” If you device gets stolen, how confident are you that the thief won’t have access to your data? Even after a password change?
Is that cloud trustworthy? Who owns it? Is it in the owner’s interest to protect your privacy? Would you store embarrassing pictures there? Passwords? Work data?
Are you fully confident about what you’ve put in there? What if you’ve leaked a sensitive piece of data there by mistake? Can you take it back or will it be persisted forever on the cloud’s servers?
In this day and age of privacy protection, cloud storage requires extra cautiousness, and at the bare minimum you should know what you are doing and understand the full extent of the technical implications.
“My weekly backup is enough, right?”
If you use your computer for work, which seems to be increasingly the case in our society, it’s probably not alright.
Think about it this way: how would you feel about a week of work going to waste? Are you ready to go through it all over again? As much as you might love your work, this is at the very least unproductive, if not outright demotivating. Should your work be something on the creative side (music, writing or maybe programming?), can you be certain you would produce the same result the second time? Will it be better or worse?
Even if you don’t work on a computer, the amount of data accumulated over a week can be significant enough that it would cost you a lot to lose it.
Conclusion: daily backups are better.
Aftermath: The system and data recovery
When this happens, you are faced with the inevitable rehabilitation part. It’s hard to give up on computers (or smartphones) these days, even if we are fed up with them. We have to get back on track, and this might be an exhausting process.
Have you ever:
- Spent a week re-installing and re-configuring your computer? (Or had someone do it for you?)
- Spent the same amount of time getting as much of your old data back as possible?
- Lived on with the discomforting itch that you forgot what you had lost?
- Had headaches synchronizing your devices with regard to data, configuration, credentials?
Even if you are not the geek kind and happily use the defaults you are provided with, the system will inevitably be shaped to your liking over time. You probably have some favourites, bookmarks, and credentials saved somewhere.
“Can we even do anything about it? Yes we can!”
At this point it must be apparent to most of us that those issues are enough a concern that we can’t just sweep them under the rug.
In this article I am going to address several possible solutions:
- Backup user settings: this makes it trivial to synchronize the exact user profile to multiple machines. In other words, this allows you to log in onto a new machine and replicate your exact working environment in a click.
- Backup data offline and online. The pros, the cons, and most importantly, the costs.
Unfortunately, this article will mostly be an outline more than a detailed walk-through since the process is extremely dependent on your operating system (your choices are quite limited on Windows, for instance). More importantly, it will be an attempt at increasing awareness about data protection, privacy and user-centric control.
For these to work, there are some essential requirements:
- Friction-less: if the process is cumbersome and lengthy, let’s face it, we will procrastinate. Even a simple copy-paste to an external hard drive becomes tiring in the long run and we will eventually postpone it for days and weeks.
- Fast and low on resources: the process must be fast enough and light on disk usage so that it can be run at least once a day. As we saw previously, even weekly backups could be insufficient.
- Automatable: it should be possible to have it run automatically every day.
- Cheap: we can’t always invest in hardware, servers, service subscriptions, etc.
Data can be divided into two categories:
- Public: anything that can be found “out there,” on the market. Typically music, movies, programs, etc. As a special case, much of your user settings can be safely marked as “public.”
- Private: your vacation pictures, your work, your credentials, etc.
The distinction matters because we are going to store some stuff online, in which case it must be very clear: your private data must be encrypted so that only you can access it with your private key.
Before getting started, let’s make sure we are on the same page when it comes to basic digital security and privacy requirements.
Computer Security 101
Regarding private data, there is something very important to understand: if it’s stored behind a password on the cloud, it does not mean it’s safe. It might be safe from external attackers, but the people running the cloud service have full, unrestricted access to it.
The only sane way to store data onto an untrusted third party is to never let your data leave your machine unencrypted. Don’t let anyone encrypt data for you if you don’t want them to have full access to your data: you must do it yourself.
Understand that “encrypting your own data” is not an involved process: user-friendly programs will happily do it for you.
- Mobile devices
- By their very nature, it’s easy to lose them or to steal them. When this happens, the thief could full access to your critical data (saved password, contacts, bank details, etc.). The PIN or the login password won’t protect you much if they can plug the hard drive onto some other computer. This is why storage on mobile devices should always be fully encrypted: without the passphrase, the thief won’t be able to see anything but binary garbage on the device.
- None of the above matters if you cannot trust the underlying system (the programs and the operating system). It’s crucial that those are transparent and open enough for you to trust them. Which means that they must be free software, open source, reproducible. Guix is a good example of such a system.
- Password management
- This animated overview by the EFF should give you a good feeling of how safe a password manager is and why you need it. As a bonus, it makes your life easier: it lifts the burden of having to remember legions of passwords.
Now that we’ve got a good understanding of the security requirements, let’s get down the actual issue of safeguarding our data. The most obvious and straighforward approach is to buy multiple hard drives to duplicate the data.
It also happens to be a rather cheap approach. Renting storage online is usually more expensive per GB.
Never stick to a single hard drive as it would weaken your setup to a single point of failure. Hard drive failures occur quite often, so you are better off always acquiring hard drives in pairs (at least).
The most obvious way to backup your data is to copy it from one drive to the next.
In practice, this is not ideal:
- It’s too manual. It should be done automatically.
- It can be a slow process. If it’s too slow, we won’t do it this often. If backups are too spread out between each other, we increase the chances of a breakdown happening days (or weeks) after the last backup, thus losing much more data than tolerable.
The answer to this is mirroring (like RAID1): whenever a file is copied on drive A, the computer automatically copies it on drive B.
There is a pitfall however: if a file is removed from A, it’s also removed from B. Removing a file by accident would make it effectively unrecoverable despite the backup, which kills the purpose of the whole thing.
Snapshots are like a save point of the drive at some point in time. A killer feature of snapshots is that they can be mounted like a drive and you can browse them just like any other folders. This effectively allows you to use (and compare!) multiple versions of your data at the same time. It’s obviously possible to revert back to any snapshot, and even branch off from them should you decide to work with multiple histories of your data.
Snapshots are smart enough that they won’t duplicate data, so they are very efficient both to create (done in a matter of seconds) and to store (they require some tiny percent of your overall data usage).
Say you store some movies from dear Georges Melliès in kick-ass quality:
- A Trip to the Moon: 200 GB.
- The Impossible Voyage: 300 GB.
You take a snapshot named “dawn”. Total disk usage would be around 500 GB, snapshot included.
Now remove “A Trip to the Moon” and snapshot to “noon”. Total disk usage would still be around 500 GB because the “dawn” snapshot is still holding the movie.
Let’s add a new movie:
- Plan 9 from Outer Space: 700 GB.
We take a new snapshot named “dusk”, and now disk usage is around 1200 GB, all 3 snapshots included.
Two movies are “visible”: The Impossible Voyage and Plan 9 from Outer Space. But we can still go back in time and play A Trip to the Moon.
Last, we delete “dawn”: A Trip to the Moon is no longer referenced, so it is effectively removed from the hard drive and some space is freed: we are now using about 1000 GB.
Snapshots make it much safer to use mirroring: should you accidentally delete a file, it can be restored from a snapshot present on both drives.
Snapshots are available only to some file systems (which is determined when you
format the hard drive). As of January 2019, good solutions include ZFS and
Btrfs. From there, use dedicated tools (such as the
btrfs command line tool)
to create snapshots.
Note that hardware-based mirroring like RAID1 is not really necessary with ZFS or Btrfs, both of which support mirroring themselves.
As of January 2019, those file systems still tend to be used only marginally. I think it’s a pity considering what a game changer they provide: by safekeeping the integrity of users data, computers suddenly become much friendlier machines!
So here we go, an ideal starting point for offline backups: 2 hard drives both formatted using a file system with snapshot support and set up for mirroring. Once formatted, the snapshots can be programmed to be automatically run daily for instance. Then there is nothing left to do on the user end:
- It’s all automatic.
- It’s fast.
- It’s space efficient.
- It preserves the complete history of the data: It safeguards against accidental deletion for instance.
Now what if your hard drives all burn down at the same time? One way to cope with this is to have another computer in a remote location, but that might not be doable for all of us, as it’s more costly.
Another, also costly solution is online storage. The great selling point of many “cloud” solutions is that they protect you from real-life damage. That is, assuming the cloud providers have several data-centers and they don’t all burn down.
But you should not exclusively rely on a cloud service either. That would also get you back to a single point of failure. What if the company shuts down? What if they make a mistake and erase your data? What if you lose your credentials? What if…?
Remote storage is nonetheless a great solution for extra security beyond your local storage.
Remember however that you should not send anything unencrypted to the remote server if it belongs to an untrusted third-party.
There are a couple of approaches here:
- Synchronize your ZFS encrypted snapshots. (I’ve never done it myself, I just assume this would work in this scenario.) As of January 2019, Btrfs does not support snapshot encryption, so it’s a big no-no for remote synchronization. (Let me know if I’m wrong about this.)
- A dedicated backup manager (as of January 2019, BorgBackup is one of the prime tools in the field). They work independently of your file system capabilities. It allows you to store encrypted backups remotely. Since it supports data deduplication (much like snapshots), backups scale well and won’t occupy much more than the sum of the different bits of data found across all backups.
At this point we’ve covered the question of safeguarding our data. Some legitimate concerns may have arisen:
- Offline and online backups are great, but admittedly come at a price. What if we cannot afford it? Or only partially, not for all the data?
- In the long run, snapshots may eat up too much space if they keep track of data that was deleted a long time ago. So it’s common to delete the older snapshots over time to regain some disk space, but then we lose part of the history.
- What about devices without storage space, like mobile devices? What about laptops without external hard drives?
In all those circumstances, it might not be possible to always keep track of all our data. But there is still some information we can preserve for very cheap: the file listings.
A file listing is a simple text file of all the files found on the drive, one full path per line.
While file listings don’t get us data back, they at least provide us with what data we have. This can be very valuable.
Think about it: when you accidentally lose data (e.g. you lose your computer), can you remember what you lost? Some of it, certainly, but what about the rest? Our memory isn’t that great, and it could very well be that we are not able to recall some important data either (as paradoxical as it may sound).
File listings occupy rarely more than a few megabytes and they are fast to generate.
File listings can then be kept under version control, for instance under some private repository of yours (a possibly remote storage space), preferably encrypted. This way you’ll not only keep the list of files but also the history of the all the files you had at every point in time.
None of this should be done manually and just like snapshots, we are better off if they are run automatically, e.g. once a day.
Reproducible user profile
Data is not everything and backing up your user settings like regular data is not the smartest thing to do. Let’s get down to it without further ado.
Versioning your user settings
User settings are everything about your environment:
- Favourite programs.
- All the configurations of those programs.
- Keyboard shortcuts.
- File shortcuts.
- Accessibility configurations.
Why not back them up like regular data, one may ask? For a fundamental difference: the user settings are much more akin to a computer program that glues together all your other programs. It’s not static data and thus it benefits greatly from being transparent and reproducible.
It might not be obvious, but for a better part of it, those settings are far from being confidential and it’s often fine, even commended, to share them publicly, like any free software.
There is a long standing tradition among hackers to share their user profile configuration, often nicknamed dotfiles. Those are often stored under version controlled repositories such as Git. You’ll find mine here :)
Depending on your involvement with computers, your user settings might be more or less extensive. But even with simpler settings, it is often useful to keep track of them under version control. Version control offers the following perks:
- Decentralized backups: it’s on all your devices, plus on all the servers where you’ve synchronized them.
- You have full control over what’s in it, what is not, what changes and the history of changes since the beginning of time.
- Version control checks the data integrity at all times, it gives you a full guarantee over what you are getting. Thus it’s fully reproducible.
Private settings and data
Some of your settings might be private. In general, it’s mostly about our personal activity on a computer, for instance:
- Bookmarks, favourites.
- Bucket lists, “Sticky notes.”
- All sorts of notes.
- Some preferences of your web browser.
- Contacts, address book.
This data can be kept under version control as well, but remember Computer Security 101: encrypt the repository if it’s synchronized with an untrusted third-party server.
It’s also possible to only encrypt the sensitive files in a repository.
For instance, you can encrypt files with GnuPG and store the resulting
a Git repository.
To display the history of the encrypted file and the differences between two
versions, add the following to a
.gitattributes file in the Git repository:
User profile initialization
Your user profile is not just about configuration files and data. There might be some tasks you’d like to run to initialize your environment back to the desired state.
Most obviously, your favourite programs must be installed. Furthermore, you’ll probably want to initialize the credentials (e.g. the password manager), synchronize your emails, etc.
Needs vary and it’s hard to fit everyone’s shoes at the same time, so over time I wrote a script (i.e. a small, quickly-patched-together program) that would fit all my personal requirements:
- Install the list of my programs.
- Retrieve my private data.
- Retrieve my password manager database.
- Retrieve and install my user settings (the “dotfiles”).
- Retrieve my emails.
- And some other nits…
The result is the following: after a fresh installation, or the first time I log in on a new machine, I run the script, wait a few seconds (or minutes, depending on the Internet connection) and there it is: my exact user environment as I left it last time I synchronized my user profile.
User profile synchronization
The user profile must also be synchronized. While the “dotfiles” synchronization is done with the version control system, there is more:
- Un-synchronized work files.
- Un-synchronized credentials (e.g. your private keys, your password manager’s database).
- The updated list of your installed programs.
- The file listings.
Again, your mileage may vary. So I wrote another script to do all the above for me. In particular, it reports all version control repositories that are not synchronized, so that I remember to finish and synchronize my pending work on all projects before going to bed. This can be done automatically if need be.
A synchronization takes no more than a couple of seconds to run and can easily be done every day, even automatically.
At this point, if I lose my computer, I’ll be able to restore an environment matching the last synchronization that won’t be older than a day. In my case, the process boils down to:
- Get a bootable USB of Guix.
- Fetch my Guix system configuration script from my dotfiles.
guix system init configuration-script.scm.
- Start the newly installed computer and log in.
- Run my user initialization script from my dotfiles.
It’s very relieving in the long-run to live with the confidence that the worse case scenario is not so bad at all.
Note to hackers: Script implementation details
If you are not familiar with programming, you can safely skip this section.
- Initialization script: https://gitlab.com/ambrevar/dotfiles/raw/master/.local/bin/homeinit
- Synchronization script: https://gitlab.com/ambrevar/dotfiles/raw/master/.local/bin/homesync
- Program listing script: https://gitlab.com/ambrevar/dotfiles/raw/master/.local/bin/package-lister
- File listing script: https://gitlab.com/ambrevar/dotfiles/raw/master/.local/bin/dataindex
I originally wrote those scripts a long time ago with different target systems in mind (FreeBSD and Arch Linux among others). Some requirements were:
- Interpretable: I must be able to hack them in case I need to adapt something to the system.
- Retrievable over the network and verifiable. So I would host them under version control.
- Portable: they must run everywhere with no dependencies.
- Idempotent: Running it multiple times should produce the same result.
- Lazy: Only perform a task if necessary. A second run should terminate in seconds.
For portability’s sake, I started off writing a POSIX shell script, since it seems to be the only language that can be understood on almost all systems.
In hindsight, this proved to be a debatable choice as the script grew and more complex features were added. POSIX shell is a very poor and limited language to program with.
Today, I mostly use Guix, so portability is less of a concern. Even then, it’s not far fetched to ask for a tiny requirement: a widely available interpreter. Then the installation process would only ask for one more step: the installation of the interpreter.
I could have sticked to a much more powerful programming language. Even then, portability would not be such an issue: Guile Scheme, for instance, is a nonrestrictive requirement as it’s rather light and widely available. Finally it’s about time we broke with the tradition that the only portable scripting language should be one of the worst. We need to move on and use better programming languages globally.
I’m planning to write a more complete, extensible and universal “user profile management” tool, probably in Guile Scheme.
The data frenzy: a social drift of the new millennial?
This was a long article. At this point your might wonder: “Why should we care so much about our data anyways? Aren’t we getting too attached to technology?”
It’s a vast topic and there is probably too much to talk about to fit in this one article. So I’ll keep it to just a few points for now:
- The blame does not have to be put on our attachment to data, but rather on the setup and the infrastructure. Data attachment and data loss crises essentially occur because currently user data is under the spotlight while typical computer setups are extremely fragile. The social and psychological question of data-attachment would mostly be moot if the technology of backups and users’ control over their data was appropriate to its level of importance.
- This article is not about the effort every user should make, it’s how vendors should set up their products so that everything is ready for backup-and-control out of the box.
- We don’t have to be attached to our data. Having the possibility to control it and to rely on it is a different thing. I believe we should all have the right both to ignore our data or to depend on it. It should be our own decision, hence the importance of user-centric control.
- Data is increasingly reflecting power. When external entities own our data (even part of it), such as corporates with poor incentives to stand for us, we are threatening our democratic rights on the political level, and our individuality on the social level. If we are the only and full proprietors of our own data, we remain in control to stand strong as first class citizens and individuals. Should we be data-craving techies or not, society is making a choice here, and we need to enforce our rights as its members, lest we lost our place in society.
On a more abstract level, data can be seen as a form of human consciousness expansion. Our brain is limited and can store only so much information.
There was a time were mankind was little aware of notions such as freedom and choice. The philosophy of individualism is a rather recent evolution. User data could be just another form of human evolution, that of memory expansion. It might be hard to foresee the benefits at this early stage, but so was it certainly when the Enlightenment philosophers were thinking ideas of individualism. Time will tell, I suppose.
An interesting experiment is that of the Facebook data, for those who’ve tried the social network for a couple of months or years. Facebook allows its users to download an archive of a collection of data that Facebook has gathered about them since you created an account. (Note that it’s most certain that lots of data is missing from that archive, and Facebook knows way more.) Going through the archive for 5 minutes will give you a look back it your own self from months and years ago, to a level of detail you would not be capable of digging out yourself with your memory alone. Yes, to some extent, social networks like Facebook might know more about yourself than yourself.
The Internet-connected society is growing to become an entity that knows more about human beings than themselves. If we as individuals don’t want to be overwhelmed and overtaken in this play of power, we might need to extend our capabilities and what defines us to something that can safeguard our power against this societal paradigm shift.
I believe the setup I’ve presented in this article provides some definite benefits, and yet there is much left to improve. In particular when it comes to universal accessibility.
Now let’s dream on a little bit and munch over some crazy ideas.
(Don’t hesitate to let me know if this is nowhere close to feasible, or, on the contrary, if it’s already done or close to being achievable.)
First of all, it’s quite clear today that many people don’t like to have to bother with data storage. The “cloud” is such an attractive concept, it would be really nice if we could use it without its privacy-infringing pitfalls.
So if we really want to go in that direction, they are a few requirements:
- It should all be encrypted. So regular users must properly learn about authentication systems and understand what it means to keep a secret key secret, for real.
- It should be distributed, which means there would be no single point of failure, nowhere in the world. User data should not be censored or blocked or removed without the user consent.
- Free and huge (unlimited?) storage space. Paying for data storage poses a threat to social equality, as richer people would have the possibility to store more data and thus have a more extensive “memory,” if not individuality.
In their talk, the IPFS team shows how disk space over cost ratio has increased more rapidly than Internet bandwidth over cost. This could be interpreted in a sense that if we all shared our storage space in a storage pool distributed over the Internet, we could simulate a seemingly infinite storage space available for everyone to use (with smart space optimization like data deduplication and compression).
IPFS is a prime implementation of this, but some pieces of the puzzle are still missing. For one, the incentive for every user to put their storage space to availability for everyone to use. Should we work out such a system, we would basically re-create Silicon Valley’s Pied Piper where our data is everywhere and nowhere at the same time, and there would be no more need for a “Download” button!