Intro

Some weeks ago, my cellphone crashed and never turned on again, so I lost several pictures and videos that I’m trying to recover. However, this made me wonder how I should back up my data, just in case something similar occurs again.

I have all Google sync/backup disabled, as I don’t want to keep feeding the monster, this is definitely not an option for me. As I’m an active Proton fan and subscriber (see Proton.me), I came up with the idea of using Proton Drive as my cloud storage to securely store my data.

While looking at the files on my personal computer, I realized I had various backups of the previous phones and computers I owned. The problem was, there were lots of duplicates among these, especially images and videos. Even though I have a lot of free space in my Proton Drive account, it makes no sense to store the same files multiple times, right?

Quick solution

I wrote a simple Python script to quickly solve the problem. You can access the latest version here: deduplicate.py.

This script simply scans a given directory using os.walk() and computes the MD5 checksum for each file using hashlib.md5(). Then, it checks if a file with the same checksum was already visited. Finally, it gives the user the option to delete the duplicates.

There are several improvements to be made, for example, giving the users the option to choose which copy they want to keep, using filters for file extensions, performing basic path validations, improving interaction, etc. Anyway, I just needed a handy way to get rid of duplicates, so it worked as is.