Posts Tagged Backup

Finding duplicate files

Posted by on Wednesday, 23 May, 2012

You have a bunch of files (for instance, jpegs). Over time they get moved around, you get a bunch from some family members and before you know it you have a situation where you have the same file multiple times. Of course you can manually sort out these duplicates, but you can also automate duplicate detection.

After a short search, I found this solution on LinuxQuestions.org:

tmp=$(mktemp)
find . -type f |xargs md5sum > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do 
    grep "^$f" $tmp
    echo ""
done

This outputs a list of duplicate files once it has run to completion.

However, it borks when it encounters whitespaces, special characters like the apostrophe etc. A solution:

#!/bin/bash
tmp=$(mktemp)
find . -type f | sed -e "s/'/\\\'/g" |xargs  -I{} md5sum {} > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d | while read f; do 
    grep "^$f" $tmp
    echo ""
done

Use the -I{} and {} to make sure the input to md5sum is not terminated by whitespaces, but only by endlines. Also, the “| sed -e “s/’/\\\’/g”” part replaces every occurence of the apostrophe “‘” with its escaped version “\'” as you would when entering it on the commandeline.

This is able to traverse deep into directory structures, and also accepts any filename I encountered in my dataset. It is however quite CPU intensive, as it calculates the MD5 hash for every file. If you only want to compare based on filename, the whole operation becomes a lot more lightweight.

Duplicate detection with locate/mlocate.db

Actually, it is not necessary to manually index all files, good chance this is already being done by the updatedb cronjob. For instance,

skidder@@spetznas:~$ locate fstab
/etc/fstab

and it also finds some other files containing the string fstab. Unfortunately, mlocate.db is a very simple list of filenames only – a file size and an MD5 would greatly ease the detection of duplicates. So far I have not found a way to do this more efficiently than the shellscript posted above.

Upgrading Ubuntu Server to 11.10 using SSH

Posted by on Friday, 28 October, 2011

About half a year ago I wrote about Upgrading Ubuntu 10.04 using SSH. I did the same thing when 11.10 (Oneiric Ocelot) was released. And with great success! Read the rest of this entry »

Moving files to a Linux box

Posted by on Thursday, 13 October, 2011

Moving files to (and from) a GNU/Linux box is trivial for those well-seasoned in the use of the commandline. For Mac or windows users who need to make use of a Linux box (read my article on Clustercomputing with Torque) this may be the first problem they encounter. Easily overcome, though tough if you don’t know how it works. Read the rest of this entry »

Upgrading Ubuntu Server to 11.04 using SSH

Posted by on Thursday, 28 April, 2011

I run a server at home (see NAS: Ubuntu Server on Intel ATOM) to, among many other tasks, backup the websites I run (see Backup your website using curlftpfs and rsync). Running at a meager 35watts, I figured there’s room for improvement. And there is, as Ubuntu 11.04 with Linux 2.6.38-8-server kernel is supposedly more energy-efficient. This morning 11.04 became available and I wanted to do a dist-upgrade. Problem: I was at work. Luckily, SSH helps here! Read the rest of this entry »

Backup your website using curlftpfs and rsync

Posted by on Monday, 28 March, 2011

If you have a website, your hosting provider may have installed some backup options. Sometimes it is even possible to backup to a remote FTP server. All nice, but what if you only have FTP access to your webspace?
Read the rest of this entry »