Finding duplicate files

Wednesday, May 23, 2012 Posted by

You have a bunch of files (for instance, jpegs). Over time they get moved around, you get a bunch from some family members and before you know it you have a situation where you have the same file multiple times. Of course you can manually sort out these duplicates, but you can also automate duplicate detection.

After a short search, I found this solution on LinuxQuestions.org:

tmp=$(mktemp)
find . -type f |xargs md5sum > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do 
    grep "^$f" $tmp
    echo ""
done

This outputs a list of duplicate files once it has run to completion.

However, it borks when it encounters whitespaces, special characters like the apostrophe etc. A solution:

#!/bin/bash
tmp=$(mktemp)
find . -type f | sed -e "s/'/\\\'/g" |xargs  -I{} md5sum {} > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d | while read f; do 
    grep "^$f" $tmp
    echo ""
done

Use the -I{} and {} to make sure the input to md5sum is not terminated by whitespaces, but only by endlines. Also, the “| sed -e “s/’/\\\’/g”” part replaces every occurence of the apostrophe “‘” with its escaped version “\'” as you would when entering it on the commandeline.

This is able to traverse deep into directory structures, and also accepts any filename I encountered in my dataset. It is however quite CPU intensive, as it calculates the MD5 hash for every file. If you only want to compare based on filename, the whole operation becomes a lot more lightweight.

Duplicate detection with locate/mlocate.db

Actually, it is not necessary to manually index all files, good chance this is already being done by the updatedb cronjob. For instance,

skidder@@spetznas:~$ locate fstab
/etc/fstab

and it also finds some other files containing the string fstab. Unfortunately, mlocate.db is a very simple list of filenames only – a file size and an MD5 would greatly ease the detection of duplicates. So far I have not found a way to do this more efficiently than the shellscript posted above.

The many uses of Eldritch Blast in D&D

Tuesday, May 22, 2012 Posted by

The Warlock’s Eldritch Blast is a formidable weapon against just about any foe in the D20 realm. However, it also has many less conventional uses.
Read the rest of this entry »

5.11 Zipper-pull tabs: sell them separately please

Wednesday, May 16, 2012 Posted by

Whoever wears gloves on a regular basis knows that having small zippers on your gear is inconvenient. No matter if they are winter gloves, motorcycle gloves or tactical gloves – even the most dexterous and nimble gloves may cause trouble with the small zippers on some equipment. Many manufacturers include “zipper-pull tabs”: a piece of string (paracord or something similar) with a small tab. Unfortunately, not every manufacturer does this so I spend some time equipping all of my new gear with zipper-pull tabs.

Recently I acquired a 5.11 Sabre 2.0 Jacket which sports some really nice zipper-pull tabs. And since 5.11 is known for listening to its customers – they are the ultimate specialist when it comes to the gear, right? – I asked them if it would be possible to sell these zipper-pull tabs separately. Ideally in two sizes – for both small and large sized zippers, and in several color schemes. At least desert tan, olive drab and black. Please.

So let’s see if our prayers will be answered 😉

Crashed Mac: rescue files with Linux

Friday, May 11, 2012 Posted by

We have a crashed 2007 Macbook, which simply won’t boot anymore. The guys at the Mac store were not able to do anything; and gave the standard answer: buy a new one! So we did, and luckily we did have Time Machine backups, which we put on the newly bought Macbook Pro. Unfortunately, the last successfully completed backup was from a few months ago – the system had been gradually crapping up more and more, and apparently wasn’t even able to complete its backups for the last few months of its life.

Now the question is, how do we get all the files off of the disk? We use Linux. Read the rest of this entry »

HP support for Scanjet on OS X 10.7 Lion

Sunday, May 6, 2012 Posted by

I’m a long-time fan of Hewlett Packard products. They work and keep on working. And, very important, they work perfect with the Linux platform. However, turns out users on Mac OS are not so fortunate. Read the rest of this entry »