Archive for category Science

Finding duplicate files

Posted by on Wednesday, 23 May, 2012

You have a bunch of files (for instance, jpegs). Over time they get moved around, you get a bunch from some family members and before you know it you have a situation where you have the same file multiple times. Of course you can manually sort out these duplicates, but you can also automate duplicate detection.

After a short search, I found this solution on

find . -type f |xargs md5sum > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d |while read f; do 
    grep "^$f" $tmp
    echo ""

This outputs a list of duplicate files once it has run to completion.

However, it borks when it encounters whitespaces, special characters like the apostrophe etc. A solution:

find . -type f | sed -e "s/'/\\\'/g" |xargs  -I{} md5sum {} > $tmp
awk '{ print $1 }' $tmp |sort |uniq -d | while read f; do 
    grep "^$f" $tmp
    echo ""

Use the -I{} and {} to make sure the input to md5sum is not terminated by whitespaces, but only by endlines. Also, the “| sed -e “s/’/\\\’/g”” part replaces every occurence of the apostrophe “‘” with its escaped version “\'” as you would when entering it on the commandeline.

This is able to traverse deep into directory structures, and also accepts any filename I encountered in my dataset. It is however quite CPU intensive, as it calculates the MD5 hash for every file. If you only want to compare based on filename, the whole operation becomes a lot more lightweight.

Duplicate detection with locate/mlocate.db

Actually, it is not necessary to manually index all files, good chance this is already being done by the updatedb cronjob. For instance,

skidder@@spetznas:~$ locate fstab

and it also finds some other files containing the string fstab. Unfortunately, mlocate.db is a very simple list of filenames only – a file size and an MD5 would greatly ease the detection of duplicates. So far I have not found a way to do this more efficiently than the shellscript posted above.

Speeding up OMNeT++/MiXiM simulations

Posted by on Wednesday, 21 December, 2011

Compiler optimisation flags can be quite helpful in bringing the (sometimes huge) duration of your OMNeT++ simulations to an acceptable level. Using a compute cluster can help, but even then the execution of a single process can take many hours for large networks. So let’s look at how you compile your code; I got a speed-up of approximately 10 times on some of my simulations. Read the rest of this entry »

Bug in Mac80211 in MiXiM post-backoff

Posted by on Friday, 16 December, 2011

Post-backoff was introduced into IEEE 802.11 to make sure that a station could not claim the channel by noting that the channel was idle after its own transmission, and then immediately transmit again. The idea is quite simple; transmission over -> pull a backoff from the contention window and count down (respecting the same rules w.r.t. carrier sense and inter-frame spacing etc.) and finally end up in the idle state. If a new packet was handed down by the network layer during post-backoff, the countdown would continue and result in a transmission. This behaviour is also implemented in MiXiM’s Mac80211 (and prior to that in the Mobility Framework). However, there’s a bug. Read the rest of this entry »

Bug in Mac80211 in MiXiM backoff when channel busy

Posted by on Thursday, 15 December, 2011

This post describes a bug in MiXiM’s Mac80211 which seems to be a fundamental error: when the MAC gets a packet from the Netw and the channel is busy, it schedules a senseChannelWhileIdle(currentIFS + remainingBackoff) after the ongoing transmission ends. Unfortunately, remainingBackoff is often 0 as post-backoff is likely to have completed. The result? Many synchronised collisions one IFS after the ongoing transmission. Read the rest of this entry »

Getting grip on eventlogs in OMNeT++

Posted by on Monday, 12 December, 2011

OMNeT++ simulations can export an eventlog. Though the manual is pretty complete about it, I’ll summarise it here. Read the rest of this entry »