elmindreda / duff Goto Github PK

View Code? Open in Web Editor NEW

88.0 88.0 12.0 247 KB

Command-line utility for finding duplicate files

License: Other

Shell 0.74% C 97.85% Makefile 0.40% M4 1.01%

c duplicate-files unix

duff's People

Contributors

Stargazers

Watchers

Forkers

fsiler fengye110 luiseduardohdbackup paulmadore marcin-gryszkalis cloudxtreme smorin leeyeetonn tabulon-ext llaith-oss robodoc zszs717524

duff's Issues

Argument to list unique files

For checking that a backup is complete, or checking that I have all the files from a camera SD card (before I wipe the card) it would be useful to be able to run duff in a "find unique" mode that lists files which don't have duplicates.

As ever, this functionality can be constructed with an appropriate pipeline of find/sha1sum/sort/uniq or similar, but perhaps it's close enough to what duff does to be worth including?

Filter files that do not match a size predicate

With find you can ignore files that are of little consequence, such as files that are really small or really big:

# Files more than 1 gigabyte
find -size +1G

# Files less than 1 megabyte
find -size -1M

This would be great for duff because when trying to free up disk space one wants to find the big files (e.g. videos) without the output being flooded by static web content (e.g. jquery-1.9.2.js, bootstrap.css).

Hopefully that isn't too tough to implement, unlike sorting which would be great but probably algorithmically prohibitive. Being able to use duff with pipes to do things like filtering would be even smarter but I don't see a way to do it the way duff reports (except with the -e option which is a bit risky).

make -p active by default (don't follow hardlinks)

I feel like this should be obvious but maybe i'm missing something. duff marks hardlinked files as "duplicates" which means that doing the obvious thing - using duff to reduce clutter and delete duplicate files - will result in deleting files with hardlinks (multiple filenames to the same data). I can't think of any reason why this should be the default rather than the opposite. Basically -p should be default, right?

-p Physical mode. Make duff consider physical files instead of hard links. If specified, multiple hard links to the same physical file will not be reported as duplicates.

Dreda.org registration lapsed, and has been picked up by a bad actor.

It appears that in 2021 this domain was registered after having lapsed. Currently, duff.dreda.org seems to direct to malvertising of some sort. ("Your computer is infected with a virus!" type stuff.)

Build from source failed on Ubuntu 16.04

Tried to build from source by following the instruction on README: first gettextize --no-changelog, then autoreconf -i, got this error:

configure.ac:47: error: `po/Makefile.in' is already registered with AC_CONFIG_FILES.
../../lib/autoconf/status.m4:288: AC_CONFIG_FILES is expanded from...
configure.ac:47: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1

Make sizes in cluster header human readable with "-h" flag

Similar to du -h, would it be possible to support an option that presents file sizes in megabytes/gigabytes instead of bytes?

xxhash

hi,
would it be possible for you to use or introduce xxhash as a hash function ?
thanks for duff 👍

memory optimization

Hi
I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).

It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).

As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB).
I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?

Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).

Documentation

The README refers to an INSTALL file with installation details but the file is not in the tarball nor the git repo.

Sort by size?

As one of the reasons to find duplicate files is to recover precious disc space, it would be great that the default sort of duff would be file size. Delete a single duplicated huge file is much more useful than lots of tiny ones. Or, at least, provide a command line option of sorting by size.

Other than that, duff is a great utility, thanks so much!

Add repology badge to the website?

Small badge

Long badge

elmindreda / duff Goto Github PK

duff's People

Contributors

Stargazers

Watchers

Forkers

duff's Issues

Argument to list unique files

Filter files that do not match a size predicate

make -p active by default (don't follow hardlinks)

Dreda.org registration lapsed, and has been picked up by a bad actor.

Build from source failed on Ubuntu 16.04

Make sizes in cluster header human readable with "-h" flag

xxhash

memory optimization

Documentation

Sort by size?

Add repology badge to the website?

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent