elmindreda / duff Goto Github PK
View Code? Open in Web Editor NEWCommand-line utility for finding duplicate files
License: Other
Command-line utility for finding duplicate files
License: Other
For checking that a backup is complete, or checking that I have all the files from a camera SD card (before I wipe the card) it would be useful to be able to run duff in a "find unique" mode that lists files which don't have duplicates.
As ever, this functionality can be constructed with an appropriate pipeline of find/sha1sum/sort/uniq or similar, but perhaps it's close enough to what duff does to be worth including?
With find
you can ignore files that are of little consequence, such as files that are really small or really big:
# Files more than 1 gigabyte
find -size +1G
# Files less than 1 megabyte
find -size -1M
This would be great for duff
because when trying to free up disk space one wants to find the big files (e.g. videos) without the output being flooded by static web content (e.g. jquery-1.9.2.js, bootstrap.css).
Hopefully that isn't too tough to implement, unlike sorting which would be great but probably algorithmically prohibitive. Being able to use duff with pipes to do things like filtering would be even smarter but I don't see a way to do it the way duff reports (except with the -e
option which is a bit risky).
I feel like this should be obvious but maybe i'm missing something. duff marks hardlinked files as "duplicates" which means that doing the obvious thing - using duff to reduce clutter and delete duplicate files - will result in deleting files with hardlinks (multiple filenames to the same data). I can't think of any reason why this should be the default rather than the opposite. Basically -p should be default, right?
-p Physical mode. Make duff consider physical files instead of hard links. If specified, multiple hard links to the same physical file will not be reported as duplicates.
It appears that in 2021 this domain was registered after having lapsed. Currently, duff.dreda.org seems to direct to malvertising of some sort. ("Your computer is infected with a virus!" type stuff.)
Tried to build from source by following the instruction on README: first gettextize --no-changelog
, then autoreconf -i
, got this error:
configure.ac:47: error: `po/Makefile.in' is already registered with AC_CONFIG_FILES.
../../lib/autoconf/status.m4:288: AC_CONFIG_FILES is expanded from...
configure.ac:47: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1
Similar to du -h
, would it be possible to support an option that presents file sizes in megabytes/gigabytes instead of bytes?
hi,
would it be possible for you to use or introduce xxhash as a hash function ?
thanks for duff ๐
Hi
I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).
It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).
As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB).
I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?
Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).
The README refers to an INSTALL file with installation details but the file is not in the tarball nor the git repo.
As one of the reasons to find duplicate files is to recover precious disc space, it would be great that the default sort of duff would be file size. Delete a single duplicated huge file is much more useful than lots of tiny ones. Or, at least, provide a command line option of sorting by size.
Other than that, duff is a great utility, thanks so much!
A declarative, efficient, and flexible JavaScript library for building user interfaces.
๐ Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. ๐๐๐
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google โค๏ธ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.