Coder Social home page Coder Social logo

duff's People

Contributors

elmindreda avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar

duff's Issues

Argument to list unique files

For checking that a backup is complete, or checking that I have all the files from a camera SD card (before I wipe the card) it would be useful to be able to run duff in a "find unique" mode that lists files which don't have duplicates.

As ever, this functionality can be constructed with an appropriate pipeline of find/sha1sum/sort/uniq or similar, but perhaps it's close enough to what duff does to be worth including?

Filter files that do not match a size predicate

With find you can ignore files that are of little consequence, such as files that are really small or really big:

# Files more than 1 gigabyte
find -size +1G

# Files less than 1 megabyte
find -size -1M

This would be great for duff because when trying to free up disk space one wants to find the big files (e.g. videos) without the output being flooded by static web content (e.g. jquery-1.9.2.js, bootstrap.css).

Hopefully that isn't too tough to implement, unlike sorting which would be great but probably algorithmically prohibitive. Being able to use duff with pipes to do things like filtering would be even smarter but I don't see a way to do it the way duff reports (except with the -e option which is a bit risky).

make -p active by default (don't follow hardlinks)

I feel like this should be obvious but maybe i'm missing something. duff marks hardlinked files as "duplicates" which means that doing the obvious thing - using duff to reduce clutter and delete duplicate files - will result in deleting files with hardlinks (multiple filenames to the same data). I can't think of any reason why this should be the default rather than the opposite. Basically -p should be default, right?

-p Physical mode. Make duff consider physical files instead of hard links. If specified, multiple hard links to the same physical file will not be reported as duplicates.

Build from source failed on Ubuntu 16.04

Tried to build from source by following the instruction on README: first gettextize --no-changelog, then autoreconf -i, got this error:

configure.ac:47: error: `po/Makefile.in' is already registered with AC_CONFIG_FILES.
../../lib/autoconf/status.m4:288: AC_CONFIG_FILES is expanded from...
configure.ac:47: the top level
autom4te: /usr/bin/m4 failed with exit status: 1
aclocal: error: echo failed with exit status: 1
autoreconf: aclocal failed with exit status: 1

xxhash

hi,
would it be possible for you to use or introduce xxhash as a hash function ?
thanks for duff ๐Ÿ‘

memory optimization

Hi
I'm doing some work on duff because I found it useful when fixing broken rsnapshot repositories (I will make some pull request in few days). Unfortunately such repositories are a bit unusual (millions of files, mostly hardlinked in groups of 30-50).

It seems that I'm having problem with large buckets (long lists): because each sampled file allocates 4KB of data that is going to be free at the end of bucket processing - I'm getting "out of memeory" errors at 3GB of memory allocated (because the box is light 32-bit atom-based system).

As sizeof(FileList) == 12 I see no problem increasing HASH_BITS to 16 (~800KB) or even 20 (~13MB).
I wonder what you think - if it's a good idea to add an option to make it runtime-configurable?

Another idea is to replace (optionally?) sample with some simple fast running checksum (crc64?).

Documentation

The README refers to an INSTALL file with installation details but the file is not in the tarball nor the git repo.

Sort by size?

As one of the reasons to find duplicate files is to recover precious disc space, it would be great that the default sort of duff would be file size. Delete a single duplicated huge file is much more useful than lots of tiny ones. Or, at least, provide a command line option of sorting by size.

Other than that, duff is a great utility, thanks so much!

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.