birkenfeld / fddf Goto Github PK

Fast data dupe finder

License: Apache License 2.0

Rust 100.00%

fddf's Introduction

fddf

Fast data dupe finder

This is a small Rust command-line program to find duplicate files in a directory recursively. It uses a thread pool to calculate file hashes in parallel.

Duplicates are found by checking size, then (Blake3) hashes of parts of files of same size, then a byte-for-byte comparison.

Build/install

Directly from crates.io with cargo install fddf.

From checkout:

cargo build --release
cargo run --release

Minimum supported Rust version is 1.48.0.

Usage

fddf [-s] [-t] [-S] [-m SIZE] [-M SIZE] [-v] <rootdir>

-s: report dupe groups in a single line
-t: produce a grand total
-S: don't scan recursively for each directory given
-f: check for files with given pattern only
-F: check for files with given regular expression only
-m: minimum size (default 1 byte)
-M: maximum size (default unlimited)
-v: verbose operation

By default, zero-length files are ignored, since there is no meaningful data to be duplicated. Pass -m 0 to include them.

PRs welcome!

fddf's People

Stargazers

Watchers

Forkers

sss kamva9697 topin89 manfredlotz jordilin cr1901 astand gmh5225 paazmaya

fddf's Issues

Unicode filenames are not printed correctly

> fddf .
Size 16700 bytes:
    .\sekaihana.mid
    .\??????????.mid

Size 15426 bytes:
    .\????????????.mid
    .\yasasisa.mid

Btw, the first filename is 世界にひとつだけの花.mid and the second one is やさしさにつつまれたなら.mid.

When I do cmd /K chcp 65001 and run fddf, I get:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Err
or { repr: Os { code: 31, message: "A device attached to the system is not funct
ioning." } }', src\libcore\result.rs:860
stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: <unknown>
   8: <unknown>
   9: <unknown>
  10: <unknown>
  11: BaseThreadInitThunk
thread 'main' panicked at 'WaitGroup explicitly poisoned!', .cargo\registry\src\
github.com-1ecc6299db9ec823\scoped-pool-1.0.0\src\lib.rs:457
stack backtrace:
   0: <unknown>
   1: <unknown>
   2: <unknown>
   3: <unknown>
   4: <unknown>
   5: <unknown>
   6: <unknown>
   7: BaseThreadInitThunk

error: no method named `ino` found for type `walkdir::DirEntry` in the current scope

error[E0599]: no method named `ino` found for type `walkdir::DirEntry` in the current scope
   --> C:\Users\me\.cargo\registry\src\github.com-1ecc6299db9ec823\fddf-1.3.0\src\main.rs:260:68
    |
260 |                                         if inodes.insert(dir_entry.ino())
{
    |                                                                    ^^^
 
error: aborting due to previous error(s)

Too many open files error

Hi, thanks for a nice tool! This is definitely something I can use.

I installed the 1.1.0 version via cargo install. However, when I tried to run it on my home dir, it blew up the stack:

thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 24, message: "Too many open files" } }', /checkout/src/libcore/result.rs:859
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 24, message: "Too many open files" } }', /checkout/src/libcore/result.rs:859
thread '<unnamed>' panicked at 'called `Result::unwrap()` on an `Err` value: Error { repr: Os { code: 24, message: "Too many open files" } }', /checkout/src/libcore/result.rs:859
...
100s more
...
thread '<unknown>' has overflowed its stack
fatal runtime error: stack overflow
[1]    3030 abort (core dumped)  fddf .

A subtle bug

I think there is a bug which just won't happen most of the times as the constellation is rare.

Assume two files systems mounted at

/mnt
/mnt/sub

Run fddf /mnt
Assume a file somefile exists with inode 4711 in /mnt. Assume further a file different_file exists with inode 4711 in /mnt/sub (i.e. in the second file system).

Those files would be regarded as hard links although this is not true.

This could be fixed by using both device id and inode when checking for hardlinks.

In output, indents should be a tab, not 4 spaces

Possibly contentious issue here, of course...

Spaces appear commonly in file names, tabs do not, so there is less potential ambiguity
Big output files mean lots of indents; why use 4x the characters?
When parsing the output, startswith('\t')is a little nicer thanstartswith(" ")`

`cargo install` failed

$ cargo install fddf
    Updating registry `https://github.com/rust-lang/crates.io-index`
 Downloading fddf v1.0.0
  Installing fddf v1.0.0
 Downloading num_cpus v1.5.1
 Downloading scoped-pool v1.0.0
 Downloading sha1 v0.2.0
 Downloading variance v0.1.3
 Downloading scopeguard v0.1.2
 Downloading atty v0.2.2
 Downloading term_size v0.3.0
   Compiling bitflags v0.8.2
   Compiling ansi_term v0.9.0
   Compiling unicode-segmentation v1.2.0
   Compiling libc v0.2.23
   Compiling crossbeam v0.2.10
   Compiling unicode-width v0.1.4
   Compiling scopeguard v0.1.2
   Compiling vec_map v0.8.0
   Compiling strsim v0.6.0
   Compiling variance v0.1.3
   Compiling same-file v0.1.3
   Compiling fnv v1.0.5
   Compiling sha1 v0.2.0
   Compiling walkdir v1.0.7
   Compiling atty v0.2.2
   Compiling term_size v0.3.0
   Compiling num_cpus v1.5.1
   Compiling clap v2.24.2
   Compiling scoped-pool v1.0.0
   Compiling fddf v1.0.0
error: cannot find macro `eprintln!` in this scope
  --> .cargo/registry/src/github.com-1ecc6299db9ec823/fddf-1.0.0/src/main.rs:19:9
   |
19 |         eprintln!("Hashing {}...", path.display());
   |         ^^^^^^^^
   |
   = help: did you mean `println!`?

error: cannot find macro `eprintln!` in this scope
  --> .cargo/registry/src/github.com-1ecc6299db9ec823/fddf-1.0.0/src/main.rs:31:13
   |
31 |             eprintln!("Error opening file {}: {}", path.display(), e);
   |             ^^^^^^^^
   |
   = help: did you mean `println!`?

error: aborting due to 2 previous errors

error: failed to compile `fddf v1.0.0`, intermediate artifacts can be found at `/tmp/cargo-install.wemz5jNXsnIn`

Caused by:
  Could not compile `fddf`.

To learn more, run the command again with --verbose.

$ rustc --version ; cargo --version
rustc 1.17.0 (56124baa9 2017-04-24)
cargo 0.18.0 (fe7b0cdcf 2017-04-24)

Why is io::stdout locked for the whole duration of the consumer thread ?

I'm fiddling a bit with the code to improve performances and at some point I got bitten by a random deadlock. After investigation, it's because stdout is locked for the whole duration of the consumer thread here, then if you try to println! anything from another thread when the consumer thread is up, you have a deadlock.

It's not a problem in production or anything, but I just found this behavior a bit counter-intuitive and I'm afraid other potential contributors could face the same problem. Is this long-living lock on stdout intentional ? Or can we remove it ?

Created Arch Package

I created an Arch AUR package for fddf here https://aur.archlinux.org/packages/fddf/ .

Speed up hashing with twox-hash

I'm curious why you decided on blake3 instead of a faster non-cryptographic hash like twox-hash.
Is it to keep the number of collisions (== the number of files whose contents have to be compared) as low as possible?
Have you done any benchmarks comparing blake3 with a faster non-cryptographic hash to see which one scans faster on a typical scenario (e.g. different percentages of duplicates)?
Most files that are different and have the same hash are probably very different early on, so their byte-by-byte comparison would terminate early. Maybe it would be faster to incur more collisions if the false positives terminate early?

Check zero length files

What do you think about checking zero size files as well and only omit zero size files if, for example,

-n Exclude zero-length files

is specified when calling fddf?

Program does nothing when run without arguments

When fddf is run without arguments, nothing observable happens. No output.

IMO it would be preferable to have it act in either of the following ways:

Either, do like ls and use the current working directory when no argument is provided. This makes the most sense for fddf IMO since fddf does not modify any files so it is safe to run fddf somewhere by accident.
Or, make the roots argument required (see https://docs.rs/clap/2.33.0/clap/struct.Arg.html#method.required and press the pluss sign on the side to expand the description) so that clap reports an error to the user when fddf is run without any arguments. Personally I think making the argument required would not be as great because it means more typing.

Low throughput

I'm running fddf on Debian Jessy, and the I/O read (shown by iotop) never goes 3MB/s. The tasks isn't CPU bound either, ~25% on both two cores. By comparison, ls -R reads between 10 and 15 MB per seconds, so does rsync on the same workload.

The directory I'm running fddf on contains a lot of small files (text files), a big amount of medium files (pictures or mp3) and a decent number of big files (movies or .iso images).

I have no idea how file I/Os work on Linux, then I don't know how to speed this up.

[feature request] Sort duplicates by inodes/hardlinks

Suppose I got all dupes in a folder and made them all hardlinks. Than I made a copy of one of the hardlinks and want all three to be the same links. With results as they are now, I would have to find all the hardlinks to all the dupes before deciding if it's worth it.

This is not a hypothetical scenario, something like that was on my PC. I solved it with python script, and that worked because there is no hardlink check on Windows.

So, how about output like that:

Size 1859 bytes:
   inode 56746546576434354:
        \\?\D:\temp\dupe1.ext
        \\?\D:\temp\dupe2.ext
   inode 6865347121004787:
        \\?\D:\temp\dupe3.ext

I don't know any tool that lists dupes like that, and that would be really helpful.

P.S.

\\?\ enables absolute paths with more than 260 bytes on Windows.

Ability to detect duplicate folders

If all files of a folder have dupes in another folder, the output can get very verbose and it's not exactly clear from looking at it. It would be very helpful if fddf could summarize that as folder dupes (or subset).
Because the primary use case for me is figuring out which files I can/should delete. If I could decide on the level of folders that would reduce the time it takes to sort through all the dupes.

Btw, here's a result I got, it took 12 mins and consumed 70 MB RAM on Win 8.1 64 bit. Most files in that folder are small files (<100KB, and the larger ones aren't much larger):

Overall results:
    16963 groups of duplicate files
    32744 files are duplicates
    1.2 GiB of space taken by dupliates

Not hashing every file

I have a directory with many subdirectories and 2496 files in total

But when I run fddf, it prints Hashing ... only 690 times, and misses many files

I checked the permissions on some files that were not hashed and it looks ok

$ ls -al rootfs/lib/libdirectfb-1.7.so.6.0.0 
-rwxr-xr-x 1 user user 2449248 Aug 18 16:54 rootfs/lib/libdirectfb-1.7.so.6.0.0

I use this command to run application:

fddf -v ./rootfs

Could not compile

You are (presumably) at nightly. I could not compile. It is easy to fix but I thought I let you know.

Here the error messages

error[E0658]: non-reference pattern used to match a reference (see issue #42640)
   --> src/main.rs:227:9
    |
227 |         Select::Any => true,
    |         ^^^^^^^^^^^ help: consider using a reference: `&Select::Any`

error[E0658]: non-reference pattern used to match a reference (see issue #42640)
   --> src/main.rs:228:9
    |
228 |         Select::Pattern(p) => entry.file_name().to_str().map_or(false, |f| p.matches(f)),
    |         ^^^^^^^^^^^^^^^^^^ help: consider using a reference: `&Select::Pattern(p)`

error[E0658]: non-reference pattern used to match a reference (see issue #42640)
   --> src/main.rs:229:9
    |
229 |         Select::Regex(r) => entry.file_name().to_str().map_or(false, |f| r.is_match(f)),
    |         ^^^^^^^^^^^^^^^^ help: consider using a reference: `&Select::Regex(r)`

error: aborting due to 3 previous errors

error: Could not compile `fddf`.

BTW

Thanks for letting me do things in a dilettantish way. It was a good learning experience.
I was impressed how structopt offered such an easy coding to deal with wrong patterns or regex. I have to study how clap does it then I can use it in my own code. At least where stack unwinding isn't required this is really very helpful.

fddf compile error

I wanted to look at newest changes, after cloning the repository I ran

cargo build

and got

error[E0277]: the trait bound `std::path::PathBuf: std::str::FromStr` is not satisfied
   --> src/main.rs:178:10
    |
178 | #[derive(StructOpt)]
    |          ^^^^^^^^^ the trait `std::str::FromStr` is not implemented for `std::path::PathBuf`
    |
    = note: required by `std::str::FromStr::from_str`

error[E0619]: the type of this value must be known in this context
   --> src/main.rs:178:10
    |
178 | #[derive(StructOpt)]
    |          ^^^^^^^^^

error: aborting due to 2 previous errors

error: Could not compile `fddf`.

rust compiler is rustc 1.25.0 (84203cac6 2018-03-25)

re-introduce `eprintln!` ( revert `5580564ca5e21be0e5cb70aa390323a492a8c860`)

Since 1.19 last Thursday, eprintln! is stable now ! 🎉