pkolaczk / fclones Goto Github PK

View Code? Open in Web Editor NEW

1.8K 1.8K 68.0 857 KB

Efficient Duplicate File Finder

License: MIT License

Rust 99.41% Shell 0.46% Dockerfile 0.13%

fclones's People

Contributors

Stargazers

Watchers

Forkers

yehia2amer stuzenz mkrupczak3 arlessweschler urs-bruelhart runhelka broke claytonbrown sunlinjin ronioncloud azmrv pianomanx bbqsrc cinus-ue 7e4 th1000s puyup mrcodechef icodein daiyiy cjarenan maximka-hue gptubpkcshkzjc8fkcrxudk6sbecpm49p5xu46u up2geek mattico therakeshpurohit sthagen pyjcode peddamat layfolk007 davidalphafox gelma geroldk amengmenga evanrichter mayhemheroes murlakatamenka fr0l shellixyz richardkiss abyss-soft jferns285 mg4400 iq-scm dmdv leeyisoft johnpyp jamestiotio oscarteg fairhopeweb ssavva05 ink-splatters marciowb testxsubject kapitainsky kkpan11 carbon-vault vaimalaviya1233 dereksz cheungxiongwei unixgeek hacker-archive gmh5225 zhangyyue maxox mustcodeal aalaaiaanaa dotysan

fclones's Issues

How should I interpret progress bar sizes? Affected by hard links?

Here's my output as it's running currently:

[2021-01-25 11:58:30.332] fclones:  info: Started
[2021-01-25 11:58:41.402] fclones:  info: Scanned 40512 file entries
[2021-01-25 11:58:41.402] fclones:  info: Found 38084 (10.7 TB) files matching selection criteria
[2021-01-25 11:58:41.408] fclones:  info: Found 15854 (4.2 TB) candidates after grouping by size
[2021-01-25 11:58:41.414] fclones:  info: Found 15694 (3.3 TB) candidates after grouping by paths and file identifiers
[2021-01-25 12:00:32.283] fclones:  info: Found 2159 (3.3 TB) candidates after grouping by prefix
[2021-01-25 12:00:53.996] fclones:  info: Found 2159 (3.3 TB) candidates after grouping by suffix
Grouping by contents        [=>                                                ]   139.20GB/5.97TB

The size reporting in the log messages seems accurate given the data I'm running this tool on, but what confuses me is the 5.97TB total grouping progress. If we have 3.3TB of candidates, I would expect to see matching numbers.

I expect this is something to do with the fact that a lot of the existing data consists of large files which are hard-linked and exist in two places, so depending on how the size count handles those files, that could be the source of the discrepancy. I'm not sure if this is just a clarity thing, or if this means that there's actually room for speeding up the hashing process - I'm not an expert obviously, but I assume that if these hard-linked files are hashed once, it would be unnecessary to hash any other files/paths that point to the same data.

Compiling to linux ARM

Great work - already had a play with the utility on macos and it works great.

Just feedback, I also tried to compile to manjaro on the rpi4 and the compile failed on the hashing crate. I might ask the author o the fashhash-sys crate what would be required to allow it to compile on the arm architecture.

fclones git:(master) cargo build --release
   Compiling fasthash-sys v0.3.2
   Compiling getrandom v0.1.14
   Compiling num_cpus v1.13.0
   Compiling atty v0.2.14
error: failed to run custom build command for `fasthash-sys v0.3.2`

Caused by:
  process didn't exit successfully: `/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-3bfb9e86593b1584/build-script-build` (exit code: 101)
--- stdout
TARGET = Some("aarch64-unknown-linux-gnu")
OPT_LEVEL = Some("3")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
CC_aarch64-unknown-linux-gnu = None
CC_aarch64_unknown_linux_gnu = None
HOST_CC = None
CC = None
HOST = Some("aarch64-unknown-linux-gnu")
TARGET = Some("aarch64-unknown-linux-gnu")
HOST = Some("aarch64-unknown-linux-gnu")
CFLAGS_aarch64-unknown-linux-gnu = None
CFLAGS_aarch64_unknown_linux_gnu = None
HOST_CFLAGS = None
CFLAGS = None
DEBUG = Some("false")
running: "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-Wno-implicit-fallthrough" "-Wno-unknown-attributes" "-msse4.2" "-maes" "-mavx" "-mavx2" "-DT1HA0_RUNTIME_SELECT=1" "-DT1HA0_AESNI_AVAILABLE=1" "-Wall" "-Wextra" "-o" "/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-fc57bf495c3381b2/out/src/fasthash.o" "-c" "src/fasthash.cpp"
cargo:warning=cc: error: unrecognized command line option \u2018-msse4.2\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-maes\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-mavx\u2019
cargo:warning=cc: error: unrecognized command line option \u2018-mavx2\u2019
exit code: 1

--- stderr
thread 'main' panicked at '

Internal error occurred: Command "cc" "-O3" "-ffunction-sections" "-fdata-sections" "-fPIC" "-Wno-implicit-fallthrough" "-Wno-unknown-attributes" "-msse4.2" "-maes" "-mavx" "-mavx2" "-DT1HA0_RUNTIME_SELECT=1" "-DT1HA0_AESNI_AVAILABLE=1" "-Wall" "-Wextra" "-o" "/home/stuart/rust_projects/fclones/target/release/build/fasthash-sys-fc57bf495c3381b2/out/src/fasthash.o" "-c" "src/fasthash.cpp" with args "cc" did not execute successfully (status code exit code: 1).

', /home/stuart/.cargo/registry/src/github.com-1ecc6299db9ec823/gcc-0.3.55/src/lib.rs:1672:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

warning: build failed, waiting for other jobs to finish...
error: build failed

-- feel free to close this issue - I just wanted to give feedback.

Write report to a file

Add an -o <file> option. This would allow for more flexibility when building pipelines, e.g. wrapping fclones with time.

Make fclones usable as a library

... in case someone wanted to create a GUI or just use file deduplication in their own programs.

ignore size and report duplicates

Is the implementation of finding duplicates based on size?
Can it find duplicates even if there's a size mismatch and report the file names?

Duplicate search between, but not within, two distinct directories?

Is it possible to use fclones to find duplicates between, but not within, two directory trees? Here's an example:

destination/
  2021/
    January/
      A.jpg

source/
  A1.jpg <-- copy of destination/2021/January/A.jpg (also same as A2.jpg)
  A2.jpg <-- copy of destination/2021/January/A.jpg (also same as A1.jpg)  
  B1.jpg <-- same as B2.jpg
  B2.jpg <-- same as B1.jpg

I want to identify A1.jpg and A2.jpg under source as duplicates of A.jpg in destination.

B1.jpg and B2.jpg are also duplicates but only under sources. They should be excluded from the match list because they don't match anything in destination.

FWIW, the use case is a source folder of images that have previously been processed by scripts to rename them and sort them into a destination directory structure (e.g. by year and month, or by other EXIF metadata). Then we come across a new folder of images, some of which may have been processed previously, and we want to know if we can safely delete them because we already have copies in the destination directory.

Can't compile on FreeBSD

Tried to compile from source on a FreeBSD jail, got these errors, tried again with the verbose flag to get more information.

Compiling fclones v0.17.0

error[E0063]: missing field l_sysid in initializer of flock
--> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/fclones-0.17.0/src/lock.rs:31:17
|
31 | let f = libc::flock {
| ^^^^^^^^^^^ missing l_sysid

error[E0063]: missing field l_sysid in initializer of flock
--> /root/.cargo/registry/src/github.com-1ecc6299db9ec823/fclones-0.17.0/src/lock.rs:47:17
|
47 | let f = libc::flock {
| ^^^^^^^^^^^ missing l_sysid

error: aborting due to 2 previous errors

For more information about this error, try rustc --explain E0063.
error: failed to compile fclones v0.17.0, intermediate artifacts can be foundat /tmp/cargo-installPdO1Nf

Caused by:
could not compile fclones

See attached file for full message.
FreeBSD fclones Errors.docx

Add timestamps to the diagnostic log

Example:

[2020-06-23 18:25:13.126] fclones: info: Found 963 (3.0 MB) duplicate files

unable to compile on aarch64 musl

hello,
I'm unable to (cross)compile a musl binary on aarch64 ("cross" as in from ubuntu, still aarch64)
[also tried compiling native on alpine aarch64, same result]

$ cargo install --target aarch64-unknown-linux-musl fclones
[cut]
   Compiling reflink v0.1.3
error[E0308]: mismatched types
  --> .cargo/registry/src/github.com-1ecc6299db9ec823/reflink-0.1.3/src/sys/unix.rs:21:39
   |
21 |         libc::ioctl(dest.as_raw_fd(), IOCTL_FICLONE, src.as_raw_fd())
   |                                       ^^^^^^^^^^^^^ expected `i32`, found `u64`
   |
help: you can convert a `u64` to an `i32` and panic if the converted value doesn't fit
   |
21 |         libc::ioctl(dest.as_raw_fd(), IOCTL_FICLONE.try_into().unwrap(), src.as_raw_fd())
   |                                                    ++++++++++++++++++++

For more information about this error, try `rustc --explain E0308`.
error: could not compile `reflink` due to previous error
warning: build failed, waiting for other jobs to finish...
error: failed to compile `fclones v0.17.1`

compiling with glibc works correctly

(rust 1.57.0)

regards,
m

Add directory clone detection

Sometime i keep 2 times the content of a CF card full of video or photos.

It will be great to have a detection of directory which all content are already present somewhere else.

`fclones dedupe` results in updates to mtimes on directories

Running fclones dedupe on some directory tree results in mtimes being updated for directories containing files that were deduplicated.

I don't know whether this should be addressed, because while mtimes may be desirable to preserve, the directories really were updated through file creation. This effect does make it less likely that I would want to use fclones dedupe on old directory trees with potentially informative mtimes, though.

Compare with `fdupes`

Add a compare with fdupes: performance, features...

Feedback after inital testing

Hello @pkolaczk, Again Thanks a lot for this GREAT Tool. I do love the idea of parallel processing and using the power of rust in this tool! and I do have a couple of questions and maybe feature requests that I want to discuss with you.

So First of all, I Tested this tool on a low power DS215j NAS Device.

CPU: MARVELL Armada 375 88F6720 - Dual Core - 800 MHz (ARMv7)
RAM: 512 MB
HDD: 6TB WD NAS Drive

and those are my questions after testing:

1. Is There a way to Export the Report to a file instead of printing to the console?

This is useful when you have too much duplicates that the terminal window cannot handle. In my case I want to run the command on Screen and then come later to get the results.

I Tried the command below, but this don't give me any Progress/Status from the tool:
sudo /usr/bin/time --verbose **./fclones ~ -R --format JSON** |& tee -a /volume2/duplicatesdata.json

2. Can The Tool Add a timestamp to the status updates ?

This could be helpful to how much time each phase took, Something like:

2020-06-20T21:57:06 [INFO] - fclones:  info: Scanned 4687831 file entries
2020-06-21T05:57:06 [INFO] - fclones:  fclones:  info: Found 3857155 (5.4 TB) files matching selection criteria
2020-06-21T08:57:06 [INFO] - fclones:  fclones:  info: Found 3447623 (1.4 TB) candidates after grouping by size

3. Can The Tool have an option to persist the Analysis Data on a file/data-store, instead of RAM ?

In my Case The Device is slow & HDD is Big, Sometimes in other tools I need to run duplicate analysis for 6 days.
When I was trying this tool I lost power after leaving it running for 2 days. So this can help me run the tool again to continue analysis.

4. Can The Tool have an option to choose the Hashing algorithm ?

From what I can Seen in here Metrohash is a great hashing algorithm. but it is optimized for machine-specific (x64 SSE4.2) x86-64 architectures.
So Adding another algorithm that is not machine-specific would be a great addition.

Improve initialization speed - avoid traversing sysfs

Running fclones on a relatively small directory, I noticed its performance is surprisingly bad:

$ time fclones group ~/Downloads/ 
[2021-06-06 18:57:22.658] fclones:  info: Started grouping
[2021-06-06 18:57:23.091] fclones:  info: Scanned 967 file entries
[2021-06-06 18:57:23.091] fclones:  info: Found 873 (2.7 GB) files matching selection criteria
[2021-06-06 18:57:23.091] fclones:  info: Found 47 (9.0 MB) candidates after grouping by size
[2021-06-06 18:57:23.092] fclones:  info: Found 47 (9.0 MB) candidates after grouping by paths and file identifiers
[2021-06-06 18:57:23.097] fclones:  info: Found 45 (8.5 MB) candidates after grouping by prefix
[2021-06-06 18:57:23.105] fclones:  info: Found 45 (8.5 MB) candidates after grouping by suffix
[2021-06-06 18:57:23.125] fclones:  info: Found 45 (8.5 MB) redundant files
<...>
real	0m0.481s
user	0m0.271s
sys	0m0.195s

So it takes about 0.5 sec to process a directory with less than 1000 files. I noticed that most of the time is spent in "Initializing" phase. So I ran strace:

$ strace -c fclones group ~/Downloads/
<...>
% time     seconds  usecs/call     calls    errors syscall
------ ----------- ----------- --------- --------- ------------------
 34.07    0.349066           5     61719      1563 openat
 25.31    0.259249           4     60156           close
 23.58    0.241530           4     53660           newfstatat
  6.21    0.063639           6     10372           read
  4.34    0.044439           6      6691           readlinkat
  3.39    0.034740           7      4502         5 access
  1.92    0.019707         221        89        17 futex
  0.60    0.006150           5      1096           getdents64
<...>

So, it appears fclones makes 60k openat, 60k close, and 54k newfstat. This is very surprising.

Inspecting openat syscalls it seems that most of them are traversing /sys/ filesystem. Here is a fragment of strace output (filtered by openat syscall):

openat(AT_FDCWD, "/sys/devices/pci0000:00/0000:00:14.0/usb2/2-4/2-4:1.0/uevent", O_RDONLY|O_CLOEXEC) = 5
openat(AT_FDCWD, "/run/udev/data/+usb:2-4:1.0", O_RDONLY|O_CLOEXEC) = 5
openat(AT_FDCWD, "/", O_RDONLY|O_CLOEXEC|O_PATH|O_DIRECTORY) = 5
openat(5, "sys", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "bus", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "usb", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "1-4.4:1.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(5, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "..", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "devices", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "pci0000:00", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "0000:00:14.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "usb1", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "1-4", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5
openat(5, "1-4.4", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 6
openat(6, "1-4.4:1.0", O_RDONLY|O_NOFOLLOW|O_CLOEXEC|O_PATH) = 5

Warning message confusing when physical file data location not available

I get the following messages running fclones group yyy :

[2021-10-24 14:33:02.212] fclones: warn: Failed to fetch extents for file : Operation not supported (os error 95)

Maybe it's harmless, my problem is that I don't know what it means.
I am using version: fclones 0.17.0
system is xubuntu 20.04 (everything upgraded)
I run fclones on aboaut 2TiB of data - 200000 files of all sizes
filesystem is zfs (no mirror or raid but encrypted)
disk is spinning disk 4TiB

Warning occurs on 10 files of the 200000
Any ideas? Could the warning message be somewhat more verbose?
Can I just ignore it?

By the way - I have run a brief speed comparison on that data above - here are my results:
(Intel quad core - 16GB ram - disk cache fully loaded from previous operations)
fclones group xxx 34 min
rdfind xxx 81 min
jdupes -S -M -Q -r xxx 90 min
rmlint -T df xxx 134 min
Pretty impresive!!!

Problem with finding files on synology NAS share mounted as CIFS volume under Linux

Hello Piotr,

Thanks for your work on fclones,

I had a plan to use for deduplication for files on my NAS, however I encountered a strange problem.

Please find directory contents on my nas:

'IMG_20210416_204824 (1).jpg'*  'IMG_20210416_204830 (2).jpg'*   IMG_20210416_204845.jpg*       'IMG_20210416_223757 (1).jpg'*  'IMG_20210416_224002 (2).jpg'*   IMG_20210416_224021.jpg*        IMG_20210416_232550.jpg*
'IMG_20210416_204824 (2).jpg'*   IMG_20210416_204830.jpg*       'IMG_20210416_204847 (1).jpg'*  'IMG_20210416_223757 (2).jpg'*   IMG_20210416_224002.jpg*       'IMG_20210416_224022 (1).jpg'*   Thumbs.db*
 IMG_20210416_204824.jpg*       'IMG_20210416_204832 (1).jpg'*  'IMG_20210416_204847 (2).jpg'*   IMG_20210416_223757.jpg*       'IMG_20210416_224015 (1).jpg'*  'IMG_20210416_224022 (2).jpg'*   VID_20210416_224914.mp4*
'IMG_20210416_204827 (1).jpg'*  'IMG_20210416_204832 (2).jpg'*   IMG_20210416_204847.jpg*       'IMG_20210416_223758 (1).jpg'*  'IMG_20210416_224015 (2).jpg'*   IMG_20210416_224022.jpg*        VID_20210416_232553.mp4*
'IMG_20210416_204827 (2).jpg'*   IMG_20210416_204832.jpg*       'IMG_20210416_223755 (1).jpg'*  'IMG_20210416_223758 (2).jpg'*   IMG_20210416_224015.jpg*       'IMG_20210416_224107 (1).jpg'*
 IMG_20210416_204827.jpg*       'IMG_20210416_204845 (1).jpg'*  'IMG_20210416_223755 (2).jpg'*   IMG_20210416_223758.jpg*       'IMG_20210416_224021 (1).jpg'*  'IMG_20210416_224107 (2).jpg'*
'IMG_20210416_204830 (1).jpg'*  'IMG_20210416_204845 (2).jpg'*   IMG_20210416_223755.jpg*       'IMG_20210416_224002 (1).jpg'*  'IMG_20210416_224021 (2).jpg'*   IMG_20210416_224107.jpg*

Please find that files with (1) or (2) in their names are duplicates for sure, I confirmed this by md5sum cmd - they simply have the same size.

Directory is mounted as type cifs (rw,relatime,vers=3.1.1,cache=strict,username=agnieszka,uid=1000,forceuid,gid=1000,forcegid,addr=10.0.0.10,file_mode=0755,dir_mode=0755,soft,nounix,serverino,mapposix,rsize=4194304,wsize=4194304,bsize=1048576,echo_interval=60,actimeo=1)

fclones --version 0.10.2
linux version 5.4.102-rt53-MANJARO

While I'm in this directory with duplicates I use the following fclones cmd: fclones .

And I got report:

[2021-04-17 20:51:17.333] fclones:  info: Started
[2021-04-17 20:51:18.125] fclones:  info: Scanned 1 file entries
[2021-04-17 20:51:18.125] fclones:  info: Found 0 (0 B) files matching selection criteria
[2021-04-17 20:51:18.125] fclones:  info: Found 0 (0 B) candidates after grouping by size
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by paths and file identifiers
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by prefix
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) candidates after grouping by suffix
[2021-04-17 20:51:18.126] fclones:  info: Found 0 (0 B) duplicate files

And same story (report) for fclones . --names '*.jpg'

It looks like fclones does not see these files correctly. I thought this is because of their long names with whitespaces (sorry, these are names generated by my phone). I renamed two duplicates for simples names like a.jpg and b.jpg but I got same results - no duplications found.

Interestingly I tracked fclones by strace and there is no single strace log which claims fclones reads any files in this location.

Finally, I copied all these files to local directory on my disk and... same results - no duplications found.

Please let me know if you would need additional data about this issue to diagnose this problem

Thanks in advance

Autodetect drive type and tune properly for HDD

Jody Bruchon found that the default performance on a single spinning drive was bad.

This doesn't surprise me, because all the settings like parallelism level, buffer sizes etc are tuned for SSD, and they are really bad for spinning drives.

Apply ordering to `fclones action --dry-run` results

I'm running this in order to test that my --priority and --keep-path values are right:

watch 'fclones remove --dry-run < results-file.out'

Unfortunately, it's not working because the results are not ordered. So the command is rerun, and the list items dance around every time watch updates.

I thought maybe concurrency was affecting it, so I tried --threads main:1 but that only seems to apply to the group action and not the others.

My workaround right now is to run the results through | sort but it's not ideal because I need to look for each result in the alphabetical list.

So my feature request is to add some ordering to the result list, preferably in the same order as the input list. Even if the ordering is not exposed in the CLI options in any way, it would be a help.

Pluggable logging and progress reporting

Currently logging and progress bar implementations are tightly coupled to the fclones engine. These should be replaced by traits + adapters so they can be swapped to different implementations e.g. in a GUI app.

Provide a Cargo.lock file

You should commit the Cargo.lock file after building the project since it's a Rust binary. I suggest you to remove Cargo.lock line from .gitignore#L3

For more information, please see: https://doc.rust-lang.org/cargo/faq.html#why-do-binaries-have-cargolock-in-version-control-but-not-libraries

Add an option to process files by programs which modify files in-place

Some programs like exiv2 modify the file instead of creating a new file without modifying the original.
In such cases, using --transform is not possible without additional scripting to make a copy before modification.

A new option --transform-copy would first make a copy of each file into a temporary directory, and then invoke the external program on that copy.

os error 123 in Windows 10

Running fclones * in Windows 10 results in an error (sorry for the error message in Polish ;-)

[2020-07-19 16:53:35.267] fclones.exe: error: Failed to stat C:\*: Nazwa pliku, nazwa katalogu lub składnia etykiety woluminu jest niepoprawna. (os error 123)
[2020-07-19 16:53:35.269] fclones.exe:  info: Scanned 0 file entries
[2020-07-19 16:53:35.270] fclones.exe:  info: Found 0 (0 B) files matching selection criteria                           
[2020-07-19 16:53:35.272] fclones.exe:  info: Found 0 (0 B) candidates after grouping by size                           
[2020-07-19 16:53:35.274] fclones.exe:  info: Found 0 (0 B) candidates after pruning hard-links                         
[2020-07-19 16:53:35.277] fclones.exe:  info: Found 0 (0 B) candidates after grouping by prefix                         
[2020-07-19 16:53:35.278] fclones.exe:  info: Found 0 (0 B) candidates after grouping by suffix                         
[2020-07-19 16:53:35.280] fclones.exe:  info: Found 0 (0 B) duplicate files

Running fclones . -R works properly and also running under WSL works properly.

error: Failed to read file list: Malformed group header:

Problem: when I run the command "fclones group . | fclones remove" on Windows 10 (Build 19042.985), I get the error "Failed to read file list: Malformed group header: F:\Photos\Sorted Photos\2005\07._DSC00284.jpg."

I had ran "fclones group ." and it worked flawlessly, so I naturally wanted to remove the duplicate files. After including the "| flcones remove", it gave me this error. Is it something to do with the file starting with a '.'?

Running in Windows terminal using Powershell.

Incremental mode

Persist hashes to a file in order to speedup subsequent runs or to avoid recomputing hashes when the previous run was interrupted.

Tests failing on file systems which don't support querying file creation (birth) time like zfs or f2fs

fclones fails to build on my Arch Linux f2fs partition.

failures:

---- dedupe::test::test_partition_respects_creation_time_priority stdout ----
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.674] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/ctime_priority/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_creation_time_priority' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:856:80
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- dedupe::test::test_partition_respects_drop_patterns stdout ----
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/drop/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_drop_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:923:68

---- dedupe::test::test_partition_respects_keep_patterns stdout ----
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.675] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/keep/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_keep_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:904:68

---- dedupe::test::test_run_dedupe_script stdout ----
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_3: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Failed to read creation time of file /home/et/yay/fclones/src/fclones-0.12.3/target/test/dedupe/partition/dedupe_script/file_1: creation time is not available for the filesystem
[2021-07-29 18:44:45.676] fclones-6cdeb7b3f6a11fd5:  warn: Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read.
thread 'dedupe::test::test_run_dedupe_script' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `2`', src/dedupe.rs:944:13


failures:
    dedupe::test::test_partition_respects_creation_time_priority
    dedupe::test::test_partition_respects_drop_patterns
    dedupe::test::test_partition_respects_keep_patterns
    dedupe::test::test_run_dedupe_script

test result: FAILED. 94 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 2.49s

error: test failed, to rerun pass '--lib'

Hashing algorithm better suited for ARM

Add an option to choose a different hash function.
E.g.:

fclones --hash highway -R ~

--stdin Parameter Not Working

Greetings!

I've been playing a bit with fclones this morning (super cool tool, BTW) and wanted to use the --stdin parameter to read the list of files to analyze from the output of find. Based on the documentation it seems like passing the input to fclones group --stdin should work, but whenever I try this I always get an error: fclones: error: No input files

Here's a simple, trivial example of what I mean:

localhost~ % fclones --version
fclones 0.12.2
localhost~ % mkdir blah
localhost~ % cd blah
localhost~/blah % touch {1,2,3}.c
localhost~/blah % find . -name '*.c'
./1.c
./2.c
./3.c
localhost~/blah % find . -name '*.c' | fclones group --stdin
[2021-06-19 15:01:12.126] fclones: error: No input files

I'm not sure if I'm doing something incorrectly - any ideas? Thanks in advance!

No binary releases

Hey!

I'm maintaining fclones and fclones-bin packages on AUR and I saw 0.9.0 release doesn't have binary artifacts. Unfortunately I couldn't bump the -bin package because of that.

Are you considering to bring back binary artifacts (e.g. fclones-$pkgver.tgz) along with the upcoming releases?

The `--dry-run` output looks suspiciously like a usable shell script

The output of --dry-run for link and dedupe moves the original files out of the way via mv (just like fclones itself) but then completely ignores possible failure of the next command and removes the backup in the following step.

If the calling shell does not have errexit set this can lead to data loss (actually just filename loss) if the ln/cp fails.

I would suggest to just print the action for each line instead and optionally emit a shell header which is called and includes proper error handling.

Fea request: Provide MacOS builds

fclones looks great, but installing a whole Rust build stack to test it out is a bit of barrier. At some point it would be great if MacOS-compatible builds where generated & provided as part of the regular release process.

Tests of device names should be enabled only in Linux

Found when testing #64

Stream output to file

I'm running into an issue - scanning 20GB over the network and was wondering when the last stage is grouping by content would be great to be able to get duplicates written out as they come. For the sake of better UX maybe only stream them when there is an argument to write the output to a file. This way if I interrupt or something interrupts the content hashing I can still get a partial result.

Identical files under the same root are returned despite `--isolate` option

When running fclones group -I, it seems to be finding duplicate files underneath the same root (path argument on command line). For example, if I construct a tree like:

echo hi > source.txt
mkdir -p {a,b}/{1,2}
for i in {a,b}/{1,2}/test; do cp source.txt "${i}"; done

and then run fclones group -I a b, I get:

[2021-11-17 16:08:01.482] fclones:  info: Started grouping
[2021-11-17 16:08:02.019] fclones:  info: Scanned 10 file entries
[2021-11-17 16:08:02.019] fclones:  info: Found 4 (12 B) files matching selection criteria
[2021-11-17 16:08:02.019] fclones:  info: Found 3 (9 B) candidates after grouping by size
[2021-11-17 16:08:02.019] fclones:  info: Found 3 (9 B) candidates after grouping by paths and file identifiers
[2021-11-17 16:08:02.033] fclones:  info: Found 3 (9 B) candidates after grouping by prefix
[2021-11-17 16:08:02.033] fclones:  info: Found 3 (9 B) candidates after grouping by suffix
[2021-11-17 16:08:02.034] fclones:  info: Found 3 (9 B) redundant files
# Report by fclones 0.17.1
# Timestamp: 2021-11-17 16:08:02.036 -0500
# Command: fclones group -I a b
# Found 1 file groups
# 9 B (9 B) in 3 redundant files can be removed
e872d4a1bdc12e1262820a95eebb530a, 3 B (3 B) * 4:
    /tmp/tree/a/1/test
    /tmp/tree/a/2/test
    /tmp/tree/b/1/test
    /tmp/tree/b/2/test

Offer a way of deleting / hardlinking / softlinking duplicated files automatically

fclone should offer a way of deleting / hardlinking / softlinking duplicated files automatically.

In #25:

@pkolaczk wrote:

That's right, fclones doesn't offer any way of deleting files automatically yet. I believe this is a task for a different program (or a subcommand) that would take output of fclones.

and @piranna replied:

From a UNIX perspective, yes, it makes sense that task being done by another command, but being so much attached to fclones output format... :-/ Maybe a shell script wrapper that offer a compatible interface with fdupes? :-) That would be easy to implement, but not sure if It should be hosted here un fclones repo or being totally independent...

IMHO, a postprocessing script parsing the fclones output might require more complexity than adding a CLI switch. For instance, here's an (untested) python implementation that leverages the CSV output (expected in fclones_out.csv) to replace duplicates with hard links:

#!/usr/bin/env python

import logging
from os import link, unlink
from os.path import isfile


def main() -> None:
    with open("fclones_out.csv") as f_handler:

        for duplicates in (
            fclone_output_line.split(",")[3:]

            for fclone_output_line in f_handler.readlines()

            if not fclone_output_line.startswith("size")
        ):
            src = duplicates[0]

            for dst in duplicates[1:]:
                logging.debug("%s -> %s", src, dst)

                if isfile(dst):
                    unlink(dst)
                link(src, dst)


if __name__ == "__main__":
    logging.basicConfig(level=logging.DEBUG)
    main()

PS: I think this deserves a ticket on its own, feel free to delete it if you don't agree. :-)

Build fails: "non-exhaustive patterns `Removable` not covered"

Hello, I am unable to compile fclones due to an error. This happens with both the AUR package and manually running cargo build --release.
OS: Manjaro Linux x86_64
Rust/Cargo version 1.49.0

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:22:24
|
22  |             0 => match disk_type {
|                        ^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:41:15
|
41  |         match self.disk_type {
|               ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:49:23
|
49  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:57:23
|
57  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:69:23
|
69  |         FileLen(match self.disk_type {
|                       ^^^^^^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error[E0004]: non-exhaustive patterns: `Removable` not covered
--> src/device.rs:100:31
|
100 |                 let p = match disk_type {
|                               ^^^^^^^^^ pattern `Removable` not covered
| 
::: /home/timothy/.cargo/registry/src/github.com-1ecc6299db9ec823/sysinfo-0.15.4/src/common.rs:257:5
|
257 |     Removable,
|     --------- not covered
|
= help: ensure that all possible cases are being handled, possibly by adding wildcards or more match arms
= note: the matched value is of type `DiskType`

error: aborting due to 6 previous errors

For more information about this error, try `rustc --explain E0004`.
error: could not compile `fclones`

To learn more, run the command again with --verbose.
warning: build failed, waiting for other jobs to finish...
error: build failed

Update README with a link to the AUR package for Archlinux

I just created an AUR package for Archlinux. You can find it here: https://aur.archlinux.org/packages/fclones-git/
It might be worth adding its name and a link to it in the README file.

Does notr build on OSX

Hi,

Your tool looks very promosing, so I wanted to give it a go on Mac. Unfortunately I get build errors. Mainly about unresolved PosixFadviseAdvice.

short trace

error[E0433]: failed to resolve: use of undeclared type or module `PosixFadviseAdvice`

If have zero RUST skills but if you have some advice I would gladly help you out.

Dedup on btrfs (and others)

Btrfs supports in place dedup ( https://btrfs.wiki.kernel.org/index.php/Deduplication ), via a syscall. This is completely safe, as checks if the files are identical before deduping.

This would be very Linux specific low level code.

Remove dependency on pcre-sys

This dependency makes it harder to install fclones on some platforms. Let's switch to regex crate.

Failure to read creation time on ZFS

It looks like stat does not report creation time from zfs properly, listing now "birth". I assume what ever fclones is using is doing similar and not getting a creation time reported. I'm still digging around for details, Does ZFS store "Birth Time" or "Creation Time" ? is what I've uncovered so far.

failures:

---- dedupe::test::test_partition_respects_keep_patterns stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/keep/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_keep_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:904:68
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- dedupe::test::test_partition_respects_drop_patterns stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_drop_patterns' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:923:68

---- dedupe::test::test_partition_respects_creation_time_priority stdout ----
[2021-06-05 20:02:23.458] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.459] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/ctime_priority/file_1: creation time is not available for the filesystem
thread 'dedupe::test::test_partition_respects_creation_time_priority' panicked at 'called `Result::unwrap()` on an `Err` value: Error { message: "Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read." }', src/dedupe.rs:856:80

---- dedupe::test::test_run_dedupe_script stdout ----
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_3: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_2: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Failed to read creation time of file /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/dedupe_script/file_1: creation time is not available for the filesystem
[2021-06-05 20:02:23.466] fclones-fe24705dd771f261:  warn: Could not determine files to drop in group with hash 00000000000000000000000000000000 and len 0: Metadata of some files could not be read.
thread 'dedupe::test::test_run_dedupe_script' panicked at 'assertion failed: `(left == right)`
  left: `0`,
 right: `2`', src/dedupe.rs:944:13


failures:
    dedupe::test::test_partition_respects_creation_time_priority
    dedupe::test::test_partition_respects_drop_patterns
    dedupe::test::test_partition_respects_keep_patterns
    dedupe::test::test_run_dedupe_script

test result: FAILED. 92 passed; 4 failed; 0 ignored; 0 measured; 0 filtered out; finished in 65.13s

0 ✓ fryfrog@apollo ~ $ ls -alh /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
-rw-r--r-- 1 fryfrog fryfrog 0 Jun  5 20:02 /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
0 ✓ fryfrog@apollo ~ $ stat /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
  File: /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
  Size: 0               Blocks: 1          IO Block: 131072 regular empty file
Device: 19h/25d Inode: 2938048     Links: 1
Access: (0644/-rw-r--r--)  Uid: ( 1000/ fryfrog)   Gid: ( 1000/ fryfrog)
Access: 2021-06-05 20:02:23.449795401 -0700
Modify: 2021-06-05 20:02:23.449795401 -0700
Change: 2021-06-05 20:02:23.449795401 -0700
 Birth: -

0 ✓ fryfrog@apollo ~ $ sudo zdb -O rpool/ROOT/arch home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1

   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  2938048    1   128K    512      0     512    512    0.00  ZFS plain file

0 ✓ fryfrog@apollo ~ $ sudo zdb -ddddd rpool/ROOT/arch  2938048
Dataset rpool/ROOT/arch [ZPL], ID 394, cr_txg 20, 81.7G, 1547348 objects, rootbp DVA[0]=<0:2287a77000:1000> DVA[1]=<0:2875361000:1000> [L0 DMU objset] fletcher4 uncompressed unencrypted LE contiguous unique double size=1000L/1000P birth=140448231L/140448231P fill=1547348 cksum=11e787a4a1:3022d1624a43:45bff9d1ab72ff:480dc35cda0302c4

   Object  lvl   iblk   dblk  dsize  dnsize  lsize   %full  type
  2938048    1   128K    512      0     512    512    0.00  ZFS plain file
                                              176   bonus  System attributes
       dnode flags: USERUSED_ACCOUNTED USEROBJUSED_ACCOUNTED
       dnode maxblkid: 0
       path    /home/fryfrog/.cache/paru/clone/fclones/src/fclones-0.12.0/target/test/dedupe/partition/drop/file_1
       uid     1000
       gid     1000
       atime   Sat Jun  5 20:02:23 2021
       mtime   Sat Jun  5 20:02:23 2021
       ctime   Sat Jun  5 20:02:23 2021
       crtime  Sat Jun  5 20:02:23 2021
       gen     140447745
       mode    100644
       size    0
       parent  2938047
       links   1
       pflags  840800000004

Use ZFS checksums for faster comparison

There was a proposal for rdfind to use existing ZFS checksum, which is created when a file is written to ZFS. This may result in much faster comparison, especially for big files on ZFS.
I think this would be great enhancement for fclones.

Here is the original rdfind post.

Thank you for this nice tool!

Update benchmarks and add czkawka and rmlint to comparsion

Hi,
I see that your app do some crazy optimizations for SSD and HDD, and I'm curious how fast it works in comparison to my app Czkawka(it use mostly simple optimizations and rather primitive algorighms, since I focus more on GUI).

I'm almost sure that with big amount of duplicated files fclones will be faster, but I'm curious if with second scan Czkawka will be faster due using caching hash results.

Feat.req. GUI

subj

Processing files before reporting

Hi, Piotr

I found something obvious - but nevertheless interesting.

I ran my regex against the fclones report to get 2 text files - a set of unique files and a set of duplicate files (minus one copy to use as an original).

I found that the set of unique files still had some duplicates images - which were different on the hash and file size due to exif data.

It seems that the camera sourced exif metadata was the superset and a number of fields (maybe half of them) were dropped when the photos were imported into Apple iphotos.

So, that got myself and a friend wondering how easy/hard it would be to pipe images (on the fly) stripped of exif data via exiftool to fclones which could then create the report ignoring the exif data (since it would no longer be there) - and then finally maybe parse the data again to sort on largest size first based on the persistent size on the disk.

The largest size file (where exif data was different would be indicative of the richest data set to keep as the original - which would be easier to regex to keep if sorted to the top for each set of duplicates.

Happy to have a play with this idea, but if you have any thoughts about this - specifically about ingesting from exiftool into fclones I would be keen to hear about it.

Cheers,
Stu

The first answer here is suggesting a similar approach to the same kind of problem.
ref:https://softwarerecs.stackexchange.com/questions/51032/compare-two-image-files-for-identical-data-excluding-metadata

Publish on crates.io

It would be nice to cargo install fclones, have cargo track versions, and so on. This could be part of the CI pipeline, and/or there's cargo-release which handles tagging and so on.

Detect changes after `fclones group` to avoid copying the wrong data

If files change after an fclones group run without updating the timestamps and remain the same size, then the fclones link command (and others) can lead to data loss:

$ mkdir z; cd z; echo same > 1; echo same > 2; echo abcd > Z
$ cat ?
same
same
abcd
$ fclones group . -o log 
$ cp -a Z 1   # timestamp is kept
$ fclones link < log
$ cat ?
abcd
abcd
abcd

This could be avoided by also checking that the ctime of a file is older than the start of the group run, and if not re-checking or aborting.

Maybe even add a --paranoid option to check the content bite-by-byte before acting on it. But even in this case I am not aware of any (Unix) way to guarantee exclusive write access to a file, so maybe mention that the checked data is expected to not change.

Support JSON input for remove/link

$ fclones remove ... < dupes.json
fclones: error: Input error: Not a default fclones report. Formats other than the default one are not supported yet.

I found it very useful to process the fclones group JSON output with jq and would like continue the workflow with fclones remove.

`fclones dedupe` does not preserve mtimes on Linux

fclones dedupe doesn't seem to preserve mtimes on Linux. Preserving mtimes seems like something that should be both possible and desirable, but please let me know if I missed something.

I tested this with btrfs on NixOS 21.11-pre.

# uname -a
Linux ra 5.10.76-hardened1 #1-NixOS SMP Wed Oct 27 07:56:57 UTC 2021 x86_64 GNU/Linux

# fclones --version
fclones 0.17.0

# cp -a /etc/passwd ./

# touch --date 2009-01-01 passwd

# l
total 4,096
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd

# cp -a passwd passwd.2

# l
total 8,192
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd.2
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd

# fclones group . | fclones dedupe
[2021-10-29 04:25:29.532] fclones:  info: Started grouping
[2021-10-29 04:25:29.540] fclones:  info: Scanned 3 file entries
[2021-10-29 04:25:29.540] fclones:  info: Found 2 (7.8 KB) files matching selection criteria
[2021-10-29 04:25:29.540] fclones:  info: Found 1 (3.9 KB) candidates after grouping by size
[2021-10-29 04:25:29.540] fclones:  info: Found 1 (3.9 KB) candidates after grouping by paths and file identifiers
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) candidates after grouping by prefix
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) candidates after grouping by suffix
[2021-10-29 04:25:29.552] fclones:  info: Found 1 (3.9 KB) redundant files
[2021-10-29 04:25:29.553] fclones:  info: Started deduplicating
[2021-10-29 04:25:29.561] fclones:  info: Processed 1 files and reclaimed up to 3.9 KB space

# l
total 8,192
-rw-r--r-- 1 at at 3,891 2009-01-01 00:00 passwd
-rw-r--r-- 1 at at 3,891 2021-10-29 04:25 passwd.2

I also tested a user xattr and it did seem to be preserved.

Add fdupes compatible output format

Perform partial hashing in order of physical block placement

This paper reports that ordering accesses by inode id or by physical block location retrieved with ioctl fiemap API can give substantial performance improvements.

These techniques could be applied for the partial hashing phase of fclones, where seek time and rotational latency are the major bottleneck.