Coder Social home page Coder Social logo

duplicates's Introduction

duplicates

File duplicates finder

Sometimes you need to find duplicates files on your disk. You can use this tool to do it. It uses a MD5 hash to identify duplicate files. You can also use some options to filter files by names an minimum size (in bytes).

This program was initially designed and written by Mathieu Ancelin and added upon by carfloresf. It had a great multi-threaded architecture but I wanted to adapt it to my use case, which is deduplicating a large media collection. The original program fully hashed every file to find duplicates based on hash, which is not necessary on large media files and also painfully slow. On the flip size, with partial hashes you theoretically run the risk of false duplicates, although this risk is low. Pure text files and/or source code files are the most risky for partial hashing. Always use full hashing on those.

To make this program better suited to large files I made the following changes:

  • Manage potential duplicates based on size (faster, reduces the number of files that need to be hashed)
  • Hash only the first and last 4k block in a file (much faster for large files)
  • Implement the option to hard-link duplicates, which doesn't remove them but does free up space
  • Increase min file size to 64K (from 1 byte)
  • Implement max file size command line option
  • Add execution timing

The resulting program is very, very fast. It can scan and hash a folder with 10,200 camera images with a total size of 98 Gb in 3.2 seconds on an 2012 era quad core Xeon (folder is on an NVME ssd). A folder with multiple snapshot backups of the same photos with 223 Gb in over 50,000 files is scanned and hashed in 1m05s (>50% duplicate rate).

usage

usage: duplicates [options...] path

  -h          Display the help message
  -name       Filename pattern
  -nostats    Do no output stats
  -single     Work in single threaded mode
  -min-size   Minimum size in bytes for a file (default: 64k)
  -max-size   Maximum size in bytes for a file (default: no maximum)
  -delete     Deletes duplicate files
  -link       Hard-links duplicate files
  -full       Hash the full file (safer, default: hash first and last 4k block of a file)

examples

$ duplicates /tmp
$ duplicates -link /tmp
$ duplicates -name .mp3 /tmp
$ duplicates -min-size 1 /tmp
$ duplicates -min-size 2056 -name .mp3 /tmp
$ duplicates -nostats -size 2056 -name .mp3 /tmp > duplicates.txt

install

  • from source
go get github.com/rlagerweij/duplicates
  • binaries

See releases on the side bar.

duplicates's People

Contributors

rlagerweij avatar mathieuancelin avatar pcolazurdo avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.