Coder Social home page Coder Social logo

gartham / file-duplicate-checker Goto Github PK

View Code? Open in Web Editor NEW
8.0 1.0 3.0 24 KB

Program to scan and search for file duplicates. (~300MB/s)

Java 100.00%
duplicate-detection duplicate-files duplicate-images file-duplicator file-manager filemanager files fileserver filestorage compress-images

file-duplicate-checker's Introduction

File Duplication Checker

This is a command line application that quickly checks to see if files are duplicates of each other.

image

It takes a directory as an argument and scans over that entire directory tree; (it searches files in all subfolders as well as the parent folder). The program detects duplicate files even if they have different names or are in different (sub-)directories.

Calling

To run the program, just pass a directory as the only argument:

java -jar dupecheck.jar /some/folder

If the directory has spaces in the pathanme:

java -jar dupecheck.jar "C:/My Documents"

Implementation

The implementation uses file sizes and hashing. It stores files mapped by their file size. If it finds two files with the same size, it generates and stores a hash of each of the both of them. (Hashes are small, ~32B, so they are easy to store in memory.) Whenever it encounters another file that has the same contents as one it's already scanned, it compares the lengths, then the hashes, and finds a collision. image Hashing is done using SHA-256. On a CPU-bound consumer computer, this program can easily hash and compare about 200MB/s of file data, in some cases reaching as fast as 300MB/s. This program is typically able to scan over files of much larger size though, and works extremely over non-duplicate files (this is because in typical scenarios, non-duplicate files rarely have the same file size).

Storage

The program uses a HashMap to store each File it hashes. Each file's hash is stored as the key, and the file itself as the value. Right now, the hash codes computed for keys in the hash map are from a Java hash code implementation on the SHA-256 hash. This can easily be made faster/better by modifying the hashCode() function to simply return the first 4 bytes of the SHA-256 hash, rather than computing a Java-returned hash code of the SHA-256 hash, (that is, assuming that the randomness of the first 4 bytes of SHA-256 are less likely to result in a conflict than Java's computed hash code on those bytes, which is most likely the case).

file-duplicate-checker's People

Contributors

gartham avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar

file-duplicate-checker's Issues

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.