Coder Social home page Coder Social logo

shatag-scrub's Introduction

OVERVIEW

  Shatag is a tool for computing and caching file checksums, and do
remote duplicate detection. Files are compared on their SHA-256 hash
to find duplicates, and will use filesystem extended attributes to cache
the checksum values.

REQUIREMENTS

 - Python 3
 - argparse
 - pyxattr (until other backends become available)
 - If you want to use a database store: sqlite3 or psycopg2
 - If you want to use an http remote store: requests
 - If you want to use the http store server: the Bottle framework

QUICKSTART

  Before you attempt to use shatag, make sure the filesystem you will be working on has
support for user extended attributes. This is typically enabled by adding the user_xattr
mount option to the appropriate filesystem in /etc/fstab. You can also enable support
without unmounting with:

  sudo mount -o remount,user_xattr /mountpoint

  If you want to use a remote database, setup a configuration file ~/.shatagrc with contents like:

  database: "pg: dbname=shatag host=localhost user=shatag password=xxxsecretpasswordxxx"

  Next, start tagging some files with: 
  
  shatag -qrt /home/data

  Upload the proper tags to the database with:
  
  shatag -qrp /home/data  (you can combine the -t and -p flags in one command)

  You can start a daemon to automatically recompute tags when needed:

  shatagd -dr /home/data

  Check for duplicates in a directory with

  shatag -rl .

SUMMARY OF TOOLS

  shatag(1) is the main tool, and used for computing and displaying file checksums,
as well as reporting duplicate.

  shatagd(1) will monitor directories for file changes and update the checksums automatically.
it requires pyinotify.

  Refer to the man pages of these tools for more information.


TECHNICAL DETAILS

  shatag stores a single sha256 checksum in hexadecimal (64 ascii characters) in the "shatag.sha256" attribute,
and the modification time of the file at the moment the checksum was computed, as decimal unix time, in "shatag.ts".

 The checksum is considered valid if the modification time of the file is still equal to the timestamp stored in the attribute.
With respect to race conditions, shatag will always read the stored timestamp before the checksum, and the checksum before
the actual mtime; it will always write the checksum before the timestamp. In case of accidental parallel operation, shatag
will rather compute unneccesary checksums than consider an old checksum valid.

  The local sqlite database contains a single table:
  
  create table contents(hash text, name text, path text, primary key (hash,name,path));

  When querying the database, shatag will lookup the hash and count the matching paths at each location. The file will
be considered a dupe if any location has the file more than once.

TODO: 
    * Fix probable encoding quirks for filenames

shatag-scrub's People

Contributors

odyx avatar jwilk avatar fld avatar

Watchers

 avatar

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    ๐Ÿ–– Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. ๐Ÿ“Š๐Ÿ“ˆ๐ŸŽ‰

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google โค๏ธ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.