Coder Social home page Coder Social logo

bitrot's People

Contributors

ambv avatar benshep avatar liloman avatar msloth avatar p1r473 avatar philipbl avatar senotrusov avatar vain avatar wzyboy avatar yang avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bitrot's Issues

Ignoring files or directories, or specifying files to be scanned

I know this is quite a big feature, but could be nice to have: I have a lot of small backup files in a directory, and bitrot is progressing really slowly (0.1% in ~20 hours).
One simple implementation would be for bitrot to accept a predefined file-list, which could be generated by find or similar, that has already exclude options. I may start to work on it, if I have the time. :)

bitrot for xmp files

Several free software photography tools like darktable store all the metadata, and the changes they make in the image in an xmp file with the same name. Original file is never touched.
Would anybody be interested in giving bitrot a feature where the bitrot information is stored in these same xmp files instead of the SQLite database file?
The main benefits of this, is that other apps like darktable or digikam that will also be able to access and use the bitrot information from the xmp file. Bitrot could even be integrated and launched from these apps directly.
Most photographs don't store checksums of their photos, but just like everybody else , they suffer the consequences when it happens. This way, as soon as bitrot occurs it can be detected and the photographer can delete the corrupt file and restore a backup of just that file.

No such file or directory

Might be some kind of race condition when run over folders where files are actively and rapidly changing?

Traceback (most recent call last):
  File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
    execfile(__file__)
  File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
    chunk_size=args.chunk_size,
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 138, in run
    st = os.stat(p)
OSError: [Errno 2] No such file or directory: './.dropbox/instance1/filecache.dbx-journal'

Improve performance using threads

Hi,

I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.

cd /tmp; mkdir more ; cd more/
#create a 320KB file                                                              
dd if=/dev/zero of=masterfile bs=1 count=327680                                 
#split it in 32768 files (instantly) + masterfile = 32769                       
split -b 10 -a 10 masterfile    
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v 

I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)

I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)

I'm using it for this tool. ;)

https://github.com/liloman/heal-bitrots

Cheers!

IOError when lacking permissions

bitrot fails totally if it encounters a file it can't read.

File "/usr/local/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 247, in run_from_command_line
    run(verbosity=verbosity, test=args.test)
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 120, in run
    new_sha1 = sha1(p)
  File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 48, in sha1
    with open(path) as f:
IOError: [Errno 13] Permission denied: './file'

I think it should be more graceful. Either skip with/without logging or abort. Any opinion on this?

“python_requires” should be set with “>=3”, as bitrot 1.0.0 is not compatible with all Python versions.

Currently, the keyword argument python_requires of setup() is not set, and thus it is assumed that this distribution is compatible with all Python versions.
However, I found it is not compatible with Python2. My local Python version is 2.7, and I encounter the following error when executing “pip install bitrot”

Collecting bitrot
  Downloading bitrot-1.0.0.tar.gz (11 kB)
    ERROR: Command errored out with exit status 1:
     command: /usr/local/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"'; __file__='"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-v9p1bP/bitrot/pip-egg-info
         cwd: /tmp/pip-install-v9p1bP/bitrot/
    Complete output (9 lines):
    Traceback (most recent call last):
      File "<string>", line 1, in <module>
      File "/tmp/pip-install-v9p1bP/bitrot/setup.py", line 39, in <module>
        from bitrot import VERSION
      File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 190, in <module>
        class Bitrot(object):
      File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 193, in Bitrot
        chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(),
    AttributeError: 'module' object has no attribute 'cpu_count'
    ----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.

I noticed that bitrot.py used the function os.cpu_count. os.cpu_count only exists in Python 3, resulting in installation failure of bitrot in Python2.

Way to fix:
modify setup() in setup.py, add python_requires keyword argument:

setup(…
     python_requires='>=3',
     …)

Thanks for your attention.
Best regrads,
PyVCEchecker

Tests fail

Tests fail in Ubuntu 22.04

$ python --version
Python 3.8.16
$ ./test-bitrot.bats 
 ✓ bitrot command exists 
 ✓ bitrot detects new files in a tree dir 
 ✓ bitrot detects modified files in a tree dir 
 ✓ bitrot detects renamed files in a tree dir 
 ✓ bitrot detects delete files in a tree dir 
 ✓ bitrot detects new files and modified in a tree dir  
 ✗ bitrot detects new files, modified, deleted and moved in a tree dir 
   (in test file test-bitrot.bats, line 115)
     `[[ ${lines[13]}  = " from ./more-files-a.txt to ./more-files-a.txt2" ]]' failed
 ✓ bitrot detects new files, modified, deleted and moved in a tree dir 2 
 ✓ bitrot can operate with 3278 files easily in a dir (1) 
 ✓ bitrot can operate with 3278 files easily in a dir (2) 
 ✗ bitrot can detect rotten bits in a dir (1)
   (in test file test-bitrot.bats, line 191)
     `[[ ${lines[2]}   = "3301 entries in the database, 2 entries new:" ]]' failed
 ✓ bitrot can detect rotten bits in a dir (2) 
 ✓ Clean everything 

13 tests, 2 failures

Python 3.12 deprecation warning for date function

This is a great program, thanks for writing and maintaining it! I was about to write something myself, but please to find an existing tested tool that does pretty much exactly what I want :)

When I run on Windows with python 3.12.0 I get the following deprecation warning. It's probably a fairly easy fix, and it's not at all urgent.

bitrot.py:73: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
  return datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S%z')

Bitrot is not reading full file contents

After trying to figure out why this script was so much faster than expected, I stepped through the code and determined that all files are being opened and read in text mode. This means that the data which is being hashed for any given binary file is randomly truncated after a 0x1A byte, making the entire exercise moot.

This may be fixed by changing the open command to use 'rb' mode.

Unfortunately fixing this means that lots and lots of checksums will become invalid.

Example with added debug output:

dir
39 blah - Copy (2).dat
3 blah - Copy (3) - Copy.dat
3 blah - Copy (3).dat
112 blah - Copy.dat
11 blah.dat

python.exe -m bitrot --verbose
.\blah - Copy (2).dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah - Copy (3) - Copy.dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy (3).dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy.dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f

Note multiple duplicate checksums.

With binary read mode:

python.exe -m bitrot --verbose
File: .\blah - Copy (2).dat; Length: 39; Checksum: ddfb5399fc8f39f26e43f7e3807ae919ee88fe59
File: .\blah - Copy (3) - Copy.dat; Length: 3; Checksum: 4684f40f78d7474c93464241cf4a1ccaa012d7d3
File: .\blah - Copy (3).dat; Length: 3; Checksum: edf5298c70ff205a98c17fd199ddd610e9e2c7c6
File: .\blah - Copy.dat; Length: 112; Checksum: 733fdb8b5cc69814ff448b87af8b02681a749907
File: .\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f

Last good hash date

Thanks for the software. I had an issue with an external drive when connecting to a windows computer. I ran bitrot on the drive, and it did report some hash errors. The date of the 'last good hash checked' on those files was in 2022. But I've run bitrot on that drive since 2022. I would have expected the 'last good hash checked' would have been the last time I ran bitrot bitrot before the error occurred?

Permission denied.

Not sure why it wouldn't have permission, but bitrot should probably handle permissions errors instead of crashing with a traceback :)

Traceback (most recent call last):
  File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
    execfile(__file__)
  File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
    chunk_size=args.chunk_size,
  File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 125, in run
    st = os.lstat(p)
OSError: [Errno 13] Permission denied: './BitTorrent Sync/Projects/prezto/.git/logs/HEAD'

High Memory Usage

Hi there,

Love this library, just found it and it seems to work exactly as I want, except for one issue. It just ran my box out of memory and caused it to crash.

I'm trying to check a fairly large batch of files (about 3.7TB worth or 1,206,600 files) and bitrot really chews through the RAM (causing my box to crash). All up it seems to need 4.3GB of RAM to run, which does seem like a lot.

I'd prefer not to split my checks into multiple smaller sets if at all possible, but obviously I can't have my system crashing.

Any ideas on what I can do to fix this issue?

My system is:
AMD64
Debian Stretch
Python3.8

Hanging with parallel multi-processor futures

Hi
Ever since upgrading to 1.0 I have been getting the program to hang when trying to hash around 4tb on magnetic disks.
I am using -w 1 for only one worker

I believe the current implementation of futures or the pool executor may be causing a deadlock or some sort of sleep condition.

However, multi-cpu processing is beyond my area of expertise
Anyone else hanging when trying to hash many terabytes of data?

If I kill it while its hung, I get this:

Traceback (most recent call last):
  File "c:\python3\lib\multiprocessing\process.py", line 315, in _bootstrap
    self.run()
  File "c:\python3\lib\multiprocessing\process.py", line 108, in run
    self._target(*self._args, **self._kwargs)
  File "c:\python3\lib\concurrent\futures\process.py", line 233, in _process_worker
    call_item = call_queue.get(block=True)
  File "c:\python3\lib\multiprocessing\queues.py", line 97, in get
    res = self._recv_bytes()
  File "c:\python3\lib\multiprocessing\connection.py", line 216, in recv_bytes
    buf = self._recv_bytes(maxlength)
  File "c:\python3\lib\multiprocessing\connection.py", line 305, in _recv_bytes
    waitres = _winapi.WaitForMultipleObjects(
KeyboardInterrupt
Traceback (most recent call last):
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1731, in <module>
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1725, in run_from_command_line

  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in run
    for future in as_completed(futures):
  File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in <listcomp>
    for future in as_completed(futures):
  File "c:\python3\lib\concurrent\futures\process.py", line 643, in submit
    self._queue_management_thread_wakeup.wakeup()
  File "c:\python3\lib\concurrent\futures\process.py", line 90, in wakeup
    self._writer.send_bytes(b"")
  File "c:\python3\lib\multiprocessing\connection.py", line 200, in send_bytes
    self._send_bytes(m[offset:offset + size])
  File "c:\python3\lib\multiprocessing\connection.py", line 280, in _send_bytes
    ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
KeyboardInterrupt

Rename handling logic broken

To be honest, I don't know why this rename handling logic is doing what it's doing, so you're probably the best person to remedy this since you probably at least know what it's supposed to be doing, but anyway I am regularly running into column path is not unique errors, which I've repro'd here just based on reading the code:

$ mkdir /tmp/bitrot/

$ cd /tmp/bitrot/

$ echo a > a

$ echo a > b

$ bitrot
Finished. 0.00 MiB of data read. 0 errors found.
2 entries in the database, 2 new, 0 updated, 0 renamed, 0 missing.

$ mv a c

$ mv b d

$ bitrot
 50.0%Traceback (most recent call last):
  File "/home/yang/.virtualenvs/bitrot/bin/bitrot", line 8, in <module>
    execfile(__file__)
  File "/home/yang/bitrot/bin/bitrot", line 30, in <module>
    run_from_command_line()
  File "/home/yang/bitrot/src/bitrot.py", line 265, in run_from_command_line
    chunk_size=args.chunk_size)
  File "/home/yang/bitrot/src/bitrot.py", line 151, in run
    (new_mtime, p_uni, update_ts, new_sha1))
sqlite3.IntegrityError: column path is not unique

Random stalls when running on large directories

bitrot normally shows progress as a running percentage shortly after checking bitrot.db integrity.
This running percentage always appears quickly for relatively small directories.

For large directories like my home directory on macOS 12.6.1, bitrot may or may not show this running percentage. When it does show it, all is well and bitrot executes as expected. When it does not show it (most of the time), bitrot stalls right after integrity checking and may never complete its execution.

I am a newbie in Python so cannot readily investigate though I could help pinpoint the issue with instructions.

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.