ambv / bitrot Goto Github PK
View Code? Open in Web Editor NEWDetects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.
License: MIT License
Detects bit rotten files on the hard drive to save your precious photo and music collection from slow decay.
License: MIT License
I know this is quite a big feature, but could be nice to have: I have a lot of small backup files in a directory, and bitrot is progressing really slowly (0.1% in ~20 hours).
One simple implementation would be for bitrot to accept a predefined file-list, which could be generated by find or similar, that has already exclude options. I may start to work on it, if I have the time. :)
Several free software photography tools like darktable store all the metadata, and the changes they make in the image in an xmp file with the same name. Original file is never touched.
Would anybody be interested in giving bitrot a feature where the bitrot information is stored in these same xmp files instead of the SQLite database file?
The main benefits of this, is that other apps like darktable or digikam that will also be able to access and use the bitrot information from the xmp file. Bitrot could even be integrated and launched from these apps directly.
Most photographs don't store checksums of their photos, but just like everybody else , they suffer the consequences when it happens. This way, as soon as bitrot occurs it can be detected and the photographer can delete the corrupt file and restore a backup of just that file.
Might be some kind of race condition when run over folders where files are actively and rapidly changing?
Traceback (most recent call last):
File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
execfile(__file__)
File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
chunk_size=args.chunk_size,
File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 138, in run
st = os.stat(p)
OSError: [Errno 2] No such file or directory: './.dropbox/instance1/filecache.dbx-journal'
Hi,
I've noted that it takes time related to number of files. So I'm trying to use it for big number of files and it takes so long.
cd /tmp; mkdir more ; cd more/
#create a 320KB file
dd if=/dev/zero of=masterfile bs=1 count=327680
#split it in 32768 files (instantly) + masterfile = 32769
split -b 10 -a 10 masterfile
#waiiiiiiiiiiiiiiiiiiiiiiiit
bitrot -v
I suppose that it could be made to work with threads and improve it a lot, cause single threaded to calculate the sha1 and insert into sqlite for x files is too old school for nowdays. ;)
I'm using an Intel i7 so I have plenty of spare threads to burn I reckon something like a central buffer/MQ/DB/x where insert the files to be hashed and n threads to calculate and insert/update them (or just another thread for just sqlite) could work (they collect files from the central buffer, n at a time), sounds like a cool project. ;)
I'm using it for this tool. ;)
https://github.com/liloman/heal-bitrots
Cheers!
bitrot fails totally if it encounters a file it can't read.
File "/usr/local/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 247, in run_from_command_line
run(verbosity=verbosity, test=args.test)
File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 120, in run
new_sha1 = sha1(p)
File "/usr/local/lib/python2.7/site-packages/bitrot.py", line 48, in sha1
with open(path) as f:
IOError: [Errno 13] Permission denied: './file'
I think it should be more graceful. Either skip with/without logging or abort. Any opinion on this?
Currently, the keyword argument python_requires of setup() is not set, and thus it is assumed that this distribution is compatible with all Python versions.
However, I found it is not compatible with Python2. My local Python version is 2.7, and I encounter the following error when executing “pip install bitrot”
Collecting bitrot
Downloading bitrot-1.0.0.tar.gz (11 kB)
ERROR: Command errored out with exit status 1:
command: /usr/local/bin/python -c 'import sys, setuptools, tokenize; sys.argv[0] = '"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"'; __file__='"'"'/tmp/pip-install-v9p1bP/bitrot/setup.py'"'"';f=getattr(tokenize, '"'"'open'"'"', open)(__file__);code=f.read().replace('"'"'\r\n'"'"', '"'"'\n'"'"');f.close();exec(compile(code, __file__, '"'"'exec'"'"'))' egg_info --egg-base /tmp/pip-install-v9p1bP/bitrot/pip-egg-info
cwd: /tmp/pip-install-v9p1bP/bitrot/
Complete output (9 lines):
Traceback (most recent call last):
File "<string>", line 1, in <module>
File "/tmp/pip-install-v9p1bP/bitrot/setup.py", line 39, in <module>
from bitrot import VERSION
File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 190, in <module>
class Bitrot(object):
File "/tmp/pip-install-v9p1bP/bitrot/src/bitrot.py", line 193, in Bitrot
chunk_size=DEFAULT_CHUNK_SIZE, workers=os.cpu_count(),
AttributeError: 'module' object has no attribute 'cpu_count'
----------------------------------------
ERROR: Command errored out with exit status 1: python setup.py egg_info Check the logs for full command output.
I noticed that bitrot.py used the function os.cpu_count. os.cpu_count only exists in Python 3, resulting in installation failure of bitrot in Python2.
Way to fix:
modify setup() in setup.py, add python_requires keyword argument:
setup(…
python_requires='>=3',
…)
Thanks for your attention.
Best regrads,
PyVCEchecker
Tests fail in Ubuntu 22.04
$ python --version
Python 3.8.16
$ ./test-bitrot.bats
✓ bitrot command exists
✓ bitrot detects new files in a tree dir
✓ bitrot detects modified files in a tree dir
✓ bitrot detects renamed files in a tree dir
✓ bitrot detects delete files in a tree dir
✓ bitrot detects new files and modified in a tree dir
✗ bitrot detects new files, modified, deleted and moved in a tree dir
(in test file test-bitrot.bats, line 115)
`[[ ${lines[13]} = " from ./more-files-a.txt to ./more-files-a.txt2" ]]' failed
✓ bitrot detects new files, modified, deleted and moved in a tree dir 2
✓ bitrot can operate with 3278 files easily in a dir (1)
✓ bitrot can operate with 3278 files easily in a dir (2)
✗ bitrot can detect rotten bits in a dir (1)
(in test file test-bitrot.bats, line 191)
`[[ ${lines[2]} = "3301 entries in the database, 2 entries new:" ]]' failed
✓ bitrot can detect rotten bits in a dir (2)
✓ Clean everything
13 tests, 2 failures
This is a great program, thanks for writing and maintaining it! I was about to write something myself, but please to find an existing tested tool that does pretty much exactly what I want :)
When I run on Windows with python 3.12.0 I get the following deprecation warning. It's probably a fairly easy fix, and it's not at all urgent.
bitrot.py:73: DeprecationWarning: datetime.datetime.utcnow() is deprecated and scheduled for removal in a future version. Use timezone-aware objects to represent datetimes in UTC: datetime.datetime.now(datetime.UTC).
return datetime.datetime.utcnow().strftime('%Y-%m-%d %H:%M:%S%z')
It seems that the need for an external state storage can be reduced if the most recent scan date, as well as the hash, is kept in xattrs.
After trying to figure out why this script was so much faster than expected, I stepped through the code and determined that all files are being opened and read in text mode. This means that the data which is being hashed for any given binary file is randomly truncated after a 0x1A byte, making the entire exercise moot.
This may be fixed by changing the open command to use 'rb' mode.
Unfortunately fixing this means that lots and lots of checksums will become invalid.
Example with added debug output:
dir
39 blah - Copy (2).dat
3 blah - Copy (3) - Copy.dat
3 blah - Copy (3).dat
112 blah - Copy.dat
11 blah.datpython.exe -m bitrot --verbose
.\blah - Copy (2).dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah - Copy (3) - Copy.dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy (3).dat; Length: 1; Checksum: 5ba93c9db0cff93f52b521d7420e43f6eda2784f
.\blah - Copy.dat; Length: 0; Checksum: da39a3ee5e6b4b0d3255bfef95601890afd80709
.\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f
Note multiple duplicate checksums.
With binary read mode:
python.exe -m bitrot --verbose
File: .\blah - Copy (2).dat; Length: 39; Checksum: ddfb5399fc8f39f26e43f7e3807ae919ee88fe59
File: .\blah - Copy (3) - Copy.dat; Length: 3; Checksum: 4684f40f78d7474c93464241cf4a1ccaa012d7d3
File: .\blah - Copy (3).dat; Length: 3; Checksum: edf5298c70ff205a98c17fd199ddd610e9e2c7c6
File: .\blah - Copy.dat; Length: 112; Checksum: 733fdb8b5cc69814ff448b87af8b02681a749907
File: .\blah.dat; Length: 11; Checksum: 3a4d8abb5811f6b58b9755ca65ffc01d38f9153f
Thanks for the software. I had an issue with an external drive when connecting to a windows computer. I ran bitrot on the drive, and it did report some hash errors. The date of the 'last good hash checked' on those files was in 2022. But I've run bitrot on that drive since 2022. I would have expected the 'last good hash checked' would have been the last time I ran bitrot bitrot before the error occurred?
Not sure why it wouldn't have permission, but bitrot
should probably handle permissions errors instead of crashing with a traceback :)
Traceback (most recent call last):
File "/Users/tailee/.virtualenvs/bitrot/bin/bitrot", line 10, in <module>
execfile(__file__)
File "/Users/tailee/Projects/bitrot/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 323, in run_from_command_line
chunk_size=args.chunk_size,
File "/Users/tailee/Projects/bitrot/src/bitrot.py", line 125, in run
st = os.lstat(p)
OSError: [Errno 13] Permission denied: './BitTorrent Sync/Projects/prezto/.git/logs/HEAD'
There was an exclusion list in one of the forks! Can you put it into the master?
https://github.com/benshep/bitrot
Thanks!
--quiet
is being ignored for integrity-checking messages:
Checking bitrot.db integrity... ok.
Updating bitrot.sha512... done.
Hi there,
Love this library, just found it and it seems to work exactly as I want, except for one issue. It just ran my box out of memory and caused it to crash.
I'm trying to check a fairly large batch of files (about 3.7TB worth or 1,206,600 files) and bitrot really chews through the RAM (causing my box to crash). All up it seems to need 4.3GB of RAM to run, which does seem like a lot.
I'd prefer not to split my checks into multiple smaller sets if at all possible, but obviously I can't have my system crashing.
Any ideas on what I can do to fix this issue?
My system is:
AMD64
Debian Stretch
Python3.8
Maybe add a license file to the repo.
Hi
Ever since upgrading to 1.0 I have been getting the program to hang when trying to hash around 4tb on magnetic disks.
I am using -w 1
for only one worker
I believe the current implementation of futures or the pool executor may be causing a deadlock or some sort of sleep condition.
However, multi-cpu processing is beyond my area of expertise
Anyone else hanging when trying to hash many terabytes of data?
If I kill it while its hung, I get this:
Traceback (most recent call last):
File "c:\python3\lib\multiprocessing\process.py", line 315, in _bootstrap
self.run()
File "c:\python3\lib\multiprocessing\process.py", line 108, in run
self._target(*self._args, **self._kwargs)
File "c:\python3\lib\concurrent\futures\process.py", line 233, in _process_worker
call_item = call_queue.get(block=True)
File "c:\python3\lib\multiprocessing\queues.py", line 97, in get
res = self._recv_bytes()
File "c:\python3\lib\multiprocessing\connection.py", line 216, in recv_bytes
buf = self._recv_bytes(maxlength)
File "c:\python3\lib\multiprocessing\connection.py", line 305, in _recv_bytes
waitres = _winapi.WaitForMultipleObjects(
KeyboardInterrupt
Traceback (most recent call last):
File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1731, in <module>
File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 1725, in run_from_command_line
File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in run
for future in as_completed(futures):
File "C:\Startup\scripts\Bitrot\src\bitrot.py", line 810, in <listcomp>
for future in as_completed(futures):
File "c:\python3\lib\concurrent\futures\process.py", line 643, in submit
self._queue_management_thread_wakeup.wakeup()
File "c:\python3\lib\concurrent\futures\process.py", line 90, in wakeup
self._writer.send_bytes(b"")
File "c:\python3\lib\multiprocessing\connection.py", line 200, in send_bytes
self._send_bytes(m[offset:offset + size])
File "c:\python3\lib\multiprocessing\connection.py", line 280, in _send_bytes
ov, err = _winapi.WriteFile(self._handle, buf, overlapped=True)
KeyboardInterrupt
To be honest, I don't know why this rename handling logic is doing what it's doing, so you're probably the best person to remedy this since you probably at least know what it's supposed to be doing, but anyway I am regularly running into column path is not unique
errors, which I've repro'd here just based on reading the code:
$ mkdir /tmp/bitrot/
$ cd /tmp/bitrot/
$ echo a > a
$ echo a > b
$ bitrot
Finished. 0.00 MiB of data read. 0 errors found.
2 entries in the database, 2 new, 0 updated, 0 renamed, 0 missing.
$ mv a c
$ mv b d
$ bitrot
50.0%Traceback (most recent call last):
File "/home/yang/.virtualenvs/bitrot/bin/bitrot", line 8, in <module>
execfile(__file__)
File "/home/yang/bitrot/bin/bitrot", line 30, in <module>
run_from_command_line()
File "/home/yang/bitrot/src/bitrot.py", line 265, in run_from_command_line
chunk_size=args.chunk_size)
File "/home/yang/bitrot/src/bitrot.py", line 151, in run
(new_mtime, p_uni, update_ts, new_sha1))
sqlite3.IntegrityError: column path is not unique
bitrot normally shows progress as a running percentage shortly after checking bitrot.db integrity.
This running percentage always appears quickly for relatively small directories.
For large directories like my home directory on macOS 12.6.1, bitrot may or may not show this running percentage. When it does show it, all is well and bitrot executes as expected. When it does not show it (most of the time), bitrot stalls right after integrity checking and may never complete its execution.
I am a newbie in Python so cannot readily investigate though I could help pinpoint the issue with instructions.
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.