libraryofcongress / bagit-python Goto Github PK
View Code? Open in Web Editor NEWWork with BagIt packages from Python.
Home Page: http://libraryofcongress.github.io/bagit-python
Work with BagIt packages from Python.
Home Page: http://libraryofcongress.github.io/bagit-python
It would be nice if the default Bag-Software-Agent had a version in it. Currently it looks like:
bagit.py <http://github.com/libraryofcongress/bagit-python>
But maybe something like this would be better, for discovering bags created with (gasp) a version of bagit.py that has a bug?
bagit.py v1.5.3 <http://github.com/libraryofcongress/bagit-python>
Currently, if you pass a metadata string to make_bag
containing newlines, those newlines end up in the bag-info.txt
file and it can no longer parse. The immediate validation in _parse_tags
then crashes, specifically due to the line
tag_value = parts[1].strip()
which assumes that the line.strip().split(':')
call must have returned two parts (the actual crash is a list index out of range
exception).
So, two things are probably in order:
make_bag
should most likely strip out CR/LF characters, or otherwise reject inputs containing them_parse_tags
should probably catch IndexError when trying to access parts[1], and be able to raise a higher-level exception indicating that the file cannot be parsedThis generates a deprecation warning every time Bag._open
is called. The deprecation warning is not emitted on 2.x, but is emitted on 3.x
See https://gist.github.com/mikedarcy/1c28a4a2acb7c7d4914a9924b49f47be for context.
I'm using Ubuntu 14.04, and the bench.py script seems to fail at the NASA download phase:
:~/bagit-python$ ./bench.py
fetching some images to bag up from nasa
Traceback (most recent call last):
File "./bench.py", line 24, in <module>
ftp.cwd('/photo_gallery/hi-res/planetary/mars/')
File "/usr/lib/python2.7/ftplib.py", line 562, in cwd
return self.voidcmd(cmd)
File "/usr/lib/python2.7/ftplib.py", line 254, in voidcmd
return self.voidresp()
File "/usr/lib/python2.7/ftplib.py", line 229, in voidresp
resp = self.getresp()
File "/usr/lib/python2.7/ftplib.py", line 224, in getresp
raise error_perm, resp
ftplib.error_perm: 550 /photo_gallery/hi-res/planetary/mars/: No such file or directory
bagit.py uses tempfile.mkdtemp() to generate the payload dir, which by definition limits to user-only read/write/search perms:
https://docs.python.org/2/library/tempfile.html#tempfile.mkdtemp
We work around this outside of bagit.py, but it might be worth considering reassigning more liberal perms to the payload upon completion, or at least optionally per user request on the bagit.py commandline (--group-readable or --perms 0755 or somesuch?).
It would be nice if things like self.version > (1, )
worked.
Because the sha1 tagmanifest is generated for a bag-info.txt that has a date in it, this test only succeeds on the day of that date.
Scenario:
$ bagit.py --validate plant
2016-12-14 10:38:36,662 - INFO - plant is invalid: Missing manifest file
$ cd plant && mv manifest-SHA1.txt manifest-sha1.txt && cd ..
$ bagit.py --validate plant
2016-12-14 11:05:31,219 - INFO - plant is valid
This is happening if a bag is created with the bagit-java library and validated with this library.
Might be somewhat related to #81:
Implementations SHOULD issue a warning when multiple manifests are present which differ only in case or normalization form.
Since it's cheap enough on computers made in the last decade, we should either change the default to SHA-256 or use more than one algorithm if there's any concern about compatibility with very old workflows.
Hi,
I'd like to use bagit for archiving for documents with quite complex metadata (an example is pasted below). Running the make_bag like in the tutorial results in parsing errors, because the _make_tag_file function seems to assume that all key,value pair in the input dict is either a string or a list of strings.
I am not sure how this can be resolved, if you expect on purpose to get this strict format, or if the bagit-python code can be improved upon. I also didn't find anything specific in the bagit spec for this.
At first I thought this would be an easy transformation just flattening out all the nested dictionaries, then I discovered that there are also dictionaries inside lists, and that different data types (string, int, None) are used, which also wouldn't work in the present _make_tag_file function (the regex in line 720).
So what do you think: should the conversion be up to the client, should it be done inside bagit-python, or a mix of those?
rd = {u'restriction': {u'email': u'blub@zoink'}, u'recid': 1, u'version_history': None, u'modification_date': u'2014-08-28T15:55:39', u'_id': 1, u'__meta_metadata__': {u'__additional_info__': {u'master_format': u'marc', u'namespace': u'recordext'}, u'__continuable_errors__': [], u'__aliases__': {}, u'__model_info__': {u'names': [u'__default__']}, u'modification_date': {u'function': [u'modification_date', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.898010', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': [u'modification_date', u'json_ext', u'dumps'], u'loads': [u'modification_date', u'json_ext', u'loads']}}, u'json_id': u'modification_date', u'type': u'derived'}, u'version_history': {u'function': [u'version_history', u'rules', u'calculated', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.861323', u'after': {u'memoize': 0}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'version_history', u'type': u'calculated'}, u'restriction': {u'function': u'UNKNOWN', u'timestamp': u'2014-08-28T15:56:02.620112', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'restriction', u'type': u'UNKNOWN'}, u'__errors__': [u"Rule Error - Unable to apply rule for virtual field 'newer_version'. \n"], u'creation_date': {u'function': [u'creation_date', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.898809', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': [u'creation_date', u'json_ext', u'dumps'], u'loads': [u'creation_date', u'json_ext', u'loads']}}, u'json_id': u'creation_date', u'type': u'derived'}, u'recid': {u'function': [u'001'], u'timestamp': u'2014-08-28T15:56:02.683490', u'after': {u'connect': [{u'connected_field': u'_id', u'update_function': None}]}, u'pid': 0, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'recid', u'type': u'creator'}, u'_id': {u'function': [u'_id', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.853278', u'after': {u'connect': [{u'connected_field': u'recid', u'update_function': None}]}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'_id', u'type': u'derived'}}, u'creation_date': u'2014-08-28T15:55:22'}
We should apply the tests specified in the spec section on Special directory characters to ensure that manifest or fetch.txt entries which attempt to escape the bag location cause an error:
../
/
~root/
C:\
\\?\C:\
Parallel issue for bagit-java: LibraryOfCongress/bagit-java#67
Yesterday I created a very large bag (approximately 82,000 files; 300 GB) using bagit.py on my Mac. Afterwards, the bag validated without issue.
Today, I copied the bag to our Archivematica transfer server and ran bagit.py --validate on the bag. This resulted in many errors, seemingly related to diacritics/character encoding issues. Some of the sample warnings:
2016-01-20 09:34:44,183 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cerámica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists on filesystem but is not in manifest
2016-01-20 09:34:43,267 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cera ́mica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists in manifest but not found on filesystem2016-01-20 09:34:43,303 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfilieri ́a 3D.dgn exists in manifest but not found on filesystem
2016-01-20 09:34:44,261 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn exists on filesystem but is not in manifest
When I look at the manifest via cat in a bash terminal, the paths appear exist as they exist on the filesystem.
Thanks!
It might be nice to have python 3 support, preferably implemented such that bagit.py works under both, rather than having separate code bases.
It would be very useful to support the ability to pass an argument specifying which hash algorithm to use (i.e. sha1 instead of md5 – or both even).
In addition to validating local files, bagit should also be able to validate bags in s3 bucket,
Any chance to add the latest release to pypi? Thanks.
add split bag by size like the java version of the library supports as requested by LibraryOfCongress/bagit-java#47
When creating a new bag, make_bag()
takes the bag directory as argument. It assumes the current content of this bag directory to be the payload. It creates a new subdirectory data
and moves all content of the bag directory into this new subdirectory.
It should be possible to create a bag without moving the payload, given that it is already in a suitable data subdirectory. This is particularly needed when only read access to the payload can be granted.
I submitted a pull request #67 to add such an option, but I got no response. Maybe it has not been noticed. So I submit this Issue to draw your attention.
See PR #103 for full description.
Right now bagit.py allows for Oxum validation or Oxum/completeness/hash validation. I have frequent use cases where I just need to check the completeness of a bag.
Currently I'm running a local fork of bagit-python that separates the completeness checking portion of the script and exposes it as its own option. There is a problem with how to expose that option since bagit.py --validate --complete
gives a false impression of what the validation process is doing.
However, besides that problem, addressing this issue only requires separating Bag._validate_entries into two separate methods, updating the logic in Bag._validate_contents, and passing the appropriate argument flag.
Currently the module only allows one to do what the LOC Java library calls "bag in place". It would be very useful to have built-in the ability to specify one or more payloads as the "source" and to then specify a "destination" where the bag containing the payloads will be created.
Minor but important note – the hashes in the manifest should be generated from the source payloads, not the copied files in the bag.
Is there any reason that the only supported version are 0.95, 0.96, and 0.97? Would
anything bad happen if 0.93 and 0.94 was added in bagit.py to the list of supported versions?
I had a few bags created using bagit version 1.2.1. We are now using bagit 1.3.4 which creates the additional tagmanifest-md5.txt file.
The previously created bags are now failing validation because bagit is looking for a tagmanifest-md5.txt file and can't find one. I checked the source code and found that if the bagit version is >0.97 then it looks for a tagmanifest-md5.txt file while validating.
Python 2 will be officially unsupported at the end of the year.
FIXME
comment containing “Python 2}The function _find_tag_files()
that selects files to be added to tagmanifest files as added by PR #69 is broken. The intention of this function was to select all files in the bag directory excluding only the payload directory and the tagmanifest files. What the logic in this function actually does, is to select all files excluding files in any directory whose name ends with "data"
. This is broken in two different ways:
"data"
, all files in this bag directory are excluded, although bag-info.txt
, bagit.txt
, and manifest-*.txt
should in particular be added."data"
, files in this subdirectories are selected for inclusion to the tagmanifest files, although these files, being part of the payload, should not be added.This bug has been discovered by Kieran O'Leary in the discussion of PR #67.
It's been about a year since 1.5.4, and I'd love to see a release on PyPI with some of the bug fixes and Python 3.5 support.
Would be nice to include (perhaps as default) a tag manifest in addition to the manifest…
I see that v1.5.3 never made it to PyPI and was thinking perhaps now would be a good time to coordinate a Github and PyPI release for v1.5.4? Also, would anyone at LoC like to have permission to upload to PyPI for bagit-python?
I'm running bagit.py on an external bag in-place bag. This is a multicore machine.
bagit.py --validate --processes=4 F:\PATH_TO_BAG\
The error output is this
Traceback (most recent call last):
File "c:/Python27/Scripts/bagit.py", line 525, in _validate_entries
pool = multiprocessing.Pool(processes if processes else None, _init_worker)
File "c:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "c:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
self._repopulate_pool()
File "c:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
w.start()
File "c:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "c:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "c:\Python27\lib\multiprocessing\forking.py", line 199, in dump
ForkingPickler(file, protocol).dump(obj)
File "c:\Python27\lib\pickle.py", line 224, in dump
self.save(obj)
File "c:\Python27\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "c:\Python27\lib\pickle.py", line 419, in save_reduce
save(state)
File "c:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "c:\Python27\lib\pickle.py", line 649, in save_dict
self._batch_setitems(obj.iteritems())
File "c:\Python27\lib\pickle.py", line 681, in _batch_setitems
save(v)
File "c:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "c:\Python27\lib\pickle.py", line 562, in save_tuple
save(element)
File "c:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "c:\Python27\lib\pickle.py", line 748, in save_global
(obj, module, name))
PicklingError: Can't pickle <function _init_worker at 0x0239F8F0>: it's not foun
d as __main__._init_worker
I apologize if this has been addressed before, but I couldn't find anything anywhere about it specifically.
Thanks
@darshanrp and @Hwesta would you be willing to take a look at the new Bag.save
method in v1.5.0 on master and see that it works the way you expect?
If you have a bagit.Bag
instance named bag
you should be able to update the bag.info
dictionary to change the metadata and then call bag.save()
to persist the change.
By default save
will not regenerate manifests. This guards against regenerating manifests when the bag is accidentally in a corrupted state (invalid), which would have the result of masking the corruption.
If you have modified the bag, by creating, updating or deleting payload files you can use the manifests
parameter: bag.save(manifests=True)
which will save both the tag metadata and regenerate the manifests.
I was looking at the code to try to see if it's possible to create a bag with custom tag files, or update an existing bag to add them. I didn't see any code that looked like it would let me do this, although it does look like there is code for validating and finding custom tag files referenced in a tag manifest. I thought it wouldn't hurt to ask if there's a way to do this that I missed.
It would be useful to be able to modify bag metadata, and write it back to disk.
b = Bag("/path/to/bag")
b.info["Bag-Id"] = "12345"
b.save()
Updating the bag-info.txt will cause a tagmanifest to change as well, so that will also need to be written. Which raises the question of whether the other manifests should be written as well based on the current state of the Bag object. Perhaps even the bag version and encoding should be written to the bagit.txt as well?
If a user is bagging a nested directory (bagit.py ~/Desktop/bag_level1), and their current working directory is inside that directory (e.g. cwd = ~/Desktop/bag_level1/bag_level2), bagit.py fails with the following error.
...
line 233, in make_bag
os.chdir(old_dir)
FileNotFoundError: [Errno 2]
...
This happens because the current working directory is saved to old_dir, but then bag_level2 is moved to data/bag_level2
I have a couple of ideas on possible behavior.
if not os.path.exists(old_dir):
os.chdir(self.path)
log('Working directory now here, because previous one doesn't exists)
It would be useful to be able to instantiate a Bag, modify its metadata or payload, and then save it, which would regenerate bag-info.txt, manifest-md5.txt, etc. This will involve a fair bit of refactoring, to recast functions in bagit.py as methods on the Bag class.
Bag creation allows fixity generation to happen in parallel to take advantage of multi-cores. It would be nice if bag validation had the same option to generate fixity values in parallel.
Is it possible to use bagit-python from the command line to update bag manifests? Or is that only possible through a python script? My workflow involved adding additional information to bag-info.txt after bagging which causes a checksum mismatch with the tag manifest. I'd like to be able to re-generate the manifests after this step. Thanks!
We are using BagIt on drives that contain a variety of file types but mainly contain broadcast wave files and accompanying digital audio workstation files (Pro Tools, Nuendo, Logic, Digital Performer). I have run into an issue where Pro Tools Plugin settings files titled "Icon" are either not written to the manifest and throw and validation error, OR it throws an error indicating a file is in the manifest but not on the drive when running the bagit validate command. All files show up in terminal via the ls -la command and I have verified that all permissions are correct.
Below is the output from the bag creation and validation on a set of audio files and digital audio workstation files.
workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Cleaned\ Up\ Masters/
2014-02-06 16:12:52,878 - INFO - creating bag for directory Cleaned Up Masters/
2014-02-06 16:12:53,994 - INFO - creating data dir
2014-02-06 16:12:53,994 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/.DS_Store
2014-02-06 16:12:53,994 - INFO - moving Song 1 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 1
2014-02-06 16:12:53,995 - INFO - moving Song 2 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 2
2014-02-06 16:12:53,995 - INFO - moving Song 3 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 3
2014-02-06 16:12:53,995 - INFO - writing manifest-md5.txt
2014-02-06 16:12:53,995 - INFO - writing manifest with 2 processes
2014-02-06 16:18:56,256 - INFO - writing bagit.txt
2014-02-06 16:18:56,257 - INFO - writing bag-info.txt
workstation-a:BNA_1017971 administrator$ bagit.py --validate Cleaned\ Up\ Masters/
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem
exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/De-Essers/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Compressors/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Guitars/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Drums/Icon
exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon
exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Vocals/Icon
2014-02-06 16:44:17,422 - INFO - Cleaned Up Masters/ is invalid: invalid bag: data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settin exists on filesystem but is not in manifest ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/IconSettings/bombfactory BF2A/Icon
I copied the "Icon" files out of each of the three songs and put them into "Test Files" directory and ran the bagit create and validate commands. Below is the output:
workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Test\ Files/
2014-02-06 16:11:42,977 - INFO - creating bag for directory Test Files/
2014-02-06 16:11:42,978 - INFO - creating data dir
2014-02-06 16:11:43,008 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/.DS_Store
2014-02-06 16:11:43,009 - INFO - moving 1 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/1
2014-02-06 16:11:43,009 - INFO - moving 2 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/2
2014-02-06 16:11:43,009 - INFO - moving 3 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/3
2014-02-06 16:11:43,009 - INFO - writing manifest-md5.txt
2014-02-06 16:11:43,010 - INFO - writing manifest with 2 processes
2014-02-06 16:11:43,207 - INFO - writing bagit.txt
2014-02-06 16:11:43,207 - INFO - writing bag-info.txt
workstation-a:BNA_1017971 administrator$ bagit.py --validate Test\ Files/
2014-02-06 16:11:51,514 - WARNING - data/3/Icon exists in manifest but not found on filesystem
2014-02-06 16:11:51,514 - WARNING - data/1/Icon exists in manifest but not found on filesystem
exists on filesystem but is not in manifestcon
exists on filesystem but is not in manifestcon
exists on filesystem but is not in manifest ; data/3/Iconnvalid bag: data/3/Icon exists in manifest but not found on filesystem ; data/1/Icon exists in manifest but not found on filesystem ; data/1/Icon
Unfortuneately, I cannot include any of the specific sessions and audio files listed in the first example but I can provide example "Icon" files for testing. They can be downloaded at the following link:
https://bmschace.box.com/bagittestfiles
Any help with this issue would be greatly appreciated.
Thanks!
Austin Lauritsen
Director of IT
BMS/Chace
Although bagit.py uses "UTF-8" as its go-to value for Tag-File-Character-Encoding, it actually isn't currently able to deal with non-ASCII characters coming in via make_bag
's bag_info
parameter (and possibly in other places). For example:
>>> import bagit
>>> bagit.make_bag("bagtest", {"Some-Key": unichr(40960)})
ERROR:root:'ascii' codec can't encode character u'\ua000' in position 10: ordinal not in range(128)
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 136, in make_bag
raise e
UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 10: ordinal not in range(128)
This could be remedied in a few places by switching from the builtin open()
function to the codecs.open()
wrapper. For instance, the crash above is fixed by replacing
bag_info_txt = open("bag-info.txt", "wb")
with
bag_info_txt = codecs.open("bag-info.txt", "wb", "UTF-8")
bagit-python should issue warnings for the conditions described at
https://github.com/loc-rdc/bagitspec/blob/67c81c8f4189ed3fbeec5ad0d91a29879fdf3002/bagit.xml#L938-L960:
Was trying to jump to a specific release and realized that the tags present at the old repository haven't been pushed here. They would be nice to have!
It should be possible to provide bag configuration data (--source-organizatio=..., etc.) from one (or more) configuration files (like, INI files). This simplifies several tasks for bag creators/providers. Existing command-line options can override config-file settings when necessary.
When adding metadata with special characters, I got an error:
No handlers could be found for logger "bagit"
Traceback (most recent call last):
File "./housekeeping.py", line 57, in <module>
bag = bagit.make_bag(ship + folder, metadataContainer)
File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 146, in make_bag
_make_tag_file('bag-info.txt', bag_info)
File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 721, in _make_tag_file
f.write("%s: %s\n" % (h, txt))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 84: ordinal not in range(128)
Which seems to be here in this function:
def _make_tag_file(bag_info_path, bag_info):
headers = list(bag_info.keys())
headers.sort()
with open(bag_info_path, 'w') as f:
for h in headers:
if isinstance(bag_info[h], list):
for val in bag_info[h]:
f.write("%s: %s\n" % (h, val))
else:
txt = bag_info[h]
# strip CR, LF and CRLF so they don't mess up the tag file
txt = re.sub(r'\n|\r|(\r\n)', '', txt)
f.write("%s: %s\n" % (h, txt))
Maybe file can be opened for writing Unicode:
with open(bag_info_path, 'w', encoding=utf-8) as f:
from the digital curation google group:
Hi again everyone,
Thanks to Michael Shallcross, I got my initial problem sorted, but now I have run into another. Everything seems to work ok as long as I don't get fancy and attempt to use the option to calculate checksums in parallel. I'm on a quad core machine, so this should work, right? I'll paste the error below-- any insights greatly appreciated!
Thanks,
Mary Willoughby
Digital Library of Georgia
Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation. All rights reserved.
C:\Users\mwilloug>cd c:\bagit_python_test
c:\bagit_python_test>python bagit.py --processes 2 --validate bucket
2016-11-15 14:33:00,171 - ERROR - unable to calculate file hashes for c:\bagit_python_test\bucket
Traceback (most recent call last):
File "bagit.py", line 518, in _validate_entries
pool = multiprocessing.Pool(processes if processes else None, _init_worker)
File "C:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "C:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
self._repopulate_pool()
File "C:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
w.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Python27\lib\pickle.py", line 224, in dump
self.save(obj)
File "C:\Python27\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python27\lib\pickle.py", line 425, in save_reduce
save(state)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
save(v)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 568, in save_tuple
save(element)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 754, in save_global
(obj, module, name))
PicklingError: Can't pickle <function _init_worker at 0x0000000002E20EB8>: it's not found as __main__._init_worker
Traceback (most recent call last):
File "bagit.py", line 945, in <module>
valid = bag.validate(processes=opts.processes, fast=opts.fast)
File "bagit.py", line 363, in validate
self._validate_contents(processes=processes, fast=fast)
File "bagit.py", line 443, in _validate_contents
self._validate_entries(processes) # *SLOW*
File "bagit.py", line 518, in _validate_entries
pool = multiprocessing.Pool(processes if processes else None, _init_worker)
File "C:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
return Pool(processes, initializer, initargs, maxtasksperchild)
File "C:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
self._repopulate_pool()
File "C:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
w.start()
File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
self._popen = Popen(self)
File "C:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
dump(process_obj, to_child, HIGHEST_PROTOCOL)
File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
ForkingPickler(file, protocol).dump(obj)
File "C:\Python27\lib\pickle.py", line 224, in dump
self.save(obj)
File "C:\Python27\lib\pickle.py", line 331, in save
self.save_reduce(obj=obj, *rv)
File "C:\Python27\lib\pickle.py", line 425, in save_reduce
save(state)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 655, in save_dict
self._batch_setitems(obj.iteritems())
File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
save(v)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 568, in save_tuple
save(element)
File "C:\Python27\lib\pickle.py", line 286, in save
f(self, obj) # Call unbound method with explicit self
File "C:\Python27\lib\pickle.py", line 754, in save_global
(obj, module, name))
pickle.PicklingError: Can't pickle <function _init_worker at 0x0000000002E20EB8>: it's not found as __main__._init_worker
c:\bagit_python_test>Traceback (most recent call last):
File "<string>", line 1, in <module>
File "C:\Python27\lib\multiprocessing\forking.py", line 381, in main
self = load(from_parent)
File "C:\Python27\lib\pickle.py", line 1384, in load
return Unpickler(file).load()
File "C:\Python27\lib\pickle.py", line 864, in load
dispatch[key](self)
File "C:\Python27\lib\pickle.py", line 886, in load_eof
raise EOFError
EOFError
Unable to get Bagger working on my machine (work laptop), but can run bagit-python, but need to use a custom bagit profile. Thanks!
Hi,
We've started using bagit-python and it's very useful. As the manifest creation can take several hours, it would be great if there was a more detailed status update as it processed each file. Right now it just says something like "Writing manifest".
Thanks,
Kieran.
If a hash manifest has multiple entries for a file, bagit.py silently overwrites these hashes on top of one another.
https://github.com/LibraryOfCongress/bagit-python/blob/master/bagit.py#L593
This isn't explicitly against the BagIt spec (2.1.3 is the relevant section), but I think bagit.py should flag these situations as invalid when loading a manifest.
I hope I'm not being blind here, but I don't see a way to read my metadata in 'bag-info.txt'
In [18]: a = bagit.make_bag('/home/jakecowton/temp/1', {'Contact-Name':"Jake Cowton", 'Contact-Email':"[email protected]", 'Author':"That Guy"})
In [19]: a.tags
Out[19]: {'BagIt-Version': '0.97', 'Tag-File-Character-Encoding': 'UTF-8'}
For some reason I (though presumably 'we') can only see the two default tags but not the ones we pass as our own nor the reserved tags listed on line 52-68 of bagit.py
Is this something you plan to develop or is there a reason for this?
When a bag has a tag manifest file created using an algorithm other than md5, such as the following bag:
bag-info.txt
bagit.txt
data
manifest-sha1.txt
tagmanifest-sha1.txt
Updating bag info results in an invalid bag:
>>> import bagit
>>> bag = bagit.Bag('/Users/twan/edeposit/edeposit/apps/signiant/tests/SR1-4444_20140611200739')
>>> bag.is_valid()
True
>>> bag.info['External-Identifier']='xyz'
>>> bag.save()
>>> bag.is_valid()
False
The problem looks like it assumes the tag manifest file always uses md5.
# Update tag-manifest for changes to manifest & bag-info files
_make_tagmanifest_file('tagmanifest-md5.txt', self.path)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.