Coder Social home page Coder Social logo

bagit-python's People

Contributors

acdha avatar avivace avatar bhspitmonkey avatar bmannix avatar cclauss avatar chenrui333 avatar darshanrp avatar dbrunton avatar dchud avatar dependabot[bot] avatar edsu avatar fatmilktv avatar fgebhart avatar gwiedeman avatar hwesta avatar jobyh avatar johnscancella avatar kba avatar kieranjol avatar mikedarcy avatar mjgiarlo avatar nkrabben avatar nsoranzo avatar rlskoeser avatar ruebot avatar steffenfritz avatar tovrstra avatar zimeon avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

bagit-python's Issues

Bag-Software-Agent include version?

It would be nice if the default Bag-Software-Agent had a version in it. Currently it looks like:

bagit.py <http://github.com/libraryofcongress/bagit-python>

But maybe something like this would be better, for discovering bags created with (gasp) a version of bagit.py that has a bug?

bagit.py v1.5.3 <http://github.com/libraryofcongress/bagit-python>

Unhandled exception when a bag-info string contains newlines

Currently, if you pass a metadata string to make_bag containing newlines, those newlines end up in the bag-info.txt file and it can no longer parse. The immediate validation in _parse_tags then crashes, specifically due to the line

tag_value = parts[1].strip()

which assumes that the line.strip().split(':') call must have returned two parts (the actual crash is a list index out of range exception).

So, two things are probably in order:

  • make_bag should most likely strip out CR/LF characters, or otherwise reject inputs containing them
  • _parse_tags should probably catch IndexError when trying to access parts[1], and be able to raise a higher-level exception indicating that the file cannot be parsed

Bench.py ftplib error_perm 550

I'm using Ubuntu 14.04, and the bench.py script seems to fail at the NASA download phase:

:~/bagit-python$ ./bench.py 
fetching some images to bag up from nasa
Traceback (most recent call last):
  File "./bench.py", line 24, in <module>
    ftp.cwd('/photo_gallery/hi-res/planetary/mars/')
  File "/usr/lib/python2.7/ftplib.py", line 562, in cwd
    return self.voidcmd(cmd)
  File "/usr/lib/python2.7/ftplib.py", line 254, in voidcmd
    return self.voidresp()
  File "/usr/lib/python2.7/ftplib.py", line 229, in voidresp
    resp = self.getresp()
  File "/usr/lib/python2.7/ftplib.py", line 224, in getresp
    raise error_perm, resp
ftplib.error_perm: 550 /photo_gallery/hi-res/planetary/mars/: No such file or directory

--validate fails if algorithm name is not lowercase

Scenario:

  1. $ bagit.py --validate plant
    2016-12-14 10:38:36,662 - INFO - plant is invalid: Missing manifest file
  2. $ cd plant && mv manifest-SHA1.txt manifest-sha1.txt && cd ..
  3. $ bagit.py --validate plant
    2016-12-14 11:05:31,219 - INFO - plant is valid

This is happening if a bag is created with the bagit-java library and validated with this library.

Might be somewhat related to #81:

Implementations SHOULD issue a warning when multiple manifests are present which differ only in case or normalization form.

Change the default hash algorithms

Since it's cheap enough on computers made in the last decade, we should either change the default to SHA-256 or use more than one algorithm if there's any concern about compatibility with very old workflows.

How to treat/parse complex metadata with nested and non-string values

Hi,

I'd like to use bagit for archiving for documents with quite complex metadata (an example is pasted below). Running the make_bag like in the tutorial results in parsing errors, because the _make_tag_file function seems to assume that all key,value pair in the input dict is either a string or a list of strings.
I am not sure how this can be resolved, if you expect on purpose to get this strict format, or if the bagit-python code can be improved upon. I also didn't find anything specific in the bagit spec for this.

At first I thought this would be an easy transformation just flattening out all the nested dictionaries, then I discovered that there are also dictionaries inside lists, and that different data types (string, int, None) are used, which also wouldn't work in the present _make_tag_file function (the regex in line 720).

So what do you think: should the conversion be up to the client, should it be done inside bagit-python, or a mix of those?

rd = {u'restriction': {u'email': u'blub@zoink'}, u'recid': 1, u'version_history': None, u'modification_date': u'2014-08-28T15:55:39', u'_id': 1, u'__meta_metadata__': {u'__additional_info__': {u'master_format': u'marc', u'namespace': u'recordext'}, u'__continuable_errors__': [], u'__aliases__': {}, u'__model_info__': {u'names': [u'__default__']}, u'modification_date': {u'function': [u'modification_date', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.898010', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': [u'modification_date', u'json_ext', u'dumps'], u'loads': [u'modification_date', u'json_ext', u'loads']}}, u'json_id': u'modification_date', u'type': u'derived'}, u'version_history': {u'function': [u'version_history', u'rules', u'calculated', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.861323', u'after': {u'memoize': 0}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'version_history', u'type': u'calculated'}, u'restriction': {u'function': u'UNKNOWN', u'timestamp': u'2014-08-28T15:56:02.620112', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'restriction', u'type': u'UNKNOWN'}, u'__errors__': [u"Rule Error - Unable to apply rule for virtual field 'newer_version'. \n"], u'creation_date': {u'function': [u'creation_date', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.898809', u'after': {}, u'pid': None, u'ext': {u'json_ext': {u'dumps': [u'creation_date', u'json_ext', u'dumps'], u'loads': [u'creation_date', u'json_ext', u'loads']}}, u'json_id': u'creation_date', u'type': u'derived'}, u'recid': {u'function': [u'001'], u'timestamp': u'2014-08-28T15:56:02.683490', u'after': {u'connect': [{u'connected_field': u'_id', u'update_function': None}]}, u'pid': 0, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'recid', u'type': u'creator'}, u'_id': {u'function': [u'_id', u'rules', u'derived', 0, u'function'], u'timestamp': u'2014-08-28T15:56:02.853278', u'after': {u'connect': [{u'connected_field': u'recid', u'update_function': None}]}, u'pid': None, u'ext': {u'json_ext': {u'dumps': None, u'loads': None}}, u'json_id': u'_id', u'type': u'derived'}}, u'creation_date': u'2014-08-28T15:55:22'}

--validation warnings related to diacritics

Yesterday I created a very large bag (approximately 82,000 files; 300 GB) using bagit.py on my Mac. Afterwards, the bag validated without issue.

Today, I copied the bag to our Archivematica transfer server and ran bagit.py --validate on the bag. This resulted in many errors, seemingly related to diacritics/character encoding issues. Some of the sample warnings:

2016-01-20 09:34:44,183 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cerámica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists on filesystem but is not in manifest
2016-01-20 09:34:43,267 - WARNING - data/VILLANURBS/CD-DVD/07_200606/05_0929 VNU/VNU_planta arriba/cera ́mica/produccion/prueba/06 0119 ceramica VNU prueba 2 b.pdf exists in manifest but not found on filesystem

2016-01-20 09:34:43,303 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfilieri ́a 3D.dgn exists in manifest but not found on filesystem
2016-01-20 09:34:44,261 - WARNING - data/VILLANURBS/SCAPE/2005/05_0929 VNU/05_0929 VNU/VNU_planta arriba/cerramientos/fachada vidrio/dibujos/06 1116 perfiliería 3D.dgn exists on filesystem but is not in manifest

When I look at the manifest via cat in a bash terminal, the paths appear exist as they exist on the filesystem.

Thanks!

python3

It might be nice to have python 3 support, preferably implemented such that bagit.py works under both, rather than having separate code bases.

support for sha1 creation

It would be very useful to support the ability to pass an argument specifying which hash algorithm to use (i.e. sha1 instead of md5 – or both even).

Make it possible to create a bag without moving the payload

When creating a new bag, make_bag() takes the bag directory as argument. It assumes the current content of this bag directory to be the payload. It creates a new subdirectory data and moves all content of the bag directory into this new subdirectory.

It should be possible to create a bag without moving the payload, given that it is already in a suitable data subdirectory. This is particularly needed when only read access to the payload can be granted.

I submitted a pull request #67 to add such an option, but I got no response. Maybe it has not been noticed. So I submit this Issue to draw your attention.

Allow for separate completeness check

Right now bagit.py allows for Oxum validation or Oxum/completeness/hash validation. I have frequent use cases where I just need to check the completeness of a bag.

Currently I'm running a local fork of bagit-python that separates the completeness checking portion of the script and exposes it as its own option. There is a problem with how to expose that option since bagit.py --validate --complete gives a false impression of what the validation process is doing.

However, besides that problem, addressing this issue only requires separating Bag._validate_entries into two separate methods, updating the logic in Bag._validate_contents, and passing the appropriate argument flag.

Support bagging to a destination other than the source

Currently the module only allows one to do what the LOC Java library calls "bag in place". It would be very useful to have built-in the ability to specify one or more payloads as the "source" and to then specify a "destination" where the bag containing the payloads will be created.

Minor but important note – the hashes in the manifest should be generated from the source payloads, not the copied files in the bag.

support for 0.93, 0.94

Is there any reason that the only supported version are 0.95, 0.96, and 0.97? Would
anything bad happen if 0.93 and 0.94 was added in bagit.py to the list of supported versions?

Bag validation fails for bags created using bagit <=1.2.1

I had a few bags created using bagit version 1.2.1. We are now using bagit 1.3.4 which creates the additional tagmanifest-md5.txt file.
The previously created bags are now failing validation because bagit is looking for a tagmanifest-md5.txt file and can't find one. I checked the source code and found that if the bagit version is >0.97 then it looks for a tagmanifest-md5.txt file while validating.

Screenshot of my test run:
screenshot from 2014-03-21 16 31 36

The selection of tag files is broken

The function _find_tag_files() that selects files to be added to tagmanifest files as added by PR #69 is broken. The intention of this function was to select all files in the bag directory excluding only the payload directory and the tagmanifest files. What the logic in this function actually does, is to select all files excluding files in any directory whose name ends with "data". This is broken in two different ways:

  1. if the bag directory itself ends with "data", all files in this bag directory are excluded, although bag-info.txt, bagit.txt, and manifest-*.txt should in particular be added.
  2. if the payload directory contains any subdirectories not ending with "data", files in this subdirectories are selected for inclusion to the tagmanifest files, although these files, being part of the payload, should not be added.

This bug has been discovered by Kieran O'Leary in the discussion of PR #67.

New release (1.5.5?) request

It's been about a year since 1.5.4, and I'd love to see a release on PyPI with some of the bug fixes and Python 3.5 support.

tag manifest option

Would be nice to include (perhaps as default) a tag manifest in addition to the manifest…

v1.5.4?

I see that v1.5.3 never made it to PyPI and was thinking perhaps now would be a good time to coordinate a Github and PyPI release for v1.5.4? Also, would anyone at LoC like to have permission to upload to PyPI for bagit-python?

Error on --validate --processes

I'm running bagit.py on an external bag in-place bag. This is a multicore machine.
bagit.py --validate --processes=4 F:\PATH_TO_BAG\

The error output is this

Traceback (most recent call last):
  File "c:/Python27/Scripts/bagit.py", line 525, in _validate_entries
    pool = multiprocessing.Pool(processes if processes else None, _init_worker)
  File "c:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "c:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
    self._repopulate_pool()
  File "c:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
    w.start()
  File "c:\Python27\lib\multiprocessing\process.py", line 130, in start
    self._popen = Popen(self)
  File "c:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
    dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "c:\Python27\lib\multiprocessing\forking.py", line 199, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "c:\Python27\lib\pickle.py", line 224, in dump
    self.save(obj)
  File "c:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "c:\Python27\lib\pickle.py", line 419, in save_reduce
    save(state)
  File "c:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "c:\Python27\lib\pickle.py", line 649, in save_dict
    self._batch_setitems(obj.iteritems())
  File "c:\Python27\lib\pickle.py", line 681, in _batch_setitems
    save(v)
  File "c:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "c:\Python27\lib\pickle.py", line 562, in save_tuple
    save(element)
  File "c:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "c:\Python27\lib\pickle.py", line 748, in save_global
    (obj, module, name))
PicklingError: Can't pickle <function _init_worker at 0x0239F8F0>: it's not foun
d as __main__._init_worker

I apologize if this has been addressed before, but I couldn't find anything anywhere about it specifically.

Thanks

review save method in v1.5.0

@darshanrp and @Hwesta would you be willing to take a look at the new Bag.save method in v1.5.0 on master and see that it works the way you expect?

If you have a bagit.Bag instance named bag you should be able to update the bag.info dictionary to change the metadata and then call bag.save() to persist the change.

By default save will not regenerate manifests. This guards against regenerating manifests when the bag is accidentally in a corrupted state (invalid), which would have the result of masking the corruption.

If you have modified the bag, by creating, updating or deleting payload files you can use the manifests parameter: bag.save(manifests=True) which will save both the tag metadata and regenerate the manifests.

Create or update bag to include optional/custom tag files?

I was looking at the code to try to see if it's possible to create a bag with custom tag files, or update an existing bag to add them. I didn't see any code that looked like it would let me do this, although it does look like there is code for validating and finding custom tag files referenced in a tag manifest. I thought it wouldn't hurt to ask if there's a way to do this that I missed.

add save() method

It would be useful to be able to modify bag metadata, and write it back to disk.

b = Bag("/path/to/bag")
b.info["Bag-Id"] = "12345"
b.save()

Updating the bag-info.txt will cause a tagmanifest to change as well, so that will also need to be written. Which raises the question of whether the other manifests should be written as well based on the current state of the Bag object. Perhaps even the bag version and encoding should be written to the bagit.txt as well?

Error when bag is created while working directory is inside dir to be bagged.

If a user is bagging a nested directory (bagit.py ~/Desktop/bag_level1), and their current working directory is inside that directory (e.g. cwd = ~/Desktop/bag_level1/bag_level2), bagit.py fails with the following error.

...
line 233, in make_bag
    os.chdir(old_dir)
FileNotFoundError: [Errno 2] 
...

This happens because the current working directory is saved to old_dir, but then bag_level2 is moved to data/bag_level2

I have a couple of ideas on possible behavior.

  1. error out and warn the user to change directories before bagging
  2. fail to another location, e.g
if not os.path.exists(old_dir):
    os.chdir(self.path)
    log('Working directory now here, because previous one doesn't exists)

updating and saving bags

It would be useful to be able to instantiate a Bag, modify its metadata or payload, and then save it, which would regenerate bag-info.txt, manifest-md5.txt, etc. This will involve a fair bit of refactoring, to recast functions in bagit.py as methods on the Bag class.

parallelize fixity checking in Bag.validate()

Bag creation allows fixity generation to happen in parallel to take advantage of multi-cores. It would be nice if bag validation had the same option to generate fixity values in parallel.

Updating Bag Manifests

Is it possible to use bagit-python from the command line to update bag manifests? Or is that only possible through a python script? My workflow involved adding additional information to bag-info.txt after bagging which causes a checksum mismatch with the tag manifest. I'd like to be able to re-generate the manifests after this step. Thanks!

Bag Validation Issues

We are using BagIt on drives that contain a variety of file types but mainly contain broadcast wave files and accompanying digital audio workstation files (Pro Tools, Nuendo, Logic, Digital Performer). I have run into an issue where Pro Tools Plugin settings files titled "Icon" are either not written to the manifest and throw and validation error, OR it throws an error indicating a file is in the manifest but not on the drive when running the bagit validate command. All files show up in terminal via the ls -la command and I have verified that all permissions are correct.

Below is the output from the bag creation and validation on a set of audio files and digital audio workstation files.

workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Cleaned\ Up\ Masters/
2014-02-06 16:12:52,878 - INFO - creating bag for directory Cleaned Up Masters/
2014-02-06 16:12:53,994 - INFO - creating data dir
2014-02-06 16:12:53,994 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/.DS_Store
2014-02-06 16:12:53,994 - INFO - moving Song 1 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 1
2014-02-06 16:12:53,995 - INFO - moving Song 2 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 2
2014-02-06 16:12:53,995 - INFO - moving Song 3 to /Volumes/BNA_1017971/Cleaned Up Masters/tmpYIWE2W/Song 3
2014-02-06 16:12:53,995 - INFO - writing manifest-md5.txt
2014-02-06 16:12:53,995 - INFO - writing manifest with 2 processes
2014-02-06 16:18:56,256 - INFO - writing bagit.txt
2014-02-06 16:18:56,257 - INFO - writing bag-info.txt

workstation-a:BNA_1017971 administrator$ bagit.py --validate Cleaned\ Up\ Masters/
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem
2014-02-06 16:27:52,810 - WARNING - data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem
 exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/De-Essers/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Compressors/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Guitars/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Drums/Icon
 exists on filesystem but is not in manifestg 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon
 exists on filesystem but is not in manifestg 3/Plug-In Settings/ChannelStrip/Vocals/Icon
2014-02-06 16:44:17,422 - INFO - Cleaned Up Masters/ is invalid: invalid bag: data/Song 3/Plug-In Settings/ChannelStrip/Full Mix Settings/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/De-Essers/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Drums/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/Purple MC77/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settings/bombfactory BF2A/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Guitars/Icon exists in manifest but not found on filesystem ; data/Song 3/Plug-In Settings/ChannelStrip/Compressors/Icon exists in manifest but not found on filesystem ; data/Song 1/StemsAndMultitrack/M1 FOR EXPORT/Plug-In Settin exists on filesystem but is not in manifest ; data/Song 3/Plug-In Settings/ChannelStrip/Vocals/IconSettings/bombfactory BF2A/Icon

I copied the "Icon" files out of each of the three songs and put them into "Test Files" directory and ran the bagit create and validate commands. Below is the output:

workstation-a:BNA_1017971 administrator$ bagit.py --contact-name 'Test Author' --processes 2 Test\ Files/
2014-02-06 16:11:42,977 - INFO - creating bag for directory Test Files/
2014-02-06 16:11:42,978 - INFO - creating data dir
2014-02-06 16:11:43,008 - INFO - moving .DS_Store to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/.DS_Store
2014-02-06 16:11:43,009 - INFO - moving 1 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/1
2014-02-06 16:11:43,009 - INFO - moving 2 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/2
2014-02-06 16:11:43,009 - INFO - moving 3 to /Volumes/BNA_1017971/Test Files/tmpOG1qwh/3
2014-02-06 16:11:43,009 - INFO - writing manifest-md5.txt
2014-02-06 16:11:43,010 - INFO - writing manifest with 2 processes
2014-02-06 16:11:43,207 - INFO - writing bagit.txt
2014-02-06 16:11:43,207 - INFO - writing bag-info.txt

workstation-a:BNA_1017971 administrator$ bagit.py --validate Test\ Files/
2014-02-06 16:11:51,514 - WARNING - data/3/Icon exists in manifest but not found on filesystem
2014-02-06 16:11:51,514 - WARNING - data/1/Icon exists in manifest but not found on filesystem
 exists on filesystem but is not in manifestcon
 exists on filesystem but is not in manifestcon
 exists on filesystem but is not in manifest ; data/3/Iconnvalid bag: data/3/Icon exists in manifest but not found on filesystem ; data/1/Icon exists in manifest but not found on filesystem ; data/1/Icon

Unfortuneately, I cannot include any of the specific sessions and audio files listed in the first example but I can provide example "Icon" files for testing. They can be downloaded at the following link:

https://bmschace.box.com/bagittestfiles

Any help with this issue would be greatly appreciated.

Thanks!
Austin Lauritsen
Director of IT
BMS/Chace

Unicode strings break make_bag

Although bagit.py uses "UTF-8" as its go-to value for Tag-File-Character-Encoding, it actually isn't currently able to deal with non-ASCII characters coming in via make_bag's bag_info parameter (and possibly in other places). For example:

>>> import bagit
>>> bagit.make_bag("bagtest", {"Some-Key": unichr(40960)})
ERROR:root:'ascii' codec can't encode character u'\ua000' in position 10: ordinal not in range(128)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 136, in make_bag
    raise e
UnicodeEncodeError: 'ascii' codec can't encode character u'\ua000' in position 10: ordinal not in range(128)

This could be remedied in a few places by switching from the builtin open() function to the codecs.open() wrapper. For instance, the crash above is fixed by replacing

bag_info_txt = open("bag-info.txt", "wb")

with

bag_info_txt = codecs.open("bag-info.txt", "wb", "UTF-8")

Validation warnings for confusingly-similar files

bagit-python should issue warnings for the conditions described at
https://github.com/loc-rdc/bagitspec/blob/67c81c8f4189ed3fbeec5ad0d91a29879fdf3002/bagit.xml#L938-L960:

  • Implementations SHOULD discourage the creation of bags containing files which differ only in case.
  • Implementations MUST prevent the creation of bags containing files which differ only in normalization form.
  • BagIt implementations SHOULD tolerate differences in normalization form by comparing both the list of filesystem and manifest names after applying the same normalization form to both.
  • Implementations SHOULD issue a warning when multiple manifests are present which differ only in case or normalization form.

Push tags from the old repository

Was trying to jump to a specific release and realized that the tags present at the old repository haven't been pushed here. They would be nice to have!

config file

It should be possible to provide bag configuration data (--source-organizatio=..., etc.) from one (or more) configuration files (like, INI files). This simplifies several tasks for bag creators/providers. Existing command-line options can override config-file settings when necessary.

Special character in Metadata

When adding metadata with special characters, I got an error:

No handlers could be found for logger "bagit"
Traceback (most recent call last):
  File "./housekeeping.py", line 57, in <module>
    bag = bagit.make_bag(ship + folder, metadataContainer)
  File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 146, in make_bag
    _make_tag_file('bag-info.txt', bag_info)
  File "/usr/local/lib/python2.7/dist-packages/bagit.py", line 721, in _make_tag_file
    f.write("%s: %s\n" % (h, txt))
UnicodeEncodeError: 'ascii' codec can't encode character u'\u0142' in position 84: ordinal not in range(128)

Which seems to be here in this function:

def _make_tag_file(bag_info_path, bag_info):
    headers = list(bag_info.keys())
    headers.sort()
    with open(bag_info_path, 'w') as f:
        for h in headers:
            if isinstance(bag_info[h], list):
                for val in bag_info[h]:
                    f.write("%s: %s\n" % (h, val))
            else:
                txt = bag_info[h]
                # strip CR, LF and CRLF so they don't mess up the tag file
                txt = re.sub(r'\n|\r|(\r\n)', '', txt)
                f.write("%s: %s\n" % (h, txt))

Maybe file can be opened for writing Unicode:
with open(bag_info_path, 'w', encoding=utf-8) as f:

Pickling error when using --processes on Windows

from the digital curation google group:


Hi again everyone,

Thanks to Michael Shallcross, I got my initial problem sorted, but now I have run into another. Everything seems to work ok as long as I don't get fancy and attempt to use the option to calculate checksums in parallel. I'm on a quad core machine, so this should work, right? I'll paste the error below-- any insights greatly appreciated!

Thanks,
Mary Willoughby

Digital Library of Georgia



Microsoft Windows [Version 6.1.7601]
Copyright (c) 2009 Microsoft Corporation.  All rights reserved.

C:\Users\mwilloug>cd c:\bagit_python_test

c:\bagit_python_test>python bagit.py --processes 2 --validate bucket
2016-11-15 14:33:00,171 - ERROR - unable to calculate file hashes for c:\bagit_python_test\bucket
Traceback (most recent call last):
  File "bagit.py", line 518, in _validate_entries
    pool = multiprocessing.Pool(processes if processes else None, _init_worker)
  File "C:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "C:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
    self._repopulate_pool()
  File "C:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
    w.start()
  File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
    self._popen = Popen(self)
  File "C:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
    dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Python27\lib\pickle.py", line 224, in dump
    self.save(obj)
  File "C:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python27\lib\pickle.py", line 425, in save_reduce
    save(state)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
    save(v)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 568, in save_tuple
    save(element)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 754, in save_global
    (obj, module, name))
PicklingError: Can't pickle <function _init_worker at 0x0000000002E20EB8>: it's not found as __main__._init_worker
Traceback (most recent call last):
  File "bagit.py", line 945, in <module>
    valid = bag.validate(processes=opts.processes, fast=opts.fast)
  File "bagit.py", line 363, in validate
    self._validate_contents(processes=processes, fast=fast)
  File "bagit.py", line 443, in _validate_contents
    self._validate_entries(processes)  # *SLOW*
  File "bagit.py", line 518, in _validate_entries
    pool = multiprocessing.Pool(processes if processes else None, _init_worker)
  File "C:\Python27\lib\multiprocessing\__init__.py", line 232, in Pool
    return Pool(processes, initializer, initargs, maxtasksperchild)
  File "C:\Python27\lib\multiprocessing\pool.py", line 159, in __init__
    self._repopulate_pool()
  File "C:\Python27\lib\multiprocessing\pool.py", line 223, in _repopulate_pool
    w.start()
  File "C:\Python27\lib\multiprocessing\process.py", line 130, in start
    self._popen = Popen(self)
  File "C:\Python27\lib\multiprocessing\forking.py", line 277, in __init__
    dump(process_obj, to_child, HIGHEST_PROTOCOL)
  File "C:\Python27\lib\multiprocessing\forking.py", line 199, in dump
    ForkingPickler(file, protocol).dump(obj)
  File "C:\Python27\lib\pickle.py", line 224, in dump
    self.save(obj)
  File "C:\Python27\lib\pickle.py", line 331, in save
    self.save_reduce(obj=obj, *rv)
  File "C:\Python27\lib\pickle.py", line 425, in save_reduce
    save(state)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 655, in save_dict
    self._batch_setitems(obj.iteritems())
  File "C:\Python27\lib\pickle.py", line 687, in _batch_setitems
    save(v)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 568, in save_tuple
    save(element)
  File "C:\Python27\lib\pickle.py", line 286, in save
    f(self, obj) # Call unbound method with explicit self
  File "C:\Python27\lib\pickle.py", line 754, in save_global
    (obj, module, name))
pickle.PicklingError: Can't pickle <function _init_worker at 0x0000000002E20EB8>: it's not found as __main__._init_worker

c:\bagit_python_test>Traceback (most recent call last):
  File "<string>", line 1, in <module>
  File "C:\Python27\lib\multiprocessing\forking.py", line 381, in main
    self = load(from_parent)
  File "C:\Python27\lib\pickle.py", line 1384, in load
    return Unpickler(file).load()
  File "C:\Python27\lib\pickle.py", line 864, in load
    dispatch[key](self)
  File "C:\Python27\lib\pickle.py", line 886, in load_eof
    raise EOFError
EOFError

Enchancement: Status update when calculating manifest

Hi,

We've started using bagit-python and it's very useful. As the manifest creation can take several hours, it would be great if there was a more detailed status update as it processed each file. Right now it just says something like "Writing manifest".

Thanks,

Kieran.

Retrieving Bag Metadata

I hope I'm not being blind here, but I don't see a way to read my metadata in 'bag-info.txt'

In [18]: a = bagit.make_bag('/home/jakecowton/temp/1', {'Contact-Name':"Jake Cowton", 'Contact-Email':"[email protected]", 'Author':"That Guy"})
In [19]: a.tags
Out[19]: {'BagIt-Version': '0.97', 'Tag-File-Character-Encoding': 'UTF-8'}

For some reason I (though presumably 'we') can only see the two default tags but not the ones we pass as our own nor the reserved tags listed on line 52-68 of bagit.py

Is this something you plan to develop or is there a reason for this?

Updating bag-info.txt makes bag invalid

When a bag has a tag manifest file created using an algorithm other than md5, such as the following bag:

bag-info.txt         
bagit.txt            
data                 
manifest-sha1.txt    
tagmanifest-sha1.txt

Updating bag info results in an invalid bag:

>>> import bagit
>>> bag = bagit.Bag('/Users/twan/edeposit/edeposit/apps/signiant/tests/SR1-4444_20140611200739')
>>> bag.is_valid()
True
>>> bag.info['External-Identifier']='xyz'
>>> bag.save()
>>> bag.is_valid()
False

The problem looks like it assumes the tag manifest file always uses md5.

# Update tag-manifest for changes to manifest & bag-info files
        _make_tagmanifest_file('tagmanifest-md5.txt', self.path)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.