This is a software library intended to support the creation, manipulation, and validation of "bags" from the bagit specification. It currently supports version 0.93 through 1.0.
Build Status | |
Metrics | |
Documentation |
- Java 8
- gradle (for development only)
We strive to have great documentation! Thus this file follows recommendations from https://www.divio.com/blog/documentation/ Editors and grammar aficionados are welcome and encouraged to edit this content to make it even better! See "Helping Contribute" below.
A "bag" is a way to transfer files from one location to another and verify that the files sent are complete (you received exactly what you were supposed to receive), and that they are correct (none of the bits have changed). While many people use "bags" for other purposes, they are tangential to the purpose of file transfer.
Typically you need to send files out of band (i.e. not using the internet) and you need to ensure that all the files are received correctly. Usually this is due to the amount of files being transferred is very large, the internet connection is too slow or unreliable, or there is no physical connection to the internet.
To save on transferring all the files (or multiple copies of the same file) you can use a fetch.txt file. This special file lists where those other files are located on the internet. This library does not handle trying to retrieve these files due to the complicated nature of retrieving files over the internet.
There is a special file called bag-info.txt (or package-info.txt for older versions) that is formatted for easy reading by humans. This file is just a list of key value pairs, and with some very few exceptions has no bearing on the bag other than to give additional information.
The BagIt specification was first created by the Library of Congress because it needed a way to verify that donated material on hard drives was correct and complete.
Path rootDir = Paths.get("RootDirectoryOfExistingBag");
Bag bag = BagReader.read(rootDir);
Path outputDir = Paths.get("WhereYouWantToWriteTheBagTo");
BagWriter.write(bag, outputDir); //where bag is a Bag object
Path folder = Paths.get("FolderYouWantToBag");
boolean includeHiddenFiles = false;
Bag bag = BagWriter.bagInPlace(folder, Arrays.asList("sha512"), includeHiddenFiles);
There are three kinds of validations:
- Verify a bag is complete.
- Verify a bag is correct.
- Just check file count and bite size.
boolean ignoreHiddenFiles = true;
BagVerifier.isComplete(bag, ignoreHiddenFiles);
boolean ignoreHiddenFiles = true;
BagVerifier.isValid(final Bag bag, final boolean ignoreHiddenFiles)
This may be removed in the future since it is mostly a hack of the bag metadata
BagVerifier.quicklyVerify(bag); //where bag is a Bag object
Path folder = Paths.get("BagYouWantToCheck");
Set<BagitWarning> warnings = BagLinter(folder);
Path rootDir = Paths.get("RootDirectoryOfExistingBag");
Bag bag = BagReader.read(rootDir);
InputStream jsonProfile = new URL("https://github.com/bagit-profiles/bagit-profiles/blob/1.1.0/bagProfileFoo.json").openStream();
assert BagLinter.checkAgainstProfile(jsonProfile, bag) == true;
The StandardHasher contains many well known checksum algorithm implementations. However, there will be times when you want (or must) use a different algorithm. The BagitChecksumNameMapping contains the mapping between bagit checksum names and their implementation and is the only place you need to modify to change which implementation you would like to use.
public enum SHA3Hasher implements Hasher {
INSTANCE;// using enum to enforce singleton
private static final int _64_KB = 1024 * 64;
private static final int CHUNK_SIZE = _64_KB;
private static final String MESSAGE_DIGEST_NAME = "SHA3-256";
private MessageDigest messageDigestInstance;
@Override
public String hash(Path path) throws IOException{
reset();
updateMessageDigest(path, messageDigestInstance);
return formatMessageDigest(messageDigestInstance);
}
@Override
public void update(byte[] bytes, int length){
messageDigestInstance.update(bytes, 0, length);
}
@Override
public String getHash(){
return formatMessageDigest(messageDigestInstance);
}
@Override
public void reset(){
messageDigestInstance.reset();
}
@Override
public String getBagitAlgorithmName(){
return "sha3256";
}
private static void updateMessageDigest(final Path path, final MessageDigest messageDigest) throws IOException{
try(final InputStream is = new BufferedInputStream(Files.newInputStream(path, StandardOpenOption.READ))){
final byte[] buffer = new byte[CHUNK_SIZE];
int read = is.read(buffer);
while(read != -1){
messageDigest.update(buffer, 0, read);
read = is.read(buffer);
}
}
}
private static String formatMessageDigest(final MessageDigest messageDigest){
try(final Formatter formatter = new Formatter()){
for (final byte b : messageDigest.digest()) {
formatter.format("%02x", b);
}
return formatter.toString();
}
}
@Override
public void initialize() throws NoSuchAlgorithmException{
messageDigestInstance = MessageDigest.getInstance(MESSAGE_DIGEST_NAME);
}
}
BagitChecksumNameMapping.add("sha3256", SHA3Hasher.INSTANCE);
This is beyond the scope of this project, however please see https://github.com/bagit-profiles/bagit-profiles for in-depth documentation on profiles.
One of the inspirations for writing this library was to create a simple to use interface for creating, reading, writing, verifying, and linting BagIt specification bags. The coding therefore tries to adhere with the best practices in Effective Java by Joshua Bloch as well as experiences from the team members.
- Install JDK 8+
- while gradle may work running from your IDE this has not been tested and isn't really supported. Instead run all gradle commands from the commandline using the
gradlew.bat
script root directory. git-bash seems to generally work, but sometimes crashes for unknown reasons. - Before submitting a pull request run
./gradlew.bat clean check
and there are no errors.
- Install JDK 8+
- while gradle may work running from your IDE this has not been tested and isn't really supported. Instead run all gradle commands from the commandline using the
gradlew
script root directory. - Before submitting a pull request run
./gradlew.bat clean check
and there are no errors.
- Install JDK 8+
- while gradle may work running from your IDE this has not been tested and isn't really supported. Instead run all gradle commands from the commandline using the
gradlew
script root directory. - Before submitting a pull request run
./gradlew.bat clean check
and there are no errors.
There are many classes that were not designed to be used outside this project, the rules for this are:
- If a class/method does not contain a javadoc.
- If a class is in a package named "internal".
- If a class is final, that class was not designed to be extended by users outside this project.
All public interfaces/classes have javadocs detailing what the class's responsibilities are and, what the methods do and are used for (http://www.javadoc.io/doc).
To see a nice view of what code is covered by the various tests, check out coveralls.io. We strive to try and maintain 90% or better code coverage knowing that testing language specifics (like getters and setters) are not helpful. Ideally we also have 100% coverage of each branch condition, but again this is more an ideal than a hard requirement.
Because there are many test cases for using the BagIt specification correctly, the Library of Congress decided to create a suite of known issues as well as canonical basic bags for each specification version. These test cases are stored in a git repository and can be found at https://github.com/libraryofcongress/bagit-conformance-suite.git We use these test cases to ensure we are correctly adhering to the BagIt specification.
Naming stuff is hard, and generally we humans are bad at naming similar things distinctly. So in order to make it easier, here is a short list of definitions
- bag - a group of files in a particular format to ensure the files received are correct and complete
- BagIt - the name of the specification (or how is a bag structured)
- bagit (notice the all lowercase naming) - The original java implementation that also contained a command line utility by the same name.
- Bagger - a java desktop application (GUI) that used bagit to create bags
- Bagging - this library which tries to correctly implement all versions of the BagIt specification in a clean and concise way
The short answer, no. The long answer is that having a java command line utility has caused more confusion and frustration than it has helped. If you need a command line implementation, try taking a look at the python implementation from the Library of Congress instead.
You don't because in order to generate a checksum of the tag file it can't change. Thus if you try to add a line for the tag-manifest containing a checksum, that checksum will change and no longer be valid.
The specific functionality for checking or creating a bag while it is still compressed is not supported and there are not plans to support it in the future. I would recommend you use your favorite application to compress/decompress and then work with the bag as normal.
Maybe, it really depends on what you are trying to achieve. Are you trying to create a super safe copy of your files that will prevent bitrot and other storage problems from ever happening? - then no the BagIt specification won't help you because it can't keep files safe (erasure codes, multiple copies, etc) but only aids in checking completeness and correctness. If that is your use case, I would recommend that you use the BagIt specification when you receive files outside the internet and then use other software to ensure those files are safe from corruption.
If you find a bug in Bagging please let us know by submitting a bug report! When creating a bug report please try to include the following information which will greatly aid in resolving the issue faster:
- a small example showing the incorrect behavior
- the expected behavior
- the actual behavior
- the operating system being used
- version of Bagging being used
We would love to hear your ideas to make this library even better! First submit a new issue discussing what feature you would like added. Please include the following information when submitting:
- Current behavior (if applicable).
- Proposed behavior.
- Why this feature is useful.
- A small code example if possible.
It is impossible for this documentation to cover all questions that you might have. Therefore, if you don't understand something or would like more clarification please submit a issue with your question. I will try to answer it as best I can, and if useful will be added to the FAQ section.
If you value this project, please consider contributing! All pull requests will be reviewed with an aim to have them incorporated into this project. Don't know how to submit a pull request? No worries, check out github's great documentation at https://help.github.com/articles/about-pull-requests/. You will need to also sign a document stating that you freely give all copyright over to this project for any submitted pull requests. Some of the items we will check when you submit a pull request are:
- Does the pull request follow the style of the rest of the project?
- Do all the tests and other code quality checks still pass?
- Does the pull request maintain the same level of code coverage?
- If adding new functionality, were test cases added for the base case and several edge cases?
- Was documentation updated (if applicable)?
From the start Bagging was built knowing that not all people speak English. If you are able to translate from English to another language we would love your help! Please see the link for Transifex to get started.
There is a very active community around digital archiving. One of which is The Digital Curation Google Group (https://groups.google.com/d/forum/digital-curation) which is an open discussion list that reaches many of the contributors to and users of this project.
- Maintain interoperability with the BagIt specification.
- Fix bugs/issues reported (on going).
- Translate to various languages (on going).