league / rngzip Goto Github PK
View Code? Open in Web Editor NEWType-based XML compression
Home Page: http://contrapunctus.net/league/haques/rngzip/
License: Other
Type-based XML compression
Home Page: http://contrapunctus.net/league/haques/rngzip/
License: Other
Introduction ============ RNGzip is a schema-based compressor for XML. This file documents version 0.6 (July 2007) by Christopher League. See also: > <http://contrapunctus.net/league/haques/rngzip/> RNGzip uses information from the schema of an XML document to achieve better compression (usually) than is possible with a general-purpose tool such as gzip. We use the Relax NG schema language, which is flexible, yet formally well-defined. A separate tool called [trang](http://www.thaiopensource.com/relaxng/trang.html) can convert other schema formats to RNG. To compress an XML stream with RNGzip, it *must* be valid according to some schema. Documents that don’t match the schema are literally not compressible with this technique. More importantly, to decompress, you must provide a copy of _precisely the same_ schema that was used to compress the document. This is one of the major caveats for using RNGzip. When your schema evolves, be sure to give it a new name or version number; otherwise, your compressed data may become inaccessible. Here is a simple use case. The [National Center for Biotechnology Information](http://www.ncbi.nih.gov/) at the NIH provides lots of data, optionally available in various XML formats. Here is an example of a search result from PubMed: % head -20 pubmed01.xml <?xml version="1.0"?> <!DOCTYPE PubmedArticleSet PUBLIC "-//NLM//DTD PubMedArticle, 1st November 2004//EN" "http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd"> <PubmedArticleSet> <PubmedArticle> <MedlineCitation Owner="NLM" Status="In-Data-Review"> <PMID>16102004</PMID> <DateCreated> <Year>2005</Year> <Month>08</Month> <Day>16</Day> </DateCreated> <Article PubModel="Print"> <Journal> <ISSN>0950-382X</ISSN> <JournalIssue> <Volume>57</Volume> <Issue>5</Issue> <PubDate> <Year>2005</Year> <Month>Sep</Month> The file is 96,264 bytes long. If we compress it using gzip, the result is just 9,215 bytes, or about 90% smaller. This is pretty good already, but RNGzip can do better. Here is how it works. First, we must convert the PubMed DTD into the Relax NG format: % trang http://www.ncbi.nlm.nih.gov/entrez/query/DTD/pubmed_041101.dtd \ pubmed.rng Next, we zip it: % java -jar rngzip.jar -v -s pubmed.rng pubmed01.xml loading file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng building automaton... done pubmed01.xml 92.28% -- replaced with pubmed01.xml.rnz The size of the result is 7,431, which is about 92% smaller than the original and 19% smaller than the gzipped version. We can ask RNGzip to provide some information about this file, including the URL of the schema, which was embedded for convenience. % java -jar rngzip.jar -iv pubmed01.xml.rnz pubmed01.xml.rnz: 7431 bytes HUFFMAN/GZ/GZ file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng(e6133345) To decompress, we do not need to specify the schema again unless its location has changed. % java -jar rngzip.jar -dvp2 pubmed01.xml.rnz loading file:/Users/league/r/xml-group/src/rngzip/tests/ncbi/pubmed.rng building automaton... done pubmed01.xml.rnz 91.28% -- replaced with pubmed01.xml % head pubmed01.xml <?xml version="1.1" encoding="UTF-8"?> <PubmedArticleSet> <PubmedArticle> <MedlineCitation Owner="NLM" Status="In-Data-Review"> <PMID>16102004</PMID> <DateCreated> <Year>2005</Year> <Month>08</Month> <Day>16</Day> </DateCreated> Notice that the spacing has changed slightly, and the DOCTYPE has been dropped. Unlike a generic compressor, RNGzip does not store the ignorable spaces at all. Instead, it can optionally reformat the XML as it decompresses; the option `-p2` means to pretty-print with an indentation of two spaces. RNGzip also drops any DOCTYPE declarations or processing instructions. You may think that it is unfair, because gzip must preserve this information that RNGzip discards. But if we remove all the ignorable space from the XML first, it doesn't make much difference in the size of gzip's output anyway. % xmllint --noblanks pubmed01.xml > pubmed01.nb.xml % gzip -v pubmed01.nb.xml pubmed01.nb.xml: 86.5% -- replaced with pubmed01.nb.xml.gz The XML with ignorable spaces dropped is 62,377 bytes, and compressed by gzip it is 8,432. In general, the more ‘taggy’ an XML document is — and the more precisely those tags are constrained by the schema — the better the compression will be. Less taggy XML documents will essentially degenerate into the compression ratio of gzip, plus a small overhead. For example, the plays of Shakespeare have been marked up in XML by Jon Bosak. These represent mostly text, with light markup to indicate who is speaking, and where lines begin and end. In these examples, RNGzip does roughly the same as gzip, on average. % ls -l taming.xml* -rw-r--r-- league 200,918 Jul 15 1999 taming.xml -rw-r--r-- league 53,479 Jul 7 12:31 taming.xml.gz -rw-r--r-- league 52,820 Jul 7 12:31 taming.xml.rnz The compressed forms are both roughly 73% smaller than the original, with the RNGzip version just 1.2% smaller than the gzipped one. You may be aware that gzip does not yield the best compression ratio of any generic compressor; rather, it is a good trade-off between speed and size. The block-sorting compressor bzip2 tends to yield much smaller files, although it is slower than gzip. For example, taming.xml.bz2 is just 39,502 bytes, beating RNGzip by 25%! This is because RNGzip actually uses a generic compressor internally, and by default, it uses gzip (zlib). To make RNGzip more competitive against bzip2, a Java implementation of bzip2 could be substituted within RNGzip. Incidentally, on the PubMed example above, RNGzip (with zlib) still beats bzip2 by 9%. ---- » [Installation](INSTALL.html)
A declarative, efficient, and flexible JavaScript library for building user interfaces.
🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.
TypeScript is a superset of JavaScript that compiles to clean JavaScript output.
An Open Source Machine Learning Framework for Everyone
The Web framework for perfectionists with deadlines.
A PHP framework for web artisans
Bring data to life with SVG, Canvas and HTML. 📊📈🎉
JavaScript (JS) is a lightweight interpreted programming language with first-class functions.
Some thing interesting about web. New door for the world.
A server is a program made to process requests and deliver data to clients.
Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.
Some thing interesting about visualization, use data art
Some thing interesting about game, make everyone happy.
We are working to build community through open source technology. NB: members must have two-factor auth.
Open source projects and samples from Microsoft.
Google ❤️ Open Source for everyone.
Alibaba Open Source for everyone
Data-Driven Documents codes.
China tencent open source team.