Coder Social home page Coder Social logo

Comments (24)

weltkante avatar weltkante commented on June 16, 2024

Your 7z file contains data compressed with PPMD instead of the usual LZMA/LZMA2. PPMD (along with DELTA and BZIP2) are currently not supported by the library, but I'll take a look at how hard it would be to add decoders for them.

Also I'm getting a NotImplementedException, are you sure you are getting InvalidDataExceptions? If so, what version of the library are you using and how does the stack trace look like?

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Thank you, Tobias!
You are right, it's NotImplementedException. The InvalidDataExceptions was due to a bug in my code.
Also, is it possible to use your library with unseekable stream (as in SharpCompress IReader interface)?
I need it to support reading not only from files, but when archive is part of some other container file (for example, archive in another archive).

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

The 7z file format is not designed to be used without seeking. The header is split and part of it is usually stored at the end of the file, after the compressed data. (That is actually normal for compression archive files because you can't write the complete header before having compressed everything.)

So someone needs to buffer the stream somewhere. I'm not doing that automatically because there are a lot of choices how to do that and I want to leave it up to the call site to select the buffering suited best to them. Some choices you have:

  • you could make a copy of the input stream in advance or on the fly
  • the content could be buffered in memory, in the swap file, or on the disk
  • you may not want to buffer at all and instead implement seeking by reopening the source stream and skipping data; you could also keep multiple readers of the source stream open in some scenarios and switch between them, to optimize alternating seeks

These choices all have advantages and disadvantages so there is no obvious best choice and I decided its better to leave the choice up to the calling program, which knows better what it needs.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Yes, I understand. The main problem is that the header is usually stored at the end of file.
Ok, I will use only seekable stream for 7z.
The main request then is to support PPMd.
And I tried its implementation in SharpCompress - and it works, but is extremely slow (much more than 10 times slower than native implementation). Is there any chance to make it comparable to native code in the managed library?
Another thing I am trying to figure out is how to get file name of the current entry in the archive (inside the loop with dsReader)?

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

Another thing I am trying to figure out is how to get file name of the current entry in the archive (inside the loop with dsReader)?

oops, the ArchiveFileModelMetadataReader is incomplete in that regard. I was in a bit of a hurry to package up the first version of the nuget package so I missed that this class isn't fully functional. I'll try to get a fix uploaded later today. If you wanted to work around the issue you could make your own subclass of ArchiveMetadataReader to capture and expose the filenames.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

I've created a separate issue for the file metadata problem and have uploaded an implementation to expose the file metadata to the caller without having to implement a metadata reader. More details on the issue #13

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

I've implemented 7zip(LZMA) reading according to the example.
And it works!
The API usage looks a little "verbose", but it's ok.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

That's (probably) intentional, I wanted a low level API which doesn't impose unnecessary overhead on the caller and provides access to all the "special" features of the 7z format. This made for some odd design decisions like the metadata reader API being separate from the content decoder. Also, as I mentioned in some places, I'll be working on SharpCompress integration which will probably be a better fit for people who "just want it to work" and don't care about all the details of how the data is stored and loaded.

That said, if you have suggestions for improvement, feel free to open an issue for it and I'll consider it. Issues for questions/comments are also fine, I'll try to explain the design decisions there until I can move it into proper documentation.

About the PPMD compression, I've taken a first look and it seems reasonable to include the decoder in the library, but it will take a few days to translate into managed code. Can't say anything about performance yet. What framework are you on, full desktop framework, or something else?

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

The intention to include PPMD compression is great news!
I am using .NET desktop framework, and it's O.K. to use the latest version (and unsafe code if it provides the performance).
As for design decisions, the main "strange" thing for me is synchronization between metadata and content. In the example, the section index is always zero:
var mdFiles = mdModel.GetFilesInSection(0);
Is it correct?
Also, implicit increment of CurrentStreamIndex and using it to access metadata looks fragile.
But, maybe I'm not that used to the 7zip format.
Using sample code to read the file I needed was actually easy.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

In the example, the section index is always zero:
var mdFiles = mdModel.GetFilesInSection(0);
Is it correct?

That was a typo when updating the sample code, sorry, I've fixed it.

A 7z file consists of sections of encoded data, each section can have different compression schemes. It's usual in 7z to compress exe/dll files differently from text files, and some uncompressable files are just stored and not compressed at all. These things go into different sections.

Within a section many files are concatenated into one huge stream instead of compressing files individually like zip does. This improves compression ratio considerably because similar files can reuse content from neighbouring files. The 7z frontend is smart enough to sort files by file extension (and also by filename) in hope to place similar files together, but that's not in the library and has to be done manually if you are compressing files.

Since not every entry in the metadata has a stream you have to perform a mapping from file metadata to streams. Empty files and "marker files" have no corresponding stream. So just iterating over the streams like the sample code does will not give you all files, just those which have actual content, the example code is bad for this purpose I guess. If you want to extract all files you'd have to iterate over the file metadata.

Also, implicit increment of CurrentStreamIndex and using it to access metadata looks fragile.

It's not implicit, CurrentStreamIndex is incremented when you call DecodedSectionReader.NextStream. There is nothing fragile since the DecodedSectionReader uses the metadata you pass in to subdivide the decoded section into streams. If you look at the implementation of StreamCount property you see it is just a convenience method forwarding the count from the metadata.

Technically you could do the subdivision yourself by creating an ArchiveSectionDecoder which decodes the whole section as one big stream. DecodedSectionReader is just a wrapper over that which uses the metadata to split the big stream into small streams, providing an iterator-like API to advance to the next stream (skipping data automatically if you didn't read everything).

And yeah, all this really needs to be documented, I'll get to it soon :-)

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Ok, I understand.
Thanks for all the clarification.
So, if I want to iterate over the metadata - how to do it correctly and what is the right way to access the corresponding stream?

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

With the last update I also added a section and stream index property (wrapped in a struct) on the file metadata. see here

I'll probably rename it because .StreamIndex.StreamIndex looks a bit odd, but for the time being you could use this to figure out what to load.

The "right way" to unpack everything is to make a list of files and sort it by section and by stream index, or you could just iterate over ArchiveFileModel.Files which should already be sorted. Then use dual iterators, one iterating over the sorted file metadata and one iterating over the decoded streams, skipping over streams you don't need to unpack. (Note that some decoded streams may be "unused" and not appear in the file metadata. This should normally not happen but may happen in corner cases.) In particular you'll notice that files with Length of zero won't have a section/stream but will still be in the file metadata as a notice for you to create an empty file. If you also want empty directories you'll have to make a hierarchic traversal over the RootFolder.

I'm aware that I probably should add convenience methods for this, but I didn't have time for that when I wrote the quick fix to expose the file metadata. I might get some work done next week.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Here is the code that I used for 7-zip reading:

            ArchiveFileModelMetadataReader mdReader = new ArchiveFileModelMetadataReader();
            ArchiveFileModel mdModel = mdReader.ReadMetadata(stream);
            PasswordStorage password = PasswordStorage.Create(streamname);
            ArchiveMetadata metadata = mdModel.Metadata;
            int numbrOfSections = metadata.DecoderSections.Length;
            for (int sectionIndex = 0; sectionIndex < numbrOfSections; sectionIndex++)
            {
                ImmutableList<ArchivedFile> mdFiles = mdModel.GetFilesInSection(sectionIndex);
                DecodedSectionReader dsReader = new DecodedSectionReader(stream, metadata, 0, password);
                for (int index = 0; index < mdFiles.Count; index++, dsReader.NextStream())
                {
                    ArchivedFile mdFile = mdFiles[index];
                    if (mdFile == null)
                        continue;
                    if ((mdFile.Attributes & FileAttributes.Directory) != 0)
                        continue;
                    string filename = mdFile.Name;
                    if (entryRegex != null && !entryRegex.IsMatch(filename))
                        continue;
                    try
                    {
                        using (Stream arcstream = dsReader.OpenStream())
                        {
                             . . .
                        }
                    }
                    catch (..)
                    {
                    }
                }
            }

Please advise whether it's correct.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

The quoted code has several issues

  • you don't pass sectionIndex to DecodedSectionReader (probably missed that I changed that in the sample code)
  • you still iterate over section streams, missing empty files and empty directories
  • checking attributes for the "directory" flag is unnecessary because you are only iterating over files, it will never be set (unless the writer of the archive screwed up and set the directory flag on a file)

I'll write up some more complete samples instead of that test code, it should be ready later today.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

I've added a more complete unpack example here

While writing it I found some more problems, one of them can be worked around (the workaround is included in the sample) another is a bug in the section mappings produced by the metadata reader in certain cases.

I'll push fixes for these issues later (and update the nuget package) but they won't affect the sample code, so I wanted to get that one out first (removing the workaround will be possible but optional).

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

Nuget package is updated and workaround for full filenames is removed from the sample.

I've also started working on the PPMD decoder, should be possible to get it working sometime this week.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Thank you!
I'll get it and implement it a new way.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

As far as I understand, ArchiveFileModel.GetFilesInSection() doesn't expose empty files. What if I would like to see them during all the unpacking (in the order they are present in archive)?
if I look into ArchiveFileModel.RootFolder, I see that its property Name is null, and it contains only one item - the first non-zero length file.

Second question is: I see that Decoder.SkipOutputData() is unimplemented yet. Is it possible to skip some files that I am not interested in decoding? DecodedSectionReader.NextStream() is not functioning in this case.

Here is the code I use:

            ArchiveFileModelMetadataReader archiveMetadataReader = new ArchiveFileModelMetadataReader();
            ArchiveFileModel archiveFileModel = archiveMetadataReader.ReadMetadata(stream);
            PasswordStorage passwordStorage = PasswordStorage.Create(streamname);
            ArchiveMetadata archiveMetadata = archiveFileModel.Metadata;
            int numberOfSections = archiveMetadata.DecoderSections.Length;
            for (int sectionIndex = 0; sectionIndex < numberOfSections; sectionIndex++)
            {
                DecodedSectionReader sectionReader = new DecodedSectionReader(stream, archiveMetadata, sectionIndex, passwordStorage);
                ImmutableList<ArchivedFile> sectionFiles = archiveFileModel.GetFilesInSection(sectionIndex);
                for (int index; (index = sectionReader.CurrentStreamIndex) < sectionFiles.Count; sectionReader.NextStream())
                {
                    ArchivedFile fileMetadata = sectionFiles[index];
                    if (fileMetadata == null)
                        continue;
                    string filename = fileMetadata.Name;
                    if (entryRegex != null ? !entryRegex.IsMatch(filename) : entryname != null && filename != entryname)
                        continue;
                    filename = fileMetadata.FullName;
                    try
                    {
                        using (Stream substream = sectionReader.OpenStream())
                        {
                        }
                    }
                    catch ()
                    {
                    }
                }
            }

Sample file is attached.
RIH3-7z.zip

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

Removing the currently remaining NotImplementedExceptions is on the top of my todo list, sorry for the inconvenience.

Is it possible to skip some files that I am not interested in decoding?

As a workaround you can read the stream and discard the data. That's less overhead than it sounds since the files are concatenated in one compressed stream which has to be decoded anyways to skip ahead. (You can observe that "performance problem" with other 7z applications when extracting individual files, they have to decompress a whole section until they reach the content.)

What if I would like to see empty files during all the unpacking (in the order they are present in archive)?

I don't think you can draw much value out of that since the order of entries has no meaning to you or the user. As I mentioned earlier during compression files are usually reordered by various criteria to improve compression rates, so you can't expect their order to carry any semantic meaning.

But I realize that a flat list is simpler to code against when you do an application which doesn't need to show any UI, so I'll be adding a way to get to that list without having to traverse the tree structure.

if I look into ArchiveFileModel.RootFolder, I see that its property Name is null, and it contains only one item - the first non-zero length file.

I can't reproduce that. With the attached 7z file the RootFolder should contain an ArchivedFolder, which itself again contains an ArchivedFolder, etc. down 3 levels, there should be 3 ArchivedFiles.

The hierarchic file metadata is designed for an UI application which presents an explorer-like interface showing the content of individual folders and allowing navigation between them. So it is correct that the toplevel folder has no name because it is just a container for the toplevel items.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

Yes, I see this hierarchy now.
It just happens that since RootFolder all hierarchy sub-items (except for the last level) have their FullName the same and equal to the first non-empty file name.
Yes, the Name property is different, and equal to the corresponding subdirectory name.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

It just happens that since RootFolder all hierarchy sub-items (except for the last level) have their FullName the same and equal to the first non-empty file name.

Oh, thanks, that is actually a bug I didn't notice :-)

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

Updated the nuget package with the PPMD decoder, (unoptimized) support for skipping ahead in streams, various bugfixes, and a change in the FileAttributes API to automatically filter out attributes which should not be set. If you were setting attributes you may take another look at the sample, it should now be simpler.

from managed-lzma.

igvk avatar igvk commented on June 16, 2024

I have tested PPMD decompression - and I am pleased to say that it works good.
Not as fast as native decompressor, but much better in terms of performance than I saw previously.

from managed-lzma.

weltkante avatar weltkante commented on June 16, 2024

Closing this since the basic support is there, made a separate issue for cleaning up the implementation (#21)

from managed-lzma.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.