Coder Social home page Coder Social logo

Comments (3)

xdanieldzd avatar xdanieldzd commented on September 9, 2024

Just added some functionality for this in the form of the FormatDetectionAttribute, plus an example in the P5BustupBIN container format.

You can now specify a static method to be executed during file detection, which should return either true when the file appears to be valid (header values appear sane, etc.), or false when it appears invalid (ex. actual filesize is less than what the header specifies). This should work fine in conjunction with the existing attributes as well. I'm open for suggestions on how to improve this, too, it might be a bit "quick and dirty" right now.

from scarlet.

darkstar avatar darkstar commented on September 9, 2024

Yes, it seems to work, however, the FormatDetection should not be run as alternative ("OR") to the other detections, but as additional step ("AND").

Two examples:

If the FilenamePattern indicates a match, and the FormatDetection function indicates no match, then the end result should be no match. I ran into this when trying to define a match based on file name pattern AND a detection format

If the filename does not match, but the FormatDetection indicates a match, the end result should be "no match" (or rather, in this case, the FormatDetection should probably not be run at all). This happens with the P5bustupBIN giving false positives on some Disgaea3 files, which have a totally different extension (.pac instead of .bin/.dds2) but the P5 ContainerFormat still tries to unpack it because the (rather simple/generic) format detection flags it as a match. See my PR for a slight improvement to the detection function (still not perfect -- for the PAC files I chose to do a full file header verification to reduce false positives to a minimum)

So, to summarize, I think it should work like that:

  • The magic number attribute is a MUST. If it does not match, there is no use in trying to process the file with the plugin in question
  • The detection function is a MUST. If it indicates a non-match, there is no need to try and process the file. It should be written to filter out "false positives" as much as possible
  • The file name match is more like a hint to the program, that not every plugin should be tried on every file. It "pre-selects" the possible plugins

The rationale is this: There is probably little use in trying to, say, decompress an archive where you know the magic number is incorrect, or where you know (by some heuristic in the detection function) that it "looks" invalid, because the developer could not possibly have foreseen how such a file should be handled (otherwise he would have put the correct magic number in, or changed his detection function)

from scarlet.

darkstar avatar darkstar commented on September 9, 2024

After thinking about it a bit more, this might be a simpler heuristic:
First, check every MagicNumber attribute. At least one of them must match. If there is no such attribute, default to "matched".
Then, check every DetectionFunction attribute. At least one of them must match. If there is no such attribute, default to "matched".
If the AND of these two results is a match, check the filename pattern. If it also matches, use the corresponding plugin. If it does not match, "fail" the match but let the user override this fail into a "matched" by providing a command line argument, e.g "-ignoreFilenamePatterns". Or, maybe even better, automatically ignore the FilenamePattern if none of the plugins matches all three attribute types and just try every plugin that matches at least the first two (MagicNumber, DetectionFunction)

from scarlet.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.