Coder Social home page Coder Social logo

MD5Sum for output, as an action option about sos HOT 30 CLOSED

gaow avatar gaow commented on August 30, 2024
MD5Sum for output, as an action option

from sos.

Comments (30)

BoPeng avatar BoPeng commented on August 30, 2024 2

hold on, trying really hard to make CI works again now.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Tests are passing..

Why cannot you do

[step]
output: whatever
sh:
   processing

sh: expand=True
  md5sum {_output} > {_output.with_suffix('md5')}

The only case that you cannot do this is when sh is in task, I think.

from sos.

gaow avatar gaow commented on August 30, 2024

Sure. It's just that we will have to do it for every workflow step. I think it warrants a shortcut for this feature. For example the decompress option in action download is such a shortcut

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

But bash is an action, which in theory does not know _output. The only valid place is

[step]
output: whatever

and maybe something like

[steo]
output: whatever, md5=True

Since we are essentially adding an option to file_target.target_signature, or calling something like file_target.save_md5sum(), it could be something like -S md5 (enhanced version of -s that saves md5 for all outputs, and potentially checks md5 for all inputs.

from sos.

gaow avatar gaow commented on August 30, 2024

I see ... Are you saying we can do

...
_output.save_md5sum()

at the end of a step? This is good enough I think. We have something like _input.zap() that has a similar interface

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

I tend to add -S md5 to specify the "type" of signatures. It is ill-defined since we cannot completely rely on md5 since we have other types of targets but in the future we could have something like -S redis://server to use a dedicated signature service.

The advantage of this approach is that we can relatively easily do

sos run workflow -s build -S md5

to generate md5 after the workflow has been completed, and

sos run workflow -s assert -S md5

to check md5.

However, sos is currently bloated all such features (_input.zap() is one of them) and I am hesitating if we should add additional features like this.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

The patch is actually quite simple. When -S md5 is specified, file_target will use complete md5 signature for signature checking, and save .md5 file as an artifact. This applies to all file_target objects and the usual -s ignore, -s build should apply.

from sos.

gaow avatar gaow commented on August 30, 2024

Thank you @BoPeng I tested it. It works well!

gw@cordata:~/tmp/05-May-2022$ sos run test.sos  -S md5
INFO: Running test: 
INFO: test is completed.
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is executed successfully with 1 completed step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos  -S md5
INFO: Running test: 
INFO: test (index=0) is ignored due to saved signature
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.
gw@cordata:~/tmp/05-May-2022$ sos run test.sos  
INFO: Running test: 
INFO: test (index=0) is ignored due to saved signature
INFO: test output:   2.txt
INFO: Workflow test (ID=w52da6a178fa12077) is ignored with 1 ignored step.

I like the behavior when -S md5 is dropped during a rerun the partial signature is still recognized and execution skipped.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

I like the behavior when -S md5 is dropped during a rerun the partial signature is still recognized and execution skipped.

This could be a bug if your signatures were built without -S md5. With -S md5, the signatures are built using complete md5 instead of partial md5 (using the first and last 500M data), then validation without -S md5 should fail. I could potentially use partial md5 for old signatures but then I will need to calculate md5 twice.

This is a proof-of-concept implementation. We can fine-tune its behavior without future development in mind.

from sos.

gaow avatar gaow commented on August 30, 2024

Hmm I see. I was under the impression that when we put in -S md5 we indeed calculate md5 twice, which is the safest and reasonable behavior -- if you agree then we can finalize on that and close this issue without future development in mind? I tried to test it and thought that was the case but now I realize my test above is incomplete!

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Making -S md5 the "add-on" feature that does not affect existing signature is indeed the safest option. I suppose calculating md5 twice is ok as long as we do not read the file twice.

You appeared to want to validate input file (and output file?) if the corresponding .md5 exists (#1454), are you still interested?

from sos.

gaow avatar gaow commented on August 30, 2024

@BoPeng your proposal in #1454 is hacky but efficient. I don't find it scary. Also, when -S md5 is on there is no need to validate output files because absence of md5sum file indicate problem. I'm not sure about input files to the start of a pipeline and I dont think we need to do anything about it. So with your proposal there plus this patch we should have 2 ways to solved #1454 and that's good enough to me, unless you have other proposals?

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Currently the patch generates md5 but missing .md5 does not indicate a failed step (at least not intentionally), which is what #1454 wants. The complexity comes with the two sets of signatures.

With -s default -S md5

  • If there is no signature and no md5. This is the initial run, sos generates the "old" database-based signature (for backward compatibility) and .md5 file for any file_target. This happens to all file_target, including input files.
  • If there is signature and no md5. This is the rerun of workflow without -S md5, we will generate md5, similar to -s build -S md5.
  • If there is no signature and existing md5: I assume that md5 is generated for validation and use it to validate content of the file. This is the case when your input files comes with md5, and perhaps the migration of sos projects (copy md5, but not sos signature database which stores on the machine sos runs, not with the project).
  • If there is signature and existing md5: both signatures will be used to validate.

Now, #1454 talks about a case when SoS crashes and partial output. Regularly sos would proceed (since there is no signature) but you would like to

  • If the output has no signature and no md5. remove it before starting. This is a completely different feature and is potentially dangerous.

from sos.

gaow avatar gaow commented on August 30, 2024

but missing .md5 does not indicate a failed step

I would argue that this is at least a very good proxy to a failed step and is enough to warrant concerns when it happens?

If there is no signature and existing md5

Good point. What's the behavior here? I suppose partial md5 should be built as a result of full md5 check and rerun otherwise?

If the output has no signature and no md5. remove it before starting.

Sorry i thought that is already the behavior? For example, if some steps append to an output using >> then it should start afresh anyways if it is not skipped. I think it is desired and not dangerous -- would you provide a user case when this is not a good idea?

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

If the output has no signature and no md5. remove it before starting.
Sorry i thought that is already the behavior?

You are right, your "sos crash" scenario confused me. I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not, so having .md5 is an indication only to users?

Then let us do this (with -S md5)

  1. When sos signature is read, if there is an md5 file , it will be used to validate the content when the file is used, regardless of existence of sos signature.
  2. When sos signature is written, a new .md5 file will be written.

from sos.

gaow avatar gaow commented on August 30, 2024

so having .md5 is an indication only to users?

Yes although as you proposed, SoS can also leverage that information when available.

I suppose when that happens, sos knows the file is corrupted and will remove it before starting, but users do not

Right, SoS would know this only because it does not have a signature under ~/.sos

Then let us do this (with -S md5)

Perfect!

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Two problems:

  1. When implementing getting both partial and full md5 using one-read, I identified a bug from the previous partial MD5 calculation. I fixed a bug in this PR but then all old signatures will be wrong (sos run will rerun, unless you use -s build).

  2. I cannot validate existing md5 all the time because that will cause md5 to be calculated all the time when files are accessed. I changed the behavior to only generate md5 if md5 files does not exist. However, existing md5 files will also be used for validating (e.g. when bypassing a step).

from sos.

gaow avatar gaow commented on August 30, 2024

I fixed a bug in this PR but then all old signatures will be wrong

Oh that's a bit unfortunate but it's good you caught that! I cannot speak for others, but our consortium is starting to run my pipeline this month but I think peopel are just starting. If we can make a release some time soon I'll ask people to update. My own group is small / flexible -- i'll just ask them to run with -s build.

However, existing md5 files will also be used for validating (e.g. when bypassing a step).

and when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right? Then I dont see a problem.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Oh that's a bit unfortunate but it's good you caught that!

The bug caused the last 1M of data not counted towards the hash (so 15 M was used instead of 16M). It was an aribitrarily defined partial MD5 so it was never a problem. I can match the new implementation with the old one but it is a bit weird that way.

when that fails the output and existing md5 will be removed anyways priorto generating the new output and md5 right?

Currently

  1. we generate output, then create md5 when signature is saved.
  2. if we generate new output, since the new output is newer, the md5 will be updated.

This is technically not a problem but if sos crash in between, we may have corrupted or missing output file with old md5 file, so it is better to clear .md5 when we start a step.

from sos.

gaow avatar gaow commented on August 30, 2024

I can match the new implementation with the old one but it is a bit weird that way.

Again I cannot speak for the others but the new change is fine by me and I can inform SoS/DSC users i know of. I'll leave that for you to decide ...

so it is better to clear .md5 when we start a step.

Great -- with that i dont see other potential problems.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Great -- with that i dont see other potential problems.

Removing existing output files can be dangerous, can we change it to renaming to FILENAME.random_hash? If someone rerun some command by accident, at least the old output files are still there and can be restored. It can be a mess to have these .fa89erf.tmp files around but it should not be too difficult to remove them later on.

from sos.

gaow avatar gaow commented on August 30, 2024

Sorry please allow me to clarify -- if someone rerun some command by accident, and the output files are complete, the built-in signature will ensure that the file is skipped. If the output files are corrupted they should be removed anyways. Is there a situation a corrupted file is still useful thus warrant renaming? I guess the answer is yes because corrupted result is better than no result -- is that your only reason for it?

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

is that your only reason for it?

Say I want to run the pipeline with a different parameter but then realize that it will take another 3 days and I cannot afford it, then the old output is already gone.

It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Note that .fadf8d.tmp is currently used when sos stops abnormally. For example

[20]
output: 'somefile'
sh:
    generate somefile
sh:
   this step goes wrong.

The old behavior was to remove somefile when the step fails, but somefile.fadf8d.tmp may help debug what goes wrong for its next step.

from sos.

gaow avatar gaow commented on August 30, 2024

It is a rare scenario, but if it is rare, it might be ok to have some .df8dfa.tmp files around, maybe.

can we only do that when the md5 and file match (indicating no corruption) but scripts change? Or equally good is to remove the random temp after new results are generated. I want to say that i'm fine with occassional tmp files but definitely not cool if they are everywhere.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

remove the random temp after new results are generated

I think this is a good idea. Basically removing somefile.dfd8fd.tmp after somefile is generated.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

I gave up the idea since glob('filename.??????.tmp') is an expensive operation that is hard to tolerate with lots of output files.

I will merge the PR once the tests pass.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Clearing output before executing is more complicated than I expected. The reason is that we have

[a]
output: something
sos_run('b')

[b]
output: something

When this happens, we actually execute step b before a,. Removing something before executing a will remove the output generated by b.

I have reverted the clear-before-executing code to make the tests pass.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

Leave the remaining issues to a separate ticket #1498.

from sos.

BoPeng avatar BoPeng commented on August 30, 2024

I have released sos 0.23.0

from sos.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.