Coder Social home page Coder Social logo

Comments (8)

m1n0 avatar m1n0 commented on May 12, 2024

This is because we use Location (i.e the line/column and file path) when creating the check identity. It seems that using the line number is excessively specific and we should skip it, but what is the current state of the identity calculation discussions @tombaeyens @vijaykiran please? I know there were some loose ends on the topic.

from soda-core.

tombaeyens avatar tombaeyens commented on May 12, 2024

@m1n0 The solution is indeed to remove the location line/col. Only keep the file path.

But then we should ideally also produce an error log if 2 checks in the same file produce the same identity. In fact, that should be for all checks in a scan:

As soda cloud triggers generation of identity, ensure verification that no duplicate identities are created within 1 scan. If that is the case, print an error message pointing to all checks / locations that result in same check identities.

from soda-core.

m1n0 avatar m1n0 commented on May 12, 2024

is it possible to have two checks with the same identity? Shouldnt that fail even earlier, during parsing stage?

from soda-core.

tombaeyens avatar tombaeyens commented on May 12, 2024

Yes: When there are 2 checks with the same identity, it make sense to have it fail.

I do want to raise awareness for the background story of identity as it is a tricky matter. It's not just a mechanism like programmatic identity in python or java. This identity mechanism is created to try and create seamless syncing between checks in sourcecode and the check entities on Soda Cloud that build up history over time. So the goal is that if users make changes to the same check, that we should ideally be able to figure that out and keep the link with the previous existing check. So identity has the ability to remain the same, even if some detail of the check changes. Though in practice we even had to include the threshold. Anyways that just to sketch the background around identity.

from soda-core.

m1n0 avatar m1n0 commented on May 12, 2024

thanks for writing up the background, my suggestion then is that after parsing the checks and creating the identity for each of them we should make sure that there are no duplicate identities before we proceed with execution, would you agree?

from soda-core.

tombaeyens avatar tombaeyens commented on May 12, 2024

As agreed before we remove line and col information from the identity.
This might lead to duplicate check identities.

Important to note is that duplicate check identities only pose a problem for Soda Cloud sync. And duplicate identities should only be handled when pushing the scan results to Soda Cloud. All the rest should keep working as is.

So when duplicate identities occur, we should ensure that only 1 (probably the first is most convenient) of the check results is sent to Soda Cloud. All subsequent check results with the same check identity should not be sent to Soda Cloud.

When handling duplicate check identities we should proceed as much of the scan as possible, and not apply the fail fast principle.

Impl note: it's probably the simplest to add this check inside the Soda Cloud method at the end when sending the scan results. As that's where the identities are generated.

@vijaykiran Review what happens if similar checks are generated by a for each as there are already directly in the check file. And potentially update the identity generation if needed. If that logic is changed. Post a note here and tag me to make me aware.

from soda-core.

m1n0 avatar m1n0 commented on May 12, 2024

@tombaeyens this issue and your last comment above now seem to be solved.
In summary:

  • current identity implementation takes the whole check definition except the Name property and creates a hash of that.
  • line number and file are not used in the hash at all
  • for each checks create their own identity with a special non-sodacl syntax, table name is included so uniqueness is guaranteed.
  • we have quite extensive test coverage for these scenarios but we are missing a test for for each construct. I will add one and then we should be able to close this issue

from soda-core.

m1n0 avatar m1n0 commented on May 12, 2024

One scenario is not covered yet - checking for each unique identity between each table, but also unique compared to a same check created specifically for one of the tables covered by for each. PR coming for that soon.

from soda-core.

Related Issues (20)

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.