<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Hello <a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-ur

That's odd, because the <a href="https://docs.python.org/3.6/library/functions.html#ro

<a class="user-mention notranslate" data-hovercard-type="user" data-hovercard-url="/us

Binary comparison and SHA-1 hashing about gdspy HOT 13 CLOSED

heitzmann commented on July 19, 2024

Binary comparison and SHA-1 hashing

from gdspy.

Comments (13)

explorer2011 commented on July 19, 2024

Hello -

in GDS file, every library and every cell has a time stamps, for example:

HEADER 600
BGNLIB 3/10/2017 14:05:01 3/10/2017 14:05:01
LIBNAME library
UNITS 0.001 1e-09

BGNSTR 3/10/2017 14:05:01 3/10/2017 14:05:01
STRNAME topcell

You can save any gds file to a text gds format (for example, using "klayout" - klayout.de - but there are many converters around), to check it.

So, binary GDS files generated at different time moments, even if they have identical geometry content, will always differ.

So, smart GDS comparison utilities ignore the time stamps in GDS files.

There is a very nice paper about hashing of IC layouts - "DÉJÀ VU: An Entropy Reduced Hash Function for VLSI Layout Databases":

http://ieeexplore.ieee.org/document/7105901/

The algorithm described in that paper is implemented in a commercial Mentor Graphics tool
called "checksum_util". So, if you have Mentor Graphics licenses, you can use that tool.

If you don't have Mentor tools - why don't you use a simple XOR between two GDS files? Empty output means two files are identical geometrically (even though the shapes may be described differently, for example using different number of points, or path vs rectangle, etc.). Non-empty output means two GDS files are different.

You can write your own XOR utility, or use XOR function in klayout, it works very well. You will have not only pass/no-pass output, but also will be able to see what's different (shapes) between the two GDS files.

Regards,

Maxim

from gdspy.

rosivda commented on July 19, 2024

@explorer2011 Thanks for your input and for the reference to the article. It was very insightful. We have gained a better understanding of the scope of the issue that we are facing.

Until recently, we had been using boolean overlays to validate our layouts, just as you suggested. Due to the size and complexity of our layouts, we have started looking at other solutions. Following your comments, we have built a python script that parses binary GDSII files, excludes BGNSTR timestamps, and builds SHA-1 out of remaining content. This now produces repeatable hashes in about 10 sec for binary .gds files of about 1GB in size. This has addressed our immediate need.

One residual problem that we are still trying to resolve is to determine why the .gds binaries generated from the same version of gdspy (1.0) on Windows and Mac are slightly different resulting in a different hash value. Do you have an idea what type of system-specific information maybe encoded into a binary .gds file to produce such a difference?

from gdspy.

explorer2011 commented on July 19, 2024

Hi Serge -

Thanks for your feedback!

I learned something new form you too, I didn't know about SHA-1, this seems to be a useful utility, easily available on Linux (sha1sum).

Please note that same SHA-1 checksum does guarantee identical content.
However, two identical (mask-wise) GDS files can generate different checksums, as identical geometrical shapes can be defined using different elements of GDS file (rectangle/path) or using the same element types with different number of points. The method suggested in the paper that I referred to, is free form this artifact.

As for Windows and Mac GDS file differences - I don't know.
From what I remember about GDSII file format, it should not contain any OS related information.
The only thing that immediately comes to my mind is the end of line difference for strings/texts between Windows and Mac (or Unix) - I don't think this is the case here, in binary GDSII files, but I would suggest to check this.

I would also convert both files to text format (using klayout, for example), and compare the text GDS files, to see if there is any difference.

Did you do XOR on these two files?

There is a very low probability possibility that grid snapping (converting from float coordinates to the grid with step size DBU (often 1nm)) is different in GDSPY on Mac and on Windows - if this is the case, this would be an issue with GDSPY, it should produce exactly the same on-grid geometry on different systems.

Maxim

from gdspy.

heitzmann commented on July 19, 2024

Hello @rosivda, sorry for taking so long to reply. I've been busy at work lately.
As @explorer2011 mentioned, there should be no difference in files produced by different systems, none that I can think of. If you have an example that you can share (small python script + both GDSII), maybe it will be easier to find out the reason.

from gdspy.

rosivda commented on July 19, 2024

@heitzmann We are looking at all aspects of .gds hashing. So far, we have overcome the hash creation problem - our hashes now indeed represent the binary file content (minus timestamps). The binary files, however, are different on machines that run different OS, which was rather unexpected as we were hoping that the binaries would not differ. The difference appears to be originating from numerical rounding and not from any OS-specific info included in .gds files. Short of going the most generic route of looking at the entropy of the data suggested by @explorer2011, we are currently examining a few gdspy-specific approaches and we will update this thread with our findings shortly. As part of this effort, we are in the process of validating gdspy 1.0 -> 1.1.1 -> 1.1.2 versions.

from gdspy.

rosivda commented on July 19, 2024

@heitzmann We have now studied the hashing of .gds files extensively and have reached the following conclusions:

Binary .gds files generated by the identical python code using the same versions of gdspy, numpy, python, etc often result in binary .gds files that are different. This difference remains even when timestamps are taken out of .gds files. The difference persists even when running using the same software/hardware (the same computer).
The fact that the binary files are different does not mean that the resultant geometries are different. We have used boolean comparison across platforms and are satisfied that the geometries are the same with some residual differences related to number rounding.
As the complexity of layouts increases, parallelization is becoming more prevalent. Order of execution hinders parallelization. In complex layouts, the sequence of assembly of geometrical elements (and their fracture) is not expected to be retained if the benefits of parallelization are to be reaped.

This leads us to the following conclusion. Employing hashing algorithms without considering the logic of what is being hashed is not possible in .gds files. As far as hashing is concerned, there does not appear to be a simple solution here, short of looking at the most generic case of data entropy suggested in the link provided by @explorer2011. Of course, there is always a possibility of comparing two geometries using boolean geometry operations, though we have found this computationally impractical for our complex layouts.

from gdspy.

heitzmann commented on July 19, 2024

@rosivda I believe the reason you find different binary files for the same script running in the same computer is that GdsLibrary stores its cells in a dictionary, which does no preserve order. When write_gds iterates over all cells (https://github.com/heitzmann/gdspy/blob/master/gdspy/__init__.py#L3674) the order is not guaranteed to be the same in different runs.
If that is where the difference comes from (apart from time stamps) then you could manually pass the cells argument something like:

cells = sorted(gdspy.current_library.cell_dict.values(), key=lambda c: c.name)

from gdspy.

explorer2011 commented on July 19, 2024

OK, a random order of writing cells to GDS file does explain the binary GDS files differences. They should also be easily seen by a naked eye when comparing text versions of GDS files (using tkdiff, for example).

But this point, mentioned by Serge, is worrisome: "the geometries are the same with some residual differences related to number rounding". The geometries should be exactly the same (after merge operation). If there is a difference due to number rounding or grid snapping - this is a big problem, in my opinion (in gdspy or elsewhere). Geometry in GDS file is integer, and geometries from two GDS files corresponding to the same design should be exactly the same, not within any tolerance.

Maxim

from gdspy.

heitzmann commented on July 19, 2024

That's odd, because the documentation does not indicate that the behavior off round depend on the OS, specially when rounding to an integer (as in the gdspy case).

from gdspy.

rosivda commented on July 19, 2024

@heitzmann Great suggestion! Overnight, we have rebuilt our libraries on 6 different machines - 4 Windows 10 and 2 Mac OS X. This time, however, we used ordered collections. Then we hashed the results eliminating all timestamps. All the Windows machines produced identical hashes. So did the two Mac machines. In 50% of instances, hashes between Windows and Mac were the same. In the cases where hashes between Windows and Mac were not the same, the file sizes remained the same, except for one instance when both hashes and file sizes differed (528067070 vs 528067870 bytes).
For practical reasons, we are very satisfied with this outcome as we can now use a simple sys.platform call to compare to the platform-specific hashes, regardless of hardware.
@explorer2011 One lingering question does remain - what is causing the platform dependency of binaries? Our layouts are generated from complex optimizations calling third-party libraries. I can only speculate that somewhere upstream in numpy, scipy, etc there is a platform-specific numerical difference that exists, but I am not sure. While not a priority, I will re-open and update this stream down the road if we are able to answer this question definitively.
Thanks for all your help and suggestions.

from gdspy.

rosivda commented on July 19, 2024

As a small update on the issue, we have upgraded our hashing algorithm in such a way that, in addition to ignoring timestamps, it now identifies and sorts libraries inside binary .gds files before SHA-1 is produced. We have found this to be very helpful in identifying .gds files with identical content even if their binaries are different. There is no longer a need to pre-sort cells while writing with gdspy. I intend to submit this hashing feature for consideration to be included in gdspy.

from gdspy.

basnijholt commented on July 19, 2024

@rosivda, it's 5 years later but I will still try; could you share that code that you use to generate hashes?

I am going through the same process of trying to get reliable hashing to work and found this issue.

from gdspy.

rosivda commented on July 19, 2024

@basnijholt It has been 5 years, but the hash difference between operating systems has not been resolved and we are no closer to understanding why there is an OS dependency in gds binaries. Our solution has been to maintain OS-specific libraries of hashes - we detect the operating system first and then store/compare the hash for that operating system. The hash generating code is below:

def gdsii_hash(fname, engine=None):
    """
    Generate hash from a binary .gds file
    :param fname: The file (or path) where the GDSII file is located.
    :param engine: Hashing algo
    :return: Hash string
    """

    with open(fname, 'rb') as fin:
        data = fin.read()
    contents = []
    pos = start = 0
    while pos < len(data):
        size, rec = struct.unpack('>HH', data[pos:pos+4])
        if rec == 0x0502:
            start = pos + 28
        elif rec == 0x0700:
            contents.append(data[start:pos])
        pos += size
    h = hashlib.sha1() if engine is None else engine
    for x in sorted(contents):
        h.update(x)
    return h.hexdigest()

from gdspy.

Binary comparison and SHA-1 hashing about gdspy HOT 13 CLOSED

Comments (13)

Maxim

Related Issues (20)

Recommend Projects

React

Vue.js

Typescript

TensorFlow

Django

Laravel

D3

Recommend Topics

javascript

web

server

Machine learning

Visualization

Game

Recommend Org

Facebook

Microsoft

Google

Alibaba

D3

Tencent