ibm / jsonsubschema Goto Github PK

View Code? Open in Web Editor NEW

84.0 8.0 17.0 321 KB

Tool for checking whether a JSON schema is a subschema of another JSON schema.

License: Apache License 2.0

Python 100.00%

json-subschema json-schema subtyping subschema

jsonsubschema's Introduction

jsonsubschema

jsonsubschema checks if one JSON schema is a subschema (subtype) of another.

For any two JSON schemas s1 and s2, s1 <: s2 (reads s1 is subschema/subtype of s2) if every JSON document instance that validates against s1 also validates against s2.

jsonsubschema is very useful in analysing schema evolution and ensuring that newer schema versions are backward compatible. jsonsubschema also enables static type checking on different components of a system that uses JSON schema to describe data interfaces among the system's different components.

The details of JSON subschema are covered in our ISSTA 2021 paper, which received a Distinguished Artifact Award:

@InProceedings{issta21JSONsubschema,
  author    = {Habib, Andrew and Shinnar, Avraham and Hirzel, Martin and Pradel, Michael},
  title     = {Finding Data Compatibility Bugs with JSON Subschema Checking},
  booktitle = {The ACM SIGSOFT International Symposium on Software Testing and Analysis (ISSTA)},
  year      = {2021},
  pages     = {620--632},
  url       = {https://doi.org/10.1145/3460319.3464796},
}

I) Obtaining the tool

Requirements

python 3.8.*
Other python dependencies will be installed during the below setup process

You can either install subschema from the source code from github or the pypy package.

A) Install from github source code

Execute the following:

git clone https://github.com/IBM/jsonsubschema.git 
cd jsonsubschema
python setup.py install
cd ..

B) Install from pypy

Execute the following:

pip install jsonsubschema

II) Running subschema

JSON subschema provides two usage interfaces:

A) CLI interface

Create two JSON schema examples by executing the following:

echo '{"type": ["null", "string"]}' > s1.json
echo '{"type": ["string", "null"], "not": {"enum": [""]}}' > s2.json

Invoke the CLI by executing:

python -m jsonsubschema.cli s2.json s1.json

B) python API

from jsonsubschema import isSubschema

def main():
	s1 = {'type': "integer"}
	s2 = {'type': ["integer", "string"]}
	
	print(f'LHS <: RHS {isSubschema(s1, s2)}')

if __name__ == "__main__":
	main()

License

jsonsubschema is distributed under the terms of the Apache 2.0 License, see LICENSE.txt.

Contributions

json-subschema is still at an early phase of development and we welcome contributions. Contributors are expected to submit a 'Developer's Certificate of Origin', which can be found in DCO1.1.txt.

jsonsubschema's People

Contributors

Stargazers

Watchers

Forkers

lukeenterprise fagan2888 freightprotocol bhaskers-blu-org1 hirzel dataunitylab lucaunipassau yetingli err0r500 michaelmior vspaz ghas-results andrewhabib mrlyc alparibal kmichel-aiven

jsonsubschema's Issues

If RHS has optional fields, it isn't considered a subschema, inhibiting schema evolution

We are currently looking for a tool that can check schema compatibility.

One of the examples I used to test this software is to validate is to check a producer with additional unused fields works with a consumer without them. This is a common schema evolution pattern and this works with jsonsubschema. ✔️

However, a second schema evolution pattern we follow is for the consumer to add a field as optional and then to start using it as soon as the producer starts supplying it.

This does not seem to work with jsonsubschema.

Example consumer.json:

{
    "properties": {
        "required": {
            "title": "Required",
            "type": "string"
        },
        "onlyusefieldifpresent": {
            "anyOf": [
                {
                    "type": "string"
                },
                {
                    "type": "null"
                }
            ],
            "default": null,
            "title": "onlyusefieldifpresent"
        }
    },
    "required": [
        "required"
    ],
    "title": "Consumer",
    "type": "object"
}

and this producer.json:

{
    "properties": {
        "required": {
            "title": "Required",
            "type": "string"
        },
        "unusedfield": {
            "title": "Unusedoption",
            "type": "string"
        }
    },
    "required": [
        "required",
        "unusedfield"
    ],
    "title": "Producer",
    "type": "object"
}

>>> jsonsubschema producer.json consumer.json
LHS <: RHS False

Is this intentional behavior? Technically I suppose producer.json is not a subschema of consumer.json, but it is a subset of consumer.json's required fields.

Enable codecov

Hi @stevemar

Can you please enable codecov for this repository?

Thakn you,

Publish updated pypi package

Hi, wanted to ask if it is possible to update the pypi package to v0.0.7? That would include the updated dependencies.

Displaying why one schema isn't a subset of another

Is it straightforward to provide a list of fields which invalidate one schema being a subset of another? - i.e. debugging information

Currently if you use this to validate schema compatibility, it does give a yes/no answer but it doesn't provide any context. In large schemas, you would have to manually dig around to figure out what broke.

This may be a large chunk of work and I'd understand if you didn't want to do it. I'm thinking of forking and and providing this functionality in a separate project along with #24, but I thought I'd drop a note here first just in case these were features that you were considering.

Consider adding a constraint for the dependency towards python-intervals

Hello,

I see that you're relying on python-intervals. I suggest to add a dependency constraint (e.g. <2.0.0) since version 2.0.0 will be released soon, and contains backward incompatible changes (e.g. .is_empty() method will be replaced by .empty property).

greenery - infinite loop

Hey,

I think there is a problem in the used package greenery. The method regex_meet in _utils.py can run into an infinite loop.

def regex_meet(s1, s2):
    if s1 and s2:
        ret = parse(s1) & parse(s2)
...

This error can be reproduced with following arguments:

s1 = ".{0,}"
s2 = "commonjs|amd|umd|system|es6|es2015|esnext|none"
parse(s1) & parse(s2)

The problem is the intersection (&) of the two parsed arguments. This can lead to an infinite loop.

This is fixed in the current version (GitHub) of greenery because with these sources everything works fine. The problem lies in the released version in PyPip on which jsonsubschema relies. The release in PyPip was on 04-19-2018, the last commit in the GitHub repository was on 04-10-2020. Maybe you can add a note or change the dependency somehow to GitHub of greenery?

Best Regards

upload for conda

Since jsonschema is available in conda-forge, please also upload this package.

performance problem

Hello,
I noticed a performance problem as soon as the schema contains the following structure:

... "anyOf": [ {"enum": ["aa", "bb", "cc"]}, {"pattern": "pattern1"}, {"pattern": "pattern2"}, {"pattern": "pattern3"}, ... ] ...

The performance can be massively improved by processing the schema beforehand. All enum values and patterns should be combined to a single pattern as shown in the example below:

... "anyOf": [ {"pattern": "^aa$|^bb$|^cc$|pattern1|pattern2|pattern3"} ] ...

Actually, you iteratively append the enum values and regex patterns to a single regex and compute for every iteration the intersection between the current pattern and ".*". This is very expensive and results in bad performance (for this specific kind of schema).

I added an example json file (anyOf.json) that shows the problem. anyOf.json takes on my machine about 50-60 seconds for the result (LHS :< RHS and RHS :< LHS) when checking the file against itself (command jsonsubschema anyOf.json anyOf.json). Applying preprocessing, it takes about 0.04 seconds. I also attached a python script (smaller_anyOf.py) that contains the preprocessing. The script combines the string-enum-values and all patterns to a single pattern as shown in the example above.

AnyOf.zip

By transforming the string-enum-values to a regex, special regex characters (e.g. ".", "-", ...) are escaped to get an identical expression as regex.

... "enum": ["ab-c"] ...
will be transformed to
... "pattern": "^ab\\-c$" ...

Be careful, this can currently lead to another problem - see #6 .

Best Regards
Michael

pattern - error with escaped characters

Hello,

There is a problem when using escaped characters in a pattern. E.g.:

{ "type": "string", "anyOf": [ { "pattern": "^a\\-b$" }, { "pattern": "^b\\-c$" } ] }

Comparing this schema with itself (command jsonsubschema a.json a.json) will throw an error. More specifically, the library greenery throws an error. It seems like greenery can not handle escaped characters. When you delete one of the pattern, everything works fine, because the problematic greenery method is not called. The error is also thrown for other escaped characters like "\ " (whitespace), "\." ...

Best Regards
Michael

isSubschema returns false on added, but not required properties?

Hello!

I'm seeing some odd behavior and I'm wondering if I'm missing something obvious.

Let's say I have a JSONSchema object like such:

{
        "properties": {
            "error_message": {
                "description": "The error message that will be thrown",
                "title": "Error Message",
                "type": "string",
            },
        },
        "required": ["error_message"],
        "title": "Throw Formatted Error Input",
        "type": "object",
    }
}

And I add a property that is not required, like such:

{
        "properties": {
            "error_message": {
                "description": "The error message that will be thrown",
                "title": "Error Message",
                "type": "string",
            },
            "not_required_field": {
                "description": "A new not required field",
                "title": "Not Required Field",
                "type": "string",
            }
        },
        "required": ["error_message"],
        "title": "Throw Formatted Error Input",
        "type": "object",
    }
}

My expectation in regards to breaking change compatibility is that this would not be considered a breaking change - that is, that isSubschema would return "True" when checking s1,s2, seeing as bodies that are valid for the first schema will still be valid for the second schema. Making them invalid would require adding a not_required_field to the required array in the second schema. Without that, the first schema should be a valid subschema of the second schema.

However, comparing these two schemas makes isSubschema return False.

I have looked through the attached paper and seen:

We observe that JSON schema types null and string are the two
most prevalent schema types present in the dataset. Both types
are fully supported in the subtype checking performed by jsonsubschema as indicated by the color code in Figure 8. The keywords
properties and required for specifying constrains on a JSON object
show up next on the order of the number of use cases. jsonsubschema
fully supports properties while the required keyword is supported
whenever it is not used in union schemas or negated schemas. In
general, disjunction of schemas happens rarely (366 occurrence
among millions of occurrences of other keywords), while negated
schemas are not used at all in our dataset.

Seeing as this is not a union or a negated schema, I would think that this would be a supported use case.

Am I missing something obvious? I did try to run through my debugger through the lib but to no great success.

I appreciate the help (and the work on the lib)!
Best