jorgecarleitao / json-deserializer Goto Github PK

View Code? Open in Web Editor NEW

29.0 29.0 9.0 69 KB

License: Apache License 2.0

Rust 96.93% Python 3.07%

json-deserializer's People

Contributors

Stargazers

Watchers

Forkers

ncpenke dandandan universalmind303 cjermain ghishadow mizardx delicioushair aldanor eventual-inc

json-deserializer's Issues

enable SIMD when comparing to simd-json benchmarks

Hi,

first of all, this is a really impressive creation I love the trick of delaying number parsing darn smart :D.

That said I would suggest enabling SIMD when comparing against simd-json, otherwise, it's a bit of a pointless comparison. The best way to do that in a non too CPU-dependent way is to use RUSTFLAGS="-C target-feature=+avx,+avx2,+sse4.2" those features are present on all modern CPUs.

Include license file in repo and crate

Hi, thanks for this crate (currently using it as part of Polars)! It looks like Cargo.toml specifies the license as Apache-2.0, but it'd be nice to include a license file in the repo (which Cargo should automatically include in the published crate if present).

I also noticed you're the owner of the arrow-format crate - would be nice to do the same there.

JSON parser has issues with small f64 vales using scientific notation

Issue jorgecarleitao/arrow2#1426 is caused by the fact that values like 2e-42 are parsed as Number::Integer(b"2", b"-42") here, causing problems downstream.

parsing special characters in strings produces invalid results

Originally reported via pola-rs/polars#8424

[dependencies]
json-deserializer = "0.4.4"

use json_deserializer::Value;

fn main() {
    let expected = "你好，polars。\n";
    let json_data = format!(r#""你好，polars。\n""#);
    let bytes = json_data.as_bytes();
    let v = json_deserializer::parse(&bytes).unwrap();
    if let Value::String(s) = v {
        assert_eq!(s.as_ref(), expected)
    }
}

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `"ä½\u{a0}å¥½ï¼\u{8c}polarsã\u{80}\u{82}\n"`,
 right: `"你好，polars。\n"`', src/main.rs:13:9

Invalid JSON panics or successfully parses

I was testing potential JSON libraries to use and I noticed several issues from json.org's JSON Checker:

fail8: Extra array close doesn't cause an error ["Extra close"]]
fail27/fail25: Unescaped tabs and newlines in string literals aren't conformant to the JSON spec.
fail32: A missing closing brace causes a panic {"Comma instead if closing brace": true,

This library looks great and is quite efficient!

Bug: unable to parse scientific notation with `+` symbol

This was first reported in polars. We are using arrow2 to infer the schema for ndjson data types, which looks like it is using this to crate deserialize json in the implementation of arrow2::io::ndjson::read::infer

Expected:

Number is parsed correctly
serde_json will parse as so: {"Value": Number(11000000000.0)}

Actual:

json_deserializer errors out with MissingComa(43) when trying to parse scientific notation with + symbol.

Steps to reproduce

fn main() {
    let mut s = br#"{"Value":1.1e+10}"#.to_vec();
    let v = json_deserializer::parse(&mut s);
    println!("{:?}", v);
}
// Err(MissingComa(43))

Note: the - symbol parses as expected.

fn main() {
    let mut s = br#"{"Value":1.1e-10}"#.to_vec();
    let v = json_deserializer::parse(&mut s);
    println!("{:?}", v);
}
// Ok(Object({"Value": Number(Float([49, 46, 49], [45, 49, 48]))}))

Performance comparison with in-place serde + possible improvements

Awesome work as usual @jorgecarleitao.

I was curious what the performance would be with serde_json's in-place deserialization so I created a custom deserializer for parser::Value. I had to duplicate parser::Value to use converted numbers since it wasn't possible to use this library's lazy numeric parser without replicating all of serde_json's parsing code.

The code for the custom deserializer is here. I updated the benchmarks to include this implementation, and also added a test for escaped strings.

Results from my macbook are below. serde_json_custom is the benchmark for the custom deserializer. string_escaped_chars is the benchmark for escaped strings.

string json_deserializer 2^10 time:   [63.949 us 64.339 us 64.739 us]
string serde_json 2^10  time:   [100.51 us 101.21 us 101.94 us]
string serde_json_custom 2^10 time:   [41.162 us 41.491 us 41.923 us]
string simd_json 2^10   time:   [25.250 us 25.574 us 25.932 us]

string_escaped_chars json_deserializer 2^10 time:   [193.93 us 195.39 us 196.92 us]
string_escaped_chars serde_json 2^10 time:   [130.26 us 131.22 us 132.27 us]
string_escaped_chars serde_json_custom 2^10 time:   [137.42 us 138.47 us 139.55 us]
string_escaped_chars simd_json 2^10 time:   [46.427 us 46.821 us 47.252 us]

I was surprised at the results, especially since nothing stuck out from inspecting the code. So I dug in further and it seems to come to the relatively large match expression in compute_length. Reworking it improved up the numbers:

string json_deserializer 2^10 time:   [39.574 us 39.719 us 39.868 us]
string serde_json 2^10  time:   [103.74 us 104.84 us 105.99 us]
string serde_json_custom 2^10 time:   [46.926 us 48.188 us 49.584 us]
string simd_json 2^10   time:   [27.517 us 27.786 us 28.105 us]

string_escaped_chars json_deserializer 2^10 time:   [144.19 us 145.30 us 146.43 us]
string_escaped_chars serde_json 2^10 time:   [143.75 us 144.64 us 145.52 us]
string_escaped_chars serde_json_custom 2^10 time:   [145.71 us 147.09 us 148.73 us]
string_escaped_chars simd_json 2^10 time:   [47.152 us 47.652 us 48.162 us]

If you're able to replicate the results and interested in taking the changes can open a PR. Changes are in this branch. Thanks!

Add custom serde deserializer gated by feature flag

As discussed in #6

Performance: much slower than serde-json.

I was trying to port the polars ndjson reader over to this crate instead of serde_json, and noticed a huge degradation in performance after the refactor.

On a dataset of 500_000 records. containing data similar to this: (more than happy to share the actual file if needed)

{
  "number": 300,
  "hash": "0xb3e37f7c14742bc54d08163792d38ada69c3951817b8dde6ef96776aa5c0f00c",
  "parent_hash": "0x989b8bf2af0be6c18c9c95bfde81492e0b47bcc1c26d555bb7cea2d09e92c6c3",
  "nonce": "0x424b554fa4a7a04f",
  "sha3_uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
  "logs_bloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
  "transactions_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
  "state_root": "0x34e5b52497408cd2bbcb6992dee0292498a235ec7aca1b34f6cbccb396f85105",
  "receipts_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
  "miner": "0xbb7b8287f3f0a933474a79eae42cbca977791171",
  "difficulty": 19753900789,
  "total_difficulty": 5531721283386,
  "size": 544,
  "extra_data": "0x476574682f4c5649562f76312e302e302f6c696e75782f676f312e342e32",
  "gas_limit": 5000,
  "gas_used": 0,
  "timestamp": 1438270848,
  "transaction_count": 0,
  "base_fee_per_gas": null
}

An unoptimized build using serde_json will load the json file in ~35s on my device.
The unoptimized build using json_serializer loads in ~55s

The code changes are available here just in case I did something wrong on my end in using json_serializer

https://github.com/pola-rs/polars/compare/master...universalmind303:polars:ndjson-perf-enhancements?expand=1.

fn parse_lines<'a>(bytes: &[u8], buffers: &mut PlIndexMap<String, Buffer<'a>>) -> Result<usize> {
    let lines: SplitLines = SplitLines::new(bytes, NEWLINE);

    for mut line in lines {
        let value = json_deserializer::parse(&mut line).map_err(|e| {
            PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
        })?;
        match value {
            Value::Object(value) => {
                buffers
                    .iter_mut()
                    .for_each(|(s, inner)| match value.get(s) {
                        Some(v) => inner.add(v).expect("inner.add(v)"),
                        None => inner.add_null(),
                    });
            }
            _ => {
                buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
            }
        }
    }
    let bytes_read = bytes.len();

    Ok(bytes_read)
}

The parser doesn't expose a way to get the number of bytes read back so the SplitLines does cause some overhead. I copied over the code to call the parse_value function directly, and created an optimized loop that doesnt use the custom SplitLines iterator. I saw some perf gains ~5s, but it is still very far off from the serde implementation.

the more optimized version:

// it's ndjson, so we know it is always should be {obj}\n{obj}\n 
fn parse_lines<'a>(
    bytes: &mut &[u8],
    buffers: &mut PlIndexMap<String, Buffer<'a>>,
) -> Result<usize> {
    use json_deserialize::Value;
    let total_bytes = bytes.len();
    let mut read = 0;

    loop {
        if read == total_bytes {
            break;
        }
       // its a newline
        if json_deserialize::current_token(bytes)
            .map_err(|e| PolarsError::ComputeError(format!("EOF error: {}", e).into()))?
            == NEWLINE
        {
            read += 1;
            *bytes = &bytes[1..];
        } 
        // its the object
        else {
            let byte_size = bytes.len();
            let value = json_deserialize::parse_value(bytes).map_err(|e| {
                PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
            })?;
            read += byte_size - bytes.len();

            match value {
                Value::Object(value) => {
                    buffers
                        .iter_mut()
                        .for_each(|(s, inner)| match value.get(s) {
                            Some(v) => inner.add(v).expect("inner.add(v)"),
                            None => inner.add_null(),
                        });
                }
                _ => {
                    buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
                }
            };
        }
    }
    Ok(total_bytes)
}