Coder Social home page Coder Social logo

json-deserializer's People

Contributors

cjermain avatar delicioushair avatar jorgecarleitao avatar mizardx avatar ncpenke avatar universalmind303 avatar

Stargazers

 avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar  avatar

Watchers

 avatar  avatar  avatar

json-deserializer's Issues

enable SIMD when comparing to simd-json benchmarks

Hi,

first of all, this is a really impressive creation I love the trick of delaying number parsing darn smart :D.

That said I would suggest enabling SIMD when comparing against simd-json, otherwise, it's a bit of a pointless comparison. The best way to do that in a non too CPU-dependent way is to use RUSTFLAGS="-C target-feature=+avx,+avx2,+sse4.2" those features are present on all modern CPUs.

Include license file in repo and crate

Hi, thanks for this crate (currently using it as part of Polars)! It looks like Cargo.toml specifies the license as Apache-2.0, but it'd be nice to include a license file in the repo (which Cargo should automatically include in the published crate if present).

I also noticed you're the owner of the arrow-format crate - would be nice to do the same there.

parsing special characters in strings produces invalid results

Originally reported via pola-rs/polars#8424

[dependencies]
json-deserializer = "0.4.4"
use json_deserializer::Value;

fn main() {
    let expected = "你好,polars。\n";
    let json_data = format!(r#""你好,polars。\n""#);
    let bytes = json_data.as_bytes();
    let v = json_deserializer::parse(&bytes).unwrap();
    if let Value::String(s) = v {
        assert_eq!(s.as_ref(), expected)
    }
}
thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `"ä½\u{a0}好ï¼\u{8c}polarsã\u{80}\u{82}\n"`,
 right: `"你好,polars。\n"`', src/main.rs:13:9

Invalid JSON panics or successfully parses

I was testing potential JSON libraries to use and I noticed several issues from json.org's JSON Checker:

  • fail8: Extra array close doesn't cause an error ["Extra close"]]
  • fail27/fail25: Unescaped tabs and newlines in string literals aren't conformant to the JSON spec.
  • fail32: A missing closing brace causes a panic {"Comma instead if closing brace": true,

This library looks great and is quite efficient!

Bug: unable to parse scientific notation with `+` symbol

This was first reported in polars. We are using arrow2 to infer the schema for ndjson data types, which looks like it is using this to crate deserialize json in the implementation of arrow2::io::ndjson::read::infer

Expected:

Number is parsed correctly
serde_json will parse as so: {"Value": Number(11000000000.0)}

Actual:

json_deserializer errors out with MissingComa(43) when trying to parse scientific notation with + symbol.

Steps to reproduce

fn main() {
    let mut s = br#"{"Value":1.1e+10}"#.to_vec();
    let v = json_deserializer::parse(&mut s);
    println!("{:?}", v);
}
// Err(MissingComa(43))

Note: the - symbol parses as expected.

fn main() {
    let mut s = br#"{"Value":1.1e-10}"#.to_vec();
    let v = json_deserializer::parse(&mut s);
    println!("{:?}", v);
}
// Ok(Object({"Value": Number(Float([49, 46, 49], [45, 49, 48]))}))

Performance comparison with in-place serde + possible improvements

Awesome work as usual @jorgecarleitao.

I was curious what the performance would be with serde_json's in-place deserialization so I created a custom deserializer for parser::Value. I had to duplicate parser::Value to use converted numbers since it wasn't possible to use this library's lazy numeric parser without replicating all of serde_json's parsing code.

The code for the custom deserializer is here. I updated the benchmarks to include this implementation, and also added a test for escaped strings.

Results from my macbook are below. serde_json_custom is the benchmark for the custom deserializer. string_escaped_chars is the benchmark for escaped strings.

string json_deserializer 2^10 time:   [63.949 us 64.339 us 64.739 us]
string serde_json 2^10  time:   [100.51 us 101.21 us 101.94 us]
string serde_json_custom 2^10 time:   [41.162 us 41.491 us 41.923 us]
string simd_json 2^10   time:   [25.250 us 25.574 us 25.932 us]
string_escaped_chars json_deserializer 2^10 time:   [193.93 us 195.39 us 196.92 us]
string_escaped_chars serde_json 2^10 time:   [130.26 us 131.22 us 132.27 us]
string_escaped_chars serde_json_custom 2^10 time:   [137.42 us 138.47 us 139.55 us]
string_escaped_chars simd_json 2^10 time:   [46.427 us 46.821 us 47.252 us]

I was surprised at the results, especially since nothing stuck out from inspecting the code. So I dug in further and it seems to come to the relatively large match expression in compute_length. Reworking it improved up the numbers:

string json_deserializer 2^10 time:   [39.574 us 39.719 us 39.868 us]
string serde_json 2^10  time:   [103.74 us 104.84 us 105.99 us]
string serde_json_custom 2^10 time:   [46.926 us 48.188 us 49.584 us]
string simd_json 2^10   time:   [27.517 us 27.786 us 28.105 us]
string_escaped_chars json_deserializer 2^10 time:   [144.19 us 145.30 us 146.43 us]
string_escaped_chars serde_json 2^10 time:   [143.75 us 144.64 us 145.52 us]
string_escaped_chars serde_json_custom 2^10 time:   [145.71 us 147.09 us 148.73 us]
string_escaped_chars simd_json 2^10 time:   [47.152 us 47.652 us 48.162 us]

If you're able to replicate the results and interested in taking the changes can open a PR. Changes are in this branch. Thanks!

Performance: much slower than serde-json.

I was trying to port the polars ndjson reader over to this crate instead of serde_json, and noticed a huge degradation in performance after the refactor.

On a dataset of 500_000 records. containing data similar to this: (more than happy to share the actual file if needed)

{
  "number": 300,
  "hash": "0xb3e37f7c14742bc54d08163792d38ada69c3951817b8dde6ef96776aa5c0f00c",
  "parent_hash": "0x989b8bf2af0be6c18c9c95bfde81492e0b47bcc1c26d555bb7cea2d09e92c6c3",
  "nonce": "0x424b554fa4a7a04f",
  "sha3_uncles": "0x1dcc4de8dec75d7aab85b567b6ccd41ad312451b948a7413f0a142fd40d49347",
  "logs_bloom": "0x00000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000000",
  "transactions_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
  "state_root": "0x34e5b52497408cd2bbcb6992dee0292498a235ec7aca1b34f6cbccb396f85105",
  "receipts_root": "0x56e81f171bcc55a6ff8345e692c0f86e5b48e01b996cadc001622fb5e363b421",
  "miner": "0xbb7b8287f3f0a933474a79eae42cbca977791171",
  "difficulty": 19753900789,
  "total_difficulty": 5531721283386,
  "size": 544,
  "extra_data": "0x476574682f4c5649562f76312e302e302f6c696e75782f676f312e342e32",
  "gas_limit": 5000,
  "gas_used": 0,
  "timestamp": 1438270848,
  "transaction_count": 0,
  "base_fee_per_gas": null
}

An unoptimized build using serde_json will load the json file in ~35s on my device.
The unoptimized build using json_serializer loads in ~55s

The code changes are available here just in case I did something wrong on my end in using json_serializer

https://github.com/pola-rs/polars/compare/master...universalmind303:polars:ndjson-perf-enhancements?expand=1.

fn parse_lines<'a>(bytes: &[u8], buffers: &mut PlIndexMap<String, Buffer<'a>>) -> Result<usize> {
    let lines: SplitLines = SplitLines::new(bytes, NEWLINE);

    for mut line in lines {
        let value = json_deserializer::parse(&mut line).map_err(|e| {
            PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
        })?;
        match value {
            Value::Object(value) => {
                buffers
                    .iter_mut()
                    .for_each(|(s, inner)| match value.get(s) {
                        Some(v) => inner.add(v).expect("inner.add(v)"),
                        None => inner.add_null(),
                    });
            }
            _ => {
                buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
            }
        }
    }
    let bytes_read = bytes.len();

    Ok(bytes_read)
}

The parser doesn't expose a way to get the number of bytes read back so the SplitLines does cause some overhead. I copied over the code to call the parse_value function directly, and created an optimized loop that doesnt use the custom SplitLines iterator. I saw some perf gains ~5s, but it is still very far off from the serde implementation.

the more optimized version:

// it's ndjson, so we know it is always should be {obj}\n{obj}\n 
fn parse_lines<'a>(
    bytes: &mut &[u8],
    buffers: &mut PlIndexMap<String, Buffer<'a>>,
) -> Result<usize> {
    use json_deserialize::Value;
    let total_bytes = bytes.len();
    let mut read = 0;

    loop {
        if read == total_bytes {
            break;
        }
       // its a newline
        if json_deserialize::current_token(bytes)
            .map_err(|e| PolarsError::ComputeError(format!("EOF error: {}", e).into()))?
            == NEWLINE
        {
            read += 1;
            *bytes = &bytes[1..];
        } 
        // its the object
        else {
            let byte_size = bytes.len();
            let value = json_deserialize::parse_value(bytes).map_err(|e| {
                PolarsError::ComputeError(format!("unable to parse line: {}", e).into())
            })?;
            read += byte_size - bytes.len();

            match value {
                Value::Object(value) => {
                    buffers
                        .iter_mut()
                        .for_each(|(s, inner)| match value.get(s) {
                            Some(v) => inner.add(v).expect("inner.add(v)"),
                            None => inner.add_null(),
                        });
                }
                _ => {
                    buffers.iter_mut().for_each(|(_, inner)| inner.add_null());
                }
            };
        }
    }
    Ok(total_bytes)
}

Recommend Projects

  • React photo React

    A declarative, efficient, and flexible JavaScript library for building user interfaces.

  • Vue.js photo Vue.js

    🖖 Vue.js is a progressive, incrementally-adoptable JavaScript framework for building UI on the web.

  • Typescript photo Typescript

    TypeScript is a superset of JavaScript that compiles to clean JavaScript output.

  • TensorFlow photo TensorFlow

    An Open Source Machine Learning Framework for Everyone

  • Django photo Django

    The Web framework for perfectionists with deadlines.

  • D3 photo D3

    Bring data to life with SVG, Canvas and HTML. 📊📈🎉

Recommend Topics

  • javascript

    JavaScript (JS) is a lightweight interpreted programming language with first-class functions.

  • web

    Some thing interesting about web. New door for the world.

  • server

    A server is a program made to process requests and deliver data to clients.

  • Machine learning

    Machine learning is a way of modeling and interpreting data that allows a piece of software to respond intelligently.

  • Game

    Some thing interesting about game, make everyone happy.

Recommend Org

  • Facebook photo Facebook

    We are working to build community through open source technology. NB: members must have two-factor auth.

  • Microsoft photo Microsoft

    Open source projects and samples from Microsoft.

  • Google photo Google

    Google ❤️ Open Source for everyone.

  • D3 photo D3

    Data-Driven Documents codes.